The Generic Syntax Extension
OCaml 4.01 with its new feature to disambiguate constructors allows to do a nice trick: a simple and generic syntax extension that allows to define your own syntax without having to write complicated parsetree transformers. We propose an implementation in the form of a ppx rewriter.
it does only a simple transformation: replace strings prefixed by an operator starting with ! by a series of constructor applications
for instance:
!! "hello 3"
is rewriten to
!! (Start (H (E (L (L (O (Space (N3 (End))))))))
How is that generic ? We will present you a few examples.
Base 3 Numbers
For instance, if you want to declare base 3 arbitrary big numbers, let's define a syntax for it. We first start by declaring some types.
type start = Start of p
and p =
| N0 of stop
| N1 of q
| N2 of q
and q =
| N0 of q
| N1 of q
| N2 of q
| Underscore of q
| End
and stop = End
This type will only allow to write strings matching the regexp 0 | (1|2)(0|1|2|_)*. Notice that some constructors appear in multiple types like N0. This is not a problem since constructor desambiguation will choose for us the right one at the right place. Let's now define a few functions to use it:
open Num
let rec convert_p = function
| N0 (End) -> Int 0
| N1 t -> convert_q (Int 1) t
| N2 t -> convert_q (Int 2) t
and convert_q acc = function
| N0 t -> convert_q (acc */ Int 3) t
| N1 t -> convert_q (Int 1 +/ acc */ Int 3) t
| N2 t -> convert_q (Int 2 +/ acc */ Int 3) t
| Underscore t -> convert_q acc t
| End -> acc
let convert (Start p) = convert_p p
# val convert : start -> Num.num = <fun>
And we can now try it:
let n1 = convert (Start (N0 End))
# val n1 : Num.num = <num 0>
let n2 = convert (Start (N1 (Underscore (N0 End))))
# val n2 : Num.num = <num 3>
let n3 = convert (Start (N1 (N2 (N0 End))))
# val n3 : Num.num = <num 15>
And the generic syntax extension allows us to write:
let ( !! ) = convert
let n4 = !! "120_121_000"
val n4 : Num.num = <num 11367>
Specialised Format Strings
We can implement specialised format strings for a particular usage. Here, for concision we will restrict to a very small subset of the classical format: the characters %, i, c and space
Let's define the constructors.
type 'a start = Start of 'a a
and 'a a =
| Percent : 'a f -> 'a a
| I : 'a a -> 'a a
| C : 'a a -> 'a a
| Space : 'a a -> 'a a
| End : unit a
and 'a f =
| I : 'a a -> (int -> 'a) f
| C : 'a a -> (char -> 'a) f
| Percent : 'a a -> 'a f
Let's look at the inferred type for some examples:
let (!*) x = x
let v = !* "%i %c";;
# val v : (int -> char -> unit) start = Start (Percent (I (Space (Percent (C End)))))
let v = !* "ici";;
# val v : unit start = Start (I (C (I End)))
This is effectively the types we would like for a format string looking like that. To use it we can define a simple printer:
let rec print (Start cons) =
main cons
and main : type t. t a -> t = function
| I r ->
print_string "i";
main r
| C r ->
print_string "c";
main r
| Space r ->
print_string " ";
main r
| End -> ()
| Percent f ->
format f
and format : type t. t f -> t = function
| I r ->
fun i ->
print_int i;
main r
| C r ->
fun c ->
print_char c;
main r
| Percent r ->
print_string "%";
main r
let (!!) cons = print cons
And voila!
let s = !! "%i %c" 1 'c';;
# 1 c
How generic is it really ?
It may not look like it, but we can do almost any syntax we might want this way. For instance we can do any regular language. To explain how we transform a regular language to a type definition, we will use as an example the language a(a|)b
type start = Start of a
and a =
| A of a';
and a' =
| A of b
| B of stop
and b = B of stop
and stop = End
We can try a few things on it:
let v = Start (A (A (B End)))
# val v : start = Start (A (A (B End)))
let v = Start (A (B End))
# val v : start = Start (A (B End))
let v = Start (B End)
# Characters 15-16:
# let v = Start (B End);;
# ^
# Error: The variant type a has no constructor B
let v = Start (A (A (A (B End))))
# Characters 21-22:
# let v = Start (A (A (A (B End))));;
# ^
# Error: The variant type b has no constructor A
Assumes the language is given as an automaton that:
- has 4 states, a, a', b and stop
- with initial state a
- with final state stop
- with transitions: a - A -> a' a' - A -> b a' - B -> stop b - B -> stop let's write {c} for the constructor corresponding to the character c and
[c][/c]
for the type corresponding to a state of the automaton.
- For each state q we have a type declaration [q]
- For each letter a of the alphabet we have a constructor {a}
- For each transition p - l -> q we have a constructor {l} with parameter [q] in type [p]:
type [p] = {l} of [q]
- The End constructor without any parameter must be present in any final state
- The initial state e is declared by
type start = Start of [e]
Yet more generic
In fact we can encode deterministic context free languages (DCFL) also. To do that we encode pushdown automatons. Here we will only give a small example: the language of well parenthesized words
type empty
type 'a r = Dummy
type _ q =
| End : empty q
| Rparen : 'a q -> 'a r q
| Lparen : 'a r q -> 'a q
type start = Start of empty q
let !! x = x
let m = ! ""
let m = ! "()"
let m = ! "((())())()"
To encode the stack, we use the type parameters: Lparen pushes an r to the stack, Rparen consumes it and End checks that the stack is effectively empty.
There are a few more tricks needed to encode tests on the top value in the stack, and a conversion of a grammar to Greibach normal form to allow this encoding.
We can go even further
a^n b^n c^n
In fact we don't need to restrict to DCFL, we can for instance encode the a^n.b^n.c^n language which is not context free:
type zero
type 'a s = Succ
type (_,_) p =
| End : (zero,zero) p
| A : ('b s, 'c s) p -> ('b, 'c) p
| B : ('b, 'c s) q -> ('b s, 'c s) p
and (_,_) q =
| B : ('b, 'c) q -> ('b s, 'c) q
| C : 'c r -> (zero, 'c s) q
and _ r =
| End : zero r
| C : 'c r -> 'c s r
type start = Start of (zero,zero) p
let v = Start (A (B (C End)))
let v = Start (A (A (B (B (C (C End))))))
Non recursive languages
We can also encode solutions of Post Correspondance Problems (PCP), which are not recursive languages:
Suppose we have two alphabets A = { X, Y, Z } et O = { a, b } and two morphisms m1 and m2 from A to O* defined as
- m1(X) = a, m1(Y) = ab, m1(Z) = bba
- m2(X) = baa, m2(Y) = aa, m2(Z) = bb
Solutions of this instance of PCP are words such that their images by m1 and m2 are equal. for instance ZYZX is a solution: both images are bbaabbbaa. The language of solution can be represented by this type declaration:
type empty
type 'a a = Dummy
type 'a b = Dummy
type (_,_) z =
| X : ('t1, 't2) s -> ('t1 a, 't2 b a a) z
| Y : ('t1, 't2) s -> ('t1 a b, 't2 a a) z
| Z : ('t1, 't2) s -> ('t1 b b a, 't2 b b) z
and (_,_) s =
| End : (empty,empty) s
| X : ('t1, 't2) s -> ('t1 a, 't2 b a a) s
| Y : ('t1, 't2) s -> ('t1 a b, 't2 a a) s
| Z : ('t1, 't2) s -> ('t1 b b a, 't2 b b) s
type start = Start : ('a, 'a) z -> start
let v = X (Z (Y (Z End)))
let r = Start (X (Z (Y (Z End))))
Open question
Can every context free language (not deterministic) be represented like that ? Notice that the classical example of the palindrome can be represented (proof let to the reader).
Conclusion
So we have a nice extension available that allows you to define a new syntax by merely declaring a type. The code is available on github. We are waiting for the nice syntax you will invent !
PS: Their may remain a small problem... If inadvertently you mistype something you may find some quite complicated type errors attacking you like a pyranha instead of a syntax error.
Au sujet d'OCamlPro :
OCamlPro développe des applications à haute valeur ajoutée depuis plus de 10 ans, en utilisant les langages les plus avancés, tels que OCaml et Rust, visant aussi bien rapidité de développement que robustesse, et en ciblant les domaines les plus exigeants (méthodes formelles, cybersécurité, systèmes distribués/blockchain, conception de DSLs). Fort de plus de 20 ingénieurs R&D, avec une expertise unique sur les langages de programmation, aussi bien théorique (plus de 80% de nos ingénieurs ont une thèse en informatique) que pratique (participation active au développement de plusieurs compilateurs open-source, prototypage de la blockchain Tezos, etc.), diversifiée (OCaml, Rust, Cobol, Python, Scilab, C/C++, etc.) et appliquée à de multiples domaines. Nous dispensons également des [formations sur mesure sur OCaml, Rust, et les méthodes formelles] (https://training.ocamlpro.com/) Pour nous contacter : contact@ocamlpro.com.
Articles les plus récents
2024
- opam 2.3.0 release!
- Optimisation de Geneweb, 1er logiciel français de Généalogie depuis près de 30 ans
- Alt-Ergo 2.6 is Out!
- Flambda2 Ep. 3: Speculative Inlining
- opam 2.2.0 release!
- Flambda2 Ep. 2: Loopifying Tail-Recursive Functions
- Fixing and Optimizing the GnuCOBOL Preprocessor
- OCaml Backtraces on Uncaught Exceptions
- Opam 102: Pinning Packages
- Flambda2 Ep. 1: Foundational Design Decisions
- Behind the Scenes of the OCaml Optimising Compiler Flambda2: Introduction and Roadmap
- Lean 4: When Sound Programs become a Choice
- Opam 101: The First Steps
2023
- Maturing Learn-OCaml to version 1.0: Gateway to the OCaml World
- The latest release of Alt-Ergo version 2.5.1 is out, with improved SMT-LIB and bitvector support!
- 2022 at OCamlPro
- Autofonce, GNU Autotests Revisited
- Sub-single-instruction Peano to machine integer conversion
- Statically guaranteeing security properties on Java bytecode: Paper presentation at VMCAI 23
- Release of ocplib-simplex, version 0.5