The Generic Syntax Extension
OCaml 4.01 with its new feature to disambiguate constructors allows to do a nice trick: a simple and generic syntax extension that allows to define your own syntax without having to write complicated parsetree transformers. We propose an implementation in the form of a ppx rewriter.
it does only a simple transformation: replace strings prefixed by an operator starting with ! by a series of constructor applications
for instance:
!! "hello 3"
is rewriten to
!! (Start (H (E (L (L (O (Space (N3 (End))))))))
How is that generic ? We will present you a few examples.
Base 3 Numbers
For instance, if you want to declare base 3 arbitrary big numbers, let's define a syntax for it. We first start by declaring some types.
type start = Start of p
and p =
| N0 of stop
| N1 of q
| N2 of q
and q =
| N0 of q
| N1 of q
| N2 of q
| Underscore of q
| End
and stop = End
This type will only allow to write strings matching the regexp 0 | (1|2)(0|1|2|_)*. Notice that some constructors appear in multiple types like N0. This is not a problem since constructor desambiguation will choose for us the right one at the right place. Let's now define a few functions to use it:
open Num
let rec convert_p = function
| N0 (End) -> Int 0
| N1 t -> convert_q (Int 1) t
| N2 t -> convert_q (Int 2) t
and convert_q acc = function
| N0 t -> convert_q (acc */ Int 3) t
| N1 t -> convert_q (Int 1 +/ acc */ Int 3) t
| N2 t -> convert_q (Int 2 +/ acc */ Int 3) t
| Underscore t -> convert_q acc t
| End -> acc
let convert (Start p) = convert_p p
# val convert : start -> Num.num = <fun>
And we can now try it:
let n1 = convert (Start (N0 End))
# val n1 : Num.num = <num 0>
let n2 = convert (Start (N1 (Underscore (N0 End))))
# val n2 : Num.num = <num 3>
let n3 = convert (Start (N1 (N2 (N0 End))))
# val n3 : Num.num = <num 15>
And the generic syntax extension allows us to write:
let ( !! ) = convert
let n4 = !! "120_121_000"
val n4 : Num.num = <num 11367>
Specialised Format Strings
We can implement specialised format strings for a particular usage. Here, for concision we will restrict to a very small subset of the classical format: the characters %, i, c and space
Let's define the constructors.
type 'a start = Start of 'a a
and 'a a =
| Percent : 'a f -> 'a a
| I : 'a a -> 'a a
| C : 'a a -> 'a a
| Space : 'a a -> 'a a
| End : unit a
and 'a f =
| I : 'a a -> (int -> 'a) f
| C : 'a a -> (char -> 'a) f
| Percent : 'a a -> 'a f
Let's look at the inferred type for some examples:
let (!*) x = x
let v = !* "%i %c";;
# val v : (int -> char -> unit) start = Start (Percent (I (Space (Percent (C End)))))
let v = !* "ici";;
# val v : unit start = Start (I (C (I End)))
This is effectively the types we would like for a format string looking like that. To use it we can define a simple printer:
let rec print (Start cons) =
main cons
and main : type t. t a -> t = function
| I r ->
print_string "i";
main r
| C r ->
print_string "c";
main r
| Space r ->
print_string " ";
main r
| End -> ()
| Percent f ->
format f
and format : type t. t f -> t = function
| I r ->
fun i ->
print_int i;
main r
| C r ->
fun c ->
print_char c;
main r
| Percent r ->
print_string "%";
main r
let (!!) cons = print cons
And voila!
let s = !! "%i %c" 1 'c';;
# 1 c
How generic is it really ?
It may not look like it, but we can do almost any syntax we might want this way. For instance we can do any regular language. To explain how we transform a regular language to a type definition, we will use as an example the language a(a|)b
type start = Start of a
and a =
| A of a';
and a' =
| A of b
| B of stop
and b = B of stop
and stop = End
We can try a few things on it:
let v = Start (A (A (B End)))
# val v : start = Start (A (A (B End)))
let v = Start (A (B End))
# val v : start = Start (A (B End))
let v = Start (B End)
# Characters 15-16:
# let v = Start (B End);;
# ^
# Error: The variant type a has no constructor B
let v = Start (A (A (A (B End))))
# Characters 21-22:
# let v = Start (A (A (A (B End))));;
# ^
# Error: The variant type b has no constructor A
Assumes the language is given as an automaton that:
- has 4 states, a, a', b and stop
- with initial state a
- with final state stop
- with transitions: a - A -> a' a' - A -> b a' - B -> stop b - B -> stop let's write {c} for the constructor corresponding to the character c and
[c][/c]
for the type corresponding to a state of the automaton.
- For each state q we have a type declaration [q]
- For each letter a of the alphabet we have a constructor {a}
- For each transition p - l -> q we have a constructor {l} with parameter [q] in type [p]:
type [p] = {l} of [q]
- The End constructor without any parameter must be present in any final state
- The initial state e is declared by
type start = Start of [e]
Yet more generic
In fact we can encode deterministic context free languages (DCFL) also. To do that we encode pushdown automatons. Here we will only give a small example: the language of well parenthesized words
type empty
type 'a r = Dummy
type _ q =
| End : empty q
| Rparen : 'a q -> 'a r q
| Lparen : 'a r q -> 'a q
type start = Start of empty q
let !! x = x
let m = ! ""
let m = ! "()"
let m = ! "((())())()"
To encode the stack, we use the type parameters: Lparen pushes an r to the stack, Rparen consumes it and End checks that the stack is effectively empty.
There are a few more tricks needed to encode tests on the top value in the stack, and a conversion of a grammar to Greibach normal form to allow this encoding.
We can go even further
a^n b^n c^n
In fact we don't need to restrict to DCFL, we can for instance encode the a^n.b^n.c^n language which is not context free:
type zero
type 'a s = Succ
type (_,_) p =
| End : (zero,zero) p
| A : ('b s, 'c s) p -> ('b, 'c) p
| B : ('b, 'c s) q -> ('b s, 'c s) p
and (_,_) q =
| B : ('b, 'c) q -> ('b s, 'c) q
| C : 'c r -> (zero, 'c s) q
and _ r =
| End : zero r
| C : 'c r -> 'c s r
type start = Start of (zero,zero) p
let v = Start (A (B (C End)))
let v = Start (A (A (B (B (C (C End))))))
Non recursive languages
We can also encode solutions of Post Correspondance Problems (PCP), which are not recursive languages:
Suppose we have two alphabets A = { X, Y, Z } et O = { a, b } and two morphisms m1 and m2 from A to O* defined as
- m1(X) = a, m1(Y) = ab, m1(Z) = bba
- m2(X) = baa, m2(Y) = aa, m2(Z) = bb
Solutions of this instance of PCP are words such that their images by m1 and m2 are equal. for instance ZYZX is a solution: both images are bbaabbbaa. The language of solution can be represented by this type declaration:
type empty
type 'a a = Dummy
type 'a b = Dummy
type (_,_) z =
| X : ('t1, 't2) s -> ('t1 a, 't2 b a a) z
| Y : ('t1, 't2) s -> ('t1 a b, 't2 a a) z
| Z : ('t1, 't2) s -> ('t1 b b a, 't2 b b) z
and (_,_) s =
| End : (empty,empty) s
| X : ('t1, 't2) s -> ('t1 a, 't2 b a a) s
| Y : ('t1, 't2) s -> ('t1 a b, 't2 a a) s
| Z : ('t1, 't2) s -> ('t1 b b a, 't2 b b) s
type start = Start : ('a, 'a) z -> start
let v = X (Z (Y (Z End)))
let r = Start (X (Z (Y (Z End))))
Open question
Can every context free language (not deterministic) be represented like that ? Notice that the classical example of the palindrome can be represented (proof let to the reader).
Conclusion
So we have a nice extension available that allows you to define a new syntax by merely declaring a type. The code is available on github. We are waiting for the nice syntax you will invent !
PS: Their may remain a small problem... If inadvertently you mistype something you may find some quite complicated type errors attacking you like a pyranha instead of a syntax error.
About OCamlPro:
OCamlPro is a R&D lab founded in 2011, with the mission to help industrial users benefit from experts with a state-of-the-art knowledge of programming languages theory and practice.
- We provide audit, support, custom developer tools and training for both the most modern languages, such as Rust, Wasm and OCaml, and for legacy languages, such as COBOL or even home-made domain-specific languages;
- We design, create and implement software with great added-value for our clients. High complexity is not a problem for our PhD-level experts. For example, we helped the French Income Tax Administration re-adapt and improve their internally kept M language, we designed a DSL to model and express revenue streams in the Cinema Industry, codename Niagara, and we also developed the prototype of the Tezos proof-of-stake blockchain from 2014 to 2018.
- We have a long history of creating open-source projects, such as the Opam package manager, the LearnOCaml web platform, and contributing to other ones, such as the Flambda optimizing compiler, or the GnuCOBOL compiler.
- We are also experts of Formal Methods, developing tools such as our SMT Solver Alt-Ergo (check our Alt-Ergo Users' Club) and using them to prove safety or security properties of programs.
Please reach out, we'll be delighted to discuss your challenges: contact@ocamlpro.com or book a quick discussion.
Most Recent Articles
2024
- opam 2.3.0 release!
- Optimisation de Geneweb, 1er logiciel français de Généalogie depuis près de 30 ans
- Alt-Ergo 2.6 is Out!
- Flambda2 Ep. 3: Speculative Inlining
- opam 2.2.0 release!
- Flambda2 Ep. 2: Loopifying Tail-Recursive Functions
- Fixing and Optimizing the GnuCOBOL Preprocessor
- OCaml Backtraces on Uncaught Exceptions
- Opam 102: Pinning Packages
- Flambda2 Ep. 1: Foundational Design Decisions
- Behind the Scenes of the OCaml Optimising Compiler Flambda2: Introduction and Roadmap
- Lean 4: When Sound Programs become a Choice
- Opam 101: The First Steps
2023
- Maturing Learn-OCaml to version 1.0: Gateway to the OCaml World
- The latest release of Alt-Ergo version 2.5.1 is out, with improved SMT-LIB and bitvector support!
- 2022 at OCamlPro
- Autofonce, GNU Autotests Revisited
- Sub-single-instruction Peano to machine integer conversion
- Statically guaranteeing security properties on Java bytecode: Paper presentation at VMCAI 23
- Release of ocplib-simplex, version 0.5