The Generic Syntax Extension

Auteurs: Çagdas Bozman
Date: 2014-04-01
Catégorie: OCaml

OCaml 4.01 with its new feature to disambiguate constructors allows to do a nice trick: a simple and generic syntax extension that allows to define your own syntax without having to write complicated parsetree transformers. We propose an implementation in the form of a ppx rewriter.

it does only a simple transformation: replace strings prefixed by an operator starting with ! by a series of constructor applications

for instance:

!! "hello 3"

is rewriten to

!! (Start (H (E (L (L (O (Space (N3 (End))))))))

How is that generic ? We will present you a few examples.

Base 3 Numbers

For instance, if you want to declare base 3 arbitrary big numbers, let's define a syntax for it. We first start by declaring some types.

type start = Start of p

and p =
  | N0 of stop
  | N1 of q
  | N2 of q

and q =
  | N0 of q
  | N1 of q
  | N2 of q
  | Underscore of q
  | End

and stop = End

This type will only allow to write strings matching the regexp 0 | (1|2)(0|1|2|_)*. Notice that some constructors appear in multiple types like N0. This is not a problem since constructor desambiguation will choose for us the right one at the right place. Let's now define a few functions to use it:

open Num

let rec convert_p = function
  | N0 (End) -> Int 0
  | N1 t -> convert_q (Int 1) t
  | N2 t -> convert_q (Int 2) t

and convert_q acc = function
  | N0 t -> convert_q (acc */ Int 3) t
  | N1 t -> convert_q (Int 1 +/ acc */ Int 3) t
  | N2 t -> convert_q (Int 2 +/ acc */ Int 3) t
  | Underscore t -> convert_q acc t
  | End -> acc

let convert (Start p) = convert_p p
# val convert : start -> Num.num = <fun>

And we can now try it:

let n1 = convert (Start (N0 End))
# val n1 : Num.num = <num 0>
let n2 = convert (Start (N1 (Underscore (N0 End))))
# val n2 : Num.num = <num 3>
let n3 = convert (Start (N1 (N2 (N0 End))))
# val n3 : Num.num = <num 15>

And the generic syntax extension allows us to write:

let ( !! ) = convert

let n4 = !! "120_121_000"
val n4 : Num.num = <num 11367>

Specialised Format Strings

We can implement specialised format strings for a particular usage. Here, for concision we will restrict to a very small subset of the classical format: the characters %, i, c and space

Let's define the constructors.

type 'a start = Start of 'a a

and 'a a =
  | Percent : 'a f -> 'a a
  | I : 'a a -> 'a a
  | C : 'a a -> 'a a
  | Space : 'a a -> 'a a
  | End : unit a

and 'a f =
  | I : 'a a -> (int -> 'a) f
  | C : 'a a -> (char -> 'a) f
  | Percent : 'a a -> 'a f

Let's look at the inferred type for some examples:

let (!*) x = x

let v = !* "%i %c";;
# val v : (int -> char -> unit) start = Start (Percent (I (Space (Percent (C End)))))
let v = !* "ici";;
# val v : unit start = Start (I (C (I End)))

This is effectively the types we would like for a format string looking like that. To use it we can define a simple printer:

let rec print (Start cons) =
  main cons

and main : type t. t a -> t = function
  | I r ->
    print_string "i";
    main r
  | C r ->
    print_string "c";
    main r
  | Space r ->
    print_string " ";
    main r
  | End -> ()
  | Percent f ->
    format f

and format : type t. t f -> t = function
  | I r ->
    fun i ->
      print_int i;
      main r
  | C r ->
    fun c ->
      print_char c;
      main r
  | Percent r ->
    print_string "%";
    main r

let (!!) cons = print cons

And voila!

let s = !! "%i %c" 1 'c';;
# 1 c

How generic is it really ?

It may not look like it, but we can do almost any syntax we might want this way. For instance we can do any regular language. To explain how we transform a regular language to a type definition, we will use as an example the language a(a|)b

type start = Start of a

and a =
  | A of a';

and a' =
  | A of b
  | B of stop

and b = B of stop

and stop = End

We can try a few things on it:

let v = Start (A (A (B End)))
# val v : start = Start (A (A (B End)))

let v = Start (A (B End))
# val v : start = Start (A (B End))

let v = Start (B End)
# Characters 15-16:
#   let v = Start (B End);;
#                  ^
# Error: The variant type a has no constructor B

let v = Start (A (A (A (B End))))
# Characters 21-22:
#  let v = Start (A (A (A (B End))));;
#                        ^
# Error: The variant type b has no constructor A

Assumes the language is given as an automaton that:

  • has 4 states, a, a', b and stop
  • with initial state a
  • with final state stop
  • with transitions: a - A -> a' a' - A -> b a' - B -> stop b - B -> stop let's write {c} for the constructor corresponding to the character c and


for the type corresponding to a state of the automaton.

  • For each state q we have a type declaration [q]
  • For each letter a of the alphabet we have a constructor {a}
  • For each transition p - l -> q we have a constructor {l} with parameter [q] in type [p]:
type [p] = {l} of [q]
  • The End constructor without any parameter must be present in any final state
  • The initial state e is declared by
type start = Start of [e]

Yet more generic

In fact we can encode deterministic context free languages (DCFL) also. To do that we encode pushdown automatons. Here we will only give a small example: the language of well parenthesized words

type empty
type 'a r = Dummy

type _ q =
  | End : empty q
  | Rparen : 'a q -> 'a r q
  | Lparen : 'a r q -> 'a q

type start = Start of empty q

let !! x = x

let m = ! ""
let m = ! "()"
let m = ! "((())())()"

To encode the stack, we use the type parameters: Lparen pushes an r to the stack, Rparen consumes it and End checks that the stack is effectively empty.

There are a few more tricks needed to encode tests on the top value in the stack, and a conversion of a grammar to Greibach normal form to allow this encoding.

We can go even further

a^n b^n c^n

In fact we don't need to restrict to DCFL, we can for instance encode the a^n.b^n.c^n language which is not context free:

type zero
type 'a s = Succ

type (_,_) p =
  | End : (zero,zero) p
  | A : ('b s, 'c s) p -> ('b, 'c) p
  | B : ('b, 'c s) q -> ('b s, 'c s) p

and (_,_) q =
  | B : ('b, 'c) q -> ('b s, 'c) q
  | C : 'c r -> (zero, 'c s) q

and _ r =
  | End : zero r
  | C : 'c r -> 'c s r

type start = Start of (zero,zero) p

let v = Start (A (B (C End)))
let v = Start (A (A (B (B (C (C End))))))

Non recursive languages

We can also encode solutions of Post Correspondance Problems (PCP), which are not recursive languages:

Suppose we have two alphabets A = { X, Y, Z } et O = { a, b } and two morphisms m1 and m2 from A to O* defined as

  • m1(X) = a, m1(Y) = ab, m1(Z) = bba
  • m2(X) = baa, m2(Y) = aa, m2(Z) = bb

Solutions of this instance of PCP are words such that their images by m1 and m2 are equal. for instance ZYZX is a solution: both images are bbaabbbaa. The language of solution can be represented by this type declaration:

type empty
type 'a a = Dummy
type 'a b = Dummy

type (_,_) z =
  | X : ('t1, 't2) s -> ('t1 a, 't2 b a a) z
  | Y : ('t1, 't2) s -> ('t1 a b, 't2 a a) z
  | Z : ('t1, 't2) s -> ('t1 b b a, 't2 b b) z

and (_,_) s =
  | End : (empty,empty) s
  | X : ('t1, 't2) s -> ('t1 a, 't2 b a a) s
  | Y : ('t1, 't2) s -> ('t1 a b, 't2 a a) s
  | Z : ('t1, 't2) s -> ('t1 b b a, 't2 b b) s

type start = Start : ('a, 'a) z -> start

let v = X (Z (Y (Z End)))
let r = Start (X (Z (Y (Z End))))

Open question

Can every context free language (not deterministic) be represented like that ? Notice that the classical example of the palindrome can be represented (proof let to the reader).


So we have a nice extension available that allows you to define a new syntax by merely declaring a type. The code is available on github. We are waiting for the nice syntax you will invent !

PS: Their may remain a small problem... If inadvertently you mistype something you may find some quite complicated type errors attacking you like a pyranha instead of a syntax error.

Au sujet d'OCamlPro :

OCamlPro développe des applications à haute valeur ajoutée depuis plus de 10 ans, en utilisant les langages les plus avancés, tels que OCaml et Rust, visant aussi bien rapidité de développement que robustesse, et en ciblant les domaines les plus exigeants (méthodes formelles, cybersécurité, systèmes distribués/blockchain, conception de DSLs). Fort de plus de 20 ingénieurs R&D, avec une expertise unique sur les langages de programmation, aussi bien théorique (plus de 80% de nos ingénieurs ont une thèse en informatique) que pratique (participation active au développement de plusieurs compilateurs open-source, prototypage de la blockchain Tezos, etc.), diversifiée (OCaml, Rust, Cobol, Python, Scilab, C/C++, etc.) et appliquée à de multiples domaines. Nous dispensons également des [formations sur mesure certifiées Qualiopi sur OCaml, Rust, et les méthodes formelles] ( Pour nous contacter :