An Indentation Engine for OCaml
Since our last activity report we have released the first stable versions of two projects: OPAM, an installation manager for OCaml source packages, and ocp-indent, an indentation tool.
We have already described the basics of OPAM in two precedent blog posts, so today we will focus on the release of ocp-indent
.
Indentation should be consistent across editors
When you work on a very large code-base, it is crucial to keep a consistent indentation scheme. This is not only good for code review purposes (when the indentation carries semantic properties) but also when your code is starting to evolve and when the one who makes the change is not the one who wrote the initial piece of code. In the latter case, the variety of editors and local configurations usually leads to lot of small changes carrying no semantic value at all (such as changing tabs to spaces, adding few spaces at the beginning or end of lines, and so on). This semantic noise considerably decreases the efficiency of any code-review and change process and is usually very good at hiding hard-to-track bugs in your code-base.
A few months ago, the solutions for OCaml to this indentation problem
were limited. For instance, you could write coding guidelines and hope
that all the developers in your project would follow them. If you wanted
to be more systematic, you could create and share a common
configuration file for some popular editors (most OCaml developers use
the emacs’ tuareg-mode
or vim) but it is very hard to get
consistent indentation result across multiple tools. Moreover, having to
rely on a specific editor mode means that it is harder to fully
automatize the indentation process, for instance when setting-up a VCS
hook.
In order to overcome these pitfalls, Jane Street asked us to design a new external tool with the following high-level specification:
- it should be easy to use inside and outside any editor;
- it should understand the OCaml semantics and reflect it in the indentation;
- it should be easy to maintain and to extend;
So we started to look at the OCaml tools’ ecosystem and we found an early prototype of Jun Furuse’s ocaml-indent.
The foundation looked great but the result on real-world code sources
was not as nice as it could be, so we decided to start from this base to
build our new tool, that we called ocp-indent
. Today, ocp-indent
and ocaml-indent
do not have much code in common anymore, but the global architecture of the system remains the same.
Writing an indentation engine for OCaml
An indentation engine may seem like a rather simple problem: given any line in the program, we want to compute its indentation level, based on the code structure.
It turns out to be much more difficult than that, mainly because
indentation is only marginally semantic, and, worse, is a matter of
taste and “proper layout”. In short, it’s not a problem that can be
expressed concisely, because one really does want lots of specific cases
handled “nicely”, depending on the on-screen layout — position of line
breaks — rather than the semantic structure. Ocp-indent
does contain lots of ad-hoc logic for such cases. To make things harder,
the OCaml syntax is known to be difficult to handle, with a few
ambiguities.
Indent process
Ocp-indent
processes code in a simple and efficient way:
- We lex the input with a modified version of the OCaml lexer, to guarantee complete consistency with OCaml itself. The parser had to be modified to be more robust (ocaml fails on errors, the indentation tool should not) and to keep tokens like comments, quotations, and, in the latest version, some ocamldoc block delimiters.
- Taking the token stream as input, we maintain a “block” stack
that keeps informations like the kinds of blocks we have been through
to get to the cursor position, the column and the indentation
parameters. For instance, the “block” stack
[KBody KFfun; KLet; KBody KModule]
corresponds to the position ofX
in the following piece of (pseudo-) code:
…
module Foo = struct
…
let f = fun a &> X
- Each token may look up the stack to find its starting counterpart (
in
will look forKLet
, etc.), or disambiguate (=
will look forKLet
, stopping on opening tokens likeKBracket
, and will be inserted as an operator if none is found). This is flexible enough to allow for “breaking” the stack when incorrect grammar is found. For example, the unclosed paren inmodule let x = ( end
should not break indent after theend
. Great care was taken in deciding what tokens should be able to remove from the stack in which conditions. - The stack can also be used to find a token that we want to align on, typically bars
|
in a pattern-matching. - On every line break, the stack can be used to compute the indentation of the next line.
- In the case of partial file indentation (typically, reindenting one line or a single block), on lines that shouldn’t be reindented the stack is reversely updated to adapt to the current indentation.
Priorities
The part where some abstraction can be put into the engine is the
knowledge of the semantics, and more precisely of the scope of the
operations. It’s also in that case that the indenter can help you write,
and not only read, your code. On that matter, ocp-indent
has a knowledge of the precedence of operators and constructs that is
used to know how far to unwind the stack, and what to align on. For
example, a ;
will flush function applications and most operators.
It is that part that gives it the most edge over tuareg
,
and avoids semantically incorrect indents. All infix operators are
defined with a priority, a kind of indentation (indentation increment or
alignment over the above concerned expression), and an indentation
value (number of spaces to add). So for example most operators have a
priority lower than function application, but not .
, which yields correct results for:
let f =
somefun
record.
field
y
+ z
Boolean operators like &&
and ||
are setup for alignment instead of indentation:
let r = a
|| b
&& c
|| d
Additionally, some special operators are wanted with a negative alignment in some cases. This is also handled in a generic way by the engine. In particular, this is the case for ;
or |
:
type t = A
| B
let r = { f1 = x
; f2 = y
}
A note on the integration in editors
ocp-indent
can be used on the command-line to reindent whole files (or part of them with --lines
),
but the most common use of an indenter is from an editor. If you are
lucky enough to be able to call OCaml code from your editor, you can use
it directly as a library, but otherwise, the preferred way is to use
the option --numeric
: instead of reprinting the file
reindented, it will only output indentation levels, which you can then
process from your editor (for instance, using indent-line-to
with emacs). That should be cheaper and will help preserve cursor position, etc.
Currently, a simple emacs binding working on either the ocaml or the tuareg mode is provided, together with a vim mode contributed by Raphaël Proust and David Powers.
Results
We’ve built ocp-indent
based on a growing collection of unit-tests. If you find an indentation bug, feel free to send us a code snippet that we will incorporate into our test suite.
Our tests clearly show that the deep understanding that ocp-indent
has of the OCaml syntax makes it shines on specific cases. We are still
discussing and evaluating the implementation of few corner-cases
related, see for instance the currently failing tests.
We have also run some benchmarks on real code-bases and the result is quite conclusive: ocp-indent
is always better than tuareg! This is a very nice result as most of the
existing source files are either indented manually or are following
tuareg standards. But ocp-indent
is also orders of magnitude faster, which means you can integrate it seamlessly into any automatic process.
Au sujet d'OCamlPro :
OCamlPro développe des applications à haute valeur ajoutée depuis plus de 10 ans, en utilisant les langages les plus avancés, tels que OCaml et Rust, visant aussi bien rapidité de développement que robustesse, et en ciblant les domaines les plus exigeants (méthodes formelles, cybersécurité, systèmes distribués/blockchain, conception de DSLs). Fort de plus de 20 ingénieurs R&D, avec une expertise unique sur les langages de programmation, aussi bien théorique (plus de 80% de nos ingénieurs ont une thèse en informatique) que pratique (participation active au développement de plusieurs compilateurs open-source, prototypage de la blockchain Tezos, etc.), diversifiée (OCaml, Rust, Cobol, Python, Scilab, C/C++, etc.) et appliquée à de multiples domaines. Nous dispensons également des [formations sur mesure sur OCaml, Rust, et les méthodes formelles] (https://training.ocamlpro.com/) Pour nous contacter : contact@ocamlpro.com.
Articles les plus récents
2024
- opam 2.3.0 release!
- Optimisation de Geneweb, 1er logiciel français de Généalogie depuis près de 30 ans
- Alt-Ergo 2.6 is Out!
- Flambda2 Ep. 3: Speculative Inlining
- opam 2.2.0 release!
- Flambda2 Ep. 2: Loopifying Tail-Recursive Functions
- Fixing and Optimizing the GnuCOBOL Preprocessor
- OCaml Backtraces on Uncaught Exceptions
- Opam 102: Pinning Packages
- Flambda2 Ep. 1: Foundational Design Decisions
- Behind the Scenes of the OCaml Optimising Compiler Flambda2: Introduction and Roadmap
- Lean 4: When Sound Programs become a Choice
- Opam 101: The First Steps
2023
- Maturing Learn-OCaml to version 1.0: Gateway to the OCaml World
- The latest release of Alt-Ergo version 2.5.1 is out, with improved SMT-LIB and bitvector support!
- 2022 at OCamlPro
- Autofonce, GNU Autotests Revisited
- Sub-single-instruction Peano to machine integer conversion
- Statically guaranteeing security properties on Java bytecode: Paper presentation at VMCAI 23
- Release of ocplib-simplex, version 0.5