Flambda2 Ep. 2: Loopifying Tail-Recursive Functions
Welcome to a new episode of The Flambda2 Snippets!
Today's topic is Loopify
, one of Flambda2
's many optimisation algorithms
which specifically deals with optimising both purely tail-recursive and/or
functions annotated with the [@@loop]
attribute in OCaml.
A lazy explanation for its utility would be to say that it simply aims at
reducing the number of memory allocations in the context of recursive and
tail-recursive function calls in OCaml. However, we will see that is just
part of the point and thus we will tend to address the broader context:
what are tail-calls, how they are optimised and how they fit in the
functional programming world, what dilemma does Loopify
nullify exactly and,
in time, many details on how it's all implemented!
If you happen to be stumbling upon this article and wish to get a bird's-eye view of the entire F2S series, be sure to refer to Episode 0 which does a good amount of contextualising as well as summarising of, and pointing to, all subsequent episodes.
All feedback is welcome, thank you for staying tuned and happy reading!
The F2S blog posts aim at gradually introducing the world to the inner-workings of a complex piece of software engineering: The
Flambda2 Optimising Compiler
, a technical marvel born from a 10 year-long effort in Research & Development and Compilation; with many more years of expertise in all aspects of Computer Science and Formal Methods.
Tail-Call Optimisation
As far as we know, Tail-Call optimisation (TCO) has been a reality since at least the 70s. Some LISP implementations used it and Scheme specified it into its language around 1975.
The debate to support TCO happens regularly today still. Nowadays, it's a given that most functional languages support it (Scala, OCaml, Haskell, Scheme and so on...). Other languages and compilers have supported it for some time too. Either optionally, with some C compilers (gcc and clang) that support TCO in some specific compilation scenarios; or systematically, like Lua, which, despite not usually being considered a functional language, specifies that TCO occurs whenever possible (you may want to read section 3.4.10 of the Lua manual here).
So what exactly is Tail-Call Optimisation ?
A place to start would be the Wikipedia
page. You may also find
some precious insight about the link between the semantics of GOTO
and tail
calls here,
a course from Xavier Leroy at the College de France, which is in French.
Additionally to these resources, here are images to help you visualise how TCO
improves stack memory consumption. Assume that g
is a recursive function
called from f
:
Now, let's consider a tail-recursive implementation of the g
function in a
context where TCO is not supported. Tail-recursion means that the last
thing t_rec_g
does before returning is calling itself. The key is that we
still have a frame for the caller version of t_rec_g
but we know that it will
only be used to return to the parent. The frame itself no longer holds any
relevant information besides the return address and thus the corresponding memory
space is therefore mostly wasted.
And finally, let us look at the same function in a context where TCO is supported. It is now apparent that memory consumption is much improved by the fact that we reuse the space from the previous stackframe to allocate the next one all the while preserving its return address:
Tail-Calls in OCaml
The List
data structure is fundamental to and ubiquitous in functional
programming. Therefore, it's important to not have an arbitrary limit on the
size of lists that one can manipulate. Indeed, most List
manipulation functions
are naturally expressed as recursive functions, and can most of the time be
implemented as tail-recursive functions. Without guaranteed TCO, a programmer
could not have the assurance that their program would not stack overflow at
some point. That reasoning also applies to a lot of other recursive data
structures that commonly occur in programs or libraries.
In OCaml, TCO is guaranteed. Ever since its inception, Cameleers have
unanimously agreed to guarantee the optimisation of tail-calls.
While the compiler's support for TCO has been a thing from the beginning,
an attribute,
[@tailcall]
was later added to help users ensure that their calls are in tail
position.
Recently, TCO was also extended with the Tail Mod Cons
optimisation which allows to
generate tail-calls in more cases.
The Conundrum of Reducing Allocations Versus Writing Clean Code
One would find one of the main purposes for the existence of Loopify
in the
following conversation: a Discuss Post about the unboxing of floating-point
values in
OCaml and
performance.
This specific comment sparks a secondary conversation that you may want to read yourself but will find a quick breakdown of below and that will be a nice starting point to understand today's subject.
Consider the following code:
let sum l =
let rec loop s l =
match l with
| [] -> s
| hd :: tl ->
(* This allocates a boxed float *)
let s = s +. hd in
loop s tl
in
loop 0. l
This is a simple tail-recursive implementation of a sum
function for a list of
floating-point numbers. However this is not as efficient as we would like it to
be.
Indeed, OCaml needs an uniform representation of its values in order to implement polymorphic functions. In the case of floating-point numbers this means that the numbers are boxed whenever they need to be used as generic values.
Besides, everytime we call a function all parameters have to be considered as generic values. We thus cannot avoid their allocation at each recursive call in this function.
If we were to optimise it in order to get every last bit of performance out of it, we could try something like:
Warning: The following was coded by trained professionnals, do NOT try this at home.
let sum l =
(* Local references *)
let s = ref 0. in
let cur = ref l in
try
while true do
match !cur with
| [] -> raise Exit
| hd :: tl ->
(* Unboxed floats -> No allocation *)
s := !s +. hd;
cur := tl
done; assert false
with Exit -> !s (* The only allocation *)
While in general references introduce one allocation and a layer of indirection, when the compiler can prove that a reference is strictly local to a given function it will use mutable variables instead of reference cells.
In our case s
and cur
do not escape the function and are therefore eligible
to this optimisation.
After this optimisation, s
is now a mutable variable of type float
and so it
can also trigger another optimisation: float unboxing.
You can see more details here but note that, in this specific example, all occurrences of boxing operations disappear except a single one at the end of the function.
We like to think that not forcing the user to write such code is a benefit, to say the least.
Loopify
Concept
There is a general concept of transforming function-level control-flow into
direct IR continuations to benefit from "basic block-level" optimisations. One
such pattern is present in the local-function optimisation triggered by the
[@local]
attribute. Here's the link to the PR that implements
it. Loopify
is an attempt to
extend the range of this kind of optimisation to proper (meaning self
)
tail-recursive calls.
As you saw previously, in some cases (e.g.: numerical calculus), recursive functions sometimes hurt performances because they introduce some allocations.
That lost performance can be recovered by hand-writing loops using local references however it's unfortunate to encourage non-functional code in a language such as OCaml.
One of Flambda
and Flambda2
's goals is to avoid situations such as those and
allow for good-looking, functional code, to be as performant as code which is
written and optimised by hand at the user-level.
Therefore, we introduce a solution to the specific problem described above with
Loopify
, which, in a nutshell, transforms tail-recursive functions into
non-recursive functions containing a loop, hence the name.
Deciding to Loopify or not
The decision to loopify a given function is made during the conversion from the
Lambda
IR to the Flambda2
IR. The conversion is triggered in two cases:
- when a function is purely tail-recursive -- meaning all its uses within its
body are
self-tail
calls, they are called proper calls; - when an annotation is given by the user in the source code using the
[@loop]
attribute;
Let's see two examples for them:
(* Not a tail-rec function: is not loopified *)
let rec map f = function
| [] -> []
| x :: r -> f x :: map f r
(* Is tail-rec: is loopified *)
let rec fold_left f acc = function
| [] -> acc
| x :: r -> fold_left f (f acc x) r
Here, the decision to loopify
is automatic and requires no input from the
user. Quite straightforward.
Onto the second case now:
(* Helper function, not recursive, nothing to do. *)
let log dbg f arg =
if dbg then
print_endline "Logging...";
f arg
[@@inline]
(*
Not tail-rec in the source, but may become
tail-rec after inlining of the [log] function.
At this point we can loopify, provided that the
user specified a [@@loop] attribute.
*)
let rec iter_with_log dbg f = function
| [] -> ()
| x :: r ->
f x;
log dbg (iter_with_log dbg f) r
[@@loop]
The recursive function iter_with_log
, is not initially purely tail-recursive.
However after the inlining of the log
function and then simplification, the new
code for iter_with_log
becomes purely tail-recursive.
At that point we have the ability to loopify
the function, but we keep from
doing so unless the user specifies the [@@loop]
attribute on the function definition.
The nature of the transformation
Onto the details of the transformation.
First, we introduce a recursive continuation at the start of the function. Lets
call it self
.
Then, at each tail-recursive call, we replace the function call with a
continuation call to self
with the same arguments as the original call.
let rec iter_with_log dbg f l =
let_cont rec k_self dbg f l =
match l with
| [] -> ()
| x :: r ->
f x;
log dbg (iter_with_log dbg f) r
in
apply_cont k_self (dbg, f, l)
Then, we inline the log
function:
let rec iter_with_log dbg f l =
let_cont k_self dbg f l =
match l with
| [] -> ()
| x :: r ->
f x;
(* Here the inlined code starts *)
(*
We first start by binding the arguments of the
original call to the parameters of the function's code
*)
let dbg = dbg in
let f = iter_with_log dbg f in
let arg = r in
if dbg then
print_endline "Logging...";
f arg
in
apply_cont k_self (dbg, f, l)
Then, we discover a proper tail-recursive call subsequently to these transformations that we replace with the adequate continuation call.
let rec iter_with_log dbg f l =
let_cont k_self dbg f l =
match l with
| [] -> ()
| x :: r ->
f x;
(* Here the inlined code starts *)
(*
Here, the let bindings have been substituted
by the simplification.
*)
if dbg then
print_endline "Logging...";
apply_cont k_self (dbg, f, r)
in
apply_cont k_self (dbg, f, l)
In this context, the benefit of transforming a function call to a continuation
call is mainly about allowing other optimisations to take place. As shown
in the previous section, one of these optimisations is unboxing
which can be
important in some cases like numerical calculus. Such optimisations can take
place because continuations are local to a function while OCaml ABI-abiding
function calls require a prior global analysis.
One could think that a continuation call is intrinsically cheaper than a
function call. However, the OCaml compiler already optimises self-tail-calls
such that they are already as cheap as continuation calls (i.e, a single jump
instruction).
An astute reader could realise that this transformation can apply to any function and will result in one of three outcomes:
- if the function is not tail-recursive, or even not recursive at all, nothing will happen, the transformation does nothing.
- if a function is purely tail-recursive then all recursive calls will be replaced to a continuation call and the function after optimisation will no longer be recursive. This allows us to later inline it and even specialise some of its arguments. This happens precisely when we automatically decide to loopify a function;
- if a function is not purely tail-recursive, but contains some tail-recursive
calls then the transformation will rewrite those calls but not the other
ones. This may result in better code but it's hard to be sure in advance. In
such cases (and cases where functions become purely tail-recursive only after
inlining
), users can force the transformation by using the[@@loop]
attribute
Conclusion
Here it is, the concept behind the Loopify
optimisation pass as well as the
general context and philosophy which led to its inception!
It should be clear enough now that having to choose between writing clean or
efficient code was always unsatisfactory to us. With Loopify
, as well as with
the rest of the Flambda
and Flambda2
compiler backends, we aim at making
sure that users should not have to write imperative code for it to be as
efficient as functional code. Thus ideally making any which way of writing a
piece of code as efficient as the next.
This article describes one of the very first user-facing optimisations of this
series of snippets on Flambda2
. We have not gotten into any of the neat
implementation details yet. This is a topic for another time. The functioning
of Loopify
will be much clearer next time we talk about it.
Loopify
is only applied automatically when the tail-recursive nature of a
function call is visible in the source from the get-go. However, the
optimisations applied by Loopify
can still very much be useful in other
situations as seen in this section. That is why we
have the [@loop]
attribute in order to enforce loopification. Good canonical
examples for applying Loopify
with the [@loop]
attribute would be either of
the following: loopifying a partially tail-recursive function (i.e, a function
with only some tail-recursive paths), or for functions which are not
obviously tail-recursive in the source code, but could become so after some
optimisation steps.
This transformation illustrates a core principle behind the Flambda2
design:
applying a somewhat naïve optimisation that is not transformative by itself,
but changes the way the compiler can look at the code and trigger a whole lot
of other useful ones. Conversely, it being triggered in the middle of the
inlining phase can allow some non-obvious cases to become radically better.
Coding a single optimisation that would discover the cases demonstrated in the
examples above would be quite complex, while this one is rather simple thanks
to these principles.
Throughout the entire series of snippets, we will continue seeing these
principles in action, starting with the next blog post that will introduce
Downward and Upward Traversals
.
Stay tuned, and thank you for reading, until next time, see you Space Cowboy. 🤠
About OCamlPro:
OCamlPro is a R&D lab founded in 2011, with the mission to help industrial users benefit from experts with a state-of-the-art knowledge of programming languages theory and practice.
- We provide audit, support, custom developer tools and training for both the most modern languages, such as Rust, Wasm and OCaml, and for legacy languages, such as COBOL or even home-made domain-specific languages;
- We design, create and implement software with great added-value for our clients. High complexity is not a problem for our PhD-level experts. For example, we helped the French Income Tax Administration re-adapt and improve their internally kept M language, we designed a DSL to model and express revenue streams in the Cinema Industry, codename Niagara, and we also developed the prototype of the Tezos proof-of-stake blockchain from 2014 to 2018.
- We have a long history of creating open-source projects, such as the Opam package manager, the LearnOCaml web platform, and contributing to other ones, such as the Flambda optimizing compiler, or the GnuCOBOL compiler.
- We are also experts of Formal Methods, developing tools such as our SMT Solver Alt-Ergo (check our Alt-Ergo Users' Club) and using them to prove safety or security properties of programs.
Please reach out, we'll be delighted to discuss your challenges: contact@ocamlpro.com or book a quick discussion.
Most Recent Articles
2024
- opam 2.3.0 release!
- Optimisation de Geneweb, 1er logiciel français de Généalogie depuis près de 30 ans
- Alt-Ergo 2.6 is Out!
- Flambda2 Ep. 3: Speculative Inlining
- opam 2.2.0 release!
- Flambda2 Ep. 2: Loopifying Tail-Recursive Functions
- Fixing and Optimizing the GnuCOBOL Preprocessor
- OCaml Backtraces on Uncaught Exceptions
- Opam 102: Pinning Packages
- Flambda2 Ep. 1: Foundational Design Decisions
- Behind the Scenes of the OCaml Optimising Compiler Flambda2: Introduction and Roadmap
- Lean 4: When Sound Programs become a Choice
- Opam 101: The First Steps
2023
- Maturing Learn-OCaml to version 1.0: Gateway to the OCaml World
- The latest release of Alt-Ergo version 2.5.1 is out, with improved SMT-LIB and bitvector support!
- 2022 at OCamlPro
- Autofonce, GNU Autotests Revisited
- Sub-single-instruction Peano to machine integer conversion
- Statically guaranteeing security properties on Java bytecode: Paper presentation at VMCAI 23
- Release of ocplib-simplex, version 0.5