Table of contents

Tail-Call Optimisation
Tail-Calls in OCaml
The Conundrum of Reducing allocations Versus Writing Clean Code
Loopify
Conclusion

Flambda2 Ep. 2: Loopifying Tail-Recursive Functions

Authors: Nathanaëlle Courant, Guillaume Bury, Pierre Chambart, Vincent Laviron, Dario Pinto

Date: 2024-05-07

Category: OCaml

Tags: Flambda, Flambda2, loopify, optimisation, f2s, ep2, loopify, tailcall, tail-recursion

Two camels are taking a break from crossing the desert, they know their path could not have been more optimised.

Welcome to a new episode of The Flambda2 Snippets!

Today's topic is Loopify, one of Flambda2's many optimisation algorithms which specifically deals with optimising both purely tail-recursive and/or functions annotated with the [@@loop] attribute in OCaml.

A lazy explanation for its utility would be to say that it simply aims at reducing the number of memory allocations in the context of recursive and tail-recursive function calls in OCaml. However, we will see that is just part of the point and thus we will tend to address the broader context: what are tail-calls, how they are optimised and how they fit in the functional programming world, what dilemma does Loopify nullify exactly and, in time, many details on how it's all implemented!

If you happen to be stumbling upon this article and wish to get a bird's-eye view of the entire F2S series, be sure to refer to Episode 0 which does a good amount of contextualising as well as summarising of, and pointing to, all subsequent episodes.

All feedback is welcome, thank you for staying tuned and happy reading!

The F2S blog posts aim at gradually introducing the world to the inner-workings of a complex piece of software engineering: The Flambda2 Optimising Compiler, a technical marvel born from a 10 year-long effort in Research & Development and Compilation; with many more years of expertise in all aspects of Computer Science and Formal Methods.

Table of contents

Tail-Call Optimisation
Tail-Calls in OCaml
The Conundrum of Reducing allocations Versus Writing Clean Code
Loopify
Conclusion

Tail-Call Optimisation

As far as we know, Tail-Call optimisation (TCO) has been a reality since at least the 70s. Some LISP implementations used it and Scheme specified it into its language around 1975.

The debate to support TCO happens regularly today still. Nowadays, it's a given that most functional languages support it (Scala, OCaml, Haskell, Scheme and so on...). Other languages and compilers have supported it for some time too. Either optionally, with some C compilers (gcc and clang) that support TCO in some specific compilation scenarios; or systematically, like Lua, which, despite not usually being considered a functional language, specifies that TCO occurs whenever possible (you may want to read section 3.4.10 of the Lua manual here).

So what exactly is Tail-Call Optimisation ?

A place to start would be the Wikipedia page. You may also find some precious insight about the link between the semantics of GOTO and tail calls here, a course from Xavier Leroy at the College de France, which is in French.

Additionally to these resources, here are images to help you visualise how TCO improves stack memory consumption. Assume that g is a recursive function called from f:

A representation of the textbook behaviour for recursive functions stackframe allocations. You can see here that the stackframes of non-tail-recursive functions are allocated sequentially on decreasing memory addresses which may eventually lead to a stack overflow.

Now, let's consider a tail-recursive implementation of the g function in a context where TCO is not supported. Tail-recursion means that the last thing t_rec_g does before returning is calling itself. The key is that we still have a frame for the caller version of t_rec_g but we know that it will only be used to return to the parent. The frame itself no longer holds any relevant information besides the return address and thus the corresponding memory space is therefore mostly wasted.

A representation of the textbook behaviour for tail-recursive functions stackframe allocations without Tail Call Optimisation (TCO). When TCO is not implemented the behaviour for these allocations and the potential for a stack overflow are the same as with non-tail-recursive functions.

And finally, let us look at the same function in a context where TCO is supported. It is now apparent that memory consumption is much improved by the fact that we reuse the space from the previous stackframe to allocate the next one all the while preserving its return address:

A representation of the textbook behaviour for tail-recursive functions stackframe allocations with TCO. Since TCO is implemented, we can see that the stack memory consumption is now constant, and that the potential that this specific tail-recursive function will lead to a stack overflow is diminished.

Tail-Calls in OCaml

The List data structure is fundamental to and ubiquitous in functional programming. Therefore, it's important to not have an arbitrary limit on the size of lists that one can manipulate. Indeed, most List manipulation functions are naturally expressed as recursive functions, and can most of the time be implemented as tail-recursive functions. Without guaranteed TCO, a programmer could not have the assurance that their program would not stack overflow at some point. That reasoning also applies to a lot of other recursive data structures that commonly occur in programs or libraries.

In OCaml, TCO is guaranteed. Ever since its inception, Cameleers have unanimously agreed to guarantee the optimisation of tail-calls. While the compiler's support for TCO has been a thing from the beginning, an attribute, [@tailcall] was later added to help users ensure that their calls are in tail position.

Recently, TCO was also extended with the Tail Mod Cons optimisation which allows to generate tail-calls in more cases.

The Conundrum of Reducing Allocations Versus Writing Clean Code

One would find one of the main purposes for the existence of Loopify in the following conversation: a Discuss Post about the unboxing of floating-point values in OCaml and performance.

This specific comment sparks a secondary conversation that you may want to read yourself but will find a quick breakdown of below and that will be a nice starting point to understand today's subject.

Consider the following code:

let sum l =
  let rec loop s l =
    match l with
    | [] -> s
    | hd :: tl ->
      (* This allocates a boxed float *)
      let s = s +. hd in
      loop s tl 
  in
  loop 0. l

This is a simple tail-recursive implementation of a sum function for a list of floating-point numbers. However this is not as efficient as we would like it to be.

Indeed, OCaml needs an uniform representation of its values in order to implement polymorphic functions. In the case of floating-point numbers this means that the numbers are boxed whenever they need to be used as generic values.

Besides, everytime we call a function all parameters have to be considered as generic values. We thus cannot avoid their allocation at each recursive call in this function.

If we were to optimise it in order to get every last bit of performance out of it, we could try something like:

Warning: The following was coded by trained professionnals, do NOT try this at home.

let sum l = 
  (* Local references *)
  let s = ref 0. in
  let cur = ref l in
  try
    while true do
      match !cur with
      | [] -> raise Exit
      | hd :: tl ->
        (* Unboxed floats -> No allocation *)
        s := !s +. hd;
        cur := tl
    done; assert false
  with Exit -> !s (* The only allocation *)

While in general references introduce one allocation and a layer of indirection, when the compiler can prove that a reference is strictly local to a given function it will use mutable variables instead of reference cells.

In our case s and cur do not escape the function and are therefore eligible to this optimisation.

After this optimisation, s is now a mutable variable of type float and so it can also trigger another optimisation: float unboxing.

You can see more details here but note that, in this specific example, all occurrences of boxing operations disappear except a single one at the end of the function.

We like to think that not forcing the user to write such code is a benefit, to say the least.

Loopify

Concept

There is a general concept of transforming function-level control-flow into direct IR continuations to benefit from "basic block-level" optimisations. One such pattern is present in the local-function optimisation triggered by the [@local] attribute. Here's the link to the PR that implements it. Loopify is an attempt to extend the range of this kind of optimisation to proper (meaning self) tail-recursive calls.

As you saw previously, in some cases (e.g.: numerical calculus), recursive functions sometimes hurt performances because they introduce some allocations.

That lost performance can be recovered by hand-writing loops using local references however it's unfortunate to encourage non-functional code in a language such as OCaml.

One of Flambda and Flambda2's goals is to avoid situations such as those and allow for good-looking, functional code, to be as performant as code which is written and optimised by hand at the user-level.

Therefore, we introduce a solution to the specific problem described above with Loopify, which, in a nutshell, transforms tail-recursive functions into non-recursive functions containing a loop, hence the name.

Deciding to Loopify or not

The decision to loopify a given function is made during the conversion from the Lambda IR to the Flambda2 IR. The conversion is triggered in two cases:

when a function is purely tail-recursive -- meaning all its uses within its body are self-tail calls, they are called proper calls;
when an annotation is given by the user in the source code using the [@loop] attribute;

Let's see two examples for them:

(* Not a tail-rec function: is not loopified *)
let rec map f = function
  | [] -> []
  | x :: r -> f x :: map f r

(* Is tail-rec: is loopified *)
let rec fold_left f acc = function
  | [] -> acc
  | x :: r -> fold_left f (f acc x) r

Here, the decision to loopify is automatic and requires no input from the user. Quite straightforward.

Onto the second case now:

(* Helper function, not recursive, nothing to do. *)
let log dbg f arg =
  if dbg then
    print_endline "Logging...";
  f arg
[@@inline]

(* 
  Not tail-rec in the source, but may become
  tail-rec after inlining of the [log] function.
  At this point we can loopify, provided that the
  user specified a [@@loop] attribute.
*)
let rec iter_with_log dbg f = function
  | [] -> ()
  | x :: r ->
    f x;
    log dbg (iter_with_log dbg f) r
[@@loop]

The recursive function iter_with_log, is not initially purely tail-recursive.

However after the inlining of the log function and then simplification, the new code for iter_with_log becomes purely tail-recursive.

At that point we have the ability to loopify the function, but we keep from doing so unless the user specifies the [@@loop] attribute on the function definition.

The nature of the transformation

Onto the details of the transformation.

First, we introduce a recursive continuation at the start of the function. Lets call it self.

Then, at each tail-recursive call, we replace the function call with a continuation call to self with the same arguments as the original call.

let rec iter_with_log dbg f l =
  let_cont rec k_self dbg f l =
    match l with
    | [] -> ()
    | x :: r ->
      f x;
      log dbg (iter_with_log dbg f) r
  in
  apply_cont k_self (dbg, f, l)

Then, we inline the log function:

let rec iter_with_log dbg f l =
  let_cont k_self dbg f l =
    match l with
    | [] -> ()
    | x :: r ->
      f x;
      (* Here the inlined code starts *)
      (*
        We first start by binding the arguments of the
        original call to the parameters of the function's code
       *)
      let dbg = dbg in
      let f = iter_with_log dbg f in
      let arg = r in
      if dbg then
        print_endline "Logging...";
      f arg
  in
  apply_cont k_self (dbg, f, l)

Then, we discover a proper tail-recursive call subsequently to these transformations that we replace with the adequate continuation call.

let rec iter_with_log dbg f l =
  let_cont k_self dbg f l =
    match l with
    | [] -> ()
    | x :: r ->
      f x;
      (* Here the inlined code starts *)
      (*
        Here, the let bindings have been substituted
        by the simplification.
       *)
      if dbg then
        print_endline "Logging...";
      apply_cont k_self (dbg, f, r)
  in
  apply_cont k_self (dbg, f, l)

In this context, the benefit of transforming a function call to a continuation call is mainly about allowing other optimisations to take place. As shown in the previous section, one of these optimisations is unboxing which can be important in some cases like numerical calculus. Such optimisations can take place because continuations are local to a function while OCaml ABI-abiding function calls require a prior global analysis.

One could think that a continuation call is intrinsically cheaper than a function call. However, the OCaml compiler already optimises self-tail-calls such that they are already as cheap as continuation calls (i.e, a single jump instruction).

An astute reader could realise that this transformation can apply to any function and will result in one of three outcomes:

if the function is not tail-recursive, or even not recursive at all, nothing will happen, the transformation does nothing.
if a function is purely tail-recursive then all recursive calls will be replaced to a continuation call and the function after optimisation will no longer be recursive. This allows us to later inline it and even specialise some of its arguments. This happens precisely when we automatically decide to loopify a function;
if a function is not purely tail-recursive, but contains some tail-recursive calls then the transformation will rewrite those calls but not the other ones. This may result in better code but it's hard to be sure in advance. In such cases (and cases where functions become purely tail-recursive only after inlining), users can force the transformation by using the [@@loop] attribute

Conclusion

Here it is, the concept behind the Loopify optimisation pass as well as the general context and philosophy which led to its inception!

It should be clear enough now that having to choose between writing clean or efficient code was always unsatisfactory to us. With Loopify, as well as with the rest of the Flambda and Flambda2 compiler backends, we aim at making sure that users should not have to write imperative code for it to be as efficient as functional code. Thus ideally making any which way of writing a piece of code as efficient as the next.

This article describes one of the very first user-facing optimisations of this series of snippets on Flambda2. We have not gotten into any of the neat implementation details yet. This is a topic for another time. The functioning of Loopify will be much clearer next time we talk about it.

Loopify is only applied automatically when the tail-recursive nature of a function call is visible in the source from the get-go. However, the optimisations applied by Loopify can still very much be useful in other situations as seen in this section. That is why we have the [@loop] attribute in order to enforce loopification. Good canonical examples for applying Loopify with the [@loop] attribute would be either of the following: loopifying a partially tail-recursive function (i.e, a function with only some tail-recursive paths), or for functions which are not obviously tail-recursive in the source code, but could become so after some optimisation steps.

This transformation illustrates a core principle behind the Flambda2 design: applying a somewhat naïve optimisation that is not transformative by itself, but changes the way the compiler can look at the code and trigger a whole lot of other useful ones. Conversely, it being triggered in the middle of the inlining phase can allow some non-obvious cases to become radically better. Coding a single optimisation that would discover the cases demonstrated in the examples above would be quite complex, while this one is rather simple thanks to these principles.

Throughout the entire series of snippets, we will continue seeing these principles in action, starting with the next blog post that will introduce Downward and Upward Traversals.

Stay tuned, and thank you for reading, until next time, see you Space Cowboy. 🤠

About OCamlPro:

OCamlPro is a R&D lab founded in 2011, with the mission to help industrial users benefit from experts with a state-of-the-art knowledge of programming languages theory and practice.

We provide audit, support, custom developer tools and training for both the most modern languages, such as Rust, Wasm and OCaml, and for legacy languages, such as COBOL or even home-made domain-specific languages;

We design, create and implement software with great added-value for our clients. High complexity is not a problem for our PhD-level experts. For example, we helped the French Income Tax Administration re-adapt and improve their internally kept M language, we designed a DSL to model and express revenue streams in the Cinema Industry, codename Niagara, and we also developed the prototype of the Tezos proof-of-stake blockchain from 2014 to 2018.

We have a long history of creating open-source projects, such as the Opam package manager, the LearnOCaml web platform, and contributing to other ones, such as the Flambda optimizing compiler, or the GnuCOBOL compiler.

We are also experts of Formal Methods, developing tools such as our SMT Solver Alt-Ergo (check our Alt-Ergo Users' Club) and using them to prove safety or security properties of programs.

Please reach out, we'll be delighted to discuss your challenges: contact@ocamlpro.com or book a quick discussion.

Most Recent Articles

2025

2024

2023

[Show all]