Table des Contenus

Flambda2 Ep. 3: Speculative Inlining

Auteurs: Pierre Chambart, Vincent Laviron, Guillaume Bury, Dario Pinto, Nathanaëlle Courant

Date: 2024-08-09

Catégorie: OCaml

Tags: flambda2, flambda, compiler, optimisation, f2s, inlining

A representation of Speculative Inlining through the famous Weighing Of The Heart of Egyptian Mythology. Egyptian God Anubis weighs his OCaml function, to see if it is worth inlining.
Credit: The Weighing of the Heart Ceremony, Ammit. Angus McBride (British, 1931-2007)

Welcome to a new episode of The Flambda2 Snippets!

The F2S blog posts aim at gradually introducing the world to the inner-workings of a complex piece of software engineering: The Flambda2 Optimising Compiler for OCaml, a technical marvel born from a 10 year-long effort in Research & Development and Compilation; with many more years of expertise in all aspects of Computer Science and Formal Methods.

Today's article will serve as an introduction to one of the key design decisions structuring Flambda2 that we will cover in the next episode in the series: Upward and Downward Traversals.

See, there are interesting things to be said about how inlining is conducted inside of our compiler. Inlining in itself is rather ubiquitous in compilers. The goal here is to show how we approach inlining, and present what we call Speculative Inlining.

Table des Contenus

Inlining in general
When inlining is detrimental
How to decide when inlining is beneficial
Speculative inlining
Speculative inlining in practice
Summary
Conclusion

Inlining in general

Given the way people write functional programs, inlining is an important part of the optimisation pipeline of such functional langages.

What we call inlining in this series is the process of duplicating some code to specialise it to a specific context.

Usually, this can be thought as copy-pasting the body of a function at its call site. A common misunderstanding is to think that the main benefit of this optimisation is to remove the cost of the function call. However, with modern computer architectures, this has become less and less relevant in the last decades. The actual benefit is to use the specific context to trigger further optimisations.

Suppose we have the following option_map and double functions:

let option_map f x =
  match x with
  | None -> None
  | Some x -> Some (f x)

let double i =
  i + i

Additionally, suppose we are currently considering the following function:

let stuff () =
  option_map double (Some 21)

In this short example, inlining the option_map function would perform the following transformation:

let stuff () =
  let f = double in
  let x = Some 21 in
  match x with
  | None -> None
  | Some x -> Some (f x)

Now we can inline the double function.

let stuff () =
  let x = Some 21 in
  match x with
  | None -> None
  | Some x ->
    Some (let i = x in i + i)

As you can see, inlining alone isn't that useful of an optimisation per se. In this context, appliquing Constant Propagation will optimise and simplify it to the following:

let stuff () = Some 42

Although this is a toy example, combining small functions is a common pattern in functional programs. It's very convenient that using combinators is not significantly worse than writing this function by hand.

When inlining is detrimental

We cannot just go around and inline everything, everywhere... all at once.

As we said, inlining is mainly code duplication and that would be detrimental and blow the size of the compiled code drastically. However, there is a sweet spot to be found, between both absolute inlining and no inlining at all, but it is hard to find.

Here's an example of exploding code at inlining time:

(* val h : int -> int *)
let h n = (* Some non constant expression *)

(* val f : (int -> int) -> int -> int *)
let f g x = g (g x)

(* 4 calls to f -> 2^4 calls to h *)
let n = f (f (f (f h))) 42

Following through with the inlining process will produce a very large binary relative to its source code. This contrived example highlights potential problems that might arise in ordinary codebases in the wild, even if this one is tailored to be quite nasty for inlining: notice the exponential blowup in the number of nested calls, every additional call to f doubles the number of calls to h after inlining.

How to decide when inlining is beneficial

Most compilers use a collection of heuristics to guide them in the decision making. A good collection of heuristics is hard to both design, and fine-tune. They also can be quite specific to a programming style and unfit for other compilers to integrate. The take away is: there is no best way.

Side Note:

This topic would make for an interesting blog post but, unfortunately, rather remote from the point of this article. If you are interested in going deeper into that subject right now, we have found references for you to explore until we get around to writing a comprehensive, and more digestable, explanation about the heuristic nature of inlining:

Secrets of the Glasgow Haskell Compiler inliner, by SIMON PEYTON JONES and SIMON MARLOW, 2002.

Extending the Scope of Syntactic Abstraction, by OSCAR WADDELL, 1999. Section 4.4 (PDF Download link), for the case of Scheme.

Towards Better Inlining Decisions Using Inlining Trials, by JEFFREY DEAN and CRAIG CHAMBERS, 1994.

Understanding and Exploiting Optimal Function Inlining, by THEODOROS THEODORIDIS, TOBIAS GROSSER, ZHENDONG SU, 2022.

Before we get to a concrete example, and break down Speculative Inlining for you, we would like to discuss the trade-offs of duplicating code.

CPUs execute instructions one by one, or at least they pretend that they do. In order to execute an instruction, they need to load up into memory both code and data. In modern CPUs, most instructions take only a few cycles to execute and in practice, the CPUs often execute several at the same time. To put into perspective, loading memory, however, in the worst case, can take hundreds of CPU cycles... Most of the time it's not the case because CPUs have complex memory cache hierarchies such that loading from instruction cache can take just a few cycles, loading from level 2 caches may take dozens of them, and the worst case is loading from main memory which can take hundreds of cycles.

The take away is, when executing a program, the cost of one instruction that has to be loaded from main memory can be larger than the cost of executing a hundred instructions in caches.

There is a way to avoid the worst case scenario. Since caches are rather small in size, the main component to keeping from loading from main memory is to keep your program rather small, or at least the parts of it that are regularly executed.

Keep these orders of magnitude in mind when we address the trade-offs between improving the number of instructions that we run and keeping the program to a reasonably small size.

Before explaining Speculative Inlining let's consider a piece of code.

The following pattern is quite common in OCaml and other functional languages, let's see how one would go about inlining this code snippet.

Example 1: Notice the higher-order function f:

(*
  val f :
    (condition:bool -> int -> unit) 
    -> condition:bool
    -> int
    -> unit
 *)
let f g ~condition n =
  for i = 0 to n do
    g ~condition i
  done

let g_real ~condition i =
  if condition then
    (* small operation *)
  else
    (* big piece of code *)

let condition = true

let foo n =
  f g_real ~condition n

Even for such a small example we will see that the heuristics involved to finding the right solution can become quite complex.

Keeping in mind the fact that condition is always true, the best set of inlining decisions would yield the following code:

(* All the code before [foo] is kept as is, from the previous codeblock *)
let foo x = 
  for i = 0 to x do
    (* small operation *)
  done

But if condition had been always false, instead of small operation, we would have had a big chunk of g_real duplicated in foo (i.e: (* big piece of code *)). Moreover it would have only spared us the running time of a few call instructions. Therefore, we would have probably preferred to have kept ourselves from inlining anything.

Specifically, we would have liked to have stopped from inlining g, as well as to have avoided inlining f because it would have needlessly increased the size of the code with no substantial benefit.

However, if we want to be able to take an educated decision based on the value of condition, we will have to consider the entirety of the code relevant to that choice. Indeed, if we just look at the code for f, or its call site in foo, nothing would guide us to the right decision. In order to take the right decision, we need to understand that if the ~condition parameter to the g_real function is true, then we can remove a large piece of code, namely: the else branch and the condition check as well.

But to understand that the ~condition in g_real is always true, we need to see it in the context of f in foo. This implies again that, that choice of inlining is not based on a property of g_real but rather a property of the context of its call.

There exists a very large number of combinations of such difficult situations that would each require different heuristics which would be incredibly tedious to design, implement, and maintain.

Speculative inlining

We manage to circumvent the hurdle that this decision problem represents thanks to what we call Speculative Inlining. This strategy requires two properties from the compiler: the ability to inline and optimise at the same time, as well as being able to backtrack inlining decisions.

Lets look at Example 1 again and look into the Speculative Inlining strategy.

let f g ~condition n =
  for i = 0 to n do
    g ~condition i
  done

let g_real ~condition x =
  if condition then
    (* small operation *)
  else
    (* big piece of code *)

let condition = true

let foo x =
  f g_real ~condition x

We will focus only on the traversal of the foo function.

Before we try and inline anything, there are a couple things we have to keep in mind about values and functions in OCaml:

Application arity may not match function arity

To give you an idea, the function foo could also been written in the following way:

let foo x =
  let f1 = f in
  let f2 = f1 g_real in 
  let f3 = f2 ~condition in
  f3 x

We expect the compiler to translate it as well as the original, but we cannot inline a function unless all its arguments are provided. To solve this, we need to handle partial applications precisely. Over-applications also present similar challenges.

Functions are values in OCaml

We have to understand that the call to f in foo is not trivially a direct call to f in this context. Indeed, at this point functions could instead be stored in pairs, or lists, or even hashtables, to be later retrieved and applied at will, and we call such functions general functions.

Since our goal is to inline it, we need to know the body of the function. We call a function concrete when we have knowledge of its body. This entails Constant Propagation in order to associate a concrete function to general function values and, consequently, be able to simplify it while inlining.

Here's the simplest case to demonstrate the importance of Constant Propagation.

let foo_bar y =
  let pair = foo, y in
  (fst pair) (snd pair)

In this case, we have to look inside the pair in order to find the function, this demonstrates that we sometimes have to do some amount of value analysis in order to proceed. It's quite common to come across such cases in OCaml programs due to the module system and other functional languages present similar characteristics.

There are many scenarios which also require a decent amount of context in order to identify which function should be called. For example, when a function passed as parameter is called, we need to know the context of the caller functions, sometimes up to an arbitrarily large context. Analysing the relevant context will tell us which function is being called and thus help us make educated inlining decisions. This problem is specific to functional languages, functions in good old imperative languages are seldom ambiguous; even though such considerations would be relevant when function pointers are involved.

This small code snippet shows us that we have to inline some functions in order to know whether we should have inlined them.

Speculative inlining in practice

In practice, Speculative Inlining is being able to quantify the benefits brought by a set of optimisations, which have to be applied after a given inlining decision, and use these results to determine if said inlining decision is in fact worth to carry out all things considered.

The criteria for accepting an inlining decision is that the resulting code should be faster that the original one. We use "should be" because program speed cannot be fully understood with absolutes.

That's why we use a heuristic algorithm in order to compare the original and the optimised versions of the code. It roughly consists in counting the number of retired (executed) instructions and comparing it to the increase in code size introduced by inlining the body of that function. The value of that cut-off ratio is by definition heuristic and different compilation options given to ocamlopt change it.

As said previously, we cannot go around and evaluate each inlining decision independently because there are cases where inlining a function allows for more of them to happen, and sometimes a given inlining choice validates another one. We can see this in Example 1, where deciding not to inline function g_real would make the inlining of function f useless.

Naturally, every combination of inlining decision cannot be explored exhaustively. We can only explore a small subset of them, and for that we have another heuristic that was already used in Flambda1, although Flambda2 does not yet implement it in full.

It's quite simple: we choose to consider inlining decision relationships only when there are nested calls. As for any other heuristic, it does not cover every useful case, but not only is it the easiest to implement, we are also fairly confident that it covers the most important cases.

Here's a small rundown of that heuristic:

A is a function which calls B
- Case 1: we evaluate the body of A at its definition, possibly inlining B in the process
- Case 2: at a specific callsite of A, we evaluate A in the inlining context.
  - Case 2.a: inlining A is beneficial no matter the decision on B, so we do it.
  - Case 2.b: inlining A is potentially detrimental, so we go and evaluate B before deciding to inline A for good.

Keep in mind that case 2.b is recursive and can go arbitrarily deep. This amounts to looking for the best leaf in the decision tree. Since we can't explore the whole tree, we do have a some limit to the depth of the exploration.

Reminder for our fellow Cameleers: Flambda1 and Flambda2 have a flag you can pass through the CLI which will generate a .org file which will detail all the inlining decisions taken by the compiler. That flag is: -inlining-report. Note that .org files allow to easily visualise a decision tree inside of the Emacs editor.

Summary

By now, you should have a better understanding of the intricacies inherent to Speculative Inlining. Prior to its initial inception, it was fair to question how feasible (and eligible, considering the many requirements for developping a compiler), such an algorithm would be in practice. Since then, it has demonstrated its usefulness in Flambda1 and, consequently, its porting to Flambda2 was called for.

So before we move on to the next stop in the F2S series, lets summarize what we know of Speculative Inlining.

We learned that inlining is the process of copying the body of a function at its callsite. We also learned that it is not a very interesting transformation by itself, especially nowadays with how efficient modern CPUs are, but that its usefulness is found in how it facilitates other optimisations to take place later.

We also learned about the heuristic nature of inlining and how it would be difficult to maintain finely-tailored heuristics in the long run as many others have tried before us. Actually, it is because there is no best way that we have come up with the need for an algorithm that is capable of simultaneously performing inlining and optimising as well as backtracking when needed which we called Speculative Inlining. In a nutshell, Speculative Inlining is one of the algorithms of the optimisation framework of Flambda2 which facilitates other optimisations to take place.

We have covered the constraints that the algorithm has to respect for it to hold ground in practice, like performance. We value a fast compiler and aim to keep both its execution but also the code it generates to be so. Take an optimisation such as Constant Propagation as an example. It would be a naïve approach to try and perform this transformation everywhere because the resulting complexity of the compiler would amount to something like size_of_the_code * number_of_inlinings_performed which is unacceptable to say the least. We aim at making the complexity of our compiler linear to the code size, which in turn entails plenty of logarithms anytime it is possible. Instead, we choose to apply any transformation only in the inlined parts of the code.

With all these parameters in mind, can we imagine ways to tackle these multi-layered challenges all at the same time ? There are solutions out there that do so in an imperative manner. In fact, the most intuitive way to implement such an algorithm may be fairly easily done with imperative code. You may want to read about Equality Saturation for instance, or even download Manuel Serrano's Paper inside the Scheme Bigloo compiler to learn more about it. However, we require backtracking, and the nested nature of these transformations (inlining, followed by different optimising transformations) would make backtracking bug-prone and tedious to maintain if it was to be written imperatively.

It soon became evident for us that we were going to leverage one of the key characteristics of functional languages in order to make this whole ordeal easier to design, implement and maintain: purity of terms. Indeed, not only is it easier to support backtracking when manipulating pure code, but it also becomes impossible for us to introduce cascades of hard to detect nested bugs by avoiding transforming code in place. From this point on, we knew we had to perform all transformations at the same time, making our inlining function one that would return an optimised inlined function. This does introduce complexities that we have chosen over the hurdles of maintaining an imperative version of that same algorithm, which can be seen as pertaining to graph traversal and tree rewriting for all intents and purposes.

Despite the density of this article, keep in mind that we aim at explaining Flambda2 in the most comprehensive manner possible and that there are voluntary shortcuts taken throughout these snippets for all of this to make sense for the broader audience. In time, these articles will go deep into the guts of the compiler and by then, hopefully, we will have done a good job at providing our readers with all necessary information for all of you to continue enjoying this rabbit-hole with us!

Here's a pseudo-code snippet representing Speculative Inlining.

(* Pseudo-code to rpz the actual speculation *)
let try_inlining f env args =
  let inlined_version_of_f = inline f env args in
  let benefit = compare inlined_version_of_f f in
  if benefit > 0 then
    inlined_version_of_f
  else
    f

Conclusion

As we said at the start of this article, this one is but an introduction to a major topic we will cover next, namely: Upwards and Downwards Traversals.

We had to cover Speculative Inlining first. It is a reasonably approachable solution to a complex problem, and having an idea of all the requirements for its good implementation is half of the work done for understanding key design decisions such as how code traversal was designed for algorithms such as Speculative Inlining to hold out.

Thank you all for reading! We hope that these articles will keep the community hungry for more!

Until next time, keep calm and OCaml! ⚱️🐫🏺📜

Au sujet d'OCamlPro :

OCamlPro développe des applications à haute valeur ajoutée depuis plus de 10 ans, en utilisant les langages les plus avancés, tels que OCaml, Rust, et WebAssembly (Wasm) visant aussi bien rapidité de développement que robustesse, et en ciblant les domaines les plus exigeants (méthodes formelles, cybersécurité, systèmes distribués/blockchain, conception de DSLs). Fort de plus de 20 ingénieurs R&D, avec une expertise unique sur les langages de programmation, aussi bien théorique (plus de 80% de nos ingénieurs ont une thèse en informatique) que pratique (participation active au développement de plusieurs compilateurs open-source, prototypage de la blockchain Tezos, etc.), diversifiée (OCaml, Rust, Cobol, Python, Scilab, C/C++, etc.) et appliquée à de multiples domaines. Nous dispensons également des [formations sur mesure certifiées Qualiopi sur OCaml, Rust, et les méthodes formelles] (https://training.ocamlpro.com/) Pour nous contacter : contact@ocamlpro.com.

Articles les plus récents

2025

2024

2023

[Montrer tous]