Flambda2 Ep. 3: Speculative Inlining
Welcome to a new episode of The Flambda2 Snippets!
The F2S blog posts aim at gradually introducing the world to the inner-workings of a complex piece of software engineering: The
Flambda2 Optimising Compiler
for OCaml, a technical marvel born from a 10 year-long effort in Research & Development and Compilation; with many more years of expertise in all aspects of Computer Science and Formal Methods.
Today's article will serve as an introduction to one of the key design
decisions structuring Flambda2
that we will cover in the next episode in the
series: Upward and Downward Traversals
.
See, there are interesting things to be said about how inlining
is conducted
inside of our compiler. Inlining
in itself is rather ubiquitous in compilers.
The goal here is to show how we approach inlining
, and present what we call
Speculative Inlining
.
Inlining in general
Given the way people write functional programs, inlining is an important part of the optimisation pipeline of such functional langages.
What we call inlining in this series is the process of duplicating some code to specialise it to a specific context.
Usually, this can be thought as copy-pasting the body of a function at its call site. A common misunderstanding is to think that the main benefit of this optimisation is to remove the cost of the function call. However, with modern computer architectures, this has become less and less relevant in the last decades. The actual benefit is to use the specific context to trigger further optimisations.
Suppose we have the following option_map
and double
functions:
let option_map f x =
match x with
| None -> None
| Some x -> Some (f x)
let double i =
i + i
Additionally, suppose we are currently considering the following function:
let stuff () =
option_map double (Some 21)
In this short example, inlining the option_map
function would perform the
following transformation:
let stuff () =
let f = double in
let x = Some 21 in
match x with
| None -> None
| Some x -> Some (f x)
Now we can inline the double
function.
let stuff () =
let x = Some 21 in
match x with
| None -> None
| Some x ->
Some (let i = x in i + i)
As you can see, inlining alone isn't that useful of an optimisation per se. In
this context, appliquing Constant Propagation
will optimise and simplify it
to the following:
let stuff () = Some 42
Although this is a toy example, combining small functions is a common pattern in functional programs. It's very convenient that using combinators is not significantly worse than writing this function by hand.
When inlining is detrimental
We cannot just go around and inline everything, everywhere... all at once.
As we said, inlining is mainly code duplication and that would be detrimental and blow the size of the compiled code drastically. However, there is a sweet spot to be found, between both absolute inlining and no inlining at all, but it is hard to find.
Here's an example of exploding code at inlining time:
(* val h : int -> int *)
let h n = (* Some non constant expression *)
(* val f : (int -> int) -> int -> int *)
let f g x = g (g x)
(* 4 calls to f -> 2^4 calls to h *)
let n = f (f (f (f h))) 42
Following through with the inlining process will produce a very large binary
relative to its source code. This contrived example highlights potential
problems that might arise in ordinary codebases in the wild, even if this one
is tailored to be quite nasty for inlining: notice the exponential blowup
in the number of nested calls, every additional call to f
doubles the number
of calls to h
after inlining.
How to decide when inlining is beneficial
Most compilers use a collection of heuristics to guide them in the decision making. A good collection of heuristics is hard to both design, and fine-tune. They also can be quite specific to a programming style and unfit for other compilers to integrate. The take away is: there is no best way.
Side Note:
This topic would make for an interesting blog post but, unfortunately, rather remote from the point of this article. If you are interested in going deeper into that subject right now, we have found references for you to explore until we get around to writing a comprehensive, and more digestable, explanation about the heuristic nature of inlining:
- Secrets of the Glasgow Haskell Compiler inliner, by SIMON PEYTON JONES and SIMON MARLOW, 2002.
- Extending the Scope of Syntactic Abstraction, by OSCAR WADDELL, 1999. Section 4.4 (PDF Download link), for the case of Scheme.
- Towards Better Inlining Decisions Using Inlining Trials, by JEFFREY DEAN and CRAIG CHAMBERS, 1994.
- Understanding and Exploiting Optimal Function Inlining, by THEODOROS THEODORIDIS, TOBIAS GROSSER, ZHENDONG SU, 2022.
Before we get to a concrete example, and break down Speculative Inlining
for
you, we would like to discuss the trade-offs of duplicating code.
CPUs execute instructions one by one, or at least they pretend that they do. In order to execute an instruction, they need to load up into memory both code and data. In modern CPUs, most instructions take only a few cycles to execute and in practice, the CPUs often execute several at the same time. To put into perspective, loading memory, however, in the worst case, can take hundreds of CPU cycles... Most of the time it's not the case because CPUs have complex memory cache hierarchies such that loading from instruction cache can take just a few cycles, loading from level 2 caches may take dozens of them, and the worst case is loading from main memory which can take hundreds of cycles.
The take away is, when executing a program, the cost of one instruction that has to be loaded from main memory can be larger than the cost of executing a hundred instructions in caches.
There is a way to avoid the worst case scenario. Since caches are rather small in size, the main component to keeping from loading from main memory is to keep your program rather small, or at least the parts of it that are regularly executed.
Keep these orders of magnitude in mind when we address the trade-offs between improving the number of instructions that we run and keeping the program to a reasonably small size.
Before explaining Speculative Inlining
let's consider a piece of code.
The following pattern is quite common in OCaml and other functional languages, let's see how one would go about inlining this code snippet.
Example 1: Notice the higher-order function f
:
(*
val f :
(condition:bool -> int -> unit)
-> condition:bool
-> int
-> unit
*)
let f g ~condition n =
for i = 0 to n do
g ~condition i
done
let g_real ~condition i =
if condition then
(* small operation *)
else
(* big piece of code *)
let condition = true
let foo n =
f g_real ~condition n
Even for such a small example we will see that the heuristics involved to finding the right solution can become quite complex.
Keeping in mind the fact that condition
is always true
, the best set of
inlining decisions would yield the following code:
(* All the code before [foo] is kept as is, from the previous codeblock *)
let foo x =
for i = 0 to x do
(* small operation *)
done
But if condition
had been always false
, instead of small operation
, we
would have had a big chunk of g_real
duplicated in foo
(i.e: (* big piece of code *)
). Moreover it would
have only spared us the running time of a few call
instructions. Therefore,
we would have probably preferred to have kept ourselves from inlining anything.
Specifically, we would have liked to have stopped from inlining g
, as well as
to have avoided inlining f
because it would have needlessly increased the
size of the code with no substantial benefit.
However, if we want to be able to take an educated decision based on the value
of condition
, we will have to consider the entirety of the code relevant to
that choice. Indeed, if we just look at the code for f
, or its call site in
foo
, nothing would guide us to the right decision. In order to take the
right decision, we need to understand that if the ~condition
parameter to the
g_real
function is true
, then we can remove a large piece of code,
namely: the else
branch and the condition check as well.
But to understand that the ~condition
in g_real
is always true
, we need
to see it in the context of f
in foo
. This implies again that, that choice
of inlining is not based on a property of g_real
but rather a property of the
context of its call.
There exists a very large number of combinations of such difficult situations that would each require different heuristics which would be incredibly tedious to design, implement, and maintain.
Speculative inlining
We manage to circumvent the hurdle that this decision problem represents
thanks to what we call Speculative Inlining
. This strategy requires two
properties from the compiler: the ability to inline and optimise at the same
time, as well as being able to backtrack inlining decisions.
Lets look at Example 1 again and look into the Speculative Inlining
strategy.
let f g ~condition n =
for i = 0 to n do
g ~condition i
done
let g_real ~condition x =
if condition then
(* small operation *)
else
(* big piece of code *)
let condition = true
let foo x =
f g_real ~condition x
We will focus only on the traversal of the foo
function.
Before we try and inline anything, there are a couple things we have to keep in mind about values and functions in OCaml:
- Application arity may not match function arity
To give you an idea, the function foo
could also been written in the
following way:
let foo x =
let f1 = f in
let f2 = f1 g_real in
let f3 = f2 ~condition in
f3 x
We expect the compiler to translate it as well as the original, but we cannot inline a function unless all its arguments are provided. To solve this, we need to handle partial applications precisely. Over-applications also present similar challenges.
- Functions are values in OCaml
We have to understand that the call to f
in foo
is not trivially a
direct call to f
in this context. Indeed, at this point functions could
instead be stored in pairs, or lists, or even hashtables, to be later retrieved
and applied at will, and we call such functions general functions.
Since our goal is to inline it, we need to know the body of the function. We
call a function concrete when we have knowledge of its body. This entails
Constant Propagation
in order to associate a concrete function to general function values and,
consequently, be able to simplify it while inlining.
Here's the simplest case to demonstrate the importance of Constant Propagation
.
let foo_bar y =
let pair = foo, y in
(fst pair) (snd pair)
In this case, we have to look inside the pair in order to find the function, this demonstrates that we sometimes have to do some amount of value analysis in order to proceed. It's quite common to come across such cases in OCaml programs due to the module system and other functional languages present similar characteristics.
There are many scenarios which also require a decent amount of context in order to identify which function should be called. For example, when a function passed as parameter is called, we need to know the context of the caller functions, sometimes up to an arbitrarily large context. Analysing the relevant context will tell us which function is being called and thus help us make educated inlining decisions. This problem is specific to functional languages, functions in good old imperative languages are seldom ambiguous; even though such considerations would be relevant when function pointers are involved.
This small code snippet shows us that we have to inline some functions in order to know whether we should have inlined them.
Speculative inlining in practice
In practice, Speculative Inlining
is being able to quantify the benefits
brought by a set of optimisations, which have to be applied after a given
inlining decision, and use these results to determine if said inlining decision
is in fact worth to carry out all things considered.
The criteria for accepting an inlining decision is that the resulting code should be faster that the original one. We use "should be" because program speed cannot be fully understood with absolutes.
That's why we use a heuristic algorithm in order to compare the original and
the optimised versions of the code. It roughly consists in counting the number
of retired (executed) instructions and comparing it to the increase in code
size introduced by inlining the body of that function. The value of that
cut-off ratio is by definition heuristic and different compilation options
given to ocamlopt
change it.
As said previously, we cannot go around and evaluate each inlining decision
independently because there are cases where inlining a function allows for more
of them to happen, and sometimes a given inlining choice validates another one.
We can see this in Example 1, where deciding not to inline function
g_real
would make the inlining of function f
useless.
Naturally, every combination of inlining decision cannot be explored
exhaustively. We can only explore a small subset of them, and for that we have
another heuristic that was already used in Flambda1
, although Flambda2
does
not yet implement it in full.
It's quite simple: we choose to consider inlining decision relationships only when there are nested calls. As for any other heuristic, it does not cover every useful case, but not only is it the easiest to implement, we are also fairly confident that it covers the most important cases.
Here's a small rundown of that heuristic:
A
is a function which callsB
- Case 1: we evaluate the body of
A
at its definition, possibly inliningB
in the process - Case 2: at a specific callsite of
A
, we evaluateA
in the inlining context.- Case 2.a: inlining
A
is beneficial no matter the decision onB
, so we do it. - Case 2.b: inlining
A
is potentially detrimental, so we go and evaluateB
before deciding to inlineA
for good.
- Case 2.a: inlining
- Case 1: we evaluate the body of
Keep in mind that case 2.b is recursive and can go arbitrarily deep. This amounts to looking for the best leaf in the decision tree. Since we can't explore the whole tree, we do have a some limit to the depth of the exploration.
Reminder for our fellow Cameleers:
Flambda1
andFlambda2
have a flag you can pass through the CLI which will generate a.org
file which will detail all the inlining decisions taken by the compiler. That flag is:-inlining-report
. Note that.org
files allow to easily visualise a decision tree inside of the Emacs editor.
Summary
By now, you should have a better understanding of the intricacies inherent to
Speculative Inlining
. Prior to its initial inception, it was fair to question
how feasible (and eligible, considering the many requirements for developping a
compiler), such an algorithm would be in practice. Since then, it has
demonstrated its usefulness in Flambda1
and, consequently, its porting to
Flambda2
was called for.
So before we move on to the next stop in the
F2S series, lets
summarize what we know of Speculative Inlining
.
We learned that inlining is the process of copying the body of a function at its callsite. We also learned that it is not a very interesting transformation by itself, especially nowadays with how efficient modern CPUs are, but that its usefulness is found in how it facilitates other optimisations to take place later.
We also learned about the heuristic nature of inlining and how it would be
difficult to maintain finely-tailored heuristics in the long run as many others
have tried before us. Actually, it is because there is no best way that we
have come up with the need for an algorithm that is capable of simultaneously
performing inlining and optimising as well as backtracking when needed
which we called Speculative Inlining
. In a nutshell, Speculative Inlining
is one of the algorithms of the optimisation framework of Flambda2
which
facilitates other optimisations to take place.
We have covered the constraints that the algorithm has to respect for it to
hold ground in practice, like performance. We value a fast compiler and aim
to keep both its execution but also the code it generates to be so. Take an
optimisation such as Constant Propagation
as an example.
It would be a naïve approach to try and perform this transformation
everywhere because the resulting complexity of the compiler would amount to
something like size_of_the_code * number_of_inlinings_performed
which is
unacceptable to say the least. We aim at making the complexity of our compiler
linear to the code size, which in turn entails plenty of logarithms anytime
it is possible. Instead, we choose to apply any transformation only in the
inlined parts of the code.
With all these parameters in mind, can we imagine ways to tackle these
multi-layered challenges all at the same time ? There are solutions out there
that do so in an imperative manner. In fact, the most intuitive way to
implement such an algorithm may be fairly easily done with imperative code. You
may want to read about Equality Saturation
for instance, or even download
Manuel Serrano's Paper inside the Scheme Bigloo
compiler
to learn more about it. However, we require backtracking, and the nested
nature of these transformations (inlining, followed by different optimising
transformations) would make backtracking bug-prone and tedious to
maintain if it was to be written imperatively.
It soon became evident for us that we were going to leverage one of the key
characteristics of functional languages in order to make this whole ordeal
easier to design, implement and maintain: purity of terms. Indeed, not only is
it easier to support backtracking when manipulating pure code, but it also
becomes impossible for us to introduce cascades of hard to detect nested
bugs by avoiding transforming code in place. From this point on, we knew we
had to perform all transformations at the same time, making our inlining
function one that would return an optimised inlined function. This does
introduce complexities that we have chosen over the hurdles of maintaining an
imperative version of that same algorithm, which can be seen as pertaining to
graph traversal
and tree rewriting
for all intents and purposes.
Despite the density of this article, keep in mind that we aim at explaining
Flambda2
in the most comprehensive manner possible and that there are
voluntary shortcuts taken throughout these snippets for all of this to make
sense for the broader audience.
In time, these articles will go deep into the guts of the compiler and by then,
hopefully, we will have done a good job at providing our readers with all
necessary information for all of you to continue enjoying this rabbit-hole with
us!
Here's a pseudo-code snippet representing Speculative Inlining
.
(* Pseudo-code to rpz the actual speculation *)
let try_inlining f env args =
let inlined_version_of_f = inline f env args in
let benefit = compare inlined_version_of_f f in
if benefit > 0 then
inlined_version_of_f
else
f
Conclusion
As we said at the start of this article, this one is but an introduction to a
major topic we will cover next, namely: Upwards and Downwards Traversals
.
We had to cover Speculative Inlining
first. It is a reasonably approachable
solution to a complex problem, and having an idea of all the requirements for
its good implementation is half of the work done for understanding key design
decisions such as how code traversal was designed for algorithms such as
Speculative Inlining
to hold out.
Thank you all for reading! We hope that these articles will keep the community hungry for more!
Until next time, keep calm and OCaml! ⚱️🐫🏺📜
About OCamlPro:
OCamlPro is a R&D lab founded in 2011, with the mission to help industrial users benefit from experts with a state-of-the-art knowledge of programming languages theory and practice.
- We provide audit, support, custom developer tools and training for both the most modern languages, such as Rust, Wasm and OCaml, and for legacy languages, such as COBOL or even home-made domain-specific languages;
- We design, create and implement software with great added-value for our clients. High complexity is not a problem for our PhD-level experts. For example, we helped the French Income Tax Administration re-adapt and improve their internally kept M language, we designed a DSL to model and express revenue streams in the Cinema Industry, codename Niagara, and we also developed the prototype of the Tezos proof-of-stake blockchain from 2014 to 2018.
- We have a long history of creating open-source projects, such as the Opam package manager, the LearnOCaml web platform, and contributing to other ones, such as the Flambda optimizing compiler, or the GnuCOBOL compiler.
- We are also experts of Formal Methods, developing tools such as our SMT Solver Alt-Ergo (check our Alt-Ergo Users' Club) and using them to prove safety or security properties of programs.
Please reach out, we'll be delighted to discuss your challenges: contact@ocamlpro.com or book a quick discussion.
Most Recent Articles
2024
- Archéologie de la Généalogie: prise en main et optimisation d’un logiciel vieux de 30 ans
- Alt-Ergo 2.6 is Out!
- Flambda2 Ep. 3: Speculative Inlining
- opam 2.2.0 release!
- Flambda2 Ep. 2: Loopifying Tail-Recursive Functions
- Fixing and Optimizing the GnuCOBOL Preprocessor
- OCaml Backtraces on Uncaught Exceptions
- Opam 102: Pinning Packages
- Flambda2 Ep. 1: Foundational Design Decisions
- Behind the Scenes of the OCaml Optimising Compiler Flambda2: Introduction and Roadmap
- Lean 4: When Sound Programs become a Choice
- Opam 101: The First Steps
2023
- Maturing Learn-OCaml to version 1.0: Gateway to the OCaml World
- The latest release of Alt-Ergo version 2.5.1 is out, with improved SMT-LIB and bitvector support!
- 2022 at OCamlPro
- Autofonce, GNU Autotests Revisited
- Sub-single-instruction Peano to machine integer conversion
- Statically guaranteeing security properties on Java bytecode: Paper presentation at VMCAI 23
- Release of ocplib-simplex, version 0.5
- The Growth of the OCaml Distribution