Profiling OCaml amd64 code under Linux
We have recently worked on modifying the OCaml system to be able to profile OCaml code on Linux amd64 systems, using the processor performance counters now supported by stable kernels. This page presents this work, funded by Jane Street.
The patch is provided for OCaml version 4.00.0. If you need it for 3.12.1, some more work is required, as we would need to backport some improvements that were already in the 4.00.0 code generator.
An example: profiling ocamlopt.opt
Here is an example of a session of profiling done using both Linux performance tools and a modified OCaml 4.00.0 system (the patch is available at the end of this article).
Linux performance tools are available as part of the Linux kernel (in the linux-tools
package on Debian/Ubuntu). Most of the tools are invoked through the perf
command, à la git. For example, we are going to check where the time is spent when calling the ocamlopt.opt
command:
perf record -g ./ocamlopt.opt -c -I utils -I parsing -I typing typing/*.ml
This command generates a file perf.data
in the current directory, containing all the events that were received during the execution of the command. These events contain the values of the performance counters in the amd64 processor, and the call-chain (backtrace) at the event.
We can inspect this file using the command:
perf report -g
The command displays:
Events: 3K cycles
+ 9.81% ocamlopt.opt ocamlopt.opt [.] compare_val
+ 8.85% ocamlopt.opt ocamlopt.opt [.] mark_slice
+ 7.75% ocamlopt.opt ocamlopt.opt [.] caml_page_table_lookup
+ 7.40% as as [.] 0x5812
+ 5.60% ocamlopt.opt [kernel.kallsyms] [k] 0xffffffff8103d0ca
+ 3.91% ocamlopt.opt ocamlopt.opt [.] sweep_slice
+ 3.18% ocamlopt.opt ocamlopt.opt [.] caml_oldify_one
+ 3.14% ocamlopt.opt ocamlopt.opt [.] caml_fl_allocate
+ 2.84% as [kernel.kallsyms] [k] 0xffffffff81317467
+ 1.99% ocamlopt.opt ocamlopt.opt [.] caml_c_call
+ 1.99% ocamlopt.opt ocamlopt.opt [.] caml_compare
+ 1.75% ocamlopt.opt ocamlopt.opt [.] camlSet__mem_1148
+ 1.62% ocamlopt.opt ocamlopt.opt [.] caml_oldify_mopup
+ 1.58% ocamlopt.opt ocamlopt.opt [.] camlSet__bal_1053
+ 1.46% ocamlopt.opt ocamlopt.opt [.] camlSet__add_1073
+ 1.37% ocamlopt.opt libc-2.15.so [.] 0x15cbd0
+ 1.37% ocamlopt.opt ocamlopt.opt [.] camlInterf__compare_1009
+ 1.33% ocamlopt.opt ocamlopt.opt [.] caml_apply2
+ 1.09% ocamlopt.opt ocamlopt.opt [.] caml_modify
+ 1.07% sh [kernel.kallsyms] [k] 0xffffffffa07e16fd
+ 1.07% as libc-2.15.so [.] 0x97a61
+ 0.94% ocamlopt.opt ocamlopt.opt [.] caml_alloc_shr
Using the arrow keys and the Enter
key to expand an item, we can get a better idea of where most of the time is spent:
Events: 3K cycles
+ 9.81% ocamlopt.opt ocamlopt.opt [.] compare_val
- compare_val
- 71.68% camlSet__mem_1148
+ 98.01% camlInterf__add_interf_1121
+ 1.99% camlInterf__add_pref_1158
- 21.48% camlSet__add_1073
- camlSet__add_1073
+ 93.41% camlSet__add_1073
+ 6.59% camlInterf__add_interf_1121
+ 1.44% camlReloadgen__fun_1386
+ 1.43% camlClosure__close_approx_var_1373
+ 1.43% camlSwitch__opt_count_1239
+ 1.34% camlTbl__add_1050
+ 1.20% camlEnv__find_1408
+ 8.85% ocamlopt.opt ocamlopt.opt [.] mark_slice
- 7.75% ocamlopt.opt ocamlopt.opt [.] caml_page_table_lookup
- caml_page_table_lookup
+ 50.03% camlBtype__set_commu_1704
+ 49.97% camlCtype__expand_head_1923
+ 7.40% as as [.] 0x5812
+ 5.60% ocamlopt.opt [kernel.kallsyms] [k] 0xffffffff8103d0ca
+ 3.91% ocamlopt.opt ocamlopt.opt [.] sweep_slice
Press `?` for help on key bindings
We notice that a lot of time is spent in the compare_val
primitive, called from the Pervasives.compare
function, itself called from the Set
module in asmcomp/interp.ml
. We can locate the corresponding code at the beginning of the file:
module IntPairSet =
Set.Make(struct type t = int * int let compare = compare end)
Let's replace the polymorphic function compare
by a monomorphic function, optimized for pairs of small ints:
module IntPairSet =
Set.Make(struct type t = int * int
let compare (a1,b1) (a2,b2) =
if a1 = a2 then b1 - b2 else a1 - a2
end)
We can now compare the speed of the two versions:
peerocaml:~/ocaml-4.00.0% time ./ocamlopt.old -c -I utils -I parsing -I typing typing/.ml
./ocamlopt.old 7.38s user 0.56s system 97% cpu 8.106 total
peerocaml:~/ocaml-4.00.0% time ./ocamlopt.new -c -I utils -I parsing -I typing typing/.ml
./ocamlopt.new 6.16s user 0.50s system 97% cpu 6.827 total
And we get an interesting speedup ! Now, we can iterate the process, check where most of the time is spent in the new version, optimize the critical path and so on.
Installation of the modified OCaml system
A modified OCaml system is required because, for each event, the Linux kernel must attach a backtrace of the stack (call-chain). However, the kernel is not able to use standard DWARF debugging information, and current OCaml stack frames are too complex to be unwinded without this DWARF information. Instead, we had to modify OCaml code generator to follow the same conventions as C for frame pointers, i.e. using saving the frame pointer on function entry and restoring it on function exit. This required to decrease the number of available registers from 13 to 12, using %rbp
as the frame pointer, leading to an average 3-5% slowdown in execution time.
The patch for OCaml 4.00.0 is available here:
omit-frame-pointer-4.00.0.patch (20 kB, v2, updated 2012/08/13)
To use it, you can use the following recipe, that will compile and install the patched version in ~/ocaml-4.00-with-fp.
$ wget http://caml.inria.fr/pub/distrib/ocaml-4.00.0/ocaml-4.00.0.tar.gz
$ tar zxf ~/ocaml-4.00.0.tar.gz
$ cd ocaml-4.00.0
$~/ocaml-4.00.0% wget ocamlpro.com/files/omit-frame-pointer-4.00.0.patch
$~/ocaml-4.00.0% patch -p1 < omit-frame-pointer-4.00.0.patch
$~/ocaml-4.00.0% ./configure -prefix ~/ocaml-4.00-with-fp
$~/ocaml-4.00.0% make world opt opt.opt install
$~/ocaml-4.00.0% cd ~
$ export PATH=$HOME/ocaml-4.00.0-with-fp/bin:$PATH
It is important to know that the patch modifies OCaml calling convention, meaning that ALL THE MODULES AND LIBRARIES in your application must be recompiled with this version.
On our benchmarks, the slowdown induced by the patch is between 3 and 5%. You can still compile your application without frame pointers, for production, using a new option -fomit-frame-pointer
that was added by the patch.
This patch has been submitted for inclusion in OCaml. You can follow its status and contribute to the discussion here: http://caml.inria.fr/mantis/view.php?id=5721
About OCamlPro:
OCamlPro is a R&D lab founded in 2011, with the mission to help industrial users benefit from experts with a state-of-the-art knowledge of programming languages theory and practice.
- We provide audit, support, custom developer tools and training for both the most modern languages, such as Rust, Wasm and OCaml, and for legacy languages, such as COBOL or even home-made domain-specific languages;
- We design, create and implement software with great added-value for our clients. High complexity is not a problem for our PhD-level experts. For example, we helped the French Income Tax Administration re-adapt and improve their internally kept M language, we designed a DSL to model and express revenue streams in the Cinema Industry, codename Niagara, and we also developed the prototype of the Tezos proof-of-stake blockchain from 2014 to 2018.
- We have a long history of creating open-source projects, such as the Opam package manager, the LearnOCaml web platform, and contributing to other ones, such as the Flambda optimizing compiler, or the GnuCOBOL compiler.
- We are also experts of Formal Methods, developing tools such as our SMT Solver Alt-Ergo (check our Alt-Ergo Users' Club) and using them to prove safety or security properties of programs.
Please reach out, we'll be delighted to discuss your challenges: contact@ocamlpro.com or book a quick discussion.
Most Recent Articles
2024
- opam 2.3.0 release!
- Optimisation de Geneweb, 1er logiciel français de Généalogie depuis près de 30 ans
- Alt-Ergo 2.6 is Out!
- Flambda2 Ep. 3: Speculative Inlining
- opam 2.2.0 release!
- Flambda2 Ep. 2: Loopifying Tail-Recursive Functions
- Fixing and Optimizing the GnuCOBOL Preprocessor
- OCaml Backtraces on Uncaught Exceptions
- Opam 102: Pinning Packages
- Flambda2 Ep. 1: Foundational Design Decisions
- Behind the Scenes of the OCaml Optimising Compiler Flambda2: Introduction and Roadmap
- Lean 4: When Sound Programs become a Choice
- Opam 101: The First Steps
2023
- Maturing Learn-OCaml to version 1.0: Gateway to the OCaml World
- The latest release of Alt-Ergo version 2.5.1 is out, with improved SMT-LIB and bitvector support!
- 2022 at OCamlPro
- Autofonce, GNU Autotests Revisited
- Sub-single-instruction Peano to machine integer conversion
- Statically guaranteeing security properties on Java bytecode: Paper presentation at VMCAI 23
- Release of ocplib-simplex, version 0.5