Profiling OCaml amd64 code under Linux
We have recently worked on modifying the OCaml system to be able to profile OCaml code on Linux amd64 systems, using the processor performance counters now supported by stable kernels. This page presents this work, funded by Jane Street.
The patch is provided for OCaml version 4.00.0. If you need it for 3.12.1, some more work is required, as we would need to backport some improvements that were already in the 4.00.0 code generator.
An example: profiling ocamlopt.opt
Here is an example of a session of profiling done using both Linux performance tools and a modified OCaml 4.00.0 system (the patch is available at the end of this article).
Linux performance tools are available as part of the Linux kernel (in the linux-tools
package on Debian/Ubuntu). Most of the tools are invoked through the perf
command, à la git. For example, we are going to check where the time is spent when calling the ocamlopt.opt
command:
perf record -g ./ocamlopt.opt -c -I utils -I parsing -I typing typing/*.ml
This command generates a file perf.data
in the current directory, containing all the events that were received during the execution of the command. These events contain the values of the performance counters in the amd64 processor, and the call-chain (backtrace) at the event.
We can inspect this file using the command:
perf report -g
The command displays:
Events: 3K cycles
+ 9.81% ocamlopt.opt ocamlopt.opt [.] compare_val
+ 8.85% ocamlopt.opt ocamlopt.opt [.] mark_slice
+ 7.75% ocamlopt.opt ocamlopt.opt [.] caml_page_table_lookup
+ 7.40% as as [.] 0x5812
+ 5.60% ocamlopt.opt [kernel.kallsyms] [k] 0xffffffff8103d0ca
+ 3.91% ocamlopt.opt ocamlopt.opt [.] sweep_slice
+ 3.18% ocamlopt.opt ocamlopt.opt [.] caml_oldify_one
+ 3.14% ocamlopt.opt ocamlopt.opt [.] caml_fl_allocate
+ 2.84% as [kernel.kallsyms] [k] 0xffffffff81317467
+ 1.99% ocamlopt.opt ocamlopt.opt [.] caml_c_call
+ 1.99% ocamlopt.opt ocamlopt.opt [.] caml_compare
+ 1.75% ocamlopt.opt ocamlopt.opt [.] camlSet__mem_1148
+ 1.62% ocamlopt.opt ocamlopt.opt [.] caml_oldify_mopup
+ 1.58% ocamlopt.opt ocamlopt.opt [.] camlSet__bal_1053
+ 1.46% ocamlopt.opt ocamlopt.opt [.] camlSet__add_1073
+ 1.37% ocamlopt.opt libc-2.15.so [.] 0x15cbd0
+ 1.37% ocamlopt.opt ocamlopt.opt [.] camlInterf__compare_1009
+ 1.33% ocamlopt.opt ocamlopt.opt [.] caml_apply2
+ 1.09% ocamlopt.opt ocamlopt.opt [.] caml_modify
+ 1.07% sh [kernel.kallsyms] [k] 0xffffffffa07e16fd
+ 1.07% as libc-2.15.so [.] 0x97a61
+ 0.94% ocamlopt.opt ocamlopt.opt [.] caml_alloc_shr
Using the arrow keys and the Enter
key to expand an item, we can get a better idea of where most of the time is spent:
Events: 3K cycles
+ 9.81% ocamlopt.opt ocamlopt.opt [.] compare_val
- compare_val
- 71.68% camlSet__mem_1148
+ 98.01% camlInterf__add_interf_1121
+ 1.99% camlInterf__add_pref_1158
- 21.48% camlSet__add_1073
- camlSet__add_1073
+ 93.41% camlSet__add_1073
+ 6.59% camlInterf__add_interf_1121
+ 1.44% camlReloadgen__fun_1386
+ 1.43% camlClosure__close_approx_var_1373
+ 1.43% camlSwitch__opt_count_1239
+ 1.34% camlTbl__add_1050
+ 1.20% camlEnv__find_1408
+ 8.85% ocamlopt.opt ocamlopt.opt [.] mark_slice
- 7.75% ocamlopt.opt ocamlopt.opt [.] caml_page_table_lookup
- caml_page_table_lookup
+ 50.03% camlBtype__set_commu_1704
+ 49.97% camlCtype__expand_head_1923
+ 7.40% as as [.] 0x5812
+ 5.60% ocamlopt.opt [kernel.kallsyms] [k] 0xffffffff8103d0ca
+ 3.91% ocamlopt.opt ocamlopt.opt [.] sweep_slice
Press `?` for help on key bindings
We notice that a lot of time is spent in the compare_val
primitive, called from the Pervasives.compare
function, itself called from the Set
module in asmcomp/interp.ml
. We can locate the corresponding code at the beginning of the file:
module IntPairSet =
Set.Make(struct type t = int * int let compare = compare end)
Let's replace the polymorphic function compare
by a monomorphic function, optimized for pairs of small ints:
module IntPairSet =
Set.Make(struct type t = int * int
let compare (a1,b1) (a2,b2) =
if a1 = a2 then b1 - b2 else a1 - a2
end)
We can now compare the speed of the two versions:
peerocaml:~/ocaml-4.00.0% time ./ocamlopt.old -c -I utils -I parsing -I typing typing/.ml
./ocamlopt.old 7.38s user 0.56s system 97% cpu 8.106 total
peerocaml:~/ocaml-4.00.0% time ./ocamlopt.new -c -I utils -I parsing -I typing typing/.ml
./ocamlopt.new 6.16s user 0.50s system 97% cpu 6.827 total
And we get an interesting speedup ! Now, we can iterate the process, check where most of the time is spent in the new version, optimize the critical path and so on.
Installation of the modified OCaml system
A modified OCaml system is required because, for each event, the Linux kernel must attach a backtrace of the stack (call-chain). However, the kernel is not able to use standard DWARF debugging information, and current OCaml stack frames are too complex to be unwinded without this DWARF information. Instead, we had to modify OCaml code generator to follow the same conventions as C for frame pointers, i.e. using saving the frame pointer on function entry and restoring it on function exit. This required to decrease the number of available registers from 13 to 12, using %rbp
as the frame pointer, leading to an average 3-5% slowdown in execution time.
The patch for OCaml 4.00.0 is available here:
omit-frame-pointer-4.00.0.patch (20 kB, v2, updated 2012/08/13)
To use it, you can use the following recipe, that will compile and install the patched version in ~/ocaml-4.00-with-fp.
$ wget http://caml.inria.fr/pub/distrib/ocaml-4.00.0/ocaml-4.00.0.tar.gz
$ tar zxf ~/ocaml-4.00.0.tar.gz
$ cd ocaml-4.00.0
$~/ocaml-4.00.0% wget ocamlpro.com/files/omit-frame-pointer-4.00.0.patch
$~/ocaml-4.00.0% patch -p1 < omit-frame-pointer-4.00.0.patch
$~/ocaml-4.00.0% ./configure -prefix ~/ocaml-4.00-with-fp
$~/ocaml-4.00.0% make world opt opt.opt install
$~/ocaml-4.00.0% cd ~
$ export PATH=$HOME/ocaml-4.00.0-with-fp/bin:$PATH
It is important to know that the patch modifies OCaml calling convention, meaning that ALL THE MODULES AND LIBRARIES in your application must be recompiled with this version.
On our benchmarks, the slowdown induced by the patch is between 3 and 5%. You can still compile your application without frame pointers, for production, using a new option -fomit-frame-pointer
that was added by the patch.
This patch has been submitted for inclusion in OCaml. You can follow its status and contribute to the discussion here: http://caml.inria.fr/mantis/view.php?id=5721
Au sujet d'OCamlPro :
OCamlPro développe des applications à haute valeur ajoutée depuis plus de 10 ans, en utilisant les langages les plus avancés, tels que OCaml et Rust, visant aussi bien rapidité de développement que robustesse, et en ciblant les domaines les plus exigeants (méthodes formelles, cybersécurité, systèmes distribués/blockchain, conception de DSLs). Fort de plus de 20 ingénieurs R&D, avec une expertise unique sur les langages de programmation, aussi bien théorique (plus de 80% de nos ingénieurs ont une thèse en informatique) que pratique (participation active au développement de plusieurs compilateurs open-source, prototypage de la blockchain Tezos, etc.), diversifiée (OCaml, Rust, Cobol, Python, Scilab, C/C++, etc.) et appliquée à de multiples domaines. Nous dispensons également des [formations sur mesure certifiées Qualiopi sur OCaml, Rust, et les méthodes formelles] (https://training.ocamlpro.com/) Pour nous contacter : contact@ocamlpro.com.
Articles les plus récents
2024
- opam 2.3.0 release!
- Optimisation de Geneweb, 1er logiciel français de Généalogie depuis près de 30 ans
- Alt-Ergo 2.6 is Out!
- Flambda2 Ep. 3: Speculative Inlining
- opam 2.2.0 release!
- Flambda2 Ep. 2: Loopifying Tail-Recursive Functions
- Fixing and Optimizing the GnuCOBOL Preprocessor
- OCaml Backtraces on Uncaught Exceptions
- Opam 102: Pinning Packages
- Flambda2 Ep. 1: Foundational Design Decisions
- Behind the Scenes of the OCaml Optimising Compiler Flambda2: Introduction and Roadmap
- Lean 4: When Sound Programs become a Choice
- Opam 101: The First Steps
2023
- Maturing Learn-OCaml to version 1.0: Gateway to the OCaml World
- The latest release of Alt-Ergo version 2.5.1 is out, with improved SMT-LIB and bitvector support!
- 2022 at OCamlPro
- Autofonce, GNU Autotests Revisited
- Sub-single-instruction Peano to machine integer conversion
- Statically guaranteeing security properties on Java bytecode: Paper presentation at VMCAI 23
- Release of ocplib-simplex, version 0.5