Table des Contenus

The Growth of the OCaml Distribution

Date: 2023-01-02
Catégorie: OCaml
Tags: OCaml, compiler



We recently worked on a project to build a binary installer for OCaml, inspired from RustUp for Rust. We had to build binary packages of the distribution for every OCaml version since 4.02.0, and we were surprised to discover that their (compressed) size grew from 18 MB to about 200 MB. This post gives a survey of our findings.

Introduction

One of the strengths of Rust is the ease with which it gets installed on a new computer in user space: with a simple command copy-pasted from a website into a terminal, you get all what you need to start building Rust projects in a few seconds. Rustup, and a set of prebuilt packages for many architectures, is the project that makes all this possible.

OCaml, on the other hand, is a bit harder to install: you need to find in the documentation the proper way for your operating system to install opam, find how to create a switch with a compiler version, and then wait for the compiler to be built and installed. This usually takes much more time.

As a winter holiday project, we worked on a project similar to Rustup, providing binary packages for most OCaml distribution versions. It builds upon our experience of opam and opam-bin, our plugin to build and share binary packages for opam.

While building binary packages for most versions of the OCaml distribution, we were surprised to discover that the size of the binary archive grew from 18 MB to about 200 MB in 10 years. Though on many high-bandwidth connexions, it is not a problem, it might become one when you go far from big towns (and fortunately, we designed our tool to be able to install from sources in such a case, compromising the download speed against the installation speed).

We decided it was worth trying to investigate this growth in more details, and this post is about our early findings.

General Trends

In 10 years, the OCaml Distribution binary archive grew by a factor 10, from 18 MB to 198 MB, corresponding to a growth from 73 MB to 522 MB after installation, and from 748 to 2433 installed files.

In 10 years, the OCaml Distribution binary archive grew by a factor 10, from 18 MB to 198 MB, corresponding to a growth from 73 MB to 522 MB after installation, and from 748 to 2433 installed files.

So, let's have a look at the evolution of the size of the binary OCaml distribution in more details. Between version 4.02.0 (Aug 2014) and version 5.0.0 (Dec 2022):

  • The size of the compressed binary archive grew from from 18 MB to 198 MB

  • The size of the installed binary distribution grew from 73 MB to 522 MB

  • The number of installed files grew from 748 to 2433

The OCaml Distribution source archive was much more stable, with a global growth smaller than 2.

The OCaml Distribution source archive was much more stable, with a global growth smaller than 2.

On the other hand, the source distribution itself was much more stable:

  • The size of the compressed source archive grew only from 3 MB to 5 MB

  • The size of the sources grew from 14 MB to 26 MB

  • The number of source files grew from 2355 to 4084

For our project, this evolution makes the source distribution a good alternative to binary distributions for low-bandwidth settings, especially as OCaml is much faster than Rust at building itself. For the record, version 5.0.0 takes about 1 minute to build on a 16-core 64GB-RAM computer.

Interestingly, if we plot the total size of the binary distribution, and the total size with only files that were present in the previous version, we can notice that the growth is mostly caused by the increase in size of these existing files, and not by the addition of new files:

The growth is
mostly caused by the increase in size of existing files, and not by
the addition of new files.

The growth is mostly caused by the increase in size of existing files, and not by the addition of new files.

Causes and Consequences

We tried to identify the main causes of this growth: the growth is linear most of the time, with sharp increases (and decreases) at some versions. We plotted the difference in size, for the total size, the new files, the deleted files and the same files, i.e. the files that made it from one version to the next one:

The
difference of size between two versions is not big most of the time,
but some versions exhibit huge increases or decreases.

The difference of size between two versions is not big most of the time, but some versions exhibit huge increases or decreases.

Let's have a look at the versions with the highest increases in size:

  • +86 MB for 4.08.0: though there are a lot of new files (+307), they only account for 3 MB of additionnal storage. Most of the difference comes from an increase in size of both compiler libraries (probably in relation with the use of Menhir for parsing) and of some binaries. In particular:

    • +13 MB for bin/ocamlobjinfo.byte (2_386_046 -> 16_907_776)
    • +12 MB for bin/ocamldep.byte (2_199_409 -> 15_541_022)
    • +6 MB for bin/ocamldebug (1_092_173 -> 7_671_300)
    • +6 MB for bin/ocamlprof.byte (630_989 -> 7_043_717)
    • +6 MB for lib/ocaml/compiler-libs/parser.cmt (2_237_513 -> 9_209_256)
  • +74 MB for 4.03.0: again, though there are a lot of new files (+475, mostly in compiler-libs), they only account for 11 MB of additionnal storage, and a large part is compensated by the removal of ocamlbuild from the distribution, causing a gain of 7 MB.

    Indeed, most the increase in size is probably caused by the compilation with debug information (option -g), that increases considerably the size of all executables, for example:

    • +12 MB for bin/ocamlopt (2_016_697 -> 15_046_969)
    • +9 MB for bin/ocaml (1_833_357 -> 11_574_555)
    • +8 MB for bin/ocamlc (1_748_717 -> 11_070_933)
    • +8 MB for lib/ocaml/expunge (1_662_786 -> 10_672_805)
    • +7 MB for lib/ocaml/compiler-libs/ocamlcommon.cma (1_713_947 -> 8_948_807)
  • +72 MB for 4.11.0: again, the increase almost only comes from existing files. For example:

    • +16 MB for bin/ocamldebug (8_170_424 -> 26_451_049)
    • +6 MB for bin/ocamlopt.byte (21_895_130 -> 28_354_131)
    • +5 MB for lib/ocaml/extract_crc (659_967 -> 6_203_791)
    • +5 MB for bin/ocaml (17_074_577 -> 22_388_774)
    • +5 MB for bin/ocamlobjinfo.byte (17_224_939 -> 22_523_686)

    Again, the increase is probably related to adding more debug information in the executable (there is a specific PR on ocamldebug for that, and for all executables more debug info is available for each allocation);

  • +48 MB for 5.0.0: a big difference in storage is not surprising for a change in a major version, but actually half of the difference just comes from an increase of 23 MB of bin/ocamldoc;

  • +34 MB for 4.02.3: this one is worth noting, as it comes at a minor version change. The increase is mostly caused by the addition of 402 new files, corresponding to cmt/cmti files for the stdlib and compiler-libs

We could of course study some other versions, but understanding the root causes of most of these changes would require to go deeper than what we can in such a blog post. Yet, these figures give good hints for experts on which versions to start investigating with.

Inside the OCaml Installation

Before concluding, it might also be worth studying which parts of the OCaml Installation take most of the space. 5.0.0 is a good candidate for such a study, as libraries have been moved to separate directories, instead of all being directly stored in lib/ocaml.

Here is a decomposition of the OCaml Installation:

  • Total: 529 MB
    • share: 1 MB
    • man: 4 MB
    • bin: 303 MB
    • lib/ocaml: 223 MB
      • compiler-libs: 134 MB
      • expunge: 20 MB

As we can see, a large majority of the space is used by executables. For example, all these ones are above 10 MB:

  • 28 MB ocamldoc
  • 26 MB ocamlopt.byte
  • 25 MB ocamldebug
  • 21 MB ocamlobjinfo.byte, ocaml
  • 20 MB ocamldep.byte, ocamlc.byte
  • 19 MB ocamldoc.opt
  • 18 MB ocamlopt.opt
  • 15 MB ocamlobjinfo.opt
  • 14 MB ocamldep.opt, ocamlc.opt, ocamlcmt

There are both bytecode and native code executables in this list.

Conclusion

Our installer project would benefit from having a smaller binary OCaml distribution, but most OCaml users in general would also benefit from that: after a few years of using OCaml, OCaml developers usually end up with huge $HOME/.opam directories, because every opam switch often takes more than 1 GB of space, and the OCaml distribution takes a big part of that. opam-bin partially solves this problem by sharing equal files between several switches (when the --enable-share configuration option has been used).

Here is a short list of ideas to test to decrease the size of the binary OCaml distribution:

  • Use the same executable for multiple programs (ocamlc.opt, ocamlopt.opt, ocamldep.opt, etc.), using the first command argument to choose the behavior to have. Rustup, for example, only installs one binary in $HOME/.cargo/bin for cargo, rustc, rustup, etc. and actually, our tool does the same trick to share the same binary for itself, opam, opam-bin, ocp-indent and drom.

  • Split installed files into separate opam packages, of which only one would be installed as the compiler distribution. For example, most cmt files of compiler-libs are not needed by most users, they might only be useful for compiler/tooling developers, and even then, only in very rare cases. They could be installed as another opam package.

  • Remove the -linkall flag on ocamlcommon.cm[x]a libraries. In general, such a flag should only be set when building an executable that is expected to use plugins, because otherwise, this executable will contain all the modules of the library, even the ones that are not useful for its specific purpose.



Au sujet d'OCamlPro :

OCamlPro développe des applications à haute valeur ajoutée depuis plus de 10 ans, en utilisant les langages les plus avancés, tels que OCaml et Rust, visant aussi bien rapidité de développement que robustesse, et en ciblant les domaines les plus exigeants (méthodes formelles, cybersécurité, systèmes distribués/blockchain, conception de DSLs). Fort de plus de 20 ingénieurs R&D, avec une expertise unique sur les langages de programmation, aussi bien théorique (plus de 80% de nos ingénieurs ont une thèse en informatique) que pratique (participation active au développement de plusieurs compilateurs open-source, prototypage de la blockchain Tezos, etc.), diversifiée (OCaml, Rust, Cobol, Python, Scilab, C/C++, etc.) et appliquée à de multiples domaines. Nous dispensons également des [formations sur mesure sur OCaml, Rust, et les méthodes formelles] (https://training.ocamlpro.com/) Pour nous contacter : contact@ocamlpro.com.