Distributed compilation in 20 minutes: using distcc

Published in January 2004 issue of C/C++ Users Journal. Markup is mine. I'm making no claim to copyright here, this is archived for personal record.

Distributed compilation in 20 minutes: using distcc

Intro

In this article, we're going to take a look at a program called distcc, written by Martin Pool, which is one of the most useful C/C++ building tools to come around in a long time. By using distcc, it's possible to use a cluster of machines to compile a single gcc/g++ source code tree, thereby reducing compilation time dramatically. And best of all, distcc is free software, is very easy to set up and use, and really does make compilation a lot faster.

The speed improvement you see depends on the number of machines you have on your LAN that are available to donate their resources. Two identical machines will typically be able to compile about 1.8 times as fast as one machine alone, and four machines will typically be able to compile about 3.5 times faster than a single machine.

Requirements

To use distcc, you need to have:

And that's about it. In particular, distcc does *not* require any of the following in order to work properly:

Because distcc is so easy to set up, it's possible to start benefitting from distributed compilation in as little as 20 minutes. Distcc is an ideal way to speed up compilation at work, at home, or even on your laptop.

Theory of operation

Here's how distcc works. First, you need to install distcc on each machine on your LAN that will be participating in distributed compilation. On the machines that will be offering their CPU resources to others, you will need to run a daemon called "distccd". We'll call these machines "the compile servers."

Then, to use distcc, you'll need to choose one machine to compile on. This machine will be called "the client." On this machine, you'll use one of several methods to get your Makefiles to call distcc rather than gcc or g++.

Note that a machine can be configured to be a client, a compile server, or both.

Once setup is completed, we can compile sources on the client, and distcc will intercept the compiler calls and distribute the work across all the compile servers. The result? Your program will compile much faster, you'll save a lot of time, and you'll be happier at the end of the day.

Inside distcc

So, that's the overview of how distcc works. On the surface, the theory behind its operation sounds simple. But for those who are very familiar with the internal workings of C and C++ compilers, it raises some interesting questions. How exactly does distcc work when different machines on the LAN may have different sets of header files? How does distcc manage to link object code when not all libraries may be available on all compile servers? And how does one get "make" to execute several things simultaneously? These are good questions, so let's tackle them one by one.

How does distcc work properly when the various compile servers and the client may have different sets of C/C++ header files?

Distcc is able to do this by doing all source code pre-processing on the client machine. It then sends the pre-processed source, along with all the gcc/g++ command-line options, to the remote machine. On the remote machine, the pre-processed source is compiled into object code, which is then sent back to the client.

How does distcc link object code?

By doing all linking locally, on the client. Distcc recognizes calls to gcc/g++ that are intended to link object code, and will perform these linking steps on the client machine.

Doesn't this make distcc less efficient?

In theory, yes, but in all practicality it does not make much difference. Linking can't really benefit from being distributed across the network, and pre-processing is generally rather fast. Most of gcc/g++'s CPU time is spent converting pre-processed source code to object code, and this is the very work that distcc is able to distribute across the compile servers.

How does one get "make" to execute multiple jobs simultanenously?

Simply call "make" with the jobserver ("-j") command-line option. Using "-j", most Makefiles can be told to execute multiple jobs simultaneously. For example, "-j4" will tell "make" to keep four jobs running at all times. When four compilations are running at the same time, there are several available to distribute to the compile servers.

Installation

Installation is fairly straightforward. First, head over to http://distcc.samba.org/ and download the latest version of the distcc sources. Then, extract, configure, compile and install them by performing the following steps:

cat /path/to/distcc-x.y.tar.bz2 | bzip2 -dc | tar xvf -
./configure --prefix=/usr
make
make install

(fig. 1)

Now, distcc and distccd will be installed on the machine. If a machine is going to be a compile server, start distccd (it will detach from your terminal and run in the background) by typing:

distccd

(fig. 2)

If your machine will be a client, there are three ways to configure your system so that the /usr/bin/distcc executable will intercept compiler calls. In this next step, we'll perform the initial setup needed for the gcc/g++ masquerading option so that it's available to us later. You only need to set up masquerading on the client machine(s), not the compile servers.

Setting up gcc/g++ masquerading

To use masquerading, we first need to create a directory that contains symbolic links have the names of the compilers on our system, but have the distcc program as the link target. Later, we can use this masquerading technique to intercept gcc/g++ calls by inserting our new /usr/lib/distcc/bin directory at the beginning of our shell's executable search path. This will stealthfully redirect all calls to distcc instead.

Masquerading can be set up by performing these configuration steps:

install -d /usr/lib/distcc/bin
cd /usr/lib/distcc/bin
ln -s /usr/bin/distcc gcc
ln -s /usr/bin/distcc cc
ln -s /usr/bin/distcc g++ 
ln -s /usr/bin/distcc c++
ln -s /usr/bin/distcc i486-pc-linux-gnu-gcc
ln -s /usr/bin/distcc i486-pc-linux-gnu-c++
ln -s /usr/bin/distcc i486-pc-linux-gnu-g++

(fig. 3)

Above, you'll want to replace the "i486-pc-linux-gnu" with the appropriate host string that matches your installed version of gcc. To see which you should use, type "gcc -v" and look at the path displayed in the first line of output.

Before compilation

Now, we're almost ready to compile something. First, we'll need to tell distcc the names of the compile servers we'd like it to use. To do this, we'll create a file called /etc/distcc/hosts that will store this information.

In it, we'll list all the hostnames or IP addresses of our compile servers. Each hostname should be separated by whitespace, and we can use the name "localhost" to refer to the client machine. No distccd daemon needs to be running on the client in order to refer to "localhost" in /etc/distcc/hosts. To set up the /etc/distcc/hosts variable, first create the /etc/distcc directory:

install -d /etc/distcc

(fig. 4)

Then, create the /etc/distcc/hosts file using your favorite text editor, and add something like this to it:

localhost
eagle
falcon
emu

(fig. 5)

This tells distcc to use the local machine first, and then distribute any additional jobs to the machines named eagle, falcon, and emu in the listed order. You may want to remove localhost from /etc/distcc/hosts, and set something like this instead:

eagle
falcon
emu

(fig. 6)

This will cause all compilation to happen remotely, thus freeing up your client's CPU for preprocessing and linking. Depending on your hardware and network configuration, as well as the number of compile servers you have set up, you may find that this approach works better.

Next, we need to tweak our local PATH setting so that "make" will find our masqueraded symbolic links that point to distcc. To do this under bash, type:

export PATH="/usr/lib/distcc/bin:${PATH}"

(fig. 7)

Now we're ready to compile! To do this, enter your favorite source tree and type:

make -j5

(fig. 8)

You'll want to tweak the number after -j to suit the number of machines participating in your compile farm. It's usually optimal to use a -j number that's just slightly higher than the number of compile servers you are using.

While your sources are being compiled, log in to the compile servers and monitor their system load. You should notice an increased load on these boxes as they assist your client box. Congratulations -- you're now using distcc! You should notice a significant improvement in compile speed.

Distcc extras

If GNOME is installed on your client machine, then it's likely that a GNOME distcc monitor was compiled and installed along with distcc and distccd. To run it, type:

distccmon-gnome

(fig. 9)

You should see a very nice GNOME-based distcc monitor that looks something like this: Screenshot of the GTK/Gnome-based Distcc monitor in action
caption: Monitoring a kernel compile process that has been distributed to a very fast AMD64/NForce3 workstation.

By using distccmon-gnome, you can see how much time is spent for each step of the build process on all the machines that are being used for compilation. The information from distccmon-gnome can be very useful for configuring distcc to perform optimally. For example, if you notice that a disproportionate amount of time is being spent on preprocessing, then you may want to remove "localhost" from DISTCC_HOSTS. This way, the client can be devoted to preprocessing and linking and compilation can be left for the compile servers.

If you don't have GNOME available, you can start the text-based version of distccmon by typing distccmon-text followed by the refresh interval in seconds:

distccmon-text 1

(fig. 10)

Other distcc usage strategies

Besides using the masquerading method, there are also a couple of other ways that can be used to get a source tree to use distcc. They're generally not quite as effective as the masquerading method we used above, but they may be appropriate for some situations.

The first alternate method is to prefix the name of the compiler that is being used with "distcc ". This can typically be done as follows:

make CC="distcc gcc" -j5

(fig. 11)

The second alternate method is to call distcc as the compiler itself. This can be done as follows:

make CC="distcc" -j5

(fig. 12)

When called this way, distcc will look for "cc" in the binary search path and use it for compilation.

For more information on the various options available for distcc and distcc, be sure to visit the distcc Web site at http://www.samba.org, as well as read the distcc and distccd man pages. In the distcc man page, you can learn how to further refine your DISTCC_HOSTS environment variable for enhanced performance, and the distccd man page has a number of security and connection options (such as ssh-based connections) that can be explored.

distcc in the real world

It's quite encouraging to see the positive response that distcc has received. For one, distcc has been integrated into Apple's Xcode developer tools. This allows multiple Apple machines with Xcode to use distcc.

In addition, Gentoo Linux (the free software project I lead) has extensive support for distcc. For information on how to use distcc under Gentoo, please see http://www.gentoo.org/doc/en/distcc.xml. Thanks to the efforts of Lisa M. Seelye (our resident distcc guru) as well as others, you can expect Gentoo's support for distcc to continue to expand. For example, the current installation CDs For Gentoo Linux for the PowerPC can also be used to set up boot-from-CD compile servers. For more information, see http://www.gentoo.org

and then there's ccache

For those who are interested in accelerating compilation even further, I recommend you take a look at Andrew Tridgell's ccache program, which you can learn about at http://ccache.samba.org/. This compiler tool keeps a local cache of all recently-compiled sources, which allows you to do neat things like perform a "make clean" in a source tree and still be able to recompile it very quickly. Distcc and ccache also happen to be quite a dynamic duo when used together.

[Author] bio: Daniel Robbins is the Chief Architect of Gentoo Linux and leader of the Gentoo free software project (http://www.gentoo.org.) He lives in Albuquerque, New Mexico with his wife and two young daughters.

Breadcrumbs ?
  1. Projects - Index
  2. SimCity 4 Language Changer
  3. Some handy shell scripts
  4. How to open a new window properly.