distcc is prefixed to compiler command lines and acts as a wrapper to invoke the compiler either on the local client machine, or on a remote volunteer host.
For example, to compile the standard application program:
distcc gcc -o hello.o -c hello.c
Standard Makefiles, including those using the GNU autoconf/automake system use the $CC variable as the name of the compiler to run. In most cases, it is sufficient to just override this variable, either from the command line, or perhaps from your login script if you wish to use distcc for all compilation. For example:
make CC='distcc'
Options to distcc must precede the compiler name. Any arguments or options following the name of the compiler are passed through to the compiler.
-
-help
Print a detailed usage message and exit.
-
-version
Show distcc version and exit.
The way in which distcc runs the compiler is controlled by a few environment variables.
NOTE:
Some versions of make do not export Make variables as
environment variables by default. Also, assignments to
variables within the Makefile may override their definitions
in the environment that calls make. The most reliable
method seems to be to set DISTCC_*
variables in the
environment of Make, and to set CC
on the
right-hand-side of the Make command line. For example:
$ DISTCC_HOSTS='localhost wistful toey'
$ export DISTCC_HOSTS
$ CC='distcc' ./configure
$ make CC='distcc' all
Some Makefiles may, contrary to convention, explicitly call
gcc
or some other compiler, in which case
overriding $CC
will not be enough to call distcc.
This should be harmless, however: those jobs will just run
locally. The best solution is to update the Makefile to
compile and link using $(CC)
to promote future
maintainability.
DISTCC_HOSTS
Space-separated list of volunteer host specifications.
DISTCC_VERBOSE
If set to 1
, distcc produces explanatory messages on the
standard error stream. This can be helpful in debugging
problems. Bug reports should include verbose output.
DISTCC_LOG
Log file to receive messages from distcc itself, rather than stderr.
DISTCC_SAVE_TEMPS
If set to 1
, temporary files are not deleted
after use. Good for debugging, or if your disks are too
empty.
DISTCC_TCP_CORK
If set to 0
,
disable use of "TCP corks", even if they're present on
this system. Using corks normally helps pack requests into
fewer packets and aids performance.
Building a C or C++ program on Unix involves several phases:
.c
) and headers (.h
) to
a preprocessed file (.i
).i
) to assembly
instructions (.s
).o
)distcc only ever runs the compiler and assembler remotely. The preprocessor must always run locally because it needs to access various header files on the local machine which may not be present, or may not be the same, on the volunteer. The linker similarly needs to examine libraries and object files, and so must run locally.
The compiler and assembler take only a single input file, the preprocessed source, produce a single output, the object file. distcc ships these two files across the network and can therefore run the compiler/assembler remotely.
Fortunately, for most programs running the preprocessor is relatively cheap, and the linker is called relatively infrequent, so most of the work can be distributed.
distcc examines its command line to determine which of these phases are being invoked, and whether the job can be distributed. Here is an example of a typical command that can be preprocessed locally and compiled remotely:
distcc gcc -o hello.o -DGREETING="hello" -c hello.c
The command-line scanner is intended to behave in the same way as gcc. In case of doubt, distcc runs the job locally.
In particular, this means that commands that compile and link in one go cannot be distributed. These are quite rare in realistic projects. Here is one example of a command that could not be distributed, because it calls the compiler and linker
distcc gcc -o hello hello.c
Moving source across the network is less efficient to compiling it locally. If you have access to a machine much faster than your workstation, the performance gain may overwhelm the cost of transferring the source code and it may be quicker to ship all your source across the network to compile it there.
In general, it is even better to compile on two or machines in parallel. Any number of invocations of distcc can run at the same time, and they will distribute their work across the available hosts.
distcc does not manage parallelization, but relies on Make or some other build system to invoke compiles in parallel.
With GNU Make, you should use the -j
option to specify
a number of parallel tasks slightly higher than the number
of available hosts. For example:
$ export DISTCC_HOSTS='angry toey wistful localhost'
$ make -j5
The $DISTCC_HOSTS
variable tells distcc which
volunteer machines are available to run jobs. This is a
space-separated list of host specifications, each of which
has the syntax:
HOSTNAME[:PORT]
A numeric TCP port may optionally be specified after a colon. If no port is specified, it uses the default, which is currently 4200.
If only one invocation of distcc runs at a time, it will always execute on the first host in the list. (This behaviour is not absolutely guaranteed, however, and may change in future versions.)
The name localhost
is handled specially by running
the compiler in place.
The daemon may be tested on localhost by setting
DISTCC_HOSTS=127.0.0.1
Although localhost
causes distcc to execute the job
directly, using an IP address will cause it to make a TCP
connection to a daemon on localhost. This is slower, but
useful for testing.
When distcc is invoked, it needs to decide which of the
volunteers in DISTCC_HOSTS
should be used to
compile a job. It uses a simple heuristic to try to spread
load across machines appropriately.
You can imagine all of the compile machines as being leaky buckets, some with larger holes (faster CPUs) than others. The distcc client tries to keep water at the same level on each one (the same number of jobs running), preferring hosts occurring earlier in DISTCC_HOSTS. Over the course of a build, the faster machines will complete jobs more quickly, and therefore be topped up more quickly and do more work overall, but without the client ever actually needing to know which one is fastest.
This design has the advantage of not requiring the client to know in advance the speeds of the volunteers, and being quite simple to implement. It copes quite well with machines that are temporarily slowed down: they are just topped-up more slowly in the future.
Scheduling is coordinated between different invocations of
the distcc
client by lockfiles in the temporary
directory. There is no coordination between clients running
as different users, on different hosts, or with different
TMPDIR
paths.
On Linux, scheduling slightly too many jobs on any machine
is quite harmless, as long as the number is not so high that
the machine begins thrashing. So it's OK to provide a
-j
number substantially higher than the number of
available processors.
The biggest problem with this design is that it handles
multiprocessor machines poorly: they probably ought to have
jobs scheduled proportional to the number of processors. At
the moment, the best thing is to run with a -j
factor equal to the product of the maximum number of CPUs in
any machine (MAX_CPUS
) and the number of machines.
This should make sure that roughly MAX_CPUS
tasks
run on every machine at all times, and will therefore keep
all CPUs loaded, but will cause excessive task-switching on
machines with fewer CPUs. Task switching is not very
expensive on Linux so it is not a big problem, but it does
lose a few percentage points of speed. This should be fixed
in a future release.
Error messages or warnings from local or remote compilers are passed through to diagnostic output on the client. The compiler takes all file names and line numbers from pragmas in the preprocessed output, so error messages will always have the correct pathnames for files on the client.
distcc prints a message when it runs a command locally or
remotely. For more information, set
$DISTCC_VERBOSE
and look at the server's log file.
By default, distcc prints diagnostic messages to stderr.
Sometimes these are too intrusive into the output of the
regular compiler, and so they may be selectively redirected
by setting the $DISTCC_LOG
environment variable to a
filename.
The current version of the distcc daemon writes diagnostic
messages only to files on its own machine. (By default, it
uses the syslog daemon
channel.) If compilation is
failing, please examine the log file on the relevant
volunteer machine.
The exit code of distcc is normally that of the compiler: zero for successful compilation and non-zero otherwise.
If distcc fails to distribute a job to a selected volunteer machine, it will try to run the compiler locally on the client. distcc only tries a single remote machine for each job.
distcc tries to distinguish between a failure to distribute the job, and a "genuine" failure of the compiler on the remote machine, for example because of a syntax error in the program. In the second case, distcc does not re-run the compiler locally, and returns the same exit code as the remote compiler.
If distcc fails to run the compiler, it may return one one of the following error codes. These are also used by distccd.
EXIT_DISTCC_FAILED
Generic or unspecified failure in distcc.
EXIT_BIND_FAILED
Failed to bind and listen on network socket. Port may already be in use.
EXIT_CONNECT_FAILED
Failed to establish network connection or listen on socket. The host may be invalid or unreachable, or there may be no daemon listening.
EXIT_COMPILER_CRASHED
The underlying compiler exited because of a signal. This probably indicates a compiler bug, or a problem with the hardware or OS on the server.
EXIT_OUT_OF_MEMORY
Obvious.
EXIT_BAD_HOSTSPEC
$DISTCC_HOSTS
was undefined, empty, or
syntactically invalid. (At the moment, you should never
see this code because distcc will fall back to building
locally. Let me know if you would prefer a hard error.)
Cross compilation means building programs to run on a machine with a different processor, architecture, or operating system to where they were compiled. distcc supports cross compilation, including teams of mixed-architecture machines, although some changes to the compilation commands may be required.
The compilation command passed to distcc must be one that
will execute properly on every volunteer machine to produce
an object file of the appropriate type. If the machines
have different processors, then simply using distcc
cc
will probably not work, because that will
normally invoke the volunteer's native compiler.
Machines with the same instruction set but different operating systems may not necessarily generate compatible .o files. Empirically it seems that the native FreeBSD compiler generates object files compatible with Linux for C programs, but not for C++. It may be a good idea to install a Linux cross compiler on BSD volunteers.
Different versions of the compiler may generate incompatible
object files. This seems to be much more of a problem with
C++ than with C, because the C++ ABI (application binary
interface) has changed in recent years. If you will be
building C++ programs, it may be a good idea to install the
same version of g++
on all machines.
gcc
has two options to select at run time the
target platform (-b
) and the gcc version
(-V
) to be used. Several different gcc
configurations can be installed side-by-side on any machine,
and these options are used by the top-level "driver" program
to switch between them. For more information, see
Specifying Target Machine and Compiler Version in
the gcc manual.
For example, adding -b i386-linux
to
$CFLAGS
ought to make sure the correct compiler is
invoked to build Linux/x86 programs. This has no particular
effect if all the volunteers are natively of that type, but
is very useful if some of the volunteer machines are
different: either the correct compiler will be used, or you
will see an error message like this if it is not installed.
gcc: installation problem, cannot exec `cpp0': No such file or directory
gcc: file path prefix `/usr/lib/gcc-lib/i386-freebsd/2.95.4/' never used
The parts of gcc particular to target machines and versions
are normally kept in the directory
/usr/local/lib/gcc-lib/MACHINE/VERSION
.
Alternatively, you might specify as the compiler command the name of a script or symbolic link that calls the appropriate version of gcc on each machine. For example:
CC='distcc gcc-i386-linux'
In general, using the -b
option is probably better,
because it does not require any special creation of scripts
on the volunteer machines beyond installing the appropriate
gcc configuration. However, using a special compiler name
may be useful if you need to make sure that a particular
version of gcc's driver program is used, perhaps because you
are testing gcc. This approach might also be useful with
compilers other than gcc that have no built-in mechanism for
choosing a target.
Suggestions for other ways to support cross-compilation or automatically detecting incompatibilities are welcome.
distcc works well with the ccache tool for caching compilation results. To use the two of them together, simply set
CC='ccache distcc'
distcc works quite well with autoconf.
DISTCC_VERBOSE
can give autoconf trouble because
autoconf tries to parse error messages from the compiler.
If you redirect distcc's diagnostics using
DISTCC_LOG
then it seems to be fine.
Some autoconf-based systems "freeze" the compiler name
used for configure into their Makefiles. To make them use
distcc, you must either set $CC
when running
./configure
, and/or override $CC
on the
right-hand-side of the Make command line.
Some poorly-written shell scripts may assume that
$CC
is a single word. At the moment the best fix
is to use a shell script that calls distcc
.
Some versions of libtool seem not to cope well when CC is
set to more than one word, such as "distcc gcc"
.
Setting CC=distcc
, which is supported in 0.10 and
later, seems to work well.
MOC is the Qt meta-object compiler.
distcc transfers only the binary contents of source, error, and object files, without any concern for metadata, attributes, character sets or end-of-line conventions.
distcc never transmits file times across the network or modifies them, and so should not care whether the clocks on the client and volunteer machines are synchronized or not. When an object file is received onto the client, its modification time will be the current time on the client machine.