2 How to interpret the Erlang crash dumps
This document describes the
erl_crash.dump
file generated upon abnormal exit of the Erlang runtime system.The system will write the crash dump in the current directory of the emulator or in the file pointed out by the environment variable (whatever that means on the current operating system) ERL_CRASH_DUMP. For a crash dump to be written, there has to be a writable file system mounted.
Crash dumps are written mainly for one of two reasons: either the builtin function
erlang:halt/1
is called explicitly with a string argument from running Erlang code, or else the runtime system has detected an error that cannot be handled. The most usual reason that the system can't handle the error is that the cause is external limitations, such as running out of memory. A crash dump due to an internal error may be caused by the system reaching limits in the emulator itself (like the number of atoms in the system, or too many simultaneous ets tables). Usually the emulator or the operating system can be reconfigured to avoid the crash, which is why interpreting the crash dump correctly is important.2.1 Reasons for crash dumps
The reason for the dump is noted in the beginning of the file as
Slogan: <reason>
(the word "slogan" has historical roots). If the system is halted by the BIFerlang:halt/1
, the slogan is the string parameter passed to the BIF, otherwise it is a description generated by the emulator or the (Erlang) kernel. Normally the message should be enough to understand the problem, but nevertheless some messages are described here. Note however that the suggested reasons for the crash are only suggestions. The exact reasons for the errors may vary depending on the local applications and the underlying operating system.
- "Can't allocate N bytes of memory" - The system has run out of memory. The number
N
indicates the amount of memory needed (in bytes), which could give some hint of what the problem is. IfN
is very large, it could be that an Erlang process consumes vast amounts of memory, possibly due to an error in the Erlang code.
- "Can't reallocate N bytes of memory" - Same as above.
- "Can't allocate Something" - Same as above.
- "Got unusable memory block Address, size N" - The emulator has reached the 4 GB limit of the Erlang virtual memory space. Something consumes huge amounts of memory, probably an error in the Erlang code.
- "Unexpected op code N" - Error in compiled code,
beam
file damaged or error in the compiler.
- "Module Name undefined"
|
"Function Name undefined"|
"No function Name:Name/1"|
"No function Name:start/2" - The kernel/stdlib applications are damaged or the start script is damaged.
- "Driver_select called with too large file descriptor
N
" - The number of file descriptors for sockets exceed 1024 (Unix only). The limit on file-descriptors in some Unix flavors can be set to over 1024, but only 1024 sockets/pipes can be used simultaneously by Erlang (due to limitations in the Unixselect
call). The number of open regular files is not affected by this.
- "Received SIGUSR1" - The SIGUSR1 signal was sent to the Erlang machine (Unix only).
- "Kernel pid terminated (Who) (Exit-reason)" - The kernel supervisor has detected a failure, usually that the
application_controller
has shut down (Who
=application_controller
,Why
=shutdown
). The application controller may have shut down for a number of reasons, the most usual being that the node name of the distributed Erlang node is already in use. A complete supervisor tree "crash" (i.e., the top supervisors have exited) will give about the same result. This message comes from the Erlang code and not from the virtual machine itself. It is always due to some kind of failure in an application, either within OTP or a "user-written" one. Looking at the error log for your application is probably the first step to take.
- "Init terminating in do_boot ()" - The primitive Erlang boot sequence was terminated, most probably because the boot script has errors or cannot be read. This is usually a configuration error - the system may have been started with a faulty
-boot
parameter or with a boot script from the wrong version of OTP.
- "Could not start kernel pid (Who) ()" - One of the kernel processes could not start. This is probably due to faulty arguments (like errors in a
-config
argument) or faulty configuration files. Check that all files are in their correct location and that the configuration files (if any) are not damaged. Usually there are also messages written to the controlling terminal and/or the error log explaining what's wrong.
Other errors than the ones mentioned above may occur, as the
erlang:halt/1
BIF may generate any message. If the message is not generated by the BIF and does not occur in the list above, it may be due to an error in the emulator. There may however be unusual messages that I haven't mentioned, that still are connected to an application failure. There is a lot more information available, so more thorough reading of the crash dump may reveal the crash reason. The size of processes, the number of ets tables and the Erlang data on each process stack can be useful for tracking down the problem.2.2 Process information
After the general information in the crash dump (the date, slogan and version information) follows a listing of each living Erlang process in the system, and zombie processes. The process information for one process may look like this (line numbers have been added):
(1) <0.2.0> Waiting. Registered as: erl_prim_loader (2) Spawned as: erl_prim_loader:start_it/4 (3) Message buffer data: 262 words (4) Link list: [<0.0.0>,<0,1>] (5) Dictionary: [{fake, entry}] (6) Reductions 2194 stack+heap 987 old_heap_sz=987 (7) Heap unused=85 OldHeap unused=987 (8) Stack dump: (9) program counter = 0x1875e4 (erl_prim_loader:loop/3 + 52) (10) cp = 0xed830 (<terminate process normally>) (11) arity = 0 (12) (13) 1d4ae0 Return addr 0xED830 (<terminate process normally>) (14) y(0) ["/usr/local/product/releases/otp_beam_sunos5_r7b_patched/lib/kernel-2.6.1.6/ebin","/usr/local/product/releases/otp_beam_sunos5_r7b_patched/lib/stdlib-1.9.3/ebin"] (15) y(1) <0.1.0> (16) y(2) {state,[],none,get_from_port_efile,stop_port,exit_port,#Port<0.2>,infinity,dummy_in_handler} (17) y(3) infinityEach line of the output should be interpreted as follows:
- (1) - The process id (
<0.2.0>
), the state of the process (Waiting
) and the registered name of the process, if any (erl_prim_loader
). The state of the process can be one of the following:
- Scheduled - The process was scheduled to run but not currently running ("in the run queue").
- Waiting - The process was waiting for something (in
receive
).
- Running - The process was currently running. If the BIF
erlang:halt/1
was called, this was the process calling it.
- Exiting - The process was on its way to exit.
- Process is garbing, limited information. - This is bad luck, the process was garbage collecting when the crash dump was written, the rest of the information for this process is limited.
- Suspended - The process is suspended, either by the BIF
erlang:suspend_process/1
or because it is trying to write to a busy port.
- (2) - The entry point of the process, i.e., what function was referenced in the
spawn
orspawn_link
call that started the process.
- (3) - Size of fragmented heap data (incorrectly called message buffers). This is data either created by messages being sent to the process or by the Erlang BIFs. This amount depends on so many things that this field is utterly uninteresting.
- (4) - Process id's of processes linked to this one. May also contain ports. If process monitoring is used, this field also tells in which direction the monitoring is in effect, i.e., a link being "to" a process tells you that the "current" process was monitoring the other and a link "from" a process tells you that the other process was monitoring the current one.
- (5) - The contents of the process dictionary (the
put/2
andget/1
thing), if non-empty.
- (6) - The number of reductions consumed by the process, the size of the stack and heap (they share memory segment) and the size of the "old heap". This "old heap" may require some explanation. The Erlang virtual machine uses generational garbage collection with two generations. There is one heap for new data items and one for the data that have survived two garbage collections. The assumption (which is almost always correct) is that data that survive two garbage collections can be "tenured" to a heap more seldom garbage collected, as they will live for a long period. This is a quite usual technique in virtual machines. The sum of the heaps and stack together constitute most of the process's allocated memory.
- (7) - The amount of unused memory on each heap. This information is usually useless.
- (8) - (17) - A dump of the Erlang process stack. Most of the live data (i.e., variables currently in use) are placed on the stack; thus this can be quite interesting. One has to "guess" what's what, but as the information is symbolic, thorough reading of this information can be very useful. As an example, we can find the state variable of the Erlang primitive loader on line
(16)
.
- (9)-(11) - Miscellaneous information abut the process state:
- (9) program counter - The current instruction pointer, only interesting for runtime system developers.
- (9) The function into which the program counter points - This is the current function of the process.
- (10) cp - The current continuation pointer, i.e., the return address for the current call. Usually useless for other than runtime system developers. This may be followed by the function into which cp points, which is the function calling the current function.
- (11) arity - Number of live argument registers. The argument registers, if any are live, follow. These may contain the arguments of the function if they are not yet moved to the stack.
When interpreting the data for a process, it is helpful to know that anonymous function objects (funs) are given a name constructed from the name of the function in which they are created, and a number (starting with 0) indicating the number of that fun within that function.
2.3 Port information
This section lists the open ports, their owners, any linked processed, and the name of their driver or external process.
2.4 Internal table information
This section mostly contains information for runtime system developers. What can be of interest is the following fields:
- Hash Table(atom_tab) - The number of objects in the atom table, indicated by the field
objs(
N)
, is the number of atoms present in the system at the time of the crash. Some ten thousands atoms is perfectly normal, but more could indicate that the BIFerlang:list_to_atom/1
is used to dynamically generate a lot of different atoms, which is never a good idea.
- Hash Table(module_code) - The field
objs(
N)
indicates the number of loaded modules in the system.
- Allocated binary
N
- This number indicates how many bytes are allocated to binaries (the binary data type) for the whole system. Binaries allocated directly on process heaps (small binaries) are not accounted for here.
The rest of the information is only of interest for runtime system developers.
2.5 ETS tables
This section contains information about all the ETS tables in the system. The following fields are interesting for each table:
- Table Number(with name)Name - The identifier and the name of the table.
- Owner Pid - The process owning the table.
- Buckets: N
|
Ordered set (AVL tree), Elements: N - The most interesting here is that it indicates whether the table is aordered_set
or not.
- Table's got N objects - the number of objects in the table
- Table's got N words of active data - The number of words (usually 4 bytes/word) allocated to data in the table.
2.6 Timers
This section contains information about all the timers started with the BIFs
erlang:start_timer/3
anderlang:send_after/3
. Each line includes the message to be sent, the pid to receive the message and how many milliseconds were left until the message would have been sent.2.7 Distribution information
If the Erlang node was alive, i.e., set up for communicating with other nodes, this section lists the connections that were active.
2.8 Loaded module information
This is a list of all loaded modules, together with the memory usage of each module, in bytes. Note that loaded code is usually larger than the packed format in the beam files.
At the end of the list, the memory usage by loaded code is summarized. There is one field for "Current code" which is code that is the current latest version of the modules. There is also a field for "Old code" which is code where there exists a newer version in the system, but the old version is not yet purged.
2.9 Atoms
Now all the atoms in the system are written. This is only interesting if one suspects that dynamic generation of atoms could be a problem, otherwise this section can be ignored.
2.10 Disclaimer
The format of the crash dump evolves between releases of OTP. Some information here may not apply to your version. A description as this will never be complete; it is meant as an explanation of the crash dump in general and as a help when trying to find application errors, not as a complete specification.