A few days after upgrading to Celery v3.1, we noticed that the Celerybeat
worker that handles
regular dispatching of jobs would often hang. Upon further inspection, we noticed that the process
seemed to be blocked on reading from a certain file descriptor. What was baffling was that it
wasn't waiting on a file descriptor connected to the RabbitMQ message broker. Instead, it seemed like
the process was trying to read from a random socket.
Celery uses universally unique identifiers (UUID)
generation for creating unique task ID's, and the Kombu library creates one for initializing
the "remote control command exchange," also known as the pidbox queue. Python v2.7 relies on the libuuid library on
machines to create these, and the creation of a UUID is usually very quick. I could tell that the code was
stuck inside the __uuid_generate_random()
function to the libuuid library. So why
would the process hang on a fairly simple operation?
The phenomenon only really occurred when running in --detach mode, which forks a process
executes in daemon mode. Without this flag, there was no issue. We could see that Celerybeat would start a few tasks
but freeze and fail to launch subsequent
By monitoring the Celery startup process, I noticed that an attempt was made to spawn off a
called /usr/sbin/uuidd. The libuuid library attempts to do so when
invoked by the
function, which is ultimately executed on startup when Celery needs to generate a UUID for the pidbox. The problem
starts to happen when launching this uuidd daemon fails. In Ubuntu v12.04 and Debian Squeeze,
the related /var/run/uuidd directory needs to be created with the correct permissions. Without it,
the daemon will be unable to spawn and libuuidd code will fallback to
reading from /dev/urandom.
Because this operation is done before forking the process, a file descriptor is opened. To avoid
concurrent writes by a parent and child process, the standard practice is to close all existing file descriptors.
Celery performs this operation after forking the
However, the libuuid library (v2.20) declares open file handles to /udev/random to be static, which
means it will
remain resident for the lifetime of the program:
static int get_random_fd(void)
static int fd = -2;
Consider what happens the next time a call is made to this function via uuid.uuid4(), which is called whenever
Celery needs to generate a unique ID for a task. Since a file descriptor to
will only be created once (the initial value is set to -2, subsequent values will use the existing file descriptor), the
read operation will fail because it has already been closed. There is no attempt to reopen /dev/urandom. Fortunately, in this case,
there is a
fallback mechanism in
the library code to generate random data via the rand() function:
/* We do this all the time, but this is the only source of
* randomness if /dev/random/urandom is out to lunch.
for (cp = buf, i = 0; i < nbytes; i++)
*cp++ ^= (rand() >> 7) & 0xFF;
What happens though if the file descriptor opened actually corresponds to one that gets reclaimed later? In this
the read() operation could succeed by reading from the wrong file descriptor. In our case, we noticed that the
was attempting to read from a UDP socket, which we traced down to a connection to the statsd
We were able to bypass this problem by starting the uuidd daemon before Celery loads. By doing so,
fallback reads to /dev/urandom are avoided before the process is forked. Subsequent versions of libuuid (v2.24)
appear to have this problem fixed, though
the Debian/Ubuntu packages are still lagging in terms of upgrading. You could also simply avoid the --detach mode as
well, but this approach
may complicate restarting with the restart scripts provided.
For more details about this issue, see: https://github.com/celery/celery/pull/1822.