Automated Office Entrance Theme Music with Python

Scott Lobdell, an engineer here at Hearsay Social, will be giving a Django MeetUp talk on February 18, 2014 @ 5:30 - 6:30 pm about how he used open source facial recognition software and Python to play intro music for employees walking into our office. Click here to RSVP or to learn more.

For more info about Scott's background, read about his experiences transitioning from the Army to working at Hearsay Social.

File Descriptor Issues in Celery

A few days after upgrading to Celery v3.1, we noticed that the Celerybeat worker that handles regular dispatching of jobs would often hang. Upon further inspection, we noticed that the process seemed to be blocked on reading from a certain file descriptor. What was baffling was that it wasn't waiting on a file descriptor connected to the RabbitMQ message broker. Instead, it seemed like the process was trying to read from a random socket.

Celery uses universally unique identifiers (UUID) generation for creating unique task ID's, and the Kombu library creates one for initializing the "remote control command exchange," also known as the pidbox queue. Python v2.7 relies on the libuuid library on Unix-based machines to create these, and the creation of a UUID is usually very quick. I could tell that the code was stuck inside the __uuid_generate_random() function to the libuuid library. So why would the process hang on a fairly simple operation?

The phenomenon only really occurred when running in --detach mode, which forks a process and executes in daemon mode. Without this flag, there was no issue. We could see that Celerybeat would start a few tasks but freeze and fail to launch subsequent tasks.

By monitoring the Celery startup process, I noticed that an attempt was made to spawn off a process called /usr/sbin/uuidd. The libuuid library attempts to do so when invoked by the uuid_generate_time() function, which is ultimately executed on startup when Celery needs to generate a UUID for the pidbox. The problem starts to happen when launching this uuidd daemon fails. In Ubuntu v12.04 and Debian Squeeze, the related /var/run/uuidd directory needs to be created with the correct permissions. Without it, the daemon will be unable to spawn and libuuidd code will fallback to reading from /dev/urandom.

Because this operation is done before forking the process, a file descriptor is opened. To avoid concurrent writes by a parent and child process, the standard practice is to close all existing file descriptors. Celery performs this operation after forking the process. However, the libuuid library (v2.20) declares open file handles to /udev/random to be static, which means it will remain resident for the lifetime of the program:

static int get_random_fd(void)
    {
      static int fd = -2;
      .
    }

Consider what happens the next time a call is made to this function via uuid.uuid4(), which is called whenever Celery needs to generate a unique ID for a task. Since a file descriptor to /dev/urandom will only be created once (the initial value is set to -2, subsequent values will use the existing file descriptor), the read operation will fail because it has already been closed. There is no attempt to reopen /dev/urandom. Fortunately, in this case, there is a fallback mechanism in the library code to generate random data via the rand() function:

/* We do this all the time, but this is the only source of
    * randomness if /dev/random/urandom is out to lunch.
    */
    for (cp = buf, i = 0; i < nbytes; i++)
      *cp++ ^= (rand() >> 7) & 0xFF;

What happens though if the file descriptor opened actually corresponds to one that gets reclaimed later? In this situation, the read() operation could succeed by reading from the wrong file descriptor. In our case, we noticed that the connection was attempting to read from a UDP socket, which we traced down to a connection to the statsd client.

We were able to bypass this problem by starting the uuidd daemon before Celery loads. By doing so, fallback reads to /dev/urandom are avoided before the process is forked. Subsequent versions of libuuid (v2.24) appear to have this problem fixed, though the Debian/Ubuntu packages are still lagging in terms of upgrading. You could also simply avoid the --detach mode as well, but this approach may complicate restarting with the restart scripts provided.

For more details about this issue, see: https://github.com/celery/celery/pull/1822.

Unraveling a CloudFront Caching Issue

A few weeks ago, some users reported that they were unable to login to our site. At first, I thought it might be something related to Facebook or LinkedIn. What was puzzling was that no matter what the user clicked on the page, nothing would happen. This behavior indicated that one of the main JavaScript frameworks we use was failing to load on their browser.

We asked the customer to try different browsers and clearing their caches but the same problem seemed to persist. We tried to have them login to different site that didn't rely on Amazon's CloudFront servers that normally caches our static files and serves them closer in geographic proximity. This time, there was no issue. We tried going back to the original site and the same phenemonon existed.

I provided a URL to one of our main JavaScript files hosted on CloudFront and asked them to see if they could access it. Internet Explorer won't render this file in the browser but will prompt the user to download the file. This security restriction turned out to provide the key clue about what was happening. Because the user had to send the entire file, the contents could be inspected. It turns out that this static JS file was truncated to 30KB instead of expected 400KB.

The issue obviously pointed to a CloudFront caching issue, but why? I started studying Amazon's caching policies to understand how this could have happened. At first I suspected it had something to do with Nginx and CloudFront, since CloudFront uses HTTP v1.0 and the web server must be configured to perform the compression for this older protocol version. However, current versions of Nginx already have this configuration setup properly, so this possibility was quickly eliminated.

However, the section on Dropped TCP Connections intrigured me. Normally CloudFront determines whether files have been fully received by checking the Content-Length: header in the response. In the absence of this header, it cannot determine whether a file was received completely. If the TCP connection is dropped abruptedly, it can result in only part of the file being cached.

I decided to investigate further by examining what happened if I used Nginx with and without compression. If you made a standard web request, the Content-Length: header would appear. However, if you made a request specifying that I could accept compressed Gzip content, no Content-Length header: was returned. This behavior could be confirmed by using curl:

curl -0 -I -H [URL of JS file] -> get back Content-Length: header

curl -0 -I -H "Accept-Encoding: gzip,deflate" [URL of JS file] -> no Content-Length: header
(Note: -0 is HTTP/1.0, which is what CloudFront uses, -I shows only the header response, and -H provides the Accept-Encoding: header)

Apparently the way in which Nginx is configured is that Content-Length: headers will not be provided when it does on-the-fly compression. This option is explicitly stated in the Nginx manual (http://wiki.nginx.org/NginxHttpGzipModule#gzip_http_version). It can also be observed in the source code for Nginx, which intentionally removes this header when compression needs to be performed by the web server

What wasn't clear from the documentation was whether using pre-compressed files by enabling the gzip_static option would avoid this problem. It turns out that the source code will avoid making this call when an existing file with the .gz exists. If you're using Chef to manage your servers, the PR changes to support the gzip_static option is listed here.

The basic conclusion is that if you intend to serve static files with CloudFront via Nginx, it's probably better to pre-compress the files and use the gzip_static module. CloudFront does not perform on-the-fly Gzip compression, so you can't simply disable Gzip on Nginx without incurring extra bandwidth usage. If you rely on Nginx to do the compression and the TCP connection gets dropped before the entire static file is sent to the CloudFront, you could encounter this same issue too.