September 26, 2013

A few weeks ago, some users reported that they were unable to login to our site. At first, I thought it might be something related to Facebook or LinkedIn. What was puzzling was that no matter what the user clicked on the page, nothing would happen. This behavior indicated that one of the main JavaScript frameworks we use was failing to load on their browser.

We asked the customer to try different browsers and clearing their caches but the same problem seemed to persist. We tried to have them login to different site that didn't rely on Amazon's CloudFront servers that normally caches our static files and serves them closer in geographic proximity. This time, there was no issue. We tried going back to the original site and the same phenemonon existed.

I provided a URL to one of our main JavaScript files hosted on CloudFront and asked them to see if they could access it. Internet Explorer won't render this file in the browser but will prompt the user to download the file. This security restriction turned out to provide the key clue about what was happening. Because the user had to send the entire file, the contents could be inspected. It turns out that this static JS file was truncated to 30KB instead of expected 400KB.

The issue obviously pointed to a CloudFront caching issue, but why? I started studying Amazon's caching policies to understand how this could have happened. At first I suspected it had something to do with Nginx and CloudFront, since CloudFront uses HTTP v1.0 and the web server must be configured to perform the compression for this older protocol version. However, current versions of Nginx already have this configuration setup properly, so this possibility was quickly eliminated.

However, the section on Dropped TCP Connections intrigured me. Normally CloudFront determines whether files have been fully received by checking the Content-Length: header in the response. In the absence of this header, it cannot determine whether a file was received completely. If the TCP connection is dropped abruptedly, it can result in only part of the file being cached.

I decided to investigate further by examining what happened if I used Nginx with and without compression. If you made a standard web request, the Content-Length: header would appear. However, if you made a request specifying that I could accept compressed Gzip content, no Content-Length header: was returned. This behavior could be confirmed by using curl:

curl -0 -I -H [URL of JS file] -> get back Content-Length: header

curl -0 -I -H "Accept-Encoding: gzip,deflate" [URL of JS file] -> no Content-Length: header
(Note: -0 is HTTP/1.0, which is what CloudFront uses, -I shows only the header response, and -H provides the Accept-Encoding: header)

Apparently the way in which Nginx is configured is that Content-Length: headers will not be provided when it does on-the-fly compression. This option is explicitly stated in the Nginx manual (http://wiki.nginx.org/NginxHttpGzipModule#gzip_http_version). It can also be observed in the source code for Nginx, which intentionally removes this header when compression needs to be performed by the web server

What wasn't clear from the documentation was whether using pre-compressed files by enabling the gzip_static option would avoid this problem. It turns out that the source code will avoid making this call when an existing file with the .gz exists. If you're using Chef to manage your servers, the PR changes to support the gzip_static option is listed here.

The basic conclusion is that if you intend to serve static files with CloudFront via Nginx, it's probably better to pre-compress the files and use the gzip_static module. CloudFront does not perform on-the-fly Gzip compression, so you can't simply disable Gzip on Nginx without incurring extra bandwidth usage. If you rely on Nginx to do the compression and the TCP connection gets dropped before the entire static file is sent to the CloudFront, you could encounter this same issue too.



blog comments powered by Disqus