April 21, 2013

If you've ever run Celery with Django, you may be tempted to setup the CELERY_CONFIG_MODULE environment variables before loading your WSGI application:

os.environ['CELERY_CONFIG_MODULE'] = 'myapp.conf'

import django.core.handlers.wsgi
_application = django.core.handlers.wsgi.WSGIHandler()

Normally you might not encounter an issue. Problems though can happen once you start hosting the machine with other Django apps in an Apache/mod_wsgi configuration.

In our particular case, I had setup mod_wsgi to host different Django applications using the WSGIApplicationGroup directive and using Apache's SetEnv and SetEnvIf directives. The reason for using this approach is to infer based on the URL being accessed which Django app to use. In the following example, the main application will be directed to the app1 Python interpreter group except in cases where the Host: requested is app2.example.com.

<VirtualHost *:80>
ServerName *.example.com
SetEnv WSGI_APP_GROUP app1
SetEnvIf Host ^app2.example.com WSGI_APP_GROUP=app2
WSGIApplicationGroup %{ENV:WSGI_APP_GROUP}
</VirtualHost>

(Note: You could setup separate Apache virtual host configurations for each app, but we have found that with multiple Django/Celery apps and developers sharing a machine that there were a lot more combinations to handle so we went with this dynamic URL approach.)

When you use mod_wsgi in embedded mode and rely on the default configuration (or explicitly specify threads= parameter), you are running in a multi-threaded environment. The mod_wsgi module uses Python subinterpreters to create multiple Python environments within the same process. Each thread gets its own sys.modules, but global variables are still shared. In contrast, if you were only to use the WSGIDaemonProcess directive, a separate process can be instantiated per app and the process spaces are isolated.

The downside is that the WSGIDaemonProcess must be explicitly defined and instantiated on startup, whereas the WSGIApplicationGroup and multiple Python subinterpreters can be created dynamically. The %{ENV} parameters in the WSGIApplicationGroup allows mod_wsgi to retrieve an Apache environment variable, which is set by the SetEnv/SetEnvIf directies. We can set the WSGI_APP_GROUP to any value and mod_wsgi will create a new Python subinterpreter if one does not exist. In contrast, all process groups must be defined by the WSGIDaemonProcess directive when Apache starts up. (Note: we observed that if you use the %{ENV} parameter but don't set an Apache environment variable, mod_wsgi will not throw a warning and treat this parameter as a regular string and not an environment variable lookup.)

The problem with using Python subinterpreters is that you have to deal with global variables modified by different threads. Since v3.0, Celery tries to instantiate an object that contains the configuration variables only when the first task is run. (See "The Big Instance" Refactor.) Celery keeps this app instance as a thread-safe variable, but since Celery will lazily instantiate this object, the configuration will not be set until the first task is run. If another mod_wsgi thread changes this environment variable before at ask is dispatched, you may notice strange issues with messages directed to queues setup for different Django apps.

In our particular case, we were noticing "socket error 536871023", which equates to 111 error code (Connection Refused) in the librabbitmq library. I eventually unraveled this mystery by realizing that one of the Celery configurations in a Django mod_wsgi app had an invalid username/password, which prevented a connection from being established. What helped lead to this realization was that the Google App Engine had team turned their Python 2.7 runtime engine to convert os.environ to a threads local variable ant allow multithreading to occur (for more info, see Nick Johnson's blog article). Unfortunately, os.environ isn't thread-safe in normal Python 2.7 environments. When I tested this hypothesis, I could prove to myself that os.environ was being wiped out and inherited by whatever mod_wsgi thread changed the environment variable last.

To fix this issue, I put a workaround in the mod_wsgi to instantiate Celery using the config_from_object() function:

# Celery is thread-safe so force the default app to instantiate
import celery.app
import importlib
celery_app = celery.app.app_or_default()

# The Celery config needs to be imported before it can be used with config_from_object().
importlib.import_module('myapp.celeryconfig')
celery_app.config_from_object("myapp.celeryconfig")

import django.core.handlers.wsgi
_application = django.core.handlers.wsgi.WSGIHandler()

By using this approach, we completely avoid using the os.environ['CELERY_CONFIG_MODULE'] and can avoid conflicts across our Django/Celery apps. The drawback is that a Celery instance has to be generated even when no tasks are running (when calling celery.app.app_or_default()), but the benefits of avoiding these global variable conflicts made sense for us to follow this approach.



blog comments powered by Disqus