May 21, 2014

Since we've upgraded with Celery v3.1, we've had numerous issues with this latest iteration. Here are a few things we thought we'd share a few snags we discovered:

  1. If you're running the Ubuntu v12.04 or Debian Squeeze, consider installing the uuidd daemon. Celery relies on the libuuid library for task ID generation, and Celerybeat can fail to dispatch tasks in daemon mode because of stale file descriptor references to /dev/urandom. There have been numerous fixes to help mitigate this issue, but if you continue to have issues with tasks not being dispatched, consider simply installing the uuidd daemon to run on any machine that uses Celery. (For more information, see this previous post.)

  2. If you're using RabbitMQ and have upgraded to v3.3.0, you must upgrade to Celery v3.1.11. If you're using librabbitmq, you should also upgraded to the latest v1.5.1 version. Celery workers are configured to process only a certain number of messages at a time using the prefetch count. Since scheduled tasks created with countdown/eta parameters are held in memory and left unacknowledged by the workers until they can be executed, Celery needs to increase this prefetch count.

    In RabbitMQ v3.3.0, changes were made in how this prefetch count is managed. Without an upgrade to Celery v3.1.11, requests to increase this number would be applied to new connections and not to existing ones. As a result, Celery would stop processing any further tasks once the prefetch number was reached, thereby causing delays in tasks dispatching. (For more information, see the reported issue on GitHub. and the RabbitMQ release notes.)

  3. The latest version of librabbitmq now appears to support connection pooling, so you must be especially careful when using Celery with concurrent workers. Upon forking, processes will inherit existing connections from their parent process, so care must be taken to close any open connections. For instance, if you issue any type of control commands before the process is forked, you must make sure the other processes do not end up reusing the connection.

    The reason is that AMQP is dependent on the sequence of frames being sent. Multiple processes sharing the same connection will interleave packets, triggering "UNEXPECTED_FRAME - expected content header for class 60" from the message broker. It can also cause workers to stall expecting a response from the AMQP broker.

    One example of how to fix this issue is to close the AMQP connection using the worker_init signal provided by Celery:

    from celery import current_app
    from celery.signals import worker_init
    
    def cleanup_connections(*args, **kwargs):
        current_app.close()   # try to avoid reusing the same AMQP connection pools across workers
    
    worker_init.connect(cleanup_connections)

    For more background information, see this discussion thread and this posting about fork safety.

  4. If you are using the MAX_TASKS_PER_CHILD parameter to help avoid memory fragmentation issues that can occur with long-lived Python processes, be wary of issues with workers crashing intermittently when this maximum number of tasks has been reached. Celery v3.1 implements a new asynchronous event loop which relies on the main process to receive messages and dispatch the tasks to other workers via interprocess pipes. However, there are issues with the way in which file descriptors are being monitored in this event loop, causing AssertionErrors to be triggered and workers to stop receiving msesages.

    We have contributed a number of fixes to help resolve this issue, most notably improvements to the test suite to help catch this issue. The main fix has already been merged, so we are awaiting a new version to help resolve these intermittent crash issues. We're also hopeful it will solve the problem with orphaned workers not waiting for new messages to process. (For more context, see our previous blog post.)

If you have any other experiences with Celery v3.1, we'd love to hear from you!



blog comments powered by Disqus