Hackbright Academy visits Hearsay Social

Last Friday, Hearsay Social hosted a panel discussion for students from Hackbright Academy. Hackbright Academy offers a 10 week Programming Fellowship in Silicon Valley that is designed to help women from all backgrounds become adept programmers. Hearsay Social's engineering team is lucky to have a number of stellar female engineers. And they were excited to share stories, advice and thoughts on the industry.

"The first time I learned how to code is also the story of how I got asked to prom," recalled Bansi Shah, a Product Manager at Hearsay Social.

Ruchi Varshney has been a Generalist Engineer at Hearsay Social for over a year. She's worked on projects ranging from site internationalization to mobile development. She advised, "Find a company that allows you to work the whole stack. Keep on learning new skills."

Megan Anctil started her career at Hearsay Social as a Customer Support Associate. She learned to code on the job with help and encouragement from the engineering team. "I was motivated to fix issues rather than forward them to the engineering team. It gave me a great platform to learn." She then went on to show off the traditional Hearsay Social Engineering Barbie. She mentioned, "This is like a rite of passage for women who check in code at Hearsay Social."

Hearsay Social co-founder and CEO, Clara Shih, offered her perspective on finding the right company. "Seek out a growing industry, find the experts of that industry, and work with them to build a team you can learn from," she advised.

As the afternoon wrapped up a Hackbright student remarked, "We've visited a lot of companies with brilliant female engineers and at Hearsay Social it really feels like you girls are good friends first."

Posted on April 24, 2013

Using mod_wsgi with Celery

If you've ever run Celery with Django, you may be tempted to setup the CELERY_CONFIG_MODULE environment variables before loading your WSGI application:

os.environ['CELERY_CONFIG_MODULE'] = 'myapp.conf'

import django.core.handlers.wsgi
_application = django.core.handlers.wsgi.WSGIHandler()

Normally you might not encounter an issue. Problems though can happen once you start hosting the machine with other Django apps in an Apache/mod_wsgi configuration.

In our particular case, I had setup mod_wsgi to host different Django applications using the WSGIApplicationGroup directive and using Apache's SetEnv and SetEnvIf directives. The reason for using this approach is to infer based on the URL being accessed which Django app to use. In the following example, the main application will be directed to the app1 Python interpreter group except in cases where the Host: requested is app2.example.com.

<VirtualHost *:80>
ServerName *.example.com
SetEnv WSGI_APP_GROUP app1
SetEnvIf Host ^app2.example.com WSGI_APP_GROUP=app2
WSGIApplicationGroup %{ENV:WSGI_APP_GROUP}
</VirtualHost>

(Note: You could setup separate Apache virtual host configurations for each app, but we have found that with multiple Django/Celery apps and developers sharing a machine that there were a lot more combinations to handle so we went with this dynamic URL approach.)

When you use mod_wsgi in embedded mode and rely on the default configuration (or explicitly specify threads= parameter), you are running in a multi-threaded environment. The mod_wsgi module uses Python subinterpreters to create multiple Python environments within the same process. Each thread gets its own sys.modules, but global variables are still shared. In contrast, if you were only to use the WSGIDaemonProcess directive, a separate process can be instantiated per app and the process spaces are isolated.

The downside is that the WSGIDaemonProcess must be explicitly defined and instantiated on startup, whereas the WSGIApplicationGroup and multiple Python subinterpreters can be created dynamically. The %{ENV} parameters in the WSGIApplicationGroup allows mod_wsgi to retrieve an Apache environment variable, which is set by the SetEnv/SetEnvIf directies. We can set the WSGI_APP_GROUP to any value and mod_wsgi will create a new Python subinterpreter if one does not exist. In contrast, all process groups must be defined by the WSGIDaemonProcess directive when Apache starts up. (Note: we observed that if you use the %{ENV} parameter but don't set an Apache environment variable, mod_wsgi will not throw a warning and treat this parameter as a regular string and not an environment variable lookup.)

The problem with using Python subinterpreters is that you have to deal with global variables modified by different threads. Since v3.0, Celery tries to instantiate an object that contains the configuration variables only when the first task is run. (See "The Big Instance" Refactor.) Celery keeps this app instance as a thread-safe variable, but since Celery will lazily instantiate this object, the configuration will not be set until the first task is run. If another mod_wsgi thread changes this environment variable before at ask is dispatched, you may notice strange issues with messages directed to queues setup for different Django apps.

In our particular case, we were noticing "socket error 536871023", which equates to 111 error code (Connection Refused) in the librabbitmq library. I eventually unraveled this mystery by realizing that one of the Celery configurations in a Django mod_wsgi app had an invalid username/password, which prevented a connection from being established. What helped lead to this realization was that the Google App Engine had team turned their Python 2.7 runtime engine to convert os.environ to a threads local variable ant allow multithreading to occur (for more info, see Nick Johnson's blog article). Unfortunately, os.environ isn't thread-safe in normal Python 2.7 environments. When I tested this hypothesis, I could prove to myself that os.environ was being wiped out and inherited by whatever mod_wsgi thread changed the environment variable last.

To fix this issue, I put a workaround in the mod_wsgi to instantiate Celery using the config_from_object() function:

# Celery is thread-safe so force the default app to instantiate
import celery.app
import importlib
celery_app = celery.app.app_or_default()

# The Celery config needs to be imported before it can be used with config_from_object().
importlib.import_module('myapp.celeryconfig')
celery_app.config_from_object("myapp.celeryconfig")

import django.core.handlers.wsgi
_application = django.core.handlers.wsgi.WSGIHandler()

By using this approach, we completely avoid using the os.environ['CELERY_CONFIG_MODULE'] and can avoid conflicts across our Django/Celery apps. The drawback is that a Celery instance has to be generated even when no tasks are running (when calling celery.app.app_or_default()), but the benefits of avoiding these global variable conflicts made sense for us to follow this approach.

Migrating from Posterous to Jekyll

With Posterous shutting down on April 30th, you may find yourself trying to decide where to move your blog postings. Move to Blogger, Tumbler, or WordPress? Build your own blogging engine in Erlang?

Thanks to a suggestion from Chase Seibert, I decided to checkout GitHub Pages. If you visit http://pages.github.com to find documentation, you might be confused given that there's no quick 1-2-3 process. In fact, what you have to do is create a GitHub repository first! If you want a personalized blog for yourself, you repo must be named "[username].github.io". You can also host a blog within an existing repo by simply creating a gh-pages branch, or you can use the "Automatic Page Generator" button in the repo's Settings page to do so. More details/documentation are available at this GitHub page.

Using the GitHub Pages templates, you may find yourself extremely limited in choices. These CSS/HTML templates are usually only 1-2 columns, and there is little guidance about how to implement pagination, search, and archive listings. GitHub Pages operates under the assumption that you're going to use a Ruby package called Jekyll, which was built by Tom Preston-Warner of GitHub fame and many other individuals. Jekyll provides support for code syntax highlighting, clean markup language, and the benefits of using the same GitHub pull-request workflow model, but you may not have the fancy aspects of WYSWYG editing and image uploads since the blog is generated via static pages.

Instead of a WYSWYG editor that inserts all the HTML markup yourself, you end up writing mostly in your favorite editor of choice (i.e. vi/Emacs) and running Jekyll to generate your postings into static HTML files that can be hosted. When you push to the gh-pages branches, you're essentially triggering GitHub to run Jekyll to generate the files. If you want to host the pages elsewhere, you can do a "sudo gem install jekyll", run "jekyll", and push the generated files in the _sites/ directory to S3. It still makes sense to have Jekyll installed locally so you can actually see how your pages will be rendered.

Here are some areas of the process that I thought might be useful to share:

  • This blog was built using the Jekyll Bootstrap package, which comes with the pagination and archiving support out of the box. If you're going to host the page off a repository instead of your main username, you'll want to make adjustments to the BASE_PATH inside the _config.yml. This BASE_PATH is needed since your URL's are no longer relative '/' but rather some combination of your GitHub username and the repo name. If you find that your CSS or images are not loading correctly, chances are that you need to update this BASE_PATH configuration! We also switched to using the rdiscount markdown instead of the included because of this known issue with embedding iFrame links with the default version.

    BASE_PATH : http://hearsaycorp.github.io/hearsay-blog
    markdown: rdiscount

  • Jekyll also comes with a dev server (jekyll --server --auto), which can also regenerate files if there are file changes. This Stack Overflow article has this command-line that appears to downgrade the 1.5.1 version to an older version that makes the reloading correctly work. Otherwise, you may be wondering why pages are not being refreshed in spite of the auto: true setting inside the _config.yml file.

    sudo gem uninstall directory_watcher && sudo gem install directory_watcher -v 1.4.1

  • Pagination on the main page could be implemented by simply defining this section inside index.html and defining "paginate: 3" inside the _config.yml. (To implement the paginated Previous/Next links on the main page, you can take the section from the _includes/themes/twitter/post.html file.)

    {% for post in paginator.posts %}
    {% include post.html %}
    {% endfor %} 

  • To migrate our postings from Posterous, I followed the documentation at https://github.com/mojombo/jekyll/wiki/blog-migrations. The key is that you need to be signed into posterous.com and need to get an API token from https://posterous.com/api by clicking the view token link. You would then run the following command:

    ruby -rubygems -e 'require "jekyll/migrators/posterous"; Jekyll::Posterous.process(<my_email>, <my_pass>, <api_token>)'

    ...which will download everything into the _posts/ directory. The key thing to note is that every HTML file downloaded will be prefixed with a YAML header which is used by Jekyll to determine whether and how the posting should be processed:

    ---
    layout: post
    title: My posting
    published: true
    ---

  • Multiple authors support can be added by following this Gist article. I modified the Rakefile to add an extra author: entry whenever generating a new posting (i.e. rake post title="My Title Here" date="2012-12-05").

    open(filename, 'w') do |post|
    post.puts "---"
    post.puts "layout: post"
    post.puts "title: \"#{title.gsub(/-/,' ')}\""
    post.puts 'description: ""'
    post.puts "author: "
    post.puts "category: "
    post.puts "tags: []"
    post.puts "---"

    You can also update the _includes/post.html file to add the author by-line:

    <div class="meta">
      Posted on <a href="{{ BASE_PATH}}{{ post.url }}">{{ post.date | date: "%B %e, %Y"}} {% if author %}by {{ author.display_name }}{% endif %}
    </div>
  • For code snippets to show you how to use Jekyll's templating language (Liquid), I added this Jekyll plug-in by downloading it to the _plugins/ directory. It basically adds a raw tag to disable Jekyll from processing the section. You still have to escape the HTML tags yourself since Jekyll is not going to be handling this aspect! (i.e. changing < and > to use their ampersand equivalents).

  • Take advantage of GitHub's ability to handle custom domains via CNAME's. You can simply create a file in the root directory with the URL you wish to use (i.e. engineering.hearsaysocial.com). Once this file is created, you just need to change your DNS to point to your (i.e. [GitHub username/org name].github.io) host. In our case, it was hearsaycorp.github.io. For more information, see: https://help.github.com/articles/setting-up-a-custom-domain-with-pages.