November 20, 2011

Building reliable software on ever-changing platforms is tough—you have to move rapidly just to keep up, but you can’t afford to introduce errors at any point. We’ve been meaning to talk about this for a while and have just been too busy building software! 

At Hearsay Social, we’ve built—and are always improving—a development process that allows us to balance these two requirements. We thought we’d show you the path of a feature as it makes it way from development to production, and how we automate as much of the work as possible.

When you start working on a feature, you’re working in a local development environment—this might be your MacBook Air in-flight, or it might be a virtual server spun up on Rackspace Cloud or Amazon EC2. We use git for source control, and we use topic branches like crazy. To start a feature, merely create a branch from upstream/master and start hacking. We use Github to manage our code review process, so you’ll eventually be pushing there.

As you build the feature, you’ll use our feature-flagging system to wrap user-visible code in flags…by giving it a simple name and description and inserting these flags (available as Python calls or Jinja2 template methods), you can use our admin console to control the feature’s visibility by customer or even individual user. This means that you can roll out trial versions of a feature to the company (or even a few end-users) to get feedback without sending it live to all of our users.

When the feature is ready to show off, you can send around a link to your local dev environment to get feedback. Of course you’ve been writing tests the whole time—unit tests in the Python/Django unit test framework, and integration tests using Selenium2, so when you’re ready to check in, commit your code locally and push it to your personal Github repo (which is a fork of upstream, our shared “mainline” repository).  In order to commit, you’ll have to pass a local git pre-commit hook that runs all of our static analysis tools (pylint, pyflakes, PEP8, and jslint).

Once you’ve built your local branch the way you want (we advocate squashing and rebasing along the way to make code review easy), we’ve built a little command-line tool (soon to be open-sourced) called “lgtm” which enables you to open a new Pull Request from the command line: just type “./lgtm create” and give it a title.  If you’re fixing a bug, it’ll show you a list of your assigned bugs in our issue/workitem tracker, Pivotal Tracker, and automatically mark this Pull Request as a fix. (this way, Github & Pivotal’s automatic integration will resolve the bug in Pivotal once the fix gets merged).


(image courtesy of Lorem Tracker)

Once your pull request goes out, a couple of things start happening for you. First, a continuous integration bot picks up the pull request, checks it out to a Hudson/Jenkins server, and deploys it into its own environment. This environment is given it’s own URL using wildcard DNS, and pushed back to Github via a comment so other developers have one-click access to your code, deployed and running (against development data). Any updates to this pull request are automatically detected, deployed, and noted as an update. Unit tests get automatically kicked off, and results are also posted back to the comment stream as well. We’ll describe this system in more detail in an upcoming blog post.

Then, as another engineer is reviewing your code, he or she can play around with it and understand what it’s doing in context—this is especially helpful for understanding visual changes. I don’t know about you, but I suck at code-reviewing CSS changes! Once it passes the code-review bar and gets merged into upstream/master, our Hudson/Jenkins server takes over.


Upstream/master gets pulled down, and a series of tests kick off. Static analysis starts (pylint and pyflakes against our Python code, jslint against Javascript)...since you’re running that pre-commit hook, this should never be an issue, but since we can’t enforce it, we check it on Hudson too. Unit tests kick off against both Python and Javascript, and Selenium tests kick off in parallel across multiple browser configurations courtesy of our friends at Sauce Labs. Failures in any of these test suites causes immediate notification of the “suspect” developers via HipChat and email.

If all of these tests pass, we’ve got high confidence that the code is ready to deploy to production. Upstream/master gets tagged, packaged, and staged across the required servers (front-ends, crawlers, databases, etc. depending on the changes involved). If all of the pre-deploy migrations report success, a rolling restart kicks off across machines, flipping symlinks and restarting services one at a time to gracefully transition load from old to new code. End users on the site won’t even notice the switch as each machine finishes existing requests, stops accepting traffic from the load-balancers, bounces services, and comes back into rotation.

When all is said and done, this system lets us focus on what we do best—building features that delight and deliver value to our customers—with confidence that we won’t break existing features and scenarios.

So what’s next?

  • We want to integrate Sauce Labs’ Scout more deeply to enable one-click viewing of pull requests in different browsers.
  • We’re going to expose this list of in-process features across the company, so we can get quicker and easier feedback from our friends in sales, marketing, and customer success.
  • We’ll kick off automated static analysis runs and Selenium tests on a per-pull-request basis, instead of waiting for merge.
  • We’re going to automate our code-coverage checks to detect any check-in which regresses our coverage numbers. We’ve also got some pretty fun ideas around a code coverage game…seriously!
  • We’re always looking for ways to close the merge-to-deploy gap with automated migration, deploy, monitoring/detection, and if necessary, roll-back.

Sound like fun? We’re hiring! If you like building things, send me an email!


blog comments powered by Disqus