Monday, September 12, 2011

Continuous Delivery vs Continuous Deployment

·         Every commit should be instantly deployed to production.
o   Alex commits. Minutes later warnings go off that the cluster is no longer healthy.
o   The failure is easily correlated to Alex's change and her change is reverted.
o   Alex spends minimal time debugging, finding the now obvious typo with ease.
o   Her changes still caused a failure cascade, but the downtime was minimal. 
·         Fail Fast – the closer a failure is to the point where it was introduced, the more data you have to correct for that failure.
o   In code, Fail Fast means raising an exception on invalid input, instead of waiting for it to break somewhere later.
o   In a software release process, Fail Fast means releasing undeployed code as fast as possible, instead of waiting for a weekly release to break.
·         Continuous Deployment is simple: just ship your code to customers as often as possible. Maybe today that's weekly instead of monthly, but over time you'll approach the ideal and you'll see the incremental benefits along the way


Continuous delivery is about putting the release schedule in the hands of the business, not in the hands of IT
o Implementing continuous delivery means making sure your software is always production ready throughout its entire lifecycle – that any build could potentially be released to users at the touch of a button using a fully automated process in a matter of seconds or minutes
o   This in turn relies on comprehensive automation of the build, test and deployment process, and excellent collaboration between everyone involved in delivery – developers, testers, DBAs, systems administrators, users, and the business
·         So when can you say you're doing continuous delivery? I'd say it's when you could flip a switch to go to continuous deployment if you decided that was the best way to deliver value to your customers. In particular, if you can't release every good build to users, what does it mean to be "done" with a story? I think at least the following conditions must apply:
o   You have run your entire test suite against the build containing the story. This validates that the story is delivering the expected business value, and that no regressions have been introduced in the process of developing it. In order to be efficient, that means having comprehensive automated tests at the unit, component and acceptance level.
o   The story has beendemonstrated to customersfrom a production-like environment. Production-like means identical with production, within the bounds of reason. Even if you're deploying to an enormous cluster, you can use a technique like blue-green deployments to run a different version of your app in parallel on the production environment without affecting users.
o   There are no obstacles to deploying to production. In other words, you could deploy the build to users using a fully automated process at the push of a button if you decided to. In particular, that means you've also tested it fulfills its cross-functional characteristics such as capacity, availability and security. If you're using an SOA or you have dependencies between your application and other systems, it means ensuring there are no integration problems.
·         Continuously integrate (commit early and often). On commit automatically run all tests. If the tests pass deploy to the cluster. If the deploy succeeds, repeat.
·         Our tests suite takes nine minutes to run (distributed across 30-40 machines). Our code pushes take another six minutes. Since these two steps are pipelined that means at peak we're pushing a new revision of the code to the website every nine minutes. That's 6 deploys an hour. Even at that pace we're often batching multiple commits into a single test/push cycle. On average we deploy new code fifty times a day
·         Continuous Deployment means running all your tests, all the time
·         The magic is in the scope, scale and thoroughness. It's a thousand test files and counting. 4.4 machine hours of automated tests to be exact. Over an hour of these tests are instances of Internet Explorer automatically clicking through use cases and asserting on behaviour, thanks to Selenium
·         schema changes are done out of band. Just deploying them can be a huge pain. Doing an expensive alter on the master requires one-by-one applying it to our dozen read slaves (pulling them in and out of production traffic as you go), then applying it to the master's standby and failing over. It's a two day affair, not something you roll back from lightly. In the end we have relatively standard practices for schemas (a pseudo DBA who reviews all schema changes extensively) and sometimes that's a bottleneck to agility. If I started this process today, I'd probably invest some time in testing the limits of distributed key value stores which in theory don't have any expensive manual processes.
o   My guess is that we could build a decent automated system by having a replicated stand by machine just for schema updates. You'd apply new schemas to the stand by and then automatically fail over to it. If you have to roll back, you just swap back the original machine, after which you have to rebuild the stand by from backups or some other expensive recovery option. You'd still need yet another safety net, in case of mysql crashes or other shenanigans, so you'd end up with a 2nd stand by machine (3 hosts per database role).
·         A symlink is switched on a small subset of the machines throwing the code live to its first few customers. We have a fixed queue of 5 copies of the website on each frontend. We rsync with the "next" one and then when every frontend is rsync'd we go back through them all and flip a symlink over. i.e. rsync the code into a separate folder, then flip a symlink to make it live. e.g. the document root is /var/www/current, which is a symlink to /var/www/1. The next deployment rsyncs to /var/www/2 and flips the /var/www/current symlink to point to that when ready