Saturday, December 14, 2013

Netflix audit Prod rather than controlling releases through central approval process

I asked Adrian Cockcroft & Ben Christensen from Netflix how they control the release cycle of production configuration items (pointing out that each of the levels in the Archaius hierarchy potentially have a separate release cycle). In short, they don't.  Rather, they constantly audit production (using Chronos) and keep track of real-time events (what, when, where) to use for ad-hoc search queries when diagnosing problems. To read between the lines, there's more value in getting changes out quickly and having them fail than not releasing changes while waiting for approval and causing bottlenecks in the approval process.  Put more effort into testing in prod (monitoring/alerting/compliance) and less effort into release ceremony.

This snippet seems to cover it:

http://techblog.netflix.com/2013/03/python-at-netflix.html
Chronos
We push hard to always increase our speed of innovation, and at the same time reduce the cost of making changes in the environment.  In the datacenter days, we forced every production change to be logged in a change control system because the first question everyone asks when looking at an issue is “What changed recently?”.  We found a formal change control system didn’t work well for with our culture of freedom and responsibility, so we deprecated a formal change control process for the vast majority of changes in favor of Chronos.  Chronos accepts events via a REST interface and allows humans and machines to ask questions like “what happened in the last hour?” or “what software did we deploy in the last day?”.  It integrates with our monkeys and Asgard so the vast majority of changes in our environment are automatically reported to it, including event types such as deployments, AB tests, security events, and other automated actions.