Monday, September 16, 2013

DevOps Cafe: DevOps At SalesForce and Measuring Service

Dave Mangot - SalesForce Orchestration Engine / Release Tools Team

Great to see DevOps Cafe Podcast finally back online after a six month break and with a corker of an episode. Dave Mangot from the "Orchestration Engine" team at SalesForce talks through culture and tools and Damon and John as always bring insightful comments on devops and cultural change.

SalesForce teams are organised in "Clouds" which is the equivalent of what other organisations would call "Tribes". Made up of ops, dev and QA, Clouds are responsible for ensuring the apps they build are kept running in prod.

Something that has stuck with Dave and influences his philosophy is the priorities given to him by a CEO at a former company.  CEO gave Dave two priorities:
  1. Keep devs moving as fast as possible
  2. Keep the site up
In that order.  Getting software to customers faster is more critical than preventing outages.

Ops is a service to dev.
IT department is a service to customers.

For a great example of an aligned organisation, watch "The Carrier" series on Netflix.  Every person onboard the aircraft carrier articulated their role in terms of "winning the war on terror".

Jez Humble's "DevOps kata" - always trying to get better.

Salesforce "Chatter" webapp for online communication

Their IM groups have "organisers".  E.g. Dave is the organiser of the devops group.

They record demos and make them available on demand to those who couldn't be there.

Do you set goals for improvement?
Do you have common visibility? Training?

When they identify an interruption to flow, they write ideas for metrics that would help improve feedback on a piece of paper
"Feel the pain"
Track metrics, get better

Damon: Measuring Service
Bringing visibility to service side of organisation (not just development visibility e.g. velocity)

Damon's 4 things to measure service:
  1. Cycle time: From backlog to customer
  2. Mean Time To Detect problems
    • How quickly can we get to detection of problem to know something is wrong
    • Enough visibility/testing/understanding of system?
  3. Mean Time To Repair
    • How good is control of system?
    • Right Automation/testing/packaging/structures in place
  4. Quality at the source
    • Scrap rate
    • How often does a problem get out of the realm of person who created the problem?
    • E.g.
      • Dev - does testing framework, ci, continuous deployment catch problem?
      • Ops - DNS change
      • Dev - changed config param
    • How far does the problem get down the stream?
  5. How good are you at not having the problem happen again?
    • How well do you learn
    • Executing the feedback loop
    • Lowering the recidivism rate
#5 added by John

Bring prod as far upstream as possible.
E.g. Salesforce run chaos monkey in test envs

"Rouster" for testing puppet functionally with Vagrant:

SalesForce toolchain:
Homebrew tooling -> Rundeck for Orchestration -> SaltStack -> Puppet

VMware's Project Zombie (PuppetConf 2013) toolchain:
  • Cassandra for CMDB
  • Mcollective
  • Puppet

"Trigger" for network automation:

A few years ago it was common to find managers that couldn't see the value of Puppet/Chef - nowadays it's an acknowledged best practice. The same resistance is now happening with network automation.