Tuesday, September 24, 2013

PuppetConf 2013 LiveStreaming Notes

PuppetConf 2013 was held on 22nd & 23rd August.  I tuned in to the live streaming and took the notes below.
Full list of talks with links to pages with video and slides: http://puppetlabs.com/resources/puppetconf-2013
Full playlist of videos: http://www.youtube.com/playlist?list=PLV86BgbREluU02Ytlz80seDSKAbkx5pRg 

Thursday 2013-08-23 [Friday BNE time]

Keynote: Why Did We Think Large Scale Distributed Systems Would be Easy?

Google's Corporate Engineering SRE team provides infrastructure services used by many of Google's desktops, laptops and servers. This talk gives an overview of the design philosophy, challenges, technologies and some interesting failures seen while implementing infrastructure at scale.
Site Reliability Manager, Google

Notes:

  • Post-mortems not about blame – lots to leran from medical community
  • Thundering herds: Randomise cron jobs
  • Anycast
    • Helps you be consistent
    • Traffic could go anywhere
    • Can I handle any type of query?
    • IP address injection in route table
    • TCP or UDP
    • Won’t maintain state
  • Cascading failures far worse than dropping traffic
  • Diversity for people – not for platforms
  • If you can’t automate it, you shouldn’t do it
Q: DNS Round-Robin depend on client?
SNMP
  • Canary
    • Dev, Test, Stage envs for devs
    • Turn some machines into canary machines
    • Full cluster of canaries
    • Then Rollout
  • How to round-robin DB backends?
    • Typically local LB
    • Not anycast
    • Tricky
    • Depends how persistent
    • Inter-transaction consistency
    • No good/complete answer

Keynote: Open Sourcing the Cloud

Red Hat is putting serious emphasis on cloud computing – with the goal of building agile infrastructure and platform clouds, which can be used to free developers and IT to do great things, faster. Brian will talk about how Red Hat’s “all-in” technology investments will help make this happen; including the external, upstream open source development model for RHEL, the Red Hat OpenStack community and the elasticity of deploying Openshift on top of Openstack.
CTO and VP, Worldwide Engineering, Red Hat

Notes:

  • Mantra: Upstream First
  • Open Daylight: Open-source network-controlled network
    • Decouple application workflows from infrastructure workflows
    • Network virtualisation – similar to server viritualisation
    • Neutron ties together OpenStack and OpenDaylight
  • CI in OpenShift
    • Clone production service
    • Jenkins cartridge
    • Easy
    • Don’t have to modify app, but will change the way you work

How Do We Better Sell DevOps?

In this talk, I will share my top lessons learned over my years studying high performing IT organizations on how to sell the value of DevOps, and help other stakeholders and executives have their own a-ha moments. I will talk about specific stories about the circumstances that led to these a-ha moments, how they created DevOps champions in surprising places (e.g., Development, CTOs, Product Management, UX, Infosec) in organizations you'll recognize, and how they enabled implementing DevOps patterns that had awesome results.
Author "The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win", IT Revolution Press

Nobody Has To Die Today: Keeping The Peace With The Other Meat Sacks


A frank (and, frankly, loud) discussion about the kinds of miscommunication that arise between developers and operations, how it leads to trouble and possible ways we can avoid (figurative) violence in the workplace using both social techniques as well as tooling.
Sr. DevOps Consultant, MomentumSI

Vampires vs Werewolves: Ending the War Between Developers and Sysadmins with Puppet

Developers need to be able to write software and deploy it, and often require cutting edge software tools and system libraries. Sysadmins are charged with maintaining stability in the production environment, and so are often resistant to rapid upgrade cycles. This has traditionally pitted us against each other, but it doesn't have to be that way. Using tools like puppet for maintaining and testing server configuration, nagios for monitoring, and jenkins for continuous code integration, Stanford University Library has brokered a peace that has given us the ability to maintain a stable production environment with a rapid upgrade cycle. I'll discuss the individual tools, our server configuration, and the social engineering that got us here.
Manager for Application Development, Stanford University Library

Notes:

  • You don’t take risks with ppl you don’t trust
    • Need to start with building interpersonal relationships, building trust
    • Let go of the anger
    • Recognise common goals
  • Tactics
    • Get to know the people on the other side, take them out for coffee
    • Show of good faith, show them your test coverage
    • Monitoring with e.g. Nagios
    • Goal: Are all of our projects functioning correctly right now?
    • Friendly Manual
    • Puppet: the ultimate challenge in vampire-werewolf relationships
    •  
  • http://www.codinghorror.com/blog/2010/08/vampires-programmers-versus-werewolves-sysadmins.html
    • The art of managing vampires and werewolves, I think, is to ensure that they spend their time not fighting amongst themselves, but instead, using those supernatural powers together to achieve a common goal they could not otherwise. In my experience, when programmers and system administrators fight, it's because they're bored. You haven't given them a sufficiently daunting task, one that requires the full combined use of their unique skills to achieve.
    • Remember, it's not vampires versus werewolves. It's vampires and werewolves. 

So You've Got Scalability. Now What?

Managing over 10k nodes brings unique challenges, one of them is managing all data in a scalable way, but solving the scalability issue isn't enough. The data must be available and manageable in a user friendly way. This talk is about how we successfully implemented a solution using Hiera, Redis, Sensu, Rails and Grape that made us capable of providing our customers with the ability to not only manage their own data but also build their own applications to manage their infrastructure using our API.
Senior Engineer, Reliant Security

Notes:

Multi-Provider Vagrant: AWS, VMware, and More

With Vagrant 1.1+, you can use the same configuration and workflow to spin up and provision machines in VirtualBox, VMware, AWS, RackSpace, and more. You get all the benefits of Vagrant with the power of working in whatever environment you need to. This capability unlocks entirely new use cases for Vagrant that can help better optimize the entire process of developing and testing Puppet code. In this talk, you'll learn how about the new multi-provider features, why they exist, and how they can be used. Your life will never be the same again.
Founder, HashiCorp

Notes:

  • Packer
    • If vagrant up becomes a blocker due to length of time Packer becomes important
    • Multi-cloud portability e.g. Dev->Test->Prod
    • Stability and portability:
      • E.g. Provision image with Puppet, don’t run Puppet live.. works after 12months
    • Handles reboots

Building Data-Driven Infrastructure with Puppet

As your Puppet Infrastructure grows, so does the complexity of the Puppet codebase. The complexity of the codebase often creates a scenario where it becomes more time consuming to modify/add to the codebase. Likewise, any new addition or node still may require modifications to the Puppet database, which could include the management of many edge cases. Fortunately, the software industry has been working on developing techniques with code abstraction, refactoring, and software maturity. This talk will focus on how to write scalable modules within Puppet to be used to create Data Driven Infrastructures. In addition, this talk will demonstrate how to structure process/procedure/code to quickly and rapidly scale operations with minimal modifications to Puppet code.
Operations Hacker, GitHub, Inc.

Notes:

  • https://gist.github.com/jfryman/6310477
  • Beer-Ops
  • James White Manifesto
  • Machine parseable: every input and output should be readable by computer
  • There is only one system
  • Systems-thinking
  • gPanel -> PuppetDB
  • gPanel self-driven portal CMDB
  • Machine readable metrics can go right back into the system
  • Unix Philosophy – apps do one thing really well
  • Controller
  • Orchestrator
    • Chat Ops – Jesse Newland talk
    • mcollective
    • Puppet is a state machine
    • Don’t want to wait for entire Puppet convergence
  • Metrics
  • System must be able to self-correct
  • Deployable using text files
  • Modularity
  • To stop config drift: Level up from Templates to Data Driven
  • NagiosDB
    • Dynamically generate Nagios setups outside of Puppet
  • System should introspect itself
  • Refactoring Puppet – least to most specific, modules become systems
  • CloudFormation
  • Autoloading
  • Modelling
    • Puppet has to know about app intimately
    • Augeas allows you to build infra dynamically
    • Pass structured data into module
    • “Turntup” slide
  • Fencing resources
    • Being built but not ready to serve
  • Constant introspection, feedback loop into the system
  • What’s missing?
    • We don’t have a good language for complex actions
    • Need a way to model complex interactions if-then-else and ngaios event handlers only gets you so far
    • Predictive Analysis
  • Data is coming from itself, think about system/feedback loop
  • Only one system
    • Everything machine parsable
    • Must be thinking “how do we make computers do my job?”

Friday 2013-08-24 [Saturday BNE time]



Keynote: Stop Hiring Devops Experts (And Start Growing Them)

Everyone is putting "devops" on their LinkedIn profile, and everyone is trying to hire them. In this talk, Jez will argue this is not a recruitment problem but an organizations failure. This talk discusses how to grow great people and great organizations, and how the two problems are connected.
Principal, ThoughtWorks

Notes:

  • http://es.slideshare.net/michael.sahota/agile-culture-and-adoption-survival-guide
  • Ops is in top right – control
  • It’s about the bottom right – cultivation culture
  • You can’t hire in cultural change
    • It’s very hard to change organisations
    • Organisations reject change in general
  • Cultivation culture
    • Have to build it into your organisation
    • 20% time (3m postit notes, Google)
    • Can’t just experiment – need to be able to act on it – Kodak
    • Pairing, only by doing it for 6months, have to work with people – not alongside them, work with them
    • Focus on experimentation/innovation
      • Okay to fail
    • Role of mgmt
      • Create env in which it’s safe to learn
      • How well do we cultivate knowledge?
    • Measure ppl on how good they are
      • Toyota Kata
      • Can only become manager when you’ve worked on factory floor for 10-15yrs
      • Focus on experimentation
      • Job of mgr = facilitate ppl doing the work to get better at doing it
    • Not effective
      • Training
        • Bring the people with you –make it natural
      • Buying tools
      • DevOps team
        • Creating a silo to fix a silo
        • If you create an org that values learning & ppl enjoy being able to learn and innovte = maybe won’t have a hiring problem
    • Taleb 3
      • Resilience is not the opposite of fragile
      • Antifragile is the opposite of fragile
      • People – e.g. Arnie, applying stress makes you stronger
      • Game days e.g. Jesse Robbins Amazon
        • Firedrills
        • Turn off the power and find out what actually happens
        • Google dirt exercises, earthquake simulation
        • Expose assumptions e.g. Mountain View failing over to Mountain View
      • Netflix
        • Simian army

Keynote: Puppet for Production in WebEx

Getting started with Puppet configuring an individual machine is straightforward. Managing a cluster of machines across multiple data centers, supporting upgrades while running a 7x24 service, and building for collaboration is significantly more challenging. The WebEx team will discuss the problems and some strategies they are using to manage this complexity
Cloud Services Architect, Cisco/WebEx

Notes:

  • Sequencing: technical + business
  • Never run just one of anything
    • DC, nodes
    • Why cluster: Technical, business (commercial retail vs federal, privacy etc)
  • Version migration
    • By DC
    • By Node
  • Blueprints -> Orchestration
  • Unix toolchain philosophy
    • Fabric/Salt/Ansible is the leatherman pocketknife multitool
    • Puppet is the real knife
  • Masterless Puppet
    • E.g. Google
    • distribute modules/manifests to nodes
    • Copy /etc/puppet/* to each node
    • Complete resiliency per node
    • No single point of control
    • puppet apply –modulepath “...”  --execute “include ...”
  • See also: Sam Bashton “Continuously Integrated Puppet in a Dynamic Environment
  • Keep problems small
  • Push dependencies into Puppet instead of RPMs
    • “4am-proofing” – Puppet = transparency
    • Favour transparency over DRY 

Keynote: Puppet at Scale – Case Study of PayPal's Learnings

Large scale and app level management pose challenges to any implementation of puppet. Come and learn some of the challenges PayPal Deployment Systems team faced and the how these were overcome.
Sr Dev Manager, PayPal 
Notes:
  • Staging = mini paypal.com
  • Commits happening every few seconds
    • How do you mange dependencies?
  • Web API for deployment
  • OpenStack
  • “Project Velocity”
  • Who looks after Puppet in the middle of the night?
    • 3,000 developers?
  • Deployment system where puppet coding is not required to deploy new apps
  • Ninja engine
    • Takes list of apps you want to install
    • Assemble list of
    • Discovery dependencies
    • Generate puppet resources from dependency graph
    • Then execute them
  • Caches dynamic resources for next run
  • What packages to install?
    • Roles & Labels
  • System Hierarchy
    • ENC -> Hiera
    • Web tool to visualise Hiera data
    • REST API for CRUD over Hiera data
  • Scaling ActiveMQ
    • Problem with mcollective beyond dev/test where multiple puppetmasters/MQ
    • Mcollective gave inconsistent results
    • Were using MQ cluster through LB – connections would time out
    • Removing LB fixed
    • When ActiveMQ host dies have to reconfigure clients – Use Puppet
  • Mcollective at Scale
    • Mcollective is equally useful as Puppet
    • Paypal heavily depends on mcollective
    • Replaces all SSH scriptsUse it to:
      • Query systems
      • Verify package versions
      • Kick off on-demand puppet runs
      • Ssh script replacement
    • REST API enables Mcollective to web and other tools
      • Powerful = careful approval required
  • Worked with PuppetLabs to create “progress” module to work out how long a deploy will take

Keynote: VMware vCHS, Puppet, and Project Zombie

Cloud Automation Architect, Hybrid Cloud Service, VMware
Nicholas Weaver is the Cloud Automation Architect for VMware's vCloud Hybrid Service (vCHS) platform and the primary architect behind the vCHS automation framework (Project Zombie). He is also a co-creator of the PuppetLabs Razor project and many VMware-specific free tools. He previously worked in the CTO office for EMC, in the EMC field as a vSpecialist, and as a infrastructure engineer in financial, media, and retail companies. Nick loves software-driven control, hacking prototypes together...

Notes:

  • Automation = Effort Evolution
  • Why is it important?
    • Warehouse
  • Resiliency = we expect things to fail
    • Can never assume anything is going to stay up
  • Project Zombie
  • Puppet has critical things for VMWare to choose
    • Mcollective
    • VM support
    • Cassandra
    • Netflix Astayanax
    • JRuby – good middle ground for dev and ops
  • Rabbit MQ
  • Modules
  • Rez  = globally distributed “refrigerator”
    • Difficult to manage resources at global distributed scale
    • Automation = baking
    • Millions of resources
    • Razor feed into Rez to manage state
    • REST API
  • Engine = “the chef”
    • Orchestration
    • Wrote their own Operational-based language
    • Controls flow and concurrency
    • ZED (Zombie Engine DSL)
    • Distributed and location-awareness
    • B = Broker
    • P = Processor

Continuously Integrated Puppet in a Dynamic Environment

This talk will show how we deploy Puppet without a Puppetmaster on an autoscaling Amazon Web Services infrastructure. Key points of interest: - Masterless Puppet - Use of Jenkins for Puppet manifest testing and environment promotion (test->staging->production) - Puppet integration with Amazon CloudFormation

Monday, September 16, 2013

DevOps Cafe: DevOps At SalesForce and Measuring Service

Dave Mangot - SalesForce Orchestration Engine / Release Tools Team

Great to see DevOps Cafe Podcast finally back online after a six month break and with a corker of an episode. Dave Mangot from the "Orchestration Engine" team at SalesForce talks through culture and tools and Damon and John as always bring insightful comments on devops and cultural change.

SalesForce teams are organised in "Clouds" which is the equivalent of what other organisations would call "Tribes". Made up of ops, dev and QA, Clouds are responsible for ensuring the apps they build are kept running in prod.

Something that has stuck with Dave and influences his philosophy is the priorities given to him by a CEO at a former company.  CEO gave Dave two priorities:
  1. Keep devs moving as fast as possible
  2. Keep the site up
In that order.  Getting software to customers faster is more critical than preventing outages.

Ops is a service to dev.
IT department is a service to customers.

For a great example of an aligned organisation, watch "The Carrier" series on Netflix.  Every person onboard the aircraft carrier articulated their role in terms of "winning the war on terror".

Jez Humble's "DevOps kata" - always trying to get better.

Salesforce "Chatter" webapp for online communication

Their IM groups have "organisers".  E.g. Dave is the organiser of the devops group.

They record demos and make them available on demand to those who couldn't be there.

Do you set goals for improvement?
Do you have common visibility? Training?

When they identify an interruption to flow, they write ideas for metrics that would help improve feedback on a piece of paper
"Feel the pain"
Track metrics, get better

Damon: Measuring Service
Bringing visibility to service side of organisation (not just development visibility e.g. velocity)

Damon's 4 things to measure service:
  1. Cycle time: From backlog to customer
  2. Mean Time To Detect problems
    • How quickly can we get to detection of problem to know something is wrong
    • Enough visibility/testing/understanding of system?
  3. Mean Time To Repair
    • How good is control of system?
    • Right Automation/testing/packaging/structures in place
  4. Quality at the source
    • Scrap rate
    • How often does a problem get out of the realm of person who created the problem?
    • E.g.
      • Dev - does testing framework, ci, continuous deployment catch problem?
      • Ops - DNS change
      • Dev - changed config param
    • How far does the problem get down the stream?
  5. How good are you at not having the problem happen again?
    • How well do you learn
    • Executing the feedback loop
    • Lowering the recidivism rate
#5 added by John

Bring prod as far upstream as possible.
E.g. Salesforce run chaos monkey in test envs

"Rouster" for testing puppet functionally with Vagrant:

SalesForce toolchain:
Homebrew tooling -> Rundeck for Orchestration -> SaltStack -> Puppet

VMware's Project Zombie (PuppetConf 2013) toolchain:
  • Cassandra for CMDB
  • Mcollective
  • Puppet

"Trigger" for network automation:


A few years ago it was common to find managers that couldn't see the value of Puppet/Chef - nowadays it's an acknowledged best practice. The same resistance is now happening with network automation.