Wednesday, December 11, 2013

Notes from Netflix OSS Yow Workshop (11/Dec/2013)

The workshop was essentially a series of presentations by Adrian Cockcroft and Ben Christensen

Slides

Diagrams



















Cloud at Scale

Adrian Cockcroft

Time to market vs Quality

  • Aggressively go out and assume that things are broken
  • e.g. Land grab, market disruption, web services
  • Default assumptions
    • Always shipping code that is broken
    • Hardware is broken
    • Operationally defensive

Need ability to see in realtime

  • Small change, quick to fix
  • Able to revert

Cloud Native – a new engineering challenge

  • Construct highly agile and hilgly available service from ephemeral and assumjed broken components

Inspiration

  • ·         Release it
    • ·         Bulkhead & circuit-breaker
  • ·         Thinking in Systems – Donella H Meadows
    • ·         Chaotic system
    • ·         Order from Chaos
    • ·         Emergent behaviour is that it shows movies
    • ·         Right feedback loops
    • ·         Not about software – about building feedback loops & rules around things so that they’re stable & predictable
    • ·         Looks like an ants nest – chaotic but order is an emergent property
  • ·         Anti-fragile
  • ·         Drift Into Failure – Sidney Decker
    • ·         Aircraft industry lessons
    • ·         Ways to avoid
    • ·         Latent failures
    • ·         Netflix outages are unique
    • ·         Byzantine
    • ·         Have enough margin
    • ·         Gradually take fat out of the system – drop dead if you miss a meal
  • ·         Everything is obvious – Duncan J. Watts
    • ·         Avoid untrained people rushing in a pushing buttons
  • ·         The REST API Design Handbook
    • ·         100s of microservices
    • ·         Short book ranting on bad apis
    • ·         Dell Cloud something
    • ·         $3
    • ·         “REST in Practice” also good
  • ·         Continuous Delivery
  • ·         Cloudonomics
    • ·         Cost model
    • ·         Detailed analysis of cloud costs and how it fits together
  • ·         Phoenix Project

How to get to Cloud Native

  • Freedom and Responsibility for Developers
  • Decentralize and Automate Ops Activities
  • Integrate DevOps into the Business Organization
  • Re-org!

Four transitions

  • Management: Integrated Roles in a Single Organization
    • Business, Development, Operations -> BusDevOps
  • Developers: Denormalized Data – NoSQL
    • Decentralized, scalable, available, polyglot
    • Hardest thing to get everyone’s head around
    • Don't really need transactions anyway
    • Data checkers run around checking bits of data have correct ids and “foreign keys”
    • No such thing as consistency
    • Paranoia covered by backups
  • Responsibility from Ops to Dev: Continuous Delivery
    • Decentralized small daily production updates
    • Push to prod end of every day
  • Responsibility from Ops to Dev: Agile Infrastructure - Cloud
    • Hardware in minutes, provisioned directly by developers

Fitting into public scale

·         1,000 – 100,000 instances is ideal for AWS
·         500k instances 3 years ago, 5m today
 ·         http://bit.ly/awsiprange

Netflix don’t use AWS for

·         SaaS Applications – Pagerduty, Onelogin etc.
·         Content Delivery Service

Open Connect Appliance Hardware - Netflix Open Source Content Delivery Service

·         5 engineers got together for 4 months and built their own hardware
·         Build it themselves
·         Give them away
·         $15k
·         Pre-loaded static content
·         Nginx, Bind, Bird bind library
·         BSD
·         UFS+ filesystem
·         Unmount disk if it fails, lower capacity – no striping
·         Hot content on SSD

DNS Service

  • Route53 missing too many features
  • Amazon will clean up in the DNS market when they finish Route53

Escaping the Death Spiral

  • Get out of the way of innovation
    • Process reduction - aggressively
  • Hardware: Best of breed, by the hour
    • If don’t like it – get rid of it
  • Choices based on scale
    • E.g. Big scale = US East Region
    • Build your own DNS

Getting to Cloud Native

Getting started with NetflixOSS Step by Step

  1. Set up AWS Accounts to get the foundation in place
  2. Security and access management setup
  3. Account Management: Asgard to deploy & Ice for cost monitoring
  4. Build Tools: Aminator to automate baking AMIs
  5. Service Registry and Searchable Account History: Eureka & Edda
  6. Configuration Management: Archaius dynamic property system
  7. Data storage: Cassandra, Astyanax, Priam, EVCache
  8. Dynamic traffic routing: Denominator, Zuul, Ribbon, Karyon
  9. Availability: Simian Army (Chaos Monkey), Hystrix, Turbine
  10. Developer productivity: Blitz4J, GCViz, Pytheas, RxJava
  11. Big Data: Genie for Hadoop PaaS, Lipstick visualizer for Pig, Suro (logging pipeline)
  12. Sample Apps to get started: RSS Reader, ACME Air (IBM), FluxCapacitor 

Flow of Code & Data between AWS Accounts

  • Auditable Account: Code with dollar signs goes into this account
  • Archive Account
    • Test account gets refreshed every weekend of all schemas/data from Prod
    • Trashes test data
    • Confidential data is encrypted at rest
    • Tokenise sensitive data
  • Cloud LDAP gives you access to Production Account
  • Auditable Account is different LDAP group, need a reason to access it
  • Vault account needs a background check
  • Higher security apps
    • Monitoring systems inside
    • Smaller and smaller for higher security
    • Hyperguard & cloudpassage
  • Cloud security architect used to be a PCI auditor – could talk to auditors at their level
    • Had to educate auditors
  • Archive Account
    • Versioned
    • Can’t delete anything from archive account
    • Delete old copies
    • PGP encrypted backup copies to Google, last resort DR copy
    • Immutable logs from Cassandra for full history

Account Security

  • Protect Accounts
    • Two factor authentication for primary login
  • Delegated Minimum Privilege
    • Create IAM roles for everything
    • Fine-grained – as-needs basis
  • Security Groups
    • Control who can call your services
    • Every service has a security group with the same name
    • Have to be in the group to be able to call the service
    • Managing service ingress permission = customer base
    • Can ignore services not in my security group
    • Superset of interactions
      • Not all customers call the service
      • Call tree is monitored through other means

Cloud Access Control

  • SSH Bastion
  • Sumo Logic
  • Ssh sudo bastion
  • Can’t ssh between instances – have to go via bastion which wraps sudo with audit logs
    • Login is yourself
    • oq ssh wrapper into machine root or login as regular e.g. default is “dal-prod”
    • Your user doesn’t exist on machines
    • E.g. “dal-prod” is 105, “www-prod” is 106
    • Dont’ run anything as root
    • Register of accounts – Asgard
  • Failure modes
    • Datacenter dependency
    • 2 copies of bastion host
      • Homedir is the same
    • Scripts that trample in through bastion – bad idea
      • NFS server died lost all shares
    • Datacenters keep breaking your cloud
  • 1 service per host
  • AWS firewall layer is dodgy – creates variance in the network


Fast Start AMIs

  • AWS Answers
  • 1 Asgard copy per account

Stateless services talk to memcache/Cassandra/RDS

No SQL queries – all REST calls to webservices

Asgard

  • Grails app
  • All UI endpoints can be read by adding .json

Eureka


Edda

  • Timestamped delta cache of service status mongodb/ES back-end
  • Searchable history of AWS instance, deployment version, etc changes
  • Every 1min
  • Janitor monkey cleans up
  • Eucalyptus = AWS-compatible private cloud – lets you see more underlying infra data e.g. switches
  • Cloud Trail give you record of calls made to configure cloud and who made them
  • E.g. Machines that blew up last week no longer exist
  • Very powerfull for security / auditing
  • CMDBs in data center don’t work – this actually works – strong assertions

Archaius

  • Property Console
    • Not OS yet
    • Based on Pytheas

Archaius Library Config Mgmt

  • Hierarchy of properties
  • Changes are logged with Chronos

Astyanax

  • A6x
  • Son of Hector (brother of Cassandra)
  • Recipes
    • Patterns to solve common problems

EVCache

  • Eccentric (Ephemeral) Volatile Cache
  • Memcache in each zone
  • “Dynomite” Cassandra-like layer above memcache
  • Prium-like sidecar exposes metrics over JMX

Routing Customers to Code

  • Denominator: DNS for multi-region availability
    • Manage traffic via multiple DNS providers with Java code (or command-line)
    • Talks to Ultra, Dyn, Route53, OpenStack
    • Pluggable
    • Ultra does Geo split (partitioning)
      • Switch to Dyn if Ultra breaks
  • Route53 does switching
  • If a Region goes down denominator switches LB at Route53 layer
    • 50 endpoints
  • Talks to Zuul API Router
  • Zuul – Smart routing
    • Groovy filters update every 30s
    • E.g. block Russian addresses
    • Similar to masher, apigee
  • Ribbon
    • Internal LB
    • Wrapper around HTTP Client
    • Round robin connections
    • Backed by Eureka
  • Karyon – common server container
    • Hello world
    • Embedded status page console
      • Machine readable
      • Enables conformity monkey
      • E.g. Reject if it has versions of libraries


Availability

  • Torture Monkeys – barrel fill of monkeys
    • Block DNS
    • Fill up root disk
    • Unmount ebs
    • Block access to ec2 apis
    • CPU busy
    • Killing all Java processes

Developer productivity

  • Blitz4J – non-blocking logging
  • GCViz
    • Runs off log files
  • Pytheas – OSS based tooling framework
    • Powerful - Just a little code in the right place
    • Scaffolding
    • Guice, Jersey, FreeMarker, JQuery, DataTabler, D3, Bootstrap

BigData & Analytics

  • Genie - Hadoop jobs
    • Complex Processing of S3 data
  • Lipstick – visualisation for Pig queries
  • Suro – event logging pipeline
    • Feeds Kafka, Storm, Druid
    • Alerting
    • 80-100bn events/day

Sample App – RSS Reader

Glisten – Workflow DSL – Amazon Simple Workflow


Scale & Resilience (Resilient API Patterns)

Ben Christensen

Constraints

  • Client libraries
    • Service provides a client library
    • To enable speed of iteration
    • While we want resources integration, procedure integration works out faster
  • Mixed Environment
    • Polyglot

Client libraries

  • Deal with Logic, Serialisation, Network Request, Deserialisation, Logic
  • Bulkheading to prevent socket timeouts
    • Limit the blast radius
  • Hystrix
  • Tryable Semaphore
    • E.g. 3-4 threads will do 50rps
    • Cap at 10 – don’t reject until hit 10 threads
    • Reject in non-blocking way – queue
    • Fast-fail shared load or fall-back
  • Thread pool
    • Size of thread pool + queue size (typically 0)
    • Slight overhead for extra threads
    • Gives extra safety of being able to release the blocking user instead of waiting on connection
    • Enables interrupting blocking threadHystrix Command Object pattern
      • Synchronous execute
      • Asynchronously
      • Circuit open?  Rate limit?
  • Failure options
    • Fail fast instead of backing up – backed up systems do not recover quickly
      • Shed load so can start processing immediately
    • Fail silent
      • Netflix shouldn’t fail for customer just because Netflix can’t talk to Facebook
    • Static fallback
      • Instead of turning feature off... turn it to default state (true, DEFAULT_OBJECT)
      • Fail open instead of failing closed
    • Stubbed fallback
      • Stub parameters with defaults if don’t know
    • Fallback via network
      • Try something else based on similar data
  • Hystrix
    • Each app cluster has a page of Circuit Breakers
    • Shows last 10s
    • Links to historical data
    • Different decisions about







Deployment

  • Zuul Routing Layer
    • Replaces ELBs + Commercial proxy layer
      • Hated it: 1-2 days to make simple rule change
  • Simple Pre and Post filters on HTTP Request/Response
  • Can add & remove filters at runtime
  • Routing changes
  • Use Cases
    • Want to know all logs for particular User-id across entire fleet – via Turbine
    • Canary vs Baseline
      • Launch 2 clusters
      • Run through 1 peak cycle
    • Squeeze Testing
      • Impossible with out Zuul
      • RPS on a particular instance (Math in a filter)
      • Increment in 5rps increments
      • Test every binary to see what it’s breaking point is
      • How many machines will we need in prod?
      • Is this change inefficient?
      • Auto-scaling parameters?
      • Test can’t come close to prod load – different load
      • Really hard to simulate true load, cache hits
      • Don’t bother with Load Tests
      • Squeeze tests as part of prod
      • Acceptable break?
        • Client will typically retry (or get a “Try again” prompt)
        • Retry will probably hit a different box (out of 100s)
      • Rules are tested in a Zuul canary
        • Prod Zuul cluster – small Zuul cluster
        • Activate on main cluster after tested
    • Coalmine
      • Long-term canary cluster
      • Java agents with byte code manipulation
      • Intercept network traffic
      • Watch a particular binary
      • Raise alarm if see network traffic not isolated by Hystrix
        • E.g. Someone Flips code Not correctly isolated
      • Look in Chronous to see what was changed
    • Production
  • Scryer – predicitive auto-scaling
    • Uses last 4 weeks for any particular day of week
    • Creates an auto-scaling plan for ASG
    • Move min-floor up and down, leave max high
    • 5% buffer better than 10-20% buffer for reactive
    • Still have reactive plan to kick in for a safety net (e.g. Snow day)
    • Will keep scaling up even if not receiving predicted traffic – avoids outage when traffic suddenly comes back online

Testing

  • Testers do manual testing on UI
  • Engineers have to make their stuff work
  • Code reviews are up to each team culture
  • Risky request feedback on pull request
  • Canary
    • If significant degradation – canary test fails
  • Integration testing?
    • Smoke testing does a lot of the service testing
    • Expected data testing “prod” branch is latest dependency integration
      • Nightly build
      • If fails, integration guys won’t promote it
    • Not scientific
  • A/B Testing – See Jason Brown presentation

Migrating Data Changes

  • Don’t make breaking changes
  • Never attempt to synchronise releases
  • Wait until all consumers are up-to-date
  • Client library facade

Other Notes

  • Build compatibility into client library
  • Client library boils down to just fallback decisions in Hystrix circuit breakers
  • Working on static analysis to check
  • Keep track of which clients call which endpoints – know who’s affected by changes

Performance & Innovation

Ben Christensen

Suro Event Pipeline

  • Cloud native, dynamic, configurable offline and realtime data sinks
  • Open-sourced yesterday
  • S3 demultiplexing -> Hadoop -> BI
  • Kafka -> Druid and ES
  • Druid = realtime data cube
    • Not event processing
    • Counting on multiple dimensions
  • Can plug Storm into Kafka for event processing
  • ES for searching events

Availability

Adrian Cockcroft 

 Incident management

  • PagerDuty alerts are automated
  • Incident creation is not automated

Cassandra at Scale

  • Boundary.com – network flow analysis
    • Small number of nodes for free

Failure Modes and Effects

Auto-scaling saves up to 70%

Janitor monkey

  • Uses Edda to find things not used

Compare TCO

  • Place
  • Power
  • Pipes
  • People
  • Patterns
    • Managing, overhead, tooling
  • Jevons Paradox
    • If you make something more efficient people will consume more of it more people will use it – more than the efficiency gains
    • Amazon incent their sales reps to help you save money

Size for the amount of RAM you need then scale horizontally




Bonus notes from sidechat about "Consumer-Driven Contracts" with Sam Newman

Recommended Resources for testing Consumer-Driven Contracts?

Other Tips

  • Rely more on monitoring in prod
    • Synthetic Transactions
    • A/B Testing
  • Releasing apps together is the wrong way to go
    • It’s a smell that your apps are becoming tightly coupled
    • Ends up tying you back to a monolithic structure