Tuesday, December 31, 2013

How to build Netflix RSS Reader Example on Centos 6 with Amazon AWS

Instructions for installing Netflix RSS Reader on Amazon EC2

In the last week of December 2013, I built and installed the Netflix example RSS Reader application through following these instructions on the Netflix recipes-rss wiki.  See also the Netflix Tech Blog for an overview.

There were a lot of dead ends and rabbit warrens involved in this process as there are a lot of components to get up and running, including: infrastructure, networking, JDK, Gradle, source code, Tomcat, Jetty.

Hopefully these step-by-step instructions make it easier for someone.

It's worth noting a couple of default assumptions that aren't immediately clear that do actually make your life easier:
  1. All 3 services (Eureka, RSS Middletier, RSS Edge) are installed on the same host - you don't need to create 3 separate machines and network them all together
  2. The RSS Reader example does not require Cassandra but uses in-memory data storage by default (connecting to Cassandra is optional)
The first set of instructions below are how to get it running on a single instance.  The follow-on instructions describe how to scale it out to separate clusters of nodes - which is where Turbine/Hystrix gets interesting.

On to the instructions...

Basic Single-Instance Installation

# Instructions prefixed with "###" are dead-ends I went down - may save you time to skip them.

# Create the machine

# Create an EC2 Instance in the AWS console with the following:
#   AMI base image: Centos 6 x86_64 with updates
#   Instance: m1.small (WARNING: t1.micro's 600M RAM is insufficient)
#   Create a security group and save the key. Login to your new instance as root with the downloaded key.

# Setup the networking configuration to allow the services to talk to each other and allow you to browse to them:

# Configure AWS security Group: 
#   Open TCP Input ports: 22, 80, 9090, 9092, 9191, 9192
#   Ideally but optionally expose them only to your IP address instead of the whole world (

# Configure iptables:
# Flush all existing rules
iptables -F
# Block null recon packets
iptables -A INPUT -p tcp --tcp-flags ALL NONE -j DROP
# Reject syn-flood attack
iptables -A INPUT -p tcp ! --syn -m state --state NEW -j DROP
# Allow loopback for internal services
iptables -A INPUT -i lo -j ACCEPT
# Open ports 22 (ssh) & 80 (http)
iptables -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 80 -j ACCEPT
# Open ports 9090 & 9092 for RSS Reader Edge webserver
iptables -A INPUT -p tcp -m tcp --dport 9090 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 9092 -j ACCEPT
# Open ports 9191 for RSS Reader Middletier webserver
iptables -A INPUT -p tcp -m tcp --dport 9191 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 9192 -j ACCEPT
# Allow outgoing connections
iptables -I INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# Allow all outgoing connections
# Drop everything else
iptables -P INPUT DROP

iptables -L -n
service iptables save
service iptables restart

# For better security, consider leaving port 80 closed and forwarding requests on port 80 to port 8080. Then, in the Tomcat instructions below, we could leave Tomcat on the default 8080 port without requiring "root" user

# iptables -A PREROUTING -t nat -i eth1 -p tcp --dport 80 -j REDIRECT --to-port 8080
# iptables -A INPUT -p tcp -m tcp --dport 8080 -j ACCEPT

# Install JDK

### I shouldn't have done this first step... it only installs the JRE... Gradle needs the JDK
### yum install -y java-1.7.0-openjdk.x86_64

# Install Oracle JDK as per: http://parijatmishra.wordpress.com/2013/03/09/oraclesun-jdk-on-ec2-amazon-linux/
### Remove OpenJDK. Hopefully not required if you didn't do the above "yum install java-1.7.0-openjdk.x86_64"
### rpm --erase --nodeps java-1.7.0-openjdk java-1.7.0-openjdk-devel
yum install -y wget
# There's probably a better way to get the most recent JDK - but these instructions made it easy
wget --no-check-certificate --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2Ftechnetwork%2Fjava%2Fjavase%2Fdownloads%2Fjdk-7u3-download-1501626.html;" http://download.oracle.com/otn-pub/java/jdk/7u25-b15/jdk-7u25-linux-x64.rpm
mv jdk-7u25-linux-x64.rpm\?AuthParam\=1388300280_9fd087722658cfbb8e571f2d0449beea jdk-7u25-linux-x64.rpm
yum install -y jdk-7u25-linux-x64.rpm 

for i in /usr/java/jdk1.7.0_25/bin/* ; do \
 f=$(basename $i); echo $f; \
 sudo alternatives --install /usr/bin/$f $f $i 20000 ; \
 sudo update-alternatives --config $f ; \

cd /etc/alternatives
ln -sfn /usr/java/jdk1.7.0_25 java_sdk
cd /usr/lib/jvm
ln -sfn /usr/java/jdk1.7.0_25/jre jre
# JAVA_HOME must be set for Gradle to work
echo "export JAVA_HOME=/usr/java/jdk1.7.0_25" >> .bashrc
. ~/.bashrc

### Install Gradle - may not be required as the Netflix build steps below use self-contained "gradlew" script (which downloads Gradle)
### curl -O "http://downloads.gradle.org/distributions/gradle-1.10-all.zip"
### yum install -y unzip
### cd /opt
### unzip ~/gradle-1.10-bin.zip 
### cd
### echo "export PATH=$PATH:/opt/gradle-1.10/bin" >> .bashrc
### . ~/.bashrc 

# Build RSS Reader Middletier and Edge webapps

yum install -y git

git clone https://github.com/Netflix/recipes-rss.git
cd recipes-rss
./gradlew clean build

### This was required to fix error "Error compiling file: /tmp//org/apache/jsp/jsp/rss_jsp.java org.apache.jasper.JasperException: PWC6033: Unable to compile class for JSP"
### javac /tmp/org/apache/jsp/jsp/rss_jsp.java -cp /root/recipes-rss/rss-edge/build/libs/rss-edge-0.1.0-SNAPSHOT.jar:/tmp -source 1.5 -target 1.5

# Install Tomcat 6 (didn't bother with Tomcat 7 as it wasn't available in defuault Centos 6 yum repo, as per "yum search tomcat")
yum install -y tomcat6
sed -i 's/port=\"8080\"/port=\"80\"/g' /etc/tomcat6/server.xml
# Set TOMCAT_USER to "root"
vim /etc/tomcat6/tomcat6.conf
    # Replace TOMCAT_USER setting with:
# End of edit
echo "export TOMCAT_HOME=/usr/share/tomcat6" >> .bashrc
. ~/.bashrc

# Build and deploy Eureka
git clone https://github.com/Netflix/eureka.git
cd eureka/
./gradlew clean build
cp ./eureka-server/build/libs/eureka-server-XXX-SNAPSHOT.war $TOMCAT_HOME/webapps/eureka.war

service tomcat6 start

# Make sure there are no errors
grep "ERROR" /usr/share/tomcat6/logs/catalina.out | less -S

# (takes around 2mins to startup, expect some startup errors due to Eureka not running in an established cluster)
# Browse to http://[IP ADDRESS]/eureka/   <-- trailing slash required

# In another terminal session
# Start RSS Middletier Webserver
export APP_ENV=dev
cd recipes-rss
java -Xmx128m -XX:MaxPermSize=32m -jar rss-middletier/build/libs/rss-middletier-*SNAPSHOT.jar

# Test via Admin port: Browse to: http://[IP ADDRESS]:9192

# In another terminal session
# Start RSS Edge Webserver
export APP_ENV=dev
cd recipes-rss
java -Xmx128m -XX:MaxPermSize=32m -jar rss-edge/build/libs/rss-edge-*SNAPSHOT.jar

# Test via Admin port: Browse to: http://[IP ADDRESS]:9092
# Browse to http://[IP ADDRESS]:9090/jsp/rss.jsp
# Add the following RSS feeds:
# http://rss.cnn.com/rss/edition.rss
# http://feeds.washingtonpost.com/rss/politics
# http://news.yahoo.com/rss/us
# http://rss.cnn.com/rss/money_autos.rss

# Optional Extras...

# Install Hystrix: https://github.com/Netflix/recipes-rss/wiki/Hystrix-Metrics-%28Optional%29

# Open port 7979 in AWS Security Group

iptables -A INPUT -p tcp -m tcp --dport 7979 -j ACCEPT
iptables -L -n
service iptables save
service iptables restart

git clone https://github.com/Netflix/Hystrix.git
cd Hystrix/hystrix-dashboard
../gradlew jettyRun

# Browse to: http://[IP ADDRESS]:7979/hystrix-dashboard
# Enter http://[IP ADDRESS]:9090/hystrix.stream to see the Hystrix metrics show up in the dashboard. You will have to send a few transactions from the Edge service to have the Hystrix metrics loaded.

# Stress the Edge webserver and watch the circuit trip into "Open" state:
for i in {1..100}; do curl -s -o /dev/null -w "%{http_code} %{url_effective}\\n" "http://[IP ADDRESS]:9090/jsp/rss.jsp" & done

# Hystrix Example application: https://github.com/Netflix/Hystrix/tree/master/hystrix-examples-webapp

# Open port 8989 in AWS Security Group

iptables -A INPUT -p tcp -m tcp --dport 8989 -j ACCEPT
iptables -L -n
service iptables save
service iptables restart

cd Hystrix/hystrix-examples-webapp
../gradlew jettyRun

# Browse to: http://[IP ADDRESS]:8989/hystrix-examples-webapp

# View it on Hystrix dashboard
# Browse to: http://[IP_ADDRESS]:7979/hystrix-dashboard/
# Enter: http://[IP_ADDRESS]:8989/hystrix-examples-webapp/hystrix.stream
# Click "Monitor Stream"

# See the metrics change with this in one window:
curl [IP_ADDRESS]:8989/hystrix-examples-webapp/hystrix.stream
# And this in another window:
while true ; do curl "[IP_ADDRESS]:8989/hystrix-examples-webapp/"; done   # <-- The trailing "/" is required

# Install Turbine: https://github.com/Netflix/Hystrix/wiki/Dashboard
curl -L -O https://github.com/downloads/Netflix/Turbine/turbine-web-1.0.0.war
cp turbine-web-1.0.0.war $TOMCAT_HOME/webapps/turbine.war

# Configure Turbine (using Archaius)
vi /root/rss-edge-turbine.properties
    # From https://github.com/Netflix/Hystrix/wiki/Dashboard
    # Hystrix stream for RSS Edge webapp
# End edit

# Add Archaius config to Tomcat "archaius.configurationSource.additionalUrls"
vi /etc/tomcat6/tomcat6.conf
    # Archaius properties for Turbine and Netflix
    JAVA_OPTS="${JAVA_OPTS} -Darchaius.configurationSource.additionalUrls=file:///root/rss-edge-turbine.properties"
# End edit

service tomcat6 restart
# Should see this in the logs: "URLs to be used as dynamic configuration source: [file:/root/rss-edge-turbine.properties]"
grep rss-edge /usr/share/tomcat6/logs/catalina.out

# Browse to: http://[IP_ADDRESS]:7979/hystrix-dashboard/
# Enter:
# Click "Monitor Stream"

# Install hystrix-dashboard in Tomcat
curl -O http://search.maven.org/remotecontent?filepath=com/netflix/hystrix/hystrix-dashboard/1.3.8/hystrix-dashboard-1.3.8.war

# Install hystrix-examples-webapp in Tomcat
cd Hystrix/hystrix-examples-webapp
../gradlew build
cp build/libs/hystrix-examples-webapp-1.3.9-SNAPSHOT.war /usr/share/tomcat6/webapps/hystrix-examples-webapp.war

Cluster Install

Turbine really only starts to shine once there are clusters involved.

Let's set up a cluster of dedicated Edge and Middletier instances, a dedicated Eureka instance (on Tomcat together with Hystrix Dashboard + Turbine) and with ELB in front of the Edge cluster. Something like this:

         Internet                           Internet
             |                                  |
             |                                  v
             |                               AWS ELB
             |                                  |
  /----------|------------------------\        /|\
  |          v                        |        v v v
  | Hystrix Dashboard  <-- Turbine  <-----  RSS Edge (x3) 
  |                                   |    ^  ^ ^ ^
  |                                   |   /    \|/
  |                       Eureka <--------      |
  |                          ^        |         |
  \--------------------------|--------/        /|\
                             |                v v v
                             ---------- RSS Middletier (x3)

# To make things easier, before we scale out we'll set up some convenience 
# hostnames and scripts - in reality we'd come back and automate this 
# properly later on with Puppet/Chef/Ansible/Salt/Baked-into-image

# Add an entry with the Private IP address of the server in /etc/hosts e.g.:       eureka

vim "recipes-rss/rss-edge/src/main/resources/edge.properties"
# End edit

vim "recipes-rss/rss-middletier/src/main/resources/middletier.properties"
# End edit

# Rebuild both jars
cd /root/recipes-rss && gradlew build

# Create the following scripts in root's homedir:

# change_eureka_host.sh
if [ $# -lt 1 ]; then
    echo "Usage: $0 IP_ADDRESS"
sed -i "s/.*eureka/$1    eureka/g" /etc/hosts

# start_rss-edge.sh
cd /root/recipes-rss
nohup java -Xmx128m -XX:MaxPermSize=32m -jar rss-edge/build/libs/rss-edge-*SNAPSHOT.jar &
echo "Output is logged to /root/recipes-rss/logs/rss-edge.log"

# start_rss-middletier.sh
cd /root/recipes-rss
nohup java -Xmx128m -XX:MaxPermSize=32m -jar rss-middletier/build/libs/rss-middletier-*SNAPSHOT.jar &
echo "Output is logged to /root/recipes-rss/logs/rss-middletier.log"

# tail_rss-edge.sh
tail -100f /root/recipes-rss/logs/rss-edge.log

# tail_rss-middletier.sh
tail -100f /root/recipes-rss/logs/rss-middletier.log

# Save the instance as an AMI
# Now create 6 more instances based off this AMI - they can all be t1.micro's (in fact we can reprovision our main node as a t1.micro and only run Tomcat on it for Eureka, Hystrix Dashboard, & Turbine)
# Optional: Name them in the EC2 console as rss-edge-N, rss-middletier-N, rss-eureka
# Create 2 more security groups: rss-edge & rss-middletier
# Ensure 'rss-edge' security group has ports 9090 & 9092 open
# Ensure 'rss-middletier' security group has ports 9191 & 9192 open
# Create a Load Balancer named 'RssEdgeLoadBalancer' containing all 3 
# rss-edge-* nodes.  Attach ports 9090 & 9092 to the same ports.
# Optional: add a health check on:
#   Protocol: HTTP
#   Port: 9092
#   Path: /adminres/webadmin/index.html
#   Timeout: 5s
#   Interval: 0.5min
#   Unhealthy Threshold: 2
# Optional: Add cloudwatch alarm in Monitoring tab
#   UnHealthyHostCount >= 1 for 1 minute
#   Send message to topic "NotifyMe"

# On the Eureka node add the Private IP addresses of the nodes to /etc/hosts 
# e.g.:       eureka  rss-edge-1  rss-edge-2  rss-edge-3  rss-middletier-1  rss-middletier-2  rss-middletier-3

# On the Eureka node - modify the all-turbine.properties
vi /root/all-turbine.properties
# End edit

# Create a script to load test the Edge nodes

# hammer_rss_vip.sh

# Usage: hammer_rss_vip.sh [pause_seconds] 
#   e.g. hammer_rss_vip.sh 0.1   - will pause for 100ms between requests
#   e.g. hammer_rss_vip.sh       - no pause: fire requests as fast as we can fork processes in the background


while true ; do
  curl -s -o /dev/null http://$RSS_EDGE_VIP:9090/jsp/rss.jsp &
  sleep $DELAY

# Now watch Turbine via the Hystrix Dashboard while you load test the Edge via the ELB and see the difference for different timings
hammer_rss_vip 1
hammer_rss_vip 0.1

Saturday, December 14, 2013

Important concepts in software

The following is a collection of important concepts all software professionals should understand.  For each of them, the moment of understanding is often an ephiphany - an "aha" moment when what has up until then either felt good or was second nature now has a label, a vocabulary.

Jeff Atwood lists the first 3: http://www.codinghorror.com/blog/2009/03/five-dollar-programming-words.html
Brandon Byers lists #4 and #5 when explaining #3: http://brandonbyars.com/2008/07/21/orthogonality/
Some good definitions in this SO answer.

The result of doing something more than once is the exact same result as doing it once


Nonorthogonal systems are hard to manipulate because it's hard to tweak isolated parts

Features can be combined. Unix philosophy.

Principle of Least Surprise

There's a single, well-defined correct value. Important for reasoning about what a program is doing at any point in time.  In contrast, imperative concurrency can give different answers, depending on a scheduler or whether you're looking or not, and they can even deadlock.

Complexity: Essential vs Accidental

State changes over time. The temporally discrete nature of imperative computation is an accommodation to a particular style of machine, rather than a natural description of behavior itself

Netflix audit Prod rather than controlling releases through central approval process

I asked Adrian Cockcroft & Ben Christensen from Netflix how they control the release cycle of production configuration items (pointing out that each of the levels in the Archaius hierarchy potentially have a separate release cycle). In short, they don't.  Rather, they constantly audit production (using Chronos) and keep track of real-time events (what, when, where) to use for ad-hoc search queries when diagnosing problems. To read between the lines, there's more value in getting changes out quickly and having them fail than not releasing changes while waiting for approval and causing bottlenecks in the approval process.  Put more effort into testing in prod (monitoring/alerting/compliance) and less effort into release ceremony.

This snippet seems to cover it:

We push hard to always increase our speed of innovation, and at the same time reduce the cost of making changes in the environment.  In the datacenter days, we forced every production change to be logged in a change control system because the first question everyone asks when looking at an issue is “What changed recently?”.  We found a formal change control system didn’t work well for with our culture of freedom and responsibility, so we deprecated a formal change control process for the vast majority of changes in favor of Chronos.  Chronos accepts events via a REST interface and allows humans and machines to ask questions like “what happened in the last hour?” or “what software did we deploy in the last day?”.  It integrates with our monkeys and Asgard so the vast majority of changes in our environment are automatically reported to it, including event types such as deployments, AB tests, security events, and other automated actions.

Wednesday, December 11, 2013

Notes from Netflix OSS Yow Workshop (11/Dec/2013)

The workshop was essentially a series of presentations by Adrian Cockcroft and Ben Christensen



Cloud at Scale

Adrian Cockcroft

Time to market vs Quality

  • Aggressively go out and assume that things are broken
  • e.g. Land grab, market disruption, web services
  • Default assumptions
    • Always shipping code that is broken
    • Hardware is broken
    • Operationally defensive

Need ability to see in realtime

  • Small change, quick to fix
  • Able to revert

Cloud Native – a new engineering challenge

  • Construct highly agile and hilgly available service from ephemeral and assumjed broken components


  • ·         Release it
    • ·         Bulkhead & circuit-breaker
  • ·         Thinking in Systems – Donella H Meadows
    • ·         Chaotic system
    • ·         Order from Chaos
    • ·         Emergent behaviour is that it shows movies
    • ·         Right feedback loops
    • ·         Not about software – about building feedback loops & rules around things so that they’re stable & predictable
    • ·         Looks like an ants nest – chaotic but order is an emergent property
  • ·         Anti-fragile
  • ·         Drift Into Failure – Sidney Decker
    • ·         Aircraft industry lessons
    • ·         Ways to avoid
    • ·         Latent failures
    • ·         Netflix outages are unique
    • ·         Byzantine
    • ·         Have enough margin
    • ·         Gradually take fat out of the system – drop dead if you miss a meal
  • ·         Everything is obvious – Duncan J. Watts
    • ·         Avoid untrained people rushing in a pushing buttons
  • ·         The REST API Design Handbook
    • ·         100s of microservices
    • ·         Short book ranting on bad apis
    • ·         Dell Cloud something
    • ·         $3
    • ·         “REST in Practice” also good
  • ·         Continuous Delivery
  • ·         Cloudonomics
    • ·         Cost model
    • ·         Detailed analysis of cloud costs and how it fits together
  • ·         Phoenix Project

How to get to Cloud Native

  • Freedom and Responsibility for Developers
  • Decentralize and Automate Ops Activities
  • Integrate DevOps into the Business Organization
  • Re-org!

Four transitions

  • Management: Integrated Roles in a Single Organization
    • Business, Development, Operations -> BusDevOps
  • Developers: Denormalized Data – NoSQL
    • Decentralized, scalable, available, polyglot
    • Hardest thing to get everyone’s head around
    • Don't really need transactions anyway
    • Data checkers run around checking bits of data have correct ids and “foreign keys”
    • No such thing as consistency
    • Paranoia covered by backups
  • Responsibility from Ops to Dev: Continuous Delivery
    • Decentralized small daily production updates
    • Push to prod end of every day
  • Responsibility from Ops to Dev: Agile Infrastructure - Cloud
    • Hardware in minutes, provisioned directly by developers

Fitting into public scale

·         1,000 – 100,000 instances is ideal for AWS
·         500k instances 3 years ago, 5m today
 ·         http://bit.ly/awsiprange

Netflix don’t use AWS for

·         SaaS Applications – Pagerduty, Onelogin etc.
·         Content Delivery Service

Open Connect Appliance Hardware - Netflix Open Source Content Delivery Service

·         5 engineers got together for 4 months and built their own hardware
·         Build it themselves
·         Give them away
·         $15k
·         Pre-loaded static content
·         Nginx, Bind, Bird bind library
·         BSD
·         UFS+ filesystem
·         Unmount disk if it fails, lower capacity – no striping
·         Hot content on SSD

DNS Service

  • Route53 missing too many features
  • Amazon will clean up in the DNS market when they finish Route53

Escaping the Death Spiral

  • Get out of the way of innovation
    • Process reduction - aggressively
  • Hardware: Best of breed, by the hour
    • If don’t like it – get rid of it
  • Choices based on scale
    • E.g. Big scale = US East Region
    • Build your own DNS

Getting to Cloud Native

Getting started with NetflixOSS Step by Step

  1. Set up AWS Accounts to get the foundation in place
  2. Security and access management setup
  3. Account Management: Asgard to deploy & Ice for cost monitoring
  4. Build Tools: Aminator to automate baking AMIs
  5. Service Registry and Searchable Account History: Eureka & Edda
  6. Configuration Management: Archaius dynamic property system
  7. Data storage: Cassandra, Astyanax, Priam, EVCache
  8. Dynamic traffic routing: Denominator, Zuul, Ribbon, Karyon
  9. Availability: Simian Army (Chaos Monkey), Hystrix, Turbine
  10. Developer productivity: Blitz4J, GCViz, Pytheas, RxJava
  11. Big Data: Genie for Hadoop PaaS, Lipstick visualizer for Pig, Suro (logging pipeline)
  12. Sample Apps to get started: RSS Reader, ACME Air (IBM), FluxCapacitor 

Flow of Code & Data between AWS Accounts

  • Auditable Account: Code with dollar signs goes into this account
  • Archive Account
    • Test account gets refreshed every weekend of all schemas/data from Prod
    • Trashes test data
    • Confidential data is encrypted at rest
    • Tokenise sensitive data
  • Cloud LDAP gives you access to Production Account
  • Auditable Account is different LDAP group, need a reason to access it
  • Vault account needs a background check
  • Higher security apps
    • Monitoring systems inside
    • Smaller and smaller for higher security
    • Hyperguard & cloudpassage
  • Cloud security architect used to be a PCI auditor – could talk to auditors at their level
    • Had to educate auditors
  • Archive Account
    • Versioned
    • Can’t delete anything from archive account
    • Delete old copies
    • PGP encrypted backup copies to Google, last resort DR copy
    • Immutable logs from Cassandra for full history

Account Security

  • Protect Accounts
    • Two factor authentication for primary login
  • Delegated Minimum Privilege
    • Create IAM roles for everything
    • Fine-grained – as-needs basis
  • Security Groups
    • Control who can call your services
    • Every service has a security group with the same name
    • Have to be in the group to be able to call the service
    • Managing service ingress permission = customer base
    • Can ignore services not in my security group
    • Superset of interactions
      • Not all customers call the service
      • Call tree is monitored through other means

Cloud Access Control

  • SSH Bastion
  • Sumo Logic
  • Ssh sudo bastion
  • Can’t ssh between instances – have to go via bastion which wraps sudo with audit logs
    • Login is yourself
    • oq ssh wrapper into machine root or login as regular e.g. default is “dal-prod”
    • Your user doesn’t exist on machines
    • E.g. “dal-prod” is 105, “www-prod” is 106
    • Dont’ run anything as root
    • Register of accounts – Asgard
  • Failure modes
    • Datacenter dependency
    • 2 copies of bastion host
      • Homedir is the same
    • Scripts that trample in through bastion – bad idea
      • NFS server died lost all shares
    • Datacenters keep breaking your cloud
  • 1 service per host
  • AWS firewall layer is dodgy – creates variance in the network

Fast Start AMIs

  • AWS Answers
  • 1 Asgard copy per account

Stateless services talk to memcache/Cassandra/RDS

No SQL queries – all REST calls to webservices


  • Grails app
  • All UI endpoints can be read by adding .json



  • Timestamped delta cache of service status mongodb/ES back-end
  • Searchable history of AWS instance, deployment version, etc changes
  • Every 1min
  • Janitor monkey cleans up
  • Eucalyptus = AWS-compatible private cloud – lets you see more underlying infra data e.g. switches
  • Cloud Trail give you record of calls made to configure cloud and who made them
  • E.g. Machines that blew up last week no longer exist
  • Very powerfull for security / auditing
  • CMDBs in data center don’t work – this actually works – strong assertions


  • Property Console
    • Not OS yet
    • Based on Pytheas

Archaius Library Config Mgmt

  • Hierarchy of properties
  • Changes are logged with Chronos


  • A6x
  • Son of Hector (brother of Cassandra)
  • Recipes
    • Patterns to solve common problems


  • Eccentric (Ephemeral) Volatile Cache
  • Memcache in each zone
  • “Dynomite” Cassandra-like layer above memcache
  • Prium-like sidecar exposes metrics over JMX

Routing Customers to Code

  • Denominator: DNS for multi-region availability
    • Manage traffic via multiple DNS providers with Java code (or command-line)
    • Talks to Ultra, Dyn, Route53, OpenStack
    • Pluggable
    • Ultra does Geo split (partitioning)
      • Switch to Dyn if Ultra breaks
  • Route53 does switching
  • If a Region goes down denominator switches LB at Route53 layer
    • 50 endpoints
  • Talks to Zuul API Router
  • Zuul – Smart routing
    • Groovy filters update every 30s
    • E.g. block Russian addresses
    • Similar to masher, apigee
  • Ribbon
    • Internal LB
    • Wrapper around HTTP Client
    • Round robin connections
    • Backed by Eureka
  • Karyon – common server container
    • Hello world
    • Embedded status page console
      • Machine readable
      • Enables conformity monkey
      • E.g. Reject if it has versions of libraries


  • Torture Monkeys – barrel fill of monkeys
    • Block DNS
    • Fill up root disk
    • Unmount ebs
    • Block access to ec2 apis
    • CPU busy
    • Killing all Java processes

Developer productivity

  • Blitz4J – non-blocking logging
  • GCViz
    • Runs off log files
  • Pytheas – OSS based tooling framework
    • Powerful - Just a little code in the right place
    • Scaffolding
    • Guice, Jersey, FreeMarker, JQuery, DataTabler, D3, Bootstrap

BigData & Analytics

  • Genie - Hadoop jobs
    • Complex Processing of S3 data
  • Lipstick – visualisation for Pig queries
  • Suro – event logging pipeline
    • Feeds Kafka, Storm, Druid
    • Alerting
    • 80-100bn events/day

Sample App – RSS Reader

Glisten – Workflow DSL – Amazon Simple Workflow

Scale & Resilience (Resilient API Patterns)

Ben Christensen


  • Client libraries
    • Service provides a client library
    • To enable speed of iteration
    • While we want resources integration, procedure integration works out faster
  • Mixed Environment
    • Polyglot

Client libraries

  • Deal with Logic, Serialisation, Network Request, Deserialisation, Logic
  • Bulkheading to prevent socket timeouts
    • Limit the blast radius
  • Hystrix
  • Tryable Semaphore
    • E.g. 3-4 threads will do 50rps
    • Cap at 10 – don’t reject until hit 10 threads
    • Reject in non-blocking way – queue
    • Fast-fail shared load or fall-back
  • Thread pool
    • Size of thread pool + queue size (typically 0)
    • Slight overhead for extra threads
    • Gives extra safety of being able to release the blocking user instead of waiting on connection
    • Enables interrupting blocking threadHystrix Command Object pattern
      • Synchronous execute
      • Asynchronously
      • Circuit open?  Rate limit?
  • Failure options
    • Fail fast instead of backing up – backed up systems do not recover quickly
      • Shed load so can start processing immediately
    • Fail silent
      • Netflix shouldn’t fail for customer just because Netflix can’t talk to Facebook
    • Static fallback
      • Instead of turning feature off... turn it to default state (true, DEFAULT_OBJECT)
      • Fail open instead of failing closed
    • Stubbed fallback
      • Stub parameters with defaults if don’t know
    • Fallback via network
      • Try something else based on similar data
  • Hystrix
    • Each app cluster has a page of Circuit Breakers
    • Shows last 10s
    • Links to historical data
    • Different decisions about


  • Zuul Routing Layer
    • Replaces ELBs + Commercial proxy layer
      • Hated it: 1-2 days to make simple rule change
  • Simple Pre and Post filters on HTTP Request/Response
  • Can add & remove filters at runtime
  • Routing changes
  • Use Cases
    • Want to know all logs for particular User-id across entire fleet – via Turbine
    • Canary vs Baseline
      • Launch 2 clusters
      • Run through 1 peak cycle
    • Squeeze Testing
      • Impossible with out Zuul
      • RPS on a particular instance (Math in a filter)
      • Increment in 5rps increments
      • Test every binary to see what it’s breaking point is
      • How many machines will we need in prod?
      • Is this change inefficient?
      • Auto-scaling parameters?
      • Test can’t come close to prod load – different load
      • Really hard to simulate true load, cache hits
      • Don’t bother with Load Tests
      • Squeeze tests as part of prod
      • Acceptable break?
        • Client will typically retry (or get a “Try again” prompt)
        • Retry will probably hit a different box (out of 100s)
      • Rules are tested in a Zuul canary
        • Prod Zuul cluster – small Zuul cluster
        • Activate on main cluster after tested
    • Coalmine
      • Long-term canary cluster
      • Java agents with byte code manipulation
      • Intercept network traffic
      • Watch a particular binary
      • Raise alarm if see network traffic not isolated by Hystrix
        • E.g. Someone Flips code Not correctly isolated
      • Look in Chronous to see what was changed
    • Production
  • Scryer – predicitive auto-scaling
    • Uses last 4 weeks for any particular day of week
    • Creates an auto-scaling plan for ASG
    • Move min-floor up and down, leave max high
    • 5% buffer better than 10-20% buffer for reactive
    • Still have reactive plan to kick in for a safety net (e.g. Snow day)
    • Will keep scaling up even if not receiving predicted traffic – avoids outage when traffic suddenly comes back online


  • Testers do manual testing on UI
  • Engineers have to make their stuff work
  • Code reviews are up to each team culture
  • Risky request feedback on pull request
  • Canary
    • If significant degradation – canary test fails
  • Integration testing?
    • Smoke testing does a lot of the service testing
    • Expected data testing “prod” branch is latest dependency integration
      • Nightly build
      • If fails, integration guys won’t promote it
    • Not scientific
  • A/B Testing – See Jason Brown presentation

Migrating Data Changes

  • Don’t make breaking changes
  • Never attempt to synchronise releases
  • Wait until all consumers are up-to-date
  • Client library facade

Other Notes

  • Build compatibility into client library
  • Client library boils down to just fallback decisions in Hystrix circuit breakers
  • Working on static analysis to check
  • Keep track of which clients call which endpoints – know who’s affected by changes

Performance & Innovation

Ben Christensen

Suro Event Pipeline

  • Cloud native, dynamic, configurable offline and realtime data sinks
  • Open-sourced yesterday
  • S3 demultiplexing -> Hadoop -> BI
  • Kafka -> Druid and ES
  • Druid = realtime data cube
    • Not event processing
    • Counting on multiple dimensions
  • Can plug Storm into Kafka for event processing
  • ES for searching events


Adrian Cockcroft 

 Incident management

  • PagerDuty alerts are automated
  • Incident creation is not automated

Cassandra at Scale

  • Boundary.com – network flow analysis
    • Small number of nodes for free

Failure Modes and Effects

Auto-scaling saves up to 70%

Janitor monkey

  • Uses Edda to find things not used

Compare TCO

  • Place
  • Power
  • Pipes
  • People
  • Patterns
    • Managing, overhead, tooling
  • Jevons Paradox
    • If you make something more efficient people will consume more of it more people will use it – more than the efficiency gains
    • Amazon incent their sales reps to help you save money

Size for the amount of RAM you need then scale horizontally

Bonus notes from sidechat about "Consumer-Driven Contracts" with Sam Newman

Recommended Resources for testing Consumer-Driven Contracts?

Other Tips

  • Rely more on monitoring in prod
    • Synthetic Transactions
    • A/B Testing
  • Releasing apps together is the wrong way to go
    • It’s a smell that your apps are becoming tightly coupled
    • Ends up tying you back to a monolithic structure