Tuesday, December 31, 2013

How to build Netflix RSS Reader Example on Centos 6 with Amazon AWS

Instructions for installing Netflix RSS Reader on Amazon EC2

In the last week of December 2013, I built and installed the Netflix example RSS Reader application through following these instructions on the Netflix recipes-rss wiki. See also the Netflix Tech Blog for an overview.

There were a lot of dead ends and rabbit warrens involved in this process as there are a lot of components to get up and running, including: infrastructure, networking, JDK, Gradle, source code, Tomcat, Jetty.

Hopefully these step-by-step instructions make it easier for someone.

It's worth noting a couple of default assumptions that aren't immediately clear that do actually make your life easier:

All 3 services (Eureka, RSS Middletier, RSS Edge) are installed on the same host - you don't need to create 3 separate machines and network them all together
The RSS Reader example does not require Cassandra but uses in-memory data storage by default (connecting to Cassandra is optional)

The first set of instructions below are how to get it running on a single instance. The follow-on instructions describe how to scale it out to separate clusters of nodes - which is where Turbine/Hystrix gets interesting.

On to the instructions...

Basic Single-Instance Installation

# Instructions prefixed with "###" are dead-ends I went down - may save you time to skip them.

# Create the machine

# Create an EC2 Instance in the AWS console with the following:
#   AMI base image: Centos 6 x86_64 with updates
#   Instance: m1.small (WARNING: t1.micro's 600M RAM is insufficient)
#   Create a security group and save the key. Login to your new instance as root with the downloaded key.



# Setup the networking configuration to allow the services to talk to each other and allow you to browse to them:

# Configure AWS security Group: 
#   Open TCP Input ports: 22, 80, 9090, 9092, 9191, 9192
#   Ideally but optionally expose them only to your IP address instead of the whole world (0.0.0.0/0)

# Configure iptables:
# Flush all existing rules
iptables -F
# Block null recon packets
iptables -A INPUT -p tcp --tcp-flags ALL NONE -j DROP
# Reject syn-flood attack
iptables -A INPUT -p tcp ! --syn -m state --state NEW -j DROP
# Allow loopback for internal services
iptables -A INPUT -i lo -j ACCEPT
# Open ports 22 (ssh) & 80 (http)
iptables -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 80 -j ACCEPT
# Open ports 9090 & 9092 for RSS Reader Edge webserver
iptables -A INPUT -p tcp -m tcp --dport 9090 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 9092 -j ACCEPT
# Open ports 9191 for RSS Reader Middletier webserver
iptables -A INPUT -p tcp -m tcp --dport 9191 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 9192 -j ACCEPT
# Allow outgoing connections
iptables -I INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# Allow all outgoing connections
iptables -P OUTPUT ACCEPT
# Drop everything else
iptables -P INPUT DROP

iptables -L -n
service iptables save
service iptables restart

# For better security, consider leaving port 80 closed and forwarding requests on port 80 to port 8080. Then, in the Tomcat instructions below, we could leave Tomcat on the default 8080 port without requiring "root" user

# iptables -A PREROUTING -t nat -i eth1 -p tcp --dport 80 -j REDIRECT --to-port 8080
# iptables -A INPUT -p tcp -m tcp --dport 8080 -j ACCEPT



# Install JDK

### I shouldn't have done this first step... it only installs the JRE... Gradle needs the JDK
### yum install -y java-1.7.0-openjdk.x86_64

# Install Oracle JDK as per: http://parijatmishra.wordpress.com/2013/03/09/oraclesun-jdk-on-ec2-amazon-linux/
### Remove OpenJDK. Hopefully not required if you didn't do the above "yum install java-1.7.0-openjdk.x86_64"
### rpm --erase --nodeps java-1.7.0-openjdk java-1.7.0-openjdk-devel
yum install -y wget
# There's probably a better way to get the most recent JDK - but these instructions made it easy
wget --no-check-certificate --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2Ftechnetwork%2Fjava%2Fjavase%2Fdownloads%2Fjdk-7u3-download-1501626.html;" http://download.oracle.com/otn-pub/java/jdk/7u25-b15/jdk-7u25-linux-x64.rpm
mv jdk-7u25-linux-x64.rpm\?AuthParam\=1388300280_9fd087722658cfbb8e571f2d0449beea jdk-7u25-linux-x64.rpm
yum install -y jdk-7u25-linux-x64.rpm 

for i in /usr/java/jdk1.7.0_25/bin/* ; do \
 f=$(basename $i); echo $f; \
 sudo alternatives --install /usr/bin/$f $f $i 20000 ; \
 sudo update-alternatives --config $f ; \
done

cd /etc/alternatives
ln -sfn /usr/java/jdk1.7.0_25 java_sdk
cd /usr/lib/jvm
ln -sfn /usr/java/jdk1.7.0_25/jre jre
# JAVA_HOME must be set for Gradle to work
echo "export JAVA_HOME=/usr/java/jdk1.7.0_25" >> .bashrc
. ~/.bashrc


### Install Gradle - may not be required as the Netflix build steps below use self-contained "gradlew" script (which downloads Gradle)
### curl -O "http://downloads.gradle.org/distributions/gradle-1.10-all.zip"
### yum install -y unzip
### cd /opt
### unzip ~/gradle-1.10-bin.zip 
### cd
### echo "export PATH=$PATH:/opt/gradle-1.10/bin" >> .bashrc
### . ~/.bashrc 


# Build RSS Reader Middletier and Edge webapps

yum install -y git

git clone https://github.com/Netflix/recipes-rss.git
cd recipes-rss
./gradlew clean build

### This was required to fix error "Error compiling file: /tmp//org/apache/jsp/jsp/rss_jsp.java org.apache.jasper.JasperException: PWC6033: Unable to compile class for JSP"
### javac /tmp/org/apache/jsp/jsp/rss_jsp.java -cp /root/recipes-rss/rss-edge/build/libs/rss-edge-0.1.0-SNAPSHOT.jar:/tmp -source 1.5 -target 1.5

# Install Tomcat 6 (didn't bother with Tomcat 7 as it wasn't available in defuault Centos 6 yum repo, as per "yum search tomcat")
yum install -y tomcat6
sed -i 's/port=\"8080\"/port=\"80\"/g' /etc/tomcat6/server.xml
# Set TOMCAT_USER to "root"
vim /etc/tomcat6/tomcat6.conf
    # Replace TOMCAT_USER setting with:
    TOMCAT_USER="root"
# End of edit
cd
echo "export TOMCAT_HOME=/usr/share/tomcat6" >> .bashrc
. ~/.bashrc

# Build and deploy Eureka
git clone https://github.com/Netflix/eureka.git
cd eureka/
./gradlew clean build
cp ./eureka-server/build/libs/eureka-server-XXX-SNAPSHOT.war $TOMCAT_HOME/webapps/eureka.war

service tomcat6 start

# Make sure there are no errors
grep "ERROR" /usr/share/tomcat6/logs/catalina.out | less -S

# (takes around 2mins to startup, expect some startup errors due to Eureka not running in an established cluster)
# Browse to http://[IP ADDRESS]/eureka/   <-- trailing slash required


# In another terminal session
# Start RSS Middletier Webserver
export APP_ENV=dev
cd recipes-rss
java -Xmx128m -XX:MaxPermSize=32m -jar rss-middletier/build/libs/rss-middletier-*SNAPSHOT.jar

# Test via Admin port: Browse to: http://[IP ADDRESS]:9192


# In another terminal session
# Start RSS Edge Webserver
export APP_ENV=dev
cd recipes-rss
java -Xmx128m -XX:MaxPermSize=32m -jar rss-edge/build/libs/rss-edge-*SNAPSHOT.jar

# Test via Admin port: Browse to: http://[IP ADDRESS]:9092
# Browse to http://[IP ADDRESS]:9090/jsp/rss.jsp
# Add the following RSS feeds:
# http://rss.cnn.com/rss/edition.rss
# http://feeds.washingtonpost.com/rss/politics
# http://news.yahoo.com/rss/us
# http://rss.cnn.com/rss/money_autos.rss


# Optional Extras...


# Install Hystrix: https://github.com/Netflix/recipes-rss/wiki/Hystrix-Metrics-%28Optional%29

# Open port 7979 in AWS Security Group

iptables -A INPUT -p tcp -m tcp --dport 7979 -j ACCEPT
iptables -L -n
service iptables save
service iptables restart

git clone https://github.com/Netflix/Hystrix.git
cd Hystrix/hystrix-dashboard
../gradlew jettyRun

# Browse to: http://[IP ADDRESS]:7979/hystrix-dashboard
# Enter http://[IP ADDRESS]:9090/hystrix.stream to see the Hystrix metrics show up in the dashboard. You will have to send a few transactions from the Edge service to have the Hystrix metrics loaded.

# Stress the Edge webserver and watch the circuit trip into "Open" state:
for i in {1..100}; do curl -s -o /dev/null -w "%{http_code} %{url_effective}\\n" "http://[IP ADDRESS]:9090/jsp/rss.jsp" & done




# Hystrix Example application: https://github.com/Netflix/Hystrix/tree/master/hystrix-examples-webapp

# Open port 8989 in AWS Security Group

iptables -A INPUT -p tcp -m tcp --dport 8989 -j ACCEPT
iptables -L -n
service iptables save
service iptables restart

cd Hystrix/hystrix-examples-webapp
../gradlew jettyRun

# Browse to: http://[IP ADDRESS]:8989/hystrix-examples-webapp

# View it on Hystrix dashboard
# Browse to: http://[IP_ADDRESS]:7979/hystrix-dashboard/
# Enter: http://[IP_ADDRESS]:8989/hystrix-examples-webapp/hystrix.stream
# Click "Monitor Stream"

# See the metrics change with this in one window:
curl [IP_ADDRESS]:8989/hystrix-examples-webapp/hystrix.stream
# And this in another window:
while true ; do curl "[IP_ADDRESS]:8989/hystrix-examples-webapp/"; done   # <-- The trailing "/" is required



# Install Turbine: https://github.com/Netflix/Hystrix/wiki/Dashboard
curl -L -O https://github.com/downloads/Netflix/Turbine/turbine-web-1.0.0.war
cp turbine-web-1.0.0.war $TOMCAT_HOME/webapps/turbine.war

# Configure Turbine (using Archaius)
vi /root/rss-edge-turbine.properties
    # From https://github.com/Netflix/Hystrix/wiki/Dashboard
    # Hystrix stream for RSS Edge webapp
    turbine.ConfigPropertyBasedDiscovery.default.instances=localhost
    turbine.instanceUrlSuffix=:9090/hystrix.stream
# End edit

# Add Archaius config to Tomcat "archaius.configurationSource.additionalUrls"
vi /etc/tomcat6/tomcat6.conf
    # Archaius properties for Turbine and Netflix
    JAVA_OPTS="${JAVA_OPTS} -Darchaius.configurationSource.additionalUrls=file:///root/rss-edge-turbine.properties"
# End edit


service tomcat6 restart
# Should see this in the logs: "URLs to be used as dynamic configuration source: [file:/root/rss-edge-turbine.properties]"
grep rss-edge /usr/share/tomcat6/logs/catalina.out

# Browse to: http://[IP_ADDRESS]:7979/hystrix-dashboard/
# Enter: http://54.206.19.137/turbine/turbine.stream
# Click "Monitor Stream"


# Install hystrix-dashboard in Tomcat
curl -O http://search.maven.org/remotecontent?filepath=com/netflix/hystrix/hystrix-dashboard/1.3.8/hystrix-dashboard-1.3.8.war

# Install hystrix-examples-webapp in Tomcat
cd Hystrix/hystrix-examples-webapp
../gradlew build
cp build/libs/hystrix-examples-webapp-1.3.9-SNAPSHOT.war /usr/share/tomcat6/webapps/hystrix-examples-webapp.war

Cluster Install

Turbine really only starts to shine once there are clusters involved.

Let's set up a cluster of dedicated Edge and Middletier instances, a dedicated Eureka instance (on Tomcat together with Hystrix Dashboard + Turbine) and with ELB in front of the Edge cluster. Something like this:

Internet Internet

| |

------------|----------------------------------|---------------------------------------------

| v

| AWS ELB

| |

/----------|------------------------\ /|\

| v | v v v

| Hystrix Dashboard <-- Turbine <----- RSS Edge (x3)

| | ^ ^ ^ ^

| | / \|/

| Eureka <-------- |

| ^ | |

\--------------------------|--------/ /|\

| v v v

---------- RSS Middletier (x3)

# To make things easier, before we scale out we'll set up some convenience

# hostnames and scripts - in reality we'd come back and automate this

# properly later on with Puppet/Chef/Ansible/Salt/Baked-into-image

# Add an entry with the Private IP address of the server in /etc/hosts e.g.:
127.0.0.1       eureka

vim "recipes-rss/rss-edge/src/main/resources/edge.properties"
    eureka.serviceUrl.default=http://eureka/eureka/v2/
# End edit

vim "recipes-rss/rss-middletier/src/main/resources/middletier.properties"
    eureka.serviceUrl.default=http://eureka/eureka/v2/
# End edit

# Rebuild both jars
cd /root/recipes-rss && gradlew build

# Create the following scripts in root's homedir:

# change_eureka_host.sh
#!/bin/bash
if [ $# -lt 1 ]; then
    echo "Usage: $0 IP_ADDRESS"
fi
sed -i "s/.*eureka/$1    eureka/g" /etc/hosts

# start_rss-edge.sh
#!/bin/bash
cd /root/recipes-rss
nohup java -Xmx128m -XX:MaxPermSize=32m -jar rss-edge/build/libs/rss-edge-*SNAPSHOT.jar &
echo "Output is logged to /root/recipes-rss/logs/rss-edge.log"

# start_rss-middletier.sh
#!/bin/bash
cd /root/recipes-rss
nohup java -Xmx128m -XX:MaxPermSize=32m -jar rss-middletier/build/libs/rss-middletier-*SNAPSHOT.jar &
echo "Output is logged to /root/recipes-rss/logs/rss-middletier.log"


# tail_rss-edge.sh
#!/bin/bash
tail -100f /root/recipes-rss/logs/rss-edge.log

# tail_rss-middletier.sh
#!/bin/bash
tail -100f /root/recipes-rss/logs/rss-middletier.log


# Save the instance as an AMI
# Now create 6 more instances based off this AMI - they can all be t1.micro's (in fact we can reprovision our main node as a t1.micro and only run Tomcat on it for Eureka, Hystrix Dashboard, & Turbine)
# Optional: Name them in the EC2 console as rss-edge-N, rss-middletier-N, rss-eureka
# Create 2 more security groups: rss-edge & rss-middletier
# Ensure 'rss-edge' security group has ports 9090 & 9092 open
# Ensure 'rss-middletier' security group has ports 9191 & 9192 open
# Create a Load Balancer named 'RssEdgeLoadBalancer' containing all 3

# rss-edge-* nodes.  Attach ports 9090 & 9092 to the same ports.
# Optional: add a health check on:
#   Protocol: HTTP
#   Port: 9092
#   Path: /adminres/webadmin/index.html
#   Timeout: 5s
#   Interval: 0.5min
#   Unhealthy Threshold: 2
# Optional: Add cloudwatch alarm in Monitoring tab
#   UnHealthyHostCount >= 1 for 1 minute
#   Send message to topic "NotifyMe"

# On the Eureka node add the Private IP addresses of the nodes to /etc/hosts

# e.g.:
127.0.0.1       eureka
172.31.111.333  rss-edge-1
172.31.111.444  rss-edge-2
172.31.111.555  rss-edge-3
172.31.111.666  rss-middletier-1
172.31.111.777  rss-middletier-2
172.31.111.888  rss-middletier-3

# On the Eureka node - modify the all-turbine.properties
vi /root/all-turbine.properties
    turbine.ConfigPropertyBasedDiscovery.rss-edge.instances=rss-edge-1,rss-edge-2,rss-edge-3
# End edit


# Create a script to load test the Edge nodes

# hammer_rss_vip.sh
#!/bin/bash

# Usage: hammer_rss_vip.sh [pause_seconds] 
#   e.g. hammer_rss_vip.sh 0.1   - will pause for 100ms between requests
#   e.g. hammer_rss_vip.sh       - no pause: fire requests as fast as we can fork processes in the background

RSS_EDGE_VIP=rssedgeloadbalancer-NNNNNNNNNN.xx-region-N.elb.amazonaws.com
DELAY=${1-0}

while true ; do
  curl -s -o /dev/null http://$RSS_EDGE_VIP:9090/jsp/rss.jsp &
  sleep $DELAY
done

# Now watch Turbine via the Hystrix Dashboard while you load test the Edge via the ELB and see the difference for different timings
hammer_rss_vip 1
hammer_rss_vip 0.1

Saturday, December 14, 2013

Important concepts in software

The following is a collection of important concepts all software professionals should understand. For each of them, the moment of understanding is often an ephiphany - an "aha" moment when what has up until then either felt good or was second nature now has a label, a vocabulary.

Jeff Atwood lists the first 3: http://www.codinghorror.com/blog/2009/03/five-dollar-programming-words.html
Brandon Byers lists #4 and #5 when explaining #3: http://brandonbyars.com/2008/07/21/orthogonality/
Some good definitions in this SO answer.

Idempotence
The result of doing something more than once is the exact same result as doing it once

Immutability

Orthogonality

Nonorthogonal systems are hard to manipulate because it's hard to tweak isolated parts

Composability
Features can be combined. Unix philosophy.

Consistency
Principle of Least Surprise

Determinism
There's a single, well-defined correct value. Important for reasoning about what a program is doing at any point in time. In contrast, imperative concurrency can give different answers, depending on a scheduler or whether you're looking or not, and they can even deadlock.

Complexity: Essential vs Accidental

Temporality
State changes over time. The temporally discrete nature of imperative computation is an accommodation to a particular style of machine, rather than a natural description of behavior itself

Netflix audit Prod rather than controlling releases through central approval process

I asked Adrian Cockcroft & Ben Christensen from Netflix how they control the release cycle of production configuration items (pointing out that each of the levels in the Archaius hierarchy potentially have a separate release cycle). In short, they don't. Rather, they constantly audit production (using Chronos) and keep track of real-time events (what, when, where) to use for ad-hoc search queries when diagnosing problems. To read between the lines, there's more value in getting changes out quickly and having them fail than not releasing changes while waiting for approval and causing bottlenecks in the approval process. Put more effort into testing in prod (monitoring/alerting/compliance) and less effort into release ceremony.

This snippet seems to cover it:

http://techblog.netflix.com/2013/03/python-at-netflix.html
Chronos
We push hard to always increase our speed of innovation, and at the same time reduce the cost of making changes in the environment. In the datacenter days, we forced every production change to be logged in a change control system because the first question everyone asks when looking at an issue is “What changed recently?”. We found a formal change control system didn’t work well for with our culture of freedom and responsibility, so we deprecated a formal change control process for the vast majority of changes in favor of Chronos. Chronos accepts events via a REST interface and allows humans and machines to ask questions like “what happened in the last hour?” or “what software did we deploy in the last day?”. It integrates with our monkeys and Asgard so the vast majority of changes in our environment are automatically reported to it, including event types such as deployments, AB tests, security events, and other automated actions.

Wednesday, December 11, 2013

Notes from Netflix OSS Yow Workshop (11/Dec/2013)

Notes taken from this workshop: Patterns for Continuous Delivery, Reactive, High Availability, DevOps & Cloud Native Open Source with NetflixOSS

The workshop was essentially a series of presentations by Adrian Cockcroft and Ben Christensen

Slides

Adrian's Slides:

SpeakerDeck

Ben's Slides:

Application Resilience Engineering and Operations at Netflix [pdf] <-- this is the closest set of slides to the workshop content

Other slides on SpeakerDeck

Diagrams

NetFlix Architecture Diagrams

Cloud at Scale

Adrian Cockcroft

Time to market vs Quality

Aggressively go out and assume that things are broken

e.g. Land grab, market disruption, web services

Default assumptions

Always shipping code that is broken

Hardware is broken

Operationally defensive

Need ability to see in realtime

Small change, quick to fix

Able to revert

Cloud Native – a new engineering challenge

Construct highly agile and hilgly available service from ephemeral and assumjed broken components

Inspiration

·         Release it

·         Bulkhead & circuit-breaker

·         Thinking in Systems – Donella H Meadows

·         Chaotic system

·         Order from Chaos

·         Emergent behaviour is that it shows movies

·         Right feedback loops

·         Not about software – about building feedback loops & rules around things so that they’re stable & predictable

·         Looks like an ants nest – chaotic but order is an emergent property

·         Anti-fragile

·         Drift Into Failure – Sidney Decker

·         Aircraft industry lessons

·         Ways to avoid

·         Latent failures

·         Netflix outages are unique

·         Byzantine

·         Have enough margin

·         Gradually take fat out of the system – drop dead if you miss a meal

·         Everything is obvious – Duncan J. Watts

·         Avoid untrained people rushing in a pushing buttons

·         The REST API Design Handbook

·         100s of microservices

·         Short book ranting on bad apis

·         Dell Cloud something

·         $3

·         “REST in Practice” also good

·         Continuous Delivery

·         Cloudonomics

·         Cost model

·         Detailed analysis of cloud costs and how it fits together

·         Phoenix Project

How to get to Cloud Native

Freedom and Responsibility for Developers

Decentralize and Automate Ops Activities

Integrate DevOps into the Business Organization

Re-org!

Four transitions

Management: Integrated Roles in a Single Organization

Business, Development, Operations -> BusDevOps

Developers: Denormalized Data – NoSQL

Decentralized, scalable, available, polyglot

Hardest thing to get everyone’s head around

Don't really need transactions anyway

Data checkers run around checking bits of data have correct ids and “foreign keys”

No such thing as consistency

Paranoia covered by backups

Responsibility from Ops to Dev: Continuous Delivery

Decentralized small daily production updates

Push to prod end of every day

Responsibility from Ops to Dev: Agile Infrastructure - Cloud

Hardware in minutes, provisioned directly by developers

Fitting into public scale

·         1,000 – 100,000 instances is ideal for AWS

·         500k instances 3 years ago, 5m today

·         http://bit.ly/awsiprange

Netflix don’t use AWS for

·         SaaS Applications – Pagerduty, Onelogin etc.

·         Content Delivery Service

·         DNS Service

Open Connect Appliance Hardware - Netflix Open Source Content Delivery Service

·         5 engineers got together for 4 months and built their own hardware

·         Build it themselves

·         Give them away

·         $15k

·         Pre-loaded static content

·         Nginx, Bind, Bird bind library

·         BSD

·         UFS+ filesystem

·         Unmount disk if it fails, lower capacity – no striping

·         Hot content on SSD

DNS Service

Route53 missing too many features

Amazon will clean up in the DNS market when they finish Route53

Escaping the Death Spiral

Get out of the way of innovation

Process reduction - aggressively

Hardware: Best of breed, by the hour

If don’t like it – get rid of it

Choices based on scale

E.g. Big scale = US East Region

Build your own DNS

Getting to Cloud Native

Getting started with NetflixOSS Step by Step

Set up AWS Accounts to get the foundation in place

Security and access management setup

Account Management: Asgard to deploy & Ice for cost monitoring

Build Tools: Aminator to automate baking AMIs

Service Registry and Searchable Account History: Eureka & Edda

Conﬁguration Management: Archaius dynamic property system

Data storage: Cassandra, Astyanax, Priam, EVCache

Dynamic traﬃc routing: Denominator, Zuul, Ribbon, Karyon

Availability: Simian Army (Chaos Monkey), Hystrix, Turbine

Developer productivity: Blitz4J, GCViz, Pytheas, RxJava

Big Data: Genie for Hadoop PaaS, Lipstick visualizer for Pig, Suro (logging pipeline)

Sample Apps to get started: RSS Reader, ACME Air (IBM), FluxCapacitor

Flow of Code & Data between AWS Accounts

Auditable Account: Code with dollar signs goes into this account

Archive Account

Test account gets refreshed every weekend of all schemas/data from Prod

Trashes test data

Confidential data is encrypted at rest

Tokenise sensitive data

Cloud LDAP gives you access to Production Account

Auditable Account is different LDAP group, need a reason to access it

Vault account needs a background check

Higher security apps

Monitoring systems inside

Smaller and smaller for higher security

Hyperguard & cloudpassage

Cloud security architect used to be a PCI auditor – could talk to auditors at their level

Had to educate auditors

Archive Account

Versioned

Can’t delete anything from archive account

Delete old copies

PGP encrypted backup copies to Google, last resort DR copy

Immutable logs from Cassandra for full history

Account Security

Protect Accounts

Two factor authentication for primary login

Delegated Minimum Privilege

Create IAM roles for everything

Fine-grained – as-needs basis

Security Groups

Control who can call your services

Every service has a security group with the same name

Have to be in the group to be able to call the service

Managing service ingress permission = customer base

Can ignore services not in my security group

Superset of interactions

Not all customers call the service

Call tree is monitored through other means

Cloud Access Control

SSH Bastion

Sumo Logic

Ssh sudo bastion

Can’t ssh between instances – have to go via bastion which wraps sudo with audit logs

Login is yourself

oq ssh wrapper into machine root or login as regular e.g. default is “dal-prod”

Your user doesn’t exist on machines

E.g. “dal-prod” is 105, “www-prod” is 106

Dont’ run anything as root

Register of accounts – Asgard

Failure modes

Datacenter dependency

2 copies of bastion host

Homedir is the same

Scripts that trample in through bastion – bad idea

NFS server died lost all shares

Datacenters keep breaking your cloud

1 service per host

AWS firewall layer is dodgy – creates variance in the network

Fast Start AMIs

AWS Answers

1 Asgard copy per account

Stateless services talk to memcache/Cassandra/RDS

No SQL queries – all REST calls to webservices

Asgard

Grails app

All UI endpoints can be read by adding .json

Eureka

Edda

Timestamped delta cache of service status mongodb/ES back-end

Searchable history of AWS instance, deployment version, etc changes

Every 1min

Janitor monkey cleans up

Eucalyptus = AWS-compatible private cloud – lets you see more underlying infra data e.g. switches

Cloud Trail give you record of calls made to configure cloud and who made them

E.g. Machines that blew up last week no longer exist

Very powerfull for security / auditing

CMDBs in data center don’t work – this actually works – strong assertions

Archaius

Property Console

Not OS yet

Based on Pytheas

Archaius Library Config Mgmt

Hierarchy of properties

Changes are logged with Chronos

Astyanax

A6x

Son of Hector (brother of Cassandra)

Recipes

Patterns to solve common problems

EVCache

Eccentric (Ephemeral) Volatile Cache

Memcache in each zone

“Dynomite” Cassandra-like layer above memcache

Prium-like sidecar exposes metrics over JMX

Routing Customers to Code

Denominator: DNS for multi-region availability

Manage traffic via multiple DNS providers with Java code (or command-line)

Talks to Ultra, Dyn, Route53, OpenStack

Pluggable

Ultra does Geo split (partitioning)

Switch to Dyn if Ultra breaks

Route53 does switching

If a Region goes down denominator switches LB at Route53 layer

50 endpoints

Talks to Zuul API Router

Zuul – Smart routing

Groovy filters update every 30s

E.g. block Russian addresses

Similar to masher, apigee

Ribbon

Internal LB

Wrapper around HTTP Client

Round robin connections

Backed by Eureka

Karyon – common server container

Hello world

Embedded status page console

Machine readable

Enables conformity monkey

E.g. Reject if it has versions of libraries

Availability

Torture Monkeys – barrel fill of monkeys

Block DNS

Fill up root disk

Unmount ebs

Block access to ec2 apis

CPU busy

Killing all Java processes

Developer productivity

Blitz4J – non-blocking logging

GCViz

Runs off log files

Pytheas – OSS based tooling framework

Powerful - Just a little code in the right place

Scaffolding

Guice, Jersey, FreeMarker, JQuery, DataTabler, D3, Bootstrap

BigData & Analytics

Genie - Hadoop jobs

Complex Processing of S3 data

Lipstick – visualisation for Pig queries

Suro – event logging pipeline

Feeds Kafka, Storm, Druid

Alerting

80-100bn events/day

Sample App – RSS Reader

Glisten – Workflow DSL – Amazon Simple Workflow

Scale & Resilience (Resilient API Patterns)

Ben Christensen

Constraints

Client libraries

Service provides a client library

To enable speed of iteration

While we want resources integration, procedure integration works out faster

Mixed Environment

Polyglot

Client libraries

Deal with Logic, Serialisation, Network Request, Deserialisation, Logic

Bulkheading to prevent socket timeouts

Limit the blast radius

Hystrix

Tryable Semaphore

E.g. 3-4 threads will do 50rps

Cap at 10 – don’t reject until hit 10 threads

Reject in non-blocking way – queue

Fast-fail shared load or fall-back

Thread pool

Size of thread pool + queue size (typically 0)

Slight overhead for extra threads

Gives extra safety of being able to release the blocking user instead of waiting on connection

Enables interrupting blocking threadHystrix Command Object pattern

Synchronous execute

Asynchronously

Circuit open? Rate limit?

Failure options

Fail fast instead of backing up – backed up systems do not recover quickly

Shed load so can start processing immediately

Fail silent

Netflix shouldn’t fail for customer just because Netflix can’t talk to Facebook

Static fallback

Instead of turning feature off... turn it to default state (true, DEFAULT_OBJECT)

Fail open instead of failing closed

Stubbed fallback

Stub parameters with defaults if don’t know

Fallback via network

Try something else based on similar data

Hystrix

Each app cluster has a page of Circuit Breakers

Shows last 10s

Links to historical data

Different decisions about

Deployment

Zuul Routing Layer

Replaces ELBs + Commercial proxy layer

Hated it: 1-2 days to make simple rule change

Simple Pre and Post filters on HTTP Request/Response

Can add & remove filters at runtime

Routing changes

Use Cases

Want to know all logs for particular User-id across entire fleet – via Turbine

Canary vs Baseline

Launch 2 clusters

Run through 1 peak cycle

Squeeze Testing

Impossible with out Zuul

RPS on a particular instance (Math in a filter)

Increment in 5rps increments

Test every binary to see what it’s breaking point is

How many machines will we need in prod?

Is this change inefficient?

Auto-scaling parameters?

Test can’t come close to prod load – different load

Really hard to simulate true load, cache hits

Don’t bother with Load Tests

Squeeze tests as part of prod

Acceptable break?

Client will typically retry (or get a “Try again” prompt)

Retry will probably hit a different box (out of 100s)

Rules are tested in a Zuul canary

Prod Zuul cluster – small Zuul cluster

Activate on main cluster after tested

Coalmine

Long-term canary cluster

Java agents with byte code manipulation

Intercept network traffic

Watch a particular binary

Raise alarm if see network traffic not isolated by Hystrix

E.g. Someone Flips code Not correctly isolated

Look in Chronous to see what was changed

Production

Scryer – predicitive auto-scaling

Uses last 4 weeks for any particular day of week

Creates an auto-scaling plan for ASG

Move min-floor up and down, leave max high

5% buffer better than 10-20% buffer for reactive

Still have reactive plan to kick in for a safety net (e.g. Snow day)

Will keep scaling up even if not receiving predicted traffic – avoids outage when traffic suddenly comes back online

Testing

Testers do manual testing on UI

Engineers have to make their stuff work

Code reviews are up to each team culture

Risky request feedback on pull request

Canary

If significant degradation – canary test fails

Integration testing?

Smoke testing does a lot of the service testing

Expected data testing “prod” branch is latest dependency integration

Nightly build

If fails, integration guys won’t promote it

Not scientific

A/B Testing – See Jason Brown presentation

Migrating Data Changes

Don’t make breaking changes

Never attempt to synchronise releases

Wait until all consumers are up-to-date

Client library facade

Other Notes

Build compatibility into client library

Client library boils down to just fallback decisions in Hystrix circuit breakers

Working on static analysis to check

Keep track of which clients call which endpoints – know who’s affected by changes

Performance & Innovation

Ben Christensen

Suro Event Pipeline

Cloud native, dynamic, configurable offline and realtime data sinks

Open-sourced yesterday

S3 demultiplexing -> Hadoop -> BI

Kafka -> Druid and ES

Druid = realtime data cube

Not event processing

Counting on multiple dimensions

Can plug Storm into Kafka for event processing

ES for searching events

Availability

Adrian Cockcroft

Incident management

PagerDuty alerts are automated

Incident creation is not automated

Cassandra at Scale

Boundary.com – network flow analysis

Small number of nodes for free

Failure Modes and Effects

Auto-scaling saves up to 70%

Janitor monkey

Uses Edda to find things not used

Compare TCO

Place

Power

Pipes

People

Patterns

Managing, overhead, tooling

Jevons Paradox

If you make something more efficient people will consume more of it more people will use it – more than the efficiency gains

Amazon incent their sales reps to help you save money

Size for the amount of RAM you need then scale horizontally

Bonus notes from sidechat about "Consumer-Driven Contracts" with Sam Newman

Recommended Resources for testing Consumer-Driven Contracts?

Anything by Ian Robinson (“Rest in practice” author)

http://www.infoq.com/articles/consumer-driven-contracts

Other Tips

Rely more on monitoring in prod

Synthetic Transactions

A/B Testing

Releasing apps together is the wrong way to go

It’s a smell that your apps are becoming tightly coupled

Ends up tying you back to a monolithic structure

Tuesday, December 31, 2013

How to build Netflix RSS Reader Example on Centos 6 with Amazon AWS

Instructions for installing Netflix RSS Reader on Amazon EC2

Basic Single-Instance Installation

Cluster Install

Saturday, December 14, 2013

Important concepts in software

Netflix audit Prod rather than controlling releases through central approval process

Wednesday, December 11, 2013

Notes from Netflix OSS Yow Workshop (11/Dec/2013)

Notes taken from this workshop: Patterns for Continuous Delivery, Reactive, High Availability, DevOps & Cloud Native Open Source with NetflixOSS The workshop was essentially a series of presentations by Adrian Cockcroft and Ben Christensen

Slides

Adrian's Slides: SpeakerDeck Ben's Slides: Application Resilience Engineering and Operations at Netflix [pdf] <-- this is the closest set of slides to the workshop content Other slides on SpeakerDeck

Diagrams

NetFlix Architecture Diagrams

Cloud at Scale

Adrian Cockcroft

Time to market vs Quality

Aggressively go out and assume that things are broken e.g. Land grab, market disruption, web services Default assumptions Always shipping code that is broken Hardware is broken Operationally defensive

Need ability to see in realtime

Small change, quick to fix Able to revert

Cloud Native – a new engineering challenge

Construct highly agile and hilgly available service from ephemeral and assumjed broken components

Inspiration

How to get to Cloud Native

Freedom and Responsibility for Developers Decentralize and Automate Ops Activities Integrate DevOps into the Business Organization Re-org!

Four transitions

Fitting into public scale

· 1,000 – 100,000 instances is ideal for AWS · 500k instances 3 years ago, 5m today · http://bit.ly/awsiprange

Netflix don’t use AWS for

· SaaS Applications – Pagerduty, Onelogin etc. · Content Delivery Service · DNS Service

Open Connect Appliance Hardware - Netflix Open Source Content Delivery Service

· 5 engineers got together for 4 months and built their own hardware · Build it themselves · Give them away · $15k · Pre-loaded static content · Nginx, Bind, Bird bind library · BSD · UFS+ filesystem · Unmount disk if it fails, lower capacity – no striping · Hot content on SSD

DNS Service

Route53 missing too many features Amazon will clean up in the DNS market when they finish Route53

Escaping the Death Spiral

Get out of the way of innovation Process reduction - aggressively Hardware: Best of breed, by the hour If don’t like it – get rid of it Choices based on scale E.g. Big scale = US East Region Build your own DNS

Getting to Cloud Native

Getting started with NetflixOSS Step by Step

Flow of Code & Data between AWS Accounts

Account Security

Cloud Access Control

Fast Start AMIs

AWS Answers 1 Asgard copy per account

Stateless services talk to memcache/Cassandra/RDS

No SQL queries – all REST calls to webservices

Asgard

Grails app All UI endpoints can be read by adding .json

Eureka

Edda

Archaius

Property Console Not OS yet Based on Pytheas

Archaius Library Config Mgmt

Hierarchy of properties Changes are logged with Chronos

Astyanax

A6x Son of Hector (brother of Cassandra) Recipes Patterns to solve common problems

EVCache

Eccentric (Ephemeral) Volatile Cache Memcache in each zone “Dynomite” Cassandra-like layer above memcache Prium-like sidecar exposes metrics over JMX

Routing Customers to Code

Availability

Torture Monkeys – barrel fill of monkeys Block DNS Fill up root disk Unmount ebs Block access to ec2 apis CPU busy Killing all Java processes

Developer productivity

Blitz4J – non-blocking logging GCViz Runs off log files Pytheas – OSS based tooling framework Powerful - Just a little code in the right place Scaffolding Guice, Jersey, FreeMarker, JQuery, DataTabler, D3, Bootstrap

BigData & Analytics

Genie - Hadoop jobs Complex Processing of S3 data Lipstick – visualisation for Pig queries Suro – event logging pipeline Feeds Kafka, Storm, Druid Alerting 80-100bn events/day

Sample App – RSS Reader

Glisten – Workflow DSL – Amazon Simple Workflow

Scale & Resilience (Resilient API Patterns)

Ben Christensen

Constraints

Client libraries Service provides a client library To enable speed of iteration While we want resources integration, procedure integration works out faster Mixed Environment Polyglot

Client libraries

Deployment

Testing

Migrating Data Changes

Don’t make breaking changes Never attempt to synchronise releases Wait until all consumers are up-to-date Client library facade

Other Notes

Build compatibility into client library Client library boils down to just fallback decisions in Hystrix circuit breakers Working on static analysis to check Keep track of which clients call which endpoints – know who’s affected by changes

Performance & Innovation

Ben Christensen

Notes taken from this workshop: Patterns for Continuous Delivery, Reactive, High Availability, DevOps & Cloud Native Open Source with NetflixOSS

The workshop was essentially a series of presentations by Adrian Cockcroft and Ben Christensen

Adrian's Slides:

SpeakerDeck

Ben's Slides:

Application Resilience Engineering and Operations at Netflix [pdf] <-- this is the closest set of slides to the workshop content

Other slides on SpeakerDeck

Aggressively go out and assume that things are broken

e.g. Land grab, market disruption, web services

Default assumptions

Always shipping code that is broken

Hardware is broken

Operationally defensive

Small change, quick to fix

Able to revert

Freedom and Responsibility for Developers

Decentralize and Automate Ops Activities

Integrate DevOps into the Business Organization

Re-org!

· 1,000 – 100,000 instances is ideal for AWS

· 500k instances 3 years ago, 5m today

· http://bit.ly/awsiprange

· SaaS Applications – Pagerduty, Onelogin etc.

· Content Delivery Service

· DNS Service

· 5 engineers got together for 4 months and built their own hardware

· Build it themselves

· Give them away

· $15k

· Pre-loaded static content

· Nginx, Bind, Bird bind library

· BSD

· UFS+ filesystem

· Unmount disk if it fails, lower capacity – no striping

· Hot content on SSD

Route53 missing too many features

Amazon will clean up in the DNS market when they finish Route53

Get out of the way of innovation

Process reduction - aggressively

Hardware: Best of breed, by the hour

If don’t like it – get rid of it

Choices based on scale

E.g. Big scale = US East Region

Build your own DNS

AWS Answers

1 Asgard copy per account

Grails app

All UI endpoints can be read by adding .json

Property Console

Not OS yet

Based on Pytheas

Hierarchy of properties

Changes are logged with Chronos

A6x

Son of Hector (brother of Cassandra)

Recipes

Patterns to solve common problems

Eccentric (Ephemeral) Volatile Cache

Memcache in each zone

“Dynomite” Cassandra-like layer above memcache

Prium-like sidecar exposes metrics over JMX

Torture Monkeys – barrel fill of monkeys

Block DNS

Fill up root disk

Unmount ebs

Block access to ec2 apis

CPU busy

Killing all Java processes

Blitz4J – non-blocking logging

GCViz

Runs off log files

Pytheas – OSS based tooling framework

Powerful - Just a little code in the right place

Scaffolding

Guice, Jersey, FreeMarker, JQuery, DataTabler, D3, Bootstrap

Genie - Hadoop jobs

Complex Processing of S3 data

Lipstick – visualisation for Pig queries

Suro – event logging pipeline

Feeds Kafka, Storm, Druid

Alerting

80-100bn events/day

Client libraries

Service provides a client library

To enable speed of iteration

While we want resources integration, procedure integration works out faster

Mixed Environment

Polyglot

Don’t make breaking changes

Never attempt to synchronise releases

Wait until all consumers are up-to-date

Client library facade

Build compatibility into client library

Client library boils down to just fallback decisions in Hystrix circuit breakers

Working on static analysis to check

Keep track of which clients call which endpoints – know who’s affected by changes

PagerDuty alerts are automated

Incident creation is not automated

Boundary.com – network flow analysis

Small number of nodes for free

Place

Power

Pipes

People

Patterns

Managing, overhead, tooling

Jevons Paradox

If you make something more efficient people will consume more of it more people will use it – more than the efficiency gains

Amazon incent their sales reps to help you save money

Anything by Ian Robinson (“Rest in practice” author)

http://www.infoq.com/articles/consumer-driven-contracts

Rely more on monitoring in prod

Synthetic Transactions

A/B Testing

Releasing apps together is the wrong way to go

It’s a smell that your apps are becoming tightly coupled

Ends up tying you back to a monolithic structure