Matt Callanan's Blog: 2015

Saturday, October 3, 2015

c3vis: Finding Your Docker Containers on Amazon ECS Clusters

I've just created my first open source project: c3vis (Cloud Container Cluster Visualizer); and released it under the auspices of Expedia's OSS contribution program. It aims to give administrators and developers a simple interface to visualise how Docker containers have been deployed to Amazon's ECS service (for more information on ECS see Intro to AWS ECS and Under the Hood of the Amazon EC2 Container Service).

The Problem

Deploying software as containers promises to solve many problems with regards to interoperability of environments, speed to deploy, and cost reduction. But understanding where our software lives now becomes more difficult both for development and operations teams. This is due to the fact that it is quite laborious to find the information indicating where the software is now located and the quantity of resources still available for more software. Several ECS console screens must be viewed, and the amount of time required to process this information grows with the amount of software deployed.

c3vis

c3vis aims to give administrators and teams one place to gain rapid insight into the state of where the containers are running and the capacity available for more containers.

How it Works

c3vis is a NodeJS server that retrieves information about the Instances and Tasks for the selected cluster from the ECS API (using the AWS JavaScript SDK). The client processes this information using D3.js to display the EC2 instances in the selected cluster as vertical bars. The Tasks allocated to the instances are represented as stacked boxes indicating their reserved memory. Each unique Task Definition is represented as a different colour, with the legend showing the Task Family name and revision number. Each Task will contain one or more containers, the size of the Task box represents accumulated reserved memory for all containers in the Task.

Why I Created it

At Expedia we're using Amazon ECS to serve highly available clusters of applications. ECS helps to save costs and speed up deploys (traditionally done with blue-green deploys to clusters of dedicated immutable instances, requiring over-allocation of resources and time-consuming AMI creation). My colleague and I have been automating deploys of microservices to ECS. Diagnosing problems post-deploy is time-consuming: where was the container installed? why don't I have enough resources?

Having a single console that summarises ECS clusters visually has been helpful for 1) quickly identifying where to perform further diagnosis, 2) quickly estimating whether the current instance types will suffice as the cluster grows, and 2) explaining concepts to development teams that are on-boarding to ECS.

Future Directions

Here are some enhancement ideas for c3vis:

Multi-region, multi-account support
Represent reserved CPU in addition to memory
Show an exploded view of task with more details:

Show containers within tasks
Show memory breakdown across containers

Sliding time-bar to see historical data for comparison of cluster state
Show container actual memory utilisation vs reserved memory utilisation
Show Service names (toggle-able with Task names)
Support more than 100 instances
Write a plugin system that lets adopters plugin visualisations of their own statistics from favourite monitoring tool
Cache responses server-side to reduce AWS API calls
Make the data transfer between client and server more efficient - Separate requests for task and instance data and populate graph asynchronously
Pluggable backend system that could support other public or private cloud providers
Provide access to more troubleshooting information (such as docker logs, ECS logs)

Amazon ECS Data Model Limitations

The ECS data model has made getting the information needed for c3vis difficult. Looking forward to hopefully seeing more two-way relationships between ECS data types:

Mapping from ContainerInstance to Tasks - currently have to traverse all Tasks and group by InstanceId.
Mapping from Task to Service - currently have to traverse all Services and group by Task ARN.

Helping Out

Feedback and pull requests are welcome, c3vis is licensed under Apache 2.0.

https://github.com/ExpediaDotCom/c3vis

Wednesday, June 3, 2015

Updating Definition of Done for Continuous Delivery and DevOps

Agile brought us the idea of the "definition of done" - basically team members must have a shared understanding of what it means for work to be complete (see scrumguides for a good explanation).

But how does Continuous Delivery and DevOps change this? Now that we've brought the scope of agile out to the last mile, now that we're including operations in the picture?

The "Continuous Delivery" book says: "done" meaning released into production is the ideal situation for a software development project" (page 27)

Steve Ropa on April 30, 2015 says that "no story is done until the code has been written, and the peer review has happened, the tests are all passing, and the hooks for monitoring, performance evaluation and health checks are in as well", "you have to be prepared for when the code meets the real world", and "you have to find out before the customer does if something is not working right".

He suggests expanding stories to incorporate DevOps in their definition of DONE:

Requirements that enable you to understand how the code is doing after deployment.
Estimations based on what it means to have the code in a state that will be successful post-deployment.
Everyone is a developer/tester/operations.

Rex Morrow on May 1, 2015 says we need to update it to also include database changes.

Kris Buytaert on May 2, 2015 says he already updated it in 2013 to mean it's not done until the app is decommissioned (or all end-users are in their graves).

Matthias Marschall has a good definition of what "done" is not:

For developers, committing a piece of code is far from being done. It needs to work in all kinds of weird use cases. And it’s not only QA’s job to find all the bugs. Good developers want to ensure that the new features are not only coded, but tested and ultimately released to their users. Only then the task is really Done.
For sysadmins, having a nice script on your own box is not enough. Every sysadmin needs to make sure it’s possible to re-create each part of the infrastructure at any time. When that slick, new script is under version control, written in a way others can understand and modify it, is their task really Done.

Essentially, it's up to each agile team to come up with the definition that makes sense for them. And we need to include Operations in the picture. Our Definition of Done needs to incorporate a feedback loop from production.

Sunday, February 22, 2015

Resources for David Logan's "Tribal Leadership"

I stumbled on David Logan's "Tribal Leadership: Leveraging Natural Groups to Build a Thriving Organization" recently (available on amazon.com). It's an interesting way to categorise stages of culture within teams/organisations and identify how to improve the culture for the mutual benefit of the tribe members.

My quick understanding is that moving team members to a place where they invest in each other instead of looking out for themselves improves the culture of the whole tribe.

Below are some useful resources.

Quick Overview
http://emergentbydesign.com/2012/06/28/a-step-by-step-guide-to-tribal-leadership-part-1-the-five-stages-of-tribal-culture/

In-depth Book Summary
http://wiki.dsoglobal.org/t/library/tribal_leadership/summary

Cheat Sheet on how to "level up"
http://finding-marbles.com/2012/02/29/tribal-leadership-the-1-page-cheat-sheet/

TED Talk from 2009 (16min)
http://www.ted.com/talks/david_logan_on_tribal_leadership

Stage 3 is common among really smart people
Hardest thing is moving from Stage 3 to Stage 4
Stage 4 Example: Zappos
Stage 5 Example: South Africa Reconciliation Council (Desmond Tutu)
Distribution of Employed Tribes:

Stage 1: 2%
Stage 2: 25%
Stage 3: 48%
Stage 4: 22%
Stage 5: 2%

You need to talk all 5 tribe stages because people are at all 5 stages
Tribes can only hear one level above or below where they are - leaders need to nudge them up
Stage 3 to 4: Find someone you don't know and someone else you don't know and connect them - connect them to something greater than themselves

Monday, February 9, 2015

Nathan Marz: Suffering-Oriented Programming

Just read "Suffering-Oriented Programming" - a blog post from Nathan Marz (creator of Storm). The central idea is that you need to understand a problem well before you try to solve it properly - especially before you build generic abstractions to solve multiple use cases. Beyond the ideas of YAGNI and MVP, the post outlines 3 distinct phases that are helpful to think about when setting out to tackle a tricky problem: 1) make it possible, 2) make it beautiful, 3) make it fast.

It has some great takeaway quotes:

don't build technology unless you feel the pain of not having it
reduce risk by ensuring that you're always working on something important
ensure that you are well-versed in a problem space before attempting a large investment.
When encountering a problem domain with which you're unfamiliar, it's a mistake to try to build a "general" or "extensible" solution right off the bat
develop a "map" of the problem space as you explore it by hacking things out
The key to developing the "beautiful" solution is figuring out the simplest set of abstractions that solve the concrete use cases you already have.
Don't try to anticipate use cases you don't actually have or else you'll end up overengineering your solution.
development of beautiful abstractions is similar to statistical regression: you have a set of points on a graph (your use cases) and you're looking for the simplest curve that fits those points (a set of abstractions).
most important... is a relentless focus on refactoring... to prevent accidental complexity from sabotaging the codebase
Use cases are ... worth their weight in gold. The only way to acquire use cases is through gaining experience through hacking.
attempts to make things generic without a deep understanding of the problem domain will lead to complexity and waste. Designs must always be driven by real, tangible use cases.

Sunday, January 18, 2015

Great AWS Tips

Came across some great AWS tips (care of Devops Weekly)

The list is below - go to this website for the full details:

https://wblinks.com/notes/aws-tips-i-wish-id-known-before-i-started/

Application Development

Store no application state on your servers. 
Store extra information in your logs. 
If you need to interact with AWS, use the SDK for your language. 
Have tools to view application logs.

Operations

Disable SSH access to all servers. 
Servers are ephemeral, you don't care about them. You only care about the service as a whole. 
Don't give servers static/elastic IPs. 
Automate everything. 
Everyone gets an IAM account. Never login to the master. 
Get your alerts to become notifications.

Billing

Set up granular billing alerts.

Security

Use EC2 roles, do not give applications an IAM account. 
Assign permissions to groups, not users. 
Set up automated security auditing.
Use CloudTrail to keep an audit log.

Use "-" instead of "." in bucket names for SSL. 
Avoid filesystem mounts (FUSE, etc). 
You don't have to use CloudFront in front of S3 (but it can help).
Use random strings at the start of your keys.

EC2/VPC

Use tags!
Use termination protection for non-auto-scaling instances. Thank me later.
Use a VPC.
Use reserved instances to save big $$$. 
Lock down your security groups.
Don't keep unassociated Elastic IPs.

Terminate SSL on the load balancer. 
Pre-warm your ELBs if you're expecting heavy traffic.

ElastiCache

Use the configuration endpoints, instead of individual node endpoints.

Set up event subscriptions for failover.

CloudWatch

Use the CLI tools.
Use the free metrics.
Use custom metrics.
Use detailed monitoring.

Auto-Scaling

Scale down on INSUFFICIENT_DATA as well as ALARM. 
Use ELB health check instead of EC2 health checks. 
Only use the availability zones (AZs) your ELB is configured for. 
Don't use multiple scaling triggers on the same group.

Use IAM roles.
Users can have multiple API keys.
IAM users can have multi-factor authentication, use it!

Route53

Use ALIAS records.

Elastic MapReduce

Specify a directory on S3 for Hive results.

Miscellaneous Tips

Scale horizontally.
Your application may require changes to work on AWS. 
Always be redundant across availability zones (AZs). 
Be aware of AWS service limits before you deploy. 
Decide on a naming convention early, and stick to it. 
Decide on a key-management strategy from the start. 
Make sure AWS is right for your workload.

Sunday, January 11, 2015

Henrik Kniberg's "Scaling Agile at Spotify"

Henrik Kniberg's done it again. He's taken a working example of complex process implementation and compressed it into a highly informative, easily-watchable and easily-readable format that makes it simple to grab the concepts and, crucially, to help show others to inspire those around you with what's possible in your IT department.

If you've read Henrik's Scrum and XP from the Trenches or Kanban vs Scrum or Lean from the Trenches, you need to see what he's been helping Spotify do with their engineering culture with scaling agile, lean, devops, culture, A/B experiments.

Two-part animated series depicting Spotify's Engineering culture

Part 1:

Part 2:

Scaling Agile at Spotify with Tribes, Squads, Chapters and Guilds

Henrik's blog
PDF: dropbox, non-dropbox

Tuesday, January 6, 2015

Little's Law

Little's Law:

Avg queue time = WIP / Throughput
Delivery rate = WIP / Lead Time
MeanResponseTime = MeanNumberInSystem / MeanThroughput

or

Avg Inventory = Throughput * Avg queue time
Avg Customers in system = Avg arrival rate * avg customer time in system

Chris Young explains how he is using Little’s Law to estimate WIP a project’s delivery rate (5min video):

http://www.infoq.com/presentations/little-law-estimate

Little's Law - The ONE thing you can do to improve process performance (6min video):

https://www.youtube.com/watch?v=lHQZcMRr2n0

Little's Law worked problem (6min video):

https://www.youtube.com/watch?v=h-1Q-uuuQkQ

Wikipedia:

https://en.wikipedia.org/wiki/Little%27s_law
The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the (Palm‑)average time a customer spends in the system, W; or expressed algebraically: L = λW.
Although it looks intuitively reasonable, it is quite a remarkable result, as the relationship is "not influenced by the arrival process distribution, the service distribution, the service order, or practically anything else."

Sunday, January 4, 2015

Experimenting with AWS EC2 Container Service

Amazon EC2 Container Service ("ecs" for short) is a Docker cluster management service that runs on top of EC2 instances. There is no additional charge for the service - you pay for the EC2 instances whether you're using them or not. It's early days but looks like a promising service that should take a lot of the grunt work (networking, security, etc) out of creating your own clusters like Kubernetes and Mesos.

ECS is currently in preview - I needed to wait around two weeks to be granted access after signing up here:
https://aws.amazon.com/ecs/

This is a transcript of how I fired up a simple Docker container on ECS using Amazon instructions available on 4/Jan/2015.

Introduction

Watch this video: https://www.youtube.com/watch?v=LE5uBqNp2Ds&t=1m58s
Particularly from 1m58s to skip the Amazon propaganda and watch the interesting visualisation.

And this video has some good terminology introduction and a live demo: https://www.youtube.com/watch?v=2vJLS8qfhI0&feature=youtu.be&t=11m12s (Slides)

Terminology:

Tasks: A grouping of related containers (e.g. Nginx, Rails app, MySql, Log collector)
Containers
Clusters: A grouping of container instances - a pool of resources for Tasks
Container Instances: An EC2 instance on which Tasks are scheduled. AMI with ecs agent installed

Setting Up

Follow these instructions:
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/get-set-up-for-amazon-ecs.html

This will walk you through the setup for the following:

IAM User
IAM Role
Key Pair
VPC
Security Group
Special copy of AWS CLI that includes "ecs" commands

NOTE: On OS X I needed to:

"brew uninstall awscli" (that removed /usr/local/bin/aws from my path)
And add "export PATH=$PATH:~/.local/lib/aws/bin" to my .bashrc

Creating The Cluster

Follow these instructions: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ECS_GetStarted.html

NOTE: I preferred to create the EC2 instance from the command line (instead of the Launch an Instance with the Amazon ECS AMI instructions):
aws ec2 run-instances --image-id ami-34ddbe5c --count 1 --instance-type t2.small --subnet-id subnet-xxxxxxxx --key-name ecsdemo-keypair --iam-instance-profile Name=ecsdemo-role

... using the subnet-id for my default VPC and the "ecsdemo" keypair and IAM role name I created during the Setting Up phase above.

Then as per the instructions test it out with:

aws ecs list-container-instances

If you see
{
"containerInstanceArns": []
}
... then something has gone wrong and you'll need to terminate your instance and try again.

To see more details about your instance:

aws ecs describe-container-instances

Running a Task (Docker process)

As per the instructions, register a Task Definition and start a Task that spins up a single docker container (based on busybox image) that simply sleeps for 6 minutes.

aws ecs register-task-definition --family sleep360 --container-definitions "[{\"environment\":[],\"name\":\"sleep\",\"image\":\"busybox\",\"cpu\":10,\"portMappings\":[],\"entryPoint\":[\"/bin/sh\"],\"memory\":10,\"command\":[\"sleep\",\"360\"],\"essential\":true}]"

aws ecs list-task-definitions
aws ecs run-task --cluster default --task-definition sleep360:1 --count 1
aws ecs list-tasks
aws ecs describe-tasks --tasks 699d5420-1d0d-410e-b105-7e51027b8fd4

Log on to your instance and check the docker container is running:

ssh -i ecsdemo-keypair.pem ec2-user@ec2-instance-public-ip

docker ps
Should see:

[ec2-user@ip-ec2-instance-public-ip-name ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ec8a9fca64b0 busybox:buildroot-2014.02 "sleep 360" 3 minutes ago Up 3 minutes ecs-sleep360-1-sleep-dc8dd4cdfcf593d07d00
58e68cc5bfc3 amazon/amazon-ecs-agent:latest "/agent" 34 minutes ago Up 34 minutes 127.0.0.1:51678->51678/tcp ecs-agent

See more details about your docker container with:

docker inspect ec8a9fca64b0

After 6 minutes of sleeping, the docker process should disappear from the "docker ps" listing.

More Examples

More interesting examples including Tasks that link together a number of containers are contained in the videos linked to above.