Matt Callanan's Blog: Notes from Cloudera Hadoop talk

Cloudera Hadoop
Notes from http://www.meetup.com/qldjvm/events/58332582/

Hadoop

· Scalable

· Fault tolerant

· OS

HDFS + MapReduce

HDFS: Like a normal Filesystem but Distributed (highly scalable)

MapReduce: Compute framework

Can process the data where it resides

Direct attached storage at individual nodes

Concepts

· Apps are written in high-level code

· Share nothing architecture (between nodes)

· Data is spread among machines in advance

Schema on-read

SerDe (Serialiser/Deserialiser)

DB: reads are fast, standards

Hadoop: loads are fast, flexibility

Enterprise DataWarehouse Arch

Sqoop & Flume --> Hadoop

Hive & ODBC

Oozie (work flow coordinator)

Sqoop out to BI reports

HDFS

· NameNode – holds all metadata for HDFS

· Needs to be highly reliable machine (Raid 10, dual power supplies, dual network cards bonded)

· More memory the better 48Gb-96Gb (depending on # of files/blocks)

· v1.0 If NameNode disappears cluster is down

· Secondary NameNode – check pointing for NameNode (same hardware)

· DataNodes

· Hardware will depend on specific

· JBOD (just a bunch of disks), no raid, no SAN, no virtualisation

· Maximise for IO throughput

· Direct attached disk

· Data on S3, EMR instances (maxed out by how quickly S3 can access EMR)

· Pay a price for virtualisation

E.g. DataNode

· 12 3TB drives

· 2 4-6core CPUs

· 24-96Gb RAM

MapReduce

Job tracker takes job submitted to cluster

Task Tracker

· take job and tries to run on same machine where data resides (90% possible)

· works on multiple files

· map task per block

· split into multiple blocks

· don’t want massive block - 2GB file split into smaller pieces

· 128kb chunks

Input is split and fed in parallel to the map tasks which process the data given to them and write immediate data o their local storage (intermediate data)

Shuffle and sort (network intensive)

Reduce tasks pick up data from the local storage of the map tasks relevant to the key range they have to perform the reduce task on They perform the reduce op and wwrite the output to HDFS

WriteOnceReadMany, e.g. log data, snapshot DB, not for system of record multiple writes

Can only append to file

Mapping	Shuffling	Reducing	Final Result
The, 1	Aardvark, 1 Cat, 1 Mat, 1 On[1,1] Sat[1,1] Sofa,1 The[1,1,1,1]	Aardvark, 1 Cat, 1 Mat, 1 On, 2 Sat, 2 Sofa, 1 The, 4	Aardvark, 1 Cat, 1 Mat, 1 On, 2 Sat, 2 Sofa, 1 The 4
Cat, 1
Sat, 1
On, 1
The, 1
Mat, 1
The, 1
Aardvark, 1
Sat, 1
On, 1
The, 1
Sofa, 1

Hive

SQL like language for wiring MapReduce Jobs

Supports SELECT, JOIN, GROUP BY, etc.

Can support very large datasets by allowing

Partitioning sampling

Bucketing

Sqoop Sql to Hadoop

· Pull data in, push data out

· Connectors

· Generates MR job to generate parallel load import or export to RDBMS

Flume

Bring in log (or other) data

Multi-agents talk to other agents

Log, twitter, avro, netcat, exec

Oozie

Workflow/coordination service to magan data processing jobs for Hadoop

Chain jobs together (e.g. every 15mins)

Pipes and Streaming

Write native-code MR in C++ / arbitrary scripting languages

Stdin – stdout map à reduce à final result

Fuse – DFS

Allows mounting HDFS volumes via Linux FUSE FS

HBase

· RealTime (not batch) column-oriented datastore (Modeled on BigTable)

· Handles billions rows, petabytes

· Facebook 1mil writes/s likes, 1.6mil/s insights, log data

· Profile data in MySQL with memcache on top, moving to HBase

· Gave up on developing Cassandra (eventual consistency better for static/write-only) for HBase

· Mobile devices

· Strong consistency

· Not RDBMS: No joins, no indexes, no SQL

· Columns can be added on the fly and store any kind of data

· Keeps 3 versions of column cells, write-ahead log, sequential writes/reads -> append-type model

· Johnathon Grey HadoopWorld videos

· Row key only (not index) otherwise scan whole dataset

CDH

Cloudera’s Distribution enterprise-ready dist of Hadoop

https://ccp.cloudera.com/display/SUPPORT/Downloads

https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM

https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial

https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode

Pig

Dataflow language for MapReduce

OpenTSDB (time series DB) http://opentsdb.net/

Interesting example - metrics for 1000s machines

Other
YCSB (Yahoo Cloud Serving Benchmark)

· Test cloud and bigdata technologies with real loads to demonstrate how they will perform before making expensive decisions with little data.

Real-time processing
· Trigger-like “coprocessors” attached to data nodes

Thursday, May 3, 2012

Notes from Cloudera Hadoop talk