Cloudera Hadoop
Notes from http://www.meetup.com/qldjvm/events/58332582/
Notes from http://www.meetup.com/qldjvm/events/58332582/
Hadoop
·
Scalable
·
Fault tolerant
·
OS
HDFS + MapReduce
HDFS: Like a normal Filesystem but Distributed (highly scalable)
MapReduce: Compute framework
Can
process the data where it resides
Direct
attached storage at individual nodes
Concepts
·
Apps are written in high-level code
·
Share nothing architecture (between nodes)
·
Data is spread among machines in advance
Schema
on-read
SerDe
(Serialiser/Deserialiser)
DB:
reads are fast, standards
Hadoop:
loads are fast, flexibility
Enterprise
DataWarehouse Arch
Sqoop
& Flume --> Hadoop
Hive
& ODBC
Oozie
(work flow coordinator)
Sqoop
out to BI reports
HDFS
·
NameNode – holds all metadata for HDFS
·
Needs to be highly reliable machine (Raid 10,
dual power supplies, dual network cards bonded)
·
More memory the better 48Gb-96Gb (depending on #
of files/blocks)
·
v1.0 If NameNode disappears cluster is down
·
Secondary NameNode – check pointing for NameNode
(same hardware)
·
DataNodes
·
Hardware will depend on specific
·
JBOD (just a bunch of disks), no raid, no SAN, no
virtualisation
·
Maximise for IO throughput
·
Direct attached disk
·
Data on S3, EMR instances (maxed out by how
quickly S3 can access EMR)
·
Pay a price for virtualisation
E.g.
DataNode
·
12 3TB drives
·
2 4-6core CPUs
·
24-96Gb RAM
MapReduce
Job
tracker takes job submitted to cluster
Task
Tracker
·
take job and tries to run on same machine where
data resides (90% possible)
·
works on multiple files
·
map task per block
·
split into multiple blocks
·
don’t want massive block - 2GB file split into
smaller pieces
·
128kb chunks
Input is split and fed in
parallel to the map tasks which process the data given to them and write
immediate data o their local storage (intermediate data)
Shuffle and sort (network
intensive)
Reduce tasks pick up data from
the local storage of the map tasks relevant to the key range they have to
perform the reduce task on They perform the reduce op and wwrite the output to
HDFS
WriteOnceReadMany,
e.g. log data, snapshot DB, not for system of record multiple writes
Can
only append to file
Mapping
|
Shuffling
|
Reducing
|
Final Result
|
The, 1
|
Aardvark, 1
Cat, 1
Mat, 1
On[1,1]
Sat[1,1]
Sofa,1
The[1,1,1,1]
|
Aardvark, 1
Cat, 1
Mat, 1
On, 2
Sat, 2
Sofa, 1
The, 4
|
Aardvark, 1
Cat, 1
Mat, 1
On, 2
Sat, 2
Sofa, 1
The 4
|
Cat, 1
|
|||
Sat, 1
|
|||
On, 1
|
|||
The, 1
|
|||
Mat, 1
|
|||
The, 1
|
|||
Aardvark, 1
|
|||
Sat, 1
|
|||
On, 1
|
|||
The, 1
|
|||
Sofa, 1
|
Hive
SQL
like language for wiring MapReduce Jobs
Supports
SELECT, JOIN, GROUP BY, etc.
Can support
very large datasets by allowing
Partitioning sampling
Bucketing
Sqoop Sql to Hadoop
·
Pull data in, push data out
·
Connectors
·
Generates MR job to generate parallel load
import or export to RDBMS
Flume
Bring
in log (or other) data
Multi-agents
talk to other agents
Log,
twitter, avro, netcat, exec
Oozie
Workflow/coordination
service to magan data processing jobs for Hadoop
Chain
jobs together (e.g. every 15mins)
Pipes and Streaming
Write
native-code MR in C++ / arbitrary scripting languages
Stdin –
stdout map à
reduce à
final result
Fuse –
DFS
Allows mounting
HDFS volumes via Linux FUSE FS
HBase
·
RealTime (not batch) column-oriented datastore (Modeled
on BigTable)
·
Handles billions rows, petabytes
·
Facebook 1mil writes/s likes, 1.6mil/s insights,
log data
·
Profile data in MySQL with memcache on top, moving
to HBase
·
Gave up on developing Cassandra (eventual
consistency better for static/write-only) for HBase
·
Mobile devices
·
Strong consistency
·
Not RDBMS: No joins, no indexes, no SQL
·
Columns can be added on the fly and store any
kind of data
·
Keeps 3 versions of column cells, write-ahead
log, sequential writes/reads -> append-type model
·
Johnathon Grey HadoopWorld videos
·
Row key only (not index) otherwise scan whole
dataset
CDH
Cloudera’s
Distribution enterprise-ready dist of Hadoop
Pig
Dataflow
language for MapReduce
OpenTSDB (time series DB) http://opentsdb.net/
Interesting
example - metrics for 1000s machines
YCSB (Yahoo Cloud Serving Benchmark)
·
Test cloud and bigdata technologies with real
loads to demonstrate how they will perform before making expensive decisions with
little data.
· Trigger-like “coprocessors” attached to data nodes