30-03-2011, 09:51 AM
Who Am I?
Hadoop Developer
– Core contributor since Hadoop’s infancy
– Project Lead for Hadoop Distributed File System
• Facebook (Hadoop, Hive, Scribe)
• Yahoo! (Hadoop in Yahoo Search)
• Veritas (San Point Direct, Veritas File System)
• IBM Transarc (Andrew File System)
• UW Computer Science Alumni (Condor Project)
Hadoop, Why?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
– Failure is expected, rather than exceptional.
– The number of nodes in a cluster is not constant.
• Need common infrastructure
– Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
– Workloads are IO bound and not CPU bound
Hive, Why?
• Need a Multi Petabyte Warehouse
• Files are insufficient data abstractions
– Need tables, schemas, partitions, indices
SQL is highly popular
Need for an open data format
– RDBMS have a closed data format
– flexible schema
Hive is a Hadoop subproject!
• Hadoop & Hive History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Becomes Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• Sept 2008 – Hive becomes a Hadoop subproject
• Who uses Hadoop?
• Amazon/A9
• IBM
• Joost
• Last.fm
• New York Times
• PowerSet
• Veoh
• Yahoo!
• Commodity Hardware
• Goals of HDFS
Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
Optimized for Batch Processing
– Data locations exposed so that computations can move to where data resides
– Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS
• Distributed File System
Single Namespace for entire cluster
Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
Files are broken up into blocks
– Typically 128 MB block size
– Each block replicated on multiple DataNodes
Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
• NameNode Metadata
Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
A Transaction Log
– Records file creations, file deletions. etc
download full report
http://googleurl?sa=t&source=web&cd=6&ve...search.ppt&ei=Cq-STaf9BYWrcbb9uYkH&usg=AFQjCNGEVfX03-YuRa9eae1aK5McJrQ_hA&sig2=IKSzKFyXJncBifMNhSSUVg