TCS 4063 2014/15

Saturday, 11 October 2014

Hadoop & Big Data

Assalamualaikum everyone, before raya En Azrul had ask what the software that use in parallel computing. I found that in Network Security whatsapp group , someone say the software use is Hadoop. So i'm make small research about Hadoop.

What is Hadoop?

At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose
GFS is not open source.
Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).
The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.
This is open source and distributed by Apache.

Typical Hadoop Cluster

• 40 nodes/rack, 1000-4000 nodes in cluster

• 1 Gbps bandwidth within rack, 8 Gbps out of rack

• Node specs (Yahoo terasort): 8 x 2GHz cores, 8 GB RAM, 4 disks

Challenges

Cheap nodes fail, especially if you have many

– Mean time between failures for 1 node = 3 years

– Mean time between failures for 1000 nodes = 1 day

– Solution: Build fault-tolerance into system

Commodity network = low bandwidth

– Solution: Push computation to the data

Programming distributed systems is hard

– Solution: Data-parallel programming model: users write “map” & “reduce” functions, system distributes work and handles faults

References: https://parlab.eecs.berkeley.edu/sites/all/parlab/files/hindman_bootcamp_2011.pdf

TCS 4063 2014/15

Pages

Saturday, 11 October 2014

No comments:

Post a Comment