Hadoop & Big Data
Assalamualaikum everyone, before raya En Azrul had ask what the software that use in parallel computing. I found that in Network Security whatsapp group , someone say the software use is Hadoop. So i'm make small research about Hadoop.
What is Hadoop?
- At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose
- GFS is not open source.
- Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).
- The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.
- This is open source and distributed by Apache.
Typical Hadoop Cluster
•
40
nodes/rack, 1000-4000 nodes in cluster
•
1
Gbps bandwidth within rack, 8 Gbps out of rack
•
Node
specs (Yahoo terasort): 8 x 2GHz cores, 8 GB RAM, 4 disks
Challenges
- Cheap nodes fail, especially if you have many
–
Mean
time between failures for 1 node = 3 years
–
Mean
time between failures for 1000 nodes = 1 day
–
Solution:
Build fault-tolerance into system
- Commodity network = low bandwidth
–
Solution:
Push computation to the data
- Programming distributed systems is hard
–
Solution:
Data-parallel programming model: users write “map” & “reduce” functions,
system distributes work and handles faults
No comments:
Post a Comment