Saturday 11 October 2014

Hadoop & Big Data

Assalamualaikum everyone, before raya En Azrul had ask what the software that use in parallel computing. I found that in Network Security whatsapp group , someone say the software use is Hadoop. So i'm make small research about Hadoop.
What is Hadoop?
  • At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose
  • GFS is not open source.
  • Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).
  • The software framework that supports HDFS, MapReduce and other related entities is called  the project Hadoop or simply Hadoop.
  • This is open source and distributed by Apache.
Typical Hadoop Cluster


                                        •          40 nodes/rack, 1000-4000 nodes in cluster
                                        •          1 Gbps bandwidth within rack, 8 Gbps out of rack

                                        •          Node specs (Yahoo terasort): 8 x 2GHz cores, 8 GB RAM, 4 disks 


Challenges
  1. Cheap nodes fail, especially if you have many
        Mean time between failures for 1 node = 3 years
        Mean time between failures for 1000 nodes = 1 day
        Solution: Build fault-tolerance into system

  1. Commodity network = low bandwidth
        Solution: Push computation to the data

  1. Programming distributed systems is hard
        Solution: Data-parallel programming model: users write “map” & “reduce” functions, system distributes work and handles faults



No comments:

Post a Comment