ABCD-2014 : Hadoop MapReduce

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

This section contains simple hadoop MapReduce examples of wordcount,matrix multiplication,finding maximum temperature,converting small image files to sequential files. That will provide basic idea,how to use MapReduce programing framework in hadoop cluster and about some basic APIs required to write Hadoop MapReduce programe and how key-value mapping done to get desire output

Hadoop MapReduce :   Compiliation & Execution :     Command Line )

Example 1.1 Write a Hadoop MapReduce program for breadth-first search (BFS) using Hadoop MapReduce Framework

( Download source code : : bfs-iterative-map-reduce (WinRAR ZIP archive) )
( Download Description of Problem (PDF) : BFS-Iterative MapReduce -PDF file) )

Example 1.2

Write Iterative MapReduce Program for Frequent Sub-graph Mining
( Download Description of Problem (PDF) :
Subgraph-Mining-MapReduce-PDF file) )

Compilation and Execution

Pre-requisites for Hadoop programs Compilation :

  • JavaTM 1.5.x, preferably from Sun, must be installed.

  • ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

  • Hadoop-1.1.x or 2.2.x must be installed.

Commands for Verification of Hadoop setup :

$hadoop version
Hadoop 2.2.0 Subversion -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4...

7814 DataNode
8288 NodeManager
15704 Jps
7667 NameNode
8170 ResourceManager
7991 SecondaryNameNode

STEP-1 :
For compilation and execution of MapReduce programs we require Java environment(JVM) and hadoop-2.2.0-core.jar(from Hadoop-1.1.x or 2.2.x)
For compilation of code use following commands

$ javac -classpath < path to hadoop-2.2-core.jar> -d wordcount_classes

where wordcount_classes is a folder that contains all .class file generated from this compilation
STEP-2 :
Converting all 'class' files (created in previous STEP-1) into 'jar' file
For creation of jar file use following commands

$ jar -cvf wordcount.jar -C wordcount_classes/ .

where wordcount.jar is the jar file created after this command execution and serves as executable jar for hadoop.
STEP-3 :
After creating 'jar' file, copy it into HDFS by using following command :
$ hadoop fs -copyFromLocal input.txt wordcount_input

where wordcount_input is the input file created in HDFS that contain your input data.
STEP-4 :
Launching hadoop executable to execute 'jar' file with input file
$ hadoop jar wordcount.jar WordCount wordcount_input wordcount_output

where WordCount is the driver class in wordcount.jar and wordcount_output is the output folder created in HDFS that contain two files
  • _SUCCESS which is basically a success log file

  • part-r-00000 which is output file generated by hadoop framework.
For output(part-r-00000) that present in HDFS execute following command.
$ hadoop fs -cat wordcount_output/part-r-00000

For copying file from HDFS to local Filesystem execute following command.
$ hadoop fs -copyToLocal wordcount_output/part-r-00000 output.txt

where output.txt is the output file created in current directory that contain actual output data.