### Big Data Analytics 2014(BIDA-2014)

#### Hadoop Graph Analysis

The amount of images and videos being uploaded to the internet is rapidly in-creasing, with Facebook users uploading over 2.5 billion new photos every month and 100 hours of video every minute however, applications that make use of this data are severely lacking. Current computer vision applications use a small number of input images because of the difficulty is in acquiring computational resources and storage options for large amounts of data.The Hadoop Mapreduce platform provides a system for large and computationally intensive distributed processing , though use of Hadoops system is severely limited by the technical complexities of developing useful applications.Hadoop using an Adaboost-based face detection algorithm (OpenCV) in which the face detection algorithm works on one image at a time, and attempts to detect the existence of faces in the image. Figure 1. Figure 2. Figure 3.

Simple Graph Program Example

 Example 4.1 Find the common edges among a set of graphs(called the graph intersection problem) using Hadoop MapReduce.
( Download source code : Intersect.java )
 Objective Find the common edges among a set of graphs(called the graph intersection problem) using Hadoop MapReduce. Description Finding the intersection among the edges of a set of directed, unweighted graphs is a simple application. In the map phase, we combine the source and destination to form a key and the graph id of the edge forms the value. Since all the values associated with the same key go to the same reducer, we can count the number of edges from distinct graphs sent to a reducer. If it is equal to the number of graphs, the edge is in the intersection of all the graphs. In this problem, we consider the nodes with same ids as the same nodes. Input Graph_idsourcedestination Output The output contains the intersecting edges found among the 3 graphs. Duplicate edges present in the same graph are considered only once.

Apache Giraph

Apache Giraph is an iterative, fault-tolerant in-memory distributed graph processing system which runs on top of Apache Hadoop clusters having thousands of computing nodes. It is capable of running any standard Bulk Synchronous Parallel (BSP) operation over any dataset which can be represented as a graph, and is a loose implementation of Google’s Pregel system. The Giraph library offers MapReduce programs, called GiraphJobs, but no additional services for the Hadoop cluster. Although mappers usually do not communicate with other mappers, Giraph uses MapReduce only during the initialization phase and mapper-to-mapper communication is actually required. Giraph is in use at companies like Facebook and PayPal, for example, to help represent and analyze the billions (or even trillions) of connections across massive datasets. Figure 1.

Giraph Program Example
 Example 4.2 To compute the shortest path between the source and remaining vertices of a graph.
( Download source code : SimpleShortestPathsComputation.java )
 Objective To compute the shortest path between the source and remaining vertices of a graph. Description In the shortest path algorithm, the objective is to mark a particular vertex as the origin, and then determine the travel paths that incur the lowest cost when traveling from that origin to any vertex in the cluster It reads a graph from an input file stored in HDFS. Input [0,0,  [[1,1], [3,3]]] [1,0,  [[0,1],  [2,2],  [3,1]]] [2,0,  [[1,2],  [4,4]]] [3,0,  [[0,3],  [1,1],  [4,4]]] [4,0,  [[3,4],  [2,4]]] Each line above has the format [source_id,source_value,[[dest_id, edge_value],...]]. If this data was read in by Giraph, it would produce a graph like the following: Figure 2. Output 0   1 2   2 1   0 3   1 4   5 (vertex id) (min dist.) Each line above prints Vertex ID along with its Minimum Distance.
Graph computing algorithms example
 Example 4.3 Implement a Parallel Dijkstra's single-source shortest path algorithm(Using Distributed computing MPI ,Hadoop Mapreduce & METIS ) (Assignment)
 Objective You will develop MPI program to compute the shortest path from a source vertex s to all the other vertices in the graph G=(V,E) using Dijkstra's single-source shortest path algorithm on p processors of message passing cluster. Basics : Minimum Spanning Tree (MST) Figure 3. A spanning tree of an undirected graph G is a subgraph of G that is a tree containing all the vertices of G. In a weighted graph, the weight of a sub graph is the sum of the weights of the edges in the subgraph. A minimum spanning tree (MST) for a weighted undirected graph is a spanning tree with minimum weight. Many problems require finding an MST of an undirected graph. For example, the minimum length of cable necessary to connect a set of computers in a network can be determined by finding the MST of the undirected graph containing all the possible connections. If G is not connected, it cannot have a spanning forest. For simplicity in describing the MST algorithm, we assume that G is connected. If G is not connected, we can find its connected components and apply the MST algorithm on them. Alternatively, we can modify the MST algorithm to output a minimum spanning forest.  Description Various algorithms to develop MPI program to compute the shortest path from a source-vertex s to all the other vertices in the graph G=(V,E). We describe Prim's algorithm for finding an MST and Dijkstra's single-source shortest path algorithm. Prim's algorithm   Prim's algorithm for finding an MST is a greedy algorithm. The algorithm begins by selecting an arbitrary starting vertex. It then grows the minimum spanning tree by choosing a new vertex and edge that are guaranteed to be in the minimum spanning tree. The algorithm continues until all the vertices have been selected.  Let G=(V,E,w) be the weighted undirected graph for which the minimum spanning tree is to be found, and let A=(ai,j) be its weighted adjacency matrix. The algorithm uses the set VT to hold the vertices of the minimum spanning tree during its construction. It also uses an array d[1,....,n] in which, for each vertex v=(V-VT), d[v] holds the weight of the edge with the least weight from any vertex in VT to vertex v. Initially, VT contains an arbitrary vertex r that becomes the root of the MST. Furthermore, d[r]=0, and for all v such that v=(V-VT),  d[v] = w(r,v) if such an edge exists; otherwise d[v]= .  During each iteration of the algorithm, a new vertex u is added to VT   such that d[u]=min{d[v]/v = (V-VT)}. After this vertex is added, all the values of d[v] such that v=(V-VT) are updated because there may now be an edge with a smaller weight between vertex v and the newly added vertex u. The algorithm terminates when VT=V. The Figure 1. above illustrates the algorithm. Upon termination of Prim's algorithm, the cost of the minimum spanning tree is Ev=Vd[v]. In program the body of the while loop (lines 10-13) is executed n-1 times. Both the computation of min{d[v]/v  =  (V-VT)} (line 10), and for loop (lines 12 and 13) execute in (n) steps. Thus, the overall complexity of Prim's algorithm is (n2).   Single-Source Shortest Paths: Dijkstra's Algorithm  1.        procedure DIJKSTRA_SINGLE_SOURCE_SP(V, E, w, s)  2.        begin  3.               VT:= (s);  4.                for all v (V - VT) do  5.                       if (s, v) exists set l[v]:= w(s, v);  6.                       else set l[v] = ;  7.                while VT V do  8.                 begin   9.                        find a vertex u such that l[u] = min{l[v]| v (V - VT)};  10.                      VT:= VT {u};  11.                      for all v (V - VT) do  12.                            l[u] = min{l[v], l[u] + w(u, v)};  13.               endwhile  14.      end DIJKSTRA_SINGLE_SOURCE_SP  For a weighed graph G=(V,E,w), the single-source shortest paths problem is to find the shortest paths from a vertex v=V to all other vertices in V. A shortest path from u to v is a mimimum-weight path. Depending on the application, edge weights may represent time,  cost,  penalty,  loss, or any other quantity that accumulates additively along a path to be minimized. In the following section, we present Dijkstra's algorithm, which solves the single-source shortest-paths problem on directed graphs with non-negative weights.  Dijkstra's algorithm, which finds the shortest paths from a single vertex s, is similar to Prim's minimum spanning tree algorithm. Like Prim's algorithm, it incrementally finds the shortest paths from s to the other vertices of G. It is also greedy, that is, it always chooses an edge to a vertex that appears closest. Comparing this algorithm with Prim's minimum spanning tree algorithm, we see that the two are almost identical. The main difference is that, for each vertex u=(V-VT), Dijkstra's algorithm stores l[u], the minimum cost to reach vertex u from vertex s by means of vertex in VT; Prim's algorithm stores d[u], the cost of the minimum-cost edge connecting a vertex in VT to u. The run time of Dijkstra's algorithm is (n2). Parallel formulation  The parallel formulation of Dijkstra's single-source shortest paths algorithm is very similar to the parallel formulation of Prim's algorithm for minimum spanning trees. The weighted adjacency matrix is partitioned using the block- striped mapping. Each of the p processors is assigned n/p consecutive columns of the weighted adjacency matrix, and computes n/p values of the array l.  During each iteration, all processors perform computation and communication of Prim's algorithm.  In this, message-passing program uses collective communication operations will compute the shortest paths from a source-vertex s to all the other vertices in the graph, using Dijkstra's single-source shortest-path algorithm. Dijkstra's algorithm incrementally finds the shortest paths from s to the other vertices inthe graph G=(V,E). At any given time, the vertices V of the graph are partitioned into two sets Vc and Vo. The set Vc contains the vertices whose shortest path from s has already been computed (closed set) and the set Vo contains the remaining vertices (open set). For each vertex v in Vo, the algorithm stores the cost of shortest path d[v] from s to v via only the vertices in Vc. In each iteration of the algorithm, vertex v from Vo is selected such that it has the smallest d[v] value, and is moved to Vc. Then, the shortest-path of the remaining vertices in Vo are updated to reflect the inclusion of v in Vc. At the beginning of the computation, only vertex s belongs in Vc , and the algorithm requires a total of V-1 iterations to compute all the shortest paths.  A parallel version of Dijkstra's algorithm can be devised by performing in parallel both the selection of the vertex to be included in Vc as well as the update of the shortest paths. On a disributed-memory parallel computer, the weighted adjacency-matrix of the graph W is distributed among the processor using a one-dimensional distribution along the columns. That is,  each one of the p processors gets n/p columns of G where n is the number of vertices. Also, every processor stores the corresponding part of the distance array d. That is,  if a processor stores the columns ranging from W(*, i) to W(*,  j), it will also store the distances for the vertices in the range of vi to vj. Our MPI program for Dijkstra's algorithm assigns n/p consecutive columns of W to each processor. Note though that in each iteration, it uses the MPI_MINLOC reduction operation to select the vetex v to be included in Vc.  Remarks  Recall that the MPI_MINLOC operation, for the pairs (a, i) and (a, j) it will return the one that has the smaller index (since both of them have the same value). Consequently, among the vertices that are equal close to the source vertex, it favors the smaller vertices. This may lead to load imbalance problems, especially when a lot of vertices in Vo are at the same minimum distance from the source. In particular, vertices stored in lower-ranked processes will tend to be moved in Vc , as opposed to those stored in higher-ranked processes. Since, both finding the minimum-distance vertex as well as computing the updated distances involve only vertices in Vo , the higher-ranked processes may end-up having more vertices in Vo than the lower-ranked vertices.  One way of correcting this problem is to distribute the columns of w in a cyclic distribution. In this distribution process i gets every pth vertex starting from vertex i. This scheme also assigns n/p vertices to each process but these vertices have indices that span almost the entore graph. Consequently, the preference given to lower-numbered vertices by MPI_MINLOC does not lead to load-imbalance problems.  Input  You are given a graph G(V,E,w) with n (=V) vertices, and E edges with suitable weights w on processor with taskid 0 and source-vertex s.  Output  You should print given graph G(V,E,w) , and the cost of the shortest path from a source-vertex s to all other vertices in the graph G = (V,E) on processor with task id 0.

Example 4.4: Write a Program for Coloring a Sparse Graph(Using distributed computing MPI ,Hadoop Mapreduce & METIS)
(Assignment)

• Objective
• You will develop MPI program to color sparse graph G(V,E) with n vertices on p processors of Message Passing Cluster.

• Definitions
• A directed graph G(V,E) (or a digraph ) consists of a set of vertices V={v1,v2,...}, a set of edges E={e1,e2,...}, and a mapping F that maps every edge onto some ordered pair of vertices (vi,vj). As in the case of undirected graphs, a vertex is represented by a point and edge by a line segment between vi and vj with an arrow directed from vi to vj

Suppose that you are given a graph G(V,E) with n vertices and are asked to paint its vertices such that no two adjacent vertices have the same color. What is the minimum number of colors that you would require ? This constitutes a coloring problem.

Having painted the vertices, you can group them into different sets. One set consisting of all red vertices, another of blue, and so forth. This is a partitioning problem. The coloring and partitioning can of course, be performed on edges or vertices of a graph.

Painting all the vertices of a graph with colors such that no two adjacent vertices have the same color is called the proper coloring (or sometimes simply coloring) of a graph. A graph in which every vertex has been assigned a color according to a proper coloring is called a properly colored graph. Usually a given graph can be properly colored in many different ways. The proper coloring which is of interest to us is one that requires the minimum number of colors.

Proper coloring of a given graph is simple enough, but a proper coloring with the minimum number of colors is, in general, a difficult task.

A proper coloring of a graph naturally induces a partitioning of the vertices into different subsets. A set of vertices in a graph is said to be an independent set of vertices or simply an independent set if no two vertices in the set are adjacent. A single vertex in any graph constitutes an independent set. A maximal independent set (or maximal internally stable set) is an independent set to which no other vertex can be added without destroying its independence property. It is well known that some graphs in general, have many maximal independent sets; and they may be of different sizes. Among all maximal independent sets, one with the largest number of vertices is often of particular interest.

Finding a maximal independent set : A reasonable method of finding a maximal independent set in a graph G(V,E) will be to start with any vertex v of G(V,E) in the set. Add more vertices to the set, selecting at each stage a vertex that is not adjacent to any of those already selected. This procedure will ultimately produce a maximal independent set with a largest number of vertices. The following figures explain the coloring of sparse graph using 3 processors with 3 and 4 colors. • Serial algorithm
• Consider the problem of finding a minimal coloring of an undirected graph G =(V,E) in which each vertex is connected to only few other vertices. Serial graph coloring algorithms are often based on graph traversals. Other class of graph coloring algorithms require independent set of computations.

• Parallel algorithm
• Serial graph coloring algorithms are often based on graph traversals, whereas parallel coloring algorithms are often based on iteratively computing independent set of vertices. Luby's algorithm uses independent set computations to derive message passing programs that performs graph coloring program.

The following figure explains the partition of graph G(V,E) into three subgraphs such that each process has equal number of vertices. We will present one message-passing graph coloring program.  Luby's algorithm consists of a number of steps. During the jth steps, the algorithm selects a maximal independent set Ij of vertices from the uncolored vertices of G. It then, assigns the color j to the vertices in Ij and proceeds to the next phase. Luby's algorithm selects each maximal independent set Ij incrementally by using a randomized algorithm as follows.  A random number is assigned to each uncolored vertex of G. Now if an uncolored vertices has a random number that is smaller than all of the random numbers of its adjacent uncolored vertices, it is then included in Ij. This process is repeated for the vertices of G that are uncolored and not in Ij nor adjacent to vertices in Ij, Ij is augmented similarly. This incremental augmentation of Ij ends when no more vertices can be inserted in Ij.  In particular, the algorithm developed by Luby  is often used to drive message-passing programs that perform graph coloring.

The critical step in Luby's algorithm is the one that computes a maximal independent subset of the uncolored vertices. If we develop a message-passing program for this step, then we can just use it repeatedly to obtain the coloring of the graph. Developing such a message-passing program is quite straight-forward. Consider the graph G =(V,E) with n vertices. The graph is distributed among the p processors such that each processor stores n/p vertices and the corresponding adjacency lists. Let us assume that we are in the jth phase of Luby's algorithm and we seek to compute Ij. Initially, each processor generates a random number for each locally stored uncolored vertex v. Next, every processor needs to check each such vertex v to determine if its random number is the maximum over the uncolored vertices adjacent to v. This can be done easily if the uncolored vertices are assigned to other processors, then these random numbers need to be communicated. Thus, the message-passing program, before proceeding to determine which vertices can be inserted in the independent set Ij, they need to receive the random numbers for all the non-local uncolored vertices that are adjacent subsets of the maximal independent set algorithm.

• Remarks
• However, even though we can develop a message-passing program that exactly implements Luby's algorithm, this program may not necessarily achieve very high performance. One reason is that this program will perform a lot of communication. In particular, a every time we need to find an independent subset of the uncolored vertices we need to perform multiple incremental augmentations. Every such augmentation sub-step leads to communication for the exchange of the generated random numbers. However, each successive augmentation subsets increases the size of the independent subset only by a small factor.

Experiments with Luby's algorithm has shown that if for example 10% of the vertices are included in Ij during the first subsets, a much smaller fraction of the vertices are included in the second subsets, and the number of vertices that are included in subsequent substeps is even smaller. One possible solution is to perform only a single sub-step of the randomized independent subsets algorithm. However, this will undoubtedly increase the number of steps (i.e., independent subsets) required to color the graph; thus, increasing the number of colors. In many applications a coloring with the least number of colors is required making this approach undesirable.

Another limitation of Luby's coloring algorithm compared to serial coloring algorithm is that it requires multiple traversals of the graph. The number of traversals are on the average half the number of colors in the graph . Thus, even if communication overheads are zero, the speedup achieved by Luby's will be quite poor when compared to serial coloring algorithm that require a single traversal of the graph making this approach undesirable.

• Input
• You are given a graph G(V,E) withV vertices and E edges on processor with task id 0. Maximum number of colors available is 16.

• Output
• You should print given graph G(V,E), maximum number of colors required to paint (color) the vertices of graph G, and respective colors for all vertices in the graph G on processor with task id 0.

 Example 4.5 Write a Hadoop Mapreduce Program for coloring graph using METIS.(Assignment)