Friday, March 20, 2015

Playing around with Bigdata(Hadoop): project Crystal ball



Recently I had an opportunity to learn and play around with Hadoop as a course exercise. I wrote a Map-reduce program designed to identify relative frequency of words in a document to run on Hadoop and use the same to predict the next likely word.

 This bog is a log of what I did.


Steps:
1: Download virtual box and import the image file (hortonworks-sandbox) and start the sand box.


----development environment-------
2: Set up a simple maven project.
3: Include the Following dependencies:
<dependencies>
<!--For Hadoop Start -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>18.0</version>
</dependency>
<!--For Hadoop End -->
</dependencies>


4: Include the maven assembly plugin.
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest> <mainClass>edu.mum.crystalBall.stripes.Application</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
5: Write Mapper, Reducer and Application Classes.
----> 3 approaches to solve relative frequency problem. Source code are attached in the end


6: Build the project with the following command:
mvn clean compile assembly:single
---Deploy and run-----
        7: Move the jar form /target/ to the sandbox location :/usr/lib/hue
7.1: Using filezilla connect to the sandbox using root@localhost port 2222.
7.2: Copy jar file for the workspace/project/target to /usr/lib/hue.
8: connect to the sandbox from host machine:
ssh root@localhost -p 2222
9: change user to hue
su hue
10: Go to /usr/lib/hue
11: Create a file under form the web browser: localhost:8000/filebrowser/view/user/hue
12: execute:  hoodop jar {nameofjar} {inputlocation} {outputlocation}
13: browse  localhost:8000/jobbrowser/

14: the result can be viewed form localhost:8000/filebrowser/view/user/hue


Mapper, Reducer and Application Classes:

Pairs Approach:
                    Class Mapper{
                                method map(inKey,text ){
                                     for each word w in text
                                              for each Neighbour n  of word w
                                                       Pair p= Pair(w,n)  
                                                       emit(p,1)
                                                       emit(*,1)
                                          }
                        }

                  Class reducer{
                         method Reduce(pair p; counts [c1; c2; …])
                                   s = 0
                                   count=0
                                   for all pair(w,*) in p do
                                         s=s+1;
                                   for all count c in pair(w, u) in counts [c1; c2; …] do
                                      count=count+c
                                  Emit(pair p; count / s)
                      }

Stripes Approach:
Class mapper{
             method Map(docid a; doc d)
                    H = new AssociativeArray
                    for all term w in doc d do
             for all term u in Neighbors(w) do
                              H{u} = H{u} + 1 
                   for all term u in H do
                   Emit(Term w; Stripe H)
}


Class Reducer{
       method Reduce(term w; stripes [H1;H2;H3; : : :])
             Hf = new AssociativeArray
             for all stripe H in stripes [H1;H2;H3; …] do
                   Sum (Hf; H).
            //Calulate frequencies
             int count = 0;
             Hnew = new AssociativeArray
             for each u in Hf do
                    count+=Hf(u);
             for each u inHf do
                    Hnew{u}=Hf{u}/count;
             Emit (term w, stripe Hnew);


Mixing Pairs and Stripes: Hybrid

Class Mapper{ method map(inKey,text ) for each word w in text for each Neighbour n of word w Pair p= Pair(w,n) emit(p,1) }


Class Reducer{ Hf=new Associative Array last =empty; method Reduce(pair p(w,u); counts [c1;c2;c3; : : :]){ Count=0 for all count c in pair(w, u) in counts [c1; c2; …] do Hf{u} = Hf{u}+c //do Stripe for all Pair for term w for all u in Hf do count += Hf{u} //all occurring for term w for all term u in Hf do Hf{u}=Hf{u}/count //element wise division if(last==w) Emit (term w; stripe Hf); Clear Hf; } method clear(){ emit(last, Hf); } }

Comparing the Run times:

I tested these applications with some random data files I randomly generated. Here is my observation.

 Bottom line: I had fun writing these codes.

You can download the classes i have written from   here


No comments:

Post a Comment