After Hello World: Playing around with Bigdata(Hadoop): project Crystal ball

Recently I had an opportunity to learn and play around with Hadoop as a course exercise. I wrote a Map-reduce program designed to identify relative frequency of words in a document to run on Hadoop and use the same to predict the next likely word.

This bog is a log of what I did.

Steps:

1: Download virtual box and import the image file (hortonworks-sandbox) and start the sand box.

----development environment-------

2: Set up a simple maven project.

3: Include the Following dependencies:

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-common</artifactId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-mapreduce-client-core</artifactId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-mapreduce-client-jobclient</artifactId>

</dependency>

<groupId>com.google.guava</groupId>

<artifactId>guava</artifactId>

</dependency>

</dependencies>

4: Include the maven assembly plugin.

<artifactId>maven-assembly-plugin</artifactId>

<manifest> <mainClass>edu.mum.crystalBall.stripes.Application</mainClass>

</manifest>

</archive>

<descriptorRef>jar-with-dependencies</descriptorRef>

</descriptorRefs>

</configuration>

</plugin>

5: Write Mapper, Reducer and Application Classes.

----> 3 approaches to solve relative frequency problem. Source code are attached in the end

6: Build the project with the following command:

mvn clean compile assembly:single

---Deploy and run-----

7: Move the jar form /target/ to the sandbox location :/usr/lib/hue

7.1: Using filezilla connect to the sandbox using root@localhost port 2222.

7.2: Copy jar file for the workspace/project/target to /usr/lib/hue.

8: connect to the sandbox from host machine:

ssh root@localhost -p 2222

9: change user to hue

su hue

10: Go to /usr/lib/hue

11: Create a file under form the web browser: localhost:8000/filebrowser/view/user/hue

12: execute: hoodop jar {nameofjar} {inputlocation} {outputlocation}

13: browse localhost:8000/jobbrowser/

14: the result can be viewed form localhost:8000/filebrowser/view/user/hue

Mapper, Reducer and Application Classes:

Pairs Approach:

Class Mapper{

method map(inKey,text ){

for each word w in text

for each Neighbour n of word w

Pair p= Pair(w,n)

emit(p,1)

emit(*,1)

}

Class reducer{

method Reduce(pair p; counts [c1; c2; …])

s = 0

count=0

for all pair(w,*) in p do

s=s+1;

for all count c in pair(w, u) in counts [c1; c2; …] do

count=count+c

Emit(pair p; count / s)

}

Stripes Approach:

Class mapper{

method Map(docid a; doc d)

H = new AssociativeArray

for all term w in doc d do

for all term u in Neighbors(w) do

H{u} = H{u} + 1

for all term u in H do

Emit(Term w; Stripe H)

}

Class Reducer{

method Reduce(term w; stripes [H1;H2;H3; : : :])

Hf = new AssociativeArray

for all stripe H in stripes [H1;H2;H3; …] do

Sum (Hf; H).

//Calulate frequencies

int count = 0;

Hnew = new AssociativeArray

for each u in Hf do

count+=Hf(u);

for each u inHf do

Hnew{u}=Hf{u}/count;

Emit (term w, stripe Hnew);

}

Mixing Pairs and Stripes: Hybrid

Class Mapper{ method map(inKey,text ) for each word w in text for each Neighbour n of word w Pair p= Pair(w,n) emit(p,1) }

Class Reducer{ Hf=new Associative Array last =empty; method Reduce(pair p(w,u); counts [c1;c2;c3; : : :]){ Count=0 for all count c in pair(w, u) in counts [c1; c2; …] do Hf{u} = Hf{u}+c //do Stripe for all Pair for term w for all u in Hf do count += Hf{u} //all occurring for term w for all term u in Hf do Hf{u}=Hf{u}/count //element wise division if(last==w) Emit (term w; stripe Hf); Clear Hf; } method clear(){ emit(last, Hf); } }

Comparing the Run times:

I tested these applications with some random data files I randomly generated. Here is my observation.

Bottom line: I had fun writing these codes.

You can download the classes i have written from here

After Hello World

Friday, March 20, 2015

Playing around with Bigdata(Hadoop): project Crystal ball

No comments:

Post a Comment