Recently I had an opportunity to learn and play around with Hadoop as a course exercise. I wrote a Map-reduce program designed to identify relative frequency of words in a document to run on Hadoop and use the same to predict the next likely word.
This bog is a log of what I did.
Steps:
1: Download virtual box and import the image file (hortonworks-sandbox) and start the sand box.
----development environment-------
2: Set up a simple maven project.
3: Include the Following dependencies:
<dependencies>
<!--For Hadoop Start -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>18.0</version>
</dependency>
<!--For Hadoop End -->
</dependencies>
|
4: Include the maven assembly plugin.
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest> <mainClass>edu.mum.crystalBall.stripes.Application</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
|
5: Write Mapper, Reducer and Application Classes.
----> 3 approaches to solve relative frequency problem. Source code are attached in the end
6: Build the project with the following command:
mvn clean compile assembly:single
|
---Deploy and run-----
7: Move the jar form /target/ to the sandbox location :/usr/lib/hue
7.2: Copy jar file for the workspace/project/target to /usr/lib/hue.
8: connect to the sandbox from host machine:
9: change user to hue
su hue
10: Go to /usr/lib/hue
11: Create a file under form the web browser: localhost:8000/filebrowser/view/user/hue
12: execute: hoodop jar {nameofjar} {inputlocation} {outputlocation}
13: browse localhost:8000/jobbrowser/
14: the result can be viewed form localhost:8000/filebrowser/view/user/hue
Mapper, Reducer and Application Classes:
Pairs Approach:
Class Mapper{
method map(inKey,text ){
for each word w in text
for each Neighbour n of word w
Pair p= Pair(w,n)
emit(p,1)
emit(*,1)
}
}
Class reducer{
method Reduce(pair p; counts [c1; c2; …])
s = 0
count=0
for all pair(w,*) in p do
s=s+1;
for all count c in pair(w, u) in counts [c1; c2; …] do
count=count+c
Emit(pair p; count / s)
}
Stripes Approach:
Class mapper{
method Map(docid a; doc d)
H = new AssociativeArray
for all term w in doc d do
for all term u in Neighbors(w) do
H{u} = H{u} + 1
for all term u in H do
Emit(Term w; Stripe H)
}
Class Reducer{
method Reduce(term w; stripes [H1;H2;H3; : : :])
Hf = new AssociativeArray
for all stripe H in stripes [H1;H2;H3; …] do
Sum (Hf; H).
//Calulate frequencies
int count = 0;
Hnew = new AssociativeArray
for each u in Hf do
count+=Hf(u);
for each u inHf do
Hnew{u}=Hf{u}/count;
Emit (term w, stripe Hnew);
}
Mixing Pairs and Stripes: Hybrid
Class Mapper{
method map(inKey,text )
for each word w in text
for each Neighbour n of word w
Pair p= Pair(w,n)
emit(p,1)
}
Class Reducer{
Hf=new Associative Array
last =empty;
method Reduce(pair p(w,u); counts [c1;c2;c3; : : :]){
Count=0
for all count c in pair(w, u) in counts [c1; c2; …] do
Hf{u} = Hf{u}+c //do Stripe for all Pair for term w
for all u in Hf do
count += Hf{u} //all occurring for term w
for all term u in Hf do
Hf{u}=Hf{u}/count //element wise division
if(last==w)
Emit (term w; stripe Hf);
Clear Hf;
}
method clear(){
emit(last, Hf);
}
}
Comparing the Run times:
I tested these applications with some random data files I randomly generated. Here is my observation.
Bottom line: I had fun writing these codes.
You can download the classes i have written from here
No comments:
Post a Comment