Vous êtes sur la page 1sur 2

Project Report

Team Compostion :
Sai Prasad Veluru 50206894
Contribution :
I did evrything.
Hadoop DFS Setup and Environment:
1) My primary os is Linux (Debian flavor ubuntu)
2)I have downloaded the hadooop distro hadoop 2.7.3 and ran everything from the
command line
3)I have did the project in psuedo distibuted mode where in the Hadoop daemons run on
a local machine, thus simulating a cluster on a small scale.

DataSets:
I had to get 94 articles of data as my Id ends with 50206894
I used a combination of manual work and some java scipt selectors and java script
console to scrape the data.As in I got the list of 3 urls with 50 page view and then
used ${} selector to select appropriate text.
Luckily I found that urls were named with sequence so I just changed programatically
the url number to go to each page and ran the snippet.
And in the end I used cat in shell to merge all the 94 files in linux into a single file.
Design issues:
Most of my issues were with cleaning the data and not designing map reduce.
I cleaned the data in the map reduce function, as in I literally removed all the extra
characters from the line that is input to the map function in the map functions .
Cleaning :
1)I used regex to remove all the non words everything and got only words.
2)However there were words like llllllllllllllllllllllllllllllfdfffffffffffffffffff, due to
OCR.Had I started the assignement earlier , I would have used english dictionary
library to filter out the non english words in semantic sense.However this may fail as
we cannot account for all the nouns in the dictionary, so we need to use more
inclusive dictionary with nouns updated.
Empirical Results and Discussions:
In the below graph, x axis is the length of words and y axis is the frequeny of each
words.

We have avoided words with more than 25 letters as they mostly are errors.As we can
see it looks like a normal distribution with skew to the left, highest occuring letter
count is 3 .
Comparision with English language:
The average word length is by taking the weighted mean of the all the word counts
below 20 is 4.2.
However this data set is small and compared to the one in the paper which is 9.7,
which indicate that news paper article generally use mor common words which tend
to have less character counts.
For all intents and purposes we can ignore the words with length greater than 20 to
get the standard stats.
Practical Experiences:
1) Most of the problems were with cleaning the data and we still do not know what
constitutes a english word.We can use a dictioinary which includes all the nouns and
words used in vernacular.
2) As far as the computation goes we did not have any issues since the load is so low
around 3.2mb of text.

Vous aimerez peut-être aussi