Hadoop Project

Transféré par

Saiprasad Veluru

0% ont trouvé ce document utile (0 vote)

85 vues2 pages

hadoop project

Titre original

hadoop project

Copyright

Formats disponibles

PDF, TXT ou lisez en ligne sur Scribd

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Signaler ce document

hadoop project

Droits d'auteur :

Formats disponibles

Téléchargez comme PDF, TXT ou lisez en ligne sur Scribd

Signaler comme contenu inapproprié

0% ont trouvé ce document utile (0 vote)

85 vues2 pages

Hadoop Project

Transféré par

Saiprasad Veluru

hadoop project

Droits d'auteur :

Formats disponibles

Téléchargez comme PDF, TXT ou lisez en ligne sur Scribd

Signaler comme contenu inapproprié

Passer à la page

Vous êtes sur la page 1sur 2

Rechercher à l'intérieur du document

Project Report

Team Compostion :
Sai Prasad Veluru 50206894
Contribution :
I did evrything.
Hadoop DFS Setup and Environment:
1) My primary os is Linux (Debian flavor ubuntu)
2)I have downloaded the hadooop distro hadoop 2.7.3 and ran everything from the
command line
3)I have did the project in psuedo distibuted mode where in the Hadoop daemons run on
a local machine, thus simulating a cluster on a small scale.

DataSets:
I had to get 94 articles of data as my Id ends with 50206894
I used a combination of manual work and some java scipt selectors and java script
console to scrape the data.As in I got the list of 3 urls with 50 page view and then
used ${} selector to select appropriate text.
Luckily I found that urls were named with sequence so I just changed programatically
the url number to go to each page and ran the snippet.
And in the end I used cat in shell to merge all the 94 files in linux into a single file.
Design issues:
Most of my issues were with cleaning the data and not designing map reduce.
I cleaned the data in the map reduce function, as in I literally removed all the extra
characters from the line that is input to the map function in the map functions .
Cleaning :
1)I used regex to remove all the non words everything and got only words.
2)However there were words like llllllllllllllllllllllllllllllfdfffffffffffffffffff, due to
OCR.Had I started the assignement earlier , I would have used english dictionary
library to filter out the non english words in semantic sense.However this may fail as
we cannot account for all the nouns in the dictionary, so we need to use more
inclusive dictionary with nouns updated.
Empirical Results and Discussions:
In the below graph, x axis is the length of words and y axis is the frequeny of each
words.

We have avoided words with more than 25 letters as they mostly are errors.As we can
see it looks like a normal distribution with skew to the left, highest occuring letter
count is 3 .
Comparision with English language:
The average word length is by taking the weighted mean of the all the word counts
below 20 is 4.2.
However this data set is small and compared to the one in the paper which is 9.7,
which indicate that news paper article generally use mor common words which tend
to have less character counts.
For all intents and purposes we can ignore the words with length greater than 20 to
get the standard stats.
Practical Experiences:
1) Most of the problems were with cleaning the data and we still do not know what
constitutes a english word.We can use a dictioinary which includes all the nouns and
words used in vernacular.
2) As far as the computation goes we did not have any issues since the load is so low
around 3.2mb of text.

Vous aimerez peut-être aussi

601.465/665 - Natural Language Processing Assignment 1: Designing Context-Free Grammars
Document11 pages
601.465/665 - Natural Language Processing Assignment 1: Designing Context-Free Grammars
Den Thanh
Pas encore d'évaluation
Bda Lab Manual_ (2)
Document20 pages
Bda Lab Manual_ (2)
RAKSHIT AYACHIT
Pas encore d'évaluation
Access hundreds of terabytes with HBase
Document44 pages
Access hundreds of terabytes with HBase
ratneshkumarg
Pas encore d'évaluation
Ocr Intern Report PDF
Document14 pages
Ocr Intern Report PDF
Parinyas Singh
Pas encore d'évaluation
Assignment 2 - Data Structure Comparison
Document5 pages
Assignment 2 - Data Structure Comparison
Anonymous aiVnyoJb
Pas encore d'évaluation
Lesson 4 Notes PDF
Document10 pages
Lesson 4 Notes PDF
DaWheng Vargas
Pas encore d'évaluation
Assignment 1
Document2 pages
Assignment 1
ashish
Pas encore d'évaluation
CSCI 2270 - Data Structures and Algorithms Instructor Hoenigman Assignment 2 Due Friday, February 3 Before 3pm Word Analysis
Document5 pages
CSCI 2270 - Data Structures and Algorithms Instructor Hoenigman Assignment 2 Due Friday, February 3 Before 3pm Word Analysis
davidwarn20
Pas encore d'évaluation
Prog Prob
Document8 pages
Prog Prob
JJohn Ssdg
Pas encore d'évaluation
P1 2018
Document5 pages
P1 2018
Taylor
Pas encore d'évaluation
2421 Lab 3
Document4 pages
2421 Lab 3
Mani Kamali
Pas encore d'évaluation
How File Compression Works
Document4 pages
How File Compression Works
Andreea Cristea
Pas encore d'évaluation
Directory System Thesis
Document8 pages
Directory System Thesis
PaperWritersCollegeCanada
100% (2)
A Portable and Efficient Generic Parser For Flat Files - CodeProject
Document8 pages
A Portable and Efficient Generic Parser For Flat Files - CodeProject
zvasanth
Pas encore d'évaluation
Chalmers Tekniska Hogskola, Goteborg, Sweden.: Higher-Order Function Polymor-Phic Typing Same Dierent Lazy Evaluation
Document44 pages
Chalmers Tekniska Hogskola, Goteborg, Sweden.: Higher-Order Function Polymor-Phic Typing Same Dierent Lazy Evaluation
Jeffrey Ge
Pas encore d'évaluation
SL-VI Assignment
Document4 pages
SL-VI Assignment
noxex
Pas encore d'évaluation
BD7 Assignment S22
Document4 pages
BD7 Assignment S22
Course Zila
Pas encore d'évaluation
Fog Index How High Is Your Fog Index?
Document4 pages
Fog Index How High Is Your Fog Index?
gurujeeva
Pas encore d'évaluation
Exercise 6 PDF
Document2 pages
Exercise 6 PDF
việt lê
Pas encore d'évaluation
Topic Modeling With BERT. - Towards Data Science
Document9 pages
Topic Modeling With BERT. - Towards Data Science
Rubi Zimmerman
Pas encore d'évaluation
Unix 1
Document6 pages
Unix 1
Ravi Shankar
Pas encore d'évaluation
MIT6 006F11 ps4
Document5 pages
MIT6 006F11 ps4
Mo
Pas encore d'évaluation
Real Time Hadoop Interview Questions From Various Interviews
Document6 pages
Real Time Hadoop Interview Questions From Various Interviews
Saurabh Gupta
Pas encore d'évaluation
Problem Set 5 Instructions
Document8 pages
Problem Set 5 Instructions
Gia Jung
Pas encore d'évaluation
Norvig Lisp Style
Document116 pages
Norvig Lisp Style
Dragan
100% (1)
Linux Admin Interview Questions: Back To Top
Document40 pages
Linux Admin Interview Questions: Back To Top
koseyiw439
Pas encore d'évaluation
CSE1001 Problem Solving and Programming Skills
Document6 pages
CSE1001 Problem Solving and Programming Skills
shubham goswami
Pas encore d'évaluation
Literate Programming using noweb
Document6 pages
Literate Programming using noweb
PADMA SATHEESAN
Pas encore d'évaluation
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
Document74 pages
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
ratneshkumarg
Pas encore d'évaluation
Contenido
Document5 pages
Contenido
Rei
Pas encore d'évaluation
Simple Programming Problems
Document5 pages
Simple Programming Problems
nikolasmer
Pas encore d'évaluation
Lab 5
Document11 pages
Lab 5
Kahlon Kahlon
Pas encore d'évaluation
General Information: CST2110 Individual Programming Assignment #1 (RESIT)
Document8 pages
General Information: CST2110 Individual Programming Assignment #1 (RESIT)
Gogo
Pas encore d'évaluation
Coding Assessment
Document2 pages
Coding Assessment
Reza GIP
Pas encore d'évaluation
1 Motivation: Setting Up To Use Pstone
Document9 pages
1 Motivation: Setting Up To Use Pstone
Swathi Patibandla
Pas encore d'évaluation
Linux Admin 5
Document44 pages
Linux Admin 5
chaits258
Pas encore d'évaluation
Linux Admin Questions
Document44 pages
Linux Admin Questions
chaits258
Pas encore d'évaluation
Word Processing Test with Key Answers
Document5 pages
Word Processing Test with Key Answers
jardorocks
100% (5)
CS221 IR Final Project
Document15 pages
CS221 IR Final Project
EvenCheng
Pas encore d'évaluation
Lesson 1: Commands: Reference Manual Commands Files
Document52 pages
Lesson 1: Commands: Reference Manual Commands Files
arunabhatla
Pas encore d'évaluation
Programming in Visual Studio 2017 C# - Combined PDF
Document1 392 pages
Programming in Visual Studio 2017 C# - Combined PDF
Sergeyups Sergey
Pas encore d'évaluation
FOM Chapter 1
Document52 pages
FOM Chapter 1
Nantha Kumaran
Pas encore d'évaluation
DSA Project
Document3 pages
DSA Project
Saurav Satsangi
Pas encore d'évaluation
How Big Is The World Wide Web
Document15 pages
How Big Is The World Wide Web
Bani
Pas encore d'évaluation
CS101 Short Question
Document15 pages
CS101 Short Question
Muhammad Bilal Malik
Pas encore d'évaluation
Latex Thesis Word Count
Document5 pages
Latex Thesis Word Count
andreaolinspringfield
100% (2)
Linux Admin Interview Questions
Document4 pages
Linux Admin Interview Questions
api-3780913
Pas encore d'évaluation
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
Document11 pages
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
shoaib riaz
Pas encore d'évaluation
Linux Admin
Document44 pages
Linux Admin
chaits258
Pas encore d'évaluation
Linux Admin Interview Questions
Document8 pages
Linux Admin Interview Questions
Elam Anbu
Pas encore d'évaluation
HW 1
Document8 pages
HW 1
Manaswi Gupta
Pas encore d'évaluation
Assignment 4 - Comp8547
Document2 pages
Assignment 4 - Comp8547
Jay Saavn
Pas encore d'évaluation
Working With Text en
Document18 pages
Working With Text en
mahdi dorgham
Pas encore d'évaluation
Thought Works
Document8 pages
Thought Works
Pritam Mondal
Pas encore d'évaluation
Good Habits for Great Coding: Improving Programming Skills with Examples in Python
D'Everand
Good Habits for Great Coding: Improving Programming Skills with Examples in Python
Michael Stueben
Pas encore d'évaluation
Programming 101: The How and Why of Programming Revealed Using the Processing Programming Language
D'Everand
Programming 101: The How and Why of Programming Revealed Using the Processing Programming Language
Jeanine Meyer
Pas encore d'évaluation
Pointers in C Programming: A Modern Approach to Memory Management, Recursive Data Structures, Strings, and Arrays
D'Everand
Pointers in C Programming: A Modern Approach to Memory Management, Recursive Data Structures, Strings, and Arrays
Thomas Mailund
Pas encore d'évaluation
Essential Algorithms: A Practical Approach to Computer Algorithms
D'Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
Évaluation : 4.5 sur 5 étoiles
4.5/5 (2)
Beginning R: The Statistical Programming Language
D'Everand
Beginning R: The Statistical Programming Language
Mark Gardener
Évaluation : 4.5 sur 5 étoiles
4.5/5 (4)
Raku Recipes: A Problem-Solution Approach
D'Everand
Raku Recipes: A Problem-Solution Approach
J.J. Merelo
Pas encore d'évaluation
MGS611 F15 TransactionCostEconomics PDF
Document11 pages
MGS611 F15 TransactionCostEconomics PDF
Saiprasad Veluru
Pas encore d'évaluation
MGS611 F15 TransactionCostEconomics PDF
Document11 pages
MGS611 F15 TransactionCostEconomics PDF
Saiprasad Veluru
Pas encore d'évaluation
Birth of Tragedy Friedrich Neitzsche
Document68 pages
Birth of Tragedy Friedrich Neitzsche
Saiprasad Veluru
Pas encore d'évaluation
MGS611 F15 TransactionCostEconomics PDF
Document11 pages
MGS611 F15 TransactionCostEconomics PDF
Saiprasad Veluru
Pas encore d'évaluation
Birth of Tragedy Friedrich Neitzsche
Document68 pages
Birth of Tragedy Friedrich Neitzsche
Saiprasad Veluru
Pas encore d'évaluation
CS2252 MPMC 16marks
Document135 pages
CS2252 MPMC 16marks
Vaishnavi Rave
Pas encore d'évaluation
Java Gui Programming Assignment
Document5 pages
Java Gui Programming Assignment
Akanksha Sonkar
Pas encore d'évaluation
Hospital Management System: A Project Work Submitted To The
Document66 pages
Hospital Management System: A Project Work Submitted To The
sauravv7
Pas encore d'évaluation
OCI Getting Started
Document163 pages
OCI Getting Started
sieger74
Pas encore d'évaluation
Z
Document8 pages
Z
Avi
Pas encore d'évaluation
Java I/O Streams and File Handling
Document69 pages
Java I/O Streams and File Handling
supriya
Pas encore d'évaluation
Graphic Era University: "Cloudsim Toolkit
Document11 pages
Graphic Era University: "Cloudsim Toolkit
nitin
Pas encore d'évaluation
Milwakee
Document11 pages
Milwakee
danursache
Pas encore d'évaluation
Ibm TS4300
Document2 pages
Ibm TS4300
Aki
Pas encore d'évaluation
Guide To Debricking Kindle Touch V4
Document14 pages
Guide To Debricking Kindle Touch V4
victor_2203
Pas encore d'évaluation
x86 Session06
Document29 pages
x86 Session06
ArunKumarAK
Pas encore d'évaluation
Demonstrating The Benefits and Uses of Smart Tags
Document8 pages
Demonstrating The Benefits and Uses of Smart Tags
JoeZega
Pas encore d'évaluation
ATMEGA 2560 Development Board
Document81 pages
ATMEGA 2560 Development Board
Jay Shah
Pas encore d'évaluation
How To Make Win Backup Image
Document7 pages
How To Make Win Backup Image
dnana
Pas encore d'évaluation
Databricks - How To Create An MLflow Run From A Model (Pickle) Trained Outside - by Ganesh Chandrasekaran - Mar, 2023 - Medium
Document8 pages
Databricks - How To Create An MLflow Run From A Model (Pickle) Trained Outside - by Ganesh Chandrasekaran - Mar, 2023 - Medium
salman kadaya
Pas encore d'évaluation
Midterm Practical Exam PDF
Document2 pages
Midterm Practical Exam PDF
Monica Rasgo
Pas encore d'évaluation
Weighbridge Software
Document11 pages
Weighbridge Software
Raj Nambiar
Pas encore d'évaluation
Report Settings in the SEL-451 Relay
Document4 pages
Report Settings in the SEL-451 Relay
Henrique Pereira Machado
Pas encore d'évaluation
Web Application Development
Document105 pages
Web Application Development
NTP 1007
Pas encore d'évaluation
Drill 3 Data Structures Laboratory
Document12 pages
Drill 3 Data Structures Laboratory
Kurogawa Shinji
Pas encore d'évaluation
Java Fullstack Codenera
Document6 pages
Java Fullstack Codenera
Sandip Khedkar
Pas encore d'évaluation
2-2 DirectX11
Document108 pages
2-2 DirectX11
Jalali Bey
Pas encore d'évaluation
EASy68K Cross Assembler and Simulator
Document19 pages
EASy68K Cross Assembler and Simulator
muhd danial
Pas encore d'évaluation
ROUTE66
Document8 pages
ROUTE66
Prior Lain
Pas encore d'évaluation
Bas SVX09B E4 - 12012009
Document84 pages
Bas SVX09B E4 - 12012009
ronneysan
Pas encore d'évaluation
Android Emulator
Document7 pages
Android Emulator
subratpattanaik
Pas encore d'évaluation
Pymajor
Document20 pages
Pymajor
Godluck John
Pas encore d'évaluation
VHDL Intro Lecture Signals and Data Types
Document16 pages
VHDL Intro Lecture Signals and Data Types
Ammar Alkindy
Pas encore d'évaluation
Log
Document9 pages
Log
Deden
Pas encore d'évaluation
Dbse
Document67 pages
Dbse
archana_sree13
Pas encore d'évaluation