Vous êtes sur la page 1sur 22

Guided By

Ms. Shikha Pachouly


Assistant Professor
Computer Engineering
Department

6/21/2014








Machine Learning
Machine learning is programming computers to
optimize a performance criterion using example data
or past experience.

Machine Learning Strategies
1) Supervised

2)Unsupervised

6/21/2014
Common Use Cases
Recommend friends/dates/products
Classify content into predefined groups
Find similar content based on object properties
Find associations/patterns in action/behaviors
Identify key topics in large collection of text
Detect anomalies in output
Ranking search results
6/21/2014
Apache Mahout Introduction
Machine Learning Library for Scalable applications

Includes core algorithms for Recommendation,
Clustering and Classification that are implemented on
top of Hadoop Map-Reduce model.

Also includes core libraries are highly optimized to
allow for good performance also for non-distributed
algorithms.

6/21/2014
6/21/2014
Mahout is distributed under a commercially friendly
Apache Software license.

The goal of Mahout is to build a vibrant, responsive,
diverse community to facilitate discussions not only on
the project itself but also on potential use cases.

Currently Mahout supports mainly three use cases:
1) Recommendation mining
2) Clustering
3) Classification
6/21/2014
Why Mahout
Many Open Source ML libraries (PyBrain, Shark etc)
either
1) lack community
2) lack scalability
3) lack documentations and examples

Most Mahout implementations are Map Reduce
enabled


6/21/2014
The main goal of Apache Mahout is to be useful to
practitioners.
-This means implementations should be easy to
use from within Java applications.
-It should be close to trivial to deploy the
trained models.
-Scaling to include more and more diverse data
should be simple.
6/21/2014
Recommendations





Extensive Framework for collaborative filtering
Recommenders
1) user based
2) item based
Many different similarity measures
e.g. Cosine, LLR, Tanimoto, Pearson,

6/21/2014
Algorithms For Recommendatation
User-Based Collaborative Filtering Single Machine
Item-Based Collaborative Filtering - single machine /
Mapreduce
Matrix Factorization with Alternating Least Squares -
single machine / MapReduce
Matrix Factorization with Alternating Least Squares on
Implicit Feedback- single machine / MapReduce
Weighted Matrix Factorization, SVD++, Parallel SGD -
single machine

6/21/2014
User-Based Recommender

6/21/2014

6/21/2014





Clustering





6/21/2014
Algorithms for Clustering
K-Means Clustering
Fuzzy K-Means
Mean Shift Clustering
Dirichlet Process Clustering (For Topic Modelling)

6/21/2014
We can use commands instead of Clustering algorithms
that can run on Hadoop infrastructure
e.g. for Canopy Clustering command is
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol
.canopy.Job

k-Means Clustering
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol
.kmeans.Job

Fuzzy k-Means Clustering
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol
.fuzzykmeans.Job

6/21/2014
Classification
Algorithms implemented in Mahout for Classifiaction
Logistic Regression - trained via SGD - single machine
Naive Bayes/ Complementary Naive Bayes -
MapReduce
Random Forest - MapReduce
Hidden Markov Models - single machine
Multilayer Perceptron - single machine





6/21/2014
Running Nave Bayes from
Command Line
Three Commands
1) mahout seq2sparse
performs TF/IDF transformations

2) mahout trainnb
model is trained by using Byes Model

3) mahout testnb
classification and testing is performed.

6/21/2014
Installation of Mahout
Download the tar files of both apache-mahout and
apache-maven projects
Unzip the tar files in a directory
Set the Path Variables for maven
Set present working directory to the mahout's core
folder
Compile the project by 'mvn-compile'
Build the project by 'mvn-install'

6/21/2014
Mahout Vs Weka
Base\ Technologies Mahout WEKA
Scalability More Less
Algorithms Less More
GUI No Yes
License Apache GPL
6/21/2014
MAHOUT COMMERCIAL USERS

Adobe: Uses clustering algorithms to increase video
consumption by better user targeting.
Amazon: For Personalization platform.
AOL: For shopping recommendations.
Twitter: Uses Mahouts LDA implementation for user interest
modeling.
Yahoo! Mail: Uses Mahouts Frequent Pattern Set Mining.
Drupal: Users Mahout to provide open source content
recommendation solutions.
Evolv: Uses Mahout for its Workforce Predictive Analytics
platform.
Foursquare: Uses Mahout for its recommendation engine .
Idealo: Uses Mahouts recommendation engine.


6/21/2014
References
Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen, on Scalable
Sentiment Classification for Big Data Analysis Using Nave Bayes Classifier,
2013 IEEE International Conference on Big Data.

Rui Mximo Esteves, Chunming Rong, Using Mahout for clustering
Wikipedias latest Articles, 2011 Third IEEE International Conference on Cloud
Computing Technology and Science.

Kathleen Ericson and Shrideep Pallickara, On the Performance of Distributed
Data Clustering Algorithms in File and Streaming Processing Systems, 2011
Fourth IEEE International Conference on Utility and Cloud Computing.

https://mahout.apache.org/

Sean Owen, Robin Anil , Mahout In Action, Manning Publications

6/21/2014


THANK YOU
6/21/2014