Vous êtes sur la page 1sur 17




3. Data science problems come in various shapes and sizes. I am saying "data
science" because you asked for "useful applications". To handle a data science problem
you need to do the following stages:

Define your business/research question.

Collect and munge suitable data.
Clean the data.
Model the data using machine learning.
Integrate the machine learning part into production.
Communicate the results.

4. Instead of asking just WHAT, I think it is also important to know WHY.

WHAT: Linear Algebra WHY: most of the machine learning that we do, deals
with scalars and vectors and matrices -- vectors of features, matrices of weights etc.
You do vector matrix multiplication like say in logistic regression, neural networks... Or
you do matrix transpose first and then multiplication (for say in error back propagation
in neural networks). Sometimes you need to do clustering of input, maybe using spectral
clustering techniques, which requires you to know what eigen values are, eigen
vectors are.. Sometimes you need to take inverses of matrices, say in computing
inverse of covariance matrix for fitting a Gaussian distribution. So you now know WHY
you need Linear Algebra.
WHAT: Optimization Theory WHY: How do you train the weights of your model so
that the training error is minimized ? Answer: optimization. You may need to know how
to take derivatives of loss function with respect to some parameter so that you can
carry out gradient descent optimization. You may need to know
what gradients mean. What are hessians if you are doing second order optimization
like LBFGS. You may need to learn what Newton steps are, maybe to solve line
searches. You will need to understand functional derivatives to better understand
Gradient Boosted Decision Trees. You will need to understand convergence properties
of various optimization methods to get an idea of how fast or slow your algorithm will
WHAT: Probability and Statistics WHY: When you are doing machine learning, you
are primarily after some kind of distribution. What is the probability of an output given
my input ? Why do I need this ? When your machine learning model predicts (assigns
probabilities) high enough to known observation, you know you have a good model at
hand. Its a goodness criteria. Statistics help you to count well, normalize well, obtain
distributions, find out the mean of your input feature, its standard deviation. Why
do you need these things ? You need means and variances to better normalize your input
data before you feed it into you machine learning system. This helps in faster
convergence (optimization theory concept).
WHAT: Signal Processing WHY: You usually do not feed raw input to your machine
learning systems. You do some kind of pre processing. For instance you would like to

extract some features from the input speech signal, or an image. Now, extracting these
features needs you to know properties of these underlying signals. Digital signal
processing or Image processing will help you gain expertise. You would be in a better
situation to know what feature extraction works and what does not. You would want to
learn what is a Fourier transform because maybe you would like to apply that to
speech signal or maybe apply discrete cosine transform to images before using them
as features to your machine learning system.
5. Calculus: You need to know how to calculate derivatives/gradients as one of the
most common optimization methods used in machine learning, the gradient
descent, actually needs to compute the gradient.
Linear Algebra: You should be comfortable with representations and
computations in terms of vectors and matrices instead of single numbers. Also,
sometimes you may need to apply calculus together with linear algebra, such as to
find the derivatives of a function with respect to a vector (sounds mysterious?).
Also, concepts like transpose and inverse are also fundamental.
Probability Theory: You need to master some basic concepts like conditional
probability, independence (these two are used in a classic ML algorithm named
naive Bayes).
Statistics: You should be familiar with expectation, variance, standard
deviation and probability distributions (better in higher dimensional instead of just
one dimensional).
Optimization: Well, perhaps you do not need to worry about this since most
ML courses will come together with some optimization methods that you will use,
like gradient descent.
The last but not the least, you may first need to decide what style/flavor of ML
you want to learn.
If you want to focus on the mathematical theory of ML (what and why), then it
will require a solid background of maths, such as the COS 511, Spring 2014:
Home offered by Rob Schapire at Princeton. You may also hope to focus on
implementations (how), then it will be less mathy (easier :) ). In fact, you probably
need not understand two much about the mathematical underpinnings of ML
algorithm to make it work for you. Instead, you just need to remember some tricks.

6. Things required to learn Machine learning

1. Python/C++/R/Java - you will probably want to learn all of these languages at
some point if you want a job in machine-learning. Python's Numpy and Scipy
libraries [2] are awesome because they have similar functionality to MATLAB, but
can be easily integrated into a web service and also used in Hadoop (see below). C+
+ will be needed to speed code up. R [3] is great for statistics and plots, and Hadoop
[4] is written in Java, so you may need to implement mappers and reducers in Java
(although you could use a scripting language via Hadoop streaming [5])
2. Probability and Statistics: A good portion of learning algorithms are based on
this theory. Naive Bayes [6], Gaussian Mixture Models [7], Hidden Markov Models
[8], to name a few. You need to have a firm understanding of Probability and Stats
to understand these models. Go nuts and study measure theory [9]. Use statistics as
an model evaluation metric: confusion matrices, receiver-operator curves, p-values,

3. Applied Math + Algorithms: For discriminate models like SVMs [10], you
need to have a firm understanding of algorithm theory. Even though you will
probably never need to implement an SVM from scratch, it helps to understand how
the algorithm works. You will need to understand subjects like convex optimization
[11], gradient decent [12], quadratic programming [13], lagrange [14], partial
differential equations [15], etc. Get used to looking at summations [16].
4. Distributed Computing: Most machine learning jobs require working with
large data sets these days (see Data Science) [17]. You cannot process this data on a
single machine, you will have to distribute it across an entire cluster. Projects like
Apache Hadoop [4] and cloud services like Amazon's EC2 [18] makes this very easy
and cost-effective. Although Hadoop abstracts away a lot of the hard-core,
distributed computing problems, you still need to have a firm understanding of
map-reduce [22], distribute-file systems [19], etc. You will most likely want to check
out Apache Mahout [20] and Apache Whirr [21].
5. Expertise in Unix Tools: Unless you are very fortunate, you are going to need
to modify the format of your data sets so they can be loaded into R,Hadoop,HBase
[23],etc. You can use a scripting language like python (using re) to do this but the
best approach is probably just master all of the awesome unix tools that were
designed for this: cat [24], grep [25], find [26], awk [27], sed [28], sort [29], cut
[30], tr [31], and many more. Since all of the processing will most likely be on linuxbased machine (Hadoop doesnt run on Window I believe), you will have access to
these tools. You should learn to love them and use them as much as possible. They
certainly have made my life a lot easier. A great example can be found here [1].
6. Become familiar with the Hadoop sub-projects: HBase, Zookeeper [32],
Hive [33], Mahout, etc. These projects can help you store/access your data, and they
7. Learn about advanced signal processing techniques: feature extraction is
one of the most important parts of machine-learning. If your features suck, no
matter which algorithm you choose, your going to see horrible performance.
Depending on the type of problem you are trying to solve, you may be able to utilize
really cool advance signal processing algorithms like: wavelets [42], shearlets [43],
curvelets [44], contourlets [45], bandlets [46]. Learn about time-frequency analysis
[47], and try to apply it to your problems. If you have not read about Fourier
Analysis[48] and Convolution[49], you will need to learn about this stuff too. The
ladder is signal processing 101 stuff though.
Finally, practice and read as much as you can. In your free time, read papers like
Google Map-Reduce [34], Google File System [35], Google Big Table [36], The
Unreasonable Effectiveness of Data [37],etc There are great free machine learning
books online and you should read those also. [38][39][40]. Here is an awesome
course I found and re-posted on github [41]. Instead of using open source packages,
code up your own, and compare the results. If you can code an SVM from scratch,
you will understand the concept of support vectors, gamma, cost, hyperplanes, etc.
It's easy to just load some data up and start training, the hard part is making sense
of it all.



The four strategies are:

Study a Machine Learning Tool
Study a Machine Learning Dataset
Study a Machine Learning Algorithm
Implement a Machine Learning Algorithm.
AGAIN IN DETAIL::::::::::::::::::::::::::::::::::::::
5. Study a Machine Learning Tool: Select a tool or library that you like and learn
how to use it well.
6. Study a Machine Learning Dataset: Select a dataset and understand it
intimately and discover which algorithm class or type addresses it the best.
7. Study a Machine Learning Algorithm: Select an algorithm and understand it
intimately and discover parameter configurations that are stable across different
8. Implement a Machine Learning Algorithm: Select an algorithm and implement
or port an existing implementation to a language of your choice.

why tools - because they can automate the project and gives result faster
When you are working on a large project, machine learning tools can help you to
prototype a solution, figure out the requirements and give you a template for the system
that you may want to implement.
One useful way to think about machine learning tools it so separate them
into Platforms and Libraries. A platform provides all you need to run a project, whereas
a library only provides discrete capabilities or parts of what you need to complete a


A machine learning platform provides capabilities to complete a machine learning project
from beginning to end. Namely, some data analysis, data preparation, modeling and
algorithm evaluation and selection.

Examples of machine learning platforms are:

WEKA Machine Learning Workbench.

R Platform.
Subset of the Python SciPy (e.g. Pandas and scikit-learn).

Machine Learning Library

A machine learning library provides capabilities for completing part of a machine learning
project. For example a library may provide a collection of modelling algorithms.
Examples of machine learning libraries are:

scikit-learn in Python.
JSAT in Java.
Accord Framework in .NET

Some examples of machine learning tools with a graphical interface include:


Application Programming Interface

Machine learning tools can provide an application programming interface giving you the
flexibility to decide what elements to use and exactly how to use them within your own
Some examples of machine learning tools with application programming interfaces

Pylearn2 for Python

Deeplearning4j for Java

Machine learning algorithms can be divided into 3 broad

categoriessupervised learning, unsupervised learning, and
reinforcement learning.


Supervised Learning

Input data is called training data and has a known

label or result such as spam/not-spam or a stock price at a time.
A model is prepared through a training process in which it is required to make
predictions and is corrected when those predictions are wrong. The training process
continues until the model achieves a desired level of accuracy on the training data.
Example problems are classification and regression.
Example algorithms include Logistic Regression and the Back Propagation Neural

Unsupervised Learning

Input data is not labeled and does not have a known

A model is prepared by deducing structures present in the input data. This may be to
extract general rules. It may be through a mathematical process to systematically reduce
redundancy, or it may be to organize data by similarity.
Example problems are clustering, dimensionality reduction and association rule learning.
Example algorithms include: the Apriori algorithm and k-Means.

Semi-Supervised Learning

Input data is a mixture of labeled and unlabelled

There is a desired prediction problem but the model must learn the structures to
organize the data as well as make predictions.

Example problems are classification and regression.

Example algorithms are extensions to other flexible methods that make assumptions
about how to model the unlabeled data.

When crunching data to model business decisions, you are most typically using
supervised and unsupervised learning methods.
A hot topic at the moment is semi-supervised learning methods in areas such as image
classification where there are large datasets with very few labeled examples.

Algorithms Grouped By Similarity

Algorithms are often grouped by similarity in terms of their function (how they work). For
example, tree-based methods, and neural network inspired methods.
I think this is the most useful way to group algorithms and it is the approach we will use
This is a useful grouping method, but it is not perfect. There are still algorithms that could
just as easily fit into multiple categories like Learning Vector Quantization that is both a
neural network inspired method and an instance-based method. There are also
categories that have the same name that describe the problem and the class of
algorithm such as Regression and Clustering.
We could handle these cases by listing algorithms twice or by selecting the group that
subjectively is the best fit. I like this latter approach of not duplicating algorithms to
keep things simple.

In this section, I list many of the popular machine learning algorithms grouped the way I
think is the most intuitive. The list is not exhaustive in either the groups or the algorithms,
but I think it is representative and will be useful to you to get an idea of the lay of the
Please Note: There is a strong bias towards algorithms used for classification and
regression, the two most prevalent supervised machine learning problems you will
If you know of an algorithm or a group of algorithms not listed, put it in the comments
and share it with us. Lets dive in.

Regression Algorithms

Regression is concerned with modeling the relationship

between variables that is iteratively refined using a measure of error in the predictions
made by the model.
Regression methods are a workhorse of statistics and have been co-opted into statistical
machine learning. This may be confusing because we can use regression to refer to the
class of problem and the class of algorithm. Really, regression is a process.
The most popular regression algorithms are:

Ordinary Least Squares Regression (OLSR)

Linear Regression
Logistic Regression
Stepwise Regression
Multivariate Adaptive Regression Splines (MARS)
Locally Estimated Scatterplot Smoothing (LOESS)

Instance-based Algorithms

Instance-based learning model is a decision problem

with instances or examples of training data that are deemed important or required to the
Such methods typically build up a database of example data and compare new data to
the database using a similarity measure in order to find the best match and make a
prediction. For this reason, instance-based methods are also called winner-take-all
methods and memory-based learning. Focus is put on the representation of the stored
instances and similarity measures used between instances.
The most popular instance-based algorithms are:

k-Nearest Neighbor (kNN)

Learning Vector Quantization (LVQ)
Self-Organizing Map (SOM)
Locally Weighted Learning (LWL)

Regularization Algorithms

An extension made to another method (typically

regression methods) that penalizes models based on their complexity, favoring simpler
models that are also better at generalizing.
I have listed regularization algorithms separately here because they are popular,
powerful and generally simple modifications made to other methods.
The most popular regularization algorithms are:

Ridge Regression
Least Absolute Shrinkage and Selection Operator (LASSO)
Elastic Net
Least-Angle Regression (LARS)

Decision Tree Algorithms

Decision tree methods construct a model of decisions

made based on actual values of attributes in the data.

Decisions fork in tree structures until a prediction decision is made for a given record.
Decision trees are trained on data for classification and regression problems. Decision
trees are often fast and accurate and a big favorite in machine learning.
The most popular decision tree algorithms are:

Classification and Regression Tree (CART)

Iterative Dichotomiser 3 (ID3)
C4.5 and C5.0 (different versions of a powerful approach)
Chi-squared Automatic Interaction Detection (CHAID)
Decision Stump
Conditional Decision Trees

Bayesian Algorithms

Bayesian methods are those that explicitly apply Bayes

Theorem for problems such as classification and regression.
The most popular Bayesian algorithms are:

Naive Bayes
Gaussian Naive Bayes
Multinomial Naive Bayes
Averaged One-Dependence Estimators (AODE)
Bayesian Belief Network (BBN)
Bayesian Network (BN)

Clustering Algorithms

Clustering, like regression, describes the class of

problem and the class of methods.
Clustering methods are typically organized by the modeling approaches such as
centroid-based and hierarchal. All methods are concerned with using the inherent
structures in the data to best organize the data into groups of maximum commonality.
The most popular clustering algorithms are:

Expectation Maximisation (EM)
Hierarchical Clustering

Association Rule Learning Algorithms

Association rule learning methods extract rules that best

explain observed relationships between variables in data.
These rules can discover important and commercially useful associations in large
multidimensional datasets that can be exploited by an organization.
The most popular association rule learning algorithms are:

Apriori algorithm
Eclat algorithm

Artificial Neural Network Algorithms

Artificial Neural Networks are models that are inspired by

the structure and/or function of biological neural networks.
They are a class of pattern matching that are commonly used for regression and
classification problems but are really an enormous subfield comprised of hundreds of
algorithms and variations for all manner of problem types.

Note that I have separated out Deep Learning from neural networks because of the
massive growth and popularity in the field. Here we are concerned with the more
classical methods.
The most popular artificial neural network algorithms are:

Hopfield Network
Radial Basis Function Network (RBFN)

Deep Learning Algorithms

Deep Learning methods are a modern update to Artificial

Neural Networks that exploit abundant cheap computation.
They are concerned with building much larger and more complex neural networks and,
as commented on above, many methods are concerned with semi-supervised learning
problems where large datasets contain very little labeled data.
The most popular deep learning algorithms are:

Deep Boltzmann Machine (DBM)

Deep Belief Networks (DBN)
Convolutional Neural Network (CNN)
Stacked Auto-Encoders

Dimensionality Reduction Algorithms

Like clustering methods, dimensionality reduction seek

and exploit the inherent structure in the data, but in this case in an unsupervised manner
or order to summarize or describe data using less information.
This can be useful to visualize dimensional data or to simplify data which can then be
used in a supervised learning method. Many of these methods can be adapted for use in
classification and regression.

Principal Component Analysis (PCA)

Principal Component Regression (PCR)
Partial Least Squares Regression (PLSR)
Sammon Mapping
Multidimensional Scaling (MDS)
Projection Pursuit
Linear Discriminant Analysis (LDA)
Mixture Discriminant Analysis (MDA)
Quadratic Discriminant Analysis (QDA)
Flexible Discriminant Analysis (FDA)

Ensemble Algorithms

Ensemble methods are models composed of multiple

weaker models that are independently trained and whose predictions are combined in
some way to make the overall prediction.
Much effort is put into what types of weak learners to combine and the ways in which to
combine them. This is a very powerful class of techniques and as such is very popular.

Bootstrapped Aggregation (Bagging)
Stacked Generalization (blending)
Gradient Boosting Machines (GBM)
Gradient Boosted Regression Trees (GBRT)
Random Forest


--Python is required for basic functions, list, dictionary, tuples,
--Then u need this ML algorithm such as linear regression, KNN method, k-means
---After that u need Randomforest methods, XGboost methods to make the models
designed above to be more precise and less error
---underlying basics
----------.......==========study python format characters