This document will tell you about Data science in brief and necessary thing to learn. written as notes only. Thanks.

© All Rights Reserved

5 vues

This document will tell you about Data science in brief and necessary thing to learn. written as notes only. Thanks.

© All Rights Reserved

- Smart Camera
- Multi Class SVM - File Exchange - MATLAB Central
- Lecture6_Classification and Its Techniques
- Data Mining
- The National Artificial Intelligence Research And Development Strategic Plan
- Brochure Plethora
- Bengio-SlidesDeepLearning
- IJAIEM-2014-07-31-88
- Microsoft Ai Platform Whitepaper
- ghgfhgf
- p1-swietojanski
- IJETT-V18P236
- CfP_Digiera19
- Harsh Resume
- Alejandro_P._Ojeda_May_09_2012_MIT.pdf
- l3
- Uk Election Tweets Asia Polmeth
- View
- 34960_Journal 1 Automated Identification of Diabetic Retinopathy.pdf
- dl a survey

Vous êtes sur la page 1sur 17

2. SVM - SUPPORT VECTOR MACHINES - learn this

3. Data science problems come in various shapes and sizes. I am saying "data

science" because you asked for "useful applications". To handle a data science problem

you need to do the following stages:

1.

2.

3.

4.

5.

6.

Collect and munge suitable data.

Clean the data.

Model the data using machine learning.

Integrate the machine learning part into production.

Communicate the results.

WHAT: Linear Algebra WHY: most of the machine learning that we do, deals

with scalars and vectors and matrices -- vectors of features, matrices of weights etc.

You do vector matrix multiplication like say in logistic regression, neural networks... Or

you do matrix transpose first and then multiplication (for say in error back propagation

in neural networks). Sometimes you need to do clustering of input, maybe using spectral

clustering techniques, which requires you to know what eigen values are, eigen

vectors are.. Sometimes you need to take inverses of matrices, say in computing

inverse of covariance matrix for fitting a Gaussian distribution. So you now know WHY

you need Linear Algebra.

WHAT: Optimization Theory WHY: How do you train the weights of your model so

that the training error is minimized ? Answer: optimization. You may need to know how

to take derivatives of loss function with respect to some parameter so that you can

carry out gradient descent optimization. You may need to know

what gradients mean. What are hessians if you are doing second order optimization

like LBFGS. You may need to learn what Newton steps are, maybe to solve line

searches. You will need to understand functional derivatives to better understand

Gradient Boosted Decision Trees. You will need to understand convergence properties

of various optimization methods to get an idea of how fast or slow your algorithm will

run.

WHAT: Probability and Statistics WHY: When you are doing machine learning, you

are primarily after some kind of distribution. What is the probability of an output given

my input ? Why do I need this ? When your machine learning model predicts (assigns

probabilities) high enough to known observation, you know you have a good model at

hand. Its a goodness criteria. Statistics help you to count well, normalize well, obtain

distributions, find out the mean of your input feature, its standard deviation. Why

do you need these things ? You need means and variances to better normalize your input

data before you feed it into you machine learning system. This helps in faster

convergence (optimization theory concept).

WHAT: Signal Processing WHY: You usually do not feed raw input to your machine

learning systems. You do some kind of pre processing. For instance you would like to

extract some features from the input speech signal, or an image. Now, extracting these

features needs you to know properties of these underlying signals. Digital signal

processing or Image processing will help you gain expertise. You would be in a better

situation to know what feature extraction works and what does not. You would want to

learn what is a Fourier transform because maybe you would like to apply that to

speech signal or maybe apply discrete cosine transform to images before using them

as features to your machine learning system.

5. Calculus: You need to know how to calculate derivatives/gradients as one of the

most common optimization methods used in machine learning, the gradient

descent, actually needs to compute the gradient.

Linear Algebra: You should be comfortable with representations and

computations in terms of vectors and matrices instead of single numbers. Also,

sometimes you may need to apply calculus together with linear algebra, such as to

find the derivatives of a function with respect to a vector (sounds mysterious?).

Also, concepts like transpose and inverse are also fundamental.

Probability Theory: You need to master some basic concepts like conditional

probability, independence (these two are used in a classic ML algorithm named

naive Bayes).

Statistics: You should be familiar with expectation, variance, standard

deviation and probability distributions (better in higher dimensional instead of just

one dimensional).

Optimization: Well, perhaps you do not need to worry about this since most

ML courses will come together with some optimization methods that you will use,

like gradient descent.

The last but not the least, you may first need to decide what style/flavor of ML

you want to learn.

If you want to focus on the mathematical theory of ML (what and why), then it

will require a solid background of maths, such as the COS 511, Spring 2014:

Home offered by Rob Schapire at Princeton. You may also hope to focus on

implementations (how), then it will be less mathy (easier :) ). In fact, you probably

need not understand two much about the mathematical underpinnings of ML

algorithm to make it work for you. Instead, you just need to remember some tricks.

==============

1. Python/C++/R/Java - you will probably want to learn all of these languages at

some point if you want a job in machine-learning. Python's Numpy and Scipy

libraries [2] are awesome because they have similar functionality to MATLAB, but

can be easily integrated into a web service and also used in Hadoop (see below). C+

+ will be needed to speed code up. R [3] is great for statistics and plots, and Hadoop

[4] is written in Java, so you may need to implement mappers and reducers in Java

(although you could use a scripting language via Hadoop streaming [5])

2. Probability and Statistics: A good portion of learning algorithms are based on

this theory. Naive Bayes [6], Gaussian Mixture Models [7], Hidden Markov Models

[8], to name a few. You need to have a firm understanding of Probability and Stats

to understand these models. Go nuts and study measure theory [9]. Use statistics as

an model evaluation metric: confusion matrices, receiver-operator curves, p-values,

etc.

3. Applied Math + Algorithms: For discriminate models like SVMs [10], you

need to have a firm understanding of algorithm theory. Even though you will

probably never need to implement an SVM from scratch, it helps to understand how

the algorithm works. You will need to understand subjects like convex optimization

[11], gradient decent [12], quadratic programming [13], lagrange [14], partial

differential equations [15], etc. Get used to looking at summations [16].

4. Distributed Computing: Most machine learning jobs require working with

large data sets these days (see Data Science) [17]. You cannot process this data on a

single machine, you will have to distribute it across an entire cluster. Projects like

Apache Hadoop [4] and cloud services like Amazon's EC2 [18] makes this very easy

and cost-effective. Although Hadoop abstracts away a lot of the hard-core,

distributed computing problems, you still need to have a firm understanding of

map-reduce [22], distribute-file systems [19], etc. You will most likely want to check

out Apache Mahout [20] and Apache Whirr [21].

5. Expertise in Unix Tools: Unless you are very fortunate, you are going to need

to modify the format of your data sets so they can be loaded into R,Hadoop,HBase

[23],etc. You can use a scripting language like python (using re) to do this but the

best approach is probably just master all of the awesome unix tools that were

designed for this: cat [24], grep [25], find [26], awk [27], sed [28], sort [29], cut

[30], tr [31], and many more. Since all of the processing will most likely be on linuxbased machine (Hadoop doesnt run on Window I believe), you will have access to

these tools. You should learn to love them and use them as much as possible. They

certainly have made my life a lot easier. A great example can be found here [1].

6. Become familiar with the Hadoop sub-projects: HBase, Zookeeper [32],

Hive [33], Mahout, etc. These projects can help you store/access your data, and they

scale.

7. Learn about advanced signal processing techniques: feature extraction is

one of the most important parts of machine-learning. If your features suck, no

matter which algorithm you choose, your going to see horrible performance.

Depending on the type of problem you are trying to solve, you may be able to utilize

really cool advance signal processing algorithms like: wavelets [42], shearlets [43],

curvelets [44], contourlets [45], bandlets [46]. Learn about time-frequency analysis

[47], and try to apply it to your problems. If you have not read about Fourier

Analysis[48] and Convolution[49], you will need to learn about this stuff too. The

ladder is signal processing 101 stuff though.

Finally, practice and read as much as you can. In your free time, read papers like

Google Map-Reduce [34], Google File System [35], Google Big Table [36], The

Unreasonable Effectiveness of Data [37],etc There are great free machine learning

books online and you should read those also. [38][39][40]. Here is an awesome

course I found and re-posted on github [41]. Instead of using open source packages,

code up your own, and compare the results. If you can code an SVM from scratch,

you will understand the concept of support vectors, gamma, cost, hyperplanes, etc.

It's easy to just load some data up and start training, the hard part is making sense

of it all.

=========================================================

=========================================================

=========================================================

1.

2.

3.

4.

Study a Machine Learning Tool

Study a Machine Learning Dataset

Study a Machine Learning Algorithm

Implement a Machine Learning Algorithm.

AGAIN IN DETAIL::::::::::::::::::::::::::::::::::::::

5. Study a Machine Learning Tool: Select a tool or library that you like and learn

how to use it well.

6. Study a Machine Learning Dataset: Select a dataset and understand it

intimately and discover which algorithm class or type addresses it the best.

7. Study a Machine Learning Algorithm: Select an algorithm and understand it

intimately and discover parameter configurations that are stable across different

datasets.

8. Implement a Machine Learning Algorithm: Select an algorithm and implement

or port an existing implementation to a language of your choice.

why tools - because they can automate the project and gives result faster

When you are working on a large project, machine learning tools can help you to

prototype a solution, figure out the requirements and give you a template for the system

that you may want to implement.

One useful way to think about machine learning tools it so separate them

into Platforms and Libraries. A platform provides all you need to run a project, whereas

a library only provides discrete capabilities or parts of what you need to complete a

project.

A machine learning platform provides capabilities to complete a machine learning project

from beginning to end. Namely, some data analysis, data preparation, modeling and

algorithm evaluation and selection.

R Platform.

Subset of the Python SciPy (e.g. Pandas and scikit-learn).

A machine learning library provides capabilities for completing part of a machine learning

project. For example a library may provide a collection of modelling algorithms.

Examples of machine learning libraries are:

scikit-learn in Python.

JSAT in Java.

Accord Framework in .NET

KNIME

RapidMiner

Orange

Machine learning tools can provide an application programming interface giving you the

flexibility to decide what elements to use and exactly how to use them within your own

programs.

Some examples of machine learning tools with application programming interfaces

include:

Deeplearning4j for Java

LIBSVM for C

categoriessupervised learning, unsupervised learning, and

reinforcement learning.

-----------------------------------------------

Supervised Learning

label or result such as spam/not-spam or a stock price at a time.

A model is prepared through a training process in which it is required to make

predictions and is corrected when those predictions are wrong. The training process

continues until the model achieves a desired level of accuracy on the training data.

Example problems are classification and regression.

Example algorithms include Logistic Regression and the Back Propagation Neural

Network.

Unsupervised Learning

result.

A model is prepared by deducing structures present in the input data. This may be to

extract general rules. It may be through a mathematical process to systematically reduce

redundancy, or it may be to organize data by similarity.

Example problems are clustering, dimensionality reduction and association rule learning.

Example algorithms include: the Apriori algorithm and k-Means.

Semi-Supervised Learning

examples.

There is a desired prediction problem but the model must learn the structures to

organize the data as well as make predictions.

Example algorithms are extensions to other flexible methods that make assumptions

about how to model the unlabeled data.

Overview

When crunching data to model business decisions, you are most typically using

supervised and unsupervised learning methods.

A hot topic at the moment is semi-supervised learning methods in areas such as image

classification where there are large datasets with very few labeled examples.

Algorithms are often grouped by similarity in terms of their function (how they work). For

example, tree-based methods, and neural network inspired methods.

I think this is the most useful way to group algorithms and it is the approach we will use

here.

This is a useful grouping method, but it is not perfect. There are still algorithms that could

just as easily fit into multiple categories like Learning Vector Quantization that is both a

neural network inspired method and an instance-based method. There are also

categories that have the same name that describe the problem and the class of

algorithm such as Regression and Clustering.

We could handle these cases by listing algorithms twice or by selecting the group that

subjectively is the best fit. I like this latter approach of not duplicating algorithms to

keep things simple.

In this section, I list many of the popular machine learning algorithms grouped the way I

think is the most intuitive. The list is not exhaustive in either the groups or the algorithms,

but I think it is representative and will be useful to you to get an idea of the lay of the

land.

Please Note: There is a strong bias towards algorithms used for classification and

regression, the two most prevalent supervised machine learning problems you will

encounter.

If you know of an algorithm or a group of algorithms not listed, put it in the comments

and share it with us. Lets dive in.

Regression Algorithms

between variables that is iteratively refined using a measure of error in the predictions

made by the model.

Regression methods are a workhorse of statistics and have been co-opted into statistical

machine learning. This may be confusing because we can use regression to refer to the

class of problem and the class of algorithm. Really, regression is a process.

The most popular regression algorithms are:

Linear Regression

Logistic Regression

Stepwise Regression

Multivariate Adaptive Regression Splines (MARS)

Locally Estimated Scatterplot Smoothing (LOESS)

Instance-based Algorithms

with instances or examples of training data that are deemed important or required to the

model.

Such methods typically build up a database of example data and compare new data to

the database using a similarity measure in order to find the best match and make a

prediction. For this reason, instance-based methods are also called winner-take-all

methods and memory-based learning. Focus is put on the representation of the stored

instances and similarity measures used between instances.

The most popular instance-based algorithms are:

Learning Vector Quantization (LVQ)

Self-Organizing Map (SOM)

Locally Weighted Learning (LWL)

Regularization Algorithms

regression methods) that penalizes models based on their complexity, favoring simpler

models that are also better at generalizing.

I have listed regularization algorithms separately here because they are popular,

powerful and generally simple modifications made to other methods.

The most popular regularization algorithms are:

Ridge Regression

Least Absolute Shrinkage and Selection Operator (LASSO)

Elastic Net

Least-Angle Regression (LARS)

made based on actual values of attributes in the data.

Decisions fork in tree structures until a prediction decision is made for a given record.

Decision trees are trained on data for classification and regression problems. Decision

trees are often fast and accurate and a big favorite in machine learning.

The most popular decision tree algorithms are:

Iterative Dichotomiser 3 (ID3)

C4.5 and C5.0 (different versions of a powerful approach)

Chi-squared Automatic Interaction Detection (CHAID)

Decision Stump

M5

Conditional Decision Trees

Bayesian Algorithms

Theorem for problems such as classification and regression.

The most popular Bayesian algorithms are:

Naive Bayes

Gaussian Naive Bayes

Multinomial Naive Bayes

Averaged One-Dependence Estimators (AODE)

Bayesian Belief Network (BBN)

Bayesian Network (BN)

Clustering Algorithms

problem and the class of methods.

Clustering methods are typically organized by the modeling approaches such as

centroid-based and hierarchal. All methods are concerned with using the inherent

structures in the data to best organize the data into groups of maximum commonality.

The most popular clustering algorithms are:

k-Means

k-Medians

Expectation Maximisation (EM)

Hierarchical Clustering

explain observed relationships between variables in data.

These rules can discover important and commercially useful associations in large

multidimensional datasets that can be exploited by an organization.

The most popular association rule learning algorithms are:

Apriori algorithm

Eclat algorithm

the structure and/or function of biological neural networks.

They are a class of pattern matching that are commonly used for regression and

classification problems but are really an enormous subfield comprised of hundreds of

algorithms and variations for all manner of problem types.

Note that I have separated out Deep Learning from neural networks because of the

massive growth and popularity in the field. Here we are concerned with the more

classical methods.

The most popular artificial neural network algorithms are:

Perceptron

Back-Propagation

Hopfield Network

Radial Basis Function Network (RBFN)

Neural Networks that exploit abundant cheap computation.

They are concerned with building much larger and more complex neural networks and,

as commented on above, many methods are concerned with semi-supervised learning

problems where large datasets contain very little labeled data.

The most popular deep learning algorithms are:

Deep Belief Networks (DBN)

Convolutional Neural Network (CNN)

Stacked Auto-Encoders

and exploit the inherent structure in the data, but in this case in an unsupervised manner

or order to summarize or describe data using less information.

This can be useful to visualize dimensional data or to simplify data which can then be

used in a supervised learning method. Many of these methods can be adapted for use in

classification and regression.

Principal Component Regression (PCR)

Partial Least Squares Regression (PLSR)

Sammon Mapping

Multidimensional Scaling (MDS)

Projection Pursuit

Linear Discriminant Analysis (LDA)

Mixture Discriminant Analysis (MDA)

Quadratic Discriminant Analysis (QDA)

Flexible Discriminant Analysis (FDA)

Ensemble Algorithms

weaker models that are independently trained and whose predictions are combined in

some way to make the overall prediction.

Much effort is put into what types of weak learners to combine and the ways in which to

combine them. This is a very powerful class of techniques and as such is very popular.

Boosting

Bootstrapped Aggregation (Bagging)

AdaBoost

Stacked Generalization (blending)

Gradient Boosting Machines (GBM)

Gradient Boosted Regression Trees (GBRT)

Random Forest

--Python is required for basic functions, list, dictionary, tuples,

--Then u need this ML algorithm such as linear regression, KNN method, k-means

clustering

---After that u need Randomforest methods, XGboost methods to make the models

designed above to be more precise and less error

---underlying basics

----------.......==========study python format characters

=============

- Smart CameraTransféré pargeorge ezar quiriado
- Multi Class SVM - File Exchange - MATLAB CentralTransféré parNamrata Singh
- Lecture6_Classification and Its TechniquesTransféré parRashid Ali
- Data MiningTransféré parCreative_Babes
- The National Artificial Intelligence Research And Development Strategic PlanTransféré parThe Networking and Information Technology Research and Development (NITRD) Program
- Brochure PlethoraTransféré parAitor Lejar
- Bengio-SlidesDeepLearningTransféré parНиколай Великий
- IJAIEM-2014-07-31-88Transféré parAnonymous vQrJlEN
- Microsoft Ai Platform WhitepaperTransféré parGerman Tapia Galvan
- ghgfhgfTransféré parதர்ஷன் செல்வராஜ்
- p1-swietojanskiTransféré paramir_shafiq123
- IJETT-V18P236Transféré parvarunendra
- CfP_Digiera19Transféré parDirk Martignoni
- Harsh ResumeTransféré parharsh nigam
- Alejandro_P._Ojeda_May_09_2012_MIT.pdfTransféré parNicolas Daniel Andres Ortega Cárdenas
- l3Transféré parAakarshan Gupta
- Uk Election Tweets Asia PolmethTransféré parMurilo Cruz Lopes
- ViewTransféré parSerag El-Deen
- 34960_Journal 1 Automated Identification of Diabetic Retinopathy.pdfTransféré parZilvia Tambengi
- dl a surveyTransféré parRoshdy AbdelRassoul
- Poggio_deep vs Shallow Nets and the Curse of DimensionalityTransféré parDanish Shabbir
- Eetop.cn_dIY Deep Learning for Vision- a Hands-On Tutorial With CaffeTransféré parexfmln
- machine learningTransféré parMuhammad Junaid Dar
- Neuralnetworksanddeeplearning ConvertedTransféré parRonald Francisco Alvarado Roman
- Caffe2 vs TensorflowTransféré parAndrés Emiliano Rodriguez
- Computeraidedlungcancerdiagnosiswithdeeplearningalgorithms.pdfTransféré parMy Life
- Mds Titles Feb 2019Transféré parNooraizad Zailani
- d46c0208715ab40bab971e64db3405492f39Transféré parJoanna Nichole Bathan
- Activity 1 DataTransféré parvinayak
- Eric C. Chi's CVTransféré parjocelyntchi

- The Power of Congress to Limit the Jurisdiction of Federal Courts: An Exercise in DialecticTransféré parKen
- Rock Mechanics - An Introduction for the Practical EngineerTransféré parsizzlingcoolpriya
- FM3Transféré parSidra Meenai
- markingTransféré parSyedHussain
- Notes of CNCTransféré parSube Singh Insan
- Single DOF Damped Free Vibrations (VI)Transféré parSudheesh Ramadasan
- Forrester Targeted-Attack HierarchyTransféré parLalit Bhakuni
- Microbiology Chapter 2 test bank - 14th editionTransféré parAnmol
- Explore Our World: Problematic ScenarioTransféré parMaryanne Oxenrider Lipovsky
- Thomas Bromley (philadelphian)A Treatise of Extraordinary Divine Dispensations ( Complete Version)-1Transféré parcalendulo
- ROUW en RAZERNIJ om CAESAR (Rage and Grief for Caesar)Transféré parTommie Hendriks
- Ladlad vs Comelec_digestTransféré parJen Diokno
- 40 Biaco v CountrysideTransféré parpa0l0s
- Building Resource Strength & Organizational Capability Lecture 4Transféré parapi-3710170
- ZapatismoTransféré pardrkjunco
- Nokia Sgs n CausesTransféré parpicaraza
- Ethics FinalTransféré parTamika Nicholson
- Marcus Katz - Alchemy course.pdfTransféré parBig Gagi
- The Importance and Symbolism of Orthodox Icons | hellenic-art.comTransféré parHellenic Art
- 138 - OKURIBITO.docxTransféré parSheela Manaog
- fetuin.pptTransféré parAnonymous G26HIUtzV
- Chapter 6; We Have Seen His Star Question SheetTransféré parJason R. Pierre
- Rectum Anal CanalTransféré para.ghaly70
- Eyes on the PreyTransféré parMarchelle Manuzon
- Organizational Behavior is the Systematic Study and Careful Application of Knowledge About How PeopleTransféré parepsbd
- 920 SK 4000 SiteHawk Analyzer Manual 10282016Transféré parAla'a Abdulla
- Tadonki vs. Secretary-General of the United NationsTransféré parRobert Amsterdam
- How Can I Do RCM for Compressor UnitTransféré paryogacruise
- Thermal Injury (02)Transféré parShaila Mayza
- Anti Angina TherapyTransféré parmuhammadridhwan

## Bien plus que des documents.

Découvrez tout ce que Scribd a à offrir, dont les livres et les livres audio des principaux éditeurs.

Annulez à tout moment.