CH 21

Machine Learning (CE360)
Chapter-2 Supervised
Machine Learning
Nisha Panchal
Supervised learning
• Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.
• In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.
• Supervised learning is a process of providing input data as well as correct output data
to the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).
• In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
How Supervised Learning Works?
Types of supervised Machine learning
Algorithms:
Regression
• Regression algorithms are used if there is a relationship between the
input variable and the output variable. It is used for the prediction of
continuous variables, such as Weather forecasting, Market Trends, etc.
Below are some popular Regression algorithms which come under
supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
Classification
• Classification algorithms are used when the output variable is
categorical, which means there are two classes such as Yes-No, Male-
Female, True-false, etc.
• Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
Linear Regression
Introduction
• In 1800, a person named Francis Galton, was studying the relationship
between parents and their children.
• He investigated the relationship between height of fathers and their
sons.
• He discovered that a man’s son tends to be roughly as tall as his
father, however, tall son’s height tended to be closer to the overall
average height of all people’s sons.
• Galton call this phenomenon as “Regression” as “father’s son height
tends to regress (or drift towards) the mean (average) height of
everyone else.
Linear Regression
• Regression is used to study the relationship between two variables.
• We can use simple regression if both the dependent variable (DV)
and the independent variable (IV) are numerical.
• If the DV is numerical but the IV is categorical, it is best to use Logistic
Regression.
Linear Regression - Example
• The following are situations where we can use regression:
• Testing if IQ affects income (IQ is the IV and income is the DV).

• Testing if hours of work affect hours of sleep (DV is hours of sleep and
the hours of work is the IV).
• Testing if the number of cigarettes smoked affects blood pressure
(number of cigarettes smoked is the IV and blood pressure is the DV).
• Chances of heart failure due to high body fat.
Displaying the data
• Displaying data for Testing if IQ

affects income (IQ is the IV and
income is the DV).
• When both the DV and IV are

numerical, we can represent
data in the form of a scatterplot.
Displaying the data
• Displaying data of Chances of

heart failure due to high body fat
• It is important to perform a
scatterplot because it helps us to
see if the relationship is linear.
Regression Case
• Dataset related to Co2 emissions from different cars.
Regression Case
• Looking to the existing data of different cars, can we estimate the approx. CO2 emission
of a car, which is yet not manufactured, such as in row 9 ?
• We can use regression methods to predict a continuous value, such as CO2 Emission,
using some other variables.
• In regression there are two types of variables:

1. a dependent variable (DV, which we want to predict) and
2. one or more independent variables (IV, existing features).

Regression Essentials
• The key point in the regression is that our dependent value should be continuous and
cannot be a discreet value.
• However, the independent variable or variables can be either a categorical or

continuous.
• We use regression to build such a regression/estimation model which would be used to

predict the expected Co2 emission for a new or unknown car.
Types of Regression Model
1. Simple regression is when one independent variable is used to estimate a dependent variable.
• It can be linear or non-linear.
• Ex: predicting Co2 emission using the variable of Engine Size.
2. When more than one independent variable is present, the process is called multiple
linear regression.
• Ex: predicting Co2 emission using Engine Size and the number of Cylinders in any given car.
• Linearity of regression depends on the relation between dependent and independent

variables; it can be either linear or non-linear regression.
Regression Application Areas
• Essentially, we use regression when we want to estimate a continuous value.
• You can try to predict a salesperson's total yearly sales (sales forecast) from
independent variables such as age, education, and years of experience.
• We can use regression analysis to predict the price of a house in an area,

based on its size, number of bedrooms, and so on.
• We can even use it to predict employment income for independent variables,

such as hours of work, education, occupation, sex, age, years of experience, and
so on.
Simple Linear Regression
• How to calculate a regression with only 2
data points ?
• In linear regression, we calculate regression

line by drawing a connecting line
• For classic linear regression or “Least

Square Method”, you only measure the
closeness in the “up and down” direction.
• Here we have perfectly fitted line because

we have only 2 points.
Regression with more data points
• Now wouldn't it be great if we could apply

this same concept to a graph with more
than just two data points?
• By doing this, we could take multiple men

and their son’s heights and do things
like tell a man how tall we expect his son
to be...before he even has a son!
• This is the idea behind supervised learning!

Regression Goal
• Goal is to determine the best line by
minimizing the vertical distance between
all the data points and our line.
• Lot of different ways to minimize this,

(sum of squared errors, sum of
absolute errors, etc).
• All these methods have a general goal

of minimizing this distance between
your line and rest of data points.
Case Study
Y: Dependent
variable
• This dataset is related to the
X: Independent variable
Co2 emission of different cars.
Continuous values
• The question is: Given this
dataset, can we predict the
Co2 emission of a car,
using another field, such Yes!
as Engine size?
Scatter Plot
• To understand linear regression, we can plot
our variables here.
• Engine size -- independent variable, Emission

– dependent/target value that we would like
to predict.
• A scatterplot clearly shows the relation

between variables where changes in one
variable "explain" or possibly "cause"
changes in the other variable.
• Also, it indicates that these variables are

linearly related.
Inference from Scatter Plot
• As the Engine Size increases, so do the
emissions.
• How do we use this line for

prediction now?
• Let us assume, for a moment, that the

line is a good fit of data.
• We can use it to predict the

emission of an unknown car.
Regression Modeling – Fitting Line
• Fitting line help us to predict the target value, Y, using the independent variable 'Engine Size' represented
on X axis
• The fit line is shown traditionally as a polynomial.
• In Simple regression Problem (single x), the form of the model would be
• = intercept = slope of the line
• Where Y is the dependent variable, or the predicted value and X is the independent variable.
• and are coefficient of linear equation

Regression Modelling
• Now the questions are:
• "How would you draw a line through the points?"
• "How do you determine which line fits best?"
• Linear regression estimates the coefficients of the line.
• This means we must calculate and to find the best line to ‘fit’ the data.
• Let’s see how we can adjust the parameters to make the line the best fit for the data ?
• Let’s assume we have already found the ‘best fit’ line for our data.
Model Error
• If we have, for instance, a car with engine size x 1
=5.4, and actual Co2=250,
• Its Co2 should be predicted very close to the actual

value, which is y=250, based on historical data.
• But, if we use the fit line it will return

ŷ =340.
• Compare the actual value with we predicted

using our model, you will find out that we have
a 90-unit error.
Error = ŷ – y = 340-250 = 90
• Prediction line is not accurate. This error is also
linear-regression-machine-learning
called the residual error.
Mean Absolute Error
• In this, the residual for every data point, taking

only the absolute value of each so that
negative and positive residuals do not cancel
out. Then take the average of all these
residuals.
Mean Squared Error
• The mean square error (MSE) is just like the MAE.
• But squares the difference before summing them all instead of using
the absolute value. We can see this difference in the equation below.
Parameter Estimation
• The objective of linear regression is to minimize this MSE equation,

and to minimize it, we should find the best parameters, and .
• How to find θ0 and θ1 in such a way that it minimizes this error?
• We have two options here:

• Option 1 - We can use a mathematic approach Or,
• Option 2 - We can use an optimization approach.
Mathematical Approach
• θ0 and θ1 (intercept and slope of the line) are the coefficients of the fit line.
• Need to calculate the mean of the independent and dependent or target columns,
from the dataset.
• Notice : All of the data must be available.
• It can be shown that the intercept and slope can be calculated using these
variables.
• We can start off by estimating the value for θ1.

is the slope
is the y-intercept.
• Problem 1
• Consider the following set of points: {(-2 , -1) , (1 , 1) , (3 , 2)}
a) Find the least square regression line for the given data points.
b) Plot the given points and the regression line
•Problem 2
a) Find the least square regression line for the following set of data
{(-1 , 0),(0 , 2),(1 , 4),(2 , 5)}
b) Plot the given points and the regression line in the same rectangular
system of axes.
Problem -3
x
The values of y and their corresponding values of y are shown in the table below 0 1 2 3 4
y 2 3 5 4 6
a) Find the least square regression line y = a x + b.
b) Estimate the value of y when x = 10
Problem- 4
The sales of a company (in million dollars) for each year are shown in the table
below.
a) Find the least square regression line y = a x + b.

b) Use the least squares regression line as a model to estimate the sales of the
company in 2012.
x
(year) 2005 2006 2007 2008 2009
y
(sales) 12 19 29 37 45
Gradient Descent
• Gradient Descent is an iterative optimization algorithm used to find the
minimum of a function. It is widely employed in various machine learning
algorithms, including linear regression, logistic regression, neural networks,
and many others. The primary goal of Gradient Descent is to adjust the
parameters of a model in such a way that the cost function (objective
function) is minimized, thus improving the model's performance.
• In the context of linear regression, the cost function is typically the Mean
Squared Error (MSE), which quantifies the difference between the predicted
values and the actual values in the training dataset. The objective of the
algorithm is to find the values of the slope (m) and the intercept (b) of the
linear regression line that minimize the MSE.
Here's how Gradient Descent works:
1. Initialize Parameters: Start by initializing the model's parameters (weights and biases) with
some arbitrary values. For linear regression, the parameters are the slope (m) and the
intercept (b) of the line.
2. Calculate the Gradient: Compute the gradient of the cost function with respect to each
parameter. The gradient represents the direction and magnitude of the cost function's
steepest ascent (positive gradient) or descent (negative gradient). It tells us how much and
in which direction we should update the parameters to minimize the cost.
3. Update Parameters: Adjust the parameters using the gradients. The update rule for each
parameter is given by:
The learning rate is a hyper parameter that controls the step size during each iteration. It
determines how big a step the algorithm takes in the direction of the gradient. A larger learning
rate may lead to faster convergence but could also cause overshooting the optimal solution. On
the other hand, a smaller learning rate might slow down convergence.
4. Repeat Steps 2 and 3: Continue calculating the gradients and updating the parameters
iteratively until a stopping criterion is met. The stopping criterion could be a fixed
number of iterations or reaching a satisfactory level of cost function value.
5. Convergence: Gradient Descent continues to update the parameters, and with each
iteration, it moves closer to the minimum of the cost function. Ideally, the algorithm
should converge to the optimal values of the parameters, where the cost function
reaches its minimum.
6. Model Evaluation: Once the optimization is complete, the model's parameters will be
optimized for the given data. You can then use these optimized parameters to make
predictions on new data.
• There are variations of Gradient Descent, such as Stochastic Gradient Descent (SGD) and
Mini-batch Gradient Descent, which use subsets of the training data in each iteration to
further optimize the process and handle large datasets efficiently.
Logistic Regression
Categorical Response Variables
Examples:
Whether or not a person smokes  Non  smoker

Y 
Binary Response Smoker
Success of a medical treatment Survives
Y 
Dies
Opinion poll responses Agree

Ordinal Response
Y   Neutral
Disagree

Introduction
• Logistic regression is a statistical and

Dependent
machine learning technique for classifying INDEPENDENT V A R I A B L Variable
ES
records of a dataset, based on the values of
the input fields.
• Let’s say we have a telecommunication

dataset which you’ll use to build a model
based on logistic regression for predicting
customer churn, using the given features.
Logistic Regression Applications
• Predict the probability of a person having a heart attack within a specified time period, based on our knowledge of the person's age,
sex, and body mass index.
• Predict the chance of mortality in an injured patient
• Predict whether a patient has a given disease, such as diabetes, based on observed characteristics of that patient
• Predict the likelihood of a customer purchasing a product or halting a subscription.
• Predict the probability of failure of a given process, system, or product.
• Note:- All of these applications, we not only predict the class of each case, we also measure the probability of a case belonging to a
specific class.
Logistic Regression Model
• Ideally, a logistic regression model, so called ŷ, can predict that the class of a
customer is 1, given its features x (probability of customer falling in a particular
class).
• For class of customer =0
•
Logistic Regression Model
• To build such model, Instead of using we

use a specific function called sigmoid.
• gives us the probability of a point 𝑦 = 𝜎 ( 𝜃 𝑇 𝑋 ) =𝜎 ( 𝜃 0 + 𝜃1 𝑥1 +… )

^
belonging to a class, instead of the value
of y directly.
Logistic Function
• Sigmoid function, ( the logistic

function), resembles the step function
and is used by the following expression
in the logistic regression.
•
Sigmoid Function
P(y=1|x)
• When goes bigger, sigmoid function gets

closer to 1, the P(y=1|x), goes up.
• When goes very small, sigmoid function gets

closer to 0, thus, the P(y=1|x), goes down.
• Sigmoid function’s output is always between 0

P(y=0|x)
and 1, which makes it proper to interpret the
results as probabilities.
Output of Model with Sigmoid function
• Ex: the probability of a customer staying with the

company can be shown as probability of churn equals
1 given a customer’s income and age, assume, 0.7.
• P(churn=1|income,age) = 0.7
• And the probability of churn is 0, for the same

customer: •
• P(churn=0|income,age) = 1-0.7=0.3
•
Model Build?
• How can we build such model ?
• First step towards building such model is to find 𝜃, which can find
through the training process.
• We can plan to learn from the data itself in training phase.

Training Process
• Step 1: Initialize 𝜃 vector with random values, assume [ −1, 2].
• Step 2: Calculate the model output, which is σ(), for a sample customer.
• X in is the feature vector values. Ex: the age and income of the customer, assume, [2, 5].
• θ is the confidence or weight that you’ve set in the previous step.
• The probability that the customer belongs to class 1 is:
• Step 3: Compare the output of our model, ŷ, with the actual label of the customer, assume, 1 for churn.
• Ex: Model’s error = 1-0.7 = 0.3
• This is the error for only one customer out of all the customers in the training set.
Training Process
• Step 4: Calculate the error for all customers.

• Add up to find total error, which is the cost of your model,
• Cost function (error of the model) is the difference between the actual and the model’s
predicted values. Therefore, the lower the cost, the better the model is at estimating the
customer’s labels correctly.
• We must try to minimize this cost !

Training Process
• Step 5: But, because the initial values for θ were chosen randomly, it’s
very likely that the cost function is very high. So, we change the 𝜃 in
such a way to hopefully reduce the total cost.
• Step 6: After changing the values of θ, we go back to step 2 to start another

iteration and calculate the cost of the model again until the cost is low
enough.
Estimation of θ
• There are different ways to change the values of θ, but one of the most
popular ways is gradient descent.
• We must continue iterations and training of our model till we reach

desired level of accuracy and stop it when it’s satisfactory.
Loss Function
• Usually, the square of this equation is used because of the possibility of the negative result, and
for the sake of simplicity, half of this value is considered as the cost function, through the
derivative process.
• Now, we can write the cost function for all the samples in our training set; for example, for all
customers, we can write it as the average sum of the cost functions of all cases.
•
Gradient Descent
• If we plot the cost function based on all

(∆𝜃1 ,∆𝜃2 )
possible values of 𝜃1, 𝜃2,
• We call it “error curve” or “error

bowl” of cost function.
• It represents the error value for different

𝜃2
values of parameters
What is variance?
• In machine learning, variance refers to the amount of fluctuation or
spread in the model's predictions for different instances in the dataset.
It measures how much the predictions of a model vary with changes in
the training data. A high variance indicates that the model is sensitive to
the noise in the training data and may not generalize well to unseen
data.
• To better understand variance, let's consider an example:
• Suppose you have a dataset of students' test scores and their
corresponding hours of study. You want to build a regression model to
predict the test scores based on the hours of study. You decide to use a
high-degree polynomial regression model.
• In this scenario:
• High variance: If you use a very high-degree polynomial regression model, it might
fit the training data extremely well. It will try to capture all the fluctuations and
noise in the training data, including outliers. However, this model might not
generalize well to new, unseen data. For instance, it could produce significantly
different predictions for test scores when presented with a slightly different dataset
with the same range of hours of study.
• Low variance: On the other hand, if you use a simple linear regression model, it
might not perfectly fit the training data. It will have some error in its predictions.
However, this model will likely have lower variance. It will provide more stable
predictions across different datasets with similar hours of study because it captures
the underlying trend without being overly influenced by the noise in the training
data.
What is bias?
• In machine learning, bias refers to the error introduced by
approximating a real-world problem, which may be complex, by a
simplified model. It can lead to underfitting, where the model fails to
capture the underlying patterns in the data due to its simplicity. In other
words, a biased model has a strong tendency to consistently make the
same types of mistakes on both the training and testing data.
• To illustrate bias, let's consider an example:
• Suppose you're building a classification model to distinguish between
different types of animals based on their features. You decide to use a
very simple model, such as a linear classifier, to make predictions.
• In this scenario:
• High bias (underfitting): If your linear classifier is too simple to capture the
complexities of the relationships between the features and the animal types,
it might perform poorly on both the training and testing data. For instance, it
might classify animals incorrectly even when the features clearly indicate
their actual types. This is because the model's simplicity prevents it from
learning the nuances and details necessary for accurate predictions.
• Low bias: On the other hand, if you use a more complex model, such as a
deep neural network, it can learn intricate patterns from the data. This model
is less likely to suffer from bias because it has the capacity to represent
complex relationships between the features and the animal types.
• In summary, bias represents the error due to overly simplistic
assumptions in the model. It's important to balance bias and variance
in machine learning models. While high bias can lead to underfitting
and poor performance on both training and testing data, the goal is to
find a model that strikes the right balance between capturing
underlying patterns (low bias) and avoiding overfitting (low variance).
What is Under fitting?
Underfitting occurs when a machine learning model is too simple to capture
the underlying patterns present in the training data. As a result, the model
performs poorly not only on the training data but also on new, unseen data.
Underfitting is a result of high bias, and it often indicates that the model is
unable to learn the complexities of the problem.
• Signs of underfitting:
1. The model's performance is low on both the training and testing datasets.
2. The model's predictions don't seem to match the data distribution.
3. The model fails to capture important patterns and trends in the data.
To avoid underfitting and improve your model's performance, consider these strategies:
1.Use a More Complex Model: If you suspect underfitting, try using a more complex model that can capture
intricate relationships in the data. For example, you could transition from a linear model to a polynomial
regression or use more advanced techniques like decision trees, random forests, or neural networks.
2.Increase Model Capacity: Increase the capacity of your model by adding more layers (in the case of neural
networks) or increasing the complexity of the chosen algorithm. This allows the model to learn more complex
patterns.
3.Feature Engineering: Ensure that you're using relevant features and performing proper preprocessing.
Sometimes, adding informative features or transforming existing features can help the model capture important
patterns.
4.Tune Hyperparameters: Adjust the hyperparameters of your model to find the right balance between bias and
variance. For example, increasing the depth of a decision tree or adjusting the regularization strength in a
regression model can impact the model's ability to fit the data appropriately.
5.More Data: If possible, gather more training data. A larger dataset can help the model
generalize better by capturing a wider range of patterns.
6.Cross-Validation: Use techniques like cross-validation to evaluate your model's
performance. This can help you identify whether the model is underfitting or overfitting.
7.Regularization: If you're using models like linear regression or neural networks,
consider using regularization techniques like L1 or L2 regularization. These techniques
can help control the complexity of the model and prevent it from becoming too simplistic.
8.Ensemble Methods: Combine multiple models (ensemble methods) to improve
predictive performance. Ensembling can help mitigate underfitting by aggregating the
predictions of multiple models.
Remember that finding the right balance between model complexity and the amount of
available data is crucial. While you want to avoid underfitting, you also don't want to
overcomplicate the model and cause overfitting. Regular monitoring and experimentation
are essential to fine-tune your model for optimal performance.
What is overfitting?
Overfitting occurs when a machine learning model learns the training data's noise
and random fluctuations rather than the underlying patterns. As a result, the model
performs very well on the training data but fails to generalize to new, unseen data.
Overfitting is a result of high variance, and it can lead to poor performance in real-
world scenarios.
• Signs of overfitting:
1. The model's performance is excellent on the training data but drops
significantly on the testing data.
2. The model's predictions seem to "memorize" the training examples rather than
capturing the overall trends.
3. The model is too complex and has many parameters relative to the amount of
training data available.
• To avoid overfitting and build a more robust model, consider these strategies:
• Use More Data: A larger dataset can help the model learn more generalizable
patterns and reduce the likelihood of memorizing noise. More data allows the
model to capture a broader range of situations.
• Simplify the Model: Choose a simpler model with fewer parameters, especially if
you have limited data. For example, you could use linear models instead of high-
degree polynomial models.
• Feature Selection: Select only the most relevant features for your model.
Removing irrelevant or redundant features can help prevent the model from
fitting noise.
• Regularization: Apply regularization techniques like L1 or L2 regularization, which
add penalties to the model's complexity during training. This discourages the
model from fitting noise and helps it focus on the most important features.
• Cross-Validation: Use techniques like cross-validation to assess your model's
performance on unseen data. This can help you detect overfitting early and make
necessary adjustments.
• Early Stopping: If you're training iteratively (e.g., neural networks), monitor the model's
performance on a validation set and stop training when the performance on the
validation set starts to degrade, indicating potential overfitting.
• Ensemble Methods: Combine multiple models through techniques like bagging
(Bootstrap Aggregating) or boosting. Ensembles can reduce overfitting by averaging out
the errors of individual models.
• Data Augmentation: In cases like image recognition, apply data augmentation
techniques to artificially increase your dataset's size and diversity.
• Hyperparameter Tuning: Fine-tune hyperparameters, such as the learning rate or the
number of layers in a neural network, to find the best balance between bias and
variance.
• Prune Decision Trees: If using decision trees, prune them to remove branches that
provide marginal gains in training accuracy but can lead to overfitting.
• Remember that while you want your model to capture complex relationships in the
data, it's important to ensure that those relationships are truly meaningful and not just
noise. Regular evaluation, experimentation, and validation with unseen data are
essential for building models that generalize well to real-world scenarios.
Bias-Variance Tradeoff
Variance: Defines model complexity
eg:- Non Linear models
Bias: Defines model imperfection
eg: Linear model for complex cases
Bias Variance Tradeoff

Support Vector Machine
Learning Objectives
• Support Vector Machine
• Features space
• Decision Boundary
• Dimension Expansion
• Hyperplane
• Transformation Approach
Support Vector Machine (SVM)
Image source
Introduction
• A Support Vector Machine is a supervised algorithm that can classify cases by

finding a separator.
• SVM works by first, mapping data to a high-dimensional feature space so

that data points can be categorized, even when the data are not otherwise
linearly separable.
• Then, a separator is estimated for the data.
• The data should be transformed in such a way that a separator could be

drawn as a hyperplane.
Feature Space
• For example, consider the following

figure, which shows the distribution
of a small set of cells, only based on
their Unit Size and Clump thickness.
• Scatter Plot
• It represents a linearly, non-
separable, dataset.
Decision Boundary..
• The two categories can be

separated with a curve.
• Not a line that formulates

most real world datasets.
• Twist and Turns on trajectory
• Scatter Plot
Dimension Expansion
• We can transfer this data to a

higher dimensional space.
• For example, mapping it to a 3-

dimensional space.
• Separation boundary gets

simplified
Image source
Hyperplane
• After the transformation, the

boundary between the two
categories can be defined by a
hyperplane.
• As we are now in 3-dimensional

space, the separator is shown as a
plane.
• This plane can be used to classify

new or unknown cases. Image source
Transformation Approach
• For example, your can increase the Linearly

Separable
dimension of data by
• mapping x into a new space using

a function, with outputs x and x2.
x2
•
• Now, the data is linearly separable!

x
Transformation……….
• Notice that, as we are in a two- Linearly

Separable
dimensional space, the hyperplane is
a line dividing a plane into two parts
where each class lays on either side. x2
• Now we can use this line to classify

new cases. x
Image source
Kernel in Support Vector Machine
• Mapping data into a higher dimensional space is called kernelling.
• The mathematical function used for the transformation is known

as the kernel function, and can be of different types, such as:
• Linear,
• Polynomial,
• Radial basis function (or RBF), and
• Sigmoid.
• Already implemented in form of machine learning libraries.
• Choose different functions in turn and compare the results.
Image source
Finding optimized separator after transformation
• One reasonable choice as the

best hyperplane is the one that
represents the largest
separation, or margin, between
the two classes.
• So, the goal is to choose a

hyperplane with as big a margin
as possible.
Margins
• Only support vectors matter for achieving our

goal; and thus, other training examples can be
ignored.
• We try to find the hyperplane in such a way that
it has the maximum distance to support vectors,
called optimal Hyperplane.
• Hyperplane is learned from training data using
an optimization procedure that maximizes the
margin;
• This optimization problem can be solved by
Gradient descent.
Image Source
SVM Outcomes
• The output of the algorithm is the values ‘w’ and ‘b’ for the line.
• You can make classifications using this estimated line.
• It is enough to plug in input values into the line equation, then, you
can calculate whether an unknown point is above or below the line.
• If the equation returns a value greater than 0, then the point belongs
to the first class, which is above the line, and vice versa.
Evaluation Metrics in Classification
• Evaluation metrics explain the performance of a model.
• Imagine that we have an historical dataset which shows the customer churn for a
telecommunication company.
• We have trained the model, and now we want to calculate its accuracy using the test set.
• We pass the test set to our model, and we find the predicted labels.
• Now the question is, “How accurate is this model?”
• Basically, we compare the actual values in the test set with the values predicted by the model,
to calculate the accuracy of the model.
SVM Applications
• Image Analysis such as image classification and digit recognition
• Text mining
• Detecting spam
• Text categorization
• Sentiment analysis
• Gene Expression data classification

CH 21

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

CH 21

Transféré par

Droits d'auteur :

Formats disponibles

Machine Learning (CE360)

• Testing if IQ affects income (IQ is the IV and income is the DV).

• Displaying data for Testing if IQ

• When both the DV and IV are

• Displaying data of Chances of

• In regression there are two types of variables:

2. one or more independent variables (IV, existing features).

• However, the independent variable or variables can be either a categorical or

• We use regression to build such a regression/estimation model which would be used to

• Linearity of regression depends on the relation between dependent and independent

• We can use regression analysis to predict the price of a house in an area,

• We can even use it to predict employment income for independent variables,

• In linear regression, we calculate regression

• For classic linear regression or “Least

• Here we have perfectly fitted line because

• Now wouldn't it be great if we could apply

• By doing this, we could take multiple men

• This is the idea behind supervised learning!

• Lot of different ways to minimize this,

• All these methods have a general goal

Co2 emission of different cars.

• Engine size -- independent variable, Emission

• A scatterplot clearly shows the relation

• Also, it indicates that these variables are

• How do we use this line for

• Let us assume, for a moment, that the

• We can use it to predict the

• The fit line is shown traditionally as a polynomial.

• = intercept = slope of the line

• and are coefficient of linear equation

• Now the questions are:

• "How would you draw a line through the points?"

• "How do you determine which line fits best?"

• Linear regression estimates the coefficients of the line.

• Its Co2 should be predicted very close to the actual

• But, if we use the fit line it will return

• Compare the actual value with we predicted

• In this, the residual for every data point, taking

• The mean square error (MSE) is just like the MAE.

• The objective of linear regression is to minimize this MSE equation,

• How to find θ0 and θ1 in such a way that it minimizes this error?

• We have two options here:

• Notice : All of the data must be available.

• We can start off by estimating the value for θ1.

{(-1 , 0),(0 , 2),(1 , 4),(2 , 5)}

a) Find the least square regression line y = a x + b.

Whether or not a person smokes  Non  smoker

• Logistic regression is a statistical and

• Let’s say we have a telecommunication

• Predict the chance of mortality in an injured patient

• Predict the likelihood of a customer purchasing a product or halting a subscription.

• Predict the probability of failure of a given process, system, or product.

• For class of customer =0

• To build such model, Instead of using we

• gives us the probability of a point 𝑦 = 𝜎 ( 𝜃 𝑇 𝑋 ) =𝜎 ( 𝜃 0 + 𝜃1 𝑥1 +… )

• Sigmoid function, ( the logistic

• When goes bigger, sigmoid function gets

• When goes very small, sigmoid function gets

• Sigmoid function’s output is always between 0

• Ex: the probability of a customer staying with the

• And the probability of churn is 0, for the same

• How can we build such model ?

• We can plan to learn from the data itself in training phase.