Académique Documents
Professionnel Documents
Culture Documents
Chapter-2 Supervised
Machine Learning
Nisha Panchal
Supervised learning
• Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.
• In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.
• Supervised learning is a process of providing input data as well as correct output data
to the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).
• In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
How Supervised Learning Works?
Types of supervised Machine learning
Algorithms:
Regression
• Regression algorithms are used if there is a relationship between the
input variable and the output variable. It is used for the prediction of
continuous variables, such as Weather forecasting, Market Trends, etc.
Below are some popular Regression algorithms which come under
supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
Classification
• Classification algorithms are used when the output variable is
categorical, which means there are two classes such as Yes-No, Male-
Female, True-false, etc.
• Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
Linear Regression
Introduction
• In 1800, a person named Francis Galton, was studying the relationship
between parents and their children.
• He investigated the relationship between height of fathers and their
sons.
• He discovered that a man’s son tends to be roughly as tall as his
father, however, tall son’s height tended to be closer to the overall
average height of all people’s sons.
• Galton call this phenomenon as “Regression” as “father’s son height
tends to regress (or drift towards) the mean (average) height of
everyone else.
Linear Regression
• Regression is used to study the relationship between two variables.
• We can use simple regression if both the dependent variable (DV)
and the independent variable (IV) are numerical.
• If the DV is numerical but the IV is categorical, it is best to use Logistic
Regression.
Linear Regression - Example
• The following are situations where we can use regression:
• It is important to perform a
scatterplot because it helps us to
see if the relationship is linear.
Regression Case
• Dataset related to Co2 emissions from different cars.
Regression Case
• Looking to the existing data of different cars, can we estimate the approx. CO2 emission
of a car, which is yet not manufactured, such as in row 9 ?
• We can use regression methods to predict a continuous value, such as CO2 Emission,
using some other variables.
2. When more than one independent variable is present, the process is called multiple
linear regression.
• Ex: predicting Co2 emission using Engine Size and the number of Cylinders in any given car.
• You can try to predict a salesperson's total yearly sales (sales forecast) from
independent variables such as age, education, and years of experience.
Continuous values
• The question is: Given this
dataset, can we predict the
Co2 emission of a car,
using another field, such Yes!
as Engine size?
Scatter Plot
• To understand linear regression, we can plot
our variables here.
• Fitting line help us to predict the target value, Y, using the independent variable 'Engine Size' represented
on X axis
• In Simple regression Problem (single x), the form of the model would be
• Where Y is the dependent variable, or the predicted value and X is the independent variable.
• This means we must calculate and to find the best line to ‘fit’ the data.
• Let’s see how we can adjust the parameters to make the line the best fit for the data ?
• Let’s assume we have already found the ‘best fit’ line for our data.
Model Error
• If we have, for instance, a car with engine size x 1
=5.4, and actual Co2=250,
• But squares the difference before summing them all instead of using
the absolute value. We can see this difference in the equation below.
Parameter Estimation
• θ0 and θ1 (intercept and slope of the line) are the coefficients of the fit line.
• Need to calculate the mean of the independent and dependent or target columns,
from the dataset.
• It can be shown that the intercept and slope can be calculated using these
variables.
b) Plot the given points and the regression line in the same rectangular
system of axes.
Problem -3
x
The values of y and their corresponding values of y are shown in the table below 0 1 2 3 4
y 2 3 5 4 6
a) Find the least square regression line y = a x + b.
b) Estimate the value of y when x = 10
Problem- 4
The sales of a company (in million dollars) for each year are shown in the table
below.
x
(year) 2005 2006 2007 2008 2009
y
(sales) 12 19 29 37 45
Gradient Descent
• Gradient Descent is an iterative optimization algorithm used to find the
minimum of a function. It is widely employed in various machine learning
algorithms, including linear regression, logistic regression, neural networks,
and many others. The primary goal of Gradient Descent is to adjust the
parameters of a model in such a way that the cost function (objective
function) is minimized, thus improving the model's performance.
• In the context of linear regression, the cost function is typically the Mean
Squared Error (MSE), which quantifies the difference between the predicted
values and the actual values in the training dataset. The objective of the
algorithm is to find the values of the slope (m) and the intercept (b) of the
linear regression line that minimize the MSE.
Here's how Gradient Descent works:
1. Initialize Parameters: Start by initializing the model's parameters (weights and biases) with
some arbitrary values. For linear regression, the parameters are the slope (m) and the
intercept (b) of the line.
2. Calculate the Gradient: Compute the gradient of the cost function with respect to each
parameter. The gradient represents the direction and magnitude of the cost function's
steepest ascent (positive gradient) or descent (negative gradient). It tells us how much and
in which direction we should update the parameters to minimize the cost.
3. Update Parameters: Adjust the parameters using the gradients. The update rule for each
parameter is given by:
The learning rate is a hyper parameter that controls the step size during each iteration. It
determines how big a step the algorithm takes in the direction of the gradient. A larger learning
rate may lead to faster convergence but could also cause overshooting the optimal solution. On
the other hand, a smaller learning rate might slow down convergence.
4. Repeat Steps 2 and 3: Continue calculating the gradients and updating the parameters
iteratively until a stopping criterion is met. The stopping criterion could be a fixed
number of iterations or reaching a satisfactory level of cost function value.
5. Convergence: Gradient Descent continues to update the parameters, and with each
iteration, it moves closer to the minimum of the cost function. Ideally, the algorithm
should converge to the optimal values of the parameters, where the cost function
reaches its minimum.
6. Model Evaluation: Once the optimization is complete, the model's parameters will be
optimized for the given data. You can then use these optimized parameters to make
predictions on new data.
• There are variations of Gradient Descent, such as Stochastic Gradient Descent (SGD) and
Mini-batch Gradient Descent, which use subsets of the training data in each iteration to
further optimize the process and handle large datasets efficiently.
Logistic Regression
Categorical Response Variables
Examples:
• Predict the probability of a person having a heart attack within a specified time period, based on our knowledge of the person's age,
sex, and body mass index.
• Predict whether a patient has a given disease, such as diabetes, based on observed characteristics of that patient
• Note:- All of these applications, we not only predict the class of each case, we also measure the probability of a case belonging to a
specific class.
Logistic Regression Model
• Ideally, a logistic regression model, so called ŷ, can predict that the class of a
customer is 1, given its features x (probability of customer falling in a particular
class).
•
Logistic Regression Model
•
Sigmoid Function
P(y=1|x)
• P(churn=1|income,age) = 0.7
•
Model Build?
• First step towards building such model is to find 𝜃, which can find
through the training process.
• Step 2: Calculate the model output, which is σ(), for a sample customer.
• X in is the feature vector values. Ex: the age and income of the customer, assume, [2, 5].
• θ is the confidence or weight that you’ve set in the previous step.
• The probability that the customer belongs to class 1 is:
• Step 3: Compare the output of our model, ŷ, with the actual label of the customer, assume, 1 for churn.
• This is the error for only one customer out of all the customers in the training set.
Training Process
• Cost function (error of the model) is the difference between the actual and the model’s
predicted values. Therefore, the lower the cost, the better the model is at estimating the
customer’s labels correctly.
• Step 5: But, because the initial values for θ were chosen randomly, it’s
very likely that the cost function is very high. So, we change the 𝜃 in
such a way to hopefully reduce the total cost.
• There are different ways to change the values of θ, but one of the most
popular ways is gradient descent.
• Usually, the square of this equation is used because of the possibility of the negative result, and
for the sake of simplicity, half of this value is considered as the cost function, through the
derivative process.
• Now, we can write the cost function for all the samples in our training set; for example, for all
customers, we can write it as the average sum of the cost functions of all cases.
•
Gradient Descent
• Features space
• Decision Boundary
• Dimension Expansion
• Hyperplane
• Transformation Approach
Support Vector Machine (SVM)
Image source
Introduction
• Scatter Plot
• It represents a linearly, non-
separable, dataset.
Decision Boundary..
• Scatter Plot
Dimension Expansion
Image source
Hyperplane
Image Source
SVM Outcomes
• The output of the algorithm is the values ‘w’ and ‘b’ for the line.
• It is enough to plug in input values into the line equation, then, you
can calculate whether an unknown point is above or below the line.
• If the equation returns a value greater than 0, then the point belongs
to the first class, which is above the line, and vice versa.
Evaluation Metrics in Classification
• Imagine that we have an historical dataset which shows the customer churn for a
telecommunication company.
• We have trained the model, and now we want to calculate its accuracy using the test set.
• We pass the test set to our model, and we find the predicted labels.
• Basically, we compare the actual values in the test set with the values predicted by the model,
to calculate the accuracy of the model.
SVM Applications
• Text mining
• Detecting spam
• Text categorization
• Sentiment analysis