Vous êtes sur la page 1sur 25

CpE 615: Machine learning

Dr. Mohammad A. Alzubaidi


Department of Computer Engineering
Yarmouk University

Brief Introduction

Dr. Mohammad A. Alzubaidi


Assistant Professor at CE Dept.
Research interests: machine learning, data
mining, image processing and their
applications to bioinformatics

Outline of lecture

Course information

Introduction to Machine Learning (ML)

Tentative Course schedule

Survey

Course Information

Instructor: Dr. Mohammad A. Alzubaidi


Office: Assistant Dean Office, H-205
Phone: 02/7211111 x4440
Email: maalzubaidi@yu.edu.jo
Web: elearning.yu.edu.jo
Time: Wed 5:00pm8:00pm
Office hours: Sun Thu 8:00am 4:00pm
Location: HN-401
Course textbook: No textbook is required. (Materials will be
available at the class web page)
Topics: Data types and representation, classification,
evaluation, preprocessing, clustering, semi-supervised
learning, advanced topics etc.

Reference books

Introduction to Data Mining. Tan, et al., 2005.

Pattern Classification. Duda, et al. , 2000.

The Elements of Statistical Learning: Data Mining,


Inference, and Prediction. Hastie, et al., 2001.

Kernel Methods in Computational Biology. Scholkopf, et


al., editors. 2004.

Kernel Methods for Pattern Analysis. Taylor and


Cristianini, 2004.

Grading

Midterm Exam: 30%

Project, class participation, and seminars: 30%.

Two to three students form a group to carry out a


small research project.

A survey of the state-of-art in an area related to this course


Machine learning techniques for specific applications
A comparative study of several well-known algorithms.
Design of a novel algorithm related to this course.

Students are required to attend the lecture, participate


in the class discussion.
Students might be asked to give a seminar.

Final Exam: 40%.

Programming language

Matlab

Tutorials

http://www.math.ufl.edu/help/matlab-tutorial/
http://www.math.mtu.edu/~msgocken/intro/node1.htm
l

R language

Or other languages

What is machine learning?

Machine learning is the study of computer systems that


improve their performance through experience.

Learn existing and known structures and rules.


Discover new findings and structures.

Face recognition
Bioinformatics

Supervised learning vs. unsupervised learning

Semi-supervised learning

Machine learning versus data


mining

Data mining is

extraction of useful patterns from data


sources, e.g., databases, texts, web, image.
the analysis of (often large) observational
data sets to find unsuspected relationships
and to summarize the data in novel ways that
are both understandable and useful to the
data owner.

A lot of common topics

Clustering, Classification etc.

Machine learning versus data


mining

Different focuses

ML focuses more on theory (statistics)


DM focuses more on applications

In this course I will try to balance


between the two.

Clustering

Finding groups of objects such that the objects in a group


will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups

Intra-cluster
distances are
minimized

Inter-cluster
distances are
maximized

Applications of Cluster
Understanding
Analysis

Group genes and proteins that have similar


functionality, or group stocks with similar price
fluctuations
Summarization
Reduce the size of large data sets

Clustering precipitation
in Australia

Classification: Definition

Given a collection of records (training set )


Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of the values of
other attributes.
Goal: previously unseen records should be assigned a class
as accurately as possible.
A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and
test sets, with training set used to build the model and
test set used to validate it.

Classification Example
al
al
us
c
c
i
i
o
or
or
nu
i
g
g
t
n
te
te
ss
a
a
o
a
l
c
c
c
c

Refund Marital
Status

Taxable
Income Cheat

No

No

Single

75K

100K

No

Yes

Married

50K

Single

70K

No

No

Married

150K

Yes

Married

120K

No

Yes

Divorced 90K

No

Divorced 95K

Yes

No

Single

40K

No

Married

No

No

Married

80K

Tid Refund Marital


Status

Taxable
Income Cheat

Yes

Single

125K

No

Married

No

60K

10

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10
10

No

Single

90K

Yes

Training
Set

Learn
Classifier

Test
Set

Model

Classification: Application

Fraud Detection

Goal: Predict fraudulent cases in credit card transactions.


Approach:

Use credit card transactions and the information on its


account-holder as attributes.
When does a customer buy, what does he buy, how often
he pays on time, etc
Label past transactions as fraud or fair transactions. This
forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card
transactions on an account.

Character Recognition

Given a digit
representation.
What is its class?

Inputs are 28x28


greyscale images.

Researchers have used


Neural Networks
Support Vector
Machines
... etc

Other applications

Face recognition

Protein function
prediction

Cancer detection

Document
categorization

Data representation

Traditional algorithms work on vectors.

Images can be represented as matrices or


vectors.

Kernel Methods: Basic ideas

Original Space

Feature Space

Data integration
mRNA
expression data
hydrophobicity
data

protein-protein
interaction data

sequence
data
(gene,
protein)

Genome-wide data

Curse of dimensionality

Large sample size is required for high-dimensional data.

Query accuracy and efficiency degrade rapidly as the


dimension increases.

Strategies

Feature reduction
Feature selection
Kernel learning

Model selection

Choose the best model from a set of different models to


fit to the data

Support Vector Machines (SVM), Linear Discriminant


Analysis (LDA)

Models are specified by certain parameters.

How to choose the best parameters?


Cross-validation (leave one out, k-fold CV)

Machine learning applications

Computer vision,
information retrieval,
image processing,
bioinformatics,
text mining,
web mining
etc.

Course schedule

Weeks 1 6:
Introduction
Data Types
Classification
Evaluation
Preprocessing
Week 7: Midterm Exam
Weeks 8 11:
Clustering
Semi-supervised Learning
Advances Topics
Weeks 12 14: Presentations
Week 15: Final Exam

Survey

Why are you taking this course?

What would you like to gain from this course?

What topics are you most interested in learning about from this
course?

Any other suggestions?

Vous aimerez peut-être aussi