Vous êtes sur la page 1sur 15

8/24/2019

INTRODUCTION TO DATA MINING


UNIT # 1

FALL 2019 Sajjad Haider 1

TODAY’S AGENDA

 Course management
 Brief overview of Data Mining and allied fields
 Summary of a few impactful articles and recent trends

FALL 2019 Sajjad Haider 2

1
8/24/2019

COURSE MANAGEMENT

FALL 2019 Sajjad Haider 3

LEARNING OBJECTIVES

 Learn the art of modeling and interpreting large complicated data sets
via predictive and descriptive data mining methods.
 Get to know several online data repositories and how to participate in
data analytics competitions held at Kaggle.com and other sites
 Have advanced level expertise in data analytics software and languages
such as KNIME and Python.

FALL 2019 Sajjad Haider 4

2
8/24/2019

COURSE OVERVIEW

 Data Preparation
 Classification Techniques
 Clustering
 Text Analytics
 Regression Analysis
 Principal Component Analysis
 Association Rule Mining

FALL 2019 Sajjad Haider 5

SOFTWARE AND DATA REPOSITORIES

 KNIME
 Python
 Data on Kaggle Website
 http://www.kaggle.com/

FALL 2019 Sajjad Haider 6

3
8/24/2019

BOOKS

 Data Mining and Data Warehousing: Principles and Practical Techniques


(2019)
 Data Mining for Business Analytics: Concepts, Techniques and
Applications in R (2017)
 Learning Data Mining with Python (2017)
 Data Mining: Practical Machine Learning Tools and Techniques by Witten
and Frank (2016)

FALL 2019 Sajjad Haider 7

ACKNOWLEDGEMENT

 Although I am not extensively following the two books below but their
slides are still very popular in the academia and would be using them
occasionally:
 Data Mining: Concepts and Techniques (2011)
 Introduction to Data Mining (2018)

FALL 2019 Sajjad Haider 8

4
8/24/2019

MARKS DISTRIBUTION

 Midterm 25
 Final 40
 Project 10
 Assignments 15

FALL 2019 Sajjad Haider 9

MEETING HOURS

 Office Hours:
 Monday/Wednesday: noon – 1PM and ?
 or by appointment (by e-mailing me at sahaider@iba.edu.pk).

 Note: I DO NOT entertain SMS/WhatsApp messages. E-mail is the


official medium of correspondence.

FALL 2019 Sajjad Haider 10

5
8/24/2019

OVERVIEW OF DATA MINING AND ALLIED FIELDS

FALL 2019 Sajjad Haider 11

APPLICATIONS OF DATA MINING/MACHINE LEARNING

 Traffic Predictions
 Google Maps
 Online Transportation Networks
 Uber/Careem for price prediction
 Video Surveillence
 Crime detection
 Fraud Detection
 Financial institutions
FALL 2019 Sajjad Haider 12

6
8/24/2019

APPLICATIONS OF DATA MINING/MACHINE LEARNING (CONT’D)

 Social Media Services


 Face recognition by Facebook
 Hate speech detection by Facebook/Twitter
 Inappropriate content by YouTube
 Emails
 Product Recommendation
 Amazon,YouTube, and others
 Machine Translation
 Autonomous Vehicles
FALL 2019 Sajjad Haider 13

MACHINE LEARNING

A computer program is said to learn from experience E with respect to


some class of tasks T and performance measures P, if its performance
at tasks in T, as measured by P, improves with experience E.’
(Tom Mitchell, 1988)

FALL 2019 Sajjad Haider 14

7
8/24/2019

A SIMPLIFIED TAXONOMY

 Data Science > Data Analytics > Data Mining > Machine Learning
 Data Analytics also deals with Visualization
 Data Science also deals with data acquisition and management of data
 Beside machine learning, data mining also makes use of statistical models

 Because of a significant overlap and due to the popularity of different terms in


different communities, the boundaries of these terms are not as crisp as
shown in this slide.

FALL 2019 Sajjad Haider 15

DATA MINING

 Data mining is a process of automated discovery of previously unknown


patterns in large volumes of data.
 This large volume of data is usually the historical data of an organization
known as the data warehouse.
 Data mining deals with large volumes of data, in Gigabytes or Terabytes
of data and sometimes as much as Zetabytes of data (in case of big data).
 Patterns must be valid, novel, useful and understandable.

FALL 2019 Sajjad Haider 16

8
8/24/2019

DATA MINING LIFE CYCLE (CRISP-DM)

1. Statistical Models
2. Machine learning

FALL 2019 Sajjad Haider 17

SUMMARY OF A FEW ARTICLES

FALL 2019 Sajjad Haider 18

9
8/24/2019

FALL 2019 Sajjad Haider 19

HBR ARTICLE (CONT’D)

 Data scientists are the people who understand how to fish out answers
to important business questions from today’s tsunami of unstructured
information.
 As companies rush to capitalize on the potential of big data, the largest
constraint many face is the scarcity of this special talent.

FALL 2019 Sajjad Haider 20

10
8/24/2019

FALL 2019 Sajjad Haider 21

BIG DATA: THE NEXT FRONTIER FOR INNOVATION (MCKINSEY 2011)

 Big Data referes to datasets whose size is beyond the ability of typical
database software tools to capture, store, manage and analyze.
 The demand for deep analytical positions in a big world could exceed the
supply being produced on current trends by 140K to 190K positions.
 A need for 1.5 million additional managers and analysts in the US
who can ask the right questions and consume the results of the analysis
of big data effectively.

FALL 2019 Sajjad Haider 22

11
8/24/2019

WHAT IS BIG DATA?

 There is not a consensus as to how to define big data


“Big data exceeds the reach of commonly used hardware environments and
software tools to capture, manage, and process it with in a tolerable elapsed
time for its user population.” - Teradata Magazine article, 2011

“Big data refers to data sets whose size is beyond the ability of
typical database software tools to capture, store, manage and analyze.”
- The McKinsey Global Institute, 2011

 One reasonable definition is that it’s data which can’t comfortably be


processed on a single machine.
FALL 2019 Sajjad Haider 23

3 V’S

 Doug Laney was the first one in talking


about 3 V's in Big Data management:
 Volume: there is more data than ever before,
its size continues increasing, but not the
percent of data that our tools can process
 Variety: there are many different types of data,
as text, sensor data, audio, video, graph, and
more
 Velocity: data is arriving continuously as
streams of data, and we are interested in
obtaining useful information from it in real
FALL 2019
time Sajjad Haider 24

12
8/24/2019

4 V’S (IBM 2014)

FALL 2019 Sajjad Haider 25

MOOCS ON DATA SCIENCES

 In the past couple of years, Data Science related courses and specialization
have been extremely popular on MOOCs websites:
 John Hopkins University (Coursera)
 University of Washington (Coursera)
 Google (UdaCity)
 UC Berkley (EdX)
 University of Toronto (Coursera)
 And many others……..

FALL 2019 Sajjad Haider 26

13
8/24/2019

RECENT TRENDS

FALL 2019 Sajjad Haider 27

DATA SCIENCE AND THE ART OF PERSUASION (HBR 2019)

 Despite the success stories, many companies aren’t getting the value they
could from data science.
 Four of the top seven “barriers faced at work”:
 lack of management/financial support
 lack of clear questions to answer
 results not used by decision makers and
 explaining data science to others

FALL 2019 Sajjad Haider 28

14
8/24/2019

DO YOUR DATA SCIENTISTS KNOW THE ‘WHY’ BEHIND


THEIR WORK? (HBR 2019)

 Data science, broadly defined, has been around for a long time. But the failure
rates of big data projects in general and AI projects in particular remain
disturbingly high.
 The following were found to be the two most important reasons:
 Many data scientists are much more interested in pursuing their crafts — namely, finding
interesting nuggets buried in data — than they are in solving business problems.
 From the company’s perspective, the talent is rare and protecting data scientists from the
chaos of everyday work just makes sense. But doing so increases the distance between
data scientists and the company’s most important problems and opportunities.

FALL 2019 Sajjad Haider 29

WHY DATA SCIENCE TEAMS NEED GENERALISTS, NOT


SPECIALISTS (HBR 2019)

 The division of labor in Data Science projects similar to a pin factory


assembly line where “One [person] draws out the wire, another straights
it, a third cuts it, a fourth points it, a fifth grinds it,” doesn’t work well.
 Algorithmic products and services like recommendations systems, style
preference classification, seasonal trend detection, and more can’t be
designed up-front.
 With data science, you learn as you go, not before you go.

FALL 2019 Sajjad Haider 30

15

Vous aimerez peut-être aussi