Vous êtes sur la page 1sur 244

6/26/2014

BigData:Prelude
IJenChiang

WhatsBigData?
No single definition; from Wikipedia:
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on
onhand
hand
database management tools or traditional data processing
applications.
The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
Th
The trend
d to larger
l
d
data
sets is
i due
d
to the
h additional
ddi i
l
information derivable from analysis of a single large set of
related data, as compared to separate smaller sets with the
same total amount of data, allowing correlations to be found
to "spot business trends, determine quality of research,
prevent diseases, link legal citations, combat crime, and
2
determine realtime roadway traffic conditions.

6/26/2014

Howmuchdata?

Googleprocesses20PBaday(2008)
WaybackMachinehas3PB+100TB/month(3/2009)
Facebookhas2.5PBofuserdata+15TB/day(4/2009)
eBayhas6.5PBofuserdata+50TB/day(5/2009)
CERNsLargeHydron Collider(LHC)generates15PBa
year

640K oughttobeenough
foranybody.

ASingleViewtotheCustomer

Banking
Finance

Social
Media

Our
Known
History

Customer

Gaming

Entertain

Purchase

6/26/2014

Collect

Accumulate
Store

LifeCycleofBigData
CloudComputing

InternetofThin
ngs

Queryy
MapReduce
Distributed
Storage
BigData

MobileComputing
6

6/26/2014

Data,data,andmoredata
AccordingtoIBM,90%ofthedataintheworldtoday
wascreatedinthepasttwoyears(2011~2013).(IBM
quote,microscope.co.uk)
AccordingtoInternationalDataCorporation,the
totalamountofglobaldataisexpectedtogrowto
2.7zettabytes during2012.(InternationalData
Corporation2012prediction,IDCwebsite)
Thedataisgrowingexponentially(43%growthrate)
andisestimatedtobe7.9zettabytes by2015.
(CenturyLink 2015prediction,ReadWriteWeb
website)
7

TendencyofData
Poorlystructured,lightly
TTabularof
b l
f
object/relational
Fixedschema

10s100sGB
Upto100,000sof
transactionsperhour
Optimizedfor
transaction
processing

Structuredand
di
dimensioned
i
d
SpecializedDBsforBI

structured,orunstructured
Simplestructureand
extremedatarate
Hi
Hierarchicaland/orfile
hi l d/ fil
oriented
Sparselyattributed

TBlowPBs
Bat h loads
Batchloads

PBandup
Stream,capture,
Stream,
capture,
batch

Optimizedfor
reportingand
analysis

Optimizedfor
distributed,
cloudbased
processing

6/26/2014

DataGrowth

Volume(Scale)
DataVolume
44xincreasefrom20092020
From0.8zettabytes to35zb

Datavolumeisincreasingexponentially

Exponentialincreasein
collected/generateddata
10

6/26/2014

30billion RFID
tagstoday
(1.3Bin2005)

12+ TBs
of tweet data
every day

4.6
billion
camera
phones
worldwide

?TBsof
dataeveryday

100sof
millions
ofGPS
enabled
devicessold
annually

25+ TBs of

2+
billion

log data
every day

76million smartmeters
in2009
200Mby2014

peopleon
theWeb
byend
2011

Realtime/FastData
Mobiledevices
(trackingallobjectsallthetime)

Socialmediaandnetworks
(allofusaregeneratingdata)

Scientificinstruments
(collectingallsortsofdata)

Sensortechnologyandnetworks
(measuringallkindsofdata)

Theprogressandinnovationisnolongerhinderedbytheabilitytocollect
data
But,bytheabilitytomanage,analyze,summarize,visualize,anddiscover
knowledgefromthecollecteddatainatimelymannerandinascalable
12
fashion

6/26/2014

TheModelHasChanged
TheModelofGenerating/ConsumingDatahasChanged
OldModel:Fewcompaniesaregeneratingdata,allothersare
consuming data
consumingdata

NewModel:allofusaregeneratingdata,andallofusare
consumingdata

13

WhatsdrivingBigData
Optimizationsandpredictiveanalytics
Complexstatisticalanalysis
Alltypesofdata,andmanysources
Verylargedatasets
Moreofarealtime

Adhocqueryingandreporting
Dataminingtechniques
h
Structureddata,typicalsources
Smalltomidsizedatasets

14

6/26/2014

WhatisBigData?
Datasets which are too large, grow too rapidly, or
are too varied to handle using traditional techniques
Characteristics:
Volume 100s of TBs, petabytes, and beyond
Velocity e.g., machine generated data, medical
devices,
sensors
Variety unstructured data, many formats,
varying
semantics
Not every data problem is a Big Data problem!!

15

InternetofThings

16

6/26/2014

Characteristics
Volume
Data Size
DataSize

BigData

17

IBMDefinition

18

6/26/2014

ANewEraofComputing
12

5 million

terabytes

ofTweets
createdaily

100s
Ofvideofeedsfrom
surveillancecameras

tradeevents
persecond

Volume

Velocity

Variety

Veracity

1in3

Only

Decisionmakers
trusttheir
information

http://watalon.com/?p=722

Wehaveforthefirsttime
aneconomybasedonakey
resource[Information]that
isnotonlyrenewable,but
selfgenerating.
Runningoutofitisnota
problem but drowning in it
problem,butdrowninginit
is.
JohnNaisbitt

20

10

6/26/2014

21

BigDataExplained
Achieve Breakthrough
Outcomes
KnowEverything
Know
Everything
aboutyourCustomers

By Analyzing Any
Big Data Type
Transactional/
Application
Data

RunZerolatency
Operations
MachineData
Innovatenew
productsatSpeed
andScale
InstantAwarenessof
FraudandRisk
ExploitInstrumented
Assets

SSocialMedia
i l M di
Data

Content

11

6/26/2014

Big Data Stack

24

12

6/26/2014

ValueofBigData
Unlockingsignificantvaluebymakinginformation
transparent and usable at much higher frequency
transparentandusableatmuchhigherfrequency.
Usingdatacollectionandanalysistoconduct
controlledexperimentstomakebetter
managementdecisions.
Sophisticatedanalyticsthatsubstantiallyimprove
decisionmaking.
Improveddevelopmentofthenextgenerationof
productsandservices.
25

Whydoesbigdatamatter?
Bigdataisnotjustaboutstoringlargedatasets
Rather,itisaboutleveragingdatasets
Miningdatasetstofindnewmeaning
Combiningdatasetsthathaveneverbeencombined
before
Makingmoreinformeddecisions
Offeringnewproductsandservices

Dataisavitalasset,andanalyticsarethekeyto
unlockingitspotential
Wedonthavebetteralgorithmsthananyoneelse,wejusthavemoredata.
PeterNorvig,DirectorofResearch,Google,spokenin2010

26

13

6/26/2014

BigData:Processing
IJenChiang

Knowledge Pyramid
Data (Text) Mining
area

Semantic level
Wisdom

How can we improve it ?


Howcanweimproveit?

(Knowledge + experience)

Knowledge
(Information + rules)

Information
(Data + context)

Data

Whatmadeitthatunsuccessful?
Whatwasthelowestselling
product?
Howmanyunitsweresold
ofeachproductline?
f
h
d
li ?

Signals

Resources occupied

14

6/26/2014

ValueChainEmerges
Prescriptive Analytics
PrescriptiveAnalytics

AutomaticallyPrescribeand
Automatically
Prescribe and
Takeaction

PredictiveAnalytics

SetsofPotentialFuture
Scenarios

IdentificationofPatterns
AndRelationships

AnEvaluationofwhathappened
inthepast

P
Processing
i

D t P
DataPreparedforAnalysis
df A l i

IIndexed,Organizedand
d d O
i d d
OptimizedData

BigData

ContainersandFeedsof
HeterogeneousData

AccesstoStructuredand
UnstructuredData

Analytics
IncreasingValu
ue

Reporting

MichaelPorter,CompetitiveAdvantage:CreatingandSustainingSuperiorPerformance

BigDataProcessing
Transactional
Data

Operational
&Partner
Data

Machineto
Machine
Data
EventStreams
SocialData

Cloud
Services
Data

HighSpeedLowLatencyInfra/Band/EthernetInterconnect

Working
LocalFlash
Storage
Layerasan
extensionof
DRAM
Di t ib t d
Distributed
SharedFlash
Storage
Layer
Lowcost
Distributed
Archive&
BackupDisk
StorageLayer

Operational
Systems

Business
Analytics

Databases

Indexes

Indexing&
Metadata

BigData
Analytics

Metadata

Cubes

Government
Systems
Databases

ActiveDataMianagement
Shared
Databases

Active
Indexes

Shared
Metadata

ArchiveData
&Metadata

Archive/BackupDataManagement

30

15

6/26/2014

TheTraditionalApproach
Querydriven(lazy,ondemand)
Clients

Integration System

Metadata

...
W
Wrapper

W
Wrapper

Source

Source

W
Wrapper

...

Source
31

DisadvantagesofQueryDriven
Approach

Delayinqueryprocessing
y q yp
g

Sloworunavailableinformationsources
Complexfilteringandintegration

Inefficientandpotentiallyexpensivefor
frequentqueries
Competeswithlocalprocessingatsources
Hasnt caughtoninindustry
32

16

6/26/2014

TheWarehousingApproach
Information
integratedin
advance
Storedinwh for
directqueryingand
analysis

Clients

Data
Warehouse

Integration System

Metadata

...
...
Extractor/
Monitor

Extractor/
Monitor

Source

Source

Extractor/
Monitor

...

Source

33

AdvantagesofWarehousingApproach
Highqueryperformance
Butnotnecessarilymostcurrent
But not necessarily most current information

Doesntinterferewithlocalprocessingatsources
Complexqueriesatwarehouse
OLTPatinformationsources

Informationcopiedatwarehouse
Canmodify,annotate,summarize,restructure,etc.
Canstorehistoricalinformation
Security,noauditing

Has caughtoninindustry
34

17

6/26/2014

BusinessIntelligence
Information Sources

Data Warehouse
Server
(Tier 1)

OLAP Servers
(Tier 2)

Clients
(Tier 3)

e.g., MOLAP
Semistructured
Sources

Data
Warehouse
extract
transform
load
refresh
etc.

OLAP
serve
Query/Reporting

serve
e.g., ROLAP

Operational
DBs

serve

Data Mining

Data Marts
35

NotEitherOrDecision
Querydrivenapproachstillbetterfor
Rapidlychanginginformation
Rapidlychanginginformationsources
Trulyvastamountsofdatafromlargenumbersof
sources
Clientswithunpredictableneeds
p

36

18

6/26/2014

WhatisaDataWarehouse?
APractitionersViewpoint
Adatawarehouseissimplyasingle,complete,andconsistent storeofdata
obtainedfromavarietyofsourcesandmadeavailabletoendusersina
waytheycanunderstandanduseitinabusinesscontext.
h
d
d d
i i b i

BarryDevlin,IBMConsultant
AnAlternativeViewpoint

ADWisa

subjectoriented,
integrated
integrated,
timevarying,
nonvolatile

collectionofdatathatisusedprimarilyinorganizationaldecision
making.
W.H.Inmon,BuildingtheDataWarehouse,1992
37

ADataWarehouseis...
Storedcollectionofdiversedata
Asolutiontodataintegrationproblem
A solution to data integration problem
Singlerepositoryofinformation

Subjectoriented
Organizedbysubject,notbyapplication
Usedforanalysis,datamining,etc.
y
g

Optimizeddifferentlyfromtransaction
orienteddb
Userinterfaceaimedatexecutive
38

19

6/26/2014

Characteristicsofa
DataMart

KROENKEandAUER DATABASECONCEPTS(6thEdition)
Copyright2013PearsonEducation,Inc.PublishingasPrenticeHall

Gainingmarketintelligencefromnewsfeeds

40
SreekumarSukumaranandAshishSureka

20

6/26/2014

IntegratedBISystems
Intermedia Data
ETL

Complete
C
l Data
D
Warehouse

RDBMS

Texttaggor&Annotator

ETL

Structural Data
DBMS

FileSystem

XML

XML

Unstructured Data
EA

Legacy

CMS

Scanned
Documents

Sreekumar Sukumaran andAshish Sureka

Email

41

DataWarehouseComponents

SOURCE:RalphKimball

21

6/26/2014

DataWarehouseComponents
Detailed

SOURCE:RalphKimball

LinuxAdoption

44

22

6/26/2014

Distributingprocessingbetween
gatewaysandcloud

45

BigDataProcessingTechniques
Distributeddatastreamprocessingtechnologyforon
theflyrealtimeanalyticsofdata/eventsgeneratedat
extremelyhighrates.
Technologiesforreliabledistributeddatastore,high
speeddatastructuretransformtocreateanalyticsDB
andquickdataplacementmanagement.
Scalabledataextractiontechnologytospeeduprich
queryingfunctionality,suchasmultidimensional
queries over a Key value store
queries,overaKeyvaluestore.
Scalabledistributedparallelprocessingofahuge
amountofstoreddataforadvancedanalysissuchas
machinelearning.
46

23

6/26/2014

BatchandRealtimePlatform

http://www.nec.com/en/global/rd/research/cl/bdpt.html
47

http://info.aiim.org/digitallandfill/newaiimo/2012/03/15/bigdataandbigcontent48
justhypeorarealopportunity

24

6/26/2014

ValueChainofBigData

FritzVenter(LEFT)andAndrewStein,Images&videos:reallybigdata,2012

49

BigDataMovingForward

50

Source:DionHinchcliffe

25

6/26/2014

51

EvolutionofData
Evolutionary
Step

Business Question

Enabling
Technologies

Product
Providers

Characteristics

Data Collection
(1960s)

"What was my total revenue


in the last five years?"

Computers, tapes, disks

IBM, CDC

Retrospective,
static data
delivery

Data Access
(1980s)

"What were unit sales in


New England last March?"

Oracle, Sybase,
Informix, IBM,
Microsoft

Data Warehousing
&
Decision Support
(1990s)

"What were unit sales in


New England last March?
Drill down to Boston."

Relational databases
(RDBMS), Structured
Query Language (SQL),
ODBC
On-line analytic
processing (OLAP),
multidimensional
databases, data
warehouses

Retrospective,
dynamic data
delivery at
record level
Retrospective,
dynamic data
delivery at
multiple levels

Data Mining
(Emerging Today)

"Whats likely to happen to


Boston unit sales next
month? Why?"

Advanced algorithms,
multiprocessor
computers, massive
databases

Pilot, Lockheed,
IBM, SGI,
numerous
startups (nascent
industry)

Pilot, Comshare,
Arbor, Cognos,
Microstrategy

Prospective,
proactive
information
delivery

Data Mining

26

6/26/2014

BigDataProcessing:RealTime
CollectandStore
In memorydatagrid

SpeedupProcessingthroughcolocationof
businesslogicwithdata
Reducenetworkhops
Scaling

Integratewiththebigdatastoretomeet
volumeandcostdemands
53

RealTimeBigDataProcessing
Data
Big
Big
Streams

Data
Streaming
System

RealtimeAnalytic
System

Metadata

Big
Systems

Data
Warehouses
Systemsof
Record

Big
Data

ETL
Hadoop

Archive&BackupSystems(Objects,geographicallydistributed

54

27

6/26/2014

Knowledge
enrichment

Integrate &
Aggregate
Pseudonymised
Repository

Extract
Information

Hazard
Hazard
Monitoring
Monitoring

Ethical oversight
committee

Chronicle

Depersonalise

Pseudonymise
In Hospital

Data Access
Cycle

Individual
Summaries
& Queries

Data Acquisition
Cycle

Focus on Information
capture, organisation,
and presentation

Reidentify
By Hospital

Whatwas
done

Privacy
Enhancement
Technologies

Summarise
& Formulate
Queries

Construct
Chronicle

What happened

And why

Human:1382

Pain:5735

Ulcer:1945

locus
locus
attends

reason

locus

reason

attends

finding

attends

Breast:1492

Clinic:4096
reason

plans

Clinic:1024

plans

plans
reason

locus

Biopsy:1066
target

finding

time

Clinic:2010

reason

Radio:1812

plans

Chemo:6502

treats

reason

Mass:1666

plans

treats

locus

time

Cancer:1914
time

time

time

time

time

time

28

6/26/2014

KDDProcess
Interpretation/
Evaluation
Data Mining
Transformation
Preprocessing

Knowledge
Pattern

Selection

Transformed
Data
Preprocessed
Data
Target Data

Data
Warehouse

Data Mining

Process for Generating Evidence-based


Guidelines Computer Interpretable

N.StolbaandA.M.Tjoa,TheRelevanceofDataWarehousingandDataMiningintheFieldofEvidencebasedMedicinetoSupportHealthcareDecisionMaking.
PROCEEDINGSOFWORLDACADEMYOFSCIENCE,ENGINEERINGANDTECHNOLOGYVOLUME11FEBRUARY2006ISSN13076884

29

6/26/2014

BigDataAnalysisProcess Data
Cleansing
Task Definition Validation
And Goal Completion

Data

KD

Extracting
Data

Selection

Gathering

Data
Warehouse

Cleaning
Data

Preprocess Transform

Preprocessed
Data

Target
Data

Usable
Data

DM
Pattern

Output

Transferring
OLAP

Organizing

Reports

Data
Loading
Data
Database

ETL

Query

Statistics Visualization Presentation Data


AI
Transfer

BigDataAnalysisProcess Text
Preprocess
Task Definition
And Goal

Data

Extracting

Language

Data

Selection
Feature
Extraction

Gathering
Cleansing
Data

Document
Repository

Transferring

Preprocessed
Data

Lexical
Analysis
Semantic
Evaluation

Organizing
Data

Knowledge
Document

Clustering/
Categorization

Text
Database

Mining

Knowledge
Based

Visualization

Browsing

Semantic
Analysis

Loading

Data

Database

Tools

30

6/26/2014

WhatisCloudComputing?
Cloudcomputingisastyleofcomputingin
whichdynamicallyscalable
hi h d
i ll
l bl andvirtualized
d i t li d
resourcesareprovidedasaservicesoverthe
Internet.
Usersneednothaveknowledgeof,expertise
in,orcontroloverthetechnology
,
gy
infrastructureinthe"cloud"thatsupports
them.

Definitions
AstyleofcomputingwheremassivelyscalableITrelated
capabilitiesareprovidedasaserviceusingInternet
technologiestomultipleexternalcustomers
h l i
li l
l
Gartner

Definitions
onefocusingonremoteaccesstoservicesandcomputing
resourcesprovidedovertheInternet"cloud
Ex:CRMandpayrollservices,aswellasvendorsthatofferaccessto
storageandprocessingpowerovertheWeb(suchasAmazon's EC2
service.)

theotherfocusingontheuseoftechnologiessuch
as virtualization andautomationthatenablethecreationand
deliveryofservicebasedcomputingcapabilities.
isanextensionoftraditionaldatacenterapproachesandcanbe
appliedtoentirelyinternalenterprisesystemswithnouseofexternal
offpremisescapabilitiesprovidedbyathirdparty

31

6/26/2014

TheNISTCloudDefinitionFramework
HybridClouds
Deployment
Models
Service
Models

Community
Cloud

Private
Cloud
Softwareasa
Service(SaaS)

PublicCloud
Public Cloud

Platformasa
Service(PaaS)

Infrastructureasa
Service(IaaS)

OnDemandSelfService
Essential
Characteristics

Common
Characteristics

BroadNetworkAccess

RapidElasticity

ResourcePoolingg

MeasuredService

MassiveScale

ResilientComputing

Homogeneity

GeographicDistribution

Virtualization

ServiceOrientation

LowCostSoftware

AdvancedSecurity
Based upon original chart created by Alex Dowbor - http://ornot.wordpress.com

63

CycleofSOA,CLOUDCOMPUTING,WEB2.0

D.Delen,H.Demirkan /DecisionSupportSystems (2013)

64

32

6/26/2014

A lifecycle of Big Data

Collection/Identification
Repository/Registry
Semantic Intellectualization
SemanticIntellectualization
1
1.
Integration

Analytics/Prediction
2
2.

Data

Visualization

Insight
Big Data

Action
DataCuration
DataScientist
DataEngineer

Decision

3.

4.

Workflow
DataQuality

65

ModelforBigData
AReferenceModelforBigData
Analysis&Prediction

Service Layer

Big Data
Management

Interface

Workflow
Management

DataQuality
Management

Data
Visualization

ServiceSupport
Layer

Interface

Data
Curation

DataIntegration
Platform Layer

DataSemanticIntellectualization

Security

Interface

DataIdentification
(DataMining&MetadataExtraction)
DataCollection

DataRegistry

Data Layer

DataRepository

66

33

6/26/2014

Datafeeds
Library
catalogs
t l

Locally held
documents

Public
repositories

Commercial
data sources

Agency data
sources

Search engine

Search engine

Search engine

Search engine

INTERNET
(public)

spiders

Filtered
content

Search engine

Search engine

Meta-Search
Meta
Search Tool
TAXONOMY

Web portal

SystemDesignPreview
Node.js Based
IJenChiang

34

6/26/2014

Scenario
SensorNetwork

69

DistributedProcessing

http://phys.org/news/201203technologyefficientlydesiredbigstreams.html

70

35

6/26/2014

71

Node.js vs CloudComputing
ChildProcessPool

Node.js MasterProcess
Incoming
WebSocket
Request

Static
t t
content
Request

Node.js ChileProcess
Node.js
Application

Application
A
li ti
Module

Communication

WebServer
&
WebSockets
Interface
Websocke
t request

Dispatcher
Node.js ChileProcess
Node.js
Application

Application
Module

Query

Communication

Node.js ChileProcess

Fully
Asynchronous

Application
Module

Node.js
Application

Communication

StaticContent
72

36

6/26/2014

EventLoop

73

EventLoopExample

74

37

6/26/2014

LearningJavascript

__proto__

__proto__

Prototype

Prototype

__proto__

SuperConstr

__proto__
new

Object

Object

Layer1:
Single
Object

Layer2:
Prototype
Chain

Constructor

Instance

Layer3:
Constructor

SubConstr
Layer4:
Constructor
inheritance

75

Createasingleobject
Objects:atomicbuildingblocksofJavascript OOP
Objects:mapsfromstringstovalues
j
p
g
Properties:entriesinthemap
Methods:propertieswhosevaluesarefunctions
Thisreferstoreceiverofmethodcall
//Objectliteral
var jane ={
//Property
name:Jane,
//Method
describe:function (){
return Personnamed+this.name;
}
};
Advantage:createobjectsdirectly,introduceabstractionslater

76

38

6/26/2014

var jane ={
name:Jane,
describe:function (){
return Personnamed+this.name;
}
};
#jane.name
Jane
#jane.describe
[Function]
#jane.describe()
PersonnamedJane
#jane.name =John
#jane.describe()
PersonnamedJohn
#jane.unknownProperty
undefined
77

Objectsversusmaps
Similar:
Verydynamic:freelydeleteandaddproperties

Different:
Inheritance(viaprototypechains)
Fastaccesstoproperties(viaconstructors)

78

39

6/26/2014

Sharingproperties:theproblem
var jane ={
name:Jane,,
describe:function (){
return Personnamed+this.name;
}
};
var tarzan ={
name:Tarzan,
describe:function (){
return Personnamed+this.name;
}
};

79

Sharingproperties:thesolution
PersonProto
describe function(0{}

jane
__proto__

tarzan
__proto__

name Jane

name Tarzan

jane andtazan
and tazan sharethesameprototypeobject.
share the same prototype object
Bothprototypechainsworklikesingleobject.

80

40

6/26/2014

Sharingproperties:thecode
var PersonProto ={
describe:function (){
return Personnamed+this.name;
}
};
var jane ={
__proto__:PersonProto,
name:Jane,
};
var tarzan ={
__proto__:PersonProto,
name:Tarzan,
};

81

Gettingandsettingtheprototype
ECMAScript 6:__proto__
ECMAScript 5:
Object.create()
Object.getPrototypeOf()

82

41

6/26/2014

Gettingandsettingtheprototype
Object.create(proto)
var PersonProto ={
describe: function (){
describe:function
() {
return Personnamed+this.name;
}
};
var jane =Object.create(PersonProto);
jane.name:Jane,

Object getPropertypeOf(obj)
Object.getPropertypeOf(obj)
#Object.getPrototypeOf(jane)===PersonProto
true

83

Sharingmethods
//Instancespecificproperties
//
p
p p
funcion Person(name){
this.name =name;
}
//Sharedproperties
Person.prototype.describe =function(){
return Personnamed+this.name;
};

84

42

6/26/2014

Instancescreatedbytheconstructor
Person
prototype

Person.prototype
describe function(0{}

functionPerson(name){
this.name =names;
}
jane
__proto__

tarzan
__proto__

name
name
Jane
Jane

name Tarzan
T

85

instanceof
IsvalueaninstanceofConstr?
value instanceof Constr
valueinstanceof
Howdoesinstanceof work?
Check:IsConstr.prototype intheprototype
chainofvalue?
//Equivalent
valueinstanceof Constr
Constr.prototype.isPrototypeOf(value)
86

43

6/26/2014

Goal:deriveEmployeefromPerson
funcion Person(name){
this.name =name;
}
Person.prototype.sayHelloTo =function(otherName){
console.log(this.name +sayhelloto+otherName;
};
Person.prototype.describe =function(){
return Personnamed+this.name;
};
Employee(name,title)islikePerson,except:
Employee(name
title) is like Person except:
Additionalinstanceproperty:title
describe()returnPersonnamed<name>(<title>)

87

Thingsweneedtodo
Employeemust
InheritPersonsinstanceproperties
Createtheinstancepropertytitle
InheritPersonsprototypeproperties
OverridemethodPerson.prototype.describe
Override method Person.prototype.describe
(andcalloverriddenmethod)

88

44

6/26/2014

Employee:thecode
funcion Employee(name,title){
Person.call(this,name);//(1)
this.title =title;(2)
}
Person.prototype =Object.create(Person.prototype);//(3)
Person.prototype.describe =function(){
return Person.prototype.describe.call(this)//(5)
+(+this.title +);
};

(1)Inheritinstanceproperties
(2)Createtheinstancepropertytitle
(3)Inheritprototypeproperties
(4)OverridemethodPerson.prototype.describe
(5)Calloverriddenmethod(asupercall)
89

Instancescreatedbytheconstructor
Object.prototype
Person
prototype

Person.prototype
__proto__
calls

Employee
prototype

sayHelloTo

function(0{}

describe

function(0{}

Employee.prototype
__proto__
d
describe
ib

f ti (0 { }
function(0{}

jane
__proto__
name

Jane

title

CTO

90

45

6/26/2014

Builtinconstructorhierarchy
Object.prototype
__proto__
null

Object
prototype

toString

function(0{}

Array
prototype

Array.prototype
__proto__
toString

function(0{}

sort

function(0{}

{foo,bar}
__proto__
0

foo

bar

length

91

HelloWorld
var http =require('http');
http.createServer(function (req,res){
res.writeHead(200,{'ContentType':'text/plain'});
res.end('HelloWorld.');
}) listen(1337 "127
}).listen(1337,
127.0.0.1
0 0 1");
);
console.log('Serverrunningat
http://127.0.0.1:1337/');
92

46

6/26/2014

Express WebApplicationFramework
var express =reuqire(
= reuqire('express')
express ),
app = express.createServer();
app.get('/', function(req, res){
res.send('HelloWorld.');
});
app.listen(1337);

93

Express Createhttpservices
app.get('/',function(req,res){
res.send('helloworld');
d('h ll
ld')
});
app.get('/test',function(req,res){
res.send('testrender');
});
app.get('/user/',function(req,res){
res.send('userpage');
});
94

47

6/26/2014

RouterIdentifiers
//Willmatch/abcd
app.get('/abcd',function(req,res)
{
res.send('abcd');
});
//Willmatch/acd
app.get('/ab?cd',function(req,res)
{
res.send('ab?cd');
})
});

//Willmatch/abxyzcd
app.get('/ab*cd',function(req,res){
res send('ab*cd');
res.send(
ab cd );
});
//Willmatch/abe and/abcde
app.get('/ab(cd)?e',function(req,res)
{
res.send('ab(cd)?e');
});

//Willmatch/abbcd
app.get('/ab+cd',function(req,res)
{
res.send('ab+cd');
});

95

Express GetParameters
//...Createhttpserver
app.get('/user/:id',function(req,res){
res.send('user:'+req.params.id);
});
app.get('/:number', function(req,res){
res.send('number:'+req.params.number);
});
96

48

6/26/2014

Connect Middleware
var connect=require("connect");
var http=require("http");
var app=connect();
app.use(function(request,response){
response.writeHead(200,{"ContentType":
"text/plain"});
response.end("Helloworld!\n");
d("H ll
ld!\ ")
});
http.createServer(app).listen(1337);
97

Connect:request,response,next
var connect=require("connect");
var http=require("http");
var app=connect();
pp
();
//log
app.use(function(request,response,next){
console.log("Incomesa"+request.method +"to"+request.url);
next();
});
//return"helloworld"
//
t
"h ll
ld"
app.use(function(request,response,next){
response.writeHead(200,{"ContentType":"text/plain"});
response.end("HelloWorld!\n");
});
http.createServer(app).listen(1337);

98

49

6/26/2014

Connect:logger
var connect=require("connect");
q
(
);
var http=require("http");
var app=connect();
app.use(connect.logger());
app.use(function(request,response){
response.writeHead(200,{"ContentType":"text/plain"});
response.end("Helloworld!\n");
d("H ll
ld!\ ")
});
http.createServer(app).listen(1337);

99

ConnectLogging
var connect=require("connect");
var http=require("http");
var app=connect();
app.use(connect.logger());
//Homepage
app.use(function(request,response,next){
if(request.url =="/"){
response.writeHead(200,{"Content
Type":"text/plain"});
response end("Welcome
response.end(
Welcometothe
to the
homepage!\n");
//Themiddlewarestopshere.
}else {
next();
}
});

//About page
app.use(function(request,response,next){
if(request.url =="/about"){
response.writeHead(200,{"ContentType":
text/plain });
});
"text/plain"
response.end("Welcometotheabout
page!\n");
//Themiddlewarestopshere.
}else {
next();
}
});
//404'd!
app.use(function(request,response){
response.writeHead(404,{"ContentType":
"text/plain"});
response.end("404error!\n");
});
http.createServer(app).listen(1337);

100

50

6/26/2014

BigData:Analysis
IJenChiang

Bigdataissues

102

51

6/26/2014

TheCRISPDMreferencemodel

Harper,Gavin;StephenD.Pickett(August2006)

TheCompleteBigDataValueChain
Collection

Ingestion

Discovery&
Cleansing

Integration

Analysis

Delivery

Collection Structured,unstructuredandsemistructureddatafrommultiplesources
Ingestion loadingvastamountsofdataontoasingledatastore
Discovery&Cleansing understandingformatandcontent;cleanup andformatting
Integration linking,entityextraction,entityresolution,indexinganddatafusion
Analysis Intelligence,statistics,predictiveandtextanalytics,machinelearning
Analysis
Intelligence statistics predictive and text analytics machine learning
Delivery querying,visualization,realtimedeliveryonenterpriseclassavailability

10
4

52

6/26/2014

PhasesandTasks
Business
Understanding

Data
Understanding

Data
Preparation

Modeling

Data Set
Data Set Description

Determine
Business Objectives
Background
Business Objectives
Business Success
Criteria

Collect Initial Data


Initial Data Collection
Report
Describe Data
Data Description Report

Select Data
Rationale for Inclusion /
Exclusion

Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits

Explore Data
Data Exploration Report

Clean Data
Data Cleaning Report

Verify Data Quality


Data Quality Report

Construct Data
Derived Attributes
Generated Records
Integrate Data
Merged Data

Determine
Data Mining Goal
Data Mining Goals
Data Mining Success
Criteria

Select Modeling
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Models
Model Description

Evaluation

Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision

Assess Model
Model Assessment
Revised Parameter
Settings

Deployment

Plan Deployment
Deployment Plan
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience
Documentation

Format Data
Reformatted Data

Produce Project Plan


Project Plan
Initial Asessment of
Tools and Techniques

Data Mining Context


Dimension

Examples

Application Data Mining


Domain
Problem Type

Technical
Aspect

Tool and
Technique

Response
Modeling

Description and Missing


Summarization Values

Clementine

Churm
Prediction

Segmentation

Outliers

MineSet

Concept
Description

Classification
Prediction
Dependency
Analysis

53

6/26/2014

WhatinBigData
Howdoyouextractvaluefrombigdata?
Yousurelycantglanceovereveryrecord;
Anditmaynotevenhaverecords
Whatifyouwantedtolearnfromit?
Understandtrends
Classifyintocategories
Detectsimilarities
Predictthefuturebasedonthepast(No,notlikeNostradamus!)
Machinelearningisquicklyestablishingasanemergingdiscipline.
ButtherearechallengeswithMLinbigdata:
But there are challenges with ML in big data:
Thousandsoffeatures
Billionsofrecords
Thelargestmachinethatyoucanget,maynotbelargeenough
Getthepicture?
10
7

DataAccumulation

108

54

6/26/2014

ServiceOrientedDSS

D.Delen,H.Demirkan /DecisionSupportSystems (2013)

109

Tencommonbigdataproblems
Modelingtruerisk
Customerchurn
C t
h
analysis
Recommendation
engine
Adtargeting
PoS transactionanalysis

Analyzingnetworkdata
to predict failure
topredictfailure
Threatanalysis
Tradesurveillance
Searchquality
Datasandbox

110

55

6/26/2014

BusinessApplications

Modelingriskandfailureprediction
Analyzingcustomerchurn
Webrecommendations(ala Amazon)
Webadtargeting
Pointofsaletransactionanalysis
Threatanalysis
Complianceandsearcheffectiveness
111

DynamicsofDataEcosystems

http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf

112

56

6/26/2014

Thebigdataopportunity

113

Industriesareembracingbigdata

114

57

6/26/2014

115

116

58

6/26/2014

BusinessAnalytics

117

D.Delen,H.Demirkan /DecisionSupportSystems (2013)

MultidimensionalConceptAnalysis
DOCUMENT

Attribute Variables

Doc

D1

True

True

D2

True

True

Top

True
True

D3

True

D4

True

D5

True

D6

True

(D1,D2),
(a,b)

((D1,D3,D6),
, , ),
(d)

True
(D1),
(a,b,d)

True

Bottom

Analysis of Accesses to State:


Elements (Documents) + Properties (accesses to attibute variables)
Analysis of Behavior:
Elements (Documents) + Properties (invocations to other
documents)

59

6/26/2014

ThreeApproaches

119

MiningSchemes
AfullydistributedandextensiblesetofMachineLearningtechniquesfor
BigData
StateoftheartalgorithmsineachoftheMachineLearningdomains,
f h
l
h
h f h
h
d
includingsupervisedandunsupervisedlearning:
Correlation
Classifiers
Clustering
Statistics
Documentmanipulation
Ngramextraction
Histogramcomputation
NaturalLanguageProcessing

Distributedandparallelunderlyinglinearalgebralibrary

120

60

6/26/2014

StatisticalApproach

121

RandomSampleandStatistics
Population: isusedtorefertothesetoruniverseofallentities
understudy.
However,lookingattheentirepopulationmaynotbe
feasible or may be too expensive
feasible,ormaybetooexpensive.
Instead,wedrawarandomsamplefromthepopulation,and
computeappropriatestatisticsfromthesample,thatgive
estimatesofthecorrespondingpopulationparametersof
interest.

61

6/26/2014

Statistic
LetSi denotetherandomvariablecorrespondingto
data point xi ,thenastatistic
datapointx
then a statistic
isafunc
is a func on
on
:(S
: (S1,
S2,,Sn)R.
Ifweusethevalueofastatistictoestimatea
populationparameter,thisvalueiscalledapoint
estimate oftheparameter,andthestatisticiscalled
of the parameter and the statistic is called
asanestimator oftheparameter.

EmpiricalCumulativeDistributionFunction

Where

InverseCumulativeDistributionFunction

62

6/26/2014

Example

MeasuresofCentralTendency(Mean)
PopulationMean:

SampleMean(Unbiased,notrobust):

63

6/26/2014

MeasuresofCentralTendency
(Median)
PopulationMedian:
or
SampleMedian:

Example

64

6/26/2014

MeasuresofDispersion(Range)
Range:
SampleRange:

Notrobust,sensitivetoextremevalues
b
ii
l

MeasuresofDispersion(InterQuartileRange)
InterQuartileRange(IQR):

SampleIQR:

Morerobust
b

65

6/26/2014

MeasuresofDispersion
(VarianceandStandardDeviation)
Variance:

StandardDeviation:

MeasuresofDispersion
(VarianceandStandardDeviation)
Variance:

StandardDeviation:

SampleVariance&StandardDeviation:

66

6/26/2014

Univariate NormalDistribution

MultivariateNormalDistribution

67

6/26/2014

OLAPandDataMining

WarehouseArchitecture
Client

Client
Query&Analysis

Metadata

Warehouse

I
Integration
i

Source

Source

Source
136

68

6/26/2014

StarSchemas

Astarschema
A
star schema isacommonorganizationfor
is a common organization for
dataatawarehouse.Itconsistsof:
1. Facttable :averylargeaccumulationoffacts
suchassales.
Ofteninsertonly.

2 Dimensiontables
2.
i
i
bl :smaller,generallystatic
ll
ll
i
informationabouttheentitiesinvolvedinthe
facts.
137

Terms
Facttable
Dimensiontables
Measures

product
prodId
name
price

sale
orderId
date
custId
prodId
storeId
qty
amt

customer
custId
name
address
city

store
storeId
city

138

69

6/26/2014

Star
product

prodId
p1
p2
p

name price
bolt
10
nut
5

sale oderId date


o100 1/7/97
o102 2/7/97
105 3/8/97

customer

store storeId
c1
c2
c3

custId
53
53
111

custId
53
81
111

name
joe
fred
sally

prodId
p1
p2
p1

storeId
c1
c1
c3

address
10 main
12 main
80 willow

qty
1
2
5

city
nyc
sfo
la

am t
12
11
50

city
sfo
sfo
la
139

Cube
Facttableview:
sale

prodId
p1
p2
p1
p2

Multidimensionalcube:
storeId
c1
c1
c3
c2

am t
12
11
50
8

p1
p2

c1
12
11

c2

c3
50

dimensions=2

140

70

6/26/2014

3DCube
Facttableview:
sale

prodId
p1
p2
p1
p2
p1
p1

Multidimensionalcube:
storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

day2
day1

c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8

dimensions=3

141

ROLAPvs.MOLAP
ROLAP:
R l ti
RelationalOnLineAnalyticalProcessing
l O Li A l ti l P
i
MOLAP:
MultiDimensionalOnLineAnalytical
Processing

142

71

6/26/2014

Aggregates
Addupamountsforday1
InSQL:SELECTsum(amt)FROMSALE
In SQL: SELECT sum(amt) FROM SALE
WHEREdate=1
sale

prodId
p1
p2
p1
p2
p1
p1

storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

81

143

Aggregates
Addupamountsbyday
InSQL:SELECTdate,sum(amt)FROMSALE
In SQL: SELECT date sum(amt) FROM SALE
GROUPBYdate
sale

prodId
p1
p2
p1
p2
p1
p1

storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

ans

date
1
2

sum
81
48

144

72

6/26/2014

AnotherExample
Addupamountsbyday,product
InSQL:SELECTdate,sum(amt)FROMSALE
In SQL: SELECT date sum(amt) FROM SALE
GROUPBYdate,prodId
sale

prodId
p1
p2
p1
p2
p1
p1

storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

sale

prodId
p1
p2
p1

date
1
1
2

amt
62
19
48

rollup
drilldown
145

Aggregates
Operators:sum,count,max,min,
median,ave
di
Havingclause
Usingdimensionhierarchy
averagebyregion(withinstore)
maximumbymonth(withindate)
maximum by month (within date)

146

73

6/26/2014

WhatisDataMining?
Discoveryofuseful,possiblyunexpected,
y
,p
y
p
,
patternsindata
Nontrivialextractionofimplicit,previously
unknownandpotentiallyusefulinformation
fromdata
Exploration&analysis,byautomaticor
Exploration & analysis by automatic or
semiautomaticmeans,oflargequantitiesof
datainordertodiscovermeaningfulpatterns

DataMiningTasks

Classification[Predictive]
Clustering[Descriptive]
AssociationRuleDiscovery[Descriptive]
SequentialPatternDiscovery[Descriptive]
Regression[Predictive]
DeviationDetection[Predictive]
CollaborativeFilter[Predictive]

74

6/26/2014

Classification:Definition
Givenacollectionofrecords(trainingset)

Eachrecordcontainsasetofattributes,oneofthe
attributesistheclass.
tt ib t i th l

Findamodel forclassattributeasafunction
ofthevaluesofotherattributes.
Goal:previouslyunseen recordsshouldbe
assignedaclassasaccuratelyaspossible.

Atestset isusedtodeterminetheaccuracyofthe
model. Usually, the given data set is divided into
model.Usually,thegivendatasetisdividedinto
trainingandtestsets,withtrainingsetusedto
buildthemodelandtestsetusedtovalidateit.

DecisionTrees
Example:
Conducted survey to see what customers were
i t
interested
t d iin new model
d l car
Want to select customers for advertising campaign

sale

custId
c1
c2
c3
c4
c5
c6

car
taurus
van
van
taurus
merc
taurus

age
27
35
40
22
50
25

city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no

training
set

150

75

6/26/2014

Clustering

income
education
age

151

KMeansClustering

152

76

6/26/2014

AssociationRuleMining

sales
records:

tran1
tran2
tran3
tran4
tran5
tran6

cust33
cust45
cust12
cust40
cust12
cust12

p2,
p5,
p1,
p5,
p2,
p9

p5, p8
p8, p11
p9
p8, p11
p9

market-basket
data

Trend: Products p5, p8 often bought together


Trend: Customer 12 likes product p9

153

AssociationRuleDiscovery
MarketingandSalesPromotion:

Lettherulediscoveredbe
{Bagels,}>{PotatoChips}
PotatoChips asconsequent =>Canbeusedto
determinewhatshouldbedonetoboostitssales.
Bagelsintheantecedent =>canbeusedtoseewhich
productswouldbeaffectedifthestorediscontinues
sellingbagels.
Bagelsinantecedent
g
and Potatochipsinconsequent
p
q
=>Canbeusedtoseewhatproductsshouldbesold
withBagelstopromotesaleofPotatochips!

Supermarketshelfmanagement.
InventoryManagement

77

6/26/2014

CollaborativeFiltering
Goal:predictwhatmovies/books/apersonmaybeinterestedin,
onthebasisof
Pastpreferencesoftheperson
Otherpeoplewithsimilarpastpreferences
Thepreferencesofsuchpeopleforanewmovie/book/

Oneapproachbasedonrepeatedclustering
Clusterpeopleonthebasisofpreferencesformovies
Thenclustermoviesonthebasisofbeinglikedbythesameclustersof
people
Againclusterpeoplebasedontheirpreferencesfor(thenewlycreated
clustersof)movies
Repeatabovetillequilibrium

Aboveproblemisaninstanceofcollaborativefiltering,whereusers
collaborateinthetaskoffilteringinformationtofindinformationof
interest

155

OtherTypesofMining
Textmining:applicationofdataminingtotextual
documents
clusterWebpagestofindrelatedpages
clusterpagesauserhasvisitedtoorganizetheirvisit
history
classifyWebpagesautomaticallyintoaWebdirectory

GraphMining:
Graph Mining:
Dealwithgraphdata

156

78

6/26/2014

DataStreams
WhatareDataStreams?
Continuousstreams
Huge,Fast,andChanging
WhyDataStreams?
Thearrivingspeedofstreamsandthehugeamountofdata
arebeyondourcapabilitytostorethem.
Realtimeprocessing
WindowModels
Landscapewindow(EntireDataStream)
SlidingWindow
Sliding Window
DampedWindow
MiningDataStream

157

ASimpleProblem
Findingfrequentitems
Givenasequence(x
Given a sequence (x1,xN)wherex
) where xi [1,m],
[1,m],andareal
and a real
number betweenzeroandone.
Lookingforxi whosefrequency>
NaveAlgorithm(mcounters)
Thenumberoffrequentitems1/
Problem:N>>m>>1/

P(N) N

158

79

6/26/2014

KRPalgorithm
Karp,et.al(TODS 03)

m=12
12

N 30
N=30

=0.35

1/ = 3

N/ ( 1/ ) N

159

StreamingSampleProblem
Scanthedatasetonce
SampleKrecords
Eachonehasequallyprobabilitytobesampled
TotalNrecord:K/N

80

6/26/2014

IntroductiontoNoSQL
IJenChiang

Big Users

81

6/26/2014

BigData

163

DataStorage

164

82

6/26/2014

FeaturesoftheDataWarehouse
ADataWarehouseisasubject oriented,
i t
integrated,nonvolatile,timevariantcollection
t d
l til ti
i t ll ti
ofdatainsupportofmanagementsdecision
W.H.Inmon

DataWarehouseArchitecture
Monitoring&
Administration

OLAPServers

Metadata
Repository

Reconcileddata
External
Sources

Extract
Transform
Load
Refresh

Analysis

Serve
Query/Reporting

Operational
Dbs

DataMining

DATASOURCES

TOOLS
DATAMARTS

83

6/26/2014

BusinessIntelligenceLoop
BusinessStrategist

OLAP

DataMining

Reports

Decision
Support

DataStorage
Data
Warehouse
EExtraction,Transformation,
t ti
T
f
ti
&Cleansing

CRM

Accounting

Finance

HR

Traditionalvs.BigData
TraditionalDataWarehouse
Completerecordfrom
transactionalsystem
Alldatacentralized

BigDataEnvironment
Datafrommanysourcesinside
D f
i id
andoutsideoforganization,
includingtraditionalDW
Dataoftenphysically
distributed
Needtoiteratesolutionto
test/improvemodels
Largememoryanalyticsalso
partofiteration
Everyiterationusuallyrequires
completereloadofinformation
168

84

6/26/2014

NoSQL Database
WideColumnStore/ColumnFamilies Hadoop/
HBase,Cassandra,Cloudata,Cloudera,Amazon
SimpleDB
DocumentStore MongoDB,CouchDB,Citrusleaf
KeyValue/TupleStore AzureTableStorage,
MEMBASE,GenieDB,TokyoCabinet/Tyrant,
MemcacheDB
EventuallyConsistentKeyValueStore Amazon
D
Dynamo,Voldemort
V ld
GraphDatabases Neo4J,InfiniteGraph,Bigdata
XMLDatabases MarkLogicServer,EMCDocumentum,
eXist
169

WhyNoSQL
Too Much Data: the database became too
large to fit into a single database table on a
single machine
Data Volume was growing FAST
Data wasnt all consistent with a specific,
well-defined
well
defined set of fields
Time was critical

170

85

6/26/2014

CAPTheorem
Three properties of a system: consistency, availability and
p
partitions
You can have at most two of these three properties for any
shared-data system
To scale out, you have to partition. That leaves either
consistency or availability to choose from

In almost all cases, you would choose


availability
il bilit over consistency
i t

Theory of noSQL: CAP


C

Manynodes
Nodescontainreplicasof
p
f
partitions ofdata
Consistency
allreplicascontainthesame
versionofdata

Availability
system
systemremainsoperational
remains operational
onfailingnodes

Partitiontolarence
multipleentrypoints
systemremainsoperational
onsystemsplit

CAP Theorem:
satisfying all three at the
same time is impossible
172

86

6/26/2014

ACID - BASE

Basically
Available(CP)
Softstate
Eventually
consistent(AP)
i
(AP)

Atomicity
Consistency
Isolation
Durability

Pritchett,D.:BASE:AnAcidAlternative(queue.acm.org/detail.cfm?id=1394128)

173

NoSQL

Key / Value

Colum n

Graph

Docum ent

174

87

6/26/2014

KeyValueStore
Pros:

very fast
very scalable
simple model
able to distribute
horizontally

Cons:

- many data
structures (objects)
can't be easily
modeled as key value
pairs

175

ColumnStores
Row oriented
Id

username

email

Department

John

john@foo.com

Sales

Mary

mary@foo.com

Marketing

Yoda

yoda@foo.com

IT

Column oriented
Id

Username

email

Department

John

john@foo.com

Sales

Mary

mary@foo.com

Marketing

Yoda

yoda@foo.com

IT

88

6/26/2014

GraphDatabase

177

An introduction to MongoDB
IJenChiang

89

6/26/2014

WhyNoSQL
TooMuchData:thedatabasebecametoo
l
largetofitintoasingledatabasetableona
t fit i t
i l d t b
t bl
singlemachine
DataVolumewasgrowing FAST
Datawasntallconsistentwithaspecific,well
defined set of fields
definedsetoffields
Timewascritical

179

DocumentStores
The store is a container for documents
Documents are made up of named fields
Fields may or may not have type definitions
e.g. XSDs for XML stores, vs. schema-less JSON stores

Can create "secondary indexes"


These provide the ability to query on any document
field(s)
Operations:
Insert and delete documents
Update fields within documents
180

90

6/26/2014

MongoDB
MongoDB isascalable,highperformance,opensource
NoSQL database.

Documentorientedstorage
FullIndexSupport
Replication&HighAvailability
AutoSharding
Querying
FastInPlaceUpdates
Map/Reduce
GridFS
181

Theory of noSQL: CAP


C

Manynodes
Nodescontainreplicasof
p
f
partitions ofdata
Consistency
allreplicascontainthesame
versionofdata

Availability
system
systemremainsoperational
remains operational
onfailingnodes

Partitiontolarence
multipleentrypoints
systemremainsoperational
onsystemsplit

CAP Theorem:
satisfying all three at the
same time is impossible
182

91

6/26/2014

SchemaLess
Pros:
Schemalessdatamodelisricherthankey/value
Schema less data model is richer than key/value
pairs

eventualconsistency
manyaredistributed
stillprovideexcellentperformanceand

scalability

Cons:
typicallynoACIDtransactionsorjoins

CommonAdvantages
Cheap,easytoimplement(opensource)
Dataarereplicatedtomultiplenodes(therefore
Data are replicated to multiple nodes (therefore
identicalandfaulttolerant)andcanbepartitioned
Downnodeseasilyreplaced
Nosinglepointoffailure

Easytodistribute
Don'ttrequireaschema
Don
require a schema
Canscaleupanddown
Relaxthedataconsistencyrequirement(CAP)

92

6/26/2014

WhatisNoSQL givingup?

joins
groupby
b
orderby
ACIDtransactions
SQLasasometimesfrustratingbutstillpowerfulquery
language
easyintegrationwithotherapplicationsthatsupportSQL

Cassandra

OriginallydevelopedatFacebook
Follows the BigTable datamodel:columnoriented
FollowstheBigTable
data model: columnoriented
UsestheDynamoEventualConsistencymodel
WritteninJava
OpensourcedandexistswithintheApachefamily
UsesApacheThriftasitsAPI

93

6/26/2014

Cassandra

Tunableconsistency.
Decentralized.
Writesarefasterthanreads.
NoSinglepointoffailure.
Incrementalscalability.
Usesconsistenthashing(logicalpartitioning)
whenclustered.
Hintedhandoffs.
Peertopeerrouting(ring).
ThriftAPI.
Multidatacentersupport.

Couchdb
AvailabilityandPartialTolerance.
Viewsareusedtoquery.Map/Reduce.
MVCC
MVCC MultipleConcurrentversions.Nolocks.
M lti l C
t
i
N l k
Alittleoverheadwiththisapproachduetogarbagecollection.
Conflictresolution.

Verysimple,RESTbased.SchemaFree.
Sharednothing,seamlesspeerbasedBiDirectionalreplication.
AutoCompaction.ManualwithMongodb.
UsesBTrees
D
Documentsandindexesarekeptinmemoryandflushedtodisc
t
di d
k ti
d fl h d t di
periodically.
Documentshavestates,incaseofafailure,recoverycancontinue
fromthestatedocumentswereleft.
Nobuiltinautosharding,thereareopensourceprojects.
Youcantdefineyourindexes.

94

6/26/2014

Mongodb
Datatypes:bool,int,double,string,object(bson),oid,
array,null,date.
Databaseandcollectionsarecreatedautomatically.
LotsofLanguageDrivers.
Cappedcollectionsarefixedsizecollections,buffers,
veryfast,FIFO,goodforlogs.Noindexes.
Objectidaregeneratedbyclient,12bytespacked
data.4bytetime,3bytemachine,2bytepid,3byte
counter.
Possibletoreferotherdocumentsindifferent
collectionsbutmoreefficienttoembeddocuments.
Replicationisveryeasytosetup.Youcanreadfrom
slaves.

Documentstore
RDBMS

MongoDB

Database

Database

Table,View
Row
Column
Index
Join
ForeignKey
Partition

>db.user.findOne({age:39})
{
Collection
"_id":ObjectId("5114e0bd42"),
Document(JSON,BSON)
"first":"John",
"last":"Doe",
Field
"age":39,
Index
"interests":[
EmbeddedDocument
"Reading",
Reference
"MountainBiking
]
Shard
"favorites":{
"color":"Blue",
"sport":"Soccer
}
}
190

95

6/26/2014

Mongodb
Connectionpoolingisdoneforyou..
Supports aggregation.
Supportsaggregation.
MapReducewithJavaScript.

Youhaveindexes,BTrees.Idsarealwaysindexed.
Updatesareatomic.Lowcontentionlocks.
Queryingmongodonewithadocument:
Lazy,returnsacursor.
Reduceable toSQL,select,insert,updatelimit,sortetc.
Thereismore:upsert (eitherinsertsofupdates)

Severaloperators:
$ne,$and,$or,$lt,$gt,$incr,$decr andsoon.

RepositoryPatternmakesdevelopmentveryeasy.

Terminology
RDBMS
Database
Table,View
Row
Column
Index
Join
ForeignKey
Partition

MongoDB
Database
Collection
Document(JSON,BSON)
Field
Index
EmbeddedDocument
Reference
Shard
192

96

6/26/2014

Features
DocumentOrientedstorege
FullIndexSupport
Full Index Support
Replication&High
Availability
AutoSharding
Querying
FastInPlaceUpdates
Map/Reduce

Agile
Scalable

193

Mongodb Sharding

replicaset
C1 mongod
C2 mongod
C3 mongod
g

Config servers:Keepsmapping
Mongos:Routingservers
Mongod:masterslavereplicas

97

6/26/2014

Mongodb DataAnalysis

195

CRUD
Create
db.collection.insert(<document>)
(
)
db.collection.save(<document>)
db.collection.update(<query>,<update>,{upsert:true})

Read
db.collection.find(<query>,<projection>)
db.collection.findOne(<query>,<projection>)

Update
p
db.collection.update(<query>,<update>,<options>)

Delete
db.collection.remove(<query>,<justOne>)
196

98

6/26/2014

SQLtoMongodb
SQLStatment

MongoStatement

CREATE TABLE USERS( N b b N b ) db.createCollection(mycoll)


CREATETABLEUSERS(aNumber,bNumber)
db
t C ll ti (
ll)
SELECTa,bFROMusers

db.users.find({},{a:1,b:1})

SELECT*FROMusersWHEREnameLIKE
%Joe%

db.users.find({name:/Joe/})

SELECT*FROMusersWHEREa=1ANDb= db.users.find({a:1,b:q})
q
SELECT COUNT(* ) FROM
SELECTCOUNT(*y)FROMusers

db
db.users.count()
t()

CREATEINDEXmyindexnameON
uses(name)

db.users.ensureIndex({name:1})

UPDATEusersSETa=1WHEREb=q

db.users.update({b:q},{$set:{a:1}},false,tru
e)

DELETEFROMusersWHEREz=abc

db.users.remove({z:abc})
197

BSON
JSONhaspowerful,butlimitedsetofdatatypes
arrays,objects,strings,numbersandnull
arrays objects strings numbers and null

BSONisabinaryrepresentationofJSON
Addsextradatatypes withDate,Int types,Id,
Optimizedforperformanceandnavigationalabilities
Andcompression

MongoDB sendsandstoresdatainBSON
bsonspec.org
198

99

6/26/2014

MongoDocument

199

Collection

200

100

6/26/2014

Query

201

Modification

202

101

6/26/2014

Select

203

QueryStage

204

102

6/26/2014

CRUDexample
Create,Read,Update,Delete
>db.user.insert({
first:"John",
last : "Doe",
icd:
[
250,
151
],
age: 39

> db.user.find (
{
"_id" : ObjectId("51"),
"first" : "John",
"last" : "Doe",
"age" : 39
})

})
> db.user.update(
p
(
{"_id" : ObjectId("51")},
{
$set: {
age: 40,
salary: 7000}
}
)

> db.user.remove({
"first": /^J/
})
205

ImportExcelintoMongodb

Mongoimport --db dbname --type csv -headerline --file /directory/file.csv


Ex.mongoimport d mydb c thingstypecsv
filelocations.csvheaderline

206

103

6/26/2014

ExportMongodb toExcel
mongoexport --host localhost --port 27017
--username abc
b --password
d 12345 -collection collName --csv --fields
id,sex,brithday,icd --out all_patients.csv -db my_db --query "{\_id\": {\"\$oid\":
\"5058ca07b7628c0999000006\"}}"

207

BlogPostDocument
>p={author:"Chris",
date:newISODate(),
text:"AboutMongoDB...",
tags:["tech","databases"]}
>db.posts.save(p)

208

104

6/26/2014

Querying
db.posts.find()
{_id:ObjectId("4c4ba5c0672c685e5e8aabf3"),
author:"Chris",
date:ISODate("201202
02T11:52:27.442Z"),
text:"AboutMongoDB...",
tags:["tech","databases"]}

Notes:
_idisunique,butcanbeanythingyou'dlike
209

Insertion
db.unicorns.insert({name:'Horny',dob:newDate(1992,2,13,7,47),loves:['carrot','papaya'],weight:600,
gender:'m',vampires:63});
db.unicorns.insert({name:'Aurora',dob:newDate(1991,0,24,13,0),loves:['carrot','grape'],weight:
450,gender:'f',vampires:43});
db
db.unicorns.insert({name:'Unicrom',dob:newDate(1973,1,9,22,10),loves:['energon','redbull'],
({
'
' d b
(
) l
['
' ' db ll'] weight:984,
h
gender:'m',vampires:182});
db.unicorns.insert({name:'Roooooodles',dob:newDate(1979,7,18,18,44),loves:['apple'],weight:575,
gender:'m',vampires:99});
db.unicorns.insert({name:'Solnara',dob:newDate(1985,6,4,2,1),loves:['apple','carrot','chocolate'],
weight:550,gender:'f',vampires:80});
db.unicorns.insert({name:'Ayna',dob:newDate(1998,2,7,8,30),loves:['strawberry','lemon'],weight:
733,gender:'f',vampires:40});
db.unicorns.insert({name:'Kenny',dob:newDate(1997,6,1,10,42),loves:['grape','lemon'],weight:690,
gender:'m',vampires:39});
db.unicorns.insert({name:'Raleigh',dob:newDate(2005,4,3,0,57),loves:['apple','sugar'],weight:421,
gender:'m',vampires:2});
db.unicorns.insert({name:'Leia',dob:newDate(2001,9,8,14,53),loves:['apple','watermelon'],weight:
601,gender:'f',vampires:33});
db.unicorns.insert({name:'Pilot',dob:newDate(1997,2,1,5,3),loves:['apple','watermelon'],weight:
650,gender:'m',vampires:54});
db.unicorns.insert({name:'Nimue',dob:newDate(1999,11,20,16,15),loves:['grape','carrot'],weight:
540,gender:'f'});
db.unicorns.insert({name:'Dunx',dob:newDate(1976,6,18,18,18),loves:['grape','watermelon'], weight:704,
gender:'m',vampires:165});
210

105

6/26/2014

MasterSelector:{field:value}
db.unicorns.find({gender:m,weight:{$gt:
700}})
or (not quite the same thing, but for
demonstration purposes)
db.unicorns.find({gender:{$ne:f},weight:
{$gte: 701}}
{$gte:701}}
$lt,$lte,$gt,$gteand$neareusedforlessthan,lessthanor
equal,greaterthan,greaterthanorequalandnotequal
operations
211

$existand$or
db.unicorns.find({vampires: {$exists:
f l }})
false}})
db.unicorns.find({gender: f, $or:
[{loves: apple}, {loves: orange}, {weight:
{$lt: 500}}]})

212

106

6/26/2014

Indexing
//1 meansascending,1 meansdescending
>db.unicorns.ensureIndex({name:1})
>db.unicorns.findOne({name:'Kenny'})
{name:'Kenny',dob:newDate(1997,6,1,10,
{name 'Kenny' dob new Date(1997 6 1 10
42),loves:['grape','lemon'],weight:690,
gender:'m',vampires:39}
213

IndexingonMultipleFields
//1meansascending,1meansdescending
db.posts.ensureIndex({author:1,ts:1})
Query:
db.posts.find({author:'Chris'}).sort({ts:1})
Return:
[
{_id :ObjectId("4c4ba5c0672c685e5e8aabf3"),
author:"Chris",...},
{_id:ObjectId("4f61d325c496820ceba84124"),
author:"Chris",...}
]

214

107

6/26/2014

GIS
location1 = {
name: "10gen East Coast,
address: "17
17 West 18th Street 8th Floor
Floor,
city: "New York,
zip: "10011,
tags: [business, mongodb],
latlong: [40.0,72.0]

}
db.locations.ensureIndex({latlong:2d})
db.locations.find({latlong:{$near:[40,70]}})

GIS Place
location1 = {
name: "10gen HQ,
address: "17 West 18th Street
S
8th Floor,

city: "New York,


zip: "10011,
latlong: [40.0,72.0],
tags: [business, cool place],
tips: [
{user:"nosh", time:6/26/2010, tip:"stop by
office hours on Wednesdays from 4-6pm"},
{.....},
]
}

for

216

108

6/26/2014

QueryingyourPlaces
Creatingyourindexes
db.locations.ensureIndex({tags:1})
db.locations.ensureIndex({name:1})
db.locations.ensureIndex({latlong:2d})
Findingplaces:
db.locations.find({latlong:{$near:[40,70]}})
Withregularexpressions:
db.locations.find({name:/^typeaheadstring/)
Bytag:
db.locations.find({tags:business})

Insertingandupdatinglocations
Initialdataload:
db l ti
db.locations.insert(place)
i
t( l )

UsingupdatetoAddtips:
db.locations.update({name:"10genHQ"},
{$push:{tips:
{
{user:"nosh",time:6/26/2010,
" h" ti
6/26/2010
tip:"stop byforofficehoursonWednesdays
from46"}}}}

109

6/26/2014

Requirements
Locations
Needtostorelocations(Offices,Restaurantsetc)
Need to store locations (Offices, Restaurants etc)
Wanttobeabletostorename,addressandtags
MaybeUserGeneratedContent,i.e.tips/smallnotes?

Wanttobeabletofindotherlocationsnearby

Checkins
Usershouldbeabletocheckin toalocation
Wanttobeabletogeneratestatistics

Users
user1 = {
name: nosh

h
email: nosh@10gen.com,
.
.
.
checkins: [{ location: 10gen HQ,
ts: 9/20/2010 10:12:00,
},
}

]
}

110

6/26/2014

Simple Stats
db.users.find({checkins.location: 10gen HQ)
db.checkins.find({checkins.location: 10gen HQ})
.sort({ts:-1}).limit(10)
db.checkins.find({checkins.location: 10gen HQ,
ts: {$gt: midnight}}).count()

Alternative
user1 = {
name: nosh
email: nosh@10gen.com,
.
.
.
checkins: [4b97e62bf1d8c7152c9ccb74,5a20e62bf1d8c736ab]
}

checkins [] = ObjectId reference to locations collection

111

6/26/2014

User Check in

Checkin=2ops
read location to obtain location id
Update ($push) location id to user object
Queries:findalllocationswhereausercheckedin:
checkin_array = db.users.find({..},
{ h ki t }) h ki
{checkins:true}).checkins
db.location.find({_id:{$in: checkin_array}})

QueryOperators
ConditionalOperators
$all,$exists,$mod,$ne,$in,$nin,$nor,$or,$size,$type
$lt,$lte,$gt,$gte
$l $l $ $

findpostswithanytags
db.posts.find({tags:{$exists:true}})
findpostsmatchingaregularexpression
db posts find({author: /^ro*/i })
db.posts.find({author:/^ro*/i
countpostsbyauthor
db.posts.find({author:'Chris'}).count()
224

112

6/26/2014

Examinethequeryplan
Query: db.posts.find({"author": 'Ross'}).explain()
R l
Result:
{
"cursor" : "BtreeCursor author_1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 0,
"indexBounds" : {
"author" : [
[
"Chris,
"Chris
]
]
}
}

225

AtomicOperators
$set,$unset,$inc,$push,$pushAll,$pull,$pullAll,$bit
// C
Create
t a commentt
new_comment = { author: "Fred", date: new Date(), text:
"Best Post Ever!"}
// Add to post
db.posts.update({ _id: "..." }, {"$push": {comments:
new_comment}},"$inc": {comments_count: 1} });
226

113

6/26/2014

NestedDocuments
{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),
author : "Chris",,
date : "Thu Feb 02 2012 11:50:01",
text : "About MongoDB...",
tags : [ "tech", "databases" ],
comments : [{
author : "Fred",
date : "Fri Feb 03 2012 13:23:11",
text : "Best Post Ever!"
}],
comment_count : 1
}
227

SecondIndexing
// Index nested documents
> db.posts.ensureIndex("comments.author":
db
d ("
h " 1))
> db.posts.find({"comments.author": "Fred"})
// Index on tags (multi-key index)
> db.posts.ensureIndex(
db posts ens reInde ( tags:
tags 1)
> db.posts.find( { tags: "tech" } )

228

114

6/26/2014

GEO

Geospatialqueries
Requireageoindex
Findpointsnearagivenpoint
Findpointswithinapolygon/sphere
//geospatialindex
> db posts ensureIndex( "author
>db.posts.ensureIndex(
author.location
location"::"2d"
2d ))
>db.posts.find("author.location":
{$near:[22,42]})
229

MemoryMappedFiles
Amemorymappedfileisasegmentofvirtual
memorywhichhasbeenassignedadirect
hi h h b
i d di t
byteforbytecorrelationwithsomeportionof
afileorfilelikeresource.1
mmap()

1:

http://en.wikipedia.org/wiki/Memory-mapped_file

230

115

6/26/2014

ReplicaSets
RedundancyandFailover
Zerodowntimefor
Zero downtime for
upgradesandmaintaince

Host1:10000
Host2:10001
Host3:10002
replica1

Master-slave replication
Strong Consistency
Delayed
D l
dC
Consistency
i t

Client

Geospatialfeatures
231

Sharding
Partitionyourdata
Scalewritethroughput
S l
it th
h t
Increasecapacity
Autobalancing
shard1
Host1:10000

shard2
Host2:10010

configdb
Host3:20000
Host4:30000

Client
232

116

6/26/2014

Sharding horizontalscaling

233

Unsharded Deployment
Primary

Secondary

Configureasareplicasetfor
automatedfailover
Async replicationbetweennodes
Add more secondaries toscalereads
Addmoresecondaries
to scale reads

Secondary

117

6/26/2014

HighThroughput
Sharding reducesthenumberofoperationseachshard
handles.Eachshardprocessesfeweroperationsasthecluster
grows.Asaresult,aclustercanincreasecapacityand
throughputhorizontally.Forexample,toinsertdata,the
applicationonlyneedstoaccesstheshardresponsibleforthat
record.
Sharding reducestheamountofdatathateachserverneeds
tostore.Eachshardstoreslessdataastheclustergrows.For
example, if a database has a 1 terabyte data set, and there are
example,ifadatabasehasa1terabytedataset,andthereare
4shards,theneachshardmightholdonly256GBofdata.If
thereare40shards,theneachshardmightholdonly25GBof
data.

235

Sharded Deployment
MongoS

config

Primary

Secondary

Autosharding distributesdataamongtwoormorereplicasets
MongoConfig Server(s)handlesdistribution&balancing
Transparenttoapplications

118

6/26/2014

Sharded Cluster

237

Maincomponents
Shard
AShardisanodeofthecluster
A Shard is a node of the cluster
EachShardcanbeasinglemongod orareplicaset

ConfigServer(metadatastorage)
Storesclusterchunkrangesandlocations
Canbeonly1or3(productionmusthave3)
Notareplicaset

Mongos
Actsasarouter/balancer
Nolocaldata(persiststoconfigdatabase)
Canbe1ormany
238

119

6/26/2014

RelationalDatabaseClustering

index

data file

0
1

p
parrot
parakeet

cat
cat
dog
dog

3
4

goat

This is a
secondary
hash index.

cat Natasha
parakeetTweety
dog
dog

Buck
Lassie

goat Sertrude

5
6

parrot Elmer
cat Mittens

h(k)

c=22
d=3
g=6
p=15

2
3
6
1

a=0, b=1, ... z=25


h(k) = (1st letter of k)(mod 7)

Deployasharded cluster
StarttheConfig ServerDatabaseInstances
mkdir
kdi /data/configdb
/d t / fi db
mongod configsvr dbpath <path>port<port>
mongod configsvr dbpath /data/configdb port
27019

Startthemongos Instances(27017)
mongos configdb <config serverhostnames>
server hostnames>
mongos configdb
cfg0.example.net:27019,cfg1.example.net:27019,cfg2.e
xample.net:27019
240

120

6/26/2014

Deployasharded cluster(contd)
AddShardstotheCluster
mongohost<hostnameofmachinerunning
mongos>port<portmongos listenson>
mongohostmongos0.example.netport27017

sh.addShard()
sh.addShard("rs1/mongodb0.example.net:27017")

241

EnableSharding foraDatabase
mongohost<hostnameofmachinerunning
mongos>port<portmongoslistenson>
t
t
li t
sh.enableSharding("<database>")
~db.runCommand({enableSharding:<database>})

Beforeyoucanshardacollection,youmustenableshardingforthe
collectionsdatabase.

242

121

6/26/2014

ChunkPartitioning

Chunkisasectionoftheentirerange

Chunksplitting

Achunkissplitonceitexceedsthemaximumsize
Thereisnosplitpointifalldocumentshavethesameshard
key
Chunksplitisalogicaloperation(nodataismoved)

Chunkisasectionoftheentirerange

122

6/26/2014

Rangebasedsharding

245

HashbasedSharding

246

123

6/26/2014

LinearHashing:Example
h0(x) = x mod N , h1(x) = x mod (2*N)
Insert 15 and 3
Bucket id

7 11 4

13

17

LinearHashing:Example
h0(x) = x mod N , h1(x) = x mod (2*N)
(2 N)
Bucket id

17

15

7 11 4

13 5

124

6/26/2014

LinearHashing:Search
h0(x) = x mod N (for the un-split buckets)
( ) = x mod (2*N)
(
) (for
(f the split
p ones))
h1(x)
Bucket id

17

15

7 11 4

13 5

EnableSharding foraCollection
Determinewhatyouwillusefortheshardkey.Yourselectionoftheshard
keyaffectstheefficiencyofsharding.
Ifthecollectionalreadycontainsdatayoumustcreateanindexonthe
shardkeyusingensureIndex().Ifthecollectionalreadycontainsdatayou
mustcreateanindexontheshardkeyusingensureIndex().Ifthe
collectionisemptythenMongoDB willcreatetheindexaspartofthe
sh.shardCollection()step.
Enablesharding foracollectionbyissuingthesh.shardCollection()
methodinthemongoshell.Themethoduses:
sh.shardCollection("<database>.<collection>",shardkeypattern)
sh.shardCollection("records.people",{"zipcode":1,"name":1})
sh.shardCollection("people.addresses",{"state":1,"_id":1})
sh.shardCollection("assets.chairs",{"type":1,"_id":1})
sh.shardCollection("events.alerts",{"_id":"hashed"})

250

125

6/26/2014

Mixed
shard1

...

Host1:10000
Host2:10001

shardn
Host4:10010

Host3:10002
replica1
configdb
Host5:20000
Host6:30000

Client

Host7:30000
251

Map/Reduce
db.collection.mapReduce(
<mapfunction>,
<reducefunction>,
{
out:<collection>,
query:<>,
sort:<>,
limit:<number>,
finalize:<function>,
scope:<>,
jsMode: <boolean>
jsMode:<boolean>,
verbose:<boolean>
}
)
varmapFunction1=function(){emit(this.cust_id,this.price);};
varreduceFunction1=function(keyCustId,valuesPrices)
{returnsum(valuesPrices);};

252

126

6/26/2014

MapReduce
The caller provides map and reduce
functions written in JavaScript
//Emiteachtag
>map="this['tags'].forEach(
function(item){emit(item,1);}
);"
//Calculatetotals
>reduce="function(key,values){
vartotal=0;
varvaluesSize=values.length;
for(vari=0;i<valuesSize;i++){
total+=parseInt(values[i],10);
}
returntotal;
};
253

MapReduce
//runthemapreduce
>

db.posts.mapReduce(map,reduce,{"out":{inline:1}});

Answer
{
"results":[
{"_id":"databases","value":1},
{"_id":"tech","value":1}
],
"timeMillis":1,
"counts":{
"input":1,
"emit":2,
"reduce":0,
"output":2
},
"ok":1,
}

254

127

6/26/2014

References
http://docs.mongodb.org/manual/core/inter
process authentication/
processauthentication/
http://api.mongodb.org/python/2.6.2/examples/
authentication.html
https://securosis.com/assets/library/reports/Sec
uringBigData_FINAL.pdf
http://docs.mongodb.org/manual/reference/user
privileges/
i il
/
http://www.slideshare.net/DefconRussia/firstov
attackingmongodb
256

128

6/26/2014

Case:CRM
IJenChiang

RelationshipMarketing

RelationshipMarketingisaProcess

communicatingwithyourcustomers
i ti
ith
t

listeningtotheirresponses

Companiestakeactions

marketingcampaigns

newproducts

newchannels

newpackaging
258

129

6/26/2014

RelationshipMarketing continued

Customersandprospectsrespond
mostcommonresponseisnoresponse

Thisresultsinacycle
dataisgenerated
opportunitiestolearnfromthedataandimprovethe
processemerge

259

AnIllustration
Afewyearsago,UPSwentonstrike
FedExsawitsvolumeincrease
FedEx saw its volume increase
Afterthestrike,itsvolumefell
FedExidentifiedthosecustomerswhoseFedEx
volumeshadincreasedandthendecreased
ThesecustomerswereusingUPSagain
These customers were using UPS again
FedExmadespecialofferstothesecustomerstoget
alloftheirbusiness
260

130

6/26/2014

TheCorporateMemory
Severalyearsago,LandsEndcouldnot
recognizeregularChristmasshoppers
somepeoplegenerallydontshopfromcatalogs
butspendhundredsofdollarseveryChristmas
ifyouonlystore6monthsofhistory,youwillmiss
them

VictoriasSecretbuildscustomerloyaltywitha
nohasslereturnspolicy
someloyalcustomers returnseveralexpensive
outfitseachmonth
theyarereallyloyalrenters
261

CRMRequiresLearningandMore

Formalearningrelationshipwithyour
customers
Noticetheirneeds
OnlineTransactionProcessingSystems

Remembertheirpreferences
DecisionSupportDataWarehouse

Learnhowtoservethembetter
Learn how to serve them better
DataMining

Acttomakecustomersmoreprofitable
262

131

6/26/2014

Customer Relationship
Management(CRM)
Traditional Marketing

CRM

Goal: Expand customer base,


base
increase marketshare by mass
marketing

Goal: Establishaprofitable,long
Establish a profitable long
term,onetoonerelationshipwith
customers;understandingtheir
needs,preferences,expectations

Productorientedview

Customerorientedview

Massmarketing/mass
production

Masscustomization,onetoone
marketing

Standardizationofcustomer
needs

Customersupplierrelationship

Transactionalrelationship

Relational approach
263

What isCRM?
The
The approach ofidentifying,establishing,maintaining,
of identifying, establishing, maintaining,
and enhancing lasting relationships with customers.

The formation ofbonds between acompany and its


customers
customers.

264

132

6/26/2014

Strategies inCRM
for Mass Customization

Prospecting (offirst
(of firsttime
timeconsumers)
consumers)
Loyalty
Crossselling /Upselling
Winback or Save

265

BusinessProcessesOrganizeAroundtheCustomer
Lifecycle

Acquisition

Activation

Relationship Management
RelationshipManagement

Winback
Former
Customer

High
Value
Prospect

New
Customer

Established
Customer

High
High
Potential
Low
Value

Voluntary
Churn

Forced
Churn
266

133

6/26/2014

StrategicCustomerRelationship
Entire universe

Satisfy these
Customers
Before
They defect

Spend less marketing $


On these segments
Change higher rates

Retain these
Loyal customers!

Cross-sell and up-sell


opportunities

Behavioral clustering/
segmentation

Get these
Customers
back

Demographic
clustering/
segmentation

Distinct demographic group


Market distinct portfolio in sequence

Product
Associations
Customer basket
analysis
Market the
Missing products

MarketingStrategy
High-valued
Customer

Clustering
Purchasing
Behaviors

Keep
Att iti
Attrition
Recognition

Mid-valued
Customer

Cross-sell ,up-sell
GetBestBenefit
With low costs

Low-valued

Before Attrition,
Win-back

Marketing Benefit Registration

Demography Clustering
Product Design and Refinement

Products

Win-back

Find Customers

Product Classifications

Differential Segmentation

Find Products

134

6/26/2014

TheProfit/LossMatrix

Thosepredictedtorespond
cost$1
thosewhoactuallyrespond
yieldagainof$45
thosewhodontrespond
yield no gain
yieldnogain

Predicted

Someonewhoscoresinthetop
30%,ispredictedtorespond
ACTUAL
YES

NO

YES

$44

-$1

NO

$0

$0

Thosenotpredictedtorespondcost$0andyield
nogain
269

MarketingPlan
Correctlyidentifythecustomerrequirements
Definepromotionforcustomers
Adjusttheflightschedule

135

6/26/2014

RecommendationEC
Multiple and effective merchandising platform
Cross-products (different types) recommendation

Online retailers

Anonymizing filter
(optional)

Marketplace

My Library & Preferences

Aggregated
Recommendations
Recommendation-Based
Purchases

My Library
(all media types)

User Profile <-> Recommendations

My Preferences

Information/dataintegration
Findhouseswith
2bedrooms
pricedunder
200K

Newfaculty
member

realestate.com

homeseekers.com

sourcesontheWebwhichprovidehouselistings

homes.com
272

136

6/26/2014

ArchitectureofDataIntegrationSystem
simplyposethe
queryinthe
mediated
schema

Findhouseswith2bedrooms
pricedunder200K
mediatedschema

sourceschema1

realestate.com

sourceschema2

sourceschema3

homeseekers.com

homes.com
273

SemanticMatchesbetweenSchemas
theschemamatchingproblemistofindsemanticmappingsbetweentheelementsofthe
twoschemas

Mediatedschema

price agentnameaddress

11match
Sourceschema
homes com
homes.com

complexmatch

listedprice contactnamecity state


320K JaneBrown
240K MikeSmith

Seattle WA
MiamiFL

274

137

6/26/2014

BigDataPlatformsand
Paradigms
Sourangshu Bhattacharya(CSE)

Outline
BigData

WhatisBigData?
What
is Big Data ?
ChallengeswithBigDataProcessing.
Hadoop HDFS
MapReduce
PIG

Analytics
BasicStatistics
Basic Statistics
TextAnalytics
SQLQueries

138

6/26/2014

WhatisBigData?
6 Billionwebqueriesperday.
q
p
y
~6TBperday,~2.5PBperyear
10Billiondisplayadsperday.
~15TBperday,~5.5PBperyear
30Billiontextadsperday.
~30TBperday,~11PBperyear
150
150MillionCreditcardtransactionsperday.
Milli C di
d
i
d
~150GBperday,~5.5TBperyear
100Billionemailsperday.
~1PBperday,~360PB peryear

WhatisBigData?
CERN LargeHadronCollider
~10PB/yearatstart
~1000PBin~10years
2500physicistscollaborating

LargeSynopticSurveyTelescope
(NSF,DOE,andprivatedonors)
~510PB/yearatstartin2012
~100
100PBby2025
PB by 2025

PanSTARRS(Haleakala,Hawaii)
USAirForce
now:800TB/year
soon:4PB/year
Courtesy:Kingetal.,IEEEBigData2013.

139

6/26/2014

BigDataChallenges

3Vs
Scalability
CostEffective
Ease of use
Easeofuse
Flexibility
FaultTolerance

3Vs Volume,Variety,Velocity.
Faulttolerance:
Computers

10

100

Chanceof failureinanhours

0.01

0.09 0.63

Datalocality.
Computationgoestodata

WhatisHadoop?

Hadoop MapReduce Batchprocessing.


Spark Distributedinmemoryprocessing.
Storm Distributedonlineprocessing.
Graphlab distributedprocessingongraph.

Ascalablefaulttolerantdistributedsystemfordatastorageand
processing.
CoreHadoop:
HadoopDistributedFileSystem(HDFS)
H d
Di t ib t d Fil S t
(HDFS)
HadoopYARN:JobSchedulingandClusterResourceManagement
HadoopMapReduce:Frameworkfordistributeddataprocessing.

OpenSourcesystemwithlargecommunitysupport.
https://hadoop.apache.org/

140

6/26/2014

HadoopArchitecture
YARN

Courtesy:http://hadoop.apache.org/docs/r2.3.0/hadoopyarn/hadoopyarnsite/YARN.html

HDFS
Assumptions
Hardwarefailureisthenorm.
Hardware failure is the norm
Streamingdataaccess.
Writeonce,readmanytimes.
Highthroughput,notlowlatency.
Largedatasets.

Characteristics:
Performsbestwithmodestnumberoflargefiles
Optimizedforstreamingreads
Layerontopofnativefilesystem.

141

6/26/2014

HDFS
Dataisorganizedintofileanddirectories.
Filesaredividedintoblocksanddistributedtonodes.
Files are divided into blocks and distributed to nodes
Blockplacementisknownatthetimeofread
Computationmovedtosamenode.

Replicationisusedfor:
Speed
Faulttolerance
Selfhealing.

WhatisMapReduce?
Methodfordistributingataskacrossmultipleservers.
ProposedbyDeanandGhemawat,2004.
Consistsoftwodevelopercreatedphases:
Map
Reduce

InbetweenMapandReduceistheShuffleandSortphase.

Whatwasthemax/mintemperatureforthelastcentury?

142

6/26/2014

MapPhase
Userwritesthemappermethod.
Inputisanunstructuredrecord:
Input is an unstructured record:
E.g.ArowofRDBMStable,
Alineofatextfile,etc

Outputisasetofrecordsoftheform:<key,value>
Bothkeyandvaluecanbeanything,e.g.text,number,etc.
E.g.forrowofRDBMStable:<columnid,value>
Lineoftextfile:<word,count>

Shuffle/Sortphase
Shufflephaseensuresthatallthemapperoutputrecordswith
the same key value, goes to the same reducer.
thesamekeyvalue,goestothesamereducer.
Sortensuresthatamongtherecordsreceivedateachreducer,
recordswithsamekeyarrivestogether.

143

6/26/2014

Reducephase
Reducerisauserdefinedfunctionwhichprocessesmapper
output records with some of the keys output by mapper.
outputrecordswithsomeofthekeysoutputbymapper.
Inputisoftheform<key,value>
Allrecordshavingsamekeyarrivetogether.

Outputisasetofrecordsoftheform<key,value>
Keyisnotimportant

Parallelpicture

144

6/26/2014

Example
WordCount:Countthetotalno.of
occurrencesofeachword
Map

Reduce

HadoopMapReduce
Provides:
AutomaticparallelizationandDistribution
Automatic parallelization and Distribution
FaultTolerance
MethodsforinterfacingwithHDFSforcolocationof
computationandstorageofoutput.
StatusandMonitoringtools
APIinJava
Abilitytodefinethemapperandreducerinmanylanguages
Ability to define the mapper and reducer in many languages
throughHadoopstreaming.

145

6/26/2014

WordCount

WordCount:Mapper

146

6/26/2014

WordCount:Reducer

WordCount:Main

147

6/26/2014

HadoopStreaming
Allowstheusertospecifymappersandreducersas
executables.
Mapperlaunchesthemapperexecutableandpipestheinput
datatothestdin oftheexecutable.
Mapperexecutableprocessesthedataandwritesthemapper
outputtostdout,whichiscollectedbythemapper.
Mapperkeyandvalueareseparatedbytab.
Reduceroperatesinthesameway.

Wordcountinperl
MainCommand:

hadoop
p j
jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20p
p
p
mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1cdh4.6.0.jar
-input /tmp/small_svm_data
-output /user/sourangshu/output
-mapper "/usr/bin/perl wcmap.pl
-reducer "/usr/bin/perl wcreduce.pl
-file wcmap.pl -file wcreduce.pl

Mapper:wcmap.pl

148

6/26/2014

Wordcountinperl
Reducer:

Pig
Pigisasystemforfastdevelopmentofbigdataprocessing
algorithms.
Pighastwocomponents:
PigLatinlanguage
Interpreterforpiglatin.

AprogramexpressedinPigLatinistranslatedintoasetof
mapreducejobs.
UsersinteractwiththePiglatin language.

149

6/26/2014

PigLatin
PigLatinstatementisanoperatorthattakesarelationas
input and produces another relation as output.
inputandproducesanotherrelationasoutput.
Exception:LoadandStorestatements,whichreadfromand
writetofiles.

AgeneralPigprogramlookslike:
ALOADstatementreadsdatafromthefilesystem.
se es o t a s o at o state e ts p ocess t e data
Aseriesof"transformation"statementsprocessthedata.
ASTOREstatementwritesoutputtothefilesystem;or,a
DUMPstatementdisplaysoutputtothescreen.

PigLatinDatatypes
Arelationisabag(morespecifically,anouterbag).
Abagisacollectionoftuplese.g.{(19,2),(18,1)}
A bag is a collection of tuples e g {(19 2) (18 1)}
Atupleisanorderedsetoffieldse.g.(19,2)
Amapisasetofkeyvaluepairse.g.[open#apache]
Afieldisapieceofdata
Afieldcanbeoftype:
Scalar:int,long,float,double
Scalar:
int, long, float, double
Array:chararray,bytearray
Complexfields:bag,tuple,map.

150

6/26/2014

Example
A = LOAD 'student' USING
PigStorage() AS (name:chararray
(name:chararray,
age:int, gpa:float);
DUMP A;
Output:
p

(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)

PigLatin
FOREACH..GENERATE..:appliesoperationtoeachelement
of a bag.
ofabag.
Example:
A=LOAD'data'AS(f1,f2,f3);
B=FOREACHAGENERATEf1+5;
C=FOREACHAgeneratef1+f2;

151

6/26/2014

PigLatin
Filtering:

X = FILTER A BY (f1
(f1==8)
8) OR (NOT
(f2+f3 > f1));
Grouping:

X = GROUP A BY f1;
DUMP X;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(8,{(8,3,4)})
Groupinggeneratesaninnerbag.

PigLatin
Cogrouping:

152

6/26/2014

PigLatinOperatorsonBags
AVG:Computestheaverageofthenumericvaluesinasingle
column bag.
columnbag.

PigLatinOperators
COUNT:Computesthenumberofelementsinabag.
IsEmpty:Checksifabagormapisempty
IsEmpty: Checks if a bag or map is empty
MAX:Computesthemaximum.
MIN:Computestheminimum.
SUM:Computesthesumofthenumericvalues.
SIZE:ComputesthenumberofelementsbasedonanyPig
datatype.

153

6/26/2014

PigLatin
Flattenconvertsinnerbagstomultipletuples:
X = FOREACH C GENERATE g
group,
p, FLATTEN(A);
( );
DUMP X;
(1,1,2,3)
(4,4,2,1)
(4,4,3,3)
(8,8,3,4)
(8,8,4,3)

Example:Projection
X = FOREACH A GENERATE a1, a2;

DUMP X;
(1,2)
(4,2)
(8,3)
(4 3)
(4,3)
(7,2)
(8,4)

154

6/26/2014

BusinessAnalyticsforDecision
Making
RamBabu Roy

DecisionMaking
Definition
Selectingthebestsolutionfromtwoormorealternatives
LevelsofManagerialDecisionMaking
Information
Characteristics
Ch
i i

Decision
Decision
Structure
Unstructured

Strategic
Management
Executivesand
Directors

Semistructured

TacticalManagement
BusinessUnitManagers
andSelfDirectedTeams

OperationalManagement
Structured

OperatingManagers
andSelfDirectedTeams

AdHoc
Unscheduled
Summarized
Infrequent
Forward
Looking
External
WideScope
Prespecified
Scheduled
Detailed
Frequent
Historical
Internal
NarrowFocus

Source: Euiho (David) Suh, POSTECH Strategic Management of Information and Technology Laboratory, (http://posmit.postech.ac.kr)

155

6/26/2014

Analysis,Analytics,andBI
Analysis:Processofcarefulstudyofsomethingto
learnaboutitscomponents,whattheydo,and
ho the are related to each other
howtheyarerelatedtoeachother
Analytics:Applicationofscientificmethodsand
techniquesforanalysis
BusinessIntelligenceandAnalytics1 techniques,
technologies,systems,practices,methodologies,
and applications that analyze critical business data
andapplicationsthatanalyzecriticalbusinessdata
tohelpanenterprisebetterunderstandits
businessandmarketandmaketimelybusiness
decisions
1Source:BusinessIntelligenceandAnalytics:Frombigdatatobigimpact,H.Chen

et.alMISQuarterly,Dec2012

BI&AOverview

Source:BusinessIntelligenceandAnalytics:Frombigdatatobigimpact,H.Chen et.alMISQuarterly,Dec2012

156

6/26/2014

BigDataandTraditionalAnalytics

Source:Bigdataatwork,Davenport,HBS

Whichtypeofanalyticsisappropriate?
Onceyougatherdata,youbuildamodelofhowthese
datawork.
Descriptive
p
tocondensebigdataintosmaller,more
g
,
usefulnuggetsofinformation
Inquisitive Whyissomethinghappening
validate/rejecthypotheses
Predictive toforecastwhatmighthappeninthe
future,usesstatisticalmodeling,datamining,and
machinelearningtechniques,e.g.whethersentimentis
positive or negative
positiveornegative
Prescriptive recommendingoneormorecoursesof
actionandshowingthelikelyoutcomeofeachdecision
requiresapredictivemodelwithtwoadditional
components:actionabledataandafeedbacksystem
Source:http://www.informationweek.com/bigdata/bigdataanalytics/bigdata
analyticsdescriptivevspredictivevsprescriptive/d/did/1113279

157

6/26/2014

EvolutionofDSSintoBusinessIntelligence
ChangeintheUseofDSS
SpecialistManagersWhomever,Whenever,Wherever
EmergenceofBusinessIntelligence
Web Technology
WebTechnology
OLAP

Data
Warehousing

BusinessIntelligence
DataMining

Intelligent
Systems
Source: Euiho (David) Suh, POSTECH Strategic Management of Information and Technology Laboratory, (http://posmit.postech.ac.kr)

BusinessIntelligence(BI)
Definition
Anumbrellatermthatcombinesarchitectures,tools,databases,
analyticaltools,applications,andmethodologies
Objective
Toenableeasyaccesstodata(andmodels)toprovidebusiness
To enable easy access to data (and models) to provide business
managerswiththeabilitytoconductanalysis
Data
BusinessIntelligence

ACTION

Information

Knowledge

Decisions

Source: Euiho (David) Suh, POSTECH Strategic Management of Information and Technology Laboratory, (http://posmit.postech.ac.kr)

158

6/26/2014

FoundationalTechnologiesinAnalytics

Source:BusinessIntelligenceandAnalytics:Frombigdatatobigimpact,H.Chen et.alMISQuarterly,Dec2012

TaxonomyforDataMiningTasks

159

6/26/2014

ClassificationTechniques

Decisiontreeanalysis
Neuralnetworks
Supportvectormachines
Casebasedreasoning
Bayesianclassifiers(nave,BeliefNetwork)
Roughsets

BayesianClassification:Why?
Astatisticalclassifier:performsprobabilisticprediction,i.e.,
predictsclassmembershipprobabilities
Foundation: BasedonBayesTheorem.
Performance:
P f
A i l B
AsimpleBayesianclassifier,naveBayesian
i
l ifi
B
i
classifier,hascomparableperformancewithdecisiontreeand
selectedneuralnetworkclassifiers
Incremental:Eachtrainingexamplecanincrementally
increase/decreasetheprobabilitythatahypothesisiscorrect
priorknowledgecanbecombinedwithobserveddata
Standard:EvenwhenBayesianmethodsarecomputationally
d d
h
h d
ll
intractable,theycanprovideastandardofoptimaldecision
makingagainstwhichothermethodscanbemeasured

160

6/26/2014

BayesTheorem:Basics
LetX beadatasample(evidence):classlabelisunknown
LetHbeahypothesis thatXbelongstoclassC
ClassificationistodetermineP(H|X),(posterioriprobability), the
probabilitythatthehypothesisholdsgiventheobserveddata
sampleX
P(H)(priorprobability),theinitialprobability
E.g., X willbuycomputer,regardlessofage,income,
P(X):probabilitythatsampledataisobserved
P(X): probability that sample data is observed
P(X|H)(likelihood),theprobabilityofobservingthesampleX,
giventhatthehypothesisholds
E.g., Giventhat X willbuycomputer,theprob.thatXis31..40,
mediumincome

BayesTheorem
Giventrainingdata X,posterioriprobabilityofahypothesisH,
( | ),
y
P(H|X),followstheBayestheorem

P ( H | X ) P (X | H ) P (H ) P (X | H ) P ( H ) / P (X )
P (X)
Informally,thiscanbewrittenas
posteriori=likelihoodxprior/evidence
PredictsX belongstoCi iff theprobabilityP(Ci|X)isthehighest
amongalltheP(Ck|X)forallthek classes
Practicaldifficulty:requireinitialknowledgeofmany
probabilities,significantcomputationalcost

161

6/26/2014

TowardsNaveBayes Classifier
LetDbeatrainingsetoftuples andtheirassociatedclasslabels,
andeachtuple isrepresentedbyannDattributevectorX =(x1,
x2,,xn)
Supposetherearem classesC1,C2,,Cm.
Classificationistoderivethemaximumposteriori,i.e.,the
maximalP(Ci|X)
ThiscanbederivedfromBayestheorem
P(C | X)
i

P(X | C )P(C )
i
i
P(X)

SinceP(X)isconstantforallclasses,only
needstobemaximized

P(C | X) P(X | C )P(C )


i
i
i

323

DerivationofNaveBayes Classifier
Asimplifiedassumption:attributesareconditionally
independent(i.e.,nodependencerelationbetweenattributes):
n
P(X | Ci) P(x | Ci) P(x | Ci) P(x | Ci) ... P(x | Ci)
k
1
2
n
k 1

Thisgreatlyreducesthecomputationcost:Onlycountsthe
classdistribution

IfAk iscategorical,P(x
g
, ( k||Ci))isthe#oftuples
p inCi havingvaluex
g
k
forAk dividedby|Ci,D|(#oftuples ofCi inD)

162

6/26/2014

NaveBayes Classifier:TrainingDataset
Class:
C1:buys_computer=yes
C2:buys_computer=no
Datatobeclassified:
X=(age<=30,
Income =medium,
Income
medium,
Student=yes
Credit_rating=Fair)

age
<=30
<=30
30
3140
>40
>40
>40
3140
<=30
< 30
<=30
>40
<=30
3140
3140
>40

income studentcredit_rating_comp
high
no fair
no
high
no excellent
no
high
no fair
yes
medium no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent yes
medium no fair
no
l
low
yes fair
f i
yes
medium yes fair
yes
medium yes excellent yes
medium no excellent yes
high
yes fair
yes
medium no excellent
no

NaveBayes Classifier:AnExample

age

income studentcredit_rating_comp
no fair
no
no excellent
no
no fair
yes
no fair
yes
yes fair
yes
yes excellent
no
yes excellent yes
no fair
no
yes fair
yes
yes fair
y
yes
y
yes excellent yes
no excellent yes
yes fair
yes
no excellent
no

<=30
high
P(Ci):P(buys_computer =yes)=9/14=0.643
<=30
high
3140 high
P(buys_computer =no)=5/14=0.357
>40
medium
>40
low
ComputeP(X|Ci)foreachclass
>40
low
3140 low
<=30
medium
P(age=<=30|buys_computer =yes)=2/9=0.222
<=30
low
>40
medium
P(age =<=
P(age
< 30
30|buys_computer
| buys computer =no)
no )=3/5
3/5 =0.6
0.6
<=30
medium
3140 medium
P(income=medium|buys_computer =yes)=4/9=0.444
3140 high
>40
medium
P(income=medium|buys_computer =no)=2/5=0.4
P(student=yes|buys_computer =yes)=6/9=0.667
P(student=yes|buys_computer =no)=1/5=0.2
P(credit_rating =fair|buys_computer =yes)=6/9=0.667
P(credit_rating =fair|buys_computer =no)=2/5=0.4
X=(age<=30,income=medium,student=yes,credit_rating =fair)
P(X|Ci):
) P(X|buys_computer
P(X|b
=yes)=0.222x0.444x0.667x0.667=0.044
) 0 222 0 444 0 667 0 667 0 044
P(X|buys_computer =no)=0.6x0.4x0.2x0.4=0.019
P(X|Ci)*P(Ci):P(X|buys_computer =yes)*P(buys_computer =yes)=0.028
P(X|buys_computer =no)*P(buys_computer =no)=0.007
Therefore,Xbelongstoclass(buys_computer =yes)

163

6/26/2014

AvoidingtheZeroProbabilityProblem
NaveBayesianpredictionrequireseachconditionalprob.be
nonzero.Otherwise,thepredictedprob.willbezero
P ( X | C i)

n
P ( x k | C i)
k 1

Ex.Supposeadatasetwith1000tuples,income=low(0),
income=medium(990),andincome=high(10)
UseLaplacian correction (orLaplacian estimator)
Adding1toeachcase
Prob(income =low)
Prob(income
low) =1/1003
1/1003
Prob(income=medium)=991/1003
Prob(income=high)=11/1003
Thecorrectedprob.estimatesareclosetotheir
uncorrectedcounterparts

NaveBayes Classifier:Comments
Advantages
Easytoimplement
Goodresultsobtainedinmostofthecases
Disadvantages
Assumption:classconditionalindependence,thereforelossof
accuracy
Practically,dependenciesexistamongvariables
E.g.,hospitals:patients:Profile:age,familyhistory,etc.
Symptoms:fever,coughetc.,Disease:lungcancer,
diabetes,etc.
DependenciesamongthesecannotbemodeledbyNave
Bayes Classifier
Howtodealwiththesedependencies?BayesianBeliefNetworks

164

6/26/2014

SupportVectorMachines(SVM)
Arelativelynewclassificationmethodforbothlinearand
nonlinear data
Itusesanonlinearmapping totransformtheoriginaltraining
dataintoahigherdimension
Withthenewdimension,itsearchesforthelinearoptimal
separatinghyperplane (i.e.,decisionboundary)
Withanappropriatenonlinearmappingtoasufficientlyhigh
dimension data from two classes can always be separated by a
dimension,datafromtwoclassescanalwaysbeseparatedbya
hyperplane
SVMfindsthishyperplane usingsupportvectors (essential
trainingtuples)andmargins (definedbythesupportvectors)
329

SupportVectorMachines
Aim:Tofind thebestclassification functionto
distinguish between members of the two classes in
distinguishbetweenmembersofthetwoclassesin
thetrainingdata
aseparatinghyperplane f(x)thatpassesthroughthe
middleofthetwoclasses
bestsuchfunctionisfoundbymaximizingthemargin
betweenthetwoclasses
xn belongstothepositiveclassiff(xn)>0

Oneofthemostrobustandaccuratemethods
Requiresonlyadozenexamplesfortraining
Insensitivetothenumberofdimensions

165

6/26/2014

SVMWhenDataIsLinearlySeparable

LetdataDbe(X1,y1),,(X|D|,y|D|),whereXi isthesetoftrainingtuplesassociated
withtheclasslabelsyi
Thereareinfinitelines(hyperplanes)separatingthetwoclassesbutwewanttofind
thebestone (theonethatminimizesclassificationerroronunseendata)
SVMsearchesforthehyperplanewiththelargestmargin,i.e.,maximummarginal
hyperplane (MMH)
331

SVMLinearlySeparable

Aseparatinghyperplanecanbewrittenas
W X +b=0
where W={w1,w
whereW={w
w2,,w
wn}isaweightvectorandbascalar(bias)
} is a weight vector and b a scalar (bias)

For2Ditcanbewrittenas
w0 +w1 x1 +w2 x2 =0

Thehyperplanedefiningthesidesofthemargin:
H1:w0 +w1 x1 +w2 x2 1foryi=+1,and
H2:w0 +w1 x1 +w2 x2 1foryi=1

AnytrainingtuplesthatfallonhyperplanesH1 orH2 (i.e.,the


sidesdefiningthemargin)aresupportvectors

Thisbecomesaconstrained(convex)quadraticoptimization problem:
Quadraticobjectivefunctionandlinearconstraints Quadratic
Programming(QP) Lagrangianmultipliers
332

166

6/26/2014

SVMLinearlyInseparable

A2

Transformtheoriginalinputdataintoahigher
p
dimensionalspace

Searchforalinearseparatinghyperplane inthenew
space

A1

333

SVM:DifferentKernelfunctions

Insteadofcomputingthedotproductonthetransformeddata,
itismath.equivalenttoapplyingakernelfunctionK(Xi,Xj)to
theoriginaldata,i.e.,K(Xi,Xj)=(Xi)(Xj)

TypicalKernelFunctions

SVMcanalsobeusedforclassifyingmultiple(>2)classesand
forregressionanalysis(withadditionalparameters)
334

167

6/26/2014

ParallelImplementationofSVM
Startwiththecurrentwandb,andinparalleldo
severaliterationsbasedoneachtrainingexample
Averagethevaluesfromeachoftheexamplesto
createanewwandb.
Ifwedistributewandbtoeachmapper,then
theMaptaskscandoasmanyiterationsaswe
wishtodoinoneround
WeneedtousetheReducetasksonlytoaverage
W
dt
th R d
t k
l t
theresults
OneiterationofMapReduce isneededforeach
round.

ModelSelection:ROCCurves

ROC (ReceiverOperatingCharacteristics)
curves:forvisualcomparisonof
classificationmodels
Originatedfromsignaldetectiontheory
Showsthetradeoffbetweenthetrue
positiverateandthefalsepositiverate
TheareaundertheROCcurveisameasure
oftheaccuracyofthemodel
Rankthetesttuples indecreasingorder:
y
g
theonethatismostlikelytobelongtothe
positiveclassappearsatthetopofthelist
Theclosertothediagonalline(i.e.,the
closertheareaisto0.5),thelessaccurate
isthemodel

Verticalaxis
representsthetrue
positiverate
Horizontal axis rep
Horizontalaxisrep.
thefalsepositiverate
Theplotalsoshowsa
diagonalline
Amodelwithperfect
accuracywillhavean
areaof1.0

168

6/26/2014

IssuesAffectingModelSelection
Accuracy
classifieraccuracy:predictingclasslabel
Speed
timetoconstructthemodel(trainingtime)
timetousethemodel(classification/predictiontime)
Robustness:handlingnoiseandmissingvalues
Scalability:efficiencyindiskresidentdatabases
Scalability: efficiency in disk resident databases
Interpretability
understandingandinsightprovidedbythemodel
Othermeasures,e.g.,goodnessofrules,suchasdecisiontree
sizeorcompactnessofclassificationrules

WhatisClusterAnalysis?
Cluster:Acollectionofdataobjects
similar(orrelated)tooneanotherwithinthesamegroup
dissimilar(orunrelated)totheobjectsinothergroups
Clusteranalysis(orclustering,datasegmentation,)
Findingsimilaritiesbetweendataaccordingtothe
characteristicsfoundinthedataandgroupingsimilardata
objectsintoclusters
Unsupervised learning: no predefined classes (i e learning by
Unsupervisedlearning:nopredefinedclasses(i.e.,learningby
observations vs.learningbyexamples:supervised)
Typicalapplications
Asastandalonetool togetinsightintodatadistribution
Asapreprocessingstep forotheralgorithms

169

6/26/2014

ClusterAnalysis:Applications
Identifynaturalgroupingsofcustomers
Identifyrulesforassigningnewcasesto
Id tif
l f
i i
t
classesfortargeting/diagnosticpurposes
Providecharacterization,definition,labelling
ofpopulations
Decreasethesizeandcomplexityofproblems
forotherdataminingmethods
Identifyoutliersinaspecificdomain(e.g.,
rareeventdetection)

ClusterAnalysis:Applications
Biology:taxonomyoflivingthings:kingdom,phylum,class,order,family,
genusandspecies
Informationretrieval:documentclustering
Information retrieval: document clustering
Landuse:Identificationofareasofsimilarlanduseinanearthobservation
database
Marketing:Helpmarketersdiscoverdistinctgroupsintheircustomerbases,
andthenusethisknowledgetodeveloptargetedmarketingprograms
Cityplanning:Identifyinggroupsofhousesaccordingtotheirhousetype,
value,andgeographicallocation
Earthquakestudies:Observedearthquakeepicentersshouldbeclustered
E th
k t di Ob
d
th
k
i t
h ld b l t d
alongcontinentfaults
Climate:understandingearthclimate,findpatternsofatmosphericand
ocean
EconomicScience:marketresarch

170

6/26/2014

RequirementsandChallenges
Scalability
Clusteringallthedatainsteadofonlyonsamples
Abilitytodealwithdifferenttypesofattributes
Numerical,binary,categorical,ordinal,linked,andmixtureofthese
Constraintbasedclustering

Usermaygiveinputsonconstraints

Usedomainknowledgetodetermineinputparameters
Interpretabilityandusability
Others
Discoveryofclusterswitharbitraryshape
Abilitytodealwithnoisydata
Incrementalclusteringandinsensitivitytoinputorder
Highdimensionality

AssociationRuleMining
Findsinterestingrelationships(affinities)betweenvariables
(
(itemsorevents)
)
Alsoknownasmarketbasketanalysis
Arepresentativeapplicationsofassociationrulemining
include
Inbusiness:crossmarketing,crossselling,storedesign,catalog
design,ecommercesitedesign,optimizationofonlineadvertising,
productpricing,andsales/promotionconfiguration
d t i i
d l /
ti
fi
ti
Inmedicine:relationshipsbetweensymptomsandillnesses;
diagnosisandpatientcharacteristicsandtreatments(tobeusedin
medicalDSS);andgenesandtheirfunctions(tobeusedingenomics
projects)

171

6/26/2014

AssociationRuleMining
Areallassociationrulesinterestinganduseful?
AGenericRule:X Y[S%,C%]
X,Y:productsand/orservices
X:Lefthandside(LHS)
Y:Righthandside(RHS)
S: Support:howoftenX andY gotogether
C: Confidence:howoftenY gotogetherwiththeX
Example:{LaptopComputer,AntivirusSoftware}
{ExtendedServicePlan}[30%,70%]

PrivacyPreservingDataMining
Thereisagrowingconcernamongpeopleand
organizationinprotectingtheirprivacy
Manybusinessandgovernmentorganizationshave
strongrequirementsforprivacypreservingdatamining
Theprimarytaskindatamining:developmentof
modelsaboutaggregateddata.
meetingprivacyrequirements
providingvaliddataminingresults
Source:PrivacyPreservingAssociationRuleMininginVerticallyPartitionedData,Jaideep Vaidya,ChrisClifton,
SIGKDD02Edmonton,Alberta,Canada

172

6/26/2014

WhatisPrivacy
Abilityofanindividualorgrouptoreveal
themselvesselectivelytoremainunnoticedor
unidentified in the public (related to anonymity)
unidentifiedinthepublic(relatedtoanonymity)
Privacylawsprohibitsunsanctionedinvasionof
privacybythegovernment,corporationsor
individuals
Privacymaybevoluntarilysacrificed,normally
inexchangeforperceivedbenefitsandvery
oftenwithspecificdangersandlosses
f
h
f d
dl
Whatisprivacypreservingdatamining?
Studyofachievingsomedatamininggoalswithout
scarifyingtheprivacyoftheindividuals

Challenges
Privacyconsiderationsseemsconflictingwithdata
mining.Howdoweminedatawhenwecanteven
lookatit?
Canwehavedataminingand privacytogether?
Canwedevelopaccuratemodelswithoutaccessto
preciseinformationinindividualdatarecords?
Leakageofinformationisinevitable howto
minimizetheleakageofinformation
PPDMoffersacompromise!

173

6/26/2014

ExampleofPrivacy
AliceandBobarebothteachingthesameclass,andeachof
themsuspectsthatonespecificstudentischeating.Noneof
themiscompletelysureabouttheidentityofthecheater,and
they like to compare the names of their two suspects
theyliketocomparethenamesoftheirtwosuspects
Forstudentsprivacy
iftheybothhavethesamesuspect,thentheyshouldlearnhisorher
name
butiftheyhavedifferentsuspectsthentheyshouldlearnnothing
beyondthatfact.

Theythereforehaveinputsxandy,andwishtocomputef(x,
y) which is defined as 1 if x = y and 0 otherwise
y)whichisdefinedas1ifx=yand0otherwise
Iff(x,y)=0theneachpartydoeslearnsomeinformation,
namelythattheotherpartyssuspectisdifferentthan
his/hers,butthisisinevitable

Totalcount:Howmanypneumoniadeathsunderage65?
r1= r0+12

hosp1
r9= r8+0

hosp2

hosp10
hosp9
r8= r7+3

r9+5
detect

hosp3
Total=61
r3= r2+8

hosp8
r7= r6+14
hosp7

hosp6
r6= r5+7

r2= r1+1

r0

hosp4
hosp5
r5= r4+11

r4= r3+0

174

6/26/2014

DistributedComputingScenario
Twoormorepartiesowningconfidentialdatabases
wish to run a data mining algorithm on the union of
wishtorunadataminingalgorithmontheunionof
theirdatabaseswithoutrevealinganyunnecessary
information
Althoughthepartiesrealizethatcombiningtheir
datahassomemutualbenefit,noneofthemis
willingtorevealitsdatabasetoanyotherparty
illi
li d b
h
Partialleakofinformationisinevitable

Associationrulemininginvertically
partitioneddata
Privacyconcernscanpreventacentraldatabase
pp
g
approachfordatamining
Thetransactionsmaybedistributedacrosssources
Collaboratetominegloballyvaliddatawithout
revealingindividualtransactiondata
Preventdisclosureofindividualrelationships
Joinkeyrevealed
Universeofattributevaluesrevealed

175

6/26/2014

ReallifeExample
FordExplorerswithFirestonetiresfromaspecic factoryhad
treadseparationproblemsincertainsituations,resultingin
800injuries.
Sincethetiresdidnothaveproblemsonothervehicles,and
othertiresonFordExplorersdidnotposeaproblem,neither
sidefeltresponsible
Delayinidentifyingtherealproblemledtoapublicrelations
nightmareandtheeventualreplacementof14.1milliontires
Bothmanufacturershadtheirowndata earlygenerationof
associationrulesbasedonallofthedatamayhaveenabled
FordandFirestonetoresolvethesafetyproblembeforeit
becameapublicrelationsnightmare.

ProblemDefinition
Tomineassociationrulesacrosstwodatabases,wherethe
columnsinthetableareatdifferentsites,splittingeachrow.
Onedatabaseisdesignatedtheprimary,andistheinitiatorof
theprotocol.Theotherdatabaseistheresponder.
Thereisajoinkeypresentinbothdatabases.Theremaining
attributesarepresentinonedatabaseortheother,butnot
both.
Thegoalistofindassociationrulesinvolvingattributesother
thanthejoinkeyobservingtheprivacyconstraints

176

6/26/2014

ProblemDefinition
LetI={i1;i2;;im}beasetofliterals,calleditems.LetDbeasetof
transactions,whereeachtransactionTisasetofitemssuchthatT I.
Anassociationruleisanimplicationoftheform,X Y,whereX I,Y I,
andX Y=
TheruleX YholdsinthetransactionsetDwithconfidencecifc%of
transactionsinDthatcontainXalsocontainY.TheruleX Yhassupports
inthetransactionsetDifs%oftransactionsinDcontainX Y
We
Weconsiderminingboolean
consider mining boolean associationrules.Theabsenceorpresenceof
association rules The absence or presence of
anattributeisrepresentedasa0or1.Transactionsarestringsof0sand1s.
Tofindoutifaparticularitemset isfrequent,wecountthenumberof
recordswherethevaluesforalltheattributesintheitemset are1.

Mathematicalproblemformulation
Letthetotalnumberofattributesbel+m,whereAhaslattributesA1through
Al,andBhastheremainingmattributesB1throughBm.transactions/records
are a sequence of l + m 1s or 0s
areasequenceofl+m1sor0s.
Letkbethesupportthresholdrequired,
nbethetotalnumberoftransaction/records
LetXandYrepresentcolumnsinthedatabase,i.e.,xi=1ifrowi hasvalue1for
attributeX.Thescalar(ordot)productoftwocardinalitynvectorsXandYis
dened as:

Determiningifthetwoitemset

isfrequentthusreducestotestingif

177

6/26/2014

Example
Findoutifitemset {A1,B1}isfrequent(i.e.,Ifsupportof{A1,B1}k)
A
B
Key

A1

Key

B1

k1

k1

k2

k2

k3

k3

k4

k4

k5

k5

Supportofitemset isdefinedasnumberoftransactionsinwhichall
attributesoftheitemset arepresent

Support

A B
i

i 1

Basicidea
Support

A B
i 1

Thisisthescalar(dot)productoftwovectors
Tofindoutifanarbitrary(shared)itemset is
frequent,createavectoroneachsideconsistingof
thecomponentmultiplicationofallattributevectors
onthatside(containedintheitemset)
To find out if {A1,A
A3,A
A5,B
B2,B
B3}isfrequent
} is frequent
Tofindoutif{A
AformsthevectorX=A1 A3 A5
BformsthevectorY=B2 B3
SecurelycomputethedotproductofXandY

178

6/26/2014

Conclusion
Privacypreservingassociationruleminingalgorithmusingan
efficientprotocolforcomputingscalarproductwhilepreserving
privacyoftheindividualvalues
Communicationcostiscomparabletothatrequiredtobuilda
centralizeddatawarehouse
Althoughsecuresolutionsexist,achievingefficientsecure
solutionsforprivacypreservingdistributeddataminingisstillan
open problem
openproblem
Handlingmultipartycaseandavoidingcollusionischallenging.
Noncategoricalattributesandquantitativeassociationrule
miningaresignificantlymorecomplexproblems

References

PrivacyPreservingAssociationRuleMininginVerticallyPartitionedData,
Jaideep Vaidya,ChrisClifton,SIGKDD02Edmonton,Alberta,Canada

PrivacypreservingDataMining.R.Agrawal andR.Srikant,ACMSIGMOD
ConferenceonManagementofData,2000.

Cryptographictechniquesforprivacypreservingdatamining.B.Pinkas,
SIGKDDExplorations,Vol.4,Issue2.

KNOWLEDGEVALUATION:Buildingblockstoaknowledgevaluation
system(KVS),AnnieGreen,Thejournalofinformationandknowledge
managementsystemsVol.36No.2,2006,pp.146154

179

6/26/2014

BusinessApplications
RamBabu Roy

It is not the strongest of the species that survives nor the most intelligent that
Itisnotthestrongestofthespeciesthatsurvives,northemostintelligentthat
survives.Itistheonethatisthemostadaptabletochange.
CharlesDarwin

FromBigDatatoBigImpact

Source:BusinessIntelligenceandAnalytics:Frombigdatatobigimpact,H.Chen et.alMISQuarterly,Dec2012

180

6/26/2014

BigImpactsbyBigData
Radicaltransparencywithdatawidelyavailable
Experimentationforanticipatingbusinessdecisions
Experimentation for anticipating business decisions
changethewaywecompete
Impactofrealtimecustomizationonbusiness
Augmentingmanagementandstrategy betterrisk
management
Informationdrivenbusinessmodelinnovations
I f
ti d i
b i
d li
ti
Leveragingvaluableexhaustdatabybusinesstransactions
Dataaggregatorasanentrepreneurialopportunity
Source:Areyoureadyfortheeraofbigdata?,BradBrownet.al,McKinseyQuarterly,Oct2011

Source:http://www.forbes.com/sites/danmunro/2013/04/28/bigproblemwithlittledata/

181

6/26/2014

MagicQuadrantforAdvancedAnalyticsPlatforms
LEADERS

CHALLENGERS

VISIONARIES

NICHEPLAYERS

Source:Gartner(February2014)

SourcesofCompetitiveAdvantage

BigData anewtypeofcorporateasset
Effectiveuseofdataatscale
Data drivendecisionmaking
Radicalcustomization
Gainingmarketshare

ConstantExperimentation
Exploring
Exploringwithreal
with realworld
worldexperiments
experiments
Adjustpricesinrealtime
Bundlingsynthesizingandmakinginformation
availableacrossorganization

NovelBusinessModels

182

6/26/2014

GrowingBusinessInterestinBigData
Hundredsofarticlespublishedin
technology/industryjournalsandgeneral
businesspress(Forbes,Fortune,Bloomberg
BusinessWeek,TheWallStreetJournal,The
Economistetc.)
3Vs useofdisparatedatasetsincludingsocial
media
Thespeedofanalysis,nearrealtimedeployment
The speed of analysis near realtime deployment
requirements
Theadvancementofthefieldsofmachine
learningandvisualization

RoleofDataScientist
Weoftendonotknowwhatquestiontoask
requiresdomainexpertisetoidentifytheimportant
p
problemstosolveinagivenarea
g
Whichaspectofbigdatamakesmoresense
Howtoapplyittothebusiness
Howonecanachievesuccessbyimplementingabig
dataproject
Newchallenges
Lackofstructure
Whattechnologyonemustusetomanageit
Challengingtoconvertitintoinsights,innovationand
businessvalue
Butnewopportunities

183

6/26/2014

VariationofPotentialValueacrossSectors

Source:Areyoureadyfortheeraofbigdata?,BradBrownet.al,McKinseyQuarterly,Oct2011

SourcesofBigData

Variousbusinessunits Govt./Private
Partners
Customers
InternetofThings
SocialMedia
Transactiondata
Webpages

184

6/26/2014

Majorindustriesthatgetbenefitted

Financialservices
IT d T l
ITandTelecommunication
i ti
Healthcare
Manufacturing
RealEstate
Government and defence
Governmentanddefence
Travel
MediaandEntertainment
Retailing

WhyisBigDatasoImportant?
Potentialtoradicallytransformbusinessesand
industries
Reinventingbusinessprocesses
R i
ti b i
Operationalefficiencies
Customerserviceinnovations.

Youcantmanagewhatyoudontmeasure
Moreknowledgeaboutthebusiness
Improveddecisionmakingandperformance
Enablesmanagerstodecideonthebasisofevidence
Enables managers to decide on the basis of evidence
ratherthanintuition
Moreeffectiveinterventions

Butneedtochangethedecisionmakingculture
Source:BigData:TheManagementRevolution,AndrewMcAfee,ErikBrynjolfsson,HBR,Oct2012

185

6/26/2014

HowaremanagersusingBigData?
Tocreatenewbusinesses
Better
Betterpredictionyieldbetterdecisions
prediction yield better decisions
Airlineshadagapofatleast5to10minutesbetween
theETA(bypilots)andATA.
ImprovedETAbycombiningweatherdata,flight
schedules,informationaboutotherplanesinthesky,
andotherfactors

Todrivemoresales
Totailorpromotionsandotherofferingstocustomers
Topersonalizetheofferstotakeadvantageoflocal
conditions

IntegratingBigDataintoBusiness
Howtoutiliseunstructureddatawithinyour
organisation
Thelatesttechnicalchangesrelatedtousing
Hadoop intheenterprise
WhybigdatasolutionscanenhanceyourROIand
delivervalue
Therisksrelatedtoincreasingvolumesofdata
The risks related to increasing volumes of data
ThefutureofbigdataandtheInternetofThings
Howcloudcomputingischangingtheenterprises
useofdata

186

6/26/2014

ApplicationsofBigdataanalytics

Disastermanagement
Big data and sensor data fusion for smart cities
Bigdataandsensordatafusionforsmartcities
Healthcare
Spatiotemporaldataanalytics
Timeseriesbasedlongtermanalysis(e.g.climate
change)
Shorttermrealtime(Trafficmanagement,
Short term real time (Traffic management,
disastermanagement)
Benchmarkingacrosstheworld
Preservingheritage Creationoferepositories

BigDatainManufacturing
Manufacturinggeneratesaboutathirdofalldatatoday
Digitallyorientedbusinessesseehigherreturns
g y
g
Detectingproductanddesignflaws withBigData

Forecasting,costandpricemodeling,supplychain
connectivity
Warrantydataanalytics,textminingforproduct
development
Visualizationanddashboards,faultdetection,failure
prediction,inprocessverificationtools,
Machinesystem/sensorhealth,processmonitoringand
correction,productdatamining,
BigDatagenerationandmanagement,andtheInternet
ofThings

187

6/26/2014

BigDatainManufacturing
Integratingdatafrommultiplesystems
Collaborationamongfunctionalunits
Collaboration among functional units
Informationfromexternalsuppliersand
customerstococreate products
Collaborationduringdesignphasetoreducecosts
Implementchangesinproducttoimprovequality
andpreventfutureproblems
Identifyanomaliesinproductionsystems
Schedulepreemptive repairsbeforefailures
Dispatchservicerepresentativesforrepair

RetailingandLogistics
Optimizeinventorylevelsatdifferentlocations
Improvethestorelayoutandsalespromotions
Optimizelogisticsbypredictingseasonal
effects
Minimizelossesduetolimitedshelflife

188

6/26/2014

FinancialApplications
BankingandOtherFinancial
Automatetheloanapplicationprocess
Detectingfraudulenttransactions
g
Optimizingcashreserveswithforecasting

BrokerageandSecuritiesTrading

Predictchangesoncertainbondprices
Forecastthedirectionofstockfluctuations
Assesstheeffectofeventsonmarketmovements
Identifyandpreventfraudulentactivitiesintrading

Insurance

Forecastclaimcostsforbetterbusinessplanning
Determineoptimalrateplans
Optimizemarketingtospecificcustomers
Identifyandpreventfraudulentclaimactivities

UnderstandingCustomerOnline
Behaviour
Drawinginsightfromtheonlinecustomers
SentimentAnalysis
S ti
t A l i togaugeresponsesto
t
t
newmarketingcampaignsandadjust
strategies
CustomerRelationshipManagement
Maximizereturnonmarketingcampaigns
Improvecustomerretention(churnanalysis)
I
i ( h
l i)
Maximizecustomervalue(cross,upselling) revenue
streams
Identifyandtreatmostvaluedcustomers

189

6/26/2014

CaseStudy:Quantcast
Howdoadvertisersreachtheirtarget
audiences online?
audiencesonline?
Consumersspendmoretimeinpersonalized
mediaenvironments
Itshardertoreachlargenumberofrelevant
consumers
Advertisersneedtousewebtochoosetheir
audiencesmoreselectively
Decision whichadtoshowtowhom
Source:Tobigtoignore:thebusinesscaseforbigdata,PhilSimon,Wiley,2013

Quantcast:ASmallBigDataCompany
Webmeasurementandtargetingcompany Founded
in2006 focusononlineaudience
Modelsmarketersprospectsandfindslookalike
audiencesacrosstheworld
Analysesmorethan300billionobservationsofmedia
consumptionpermonth
Detaileddemographic,geographicandlifestyle
information
Createdamassivedataprocessinginfrastructure
Quantcast FileSystem(QFS)
Incorporatesdatageneratedfrommobiledevices

190

6/26/2014

Quantcast:ASmallBigDataCompany
Bigdataallowsorganizationstodrilldownto
reachspecificaudiences
h
ifi
di
Differentbusinesseshavedifferentdata
requirements,challengesandgoals
Quantcast providesintegrationbetweenits
product and third party data and applications
productandthirdpartydataandapplications
Anorganizationdoesntneedtobebigto
benefitfromBigData

PromiseofBigDatainHealthcare

Predictiveandprescriptiveanalytics
Publichealth
Diseasemanagement
Drugdiscovery
Personalizedmedicine
Continuouslyscanandinterveneinthe
healthcarepractices

191

6/26/2014

Whatishealthcare?
Broaderthanjustpracticingmedicine
Roleofphysician,pharmaceuticalcompanies,
hospitals,diagnosticsservices
Cocreationofhealthvalue
Roleofpatientsandtheirfamily
Objective:todeliverhighqualityandcost
effectivecaretopatients

UniquenessofHealthcare
Everypatientisuniqueandneedpersonalized
care
Allthemedicalprofessionalsareunique
Canwematchthecorecompetencyandunique
styleofmedicalprofessionalstospecific
patients?
Canweuseserviceoperationsmanagement
Can we use service operations management
principlestoimprovetheoverallefficiency?
Needforengagingmedicalprofessionalstothe
taskstheyarebestatdoing

192

6/26/2014

HealthcareBusinessInnovation
Needsexchangeofknowledgebetween
b i
businessandmedicine
d
di i
Entrepreneurshipinhealthcare
Availabilityoffinancialsupporttoengagein
valuecreation
Developmentandanalysisofhealthcare
Development and analysis of healthcare
businessmodel

HealthcareAnalytics
Datacollectedthroughmobiledevices,health
workers, individuals, other data sources
workers,individuals,otherdatasources
Crucialforunderstandingpopulationhealthtrends
orstoppingoutbreaks
Individualelectronichealthrecords notonly
improvescontinuityofcarefortheindividual,but
alsocreatemassivedatasets
Identificationofatriskpatients
Identification of at risk patients
Trendsinservicelineutilization
Improvementopportunitiesintherevenuecycle

Treatmentsandoutcomescanbecomparedinan
efficientandcosteffectivemanner.

193

6/26/2014

SourcesofInformation

Governmentofficials
Industryrepresentatives
Informationtechnologyexperts
Healthcareprofessionals

Enabler
Increaseofprocessingpowerandstoragecapacities
Radical
Radicaladvancementofinformationandcommunication
advancement of information and communication
technology
Reducedcostofstorageandprocessingofbigdata

Availabilityofdata

Digitizationofdata
Increaseinadoptionofdigitalgadgetsbyusers
P
Popularityofsocialmedia
l it f
i l
di
Mobileandinternetpopulationandpenetration

Awarenessaboutbenefitsofhavingknowledge
Requirementofdatadriveninsightsbyvarious
stakeholders

194

6/26/2014

Explorys:ThehumanCaseforBig
Data
January2011,UShealthcarespending$3
trillion
Behavioural,operationalandclinicalwastes
Longtermeconomicimplications
Opportunitytoimprovehealthcaredelivery
andreduceexpenses
Integratingclinical,financialandoperationaldata
Volume,velocityandvarietyofhealthcaredata

Bigdatacanhaveabigimpact

BetterHealthcareusingBigData
Realtimeexploration,performanceand
predictive analytics
predictiveanalytics
Vastuserbase Majorintegrateddelivery
networksintheUS
Userscanviewkeyperformancemetrics
acrossproviders,groups,carevenues,and
l ti
locations
Identifywaystoimproveoutcomesand
reduceunnecessarycosts

195

6/26/2014

BetterHealthcareusingBigData
Whydopeoplegotoemergencyroomrather
than a primary physician?
thanaprimaryphysician?
Analyticstodecidewhetherthecareis
availableintheneighbourhoods
Serviceproviderscanreachouttopatientsto
guidethemthroughthetreatmentprocesses
Patientscanreceivetherightcareattheright
venueattherighttime
Privacyconcerns

MobilisingDatatoDealwithan
Epidemic
CaseofHaitis devastating2010earthquake
Mobiledatapatternscouldbeusedto
M bil d t
tt
ld b
dt
understandthemovementofrefugeesand
theconsequenthealthrisks
Accuratelyanalysethedestinationofover
600,000peopledisplacedfromPortauPrince
Theymadethisinformationavailableto
governmentandhumanitarianorganisations
dealingwiththecrisis

196

6/26/2014

CholeraoutbreakinHaiti
Mobiledatatotrackthemovementofpeople
f
fromaffectedzones
ff t d
Aidorganisationsusedthisdatatopreparefor
newoutbreaks
Demonstrateshowmobiledataanalysiscould
revolutionise disaster and emergency
revolutionisedisasterandemergency
responses.

CosteffectiveTechnology
DataGrid platformonClouderas enterpriseready
Hadoop solution
Platformneedstosupport
Thecomplexhealthcaredata
Evolvingtransformationofhealthcaredeliverysystem

Needforasolutionthatcanevolvewithtime
Flexibility,scalabilityandspeednecessaryto
answerscomplexquestionsonfly
l
i
fl
Explorys couldfocusonthecorecompetenciesof
itsbusiness
Imperativetolearn,adaptandbecreative

197

6/26/2014

BenefitsofInformationTechnology
Improvedaccesstopatientdata canhelp
cliniciansastheydiagnoseandtreatpatients
Patientsmighthavemorecontrolovertheir
health
Monitoringofpublichealthpatternsandtrends
Enhancedabilitytoconductclinicaltrialsofnew
diagnosticmethodsandtreatments
Creationofnewhightechnologymarketsand
jobs
Supportarangeofhealthcarerelatedeconomic
reforms

PotentialBenefits
Provideasolidscientificandeconomicbasis
f i
forinvestmentrecommendations
t
t
d ti
Establishfoundationforthehealthcarepolicy
decisions
Improvedpatientoutcomes
Costsavings
Cost savings
Fasterdevelopmentoftreatmentsand
medicalbreakthroughs

198

6/26/2014

ActionPlan
Governmentshouldfacilitatethenationwide
adoptionofauniversalexchangelanguagefor
d ti
f
i
l
h
l
f
healthcareinformation
Creationofadigitalinfrastructureforlocating
patientrecordswhilestrictlyensuringpatient
p
privacy.
y
Facilitateatransitionfromtraditionalelectronic
healthrecordstotheuseofhealthcaredata
taggedwithprivacyandsecurityspecifications

NASA:OpenInnovation
Effectsofcrowdsourcing ontheeconomy collaboration
andinnovation
Wikipedia
p

WhatistherelationshipbetweenBigData,collaboration
andopeninnovation?
Democratizebigdatatobenefitfromthewisdomof
crowds
Groupscanoftenuseinformationtomakebetter
decisions than any individual
decisionsthananyindividual
TopCoder bringstogetherdiversecommunityof
softwaredevelopers,datascientists,andstatisticians
Riskpreventioncontest
Predicttheriskiestlocationsformajoremergenciesandcrimes

199

6/26/2014

NASAsRealworldChallenges
Recordedmorethan100TBofspaceimagesand
more
EncourageexplorationandanalysisofPlanetary
DataSystem(PDS)databases
Imageprocessingsolutionstocategorizedatafrom
missionstothemoon typesandnumbersof
craters
Software to handle complex operations of satellites
Softwaretohandlecomplexoperationsofsatellites
anddataanalysis
Vehiclerecognition
Craterdetection

Realtimeinformationtocontestantsandcollecting
theircomments

LessonsLearnt
Toeffectivelyharnessbigdata,organization
neednotofferbigrewards
d t ff bi
d
Statetheprobleminawaytoallowadiverseset
ofpotentialsolverstoapplytheirknowledge
CreateacompellingBigDataandopen
innovation challenge
innovationchallenge
Organizationssizedoesntmattertoreap
rewardsofBigData

200

6/26/2014

CrowdManagementCricketMatch
DirectionalflowdensityduringCricketMatch
Availabilityofpublictransportforcommuters
fromdifferentdestinations
Mobiledatacanbeusedtopredictthe
directionofmovementofspectatorspost
match

AstudybyUttam KSarkar etal,IIMCalcutta

PotentialApplications
Managingresourcestopursuemostpromising
customers and markets
customersandmarkets
Technologyplatformformanagingtheriskand
valueofnetworkofdifferentstakeholders
Meetingregulatorycompliance
Developmentofpatientreferralsystemtoreduce
patientchurn
Lifescienceindustriessuchasdiagnostics,
pharmaceuticals,andmedicaldevices
Developmentofmobileappsforreferral
Geneticdataminingforpredictionofdiseaseand
effectivecare

201

6/26/2014

SimpleTechniquesforBigData
Transition
Whichproblemtotackle? Needfordomain
experts to decide right questions
expertstodeciderightquestions
Habitofaskingkeyquestions
Whatdothedatasay?
Wheredidthedatacomefrom?
Whatkindsofanalyseswereconducted?
Howconfidentareweinresults?

Computerscanonlygiveanswers

NeedforChangeManagement
Leadership setcleargoals,definesuccess,spot
greatopportunity,andunderstandhowamarket
is developing
isdeveloping
Talentmanagement organizingbigdata,
visualizationtoolsandtechniques,designof
experiments
Technology tointegrateallinternalandexternal
sourcesofdata,adoptnewtechnologies
DecisionMaking maximizecrossfunctional
cooperations
Companyculture shiftfromaskingWhatdowe
think?toWhatdoweknow?

202

6/26/2014

ChallengesinGettingReturns
Collecting,processing,analyzingandusingBig
Dataintheirbusinesses
Handlingthevolume,varietyandvelocityofthe
data.
Findingdatascientists by2018,thedemandfor
analyticsandBigDatapeopleintheU.S.alone
willexceedsupplybyasmuchas190,000
(McKinsey Global Institute)
(McKinseyGlobalInstitute)
Sharingdataacrossorganizationalsilos
Drivingdatadrivenbusinessdecisionratherthan
thatbasedonintuition

ConfrontingComplications
Shortageofadequatelyskilledworkforce
Shortageofmanagersandanalystswithsharp
understandingofbigdataapplications
Tensionbetweenprivacyandconvenience
Consumercaptureslargerpartofeconomic
surplus at the cost of privacy concerns
surplusatthecostofprivacyconcerns

Data/IPsecurityconcerns cloudcomputing
andopenaccesstoinformation

203

6/26/2014

WhatBigDataCantAnswer
Howdoweknowwhatsimportantinacomplex
world?
ld?
E.g.topredictandexplainmarketcrashes,food
prices,theArabSpring,ethnicviolence,and
othercomplexbiologicalandsocialsystems.
Determining which information is pivotal (and
Determiningwhichinformationispivotal(and
ignoringtherest)isthekeytosolvingtheworlds
increasinglycomplexchallenges.
Yaneer BarYamandMayaBialik,Beyond BigData:Identifyingimportant
informationforrealworldchallenges(December17,2013),arXiv inpress

ComplexityProfile
Theamountof
informationthatis
requiredtodescribea
systemasafunctionof
thescaleofdescription.
Themostimportant
i f
informationabouta
ti
b t
systemforinforming
actiononthatsystemis
thebehavior atthe
largestscale.

204

6/26/2014

References
Holdren,JohnP.,EricLander,andHaroldVarmus.
"Report
ReporttothePresidentRealizingtheFull
to the President Realizing the Full
PotentialofHealthInformationTechnologyto
ImproveHealthcareforAmericans:ThePath
Forward."President'sCouncilofAdvisorson
ScienceandTechnology.ExecutiveOfficeofthe
President,Dec.2010.
TheEmergingBigReturnsonBigData,ATCS
2013GlobalTrendStudy,2013
www.ikanow.com

MiningSocialNetwork
Graphs
Prof.RamBabu Roy

205

6/26/2014

TypesofNetworks
Technological manmade,consciouslycreated
((Kolaczyk,2009)
y ,
)
Social interactionsamongsocialentities
Biological
Biological interaction
interactionamongbiologicalelements
among biological elements
Informational elementsofinformation

WhatisaSocialNetwork?
Collectionofentitiesthatparticipateinthe
network
Thereisatleastonerelationshipbetween
Th
i
l
l i hi b
entitiesofthenetwork.
Thereisanassumptionofnonrandomness or
locality
Examples

Friendsnetworks
e ds e o s Facebook,Twitter,Google+,
aceboo ,
e , oog e ,
TelephoneNetworks
EmailNetworks
CollaborationNetworks

Ref:MiningofMassiveDatasets,JureLeskovec,Anand Rajaraman,andJeffreyD.Ullman,2014

206

6/26/2014

SocialNetworksasGraphs
Noteverygraphisasuitablerepresentationofa
social network
socialnetwork.
Locality thepropertyofsocialnetworksthat
saysnodesandedgesofthegraphtendtocluster
incommunities.
Importantquestions
how
howtoidentifycommunities
to identify communities subsetsofthenodes
subsets of the nodes
withunusuallystrongconnections
communitiesusuallyoverlap youmaybelongto
severalcommunities

ExampleofaSmallSocialNetwork
Nineedgesoutofthe7C2 =21pairsofnodes
SupposeX,Y,andZarenodeswithedgesbetweenXandYandalsobetween
X and Z What would we expect the probability of an edge between Y and Z to
XandZ.WhatwouldweexpecttheprobabilityofanedgebetweenYandZto
be?
Ifthegraphwerelarge,thatprobabilitywouldbeveryclosetothefractionof
thepairsofnodesthathaveedgesbetweenthem,i.e.,9/21=.429
Ifthegraphissmall,thereareedges(X,Y)and(X,Z),thereforeonlyseven
edgesremaining.Thus,theprobabilityofanedge(Y,Z)is7/19=.368.

Byactuallycountingthepairsofnodes,
fractionoftimesthethirdedgeexistsis9/16
=.563

207

6/26/2014

SocialNetworkAnalysis
Anactorexistsinafabricofrelations(Wasserman&Faust,1994)
Degree importanceduetodirectlinkagewithnodes(Freeman,
importance due to direct linkage with nodes (Freeman
1979,Friedkin,1991)
Closeness abilitytoreachotheractorsatease
Betweenness relativeimportanceofanodeinlinkingtwonodes
p
g
Eigenvector (Bonacich,1987) entirepatternofconnections
Networkcentralization(Freeman,1979)

(C
Max (C
xU

xU

max
A

C A ( x ))

max
A

C A ( x))

LinkAnalysis
LinkAnalysisAlgorithms
PageRank
Page Rank
HITS(HypertextInducedTopicSelection)

Pageismoreimportantifithasmorelinks
Incominglinks
Outgoinglinks
Thinkofinlinksasvotes
Areallinlinksareequal?
Linksfromimportantpagescountmore
Recursivequestion!

208

6/26/2014

PageRank Scores

Source:JureLeskovec,StanfordC246:MiningMassiveDatasets

SimpleRecursiveFormulation
Eachlinksvoteisproportionaltothe
importance of its source page
importanceofitssourcepage
Ifpagejwithimportancerj hasnoutlinks,
eachlinkgetsrj /nvotes
Pagejs ownimportanceisthesumofthe
votesonitsinlinks
rj=ri/3+rk/4
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets

209

6/26/2014

TheFlowModel
Avotefromanimportantpageis
worthmore
th
Apageisimportantifitispointed
tobyotherimportantpages
Definearankrj forpagej

Source:JureLeskovec,StanfordC246:MiningMassiveDatasets

MatrixFormulation
Stochasticadjacencymatrix
Letpage has outlinks
If ,then
=1/ else

=0

isacolumnstochasticmatrix

Columnssumto1

Rankvector :vectorwithanentryperpage istheimportancescore


ofpage

=1

Theflowequationscanbewritten
The flow equations can be written =
TherankvectorrisaneigenvectorofthestochasticwebmatrixM
WecannowsolveforrusingthemethodcalledPoweriteration
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets

210

6/26/2014

PowerIterationMethod
GivenawebgraphwithNnodes,wherethe
nodesarepagesandedgesarehyperlinks
d
d d
h
li k
Initialize:r(0)=[1/N,.,1/N]T
Iterate:r(t+1)=Mr(t)
Stopwhen|r(t+1) r(t)|1<
|x|
| |1 =1iN|x
| i|istheL
|
1 norm
Canuseanyothervectornorm,e.g.,Euclidean
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets

HowtoSolve?

Source:JureLeskovec,StanfordC246:MiningMassiveDatasets

211

6/26/2014

HITS(HypertextInducedTopic
Selection)
Isameasureofimportanceofpagesor
documents,
SimilartoPageRank
ProposedataroundsametimeasPageRank (98)

Goal:Saywewanttofindgoodnewspapers
Dontjustfindnewspapers.Findexperts people
wholinkinacoordinatedwaytogoodnewspapers

Idea:Linksasvotes
Pageismoreimportantifithasmorelinks
Incominglinks?Outgoinglinks?

HubsandAuthorities
Eachpagehas2scores:
Quality as an expert (hub):
Qualityasanexpert(hub):
Totalsumofvotesofauthoritiespointedto
Hubsarepagesthatlinktoauthorities
Listofnewspapers

Qualityasacontent(authority):
Totalsumofvotescomingfromexperts
Total sum of votes coming from experts
Authoritiesarepagescontaininguseful
information
Newspaperhomepages

212

6/26/2014

HubsandAuthorities

Source:JureLeskovec,StanfordC246:MiningMassiveDatasets

Example

UnderreasonableassumptionsabouttheadjacencymatrixA,HITSconvergesto
vectorsh*anda*:
h*istheprincipaleigenvectorofmatrixAAT
a*istheprincipaleigenvectorofmatrixATA
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets

213

6/26/2014

FinancialMarketasCAS
ComplexSystems interconnectedness,hierarchyof
subsystems,decentralizeddecisions,nonlinearity,self
organisation, evolution, uncertainty , selforganized
organisation,evolution,uncertainty,self
organized
criticality
ComplexAdaptiveSystems adaptation
ManysocioeconomicsystemsbehaveasCAS
(Mauboussin 2002;Markose 2005)
Complex
Complex networksofinterconnections
networks of interconnections
Adaptive optimizingagents,capableoflearning
Complexityandhomogeneity bothrobustandfragile

Structuralvulnerabilitiesbuildsupovertime(Haldane,
2009)

NetworkbasedModeling
ModelingComplexsystems SD,ABM,VSM,
Econophysics
Structureandtheevolutionofnetworksareinseparable
Structure and the evolution of networks are inseparable
(Barabsi,1999)
Visualrepresentationoftheinterdependence(Newman
2008) knowledgediscovery
Dynamicsofnetworksmayrevealunderlyingmechanism
(Barabasi 2009)
(Barabasi,2009)
Recentworksusingnetworkapproach(Boginski etal.
2006;Tse,C.K,2010)

214

6/26/2014

Marketgraph
Logarithmofreturnoftheinstrumenti overtheonedayperiod
from(t1)tot
=Ri (t)=ln Pi (t)/Pi (t1)
Pi (t)denotethepriceoftheinstrumenti
(t) denote the price of the instrument i ondayt
on day t
Correlationcoefficientbetweeninstrumentsi andjis

Anedgeconnectingstocksi andjisaddedtothegraphifCij >=


threshold
pricesofthesetwostocksbehavesimilarlyovertime
degreeofsimilarityisdefinedbythechosenvalueofthreshold

Studyingthepatternofconnectionsinthemarketgraphwould
providehelpfulinformationabouttheinternalstructure ofthe
stockmarket
Source:Miningmarketdata:Anetworkapproach,VladimirBoginski,Sergiy Butenko,Panos M.Pardalos,
Computers&OperationsResearch33(2006)31713184

Stateoftheart
Regionalbehaviourofstockmarkets,relativelysmallsamplesize
Korean(Jung etal. 2006;Kim etal. 2007),Indian(Pan&Sinha
2007),andBrazilian(Tabak etal. 2009)
Evolutionofinterdependenceandcharacterizingthedynamics
(Garas,2007; Huang etal. 2009)
Notmuchworkonidentifyingdominantstockindices(Eryigit,
2009))
Littleworkontheimpactofrecentfinancialcrisisduring2008
LimitedbusinessapplicationofSNA(Bonchi etal.,2011)

215

6/26/2014

ResearchGap
Characterisingtheglobalmarketdynamicsisacomplex
problem
Lackofsystemlevelanalysisofglobalstockmarket
Thenetworkbasedmethodologiesinadequatelyexplored
AdaptationofSNAmethodologiestootherdomains

ResearchObjective
Understandingunderlyingnetworkstructureofmarkets
Methodstocaptureinterdependencestructureof
complexsystems
Methodstocharacterizeevolutionarybehavior
Methodsforchangedetection theimpactofeventson
thetopology

216

6/26/2014

ResearchQuestions
Isthereanyregionalinfluenceontheevolutionarybehavior?
Is
Isthenetworktopologyduringcrisisphasedifferentfromthatofthe
the network topology during crisis phase different from that of the
normalphase?
Howtocapturethemacroscopicinterdependencestructureamong
stockmarketsandeconomicsectors?
Howtoidentifydominantstockmarkets/economicsectors?
How to identify dominant stock markets /economic sectors?
Whatistheresponseofthenetworktoextremeevent?

Application:ChangeDetectioninthe
Interdependence Structure of Global
InterdependenceStructureofGlobal
StockMarkets

Source:ASocialNetworkApproachtoChangeDetectionintheInterdependence
StructureofGlobalStockMarkets by RamBabu Roy,Uttam KumarSarkar Social
NetworkAnalysisandMining,Springer,Vol.3,Number3, (2013)

217

6/26/2014

Methodology
Secondarydataonmajorstockmarketsfromacrosstheglobe
obtainedfromBloomberg
NetworkmodelsofstockmarketsandsimplificationusingMST
Network models of stock markets and simplification using MST
Characterizationandpatternminingtoinvestigatethe
structuralandstatisticalpropertiesandbehavior
StatisticalControlcharttodetectanomalyinevolution
Graphtheoreticmethodsandalgorithms,networkvisualization
tool(Pajek,Matlab andMSExcel)
Nonparametricmethodsforanalysisandchangedetection

DataDescription

Thedailyclosingpricesfor85stockindicesfrom36countries
fromacrosstheworldfromJanuary2006toDecember2010
obtainedfromBloomberg

Inadditiontothesestockindicesfromvariouscountries,8
otherindicesnamely,SX5E,SX5P,SXXE,SXXP,E100,E300,
SPEURO,SPEUfromEuropeanregionwereincludedto
investigatewhethertheregionalindiceshaveanyinfluence
onthenetworkstructure

Choiceoftheperiod
Choice
of the period tostudythebehaviourofthestock
to study the behaviour of the stock
marketnetworkbeforeandafterthecollapseofLehman
BrothersintheUSA.

Restrictedoursamplestoonlythoseindicesexistingfor
longerperiodandhavedataavailableonBloomberg(say
from1990)givingus93suchindices

218

6/26/2014

Computations
LogarithmicreturnRi(t)oftheinstrumenti overtheoneday
periodfrom(t1)tot
=Ri (t)=ln {Pi(t)/Pi(t1)}
Correlationcoefficientbetweenreturnsofinstrumentsi andj
C ij

Ri R j Ri
R i2 R i

Rj

R 2j R j

Anedgeconnectingstockindicesi andjisaddedtothegraphif
Cij >=threshold
>= threshold
returnsofthesetwostockindicesbehavesimilarlyovertime
degreeofsimilarityisdefinedbythechosenvalueofthreshold
MSTisusedforobtainingsimplifiedconnectednetwork
d ij 2(1 ij ) ,

0 d ij 2,

IllustrationofMSTcreation

219

6/26/2014

EmpiricalFindings

Period No

Start Date

End Date

Period No

Start Date

End Date

1/11/2006

4/23/2008

36

9/17/2008

12/29/2010

5/31/2006

9/10/2008

37

1/4/2006

12/29/2010

6/28/2006

10/8/2008

PreLB

220

6/26/2014

PostLB

Indicesclusterwiththeregionalhubs
Relativelymoredecentralizednetwork
Europeanstockindicesemergemorecentral

Application:IdentifyingDominant
EconomicSectorsandStockMarkets
Source:Identifyingdominanteconomicsectorsandstockmarkets:Asocialnetwork
mining approach by RamBabu
miningapproach
Ram Babu RoyandUttam
Roy and Uttam KSarkar
K Sarkar PAKDD2013DMApps,Gold
PAKDD 2013 DMApps Gold
Coast,Australia,Springer,LNCS7867,pp5970., (2013)

221

6/26/2014

DataDescription
Stock
Index
AS52

Number
of Stocks
136

CNX500
FSSTI
HDAX

303
19
38

Stock
Index

Number
of Stocks

GICS Economic
Sector

Number
of Stocks

192

Consumer
Discretionary

486

NZSE50FG
SBF250
SET

26
140
285

Consumer
Staples
Energy
Financials

229
128
408

NMX

HSI
IBOV

25
27

SHCOMP
SPTSX

382
147

Health Care
Industrials

146
512

KRX100

65

SPX

384

Information
Technology

234

MEXBOL

24

TWSE

313

Materials

425

NKY

192
Total

829

Total

1869

Telecommuni
cation Services
Utilities
Grand Total

36
94
2698

Datafor13yearsfromJanuary1998toJanuary2011
GICS GlobalIndustryClassificationStandard

IdentificationofDominantEconomic
SectorsandStockMarkets

Thenormalizedintrasectoral edgedensity(inpercent)isthe
ratioofthenumberofedgesbetweenthestocksoftheparticular
sectorandthemaximumnumberofpossibleedgesbetweenthe
stocksofthatparticularsector(i.e.n1wheren istheno.of
stocksofthatparticularsector)
Thenormalizedintersectoral edgedensityitistheratioofthe
numberofedgesbetweenthestocksofthetwodifferentsectors
andthemaximumnumberofpossibleedgesbetweenthestocks
ofthosetwosectors(i.e.themin(n1,n2)wheren1 andn2 arethe
numberofstocksbelongingtothetwosectors)
Similarprocedurehasbeenfollowedtoidentifydominantstock
marketsaftercomputingthenormalizedinterindexandintra
indexedgedensities.

222

6/26/2014

Index

Color

Index

Color

Continent

Node Shape

AS52
SPTSX

Pink
Black

NMX

Green

CNX500

Green

SPX

HDAX

Magenta

SBF250

Brown

HSI

Purple

SET

Brown

IBOV

Yellow

FSSTI

Cyan

Australia
Zealandia
Asia
Europe
North America
South America

Diamond
Diamond
Triangle
box
Circle
Ellipse

KRX100

Magenta

TWSE

Orange

MEXBOL

Blue

SHCOMP

Red

NKY

Black

NZSE50FG Green
Red

Continent

GICS Sector
Materials

Color
Brown

Industrials
Health Care

Cyan
Orange

Energy
Consumer
Discretionary

Red
Yellow

No. of Stocks

Node Shape

Australia

Diamond

Zealandia

Diamond

Asia

Triangle

Europe

box

North America

Circle

South America

Ellipse

No. of Stocks

GICS Sector
Financials

Color
Magenta
Blue
Black

234
229

128

Information Technology
Consumer Staples
Telecommunication
Services

Green

36

486

Utilities

Purple

94

425
512
146

408

223

6/26/2014

Interdependencestructureof
economicsectors(Weightededge)
Rank

Economic Sector

Eigenvector
Centrality

1 Financials

0.4785

2 Industrials

0.4042

3 Materials
Consumer
4 Discretionary
Information
5 Technology

0.3795

6 Consumer Staples
Telecommunication
7 Services

0.3449
0.2962
0.2702
0.2694

8 Utilities

0.2002

9 Energy

0.1957

10 Health Care

0.1817

NetworkofStockIndices
(Weightededge)
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Eigenvector
Index
Centrality
SBF250
0.6381
HDAX
0.5938
NMX
0.393
SPX
0.1523
AS52
0.1517
NZSE50FG
0.1248
SPTSX
0.1112
HS Index
0.071
SET
0.0465
MEXBOL
0.0423
CNX500
0.0353
NKY
0.0289
FSSTI
0.0208
SHCOMP
0.0048
IBOV
0.0043
TWSE
0.0004
KRX100
0

224

6/26/2014

Intersectoral interdependence
linkingcrosscountrystockmarkets

Rank
1
2
3
4
5
6
7
8
9
10

Economic Sector
Industrials
Financials
Materials
Utilities
Health Care
Consumer Staples
Consumer
Discretionary
Information
Technology
Telecommunication
Services
Energy

EV
Centrality
0.4738
0.4575
0.3105
0.2921
0.2886
0.2747
0.2685
0.2306
0.225
0.2232

FindingsandConclusions
Presence of distinct regional and sectoral clusters
Regional influence dominates economic sectors
Stock indices from Europe and US emerge as
dominant
Financial, Industrial, Materials, Consumer
Discretionary sectors dominate
Social
S i l position
iti off stocks
t k portfolio
tf li managementt
System level understanding of the structure and
behavior

225

6/26/2014

ScopeforFutureResearch
Potentialapplicationforclassificationofstocksinportfolio
management
Detectingepicenterofturbulenceinnearrealtime development
ofEWS
Modelingshockpropagationthroughthenetwork
Capturingcausalrelationshipsamongstockreturns
Capturing causal relationships among stock returns
Whethertheselforganizingnetworkprovidesinbuiltresilience

References

Bessler,D.A.,Yang,J.,Thestructureofinterdependenceininternationalstockmarkets.Journalof
InternationalMoneyandFinance22,261287,2003.
Eryiit,M.andR.Eryiit,Networkstructureofcrosscorrelationsamongtheworldmarketindices.
Physica A:StatisticalMechanicsanditsApplications388(17):35513562,2009.
Adams,J.,K.Faust,etal.,Capturingcontext:Integratingspatialandsocialnetworkanalyses.Social
Networks34(1),15,,2012.
Markose Computability and evolutionary complexity: Markets as CAS The economic journal 115
Markose,Computabilityandevolutionarycomplexity:MarketsasCAS,Theeconomicjournal,115,
2005
Tse,C.K.,Liu,J.,Lau,F.C.M.,Anetworkperspectiveofthestockmarket.JournalofEmpiricalFinance
doi:10.1016/j.jempfin.2010.04.008,2010.
Boginski,V.,Butenko,S.,Pardalos,P.M.,Miningmarketdata:Anetworkapproach.Computers&
OperationsResearch33,31713184,2006.
Wasserman,S.andK.Faust,SocialNetworkAnalysis:MethodsandApplications."Cambridge
UniversityPress:461502,1994.
Freeman,L.C.,Centralityinsocialnetworks:Conceptualclarification.SocialNetworks1,215239,
1979.
Bonacich,P.,PowerandCentrality:AFamilyofMeasures.TheAmericanJournalofSociology92,
1987.
1987.
Roy,R.B.andU.K.Sarkar,Asocialnetworkapproachtoexaminetheroleofinfluentialstocksin
shapinginterdependencestructureinglobalstockmarkets.InternationalConferenceonAdvances
inSocialNetworkAnalysisandMining(ASONAM),Kaohsiung,Taiwan,DOI:
10.1109/ASONAM.2011.87,2011.
Roy,R.B.andU.K.Sarkar,"Asocialnetworkapproachtochangedetectionintheinterdependence
structureofglobalstockmarkets."SocialNetworkAnalysisandMiningDOI10.1007/s13278012
0063y,2012.

226

6/26/2014

Classificationusing
NeuralNetworks
Uttam KSarkar
IndianInstituteofManagementCalcutta

SessionPlan
Theneedforneuralnetworks
Signaturerecognitionproblem
g
g
p
Whetherthesignatureinthecheque issameasthe
signaturethebankhasagainsttheaccountinits
database

ConceptofaNeuralNetwork
DemonstrationofhowitworksusingExcel
Potentialbusinessapplications
Issuesinusingneuralnetworks

227

6/26/2014

ArtificialNeuralNetwork(ANN)
AnANNisacomputationalparadigminspiredby
th t t
thestructureofbiologicalneuralnetworks
f bi l i l
l t
k
andtheirwayofencodingandsolving
problems

Biologicalinspirations
Somenumbers
Thehumanbraincontainsabout10billionnervecells
(neurons)
Eachneuronisconnectedtotheothersthroughabout
10000synapses

Propertiesofthebrain
Itcanlearn,reorganizeitselffromexperience
Itadaptstotheenvironment
I d
h
i
Itisrobustandfaulttolerant

228

6/26/2014

Biologicalneuralnetworks
Humanbraincontainsseveralbillionnervecells
called neurons
calledneurons
Aneuronreceiveselectricalsignalsthroughits
dendrites
Theaccumulatedeffectofseveralsignalsreceived
simultaneouslyislinearlyadditive
Outputisnonlinear(allornone)typeofoutput
signal
Connectivity(noofneuronsconnectedtoaneuron)
variesfrom1to105.Forthecerebralcortexits
about103.

BiologicalNeuron
synapse

axon

nucleus

cell body

dendrites

Dendritessenseinput
Axonstransmitoutput
Informationflowsfromdendritestoaxonsviacellbody
Axonconnectstootherdendritesviasynapses
y p
Interactionsofneurons
synapsescanbeexcitatoryorinhibitory
synapsesvaryinstrength

Howcantheabovebiologicalcharacteristicsbemodeledinan
artificialsystem?

229

6/26/2014

Artificialimplementationusinga
computer?
Input
Acceptingexternalinputissimpleandcommonplace
Accepting external input is simple and commonplace

Axonstransmitoutput
Outputmechanismstooarewellknown

Informationflowsfromdendritestoaxonsviacellbody
Informationflowisdoable

Axonconnectstootherdendritesviasynapses
Interactionsofneurons(how?Whatkindofgraphornetwork??)
I
i
f
(h ? Wh ki d f
h
k??)
synapsescanbeexcitatoryorinhibitory(1/0?Continuous?)
synapsesvaryinstrength(weightedaverage?)

TypicalExcitementorActivation
FunctionataNeuron(Sigmoidor
Logisticcurve)
g
)
Logistic

1
y f ( x)
1 ex

230

6/26/2014

Interconnections?
FeedForwardNeuralNetwork
Informationisfedattheinput
Computationsdoneatthe
Computations done at the
hiddenlayers
Deviationsofcomputed
resultsfromdesiredgoals
retunescomputations
Networkthuslearns
Computationisterminated
oncethelearningisassumed
acceptableorresources
earmarkedforcomputation
getexhausted

Outputlayer
2ndhidden
layer
1sthidden
layer

x1

x2

Inputlayer..

xn

SupervisedLearning
Inputsandoutputsarebothknown.
Thenetworktunesitsweightstotransform
theinputstotheoutputswithouttryingto
discoverthemappinginanexplicitform
Onemayprovideexamplesandteachthe
neural network to arrive at a solution without
neuralnetworktoarriveatasolutionwithout
knowingexactlyhow!

231

6/26/2014

CharacteristicsofANN
Supervisednetworksaregoodapproximators
BoundedfunctionscanbeapproximatedbyanANN
B
d d f ti
b
i t db
ANN
toanyprecision
Doesselflearningbyadaptingweightsto
environmentalneeds
Canwork with incomplete data
Theinformationisdistributedacrossthenetwork.If
onepartgetsdamagedtheoverallperformancemay
notdegradedrastically

WhatdoweneedtouseNN?
Determinationofpertinentinputs
Collectionofdataforthelearningandtestingphase
Collection of data for the learning and testing phase
oftheneuralnetwork
Findingtheoptimumnumberofhiddennodes
Estimatetheparameters(Learning)
Evaluatetheperformancesofthenetwork
Ifperformancesarenotsatisfactorythenreviewall
If performances are not satisfactory then review all
theprecedentpoints

232

6/26/2014

ApplicationsofANN
Darestoaddressdifficultproblemswherecause
effectrelationshipofinputoutputisveryhardto
quantify
Stockmarketpredictions
Facerecognition
Timeseriesprediction
Processcontrol
Opticalcharacterrecognition
Optimization

ConcludingremarksonNeural
Networks
Neuralnetworksareutilizedasstatisticaltools
Adjustnonlinearfunctionstofulfillatask
Needofmultipleandrepresentativeexamplesbutfewer
thaninothermethods
Neuralnetworksenabletomodelcomplexphenomena
NNaregoodclassifiersBUT
Goodrepresentationsofdatahavetobeformulated
Trainingvectorsmustbestatisticallyrepresentativeofthe
entire input space
entireinputspace
EffectiveuseofNNneedsagoodcomprehensionofthe
problemandagoodgriponunderlyingmathematics

233

6/26/2014

Business Data Mining Promises and Reality

Uttam K Sarkar
Indian Institute of Management
g
Calcutta

Background
WaristooimportanttobelefttotheGenerals
GeorgesBenzamin Clemenceau
Decisionmakingnowrequiresnavigatingbeyond
transactionaldata

Needforexploringoceanofdata

Howtofilter?
How to impute?
Howtoimpute?
Howaboutoutliers?
Howtosummarize?
Howtoanalyze?
Howtoapply?

234

6/26/2014

Challenges(Opportunities?)
Realworlddoesnothaveaconsistentbehaviour
Whatmodelistobeextractedbyanalyticsthen?Future,bydefinition,
isuncertain

Captureddataareerrorproneandinvolveuncertainty
p
p
y
Youaredeadbythetimeyouknowtheultimatetruth

Oftenthereisnoconcreteclarityongoalasyouoptimize
Whatarethestrengthsandweaknessofunderlyingassumptions?

Dataarevoluminous,havehighvariety,andhighvelocity
Howtocapture,store,transmit,share,sample,analyze?

Interpretationoffindingsisnontrivial
Interpretation of findings is nontrivial
Sufficientlytorturedandabused,statisticswouldconfesstoalmost
anything

Analytics
Promises,Myths,andReality
Toomanyjargons
Too many jargons
Businessintelligence,Datamining,Data
warehousing,Predictiveanalytics,Prescriptive
analytics,Bigdata,.

Toomanyquestions
Whatonearthdotheymean?Wheredidthey
What on earth do they mean? Where did they
emergefrom?Whyaretheygettingsopopular?Is
analyticsthepanacea?Isityetanotheroverhyped
buzzword?

235

6/26/2014

Analyticsinaction
Insteadofstrugglingforacademicdefinitions
l t l k i t h t th
letuslookintowhattheseareintendedto
i t d dt
achieveandlookatsomeexamplesfromthe
businessworldwherethesearebeingused

Netflix
Disruptiveinnovationridingonanalytics
dimension
1997
1997:ReedHastings,ComputerScienceMastersfrom
R d H ti
C
t S i
M t f
Stanford,pays$40latefeeforaVHScassetteofthemovie
Apollo13
Therecommendersystem
Thetransportationlogisticsfinetuning
ThesilentbattleofthenondescriptentrantNetflixagainst
incumbentbilliondollarBlockbuster
BlockbusterinitiallyignoredNetflix
Recognition(andvirtualsurrender)byBlockbuster itwas
toolate!

236

6/26/2014

Harrahs
Knowingcustomersbetterusinganalytics
ThemagicofCRM

WalMart
The
Theeverannoyingproblemofstockout,shrinkage,
ever annoying problem of stockout shrinkage
andinventorymanagementandthesuccessof
RadioFrequencyIdentifcation (RFID)based
analytics
Reducedstockout!

ExamplesGalore

HP
Whichemployeeislikelytoquit?
Target
Whichcustomersareexpectantmothers?
Telenor
Whichcustomerscanbepersuadedtostayback?
LatestUSpresidentialelection
Whichvoterwillbepositivelypersuadedbypoliticalcampaign
suchasacall,doorknock,flier,orTVad?
h
ll d
k k fli
TV d?
IBM(Watson)
AutomatedJeopardy! Analyticssoftwarechallengedhuman
opponent!

237

6/26/2014

Analyticsdefined?
Analyticsis(Ref:Davenport)
Extensiveuseofdata,statistical,quantitative,and
computerbasedanalysis,explanatoryandpredictive
models,andfactbasedratherthangutfeeling
basedapproachtoarriveatbusinessdecisionsand
todriveactions

Analyticscanhelparriveatintelligent
l i
h l
i
i lli

decisions
Whatisintelligenceinthiscontext?
Howcanamachinehelptakeintelligentdecisions?

Changingnatureofcompetitioninbusinessworld:How
togetcompetitiveadvantage?

Operationalbusinessprocessesarentmuch
different from anybody else s
differentfromanybodyelses
ThankstoR&Donbestpractices

238

6/26/2014

Competitiveadvantage
Operationalbusinessprocessesarentmuchdifferentfromanybodyelses
ThankstoR&Donbestpractices
Productsaren
Products arenttmuchdifferentiatedanymore
much differentiated any more
Thankstocomparabletechnologyaccessibletoall
Uniquegeographicaladvantagedoesntmattermuch
Thankstoimprovedtransportation/communicationlogistics
Protectiveregulationsarelargelygone
Thankstoglobalization
Proprietarytechnologygetsrapidlycopied
Proprietary technology gets rapidly copied
Thankstotechnologicalinnovations
Whatisleftistoimproveefficiencyandeffectivenessbytakingthe
smartestbusinessdecisionspossibleoutofdatawhichmayaswellbe
availabletocompetitors

StockMarketAnalogy
Itstoowellknownthestockmarketwould
ceasetoexistifonecouldfindapredictive
theoryofpricemovements!
Themarketexistsbecausenosuchidealmodelis
possible.Playerstrydisparatemodels somegain
whileotherslose

Ocean
Oceanofdata
of data bestmodelorbest
best model or best
interpretationismeaningless
Thesmarterguydiscoversmoremeaningful
patternsforhisbusinessbyusinganalytics!

239

6/26/2014

Emergenceofanalytics:thecontributors
Oldwineinnewbottle?
YesandNo

Pillarsofbusinessanalytics

Datacaptureandstorageinbulk
Statisticalprinciplesrevisited
Developmentsinmachinelearningprinciples
Availabilityofeasytousesoftwaretools
M
Managerialinvolvementwithachangedmindset
i li l
ih h
d i d

Panacea?
Toomanyreadytousetoolsandtechniquesareavailable!
Afoolwithatoolisstillafool

RoadtoAnalytics
Uttam KSarkar
IndianInstituteofManagementCalcutta

240

6/26/2014

Attributesoforganizationsthrivingonanalytics
Identificationofdistinctivestrategiccapabilityforanalyticstomakea
difference
Highlyorganizationspecific
MaybecustomerloyaltyforHarrah
May be customer loyalty for Harrahs,
s,revenuemanagementforMarriott,
revenue management for Marriott,
supplychainperformanceforWalMart.Customersmoviepreferencefor
Netflix,

Enterpriselevelapproachtoanalytics
GaryLoveman ofHarrahsbrokethefiefdomofmarketingandcustomer
serviceislandsintocrossdepartmentchorusonshareddataandthoughts

Seniormanagementcapabilityandcommitment
ReedHastingsofNetflix
ComputerSciencegraduatefromStanford)
GaryLoveman ofHarrahs
PhDinEconomicsfromMIT
Bezos ofAmazon.com
ComputerSciencegraduatefromPrinceton

Stagesofanalyticspreparednessofanorganization
Analyticallyimpaired
Gropingfordatatoimproveoperations

Explorerofinternalanalytics
Explorer of internal analytics
Usinglimitedanalyticstoimproveafunctionalactivity

AnalyticsExperimenter
Exploringanalyticstoimproveadistinctivecapability

AnalyticsMindset
Usinganalyticstodifferentiateitsofferings
Using analytics to differentiate its offerings

Competingonanalytics
Stayingaheadonstrengthofanalytics

241

6/26/2014

Investmentinanalyticswithoutduediligencemay
notyieldresults
Unitedairlineshadbeentheworldslargestairlinein
termsoffleetsizeandnumberofdestinations,,
operatingover1200aircraftwithserviceto400odd
destinations
Unitedairlineshadinvestedheavilyinanalytics
ThecompanyfiledforChapter11bankruptcy
p
protectionin2002
Despitebusinessdownturn,mostotherairlineswere
notasadverselyaffected
Whatwentwrongwithitsanalytics?

Unitedairlinesanalyticspostmortem
UApioneeredonlyyieldmanagement
Othersmallerairlinesweredoingcostcuttingandoffering
seatsatlowerfares
t tl
f
UAweredevelopingcomplexrouteplanningoptimization
analyticswithmultipleplanetypes
CompetitorlikeSouthWest usedonlyonekindofplane
andhadafarsimplerandcheapersystemofrunning
UApioneeredloyaltyprogramme basedonanalytics
Theircustomerservicewassopatheticthatfrequentflyershardyhad
anyloyaltytoit

UAspentafortunedevelopinganalytics
OtherairlinescanbuyatfarcheaperpriceSabre systemforpretty
similaranalysis

242

6/26/2014

Questionstoaskwhenevaluatingananalytics
initiative
How
Howwillthisinvestmentmakeusmoreprofitable?
will this investment make us more profitable?
Howdoesthisinitiativeimproveenterprisewidecapabilities?
Whatcomplementarychangesneedtobemadetotake
advantageofthecapabilities?
MoreIT?Moretraining?Redesignjobs?Hirenewskills?

Dowehaveaccesstorightdata?
Aredatatimely,accurate,complete,consistent?

Dowehaveaccesstorighttechnology?
Isthetechnologyworkable,scalable,reliable,andcosteffective?

Misstepstoavoidwhengettingintoananalytics
initiative
Focusingexcessivelyononedimensionofthe
capability(say,onlyontechnology)
Attemptingtodoeverythingatonce
Anycomplexsystemisbesthandledincrementally
Investingtoomuchortoolittleonanalyticswithout
matchingimpactonanddemandofbusiness
Choosingthewrongproblem
Wrongformulation,wrongassumptions,wrongdata,
wrong software tool wrong method
wrongsoftwaretool,wrongmethod
Makingthewronginterpretation
Tool+data+fewmouseclicks=model
Model+input=output
Whoassesseswhethertheoutputisgarbage?

243

6/26/2014

Characteristicsofexecutivespromotinganalytics
Theyshouldbepassionateaboutfactdrivendecision
making
Theyshouldhaveskillsandappreciationofanalytical
toolsandmethods
Theyshouldbewillingtoactonthefindingsfrom
analytics
Theyshouldbecompetenttoassessandmanage
meritocracy

References&Acknowledgment
PredictiveAnalytics
EricSiegel,Wiley,2013
Eric Siegel Wiley 2013

CompetingonAnalytics
DavenportandHarris,HarvardSchoolPress,2007

AnalyticsatWork
Davenport,Harris,andMorrison,HarvardBusinessPress,
2010

244

Vous aimerez peut-être aussi