Académique Documents
Professionnel Documents
Culture Documents
BigData:Prelude
IJenChiang
WhatsBigData?
No single definition; from Wikipedia:
Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on
onhand
hand
database management tools or traditional data processing
applications.
The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
Th
The trend
d to larger
l
d
data
sets is
i due
d
to the
h additional
ddi i
l
information derivable from analysis of a single large set of
related data, as compared to separate smaller sets with the
same total amount of data, allowing correlations to be found
to "spot business trends, determine quality of research,
prevent diseases, link legal citations, combat crime, and
2
determine realtime roadway traffic conditions.
6/26/2014
Howmuchdata?
Googleprocesses20PBaday(2008)
WaybackMachinehas3PB+100TB/month(3/2009)
Facebookhas2.5PBofuserdata+15TB/day(4/2009)
eBayhas6.5PBofuserdata+50TB/day(5/2009)
CERNsLargeHydron Collider(LHC)generates15PBa
year
640K oughttobeenough
foranybody.
ASingleViewtotheCustomer
Banking
Finance
Social
Media
Our
Known
History
Customer
Gaming
Entertain
Purchase
6/26/2014
Collect
Accumulate
Store
LifeCycleofBigData
CloudComputing
InternetofThin
ngs
Queryy
MapReduce
Distributed
Storage
BigData
MobileComputing
6
6/26/2014
Data,data,andmoredata
AccordingtoIBM,90%ofthedataintheworldtoday
wascreatedinthepasttwoyears(2011~2013).(IBM
quote,microscope.co.uk)
AccordingtoInternationalDataCorporation,the
totalamountofglobaldataisexpectedtogrowto
2.7zettabytes during2012.(InternationalData
Corporation2012prediction,IDCwebsite)
Thedataisgrowingexponentially(43%growthrate)
andisestimatedtobe7.9zettabytes by2015.
(CenturyLink 2015prediction,ReadWriteWeb
website)
7
TendencyofData
Poorlystructured,lightly
TTabularof
b l
f
object/relational
Fixedschema
10s100sGB
Upto100,000sof
transactionsperhour
Optimizedfor
transaction
processing
Structuredand
di
dimensioned
i
d
SpecializedDBsforBI
structured,orunstructured
Simplestructureand
extremedatarate
Hi
Hierarchicaland/orfile
hi l d/ fil
oriented
Sparselyattributed
TBlowPBs
Bat h loads
Batchloads
PBandup
Stream,capture,
Stream,
capture,
batch
Optimizedfor
reportingand
analysis
Optimizedfor
distributed,
cloudbased
processing
6/26/2014
DataGrowth
Volume(Scale)
DataVolume
44xincreasefrom20092020
From0.8zettabytes to35zb
Datavolumeisincreasingexponentially
Exponentialincreasein
collected/generateddata
10
6/26/2014
30billion RFID
tagstoday
(1.3Bin2005)
12+ TBs
of tweet data
every day
4.6
billion
camera
phones
worldwide
?TBsof
dataeveryday
100sof
millions
ofGPS
enabled
devicessold
annually
25+ TBs of
2+
billion
log data
every day
76million smartmeters
in2009
200Mby2014
peopleon
theWeb
byend
2011
Realtime/FastData
Mobiledevices
(trackingallobjectsallthetime)
Socialmediaandnetworks
(allofusaregeneratingdata)
Scientificinstruments
(collectingallsortsofdata)
Sensortechnologyandnetworks
(measuringallkindsofdata)
Theprogressandinnovationisnolongerhinderedbytheabilitytocollect
data
But,bytheabilitytomanage,analyze,summarize,visualize,anddiscover
knowledgefromthecollecteddatainatimelymannerandinascalable
12
fashion
6/26/2014
TheModelHasChanged
TheModelofGenerating/ConsumingDatahasChanged
OldModel:Fewcompaniesaregeneratingdata,allothersare
consuming data
consumingdata
NewModel:allofusaregeneratingdata,andallofusare
consumingdata
13
WhatsdrivingBigData
Optimizationsandpredictiveanalytics
Complexstatisticalanalysis
Alltypesofdata,andmanysources
Verylargedatasets
Moreofarealtime
Adhocqueryingandreporting
Dataminingtechniques
h
Structureddata,typicalsources
Smalltomidsizedatasets
14
6/26/2014
WhatisBigData?
Datasets which are too large, grow too rapidly, or
are too varied to handle using traditional techniques
Characteristics:
Volume 100s of TBs, petabytes, and beyond
Velocity e.g., machine generated data, medical
devices,
sensors
Variety unstructured data, many formats,
varying
semantics
Not every data problem is a Big Data problem!!
15
InternetofThings
16
6/26/2014
Characteristics
Volume
Data Size
DataSize
BigData
17
IBMDefinition
18
6/26/2014
ANewEraofComputing
12
5 million
terabytes
ofTweets
createdaily
100s
Ofvideofeedsfrom
surveillancecameras
tradeevents
persecond
Volume
Velocity
Variety
Veracity
1in3
Only
Decisionmakers
trusttheir
information
http://watalon.com/?p=722
Wehaveforthefirsttime
aneconomybasedonakey
resource[Information]that
isnotonlyrenewable,but
selfgenerating.
Runningoutofitisnota
problem but drowning in it
problem,butdrowninginit
is.
JohnNaisbitt
20
10
6/26/2014
21
BigDataExplained
Achieve Breakthrough
Outcomes
KnowEverything
Know
Everything
aboutyourCustomers
By Analyzing Any
Big Data Type
Transactional/
Application
Data
RunZerolatency
Operations
MachineData
Innovatenew
productsatSpeed
andScale
InstantAwarenessof
FraudandRisk
ExploitInstrumented
Assets
SSocialMedia
i l M di
Data
Content
11
6/26/2014
24
12
6/26/2014
ValueofBigData
Unlockingsignificantvaluebymakinginformation
transparent and usable at much higher frequency
transparentandusableatmuchhigherfrequency.
Usingdatacollectionandanalysistoconduct
controlledexperimentstomakebetter
managementdecisions.
Sophisticatedanalyticsthatsubstantiallyimprove
decisionmaking.
Improveddevelopmentofthenextgenerationof
productsandservices.
25
Whydoesbigdatamatter?
Bigdataisnotjustaboutstoringlargedatasets
Rather,itisaboutleveragingdatasets
Miningdatasetstofindnewmeaning
Combiningdatasetsthathaveneverbeencombined
before
Makingmoreinformeddecisions
Offeringnewproductsandservices
Dataisavitalasset,andanalyticsarethekeyto
unlockingitspotential
Wedonthavebetteralgorithmsthananyoneelse,wejusthavemoredata.
PeterNorvig,DirectorofResearch,Google,spokenin2010
26
13
6/26/2014
BigData:Processing
IJenChiang
Knowledge Pyramid
Data (Text) Mining
area
Semantic level
Wisdom
(Knowledge + experience)
Knowledge
(Information + rules)
Information
(Data + context)
Data
Whatmadeitthatunsuccessful?
Whatwasthelowestselling
product?
Howmanyunitsweresold
ofeachproductline?
f
h
d
li ?
Signals
Resources occupied
14
6/26/2014
ValueChainEmerges
Prescriptive Analytics
PrescriptiveAnalytics
AutomaticallyPrescribeand
Automatically
Prescribe and
Takeaction
PredictiveAnalytics
SetsofPotentialFuture
Scenarios
IdentificationofPatterns
AndRelationships
AnEvaluationofwhathappened
inthepast
P
Processing
i
D t P
DataPreparedforAnalysis
df A l i
IIndexed,Organizedand
d d O
i d d
OptimizedData
BigData
ContainersandFeedsof
HeterogeneousData
AccesstoStructuredand
UnstructuredData
Analytics
IncreasingValu
ue
Reporting
MichaelPorter,CompetitiveAdvantage:CreatingandSustainingSuperiorPerformance
BigDataProcessing
Transactional
Data
Operational
&Partner
Data
Machineto
Machine
Data
EventStreams
SocialData
Cloud
Services
Data
HighSpeedLowLatencyInfra/Band/EthernetInterconnect
Working
LocalFlash
Storage
Layerasan
extensionof
DRAM
Di t ib t d
Distributed
SharedFlash
Storage
Layer
Lowcost
Distributed
Archive&
BackupDisk
StorageLayer
Operational
Systems
Business
Analytics
Databases
Indexes
Indexing&
Metadata
BigData
Analytics
Metadata
Cubes
Government
Systems
Databases
ActiveDataMianagement
Shared
Databases
Active
Indexes
Shared
Metadata
ArchiveData
&Metadata
Archive/BackupDataManagement
30
15
6/26/2014
TheTraditionalApproach
Querydriven(lazy,ondemand)
Clients
Integration System
Metadata
...
W
Wrapper
W
Wrapper
Source
Source
W
Wrapper
...
Source
31
DisadvantagesofQueryDriven
Approach
Delayinqueryprocessing
y q yp
g
Sloworunavailableinformationsources
Complexfilteringandintegration
Inefficientandpotentiallyexpensivefor
frequentqueries
Competeswithlocalprocessingatsources
Hasnt caughtoninindustry
32
16
6/26/2014
TheWarehousingApproach
Information
integratedin
advance
Storedinwh for
directqueryingand
analysis
Clients
Data
Warehouse
Integration System
Metadata
...
...
Extractor/
Monitor
Extractor/
Monitor
Source
Source
Extractor/
Monitor
...
Source
33
AdvantagesofWarehousingApproach
Highqueryperformance
Butnotnecessarilymostcurrent
But not necessarily most current information
Doesntinterferewithlocalprocessingatsources
Complexqueriesatwarehouse
OLTPatinformationsources
Informationcopiedatwarehouse
Canmodify,annotate,summarize,restructure,etc.
Canstorehistoricalinformation
Security,noauditing
Has caughtoninindustry
34
17
6/26/2014
BusinessIntelligence
Information Sources
Data Warehouse
Server
(Tier 1)
OLAP Servers
(Tier 2)
Clients
(Tier 3)
e.g., MOLAP
Semistructured
Sources
Data
Warehouse
extract
transform
load
refresh
etc.
OLAP
serve
Query/Reporting
serve
e.g., ROLAP
Operational
DBs
serve
Data Mining
Data Marts
35
NotEitherOrDecision
Querydrivenapproachstillbetterfor
Rapidlychanginginformation
Rapidlychanginginformationsources
Trulyvastamountsofdatafromlargenumbersof
sources
Clientswithunpredictableneeds
p
36
18
6/26/2014
WhatisaDataWarehouse?
APractitionersViewpoint
Adatawarehouseissimplyasingle,complete,andconsistent storeofdata
obtainedfromavarietyofsourcesandmadeavailabletoendusersina
waytheycanunderstandanduseitinabusinesscontext.
h
d
d d
i i b i
BarryDevlin,IBMConsultant
AnAlternativeViewpoint
ADWisa
subjectoriented,
integrated
integrated,
timevarying,
nonvolatile
collectionofdatathatisusedprimarilyinorganizationaldecision
making.
W.H.Inmon,BuildingtheDataWarehouse,1992
37
ADataWarehouseis...
Storedcollectionofdiversedata
Asolutiontodataintegrationproblem
A solution to data integration problem
Singlerepositoryofinformation
Subjectoriented
Organizedbysubject,notbyapplication
Usedforanalysis,datamining,etc.
y
g
Optimizeddifferentlyfromtransaction
orienteddb
Userinterfaceaimedatexecutive
38
19
6/26/2014
Characteristicsofa
DataMart
KROENKEandAUER DATABASECONCEPTS(6thEdition)
Copyright2013PearsonEducation,Inc.PublishingasPrenticeHall
Gainingmarketintelligencefromnewsfeeds
40
SreekumarSukumaranandAshishSureka
20
6/26/2014
IntegratedBISystems
Intermedia Data
ETL
Complete
C
l Data
D
Warehouse
RDBMS
Texttaggor&Annotator
ETL
Structural Data
DBMS
FileSystem
XML
XML
Unstructured Data
EA
Legacy
CMS
Scanned
Documents
41
DataWarehouseComponents
SOURCE:RalphKimball
21
6/26/2014
DataWarehouseComponents
Detailed
SOURCE:RalphKimball
LinuxAdoption
44
22
6/26/2014
Distributingprocessingbetween
gatewaysandcloud
45
BigDataProcessingTechniques
Distributeddatastreamprocessingtechnologyforon
theflyrealtimeanalyticsofdata/eventsgeneratedat
extremelyhighrates.
Technologiesforreliabledistributeddatastore,high
speeddatastructuretransformtocreateanalyticsDB
andquickdataplacementmanagement.
Scalabledataextractiontechnologytospeeduprich
queryingfunctionality,suchasmultidimensional
queries over a Key value store
queries,overaKeyvaluestore.
Scalabledistributedparallelprocessingofahuge
amountofstoreddataforadvancedanalysissuchas
machinelearning.
46
23
6/26/2014
BatchandRealtimePlatform
http://www.nec.com/en/global/rd/research/cl/bdpt.html
47
http://info.aiim.org/digitallandfill/newaiimo/2012/03/15/bigdataandbigcontent48
justhypeorarealopportunity
24
6/26/2014
ValueChainofBigData
FritzVenter(LEFT)andAndrewStein,Images&videos:reallybigdata,2012
49
BigDataMovingForward
50
Source:DionHinchcliffe
25
6/26/2014
51
EvolutionofData
Evolutionary
Step
Business Question
Enabling
Technologies
Product
Providers
Characteristics
Data Collection
(1960s)
IBM, CDC
Retrospective,
static data
delivery
Data Access
(1980s)
Oracle, Sybase,
Informix, IBM,
Microsoft
Data Warehousing
&
Decision Support
(1990s)
Relational databases
(RDBMS), Structured
Query Language (SQL),
ODBC
On-line analytic
processing (OLAP),
multidimensional
databases, data
warehouses
Retrospective,
dynamic data
delivery at
record level
Retrospective,
dynamic data
delivery at
multiple levels
Data Mining
(Emerging Today)
Advanced algorithms,
multiprocessor
computers, massive
databases
Pilot, Lockheed,
IBM, SGI,
numerous
startups (nascent
industry)
Pilot, Comshare,
Arbor, Cognos,
Microstrategy
Prospective,
proactive
information
delivery
Data Mining
26
6/26/2014
BigDataProcessing:RealTime
CollectandStore
In memorydatagrid
SpeedupProcessingthroughcolocationof
businesslogicwithdata
Reducenetworkhops
Scaling
Integratewiththebigdatastoretomeet
volumeandcostdemands
53
RealTimeBigDataProcessing
Data
Big
Big
Streams
Data
Streaming
System
RealtimeAnalytic
System
Metadata
Big
Systems
Data
Warehouses
Systemsof
Record
Big
Data
ETL
Hadoop
Archive&BackupSystems(Objects,geographicallydistributed
54
27
6/26/2014
Knowledge
enrichment
Integrate &
Aggregate
Pseudonymised
Repository
Extract
Information
Hazard
Hazard
Monitoring
Monitoring
Ethical oversight
committee
Chronicle
Depersonalise
Pseudonymise
In Hospital
Data Access
Cycle
Individual
Summaries
& Queries
Data Acquisition
Cycle
Focus on Information
capture, organisation,
and presentation
Reidentify
By Hospital
Whatwas
done
Privacy
Enhancement
Technologies
Summarise
& Formulate
Queries
Construct
Chronicle
What happened
And why
Human:1382
Pain:5735
Ulcer:1945
locus
locus
attends
reason
locus
reason
attends
finding
attends
Breast:1492
Clinic:4096
reason
plans
Clinic:1024
plans
plans
reason
locus
Biopsy:1066
target
finding
time
Clinic:2010
reason
Radio:1812
plans
Chemo:6502
treats
reason
Mass:1666
plans
treats
locus
time
Cancer:1914
time
time
time
time
time
time
28
6/26/2014
KDDProcess
Interpretation/
Evaluation
Data Mining
Transformation
Preprocessing
Knowledge
Pattern
Selection
Transformed
Data
Preprocessed
Data
Target Data
Data
Warehouse
Data Mining
N.StolbaandA.M.Tjoa,TheRelevanceofDataWarehousingandDataMiningintheFieldofEvidencebasedMedicinetoSupportHealthcareDecisionMaking.
PROCEEDINGSOFWORLDACADEMYOFSCIENCE,ENGINEERINGANDTECHNOLOGYVOLUME11FEBRUARY2006ISSN13076884
29
6/26/2014
BigDataAnalysisProcess Data
Cleansing
Task Definition Validation
And Goal Completion
Data
KD
Extracting
Data
Selection
Gathering
Data
Warehouse
Cleaning
Data
Preprocess Transform
Preprocessed
Data
Target
Data
Usable
Data
DM
Pattern
Output
Transferring
OLAP
Organizing
Reports
Data
Loading
Data
Database
ETL
Query
BigDataAnalysisProcess Text
Preprocess
Task Definition
And Goal
Data
Extracting
Language
Data
Selection
Feature
Extraction
Gathering
Cleansing
Data
Document
Repository
Transferring
Preprocessed
Data
Lexical
Analysis
Semantic
Evaluation
Organizing
Data
Knowledge
Document
Clustering/
Categorization
Text
Database
Mining
Knowledge
Based
Visualization
Browsing
Semantic
Analysis
Loading
Data
Database
Tools
30
6/26/2014
WhatisCloudComputing?
Cloudcomputingisastyleofcomputingin
whichdynamicallyscalable
hi h d
i ll
l bl andvirtualized
d i t li d
resourcesareprovidedasaservicesoverthe
Internet.
Usersneednothaveknowledgeof,expertise
in,orcontroloverthetechnology
,
gy
infrastructureinthe"cloud"thatsupports
them.
Definitions
AstyleofcomputingwheremassivelyscalableITrelated
capabilitiesareprovidedasaserviceusingInternet
technologiestomultipleexternalcustomers
h l i
li l
l
Gartner
Definitions
onefocusingonremoteaccesstoservicesandcomputing
resourcesprovidedovertheInternet"cloud
Ex:CRMandpayrollservices,aswellasvendorsthatofferaccessto
storageandprocessingpowerovertheWeb(suchasAmazon's EC2
service.)
theotherfocusingontheuseoftechnologiessuch
as virtualization andautomationthatenablethecreationand
deliveryofservicebasedcomputingcapabilities.
isanextensionoftraditionaldatacenterapproachesandcanbe
appliedtoentirelyinternalenterprisesystemswithnouseofexternal
offpremisescapabilitiesprovidedbyathirdparty
31
6/26/2014
TheNISTCloudDefinitionFramework
HybridClouds
Deployment
Models
Service
Models
Community
Cloud
Private
Cloud
Softwareasa
Service(SaaS)
PublicCloud
Public Cloud
Platformasa
Service(PaaS)
Infrastructureasa
Service(IaaS)
OnDemandSelfService
Essential
Characteristics
Common
Characteristics
BroadNetworkAccess
RapidElasticity
ResourcePoolingg
MeasuredService
MassiveScale
ResilientComputing
Homogeneity
GeographicDistribution
Virtualization
ServiceOrientation
LowCostSoftware
AdvancedSecurity
Based upon original chart created by Alex Dowbor - http://ornot.wordpress.com
63
CycleofSOA,CLOUDCOMPUTING,WEB2.0
64
32
6/26/2014
Collection/Identification
Repository/Registry
Semantic Intellectualization
SemanticIntellectualization
1
1.
Integration
Analytics/Prediction
2
2.
Data
Visualization
Insight
Big Data
Action
DataCuration
DataScientist
DataEngineer
Decision
3.
4.
Workflow
DataQuality
65
ModelforBigData
AReferenceModelforBigData
Analysis&Prediction
Service Layer
Big Data
Management
Interface
Workflow
Management
DataQuality
Management
Data
Visualization
ServiceSupport
Layer
Interface
Data
Curation
DataIntegration
Platform Layer
DataSemanticIntellectualization
Security
Interface
DataIdentification
(DataMining&MetadataExtraction)
DataCollection
DataRegistry
Data Layer
DataRepository
66
33
6/26/2014
Datafeeds
Library
catalogs
t l
Locally held
documents
Public
repositories
Commercial
data sources
Agency data
sources
Search engine
Search engine
Search engine
Search engine
INTERNET
(public)
spiders
Filtered
content
Search engine
Search engine
Meta-Search
Meta
Search Tool
TAXONOMY
Web portal
SystemDesignPreview
Node.js Based
IJenChiang
34
6/26/2014
Scenario
SensorNetwork
69
DistributedProcessing
http://phys.org/news/201203technologyefficientlydesiredbigstreams.html
70
35
6/26/2014
71
Node.js vs CloudComputing
ChildProcessPool
Node.js MasterProcess
Incoming
WebSocket
Request
Static
t t
content
Request
Node.js ChileProcess
Node.js
Application
Application
A
li ti
Module
Communication
WebServer
&
WebSockets
Interface
Websocke
t request
Dispatcher
Node.js ChileProcess
Node.js
Application
Application
Module
Query
Communication
Node.js ChileProcess
Fully
Asynchronous
Application
Module
Node.js
Application
Communication
StaticContent
72
36
6/26/2014
EventLoop
73
EventLoopExample
74
37
6/26/2014
LearningJavascript
__proto__
__proto__
Prototype
Prototype
__proto__
SuperConstr
__proto__
new
Object
Object
Layer1:
Single
Object
Layer2:
Prototype
Chain
Constructor
Instance
Layer3:
Constructor
SubConstr
Layer4:
Constructor
inheritance
75
Createasingleobject
Objects:atomicbuildingblocksofJavascript OOP
Objects:mapsfromstringstovalues
j
p
g
Properties:entriesinthemap
Methods:propertieswhosevaluesarefunctions
Thisreferstoreceiverofmethodcall
//Objectliteral
var jane ={
//Property
name:Jane,
//Method
describe:function (){
return Personnamed+this.name;
}
};
Advantage:createobjectsdirectly,introduceabstractionslater
76
38
6/26/2014
var jane ={
name:Jane,
describe:function (){
return Personnamed+this.name;
}
};
#jane.name
Jane
#jane.describe
[Function]
#jane.describe()
PersonnamedJane
#jane.name =John
#jane.describe()
PersonnamedJohn
#jane.unknownProperty
undefined
77
Objectsversusmaps
Similar:
Verydynamic:freelydeleteandaddproperties
Different:
Inheritance(viaprototypechains)
Fastaccesstoproperties(viaconstructors)
78
39
6/26/2014
Sharingproperties:theproblem
var jane ={
name:Jane,,
describe:function (){
return Personnamed+this.name;
}
};
var tarzan ={
name:Tarzan,
describe:function (){
return Personnamed+this.name;
}
};
79
Sharingproperties:thesolution
PersonProto
describe function(0{}
jane
__proto__
tarzan
__proto__
name Jane
name Tarzan
jane andtazan
and tazan sharethesameprototypeobject.
share the same prototype object
Bothprototypechainsworklikesingleobject.
80
40
6/26/2014
Sharingproperties:thecode
var PersonProto ={
describe:function (){
return Personnamed+this.name;
}
};
var jane ={
__proto__:PersonProto,
name:Jane,
};
var tarzan ={
__proto__:PersonProto,
name:Tarzan,
};
81
Gettingandsettingtheprototype
ECMAScript 6:__proto__
ECMAScript 5:
Object.create()
Object.getPrototypeOf()
82
41
6/26/2014
Gettingandsettingtheprototype
Object.create(proto)
var PersonProto ={
describe: function (){
describe:function
() {
return Personnamed+this.name;
}
};
var jane =Object.create(PersonProto);
jane.name:Jane,
Object getPropertypeOf(obj)
Object.getPropertypeOf(obj)
#Object.getPrototypeOf(jane)===PersonProto
true
83
Sharingmethods
//Instancespecificproperties
//
p
p p
funcion Person(name){
this.name =name;
}
//Sharedproperties
Person.prototype.describe =function(){
return Personnamed+this.name;
};
84
42
6/26/2014
Instancescreatedbytheconstructor
Person
prototype
Person.prototype
describe function(0{}
functionPerson(name){
this.name =names;
}
jane
__proto__
tarzan
__proto__
name
name
Jane
Jane
name Tarzan
T
85
instanceof
IsvalueaninstanceofConstr?
value instanceof Constr
valueinstanceof
Howdoesinstanceof work?
Check:IsConstr.prototype intheprototype
chainofvalue?
//Equivalent
valueinstanceof Constr
Constr.prototype.isPrototypeOf(value)
86
43
6/26/2014
Goal:deriveEmployeefromPerson
funcion Person(name){
this.name =name;
}
Person.prototype.sayHelloTo =function(otherName){
console.log(this.name +sayhelloto+otherName;
};
Person.prototype.describe =function(){
return Personnamed+this.name;
};
Employee(name,title)islikePerson,except:
Employee(name
title) is like Person except:
Additionalinstanceproperty:title
describe()returnPersonnamed<name>(<title>)
87
Thingsweneedtodo
Employeemust
InheritPersonsinstanceproperties
Createtheinstancepropertytitle
InheritPersonsprototypeproperties
OverridemethodPerson.prototype.describe
Override method Person.prototype.describe
(andcalloverriddenmethod)
88
44
6/26/2014
Employee:thecode
funcion Employee(name,title){
Person.call(this,name);//(1)
this.title =title;(2)
}
Person.prototype =Object.create(Person.prototype);//(3)
Person.prototype.describe =function(){
return Person.prototype.describe.call(this)//(5)
+(+this.title +);
};
(1)Inheritinstanceproperties
(2)Createtheinstancepropertytitle
(3)Inheritprototypeproperties
(4)OverridemethodPerson.prototype.describe
(5)Calloverriddenmethod(asupercall)
89
Instancescreatedbytheconstructor
Object.prototype
Person
prototype
Person.prototype
__proto__
calls
Employee
prototype
sayHelloTo
function(0{}
describe
function(0{}
Employee.prototype
__proto__
d
describe
ib
f ti (0 { }
function(0{}
jane
__proto__
name
Jane
title
CTO
90
45
6/26/2014
Builtinconstructorhierarchy
Object.prototype
__proto__
null
Object
prototype
toString
function(0{}
Array
prototype
Array.prototype
__proto__
toString
function(0{}
sort
function(0{}
{foo,bar}
__proto__
0
foo
bar
length
91
HelloWorld
var http =require('http');
http.createServer(function (req,res){
res.writeHead(200,{'ContentType':'text/plain'});
res.end('HelloWorld.');
}) listen(1337 "127
}).listen(1337,
127.0.0.1
0 0 1");
);
console.log('Serverrunningat
http://127.0.0.1:1337/');
92
46
6/26/2014
Express WebApplicationFramework
var express =reuqire(
= reuqire('express')
express ),
app = express.createServer();
app.get('/', function(req, res){
res.send('HelloWorld.');
});
app.listen(1337);
93
Express Createhttpservices
app.get('/',function(req,res){
res.send('helloworld');
d('h ll
ld')
});
app.get('/test',function(req,res){
res.send('testrender');
});
app.get('/user/',function(req,res){
res.send('userpage');
});
94
47
6/26/2014
RouterIdentifiers
//Willmatch/abcd
app.get('/abcd',function(req,res)
{
res.send('abcd');
});
//Willmatch/acd
app.get('/ab?cd',function(req,res)
{
res.send('ab?cd');
})
});
//Willmatch/abxyzcd
app.get('/ab*cd',function(req,res){
res send('ab*cd');
res.send(
ab cd );
});
//Willmatch/abe and/abcde
app.get('/ab(cd)?e',function(req,res)
{
res.send('ab(cd)?e');
});
//Willmatch/abbcd
app.get('/ab+cd',function(req,res)
{
res.send('ab+cd');
});
95
Express GetParameters
//...Createhttpserver
app.get('/user/:id',function(req,res){
res.send('user:'+req.params.id);
});
app.get('/:number', function(req,res){
res.send('number:'+req.params.number);
});
96
48
6/26/2014
Connect Middleware
var connect=require("connect");
var http=require("http");
var app=connect();
app.use(function(request,response){
response.writeHead(200,{"ContentType":
"text/plain"});
response.end("Helloworld!\n");
d("H ll
ld!\ ")
});
http.createServer(app).listen(1337);
97
Connect:request,response,next
var connect=require("connect");
var http=require("http");
var app=connect();
pp
();
//log
app.use(function(request,response,next){
console.log("Incomesa"+request.method +"to"+request.url);
next();
});
//return"helloworld"
//
t
"h ll
ld"
app.use(function(request,response,next){
response.writeHead(200,{"ContentType":"text/plain"});
response.end("HelloWorld!\n");
});
http.createServer(app).listen(1337);
98
49
6/26/2014
Connect:logger
var connect=require("connect");
q
(
);
var http=require("http");
var app=connect();
app.use(connect.logger());
app.use(function(request,response){
response.writeHead(200,{"ContentType":"text/plain"});
response.end("Helloworld!\n");
d("H ll
ld!\ ")
});
http.createServer(app).listen(1337);
99
ConnectLogging
var connect=require("connect");
var http=require("http");
var app=connect();
app.use(connect.logger());
//Homepage
app.use(function(request,response,next){
if(request.url =="/"){
response.writeHead(200,{"Content
Type":"text/plain"});
response end("Welcome
response.end(
Welcometothe
to the
homepage!\n");
//Themiddlewarestopshere.
}else {
next();
}
});
//About page
app.use(function(request,response,next){
if(request.url =="/about"){
response.writeHead(200,{"ContentType":
text/plain });
});
"text/plain"
response.end("Welcometotheabout
page!\n");
//Themiddlewarestopshere.
}else {
next();
}
});
//404'd!
app.use(function(request,response){
response.writeHead(404,{"ContentType":
"text/plain"});
response.end("404error!\n");
});
http.createServer(app).listen(1337);
100
50
6/26/2014
BigData:Analysis
IJenChiang
Bigdataissues
102
51
6/26/2014
TheCRISPDMreferencemodel
Harper,Gavin;StephenD.Pickett(August2006)
TheCompleteBigDataValueChain
Collection
Ingestion
Discovery&
Cleansing
Integration
Analysis
Delivery
Collection Structured,unstructuredandsemistructureddatafrommultiplesources
Ingestion loadingvastamountsofdataontoasingledatastore
Discovery&Cleansing understandingformatandcontent;cleanup andformatting
Integration linking,entityextraction,entityresolution,indexinganddatafusion
Analysis Intelligence,statistics,predictiveandtextanalytics,machinelearning
Analysis
Intelligence statistics predictive and text analytics machine learning
Delivery querying,visualization,realtimedeliveryonenterpriseclassavailability
10
4
52
6/26/2014
PhasesandTasks
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Data Set
Data Set Description
Determine
Business Objectives
Background
Business Objectives
Business Success
Criteria
Select Data
Rationale for Inclusion /
Exclusion
Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
Explore Data
Data Exploration Report
Clean Data
Data Cleaning Report
Construct Data
Derived Attributes
Generated Records
Integrate Data
Merged Data
Determine
Data Mining Goal
Data Mining Goals
Data Mining Success
Criteria
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Models
Model Description
Evaluation
Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision
Assess Model
Model Assessment
Revised Parameter
Settings
Deployment
Plan Deployment
Deployment Plan
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience
Documentation
Format Data
Reformatted Data
Examples
Technical
Aspect
Tool and
Technique
Response
Modeling
Clementine
Churm
Prediction
Segmentation
Outliers
MineSet
Concept
Description
Classification
Prediction
Dependency
Analysis
53
6/26/2014
WhatinBigData
Howdoyouextractvaluefrombigdata?
Yousurelycantglanceovereveryrecord;
Anditmaynotevenhaverecords
Whatifyouwantedtolearnfromit?
Understandtrends
Classifyintocategories
Detectsimilarities
Predictthefuturebasedonthepast(No,notlikeNostradamus!)
Machinelearningisquicklyestablishingasanemergingdiscipline.
ButtherearechallengeswithMLinbigdata:
But there are challenges with ML in big data:
Thousandsoffeatures
Billionsofrecords
Thelargestmachinethatyoucanget,maynotbelargeenough
Getthepicture?
10
7
DataAccumulation
108
54
6/26/2014
ServiceOrientedDSS
109
Tencommonbigdataproblems
Modelingtruerisk
Customerchurn
C t
h
analysis
Recommendation
engine
Adtargeting
PoS transactionanalysis
Analyzingnetworkdata
to predict failure
topredictfailure
Threatanalysis
Tradesurveillance
Searchquality
Datasandbox
110
55
6/26/2014
BusinessApplications
Modelingriskandfailureprediction
Analyzingcustomerchurn
Webrecommendations(ala Amazon)
Webadtargeting
Pointofsaletransactionanalysis
Threatanalysis
Complianceandsearcheffectiveness
111
DynamicsofDataEcosystems
http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact_Briefing_2012.pdf
112
56
6/26/2014
Thebigdataopportunity
113
Industriesareembracingbigdata
114
57
6/26/2014
115
116
58
6/26/2014
BusinessAnalytics
117
MultidimensionalConceptAnalysis
DOCUMENT
Attribute Variables
Doc
D1
True
True
D2
True
True
Top
True
True
D3
True
D4
True
D5
True
D6
True
(D1,D2),
(a,b)
((D1,D3,D6),
, , ),
(d)
True
(D1),
(a,b,d)
True
Bottom
59
6/26/2014
ThreeApproaches
119
MiningSchemes
AfullydistributedandextensiblesetofMachineLearningtechniquesfor
BigData
StateoftheartalgorithmsineachoftheMachineLearningdomains,
f h
l
h
h f h
h
d
includingsupervisedandunsupervisedlearning:
Correlation
Classifiers
Clustering
Statistics
Documentmanipulation
Ngramextraction
Histogramcomputation
NaturalLanguageProcessing
Distributedandparallelunderlyinglinearalgebralibrary
120
60
6/26/2014
StatisticalApproach
121
RandomSampleandStatistics
Population: isusedtorefertothesetoruniverseofallentities
understudy.
However,lookingattheentirepopulationmaynotbe
feasible or may be too expensive
feasible,ormaybetooexpensive.
Instead,wedrawarandomsamplefromthepopulation,and
computeappropriatestatisticsfromthesample,thatgive
estimatesofthecorrespondingpopulationparametersof
interest.
61
6/26/2014
Statistic
LetSi denotetherandomvariablecorrespondingto
data point xi ,thenastatistic
datapointx
then a statistic
isafunc
is a func on
on
:(S
: (S1,
S2,,Sn)R.
Ifweusethevalueofastatistictoestimatea
populationparameter,thisvalueiscalledapoint
estimate oftheparameter,andthestatisticiscalled
of the parameter and the statistic is called
asanestimator oftheparameter.
EmpiricalCumulativeDistributionFunction
Where
InverseCumulativeDistributionFunction
62
6/26/2014
Example
MeasuresofCentralTendency(Mean)
PopulationMean:
SampleMean(Unbiased,notrobust):
63
6/26/2014
MeasuresofCentralTendency
(Median)
PopulationMedian:
or
SampleMedian:
Example
64
6/26/2014
MeasuresofDispersion(Range)
Range:
SampleRange:
Notrobust,sensitivetoextremevalues
b
ii
l
MeasuresofDispersion(InterQuartileRange)
InterQuartileRange(IQR):
SampleIQR:
Morerobust
b
65
6/26/2014
MeasuresofDispersion
(VarianceandStandardDeviation)
Variance:
StandardDeviation:
MeasuresofDispersion
(VarianceandStandardDeviation)
Variance:
StandardDeviation:
SampleVariance&StandardDeviation:
66
6/26/2014
Univariate NormalDistribution
MultivariateNormalDistribution
67
6/26/2014
OLAPandDataMining
WarehouseArchitecture
Client
Client
Query&Analysis
Metadata
Warehouse
I
Integration
i
Source
Source
Source
136
68
6/26/2014
StarSchemas
Astarschema
A
star schema isacommonorganizationfor
is a common organization for
dataatawarehouse.Itconsistsof:
1. Facttable :averylargeaccumulationoffacts
suchassales.
Ofteninsertonly.
2 Dimensiontables
2.
i
i
bl :smaller,generallystatic
ll
ll
i
informationabouttheentitiesinvolvedinthe
facts.
137
Terms
Facttable
Dimensiontables
Measures
product
prodId
name
price
sale
orderId
date
custId
prodId
storeId
qty
amt
customer
custId
name
address
city
store
storeId
city
138
69
6/26/2014
Star
product
prodId
p1
p2
p
name price
bolt
10
nut
5
customer
store storeId
c1
c2
c3
custId
53
53
111
custId
53
81
111
name
joe
fred
sally
prodId
p1
p2
p1
storeId
c1
c1
c3
address
10 main
12 main
80 willow
qty
1
2
5
city
nyc
sfo
la
am t
12
11
50
city
sfo
sfo
la
139
Cube
Facttableview:
sale
prodId
p1
p2
p1
p2
Multidimensionalcube:
storeId
c1
c1
c3
c2
am t
12
11
50
8
p1
p2
c1
12
11
c2
c3
50
dimensions=2
140
70
6/26/2014
3DCube
Facttableview:
sale
prodId
p1
p2
p1
p2
p1
p1
Multidimensionalcube:
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
day2
day1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
dimensions=3
141
ROLAPvs.MOLAP
ROLAP:
R l ti
RelationalOnLineAnalyticalProcessing
l O Li A l ti l P
i
MOLAP:
MultiDimensionalOnLineAnalytical
Processing
142
71
6/26/2014
Aggregates
Addupamountsforday1
InSQL:SELECTsum(amt)FROMSALE
In SQL: SELECT sum(amt) FROM SALE
WHEREdate=1
sale
prodId
p1
p2
p1
p2
p1
p1
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
81
143
Aggregates
Addupamountsbyday
InSQL:SELECTdate,sum(amt)FROMSALE
In SQL: SELECT date sum(amt) FROM SALE
GROUPBYdate
sale
prodId
p1
p2
p1
p2
p1
p1
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
ans
date
1
2
sum
81
48
144
72
6/26/2014
AnotherExample
Addupamountsbyday,product
InSQL:SELECTdate,sum(amt)FROMSALE
In SQL: SELECT date sum(amt) FROM SALE
GROUPBYdate,prodId
sale
prodId
p1
p2
p1
p2
p1
p1
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
sale
prodId
p1
p2
p1
date
1
1
2
amt
62
19
48
rollup
drilldown
145
Aggregates
Operators:sum,count,max,min,
median,ave
di
Havingclause
Usingdimensionhierarchy
averagebyregion(withinstore)
maximumbymonth(withindate)
maximum by month (within date)
146
73
6/26/2014
WhatisDataMining?
Discoveryofuseful,possiblyunexpected,
y
,p
y
p
,
patternsindata
Nontrivialextractionofimplicit,previously
unknownandpotentiallyusefulinformation
fromdata
Exploration&analysis,byautomaticor
Exploration & analysis by automatic or
semiautomaticmeans,oflargequantitiesof
datainordertodiscovermeaningfulpatterns
DataMiningTasks
Classification[Predictive]
Clustering[Descriptive]
AssociationRuleDiscovery[Descriptive]
SequentialPatternDiscovery[Descriptive]
Regression[Predictive]
DeviationDetection[Predictive]
CollaborativeFilter[Predictive]
74
6/26/2014
Classification:Definition
Givenacollectionofrecords(trainingset)
Eachrecordcontainsasetofattributes,oneofthe
attributesistheclass.
tt ib t i th l
Findamodel forclassattributeasafunction
ofthevaluesofotherattributes.
Goal:previouslyunseen recordsshouldbe
assignedaclassasaccuratelyaspossible.
Atestset isusedtodeterminetheaccuracyofthe
model. Usually, the given data set is divided into
model.Usually,thegivendatasetisdividedinto
trainingandtestsets,withtrainingsetusedto
buildthemodelandtestsetusedtovalidateit.
DecisionTrees
Example:
Conducted survey to see what customers were
i t
interested
t d iin new model
d l car
Want to select customers for advertising campaign
sale
custId
c1
c2
c3
c4
c5
c6
car
taurus
van
van
taurus
merc
taurus
age
27
35
40
22
50
25
city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no
training
set
150
75
6/26/2014
Clustering
income
education
age
151
KMeansClustering
152
76
6/26/2014
AssociationRuleMining
sales
records:
tran1
tran2
tran3
tran4
tran5
tran6
cust33
cust45
cust12
cust40
cust12
cust12
p2,
p5,
p1,
p5,
p2,
p9
p5, p8
p8, p11
p9
p8, p11
p9
market-basket
data
153
AssociationRuleDiscovery
MarketingandSalesPromotion:
Lettherulediscoveredbe
{Bagels,}>{PotatoChips}
PotatoChips asconsequent =>Canbeusedto
determinewhatshouldbedonetoboostitssales.
Bagelsintheantecedent =>canbeusedtoseewhich
productswouldbeaffectedifthestorediscontinues
sellingbagels.
Bagelsinantecedent
g
and Potatochipsinconsequent
p
q
=>Canbeusedtoseewhatproductsshouldbesold
withBagelstopromotesaleofPotatochips!
Supermarketshelfmanagement.
InventoryManagement
77
6/26/2014
CollaborativeFiltering
Goal:predictwhatmovies/books/apersonmaybeinterestedin,
onthebasisof
Pastpreferencesoftheperson
Otherpeoplewithsimilarpastpreferences
Thepreferencesofsuchpeopleforanewmovie/book/
Oneapproachbasedonrepeatedclustering
Clusterpeopleonthebasisofpreferencesformovies
Thenclustermoviesonthebasisofbeinglikedbythesameclustersof
people
Againclusterpeoplebasedontheirpreferencesfor(thenewlycreated
clustersof)movies
Repeatabovetillequilibrium
Aboveproblemisaninstanceofcollaborativefiltering,whereusers
collaborateinthetaskoffilteringinformationtofindinformationof
interest
155
OtherTypesofMining
Textmining:applicationofdataminingtotextual
documents
clusterWebpagestofindrelatedpages
clusterpagesauserhasvisitedtoorganizetheirvisit
history
classifyWebpagesautomaticallyintoaWebdirectory
GraphMining:
Graph Mining:
Dealwithgraphdata
156
78
6/26/2014
DataStreams
WhatareDataStreams?
Continuousstreams
Huge,Fast,andChanging
WhyDataStreams?
Thearrivingspeedofstreamsandthehugeamountofdata
arebeyondourcapabilitytostorethem.
Realtimeprocessing
WindowModels
Landscapewindow(EntireDataStream)
SlidingWindow
Sliding Window
DampedWindow
MiningDataStream
157
ASimpleProblem
Findingfrequentitems
Givenasequence(x
Given a sequence (x1,xN)wherex
) where xi [1,m],
[1,m],andareal
and a real
number betweenzeroandone.
Lookingforxi whosefrequency>
NaveAlgorithm(mcounters)
Thenumberoffrequentitems1/
Problem:N>>m>>1/
P(N) N
158
79
6/26/2014
KRPalgorithm
Karp,et.al(TODS 03)
m=12
12
N 30
N=30
=0.35
1/ = 3
N/ ( 1/ ) N
159
StreamingSampleProblem
Scanthedatasetonce
SampleKrecords
Eachonehasequallyprobabilitytobesampled
TotalNrecord:K/N
80
6/26/2014
IntroductiontoNoSQL
IJenChiang
Big Users
81
6/26/2014
BigData
163
DataStorage
164
82
6/26/2014
FeaturesoftheDataWarehouse
ADataWarehouseisasubject oriented,
i t
integrated,nonvolatile,timevariantcollection
t d
l til ti
i t ll ti
ofdatainsupportofmanagementsdecision
W.H.Inmon
DataWarehouseArchitecture
Monitoring&
Administration
OLAPServers
Metadata
Repository
Reconcileddata
External
Sources
Extract
Transform
Load
Refresh
Analysis
Serve
Query/Reporting
Operational
Dbs
DataMining
DATASOURCES
TOOLS
DATAMARTS
83
6/26/2014
BusinessIntelligenceLoop
BusinessStrategist
OLAP
DataMining
Reports
Decision
Support
DataStorage
Data
Warehouse
EExtraction,Transformation,
t ti
T
f
ti
&Cleansing
CRM
Accounting
Finance
HR
Traditionalvs.BigData
TraditionalDataWarehouse
Completerecordfrom
transactionalsystem
Alldatacentralized
BigDataEnvironment
Datafrommanysourcesinside
D f
i id
andoutsideoforganization,
includingtraditionalDW
Dataoftenphysically
distributed
Needtoiteratesolutionto
test/improvemodels
Largememoryanalyticsalso
partofiteration
Everyiterationusuallyrequires
completereloadofinformation
168
84
6/26/2014
NoSQL Database
WideColumnStore/ColumnFamilies Hadoop/
HBase,Cassandra,Cloudata,Cloudera,Amazon
SimpleDB
DocumentStore MongoDB,CouchDB,Citrusleaf
KeyValue/TupleStore AzureTableStorage,
MEMBASE,GenieDB,TokyoCabinet/Tyrant,
MemcacheDB
EventuallyConsistentKeyValueStore Amazon
D
Dynamo,Voldemort
V ld
GraphDatabases Neo4J,InfiniteGraph,Bigdata
XMLDatabases MarkLogicServer,EMCDocumentum,
eXist
169
WhyNoSQL
Too Much Data: the database became too
large to fit into a single database table on a
single machine
Data Volume was growing FAST
Data wasnt all consistent with a specific,
well-defined
well
defined set of fields
Time was critical
170
85
6/26/2014
CAPTheorem
Three properties of a system: consistency, availability and
p
partitions
You can have at most two of these three properties for any
shared-data system
To scale out, you have to partition. That leaves either
consistency or availability to choose from
Manynodes
Nodescontainreplicasof
p
f
partitions ofdata
Consistency
allreplicascontainthesame
versionofdata
Availability
system
systemremainsoperational
remains operational
onfailingnodes
Partitiontolarence
multipleentrypoints
systemremainsoperational
onsystemsplit
CAP Theorem:
satisfying all three at the
same time is impossible
172
86
6/26/2014
ACID - BASE
Basically
Available(CP)
Softstate
Eventually
consistent(AP)
i
(AP)
Atomicity
Consistency
Isolation
Durability
Pritchett,D.:BASE:AnAcidAlternative(queue.acm.org/detail.cfm?id=1394128)
173
NoSQL
Key / Value
Colum n
Graph
Docum ent
174
87
6/26/2014
KeyValueStore
Pros:
very fast
very scalable
simple model
able to distribute
horizontally
Cons:
- many data
structures (objects)
can't be easily
modeled as key value
pairs
175
ColumnStores
Row oriented
Id
username
Department
John
john@foo.com
Sales
Mary
mary@foo.com
Marketing
Yoda
yoda@foo.com
IT
Column oriented
Id
Username
Department
John
john@foo.com
Sales
Mary
mary@foo.com
Marketing
Yoda
yoda@foo.com
IT
88
6/26/2014
GraphDatabase
177
An introduction to MongoDB
IJenChiang
89
6/26/2014
WhyNoSQL
TooMuchData:thedatabasebecametoo
l
largetofitintoasingledatabasetableona
t fit i t
i l d t b
t bl
singlemachine
DataVolumewasgrowing FAST
Datawasntallconsistentwithaspecific,well
defined set of fields
definedsetoffields
Timewascritical
179
DocumentStores
The store is a container for documents
Documents are made up of named fields
Fields may or may not have type definitions
e.g. XSDs for XML stores, vs. schema-less JSON stores
90
6/26/2014
MongoDB
MongoDB isascalable,highperformance,opensource
NoSQL database.
Documentorientedstorage
FullIndexSupport
Replication&HighAvailability
AutoSharding
Querying
FastInPlaceUpdates
Map/Reduce
GridFS
181
Manynodes
Nodescontainreplicasof
p
f
partitions ofdata
Consistency
allreplicascontainthesame
versionofdata
Availability
system
systemremainsoperational
remains operational
onfailingnodes
Partitiontolarence
multipleentrypoints
systemremainsoperational
onsystemsplit
CAP Theorem:
satisfying all three at the
same time is impossible
182
91
6/26/2014
SchemaLess
Pros:
Schemalessdatamodelisricherthankey/value
Schema less data model is richer than key/value
pairs
eventualconsistency
manyaredistributed
stillprovideexcellentperformanceand
scalability
Cons:
typicallynoACIDtransactionsorjoins
CommonAdvantages
Cheap,easytoimplement(opensource)
Dataarereplicatedtomultiplenodes(therefore
Data are replicated to multiple nodes (therefore
identicalandfaulttolerant)andcanbepartitioned
Downnodeseasilyreplaced
Nosinglepointoffailure
Easytodistribute
Don'ttrequireaschema
Don
require a schema
Canscaleupanddown
Relaxthedataconsistencyrequirement(CAP)
92
6/26/2014
WhatisNoSQL givingup?
joins
groupby
b
orderby
ACIDtransactions
SQLasasometimesfrustratingbutstillpowerfulquery
language
easyintegrationwithotherapplicationsthatsupportSQL
Cassandra
OriginallydevelopedatFacebook
Follows the BigTable datamodel:columnoriented
FollowstheBigTable
data model: columnoriented
UsestheDynamoEventualConsistencymodel
WritteninJava
OpensourcedandexistswithintheApachefamily
UsesApacheThriftasitsAPI
93
6/26/2014
Cassandra
Tunableconsistency.
Decentralized.
Writesarefasterthanreads.
NoSinglepointoffailure.
Incrementalscalability.
Usesconsistenthashing(logicalpartitioning)
whenclustered.
Hintedhandoffs.
Peertopeerrouting(ring).
ThriftAPI.
Multidatacentersupport.
Couchdb
AvailabilityandPartialTolerance.
Viewsareusedtoquery.Map/Reduce.
MVCC
MVCC MultipleConcurrentversions.Nolocks.
M lti l C
t
i
N l k
Alittleoverheadwiththisapproachduetogarbagecollection.
Conflictresolution.
Verysimple,RESTbased.SchemaFree.
Sharednothing,seamlesspeerbasedBiDirectionalreplication.
AutoCompaction.ManualwithMongodb.
UsesBTrees
D
Documentsandindexesarekeptinmemoryandflushedtodisc
t
di d
k ti
d fl h d t di
periodically.
Documentshavestates,incaseofafailure,recoverycancontinue
fromthestatedocumentswereleft.
Nobuiltinautosharding,thereareopensourceprojects.
Youcantdefineyourindexes.
94
6/26/2014
Mongodb
Datatypes:bool,int,double,string,object(bson),oid,
array,null,date.
Databaseandcollectionsarecreatedautomatically.
LotsofLanguageDrivers.
Cappedcollectionsarefixedsizecollections,buffers,
veryfast,FIFO,goodforlogs.Noindexes.
Objectidaregeneratedbyclient,12bytespacked
data.4bytetime,3bytemachine,2bytepid,3byte
counter.
Possibletoreferotherdocumentsindifferent
collectionsbutmoreefficienttoembeddocuments.
Replicationisveryeasytosetup.Youcanreadfrom
slaves.
Documentstore
RDBMS
MongoDB
Database
Database
Table,View
Row
Column
Index
Join
ForeignKey
Partition
>db.user.findOne({age:39})
{
Collection
"_id":ObjectId("5114e0bd42"),
Document(JSON,BSON)
"first":"John",
"last":"Doe",
Field
"age":39,
Index
"interests":[
EmbeddedDocument
"Reading",
Reference
"MountainBiking
]
Shard
"favorites":{
"color":"Blue",
"sport":"Soccer
}
}
190
95
6/26/2014
Mongodb
Connectionpoolingisdoneforyou..
Supports aggregation.
Supportsaggregation.
MapReducewithJavaScript.
Youhaveindexes,BTrees.Idsarealwaysindexed.
Updatesareatomic.Lowcontentionlocks.
Queryingmongodonewithadocument:
Lazy,returnsacursor.
Reduceable toSQL,select,insert,updatelimit,sortetc.
Thereismore:upsert (eitherinsertsofupdates)
Severaloperators:
$ne,$and,$or,$lt,$gt,$incr,$decr andsoon.
RepositoryPatternmakesdevelopmentveryeasy.
Terminology
RDBMS
Database
Table,View
Row
Column
Index
Join
ForeignKey
Partition
MongoDB
Database
Collection
Document(JSON,BSON)
Field
Index
EmbeddedDocument
Reference
Shard
192
96
6/26/2014
Features
DocumentOrientedstorege
FullIndexSupport
Full Index Support
Replication&High
Availability
AutoSharding
Querying
FastInPlaceUpdates
Map/Reduce
Agile
Scalable
193
Mongodb Sharding
replicaset
C1 mongod
C2 mongod
C3 mongod
g
Config servers:Keepsmapping
Mongos:Routingservers
Mongod:masterslavereplicas
97
6/26/2014
Mongodb DataAnalysis
195
CRUD
Create
db.collection.insert(<document>)
(
)
db.collection.save(<document>)
db.collection.update(<query>,<update>,{upsert:true})
Read
db.collection.find(<query>,<projection>)
db.collection.findOne(<query>,<projection>)
Update
p
db.collection.update(<query>,<update>,<options>)
Delete
db.collection.remove(<query>,<justOne>)
196
98
6/26/2014
SQLtoMongodb
SQLStatment
MongoStatement
db.users.find({},{a:1,b:1})
SELECT*FROMusersWHEREnameLIKE
%Joe%
db.users.find({name:/Joe/})
SELECT*FROMusersWHEREa=1ANDb= db.users.find({a:1,b:q})
q
SELECT COUNT(* ) FROM
SELECTCOUNT(*y)FROMusers
db
db.users.count()
t()
CREATEINDEXmyindexnameON
uses(name)
db.users.ensureIndex({name:1})
UPDATEusersSETa=1WHEREb=q
db.users.update({b:q},{$set:{a:1}},false,tru
e)
DELETEFROMusersWHEREz=abc
db.users.remove({z:abc})
197
BSON
JSONhaspowerful,butlimitedsetofdatatypes
arrays,objects,strings,numbersandnull
arrays objects strings numbers and null
BSONisabinaryrepresentationofJSON
Addsextradatatypes withDate,Int types,Id,
Optimizedforperformanceandnavigationalabilities
Andcompression
MongoDB sendsandstoresdatainBSON
bsonspec.org
198
99
6/26/2014
MongoDocument
199
Collection
200
100
6/26/2014
Query
201
Modification
202
101
6/26/2014
Select
203
QueryStage
204
102
6/26/2014
CRUDexample
Create,Read,Update,Delete
>db.user.insert({
first:"John",
last : "Doe",
icd:
[
250,
151
],
age: 39
> db.user.find (
{
"_id" : ObjectId("51"),
"first" : "John",
"last" : "Doe",
"age" : 39
})
})
> db.user.update(
p
(
{"_id" : ObjectId("51")},
{
$set: {
age: 40,
salary: 7000}
}
)
> db.user.remove({
"first": /^J/
})
205
ImportExcelintoMongodb
206
103
6/26/2014
ExportMongodb toExcel
mongoexport --host localhost --port 27017
--username abc
b --password
d 12345 -collection collName --csv --fields
id,sex,brithday,icd --out all_patients.csv -db my_db --query "{\_id\": {\"\$oid\":
\"5058ca07b7628c0999000006\"}}"
207
BlogPostDocument
>p={author:"Chris",
date:newISODate(),
text:"AboutMongoDB...",
tags:["tech","databases"]}
>db.posts.save(p)
208
104
6/26/2014
Querying
db.posts.find()
{_id:ObjectId("4c4ba5c0672c685e5e8aabf3"),
author:"Chris",
date:ISODate("201202
02T11:52:27.442Z"),
text:"AboutMongoDB...",
tags:["tech","databases"]}
Notes:
_idisunique,butcanbeanythingyou'dlike
209
Insertion
db.unicorns.insert({name:'Horny',dob:newDate(1992,2,13,7,47),loves:['carrot','papaya'],weight:600,
gender:'m',vampires:63});
db.unicorns.insert({name:'Aurora',dob:newDate(1991,0,24,13,0),loves:['carrot','grape'],weight:
450,gender:'f',vampires:43});
db
db.unicorns.insert({name:'Unicrom',dob:newDate(1973,1,9,22,10),loves:['energon','redbull'],
({
'
' d b
(
) l
['
' ' db ll'] weight:984,
h
gender:'m',vampires:182});
db.unicorns.insert({name:'Roooooodles',dob:newDate(1979,7,18,18,44),loves:['apple'],weight:575,
gender:'m',vampires:99});
db.unicorns.insert({name:'Solnara',dob:newDate(1985,6,4,2,1),loves:['apple','carrot','chocolate'],
weight:550,gender:'f',vampires:80});
db.unicorns.insert({name:'Ayna',dob:newDate(1998,2,7,8,30),loves:['strawberry','lemon'],weight:
733,gender:'f',vampires:40});
db.unicorns.insert({name:'Kenny',dob:newDate(1997,6,1,10,42),loves:['grape','lemon'],weight:690,
gender:'m',vampires:39});
db.unicorns.insert({name:'Raleigh',dob:newDate(2005,4,3,0,57),loves:['apple','sugar'],weight:421,
gender:'m',vampires:2});
db.unicorns.insert({name:'Leia',dob:newDate(2001,9,8,14,53),loves:['apple','watermelon'],weight:
601,gender:'f',vampires:33});
db.unicorns.insert({name:'Pilot',dob:newDate(1997,2,1,5,3),loves:['apple','watermelon'],weight:
650,gender:'m',vampires:54});
db.unicorns.insert({name:'Nimue',dob:newDate(1999,11,20,16,15),loves:['grape','carrot'],weight:
540,gender:'f'});
db.unicorns.insert({name:'Dunx',dob:newDate(1976,6,18,18,18),loves:['grape','watermelon'], weight:704,
gender:'m',vampires:165});
210
105
6/26/2014
MasterSelector:{field:value}
db.unicorns.find({gender:m,weight:{$gt:
700}})
or (not quite the same thing, but for
demonstration purposes)
db.unicorns.find({gender:{$ne:f},weight:
{$gte: 701}}
{$gte:701}}
$lt,$lte,$gt,$gteand$neareusedforlessthan,lessthanor
equal,greaterthan,greaterthanorequalandnotequal
operations
211
$existand$or
db.unicorns.find({vampires: {$exists:
f l }})
false}})
db.unicorns.find({gender: f, $or:
[{loves: apple}, {loves: orange}, {weight:
{$lt: 500}}]})
212
106
6/26/2014
Indexing
//1 meansascending,1 meansdescending
>db.unicorns.ensureIndex({name:1})
>db.unicorns.findOne({name:'Kenny'})
{name:'Kenny',dob:newDate(1997,6,1,10,
{name 'Kenny' dob new Date(1997 6 1 10
42),loves:['grape','lemon'],weight:690,
gender:'m',vampires:39}
213
IndexingonMultipleFields
//1meansascending,1meansdescending
db.posts.ensureIndex({author:1,ts:1})
Query:
db.posts.find({author:'Chris'}).sort({ts:1})
Return:
[
{_id :ObjectId("4c4ba5c0672c685e5e8aabf3"),
author:"Chris",...},
{_id:ObjectId("4f61d325c496820ceba84124"),
author:"Chris",...}
]
214
107
6/26/2014
GIS
location1 = {
name: "10gen East Coast,
address: "17
17 West 18th Street 8th Floor
Floor,
city: "New York,
zip: "10011,
tags: [business, mongodb],
latlong: [40.0,72.0]
}
db.locations.ensureIndex({latlong:2d})
db.locations.find({latlong:{$near:[40,70]}})
GIS Place
location1 = {
name: "10gen HQ,
address: "17 West 18th Street
S
8th Floor,
for
216
108
6/26/2014
QueryingyourPlaces
Creatingyourindexes
db.locations.ensureIndex({tags:1})
db.locations.ensureIndex({name:1})
db.locations.ensureIndex({latlong:2d})
Findingplaces:
db.locations.find({latlong:{$near:[40,70]}})
Withregularexpressions:
db.locations.find({name:/^typeaheadstring/)
Bytag:
db.locations.find({tags:business})
Insertingandupdatinglocations
Initialdataload:
db l ti
db.locations.insert(place)
i
t( l )
UsingupdatetoAddtips:
db.locations.update({name:"10genHQ"},
{$push:{tips:
{
{user:"nosh",time:6/26/2010,
" h" ti
6/26/2010
tip:"stop byforofficehoursonWednesdays
from46"}}}}
109
6/26/2014
Requirements
Locations
Needtostorelocations(Offices,Restaurantsetc)
Need to store locations (Offices, Restaurants etc)
Wanttobeabletostorename,addressandtags
MaybeUserGeneratedContent,i.e.tips/smallnotes?
Wanttobeabletofindotherlocationsnearby
Checkins
Usershouldbeabletocheckin toalocation
Wanttobeabletogeneratestatistics
Users
user1 = {
name: nosh
h
email: nosh@10gen.com,
.
.
.
checkins: [{ location: 10gen HQ,
ts: 9/20/2010 10:12:00,
},
}
]
}
110
6/26/2014
Simple Stats
db.users.find({checkins.location: 10gen HQ)
db.checkins.find({checkins.location: 10gen HQ})
.sort({ts:-1}).limit(10)
db.checkins.find({checkins.location: 10gen HQ,
ts: {$gt: midnight}}).count()
Alternative
user1 = {
name: nosh
email: nosh@10gen.com,
.
.
.
checkins: [4b97e62bf1d8c7152c9ccb74,5a20e62bf1d8c736ab]
}
111
6/26/2014
User Check in
Checkin=2ops
read location to obtain location id
Update ($push) location id to user object
Queries:findalllocationswhereausercheckedin:
checkin_array = db.users.find({..},
{ h ki t }) h ki
{checkins:true}).checkins
db.location.find({_id:{$in: checkin_array}})
QueryOperators
ConditionalOperators
$all,$exists,$mod,$ne,$in,$nin,$nor,$or,$size,$type
$lt,$lte,$gt,$gte
$l $l $ $
findpostswithanytags
db.posts.find({tags:{$exists:true}})
findpostsmatchingaregularexpression
db posts find({author: /^ro*/i })
db.posts.find({author:/^ro*/i
countpostsbyauthor
db.posts.find({author:'Chris'}).count()
224
112
6/26/2014
Examinethequeryplan
Query: db.posts.find({"author": 'Ross'}).explain()
R l
Result:
{
"cursor" : "BtreeCursor author_1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 0,
"indexBounds" : {
"author" : [
[
"Chris,
"Chris
]
]
}
}
225
AtomicOperators
$set,$unset,$inc,$push,$pushAll,$pull,$pullAll,$bit
// C
Create
t a commentt
new_comment = { author: "Fred", date: new Date(), text:
"Best Post Ever!"}
// Add to post
db.posts.update({ _id: "..." }, {"$push": {comments:
new_comment}},"$inc": {comments_count: 1} });
226
113
6/26/2014
NestedDocuments
{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),
author : "Chris",,
date : "Thu Feb 02 2012 11:50:01",
text : "About MongoDB...",
tags : [ "tech", "databases" ],
comments : [{
author : "Fred",
date : "Fri Feb 03 2012 13:23:11",
text : "Best Post Ever!"
}],
comment_count : 1
}
227
SecondIndexing
// Index nested documents
> db.posts.ensureIndex("comments.author":
db
d ("
h " 1))
> db.posts.find({"comments.author": "Fred"})
// Index on tags (multi-key index)
> db.posts.ensureIndex(
db posts ens reInde ( tags:
tags 1)
> db.posts.find( { tags: "tech" } )
228
114
6/26/2014
GEO
Geospatialqueries
Requireageoindex
Findpointsnearagivenpoint
Findpointswithinapolygon/sphere
//geospatialindex
> db posts ensureIndex( "author
>db.posts.ensureIndex(
author.location
location"::"2d"
2d ))
>db.posts.find("author.location":
{$near:[22,42]})
229
MemoryMappedFiles
Amemorymappedfileisasegmentofvirtual
memorywhichhasbeenassignedadirect
hi h h b
i d di t
byteforbytecorrelationwithsomeportionof
afileorfilelikeresource.1
mmap()
1:
http://en.wikipedia.org/wiki/Memory-mapped_file
230
115
6/26/2014
ReplicaSets
RedundancyandFailover
Zerodowntimefor
Zero downtime for
upgradesandmaintaince
Host1:10000
Host2:10001
Host3:10002
replica1
Master-slave replication
Strong Consistency
Delayed
D l
dC
Consistency
i t
Client
Geospatialfeatures
231
Sharding
Partitionyourdata
Scalewritethroughput
S l
it th
h t
Increasecapacity
Autobalancing
shard1
Host1:10000
shard2
Host2:10010
configdb
Host3:20000
Host4:30000
Client
232
116
6/26/2014
Sharding horizontalscaling
233
Unsharded Deployment
Primary
Secondary
Configureasareplicasetfor
automatedfailover
Async replicationbetweennodes
Add more secondaries toscalereads
Addmoresecondaries
to scale reads
Secondary
117
6/26/2014
HighThroughput
Sharding reducesthenumberofoperationseachshard
handles.Eachshardprocessesfeweroperationsasthecluster
grows.Asaresult,aclustercanincreasecapacityand
throughputhorizontally.Forexample,toinsertdata,the
applicationonlyneedstoaccesstheshardresponsibleforthat
record.
Sharding reducestheamountofdatathateachserverneeds
tostore.Eachshardstoreslessdataastheclustergrows.For
example, if a database has a 1 terabyte data set, and there are
example,ifadatabasehasa1terabytedataset,andthereare
4shards,theneachshardmightholdonly256GBofdata.If
thereare40shards,theneachshardmightholdonly25GBof
data.
235
Sharded Deployment
MongoS
config
Primary
Secondary
Autosharding distributesdataamongtwoormorereplicasets
MongoConfig Server(s)handlesdistribution&balancing
Transparenttoapplications
118
6/26/2014
Sharded Cluster
237
Maincomponents
Shard
AShardisanodeofthecluster
A Shard is a node of the cluster
EachShardcanbeasinglemongod orareplicaset
ConfigServer(metadatastorage)
Storesclusterchunkrangesandlocations
Canbeonly1or3(productionmusthave3)
Notareplicaset
Mongos
Actsasarouter/balancer
Nolocaldata(persiststoconfigdatabase)
Canbe1ormany
238
119
6/26/2014
RelationalDatabaseClustering
index
data file
0
1
p
parrot
parakeet
cat
cat
dog
dog
3
4
goat
This is a
secondary
hash index.
cat Natasha
parakeetTweety
dog
dog
Buck
Lassie
goat Sertrude
5
6
parrot Elmer
cat Mittens
h(k)
c=22
d=3
g=6
p=15
2
3
6
1
Deployasharded cluster
StarttheConfig ServerDatabaseInstances
mkdir
kdi /data/configdb
/d t / fi db
mongod configsvr dbpath <path>port<port>
mongod configsvr dbpath /data/configdb port
27019
Startthemongos Instances(27017)
mongos configdb <config serverhostnames>
server hostnames>
mongos configdb
cfg0.example.net:27019,cfg1.example.net:27019,cfg2.e
xample.net:27019
240
120
6/26/2014
Deployasharded cluster(contd)
AddShardstotheCluster
mongohost<hostnameofmachinerunning
mongos>port<portmongos listenson>
mongohostmongos0.example.netport27017
sh.addShard()
sh.addShard("rs1/mongodb0.example.net:27017")
241
EnableSharding foraDatabase
mongohost<hostnameofmachinerunning
mongos>port<portmongoslistenson>
t
t
li t
sh.enableSharding("<database>")
~db.runCommand({enableSharding:<database>})
Beforeyoucanshardacollection,youmustenableshardingforthe
collectionsdatabase.
242
121
6/26/2014
ChunkPartitioning
Chunkisasectionoftheentirerange
Chunksplitting
Achunkissplitonceitexceedsthemaximumsize
Thereisnosplitpointifalldocumentshavethesameshard
key
Chunksplitisalogicaloperation(nodataismoved)
Chunkisasectionoftheentirerange
122
6/26/2014
Rangebasedsharding
245
HashbasedSharding
246
123
6/26/2014
LinearHashing:Example
h0(x) = x mod N , h1(x) = x mod (2*N)
Insert 15 and 3
Bucket id
7 11 4
13
17
LinearHashing:Example
h0(x) = x mod N , h1(x) = x mod (2*N)
(2 N)
Bucket id
17
15
7 11 4
13 5
124
6/26/2014
LinearHashing:Search
h0(x) = x mod N (for the un-split buckets)
( ) = x mod (2*N)
(
) (for
(f the split
p ones))
h1(x)
Bucket id
17
15
7 11 4
13 5
EnableSharding foraCollection
Determinewhatyouwillusefortheshardkey.Yourselectionoftheshard
keyaffectstheefficiencyofsharding.
Ifthecollectionalreadycontainsdatayoumustcreateanindexonthe
shardkeyusingensureIndex().Ifthecollectionalreadycontainsdatayou
mustcreateanindexontheshardkeyusingensureIndex().Ifthe
collectionisemptythenMongoDB willcreatetheindexaspartofthe
sh.shardCollection()step.
Enablesharding foracollectionbyissuingthesh.shardCollection()
methodinthemongoshell.Themethoduses:
sh.shardCollection("<database>.<collection>",shardkeypattern)
sh.shardCollection("records.people",{"zipcode":1,"name":1})
sh.shardCollection("people.addresses",{"state":1,"_id":1})
sh.shardCollection("assets.chairs",{"type":1,"_id":1})
sh.shardCollection("events.alerts",{"_id":"hashed"})
250
125
6/26/2014
Mixed
shard1
...
Host1:10000
Host2:10001
shardn
Host4:10010
Host3:10002
replica1
configdb
Host5:20000
Host6:30000
Client
Host7:30000
251
Map/Reduce
db.collection.mapReduce(
<mapfunction>,
<reducefunction>,
{
out:<collection>,
query:<>,
sort:<>,
limit:<number>,
finalize:<function>,
scope:<>,
jsMode: <boolean>
jsMode:<boolean>,
verbose:<boolean>
}
)
varmapFunction1=function(){emit(this.cust_id,this.price);};
varreduceFunction1=function(keyCustId,valuesPrices)
{returnsum(valuesPrices);};
252
126
6/26/2014
MapReduce
The caller provides map and reduce
functions written in JavaScript
//Emiteachtag
>map="this['tags'].forEach(
function(item){emit(item,1);}
);"
//Calculatetotals
>reduce="function(key,values){
vartotal=0;
varvaluesSize=values.length;
for(vari=0;i<valuesSize;i++){
total+=parseInt(values[i],10);
}
returntotal;
};
253
MapReduce
//runthemapreduce
>
db.posts.mapReduce(map,reduce,{"out":{inline:1}});
Answer
{
"results":[
{"_id":"databases","value":1},
{"_id":"tech","value":1}
],
"timeMillis":1,
"counts":{
"input":1,
"emit":2,
"reduce":0,
"output":2
},
"ok":1,
}
254
127
6/26/2014
References
http://docs.mongodb.org/manual/core/inter
process authentication/
processauthentication/
http://api.mongodb.org/python/2.6.2/examples/
authentication.html
https://securosis.com/assets/library/reports/Sec
uringBigData_FINAL.pdf
http://docs.mongodb.org/manual/reference/user
privileges/
i il
/
http://www.slideshare.net/DefconRussia/firstov
attackingmongodb
256
128
6/26/2014
Case:CRM
IJenChiang
RelationshipMarketing
RelationshipMarketingisaProcess
communicatingwithyourcustomers
i ti
ith
t
listeningtotheirresponses
Companiestakeactions
marketingcampaigns
newproducts
newchannels
newpackaging
258
129
6/26/2014
RelationshipMarketing continued
Customersandprospectsrespond
mostcommonresponseisnoresponse
Thisresultsinacycle
dataisgenerated
opportunitiestolearnfromthedataandimprovethe
processemerge
259
AnIllustration
Afewyearsago,UPSwentonstrike
FedExsawitsvolumeincrease
FedEx saw its volume increase
Afterthestrike,itsvolumefell
FedExidentifiedthosecustomerswhoseFedEx
volumeshadincreasedandthendecreased
ThesecustomerswereusingUPSagain
These customers were using UPS again
FedExmadespecialofferstothesecustomerstoget
alloftheirbusiness
260
130
6/26/2014
TheCorporateMemory
Severalyearsago,LandsEndcouldnot
recognizeregularChristmasshoppers
somepeoplegenerallydontshopfromcatalogs
butspendhundredsofdollarseveryChristmas
ifyouonlystore6monthsofhistory,youwillmiss
them
VictoriasSecretbuildscustomerloyaltywitha
nohasslereturnspolicy
someloyalcustomers returnseveralexpensive
outfitseachmonth
theyarereallyloyalrenters
261
CRMRequiresLearningandMore
Formalearningrelationshipwithyour
customers
Noticetheirneeds
OnlineTransactionProcessingSystems
Remembertheirpreferences
DecisionSupportDataWarehouse
Learnhowtoservethembetter
Learn how to serve them better
DataMining
Acttomakecustomersmoreprofitable
262
131
6/26/2014
Customer Relationship
Management(CRM)
Traditional Marketing
CRM
Goal: Establishaprofitable,long
Establish a profitable long
term,onetoonerelationshipwith
customers;understandingtheir
needs,preferences,expectations
Productorientedview
Customerorientedview
Massmarketing/mass
production
Masscustomization,onetoone
marketing
Standardizationofcustomer
needs
Customersupplierrelationship
Transactionalrelationship
Relational approach
263
What isCRM?
The
The approach ofidentifying,establishing,maintaining,
of identifying, establishing, maintaining,
and enhancing lasting relationships with customers.
264
132
6/26/2014
Strategies inCRM
for Mass Customization
Prospecting (offirst
(of firsttime
timeconsumers)
consumers)
Loyalty
Crossselling /Upselling
Winback or Save
265
BusinessProcessesOrganizeAroundtheCustomer
Lifecycle
Acquisition
Activation
Relationship Management
RelationshipManagement
Winback
Former
Customer
High
Value
Prospect
New
Customer
Established
Customer
High
High
Potential
Low
Value
Voluntary
Churn
Forced
Churn
266
133
6/26/2014
StrategicCustomerRelationship
Entire universe
Satisfy these
Customers
Before
They defect
Retain these
Loyal customers!
Behavioral clustering/
segmentation
Get these
Customers
back
Demographic
clustering/
segmentation
Product
Associations
Customer basket
analysis
Market the
Missing products
MarketingStrategy
High-valued
Customer
Clustering
Purchasing
Behaviors
Keep
Att iti
Attrition
Recognition
Mid-valued
Customer
Cross-sell ,up-sell
GetBestBenefit
With low costs
Low-valued
Before Attrition,
Win-back
Demography Clustering
Product Design and Refinement
Products
Win-back
Find Customers
Product Classifications
Differential Segmentation
Find Products
134
6/26/2014
TheProfit/LossMatrix
Thosepredictedtorespond
cost$1
thosewhoactuallyrespond
yieldagainof$45
thosewhodontrespond
yield no gain
yieldnogain
Predicted
Someonewhoscoresinthetop
30%,ispredictedtorespond
ACTUAL
YES
NO
YES
$44
-$1
NO
$0
$0
Thosenotpredictedtorespondcost$0andyield
nogain
269
MarketingPlan
Correctlyidentifythecustomerrequirements
Definepromotionforcustomers
Adjusttheflightschedule
135
6/26/2014
RecommendationEC
Multiple and effective merchandising platform
Cross-products (different types) recommendation
Online retailers
Anonymizing filter
(optional)
Marketplace
Aggregated
Recommendations
Recommendation-Based
Purchases
My Library
(all media types)
My Preferences
Information/dataintegration
Findhouseswith
2bedrooms
pricedunder
200K
Newfaculty
member
realestate.com
homeseekers.com
sourcesontheWebwhichprovidehouselistings
homes.com
272
136
6/26/2014
ArchitectureofDataIntegrationSystem
simplyposethe
queryinthe
mediated
schema
Findhouseswith2bedrooms
pricedunder200K
mediatedschema
sourceschema1
realestate.com
sourceschema2
sourceschema3
homeseekers.com
homes.com
273
SemanticMatchesbetweenSchemas
theschemamatchingproblemistofindsemanticmappingsbetweentheelementsofthe
twoschemas
Mediatedschema
price agentnameaddress
11match
Sourceschema
homes com
homes.com
complexmatch
Seattle WA
MiamiFL
274
137
6/26/2014
BigDataPlatformsand
Paradigms
Sourangshu Bhattacharya(CSE)
Outline
BigData
WhatisBigData?
What
is Big Data ?
ChallengeswithBigDataProcessing.
Hadoop HDFS
MapReduce
PIG
Analytics
BasicStatistics
Basic Statistics
TextAnalytics
SQLQueries
138
6/26/2014
WhatisBigData?
6 Billionwebqueriesperday.
q
p
y
~6TBperday,~2.5PBperyear
10Billiondisplayadsperday.
~15TBperday,~5.5PBperyear
30Billiontextadsperday.
~30TBperday,~11PBperyear
150
150MillionCreditcardtransactionsperday.
Milli C di
d
i
d
~150GBperday,~5.5TBperyear
100Billionemailsperday.
~1PBperday,~360PB peryear
WhatisBigData?
CERN LargeHadronCollider
~10PB/yearatstart
~1000PBin~10years
2500physicistscollaborating
LargeSynopticSurveyTelescope
(NSF,DOE,andprivatedonors)
~510PB/yearatstartin2012
~100
100PBby2025
PB by 2025
PanSTARRS(Haleakala,Hawaii)
USAirForce
now:800TB/year
soon:4PB/year
Courtesy:Kingetal.,IEEEBigData2013.
139
6/26/2014
BigDataChallenges
3Vs
Scalability
CostEffective
Ease of use
Easeofuse
Flexibility
FaultTolerance
3Vs Volume,Variety,Velocity.
Faulttolerance:
Computers
10
100
Chanceof failureinanhours
0.01
0.09 0.63
Datalocality.
Computationgoestodata
WhatisHadoop?
Ascalablefaulttolerantdistributedsystemfordatastorageand
processing.
CoreHadoop:
HadoopDistributedFileSystem(HDFS)
H d
Di t ib t d Fil S t
(HDFS)
HadoopYARN:JobSchedulingandClusterResourceManagement
HadoopMapReduce:Frameworkfordistributeddataprocessing.
OpenSourcesystemwithlargecommunitysupport.
https://hadoop.apache.org/
140
6/26/2014
HadoopArchitecture
YARN
Courtesy:http://hadoop.apache.org/docs/r2.3.0/hadoopyarn/hadoopyarnsite/YARN.html
HDFS
Assumptions
Hardwarefailureisthenorm.
Hardware failure is the norm
Streamingdataaccess.
Writeonce,readmanytimes.
Highthroughput,notlowlatency.
Largedatasets.
Characteristics:
Performsbestwithmodestnumberoflargefiles
Optimizedforstreamingreads
Layerontopofnativefilesystem.
141
6/26/2014
HDFS
Dataisorganizedintofileanddirectories.
Filesaredividedintoblocksanddistributedtonodes.
Files are divided into blocks and distributed to nodes
Blockplacementisknownatthetimeofread
Computationmovedtosamenode.
Replicationisusedfor:
Speed
Faulttolerance
Selfhealing.
WhatisMapReduce?
Methodfordistributingataskacrossmultipleservers.
ProposedbyDeanandGhemawat,2004.
Consistsoftwodevelopercreatedphases:
Map
Reduce
InbetweenMapandReduceistheShuffleandSortphase.
Whatwasthemax/mintemperatureforthelastcentury?
142
6/26/2014
MapPhase
Userwritesthemappermethod.
Inputisanunstructuredrecord:
Input is an unstructured record:
E.g.ArowofRDBMStable,
Alineofatextfile,etc
Outputisasetofrecordsoftheform:<key,value>
Bothkeyandvaluecanbeanything,e.g.text,number,etc.
E.g.forrowofRDBMStable:<columnid,value>
Lineoftextfile:<word,count>
Shuffle/Sortphase
Shufflephaseensuresthatallthemapperoutputrecordswith
the same key value, goes to the same reducer.
thesamekeyvalue,goestothesamereducer.
Sortensuresthatamongtherecordsreceivedateachreducer,
recordswithsamekeyarrivestogether.
143
6/26/2014
Reducephase
Reducerisauserdefinedfunctionwhichprocessesmapper
output records with some of the keys output by mapper.
outputrecordswithsomeofthekeysoutputbymapper.
Inputisoftheform<key,value>
Allrecordshavingsamekeyarrivetogether.
Outputisasetofrecordsoftheform<key,value>
Keyisnotimportant
Parallelpicture
144
6/26/2014
Example
WordCount:Countthetotalno.of
occurrencesofeachword
Map
Reduce
HadoopMapReduce
Provides:
AutomaticparallelizationandDistribution
Automatic parallelization and Distribution
FaultTolerance
MethodsforinterfacingwithHDFSforcolocationof
computationandstorageofoutput.
StatusandMonitoringtools
APIinJava
Abilitytodefinethemapperandreducerinmanylanguages
Ability to define the mapper and reducer in many languages
throughHadoopstreaming.
145
6/26/2014
WordCount
WordCount:Mapper
146
6/26/2014
WordCount:Reducer
WordCount:Main
147
6/26/2014
HadoopStreaming
Allowstheusertospecifymappersandreducersas
executables.
Mapperlaunchesthemapperexecutableandpipestheinput
datatothestdin oftheexecutable.
Mapperexecutableprocessesthedataandwritesthemapper
outputtostdout,whichiscollectedbythemapper.
Mapperkeyandvalueareseparatedbytab.
Reduceroperatesinthesameway.
Wordcountinperl
MainCommand:
hadoop
p j
jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20p
p
p
mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1cdh4.6.0.jar
-input /tmp/small_svm_data
-output /user/sourangshu/output
-mapper "/usr/bin/perl wcmap.pl
-reducer "/usr/bin/perl wcreduce.pl
-file wcmap.pl -file wcreduce.pl
Mapper:wcmap.pl
148
6/26/2014
Wordcountinperl
Reducer:
Pig
Pigisasystemforfastdevelopmentofbigdataprocessing
algorithms.
Pighastwocomponents:
PigLatinlanguage
Interpreterforpiglatin.
AprogramexpressedinPigLatinistranslatedintoasetof
mapreducejobs.
UsersinteractwiththePiglatin language.
149
6/26/2014
PigLatin
PigLatinstatementisanoperatorthattakesarelationas
input and produces another relation as output.
inputandproducesanotherrelationasoutput.
Exception:LoadandStorestatements,whichreadfromand
writetofiles.
AgeneralPigprogramlookslike:
ALOADstatementreadsdatafromthefilesystem.
se es o t a s o at o state e ts p ocess t e data
Aseriesof"transformation"statementsprocessthedata.
ASTOREstatementwritesoutputtothefilesystem;or,a
DUMPstatementdisplaysoutputtothescreen.
PigLatinDatatypes
Arelationisabag(morespecifically,anouterbag).
Abagisacollectionoftuplese.g.{(19,2),(18,1)}
A bag is a collection of tuples e g {(19 2) (18 1)}
Atupleisanorderedsetoffieldse.g.(19,2)
Amapisasetofkeyvaluepairse.g.[open#apache]
Afieldisapieceofdata
Afieldcanbeoftype:
Scalar:int,long,float,double
Scalar:
int, long, float, double
Array:chararray,bytearray
Complexfields:bag,tuple,map.
150
6/26/2014
Example
A = LOAD 'student' USING
PigStorage() AS (name:chararray
(name:chararray,
age:int, gpa:float);
DUMP A;
Output:
p
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
PigLatin
FOREACH..GENERATE..:appliesoperationtoeachelement
of a bag.
ofabag.
Example:
A=LOAD'data'AS(f1,f2,f3);
B=FOREACHAGENERATEf1+5;
C=FOREACHAgeneratef1+f2;
151
6/26/2014
PigLatin
Filtering:
X = FILTER A BY (f1
(f1==8)
8) OR (NOT
(f2+f3 > f1));
Grouping:
X = GROUP A BY f1;
DUMP X;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(8,{(8,3,4)})
Groupinggeneratesaninnerbag.
PigLatin
Cogrouping:
152
6/26/2014
PigLatinOperatorsonBags
AVG:Computestheaverageofthenumericvaluesinasingle
column bag.
columnbag.
PigLatinOperators
COUNT:Computesthenumberofelementsinabag.
IsEmpty:Checksifabagormapisempty
IsEmpty: Checks if a bag or map is empty
MAX:Computesthemaximum.
MIN:Computestheminimum.
SUM:Computesthesumofthenumericvalues.
SIZE:ComputesthenumberofelementsbasedonanyPig
datatype.
153
6/26/2014
PigLatin
Flattenconvertsinnerbagstomultipletuples:
X = FOREACH C GENERATE g
group,
p, FLATTEN(A);
( );
DUMP X;
(1,1,2,3)
(4,4,2,1)
(4,4,3,3)
(8,8,3,4)
(8,8,4,3)
Example:Projection
X = FOREACH A GENERATE a1, a2;
DUMP X;
(1,2)
(4,2)
(8,3)
(4 3)
(4,3)
(7,2)
(8,4)
154
6/26/2014
BusinessAnalyticsforDecision
Making
RamBabu Roy
DecisionMaking
Definition
Selectingthebestsolutionfromtwoormorealternatives
LevelsofManagerialDecisionMaking
Information
Characteristics
Ch
i i
Decision
Decision
Structure
Unstructured
Strategic
Management
Executivesand
Directors
Semistructured
TacticalManagement
BusinessUnitManagers
andSelfDirectedTeams
OperationalManagement
Structured
OperatingManagers
andSelfDirectedTeams
AdHoc
Unscheduled
Summarized
Infrequent
Forward
Looking
External
WideScope
Prespecified
Scheduled
Detailed
Frequent
Historical
Internal
NarrowFocus
Source: Euiho (David) Suh, POSTECH Strategic Management of Information and Technology Laboratory, (http://posmit.postech.ac.kr)
155
6/26/2014
Analysis,Analytics,andBI
Analysis:Processofcarefulstudyofsomethingto
learnaboutitscomponents,whattheydo,and
ho the are related to each other
howtheyarerelatedtoeachother
Analytics:Applicationofscientificmethodsand
techniquesforanalysis
BusinessIntelligenceandAnalytics1 techniques,
technologies,systems,practices,methodologies,
and applications that analyze critical business data
andapplicationsthatanalyzecriticalbusinessdata
tohelpanenterprisebetterunderstandits
businessandmarketandmaketimelybusiness
decisions
1Source:BusinessIntelligenceandAnalytics:Frombigdatatobigimpact,H.Chen
et.alMISQuarterly,Dec2012
BI&AOverview
Source:BusinessIntelligenceandAnalytics:Frombigdatatobigimpact,H.Chen et.alMISQuarterly,Dec2012
156
6/26/2014
BigDataandTraditionalAnalytics
Source:Bigdataatwork,Davenport,HBS
Whichtypeofanalyticsisappropriate?
Onceyougatherdata,youbuildamodelofhowthese
datawork.
Descriptive
p
tocondensebigdataintosmaller,more
g
,
usefulnuggetsofinformation
Inquisitive Whyissomethinghappening
validate/rejecthypotheses
Predictive toforecastwhatmighthappeninthe
future,usesstatisticalmodeling,datamining,and
machinelearningtechniques,e.g.whethersentimentis
positive or negative
positiveornegative
Prescriptive recommendingoneormorecoursesof
actionandshowingthelikelyoutcomeofeachdecision
requiresapredictivemodelwithtwoadditional
components:actionabledataandafeedbacksystem
Source:http://www.informationweek.com/bigdata/bigdataanalytics/bigdata
analyticsdescriptivevspredictivevsprescriptive/d/did/1113279
157
6/26/2014
EvolutionofDSSintoBusinessIntelligence
ChangeintheUseofDSS
SpecialistManagersWhomever,Whenever,Wherever
EmergenceofBusinessIntelligence
Web Technology
WebTechnology
OLAP
Data
Warehousing
BusinessIntelligence
DataMining
Intelligent
Systems
Source: Euiho (David) Suh, POSTECH Strategic Management of Information and Technology Laboratory, (http://posmit.postech.ac.kr)
BusinessIntelligence(BI)
Definition
Anumbrellatermthatcombinesarchitectures,tools,databases,
analyticaltools,applications,andmethodologies
Objective
Toenableeasyaccesstodata(andmodels)toprovidebusiness
To enable easy access to data (and models) to provide business
managerswiththeabilitytoconductanalysis
Data
BusinessIntelligence
ACTION
Information
Knowledge
Decisions
Source: Euiho (David) Suh, POSTECH Strategic Management of Information and Technology Laboratory, (http://posmit.postech.ac.kr)
158
6/26/2014
FoundationalTechnologiesinAnalytics
Source:BusinessIntelligenceandAnalytics:Frombigdatatobigimpact,H.Chen et.alMISQuarterly,Dec2012
TaxonomyforDataMiningTasks
159
6/26/2014
ClassificationTechniques
Decisiontreeanalysis
Neuralnetworks
Supportvectormachines
Casebasedreasoning
Bayesianclassifiers(nave,BeliefNetwork)
Roughsets
BayesianClassification:Why?
Astatisticalclassifier:performsprobabilisticprediction,i.e.,
predictsclassmembershipprobabilities
Foundation: BasedonBayesTheorem.
Performance:
P f
A i l B
AsimpleBayesianclassifier,naveBayesian
i
l ifi
B
i
classifier,hascomparableperformancewithdecisiontreeand
selectedneuralnetworkclassifiers
Incremental:Eachtrainingexamplecanincrementally
increase/decreasetheprobabilitythatahypothesisiscorrect
priorknowledgecanbecombinedwithobserveddata
Standard:EvenwhenBayesianmethodsarecomputationally
d d
h
h d
ll
intractable,theycanprovideastandardofoptimaldecision
makingagainstwhichothermethodscanbemeasured
160
6/26/2014
BayesTheorem:Basics
LetX beadatasample(evidence):classlabelisunknown
LetHbeahypothesis thatXbelongstoclassC
ClassificationistodetermineP(H|X),(posterioriprobability), the
probabilitythatthehypothesisholdsgiventheobserveddata
sampleX
P(H)(priorprobability),theinitialprobability
E.g., X willbuycomputer,regardlessofage,income,
P(X):probabilitythatsampledataisobserved
P(X): probability that sample data is observed
P(X|H)(likelihood),theprobabilityofobservingthesampleX,
giventhatthehypothesisholds
E.g., Giventhat X willbuycomputer,theprob.thatXis31..40,
mediumincome
BayesTheorem
Giventrainingdata X,posterioriprobabilityofahypothesisH,
( | ),
y
P(H|X),followstheBayestheorem
P ( H | X ) P (X | H ) P (H ) P (X | H ) P ( H ) / P (X )
P (X)
Informally,thiscanbewrittenas
posteriori=likelihoodxprior/evidence
PredictsX belongstoCi iff theprobabilityP(Ci|X)isthehighest
amongalltheP(Ck|X)forallthek classes
Practicaldifficulty:requireinitialknowledgeofmany
probabilities,significantcomputationalcost
161
6/26/2014
TowardsNaveBayes Classifier
LetDbeatrainingsetoftuples andtheirassociatedclasslabels,
andeachtuple isrepresentedbyannDattributevectorX =(x1,
x2,,xn)
Supposetherearem classesC1,C2,,Cm.
Classificationistoderivethemaximumposteriori,i.e.,the
maximalP(Ci|X)
ThiscanbederivedfromBayestheorem
P(C | X)
i
P(X | C )P(C )
i
i
P(X)
SinceP(X)isconstantforallclasses,only
needstobemaximized
323
DerivationofNaveBayes Classifier
Asimplifiedassumption:attributesareconditionally
independent(i.e.,nodependencerelationbetweenattributes):
n
P(X | Ci) P(x | Ci) P(x | Ci) P(x | Ci) ... P(x | Ci)
k
1
2
n
k 1
Thisgreatlyreducesthecomputationcost:Onlycountsthe
classdistribution
IfAk iscategorical,P(x
g
, ( k||Ci))isthe#oftuples
p inCi havingvaluex
g
k
forAk dividedby|Ci,D|(#oftuples ofCi inD)
162
6/26/2014
NaveBayes Classifier:TrainingDataset
Class:
C1:buys_computer=yes
C2:buys_computer=no
Datatobeclassified:
X=(age<=30,
Income =medium,
Income
medium,
Student=yes
Credit_rating=Fair)
age
<=30
<=30
30
3140
>40
>40
>40
3140
<=30
< 30
<=30
>40
<=30
3140
3140
>40
income studentcredit_rating_comp
high
no fair
no
high
no excellent
no
high
no fair
yes
medium no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent yes
medium no fair
no
l
low
yes fair
f i
yes
medium yes fair
yes
medium yes excellent yes
medium no excellent yes
high
yes fair
yes
medium no excellent
no
NaveBayes Classifier:AnExample
age
income studentcredit_rating_comp
no fair
no
no excellent
no
no fair
yes
no fair
yes
yes fair
yes
yes excellent
no
yes excellent yes
no fair
no
yes fair
yes
yes fair
y
yes
y
yes excellent yes
no excellent yes
yes fair
yes
no excellent
no
<=30
high
P(Ci):P(buys_computer =yes)=9/14=0.643
<=30
high
3140 high
P(buys_computer =no)=5/14=0.357
>40
medium
>40
low
ComputeP(X|Ci)foreachclass
>40
low
3140 low
<=30
medium
P(age=<=30|buys_computer =yes)=2/9=0.222
<=30
low
>40
medium
P(age =<=
P(age
< 30
30|buys_computer
| buys computer =no)
no )=3/5
3/5 =0.6
0.6
<=30
medium
3140 medium
P(income=medium|buys_computer =yes)=4/9=0.444
3140 high
>40
medium
P(income=medium|buys_computer =no)=2/5=0.4
P(student=yes|buys_computer =yes)=6/9=0.667
P(student=yes|buys_computer =no)=1/5=0.2
P(credit_rating =fair|buys_computer =yes)=6/9=0.667
P(credit_rating =fair|buys_computer =no)=2/5=0.4
X=(age<=30,income=medium,student=yes,credit_rating =fair)
P(X|Ci):
) P(X|buys_computer
P(X|b
=yes)=0.222x0.444x0.667x0.667=0.044
) 0 222 0 444 0 667 0 667 0 044
P(X|buys_computer =no)=0.6x0.4x0.2x0.4=0.019
P(X|Ci)*P(Ci):P(X|buys_computer =yes)*P(buys_computer =yes)=0.028
P(X|buys_computer =no)*P(buys_computer =no)=0.007
Therefore,Xbelongstoclass(buys_computer =yes)
163
6/26/2014
AvoidingtheZeroProbabilityProblem
NaveBayesianpredictionrequireseachconditionalprob.be
nonzero.Otherwise,thepredictedprob.willbezero
P ( X | C i)
n
P ( x k | C i)
k 1
Ex.Supposeadatasetwith1000tuples,income=low(0),
income=medium(990),andincome=high(10)
UseLaplacian correction (orLaplacian estimator)
Adding1toeachcase
Prob(income =low)
Prob(income
low) =1/1003
1/1003
Prob(income=medium)=991/1003
Prob(income=high)=11/1003
Thecorrectedprob.estimatesareclosetotheir
uncorrectedcounterparts
NaveBayes Classifier:Comments
Advantages
Easytoimplement
Goodresultsobtainedinmostofthecases
Disadvantages
Assumption:classconditionalindependence,thereforelossof
accuracy
Practically,dependenciesexistamongvariables
E.g.,hospitals:patients:Profile:age,familyhistory,etc.
Symptoms:fever,coughetc.,Disease:lungcancer,
diabetes,etc.
DependenciesamongthesecannotbemodeledbyNave
Bayes Classifier
Howtodealwiththesedependencies?BayesianBeliefNetworks
164
6/26/2014
SupportVectorMachines(SVM)
Arelativelynewclassificationmethodforbothlinearand
nonlinear data
Itusesanonlinearmapping totransformtheoriginaltraining
dataintoahigherdimension
Withthenewdimension,itsearchesforthelinearoptimal
separatinghyperplane (i.e.,decisionboundary)
Withanappropriatenonlinearmappingtoasufficientlyhigh
dimension data from two classes can always be separated by a
dimension,datafromtwoclassescanalwaysbeseparatedbya
hyperplane
SVMfindsthishyperplane usingsupportvectors (essential
trainingtuples)andmargins (definedbythesupportvectors)
329
SupportVectorMachines
Aim:Tofind thebestclassification functionto
distinguish between members of the two classes in
distinguishbetweenmembersofthetwoclassesin
thetrainingdata
aseparatinghyperplane f(x)thatpassesthroughthe
middleofthetwoclasses
bestsuchfunctionisfoundbymaximizingthemargin
betweenthetwoclasses
xn belongstothepositiveclassiff(xn)>0
Oneofthemostrobustandaccuratemethods
Requiresonlyadozenexamplesfortraining
Insensitivetothenumberofdimensions
165
6/26/2014
SVMWhenDataIsLinearlySeparable
LetdataDbe(X1,y1),,(X|D|,y|D|),whereXi isthesetoftrainingtuplesassociated
withtheclasslabelsyi
Thereareinfinitelines(hyperplanes)separatingthetwoclassesbutwewanttofind
thebestone (theonethatminimizesclassificationerroronunseendata)
SVMsearchesforthehyperplanewiththelargestmargin,i.e.,maximummarginal
hyperplane (MMH)
331
SVMLinearlySeparable
Aseparatinghyperplanecanbewrittenas
W X +b=0
where W={w1,w
whereW={w
w2,,w
wn}isaweightvectorandbascalar(bias)
} is a weight vector and b a scalar (bias)
For2Ditcanbewrittenas
w0 +w1 x1 +w2 x2 =0
Thehyperplanedefiningthesidesofthemargin:
H1:w0 +w1 x1 +w2 x2 1foryi=+1,and
H2:w0 +w1 x1 +w2 x2 1foryi=1
Thisbecomesaconstrained(convex)quadraticoptimization problem:
Quadraticobjectivefunctionandlinearconstraints Quadratic
Programming(QP) Lagrangianmultipliers
332
166
6/26/2014
SVMLinearlyInseparable
A2
Transformtheoriginalinputdataintoahigher
p
dimensionalspace
Searchforalinearseparatinghyperplane inthenew
space
A1
333
SVM:DifferentKernelfunctions
Insteadofcomputingthedotproductonthetransformeddata,
itismath.equivalenttoapplyingakernelfunctionK(Xi,Xj)to
theoriginaldata,i.e.,K(Xi,Xj)=(Xi)(Xj)
TypicalKernelFunctions
SVMcanalsobeusedforclassifyingmultiple(>2)classesand
forregressionanalysis(withadditionalparameters)
334
167
6/26/2014
ParallelImplementationofSVM
Startwiththecurrentwandb,andinparalleldo
severaliterationsbasedoneachtrainingexample
Averagethevaluesfromeachoftheexamplesto
createanewwandb.
Ifwedistributewandbtoeachmapper,then
theMaptaskscandoasmanyiterationsaswe
wishtodoinoneround
WeneedtousetheReducetasksonlytoaverage
W
dt
th R d
t k
l t
theresults
OneiterationofMapReduce isneededforeach
round.
ModelSelection:ROCCurves
ROC (ReceiverOperatingCharacteristics)
curves:forvisualcomparisonof
classificationmodels
Originatedfromsignaldetectiontheory
Showsthetradeoffbetweenthetrue
positiverateandthefalsepositiverate
TheareaundertheROCcurveisameasure
oftheaccuracyofthemodel
Rankthetesttuples indecreasingorder:
y
g
theonethatismostlikelytobelongtothe
positiveclassappearsatthetopofthelist
Theclosertothediagonalline(i.e.,the
closertheareaisto0.5),thelessaccurate
isthemodel
Verticalaxis
representsthetrue
positiverate
Horizontal axis rep
Horizontalaxisrep.
thefalsepositiverate
Theplotalsoshowsa
diagonalline
Amodelwithperfect
accuracywillhavean
areaof1.0
168
6/26/2014
IssuesAffectingModelSelection
Accuracy
classifieraccuracy:predictingclasslabel
Speed
timetoconstructthemodel(trainingtime)
timetousethemodel(classification/predictiontime)
Robustness:handlingnoiseandmissingvalues
Scalability:efficiencyindiskresidentdatabases
Scalability: efficiency in disk resident databases
Interpretability
understandingandinsightprovidedbythemodel
Othermeasures,e.g.,goodnessofrules,suchasdecisiontree
sizeorcompactnessofclassificationrules
WhatisClusterAnalysis?
Cluster:Acollectionofdataobjects
similar(orrelated)tooneanotherwithinthesamegroup
dissimilar(orunrelated)totheobjectsinothergroups
Clusteranalysis(orclustering,datasegmentation,)
Findingsimilaritiesbetweendataaccordingtothe
characteristicsfoundinthedataandgroupingsimilardata
objectsintoclusters
Unsupervised learning: no predefined classes (i e learning by
Unsupervisedlearning:nopredefinedclasses(i.e.,learningby
observations vs.learningbyexamples:supervised)
Typicalapplications
Asastandalonetool togetinsightintodatadistribution
Asapreprocessingstep forotheralgorithms
169
6/26/2014
ClusterAnalysis:Applications
Identifynaturalgroupingsofcustomers
Identifyrulesforassigningnewcasesto
Id tif
l f
i i
t
classesfortargeting/diagnosticpurposes
Providecharacterization,definition,labelling
ofpopulations
Decreasethesizeandcomplexityofproblems
forotherdataminingmethods
Identifyoutliersinaspecificdomain(e.g.,
rareeventdetection)
ClusterAnalysis:Applications
Biology:taxonomyoflivingthings:kingdom,phylum,class,order,family,
genusandspecies
Informationretrieval:documentclustering
Information retrieval: document clustering
Landuse:Identificationofareasofsimilarlanduseinanearthobservation
database
Marketing:Helpmarketersdiscoverdistinctgroupsintheircustomerbases,
andthenusethisknowledgetodeveloptargetedmarketingprograms
Cityplanning:Identifyinggroupsofhousesaccordingtotheirhousetype,
value,andgeographicallocation
Earthquakestudies:Observedearthquakeepicentersshouldbeclustered
E th
k t di Ob
d
th
k
i t
h ld b l t d
alongcontinentfaults
Climate:understandingearthclimate,findpatternsofatmosphericand
ocean
EconomicScience:marketresarch
170
6/26/2014
RequirementsandChallenges
Scalability
Clusteringallthedatainsteadofonlyonsamples
Abilitytodealwithdifferenttypesofattributes
Numerical,binary,categorical,ordinal,linked,andmixtureofthese
Constraintbasedclustering
Usermaygiveinputsonconstraints
Usedomainknowledgetodetermineinputparameters
Interpretabilityandusability
Others
Discoveryofclusterswitharbitraryshape
Abilitytodealwithnoisydata
Incrementalclusteringandinsensitivitytoinputorder
Highdimensionality
AssociationRuleMining
Findsinterestingrelationships(affinities)betweenvariables
(
(itemsorevents)
)
Alsoknownasmarketbasketanalysis
Arepresentativeapplicationsofassociationrulemining
include
Inbusiness:crossmarketing,crossselling,storedesign,catalog
design,ecommercesitedesign,optimizationofonlineadvertising,
productpricing,andsales/promotionconfiguration
d t i i
d l /
ti
fi
ti
Inmedicine:relationshipsbetweensymptomsandillnesses;
diagnosisandpatientcharacteristicsandtreatments(tobeusedin
medicalDSS);andgenesandtheirfunctions(tobeusedingenomics
projects)
171
6/26/2014
AssociationRuleMining
Areallassociationrulesinterestinganduseful?
AGenericRule:X Y[S%,C%]
X,Y:productsand/orservices
X:Lefthandside(LHS)
Y:Righthandside(RHS)
S: Support:howoftenX andY gotogether
C: Confidence:howoftenY gotogetherwiththeX
Example:{LaptopComputer,AntivirusSoftware}
{ExtendedServicePlan}[30%,70%]
PrivacyPreservingDataMining
Thereisagrowingconcernamongpeopleand
organizationinprotectingtheirprivacy
Manybusinessandgovernmentorganizationshave
strongrequirementsforprivacypreservingdatamining
Theprimarytaskindatamining:developmentof
modelsaboutaggregateddata.
meetingprivacyrequirements
providingvaliddataminingresults
Source:PrivacyPreservingAssociationRuleMininginVerticallyPartitionedData,Jaideep Vaidya,ChrisClifton,
SIGKDD02Edmonton,Alberta,Canada
172
6/26/2014
WhatisPrivacy
Abilityofanindividualorgrouptoreveal
themselvesselectivelytoremainunnoticedor
unidentified in the public (related to anonymity)
unidentifiedinthepublic(relatedtoanonymity)
Privacylawsprohibitsunsanctionedinvasionof
privacybythegovernment,corporationsor
individuals
Privacymaybevoluntarilysacrificed,normally
inexchangeforperceivedbenefitsandvery
oftenwithspecificdangersandlosses
f
h
f d
dl
Whatisprivacypreservingdatamining?
Studyofachievingsomedatamininggoalswithout
scarifyingtheprivacyoftheindividuals
Challenges
Privacyconsiderationsseemsconflictingwithdata
mining.Howdoweminedatawhenwecanteven
lookatit?
Canwehavedataminingand privacytogether?
Canwedevelopaccuratemodelswithoutaccessto
preciseinformationinindividualdatarecords?
Leakageofinformationisinevitable howto
minimizetheleakageofinformation
PPDMoffersacompromise!
173
6/26/2014
ExampleofPrivacy
AliceandBobarebothteachingthesameclass,andeachof
themsuspectsthatonespecificstudentischeating.Noneof
themiscompletelysureabouttheidentityofthecheater,and
they like to compare the names of their two suspects
theyliketocomparethenamesoftheirtwosuspects
Forstudentsprivacy
iftheybothhavethesamesuspect,thentheyshouldlearnhisorher
name
butiftheyhavedifferentsuspectsthentheyshouldlearnnothing
beyondthatfact.
Theythereforehaveinputsxandy,andwishtocomputef(x,
y) which is defined as 1 if x = y and 0 otherwise
y)whichisdefinedas1ifx=yand0otherwise
Iff(x,y)=0theneachpartydoeslearnsomeinformation,
namelythattheotherpartyssuspectisdifferentthan
his/hers,butthisisinevitable
Totalcount:Howmanypneumoniadeathsunderage65?
r1= r0+12
hosp1
r9= r8+0
hosp2
hosp10
hosp9
r8= r7+3
r9+5
detect
hosp3
Total=61
r3= r2+8
hosp8
r7= r6+14
hosp7
hosp6
r6= r5+7
r2= r1+1
r0
hosp4
hosp5
r5= r4+11
r4= r3+0
174
6/26/2014
DistributedComputingScenario
Twoormorepartiesowningconfidentialdatabases
wish to run a data mining algorithm on the union of
wishtorunadataminingalgorithmontheunionof
theirdatabaseswithoutrevealinganyunnecessary
information
Althoughthepartiesrealizethatcombiningtheir
datahassomemutualbenefit,noneofthemis
willingtorevealitsdatabasetoanyotherparty
illi
li d b
h
Partialleakofinformationisinevitable
Associationrulemininginvertically
partitioneddata
Privacyconcernscanpreventacentraldatabase
pp
g
approachfordatamining
Thetransactionsmaybedistributedacrosssources
Collaboratetominegloballyvaliddatawithout
revealingindividualtransactiondata
Preventdisclosureofindividualrelationships
Joinkeyrevealed
Universeofattributevaluesrevealed
175
6/26/2014
ReallifeExample
FordExplorerswithFirestonetiresfromaspecic factoryhad
treadseparationproblemsincertainsituations,resultingin
800injuries.
Sincethetiresdidnothaveproblemsonothervehicles,and
othertiresonFordExplorersdidnotposeaproblem,neither
sidefeltresponsible
Delayinidentifyingtherealproblemledtoapublicrelations
nightmareandtheeventualreplacementof14.1milliontires
Bothmanufacturershadtheirowndata earlygenerationof
associationrulesbasedonallofthedatamayhaveenabled
FordandFirestonetoresolvethesafetyproblembeforeit
becameapublicrelationsnightmare.
ProblemDefinition
Tomineassociationrulesacrosstwodatabases,wherethe
columnsinthetableareatdifferentsites,splittingeachrow.
Onedatabaseisdesignatedtheprimary,andistheinitiatorof
theprotocol.Theotherdatabaseistheresponder.
Thereisajoinkeypresentinbothdatabases.Theremaining
attributesarepresentinonedatabaseortheother,butnot
both.
Thegoalistofindassociationrulesinvolvingattributesother
thanthejoinkeyobservingtheprivacyconstraints
176
6/26/2014
ProblemDefinition
LetI={i1;i2;;im}beasetofliterals,calleditems.LetDbeasetof
transactions,whereeachtransactionTisasetofitemssuchthatT I.
Anassociationruleisanimplicationoftheform,X Y,whereX I,Y I,
andX Y=
TheruleX YholdsinthetransactionsetDwithconfidencecifc%of
transactionsinDthatcontainXalsocontainY.TheruleX Yhassupports
inthetransactionsetDifs%oftransactionsinDcontainX Y
We
Weconsiderminingboolean
consider mining boolean associationrules.Theabsenceorpresenceof
association rules The absence or presence of
anattributeisrepresentedasa0or1.Transactionsarestringsof0sand1s.
Tofindoutifaparticularitemset isfrequent,wecountthenumberof
recordswherethevaluesforalltheattributesintheitemset are1.
Mathematicalproblemformulation
Letthetotalnumberofattributesbel+m,whereAhaslattributesA1through
Al,andBhastheremainingmattributesB1throughBm.transactions/records
are a sequence of l + m 1s or 0s
areasequenceofl+m1sor0s.
Letkbethesupportthresholdrequired,
nbethetotalnumberoftransaction/records
LetXandYrepresentcolumnsinthedatabase,i.e.,xi=1ifrowi hasvalue1for
attributeX.Thescalar(ordot)productoftwocardinalitynvectorsXandYis
dened as:
Determiningifthetwoitemset
isfrequentthusreducestotestingif
177
6/26/2014
Example
Findoutifitemset {A1,B1}isfrequent(i.e.,Ifsupportof{A1,B1}k)
A
B
Key
A1
Key
B1
k1
k1
k2
k2
k3
k3
k4
k4
k5
k5
Supportofitemset isdefinedasnumberoftransactionsinwhichall
attributesoftheitemset arepresent
Support
A B
i
i 1
Basicidea
Support
A B
i 1
Thisisthescalar(dot)productoftwovectors
Tofindoutifanarbitrary(shared)itemset is
frequent,createavectoroneachsideconsistingof
thecomponentmultiplicationofallattributevectors
onthatside(containedintheitemset)
To find out if {A1,A
A3,A
A5,B
B2,B
B3}isfrequent
} is frequent
Tofindoutif{A
AformsthevectorX=A1 A3 A5
BformsthevectorY=B2 B3
SecurelycomputethedotproductofXandY
178
6/26/2014
Conclusion
Privacypreservingassociationruleminingalgorithmusingan
efficientprotocolforcomputingscalarproductwhilepreserving
privacyoftheindividualvalues
Communicationcostiscomparabletothatrequiredtobuilda
centralizeddatawarehouse
Althoughsecuresolutionsexist,achievingefficientsecure
solutionsforprivacypreservingdistributeddataminingisstillan
open problem
openproblem
Handlingmultipartycaseandavoidingcollusionischallenging.
Noncategoricalattributesandquantitativeassociationrule
miningaresignificantlymorecomplexproblems
References
PrivacyPreservingAssociationRuleMininginVerticallyPartitionedData,
Jaideep Vaidya,ChrisClifton,SIGKDD02Edmonton,Alberta,Canada
PrivacypreservingDataMining.R.Agrawal andR.Srikant,ACMSIGMOD
ConferenceonManagementofData,2000.
Cryptographictechniquesforprivacypreservingdatamining.B.Pinkas,
SIGKDDExplorations,Vol.4,Issue2.
KNOWLEDGEVALUATION:Buildingblockstoaknowledgevaluation
system(KVS),AnnieGreen,Thejournalofinformationandknowledge
managementsystemsVol.36No.2,2006,pp.146154
179
6/26/2014
BusinessApplications
RamBabu Roy
It is not the strongest of the species that survives nor the most intelligent that
Itisnotthestrongestofthespeciesthatsurvives,northemostintelligentthat
survives.Itistheonethatisthemostadaptabletochange.
CharlesDarwin
FromBigDatatoBigImpact
Source:BusinessIntelligenceandAnalytics:Frombigdatatobigimpact,H.Chen et.alMISQuarterly,Dec2012
180
6/26/2014
BigImpactsbyBigData
Radicaltransparencywithdatawidelyavailable
Experimentationforanticipatingbusinessdecisions
Experimentation for anticipating business decisions
changethewaywecompete
Impactofrealtimecustomizationonbusiness
Augmentingmanagementandstrategy betterrisk
management
Informationdrivenbusinessmodelinnovations
I f
ti d i
b i
d li
ti
Leveragingvaluableexhaustdatabybusinesstransactions
Dataaggregatorasanentrepreneurialopportunity
Source:Areyoureadyfortheeraofbigdata?,BradBrownet.al,McKinseyQuarterly,Oct2011
Source:http://www.forbes.com/sites/danmunro/2013/04/28/bigproblemwithlittledata/
181
6/26/2014
MagicQuadrantforAdvancedAnalyticsPlatforms
LEADERS
CHALLENGERS
VISIONARIES
NICHEPLAYERS
Source:Gartner(February2014)
SourcesofCompetitiveAdvantage
BigData anewtypeofcorporateasset
Effectiveuseofdataatscale
Data drivendecisionmaking
Radicalcustomization
Gainingmarketshare
ConstantExperimentation
Exploring
Exploringwithreal
with realworld
worldexperiments
experiments
Adjustpricesinrealtime
Bundlingsynthesizingandmakinginformation
availableacrossorganization
NovelBusinessModels
182
6/26/2014
GrowingBusinessInterestinBigData
Hundredsofarticlespublishedin
technology/industryjournalsandgeneral
businesspress(Forbes,Fortune,Bloomberg
BusinessWeek,TheWallStreetJournal,The
Economistetc.)
3Vs useofdisparatedatasetsincludingsocial
media
Thespeedofanalysis,nearrealtimedeployment
The speed of analysis near realtime deployment
requirements
Theadvancementofthefieldsofmachine
learningandvisualization
RoleofDataScientist
Weoftendonotknowwhatquestiontoask
requiresdomainexpertisetoidentifytheimportant
p
problemstosolveinagivenarea
g
Whichaspectofbigdatamakesmoresense
Howtoapplyittothebusiness
Howonecanachievesuccessbyimplementingabig
dataproject
Newchallenges
Lackofstructure
Whattechnologyonemustusetomanageit
Challengingtoconvertitintoinsights,innovationand
businessvalue
Butnewopportunities
183
6/26/2014
VariationofPotentialValueacrossSectors
Source:Areyoureadyfortheeraofbigdata?,BradBrownet.al,McKinseyQuarterly,Oct2011
SourcesofBigData
Variousbusinessunits Govt./Private
Partners
Customers
InternetofThings
SocialMedia
Transactiondata
Webpages
184
6/26/2014
Majorindustriesthatgetbenefitted
Financialservices
IT d T l
ITandTelecommunication
i ti
Healthcare
Manufacturing
RealEstate
Government and defence
Governmentanddefence
Travel
MediaandEntertainment
Retailing
WhyisBigDatasoImportant?
Potentialtoradicallytransformbusinessesand
industries
Reinventingbusinessprocesses
R i
ti b i
Operationalefficiencies
Customerserviceinnovations.
Youcantmanagewhatyoudontmeasure
Moreknowledgeaboutthebusiness
Improveddecisionmakingandperformance
Enablesmanagerstodecideonthebasisofevidence
Enables managers to decide on the basis of evidence
ratherthanintuition
Moreeffectiveinterventions
Butneedtochangethedecisionmakingculture
Source:BigData:TheManagementRevolution,AndrewMcAfee,ErikBrynjolfsson,HBR,Oct2012
185
6/26/2014
HowaremanagersusingBigData?
Tocreatenewbusinesses
Better
Betterpredictionyieldbetterdecisions
prediction yield better decisions
Airlineshadagapofatleast5to10minutesbetween
theETA(bypilots)andATA.
ImprovedETAbycombiningweatherdata,flight
schedules,informationaboutotherplanesinthesky,
andotherfactors
Todrivemoresales
Totailorpromotionsandotherofferingstocustomers
Topersonalizetheofferstotakeadvantageoflocal
conditions
IntegratingBigDataintoBusiness
Howtoutiliseunstructureddatawithinyour
organisation
Thelatesttechnicalchangesrelatedtousing
Hadoop intheenterprise
WhybigdatasolutionscanenhanceyourROIand
delivervalue
Therisksrelatedtoincreasingvolumesofdata
The risks related to increasing volumes of data
ThefutureofbigdataandtheInternetofThings
Howcloudcomputingischangingtheenterprises
useofdata
186
6/26/2014
ApplicationsofBigdataanalytics
Disastermanagement
Big data and sensor data fusion for smart cities
Bigdataandsensordatafusionforsmartcities
Healthcare
Spatiotemporaldataanalytics
Timeseriesbasedlongtermanalysis(e.g.climate
change)
Shorttermrealtime(Trafficmanagement,
Short term real time (Traffic management,
disastermanagement)
Benchmarkingacrosstheworld
Preservingheritage Creationoferepositories
BigDatainManufacturing
Manufacturinggeneratesaboutathirdofalldatatoday
Digitallyorientedbusinessesseehigherreturns
g y
g
Detectingproductanddesignflaws withBigData
Forecasting,costandpricemodeling,supplychain
connectivity
Warrantydataanalytics,textminingforproduct
development
Visualizationanddashboards,faultdetection,failure
prediction,inprocessverificationtools,
Machinesystem/sensorhealth,processmonitoringand
correction,productdatamining,
BigDatagenerationandmanagement,andtheInternet
ofThings
187
6/26/2014
BigDatainManufacturing
Integratingdatafrommultiplesystems
Collaborationamongfunctionalunits
Collaboration among functional units
Informationfromexternalsuppliersand
customerstococreate products
Collaborationduringdesignphasetoreducecosts
Implementchangesinproducttoimprovequality
andpreventfutureproblems
Identifyanomaliesinproductionsystems
Schedulepreemptive repairsbeforefailures
Dispatchservicerepresentativesforrepair
RetailingandLogistics
Optimizeinventorylevelsatdifferentlocations
Improvethestorelayoutandsalespromotions
Optimizelogisticsbypredictingseasonal
effects
Minimizelossesduetolimitedshelflife
188
6/26/2014
FinancialApplications
BankingandOtherFinancial
Automatetheloanapplicationprocess
Detectingfraudulenttransactions
g
Optimizingcashreserveswithforecasting
BrokerageandSecuritiesTrading
Predictchangesoncertainbondprices
Forecastthedirectionofstockfluctuations
Assesstheeffectofeventsonmarketmovements
Identifyandpreventfraudulentactivitiesintrading
Insurance
Forecastclaimcostsforbetterbusinessplanning
Determineoptimalrateplans
Optimizemarketingtospecificcustomers
Identifyandpreventfraudulentclaimactivities
UnderstandingCustomerOnline
Behaviour
Drawinginsightfromtheonlinecustomers
SentimentAnalysis
S ti
t A l i togaugeresponsesto
t
t
newmarketingcampaignsandadjust
strategies
CustomerRelationshipManagement
Maximizereturnonmarketingcampaigns
Improvecustomerretention(churnanalysis)
I
i ( h
l i)
Maximizecustomervalue(cross,upselling) revenue
streams
Identifyandtreatmostvaluedcustomers
189
6/26/2014
CaseStudy:Quantcast
Howdoadvertisersreachtheirtarget
audiences online?
audiencesonline?
Consumersspendmoretimeinpersonalized
mediaenvironments
Itshardertoreachlargenumberofrelevant
consumers
Advertisersneedtousewebtochoosetheir
audiencesmoreselectively
Decision whichadtoshowtowhom
Source:Tobigtoignore:thebusinesscaseforbigdata,PhilSimon,Wiley,2013
Quantcast:ASmallBigDataCompany
Webmeasurementandtargetingcompany Founded
in2006 focusononlineaudience
Modelsmarketersprospectsandfindslookalike
audiencesacrosstheworld
Analysesmorethan300billionobservationsofmedia
consumptionpermonth
Detaileddemographic,geographicandlifestyle
information
Createdamassivedataprocessinginfrastructure
Quantcast FileSystem(QFS)
Incorporatesdatageneratedfrommobiledevices
190
6/26/2014
Quantcast:ASmallBigDataCompany
Bigdataallowsorganizationstodrilldownto
reachspecificaudiences
h
ifi
di
Differentbusinesseshavedifferentdata
requirements,challengesandgoals
Quantcast providesintegrationbetweenits
product and third party data and applications
productandthirdpartydataandapplications
Anorganizationdoesntneedtobebigto
benefitfromBigData
PromiseofBigDatainHealthcare
Predictiveandprescriptiveanalytics
Publichealth
Diseasemanagement
Drugdiscovery
Personalizedmedicine
Continuouslyscanandinterveneinthe
healthcarepractices
191
6/26/2014
Whatishealthcare?
Broaderthanjustpracticingmedicine
Roleofphysician,pharmaceuticalcompanies,
hospitals,diagnosticsservices
Cocreationofhealthvalue
Roleofpatientsandtheirfamily
Objective:todeliverhighqualityandcost
effectivecaretopatients
UniquenessofHealthcare
Everypatientisuniqueandneedpersonalized
care
Allthemedicalprofessionalsareunique
Canwematchthecorecompetencyandunique
styleofmedicalprofessionalstospecific
patients?
Canweuseserviceoperationsmanagement
Can we use service operations management
principlestoimprovetheoverallefficiency?
Needforengagingmedicalprofessionalstothe
taskstheyarebestatdoing
192
6/26/2014
HealthcareBusinessInnovation
Needsexchangeofknowledgebetween
b i
businessandmedicine
d
di i
Entrepreneurshipinhealthcare
Availabilityoffinancialsupporttoengagein
valuecreation
Developmentandanalysisofhealthcare
Development and analysis of healthcare
businessmodel
HealthcareAnalytics
Datacollectedthroughmobiledevices,health
workers, individuals, other data sources
workers,individuals,otherdatasources
Crucialforunderstandingpopulationhealthtrends
orstoppingoutbreaks
Individualelectronichealthrecords notonly
improvescontinuityofcarefortheindividual,but
alsocreatemassivedatasets
Identificationofatriskpatients
Identification of at risk patients
Trendsinservicelineutilization
Improvementopportunitiesintherevenuecycle
Treatmentsandoutcomescanbecomparedinan
efficientandcosteffectivemanner.
193
6/26/2014
SourcesofInformation
Governmentofficials
Industryrepresentatives
Informationtechnologyexperts
Healthcareprofessionals
Enabler
Increaseofprocessingpowerandstoragecapacities
Radical
Radicaladvancementofinformationandcommunication
advancement of information and communication
technology
Reducedcostofstorageandprocessingofbigdata
Availabilityofdata
Digitizationofdata
Increaseinadoptionofdigitalgadgetsbyusers
P
Popularityofsocialmedia
l it f
i l
di
Mobileandinternetpopulationandpenetration
Awarenessaboutbenefitsofhavingknowledge
Requirementofdatadriveninsightsbyvarious
stakeholders
194
6/26/2014
Explorys:ThehumanCaseforBig
Data
January2011,UShealthcarespending$3
trillion
Behavioural,operationalandclinicalwastes
Longtermeconomicimplications
Opportunitytoimprovehealthcaredelivery
andreduceexpenses
Integratingclinical,financialandoperationaldata
Volume,velocityandvarietyofhealthcaredata
Bigdatacanhaveabigimpact
BetterHealthcareusingBigData
Realtimeexploration,performanceand
predictive analytics
predictiveanalytics
Vastuserbase Majorintegrateddelivery
networksintheUS
Userscanviewkeyperformancemetrics
acrossproviders,groups,carevenues,and
l ti
locations
Identifywaystoimproveoutcomesand
reduceunnecessarycosts
195
6/26/2014
BetterHealthcareusingBigData
Whydopeoplegotoemergencyroomrather
than a primary physician?
thanaprimaryphysician?
Analyticstodecidewhetherthecareis
availableintheneighbourhoods
Serviceproviderscanreachouttopatientsto
guidethemthroughthetreatmentprocesses
Patientscanreceivetherightcareattheright
venueattherighttime
Privacyconcerns
MobilisingDatatoDealwithan
Epidemic
CaseofHaitis devastating2010earthquake
Mobiledatapatternscouldbeusedto
M bil d t
tt
ld b
dt
understandthemovementofrefugeesand
theconsequenthealthrisks
Accuratelyanalysethedestinationofover
600,000peopledisplacedfromPortauPrince
Theymadethisinformationavailableto
governmentandhumanitarianorganisations
dealingwiththecrisis
196
6/26/2014
CholeraoutbreakinHaiti
Mobiledatatotrackthemovementofpeople
f
fromaffectedzones
ff t d
Aidorganisationsusedthisdatatopreparefor
newoutbreaks
Demonstrateshowmobiledataanalysiscould
revolutionise disaster and emergency
revolutionisedisasterandemergency
responses.
CosteffectiveTechnology
DataGrid platformonClouderas enterpriseready
Hadoop solution
Platformneedstosupport
Thecomplexhealthcaredata
Evolvingtransformationofhealthcaredeliverysystem
Needforasolutionthatcanevolvewithtime
Flexibility,scalabilityandspeednecessaryto
answerscomplexquestionsonfly
l
i
fl
Explorys couldfocusonthecorecompetenciesof
itsbusiness
Imperativetolearn,adaptandbecreative
197
6/26/2014
BenefitsofInformationTechnology
Improvedaccesstopatientdata canhelp
cliniciansastheydiagnoseandtreatpatients
Patientsmighthavemorecontrolovertheir
health
Monitoringofpublichealthpatternsandtrends
Enhancedabilitytoconductclinicaltrialsofnew
diagnosticmethodsandtreatments
Creationofnewhightechnologymarketsand
jobs
Supportarangeofhealthcarerelatedeconomic
reforms
PotentialBenefits
Provideasolidscientificandeconomicbasis
f i
forinvestmentrecommendations
t
t
d ti
Establishfoundationforthehealthcarepolicy
decisions
Improvedpatientoutcomes
Costsavings
Cost savings
Fasterdevelopmentoftreatmentsand
medicalbreakthroughs
198
6/26/2014
ActionPlan
Governmentshouldfacilitatethenationwide
adoptionofauniversalexchangelanguagefor
d ti
f
i
l
h
l
f
healthcareinformation
Creationofadigitalinfrastructureforlocating
patientrecordswhilestrictlyensuringpatient
p
privacy.
y
Facilitateatransitionfromtraditionalelectronic
healthrecordstotheuseofhealthcaredata
taggedwithprivacyandsecurityspecifications
NASA:OpenInnovation
Effectsofcrowdsourcing ontheeconomy collaboration
andinnovation
Wikipedia
p
WhatistherelationshipbetweenBigData,collaboration
andopeninnovation?
Democratizebigdatatobenefitfromthewisdomof
crowds
Groupscanoftenuseinformationtomakebetter
decisions than any individual
decisionsthananyindividual
TopCoder bringstogetherdiversecommunityof
softwaredevelopers,datascientists,andstatisticians
Riskpreventioncontest
Predicttheriskiestlocationsformajoremergenciesandcrimes
199
6/26/2014
NASAsRealworldChallenges
Recordedmorethan100TBofspaceimagesand
more
EncourageexplorationandanalysisofPlanetary
DataSystem(PDS)databases
Imageprocessingsolutionstocategorizedatafrom
missionstothemoon typesandnumbersof
craters
Software to handle complex operations of satellites
Softwaretohandlecomplexoperationsofsatellites
anddataanalysis
Vehiclerecognition
Craterdetection
Realtimeinformationtocontestantsandcollecting
theircomments
LessonsLearnt
Toeffectivelyharnessbigdata,organization
neednotofferbigrewards
d t ff bi
d
Statetheprobleminawaytoallowadiverseset
ofpotentialsolverstoapplytheirknowledge
CreateacompellingBigDataandopen
innovation challenge
innovationchallenge
Organizationssizedoesntmattertoreap
rewardsofBigData
200
6/26/2014
CrowdManagementCricketMatch
DirectionalflowdensityduringCricketMatch
Availabilityofpublictransportforcommuters
fromdifferentdestinations
Mobiledatacanbeusedtopredictthe
directionofmovementofspectatorspost
match
PotentialApplications
Managingresourcestopursuemostpromising
customers and markets
customersandmarkets
Technologyplatformformanagingtheriskand
valueofnetworkofdifferentstakeholders
Meetingregulatorycompliance
Developmentofpatientreferralsystemtoreduce
patientchurn
Lifescienceindustriessuchasdiagnostics,
pharmaceuticals,andmedicaldevices
Developmentofmobileappsforreferral
Geneticdataminingforpredictionofdiseaseand
effectivecare
201
6/26/2014
SimpleTechniquesforBigData
Transition
Whichproblemtotackle? Needfordomain
experts to decide right questions
expertstodeciderightquestions
Habitofaskingkeyquestions
Whatdothedatasay?
Wheredidthedatacomefrom?
Whatkindsofanalyseswereconducted?
Howconfidentareweinresults?
Computerscanonlygiveanswers
NeedforChangeManagement
Leadership setcleargoals,definesuccess,spot
greatopportunity,andunderstandhowamarket
is developing
isdeveloping
Talentmanagement organizingbigdata,
visualizationtoolsandtechniques,designof
experiments
Technology tointegrateallinternalandexternal
sourcesofdata,adoptnewtechnologies
DecisionMaking maximizecrossfunctional
cooperations
Companyculture shiftfromaskingWhatdowe
think?toWhatdoweknow?
202
6/26/2014
ChallengesinGettingReturns
Collecting,processing,analyzingandusingBig
Dataintheirbusinesses
Handlingthevolume,varietyandvelocityofthe
data.
Findingdatascientists by2018,thedemandfor
analyticsandBigDatapeopleintheU.S.alone
willexceedsupplybyasmuchas190,000
(McKinsey Global Institute)
(McKinseyGlobalInstitute)
Sharingdataacrossorganizationalsilos
Drivingdatadrivenbusinessdecisionratherthan
thatbasedonintuition
ConfrontingComplications
Shortageofadequatelyskilledworkforce
Shortageofmanagersandanalystswithsharp
understandingofbigdataapplications
Tensionbetweenprivacyandconvenience
Consumercaptureslargerpartofeconomic
surplus at the cost of privacy concerns
surplusatthecostofprivacyconcerns
Data/IPsecurityconcerns cloudcomputing
andopenaccesstoinformation
203
6/26/2014
WhatBigDataCantAnswer
Howdoweknowwhatsimportantinacomplex
world?
ld?
E.g.topredictandexplainmarketcrashes,food
prices,theArabSpring,ethnicviolence,and
othercomplexbiologicalandsocialsystems.
Determining which information is pivotal (and
Determiningwhichinformationispivotal(and
ignoringtherest)isthekeytosolvingtheworlds
increasinglycomplexchallenges.
Yaneer BarYamandMayaBialik,Beyond BigData:Identifyingimportant
informationforrealworldchallenges(December17,2013),arXiv inpress
ComplexityProfile
Theamountof
informationthatis
requiredtodescribea
systemasafunctionof
thescaleofdescription.
Themostimportant
i f
informationabouta
ti
b t
systemforinforming
actiononthatsystemis
thebehavior atthe
largestscale.
204
6/26/2014
References
Holdren,JohnP.,EricLander,andHaroldVarmus.
"Report
ReporttothePresidentRealizingtheFull
to the President Realizing the Full
PotentialofHealthInformationTechnologyto
ImproveHealthcareforAmericans:ThePath
Forward."President'sCouncilofAdvisorson
ScienceandTechnology.ExecutiveOfficeofthe
President,Dec.2010.
TheEmergingBigReturnsonBigData,ATCS
2013GlobalTrendStudy,2013
www.ikanow.com
MiningSocialNetwork
Graphs
Prof.RamBabu Roy
205
6/26/2014
TypesofNetworks
Technological manmade,consciouslycreated
((Kolaczyk,2009)
y ,
)
Social interactionsamongsocialentities
Biological
Biological interaction
interactionamongbiologicalelements
among biological elements
Informational elementsofinformation
WhatisaSocialNetwork?
Collectionofentitiesthatparticipateinthe
network
Thereisatleastonerelationshipbetween
Th
i
l
l i hi b
entitiesofthenetwork.
Thereisanassumptionofnonrandomness or
locality
Examples
Friendsnetworks
e ds e o s Facebook,Twitter,Google+,
aceboo ,
e , oog e ,
TelephoneNetworks
EmailNetworks
CollaborationNetworks
Ref:MiningofMassiveDatasets,JureLeskovec,Anand Rajaraman,andJeffreyD.Ullman,2014
206
6/26/2014
SocialNetworksasGraphs
Noteverygraphisasuitablerepresentationofa
social network
socialnetwork.
Locality thepropertyofsocialnetworksthat
saysnodesandedgesofthegraphtendtocluster
incommunities.
Importantquestions
how
howtoidentifycommunities
to identify communities subsetsofthenodes
subsets of the nodes
withunusuallystrongconnections
communitiesusuallyoverlap youmaybelongto
severalcommunities
ExampleofaSmallSocialNetwork
Nineedgesoutofthe7C2 =21pairsofnodes
SupposeX,Y,andZarenodeswithedgesbetweenXandYandalsobetween
X and Z What would we expect the probability of an edge between Y and Z to
XandZ.WhatwouldweexpecttheprobabilityofanedgebetweenYandZto
be?
Ifthegraphwerelarge,thatprobabilitywouldbeveryclosetothefractionof
thepairsofnodesthathaveedgesbetweenthem,i.e.,9/21=.429
Ifthegraphissmall,thereareedges(X,Y)and(X,Z),thereforeonlyseven
edgesremaining.Thus,theprobabilityofanedge(Y,Z)is7/19=.368.
Byactuallycountingthepairsofnodes,
fractionoftimesthethirdedgeexistsis9/16
=.563
207
6/26/2014
SocialNetworkAnalysis
Anactorexistsinafabricofrelations(Wasserman&Faust,1994)
Degree importanceduetodirectlinkagewithnodes(Freeman,
importance due to direct linkage with nodes (Freeman
1979,Friedkin,1991)
Closeness abilitytoreachotheractorsatease
Betweenness relativeimportanceofanodeinlinkingtwonodes
p
g
Eigenvector (Bonacich,1987) entirepatternofconnections
Networkcentralization(Freeman,1979)
(C
Max (C
xU
xU
max
A
C A ( x ))
max
A
C A ( x))
LinkAnalysis
LinkAnalysisAlgorithms
PageRank
Page Rank
HITS(HypertextInducedTopicSelection)
Pageismoreimportantifithasmorelinks
Incominglinks
Outgoinglinks
Thinkofinlinksasvotes
Areallinlinksareequal?
Linksfromimportantpagescountmore
Recursivequestion!
208
6/26/2014
PageRank Scores
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets
SimpleRecursiveFormulation
Eachlinksvoteisproportionaltothe
importance of its source page
importanceofitssourcepage
Ifpagejwithimportancerj hasnoutlinks,
eachlinkgetsrj /nvotes
Pagejs ownimportanceisthesumofthe
votesonitsinlinks
rj=ri/3+rk/4
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets
209
6/26/2014
TheFlowModel
Avotefromanimportantpageis
worthmore
th
Apageisimportantifitispointed
tobyotherimportantpages
Definearankrj forpagej
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets
MatrixFormulation
Stochasticadjacencymatrix
Letpage has outlinks
If ,then
=1/ else
=0
isacolumnstochasticmatrix
Columnssumto1
=1
Theflowequationscanbewritten
The flow equations can be written =
TherankvectorrisaneigenvectorofthestochasticwebmatrixM
WecannowsolveforrusingthemethodcalledPoweriteration
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets
210
6/26/2014
PowerIterationMethod
GivenawebgraphwithNnodes,wherethe
nodesarepagesandedgesarehyperlinks
d
d d
h
li k
Initialize:r(0)=[1/N,.,1/N]T
Iterate:r(t+1)=Mr(t)
Stopwhen|r(t+1) r(t)|1<
|x|
| |1 =1iN|x
| i|istheL
|
1 norm
Canuseanyothervectornorm,e.g.,Euclidean
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets
HowtoSolve?
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets
211
6/26/2014
HITS(HypertextInducedTopic
Selection)
Isameasureofimportanceofpagesor
documents,
SimilartoPageRank
ProposedataroundsametimeasPageRank (98)
Goal:Saywewanttofindgoodnewspapers
Dontjustfindnewspapers.Findexperts people
wholinkinacoordinatedwaytogoodnewspapers
Idea:Linksasvotes
Pageismoreimportantifithasmorelinks
Incominglinks?Outgoinglinks?
HubsandAuthorities
Eachpagehas2scores:
Quality as an expert (hub):
Qualityasanexpert(hub):
Totalsumofvotesofauthoritiespointedto
Hubsarepagesthatlinktoauthorities
Listofnewspapers
Qualityasacontent(authority):
Totalsumofvotescomingfromexperts
Total sum of votes coming from experts
Authoritiesarepagescontaininguseful
information
Newspaperhomepages
212
6/26/2014
HubsandAuthorities
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets
Example
UnderreasonableassumptionsabouttheadjacencymatrixA,HITSconvergesto
vectorsh*anda*:
h*istheprincipaleigenvectorofmatrixAAT
a*istheprincipaleigenvectorofmatrixATA
Source:JureLeskovec,StanfordC246:MiningMassiveDatasets
213
6/26/2014
FinancialMarketasCAS
ComplexSystems interconnectedness,hierarchyof
subsystems,decentralizeddecisions,nonlinearity,self
organisation, evolution, uncertainty , selforganized
organisation,evolution,uncertainty,self
organized
criticality
ComplexAdaptiveSystems adaptation
ManysocioeconomicsystemsbehaveasCAS
(Mauboussin 2002;Markose 2005)
Complex
Complex networksofinterconnections
networks of interconnections
Adaptive optimizingagents,capableoflearning
Complexityandhomogeneity bothrobustandfragile
Structuralvulnerabilitiesbuildsupovertime(Haldane,
2009)
NetworkbasedModeling
ModelingComplexsystems SD,ABM,VSM,
Econophysics
Structureandtheevolutionofnetworksareinseparable
Structure and the evolution of networks are inseparable
(Barabsi,1999)
Visualrepresentationoftheinterdependence(Newman
2008) knowledgediscovery
Dynamicsofnetworksmayrevealunderlyingmechanism
(Barabasi 2009)
(Barabasi,2009)
Recentworksusingnetworkapproach(Boginski etal.
2006;Tse,C.K,2010)
214
6/26/2014
Marketgraph
Logarithmofreturnoftheinstrumenti overtheonedayperiod
from(t1)tot
=Ri (t)=ln Pi (t)/Pi (t1)
Pi (t)denotethepriceoftheinstrumenti
(t) denote the price of the instrument i ondayt
on day t
Correlationcoefficientbetweeninstrumentsi andjis
Studyingthepatternofconnectionsinthemarketgraphwould
providehelpfulinformationabouttheinternalstructure ofthe
stockmarket
Source:Miningmarketdata:Anetworkapproach,VladimirBoginski,Sergiy Butenko,Panos M.Pardalos,
Computers&OperationsResearch33(2006)31713184
Stateoftheart
Regionalbehaviourofstockmarkets,relativelysmallsamplesize
Korean(Jung etal. 2006;Kim etal. 2007),Indian(Pan&Sinha
2007),andBrazilian(Tabak etal. 2009)
Evolutionofinterdependenceandcharacterizingthedynamics
(Garas,2007; Huang etal. 2009)
Notmuchworkonidentifyingdominantstockindices(Eryigit,
2009))
Littleworkontheimpactofrecentfinancialcrisisduring2008
LimitedbusinessapplicationofSNA(Bonchi etal.,2011)
215
6/26/2014
ResearchGap
Characterisingtheglobalmarketdynamicsisacomplex
problem
Lackofsystemlevelanalysisofglobalstockmarket
Thenetworkbasedmethodologiesinadequatelyexplored
AdaptationofSNAmethodologiestootherdomains
ResearchObjective
Understandingunderlyingnetworkstructureofmarkets
Methodstocaptureinterdependencestructureof
complexsystems
Methodstocharacterizeevolutionarybehavior
Methodsforchangedetection theimpactofeventson
thetopology
216
6/26/2014
ResearchQuestions
Isthereanyregionalinfluenceontheevolutionarybehavior?
Is
Isthenetworktopologyduringcrisisphasedifferentfromthatofthe
the network topology during crisis phase different from that of the
normalphase?
Howtocapturethemacroscopicinterdependencestructureamong
stockmarketsandeconomicsectors?
Howtoidentifydominantstockmarkets/economicsectors?
How to identify dominant stock markets /economic sectors?
Whatistheresponseofthenetworktoextremeevent?
Application:ChangeDetectioninthe
Interdependence Structure of Global
InterdependenceStructureofGlobal
StockMarkets
Source:ASocialNetworkApproachtoChangeDetectionintheInterdependence
StructureofGlobalStockMarkets by RamBabu Roy,Uttam KumarSarkar Social
NetworkAnalysisandMining,Springer,Vol.3,Number3, (2013)
217
6/26/2014
Methodology
Secondarydataonmajorstockmarketsfromacrosstheglobe
obtainedfromBloomberg
NetworkmodelsofstockmarketsandsimplificationusingMST
Network models of stock markets and simplification using MST
Characterizationandpatternminingtoinvestigatethe
structuralandstatisticalpropertiesandbehavior
StatisticalControlcharttodetectanomalyinevolution
Graphtheoreticmethodsandalgorithms,networkvisualization
tool(Pajek,Matlab andMSExcel)
Nonparametricmethodsforanalysisandchangedetection
DataDescription
Thedailyclosingpricesfor85stockindicesfrom36countries
fromacrosstheworldfromJanuary2006toDecember2010
obtainedfromBloomberg
Inadditiontothesestockindicesfromvariouscountries,8
otherindicesnamely,SX5E,SX5P,SXXE,SXXP,E100,E300,
SPEURO,SPEUfromEuropeanregionwereincludedto
investigatewhethertheregionalindiceshaveanyinfluence
onthenetworkstructure
Choiceoftheperiod
Choice
of the period tostudythebehaviourofthestock
to study the behaviour of the stock
marketnetworkbeforeandafterthecollapseofLehman
BrothersintheUSA.
Restrictedoursamplestoonlythoseindicesexistingfor
longerperiodandhavedataavailableonBloomberg(say
from1990)givingus93suchindices
218
6/26/2014
Computations
LogarithmicreturnRi(t)oftheinstrumenti overtheoneday
periodfrom(t1)tot
=Ri (t)=ln {Pi(t)/Pi(t1)}
Correlationcoefficientbetweenreturnsofinstrumentsi andj
C ij
Ri R j Ri
R i2 R i
Rj
R 2j R j
Anedgeconnectingstockindicesi andjisaddedtothegraphif
Cij >=threshold
>= threshold
returnsofthesetwostockindicesbehavesimilarlyovertime
degreeofsimilarityisdefinedbythechosenvalueofthreshold
MSTisusedforobtainingsimplifiedconnectednetwork
d ij 2(1 ij ) ,
0 d ij 2,
IllustrationofMSTcreation
219
6/26/2014
EmpiricalFindings
Period No
Start Date
End Date
Period No
Start Date
End Date
1/11/2006
4/23/2008
36
9/17/2008
12/29/2010
5/31/2006
9/10/2008
37
1/4/2006
12/29/2010
6/28/2006
10/8/2008
PreLB
220
6/26/2014
PostLB
Indicesclusterwiththeregionalhubs
Relativelymoredecentralizednetwork
Europeanstockindicesemergemorecentral
Application:IdentifyingDominant
EconomicSectorsandStockMarkets
Source:Identifyingdominanteconomicsectorsandstockmarkets:Asocialnetwork
mining approach by RamBabu
miningapproach
Ram Babu RoyandUttam
Roy and Uttam KSarkar
K Sarkar PAKDD2013DMApps,Gold
PAKDD 2013 DMApps Gold
Coast,Australia,Springer,LNCS7867,pp5970., (2013)
221
6/26/2014
DataDescription
Stock
Index
AS52
Number
of Stocks
136
CNX500
FSSTI
HDAX
303
19
38
Stock
Index
Number
of Stocks
GICS Economic
Sector
Number
of Stocks
192
Consumer
Discretionary
486
NZSE50FG
SBF250
SET
26
140
285
Consumer
Staples
Energy
Financials
229
128
408
NMX
HSI
IBOV
25
27
SHCOMP
SPTSX
382
147
Health Care
Industrials
146
512
KRX100
65
SPX
384
Information
Technology
234
MEXBOL
24
TWSE
313
Materials
425
NKY
192
Total
829
Total
1869
Telecommuni
cation Services
Utilities
Grand Total
36
94
2698
Datafor13yearsfromJanuary1998toJanuary2011
GICS GlobalIndustryClassificationStandard
IdentificationofDominantEconomic
SectorsandStockMarkets
Thenormalizedintrasectoral edgedensity(inpercent)isthe
ratioofthenumberofedgesbetweenthestocksoftheparticular
sectorandthemaximumnumberofpossibleedgesbetweenthe
stocksofthatparticularsector(i.e.n1wheren istheno.of
stocksofthatparticularsector)
Thenormalizedintersectoral edgedensityitistheratioofthe
numberofedgesbetweenthestocksofthetwodifferentsectors
andthemaximumnumberofpossibleedgesbetweenthestocks
ofthosetwosectors(i.e.themin(n1,n2)wheren1 andn2 arethe
numberofstocksbelongingtothetwosectors)
Similarprocedurehasbeenfollowedtoidentifydominantstock
marketsaftercomputingthenormalizedinterindexandintra
indexedgedensities.
222
6/26/2014
Index
Color
Index
Color
Continent
Node Shape
AS52
SPTSX
Pink
Black
NMX
Green
CNX500
Green
SPX
HDAX
Magenta
SBF250
Brown
HSI
Purple
SET
Brown
IBOV
Yellow
FSSTI
Cyan
Australia
Zealandia
Asia
Europe
North America
South America
Diamond
Diamond
Triangle
box
Circle
Ellipse
KRX100
Magenta
TWSE
Orange
MEXBOL
Blue
SHCOMP
Red
NKY
Black
NZSE50FG Green
Red
Continent
GICS Sector
Materials
Color
Brown
Industrials
Health Care
Cyan
Orange
Energy
Consumer
Discretionary
Red
Yellow
No. of Stocks
Node Shape
Australia
Diamond
Zealandia
Diamond
Asia
Triangle
Europe
box
North America
Circle
South America
Ellipse
No. of Stocks
GICS Sector
Financials
Color
Magenta
Blue
Black
234
229
128
Information Technology
Consumer Staples
Telecommunication
Services
Green
36
486
Utilities
Purple
94
425
512
146
408
223
6/26/2014
Interdependencestructureof
economicsectors(Weightededge)
Rank
Economic Sector
Eigenvector
Centrality
1 Financials
0.4785
2 Industrials
0.4042
3 Materials
Consumer
4 Discretionary
Information
5 Technology
0.3795
6 Consumer Staples
Telecommunication
7 Services
0.3449
0.2962
0.2702
0.2694
8 Utilities
0.2002
9 Energy
0.1957
10 Health Care
0.1817
NetworkofStockIndices
(Weightededge)
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Eigenvector
Index
Centrality
SBF250
0.6381
HDAX
0.5938
NMX
0.393
SPX
0.1523
AS52
0.1517
NZSE50FG
0.1248
SPTSX
0.1112
HS Index
0.071
SET
0.0465
MEXBOL
0.0423
CNX500
0.0353
NKY
0.0289
FSSTI
0.0208
SHCOMP
0.0048
IBOV
0.0043
TWSE
0.0004
KRX100
0
224
6/26/2014
Intersectoral interdependence
linkingcrosscountrystockmarkets
Rank
1
2
3
4
5
6
7
8
9
10
Economic Sector
Industrials
Financials
Materials
Utilities
Health Care
Consumer Staples
Consumer
Discretionary
Information
Technology
Telecommunication
Services
Energy
EV
Centrality
0.4738
0.4575
0.3105
0.2921
0.2886
0.2747
0.2685
0.2306
0.225
0.2232
FindingsandConclusions
Presence of distinct regional and sectoral clusters
Regional influence dominates economic sectors
Stock indices from Europe and US emerge as
dominant
Financial, Industrial, Materials, Consumer
Discretionary sectors dominate
Social
S i l position
iti off stocks
t k portfolio
tf li managementt
System level understanding of the structure and
behavior
225
6/26/2014
ScopeforFutureResearch
Potentialapplicationforclassificationofstocksinportfolio
management
Detectingepicenterofturbulenceinnearrealtime development
ofEWS
Modelingshockpropagationthroughthenetwork
Capturingcausalrelationshipsamongstockreturns
Capturing causal relationships among stock returns
Whethertheselforganizingnetworkprovidesinbuiltresilience
References
Bessler,D.A.,Yang,J.,Thestructureofinterdependenceininternationalstockmarkets.Journalof
InternationalMoneyandFinance22,261287,2003.
Eryiit,M.andR.Eryiit,Networkstructureofcrosscorrelationsamongtheworldmarketindices.
Physica A:StatisticalMechanicsanditsApplications388(17):35513562,2009.
Adams,J.,K.Faust,etal.,Capturingcontext:Integratingspatialandsocialnetworkanalyses.Social
Networks34(1),15,,2012.
Markose Computability and evolutionary complexity: Markets as CAS The economic journal 115
Markose,Computabilityandevolutionarycomplexity:MarketsasCAS,Theeconomicjournal,115,
2005
Tse,C.K.,Liu,J.,Lau,F.C.M.,Anetworkperspectiveofthestockmarket.JournalofEmpiricalFinance
doi:10.1016/j.jempfin.2010.04.008,2010.
Boginski,V.,Butenko,S.,Pardalos,P.M.,Miningmarketdata:Anetworkapproach.Computers&
OperationsResearch33,31713184,2006.
Wasserman,S.andK.Faust,SocialNetworkAnalysis:MethodsandApplications."Cambridge
UniversityPress:461502,1994.
Freeman,L.C.,Centralityinsocialnetworks:Conceptualclarification.SocialNetworks1,215239,
1979.
Bonacich,P.,PowerandCentrality:AFamilyofMeasures.TheAmericanJournalofSociology92,
1987.
1987.
Roy,R.B.andU.K.Sarkar,Asocialnetworkapproachtoexaminetheroleofinfluentialstocksin
shapinginterdependencestructureinglobalstockmarkets.InternationalConferenceonAdvances
inSocialNetworkAnalysisandMining(ASONAM),Kaohsiung,Taiwan,DOI:
10.1109/ASONAM.2011.87,2011.
Roy,R.B.andU.K.Sarkar,"Asocialnetworkapproachtochangedetectionintheinterdependence
structureofglobalstockmarkets."SocialNetworkAnalysisandMiningDOI10.1007/s13278012
0063y,2012.
226
6/26/2014
Classificationusing
NeuralNetworks
Uttam KSarkar
IndianInstituteofManagementCalcutta
SessionPlan
Theneedforneuralnetworks
Signaturerecognitionproblem
g
g
p
Whetherthesignatureinthecheque issameasthe
signaturethebankhasagainsttheaccountinits
database
ConceptofaNeuralNetwork
DemonstrationofhowitworksusingExcel
Potentialbusinessapplications
Issuesinusingneuralnetworks
227
6/26/2014
ArtificialNeuralNetwork(ANN)
AnANNisacomputationalparadigminspiredby
th t t
thestructureofbiologicalneuralnetworks
f bi l i l
l t
k
andtheirwayofencodingandsolving
problems
Biologicalinspirations
Somenumbers
Thehumanbraincontainsabout10billionnervecells
(neurons)
Eachneuronisconnectedtotheothersthroughabout
10000synapses
Propertiesofthebrain
Itcanlearn,reorganizeitselffromexperience
Itadaptstotheenvironment
I d
h
i
Itisrobustandfaulttolerant
228
6/26/2014
Biologicalneuralnetworks
Humanbraincontainsseveralbillionnervecells
called neurons
calledneurons
Aneuronreceiveselectricalsignalsthroughits
dendrites
Theaccumulatedeffectofseveralsignalsreceived
simultaneouslyislinearlyadditive
Outputisnonlinear(allornone)typeofoutput
signal
Connectivity(noofneuronsconnectedtoaneuron)
variesfrom1to105.Forthecerebralcortexits
about103.
BiologicalNeuron
synapse
axon
nucleus
cell body
dendrites
Dendritessenseinput
Axonstransmitoutput
Informationflowsfromdendritestoaxonsviacellbody
Axonconnectstootherdendritesviasynapses
y p
Interactionsofneurons
synapsescanbeexcitatoryorinhibitory
synapsesvaryinstrength
Howcantheabovebiologicalcharacteristicsbemodeledinan
artificialsystem?
229
6/26/2014
Artificialimplementationusinga
computer?
Input
Acceptingexternalinputissimpleandcommonplace
Accepting external input is simple and commonplace
Axonstransmitoutput
Outputmechanismstooarewellknown
Informationflowsfromdendritestoaxonsviacellbody
Informationflowisdoable
Axonconnectstootherdendritesviasynapses
Interactionsofneurons(how?Whatkindofgraphornetwork??)
I
i
f
(h ? Wh ki d f
h
k??)
synapsescanbeexcitatoryorinhibitory(1/0?Continuous?)
synapsesvaryinstrength(weightedaverage?)
TypicalExcitementorActivation
FunctionataNeuron(Sigmoidor
Logisticcurve)
g
)
Logistic
1
y f ( x)
1 ex
230
6/26/2014
Interconnections?
FeedForwardNeuralNetwork
Informationisfedattheinput
Computationsdoneatthe
Computations done at the
hiddenlayers
Deviationsofcomputed
resultsfromdesiredgoals
retunescomputations
Networkthuslearns
Computationisterminated
oncethelearningisassumed
acceptableorresources
earmarkedforcomputation
getexhausted
Outputlayer
2ndhidden
layer
1sthidden
layer
x1
x2
Inputlayer..
xn
SupervisedLearning
Inputsandoutputsarebothknown.
Thenetworktunesitsweightstotransform
theinputstotheoutputswithouttryingto
discoverthemappinginanexplicitform
Onemayprovideexamplesandteachthe
neural network to arrive at a solution without
neuralnetworktoarriveatasolutionwithout
knowingexactlyhow!
231
6/26/2014
CharacteristicsofANN
Supervisednetworksaregoodapproximators
BoundedfunctionscanbeapproximatedbyanANN
B
d d f ti
b
i t db
ANN
toanyprecision
Doesselflearningbyadaptingweightsto
environmentalneeds
Canwork with incomplete data
Theinformationisdistributedacrossthenetwork.If
onepartgetsdamagedtheoverallperformancemay
notdegradedrastically
WhatdoweneedtouseNN?
Determinationofpertinentinputs
Collectionofdataforthelearningandtestingphase
Collection of data for the learning and testing phase
oftheneuralnetwork
Findingtheoptimumnumberofhiddennodes
Estimatetheparameters(Learning)
Evaluatetheperformancesofthenetwork
Ifperformancesarenotsatisfactorythenreviewall
If performances are not satisfactory then review all
theprecedentpoints
232
6/26/2014
ApplicationsofANN
Darestoaddressdifficultproblemswherecause
effectrelationshipofinputoutputisveryhardto
quantify
Stockmarketpredictions
Facerecognition
Timeseriesprediction
Processcontrol
Opticalcharacterrecognition
Optimization
ConcludingremarksonNeural
Networks
Neuralnetworksareutilizedasstatisticaltools
Adjustnonlinearfunctionstofulfillatask
Needofmultipleandrepresentativeexamplesbutfewer
thaninothermethods
Neuralnetworksenabletomodelcomplexphenomena
NNaregoodclassifiersBUT
Goodrepresentationsofdatahavetobeformulated
Trainingvectorsmustbestatisticallyrepresentativeofthe
entire input space
entireinputspace
EffectiveuseofNNneedsagoodcomprehensionofthe
problemandagoodgriponunderlyingmathematics
233
6/26/2014
Uttam K Sarkar
Indian Institute of Management
g
Calcutta
Background
WaristooimportanttobelefttotheGenerals
GeorgesBenzamin Clemenceau
Decisionmakingnowrequiresnavigatingbeyond
transactionaldata
Needforexploringoceanofdata
Howtofilter?
How to impute?
Howtoimpute?
Howaboutoutliers?
Howtosummarize?
Howtoanalyze?
Howtoapply?
234
6/26/2014
Challenges(Opportunities?)
Realworlddoesnothaveaconsistentbehaviour
Whatmodelistobeextractedbyanalyticsthen?Future,bydefinition,
isuncertain
Captureddataareerrorproneandinvolveuncertainty
p
p
y
Youaredeadbythetimeyouknowtheultimatetruth
Oftenthereisnoconcreteclarityongoalasyouoptimize
Whatarethestrengthsandweaknessofunderlyingassumptions?
Dataarevoluminous,havehighvariety,andhighvelocity
Howtocapture,store,transmit,share,sample,analyze?
Interpretationoffindingsisnontrivial
Interpretation of findings is nontrivial
Sufficientlytorturedandabused,statisticswouldconfesstoalmost
anything
Analytics
Promises,Myths,andReality
Toomanyjargons
Too many jargons
Businessintelligence,Datamining,Data
warehousing,Predictiveanalytics,Prescriptive
analytics,Bigdata,.
Toomanyquestions
Whatonearthdotheymean?Wheredidthey
What on earth do they mean? Where did they
emergefrom?Whyaretheygettingsopopular?Is
analyticsthepanacea?Isityetanotheroverhyped
buzzword?
235
6/26/2014
Analyticsinaction
Insteadofstrugglingforacademicdefinitions
l t l k i t h t th
letuslookintowhattheseareintendedto
i t d dt
achieveandlookatsomeexamplesfromthe
businessworldwherethesearebeingused
Netflix
Disruptiveinnovationridingonanalytics
dimension
1997
1997:ReedHastings,ComputerScienceMastersfrom
R d H ti
C
t S i
M t f
Stanford,pays$40latefeeforaVHScassetteofthemovie
Apollo13
Therecommendersystem
Thetransportationlogisticsfinetuning
ThesilentbattleofthenondescriptentrantNetflixagainst
incumbentbilliondollarBlockbuster
BlockbusterinitiallyignoredNetflix
Recognition(andvirtualsurrender)byBlockbuster itwas
toolate!
236
6/26/2014
Harrahs
Knowingcustomersbetterusinganalytics
ThemagicofCRM
WalMart
The
Theeverannoyingproblemofstockout,shrinkage,
ever annoying problem of stockout shrinkage
andinventorymanagementandthesuccessof
RadioFrequencyIdentifcation (RFID)based
analytics
Reducedstockout!
ExamplesGalore
HP
Whichemployeeislikelytoquit?
Target
Whichcustomersareexpectantmothers?
Telenor
Whichcustomerscanbepersuadedtostayback?
LatestUSpresidentialelection
Whichvoterwillbepositivelypersuadedbypoliticalcampaign
suchasacall,doorknock,flier,orTVad?
h
ll d
k k fli
TV d?
IBM(Watson)
AutomatedJeopardy! Analyticssoftwarechallengedhuman
opponent!
237
6/26/2014
Analyticsdefined?
Analyticsis(Ref:Davenport)
Extensiveuseofdata,statistical,quantitative,and
computerbasedanalysis,explanatoryandpredictive
models,andfactbasedratherthangutfeeling
basedapproachtoarriveatbusinessdecisionsand
todriveactions
Analyticscanhelparriveatintelligent
l i
h l
i
i lli
decisions
Whatisintelligenceinthiscontext?
Howcanamachinehelptakeintelligentdecisions?
Changingnatureofcompetitioninbusinessworld:How
togetcompetitiveadvantage?
Operationalbusinessprocessesarentmuch
different from anybody else s
differentfromanybodyelses
ThankstoR&Donbestpractices
238
6/26/2014
Competitiveadvantage
Operationalbusinessprocessesarentmuchdifferentfromanybodyelses
ThankstoR&Donbestpractices
Productsaren
Products arenttmuchdifferentiatedanymore
much differentiated any more
Thankstocomparabletechnologyaccessibletoall
Uniquegeographicaladvantagedoesntmattermuch
Thankstoimprovedtransportation/communicationlogistics
Protectiveregulationsarelargelygone
Thankstoglobalization
Proprietarytechnologygetsrapidlycopied
Proprietary technology gets rapidly copied
Thankstotechnologicalinnovations
Whatisleftistoimproveefficiencyandeffectivenessbytakingthe
smartestbusinessdecisionspossibleoutofdatawhichmayaswellbe
availabletocompetitors
StockMarketAnalogy
Itstoowellknownthestockmarketwould
ceasetoexistifonecouldfindapredictive
theoryofpricemovements!
Themarketexistsbecausenosuchidealmodelis
possible.Playerstrydisparatemodels somegain
whileotherslose
Ocean
Oceanofdata
of data bestmodelorbest
best model or best
interpretationismeaningless
Thesmarterguydiscoversmoremeaningful
patternsforhisbusinessbyusinganalytics!
239
6/26/2014
Emergenceofanalytics:thecontributors
Oldwineinnewbottle?
YesandNo
Pillarsofbusinessanalytics
Datacaptureandstorageinbulk
Statisticalprinciplesrevisited
Developmentsinmachinelearningprinciples
Availabilityofeasytousesoftwaretools
M
Managerialinvolvementwithachangedmindset
i li l
ih h
d i d
Panacea?
Toomanyreadytousetoolsandtechniquesareavailable!
Afoolwithatoolisstillafool
RoadtoAnalytics
Uttam KSarkar
IndianInstituteofManagementCalcutta
240
6/26/2014
Attributesoforganizationsthrivingonanalytics
Identificationofdistinctivestrategiccapabilityforanalyticstomakea
difference
Highlyorganizationspecific
MaybecustomerloyaltyforHarrah
May be customer loyalty for Harrahs,
s,revenuemanagementforMarriott,
revenue management for Marriott,
supplychainperformanceforWalMart.Customersmoviepreferencefor
Netflix,
Enterpriselevelapproachtoanalytics
GaryLoveman ofHarrahsbrokethefiefdomofmarketingandcustomer
serviceislandsintocrossdepartmentchorusonshareddataandthoughts
Seniormanagementcapabilityandcommitment
ReedHastingsofNetflix
ComputerSciencegraduatefromStanford)
GaryLoveman ofHarrahs
PhDinEconomicsfromMIT
Bezos ofAmazon.com
ComputerSciencegraduatefromPrinceton
Stagesofanalyticspreparednessofanorganization
Analyticallyimpaired
Gropingfordatatoimproveoperations
Explorerofinternalanalytics
Explorer of internal analytics
Usinglimitedanalyticstoimproveafunctionalactivity
AnalyticsExperimenter
Exploringanalyticstoimproveadistinctivecapability
AnalyticsMindset
Usinganalyticstodifferentiateitsofferings
Using analytics to differentiate its offerings
Competingonanalytics
Stayingaheadonstrengthofanalytics
241
6/26/2014
Investmentinanalyticswithoutduediligencemay
notyieldresults
Unitedairlineshadbeentheworldslargestairlinein
termsoffleetsizeandnumberofdestinations,,
operatingover1200aircraftwithserviceto400odd
destinations
Unitedairlineshadinvestedheavilyinanalytics
ThecompanyfiledforChapter11bankruptcy
p
protectionin2002
Despitebusinessdownturn,mostotherairlineswere
notasadverselyaffected
Whatwentwrongwithitsanalytics?
Unitedairlinesanalyticspostmortem
UApioneeredonlyyieldmanagement
Othersmallerairlinesweredoingcostcuttingandoffering
seatsatlowerfares
t tl
f
UAweredevelopingcomplexrouteplanningoptimization
analyticswithmultipleplanetypes
CompetitorlikeSouthWest usedonlyonekindofplane
andhadafarsimplerandcheapersystemofrunning
UApioneeredloyaltyprogramme basedonanalytics
Theircustomerservicewassopatheticthatfrequentflyershardyhad
anyloyaltytoit
UAspentafortunedevelopinganalytics
OtherairlinescanbuyatfarcheaperpriceSabre systemforpretty
similaranalysis
242
6/26/2014
Questionstoaskwhenevaluatingananalytics
initiative
How
Howwillthisinvestmentmakeusmoreprofitable?
will this investment make us more profitable?
Howdoesthisinitiativeimproveenterprisewidecapabilities?
Whatcomplementarychangesneedtobemadetotake
advantageofthecapabilities?
MoreIT?Moretraining?Redesignjobs?Hirenewskills?
Dowehaveaccesstorightdata?
Aredatatimely,accurate,complete,consistent?
Dowehaveaccesstorighttechnology?
Isthetechnologyworkable,scalable,reliable,andcosteffective?
Misstepstoavoidwhengettingintoananalytics
initiative
Focusingexcessivelyononedimensionofthe
capability(say,onlyontechnology)
Attemptingtodoeverythingatonce
Anycomplexsystemisbesthandledincrementally
Investingtoomuchortoolittleonanalyticswithout
matchingimpactonanddemandofbusiness
Choosingthewrongproblem
Wrongformulation,wrongassumptions,wrongdata,
wrong software tool wrong method
wrongsoftwaretool,wrongmethod
Makingthewronginterpretation
Tool+data+fewmouseclicks=model
Model+input=output
Whoassesseswhethertheoutputisgarbage?
243
6/26/2014
Characteristicsofexecutivespromotinganalytics
Theyshouldbepassionateaboutfactdrivendecision
making
Theyshouldhaveskillsandappreciationofanalytical
toolsandmethods
Theyshouldbewillingtoactonthefindingsfrom
analytics
Theyshouldbecompetenttoassessandmanage
meritocracy
References&Acknowledgment
PredictiveAnalytics
EricSiegel,Wiley,2013
Eric Siegel Wiley 2013
CompetingonAnalytics
DavenportandHarris,HarvardSchoolPress,2007
AnalyticsatWork
Davenport,Harris,andMorrison,HarvardBusinessPress,
2010
244