Vous êtes sur la page 1sur 5

BigDataConcepts

11/18/2013

DefiningOurTerms:Whatdoyouthink
BigDatameans?
Basedonyourexperienceandknowledge,what

isBigData?
Isitmorethanjustbuzzwordorhype?
WhyisBigDatasignificantrightnow?

BigDataConcepts
ProfessorMouwafacSidaoui

WhyaresomecallingtheemergenceofBig

Dataandadvancedanalytics,thenew
IndustrialRevolution?

WhatDoOthersThink?

HowMuchData?

Everyminute
Googleprocessesmorethan2millionqueries
Facebookprocesses350GBofdata
Twitteruserssendout277,000 tweets
Sprintprocesses250,000 phonecalls
Amazonsoldover18,000 itemsatitspeak(onCyberMonday)in
2012.
WalMart processesalmost17,000 transactions
72hoursofnewvideosareuploadedtoYouTube
Morethan100million emailsaregenerated
Individualsandorganizationslaunch571 newwebsites

Dataisapreciousthingandwilllastlongerthanthesystemsthemselves.
TimBernersLee,inventoroftheWorldWideWeb
Dataisthenewoil.
CliveHumby (aSheffieldmathematicianwhowithhis
wife,EdwinaDunn,made90mhelpingTescowith
itsClubcard system)
DataScienceisnotvoodoo.Wearenotbuildingfancymathmodelsfortheirown
sake.Wearetryingtolistentowhatthecustomeristellingusthroughtheir
behavior.
KevinGeraghty,HeadofAnalytics,360i
BigDataisabouthavinganunderstandingofwhatyourrelationshipiswiththe
peoplewhoaremostimportanttoyouandanawarenessofthepotentialinthat
relationship.
JoeRospars,ChiefDigitalStrategist,ObamaforAmerica
Ifwehavedata,letslookatdata.Ifallwehaveareopinions,letsgowithmine.
JimBarksdale,formerNetscape
DataScientist:TheSexiestJobofthe21stCentury
ThomasH.DavenportandD.J.Patil

WhatCanWeDoWithData?

HowDoWeTreatOurData?

Googlesellsadvertisementsbasedonyourinterests

Dataisan

Facebookplacesadstargetedatyoubasedonyourprofile
Twitterstudieswhatyoutalkaboutandwhatyoupayattentionto
Sprintmaximizesitsopportunitiesbasedonyourcallpatterns
Amazonsuggestsitemstopurchasebasedonwhatyoubuyovertime
WalMart knowswhatitemsyoubuyincombinationandoffersthem

together
YouTubehasaprofilethatdescribesyouperfectlybasedonwhatyou
watch
TheNSAknowsifyouarealikelyterroristbasedonyouremails
Thepeople/organizationswholaunchwebsitesareobservingyour
behavior

ProfessorSidaoui

GOLD!
ASSET!

BigDataConcepts

11/18/2013

WhatDoestheDataAssetBecome?

Traditionalvs.BigData

Data:Acollectionofrandomfacts

DataSystems:Relationalvs.NoSQL

Information:Dataplacedincontext

DataScience:BusinessIntelligencevs.MapReduce

Knowledge:Whatcanbelearned from

theinformation

Wisdom:Knowledgecombinedwith

experience andintuition

TheRelationalModel
Hasbeenwithussince1970(EdgarF.Codd)
FirstviablecommercialRDBMSsin1979(Oracle)
Hasntevolvedmuch
Thepredominantdatabasemodeltoday

TheRelationalModel
Predictable,structureddata

Tableswithdefinedcolumns
Comprisedofintegers,booleans,characters,floatingpoint
numbers,alphanumericstrings
Expected,plannedvariability

Themodelshiftedslowlytoincorporatechangesindata
notmany,though
Anticipated,measuredvolume

Datatypicallybelongedtooneentityandwasnotshared
Volumetrackedthegrowthofthebusinessentity
Normalized,largelyinflexibledesign

Adatabaseslogicalmodeldidnotchangemuch,butwhenit
did,itwasabigdeal

TheRelationalModel

After30Years,WhatHappened?

Logicalandphysicaldatabase

1950sand60s:Largemonolithicsystems

Thetwoareinextricablytiedtogether
ObjectsinthelogicalrepresentationofthedataintheRDBMSwere
tiedtothephysicalstructuresinsecondarystorage
Controlledsystem

TheRDBMSderivedmuchofitsvaluefromthisapproach
Careful,regulatedmanagement

Highlytechnical,specializedtalenttodesign,build,andmaintainan
RDBMS
Mostlytransactionaldata

Financials,customerorders,inventory

Consolesnetworkedtoamainframe
1970s:Birthofthemicrocomputer

Notnetworkedtoanything
1980s:LANs

Ethernet
Microcomputerscommunicatedwitheachother
locally
Servers,printers

ACID

Atomic,Consistent,Isolated,Durable

ProfessorSidaoui

BigDataConcepts

11/18/2013

After30Years,WhatHappened?

HowBigDataBroketheDataModel

1990s:TheInternetbecamepubliclyavailable

Volume:ingestingthecontentsoftheentireInternet

Globalnetworkconnectingcomputingresources
WorldWideWeb
2000s:Socialnetworksandthesmartphone

Entireplatformsdevotedtocommunicatingand
collaborating
Bringyourcommunicationsplatforminyourshirt
pocket
Everybodyiscreatingdataeverywheretheygo
ALLofthatdataisbeingcaptured

Velocity:dataisbeingcreatedfasterthanwecankeepup
Variety:text,numbers,complexlargeobjects(CLOBs)like

documents,binarylargeobjects(BLOBs)likeimages,audio,
video,aswellasmetadata(dataaboutotherdata)
Someaddveracity/validity,variability/volatility,andvalue

tothethreeVs

Otheraspects?

TheLaundryMetaphor

GoogleDoestheWorldsLaundry

Howmanyofyoudoyourownlaundry?
Yourfamilyslaundry?
Kids?
Whatsyourprocess?

Sort
Wash
Dry
Resort
Fold
Putaway
Doingyourfamilyslaundryislike
managingabusinessdata

GoogleDoestheWorldsLaundry

Platforms&Tools

GoogleFileSystem(GFS)

GFSandMapReduce goOpenSource

http://static.googleusercontent.com/external_content/untrusted_dlcp/res
earch.google.com/en/us/archive/gfssosp2003.pdf

MapReduce
http://static.googleusercontent.com/external_content/untrusted_dlcp/res
earch.google.com/en/us/archive/mapreduceosdi04.pdf

ProfessorSidaoui

ApacheHadoop &HDFS

http://hadoop.apache.org
TwittersStorm

http://stormproject.net

BigDataConcepts

11/18/2013

WhatDoWeDowiththeDataThen?

DataMining:PuttingBigDatatoWork

Thereareknownknowns;therearethingsweknowweknow

TheScientificMethod:

Wealsoknowthereareknownunknowns;thatistosay
weknowtherearesomethingswedonotknow

AskaQuestion

Buttherearealsounknownunknowns
therearethingswedonotknowwedon't
know.

DoBackgroundResearch
ConstructaHypothesis
TestYourHypothesisbyDoinganExperiment
AnalyzeYourDataandDrawaConclusion

UnitedStatesSecretaryofDefense,DonaldRumsfeld

CommunicateYourResults

DataMining:PuttingBigDatatoWork

BigData=HumanBehavior

Supervisedvs.UnsupervisedLearning:Abrief

Dataminingandstatisticsrequiressufficientpopulationsof

explanation
BayesClassifier(supervised):Predictingfraudulent
financialreporting
KNearestNeighbor(unsupervised):Classifying
customersinotherthantraditionalways
LogisticRegression(supervised):Spamfiltering
NeuralNetworks(unsupervised):Facialrecognition
TimeSeriesAnalysis(supervised):Ingameanalysisat
Nickelodeon

BigDataEthics

datatodisprove/proveahypothesis
Wenowhavesomuchdataaboutthebehavior(choices)that

peoplemake,wecananalyzeitsufficiently

YourQuestions

Whatisandisnot ethical?
Howshould firms,governments,andallmannerof

organizationsuseBigData?
Whatislegalandillegal?

Q&A

Whatshould belegal?
Whatwillyou do?

ProfessorSidaoui

BigDataConcepts

11/18/2013

Hadoop Demo
LetslookatClouderas Hadoop Distribution

ProfessorSidaoui

Vous aimerez peut-être aussi