Académique Documents
Professionnel Documents
Culture Documents
11/18/2013
DefiningOurTerms:Whatdoyouthink
BigDatameans?
Basedonyourexperienceandknowledge,what
isBigData?
Isitmorethanjustbuzzwordorhype?
WhyisBigDatasignificantrightnow?
BigDataConcepts
ProfessorMouwafacSidaoui
WhyaresomecallingtheemergenceofBig
Dataandadvancedanalytics,thenew
IndustrialRevolution?
WhatDoOthersThink?
HowMuchData?
Everyminute
Googleprocessesmorethan2millionqueries
Facebookprocesses350GBofdata
Twitteruserssendout277,000 tweets
Sprintprocesses250,000 phonecalls
Amazonsoldover18,000 itemsatitspeak(onCyberMonday)in
2012.
WalMart processesalmost17,000 transactions
72hoursofnewvideosareuploadedtoYouTube
Morethan100million emailsaregenerated
Individualsandorganizationslaunch571 newwebsites
Dataisapreciousthingandwilllastlongerthanthesystemsthemselves.
TimBernersLee,inventoroftheWorldWideWeb
Dataisthenewoil.
CliveHumby (aSheffieldmathematicianwhowithhis
wife,EdwinaDunn,made90mhelpingTescowith
itsClubcard system)
DataScienceisnotvoodoo.Wearenotbuildingfancymathmodelsfortheirown
sake.Wearetryingtolistentowhatthecustomeristellingusthroughtheir
behavior.
KevinGeraghty,HeadofAnalytics,360i
BigDataisabouthavinganunderstandingofwhatyourrelationshipiswiththe
peoplewhoaremostimportanttoyouandanawarenessofthepotentialinthat
relationship.
JoeRospars,ChiefDigitalStrategist,ObamaforAmerica
Ifwehavedata,letslookatdata.Ifallwehaveareopinions,letsgowithmine.
JimBarksdale,formerNetscape
DataScientist:TheSexiestJobofthe21stCentury
ThomasH.DavenportandD.J.Patil
WhatCanWeDoWithData?
HowDoWeTreatOurData?
Googlesellsadvertisementsbasedonyourinterests
Dataisan
Facebookplacesadstargetedatyoubasedonyourprofile
Twitterstudieswhatyoutalkaboutandwhatyoupayattentionto
Sprintmaximizesitsopportunitiesbasedonyourcallpatterns
Amazonsuggestsitemstopurchasebasedonwhatyoubuyovertime
WalMart knowswhatitemsyoubuyincombinationandoffersthem
together
YouTubehasaprofilethatdescribesyouperfectlybasedonwhatyou
watch
TheNSAknowsifyouarealikelyterroristbasedonyouremails
Thepeople/organizationswholaunchwebsitesareobservingyour
behavior
ProfessorSidaoui
GOLD!
ASSET!
BigDataConcepts
11/18/2013
WhatDoestheDataAssetBecome?
Traditionalvs.BigData
Data:Acollectionofrandomfacts
DataSystems:Relationalvs.NoSQL
Information:Dataplacedincontext
DataScience:BusinessIntelligencevs.MapReduce
Knowledge:Whatcanbelearned from
theinformation
Wisdom:Knowledgecombinedwith
experience andintuition
TheRelationalModel
Hasbeenwithussince1970(EdgarF.Codd)
FirstviablecommercialRDBMSsin1979(Oracle)
Hasntevolvedmuch
Thepredominantdatabasemodeltoday
TheRelationalModel
Predictable,structureddata
Tableswithdefinedcolumns
Comprisedofintegers,booleans,characters,floatingpoint
numbers,alphanumericstrings
Expected,plannedvariability
Themodelshiftedslowlytoincorporatechangesindata
notmany,though
Anticipated,measuredvolume
Datatypicallybelongedtooneentityandwasnotshared
Volumetrackedthegrowthofthebusinessentity
Normalized,largelyinflexibledesign
Adatabaseslogicalmodeldidnotchangemuch,butwhenit
did,itwasabigdeal
TheRelationalModel
After30Years,WhatHappened?
Logicalandphysicaldatabase
1950sand60s:Largemonolithicsystems
Thetwoareinextricablytiedtogether
ObjectsinthelogicalrepresentationofthedataintheRDBMSwere
tiedtothephysicalstructuresinsecondarystorage
Controlledsystem
TheRDBMSderivedmuchofitsvaluefromthisapproach
Careful,regulatedmanagement
Highlytechnical,specializedtalenttodesign,build,andmaintainan
RDBMS
Mostlytransactionaldata
Financials,customerorders,inventory
Consolesnetworkedtoamainframe
1970s:Birthofthemicrocomputer
Notnetworkedtoanything
1980s:LANs
Ethernet
Microcomputerscommunicatedwitheachother
locally
Servers,printers
ACID
Atomic,Consistent,Isolated,Durable
ProfessorSidaoui
BigDataConcepts
11/18/2013
After30Years,WhatHappened?
HowBigDataBroketheDataModel
1990s:TheInternetbecamepubliclyavailable
Volume:ingestingthecontentsoftheentireInternet
Globalnetworkconnectingcomputingresources
WorldWideWeb
2000s:Socialnetworksandthesmartphone
Entireplatformsdevotedtocommunicatingand
collaborating
Bringyourcommunicationsplatforminyourshirt
pocket
Everybodyiscreatingdataeverywheretheygo
ALLofthatdataisbeingcaptured
Velocity:dataisbeingcreatedfasterthanwecankeepup
Variety:text,numbers,complexlargeobjects(CLOBs)like
documents,binarylargeobjects(BLOBs)likeimages,audio,
video,aswellasmetadata(dataaboutotherdata)
Someaddveracity/validity,variability/volatility,andvalue
tothethreeVs
Otheraspects?
TheLaundryMetaphor
GoogleDoestheWorldsLaundry
Howmanyofyoudoyourownlaundry?
Yourfamilyslaundry?
Kids?
Whatsyourprocess?
Sort
Wash
Dry
Resort
Fold
Putaway
Doingyourfamilyslaundryislike
managingabusinessdata
GoogleDoestheWorldsLaundry
Platforms&Tools
GoogleFileSystem(GFS)
GFSandMapReduce goOpenSource
http://static.googleusercontent.com/external_content/untrusted_dlcp/res
earch.google.com/en/us/archive/gfssosp2003.pdf
MapReduce
http://static.googleusercontent.com/external_content/untrusted_dlcp/res
earch.google.com/en/us/archive/mapreduceosdi04.pdf
ProfessorSidaoui
ApacheHadoop &HDFS
http://hadoop.apache.org
TwittersStorm
http://stormproject.net
BigDataConcepts
11/18/2013
WhatDoWeDowiththeDataThen?
DataMining:PuttingBigDatatoWork
Thereareknownknowns;therearethingsweknowweknow
TheScientificMethod:
Wealsoknowthereareknownunknowns;thatistosay
weknowtherearesomethingswedonotknow
AskaQuestion
Buttherearealsounknownunknowns
therearethingswedonotknowwedon't
know.
DoBackgroundResearch
ConstructaHypothesis
TestYourHypothesisbyDoinganExperiment
AnalyzeYourDataandDrawaConclusion
UnitedStatesSecretaryofDefense,DonaldRumsfeld
CommunicateYourResults
DataMining:PuttingBigDatatoWork
BigData=HumanBehavior
Supervisedvs.UnsupervisedLearning:Abrief
Dataminingandstatisticsrequiressufficientpopulationsof
explanation
BayesClassifier(supervised):Predictingfraudulent
financialreporting
KNearestNeighbor(unsupervised):Classifying
customersinotherthantraditionalways
LogisticRegression(supervised):Spamfiltering
NeuralNetworks(unsupervised):Facialrecognition
TimeSeriesAnalysis(supervised):Ingameanalysisat
Nickelodeon
BigDataEthics
datatodisprove/proveahypothesis
Wenowhavesomuchdataaboutthebehavior(choices)that
peoplemake,wecananalyzeitsufficiently
YourQuestions
Whatisandisnot ethical?
Howshould firms,governments,andallmannerof
organizationsuseBigData?
Whatislegalandillegal?
Q&A
Whatshould belegal?
Whatwillyou do?
ProfessorSidaoui
BigDataConcepts
11/18/2013
Hadoop Demo
LetslookatClouderas Hadoop Distribution
ProfessorSidaoui