/
As J>~r New Syllabus .o fVTU,20'1s ·s~h~rpe_ . . ·I
. ' Choice Based Credit Sys't ~~(Ctics).. ,.. -. -r
j ' ; , ', l• , l, · •. ;: ' ," ,. I . ~ ,;, , '
·ALL IN ONE
·-r
:s.UNS.TAR .EXAM's·cANNER'.
• • t • • . • • • . \• , :
[
!'•
.,
. . . •'
_::-: ,.- : . -
~ .• . .
. _/ ' ' '
. ·.·;.! L-.-/~1,!,:,-.
:··.. ( 'A~TlfoR~~HYATEAM .OF.EXPERTS. ) -~.,::'. :'. ·~:~~<. :,; ..
.( .
;r " . .. ...._ ·FfZi~r-~/:~- -
£~~-~
. .~- . _:_.
- 'JI
:�
3, . Network Ma�age�ent ·
►- CBCS.M,od�i Question Paper! \ 03 - L9
► CBC� l'-'!oqel Q��s�ion_Pa:B:e;
:
' .
=:':l ' 20 - 3.6
. . -• . ,
·. . . .
.•··:,1,- :- .
·''
( -�,
· ..-.. ;,.,:-.· :.
:;;··-.7_,;1·
'. ,,'i_••.:·=:,!-
(VH)�:EM.8.E: C$EJ,.1SE) : �-
---�------.-···
: -}j --'----~ ·
i· ·
SYLLABUS · Eighth Scmc., tcr B.E. Degree E:rnmim1tion,
BIG DATA ANALYTICS CBCS - Model Question Piiper - I
IA S !' ER C II OIC E 11,\ S J:n CIU' Oll S YS'I E~: ,C l! LS I SCI IEM[ J
(EFl· ::CI IV E FROM I II E i\Ci\l> F~II C Y!' fdl ~<1 16 . 2Ul7) : BIG DATA ANALYTICS
SnbJ«I C:utlt · 15(1182 I.~ Mnrk, 20 Time: 3 hrs. . . Max, Marks: 80
N.,.mbrr of Ll'<'tun- lloun/Wcrk : 0~ .I kO Note : A1rsiver a11y F/V£J,1f/ 1111tstloi1J, sclecllng'Otv_£/ull qutsf/01ifro111 eac/r 111od117t, •
lol•I Nuonbrr ur Lwun, lluurs . fa,m tiiiuri . OJ
.. Module~ I
. · . . · · , · . ·.MODULE 1 .. , . ,. . . •
Hadoop _Distributed fil~ Sj>ste.rii,_,B,_asicsf Rimnh11(~~ampte ·Benchmarks, Ha~oo P~6gramia1rd
MapReduce Fram~work, ~fapReduce Programmi~g;:° • . · •. .: · ;_ ·.· . · . . . . . . . . . ·:
.<·'/ ;.·.::. '1: :,/~'·'.;>./ ::: f-,~:·_::;:i}'.:.:'i''.·~~()U.~E,~ ·: ·<_:..:\ : •,\· ':-<iil . . >•., :,./t><'
\) . }; p,ss,enh~l-'l-!<\doop Toci!~•::H4~APP-~1-~N AP.i>llc11Jio11~\•Manag.ing.: Ha.~?cif\yit,li Ap_ache:A:m~ar)
1 ·83,s(c Ha{looj, Administ~atipri' Prod:diires. , ·· '. :·, · · :· :·· '
( ':~ : · , : ·. -- '\ :· : .•• ' :.!... : , , 1•;'•.1.--,2:.~""~--:;· ~-~ ...~ . -~\, ~.... . . . ·. · . ·· ·,:··:/.. .: :• ..'·!, ·\·..... .
l;,;li
·'!:. . : .
·, : . ,M9DUL£3 .
. Business. Intelligence ·concepts 'and :~pplic~tion, Daia Warehoii~\ng, ~a_·~ai t,;iii1ing,
Visualization •( 1. .
..·, ~- ·.
. . . .. . . '·'.''· .: ....:,.. .,.MO.[?L!LE4, .... ,, . . . '-~... ._., . . ··:: ·, .
Decision Trees, .Regressicin, -Artificiar Neu~J Ndivorks, Cl~st~t Analysi~, · Associa,ion R~I
M1nlngs. . ··:~:;'::( .. ~
.i/t:~lJtL~.f~:,:"' <( ttt}J .,,. .·:. ·•
.· ·. Text -~ in i;1g, .Na~\>e,Bayes Analysis, ·.su1,port .,\!ecto~~Machirtes, Web .~ ining,·,S~daLNetwqrk
Analysis . . . . ·. · - · .1_: •. : ., .-_ . . . · . . . · :·.. ·.·. . . >.,. :·:.
·.
.. , •,, • .·
,•::.
I
VIII Sem, (CS£/ISE)
b. Wilh ~ neHI di11gr~111 cxpl~in v.1rio11s system roles in nn IIDFS deployment'! (12 Marl,s) lets lhc ~onvcrs.ation bctwccrr the client and the DataNodcs procec~ilc dal~ transfer is ·
Ans. ):!DI'S Components . · . · . . progressing, the .NamcN()(!c also monitcrs the datuNodes b~ listening for heanbeats senl
lllxe design of HDFS is b~scd 0111wo types of nodes: fwm DataNoJe){ihe' lack of a hcar1bcn: si1;nal indicMcs a potential node failur9 ·1n such
. i\'amci\'ode ;ind multiple DataNodc:) · a case, the. NarncNodc will rou1c· arou 11~ the failed DataNod,· and begin re-replicating the
(J:n a basic dcoign, a sic g!c Ne,11r\ode rn~nagcs all :he me,ad:H?. needed 10 sto,~ ~i,d rmievc 11ow-mi,s_ing bl ocks. ~cc ~c;sc :t,c fik sys:c::i is rcdun.!.inl.@,1r.Ncccs c~n ~.e iakeii offiine
the ac!L1al data frot)l the Da,ai\' otic§)to data is acn(ally stored 01i the NameNod~wevcr. (decomm1ss1oncd) I r maintenance by informing 1hc NnmeNodc ofrhe Da1nNodes to exclude
For a minimal Hadoop installa: :on, there needs to be a s_ingle NameNodc daemon and sing.le i'om DF •·
)lataNode dac1i1on running on at least one machin;> . : The mappings bet en data blocks and .the physical DataNode arc nol -kept i~ persist~nt .
(J_hc design is a master/slave architecture in whic.h the inastu {NameNode) mar.ages the file s orage 9n the NameNode~). · . . .· · . · . ..
· system namespace and regulates access to files by client,9tle system namespac~ operations· . . (Dir performance reasons, the NameNode slom all sietadata in memo·ry)i.!l!on startup, each .
).Lrch as opening,clos.ing, and_reuaming file~ and dir~torie, are au managed by the Name Node'\ . DntaNode provides a block repo11 (which it keeps in pcrsis1ent.storage) to the NameNod~ ,
Qhe NameNode also deter11111ics the mapping of blocks to DataNodes and handles DataNod( · (the block repm1s are sent eve~ IO heanbentj)(The int~rval between repor1_s is a con~gurab~
failure~ . · · '· · . . . · · . ·· · .:· · · property,) The reports ena~le(!!!e NameNode to keep an up-to-Jata account ufall data bl01:ks
(The slaves (DataNodes). are responsible for serving read and write reql)ests from th~ file. in the clusler::; · ,· · · · ·' , ·
/ 'i},stem to the client~c Name Node nrnnagcs block creation, deletion, and repiicatioij) ([h almost al(Hadoop deploymen·ts, there is a SecondaryNnmeNode. While pot explicitly ·
. ·(An .~xampl_c!'of the c~ienl/NameNode/DatbN~d~ i1ueractio11 is. pr~vi1ie i11 figure .1). ~hen . required by a NameNod~, it is hi~ly _re,crimrrtende~h: term "~econdary- N~meNode~ '•
'a client wnlcs dala, 1t first .communicates wnlr the NameNolle and requests to crate a file,) . (no~ called CheckPointNode) , 11, 1s nol an acttve fa1lover node and ·cannot replace lh~ . •
!he Nam~~cid,i d~terniines how n{irny blocks are nee1led ~ml proyides ·the clieni with ti,c pdmary NnmeNode in case of its failurt) · . • . .. . . · ·· ·: : ·· ·. ; . · -. . . ·
( ~ purpose of the SecondaryNameNOde Is to perform period checkpoints that e~al~ate-the
.· Data~od~s that will st?re t h ~ part of the storaf~}:?_c:~~•t . · '. . .· .
status oft~c NameNode) Recall that the NnmeNode keeps ail system metadata memory ~or ·
fast access,&also l:ias two disk files thal track changes to the metadata:'--, .·
~ n image of the file system state when the NameNode was star1ed.1'hi~ file' begins w[th
··:~~ {] ·
.. /
. fsimage_• and is used only at startup by ihe NameN.ode. · ·.: : ·· i: ., .. · : . ' .
A s:des of modi~c~tions ·ctone to uiefile systelll afte·r starting' the. NnirieNo~e·/T~es~ file ' '
~-~
.....
. .. ~/ 1:::t
. '..,) ~:t&)<~ ~~( ' ·, i begm with edit ~ and)'eflect•the changes·.made after the fsim)lge . • file was re.id; :· · ·'
' (The location of th~e flies Is ~~t by the'df~inarile~bde.name:dif pr~rty ,iri the"hdf~·foe.xml .'
L~J
'j
.;:·~_ ,___s!Jl_::, since the_last _che~kpoiny(!i' theSecondaryNameNode were not. runrting,: li -r~sta~· of thti
NameNode could 'ta~c a prohibitively long time duno the num~r changes "io the flhi'
· system) .'~'>,,~ · .
of
i· • - ·
L_"2':.:.'.J· ~ OR · . ',
1
:. ; : ·. ·. , · Figure i. I Vt,riou,f ~yJ/enr r~/eJ i11.a11 Jl[ff_i,· deJ!j~v11ie11i . . ·' • •. . 2, a. · E~piain ihe inap reduce' nt~el witti siniple':map~r Sc.rip! ~nd s·i~pie rcdii~e script; .
·: data blocks are replicat~d a!kr they are written 10 the assigned 11od9J?cpending on how many · ·.· . · ·· · · · · . · · ·. · · :: (08 Ma'rks)
: 'nodes m in the ch1s1cr: the NamcNodc will ~ttcmpt to write replicas of the da1a'.block_s on · · Ans. The Map Reduce Model: Apache ·1-i~doop is often associated 1vith Mapred~ce·conip11ti~g.
. nodes that arc._in 0th.er separate rack~ {if possiblelXi1hcre is only one rack, then ihe rep! icated · Prior to Hada op version 2, th is assumption ~as certa.inly 'true. H·ado.op yersion 2 main1ained
. blocks are written to other servers 111 the same rac~ fter .the Data Node acknowledges that the MapRedt1ce capabilityand als(? made other processing mode.is available 10 users~Vi~fl81l)' .. '
' tlic file block rcpHcation is complete, the client closes the file and informs the Name'Noc!e'that •· ·. all the tools_'developed (91)-iadoop,'. sui:h as Pig a_rtd Hive; will work sea,mlessly-on top_ o~j~e. •:.
the operation is complet~te that the Nar11eNode does nolwrite a11y data ifa~i:ily to the . Hadoop version,2 l'xlapReduce,• . · · . : ., , .. . ' · . , , . . ... ·
. · DaiaNodcs)Q!, g_ivfs_ the. Sl,i~nJa limited a".1ou'.11_oflime to com~let_e t~~ bp~~ifo~(fu! does .·· · The· MapRcduce cdmputation model provide~ a very powerful ·tool for inany_applicat_io_n_s a.nd ..
2,ot complete 1n the time p.eqod, the operalion,1s canceled) . · , _:.. · ;_:: .' · · · . ... is more common than mo~t users realize. Its i1~detlying idea is very simple. · ·. · . ' ·. . ·. :
~ad ing data happens in a.similar fashion)Jh~ client. ,~q·u·ests a.6le from . th~ I-Jame Node,,.·
. wh1ch _rdum; the nest· Da1aN6dcs from whic~·.to read 11_,.~:d.a ~.(!:.hc dicnt t_h~n accesm the .'
'Thm ar~ lwO stages:~ inappiqg stage and a reducing stage. _· . :· <:.·. .: ' . . ' '
ln·\he mapping stag~. a·mappingpr9cet)11re is applied to inpui dat,a, The map 1s·usually some
~ata directly from the DataNoucs'.J . . . . · · · · · .: . ·· ·· ··:.'· ·. ::. · · . · ·, kind of filter or sorting proces~, ' :.. ' ' ' ·. . ·, ' ,,: ;-: :·-,• . .
- - -- tThus, ·once th_c meradara has be&,:_delivered to the clieni,, th~:;i:,/a~eNo?e-steps-back:and ' .. ·•·.:'.-----:-l --"-:- ..
.. &i\~+;f C.,A;,. ~ r1tf
't.
,:l;:'
VIII Sem, (CS'E/IS'f) i;'..': cncs ,. Moaei Qi..wM-toYlll'o-fJe<" . l
ror instnncc, ~ssumc you llCl',! to l·o,1111 h11w nrnn)' times the nnmc "K utuzov" 11ppcms in ih·c Milp(key l ,valt1c I)-> list(kcyz',villue2) ·
novel Wa: nmt i'cccc Onl' sol1,1io:1:s to L,11thcl' 20 friends nnd ~ivc them cnch n sc rlion of the f'hc reducer function is then ilpplicd :o cnch key-value pair.which in tum prv<!U"\ a ,ullc~:ion
book to senn:h. TI1is step is the 1r,ap stngc. The miurc phas.l' hn1;pc11,s when cvrryonc is done of v·ill11cs in rhc snim t!<Jlll.1in : ..
countlr.g a~d yo.i scm llt' lot~ I ~~ y0 ur fr.er,ds t~il yct1 !lieir t:cu:,ti .' Rcd111:c(kcy2, li:,l(v~l ,1c2) )·> l, ,:(s r,.,.cJ )
i'\ow con~;dt::· how thi s ~:rna: p ,\Ji.:1-.:S~ ::v.i!d be ,11.:rumplishcd using s iinp :c • ni., ,ommrmd- E~ch reducer Cilll 1ypirnl:y prod\lces c;:her one val 11, (•~.luc1J u, a:1 cmply_r~:;punst. Thus.1hc
line tools. The follow ing gn:p comnl<lnd applied n spcdfic nJap tv 11 text file: MapRcducc frnr.icwork rrnnsforms a list of(kcy,val(ie) pairs into a lilt of values. ,.
S gl'ep " Kutuzov" wnr-and-r,cact.t~t. . . . . : '. . .. . '· The MapReducc rpodel is inspired by the map and reduce functfoo3 _comrrionly· lisec in
This com'n1aml searches for the ,vord Kt1i'i'i'zuv· (wii'f1 feading'·nna'trniling s\i'iic'e} 'in n text many functional pro·gramming languages. The functional nature of MapReduce has some
file called war-nnd-pcace.txt. Ench match is reported as :i single line of text tltnt contains·• important properties:. ' ' ' ' . ' ' '
the senrch term. TI,e ncttml text file is a 3.2MB text dump of the novel Wnr n;d Pe~ce nnd i.Data flo,v is in one direction (map to reducc).lt is possible to usc·output'ofa rtljute step u ·
is nvailnbi~ from the book downland page. The s~nrch term,. Ktituzort, Is n. character in the , the input to n~other MnpRedtice process. · · · . · ' .· · ·
oook ..if w~ ignore the grep l'QUnt (-c) option fot:·the moment, we·.~n.n reduce 1tie·nm!1ber of (i. As with functiorinl programming, the input data arc mx change~, By applying the'mapp_ing
fostances to a single 11.(unber (257) b~ set1\l'ing (piping) \he.res11lts;of grep, . , . . and reduciion functions to th~ irtput data, new data arc produced. In etfet1,the original state :
intowc-1. ' ', ' ' '. . ' of Hadoop data lake is·always preserved. ·. . . . . . ·· . . .
(we -1 or '.',i·o:-d coup.C'. rcpo11s the number_of lines i1_1'ccciyes.t,,: , , iii. Bec.~use there is no dependency on how the mapping and redui:ing functions ateappljeli ,
'•·" .· .S,grep ''.Kuiuzon.":.war-a9d;pcnce.txt!wc , \ 1, • ; • , .• ,. : · , • • .. :, tci'the data,th_e mapper and reducer data ffo~ ~an be implemented~ any ilumbet o(iMai~ to . :
. 25'7 , , ,.. ,, .· .· ' . . · . .·..., ·' , ·.. . , ,, .,, .:.;;:/, .. , .. ,':;>, :, :; _. ' , provi.de be\terpe(forrtiance: .· . · · .. ·. . .. . . : .. · . . ,_. :~. :
. Though 'not strictly a MapReduce process, t~i~)<i,ea,js quite similar t(i.an,d, inu,c_h xa~,:er than. ' Distributed(parallel). implementations of Ma;:Reduce enable· large amo<JllU -of·~ta to be
·. ihe manual process ,of counting th,e instanc~_sQf.,.K~iuzon,i_n, ui,e p~in(ed ,bo~k. The 1a1inlogy lirialyzel quickly. In general, the mapper process· is fully -.scalable and be ipplied' io ~ny
. . . can be:taken a bit /urt!)er by using t)ie tw~ sl_mpl~ (and nai~e) #\i~!·iptI s~o',\11i )11 Listing
I. i and Listing I.~.-Tiie shell scripts are oper.ation_(mu~~-more .s\~wly) n_iid tok 7niz~ -both the·
subset of the input data: 't{esults from multiple parnlt~I mapping fun~tions ~'fen ~o_mbined a.~
in the ~educhphase._: . ' .' . . ' .... ,:·;: · . '. '
J(uwz.ov·and P.etersburg s:rings in \he tixt: .,.: . , , ,.. , ·.. . . .. ·. \ ·.· ·
of,l~e : Hadoop ~ord CO~II! ·~x~.n.iPJe
.t%;r~1r~"tll•' ."',~::':::::•',•' ;,' ;;: ,.,·.·, ., •·.'. .
:·Notice. that more instar.ce .of Ku:uzov•ha~ ,beeri ·fotrnd (the. first -greP. co~n,and .ignore •·
b, Elpl~ln c~mpiilhg imd ru~nlng 'i>roce:ss
program:,: ; ' ' ·I .: .: . . '. ' '.. . : .. '
~lib . .
/ (~_Marks) '
,, , Ans; WordCoimt. is a,slmp/e·application .that counts tlie niirn~rof cx_cutrences :or n.ch_'~ord 41. ■.• .
i giyeif input set. The"MapRedu_ce framework operatesexd~ively 00 key• val~~: that'
i~siance 'like "Kuiuzov.'; or "Kitt~~~~."). The mapper inputs ~ iextfile'ai1ci ttieriot(tputs aain . . is,' the.framework views the input to .the jQb as a·set'ofkey-value pairs and prod~~--s a set of .
in a (liey,' Vi!lue),Pair (ioke~'._namc, coun_l) f.ormat. Strict)~ speakirig,;th~ inpuno th_e script ' .key-value pairs o(di_fferent types,,\he MapReduce job prcc~ as follows: ' ?:'· .
· . ·the file and :he Keys ?.re Kutuzcv.:and,:Petersburg. The reducer script takes these key-value • .
cairs andcombir.es the similar ioken .~nil coi;n!s the l<!tnl nun1be1:.of instances:Th~ result is a.. .
ti~~ctl, :r~;t_ap•> <k2;.v2:>-\~o_m_b)nr~.<k1. v!> rcduc:-~;,~~-~~::: .: ., ' .
.i_tiy--;.~i'ue p~ir [tok~r,-namc;sunil: : · · · · · · · .·. · · , ., · · · · · 1be niapp~r implem~ntation,.via th~ m~p met~od,pr~~~.o~e)_irt~ aJ_ a_.t~'.'t~s ,prov,d.:d
. .. l,,lstlngJ ;( Simplc.J\1nnr~rScript '
·· '#! /bin/bash
; ..
.·:.- ·• · by the specified Te~tlnputformat class. It the~ splits th,e line:i_nto_to~~; seP.a111ted by
whitcspaces using the Strit1gTokeni1~r and emi'.s a key-value pai, ~f<wofd. !>.The_~lev~nt .
' While read .line; do for iaken in Slirie; do ' '
if [''Stoken""-'.'Kutuzov''I ; theri encho :'K~ti:zo'~; I" .• · ·'·'•rc;• co;:~~~:tai~i~~~~J~::t~Y~Text v~i~_e,c:ont~~t coil;e~t.,; :: ·-·,·. ' ;: ·:. ~-;_.(,..;· · · :
'. :·~-ir;:::it; .:S,:•p~:ersburg''J; ihert ebhc ·:petersburg, I" ' '' ·. -:;. ' .. )throws IOE¼ceptiQn, lrtte~rup\edE.1ception_{ .~tringJo,k_eiiii.er i!FflC\f ~S~~tT,<>kenii...-r . ·
(value.tosiring ()); while. (i;r.hasM_oreTokens () ); :. .. · · 1:i; , .
Listing, l.2Simple Reducer Script , ·. woid.sei(itr.'iiexToken (}); context:\vrit(l"'.ord, one): .. , : ,s, /
:-#!/bjn/bash kcomit'70 pcount,=6 " ·,,;, }.. ·, /; ,, .:
'·' ' ' . ' ··,,'• . . ·
' .-:.(: ::._. ,.· .
whileread:liJJe (do · ·. . . . ,. • . .. , . }, ., :·, ," , . . . . . ,- ,.- . ~;,-_-·,/j\_.-•
· if[t $line''.='.'Kutuzov,l.'.') ; tben let.~count:'kcciuhtf l . . . Given two input'files with contents Hello World By~-Wo[~da_!:~ He,llo , ,. ·,-:,L ••.:,.j ·:; · :....:.: •.
'elif["$1irie" =;'Petersburg, 1:"J '; then· let pco~nt=pcou~i-t-.1.
.. done ; · . . . : " ·.' . · . . . . . ., . ~J:iri:·,~~~~ye th
Hat~op; .e w~1tt~ap~-?t~ p~;.:t::}:t!k;;,:tf' :' :..' ·
0
' echo "Kuluzov,Skcount"·echo "Petei·sbltf'g.$pcou~t'' > ;iworld, I> . ,_. ~r, --: ·:: •· ,·, ·b, ·: • .. , .. ',:. '
. Formally;the Map Reduce process can be descdbed as folloivs. : . ., :: .· ·., . . : .·./ , · ... .··
: :Th~ niappcr and reducer ft1rictf ~ns ;tre both ,forincd ,vn'dil~.St(ucturcd i~ (k~y,ya 1Iie) pa i;s:.
The rriapper tak~s .one paifof ihe daia ,vitli ntype iri one data dornain,a~~I ,etJins,~ list of~nirs 1.
in a different ~.omain : . · · . . · ·. · , · · ·
.-~_.:__'.:~!~~};_>\
· <,Ha~oop, I>
.(· ; ..: ~_::i,:>::"~'.i:,:'./\~\''j./:-..·. <•·.
. I : ·. · ;;'-':
,I i
-~~\fM ~AM ~MV .
• • , ;•.~ -~ • • .· , :'·'•~:- •. • , . ;1·1 •-~-•• , .
-:.-,_-,;;.,·
VIII Se.111 (CSr./LSE)
<Gootlbye, I> 15/0S/24 18: I J:7.6 INFO_ impl.Timelincdicntlmp I: li:11clinc mvicc mJdrm: hllp:/1 ·
<Hadocip, I> limulus :B188/ws/v 1/timclinc/
WonJCount sets n mapper 1510517. 1, 18. I3:26 INFO client .RMProxy: Cunnc,ting 10 RcsourceNfanagcr a:
job.setMnpp~rClas~ (TokcnizcrMnppcr.dMs) ;· lim11lt1sll O0.0. I:8050 , '
a combir.cr io:i.sttC'oc1bincrCln~s(intSumRcdu(cr.cl~ss); Isi0~/24 : X: ! 3:26 WARN nrnprcducc.JobSubmir.er· I lacl.)t;p comr:iar.dstine op_:ion p :s:r.g
and n rcduc~r jo~.setC'omcincrClass(lntSumRcJuwr dn~~) ; not performed. lmplemcm the Tool interface' aad cxec~:e yo,1t -~~lka1ion wi1h TcolR11nncr
Hence, the output"t1f each nrnp is passed through the l~cnl. combi_ner (which sums thc/alucs to remedy this ,• · _.
in the snmc wny os the rclil1cer) for local nggrcgntion and then· sonds lhc dntn on ·to the final . IS/0'5124 18: 13:26 INFO input.Filefnputf'ormal: TO:af lnpul paths .to proceu; I IS/0$/24
rcdltccr. Thus, each mnp above the combiner perform$ t11c following pre-reductions: IS: 13:27 INF0 inapreduc_e.JobSubmiller: number of splits: I
<Bye, I> . [... ] .' . ' '
4. To nm :he example, c1'ente an input 1Hrec1ory"iri HO.FS and place. a·text file :in.•the new Table 2.1 Apitche Pig Usnge Modes ,
: dircctciry. For this example; we will use .the war-and-pi:°ace.t~f: . . . .. . Local _Mode . T~z Loca·I Mode : Map Reduce, ~_ 9llc-·'Tez. Moue ...:
•~-~:~: ~~:~;j~~~~at:~-t~:t:.~: t ~~hd~~:•a~~~;~~~;\.
· s. Run the WordCount application using the following command:
.;,r,. , ' .:,._,:_,,.,..... .. ,.. . ·· h1teractive.Mod~.; ' · Yes'_ ,. • _E)(p~rimentaL
Batch Mode.' . ·._:Yes .
Yes>:.
·.. Experimerital.. ,· ' . . .Yes.
::/Ycf.
; • •,- ;::Yes . .
. s·haddop jar wordco\1nt:jr.r WordC<iurit" war:and-pence-inpul . There are also interact iv~ modes, using small nmou'nts ,•f data, a_nd then r~1n at , . • •
... war-and-peace-output: · · . . ' . · .,· . •. : • ·. . . deveioped ·1ociil1y iri inter~cii~e rnodes, using small amounts of da_~,. a~dthen nma_t .
. lf~verything in\lorking correctly, Hadoop messages for.ihe job should look·like.\he . scale on the chister_in a prodl1ction·mode. The modes are summanze_d mTnbl~ -2.L: ·
., f~Howing (abbre~iated vei-sion): · · · · ·· · · Pig Example Walk-Thro__u,g__ h: . _ _ ' · · · ' ·
,, ·.1:··
. 5"~~+Mi~M:~~~v:. . ·..:
°"~s+M c.i~M ~ M t ( ·. .
. . ,,
~-:.: 9·.·:.
8.
;; J __
_·..:,.. -·
VIII 5e-wv.(CS'£/ISE)
for this example, the following sofhvarc cn.vironment is nssunml. Olher environments Comments nre delineated by 1•1• and:. at lhc end of a line. Th~ script will create a J ireclory ·
should work inn similar fashion. cnllcd id.out for the rcsull.1. l'ir1l, ensure 1111111hc id.011: dirc,tury is nol in your local directory,
• OS: Lir.ux ~nd 1hcn stn11 Pig with lhe scrip! on rhc command line:
• Platform, RHEL 6.6 $ /~i~/rm -, id.miV
• Ho,~onwori<s HDP _2.2 wh ich 1-iadoop "crsion: 2.6 $ pig -x local id.pig . . .
• Pig version: 0. 1--1.0 ;' If the script worked con·ectly, you should see at lei15t one data file with the results and a zero~
If pseudo•dis:ributcd installation is used, "Installation Rccipi:s,'._; inslrnctions for · installing ·>· length file with the name-SUCCESS. To run the MapReducc version, the same procedure;
Pig nrc . . • the only difference is 1h_n·1now all reading and wri!ing taken pince in HDFS. '
In this simple example, Pig is used to extract user names froml he ·/etc lpasswd file·. A full , ) · $ hctrs dfs -rm -r id.out ·
descriptio:i of the Pig Latin language is beyond the scope _of this introduction, but more '. $ pig id:pig . .
iqformation about'Pig tan cc found nl hur://pig.apnche,orgldocs/r0.14.01 stiu1;h:ml. The· · If Apache tez· is installed, you can _run the example script 115ing the -x tez option.-You ,an
following example assumes the user'is hdfs,but any .valid user with acc~ss lo HDl'S can run learn more about writing Pig-script at hnp:/lpig.apach~.orgldocs/rtl.14.0lstart. html. , · ·
the example. · ·. - . .. . · · · . Using Apache ~ive . · · ·· . · ,,;
-To begin the example: copy the pamvd .file io a wo'rldng directory for local Pig operation: · Apache Hive :s a data warehouse infrastructure built on top of Hadoop for_prov1ding da:a
S cp /etc/passwd · . - . · : · .· ' . s1unmarization·, ad hoc .queries;and the analysis of large dala sets usin,g a SQL- l_ike language ..
Next, copy the data file· ir.lo HDFS for-Hadoop MapReduce operation: called HiveQL, Hive is'considered the de facto standard for interactive SQl.iqueries over
$ lidfs dfs -put passwd passwd - . . · - .: . , : . petabytes·of data using· Hadoop and offers the fol!owing features:
You can co:ifilm the file is in"IIDFS by enteririg the following command: · • Tools 'to enable easy data e"trai:tion, transformation, and ;oading (ETL)
-hadfs dfs -Is passwd · • Amech~nism :o impose structure on a variety ofca!a formats .
-rw-r--r-- 2 hdfs hdfs 2526 2015-03-l91 li08 passwd . " . • Access -lo files stored ~ilher directly ill< HDFS or in other datz storage syste~ such as
In the followl:ig example of local Pig·orera:ion, alrj)rocessing is ·done on the :ocal macl:ine HBase · · · · ·· · · ·
! '( · {Hadoop is not used). First, the interactive command line is sta11eit: · · · • .Query, executio~ via· MapRe~uce and Tez (optimized MapReducc) :_· ,. , . . .
l·
I
. S nig-x lucal . . , - _ . _ : . . . _.. f{ive provides users who are already ..familiar w.ilh SQL, the ca·pability to tjuery,.ihe da:a 0:1--
. If _Pig starts correFlly,:yo·u wi!I sec:'a gnint> _prol]ipt. Yo.it may nisei see a--bunch ..of INFO ;: Hadoop clusters. At th~ same time,-Hive makes·itpossible for programnim 1~h!Ji.iri-familiar ,
~due~
lif
messages, whicti.yoµ can-ignore. Next, enter tl1e'follQy,,ing.co11;ma11ds :o load lhcfpasswd file :;- ' with,the MapReduce' frame\VO(k' lo.add-i~eir C\istom ma'pprn and toJiiVe'C]Ueries.
and the~ grab the usernnm~ and dump i_t to the :erminal."Note th~t Pig comr_nands mu~t end' '. · .:, Hive qliei'ies can nlso -bc drrimaticallly,'accel~raied·using ihe Apache Tei. frarneworhsnder
1 ~~=~~0 ' . . • . . YARN ii) HadooiJVersion 2, --. . .
grunl> A= load •·passwd' using PigStorage ('·;');. Hive faample W11,l·k-Through·. . . ;. ':"
grunt> 13 = foreach A.genmte SO as id; .. . For this cxamp!e.;the foliowfog ·sonw·are envfronment -is ·assamed; Othenin~irorunents :
MN)l
,,\•:IJ
grunt> dump I}; ·_: . · .· . .. .. _ . . ... , . -should work in a.similar fashion:· . . . '
Th_e proc~s.sing ~viii sta11 ;~nd a list of 1iser oames w.ili be "pi'in_le~ lo ·:hii·sc:c:en.,To _exit ihc 0~: Lim1x ·1 . ·
ihternc1ivc sessioti,~nter lhe command quit ..:• . . . - . . • Platfo'rni: RliE.L 6.6 , . _ . ._. ..
. $ pig -x mapreduce . _ .. .-. ... • • _ _. . . •. _ ••. _ _ . .. _.- . Hci110i1works l;IDP 2.2 with Hadoop Version:-t.6 :- :" .
1
. The same sequence of commands can be entered at the grunt> prompt. You may wish to > Hive vei·sion/ 0, 14.0 · - . ·' . .
_change the $0 argument to pull out other items in the passwd filc ..lri the case of this simple Although' the :fcJllowing example assumes_.the ·~sei is hdfs, any va)i_d 1:ser with,:"accessto
scrip_t,you will notice· that the Map Reduce versio·n tnkes much..longer. Also, ·bccuase we are HDF'S.can 'run the example: . . _ ·. · _·. . _ -·'. · · _- · . _'. :-. · . :,
runn1Tt·g lhis application under Hadoop, make st1re 1:1e file 1s placed in HDFS. : . To star Hive, simply enter _th·e hive command. If Hive stnr.s corr~ctly, yo(t sh9uld get
lfyou are uslng the Hortonworks HDP dislnbul1on with tez installed, the tez engine can be . ahive> prompt ·. :_ . . . . -
used ns follows: · ·$ hive
$ pig -x tei . _. . . · (some rnes~age may show up here)_:, •_
Pi~ can also ~e from ascript. An example script (id.pig) is a~ailable fro~ the example code -hive> :· ·. .. · · · ' "'· · '
· down'!o;id (see Appendix A, "Book Webpage-and Code Down'ioad"f This script, which. is As a.simpl; ;est, c~~ate and:drop !l ta~le:: Note that Hive commanqs inust end with a semic-~'ion \
repeated herc,is desigr,ed to the same thbgs ~s the interactive v~rsio1i: ' (;) . . . . . . . . . .· .
;·
: _[ 10 1i
): l
vm Se-!'ltl (CS!:/ISE)
OK fDEBlJC.,Jm
pokes [ERROR) J .
Time taken: 0.174 secor.Js. l'c:chcd: I row (s) [F/\Ti\L] I
hive> DROP TABLE pckl·s [INP0]96
OK iTRi\C E] 816
Time tr.ken: 4 U38 secu:1ds [Wi\RN]4 .
A more dc:niled cxnmplc cnn be developed llsing a web server log file to suinmnrize Time taken: 32 .624 s~conds, Fetchect 6 row(s)
message types. First, create n table using the following cumninml: To exit Hive, simply type exit;:
hive> CREATE.TABLE logs(tf string, _t2 string, i3 string, 14 string,' ti'ive> .exit;
--+ 15 string, 16 string, t7 string) ROW l'ORMAT DELIMITED FIELDS IJ, E~plaln with. the follo1ving cumnianci~ in the H base data modcL (O_~IVlarks)
--+TERMINATED.BY"; . . 1) Create·the.database. . . . ·.
OK .
2) lnspe~·t the database
Time taken: 0.129 seconds. . . . · ··
Nel\l, lond .the da.to-in this cnse,from the .snmpldog file. This· file is .nvililnb_le .frorri the ·
3) C~eale ro; . ' . ~ : ·. : - ·;
e~ample code downloa·d. Note that file is fourid in the local direct.ory and.not in HDFS. 4) D_elele a row · ·
hive> LOAD DATA LOCAL INPATH 'snmple.log'OVERWRITE INTO TABLE logs; 5) Remove a ia hie
Loading data to table default.logs · . • . · -: · · ;·. . . .. · .. ·. ·.· 6) Adding data In Bulk,
0
Table defoult:logs stats: (numFiles=I, n~mRow=O, tota1Size=99271; rawDat~Size=O) · . . I) Ci'eate the Database '. . •.
OK . . ' · The.next ~tep is to create the daiabnse in.HBase using the·following-eoinmand: .. ,.
Time taken: 0.953 seconds ·. .. . :fa 1· h~ase.(main) :006:0> create 'apple', 'price•; 'volme' ·. .
Finally, npply tile Select step ,to the fi1¢. Note .that ;his invok~·s n Had·~op MripReduce 0 roiv(s) in .0.8150 seconds .·. · .' . . .. : . , . · ·. c · · • · · ·
i operation. The resu(!s apP41ar _nt the end of -ihc olllput (e;g.; totals for the tnessage'iypes_ In \his. case;th~ table riamekapple, a~d two ·coluni~s ar~ ddined.. The.daia.,wll! iie. U:Sed as.
I
DEBUG,ERROR,andsoon). · · · · .' . · , · ·... ·. . ·· . , · : : . • the row key. Th~ price colliinn· is· a f~mily of four- valuesioperi;.close;_ lo_\v, high), r.,e put .
! \ com_mand _is used, to a~d to the daiabase .from _within· the:shelL ~or. instance. •~i pieccding .
. hive> SfUCTt4:AS ;ev, COUNT(•)_AS .cnt_ FROM l~g;•\'.'HERE.i4 LIKE i!¾' qROUP :: 1
BYt4; . . · .- . . · · ,. ·- . .:· .. ,... , . ,.. . · · .. : "· .. : - · · :. , data ·ca~ be entered by-using th¢ fol[o1ving -commands: .· ·. · : ·.-.- .i; · ,,':',
. Query ID= hdt'sJ015032il 3,.P000_dlela265~rt~d1-4~i\8-b785-2c6569791368 Toial j~-~ = / .·. ,, jlUt 'apple:'; •~-May-IS', 'price:open•; '.126.56··: . .... ,:,
i. ·
·.;:·:. · ·
ff
In order to change the· average load for·.a ·reducer (in bytes) : ·
set hive .exec.reduce.byies.per.reducer=;<number> · · . : plit 'apple'. \6-Ma'y.-15', 'vo!\Jine'.-, ' 7i820387' . . .. . . .
.. In order to limit the maximum number'of reducers: Note that these··c(/mmands can be copied arid pasteiLinto :HBasc shei'i::and::a.,: :'availabie .
set hive.el\ec.reducers.max=<1,umber> . · . . ,• . · from the book <io·,,·nloatl fi!es . The shell also k~~p:> a histoEy..fcr ib~ sectioit,.ancfipreviou·s
In order to set a constant number ofredu~ers:-·.:'. > commands can be.'retrieved·i,nd°edited.for res1ibmiss10n • .' · · · •.• .. ·., ., • · ·
set mapreduce.Job:i·educes':<numbi:'r/ . · . . . - . . ·· 2) Inspect the !?~ta base . ·.. .. . _ . . . . · . , . . , . . . ·\· . . _.
Starting Job =: . joh_::_1427397392157_000 I-;< Tkeing .:_-URL· =: . http://11or'bert:8088/prox·y1 · . · , The .eritfre ·• d?.tabase can be_ listed ;u~irig the scan c_oriut\and. Be .carefu_l:.~~en:_using· ttiis _
application,_142i39739257 0001/ · . . · . · · · · . · · .. · ·. ·· command .with large ~atabase: This eXampJe ls for: onC.row. ~ ,., ·': ~:: ·
· scan.'apple' · . , · : .'· .. ' . . · · . . · :._ :
Kill ~ommand =.Jopt/lindo;;-p-2.6.Q/gin/hadoop job : kil i job~ 1417397 392iS(OOO i Hadoop
job ·_ infornrn1ion for Stage-I : nu.mber of mappers: L; number of:i-educem · 1 2015-03-27 . . hbse (niain) ,:006:0> scan 'apple' .
.. ·13:Q0:17;399 Stnge-1 rna·p=0o/o;reduc~'=.0% . . .· . . ... . . . .. . . RO\\/. . .. .COLUMN+CELL · . . .· .. -~· · · · ·
'jfJ~~~l~~~ifg!j~i~~~{!f(!ir*t{
2015-03-27 13:00:26.100 Stage- I map;, _.100%, reduce= 0%; Cumulative'-CPU 2.14 sec .
· 201 S-03'.27 13:00:34,979 Siage-1 map"" I00¾;reduce = I00%, CumnlativiCPU 4,07,sec
MapR·educe Total cumulativ·c-Cl'U tinie: '4·secoitds'-70:msec'.' .
Ended Job =job~l4273~7392757 oooi·., ·.. ·: ... . C • . ·· ·
Map~edncdobs La1indied: . · -~: - ' · · · . . .
:;;j Stagc-Sta~e-1 :·Map: I Reduce: I Cumulative CPU; f07 sec HDfS Read: i06384 .'
HDFS Write; 63 -SUCCESS . . .
Total MapReduce CPU Time Spent: 4 seconds 70 msec·
· . . .• . Yqu Caij (ISC 'lh,e row key 'IO ~ccess an indi vidu:a1 .row: 'in the.st'ock,'price d~t~ba_~e'.:th~ dM.a is_; ·.
_ ___ih_e:~ro_.w~)~
'e:·~ ; - _ ·· . , . . . ' . . ·. : : . ,t ,·:?-::::~-
OK . . . . - :-,--- -- .. ,.. ·-
,,
,I
"'tJ
;,.,:: /
, iz .Sii11~+... ~· e,c..,~ &...~~i,. ·.
.
··'. - :· ·
vm Sett11 (CS[/ISE)
hbase (main) :008:0> get 'apple', 'G•MnH5' Using the· YAltN Dlstrihutcd-Shell ,
COLUMN CELL . For tile purpose of the example, prescntcrl in the remainder of this chap:cr. "'c auumc aud
price:closelimesfomp= 1430955128359, value~ I 25.0 I asiign the following installation parh, b~scd on Hor:onwork! HDP2.2, !he Dis:ributcd-Shcll ·
price:high :imcstamp= !4:;09 SS12602.:, va!uc= 126 .75 :lpp!icntion: . 1 •
price:101v timest.mp= 1~ 3U95 5 i 26053, vnluc=123.36 .i export Y/\ rt N_ DS~iuw l1dp/curr«11/11adoop-yarn-clic11tfnidcop-yur.-applicat1ons-
price:opcn timestnmp,,;1430955125977, vnlue~l;l_6.56 ·•distrib11tcdshell.jar · •. · ·- · -~
volume: :imestnmp=l430_955 I 41440, valuc~71820387 For the pseudo-dis1rib11ted install 11singApache Hadoop version 2.6.0, L~c follow_ing path will
row(s) in 0.130 seconds run the Distributed-Shell application (assuming SHADOOP,:_HOME is defined to reflect the
4}Dcktc n Row . location HadQop): . . .
You can delete nn entire row by giving the deletenll command as follows: $ export YA RN_os~SHA DOOP_HOME/shart/hadocp/yamihadoop-yarn-applications•
.hbase(rnnin) :009:0>dc!ctca!I 'apple', '6-May-1S' · disfribuledshell-2.6.0.jar , • .
5) Remove a Table · . . If another distribution is used, search for the file hadoop-yarn-appHcaiions- distributedshell'.
To remove (drop) a mble, you must first disable h. The following two commands remove jar and set $1(I\ RN~DS.based on its lo~ation_. D~lfibuted-Shell exposes vcrious opti~ 1ha1
the apple trible from Hbase: hbase(main). :009:0> disable 'apple'·hbm(main) :_010:0> drop can be foiind by running the f?llowing command: · ·· .· .
'apple' · $ ·yarn org:apac!ie.hadoo~.yam.applications.distributedshell.Clienl -jar SYARN_OS .
6)Adding data io•Bulk · . . . · --help · . · . . ·
There are several ways to efficiently load.bulk data into· HBase. Covering all ofthese.niethods The .output ofthis·co111rnand fo_liows: ·
is beyond lhe scope of'this chapter. Instead, we will fo~us on the lmpo1tTsv utility, which usage: client
loads d~ta ir.tab-separated values {tsv) fo:mat into HBase. It has two distinct usage modes:· -appname <arg> Applica:ion Name. Default ·
• Lolding data from a :sv-:ormat file is HDFS into H~ase via ttie:plit corrinirind . . . value -distributedShdl ·
• Preparing StoreFiies to be loaded via :he corr.pletebulkload utility · -attempUaiil1rcs_ validilY,~: wh~n att.:m;n_failure_validity_ icit.:rva l in m!llis::Conds is
/. I The following example shows how to use lmporTsl' for the firsroption, loading the tsy- i11tei-v_al ·<arg> . se.t to >O, the failures number will r.OI ~e
failure which'-
!. .format file using th.e put command. The second option works in a two-step fashion and can happen Olli of,thc validitylntmal )nto fa!lu,rt count. ff .
.
be explored by consuiting http://hbase.apache.org/book.html#irnporttsv.. · :· ·· ·
;i I. ·. ,
fail1ire count r:a_ches to m~XAppAt:tempu, tht application ,
· The first step is co.nwi:t the Apple-stock.csv fi I~ to tsv fo1mat. The following si:ript,.whid1 is • 111ffl be failed. . · - . ..
ii\Hj:° included iri the book software, 1vill ;·c.move the rirst-l inc and do the conversion. In 'doing io,. . -~ontainer_m~mory <arg> A~o~rit.o( trie~ory i~ MB to be rcquested_~o ru.1the_sh~H
f!:·:_:•. ,:,•·:·.: l,:11;: . it crentes a file named Apple:stock:tsv,: . . . . . command . . . . . . .
S con·vert-to-tsv.sh Apple,stock.tsv /tmp:• . . .. .
,;-· 'j Finally, lmportTsv is run u~ing 'the following command'line, Note the column designation in · -~?ntaine;_yco;~;~rg> . Amonnt of virtual co~ to be requested to nin-Lie_shell. ·
command , . . .
![~
_·(: \i;.} l. thc.-Dimporttsv.columns option. In 'the e/(ample·, the HBASE_ ROW~ KEY·is set as the first ...
le!P4j column-that'is, thedala forthe.da!a.: . · ·.. . : .- . · .. ·: . .create -- Flaglo·'indicate ·whether.to creaie the:i!omain . ·
l~.l ,~_;_1-;!_· $ hbasc org.apachc,.hadoop,hbas~,~apicduce.lmportTsv -Dimportts.v:c.olurriris";' • specified with "-do_main. . ' .
," -, HBASE_ROW_KEY, price:open,price:high,price:low,prke:clos~,volume , · ~ebug .-: : i : .. · Dl1mp·out infomiatiorr . ,
:1J, ,i: -+ apple /tmp/Appl_ e-siock.tvs . ,' . . . . . .. • ' . . ,l..
-domam•<arg>. . .ID ofthe timeli~i:°do~ain\vh~rc th~ timefi~~ .·
l'l)l:u. . The lniportTvs command ,vorks wili use Ma.pReduce to load Nie'data .into HBase.•To ·verify entities\;•ill be put · . ..
1
ji~ · that the command works, drop arid rf-create tkappie databa~c; ti$ described previously,
:'_J'i(~I· before"ninning the import command: . . . . i • • • .' •
-help ' ·. Pr.intusage
J~~ fil~ contajning th~ application master \ ·
-)ar <arg> .
·'; i••i· C, w.hat is YARN? E_xplain any fiv_._e.com_ma_l)dS?. . . . . _ arks).
(04 M '.
rU1
1i· Ans . . Tlie Hadoop YARN project includes the Distribi1ted-Shell application, which is an ex:ainple ,. -keep coniainer(_iicross_· Flag·i to . indtcat~ . \\'hethe; to keep _cor.tai~ers\ across
ijt,[\] . of-a Hadoop non-MapReduce·· applicaii~n built. on top of YARN: Distdbuted: Shell is a · applic;;iion:._attem_pis · · _application atter(lpts. lfLie flag is true. running containers ·
will not be retrieved by the new applicafion attempt. -·., · .
!1ii1
1
;/l! . simple mechanism for 11inning shell commands and scripts in cont~inc~s· on multiple nodes
\igyroperties <arg> .. · log4j,properties fil~ . . . . .
!,il_·•.'t::•::i_::
_·,·
·.·
:·•;.; ~t
:.,!;:_.-J;
;·',·,·•.
:
:. !:~•
·.,·. . . · . in a Hado~ir
rather cluster. This
a demonstiation ?.pplication
of.the is nbl mea,nt
non~MnpRcduce to be a production
c·apability tha(can be administratiori tool.
impli:~cn.ted on topbut
of' ·. -master_meineory <qrg>' . :Amount or'me~ory ,i~ MB 10 be· requeste'd 'io fUU th'e, ,
YARN. Tliere.are multiple mature implementations of a distribu1ed shcli°lhat administrators _app/ication _m~st~r. : .. . ' .
< • ,,,. · typically· use to mannge a cluster ofmachines: _1n addition, Distributed-Shell cart be used ~s .. vcorcs
... · ' . .<arg>
· Amo1mt of v __
irtual cores to b·e requC,S:ied___: to run the
• -master~
;J \ :r -~--- a stanirig poirit for explor:ng and building Hadoop YARN _applicatiC>~~~TJi.i1_c.!i!!ptet olfe.rs ,. . . ., -r·aP.plica:ion niaster"· .
. . !,
l\l!/ . -·-~4 =~:~~~:; 1Ww lli< Dis<ab"kd~•ll em be """ " ' " ~ " • ~ ; ~ : : : : : ~ : L. 15
VIII Se-+n- (CSE/IS[)
-modify_ads -:org> Um~ 11ml ip·t;ups that allowed to mml ify the timline of vnriuus Hndoop service is faplayed on the lefl using green/orange dots. :"lote that two of-'
cnti:ics the timeline entities in the giwn doma.in the scrvic~ mnnngcd hy Ambarl Mc Nagios and Ganglia; the s1and.ird ch15ter maruigement
services mnnnged by Ambari, they nrc 511cd 10 · provide cluster monitoring (Nagios) and
-node_label~expre~sion <a~> Node label expression to dctcl'mine the nodes whcr~ all ') ' metrics (GMglin).
:he co~taincrs of th:s application wilt !:e ttllorn1cd, '" '" OashboRrd Virw
mcar.s co:1tai11ers can b,• al!ocalc~ anywb·rc, if you don':
The Dashboard view provides small srn11:s widg·ets for m~ny of th~ 51::·v:ce run:iing on the
specify the t>ption, default 1i°odc_labcl_expn:ssion of tluster. T.hc_nctu~I services are'"listed on the _lefl-side vertical menu. Yvu can move; edit,
queue will be ·umt. · · rempve, or·ndd these widgets as follows: .
•num_containm <arg> No. of containers on which the shell .comman.d needs to .• Moving: Click and hold a widget while it is moved about the grid. · ,
be exm;ted ' · • Edit: Place the mottsc on the widget and ·click.the gray edit symbol in the upper• right
•priority <arg> . . Applicati.on Priority. · corner of the ·widget, You can change several ditl'ererit aspei:u (including tivesholds) of
Defaull o·· . the widget: . . -
• Rem_ove: Plnce.the_mouseo~ ihe widget and click the' X in the ~~~-left comer.. .
-queue <arg> RM Queue in which this ~pplication is io .be submitted
• Add:.Click the small trfangle next to !he Mterics tab and select Add, The available widgets
-shc!(~args <arg>_ Command line ara~ f~r tii~ ;,,~1i
sc/ipt:. Multiple ~·rgs .can . wi!I be-displ~ycd. Select the widgels you want 10 add and clic~ Apply. · ·. . .: ;
. .
.
. . '. . be separated by e~1pty space.· ·.• . . . Some widgets .provide additional informatiori when you move the mouse over them. For.
-sl1etLcmd_JJriority <llrg> Priority "ror the shell rnmmmid c.imtnin-~r; . . "in~ta_rice; the Data Nodes widg~t displays the number of live, de2d-; and vie_w, Foe insla~ce, ·
-sh et(_command' <arg> · io
Shell ·com1nand _be executed by ihe Applicatimi' Master. . _-, Figure 4.2 prQvides a detailed vielj'..Qf the CPU Usage widge: from Figure 4. I. : . :
· The Dashboard view also.includes a heatmap .view of the cluster. Cluster Jie.r.maps physically-
,
C?,n only s~cifyeitlier - -shell:._cprnrnnnd or- :shcH_script
~shell_env <agr;_ . · En~ironment for shell _script. Sp&~ ified ~sen~_ key=env..:: ·
mar. selected metdcs across the cluster. \Vhell you click the Heauna;,s tab, heatmap (or the a
dust,r. will be displayed~To sele,t the m·etric used for tile heat:m:p, choose the desired option .
_val pairs _i • : . ·.· · . . ~ .· . . . . . . · ,. from the Select Metric pull-down menu. Note that the scale oiy used is displaye«fin Figure
. -sh~ll..:_scr_ir! <aig? .. · _· Lo~~ti9n. ~f th~/ hell script· to be e.xecuted Ca~'. only /: 4:3 . ' '. . ' ' .' ; ., ·:· . .
. _sp~c1fy, either· : . . . . . • .. ;.. . . • . · . . , • ·
· ·-~shc!l -commarid or . ,- ·
in the given_ilomain. .. . .
- ·-..;; ..
4. n: E~plaln virus or Ap~che Arubari1 <: . _: .: . .. . . . . . · (IH1~r.ks) :· -~.oams .
Ans. After con1plcting .the initial jostal!ation a°nct" logeing fnto Aniliari) a. dashboard similar to · · . .'·.·,
• that_sho:-vn .in Figure__4.1 is 'piesented.;'Thc-sam; four-node clust~r as c1:eated that will be
· i1sed tci explorlAnibiiri ;- It you ni:ed fo: 'i-~dpe_n the Ambari ~ash board iinerface;.simply ~qter: .•
_,;,.,~ .:;"(.:-:. .-;:~~'j
52.0d ·, . . · •-
·- - ·
1.33
the following commanil (which as'st111i'es·you arti using the Firefox browser.Oat though othe"r ·, •
brows~rs may also be used):
S flrefox locathost.: 8080
· ·
---•····~ .
•-~
~ ·
,,. . 18.2d .• _. 414
. The d~fault !ogin and password are admin and adm in, respeciiv~ly. Befor~ continuing a,;y ..
. . further, y~u should. c~ange the defauh password. To change,the password. select Manage.
. Ambarjfi:om the :A~(ll!1t: pull7down menu jn the uppeMight comer. [n the rnanageme~I .:~- : ..
· wi11dow, click Users 111\der User+ Group Management, and then click the adrriin user.name: ·.
· Select Change Passw_ord and ente(a new passwot'ct. Whe1i yolt are flnishcd,·click°the Go To '\
io
· Dashboard link ori th.i: i~fl sid~ of the window retmn 10 the d~sh board view: , · , . ·: :-
To leave :he -Ambari interface. ;eiect theAdmi~ pull-down· 1ile1111 of the instalbl servim. :i Fig,ire.': i A~a_ci,e Ambari dlisltbonrd view of n Hadoop Ci113·te~ ..
Aglance.at the dashboard shot1ld allow you te get a sense of how the clt1ster is pe1·forming. >.·.
The top navigation n1~nu bar, shown.infigure4.l ,:provi&S:access tb the Da~hboard;Services; \·:; -'-. ~ -'---......- --'---
Hosts, Adinin andVie-ws foaiures{llie3•3ciibi! is theVi~ws m~iiu). · Th~ status (up/uow1iJ 7:'- ·:; i
- . . . . . "! . .. . .· . . ·-· ·
CPU Us.igo , Configurntion l1istory is tl1c;final tab in the dashboard window. This view pro~ldes.o lis_t' ,
of conf1gurntio11 ·chang~s made to the cluster. As shown in figure 4;4, Ambari·- enabie
config11rntions to be soricd by service, co~fig11ration; group, data, .and"author. To find the •
5pecific co11fig,11·,11io11 ,,t:i,:gs, click the service name. Mme information 011 configurntion
selling is provided lat~r in 1l1c d1art~r. ·
Service VICI! ' , .• . . .· "
The servic~ menu provides a detailed look at ea·ch service running on the clusfer: · It aiso
provides a g1'.i11l1ic_al mctl10d· for configuring each service (i.e., instead of hand-editing the·, ·
etc/hadoop/conr XML fi!cs). The summa1y tab provides_~currcntSummary view ofii:ilportant .
service metrics amt an Alters and Health_Checks sub- window. .-. ,
Similar to tiic Dashbo:1nl view, the clu'rently lnstalled se.rvices ilre listed on th'~·•1en- side
menu. To"sclcct ;\"scrvicc,click the service nall)e in the menu: when np"plicabl~; iPc~service
will iiaye is ow11 s11mma1:y, Alters and Health Monitoring and Service Metrics.wi~dqw's; For
example, Figure 4.5 shows the.service view for HDFS. Important information s_uc~ as the ·
statLI_S ofNam~Nodc, SccondnryNameNodc, DataNodcs, uptime, and°a~ailab_ledi~t~pace is .
. displayed in i11c Slllt\llWY IV i11dow, TheAlters an~ Health Checks widow proviili!s'.ilie hitesl
'-.: statµ~ 'oftli·e sciv,cc· and its co_;1ipom:nt systems. Finally, several important real;tiriie's~rvice : ·_.
,· meirics arc displayed ns widgets at the bottom of the screen. : . . .. ·. :.:. <·,:U'.f. ,, .. .•·..
:SJi:s on th~',dasliboitnl, thcsew idg~ts ca_nb~ fXpanded to display amor~ det'~ili;cfvie!; <;licking .
the Gonfigs tab will open.an options from, shoivn in.Figure 4.6, for die Tli'e:options . seNic~:
. . . ~:7'.':".!"-:,~•j
·,: ;_ .
l'U)~"f'!~'"'ol,
(prbpe11ics) arc the same Cl\CS .that are .s~l in U1e Hadoop XML sho1ild mari~g~_th~in-~iily. ,·
.:through the li1:1bari intcrlacc-tllaris, the user sl1ould not-edit .the files by hmid.: " -. --.i, :
::t...;.,/ -.~
•:t--"f-
-~i ---·/:::;~;: _~ - ~ ~:-L:
... _.,_ ... , .(L:;~; ;'.~:i~:;~.i+,~~~~:~.~~~~__:£~~~=;:::.'.~'.: :!;~,~~
;~
·:· ~ :.>~ . -~:·:_~;·~.--;.::::- · ;:·-rt-:· :~~-.~--.: ·
1
. ,r· ~ ·.
·•. 1~ ~
1' ff, ~-,•·:" ~ . •.:,
:;•• , ...
:,
,:. . -_,.:-.
. ~ -'~•--'~ . .:-~~~;·&~ . - ,..u...,i:r-"'~ -- -
', ...•.•..·• ·i · ··=t!'1~0-~ '"- · ;:··
.-.'W;fi:.,.:1 ...... - - .
.. .•,
' IITI'.,,,.., ,..,...,..ri~"" .
1, .. . . ,~ • · · ........ , _ _ _ '4,
i....•~·•·"- . _..,, .... ,_,_.,•- -..;., ........ . , _ci .... • · .
,~ .. .......... ·,,, .... , ·,.. •.·:
~, .... , .. , _ ' .... ,, , IG-. ... ,.~..
··-· - -~-~
...,o,,.,,Ci:::J:I ,.._...,,._,.,t1,n
f>~ ~l'I.-UT.I
. ,' . _.:. . - ..
_
18
VIII Se-tw (CS[/ISI::)
The current se11i11g.1 a ,, il nlilc for each Sl,rvicc .ire shown in the form. The ndministrator c.in sofiwnrc installed, The remaining optiora in the Actions pull-<lown menu provide control ·
set each ortiocsc prnp,•:·tics by clw11i;ing the values in the ro:-111. Placing the mouse in the input over the various service cu111po11c11ts mnning on the hosts.
box of the property di,p l.,ys n sl:011 llcsccir1 i0r. of e.ich propc11y. Where possible, properly Further <lct;,i!s !iii' a par1k11lar host can be rm,nd by clicking host name in the lefi column.·
nre grou;x,: b)' :,:o:c: ic:·:11i1y r:_- fo:· ," cl~o b , r,1:ovisions for adding prcr,cnies 1h~1 arc not As shown in Fi3u r•.: 1l.8. ~he imlividu:tl !·1r,~: view r~ovidc!, three ~ub· -~·indows: Components.
!isled. /\,1 c.x,,n,plc uf c!;;111gi,;g sc11·iccc prcpcnics nad rcsta,1ing the service componcn:s is Ilusl Mctr:cs, .ind Srn11111:t1 y in lorr.iatio:1. ·1 he Componcr.1, wi~dow \,,is the se rvices that are
provided in thl' ''M:1mging I lndoop Service" sec:i6n. L"urrcntly running on Ilic ltost. Each service can be stopped, restarted, decommissioned, or·
lf a service pruvidcs ils ~wn grnphical interface (e,g., HDl'S, . .YARN, Oozic), then that placed in 1naintcr.,111cc mode, The Metrics window displays widgets that provide important
interface c:rn be qpcnc,! ln o sep:irn1c browser tnb by usi,ig the Quick Links pull- down menu metrics (e,_g., Cl'U, 11\cmory, <lis_k, and network usage). Clicking the widget displays a larger
located in top middle of the window, . . . version of lhc grapltic . .The Summary window provides basic infonnation about the host,
F_inally, lhc Service A<:tion pull,down menu in the upper-lcfl corner provides a method for · including the 'inst time ri l1c:i11beat was received,.
sl'a11ing and stu1,pi11g ead1 service and/or its component dnemons across the cli1ster. Some ,
service may lrnvc :t s~t of uniqu~ actions (such as rebalancing .IWFS) thoi apply to only
cenain si!llailons, l'inali)', every service has a Se1-vice Check option to make sui-e·the service
is working propc11y. The service check is initially run as pait or'tlie installation process and
-··.. --·
can be vahmblc when di:um_osiiu: roblcms. · . . ···-·
'''" .
- .. -. i
r,J"'(I,&·
-
I
;~J •. .• ; .
..
. .,,:, ;,
Iii ·~
· ·.,~w: T.
···••"I.¥
. ~·,~,. ··.
¥'sffl:i.'H.}··. ...,
,
~
~
:-.:_. ,. -: •.. ~ : .. •. . -- -··- - .- .. t: n.. ,;.,.:.._.._;a: - . ...-; ; - S
' ~ c.r,- . ·"'!-• .......,.
011-r-"I ·:
·~~~-~ ·;~~;~%:Ai~¼(. --···. "eiw~-~--•--~.
: ... : · ....-~: ·- ,,./': .. ~ .:; ·_._ : ·
~..... _· Iii, :: .
- ..
~ . , ., . . . ~•IM'<ill
~ ....
,< . : ;:·.~-.-~~~-- 1
.~ ~~......:~- _.,;,~,
......,v......... .....
.
!L-.•u,,..,:_»
~
• ✓•-
•• •·4•.•·· .
.
-~ -.
'. 't·• ...... ~~ -·:·· ·. .
"'··-·>••
:.-: ~---
: . .. .
·--·
',
- . ·~•-~..,
'Figure 4.6A_in~111'i i~;11ice.opt(o1isfor 1/Df.S
c-~~:;~::- :-:-,. . ,·..
_::. :t.,:•;·t~ •li.~,■ ,w.ch--~
· tlostSVicw . . . . . . . . . . .
Seleiting· the . IJosi~ :lllCllll 'iic,;t provlde's. tl;·e information show,; in :Figu~e 4:7. The-iiost. .--... -
~ · ·,.,, ..
.. . -
· .. . -. ·
.- ....
· .Fi••1i;c
0
·4,8~l111b11ri cl1isier t,bsl llet11il vie1f · .
Admiri Vic,v , ·· . ; •·· ·: . · • . • '. . . ' •• ., ..• .. :. :
· .'f;t72~!1Ir:i;:~]t::ittt:htf.~·,t•··
. . . . •. ... . . · .:;·_:--.:, i . ·. .. . . . . ·. . . ' .. . · ..
. .. . . ~; Explain the llasic lliitloo(i YARN administration? . . . ,. 1 · (04Marks)
.· Ans. YARN:has scvernl.administrntive ..featu1'es ·and commands, .To Jind out inore .abou.t the1t1,
. cxami~e the YARN ·- commnnds document~tio.ns at https:1/hacjoop,apachq)rg/ docs/current( .
· ., hlidoop-yarnlhndoo1>~ynrn-sitc/Yar~·cominands,htm l# Administration_ . Comm.an~s. · The t. : .Resotu ~C'."1ntHl;\CJ'. A requested contamer smaller than this.\:alue will result'in lin allocated .
. co~ta[ncr of1hissiz~.(clcfat1lt 1024MB). . . . . . . . :• .' ;,_ .
main a¢ininistration-comniands.is·yar.n rmadinin (resource .manager administration). Enter :,.,
• . yarn.scliedti lci·:max i:iu111i-allcicatio1i:1~b. is ·the·_·larg~! i:on,~'i~er,, ,ai_
foi~~ ''. by · the, .
~ac;orn7~~~:::ii~~!P~~-g~r~:~ra_b~ut the v.arious options .. .. : • . . : . . i :.;... :'. -•::··.,\'
. If a NodcMnnngcr host/iipdes·to be re:mo11ed froni thedi:ster; it should°be decommissioned·,
·.first.Assuming the i1ode is i-espondi~g,. you .tan easiiy decommission it froin.-iheAmbari\veb "
U_I. Sinip!ygo to-th e i·tosts view, cl ick on· the host; and select becommission from the puu-;,
·..~i¾~~l~'.~~):k(::t;},1:wi\~•~roi~~~lff.~t?t••f .
· •/ y_arn.,sch~duJ~r.maximiu~-allo~a:[oti-vcores: Tiie inin_imum·aJIOClitioii forevei)r°coniiiintr. , . ··
.. down. men~ next to tlie NcideMana·ge,r cbmp~hent: l'Jode that.the host may-also be' acting as _::
. a HDFS. D:itaNotlcO- Use tlic Anibari Hosts view to dccbmrriission the HDFS host in a similar.· ·. : t ~eq~~st ~.\ _,1~~ -~csbl1_r~e¥_a_~mger, ~t~ t~r~ts·of.~irtu~I :~P~ c'PrC~·. -~~·q_
ue:sr~:j:O)~·n~t 'th·an·~ .-: ·:·
fashion.- · · ·. ' , this :illoca1io11 w,l!not tali¢ ·errect;a,1d tlte specified vaiue will b~'allociit~dt1if-- ihtni~,'iun:,
- .-,:.: ..' ·.. __ _,- --.~·:-:- · .-~ nun,1bcr _oftorcs ..1:i:i_c~.:~ult,is 1.-core_.._:.;. : . .' <;}:/'(/';,-· -:-'c'-·
4
;;......~,:,:·',:..:.
1 i :;.:'
22 ·, .!: '·
an~-
13usirie~s l_nt~lli!,'<! l1Ce, .131 is a con_cept !hat VSli~lly involves the d~livciy ini~gi·ation of ·.; A!l unendi'ng' prnccs~ of lJCncrating fresh new insiglits i11 rear time c,in' belin1i~ke 'belt~r
relevant and llS~foL'bt1si!lCSS inforinntion in ari organiz.1tion,-lts major components are data . . d~cisions, and thus cari bc ·a sigi1ilii:a111 competitive ·advantage. . · . · -: ·,,:::. .·
·...• ':•¥2~"i~~i -
warehousing, d:i\wndnin ; qucryi~·g, tind-rcj)o_t1irig (Figure 5._I). · ni tools.more import.111t lh,11i'1Tsccurily solutions: . . . __ . . _ .
9 Bl tcolii'i11cltid~\!att1 _wludwusing, -online ana"iytical. processing,· social_m~dia--; ~nalytics, .
'. reporti:1g, dashhca.r'ils, quei-yi11g, and data :i1i:iing. · •_'_ · · - ' • '. f _. ,. · - · ·
. £it tools can rai1gc ,fromvc1ysimple tools that cotild_be considered end-.usedooii, to ·very :,
. . sophisticated tools !h'at offd a very--broad a1id complex sef of .f1J11ctionality.' .Tlii1s,: even ·
. ,executivcs ·cai1:b~'.ll:cir cw:nB.i'cxperis; or lheycan rely on· Bl specialists toset"itp\he BI_
mech;:1'.is1i1s ,for\Jiem. "Tl111s, large organizations invest in expensive sopijisticated BI solutions
: . : . . . _ . nw1re_S.I .JJ11si11ess i11tellige11ce nml 1lilln 111i11/i1g cycic, . .- · :, . -'
·, that provide goo.If infomrniioi1-i1i'rcal·time:. · _·- . · . ': . · +_- .-
The ,nature · of lire · and businesses is lo ·grow. 'Information is the lifeblood ·of business.· ·
Busiriesscs ll!\C ma!l)' tcdrniqucs for imdcrstanding their environment and predicting the
- Aspreadsh~i:t tool, such as Microsoft Excel; can act 'as a1i ea~y b:i:effecli~·e Sltool by itself. .
. o'ara>cai1 be' \lowi1(oa<led and ~tor~d i,;: tlie spre'adsheei, Hien, ahaiyzed to 'produ~e'.insights,-
future ior ihcir_own .beneht and growth. Decisions are made·from facts and feelings. Dala•
· .. based decisions arc more effective than th_ose based cin 'feelings· alone. Actions based on iticn fll'~S~l;t~d ir\ the form ofgi'aph.i'a:ii.t'tablcs:·111is sys\em' offers l.im.ited auioriiation using
'macros alll1-olhcr fc::(lir..:s .. Th~ analytical features includei basic ·statisticill ·a'i1d .financial .
- accurate data; infon11n\ion, knowledge, experimentation, and -lesting, using fresh insights,
, fui1dioi1s: PiVc! 'iabl~s · lielp '<lo:sopliisficaied what-if a1iaiysis: Add'.011 ' rhod~l~s can . be -.
can mor,i likely SIICC~ed nnd km! to sustained gr~wth: ' :· . , . " . . . '
instajlcdto cn:ibl~ inti,kra1c1Y,s'opliistic~ied -statlsticai analysis .. --:., ;'· · '.. ·;-.. :,· • / : .
The org,i'ni1.;1tion ,ho11lc! invest in b_usinm inlclligcncc(BIJ.solutions': -__- -·_ . , -. .
A dashbo:,rdi11gs))ld11,s11ch ~s Tabic.iu; canoffero sopliisticatcd set oftools'for·gntheririg;
Compa1iies USC 131 . lo ,ktcct s ignific~nt evcr)is 'and identify/monitor business trends iQ
··. aiiaiyzi1ig; a:id 1ii\$e1i_til1gda(J. A!ibe"u~ef c11d, :i11odlilardasilboarcis can b~' designtd and
ordc( to adapt quickly to their d1a11ging ·environmenl a;1d a scenario. If eiTectivc busihess .
· [·edesigncil ei1sily 1vith :I' giii'phiral user interface. The back-end data analytical capabilities ·
intcllig~1ice trniriing is 11ml in the oq~ai1ization, lioih decision making p;ocesses at all levels
i,id1id~ m~ny s\11islical t'1i11~1ioi1s.-The (lashboards are linked to dat~warohous~s:~l(he back,'
· of manag~1{1cnt nm! _tactical s!rritcgic managemen!.processes -caii be in1proved , - . ..
e'nd 10 ensure th:1i the tablmmd graphs and oth~r elc111e'nts. of."thc dashtio'ar~ ar¢'iipdated in
·_. ' . D'i.for llc'ttcr O'cl'i$ionst Them arc t1vo·main ki,ids -ofdecisioiis: . . -. real tinic: . . : · ·" i.· ' . . - . : ' · ' ·· ·• . .. .
_ :.__:: __,_ _._-•-Strateglc:dccisi~nsiui ·, < --·-
Data mini11g systems, such as 113M SPSS:Mooerer;-are- i1idustriol ·strength systems that
. • .Operational ikcisions. · ' ,_. ,-·
.. ,/
..
. 6 &~iM C.CAM Si..,_l\t\~
-
. ·: . ---} :~ '. ~- ' . - :.~-..:.,_: . ~. ,.-.._· __· ·, : •. ~ .... : ·. ";~· ' . .- :~ -~--:.. . . .. .
~ . :__ ....
~.-. ~
"I
VIII Se,m, (CSE/IS!:)
~~:::;:r'fu1:~- .
I
1
m"lA ~~
!
S1J~>cr v:, ll!'c.J '.
l.<?;i r n ir~r; :
-~·'.1~·~1:~
1
·uiisui,ervlsccJ
M.ithlr\e ·.·
l a..1rr1 Jr-e,
rcchhl (fU(!t
·}if{}~ffr:t-iJ/l\f1tt~
Ch.l•ter Aii;iiy,IF'· . '. c
~:~;~~:~~" .·~::i,~'.,;~~, -..•4~:~~;~#fJi'.fM;f~f~f~f ·
_ 1
28 29
VIII Se-m, (CSE(ISE)
pas,ctl wi1hi11 the -layers u,f 1\c11rons may no." m:ikc in1ui1ivc sense to 1111 observer. Thus, the
6. Outlier data clcmc111s nec'd lo be rerr.ovcd aOu cMcful review, 10 avoid the sktwing ..-
ncu:nl networks nrc considrl\'\l n black-box system. . .
of results. For cxnmplc, one big donor could skew !he analysi; of alumni donors in ,m ·
Al some poi11:, the ncurnl network will ha l'c le:irncd enough and begin to mutch the pred1c11vc _cd11catio1rnl setting. - ·_ ·
nl'c-.:1-:1cy of a :1:1111:in cxpe1~ or al:l'rna:il'c dass ifoc:iti o11 tcdmiqucs. The predictions of.some
7. Any biases m the sdc, tion of <lar., should be corm::ed to ensure the data is rep~mir.la.tive
A!',.~~ th:-,t h,"i, ~ b::tn t:·~:n~<l ovr r ;i lung rcrioC: f., f 1imc wi1h tl large urnou:1: of ci:.l~il fit:ve
of tl:c plicno111cn.1 1,11dcr ,<J1;dysis. _If the data induce& mar.y more mt mbcrs of unc gender ·.
l:crnmc decisively mo1c accur,1:c th:i,1 huma:: ,·,pc1ts. Alth.i: 1>o_i111, the ANNs c;111bcgi1, to
than is ty'pical of the P.?Ptdation of interes\, then idjustmcr.ll need lo be applied to lhe
be i~riously l'onsitkretl for ,kploymcnt, in 1'cnl situ-:itions in real _iimc. _ . data. · _ .. - - - _ ·•
ANNs arc poplllar b,'cm1se they aro evcntunlly rible to reach a high predictive accuracy.
ANNs arc :il~o relatively simple io implement and do not_hnvc nny issues with _dntn quality.
s·. Dnta_sho11ld be brought to \he same gran~larity to ensure comparability. Sales d.1ta may
be nvnilabl~ daily,_b11t lhe sales person-compensation data may only be avaH:ible monthly.
ANNs require a lot of data to tmin :10 develop good predictive n,bility. ·
To relate (l)e~c variables, the data must be brought lo the lowest common de110111iruilor, in·
Cluster :inalysis is an exploratory learning lechniquc that helps in ide11tifying a set of similar· , this cnsc, moiithly.- · . -
groups in the data. lt is n :~'Chniquc us~d for nulomatic idcntificntion of rnturnl groupings of
· 9..Data may-need to be selected to i_nereMe-information density. Some data may_liot show
things.,Data insta~ces !hat are similar-to (or near') each other are J~tcgorized into 0/le cluster, ._
while data instances th~t ai-c-vcry-difi'crent (or fnr mvay) from ~nc_~ other are_categorized into -:
much variability,- beca_use it _was not propcriy recorded or for any_ other reasons.
This data • .
' _may d1_11l the effects of other differences in the data and shou[d lie rmioved lo improve_the -
separntc clust~rs. TI1erc-can be ;my number of cltistcrs that couid be prod1iced by the data.
. infotmation dcnsily ofthe d.1ta. _ _ - - ~ _· - , - - . - . ._ - -
The -K-rncnns te~lmiq1ie is a popular technique and allows ti1e tts~r 'g~idancc in selecting the .
-- right:number(K)ofclustcrs -fro1nthcdntn; _ . - - ; _ - -- - _ -_ -__ - _- _- C. : Wlmt is"ciat~ visti~li;atiini? I/ow w~uld you judi:e the q~lity of d.:lta v~~t&aiio~? .
Clustering is also known as 111~ scgi1ientniion tc~hnique. The_(echhique sho\V_s_\neAuste1~ of ·. • : • ••
1
- _
-•- - • • •• _ _ ; _ - , - _ - , ,_ < ! - _ ' · ·(ef4 _Marks~-,
-thinss froin past data. The ciuiP.ut is thc_cciltroids for-each cluster·and the alloca)ion of d~ta :, _Da_ta Yisu~liz.1tion ls the ai-t' nn1 si:ienc~. of IJ1,'tkl-ig <!ata easf to_understand-~d'~u_nie,
-1ioints to their duster. · ' _- - -· · -_ :- · - · · -(;)- .- - ; I - - _ < :: . for tlic end user, Iden! visua_lization shows the right amou:11 of data, in the i-igh; oidei, in the
The c~ntroid dcu'niti~n is used to assig; ~cw dafo instn~ces thii\' ca11 be as~ign~d: to the ii' :- - right vis uni form, to convey the l~igh priority_ information. The right visuaJ.iiation requfres ail -
clttster homes. Cl11slcri,;g is aiso n jmt ofllie nrtifid;il intelligciic~fo~1i1/oftech1iiques. - ' -, understa11iling cif(hc cons1:mci's ~e~ds, nature of the-data, and :he_many tools)i:1ct_1.;.;~:iiques -
' i\ssoci:i"lion ru(cs 'are-.a-poi11il.ii- datri minilig method in btisiness, especially \,\'lierc seUiilg is :_ avnllabfe lo pr~s~nt duta. The right visualization arises rro!ll a comp let~ imdemaiii!ing qflhc: : _
. in_volvedf\lso kno1v11 ns market brisket analysis, it helps in-answeriiig-quesiions abo11t cross~ - totaliiy of1hc_sit1m1ion.iOne should use visuals to tei( a true, complete arid fast~Piiced slozy. ,
- ,.. selling 0jipcii,unitics. 'fhis .is _the h_cai1 ci(thc personalii:aiiori engine. use4 tiy-e~commercc_ '. Datri visualiz.1tion,is the last st~p in the .data life cy~le. This is ~here the data isprot~ for_· ; .
sites lil<e "Amazon:~om anil str,cmning niovie..sites-_like Netflix.coin. Th/ tcclioiqi1e helps find :_ •-f1:fr~s~citat_icin in ah ensy0 1o·:~onsume.i~~nner to ihe iighf auq_ience ror the ngliui-~rpose,_ '.The• .
__ inti:restin-g rclalioilships (nffi1iiiiesf 1ieiivcen variables (iteinsor events).These :ire i-"epresented _ --·dat'a shoiil_d be c:onvetted into a language nnd format that is b~t pn:(cited and understood by
· _ns_-cules of the- fonrt 'x-::::; -Y, wher~ X and Y arc sets of data items; A foi'm -cif imstipervised '_ ! - the consumer -ufdafa.cThc pi-csentation should aim to highlight the_in;ighJs front-th·e data in
.. - .foaming, it has i10 licperi<lcnt v~i·iable; and tl;eie ~I~ no righi cir ,vi·ong arisweis. There ai-e jusr_: _. a_il nctionable 11iiinller. If the di.ta' is pri:senied in too rnuch.<!ctail~ thea the _coilsumer of that
sfronger and weaker .i!linities. Tluis, each m!e has a confidence level assigned "to' ii. A pa1t ·,; data niight lose_i~t~resi and _ihc insight. - - .
ofihe n.iachi11c.1~;rn·i;1g fmi1ilv, this' tecl;;1ique a~i1leved legendary statu~ 1vh~n i foscinating •- The qi:nlity ot\1:iia;,,,isualizmi~ns cnn be judged by E.wellcnccYJSualizzlion. _ •· • ,
·refoiionship ivns fou11din thi: l r. ics ~fdiapers a~d beers. - •. - - -- - - - ' __. Data cnn be p1emitcd in the form of rectangular tabli:s,"or it.can be-presented in-colorful -
·-. b. Why.is ,h;t~ prc~arnlicin so im~orlant anil iimc i:ons;iming? : · - _-_-_, i .(04-M~r~~/:-.' - _ graphs _of v~riods iypcs: '$mall;-nol) comparati've,_)lighly-labeted.data setsusualfy:- belong-·· '_
_ _ _ in fables" ·. Ifo,¥ivw, as thi:'am~nr.t;cfdata grows;;gra;,lis are preferable. Gi-:ip(iici help ,
_A.~si_-Data clcan~ing-and prcparatio1i is (I labbrinlcnsive-or scmiautom;ted activity that cal\ -tn~e up_ -_- give sh~pc _fo dnlri. Tufte, a pion~eririg e.~peif·on daia -Visual1zatiori, p_resents thdoliowing -
-: to _60 io 70 _pcrccntofthctlinc nee<led'.for a data mining project - , _. -- _ -- _:- i cibjcctivesTorgi:aphicriJ-excellelic~: ,.: ·- 1 _ <-:-' · .- - ·
- ·L -0~1plicati:_datn: needs-to be rcmoved_ -Ttie same data· may be_:rcceive~ -from multiple :; ,.
fi ;'; 1/· som~es. When_nieM_-,_·111!! ti-1" data._sct_s, (i",- t", _m-ust-b-"-d•-dupe-d.-·_: :-.' _': - - , _ _! -·- _-_ --.-__ \_i__
- I.. Show, ai1dcvc11-reveal, Ilic cbia:-Th~ data should.tell n stbry;especia-lty°astoty hidd-en in' --
_, ~
1.!.·;.·;_J,i a,-,l
.,!, __
2. Missing -values- need •<.:. .., ... "u .., ...,
':o be fi!lctl -in;·or_ihose -fOIVS shoµld be rernoved 'frorri analysis: :("
Inrg~masses of ditta:J!o1vevc~; rnvcnl the data µi context,so -the story is cor~ctly told '. -
flfil, · - -- - - - -
- :- Missingvah1cs can be filled in-witluivcrag~ orrnodil! oi-defa\tlt values:-__ .-.·,:! ' . -_: - -: i - - - ' ._ .;'f Tnliucii th~ vie\v~rto tl1irik oftM substriilce ofth~ data: The foimat of the g,:aph should be
1,,,. 'f .• - S(\ nat_ liral lo the-data, that if hides iiselfand lets 'data shille. : :-.. _, · : . , -:-, ;: ~- '~- · _ :
1:111:i,t. ':_ ..,_- 3. ;Data d~mcnls may need !O_be transforiiicd from one unitfo anoth~r, Ft>i example, toiat' .j
;ili, 'i _- '. .costs ofi hc.iith cnrc _~nd ·,lie total numb.er ~(patients rnay;ifoed ici be _rcd_1iced to costJ:J -· 3:;·Av6id -disto1ting·,vliat thc 'data'have to:say: Stai:isijc"s-_can:be used to_:lie:lri tliii namcj_r _-
-_ '' ' sihrplifyirig,soilic c1i.\cial coritcxt c011ld cc rem<ivedleiiding 19 distorted co,mniu_iijcatioii: ·
\~iR - - paticntto :illoiv'coinparnbility 9f1Iin;'valuei -- . - : - - - : . • . __ - --- -
- 4._. _Mnkc lnrgc daln sels coherent: Iiy giving shape to data, visualizatiori~ _c·an help bri~g tlie -
~*1·1! •- :4. Coniinuous val\lCS i;nay ne~:1 lo bcbinnedin\o iifa~v. buck~i"s to_heln with _someartalyse.s) . :·:' data iogcthcrto tell ri comprehensive sto1y. - : .! - -,_- i -_- - _; __- - _- -_"-: - . -_
,}; :(. . .f.or example, ivork cxp~rience CC>uld be_bif1ricd as lo\v, nii:dium; a'nd ~igh.' _; . ' , _ -_- :,
f:.:-1{ - , _ _- --5. _Data el_cmc~\s inay need to ~c adj_i\stedJo (riak(tli~rn coinpefi:ableo~cr(ir)je.Jpr examp'le/ · 5( E11co11rilge t_h~ -cy~s to. coin pare di/1\:rer:t pieces of c:i~ia: .Organize ·the c~art iii ways the
-eycs-woltld trntuml_ly -mov¢ to derive insights from t~e-graph.: ,.' -'· ""-· , _ . _ ,
,F-'-- - ---.-,f~• i°:-~-Rcve;il,tli~ dnfa -at sevcntl l~vcls of t!cbil: Gr:aphs lcad_n o in_s\ght~, ,v_hich raise ~1\rther : _ _ _-
'~
_'::
: .!:.
-_:·_•
:: ,~'.·.
__
. -_:_:,
_·___
.; :, _ · ~-.:- ----~ ~ ~.~~~'.~~j!lJl.1S9.car_for_~o111partibl\iiy.:
·;,
:i': -curre_ncy val\1cs l)lay _need iu be ~djusted _Thef_inay
:_:,-_::·:f_il:li,:, -_--:,:
_l[
.,-,.
i:,
_'______
.-_- - -__ · _ ·__· _ for-inflation;
: _ , •' - be ~onv~[i~dii!\..f.Qmiriori"
theyto,would
need • heed lo -~e 1.:on verteq .
-
- · ctli;i.osit~; and thi1s prlls~,i~~iions should \i~lpgeff~tlie root ca\1se. .__-•- --:7'--c-~1 ~ !·;- -~'-~'--c-· ..
·30 - -- ~M.+#.f CKAM ~1111i/: - - -• ·--•-~1
Si,,~~+~( ~;~ $uinAv
~l
VIII Se-m, (CSE/IS£)
be no scpJ:·;;tiun of cil;:rts 011d 1ext in prcscr.1a1ion. Encl\ mod~ shou ld tell n complete I
sto,y, lntc,~pcrs~ :cxl wi:il the m;,pigrnph,c :o hii;hligh: !he main insights.
Module -4
(.l 1Cu,711/•"•' frt,...·...... ,..;.,..
7. n. What Is a dcdsion tree? Wily ;1rc decision trees the .most popular classifica_tion
tcc~niquc? . · (02 Murks) 'i . ( .• ,· ~ ~ µ_
Ans. A decision :ree 1s a tree where each node represents a feattire(attribute), each,link(branch)
; I L
represents a decision(1~1ic) and each lenf represents nn outcome(catcgorical qr continues ,' i
•1
value).TI1c whole ide~ is to crcMe a tree like this for the entire data and process. a single , · . . :;; . ··i: ' . .. ·. .. ·, . ' ·.. . : ·. :~- ~-~ :_·::\:)_
outcome at cve~y leaf(or minimize the error iit every leof) · \ . Figure,7. r: Scaffl!tJJIOIS .show/11g type, oire/ntlo,iship111uwng two vifrlubltJ
De~ision trees arc a si1nplc ·;,,ay to guide one's path ton decision.1l1e decision iiiay ·be a simple · (Sou'rce:Grocbneretal.2013) ~ .-. . .... -.. · - .... ·· . · / . ,; ·,:';, ~:··:: · ,
binary one, whctlier to approve~ loan or not. 9rit may be n com.plex !llulii•~ahied decision, · : .Chai~ (a) shows n very strong lin~ar, r.clat1o~h1p ~et\l'ecn the_v.:lriabl~. x and,Y:,'!l;i,I means
as to 1vhat may be tl1e diagnosis for a pmticulai sickness. Decis.ion trees are hierarchical_ly the value ofy im;rc3SCS propcitt.io.nally with'i Chart (b) also s~s.~ Slrolig linearrilatiooship
br.1nci1cd stnicturcs that help one come to Ii decision based on asking certain question"s inn . between the vai·iables X and. y.,Herc il is an inverse re[ations!iip'. That means 'tbe'vilue or y
· . particular sequence.. · · _ . . . .,-. · ' . ··.
dec;eampr~p:ortionaliy.ivit~~: ., . · · · • • ' . , :·.', " ·.·,: ·:•~.:: . .
Decision. ti-ees ire one ofihe most'widety·used techniqi:es for classification. A go9d decision ·
tree should tc short and P.sk only a few meaningful questions ..They are veiy-.efficient to.us·e;
Chart (c) shows a curvilinear: rclaticnship. It is an inveise' relatio:tship, _:~~ns that~ ,r.~i~~
the value of y decren$Cs propo!tionally with x. However, it seems a rela:1v~ly_'\!~11:-defined
easy 10 explain, anq their da~sific:ttion i ccuracy is co.mpcFtiveJibi ether methods. Decision relationship, like a:1 arc ·of a drck, which can be represented by a siciple quadraiic ~uaJion .
frees can gcr.crntc ~nowlcdgdrom.a: few tei,t instances that -can _then be applied to n broad _.;: (quadratic. qienns·tile pow~r of two, ihat is, using terms like x2 and.y2), Cn.:irt..(d)_shciws_11
·· population, Decision trees arc iis.cd mostly.to answer r~!atively simple bi.nmy decisions. : · positive curvilinear i'clal,ionship. However, it.docs not seem to resem:ile a: ~guJar shape, and·· ·
. thus would.not be a sti-ongTelationship. Ch.irts (e) and (f) show no relationship, · · . ·. . .
· · ··b; Wh~i°i~ a;i"cgrcss~n model? \Vhal is a scaftcr pl.ot?' lio,v docs ii iiclp?(6 Mark~} · .. . .- of
That nieani variabl~s· )( and y ar~ independent each otr.i:r. Cll.lrtS (a) and!(bhre good, . .
a
• Ails; -Regressio:{is 1veli-k1i~wn·stntistical ·te:dinique to ·modt:I the predi~tive rel21ionshipbe!~e~n. .candidates·: tliat .inodel a 'si1i1plc ·11iie~r regreisioa mod;t' (the le(ll!S ,regressi-011 model aiid
fegres'sio1i cqi1ation can be_used interchangeably). O,..lrt (c) could_be modeled° with ii too
several independent variables ·coVsi and ..OJ!C depcndent:variable. The objective is to find_'
the·best•filli~g .curve :for a· dep.endc1i1: variable 'in a multidiinensional space, iyit.h_. each. little more complex, quadratic i·egrcssion equation. Cliajt (d) might require. air even higher
independent vai-i?b!e heing ;; <lime.;sion. The.. curv~ co.uld be.a straight line, or.it .could be a/ order poly1iominl ·r~gre.ssion e.qt:a:ion to·represer.t L~edata · . . . . .. . .
nonlinear curve.. · .· . ' · . ·. ·. Charts (e) a11d (i}'have no·relnt1onship, thus, they ca.·1:1otbl: modeled together, by regn:ssio:i
Tne quality of pl of the ct;rve :o.che.-d;i:a can:bemeasured by a coefficient.of correl~.tion',(r).,..:: cir using m1y·.othe1· modeling tee]. . .
.• • • ' · .• 1 • ,' ' . . ; ·. ·.·; • .· . . • - • :• • • ";- • . ' "( : , •· • • : ::;. ~ ~ : ; ;! :· ..:': ~ (
· whic!i is the sq1iarc roo:of the aill(:iunt ofvarlmice' explained by the curve. · ·· ·· .·. c.·_Examine .thc,sicps in dcv~J9pi11g a:ncml nch!or.~ f<!r PE'~j_cti,ng ~~ix,:~,:P.fi~~ -What
TI1e key stc;,s for rcg:essioi1 are simple: · •.,. : . . . . . . . . .. .kinct ,or.objcetiv¢ TuncJio·n ancCw)la1. kin.d of:d_ata Jvquid ,be rcquif.cd; ~of~ -&!le>d stock
'· I, List ali :he variables avr. ifoble for niaking die model.:·. · •prkc prcdictwsystcm asin~.A("l'R . : •,.--'. .·. •., :: ·.• -<< :- ,;"·,,:'·:,..
,,,,.; ,~•,:-.) {~~Marks) :·
· . 2. ·f:st'ab.lish a Depc1ident Vaiiable (DV) of·i11teres1.· . . . ._ . ·; Ans . . Jt takes reso1\rccs;:trnii1ing ilat;i; rimhki)!._an.d \ime ,10 4eveio;, a:n_eui:a~r,e~C?r.ll ,¥0s,1 data ·
3; Examine visual (if possible) relationships betwe·eri variables.of interest. · ·rriining"platfornii;-offcr at kast the:MLralgorithm to implerne:.:a.:c_ura.l_n~~"'~fK, Th: steps
4. FiT\d ~ wa), to predict DV using the o.ther variables, .- . rcquir_cdtobui!danANN~rnas.fol!o,~s:. . · . . . . •·: · · ·,>,.: ' . ·· ..
A-scatter plot (or.scatter ci)agram) is a simple .exercise.forp.lolting all data points between tWo '.:': I. Gather dntn: Divide into training data and test data ..The training data needs to be further •
variables ona:two-.dinmisional graph. ·11 provides a visual layout of.wh~r~ all tlie data points ;: . divided into training d"ata 'andvalidation·data. . . - : .: . . '
. areplacedin.thatt1vo-dimcnsio1rnlspace, ·: . .·.. :. . · _. .·· .. •·...... '·. . · ·_ :·,:: . · ·.:· 2. ·Select the llCtWQl:k ardiiteclure, ·such as fcedforward network, . .
· -. The scatier plotcan be us~~1i.1foig1:np!iically intuiting the relationship betw.een,two variables; ; 3. Selectthc .aigoJ·\ii1111, :s11cl.1 .as' Multilayer ~erception . ..' -'
. · .Here 'is a pictt1re'
..
(FigLm!
..
__
7.1)., ·.
tlrnt shows
.
mm1y possible ·eatlems:in
.. .
s·caftcr.diagrams
-· ·
.. - . ·. . .4. $el 1ietwor~ parm1icle1's,: . . . . .. , . .
. : 5.' train the ANN 1viih trai1\ing'clafa. · --·. ··-· ·
'6. Validate the mot.Id witl1 valid~tion d.nta.
:7_ f,te~ze the·weigh1s·anio1iicrparameters.. ..
· s·. Test the trained network )Yith test data. · ,.' .·. . ·. ·
~9:' Deploy thc.ANN·whe1t-it-~~hiews good predictive ~ccuracy,- ; ,:_· . . . • . -. --- - - --
. Other. neti'ral nelwork ...
architeciuics i11clu<l~I probabilistic networks and self•orgarii.zitig .
feature
. . .
32 . '5',.(lr.+....-; e.c. .~ ~~~~... 5"~~t...~ e.c. .-"' ~Mu__.,:._ __ · ·- ·· . 33
-~- -
·",,
I
'. . . .
.
. .
.. ~~~1$!~~:·
'.
.
. .
. . . . ' .
·· TI1is is the neuron that you must be familiar with; well if you aren't.you.should now be · ,;\,,,
. .
.,.
'_};,
<~:I:~1~~a:~~lf:1;,y~~~~~~1:~;~i~~~~~ ·'
· · · ' \ . ·: lnpullnyor . ... ,: Hdui,,,l.llyer,, · .· ~~,4•.;_::/':,
··grateful that ymfrtin ltnderstand this.because there are billions of neurol)S in your'br:iin. There ,,-_:;:~.
. 'ilre ihi•ce components tci·n·ncuroi),: tf1c dcn'drites,lhe axon and the main body of the neuron, :.~';;!
. The.dendrites nrc·thc 'rcceivc1:s <ifthc s/gnal nri•d the axon, is the transmillcr. Alone; .a neuron · .'t i
.~
.P.\2~
::r :· ·-
. .' ·is not of.n1uch use; but_when ·lt is'i.:ontiectcd to·other neurons, it docs sevc~al complicated \ .-:'.~ . '.J
· comp(ifotfons.and helps opcrntc tire ·111ost c~mplicated machine on our.planet; the human .: /ff!
~ (;_-) .::-.,.,.
_ ~,-'•,<--;....\
i,·
\,;"',; (])r.
_· ·. ·:·-.· -·· ---· -j · 7 - ,-·.'-'
.
_ . ! .- ' - - - - ~ - - , - , - ,, . . . . , . . . . ~
l ·. : j- 35
, : ., ,_; ..
•-1 ~ ...
34
·~r~--·~··~~'.!~·.
. ' The neural network will be give:1 the dataset,' which consists of the OHi.CV data as the . , moving.in the direction .ofthe steepe$t slope, in order to reach the· minima in.the_shonest
input and as thc'o11tput, we would also give·the model.the Close price of the nexi<lay, this is: duration. With this approach, we do not have to do niariy computations aiid 1,5 a-result; the_
: the vnlue that we want our model to learn to predict. The actual value _of the output will be:
r
·· repres~!ited by 'y' and the predicted value will .be represented by y", hai. The trainin~ o, .
the·modcl involve~ ndjusting the weiglit~ of the variables fol all the different neurons prl.lSent•
~-,,~~ "' •.*·
· in th~ neural network. This is do;ie by minimizing t1ic 'Cost Function'. The cost_fu~ction; ai
·. .thenanie suggests is the ,ost ofmiiking a prediction using the neural netwo;k, It is a measur~ .
· .. ofhow far off tl:e predicted val ue, y~, is from the act1tal or observed value, y..There are ma·riy :C.:
· cost functions that arc ·usi:d i11 practice, the most popular one is computed ashalfoftlie su'ni' )
of squared differe11ces of the actual and predicted values for the training dataset. · .· .· . .. . .: : _. 'k .
. . . ' ~ ·~ - .. . . _. . . [ 1 . . ' . . .
. . ·. . . . .
·•c;; Ir:::1·112 u~ '-'. yf
: . . .. L.l · . . . . :
•
. Tlie \~ay th~ neu~al network trains it~elr is by first ~o~puting thecosi fun;tip1{ fo~ th~ lr~i~ing
· aata,<;et for a given set of weights f6r the neurons, Then it _goe_s back and adjusts the weights, .
. . followed by computii1g the cost function for the training dataset based on the new weights. \':;
The: process of sending the errors back to the network ·for adjusting the weights is called ··i
· . backpr9pagaiion. This is rcpc~ted several times tjll the cost fun;,tion has been minimi.zed, We ·\ . Gradient descent cian be do~e in three possibl~ ~~ys, catch gracHent desce~i; stochastic
will look ·at 110w the weiglits are.adj1isted ·a·nd the cos\ fonciion is minimized in more detail .- •· .·. gradient desc~nt and mini-batch gradient dcscei1t In batch gradient descent, the cost function
. next, . . . . . . .. . .
· is computed by sm11min{all the individual ·cost' functions in the training dataset and tbin
Gradient Descent computing the slope and adjusting ihe weights. In stochastic gradient desbeni,' the slope of
Ttie w:dghts are adjusted to ~ i1iimize the cost function. One w~y to do this is throi1gh.brut~/ ihe cost function and the adjustments of wei~1ts are done after e·ach data··entry Jil tlie training · .
. . force. Stippose \VC iake _I000 V?li1cs for the \Veights, arid evaluate the cost function for these.'.- · dataset. This is extreme'ry'uscful to avoid getting stltck a_t a local minima if'ttie"cui:ve.or'the . ·
vtill1cs. Whei1we plot th e graph ofihe cost function; we will arrive at a graph as shown below/ cost funciion' is not sirictly convex. Each .time you run the _stochastic gradienfd~s~eni/ t~_e
. The best value ·for weiglits wo11ld be the. co~t function corresponding to the mi11ima of this . . process to arrive at the glo.b<1I minima will be different. Batch gradient descent may result ,in
·_ · ~: ~'- -- :·~~~ -grarlt . . . . . . . ' a
- - ge!ting stuc_k with si1boptinial result if it stops.at local:tt!ln1ma-:1'hetliird type is:ihe'inini-•
.. . . I .. .
: . . ' !
batch i;rodient dcsce~t, which is u combinntion of the bntch nnd stochnstic mcth~ds. Herc,
we create different botches by clubbing together multiple dnta entries in one batch. This
' ,1_ '.
essentially rcsi11ts in implementing tlic stochnstic gradient descent on bigger batclm of dntn
entries in the training dnlnsct. Nc.,t, let us understand how backpropagation works to ndjusl
:he weir,)1:s acconl:ng tq lh~ e:m: whkh l·,nd been grncratcd.
..,·
11ackpropag:ition . . : ., · .
· Bnckpropiigotion is an ndvanccd nlgoritlnwwhich ennblcs us io u_pdatc all the weights in the
neural network simullan1.-ously. Th.is drastically reduces the. complexity ·of the process to
adjust weights. If we were no: using this algorithm, we would have to adjust each Weight
individually by figuring out w\int impact that pa11icular weig~t has ·on the 'error in, the .,
prediction. Lei us look nt :he steps involved in training tlic neural network witl1 Stochastic..
· Gradicni' Descent: · ·,· · · · .' ··
• Initialize the weights to small numbers 'very clQse /o O(but not 0) . :·.· . . . · ..
,; F~rwnrd propagation - the neurnris a1~ activate_d from left' to.rig.hi, by 1ising the first data .
entry in our training data$et, until WC al'l'ive·al the pre'dicted result y" .. .
. ,i Me.1Slll'C the error w!iich will begcncriiltid · . . . . . '.° .• ,. . ... : . . ~.
. -~ ·Backpi'opagntion-:- the en-or gcnernted \Yillbe back propagate~.fr?m right to !en;arid th_e :,.
. '. wdghts will bc'adjuslcd according to ihe.leaming rate . . ._,; :..: . . . ·.• :·:· ' . . .:-:
• Reperit the previous tlirce _steps, fonvrird prop,i'gatfori;. erro_r computation_· and- ~ack: :::·
. : propagation on the entirc.training'dataset . . : . . fo . . /. ..• . : :
~ This WOl;id ~ark tlie end or the firsr'ep6ch, the successive epochs wili begin with .the :,
. . ·· weight v:ilues ·of ilie.pre~ious epocfis; 1vi can ·siop'this process when.'ihe cosf 1unctiotr,
,'. )?~verg~~w1tlijnace11ain.hccep1tgri_it ~ _.. .. , ·. ~ .·. ·. >.. ', . ·.- .\ · . (:
0
. . . A.Nl\°s are {:Oirip~sed ,of a large :muiiber ot'h{ghiy inte:~conne:cted :processing elements \' :.;.. : ·
· . ·. '.· (neurons) working in a. ril'uitklayered structures that receive inputs, process the inp:uts,.~nd_'; > ... ·..(: . ..•:.· ii;;~t~ a,i:M~l!~lfo; ~ m,ilti-layer ANl'i .. :·. . \ ,; ,
00
, : •• '
· . · . pro(jui:e ari output AnA,NN is.dcsign~dfor a specific applicati~~; SliC~ as pauem reco~mhon · . ·;.l The processing logic.of ~ac)i neur.dn.may assign, different ~eights to iiie ~~ioiis lncomi~g
, . <, or data ·classification and Jrained through a.learning process.Just li~e in bicilogi~al systems, · input:slrea_ms; The pr_~cessin~ l9g.ic may also use nonlinear transfoiniatiim,fi~Kas/sigriioid :
. ;: .:,ANN~ make adJust;~!)tS t,citlie. ~yo~ptic_coone.~iions \Vith dch l~arning_insian~e. '. ,. . .. ; . : ; ru.~ct1on; .fi:orn ,the:pro~essed •values ,to· the ,output value. This piocessing; Jogie -aiid~tite "
.:, .. ANNs a;·e iike a black' bp,i' traine'd° in(o· solving a ·particular typ~ of piol:ile_m( a,nd they c~~ : . ~~t~rme~iat~-:\Veig~t_~•~d processing ftinctio'.15 are just wh~i \fOrks fo~:t~~ ~yst~~jf~~i¥;': .'
: ·.:: d~v~!op high e\:Cdict,iy._~ po;v_ei,s. Tlieir'i_ntermedi~te JY.!1~pti9, r~fo11_ete_rYa 1:u.~s evo/v_e as the .. m 1ts:obJec_t1ve of solvmg·a.pro~lcm ~ollectively,'Thus; neural networks·are considered to.be ,
..· system obt_~!~~ f~eqback. on i.ts. prcdic~ions_. atjd thus a11 ANN le~rns from m?reJrai,nmg da\~
(figur'e8 .I),. · ' . . . ,:\. . .::~f::~;:~t:it~~t~it:~~1~ted b~ ~;ki~~ ~;hi;;_ar d~c~ions,~y~~;:~d {i~lfo~;wir(:,>
· m~~y tra111111g c~sil's:. lt wlll .~ori!inuc to learn ~y a'djusting .its internal ,f:\linP~(a(f~'ii:and. . · .. . , ·
con\n1uni,calion based onfe~dback about 'iis previoliS. decisions. Thli~. the:neLirai':networks · . ~: · · ·
.become'lietler·at.q1nkiiig-a decision-as:they-hr~le-nioreand.moredecjsi9rts'. :.~·/;f -~1-~ ~-: ~-~
-~ i~,·
VIII Se..n, (CSE/LSE)
Depending upon.the nMurc of 1he probkm and the availability of good lrnining dnta, at some Now append the scrambled d.i:a set to·the original' da)a. We therefore now liavci the ·same '
point the neural network will learn enough and begin to matd1 the prcuictive nccurncy of. number of columns as before lrnt lwicc as many rows. The top portion of the data is the
a human cxpc,1. In mnny praclical sihlations, 1hc prcdiclions of ANN, 1rnir1cd over a long · original dalu und lhc bollom port ion will be the scrambled copy, Add a new' column· to the·
period of time with" J;1,·ge m, 110 :,er of trninir,g da:a, have begun 10 decisively become. moi-e da:a 10 !nbd records by their d,i:a ~oum, ("'Original" vs. "Cory''). . .·
nccurale lhan human cxpc,,s. Al that point ANN can begin to be seriously considcr,ed for Gener.le a prc1!ictivc rnoucl lo allcmp: to discriminat~ between the Original and Copy data
deployment in rcnl si:ualions in real lime. ·· · · ·· · sets. If il is impossible to tell, nflcr lhe foci, which records nre original and which are.random
m1ifoc1s.1hen lhcre is no structure in the data. If it i.s easy. to tell th~ difference t4~n thereis
b. What is ufisupcrviscu lc,1rning? When is"it used? (04 Marks) strong structure in the data. .. · . .
Ans; Unsupervised lenming, by contrast, does not begin with a target variable. Instead the objective . In tlie CART motfo} separating the Original from.the.Copy records, node3 with~ ·high fraction
is to find groups of similar records in the data . .One can think of unsupervised learning as i' • of Orig.inal records define regions of-high density and qualify as potential "clusters". Such
a form of data compression: we scarcl1 for a moderate number .of representative: records} nodes reveal patterns of d~ta vnlu.es, which ripp.ear frequently in th~ real data ~ufnot iii.t,he :
to summarize or stand in for 1he original database. Consider a mobile telecom.munications ? rand.ornized miifact,.- . , _- . . . .: . . . . .,:...', _, · ·:· ·.··
company with 20 million customer... The compan~ database. will likely c·ontain various '. We don not exp~ct t.l1e optimal sized tree for ch1ster,detcaion 10 be the most accuri'te:separator
ca,t~gories of information in~luding customer charnctei;istics such as. nge and postal code, : .ofOrigi.nal from CopY:rcc<irds. Wli recommend that you prune back to a tree s"fzi iliai.reveals
product inforniation describing the customer's mobile handset, .features oqhe P.1.ans.. the.. ·. interestingdata 'gro·upir;gi: . . , . _ ·· · · ·. ·· .· .··_-
subscriber has selected, dc:ails of the subscribers use of plan features, and billing and payment · · This. ~pproach to·.unsu·pmised leaming repr.esen~ a~ ,important· advance ·,iii:~lust~ing
information. Although .it is almost certain that .no· two·subscribers ·will be idenlical on every · . t~cl:mol.ogy. bec..-ause; · ., .,: ,_. . ., . : •. . .. .· . ; • . • · . , :· . .
detnil in· llieir cuslomemcor~s. we would expect 10 find groups.of customers that are very . . • v;rihble.- selection is noi ni!cessal'y and different clusters ,may,b(d.efi~e~ 'o·~.-.dilfereni
·sfniilar irt their overall pM:c:-n of dcmogniphics, selected eqajp:nent, plan use, and spending · · · •·.grQtips.<l(variable. ··: : . . ,. , ·, . . _ . _. .. _.. . . _·. · ,'.. _-,-.:···;. _.·
and paymc~t behavior. If we coui'd find say 30 representative c~.~tomer fypes sue~
that lhe '.· as
·.:·• Preprocessingsor r~scal:ng ofthe ~a:;1 is .unnec:ssa_ry !~es~ ciust;ring qieth~ are not
bulk of cus\c1ncrs arc -.i•cll described .as belonging lo their •~type"; 1his inforniation could be .. inHucnccd by how data is scaled.. . : .. . , .. . . . .. ; . -. · , . '. :. -·_- ·-:-- · . ...
. very ;l~eful fol' i11a'r~cHnii, p!an~ing, and new product developme(lt: We cannot prpmise that/ • !'vlissing:va,lues pi-fl5ent no·'~halie.lig\!.5a; the ..~elliod:_s a·ut~rn. ~ti~ily~anag¥'~.i~;.\niid.ata
' ·. we can find clusters
. or groi1pings
. in c!:i1a.. that yo11 \viii finc! ·uscful. But ,ve inc!ud~ a method_ • Th e CAR.,.,. based c1ustenng · g,v. · es easy contra I.over the number 01 cliisters· and h.elps. ·
..quite distinct from lhat found in other slalisticalordafa mining soflware. CART and olhet . . . . '1 . . . '' . .
Salford daLi n1ining modules now includc'im approach 10 duster analys~; densitycstimation·:
.arid i:msup~rviscd learning using {deas tliat \~C trace io Leo Breimnn, but ivhich ·may have : " " " "
>,:...·.·~
.·•.··'. ·_
· :r:c'th,/s.'sOo[c~.;~,'. :.mt'1ao,ln0.r'. ,,m,.lcbse?·.r:1·1'0·.,·,.;_<l·o ·tl··t·ei 1,·.~_.,p··n.<..
. .· : c.· ,,,~:.1·.·S·•.etl~.·
" , ~ ,
:·. ?._i ._.··, .~.-0'4'.:M
. ~
·. a
:'r,:~,·-
""
·been known informally in among statisticians a.t Stanford and else,vherc for some time. The · ·A~s . . Ass?cialion rule mining js a po.pular, unsupervised. learning.l~cfinique, used irfbusihess to
niethod detects structure in ·data by contrasting original d,~t1. with rnm,1dmized variants of !hat>. .I ' (_hclji . idctjtify shor,jiing patterns, 1.t is·:also known·as market "basket amilxsif' ri :helps. find
data, Analysts ti.Se d1is method implicitiy'whcn vic1\ling d?.ta gl'aphically lo identify clusters·'. . interesting idation~hips (affini:i~s) ?e1,veen variables (iienis or e.~ents):Thus:·11 can help ..
or.other structure. in ·cata visually, Take for example Cllstomer ages and handsets o·wned; If. . cross-sell related items and increase the size ofa sa.le; . ' . • . . .:• ·:r. ', · · . .
there is a ·pat:ern in·1he da:;itl,cn we expect lo see cc:1ain handsets 01~ned by people in their. . All data used in this technique is categoricil. There is•n~ depen~e~tv'aiiable ..lt"iisiis machini- ·
¢arly 20's, am\ rather different_ hands:ts o·wncd by customers in !heir early 30's. ·1r every . learning a_lgorithrr.$. The fascinating ''.rclat!onship. betwe~Q sales of diapers a~d ,b~ers". is ..
· handset h jus1· as likely to be ·o,vn&i in every age group then.there is no· structure relating·.- · _how it is oflen explained in popular.lilerature. This technique accepts as inputth~ f.1)¥; ·point --
. . ttiese two data dimen~ions. The ·meth/id we use generalizes this everyday detection ide?. t.o . ;·of-sale transncti~,i ~ata.-Thcoutputprod,iced is the destription of the mosffrequ,e11t-~ffiri_it(es
. . high dimensions, . ·... . . · : · . . · . . · .·. . .. . . . . · . among iteins. An' ex·a:nple ofa:n asso.ciatiqn rule would be,''.a Customer.who. bough!,a)aptop
.. Tlic method COl)Sists of ihcsii steps:· .. . . . . .. . ... : . . . compute,. arid virus protection softw~re also" bought an ex.tended·se~ice·planJQ,P,ercent·of-.
Make a copy qfthe·odginal data, and then randomly scramble each column of data separately. . : the time.''. ' :::· • ·. :. · . . · ..- <.. _: .. .. . · -:.·.. ·. ,·: ,_."·.·<,; ·.
As ~n example, stait'in·g 11•i1h·daia 1yji"icrif·of a··mobile phone company, suppose we randomly .:2~ In business envirohments,a pattern or knowledge can be used for many p~rpos:s ..Irtsafe3. and·:
_' exchanged_date"~f bi11ldnfoi'ma.t!on at random in our copy of the database;· Each custo,mer /¾1 .·.
marketing, i\ is· tised for crosHnarketirtg and cross se.lling; catalog des'i&l)ie:~Onim.erce site
. rei:ord would ·1ikeli• comain age inforniatiori:belonging to •another customer, We noiv repeat ;} ti . de.sign, on.line advertising ciptimi2ation: produc1 pricing, and s.iles/promot.ion ·co.-nfi.guratio~s.
this process in ~very colunmo.f the data. Bre.ima·n uses a varianf in ,vhicli·cach column: of \fj ;:,·: ···,,This analysistan suggest not to put one item on sale at a time, an\lin~lead to.cfe.1te abundle
. original"d:iltl is ·.replaced .,v.itli ~ bootstrap resample· of the cohiinn and you :can· use either._. )'P{ . .of products promoted as.a package io'sell .other nonscllirig items; ' . . ;.,. .:•,e, :, _. . ·;,:..-.. ·. ·.->. .·
· ·method lnSalford s9fiwar~.. . . · ·'· ·· · ·< ' .: ..··.. · :·· ··. ·.., )lf} r~ retail cnvironmeiits; it can be used for store design, Strongly associaie4 i!ems~an be.kept . .
Nole that all we have d<Jne is mpved info11nation about in the dat1.base, but other than moving"Jf1 . close tougl1er for customer convenience. Or th~y"could be placed f~rJrom ,each other sci:that
. data we not changed 'aqylhing. So nggrcgates such as averages and totals will not have .~ : ·. I.he ciistoni~r has to walk ll1e aisl~s and by doing so is pcik~tially.exposed.tq other.items, ' --: . .
changed. Any one customer record is now a "Frankenstein" record, with ile(TI ·ofinformatio.n -;,j?£ :' · 0ln medicine; this techniqtie:can be used for relationships between,symptoms·:ai)d :illnesses; . : . • :
having b~n obl~ined fro1iH11iilfor1!ilH:us1omer, Thus, date of birth might lie from custome( l , .ic--~iagnosis.~nd patie~t ~haractdri~ticsi1rcat1ilents; genes ;ind .t he~f.~nctjons;imdsiio~ ___:_· _
IO 1135, the service plan taken from customer 456779 and the spend data from 98700 I. .... :)!,~ r· .
Represenlmg.As~oc,at,on 1\-ulcs · ·· · · . .:I : · ..
.. 40' . ~11~+....- _fo'-' ~Ill\~ :\I ti,'. ·Sew,+.,,,. c,c.,,"'_ &,:,i,1i:;? ·41,
vm Sem, (CSE/IS[)
w,,lf'o r111
A generic rule is rcprcscntcu between a set X nnd Y: X ⇒ Y [S%, Co/u] I
X, Y: products anillor scn·ices ~«i~ 11.. 11,..,,4.411;1•
~,..,...,. I l'I.• I I
X: Lcn-lrnml-sidc (LHS or Antecedent) · ft 1'!1V •41 -,-, .. ,~ llt I :omr. I I • I ~IA 1 •0..i, I
!.4,1•w .. ,.,
I ' 1••1" ( j J f l f ..,,I,& O..M I
Y: Right-hnnJ-s idc (RHS or Co1m•quent) ~-4<1\ "\... I " 41 I J I J I , l .,,~ O)t I
S: S11pj Hll7: ~ow <Jf1 cn X ;i nd Y go 1oge:l1cr in the 101:i l 1rnnsnction set .; ~,1,(1".• 1 I s
C: Confldrnce: how oficn Y goes togcthc.~with X ·: .
Example: (Laptop Computer, Anti virus Sonwnrc} ⇒ (fatendcil Sctvice Pion} (30%, 70%] r,un ~ · .. ,~ GX ..
rc~1-,y ..,,.,
Modulc-5
1i~•,rr., u v-•
'), 11. Why ls text mining uscrul in the age ohoillnl media? ·(04 Mnrks) rn·11m~•r •-•
Ans, Text mining is·the art and sdence·of discovering knowledge, ins_ights and paiterns from an i . n..inv 1-!tl
organized collection of textual databases. Textitnl mining can help'_with ft·eqllency analysis of,:-: Step 3: Now, usc·Naiv~ J3ay'csian ·equatfon 10·calculate ·t11e posterior probability -for each
· important terms, and their scm,mlic relationships.. .. · . · ·· . class·. The class with the highcsC posterior probability is the outcome of prediction:· '
Text ls an impomnt pa1t of the groiving data in the:world; Social inedla technologies.have · , ·.- .Problem: Players will piny if weather-is sunny. Is tliis statement is correct? . -.- :',. ,.:
•cn'abicil users to become .producers of text and images and other· kinds · of infQnnattori, ' '. \ ..We cnnso(vc it using above discussed method 'of posterior pr~babili:y. ". ·.' :.-: ~/ '. . ; 7, .
·. Text. mining cnn be npplicd to lai-ge-scale social ·media ·data ·'~M g~therlng_ preferences, · • , , ·. ; P{YesJSlimiy):: P( Sunity I Yes) ♦ P(Ycs)/ P (Stinny) . · , :· ·. .'.-'-. ' ' ,•
and measuring emotional sentiments:· it can also be applied to s!idelal; organizational and . · •. ·• 11.e ~ ,ve·liave P(Sunny !Yes) ':)/9 ;., OJJ, P(Sti11ny),= 511,4 := OJ6, P( \l~>#:9114,':l.0:64: · ·
individual scnles.- . . . ' . . . -. · · . . . . ·. . \ · · ' ·., , • •· . N~w,'P (Yes j Sininy) = 0.33 • 0.64'/ 0.36=Q.60; whicfi has higherpro~abliii9,.::C';.; : , . . .
Text mining works on texls from practically any kind of sourccs,~roni any business or non'. : ;; Naive 13iiyi:s us~s a similar 1i1ethod.to predict the probability of diJferent,. c~ ~don
business domains, in any fo1mats irichtding W9rd documents, P/:)F files; XML files, text -;: ·varioL1s aitributes. This algorithm is mostly used in text classification and_.with ·p~blems .
messages; etc. Here are·some representative examples: · .· '. · .·· .' · ·• : ; · ' ·.:i h~{ing 1t1l!lti11le classes. ; ; .. · · ··· · · · .. •. ', : · .. ·
L In the leg~l profession, text source~ would include iaw;.court ,d~lib1,i~1loris, court orders,·:. · Naive 13aycs stand for: . . . . . .
•~
_:,:_· 1/i'·.~::~'__:·.::~.:
'.
_·.•.•:. ·.. ,' • 3,. ,~:1~,:~rld ~r.~~ance ':ill 'includc;~l~:u~o?· rcp~rt~; i-"t?~a[:r,~octsiSFY ~'~'.~~;nt~,:a~d
4. ln'medicine, it ,vould includc.inedicaljournals, patiel)t histories,'discharge summaries'; etc.
; also·onilie·baslsof~rior'exper/ence:- .
.•Tiie\ vord Naive represents 'ihe.strong assumption ihal -all the parameters of the instam:es -
. .are iiid~pe1id~nl.~ariables wit~ li1tl~ or 1io coirclatfon. Thus ifpiiopie are ider.tified by their,.
~- . S..fn marketing, itl.vould include advcrtiseri1erits; customer'coinmenls; etC: .· _' . .. . , .. : . height, w~ighi, age,'geitd~r;'allOthes~ vari~bles··are assuh1edto be uicepen<!enio'f:cach other,
6, '? ~hcwo~lcl oftelc~nololgd'.Y .i~dd s~arbch! idt would inehide pateni ?Piiljcations; the 1vhole of ;
~lil"I.l;,..· . ·
nl,l·_l,,~. . ·
tef~;: .m1pl1ftnt1on on t 1e wor . -w1 c we , rtn . more:
b. · wi,;us :i Naivc:aaycs tcclii1 iquc?Whi,1 d~cs Nah·c & naycs stand for?
•f~~i,~~;{~i~(~;Ji~;~:
·-iA)l~:'.'. . (;~~j~t~!di;c,i·i.1~i~~:i~_e c~s·i.~erf Q;:;;;
d: ~.i.=~~~ ; i ~ . ·. ·.. ,
·,,. :; : 7;~Yp~[plan,c. Il]_;olhcr,~ypr~s; g1yci1}ab~l~d tram mg data (supervis::d leai:n,tog), ~qlgonthm · ·
Ans: ~
·Nal\l~ Dn;~s alg~rtihm' i'Naivl Bay~1is simpl~ tecli11ique fot c6nstructing : ia~s~fl~~r;t '
i ,P~tj,ry~.l h(P~r~l?-\lf (yh i~\ C?t;go~izes _ne_w exa~p!es, In, I~~ d,!i11t11f~!i.a(s.~ce ,
~u_tp11l,s at_
•.. . •.· • tlus·hypet'pl~ne \S' a.line d1v1µing_apla11e; in two.parts wlicrc m~ch,,cl~~sJ?r,in, ~1_t&r s_1d~• ..
!~}_:_i!l,J,R
.,' ..·'. . : ; .: models !hat as.sign class lnbels lo problim instances; represented ~s vectors ~f.(eatllre ,v~lt_~.es; ;: .' ·. '. ·;;:: _; ;~qnf.usi11grn,~n%worry;,w·espall)~all\,iri,l,~ymen terms, . _.: / <:' .. ,:~,:/, ,;,fr,, .
fl_.'1'~1 : : where the class labels are drawn frpni some finite set. It-is not-a single algorithrtrfor training··,:.·: · ·,S\jppo~u~u,?r-~ glve1i p!ot,of.!\vo. fobi;I ;class~s (in graph a~ shpi\in i11 imageJA,}_;.Can you .
~
,j}j .
y,!I -
.
a
.si1ch ~1assificrs, but a fa;riily of algorithins. based o~ common princ)ple: all .11aive Bayes. ;--_:
·:classifiers assume tlint th~ value of a particular feature is indep,endent of the:vahie of any :,:
\i~_cjqe ~sirniaijng li11efor:!]i~{l*~~es1 ·<. , .·•·. . . .···, ·
·.':,~,y\ .:r;,);; '. · . .
:·d~·:f: ~-
~~lfl > . o~:~::e~'.:;:~!i,~~~i:~;1:':ir~;;able:-,,, ·. . : . :. . <.·,.: .-: ·, :::r ·. . ) ;:·: ...._,~.-.•,·
\.'~·/. ~>-:
. -· :_. -: ,j :..·-:.-.'-t ;..: ,; _. . !,, .,
\ _::i:::-.-,-~ ~~; .:\ ~·_;!: ~i•.
tt~!liN •.'·
til1~ ·. ·
h :!ill '•
.. Lei'.s 111iderstand it usirig an ·c~an1ple, Below Lnave atrainirig:data ·set of wea_th_er and._
..
corrcspon~ing target variable/ Pia/ (stiggesfoig p_ossibilities ofplay_irig).,N,olwl; we nebcdl.to:.
cl~ssifywhether players will play or not based on )l'eath~i condition. Let's fo _OIi;' the e ow· :. _-: ~-·-.:_:. --: ,
,: ::'. .·.. ~t:•.·_. ';,;.:,·\:\/~:::i•..
;. --'~.ii;,· ·,.::·:::•;.~~)! .."tf/~f --.
~li!il,' . steps io pe1fom1 It. . , . , . . . .. . .. . .
· _. ,'.~d',J,
!--"'--...:........:...-----
\f~t} · · , Step:f: Con_vcri,thc_data set into.a fr~quency table .. ; ''.x ..·:, ..'.:,[I)~~~ I\}o.ni:~ ~Hne th!itsep~rot~sbl~ck ~ircles ati~~l.~~.t~~~I:;;'..1\
~->·.•
_. . . . . . . . . . . . .. .. . .' .
'{:lt'f- . .. St~p 2: Create Likelihood table by findii1g ti1c probabili(ies like Cvetcast prdbability ;= 0:29 ', .. '. YQ4 might h'~V~ CQOlQ lip witll -~on1eihing siaiUilr (0 f\lUo,ving irtl)lg_~,(i~a?.e,: ~)dl fa_irly . . .:·"
arulp,ibabil_ilyof PfaYiog,iso.64.C · · · "i-. : sep~r~\~s th~ t.1vq 9l'asscs, Any f!OinHhat,is !en.of lin~ frills intQ black cu,;leql_~ :and on right ;,,/ .·· ·
. .. . . . .··'-,-·.:-:- - ·r-~-· ' .
!. :-' \. ·, ;43
42
~•it~W
. - ~ . ~--~ - ~- . :· ,.. ,..::,-;:.. :::.- :i i-·
...~
. .I,", - :.·
VIII Se.mt (CSE/IS'£)
falls into blue sq11are class. Separation of classes. That's what SVM docs, 1t finds out a line/ pn!',CS , Tl1crc nrc two basic slnltegic models for successful websites: Hubs and Auth·orities.
hyper-plane (in multidimensional space that separate outs classes). Shortly, we shall discllss I. llull5: Th ese nrc pages with n large number of inlemting links. They serve as a hub, or a
w/ty I wrote multidimensional space. · ·galhering-point, where people visit to access a variety of informncion. Media sites like Ya.hoo, ·
com, or govcrnmc11C iil_ci; wou:d !:e:ve th:t purpoic. More focused sites like Travclad_visor.
com and yelp.com could asp_irc to becoming hubs for new emerging areas.
2. Authorities: UltimMcly, people would grnvitate towards pnges that provide the lnost·
■ compl~tc and authoritative information . on a particular subject. This could be factual .
■
■ information, Mws, advice, user reviews etc. These websites would have the most nlirilber
■
■
of inbound links from other welisites. T~us Mayoclinic.com would serve as an authorillltiv'e
page fot· expc.11 medical opinion. NYtlmes.com wo'uid serve as an -authoritative page for daily
news. · · · · ·
Image B: Sample cut to-_divide into two classes. · Web usage 111ining , . .. . ._ .
OR As a user clicks anywhere on a webpage or application, the actiqn is recorded _by many .
entities .in many locations. The browser at the client machine will record ihe click, and the
10..a, What are the three types of web mining? (10 Marks) . web setver providing the content would also make a record of the pages served:aiidihe user
Ans: The web c~uld be analyzed for· its stru~ture as well as content. ·The usage paiteqi·of web ·activity on· those pages,.The entities between the client and.the server, such as the ·rouier,.
pages."could also be analyzed. Depending 11pcin objectives, web qiining:cari be divided into proxy server, or ad server, too would record that-click. ·. ·. . . .. ,.. ' .
three different types: Web usage mi~ing, Web content mining ~nd · Wcb structure mining . The goal o(web usage mining is to extract useful infonnationand patte'rns from da1~g~nerated :
(Figure iO. I). · through Web pnge visits and traasaction·s. The activity data comes 'from data stored in server
· .. access logs, referrer logs( agen( logs, and client-side cookies_: The _user_c~ar,acter_is\ics and
1
WebMinin¢ . usage profiles are also gathered -directly, or indirectly, through syndicated .daia.: Further,
. . metadata, such as page attributes, content attributes, acid usage data are also gathered; .
.I. · · --- . ---~---:-- The,webcontentco,uld pean.alyzedatmultiplelevels(Fi$ure 10.2);_· :, · · : ., ·, · ., .
):I,
IH...
---~·-'°·-·----~-~----
Web, Content Mlnlne . \vcb-6truct.urc Mining:
~
·-.;.;::::,,__'----..
Web u,age-Mlnfrig:
I. The ·servcr side an~lysis would.show the relative pqpularity oftheweb P.3gi/s accessed.
Those Websites co{(Jd'be hubs and aulhoriiies ... ' :- .,. ' .. ' \
0
f:l! !-1( Uslnc HTML pages Uiing URL lini<s ._. Using visits, d ick,, lcJ . ·2. Th~ client side analysis couid focus on the usage pattern or ihe actual:conten\ co_nsunied·
i' , . · and creaied.b)"users: . . · · · · .. . ·
)ii~ ; ·
Fig,irei f O.I Web Mi11i11g Jimc1111c · .1. U~age .pattern co_ u.li!,be analyze~ usi~g ;clickstrearn' analysis, i:e;. analyzing W~~,activ\ty . .
1iff: .· Web content nii~-lng
_ . . . for, :patterns of seq6ence of clicks, and th~. location ·and duration: of, v~_its on :febsites.
u1l.~_;1\,1 . A website is designed.iii tile fomi' of pages wiil1 a distinct URL (w1iversal resource loc'ator): ' .· ·, Clicksfream.. an~lysis can· be us~tltl for web . activity. analysis, soffw~re ·_ce,stitig, ' market
;(' · ·· A large -website mny contain thousands of pages. These pages and U1eir conteht is managed . rcse·arcli," apd rie1:ily;i:Irtg-e1nployee productivity. . . . .. : ·'-' . : ..... . ·,. . . :
'IJi!'i· using SP.CCialized software systems called Content Managemei11 Systems. Every page can ·. ' ' •2. TextuaUnfor~aiiori accessed on· thit pagcs retrieve'~ by users couid ,qe an,a,yi:ed using _.
· f'
l'.' ~
:l ha.ve te~·t: graphics, 3~dio, video, fonns: applications, _a1id more k.inds ·of.content incl.~ding
.. user generated contei1t. : .. . . ,. , . ! ... ' • . · . • : .. · ' • ' . • . . . ', . . . . .
.text mining techn~ques. Th_e text l)'OUld, be gathered ar.d ~lruciured using the,b~g~(~\\'or_d~ .
technique to bujld a Te,m-dosument matrix:i This:mat_rix could then be rg_i)le~~i~gcl_uster,
l t;\; t J11e websites keep a__rccord of all ·requests· received for _its page/URLs; 'including the . as
analysis: and a\sciciatiori rules ,for patterns such popular topics, use.r segllleniation, and· .
:,
'
11 ;rI
,i_ •,(:,·· ;~"
'.:
'
:. ~
w
.":·
•_,··:
, requester infmmation using 'cookiesr,' Th.e log of these requests could be ·a1ialyzed fo gauge .
. ·. .
·.the .pOpularity of, ti10sc pages among different: segments of tlie population. TI1e ·text and
.
apPli.C{'ti(Hl conte11t'(m the pages ~ould be analyzed for ils usage· by" visit counts .. lltc pages .
sentiment ana_lysis: ..
,-----
______ ·
, , . _. , . . _ _ _lia.lhttJ,
. _·..... .:·•. ·:, ·.
· · .:_:,,··. ·.
.. : · ~
;~:~~r:,~:~_s
7;,e;;:1~,:3. . . . ',, .
~:1I{', ' ·. Oil a 1vebsiie tltc1i1se·lves could be analyzed for quality. of content that attracts most users. ' -:r;.~~:~~ '_:
. Web logs, ' · •ld,nttlv uscn ·--•Web'p•i•;·: .
,'ri~;t'.·,, , ,_!
:,['·,~
,•.•,;_:1._! TI1us the tuiwantcd or unpopular-pages could be weeded out, or they can be transformed with
_· Wcb,ho Users, . cit,1<s1rcoms . :::::;;;;';:; ~,:~••..·.. .
< ,: • , . diJTet'ent content and style. 'similarly; 11rcire_resources ,could° be assigned _to keep _ the more· ·,i•'i · ..._cu,t omcr• ' · views · : ~••!_m1,-,1ori_.
Ji i: ·
. .. .
popu_lar pages more fresh and inviting. · · ' · .: . ;,#j
:Jfff,
-----
· .. ·
,...c__;___ __,,
. •
~~;;,~•;~,:
. .. .
!,!,:_. l~
,:.i_\.·.
I Web slrurturc mining . · _ . ,.. . . . . . :
The Web works through a system ofhyperhnks using the hypertext protocol (http). Any page ·-:):?1/l . . .. . . Figurl!: 10.2 IVeb Usag~ Mi;li11g nrd1ill'cl11t1! . , . . . .
1~':
1- j 1 ··cM create a ltyp~rlink to ·any other pJtge; it can be·_linked_to by' another page'. The inlertwined-.•_:;,f::~ /:· \1/eb usage mi.~ing has,ri1any.busines_s appiicalions: It cari help predict user behavior b.ased on, . ·_- . .· .
~.[fl .···._
or self-refcri-al n'aturc ofweli lends it~elfto sonfe.uniq1ie rietwork ai1alyticai algorithms. The:·<11 ?;[/. ' . previo~sly learn'ed rules·and i(sers'. profiles, and can help determine lifetime v~lu? of clients.:· _. ·, ' . ·.
4
:;-· - ~ - ~ ---. .~_. tructure of Web 11agcs could al.so .be analyzed to.
e,xamine1fie,iattem-ofhyperlinkninong\':/!i J)l( -' -. · It"can alsci~1elp des.i~n'·cross-=itiarkctrng-stra;egim!cross products, ~y o_bsetvin,g as~ociaHon ;-:- .- _.-..- .,
'"'h '44 . ""''"' "'"' s...,,.,, ,: , ~-:.' ~11~+~( e.;,";,;. &...11_ii~ I . 45 . ,~,-- _. -
]Tf~:1: . .
VIII Sem, (CSE(ISE)
·.:~
. i'
To list fries in your home direclo;, enter the f~llow.ing command:$ hdfs dfs -ls. Moved: .'hdfs: ·/ lliit1ulus: 80201lis'er/l1dfs/stufl7iest'.'to.traslrai: hdfs:/1 limuh'is:&O~O}uscr/
·rom1dllitems' .,·.. · · ...• . · ... .. .. . · · lictfs/ ,Tnish/Cl11·1·c;,t · • •. ·. :·. ·, · • : . . . . . · : .· i ·, · ; t{::. : · .
/ ' \ Noic ihai \~h~ri the fsJiilsh.i~tcryl~p(iQn is set to a'no~'.zero v~li1e in corc~site'.:.xliit;'aii' '
l.:
.•.•;,,:-:·: drwx---'· · -hdfs =· hdfs 02015-05-27.20:00 .Trash • • I • deieied files are 111ove,1fo·i1i u~~r!s .Trilsh.ilircct~ry. 'This tan be'a~oidci:fby)ncfoding the .
drwX-'···• -hdfs hdfs 0.2015-05-26 15:43 ,sta.ging .. • · ..0·sk1pTrash ·optio11,· .-· .·.·. _:: .:.. _ .... ·.. , . , ,_·, . ··:,.· _'.-· .. ·,.~:,·_.· ·-::'./ :), ..::_. ._
:.i ,'.'; _> ' -
t-d-:-n-v-xr--x-·r--x-•·7· r_._....
hd~ri:-s-.• +h:- fs-..-.-+,-0'-2.0-.r'""~-~0'-5-~2..,.8_1.:.,3:_0:..3.;.;O.;.;is;:.ir:;.;ib:::.ut-ed.:.s_h:,;.el'.-1-.----"--~
_d.... $hd,fs dfs-1111-skip·:111sli stuff/test Delclc(i'stulf/t~stl ', . ·: · ' . ·- :· ';';=<!/ -~:· ·.
dl')VXr:xr-x· -hdfs . hdfs···· 02015:os-!409:19TeraGen-50GB (o~lctc n Dlrcctqr~ 111 IIDFS .. _- .) · ' - , ' · , ' .;,::.. ·
The followl11g cominand will ddctc the HDFS directo1y stuffand_all its contents:
dr;vxr-xr-x -h.dfs hdfs. 02015:05-14 10:ll TeraS<irt;50GB . • ·
;$ 1\.ilfs dfs cnn:er -_skip'fra.sli~lu_:n: Deleted stulf:1 ·..·!. ·, ".':-.''·' ·. •
lCctan HOFS Status Report . ,· . . J . , . .• . . . . .. '' {·'•,;' . ·. :
drwxi-xr-x-.a _;hdfs 015,04,29 J_6:52 ecan1pli:s __ • . ~ ---__,,-_Regulai: um~.' can gel ;111 'nbbh:viatcd:HDl'S staf(1S _report tising:· tlic/a,~1.oW:in~]~~inm~~d:' :: 1 . .·_:.
•. rho$e w11l1 HDFSa,ain:ii1isti:ato( privileges .will ·genrfol 1.(ri111t116terittat\tt0Jig)' 1~PO! tt,JsoP:-"'.-. •--..- -
·this com:11a11d uscs 'df!n<l111i11 i11sterid of sfs to bvo~e-ad111inistr~ti1,1c-·til1i1i1iands( irlie:siat~s : · •• ,. :·. . ·
I .\ •. •
. 1·· ·
48
:-:'..,;j~H .
VIII S&1l1t (CSE(!Sf.)
dbcu1111t: An example job that co,int the fJagev icw coun:s from a datDbaw.
report is 5im:lar to the datn presented in the HDFS web GUI
dlslhhp: A nrnp/rcduce progrnm that u~e~ a DUl'-typc formula to compute ex.ict bits of Pi.
S hdfs dfsndmin -report
grcp: A map/reduce prngrnm tha: counts the ma:ches of a rcgcx in the input. join: 1\job th.it
Configured Capncity: 1503409881088 (1.37 TB) cllhts a join over sOl1cd, equally p;utilioncd datasets .
i'l'C~Cnt C,paci:y: t40i94598 1952 ( 1.28 TO)
mullllilcwr: Ajob tl1at counts wor:ls ~rr,rn !cv~ral files. pcnlomlno: A map/redur~ till faying
UF:i llcm:i :11ing: 1255510SMS64(1.1411.1) program to f:n<l solutions to pe,;:~minu prol;.c.n~. ·
DFS Used: 1524354170SS (141.97 GB) __
pi: A map/reduce jll'ogrnm that estimates ri using a quasi-MonteCarlo method. . . ..
DI'S Used¾: 10.83%
__________..,..
,, .
___.
Under repficnted blocks: 54 Blocks with C0ITUpl replicns: 0 Missing blocks:0
report: Accm denied for user deadline. Superuser privileg~ .is _tequircdl
.
. .
rnntlomMvritcr: A map/reduce program that writes 10GB of random textual data per node.
sccondurysort: An example definlng a secondary sorl to·.the reduce. .aorl: A m~p/reduce.
program that sorts the·data written by the random writer.
suduku; A sudoku solver.
b; ·\Vrit~ a short i1otc on "running nurp reduce· cxmnplc ··and· also Jplain the existing _. . tern gen: Gen.crate data (or t~e terasort
nvttilablc cxumplcs.· . · (08 Markli) ''., . tcrasort: Run tlie ic111sort . · .
Ans. _Running ~lupRcducc l::\:111111lcs . . .. .. · •. terlivalillntc: Checking results ofterasort . .
All Haddop r~kns·es come wi:h -MapRcduce e.xample. applicallons, Rqnning the existing wor,dconnt_:A miip/reduccp,ogrnrn that counts the words iri the input files.
MapR~du,c examples is a ~impk process-once the example files are iocated, that is, !'or •. 1i-o·rt11ilcan: A 111ap/reduce program timt counts the average length ofth~ words in the input
example, if you_..installcd fladdop version "2.6.0 from the Apache· soiu-ces under /opt, th·e files . · · • · · · . . : ·.
cxampk will be in the foilowiiig dir~cto1y: · '· · · ·· · wordian: A1nap/rec!11cc progrn~ that counts the' medlan iength of ih~ ~ords in tlie ~!~;' inpu! ..
/opt /hack,op-2.6.0/sharc/lmdoop/maprcduce/ • . .. , . · word stnnd~rd deviation: A map/reduce.program that counts s:andard deviation of-,t!)e.:
_In other vc,~ions, the.examples n;ay l>e in/urs/liblhadoop-mapredli~e/ or. some other lcx:ati~n.. length of the words in the inJ)ut fi l~s, ·· .
The exact !ocMion of the ~xamplcjar file cn_n be foi1nd using the find coniniand: . .· · .
$find/ -n:une "hnddop-niapreducc-cxmi1ple* Ja,~' -pri_tit ~-. . , . . , ,• ,
OR . . . . -· .
.Consider the following software environment : Exjii~r~ 1vith neat ~lag~am Apache Hadoop paraUei lll?P red~~ l:la!ll ri~~ (or)Expl_aln
• ·OS: Linux' . . · · ba~irstcps of MajJijcdl!CC parnllcl c!ata pow with the euruplc iirword couo·c pi:og~m ··
• Platform: RHEL 6,6 . . . .• . . . . .. . . .. . · (diagr:im). :-.. ;i ' • · . ·.· :: . . . .· .· . :· >· :·: .:.(08~arks)
._i HortonwoibliDP2.21vithHadoojiV~rsion:i.6 · .', . · · . ,. . ; ' ..: · :_r ~ns •. MajiRcdiiec PataUcl nn111 :Fiow: From· a programmers p.crspecti~e; ihe MapRed~,o
is
. In .this environ_meili,·the l~ation of the e\amples /usr/hi:lpli1.4j~vhadoop: rtiaprcduce:..: . algorithm is fairly simple,'Thc pr<igmminer must prov_idc a!rulpjlirtg futction iirid ii rcduciiig
for the purpose. ofU1is example, ·an environment variable cailed HADOOP EXAMPLES earl . functio!1 ..Opc'ration~l.ly, how.ever, !he Ap.ichc Hadoop paralle( MapRedu~ data fiow can
be·definedasfollows: . . . ··. · . . ·.· · : _. ·. ·. · ~-- ·. , , - . : be quite comp le~; Parallel _execution of MapReduce requires other:s::ps in additiort to. the
$ e~pori HA DOOP_EXAMPLES=/usr/hdp/2.2.4.2.-2/hadoop-mapreduce '. . · · . · . ..·, . ·mapper-and rec!u'ccr processes,·. . - . ·.
011cc )'OU define the exnii1ples. path, )'QU can fl)n the Hadoop exrimpies"'using the ~ommands •.. Thcbasic;sicji~arcasfollo1v~~- .·:·'., ·. ·· .___ ..-. : ' . ,_:_ · . . . .· · ._ ..
.discuss~d "in the follo\vlng sections, : · ·· ·: :_· . · · · · · ·· · ·. ;- ·- ~'. ·.---::-:-· J.lnputSplits,·&IDfS distdbL:tes andreplic.ites·d:ita over multiple serv~:,'.f1iedefault c(ata ' ·
· LislingAvaiinlile·lsxt1111j>li:s . · ··. ·,·: · . . . · •. .· '. · · · .: ·.·. ,. ' · · ·_chunk o_r block ~ndwdt'tcn. to differeritma~hincs in tkchister. Jl!e.da:a ·iu:e·alsg replic;itcd .
. ~ li,st of tl:e available exaii1ples can Ue found by running thef~liowing eom~and. ln some . : 1111 multiple m,a\:hines (typically Utrc;e machine). Thesedl!la slicCl! wph~ical :6oun!laries .
\ cases;_the versi.on number may be pa11_ofthejar./\le (e.g., in the.version 2.6Apaciic _sources> . determined by 1-iors and have nothing to do with ihe data L'i the file. Also,' while not 0
thefilc is named.hadoop-mupreducc:cxamples-2.6.0;jar). ·.· • . ·. . . . conside~ed part ~f the MapRedi1c~ process, the tune required°!o lo;d ;i.~d djstr,b1ite dat~
Syam jar SHA DOOP_EXAMPLES/hadoop-tnapreduce-example.jar · . . · ' ·. . ·.· •:. .,: througl1out:HDFS ·servers can be considered pM of the.total processing time . .•' .. : · · : .
. ~ole: ln.pr~vio·us version ofHadoop, the command hadoop Jar:.:was us~ to run l'vfapRed11ce : . TI1e input splits: used by MapR~dtice are iogical boundaries based on: the ·input data·.: For
: " progr~rris. Ne_wer VJ!rsions provides the y;u-n comriiaiid; which .olfers_rtiore.capabllities. Boih .i I . . exa1nple, ihe split.size ,an
be bas~d-on ih·e number of records in•a file .(if thnfata. exist as •
commands.will work for tl1ese·examples. . · . · . · . .· ' ·· ·.· . · ·\ . . · . · ·. · ; ;, ·records) 01' an actuai size in bytes, Splits are almost alw.iys s111·a11er than the HOF$ blo_ck size.
· The possible·exilmples are is foll01vs: . i .. ' · · · .The numbei· ofsP.lits corresponds to th~ nu.niber of mapping processes use~jn the map stage. :
An example progrnm miist be given as the first arnurric.
0
nt: .· . · ·'.2.Map Step. Tl1e mapping .process is ivherc the parnlkl nature of Hadooirc.o!Ilcs interplay.
Valid program ·names·'a.re: · · .. ·. ·
· ·For laige mnounts of data, many rriapp~rs can b~ operat_ing at tht same time.The user provides
ag~rcgate ,vordcouut:"An Aggregate based mapireduce progra~ tli~t.. cou~tSth~ words in ' the spcci!ic mapping process. Map Reduce. wiij try to exi:cute the mapper on the machines
the input files. : .. . ., .·. . _. : · . . ., ; • : :. ., ' • . \ . ·· . . : _·.·:: : where the block i-esides, Decaus.: •ti1dile is repl1ented in,HDFS; .the le~s\.biisy; ivJap~educe
aggrcgalewo5dl!~~: An A~rcgate based map/redu~e program tha·t computes tli.~1)1stogram-\ ·.will iry to pi_ck ~ ~od.c Uia,t is dosdSI to the _node thni hosts the 'dat;a block (.i char;icteristic .
ofthe·worqs 111 the rnput files. ·. . · , .. ': ·· :· . · . ,. ·. . .. . • . · _._. ,:·..:; •: called rack ali,arcness). Thi! 1.ast choice .ls ariy node in.the cluster ;h;it has ac.cess to HDl'S.'.
,0---,--..,..----,-.-'-trti·p:7\-lna¢.redure:program:that use~ Dailey:Borwein-1"1imire thahcrmpuii ~xaci hits I\::.;'
of , ' -- ~-~· . -_· · . .·-. . -- - ·. : _
_,,· ~ . : .. .. . . : . :· . __
. ·. .,:._,..._.._..,... , ..,
. I ..
50 · <st .-··
&.11~fir E:.c.-.M Sui.~riv .··
.. ·, , ·
..· \:-· .:-.,;· ·:, :
VIII Sem, (CS[/ISE)
Figure 1.2 A1/1{i11g 'r, cu1i1bi11er process fo /I,{! 11wji-stip 1,, MnJiR~il;i,~. ;.,; '. .. .· .
...
. ·. Se.l -spot fill\ . . I . . . ~ split ,the l(nehit~:w,brd-s IYO!'<ls .-= line:split () . : ..
run spci( run . . # increase counters. ' · · ··.•:)i'.: :'),:
·see the cot ·. -_ for ·w~rd iii. wo~ds: - .. __ : . , . _ : -. •. . . .. : . ,_·:.-.- , _. ,. •.: i :. . . .
· , .Thefw-st 11ii,',!lMnpR~d;1cc w.ill do_·is create ih~-- data'Spliis.- For.simpficity, cach.line·"will be .' · # write the res11lts:toSTDPUT(stairdar9 :output);# what we,output here. ~illbe.9i~ inpm for ·
the .' ·:- ·.· · ~ ·1 • • •• •• : - .: :._ ·: ·, . :_ -. • - - .'_ ·.:•: .• : -· •• • •• : . :•: : • .· , ·= '/< .:,.:. · ~:•:<· ·.'. 1· •
one split. Sine~ each split will rc~uirc a map task, there arc tl11ee mapper processes that.count '· : __•. ... _ : -
the number of words -in the split.Ori a cluster, the results ofea~h map task:are ~1)iten fo local : # Redl1ce· st~p; i:c:·th~ inpittfo~ redt1~ti~ ,py i .
·· disk i\ild not to IIDl'S. Next;si1i1i,lar keys need to be collccted:a,;d sent io a reducef.j,rocess: · }I tab-cletimiled; ti1e trivial IVO;,;r count is ( piJnt ''¼s/1¾s'¾ (word,:'!) ...·
. . . . . . ·. ), . . . , ; ~ •· ,,
. . Tl;e shuhlc sfcp rctjuircd ,fot~ 1110vc111c1ir ai1d can ·be expansive in (C!IDS of processiilgtimc:-:_.
. .- Dcpci1ding on thc _na·11irc of the ,1 pplicufion, thc ammuil of data .thatniust b,shuffie.th.roughotit . . Ustingt'.2Py(h~~Rcd11cc~Scripi, (reduce ,py)
·· tlieclustcrcan .1;myfrom~malltolnrge, · : : ,·.-. · ' . .'· · . ·:, -, - . .:·.- . _. _ i·. : . . #!/usr/bin/cnv.python : ' · • · · - · · _ • ·
, · 'once the 1fa1:i have ·been: coilectcd and sor1cd liy.key,·thc reduct101i step can begin· (even' if.' froni opcrnto'rin\pori ii~rneg~ticr i~port'sys . . '
. only pa1'1jal re_suhs m ·:wnilablc). It is nutnccessary0 and riot normally recoo1mer.dc<l-to have ·; CUl'l'i!nt_:wor<l"'None Clll'rei1t~COl11Jt=0 ,~orcl:,; Norie,
. . ·a rcducer'fµr eaclfkcy~valuc pair as show11 in Figurt I.I : frt some·cascs;a single reducer wil! i .•, #ihput:comes fro.m STD.IN · 1 ,
. . , .;._ piovidi:adcqLiale perfomimicc: in othci:cases,..m.ulJ.ipJe.."re.ducers niay...lxi.required.to.speed upi \,. __, _ foF.lineJn.sys.stdin~·~ ----~- - . : , .--· ·.-- :>
. .: _the.rcdticc_plmsc.-Tiie number of reducers is n tunable opii~n fqr many applicat:ons,' Thefinal./ • '· . # ,ch1b~e:leading all(l'trailingwh'itespacc'line = line.s_trip O '·.
. -•· . . . ·. I . -. , -.
SZ.·. S...11r.f~~ E...-AM ~tit\~ ; ~11~tAf E;c,;1,;.. :&...il~u
VIII Sem, ( CSE/IS[)
fl parse the inpltt we got from mapper .p)' word, count= line .split(' /I', I)
µ conve,·: count (currently a string) to int
Locate lhc lrndoop-s1rcmning.)ar file in your distribiition. The location may va;,
and it ·may '
conlnh'. n v~1·sio11 lag. l11 this example, the Ho11tinworks HDJ72:2 distrib111ion wa{used.,The
try: ~ollowmg command 1ii1e will use 1hc muppc1· .py nnd reducer .py to do n word count on the
count= inl(count) input file. · •
except Vnh:cError: ~ 1~doop jnr /11sr/hdr,/currc1n/hndoop-maprcduce-client/hac:oop-strcaming.jnr ·
II count ,~as not n number, so si lently# ignore/ discard this line.
contin(1c . _. . . • -file .lm.tpper .py
# tl1is lF-switd1 only woi·ks because lfadoop smts map output# by key (here: wo11I) before -mripper ./niapper .py
it is passed lo the reducer · . . . . · -file J~ed11cer, .py -.reduce ./reducer .py ·
if c·uri:cnt_word ==word, currcnt;_CQ11nt += count else: -input \var-nhd:peace-inputlwnr-nnd-pcace .txt
if current word: ·. -output war-and-peace-output · . . . . .
II wi:itc re;ult to STUDOUT · . The output will be the familfor (_SUCCESS and part -00000) in the ~ar-a~d-petic~ ouiput
print '%s/t¾s' 5 (current_word, c11n-ei1t_count) current_cotint = count • directory. Th~ actual file naine may be·slightly difference depen,ding on youd-la~oop:version:
, current word ,;word • .· . · . Also note th~t the Python scripts used irt this eJmmple could be Bash, Perl, Tel; A\yk, compiled·.
# do not forget to output the last word if needed! if current_ivord ==word: C codc.,or miy language th«! can read and write from std in and stdout. ·: ·: · • : : •. . · ·
. print '%s/t¾s' ¾ (currcn(..word, currcnt_count) . . . ~ . Ailhough 'tl1e_streaming interface is rnther simplei.it does have some disadv.antages·~ver
: The operation of the mapper .py script can be observed by rnn.ning the.commands as shown ·- using _Java directly. In pa1tic11lar, not all ·applications are string-atid character •bfo~iy dnta,'
.in the folloiving: .· · .'·· : . · · · - ·· . Ariotl1er disadvantage is thM ri\ariy tiinirig par.i~eters ~vailable through tl1c ftiH )avaHadoop _
·seclio "foo foo quux labs foo bar quux" I .lmrippcr .py . AP[ are n9tavailnbl_e iri sirerihilng. . .- . . .
··r-'oo -I . .
Foo I . .. ·.· .· Module -2 . . . . . . .. .
Quux 1 . to
Explain ifoiv quite dat~ streams using Apachcfluinc? . . . ' . .' : . ·. (O~· Ma~ks)
·Labs I· Apache rhune is an independent .tgen·t designed to·collect, transport, arid:store .. ifata_irifo
Foo' I 1-JDFS. Often data tr~nsport involves a numberofF.iume agentstJ:iat may traveise a series.of
·. Bar..'r' · · m.~chincs a1id locations. l'lume is often used for log files, sociill ttiedia-gener.ited_daia, c/illiil ·
Qtlux. I . f, _ •. • . .. • •. • • •- • ~ , tnes~age;•andjust about any coirtinliQliS dain source: .. ·.. ' . .• ' ' .. . . .
0
Piping the ii:sult of the° mr.p into the s.oit coinmimc! can create a simulated shuffie phase.:.. As shown in Figure 3.1, a Fhune agentls composed of thre; ~oriipoilthis. ·. ·_, . ; , , ··
~I . . . . • ·Sou rec, The source component receives data and sends it to ii charineL It c~it seiid the data
foo I· 19-more Iha~ oq&chanrtel. The inpuid.\ta 'can be from;i real-tiin~ solii'ce (e.g., weblog) or .
. Foo I anolher Flume agent: . · ·. . · . · . . . . . . .. : :.. .· . -.
Foo I . • Channel. A channel is a da:a queue :tliilt for\vards the source diita to the sinkdestiriation. ·
Labs I It can .be thoi1gii of as b11Jrer Iha! mtli1:rges -input(soiir~) aiid ~11ipu\ (~itik) flow rates.. ..
Qullx j· •
.Sink. Th.c sink ¢divers data to dcstinntio11 such as HDFS, a·1ocaj file, oranother Flume agent
· Quu~ I . · . ·' .·· · · . ·' . · · • .· · · ·. · ·· · · ·• · · '. .: : · A Flume ~gent in'ust have :ill three of these compcin~rits defit1ed.-AFiume agent tan Have .
.'. filially, the full .Maplleduce process cmi be ~iniulated by ridding the r.educer..py scrip\ to tlie ·· ·. severnLs'ources, channels,and·sirtk~. Souices can writdo mul(iple ·chaiiriels; buffsink:cnif
follqwi~g command pipeline:· · ·. . . .. · · · .. . . .. . take datirfrontonly :{single·chaiincl. Data ,vritteifto :1 channel remaiii til'th{(:k~nn1:°l ti~til .
· $ cl~ci "foo (oo 'q1iux, labs foo lia1: quux'.'. I )mapper.pf I so1t. a_ sink removes the data. By default, the 'data fo a channel are kept irt memo!)' but may be
-'+ -k I, I iJ1·educcr.py · · optionally stored on disk to prevent data 'loss in thfevent of a network failure. _ ·
· Bar 1 ·
. Foo 3
: .·Labs .1
Q11ux.2· ·
0
. _. . . . . .. . . ·,. .. .
. To.ru1i 1his npplicatibll.using a !fadciop· iristnllation; ·create,.ifri~edcd, l\ dir~ctory and mov
·. _lhc war-and-peace.txt input.file. into HDF_S: ·. . . . .
.}.I ; i] .' ·s hdfs dfs '.mk<lir war-and.-pme-inp11L .
~ii/, I· , $ hdfs dfs -put war-and-pcace.txt wa1-and-peacc-input , . ' '.:··. .
XJll 1
~
'.: ~~e s11r~~P~_di1ectory 1s 1emovcd fro~1 any previous.test rnns: ·
7ffil:., ·. · $ hdfsafs -rm -r -slip1Trnsh war-and-peacc-oulpm·-:....,..__ _ _ __ ___:..__~_ _._:;. :_,_ -·-FijiiFeTTFl,~gei,iwii,,.~~;,,.ce, (·i111:111e1; a11i1il1if(tultip~iiilJrollrAftiir:/jiiF1~111e :.·
1!i
': ' ' 54 ,
. ·.· ·::_: :,
5'i~5,f-ii.l" CilAM Su.M~
~~~+-.. .- Cic.¼ fuilti~ _ · · · tf1J1:1d11e11iatio11) .. · , . · · ··.· ! :S!i .
- ~~~~/ -~ . ~
1> • i,
., . ' ,'.
ff· i
I
!
VIII Se-rw (CSE(ISE)
t\s show~ i,1 Figure 3.2, Sqoo1i agcnls n:iay be placed in_ a pipeline, possibly io traverse nn1ncspace and logs. . . .
sevcrn! machines or domains. This conflgurntio11 is normal!y usetl when d;,ln nre colleclecl The wcb-ba~cd UI -.m te started from within Ambari or from .i web browser co1inected to
on on ',;; 11::.:.:h;!'le (e.g., ::! Wl'b servl.'.r) and sen: fo ~n()thcr r.rnch i:1c :h~\l h,:s i1CCCSS to HDFS . · the NameNodc. 'In Ambnl'i, simply select the i-mr-s service window and click on the Quiel<
Lin!<s pull-down :ncnu b tl;c top mid<llc of the page. Select NomeNodc UI. A r.cw-.brciw~er
tab will open with the Ul shown in rigurc 3.4. You ca:i also stnrt the UI directly by enterir.g
the following command: · · ·
S fircfo)( http:localhost:50070 . .
There ;,re ri·vc•inlJs on the UI ·:' Overview, i>ntnnodcs, Snapshot, siartup rrogrcis,·und
Utilities. The Qverview page provi<.les much of the essential information that the commimd-
Figure 3.2 Pipeline is crcat'ed b)• connecting Fume agents ( Adapted from.~paci1c Flun~i line tools als.o offer, but in a much easii:1·-'to - rend format. The qai:inodes tab displays node
Sqoop Documeniation) . . . . '·' iqforma_lion like.that shown in figure 3.5 · · • . · · ·• , . ·
The ·sanpsho: window lists the "snap-shottable" directories and the snapsho(s ....Further ·
info!·mation on snapshots can be founu'in t~e "HDFS Snapshots" section. · . . . · ·. . .
figure 3.6 -provides ~ NameNodc .stm1up progress view. when the NarrieNode stn.rts it .
. reads the prcv_ious file sys:cin image file(fsinirige); applies. ~ny new _e.clits to the file_'syst~m
image, thereby creating a new file system im_age; ~nd drops into safe m<id_e· untit.lcno1igl1 ·
DafaNodes come online. This progress is shown in real time in the UI as the NariieJilocte
starts. Compl~ted phases me .displayed in bold text~ The curre.nlly ·running phase i~·di.splnycd
in•italics_. Phases that have _not yet bcg\111 are displayed·in gray text. Figi1re 3.6, nlf the pliases
have been COlilpleted, and a_s.indicr.tc<l in the overview ,viridow in_ Figuie :3.4,'th'~ s·ystem is ..
out of sn.femodc.. ·· · . · ' , . .'. . . . . . . ' ·
.. Th~ ·utilities mcau olfors ·two option~ ..TI1c fi_rst,fas' shown i,i Figure _3':77·_if~-.ij1~· system
brow~cr:_From this ,~indow, ymi cai1 easil)'c~plorethe HDFS namesp~ce. 'f!1e sccoli~option,
which.·is not,sli~wn;' li1\ks toJ'2e various NnmcNode fogs , ··· · . .... · .'-,:,,.' '.'i\i!{ .·
~ .~ i -
_,, .. .
.Overview .i!i:iu,:a~~u·1;,,;,-oi ·.
Figure j,J ~(Fliim<! ~omolit!n1iiJ1111elwori (Adapi<!d j,0111 Apa~/;<! F/11111~
. . ' .. . •~: -
"bujtlll n ll ·lll: 1' IDI :Dn ·- .'
. . . · i ·· . ,: · . .. · _.iJoc11i11e11tr11io11i . . . · .· · · _. ,.,
. ·.._In a. r,Iume_iiipcline, th~ ·s_ink fom1 .'one agent is conr.ecred tci ·:he source of another. The . . \~
,'data trmtsfor format 11ormaUy used b)' Flume, which is called Apache Avro, provides several .-.. i
uscful-feailircs.- f:irst,_A_vro is a data scrializa:io:1/d~serili22.1io~ system that uses a compact \ i_;,l
blnOry:forinal. Th~ schem.i is sci1t as part of the da1a exch ange and is· defined using JSON ~•:'
(JavaScript Object Notation). Avro also rcmole procedt1:-e calls (RPCs) to sent! ~.ata. Tirnt is, ;j;;,.
. :an Av1'0 sink will coniact an Avro so11rcc to send data . Another<t1seful Flume ccrnfigurntio·11'is, i~
· , shoivn in Figure J.3. In tli is configuration,_Flum·c is used IQ c_onsolida:e several cata sources J,1r;
bcfori:comn1itting them to HDf-S. Thcrn ·m·e n1any possiblc 'waystoco!1struc\ flurnc tr,rnsporf
networks:.rn addition;othe·:• flume feature's no\describcd ·in depth here include ·plug-ins and_.
i_nici'c·cpto1·s i,;~t can ·c11i1ancc flume pipclinc.- F'cr pipelines . .::: , · • .· · ,.·.)
: .• . . ·. .. . . • • . . .· ··.!;
·1i. Expl~in hridly 'basic IIUFC.n<lministralion? (12 Marks
·. . ·Ans • . The N,iiitcNo<lc User lntcrfocc . . . .' . . . . . . ·. ':.:
Mon.1!._~i~g_l!QfS_~m1 be done i,1 several ways. One ofth'emore coimnieiti'ways to g~i"
quick. view o_f.HDFS siiitus is throu~h ·1!1e NamcNoile user 11\terme:llTinveb-based to':.
providc's essei1tial ·information about HDFS and offers the c·apability. to. browse the HI:>~
Datanode Information
· In op~alion
.,. .
... _ .. M_,1,,_ ,_., ·.~- ....... ;,,. •• ~ _;, •.-;~.;:..:. ,,,;..,.__~ . : .._
11 ~ 111 _._.,,111_
,ni~ ..... ~· : . n1 .1 , 1 1.11c.t._ Ju,r,oe _ur u, _011.11.n11t • r.uu.•H
,.UU.1,\0~IOI · 1
IWI~• . Ul.1'-- 11.)1..0 >H)i:.i ... ~i~.~I~ " -~,♦ • ·,;;;~~._,~--. ~ - I .HUH~
"°"'""••U.1.l:5NitJ - I )lt-~ I c•· 1t~OI ,...;,:-... . JlUI ft ·_ '" _ . ~~~ !»(t,0"-t .• . 1-~ ~"':' :'"-.
0 0
- ~·• . no.IOI IUltl 1111r.l JOl-1f01 • ilt 14.ff"t,(.l\tll.t ,.,u.Ut.W
. . . :
Deco~is~ion.rlg
.f '
:·==:::.::.'!-·
_
.,_:
.,_ ., __ . ____ . •··-- ·----~ ---·
. '"''_,.. IIIM.,1"f'l"'-"1MI .. U
.
,~• - • ~ -- - • w• _ _ _ _ _ ,. _
, ,.·:·
:-···
·· Haiffp,~OU.·-
niter the Dnta NoJcs have reported that most file system blocks ar.e available.
Mis-replicated blocks 0 (0.0 o/,)
The adminis:rntor can place IIDFS in Safe Mode by giving the following command:
Dcfr.ull r~plication factor : i S hdfs ,dfsadmin -safcrnotle cn:cr ·
Avcrg;;c block replication : 1.nso 144 Co:rnpt blocks : o· . Entering the fo!lowin g co:nr:ianJ t11:ns off Sifo Mc~c: •
Missing re;,!ic.1s: O(0 .0%) Num~er or darn. nodes: 4 Numb~r or ,~cks: l S hdfs dfs~dr:1in -safcmot.!c lc~,c .
!'~CK ended al Fri M~y 29 14 : 48: 03 ED'l'2015 in 1~53 milliseconds l·IDF.S may drop into Safe Mode if a ·,najor issue arise~ within the.file s~stem (e.g., a full
The. filesyslcm under path'/' is HEALTHY . . . DataNode). The lile system will not leave Safe Mode unti_lthe situation is resolved. To chec_k
Other options provide more detail, include_snapshots and open· fil~~. and management of whether HDr:S'·,s in Safe Mode, cnw the followir.g command: ·
corrupted ffles. · '· $ hdfs dfsadmin -safcmode gel ·
• move moves corrupted files to /lost+ fo1md Dccommlsslo11lng HDF:S Nodes
• delete deletes corrupted files · If .vou need to ·remove ·a DataNode host/node from the' clusler you should decom·-· mission
• files priiits out files being checked it first. Assuming the node is respond_ing. ii c:in be.easily dccommissjoned from ·the Ambnri ·
· ~ o~cnforwrilc prints oi1i files opened for.writes during·tlock . . .- web _UI.°Simply go to the Hosts view, click on the hosfand s~l~ed Deconimissioii from th'e
• lnchidcSnnpshots , includes s1mpshot data, The path indjs;ates .\he existence _of ·a ·.,. pull-di>,vn menu next.to the DataNode component. .- . . . . _, . . . . . . ·
: snapshottnbte directoiy or the presence of snapshottable directories under it. . , · . ; .l Nole th~t' the host may also be ·acting ns ll Yarn NodeManager.' Use- !hi ~~ba.rl H 10
,; llst-corr111itfilcblocks prin:s_out a lis: ·or missi1)g bio'cks and the_files to which theybelong. 1 decommission the YARN host in_a similar fashion. · .· • : · >':·· .. .: .
• b\ocks"prinls out a blo.ck repo11., .. ·. , ·. · · · · The restoration pi-ocess is basically ll slmple copy-froni'lhe snapshot'~ the previau.s dfrectory ·
.• · 1ocatior1s prints Olli locations for every blqck : (or anywhere else). Note the LlSC o(lhe .../. snapshot/wapi-sriap~l.path _lo restore the fiie: . . .
· • racks prints out network topology for data-nodc.locaitQns. ·$ hdfs . dfs-~cp /usedhdfs°iwar-and-peace'.input/.snaps!iot/wapi-sll?.J)-l/war0illif peace·. txt/ ..' ·.
Dahincing_HOFS . .. · , · ::'. . , · .user/hdfs/wa1·-and-pcnce-inp11t •· · .· . · · _: : : · . .
. Based. ori tisage patterns and DataNode availability, the number.of.data blocks across.tho . Confirmation that t111:' filc"h.as. bei:.n restored can be obtained by·issuing the following ·
DataNodesrnay b~com~·u1:blahced. To avoi(I ove'r-utili~d Datai'-/.odes, the HbFS ~alaucer ; • . command: · ·· · · · ·. ·
·· tool rebalances data blocks across ·the availabl~ DataNodes. Data blocks are moved froni:
0
., • 0Ver-utiiiied to'undeMtlilizcd nodes'I() ivithiiia ccrt~irrpei'.cent lhteshold. Reb;ila11cing ciiif t!~---~-1 .: ·~_-i"'!.~ .';~·-·,· :_~j"~-~~~--
be'done-wlie~ new.Dat.iNodes are added ·or.when a,_DaiaNodc is .removed from servji:e. This', ♦""J_~~~!. ~:~·~~:,1 ~:rf~i~.i·~~~-~-.... . •...•...,. • -
, step·do~s i1ot create inllr.c spm:e in l{QFS, bu'traiher. i1h~roves ufliciency. . . . • · .•
_•.=:,-: •:•·:
· The HDFS superuser must run tiie balancer. The simplesi way iq run the ii~!ancei is jo•ei;teri
the followilig command: . ·· · · · · · · · ·
· $ h'ufs balMcer . . .. . . . . .
. Snapshot Summary:
.: :_ ·:.·~- . /' .
By ·defaul1, tbe balancer will coiliii1ue lo rebalance the nodes until the number _of data block
on all.Data Nodes a;·e within ·i 0% of each other. TI1e balancer cai1 be stopped witlt'ou! liarming
. HDFS, ai ~ny l1ine by elllcring. a Ctrl-C, Lo_wer or hfglicr_.thresholds car be set,_usi~_g the
• ,threshold argiiment. For examp_le, gmng tl1e following COIJ'!man_d sets a 5%_thres!iold:_: ..
· · $ hdfs balaricer-threshold 5 .. · .· . . · . ... .·. l
The l~w'er the'threshokl; the longer the bala/\_cer ~viii nm. To ensure'tlie·_ b~lancer' do.es noi ·_..-:
a
. swa1i1p ·1he-cluslerne1works, you can :set _bandwidth limit before runningU1c balancer, as ', ·,
.
...,-,.,.t ......
. .
.
. 00
~--·
1/'1.,;!0ll..-JUJ,_
.· ·+-~.....
···•· -~~:~~--
- -- ~ lofll; .
- ~~ . . . . . .
, . .1,J·-...~-~1-,.loi:.-"""".M-,.....;....M.Nt,-l ·-
. .....~":"~;
~ · , 1~;1Lltl~11''!'
. The 'ncwbandwidth option is the max:inium amount ' of rictw~rk ba1\dwidth, fo bytes per ·, ' ... . ·, __ :,:- '•
·second, that eachDataNodc ·can use during the balancing operation:.. ·. _:; · : . . ; . . ·. · • ..
· Balancii1g -da,tablocks can also .break.HBase locality.W11~n HBase.rg(ons _ are moved, some _-. , · F{giire; ),q t/ptici1eff.a111't!N(ltf1!_ 1Yeb /11terfl1ceffwwi;/t s11upsJ1ot ilif.<?.T11!t1,tio11 ·
. data locality is lost;- and lhc Rc~ionSe~vers i ·ill then reqliest the data over t!ie network froin • · $ hdfs dfs,Js /user/hdfs/~ar,and,peace-ji1put./ Fowl_d J iieiils , .,.-. . . · . . , •V
re1~01e DataNod~(s): 11iis coi19itionwi.il pcr_sisl imtil_a majorl:lBasc: ~oinpactio11 ~rent take, : . , -.1:w-r--r:,: 2 h\lfs hdfs, .: .328&746 :21i1s:06-24 2J: 12 /usei'/hilfs(~af:iii1d,,p~iice:; . . .
pla~e (which ,itay either oc9ur at: r~gular intervals or be _initialed by lhe:adiniiiisfrator). .' . ··1\ . .inp~i.lwar-and~pea_ce.tx( ·, :::; ·:>. ,.:.,.... ·,: ;·,.:, ·.· :, .-:-: _.,. ,.. :,::.,;, ,,... ·. .- ·
1IDFS Safe Mode . . . : . • . . •. . . : . .. . .. ·. The NanieNode· Lil .provides' a 'listing of snapshottable .d_ir(Ctories a,rid._thf:-$~llp~l!Ol'S thai ....
when the·N~meN.ode starts, ii loads tli.e file·system sta_te from. tl.1dsiinage_and then applies';: have ~n taken: F.igure J:8 shows 1f1e resul(s.ofcrcating the:previ~us sn,aps~oI, j'~ delete a
____ _:____\wh,._e...edwi.,.ts..i:Joll.!·g._fiwl¥,pJJ.u
l tlmi waits_(Qr_D~la~odes to JtJlo.it.tlieh:J.J.19.Ck...h.Thlrin ; thi 'mi: · the:, ~---'--,S.(lapshot, give the follo'wing tontniand: .. . . . .. :" . .
. N·anieN_ode stays in a rcad-onlYSafc'M_ode._TI,e Na1neNodc leaves S~feMode automatically) $ hdfs dfs-deleieSnapshot/user/hdfs/war,a?d,pe~ce-input~ap_i~sriap~I C-. :•
. ' . !
.-60, S<!ri!.tM E.c.,;/,\ ~~~(,;( ·. ~l\s.t....- ~.-;;;.. &.i.tiiii . -'---- - - -- - . 61
VIII Se,n, (CSE/Isri .
To make ;i directory "un-snapshottnblc" (or !lO back to tl1c ddnult stntc), use the
followir.g co:nmand: . · . · . · •
S hdfs dfsadmin -disnllowSnnpshot /userlhdfslwar•and-pcacc•inpul Disallowing snnpshot
on /usc,/hdfslwar•and•peace'. inpnt succeeded
OR
4. 11.ll01tto nianngc Hadoop service? (08 Marks)
Ans. During the course iifnormal Hadoop duster operations, servicc.irrny:fail foi any number of .,
reason. Amuari monitors all of the Hadoop service and reports any· service intcrruptio.n to , '
the dashboard. In addition, when the· system w~s inst~lled, an ildministrative 'email for the)
Nag!os monitoring system was req11ire9. All service intern1ption ~otifications. arc sent io th{
email address. , . . .· :
Figure 4.1.shoivs ·the Ambari dashboard reporting a.:down DataNode.Tjie service error;
· indicator numbers next to the HDFS.service and Hosts menu item indicate this conditions·;
The D.ataNode w.idget also has turned red and indicates that 3/4 ·oataNode.'are:oper~ting, .. ·:; .
. Clicking the HDFS service link in the lefl vertical menu will bring up the set"\'ice·suinmary :
s.creen ·sho0wn in figure 4.2. The Alters and Health Checks windo;v confirjn~ !hat a.DataNode :.
isdQwn.. . : ·. . . · . . . ::. . . . . . · ·
The specific h~st·(or hosts) 1\iith an issue. can. be found ~y examining the. J:-lo~ts ..1vindow: A~ .·
. shown •in Figure ·4.J, the status of host nI has changed from a gc~en •dot.with a check mark/
· .inside.to a yellow dot with a dash inside, An orange dot .with a question mark inside.indica
the hostjs not responding arid is probably down. Other service intiirruptlon inc!ic;ato.r !llaya
.be:set as ib~liolt·ofthe i:inre{ onsive.n·oae. . .1 • : ..• : . . . . .;-.·. · ; , ; ·:• . • • ·• ,, ; : • ,, . ... • • ·.:; . . • •
.. i ::Fif!llff 4.2Ai11hntiillDFS ser11/~•i!s1iim11ii'r,.whuldw. li1dici1/li1i a.dowiiJJ,iinNiJdt i.
1
.•.'~~.! .·: !.~
~
,:'.".·.,
.. .'..:_.~
1~
1~
•..., ',·._=.__·.•· .,.·.•. ,.·, \.-.,•::.c•.,·.·.·,.-•-•·.•.-.:.-.<?...':~.;.:. . _......'·.'.·.·'.'·. _·:_. .... . ~~-.~·-··- ~-~~-~-:C:-is!~-....-~.'¼.·-~•-:--r:-;;:--~·
,_= = · "' · • ·-~ -~: · ··:· · -~.-,: · · --- · -::.-::
·.,_,_;.•;,;-;, : .......
. ----
·-- - -~ (Int •-
. O'V- -
.-1
-,.,Ji
_, - -----
.1:.r.~.~.-.t.~~(;1_~~0~·.:; _. ~;l:,~~- ~, ~.- . . .. 0.14 ms
. . -
·.
- •~ ; _
-- . . -·
.
..
.
- ~
: ·· .
.,.
-(WV~
-. -··
..•"• ..
. '. c,·. .,.. ..
·. ·co,iu.(; _~
• .i:,~, ~
,~~>·i;::/\-·:.\../. ·:r<-
r:,o ~ .··./· · .
l~•:C• .'_.~~~ •.,;__:__.~.
• ...e,,~,i; · :·
·II ·. 1u01 i.:..· ·
' ~.
_1;(, . ·
t !1;
- - ~~f
. ... .
.. 1'\·. . . .
.I .
. I .
... · - - ~:, -
··.: ~.:.-_1_:;... .
-.__~.n~
i:t~
•M~-- -.
:·: . ;.,;~-=-:·.:--:·
'·.
.
I . ·' "'·
.. · ,. . ..
-.,. __
.
·._·.-~..,_
-•- .. ,_-·. .·-. -~~ I;·•:~,;, •.. ." ; ,::--· ," '~.' _. :• . ~
. . ....... lM•
. .. . .. . . . '
..
..,. ·;;,,=: .' 1~/ ( ·. / .
";..• ·
.:62
J; =~.:~;~'.....,•;1~c1~.:~:-:_. 1
:-=--•~• iJ:•~:-;:,.,/ri:,;~;;~•,;;;.:;.· 0
----==---lill
I+.. :.' · - ,;_. - ..: . .' : 0 ' ..- !:
•-D
:.,:•:o
il•~-
"""" · _; / _o.,.- !.
, 9'C·as::~~ ·· ~ . n ·11 · -~ i===~ i. ---~-- ,(
t\i·:. :2:1,::;:· · · .lnt\i.~yJJu 1·~- ·:, . ,....,u.-, ·
f
l!!rli
1,,;:f
31 .
,..~.,·
...... ., ~.... .
,:,:§ §;.~:.,/~:~' '
,PJ•~~. ,nJ,~ .
. . --- · r=-c -~
t~
:· ; ._w,,;_.. ~--~ .
... '!" ...... •
. _'-:"~-·.'-,;'
1_.33 .·, ;·. ~2.9 ~-
/ ";'U.
~dl ""'''"" ."""-~-- --~-~: l ~ (}~) .. 19.1 d- 414 : :4/4 ·.
i!}
.1; .
. Fig11r~ ;;:•,::::; wi11dow for lt~JI II I i111/ic11ti11g tlte D11rat•io1/e/llDFSrcrricc has ·.
. . . . · ·/!topped . . · . ., · · . . ::: . Figure 4.6. Ambari dashbpard iliclic·ati_ng.:, all .DataNo_dcs_~re running (The seryice ·error-· ·
i: · Whcn·a· service daemon is sta11cd 'or stopped, a progress window similar to •Figurc :4.5. is:, · • · indicators will slowly droP. off.tl1~ scieeii) Dat.Nod~s are.now working andtiie 'servfco error ··
i,., · .opcric~. The progress b_ ar imlicmcs the status of each .action. Noie that previous act ions are:'. .· indicators are beginning to 'slo:,vly disappear. TI1c scr_vice errorindicators _inay lag.tiellind 1lie
t~i°' •
IW,;'i,
. ·, p~rt o_fthis ,vindow. lfsomcl11iag goes wrong during the aciion, the progress bar wi! Itum.red. ,,:
.-._·· If the__sy·stein gcn·erafes··a warnf11g atlouf'th~actiol!, th~ process bac·will tum-orange.
· When thcse:ba_ckground operations are running, the smail ops -(operations) bubble on the top·· .':
. :}
real-time widget-upda.te·s for several mi11i1t¢s, . ·,
· .; .. · :•.: , .. · '- > - · ' ·. ,,
.
I
', ;:
~; · ·mcnu'ba'r.ivill
· ·indicale how ma'ny 011crations arc 'n11111in ; . .' . ·.· · . .·.-: •,1' .
machine . . .
and acts' as the. central authority
. .
for allocatir.g rcsOlirces .
lei the various. compeiing ..
.
'~i \; '. j ilii'clig_rciund Operations Running . ';_ npplica(i~as (11 thc' cit1ster.The_Rcso11rceMar.age1· htis acentral and global vie,w o'rali_cluster
)]I,' --- 'oi..,,,.;.;... '"'' ,... '0"""" 01
• .,_; ~' ~ - .. -~ j ,· !:esourcepnd,: therefore, ciln,,ensurc fairness, capacity, a11d _locality are -~!Jafe1:across ~.H
, ... .. :--- __ __ . ..... ·.... · .- ·- - . - - - - ...---r., 11scrs.. Dcpcnding Oil tbe application demand, scheduling priorities, and resource availability,
"" 1 o ,.,,,,i,.,,,..... ,®· -~~•~~---~------- _____.. ~ --.:_•_ _______ 'ihc ll'csourccManager ,dynamically allocates resource containers to apliicatiims to ruil:on ·. ..._- .
I-~- . ·--..·---. · ... · .,. .,.... , ,.,,. , particuinr nodes.' A container ls a logical bu~dle ofresou~ccs (e.g~, memory, coi·es) bound' ..
•,, ~_,.,,;o""'"'"' . . ·.-.. .. . ........... ·---, · ---- - -- - to a partictilar cli1ster node,. To enforce and track such _assignments,·the R~s~'tii·ceMnnager ·
I. .. .... .._. .. ... ..... , .. , .. , ....... .. . ., •• ," '" . ""' ~· " "' I a
. interacts 'with special system daemon running on eaclf riode called agers are he~rtb:eat ba~ed . .
l tf ,j'
{.,.~lif
~ "'"'' '-'" '"''""
• 511110., ~,M~kl . .. ........ __j,-.. -· --
, ... -- ··- -- - - · --- - - - -- - ,-,.,,-.-, for sc~lability. NodcManage:'S a;·e responsible for local monitoring or'1'esource availqbility,
' b
s)-, T.he
0
l1rr:'
~~-•t,,, . _. . . Y,;1;· ,., . slate. Aside frrni1 ·internal bookkeeping, this :process ir. valves allo~ating ~-container for the .
i!,:I'_ .
).
. . :
. °"""'""."'"' 0 ' 109 ~·---•...i . . .. . . __ _:...,_ '. :'/{~ f; ·. . .
. : .. ·
single ApplicationMas(ei· ar.d spaiyning it on a .node in' the cluster. 9ren caHed c_ontainer 0,
;:,( : · . · ·. • _ . Figure 4,5 Ambur/ pl'ogms· wi11ilorv for DataNoile-reslad . · -,:--,:.~ {C:. : . the ApplicationMaster docs npl have any additional _rcso1!rces. a\ this P,Oint, but rather ,must ·
_ .,,:;:.c·1":,-_ _ _ __ _ _ ~ Onc~ ·the 0alaNode hfts ,beel1 resta,ied~uccessfully,tl½e~ash_boar<l-WiH -refleot-the ne".' status~~~ ff.-~ __.,. :--·rCquest·addidori~?U!'tCs-from1he:ResourceManag~~-'.-~- ~ ·_- · ;_-- :.·_.··,. . ·- ·... ~-- •·. · .
(e.g., 4/4 DataNode arc Live). As shown in l'igui•c 4.61 all four . ·::J;:fj f:' . · .. · ·! · . . · . .
.-,~-:-
The App!icnti~nMnstcr is tile ·•r.,as:cr" user job that manages all npplicntion life-cycle aspects, was dc~igncd nnd intcgrntctl n1;011nd managing only MnpReduce tasks. . . .
including dynamically ir.creasing and _decreasing rcsomcc consumption (i.e., containers), Figure 4.7 ilhtstrntes the relationship between the npplicntion and YARN components: The
managin~ the flow of execution (e.r., in case of Maplleduce jobs, running reducers _ag?in~t the YARN components nppca'r ns the large outer boxes (ResourceManagcr and NodcManagers);
outpi1t ofmnps), handling foul:s ar.d co:11p1itation skew, and pe1'formi11g other opt1m1zat1ons, r.nd the two applica_tions app~nr as smaller boxes (containers), one dark-one light. .Each
The Ap~lic.;tiu~!'vl;istcr is ,!";i;;r.•::'. 10 ru:1 a:·bi11rnt1y l!Scr code th~\ ca11 bz wr_,:tcn •~ :my_ ~rplica:ion uses a <.!iff~rc~t A;i~licn:io;i:-.1~z:c:; the darker c_iicn: is ru,mi;ig a :-.1essage
progrnmming lang\1age 1 .:s .ill , 01:111n:r.i,:Hion wil!t the Rcsource:Y:ai1agcr and NcdcMannger p.issing Interface (MPI) applkation. and the lighter client is running a 1ratlitional MaµRcduce
is encoded using extensible network protocols. . ' . ;. · · · .· application: . · ' . :· · · · ·:-· ·•
YARN makes fow asst1mptio11s about thc.ApplicntionMaster, although in practice it expects ·,- The darker clicnt(Ml'f AMi) is_ nmning an MPI application,and the lighter clicnt(fv(R AM 1)is .
m.ostjobs will use a higher-level programming framework. lly delegating iill thl}.i,e function~. :_- running n _MapRed_uce application. · · · . . • , . .. ·
to ApplicationMnsters, YARN's architcctme gains a great den! of scalability, programrning / c. Explnin"cnpdc·fly scheduler.background.. .. . '._..(04 Marks)
model flexibility, and improved llSc,· .agility. t'or example, upgrading :nnd testing a nrn1:·.
·MnpReduce framework can be done imlependently o_fother runpin$ MnpReduceJrameworks. :
Ans, C:1pnciiy Schctlulc·r Backgrouncl . .. . . . i: . . , , . . . · . : . ·:: ·. .'
,The _Cnp~city scheduler is the tlcfault scheduler for YARN 1hat enables multiple ·gr~ups to ·
, Typicnlly, nn Appli"cationMaster will need lo harness the processing power of multiplc·servers '·,
securciy slmni alarge 1-ladoop cluster. Developed by the original Hadoop team at Yahoo!, the_
: to c·omplete a job. To nchicve this, the ApplicationMastcr issltes resource requests _to the .:
·cnpaciiy scheduler has successf11lly run.niany-oflhe largest i1ndotlp c_lusiers: · ·.. . · ·.::: ,-·•.
Res(!urceMnnager. TI1e form qf these requests incli1des spccificat_ion of locality preforenc~s .
·To tisnhe Cnpacity schcdulc"r; ·or(c or more quc~es ~re configured, with a predeter91i_n.~d ·,:"
(e.g., to nccommodate 1-IDt'S use) and properties of the containers. _Thc ResorceManagcr 1v1ll
fraction _of the total slot ·(or processorLcapacity, This assignmenr guarantees-a mi_n_im~m··
. nltempt to s~tisfylhc resource requests coming from each npplicati,on according t_o.nvnilability .' . 'amolint ·Ofresourccs ·for each queue: Administrators can configure sofi·Jimiis and OP,tionaf
· and scheduling policies. When n resource is schcd°uled on · behalf -of an ApplicationMast~r~ ·
hard lii11it~ on ti,e c'apacity allocated io each qi1eue, Each 'queu'ci \1as -~frict A~ts-- (!\tc~~s ,
a
the RcsclurccManngcr gcncrntcs lease for !lie resource, which i,s acquired by_a subsequent" _..
Control l,ists) that control which: i1sers crin submit ~pplic.itions to individual quei:_es::Also, .
ApplicntionMastcr heartbeat, · · · · _. }t -' . . . · safeguards arc i_n place to ensure that users c~:lnci view :or modify applications frcim.other.. .·
. .. . ,. . ~,.,~~...,~~-:U-~~~~~~ ( h;!,i"iii ~•d~~ii;p11em.jV':_i users.
. · ,~~~- . - _. t---~
_-c_-.::.::.::_:::::_:::::::::~-
~ .:
1
i:1
·-• ~1;;;~~;t;a:~\;~:i::~,~;;~:'.ii1!:~:\~~n~J:1; ::;~~ !J:i::t;~~l!:aj,r;(J1~~~p;t;t.·.: •'. :· ·
..'[ !]EJ[j]r;:J,:;
I . . . I
:or dcmand.{t~., a group:1s always guaranteed ii ·minimum number of resources 1S,_ava1lable); : .·
. Ex4ess-stots are giwii to th~;mos, starved ·q'u~m:°s, bascd ·on,!he m;mber'~(n1ri"rifn1 tasks·, ...
divided by the queue Capacity. Ti111s, the"futlest queues"as defined by their i:titial"ininfmu·m ' ,
i capacityg(1aranti:e.get th~ most necrJed i'eso(u'tes Idle capai:itycan be assignecfiindplovldes·. .
. · ela_stidtyfo,; tfi~ i1seh_i'n·:i cosi~cflective maii"ner .'. --: '.-. · . : ·'- '·' -<:· ~ < :-',.';_':t'}t,::.·,:' : ·
;\,1~ff .;;g ;::;r ~ . . Adiriinisi~ators can/hange queiic definitions and properties, such as qip~ciiy:aitii'.~tLs; ai ·.
1
· .·. ·,_ ;1~:~n~:!~~i~~tiri~t::~~~f/:1~~;i:te:d:~: t}~i!~d:t~o~u:~:~t rit~f1~~~fn~~:~: ~_:
)hat ivhile_. existi,ig appiicationsrlln t.ci completion! no new llpp!i-c.itioris can b.e subr\iitteit _: ·. . .
. Tbe_CiipnciiY _sc~edu!c('cuo-ciiify. luppoits : me1iioiy'.iiien.si~e :ap~li~ajiij_~~i ,%fre' : a~ . ·
··c
. ~=
' J
-;_; ~·- '&fjpj,-- ·u.· · ,••o~·· ~<,.,:··. s···· ·: -.. . __···. ··_. ·apP.licatio~ -~~n .?ii,tiona!ly sp~cify. h_igher rn_emory reSOl(tc~ req(lire~eil_~J,~1\f/ifWatflt. • ·
. Using infojrn~tion f~on! the No~eMan_agers! the capacity scl1edufer ca_n then r/~~~ ~?,~tain~_rs _·
'
'
, ..
'
·,
:
'• :, .. :. ~ Container, 1 f
'
I ,
. :.~.:. . - .....- •.
'VIII Se,m, (CSE(ISf) '
Hadoop YARN : Moving beyond MapReduce nnd Batch Processing with Apnche Hadoop 2. IJ . Microsofi Bl platform 32. SQL Server Anaiysis Services
In addition to the capacity schcdulnr, lfodoop YARN offers n F~ir scheduler, More information '
14 . MicroStrntegy 33. Style Intelligence
can be found on the Hndoo website. · · · · ·
15. MITS J.4. Syntell sol11tions·
16. Openl 35 . Tar&it
~ NEW,NEW_Si.VING,SUOMITTEO,ACC:EPTEO,RUNNING Appllcallons 17. Orncle 81 .. ·- 36. Vlsmatica
·18. Oracle Eii_terprise Bl Server 37, WebFOCUS ·
19. Orncle Hyperion Sys1em · JB. Yellowfin 81
The Bl tool used ,n our 01gan1zal1_on .. Educat1011 . . .
As higher education becomes n\ore expensive and competitive, it is a great user of data-based
::~t:::~:;s!~~:i~~~a
An~. DW has four ke· dem_ents (Figure 5: I). ·
w~~~~~~~~~ ~~~:r;b~·,a~h:·:·· ' .:. :::r\6 ~~rks) . . .
· ,. · ·
i'ntemnl metrics coming from different- company depnrtments, and external data extracted· ) Ontil Sources : !2s.!i . . •.A·t,;Ssi•ng u$@n
1
from · third-pa11y systems, social media clrnnne ls, emails, or even macroeconomic data. · .Operation$ · /' Iransforma_tion : Data Mofrt or · · g·~
ah'.PUC?tlocis
-ERP systems ,·~ _.- Q(APtciols·.. _.,_
Ultimately, busin_ess intelligence sofiware.. helps companies ga in· insight-on their overall
. ', growih, sales trends, nnd CllStomer behavior, . . . .. . ~ .
_.Legacy systcmi
,Point.of Sale . I
,. S~ici~t Data-'-~
-Extra.ct Data
_: One data .'
mart' for ca.ch
· Rep_o:,i1rie .Toois
: Dashboards· .
I:. Sisense 20. Palo OLAP Setver '.:~~ ~~:t;;•. ! . -cleanse data
1
.. ~?ar~ment, .. . :. ~~1~,:~i~vfces ..
2. Actuate BLisi_ness Intelligence and Repo1ting Tools (OIRT) 21 . Pentaho· E•teinal : ·-fi~~:puie ~•. • ·. -A Warehou~• . O~ta Mlnl~'g- ·:
-s·uhptiers· ' lrite·a rate Data for ihe whole ,. , Custom apps .:
l icCube . 22. Profit base' -Custome·rs:· ·-loilJ data· f:f"!!erPrtse .,: ·
-G·overnmc:nt
23. QJikView
. . . . ., ·. ' . . . Flgu;l!S,J Data·,varctioiil·i11g a;r:hilecture . .. , ;.,: . . . . .. : :
5. Boaid ~fonagcinc111 Intelligence Toolkit. 24. Rapid.insight..
The first element ·is•the data sources that provide: ·the raw data._The second element is•the.
'/i.,Clerir Analyti~s ;, · 25.' SAP business intelligence
· process oftransforming that.data. to meet _the decision needs; nie third elertien_tis the m~thods
_7. Ducc,i ·. 26. SAP BusinessObjects . of regularly and accurately loading of that datb into EDW or data marts. TheJo_urth _element _·
8. Gooddnta . . 27 .._SAPNetWeaver BW : .is the data access nnd ·analyst~ j,ar:t, where devices and appli~ations tise the dat~fro!fi.,DY{-to .
9. IBM Cognos·lntellig·erice 28. SAS Bl . deliv~r insights and other bcn'efits to users._ . . .. .·.
I o..lnsightsquared · 29. Silvan :.
Dala Soured. · . . , · . ···.. . ·_. ;, .· .. ·. , ... , .-.' . .· :-·. .
DWs are created from structured·data sourc·es. Unst,ructurcd data, such ns tcxt ·data, .would · '
IL JaspcrSoft · -30. Solver · . . need to be structured befo1•e inserted°into·ow: . . . .: .• . · . . · ..
·12, Looker 31. SpagoBI . . --i . • . , ·~ ·.:. - ~·-'_··_ __
..- : •, , ·
. [ .
. . .
, . . ' .
i
68 ~l\~t... ~ t:,cAf,\ 5w.MV.· 69.
•;,•,, ·
J
VIII Sr,wv (CSE/ISE)
· I. O~eralions dnta inchi°dc daln from all business applica1io11s, inclu,ling frot:i ERi's systems dnlnbnsc mnnagc~ncnt system 'and the right set of data management tuob. There ore a few big '
thnt !orm 1hc bJckbone or .111 organ iz.nt ion's 1T syst_cms. The llatn to be cxtraclccj will depend nnd rclmblc providers of DW syslems. · · · .
upon 1hc subject matter of DW. For cxmnple, for a sales/marketing DW, only 1he dat_n nbout The pro~idcr of the opcrntioMI DOMS r.iay be chosen for DW also. ·
c1:stomcrs; or,krs, customl'r set vier. and so on would be extrnctcd. · Altcrnaltvcly, a l:csl-of-brccd DW vcncor could b~ 111cd. There are also a variely oftools oi:t
2, Other :ippli~aticns,_sudi ~.s poir.t-o!:s;i lc (l'OS) :erminals and e-commerce applica:ions, U1c.-c lor da:a rn igrntion, d_.ita uplo11d, d;ita rc:ri.:,a!, and data :m.ilyii).
provide customer-facing data . Supplil'r Jn:a could come from suppl~ chain mnrrngcmeli1 .· DWAcccss ·
systems. Planning nod budget data should also be ·added ns needed for making compnrisons ··, · Dain fr~_m DW could be accessed f'iir many p11rposes1.through many·dcvices. ·: ·
against targe:s. · . , . ·· ·· L A.primary use of DW is lo produce routine management and monitoring· re~orts. For
3. External syndicated dma, such as weather or'cconomic activity dnta, couid alsci be ndded.· c~a~ple, a sales pcrfor'.11ance report _would show sales by many d~ensioris,_a~d c:o·mpared·,
to DW, as nccdull, to provide good ccintcxtunl information to decision makers. Wl!h pl8n. A·,?ashboanlmg syslem will use data from the warihouse and pmen\ analygis.to
Figure 5.2 Da:_n warehousing architecture ·. ·· · ' lisers, The dat'a from DW ca.11 be used to populate customized. perfomiance dasl;boards -for ·
Data Trnr\sformation rroccs~cs, . • . • to
executives .. The dashboard could include drill-down capabilities anaiyze tlie peri;11rmancc·
The heart ofa useful DW is tlie processes fo populate the ..DW with:good qua°lity dntd. This is > data for toot cruse.analysis.- . . , . . . - . _' , · _. .. , , . · .
called the cxtract-tnmsform-load (ETL) cycle.' ·. ·· .- ·. · · : · · · · · ' . · . - · .\ :· > 2. The ·datn_from.thc wnrcltouse could bfilsed (orad hoc queries _and ~y-~thec apP.lica!ions
I. Data should be exlracted 'fiom many operationai (transactio~al) database sources on"n . thatmakeuse .ofthe'internaldnta. . · • ·, ·.. ." , - . '·· ·. , • : ,·.,::,;;.,-- , :-.. _.·.
regular basis. - - · · · . 3. Da~n froni OW is u_scd to prcivid~_.d~ta for mining purpos~s.Parts of.tile ~t-~~ou•ii be,' .·
2. Extracted d~ta sliould .be a°Iigncd tcigethef by key fields. It stroi;lll .be d~ansed :iny--,·.·'' of ~111111.~ted; -nnd then combined with !)!her relevant data, for data'!lining ... ·. ·_,;,:_:.\, ·. . · · ,
irrcgularitiesoi'missing values, It should be rolled . · ·. • · · ' · • - . ··. · OR -." , -:. .... ; ,_.,t,, ·
. ~p together :o ihe siin~e level ~-f grnnufarity: Oesir~d'.~elds: Sl\Ch a~~;a;_sal~~ t?t~ls,·s~~uld
be computed. The ent11·e darn should·then be brought fo tfie same format as the 'centtal tab[e D~sc~i~~ih~ key st~i1s in ~he dat~ mining process; W~~ Is it·1;np~rtani ;~ .rcil ~~v ihesc
ofDW. . .- . _·. . . . .: · . . ··_. : . ' ✓-· pr.ci_ccsscs_? ' , _ · . , · ·. . . ' , . .. ·, '. : '(os:Marks)
. 3'. The iransformed :<lat~ shquld tiien be uploaded int~ ow: rhif'ETL prd~ess ~hould be/ Elfec1ive and successful 1ise of data mining activity requires botil business and:tei:hriology :
_..run at a_-rcgular frequency, Daily ti'i11isaction data can be extracte'd from ERPs{transformed}' skills: Th~ business hspeds help understand. \be 'domain and the·key questions': !ialso helps.
nrid lipl~nded io ttic database the same night. Thus, DW _is ~P:to-(i~te n'extm_ci\:rirrig.:lfD • .one imagine possible relationships in the data•and create hypotheses to'test:if.:iW'rt:nsoects
. is rieedcd for near,i-eai-time informntion access, then' the ETL processes \vb11lf need to 6C : he!iifetchthc ddta rrdn; rnrtriy sour~c$; clc~n up _the d.1ta, assemb[e ~ to :neet tlf~rteeihJrihe .
is
.-exeputed more frequeotli ETl,work usuai!y:autodrntei.l i1iingprogrnmf~g sci'ii,ts that a·re\ bqsines~problcni, and then. run
!lie d.1t.1.mining teciiniques on ilti! platform.. ·\,:>:-\! .: : ._,· .-
·. · written, tested, arid then deployed for periodic. i1pdating DW. · · · · · and
.An illipo"1tant-~lemcnt is to go after tlie· probiem iteratiyely.·It is be!ter to divide conquer.::-·
. DW Dcsiqn · . . . .. . . . ·c. tlic prob!em ivith s~alle1i.amounts of. data; arid get closer to the heart of the soiuiloi:i .in. an ·
Star sclie;a is the preferred datn -archited u;.~ foi most DWs: Ther~ is a ie~trifat t tabic that: ; . iterniivi::'s¢quenc~ ,ofstcps:. Tiicre are several ·best practices learned 'frcin ihe~us~,-of data :
provi4e~ rims! of the inform_riti6~ of ii;ter~st: Th~re
are lookup t<\bles that 'provide d'etnile;i \ ll)ining te9hniqu~f pv"ci- n long perio(!of time.The c.lafa min~,g imfastry has·pri;po.se4 a . . .
Cross-~lrdustr{Siapdai:d Process. fo/ Dritil'Mining(CRISP~OM).' It 1-.as,six esseJ\tiel steps ·
· :values for codes 1i~ed iii the central 'table. Foe example, the central table inay use digits to ·• · (Ffgl)re '6":J): : ·,-. ·.·;, . ' · - ·· ;..,--:,;:_~-
- .-. - ~ · " ..·. -,:•-·--, :;,:;,. : •. '.
reprcsciii'a"snl~person: The iook,;p 't.ib!c will
help p~ovidc the ham_c fo1' that 'safoi ;pe-;:so'n ..
· :cod~..Herc is a~ example of a star's.chema fcir _a data.mart for,moni\oring·sales peifo(mance ._. __:
' (l'igure5.2); . : > . .· .: . · .: '.)..' . ·. .· .: ' . · . : : - · : '. : .·., .'_'. ,
- ~rst.t.:mtLi!:.\il. 1
-- -~ ~
' 71
5"ils:+..., ~~~
S~,1~fa( f;..,~M
·70 sJ.r1~J< ~MV~ . .. . ' . ,·
. ·~•", ......
~ ~-·•. ~ .:,
. : ... ~ .. -··-: ·..-.
Qi,•·,
-I"
VIII Se.rv (CSf(ISE} ;c,ics . f.,toaci,QtAMtwwP~u · 2
is successful. There should be strong executive suppoi·t for the data mining project, which 2.C.cornrlric. rrojcctio11 vi.1ulili1,;1tio11 techniques
menns that the project nlig11s well with the business stmtcgy.· . A drnwba·ck, of plxel•oricnlcd vlsunliwtiori techniques is lhnt they cannot help ,.,s ·much in
: · A second important step i~ to be creative nnd opc11 in proposing'imaginntive hypotheses 1m<lerstnnding the <ii~trioution of da1a in a multidimensional space.
,or tl'.e solution . Thi,,king uutsice :!,d,ox is imp011an1, both in tcni1s ofa pror,oscq model as Gcumc1,ric 1:rojcction tcchr.iq:1t·~ hei;, users fin,! ir.tercsting projections of m11l1i.l:.:1~11sfo~~1
well m Jhe dma sets nvnilable aml :cquircd. ~~ - ' .
3.. _The Ja1a should be clean and o:higt·iquality. It is impo11onl 10,assemb1e a team thnt has n A scatter plot displays 2-D data ·point using Cartesian co-ordinates. A third dim~,Hion can be .
mix of technical and business skills, who understilnd the domain and the data. Dain cleaning added using different colors of shapes to represent different data points. " :' : ·
can take 60 to 70 percent oflhe ti_mc in a data mining project. It may be desirable_to add nc1v . ,, Eg. Where x"and y nre two spatial attributes and the thiid dimension is repr~schted°by
data elements frQm external sources of data that could h_elp improve prediciive accuracy." .'· .:-:' different s,lrnpcs . . . . . . .. . :· ·
4. Patience is required in.continuously engaging with itie data until 1he data yields some good' · Thrnugh this.v_is'ualization, we can see that points ~types"+" &"X" tend ·10 be 'collocated.
insights. A host of modeling tools and algorithms shimld be used, A tool could-be tried wfot' ', ·•. . . . ia ~' ; ' I l .
con~dence in the solution. Evaluate the model's predictive accuracy with more test data. . ' ,v'
· 6..The dissemination and rollout oflhe sohttion is the key to project success. Otherwise the 40
a ..
project will be .a ~aste ~ftime and will be a setback for establisliing 1and suppo11ing a·data- . •' 10
bascd decision-process culture in the organization. The model should be embedded iq the
• organiza!fon's business p_rocesses. · · · iii . · :f .
' ~( - .
b. Whal arc ihc d~ia vlsuall~ation lcch~iqucs? When wo.uld y~u'tisc iablcs' ~r graphs? .. .. _.1~ .· ~ - . ·:
. ·.
Aris. Qnla Vls~~lfza·1io1i fcchniqtles arc! · .
. .
~-0 ·,. •'-'---....,....-.;...-,---
··:a _10 . 1~ . JO .-'D ·io . Ml .:~ . ~ -
- · r·- :. =. ··
-1~
. .· from All Electroni~s; the~·e is no clear correlation between income and ag~. ··Tlie silbspaces ·are xisualiz'edirt a liierarcbical rnanner: .. . .,.. ,.
multiple variables arc· stacked one on.top of the olher· lo tell an ii1!eresting story. llms c~~
also be nomialized sllch as lhc total height of every bar is- eq,ial, so it can show the relative'
composition:or each bar. · . · . .'·.\
S. Histograms: 111eic .ire like_bar grn11hs, except' that.they" are useflil in showing dlitn:·,:
frequencies or d~ta values on clnsm (or ranges) ofa numerical variable. . ' · .:
6. ['.ic cliurts: 111csc arc VCI)' p~,pub1 to show lhc distribution ofa variable, such as sales,_by
· region. The size or a slice is rcrrcsentativc of the 1:el_a1ive strengtlls of each value, . .. :.
. 7. Bo.x ~harls: TI:csc ·.1rc f.;K'ci ;,! form of clwrts to show the d1~t1'ibt1tion of variabl~s. The
box show,; the middle lialf of the ·v-,1h1es, whi'le whiskers on botl1 sides extend to the extreme
value_s in ~ilhcr <lircctio11. .:;-.
8._Dubblc Graph: This is nn intcrc~ting way of displaying multiplc.dinwnsions in one chh11;: Y
It is 3 varbnt of :I scatter plot wi th maay dala points h1nrked·o'n IWO_diincnslons. f'!ow imogi~e:
that each data poi11t on tl:e graph is a bubble (ura circle)° ... lhe size oftliccirclc.and lhc coki..'
fill in the circle could represent two additional_dimensions,
. .. . . .
· 11. Pictographs: O_ne can use pictures IO'represent data, E:g, Figure 6.7 shows the number of: . •."Giaphs and tables serve ditferin:1 purposes. Ch_oos~ ilie appropriate daia display iofit you(
· litei:s~fwatcr needed_to produce orie pound of each of the products, where imag~ are used io'· purpose. . . . . . . I .,· . . ' . . .. . ' . . . .. .
sh_ow ihe product fc;,_,. eaiy reference. Each droplet of water al_so represeril,s 50 liters ofwater.:
. : .
. . . · ,'
74
~~d-~-, ~,;~ fu~11~
. ..•·- :~<•\::.:.:,. ·.. .... ,?:-. -- ··... ~.. . . . -:~ .- -. :, .. ~
- · -- __ _____
._ - _- -
- -- ----- - -- - -
;-,:
:'.:'
VIII Sem, (CSE(lSE) ~ cries '. /v1orMd, Qu~lt orv T'Ctfle+' - 2
dimc11lt.
Decision Trees
Should l:c fos!cr once tr.1i :1cd {although both al~ori thms nm trnin slowly depending 011 An. idc.al cluster ca~ be tie.lined as asel of points that is compact nnd i5olatcd. .
exact a!gorithm a11ll the a:mllmt/d:mcnsionality of the ctatn). This is because n dcci~ion tree ~n ica/',tf•,a clu~tcr 1s II subJ~ctivc entity whose significance u.d in1crprclation requires dorn~in
i:1hercn1I:' "1hrows away .. the inp11t fca111 rcs that it tlncsn't find uscfol, whereas a neural net . now cq,c. In the 5arnplc G~la bclo,v (Figure 8.1), how many cluskt'l can one visualize'/ ·
will use :':en: ;,II unless ym: do some fcnmrc scil·c: ion as a pre-processing step. · X
If it is i111po11an1_10 11111.lmtand what the model is doi11g, the trees ·arc very interpretable.
Only model functions whkh arc a~is-pnrallel splits of the data, which may 091 b~ the case. >( X
to
You probably want to b.e sure to prune the tree avoid ov~r-fitting. X
X X
X · X
Neural Nels _ .. X
Slower(both for training and classification), and less int~rpretable. . .., X , )( X
If your data arrives in· a stream, you can do incrcnienta( updates wi.th stochasti.c grndien{· ·x .
·descent (unlike decision trees, which use inhercQtly. batch-leami11g algorithms).· . · : . . . .. . · Flg11re8.J: Vl~ualclusttrtxample . .· . . ·'. '.. .~ · ·
Can niodel ·more arbitrmy functions (nonlinear-interactions, ~tc'.) and tl1erefqre•i11ight be more .. It seems like there are two clusters of approxi~tely eqUAI sizes. However, thty'~ be seen
accuraie, provided there is enough traiiiing data. But it can be prone t<iover-fitting as ·well. · ·:• as three clusters, depending on .how we clraw the dividing Jines. There-is nol a truly optimal
TI1eli: are 111a1iy advantages of using ANN. , -. . . way to calcufati: it. Heuristics are oflen.11sed to define the number of clusters. -~ .· -.· . · ·
· I. ANNs impose very little restrictions•on Uieir us·c. ANN can deal witl\ (identify/model) : · Three business.applications: · . · · · · · _· ·
highly nonlinear relationships on their.own, without much worldrom the user or analyst. . · Cluste~ ~nal~sis is used _in almost every'tield w~ere there is a ~ge of~~ctions. ~ey
They h~.lp find praotical data-drive1i solu\ions where algorithmic s!)lutions are nonexi~ten\ or lt he.fps prov1_de characterization; definition! and iabels· for populations. Ircari help ideniify ·
too complicatecl. · . . , .• . , . . . ·~atural gr,oupmgs.o_f customers, products, pat fen ts, and.sc, on.'ft can also belp ide11tify.outliers
2. There is no need to program ANN neural networks, as they.iear\\'.from exa1riples. They get ... ·-· .m a. spe~ific ~o~am. an_d thus decrcas~ 'the size and complexity, of problems_ A prominent
· better wi:h use, witi19ut much programing effo1t. . - . · . . - b.usm.ess application ~f duster ~nalysis is, in market r~earch. Custo17Jers are segmented into
3.. ANN can kndle a viriety of proble1il . types, ·including b1~ssific~t;on·, ~lusteri~ · clusters.based on their characteristics--,wants an.ct _needs, geography, pri~e ~~itivity, ana s,;,
associations, -andso on. · . _ . ·. .·. . . . •, on. Herc ~re some examples ofclustering:-:· -. . . . - .- - · , .:: · . , ' · -. · ..
. 4. ANNs are tolerant of data quality issues, and they do not resirict the .data lo follow st.r'k . l,Mit~ketSeg,i1Cmtali!J/1; Categorizing ~itsto111e~according tot/lei~ similariti~; inst~ce r,J~
norniality 2qaior independence assui1iptions.:. · ' - . . ·. ·, . : . . . , . . ·.. .by,tl)eir £0".ll!lon ~~nts and 'le~s, _and ~ropen~ity to (i?Y, can,_ho!p ~ tazget~ ~keting; . ,.
5. ANN can han(lle both·numerical and caiegor1e11l v~riables .. -. . . . .. _. _ 2. Prorl11ff fortfo/10. Peep.le ~f s1m1!a1' sr~es·can be grouped rogctl!eno ma.rceslriall; inediuJJi· ·
6. ANNs can be muclifaster than other'lechniques. · ·. · · · --. · · · ·. ·- · -'_: .and large sizes for. clothing items, ·· · · · · · · . ·- - ·. ·
· 7. Most inipo11nntly, ANN usually provide better results -(prediciioli and/or clustering).: '~- Tex_r M(11ii1g, ~1t!st~~i~1g ~an help org~ili~e ~ given coilectlon ~f t~xt doc-~~~~~ccrdlrig
. -compared to st2tistical counterparts, once they .hav·e been tra.ined enough. -:· lo the1.r content s1mll~nlles mto ~lusters oftelated topics.. --.--, -, .-.- - · ·
· The key disadvantages arise from the. (act that tliey are not easy to interpret or. explai1(or., . .- · •, Moduic ·~f
·:,. ." · c·o1l1pute. · · · - --~~~ - . ·_ -,__ · · · ·
· ·1.. They are deeri1ed to be black box solutions, lacking explainability, : - :, ,.. - _· . . . :
0 rnvctl'.c_C9".1l'aris~dii~~lw:Cif1{rvfj~j11g ~~-(oaiaM!~iiltK •~.,-.: ,\ (~8 ~~~~k$)::<--~t~
·2: Optimal design ofANN is still an arf: It requires expertise and ·extensive experinientation,· _c TextMi'.11.ng I~ a, f.01;11 ~f~ata mm111g, :rpere are many common elCfll_t;nts:l,efyi~n.Text and . .
3~It ca11 be diffi~ult ·to handle ·a larie number of variables (especially:.the rich nomi~al .: are
._ ~9(,1 Mming-. ~o~cver,_ t~,ere .s?meJey differences (Tabli Beiowr k~y,~jll'ci~n~ the
.attributes) ,vith an ANN, ., .. ' . . -· . . , .- JS itta~ text ,ll)llllng rcq~,r~s conversion CJf;teict data in:o, fre(]!lffl(;)' data, bef'oi;e 4~tii tnining
4, It takes large daia sets to trai~ ari ANN.. _ - .. ._ . · techniques can lieapp/1eq.: ·. •··- · · · · · · ' ·
. ' b. n'cr.n: Clusi~i-~~ Describe ihi~e busht~S appll~t/on~ i~ yo~r ·!nd~~t;y ;vl;ci-c cl~s;~~; '. Qinie~~fon ;. Text fl,ifni~g .0~12 M!~ing j , . .
. .. annlysis1vHl.bcuscful . . < , . ··. · ·-'- . :· - · .· ... · :. (0.8.Marks) :
Nature or -' · · ·
- ·.:"· - .u. ~.sir.uc_·turcd._data.·_ :.w_
· ·.. ..d.. nta
·· · ..
_ .ten_~.i:s · values··
ords. .P_h..ia,·s.c:s_:sen Numbers; alpha!JC!ical.arid
,. - - -- ...,.~logical
.
' ·
:.' .
. .' ,Ans .. ·oefinitioi1 of·a Cluster :An operational definition ofa cluster is that, giveii a representation of,'
· ri.oiijects, find K groups based ori a measi1re'o(similarity,such ihafQbj~~ts.withhi the same .; · Lang a · Mnny ln~guag.:s and dial~is used in tltc . , · :, - -- .-.. " · .-.,, · · ·
group 11re alik.e but tti~ objects in different groups .are not al_ike. _ . - . - . - . _ .. _ .: -· ~sc;I ~~ world; many languag.:s are. e.<linci, new _·:,,r: _:vlo~~l.r.d.~_fll.~n.',;a.. .•
Lijs .i'.',s, ~...c.·.r ~ :
- ~~.m
docii!Jleri!S ~re 4,scove~d· ·: ' ..
However, the notion·o'f sl'milariiy can bd i~terpi·eted :ill iriany ways, .dusters can differ in);
terms of their shape, size, a1id density. Clusters are pattems,. and the.re cari be many kin~s. ' (:larity. anci
· !lrt(jsio~· ·
.· ,of patterns. Soiltc clusters ~re ·11;e traditional types, such as 'data points h~iiging together.:
. t\'01.v'ever, tliere a'ri: other clusters, such as all points,repres,e1iti1ig the_circtiinferen.ce g(J
· ·,. · :circle; Tlicre-may be concentri_c circles witli points o( different ciicles represe1iting different.
'==-:-c--:-,---,:ltisters,-'fhe-presen ce-of-noise'in-lhe-data-inakes the _tle!eGti 011-ef-t~ust~rs-eve'n---mor~·
· 1
78
:. .-.•.-•·;• #_
VIII Se-iw (CSE(ISE)
1hc computations involved nnd, hence is called "naive" , Thl.1 dn.11 ifier is also ~'nlled idiol
1'1.:xt IHJ)' present :1 ckM :rnJ consistent er
_llnycs, simple Unyes, or independent IJayei ,
Scnlimrnt ::iixcJ sl!nlinwnt, ~\.:ross aconli11u1.1m: Spokl!n NIA
\\ C'l rtl$ ,:,ids for::11.·r s\!1~: iir:cnt
The ndvnntages of Naive !Jayes nrc ;
· It use ::. ii H:ry intui!i1vc 1cchr. i,1 ue. Bayes clai sifi crs1 uuJi~c: m:ural nrtwcxks, du pot ha\'e
Spi.:lli1;g l.'.i rn.~. Di! ~~nnf v;1lui.:s of p10F ;;l issues wilh mis~in&,:Villt11.!S,
Qunlity 1111uns, s11ch ns nnm.:.,. Varyins q11alil)' of
several free parameters that m,1_st be set. This grcally simplifies the «!uign proc~,. ·
ontlh,:r~, nn<l rn on • ·: since _lhe classifier returns probabilities, it is sir:ipler to apply these icsults 10 a wide variety ·
langu"llc 1rnnslation .
of tasks thnn lf nn arbitrary scale was used. :.
A l\lll 1vi<lc rnhgc ol'staiisticnl
Natui:r o'f K9wonl-basc<l search; co.:.xistcncc of · It does not require large nmou~ts.of dntn btfore learning can begin,, ·
nnd machinc-icnrning nnalysis for
:inHlysis · themes, sentiment mining relationships nml dilfocnccs · · · Nai~e .Bay~s classifiers Me compulational(y fast when making decisiol13.
·o R · . ,
ll. In whut ways is Nnin-:-Uaycs bcltcr._lhau olhc~ clnssificnlion tcdnliqu·cs? Co.mparc wi\i/; \ ' .•I
.' / .,
. ,:so- - ~----
vm Se-1r11 (CSE/LSI)
There nrc two outbound links from node A to node IJ and C. Thus, ,both Band C receives ·
· halfofnodc A'~ inllucncc. Si111ilarly, there arc two outbound links from node D to node C and
· A ,so both C and A receives l1alf of node B's influence. .
There is only m,:bound link ·from no.de D to node A. Thus ,node A gets all the ir1lluence of
noJc n. T11crc is on ly·oi:tbo1.nd link from no:Je Ct~ node o ·a:id hence, node D gieu ~II the
inn ucnce or node C. · . ·
Node A gets all of the influence of node D and half the influence of node D.
Thus, . ·, ·. · Ra =0.5 X Rb +Rd. .
Node B gels half the influence of node A. : .
Thi1s, . . Ro :O.S X ll11.
Node C ge1s··half the influence of node A a·~d half the influence ofriode·B.
Thus, . · · Re =O.S X Ra+ 0.5 X Rb. · .
. Node D gets.all of the influence 9(node C,an4 half the influence ofno.de B•
. Thus, . . ·R,t =Re. . . . .· . .;
. . .·: ·· ·•FigA11e1workwill1Di.1li11clsubNeMo_rk · . . i . . , . · _-_ ·we hn·ve 4 equations using fvn~inbles. These can be solved rru1themalically.-_·_ . .
. Computing tin(i~rtnricc of Nodes: When the co:mcctions between nodes in the network We can represcn\ the ~oe(Iici,ent oflhcse 4 equations in a matrix fonn as show'! _lnth~ Dataset_
·have a.direction to tliem ,then the nodes cnn be compared for;their relative influence or . (10.1} gfven belciw..This is-ihe ·1nnue~ce· Matrix -, The zero valui: ~presenlirulfthe term is '
· i-ank. This is doneusi~g 'Influence flow Model' ~Every outbound link from a node can _be . an
not representel in eq11ation. . . ··. .
~on~idercd nn outflow of inillience. Every incoming link is .sim\lar an inflow of influence, : D~tn Set 10.l ·
More. in-links to a node means greater impqrtance.- Thns there\ will b~· ll)any direct a11d Ra .Rb . Re . Rd ,
inQir~ct flows of intluerici: between ~ny two nodes in lhe networ_k·.' · . · ··• ·.. . · · :-.._c.' Rri ' 0 0.50 0 LOO
Computing tlie relative.influence of each ri.ode is done on the basis-'1lf an input-outpi1t matri"\ Rb ,. • -, o.s·o ·o··..\ -·· ·· o· o
of flows of influence among the nodes. Assume .eacli nodes has an influence value . ·Th_c; t-,---,----------.-------1------,-1
. Re 0.50
1 0.50.·. .
. o:· o·, · J.t >··.··
computational task is·tci identity .a·set ofi·ank value~ that'sati_sfies tlie set of links between t)i~•
. .nodes. It is 1111 iterative task where \ve begin with s.orile initia,l yaluesand.contin·ue to iterat~. · Rd , ·o ·. o .1.00 · o, . :
. till tl1e rnnk values,siabilize. . . · .· , ' · • · ·:· ·. · . . .' • ,. ·• ·... ; . ,: :•. F,6i: ~iniplification I let us also state that all the rank values-ad<f~up,10 I..Thus, eac~:~odc:hes· ..
.. Consider !lie following simple network w.iih 4 nod;s (A,B,C,p) :and 6 directed links between _ .:.-. afraction as thtfrank value, Let U!, sta1t with an in_i[ial set of rank val_ucs andthc.ri°lteralively
·them· as shown iii°thefigure(I0.2). Note that there_is a bidirectional link. Here are the links:::. · compute new.rank,values °till they stabil,ize.. One can start wiih anY: in°itial ·rank values, such·
Node A links into B . as
/In or i/4 . fo(_tli{nodes. : . . . .. '. .
· Node B links into.C· Variable . · Initial Value · ·:·.-· -. ··
Node Clinks into D , --~·- :----.~ - - - -:rl~-:-,· c...c... _ --.• Ra ~ -- __:__, _0250
Node D links into A .
Node A iinks into C ..- · · Rb . . Q250
Node Blinks hito·A· R.c·. 0250 ·
G),. .,0 : .,,•Q250 ' ;
l~ L
·. Variable . lliWal,Va,lue .:, lleratioid
(: . . . Ra 0.250 : , . . o:~75 . · · .
: Rb. 0250 -•. 0, 12~ . . ... .
··;_, :·-· . :~~ . • ·Oi50 .0.250 , :·.- ·,
.: : . ' :··.
0 .'. :
: ·.· ·. .· .. . . •.· Fig
0 .'
. · ..··. . •. . . .· . :'.
.. n.4·, o.2so .: o:2so .-. ·· , \ ::.. .
. .Comput_ingthe revised valu.es_using .the equatiims startc,j earlier.we _g_et a revised set:ofvahies · ·
. The goal is to find ihe re.lati~e importance ; or rli_nk ' or _eve!)' node in the netwo~k '. This .wiff ·show1uis. iterati_on I. . ·. . , . . .. . . . . . . . . . .. . . . ... _-
help identify the.most impo1tance ~ode(s) °in tb~ n~twork. · : . . · ,. ··, · .. ·: , . •'.:. • Using the rank yalues from lteration.i as the n_ev.: starting values ,ivc_cai1 compu!e new values .
We begin by assigning the variables for influence (or Jank) valudor each. node,: as Ra , Rbs,: for these vadab.les ,showi1as·\teration 2. Ra_nk vali1es wiff continue )o.ch~rige.. · ·:.:· · . •:
lo
R.· an· Rd.:The on! is find the relative .values of these variables~. . . . ,..
~f~~,.- /.
~~-J: 82
. ., : •. ~ (~ ·~ -. •' J •
VIII Se.w (CS[/ISE)
VarialJlc lnitiul Vnluc lll'rntiun I llcrnliun 2 . Eii:hth Semester Il.E. Degree l):x:11nin:1tion, _
Ila 0.250 0.375 0.3125
CBCS - Model Question Pn[Jcr.,. 3
Rb 0.250 0.125 0.1875
B_IG DATA ANALYTICS
Re 0.250 U.2~0 0.250
Tlnic: 3 hrs. . . .. .. · . Max. Marks: sg
-.
Rd 0.250 0.250 ,, ., 0.250 Nol"c: Answer 111/Y FIVE full 1i11cs//q11s, se/ect/113 ONE/111/ q11est/011/ro111 l!IICh '11101/ule.
Working from values of lternt1on2 and so, we can do a few more ucrallons still the values
stabilize. Dataset(I0.2) shows the final values a Iler.the 8th iterntion. • • ·
·Dain Set 10.2 .,
·Module-1
VnriaMc lnilial Vali1c· itcrnlion I Iteration 2. t, a. Write note on followln~, : -~08 Marks)
Ra 0.250 .0.375 0.313 (1) Rock_:11varc11,css
(2) HDFS snnpsliots . : ..
Rb - 0.25Q 0.125 Q.188
(3) HDFS Namc-nods.Fcdcraifon ·
. Re 0.250 0.250 : 0.250 .,... 0.250 . · (O Rack A1i•ai·cncss · . . . . . . . . . .
Ril 0.250 o.zso 0.250 . o:2so Rack aw·arenesscl.eals witli data'lcicality, _Recall thalciie oflhe main desjgi1go~s'.c,r)~a.d°oop ·
The final -rank 'shows .lhat rn1ik of node A is,the highest at 0.333 .. Tl,.us, the most important : .. Map Reduce 'is t<i move the computation to the data. Assu•ming lliat most data c"eiitre nciworks
noile is A . The_lowest ronk is 0. I67 ofRb. Thus ·, 13 is the least important node. Nodes C an~:·. rlCl not' offer bisection baiidw~d\h, a typ_ical Hadoop_clustet \Vill_cxhibit' thri:dev~ls of data
Dare in th~ middle_. In this casc,their ranks did not change lit all: '. . . . . . . -locality: · .. · ·· · · · ·· ' .-· · · ·" · ,.··. · •.
The relative scores of the nodesfo this· ilet,voiic: would have been ~~e same irreipective ofthi'; I . Data resides o~ ·the local machine (best) .
initial_values c.hosen for the ~ornpu_tations ; It_may take_longer or ii\'orter riuniber.of iteration's' 2. Data ~esidcs in.the sii.me rack,(oett~rJ·
for the results to stabilize for different sets of initial liill11es; . -';_·· · · • · .'::
:i, Pala re~ides in a differe'nt 'i~ck (go.ad). . • . :. i.:
a
PAG~RANi<: P:gcRank i~ p_a,tteular apjili1:_ation ofilie s~iai nJt\v~rk n~~lysi~ tcchnici;i':° When tjte·Y,ARN 'sClied~l~t is 'assigni~g MapRedlit:e·containers'-to worlc IIS'm'app~;s;irwill
. _above to _coq1pute the rclahvc 1mp011ance ofwebsitesjt1 the overall World Wido Web. _Th
. data. ~11 wcbsite:a~dtheir' links is gat(1er.~d through web crawler bots lliat travei-se throu
. the·webp:tge al fr~qucrit in_tervals: Every web.page is a. nod_e iri· the social 11etivoi:k and
.. , ~~~;~«'::k'.l~e c~~Jt,r f!:~.t~~1/1e !~t~_·_
'.~t~hi~\t~;~'.t 'te: ~~?ct;~~?'.~;!:t t°,:·.
111 adclitio11, tileNaini:N,odc tr.ies fopla~e .~P.J icated d~ta block_s _on hiu)tipl~ ri~k:$ rot;iinproved ' . ' ' ..
t~1e hyperlii1ks fro~1 ·that,page ·becolile directed links to other web-pages. Eve1y outboun·
fault tolera11cec'._ln sucb acase, an .e1itire rack fa,ilure ·wil!'nol cause dat-:i.lOS:s:/k~i!JP HOPS
:_lmk from a web-pag~ is considered an outOow of inn uence-of that web-page. An iteraiiy~J_
from working, Perfqrinance ,in,~y b~ degrade~, howe.v~r.: . :, .. ,.::.. ': ::::·: :.f ../ ,'· : .
· computational kch111q11e i_s applied to compute a relative importance t_o each page,. Thd1::_·_
HDFS can, be maqe)~ck~a,,.,are by using a.L1s~r;derive-d script that ·enables-the mast~f node to
. score 1s called PageRank, accqrding to an _eponymous algorithm invented by the founders g(,'
.. nia'p the networl< i,opo!ogi, of.the cluster. A' <lefault)1adoop i1i'sta1faiion:i:s~unietaiCifie, nodes
J~}riti:i;~::;:~r;~;::::;;::~:.,oi~~~:~i;.;}·•·,~.
Google ,the web search company. · .; - · .: · - · . · : _,'.
. PageRank is used by QQogle for brderiiig ti1e display oTwctisites in ic~ponse to scardi ' ·.
que,:ies: To be sl1own highe(in tile search results, nian~ websites o,vners _try to ar.ificia!ly '. _:,
-boost.'lheir. PageRank by creating mariy.durhiny wc bsiies ,vhose:ranks caii be made to flow· ..
Jn_to their desire\! weqsites. ·Also, ma;iy ;vebsites cai1 be designed tq cyclicai" sets of.links :: ·· -snap~l1ot .c9111Jnand. :HOf~,sri_aps}1o_ts:ate ~~a~:o~I}'- poin~-m-ti~c;,cop)~ ,oqJ1r~)(s~stem •
. fro~ where th_e 1veb crawler may 1iot ~cable to break out. The'se are called spider trap~: · /:,
. , :To:0ovcrcop1c U\c'se ,and other challenges .; Goo~le incll!des :a· Telcporting ·factor i1iio \ '-'
. . comp1ili11g_U11! Plig_eR~nk. Tcleportii1g ~ssuined that thei-e is a poten!ial link frcim :a1iy node' ..
to any other.node, irrespective of whether it actuall{exists; Thus, ihe· influence niatrix, is·:.. \
multiplied by_~.'wf=igllli11g factor called Beta wi)li a typica·I value of.0.85.or 85(pcrcent)W .' ·
The-~cmai11ing 1vdght .~fO:JS·or 15 (percent)% is given to ielepoiiation. In Telepol'tation
···•.c"'.litli~l~~jitf~:.:::;,~~:::'.)J;-1~tf2!'."J,J
.. • mocks on the Data Nodes are not .copied, because _the s~ap~h.o,t -~Jes -~~c~:d th,e ~!oclc list
· \natr.ix.; eac_h,.ce[I· itgiy~f a rankof !/n ;. wfiere n 'is the niunbei' of ~odes the web. The , and th~ file size. lhere is no data c_opying, although it app~ars t~~ l~~:~,s_cr..t,h~t ther:,.are
. tl~O m~trices are ~dded to'•con1pure 'the final i~flt;enc:~ ~ai~i~ ..Thi~ matrix can be used
. iter~tively conijl\tte tlie •P;ageRnnl< o~ all._the,iti d,s: ·: · : '·. .. .
r,/··
\ ': !);tii~,;1~:1.atdt~·!•fHDi~~~,;;,J; ' i i, };; .· ·...i .•
·SM,,!::=:~•;:~9~~~;~,,:;;:~;;~;1F
. ,:·\.i,nothe1' i;1;porta11t fiahire Of i-iDFS is Nan1eNocle' Pedeiatfon ,'.6ii)~i' vtl'sioii~ 'Of HDFS · ,
if-.-·-.-~8
4
-
.-l
.·.
f · The \wo benchmancs discussed ~re i tci'.nsort and TcstDFSIO, provide ·u:good sense of how. .· I661es of size! GB 'ap: specjned. Note that'. the TestDFSIO- l:encfuriark is part·of.ibe badoop-
well your Hadoop i.nstallatio~ 'is ·operating and ~an be coinpared w.ith public data published.·: ·. inapred_lice•client-j6bclient.jar. Other't_enchmarks . are also available :as _part of.this jar
ijl fo1· other Had<iop' systems: The result$, however, should. not be takert as a single indic.itor for. ·. . · . · file. Running it .1vith nci arguments will yield a lis( In addition. to Tc:stDFSiO; ?,'NBcnch
Mil . ssytem•wide perfonnance.011 all applications:_:} ~ . . : . : . . , . . . . ' : ·. -:_~lo~d testi1ig the J\'.ameNoc!e) and MR Bench (load tesi~1g tl!c..Map~frame\i-ork).are ..
I
Thefoliowing benchmarks are designed for full Hadoop clusier.'installations. These tests. . · coni111o·n1y use.d Hridoop benchmarks.Neverthel~, TestDFSTO'is perlw.ps. th= mQSt.widely
·. · · assume a:multi:disk ·HDFS en\'ironme~t. RliQning these benchmarks in ·the H_ortonworks . · ( as
reported of these ~enchmarks. The steps to l1ll) TesiDFSIO fu-c follows: · > . ..
. . Snadbox cir in·the pseudo:distritiu,ted single-nod¢ install is'·n·ot· iecommend_ed because aU ,: · I. Riii1 :TcstDFS'io i11 wriic.mode arid mate dala. · .. . .. . . ;_. · : .·: ,··.. i
. input_ahd output (.1/0) aie done lisi_ng a,'~ingle systel!I disk driye. ' .. · . . $ ya,rn jar SHA600P_ EXAMPLES/hadoop-mapredutHli;nt-jalicliei::t-t~.jar.: · ·
~
f ·. . . Running the Teras<irt Test . . . . . . . .. . . · . .. · .
.. ..,... TestDFSIO ~write -1irFiles 16 -fileSize iOOO· ·: · · · . ·· . . .
jtW:t . : The terasoit· bechmarks·sorts a specified amount or randomly. generated data . . ·. . . · ·. · Exam pie results are as follows (data and iime prefix r~~oved). .
~~ .
. .!~l::. ·
~b
.
This benchmark prc.:ivides combine testing of the·HDFS and MApReduce laye'rs_of a Hadoop.
~luster. A full terascirt benchmark run .consists of the following .three ·steps:
. ··:1. Generating the input data-via terageri prograin, .'. ·. .. ' . .
fs.TestDFSIO: -.,:.·•TestDFS.10-s-....:: write . .
' fs .TestDFSIO:: Date .& time: Thu M~y 14· I0:39:33 EDT 20_15 fs'.TestDFSIO: Number
files : I~ .·.', · \ · . · .. · · . . . · .. ·
.
ii
. .....:.-
· •r ·
·I'
86 . . 87 '
·•, ,
.. :.~- ~ . .·.... .;·~. \-.. ,·_i'. ·. , ··. •
VIII Se,m,.(CSF./IS'E)
Example results arc ns follow s (dn:a nnd time prefix !'~moved). The largo standnrd deviation Hrchives to ~c unn (chivcd on the compute machincJ.
is due to the placement of tasks in the cluster on n smnll four-node cluster. fs.TestDFSIO: . The general comm11nd line syntax is .
· bin/ha<loop command (gcncricOptionas) [comr:rnndOptions!
--·----TestDFSlO--·- : read
fs .TcstDFSIO: Data & time: Thu May 14 i0:'14 :09.EDT 2015 fs.Te, tDl'SIO: Number of OR
files: 16 · 2.·n. Write a short note on following
'fs ,TestDFSlO: Total Ml3ytcs processed:· 16000.0 . fs.TcstDFSlb! Thrlighput . mb/sec:
i. Spccuulalivc c~ccution ··
32.38643494172466 fs.TestDl'SIO: Average 10 1rnte mb/sei:: 58.72880554199219
ii. Hmloop Mop reduce lrnr<lwurc. .. _ (~ .Marks)
fs.TestDFSIO: 10 rate std deviation: 64 .60017624360337 fs.TestDFSlO : Te;t e~ec time sec:.:
~m . Ans. _ i. S(lccul~tivc Exccution_: ·one of the chall_enges with many large cluster is the inability to
predict or manage unexpected system bottlenecks or failuieS:, In theory, it is possible to control
3. Clean up the TC5t0F'SlO dutn. - . and monitor 1'esources so that network traffic and processor load can be evenly biilanced; in
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreducc:client-jobcijent-tests.jur
-, practice, h5Jwever, this pr?blem represents a difficult d1;1Jlcnge for iarge systems...Tl]us, it is ·
_, TestDl'SIO -clean · . ·
possiblfthat a congested -network, slow disk controller; (ailil)g disk, high processor load, or
Running the TcstDfSIO arid temsort benchmark~ help: you gain 9011fidenc~-in ll •Hadoop ;,
__ some other similar problem might lead to slow performance without anyorie_nciticing. · . -
installation and detect any potential problems. It.is .also iristru'ctive ·to.-view the Amabri :
When one-part of i\ MapReducc process mns slowfy, it ultimately s:ows a~nJveiything
deshboard and the YARN web GUI (as described previously)'as the:iests run. . .
. else because the application canrrot comP.lete until-all processes are t\nishcd; Th·:: nature
Managing Hridoop MapReduce Jcibs' _ . •_ . . . . .
of the parallel MapRJdllCC lt)Odei p_rovides ·an intc.ies"ticg soluiion tc, this"proble~. Recall.
Hadoo{i . Mnp_Reduce jobs" ,ca:i, be .managed using ihe iriapred jdb ccfrmriaild. The ai~st ' · . that input data are immutable. in th~ MapReduce proce_ss. Theiefore,it is poss:lbl~ t~ stitrt a
im~ortant options for this command in _terms ofthe exaniples and benchrimks are -list, -kill, <
nnd--stotus, _ · - ·. : ,_ ._ ·- - · . . , _ ,· · - - · · · · copy of a running map process without disturbing any other running 'mapiJer proce~ses. Fo~'
In particular, if )'OU need to kill one.of :h~ e~ainpies 01· ber.~h.~arks, y'b~ ~~n iise fite mnprcd ·::: ~xample, suppose that as most of the map tasks are coming to a close, the A?pli~ionMasier
that some ~r.e stiU .nmni_ng· _and JChedulcs_redu:idant copies of the-rema ining jo)is'.on less
job -list command 1o·fin(!thi: job -ici .and therr llS_e rimpred job'-kill ;<job-id::i to kill the job ·
- busy or fr~e servers, ~hould ihe secorrdaf}( processes fi11ish first, tjie other li:t- jii~ess~ arc .
.across the cluster.: MapReduce jobs_~arr also be controlled at the applicatioct l~vel with th(r'
rarnnpplicaiion 'c,ommand. Th_e possible optlims foi mapredjob,areas follo,vs! . ' .
as
· then terminated (or vice versa). This pro·cess is known speculative exeeiltiori.'Thec same ·
S mapr~djob _- · · , · . _·. ·- _- . . __ · . . . · ·' · · ·· approa_ch c~r; be a)lpl)e,U to reducer processes,.thatseem }<1 be-1.\king a lor.g timc,[SRC~lative --:
., -:: ' Uslige: CL[-<coininiind> <args> (~subm'it <Job-'fife>J . ' . ex,ccut1on can seem:to•h~xe. a sl9w SP.ot: lt.can .also.be turned off a.'.ld on i.'l themapieil-site·- __
·. :xml configuration_file_ .. < ·._ .' ," · · · -- - , - ·. · · · ·.· 1 . • .
[-status <job-id>] :. · ;. -. - < _- · . : .', ·. > •. · ---- -. ·. . _.-
ii, Hadoop M11(lllcduccllardwarc . --_ : · · .. _._ _ . : . •-i . :·.,
[-courrter <job-id> <group-name> <~cillnle(•name>][-kill ~ob-id>j : ·' . '· ' . ' ', \
[~set-priority .·<job-id> .<pdority>] : ·y alid values· for. ,priorities: ;re: VERY HIGH HIGH /
'· · The -capability o(Hadoop 'MapReduce and HDFS t~ tole~te serv·e~_;,r e'v~/;;~hii~rack-
· NORMALLOWVERY LOW . : .. . - - - -. . - ., .faiiure~ cah iriflucn,cc har1ware d~signs. The u~e of commodity (typically ~86 64) se~ers for.
imp.ort .org.apache.irndoop.util:ToolRuriner; . •. . . . · . task mid ex,trncts. texi n\atcl1irig using tlie-given regular expression.1"he. matchjn8·strings:iire :.
!{ ·)• Extr11cts mat~hinlfrcgexs from 'input' files and counts the~:*/ ..·.
W
i' pu~lic closs Grep extt,n~ config11red implem'cnis-T~ot( ;'.° . .·: ;: ~J~~~tn}%~;;~~i;fi;~hi%~tjgf;~:;~pib;~{6!~::~~!~;;~~~; 1f.~ta~;~:\•:..::.
r private Orcp () () //5/iigletoii. ·. , .·. . . ' ·. ; " . reducer uses the. LongSumReducer class that outputsthe sum Qf long values pe? \!~E~( mp~\ ': ·.
V p~blic int ruo (String [ J args) tfirows E~ceptiori { if(args ,length <3) { ..
/:,: , System.out prit' ln ("Orcp <inDri> <outDir> <icgcx> (<group>]"); ~i
i .· ~~6~nd.job i~k;s\~~ i;~
~ut~lo'~d1~ ri1~;J~b:;s i~but::rhe'cidp~~~r~ ·i~\ r ~~;;~~~p ~at '· .
, return 2; ·. rev~rs!!r {o1· swaps) its inpUt<key,v~lue::►. p~irs if1l/J <::val_ue,~ey>,There is nof ddu9t\on sldp; ..
.. so the I.dentityReducer.cla'ss is'.u.ied'&y.dcfoiilLAIUnpiii issirriply"passedt~th,eoiltpiil (N~te:·
.£21::fi ~;~~;!t1~t%~%~;:ti:ir;1~%1~tfi,~~
,.'_ _·_ _'... _ - --- jatfftempDir= ·- ·- . -'- .
new path ("grcp-temp-"+ ·
, ' Integer .loString (~e1v R~ndom ( Y,
. Tnc exainp[e •aiso demoiisti'ates haw io pits Sa cominaiid~Hne parametedo ·a,iii~'ppe( ..
.·•. g;~~~Ji~:ii~i$tii:,:.9~;i:J~~~eff
~ · · ncxtlnt (Integer .MAX.'._VALUE)); ·.
~- . .Con(lgurritio~ confc'gctCimt O; ·. · .
conf.ser(RegexMapper .PATTERN, arge[2J ); ·
' lf(args. lcngih = =·4) · , . · . . ·· ·, . ·
.· conf:.~et (Rcgex'Mbppci' .OROU:P, ui,is [Jl); '
. Iol? grepJob =; new Job ·(count); $ mkdirOrep_dasses ,, : , :; , .i ;. ; _,_: 0: ;.• ,/, : : . ·: ·,:. , .._: i ..
. ·tty{ . .. . . .· :2. Coin pile the )1/ol'dC6unt:java ·piogram using.the foUowirig line:
: .grepJob.setJobNn/Jle C'gri:p,scarch;'); · $javac~cpfhadoopdasspaili' S:4Grepj;Jasscsdre!i,java°' :,:.·. ,:, · :,;/,.(
· .. Fi°lclpputformat.set[nputPaths (grepJob, ilrgs[OJ); ·.
~rcpJob.sctMapperClass (RegexMappcr.class); . , ·. .
. . Lr:~::j~t.:~::::~~11:;1:t!~:i~Jlowi~g:~o:mr~nd: ' ' :,: ... ..· . . ..·.· •. { ,. i.:\
grepJo~.~~tConibinerCl.ass (LongSumReduccr.cJass); . . If needed, crea.te a directory :and move th~\w·ar,an·d•µ.~ace.txtfile in.to HDF~\ ·,.: :"!'
$ hd[s ?rs°'mkdi; .wai'.~nd-p:e,d¢~~i~pui . . .: . :.' ... :.. . • ; ,:· . . .. . :.. \i ;,;r!' .
. grepJ9b,_setRed~cerCla~s (LongSumReducer.class); . ·::_(
, .· Fll~Q4tputFormnt.setOutpu1Path (grepJob, tempPir); .. ..·.. . . . .. . . · $ hdfs dfs:-put war-and~pcn~e.tl(t ~ar-and-peacc;output · '. ·
-'--,.,,..,,..,,-,:'= · ·"'g"""iepjo1n:etOmpatFl5rmin,Class-(S~cjueilceFileOITfput oroia. :c nss ;-•c: ~ --:-~-;-:-:
,- cc-
- --- _, __ _··-· . . . . ' . . . -~---.-------·:- .
90-
·- · ~· ~-'
·:- ·-· -
·- : · ·,.t•::·. ·
,1 • • ' ~ - .' ~( ...~ : _, ·
VIII Sem, (CSE/ISf)
As always, make sure the oulpul dircclory has bec.n removed by im1ing the following Container: conlainer_l4J2667013445_0001_01_000001 on nO 4~454
command:
S hdfs dfs -rm -r -skipTm~h war-and-peace-output
Entcrinr 11,c foll owine comm~r.c wiil nin the G1cp progrnrn: Cnntainer: co~iai~cr_l47.66i013445_0001_01_000023 or. nl _~~4 ~4
S hadoop jar G1 ~p.jar org..apachc.hadoop.example .Grep war-and-peace-input e _ :=; _ = - o - :: - - - - - - == - - - - = - ::a= c: - - e - ~ - - - = - ::,,
-+ war-and-peace-output Kuluzov · .. . . . . . [... ] ' .
As the example-runs, two stages will be evident. Each stage is easily·recogniznblc in A specific container can be examined by the containerld and U1e nodeAddress from the
the program output. The results can be found by examining. the ·rcsultant outpi1t file. · preceding · output. For example, containc.r_l43266701344S_OOOI_OI_000023 can be
$ hdfs dfs -cat war-and-pcace-output/part-r-00000 · examined by entering the command following this p,aragraph.·Note that the node name (Iii)
. 530 Kutl!ZOV . nod po1t number ·are written as.nl_45454 fn the comman4 output. i:o get the nodeAddress;
c. Explain command line log voiding. (04°Mnrks) simply replace the_ with. a: (Le., -node.Address nl :45454). Thll5, the. resulu for-a .single
container ·can be found by entering 11\is line: . . . ·
Ans. MapRe<lucc logs can also be viewed from the command line. The yarn logs command enables •,
the logs to be easily viewed together' witho~t.ha·ving lo hunt for .individual log files on the $ yarn logs :application Id application_l432667013445_000 l_containerld
ciuster nodes: As before, log aggregation is•required for use. The options to ya('li logs are as -+·conta.iner_1432667013445_0001_01_000023 -nodeAddrm·nl :45454 Imor.~
follows: · · · · · · · . . ·Modul~-2
$yam logs , . .
Retrieve logs for completed YARN applications. . Explain with an .diagrams .DAG W!)Tk riows? . . . . . . (JO ½arks) '
· usage: yam logs -application Id <application ID> [OPTIONS] Oozie is a.\Vcirkflow director system designed to run .and manage multiple related Apache :
general options are: . . (:ladoop,jobs, For inst~nce, complete data input ~nd 2nalysis·may req~fred severai discrete
-appOwncr <Application O\vner>. AppOwner (~ssumed to ;Be curren( user Hadoop jobs to be run as. a workflow in which the output of one job sirves i!S the·input,for
· . · . · ·. · , . specified) a si1ccessive°job. Ooze is nota substitute for ihc YARN scheduler. That ls, YARN manages
-containe~ld <Container ID>ContaineM. (must be specified. if node ·:address . resources foi .individual Hadoop jobs; and Oozie provides a ',Yliy .to coiuiei:1 a.rid ·control
specified) . •. , .• ., , . : Hadoopjobs ori thedustet, : ··.· .· . : · · . : · . · '. ·'. . ·: . . · ";; .·
-nodeAddress <:NodeAddress> . : NodeAddress :in the format noderiame: .. · . . : .. . Oozie workflow jo,bs are represented as directcdacyc!i~ grapfls(DAGsj of a~orjs. (DA9Gs
, '.. , _. . . , . . .'poit(rnust be specified if contai~er id is specified) ·. . . are baskally graphs tllat ~annot have.directe.d locips:)Th."e_e"types ofOoziejobs are·pertnitt~d:-
.For e;ample, after r;1nning ·1he iifexample program (discussed in Chapter 4), the logs can .be : •• : Workflow-a specified sequence of Hadoop jobs.with outcome-based decislon:points"nnd
. examined as.follows: · ,. . . . . _- . . . . . .. .. . ·control dependency: Progress from one action to another; cannot happen until the first
S hado~p jar SHA DOOP_EXAMPLES/hadqop-mapreduce-examples.jar pi 16 action is complete. . :. . · . .. . . · "' . . : ., .: . ,
100000. . . . .. . . . .. . .. . . . . .. . Coordinate:a·s21i~duled Worldl~iv.job ihat ·can run
at vari~us.time i.'i!ervaJs or when aata·
Aller the pi. examp_l~~mp!e~s,_~ote tl_1~·.appli~ationld, which can be: found e.ither.from the. . . bec~me:avail~ble. ... . . .· . . .' · . , • _ .· _: - ~ ~ " . _-:· ·
. application o·utp;,t or by ·using the' yam application ·command. 1l1e ~pplicationld will start - -: •-c Bundleca highef-level.Oozie abstraction that will baic~ a set of coordin1torjo~s;?o~e.•is•
. with application_and appear under the Application-Id colurrin, ...· ... . · ·. · . ' . . •. integrated with the rest of thi:" Hadoop stack, s.uppor.iiig seve~ ty~·o.f Hadoop Jobs:out
. s·yarn application -list -appStates FINISHED. · . . . : . .· ·. . , · · . . · · ·. ·. :· · of the box (e:g.; JavaMapReduce, Streaming MapReduce, Pig, Hive, .also Sq_~cip).aswc.U·
. Nexi, run the following command to prod11ce a dump of all t.he logs"fotthat application. Node . as system-sp~dificjobs (e.g.; Java prog1'am and shell sc_ripis)..Oozie.a.lso proyid,es aC~I
that the ouiput can be long and is best sav·ed to a file. · ·. . :·. ... . : . . ' :.. and a' web UI for monitoring jobs. . ; . . · • , .. • . · : ·. , . . ·
· $yarn.logs -appficationld application~l43i667013445_000 I> ApP,Out . · : . .· · · Figure 3.l depicts a. simple Oozie workflow. ln this case, Oozie,runs. aliasic MapRcdu~c
The AppOut file can be inspected u~ing a te?(t editor. Note that for. eaclt container, stdout, .. operation . .(ftl1e application was succ·essful; U1e job end; if <!ii-error occilr.red'."_hejoti is kill_ed. .
stderr, ~nd syslog are provided . Tlie. lis\ of ac\ual containers .can be.Jo.imd by using the . Oozie •workflow dcfinitio.ns are written in hPDL (an XML Process Definition Language). .
:,follmving com·mand: ' · · ·. : · · ·:: . . · : · · . ·· . · · · ·· · · ·' . Suchworkfloivcontainseveraltypesofriodes: . .· . ··: .··• '..'.. · . . :. . . . .
$grep-B I "'==;,..=AppOllt.Forexaniple(mltputtiuncated): •i . ,,. • :. Control ·flow nodes defirie the beginning.and \he·end:qfa world\mv. They•!nlll_1\de start,· .
. [...] ·. . ·.·· .. · · . :
·,:··· : .· .,· .· . . : .. . ·· . . . · ·. · end,:-and .optionalfailnodcs; :-:.::-... . · ; ..·· .:_ .'. ·_.·_. .. ; ;: ·. · ..,: .:< ..:;;·:,.:.::,L·· ._
.,_.
· Containe~ con.tainer.)4326670 I3445_000 (..OI_000008 on liml\lu.s_:45454 · · ·• Action ,n·oiJes arc where: the ·actual. i>ioccss.ing taks are .defined, W~en._-:an,acp,on· node .
.·
-~ --~--,--..:. .'-- ~~ . /- ;. · __ ·· ·_-_- ~- . - ,._ · .• . ~- . ·- -··. . · _...,:. · ___ _ . finishes; the.remote systems.notify Oozie and the next node.in theworkflpw.is '.executed, :·
. Acfi~n nodes can also include HDFS commands/ . . . . · .' 1·.. . ... '. . . .
· 6~ntai~~r: ~onlai;e~_I 4Ji6~7013445_0001 :CO iJioo6. 10 on lidi111l~~45454 .' i Fork/join node enable parnilel execution of tasks in the workflo.w: .'!Jic for!< node enable . ..
:~ ==== _. =-_-._: ·-=- . . .- __ :....~ _.:..._·__ _~ _- ·~,::...._·..:..,-:-_:.._·-:.._ _.:,;-~ -,_ -_,;,.;;_ ·.. two or" more tasks to nin <\t •the saiiie time.A join n.ode represents.aredeZ','.OUs"point.that :.
.
-- . .
..
- . -- - ----- - - --. . ... ~-./~t-.wail't;ntihil~forkedtaskscomple1.i· . . . :, ."; :,' '. '.:':::{ .. ·~</;
92 ., .
'93
VIII Se.1111 (CSr/ISE)
• Conlrof now 11odcs enable decisions to made about the p~evious ta~k. _control deci~ions '.'un o~ standard Hadoop VI : using the MapRcducc framework, but,that app;~~1;-'i,;o~~d .
. :! are based on the results of the previous action (e.g., file Slze or file existence). Dccislon mcfficie_nt nnd totally unnatural for vnrious rel!sons. The native Girnph impleinenttition un'der
nodes are essentially switch-case statements that use JSP . YARN provide~ the user witli nn lterntive processing model_that is not direclty 'available with
I ';__ EL (Java Server pages-Expression Language) that evaluate to either true or Cal_sc. Figure 3.2 MapRcduce. Support for YARN has been present in Giraphsince·its own version LO release.
depicts a more complex workfolw th~t uses ,di of these r,odc ty;ic_s_- In adfiition, us ing the flcxi l>ility of YARN, the <iiraph developers plan ·on impl~me~ting their
own wet, in_terface to monit9.r job progress . · .. ,, · · ·
OK
Iii) Hamster: Hadoop and Ml,'I on the SRmc Cluster
The Mes~age Pnss ing Interface (MPI) is Widely used in high-performance co~puting (HPC). ·
MP! is primarily.a set of optimized message-passing library calls for C,.C++, and Fortran that
operate over, popular server interconnects such as Ethernet and lrifiiiiBand. Betausc users ·
have full control o_ver their YARN containers, there is no reason why MP!_applications ~~n~ot
tun with iii n Hadoop cluster. The Hamster effort is a ~ork-in-progress that P(avides a good-
·.' ., discussion of the issues involved in•mapping MP! to a YARN cluster·. Currently, an:alpha ·
version ofMPICH2 is·available for:YARN that.can be used to run MP! applic:atloris:: '.'-: ·;.
. iv)Apachc Spark ' · : · '. . ' · ' .· • · ;;',__·
·. Spark was initially developed f~r a·pplications in whfch keeping data f~ mertidrflitiproves
>·
· performance, such .as .iteraiive algorithms, which* common . fn m·ac~ine'. (eaming;'arid
inieractiv'e data mining: Spn~k dilfe~ from classic- MapReduce ii1 t1vo imp◊rt~ni wayfFirst, .
Spark holds intermediritc results In ineinory, rather tlui_ti 111riting them to disk,;S~cdricl;Spark
holds supports more than jLiSI MapReduce f~nctio.is; that is, it greatly ex'p~rtd~j ~~ s~t
o(.
-possible analyses t~af c~-~ be ex~cut~d ayer HDFS d.ttli,_sl~res. II also),riivld~wls ~.Sc~i~/
.,oftri;porting
~ ~Odl~m;:k h~ feri r~~ni~g ~~- prnducii?~ ~;~~A1ste~,a~.r~~~l;:iu(~~~~\~~.~· :_ :
and ru~nlng ·Spark.:on top· of Y~RN 1~-the common
r~outce tnanug~i)\ent a11d a,:·.
's(~gleiuid~rly,ing_'fifesysiem ;_ .. · '. :·,(); .·: · . ·.: ,. ,,. , ' •· ·:;:,fr,:·· ..
·.. ,
H~~_t'ii ~tl~~~e ~adoop pr~11c:~u~f ., , \;:> :.'.:·:):/, ;\~,:,::·j t/Jj)i;:~~~-~•{
One of the .challenges of ma11agiog a HadooP. cluster is mi111agirig change{toj [Ltsicr \\'ide . . . '- ·
/ .,. . . configuration properties. In addition to modifying a: large,number 'o(propi:rtie~)l't!iklng. :
. changes·to a prope~y often !~<!)I.ired'daeinoils·1!!JiL~jien~e.nt' 4.aenion_so;:ii~~fihc eµjif~ • .
·_- -·--- -.\ . .. . ·.• . ' . - --·-- -------- ....
:.E{~j:,~~;~~-r~i ci,w/i~~:i::1~~tr;;:;:;1°::~:; 1
'
. clus_ter. This pro~ess is tedious-and·time consuming; Fortunately, Ai:n.bar.i prov1de(an e~sy:·. ·
'b: G_work o·iv .. . ' ·.. -~::ht:en::1::g;rf ~1lt:e~6ri~g~ ii;hi a :~ns :f'oinl di~play;~f i 1(~he~p~~l;,~,~~ii~':,-.'
foperty c.a~ ~~ ·c~ang·e.(~r Qd~ed}.uSing.th~s ~~'i~/f8~e/A
· ·.· pr''~peit(~s;~ny'Se'rv!ce p_ {~Hi~~mp·l~;,:,: -. :
·· :(i) Ap ache Tei · . ..
. · (ii)A11aclie'Glraph .. : . . · . . . - · · -
'
tlie configuratfon prope_itie~ for th'c_Y>;\~N:~chedule(as sh~1~n _in Figui6_4;p :
.· ,: _-
<>.: i :,:·.: '. .
,.- (iii) l!amstci- 111\dQop ~mt'MPI on the same cluster.
. . . :'(iv) Apachcsparli:: :>., '• . :. . . •
Ans. (I) Apaclie Tcz ' · . . , d · tis involves :·
• Ohe great exaniP.le of 11 new YARN framework is Apache Tez. Ma~y Ha oop ~o Reduce j
the execution of a complex directed acyclic graph (DAG) of task usmg separate Map ta 5 ,;
sta"~ Apache Tez generalizer~ th is process. and enables these tasks be spreacl _acrossRs dge ,
• b Tez can be used as a Map c uce
0
d dtO the
' ' • •
>.1 so that tliey can b'enm as a single, all-cncompass1ngJo ·
.;J.1 i,j replacement for projects such as Apache Hive and Apache Pi~. N~ changes are n~e e ';
-·
figmc 4.4, is presented. This window confirms 1ha1 lhe properties have teen saved. Once the
new properly is changed, an ornngc Rcslart bunon will appear al 1he lop Jeti of1hc window.
The new properly w:11 r. ;it 1::kc ~f:cc\ i:01 :I lhc required ;erviccs are 1csla11cJ. As shown in
·--·· : - ·--·· -~-- ..
< r ;.-;;i" • ·-----. ·--- • ,-.:·,.
'
figure 4.5, lhc Reslarl butlor. provides lwo oplions: Restart All and Restart Nodc!\llanagers. .
To be safe, the Restart AU sfiould be used. Note Iha! Restart All docs no: mean nil the Hadoop·
service will be rcsta11ed ; rather, only those lhal use the ·new property will be restarted. ·
I :. 15 After the user clicks Restart All, a confirmation window, shown in Figure 4.6 will be
displayed. Click Confirm Restart All to begin the cluster-wide restart.
.. ~ ; . Save Configuration Changes
·- ·· ··"'·· .
m
. .-.... • _ , . . _ . . , , _ . . . .. "II •
; __ ~
... ~~-· . .
..... _;_.,~
.
--
Y~~ie~11o·ies~rt~AF,N
._';i~~~~:~~L:~~tfr;;i~}~°:J:~~:~cn~; ~
,, :
·-- ~-· . . ·"' .• -'. ·.•· .
.F.~~~..... :.
j~"'~.1:1ya . .; · , . •·
ht&plf7" •' ·• .
.. o ;_~_·: ij· : .: ~ , ·. ..
_I\~_..;~ ■ - . - ---~ --::-::::~.~~- : ___:~c~.J.!.,i!J ..
~~nad : f :~ ~=--=--=-- -- .·. Figure-4.6 tfl11barl conjimiatioi1 box/or selvice r~tart-'.
)'In\~~~·. :: ~·! .J~;;\;J .
to
. ··. .Similar tlle-DaiaNode restart, exan1pte; _a progress windOIV win" be displ"aycd. Again; the
_ _· . .Figure 4.2 }'ARN~raperties wilh log 11ggreg{l(io1;- /1,;ue,i off ,. ._. : progress bads for_ihe entire YARN restm1. Details from.the fogs can be found:t,yclick_ ing·the .
. · Changes
· do _n·o) -become
· · pcrmanent'untl
•- ·. ·1:tie · C1·IC·ks tlie. .Save. button
I user . ·, ' A save. /. notes
tt,e"-:· . ari·ow to the.right of the bar•(see Figure 0)." . . . . . .· _- .. . ' . .
wi~ilow will t1ien be ,lisp(ayed. It is highly re~ommended that h1stoncalnotes.conc~1nmg _· . . O~c~ the restart is complete, run a simple example and attempt to view ihe. logs u~ing the ..
· . chan;•e be added to this wi1ido1v. . · ·. . . ' ' · .. · · · . ' . . YARN ResourceMana~erApplicaiion u1: (You can access :the UI fr~m the Qui~kl.iriks put~~:. ·.. . . , ,·
·· :' . ; , . · · ve Conliguraticin · ·x · down inenu iirthe 1tiidi!le_of the YARf':I series·windo\v.) A message_similar tq that'ill'.figurc · .
.· •' . ' • . . . ·: 4.8 wHl be displayed : . . . . . . .. . .. ._ . . .
·· .:· /\~~ .;i~·rn ~lf-~a~,~~I~~· :·. . I
~. .. . .' . Anibari"iracks al( chaoges·made to system' propertfo·s,,J scan be seen in Figuce , . :'-, ·, · : :
.. ·. __ ____,/. . 4.1 ·and in more detail in Figure 4.9; each time a corifigL,ration is ch~nged, a11ew· \lers10n is . .
' ·-··· . .
cieated:..1teve11ing back to a previous version results in a new vcrs.fon, You can reduce ihe .
potC:~tiil_ilor-,1,'.CCSion.:confusioci_by p~o•(i•JiR~~~ngfui--commcntHor--fl'ith-<:harig~::(e.g-:,·_.-. - - -
II. .' ·.. F/gltfl!.4:'3 A111iwi ,·01ijig11mtlo11 si1~diioiesw i1ilo1P:~-)- -" ' ' . .. .
. Figure 4:3 and Figure 4.Uf Iii ihe_.prec~ding .exah1pfe, we created, versio_n .12_(Vl2). ;The . :
· . 1 L;9;_.-_-
,. .
~~ .
-- ·-· - - · · - - - · - · - ' - ' < - ,-
, . . , ----· ->
i'fi!!..-•-- _· --,--c-.,. .:.96
- -->i~!t,.;; . . . .. . 5"~~+;.,( t;.;......
----- -- -- -- - -- - - ·. - -:~ ·· .
VIII Se-t~ (CSf(ISE)
current vcision is indicated by a green Current label in the horizontal version boxes or in the
dark horizontal bar. Scrollin11 thouah the version boxes · . . . . . : . i'~±~-,
1 Background Operations Running a,~ 1---~~~~~~}.:i.: .i ,~Corl.c,Gto..c- ''"'' ·_· ·-·::· ·1·i·-
s1ar,; .~ t '."°'.: · AA (IOI .. , J l:ZIJ ..., III -"· Cl - .. , Cl -" ~ -•-
1•dlrl • ,• . . , . . ' .,.,...,. '''""'- ' !t••·
100'"AI .' ►
.: Figi,re '4.8 YARN ReJourcellin11agcr i11ierjnce .with log-tigg;egaiitm lltmetl off ·' _,..() .
Or'pulling down thl menu on the-left-ha11d side'cif the datk horizcinta_l bar will display t_he ?
.. previous configuration versions·. ··__·- ·-; -.- ·- ·.· ' . . . . . : : .· . ·. . ·. . . ' . '
To revert to a previous version,'simply·selec tthe version·from th'.c version box~ or the pull~·. '.
. down m:enu. In .Figure 4. i 0, the user has selected the prevfous version by clicking.the Make, ':
. C~rie~( button-'iii flic_'. infprmation box:. :This 'configuration w(II ret~rn to.the previous sta~ ';,_·,
. wherf foii;ggr'~~~lion is enabled. ·.. . ,· " · . · · : . . . - . ·. . · . · ·,·
-~ .,;, _ . ; .
·: ·: .. ' ' Fig~re -I.JO ReYei-ti11g io previous YARNco~,jiguraiio.11 'iJ'JJ)'witl,'A1i1b1i_';f :' ',: ;
' .. ;. As. sh_oyin [n· Figure4.11;a confiimation i not.es w(ndow opiiri before fue;iicw \:orifigutl!tioh .-
. , .' 'is-sav,ecl, Again; it'is suggested that you prrividc -noie 'about ihc .cliange:1n'-ihe:Noies:text_:
-·. ·-.·
: ,
·, . . . .. :.-:'·~ .·.•.·E:~t!i1i:!:!~;~;t:%~~::;:;it;~~[:~
.. . 98
. . : ..
,.. . .
•. . .
.• .~! •.
3.i:_,:·:.~ ~: ~:.'~~,:~:- ~
{-~.j. :-~\ -
\/III Se,,n, (CSf/ISE)
Make Current Conllrmallon Ambari, go to the HDFS icrvic'e window and selccfthe Configs tab.'roward the bottom of the
screen, se!cct the Add Property link in the Custom core- sitc.xml section. Add the following
two properties (the item used for the key ficlct in Ambari is the name field included in th!s
code): _ ·
<property>
. <namc>hadoop.proxyuscr.root.grm1ps<Jname>
<value>•<ivalue> ·
· · Figure 4.11 A111b11rl ,·01tjir111a1io11 w/tulo.v Joi 1111ew co1tjigur11tio11' :: < /prope11y>
There are several i111porla11t points to rc111e1:ibcr about lhe Amabri versioning tool: _ · <propirty> -, -
Every time you ~ha11ge the configuration, a new version is_creatcd. Reverting to_a prev_ious_; <nan1e>hadoop.proxyuser.r,oot.hosts<iname>
version created n new version. · . . · . . .? <vaiue>•</value> </propc11y> . . .· . .
· • You can view or compa:·e a version to other versions ,vithout h~ving lo change <>r resla_rt ·.. The name of the\1ser who wlll start the Hadoop NFSvlgateway -j; •i,Jaced iri the_n~,rie field. · 1 •
service. (Sec the l.iu1toi1s in the V 11 box i:1 Figure 4, I0) · . . : In the· previous example, rqot.is. used for this purpose. This set_t(n~;_c_aid,e.·a~)'.'. us_cr wh9 _
Each service has its own version record. · ·· · · · . . starts the gateway. lf, .for ins.lance, user nf.sadmin strirts tlie gateway, then tlie'two· names ..
. . • · Every (inie you d1a11ge the properties, you must restart the servi.ce by us_ing the wo·uld be. hadoop.proxyuser.nfsadmin.groups and hadoop.proxy_user. nfsadmin·.host_s. The.•.
· Restait button. Whei1 in doubt, restart all services. · · value, entei·ed in the prcc~ding lines, opens the gateway to_all_groups _and aUc,ivs,it_l_o run ·on
· . nny host. _Access.is restri9ted b entering gniups:(comma separatei!) in;the _group'i property,
b. · Define tl;c cap~billti~-and,configu~ntion step~ of'ari NFS VJ Ga:teway to HDFS _-.. · .·. . Entering a host name forthe host's property caii restrael the host'running°:ilie' gatew#', · . ·
. · ·· .·. -·_- . · · ·. . -. · . . · -·. · · · · (04Mnrks)
Next, move to the Advan·ced 'hdis-site.xml :section and setihe following 'property: ·,;r9perty>,
Aris, C~nfiguring a~ NFSvJ Gateway to HDFS · ·· · . \.:.'. · · . .. · · . . · ':: <name;,dfs.~fs3.di1mp.dir</name> .. .. ·, . ' .. .
HDFS suppo11s an NFS version 3 (NFSv3) gateway. This _featur~ enables files to be e~silr,:;-. <value>/tmp/.hdfs-nfs</value> -· ·. . ,.
moved between I·fDFS and client systems. The NFS gateway. sup11oris NFSv3 and allo1~~( <ipropci1y> . . .. ,, ·: . ' • _ . - . . · . · : . . •. . ~- ., · , : · : . .
HDFS to be mo.uritcd as pti1t of the client's local file system. Currently the N_FSv3 gatew .. · Thc'°NFSv3'dump directoty is · needed be~.iuse \he NFS ·cli~nt often
recorde~ writes . .
· suppci11s the followi_ng capabi_litics::· . .· , - _' - · ; :
- • Users can browse the HDFS. file systeip duoug_h their local ·_Ii le system,_usmg a~. NFSy
. . · .·. . . · Se~1ientiaLwrites caniafriy~ at _U1e N~S _giit~w~y in r.ind~1:r1 cird~.r.-This dir~jo({is l_Z to __
. tempo1'arily S-ave out'of-or~er_ wi'iies b_cfore 1vriting IQ H~FS:,Mak~ ~ure th~ ~UillP,:directory : •
.clicot-~ompatibie operating sysiein: . ; -'· ·. ..' _ .. _ . · ~ · · · < _ • ·· ' hasieliough space. Forexampl~, if the applica'tio1i' i1ploads 10 files, eai:h·of sit~ I0:01\11B, it is .. .. ·
<:.
. ,i Users can download files fro1i1 tlie HDFS file,systein to their local _file S)'.Stem. . .
··• · Users can 1iplond fil~sfrom iticfr local file.system ~iiectly to the ~DFS.file ~vstem,_ . :._.
. • · Users can stream ·data directly to HDFS through the mount point. File append 1s supported,_',
. ·
,er,eryfite.' -· >,."•·'. :. :_.- ,;./ ·•.-··:; .:·,:.. '.i:".
. recommended that this directory- have 1GB .of s;iai:·e to cover a worst-case write!reorder for.
. bnc·e all .the·changesbive been hiride; c)ick the.greei(Save button a.rid iioie the_d\aiigesyou
":s · ·
. but 1'andom write is pot ·supported. · . . · . · ·. i . made 'to ihe Notds iiox'in tii~ ·save.confir~aiion_dialog;r~eri r~i~rt allofHDFS bf~licking
· , The gateway musf be•riin on the same host as a·oa:aNo_de, NameNode, or any HDfS chent-: ·
· More information obout the ·NFSv) gatewny can be found at' https:/' hadoop. apache.org,'
doc·s/current/hadoo·p-pro]ect-dist/hadoop-hdfs/HdfsNfsGateway . .
:~;!;~:ta~t~t~ gt:~~~~(;:,i· ·._. ·:.'c;' _ ,._ ..-,.'. _··: :: :, '·._._'_.· t::•::: r-;,;_·· ·.
1
·_Log into ii DataNode aiid make sure all_Nl'S ,sci-vices are stop~d: In this example;
.htri1L - .. . . . . . , . - ·. . . .
.In the' -follc;>~ing ~xample, a simple fou'r~node cluster _is l\Sed to demonstrate toe st~ps_for . .t;:~~:ed;p:~i~~~::::s!~,:;~~\:;t/.· ' .· : . . . . •. · ·. ·:., -. ..
. enabling the NFSvJ·gatcway.·Other potential options, including those, relate!! to secunty, are.
Next, stai1 ihe H.DFS gateway by \Ising tfre hailbop-dae.mon script to start portmap .
not adcjressed in.this exi1111ple. A Datal'>lcide.is used a:; the gateway node i~ this example, _a~d :
. and nfs3 as fo llov1s : .· _ _ . . '. · -- . · · · ..
· . HDFS is mounted on ti1e maln (login) cluster node. : . . . . . ·. . . · ,-
_#/usr/hdp/2.2-4.2-2/radoop/sbin/h~doop-daemon.,sh _, .sta1t . portmap. _#/usr/hdp/2,24.2-2(- ,.
. · Slep I: S~t Configuration l'ilcs . . · .. ._:· : .. ·. .- . , , · . _ :· . . :,
· hadoop.sbin/hadoop-daemon:sh ·st.art nfs3.
S~\lernl.Hadoop configurntidn files _need to be chimged. In this example, the Am_ban GUI .
· The _po11m:ip daemon will write its log to .
.· will b~ used to alter -the HDFS configuration files. Do not save_the changes or rest.art HDFS -::
/vvar/log/hacfoop/root/hadoop-r_oot-nfs3'. nO.l6g . . . -: . : , --< '.. . -;'. -: ~ ; . . --- - .
-- unfii"all tho following d1anges are ,111ade. If you :ire not. 1ising Amtiari, you must char\g~ :-
. these files hy hand and then rcsta1t the appropriate -services across the cluster. The following ':·'
To confirm the ga/cway is ·working; .issue ·1he fol!owing command. The: output s~ould .
look. lik'e the fol1;1vi~g: .· . . . . .. ' .. . . .. . . .
· __. environnirn.1s assi1in~d: · , . · · . . .. · · · · · '; .
#rp¢infci -p no . . - --- -...---.--......;..-,--~,...,....----,
• .~l~;f~~:\.~El: 6:6·. . ._ . _ · . ' · .. ·. · • . _· program . v'ers --ptoto . .
• · Ho11onworks HDP 2.2 with Hadqo version: 2.6 . . . . _, J9Q005 1
Sevfra!'p(operti~s__ need to be added to the /etc/hadoop/conf ig/core-_sit~.xml fil_e;· IJ_
~1r:r
' 1_00 ioi '
VIII Se-1w (CSE/JS[)
(.
100000 2 udp portmappcr
Ill ,I. Siscnsc 20. Palo OLAP Server ·.
( :',
' •.\ 100000 2 top portn1appcr
ill 2. /\clualc Uusinm lnlclligencc a~d Rcporling Tools (IJIRT) 21. Pentaho
tcp
4242 mounted 3. icC11bc 22. Profit b:ise
106005 I
nfs 4. Oon:o 23.Q!ikVi~-w
100003 } lcp 2049
5· Uoal'd Management Intelligence Toolkit ·24. Rapid insight
100005 l udp - 4242 mounted
6. Clear Arinlyties 25. 6AP businns intelligence.
·\00005 3 udp 4242 mounted
7. Queen 26, SAP BusincssObjccts··
100005 3 tcp 4242 mounted
q1ounted __.
8. Gooddata \ . 27, SAPNctWc:ivcr BW . >
10000S 2 udp 4242 9. r'DM Cognos Inlclligcnce 28. SAS Bl .
.Finaliy, make sure the moµnt 1s available by 1ssqmg the.following c9m!Jlaitd.: ·
I0. .lnsighlsquarcd 29. Silvon ·. :; :-:,::,_.
#showmount-en0 , . . -. . . .: : , . · . '. . .· .
Export listJor nO: ·
I I.JosjicrSofl Jo.solver.' ' , · · .-. ,,·., .
.. /• . ·: ·. • , • ... . ' . .. . . . . . . •' . 12.Lookci
.If the rpcinfo or sho'~mo1ii;il command d_oes n6t ,vork corr.ectly, chC:c\<tlie pre_viously ll ~icrosc(t Ill plolform
. ~i:~11;c~ ~~)~t;tobie·'.;s.
6 -~ · _.· __· .: ·_. · ·: · \ ; · . '. . :-.> '·:···:· \ · _ .,. 't4. MicioS1ra:egy · · · 33,, S1yldnti:lligencc ·•.- ;i;:-_· _
.
15. MITS
The ~nal st~p is ici 11_19Jiit i-IOFS ~n aclieninod\:. in this}xampie, \~e\naiii l,ogi_1f node.is·usect ·35_Targit ' ·-_-.,_, . .·, x ·
· To 'inouni tlie HD.FS file's, _exit from tlie gateway·1iode' and create_th~..fol~owh1g·qirectory :. -: · i6: Open[ f- ·
#1i1kdir /mnt/ hdfs . • . - .· . . . . . : : .· ·.· · ;], '. ·_ >:.. ·_. . . 17.broc le[ll 36. V~it:atic,i". '·_ . ,·-; --, ,,
The moimt:comrriand'is as follows. Note·tltat the name oftiie gateiaj rio_de w\!f b(differe~ti .18..Orack Enterprise [l_[ Si!rvcr
· on other clusters, and ari IP. address cari be used instead of1he•tiode name. #moi"1nt'-t nfs .. 19. Oracle Hyperion.System , 38, Yc!low·fi,,81 . ... __ -
· Vers' ),pr6to·,;;t~p;nril~_cfho'.:i tin.l}t/~dfsi . .:, ',.. . :\ ',' _: \ ·•- 0: ·,- ..·:•)
0
Th~;Bt~oolus~din opr:-Qrganizati9n:::Ed\lca_tion -:,·: , ,., . ,:,:: ::> :;,. ;,:.,; .;:iI)~i ·
:_:;:l;,.
•. . .· Once. tlie file syste)1( is 1'noi11it~d; thefiles:wm b(vis\lile to' th~'clie~t. use~s:J'liehll<iwin . ,As ~l?h_er e~-~_cat1on be~_?llles:in~re_ex~nsi_ve.and c,in.ii,ititiye, i{Js aP.,t iis~r 44a.~~bii~ea. . .
. ·.·\ ~~l}):t~~tif'·'.sJ;Hr\11.~uii:\~:: fi:l~w\~~: :':;;\ ,~ :!, •. . ;.:• ·,:.::-./:··:_:<·;,;~::.-·,~; ·:, ·.'\. ·deCJ.~1on:-makmg, There.is a'Strqngneed for efficie11cy,'increasing revenue; andjgipr_oy1ng th~. .
qua_h1y ·of student expenence·at all levels of education. ·: · · - - - · .t· - .. .·,
. app~logs app's be1ichinarks hd(i maprcd inr~history sysi~ri, tmp user var The gate;,:iiy_in the_';:
1
J, S~udcnt cnrolmcn~ (recruitment 'a!ld retention): Marketing -to.new,pote~~ial.~tudents:
. current Hadoop .release uses AUTH UNIX-style auth~nticatio1i ,anr~quires ,that ,t~e logi( req111res sc,nools to develop profiles ·of the stlfdent.i _that are inosi l_ikely.tci_atten.d; Schools can· ) . ,.. _. ,,_
. . •. .;;ser'n_3,ne on' lhe dien(miir~h thi us~r ;;~ir!el!iai NFS.pa,s,s ei to 1-lpf's.:f~r ~X:a111pfo, if_the ,:': . develop m?dels of w.hat kinds of students ~r'eJJttracted to !fie schoQI; and then_·;~ach 01ii to . . · · . · ·
~~~I~~ii¥~l1I~~~i~Ii if~t~rt .
. . · _:. · . · ' NFS 'tl1ent is user _adniin, th~ NFS gateway will ·access HbFs_· as ·u:ser. admin ai1d _existing
· ·-:-~~ _,_:.......:... HoFSpennissions .willprevail:. . . . ... · --- _, ,-.,,-.~s:~.;:·.c7:~ .-~· -' . ·· , ...
. ' . The system administrator must ensure that the user on the N.FS clfe,1it machine has the same ·: i
. user ii~;1i'e and.l;Se(ID 'as thai'on tiie Ni' S:safewaY:maclijn'e.'This is l\sually rioi:a problem if \°
. you use the same user manag~inent system, sucii ai: tr:iAP/Nis; to cr~teand 'c!epfoy users·,··•
·. ..·;:;: ;:~;~:~: :~~~:~:i1t:~ .L~;;~.:,.:.;'";;;~.:i to pledg~fin~nc_1at Sllppcir!to _theschoo!_,:Schools can.create a profi[e_for almimimpr~ lficely·; · _..
to pledg~,donat1ans. to th~ _school. Th/s coi1ld lead to a reduction.iri the cosi of:mailfiig(aitd ; ·
··Ans . . ·According ta the.llsiof bes\ busfness intelligence tools' prepared.by experts 'from Finances ;
·. OriHne \he leading solutions in this ca{egory cortprise of'sysienis desigtied .to captor~/:
,; c~\egorize; an~ analyze corpora1e data and' exirac( besr practices for improved decision ;•
. .. ·'nia~ing, T,he more _advanc~.d the system is, the-more da1a sources_itwiUcombine, includi11g
. · · int.ernal' metrics coming from different com_papy' departments, .aiid external ·data ,extrac_te<I
:. fioin :thi1;d-party ;ystems, ~ocial media chamiel~, ·emaiis,'_or ·e_ven mac_roecorio~ic_ dat, .
. Ultimately, business. intelligen_ce. software helps c'ompat1ies gain insight :on their overal
.rowth, sales trends·'.and cusiorner-tieliav·ior.·· . . ' ·. '··. . · . · - : ··' . ::.
.ift!i~ilf~£i~~!i t~;r,~~!;1~
. that form the back~one ofan'organiiati<iri's· IT sys\enis. Thi: data t~ p~:-~xiraciefviill'depe~d
, . upoq the s~bjcd l!]alterof.DW. For example, (or.a salcs/markctihg' OW; _oniyili~ daia.·aboUt·
. . u.S.ta·ifi.·Ccs: o·C
_1.bt~i~6@21jiti_SCj'viCe;and-so-Ofl~WOUtd-b~itr,aCfed ..·. . :. '. · -.. .···~· :. . ' .
. . . ':..:-.-.: __;_:.._. .. .i. . . . . . . ..
1
-
.
•- . - ... __ .,.... __. .,. ·.
.
,.·, ._
102
·- ...
~/ '!f \ -~~: :J~i: !·~~~:..:.~:.~f..~. ---
·.,·.. ;,
... ~ ,11-··
~Ui··; 1Q_4· ·- . _,_. ~ -
:\q.i>:
--'---'-"'-'"'-.........~ - - ' ~ -
VIII Se,rn, ( CS'f/LSf)
IHlll1·1-'a1l !,i
. ,
KK . ~333 ·. . 122 so: 31.4 ·, . ·16:?: .2:4.
. GO; 1411 128 13 11.0. 1os.s· 9.8-.
1 ·:, ,·
'._ 89.9
LL
MM
1348
1201 "
15
28
,1 ·-.
· 13 .
. 192. 6 ·
42.9 :: - . '92.4
· 2._ I
. 22
.: (f';___ .,_l ;,_ + 1 1. -.: _, :: ; •: :i : , : 1·:.; _
C( . g92; : '.'J2 ' :6'.' "31',0 165.3 · ·5.3 .:_..'. ·.: . 0,. .. . , ,-.• .F_lg,ir~ ~.2: O~<{e;-s by Pr~di,cts ,
:Tlie::forc, _the ~:dcrs' ,data . could ' be investigated further io •~ei
>:Order
pati~•rns:
- .:··:;_:·..:Suppo.i~'
·. ... '. .
. EE - ·933 ·:JO ·7, 3LI. ·133.3 . 4.3·
.' add1t1on~I data.is _made 'available for Orders. by their size..Suppose·thc ofders ·nif ~hunked ·
Ff . .676 : 35 6 j9.3 112.7 " · _5.8 .. into 4 sizes·; Tiny, S11)91!, Medium, a_nd Large:/\dditionaidata is showri in Table 6:3: . :
138 _3SS : 43 8.3 44.4 5.A Product ':totai.Orll~rs· . · Tiny_ S11ia11 · . ·Medium .-, ,Large_··.
'JJ :-: ··. ---=2--1s ~ 7- _,_2~ _;, 30.7 .. . 107.5 3;5 _. . • .I • ' 13[ · .5 · 44 -.c _· ,c - 70 . . _ -:-• ·. 12'. ..
DD 125 31 4 :· . 4.0_· . 31'.3 7.8 . 4.:
Total . 177 146.5 - 4.1 ::~·· : : 8·.. . :·i ·-
25936 734 35.3
. . . . . T"ble 6.2: Sor/i!d dntn, with nlldifio11nl mtio1· .
Tiiere are t_oo many numbers on 'this table to visualize any trends in them. The numbers ate.· .' . ·_- S:/,.;."- .
·in dilf;rent scales so plottihg ·1hem on the same chart ivould not be-easy. 'E.g. the Reve1iue:; ·::,_·:;:-2' ·. /,:.:·
nu111bs:is are.in ·thousands \Vhile the SalesPers numbers ilnd Orders/Sales Pers ar~ in the single ·• ·,•10 ·• ..:,_ ·.: :ii'o--/f··
· P( dollble digit. · . . . . .· .. . . - · ._ · . ,,
·. cine c~uld start by visualiziiig tl1i revenue ~s ,a pie-diait:_ni~ _revenue pr~pordon:dropi}
·. significaniiy from'. tlie first product,t6 the next. (Figure 6: (). . : . · ..-·. . '.: . . . ·!
.•_It is int_cresting-'to 11ote that the_iop 3 p\oducts produ~e_aimo·st 75¾.ofthe·revenlie . .
2l ' ' . ' 10 ; ·. ' () . ·'· ,;:' O' ·''·
189 . 329 · · 185 ·:--:,.31 ·
· Tqtili!_6,J: A'lfdi1io,1nf 1f11ta im-otder sizes •: : .. ;-
·. Figure 6:3 is a Stacked bar grnpf1 that shows the pcrcentageofOrtlers tiy s1ze foreacli product. ,.
~- - - -- 's..ciliu:t..(figurLI.J)..brings..adiJTere.nLself>Ciosighls_ ft shows that the pco~uct .tJH has
t :: __________ . .• r- . . --- --- . ,. ___. ---- -?-_1_·_,_:__1__ 7~:
106 ~1:, · .
_...___ ~ - - -'----"----'-,---'-C,,..
~
:_o·_.__
.... -~
-:.::.~·-
. ~
.
VIII Swv (CSE/IS[) I • ,( ... I ' . ""'
a larger· proponion of 1iny 01icrs. The products at lhe fnr right have n lnrgc number of tin·y ~; Cltoose 11ppropr/ate 111et/1nll'10 pm'tlll 11,; 1/11111. The dala c_ould be presented as a l'~blc, or
orders nnd very few large ordm. . 1t_co11ld be presented as any oflhe graph types. ·. . ,
4. Tlte t/11/11 se/ c111//tl he p11111r1f lo include only the ntore significant elemcnl1, M~c data
P, oduct Orti,.,rs by Sin, i; 1101 ncr..:s~arily bcl:er, 1111lc; s ii m;i~es the mos~sig11ifican1 ionµac1on 1ite situa;ion.
S. 11w vi.11111//,111io11 ,·011/r/ 11tot111uldi1/01111/ 11/i11emlo11for rVtWm! such as the cxp,ectations
!ii~ . .. '
' I1·1·1,I IIIIIII
or targets with which to c6i11pare 1he results. . -· . · · · :. . . ".
6. TM 111111,erlc11I 1/1111111111y 11e,:d to b~ bflm~d /1110 afei,, cntegor/eJ. E.g. ·the orders per ·
per~on were plotted as actual values, while the order sizes were binned into 4 cat~goric.il·
choices. . , · . . · r• . .
' i ,. ' • ' : ' .' ' . • ..-, .. • 7, lliglt-/evel v/s111illzat/011 could be backed Ir; inpre de/a/led anal}'JIJ. For tl1e mosl sigrilfiea:it" ·
. . . result~, .a driH-dowq may lie required. · ·: ·. •·! · ' • .
■ ~ ....~,,•~ · ■ orders'· ··ii_s,ito'!lPltr~ ■ n~wo,det' al~ev,'~M.a:tP· 8, Tl11muii11y.be 11ei!I( to prm11t ni(tl(t/01111{_lw11al l1tfor,;,ailo11· 10 tcll_the whole s!;oii.' For ·
■ Ordni'S ul.,,P •T'lpyt : ■ Srrudl •, •at-le~jum ■ L.arqe
example, oi:e may requ_ire n_o:e, to expl~iii•some extraordinary r~ults, • ·. __:· '.•' ·.' : ,:•. · .
I
. Figure 6.3, Pro,luci Order~ by Order She: · ., . Modulc-4 .· · . . .. .
VisuallzationExamplcphasc-2· . . .. . ·:. . ·. ·. · · .·
·111e executive wants to understand the 'productivity ofsalespersori~. This analysis cquld be. What·is pr1111i~g? Wh~t ~~e p~-pni~ing a~d post~p~111ilng? Why ch~;\;~~~;;:,~~ .·
done boih in terms of the number_of orders, or reve1iuc, per sale.spefson·. There could_be 'two ·:· .other?._ · -. ,_ · _ · . •~· '. :· ·.- , . ·=: .. > · . . · ...-· ~{08Markl) ._. ...
separate graphs, one for.the nunibero_i'oiders p_er i.alesperson,·and the other _for the reven_u~ -; · Pruning·,: 1:1e lree cou/d _be td111me~ to)nake it more baianced an:{. more·e~ily·i u,sibic:
per salesperson. However, an interesting way is' to plot both measu~;s on the same .graph:t(i' ·The prun mg 1s. often done after t~c tree. is constni~ted, to balance out t'ie·tree' uid· improve
give·a more·complete picture. This ean·bc done excn when the two diih have djffcrent scales. ' iisabi_lity, :J:1e symptoms!of ,n ~yer1i_tted tree. tree witf\6omany i,k~c1;~, are_~ too4ee11,
: 'The data is ·here resorted by nu~1ber·oforders per salesperson.': . , -~ . . · •···· some of whtch may re~ect ano1T1111i~ due_t~·noise or_outlicrs, Thus;thi:trceshO\llcf bepnvied. ·:
~! .
·,ri1:i~~ Ei:i:i ;Jt~~~~i~~t:2,;~;~\~:~fi:~=if!ft~Jtcti~:··· _.
C · · · • ·. ord<!1:; ~-nd ~eVenue· per s~-rcsP~rSon
1
lt,: .,,< .. ::~·-C'"'''""'..;~..;. .· ..
ljj>.
. ...=·'c:':+;<.:i;,
.. -bec.ausr; we_do not know ,wliat m~y ~appe11 ~u~qil~ntly, ifwe keep_gx:owing the.tr~. -:-- · • · ·· ,: ,
1 , ,. L,,.;,..s,::.. ,;1;; . . •, Pos:-prunmg: Remove branches or sub-trees from a "fully gi:own" me. This method ls . · •
·.· . commonly used._C4.5_algodtlirfr uses~ sl~tistlcal_me1'Jocf to ·es1iina1e the crror1'at each node
I f9t pruning. A validat)oi1set'ijiay,be.us·ecfforpruning as well. . ·; ·. . •' ;,_· . ': •
·
·tu3
VIIL-Se.w(CSE/ISt)
only: ~
• Cli1sler significnnce and labeling. ·
, The labeling can be carried o:it even if the lnbels nre only nva ilnblc for n small number ol
,
'
t±ili .
,, ,
objects representat ive of the des ired cl~sscs. . , , . An.I, A_s~nllcr plot of 10 data points in 2 dn:cn~ion~ shows them disrributed fairly ; andomly
s c:f-Org,mi 7jr.g ,.:uni! j;t;:wcr:<s !t~rn ~ts i1,g. u:i sup~:·vbrd lr ~rn i~g al£ori:hm to 1dl'nllfy ~Pig.ire 8.1). As n t cll o:n-,,~ lcchaiqcc, :!1c number of'ciusters ,,r.t! th eir crn:roids cr.n be
hidden patterns in unlabe'. led ia?ut data . Th is unsu pervi sed refers lo the .ibii ity ~o learn ~nd 1ntu,tcd. · · ..
·organize inform~tion without providing .in error signal lo ev;ilunte 1hc potential solution, T_hc points arc distributed randomly enough that it <ould be considered ns one cluster. Toe
The ·1ack of di,-ection for the learning algorithm in unsupervised learning cnn sometime c'.rclewould represen( the central point (centroid) of these poi~t., However, Ihm isii big
be advantageous, si;1ce it lets the _nlgorilhm to look bnck for patterns that havc ,rtol _been ,,: distanc_e between the points (2,6) and (8,3). So, th is drita co1ild be broken into two clusters.
previously.considered , The main charnctcristics of Self-Organizing Mops (SOM) are: · .. , ,.:· Th_e three point~ al the bottom right could form one cluster and the other seven could form
I. It trnnsforms an incoming signnl pattern oforbitrary dimens.ion iuto one .or 2_dimensionat} the other c(uster,.The two clusters would look iike_ ihis (Figure 8.2). The circles w·ill be the
. map and perform-this lrhnsformalion od.aplively · · . : · . . ·. . . .. new centroids.. . . . .. · · · : · . · . ·.
2. The network represents fecdfonvnrd structure.\Vithn single computational.layer consisting",, The bigger chistei seems too far apart. S<_>, it seems Hke the four poinl.s on the iop·iviil (o~m a .
.of neurons nrriinged in rows 11nd colulnns." . ·. .. . . . .· ··; . . . ·. ·,:, . sepnrntc chtster. The three clusters could look like this (F.igurc 8.3). . .. - . , ; · , ,
3. A! each stnge of representation, each input signal is kept in•its proper co11tcxt-and, :. ;. '.' ~ · .:·, .This solttt(on has thr~e clusters, The cluster .on U1e_right ~ for from the other.two·c_l~!~ · ,
. 4. Neurons dealing with closely relMed pieces of infornialion are close together ~nd they : However, tis centroid.is _not too close to all the d.itlt points ..The chi.s\er at.th~ t.<>P.fooks.:ve!)' .
. comniunicat~throi1ghsynaptic~onnections. -. ·_· . ·· .. · •. { ·.• _, .: ··.: .. ·· ·-::, . ··_ tight~~(titig, )"'.ith n nice cehlroid. the third duster, at the left, is spread(!ilt an_ dmay notbeo,f ·
The crnnputational ·lnycr is also called as ,oriipetitive la,cr since ibe neurons in the lnyer, .' · . much i1sefttlriess. ·.. · · · , · · \ · . · · ·. · · , .··,: . ·•t:__ · · · _::.. : , .' . > ..-. ._
-.·:
conip~tc \~ ith each other lo become activc·,,Hcncc,'lhis learning algorit,hm .is called ccinipetitive,:C:: I
algorlthin, llnsupe!Vis'cd algorithm' in SOM wo,·ks in three phases:· \.;\ . : ·l . . ' ' . . . ., , ♦ . .5~ 7 :
_Competition ph~se: for each 'input ·pattern :x, preseriied to 'the netw'otic, foner''produc_t 1v··
·. sy'naplic _11;efght w_is calculated arid tlie rieurons in the coinpe(itivelayerfirid(a discrim'i · • s., . ~ 6,e .
''
lf.
function that induc<: .competition amonjfthe .new·ons rind the synaptic w'eight vector tha
cfose ns
tOlhe input 'vector in th~ Euclidea,n distance i~announced winner In the competiti
:·_T.ha(nCuro'i1is:cntfcdbcStlti:itchinS.'netiron;'i.e~·x:;argnlin::flx:-:w)I > · :· ·.:: ..- -:i ". ·· ,;·i·. ' ' ', • J ...
.r
-• •2.4 ;
.
··· . ~:: ~ -- J<
:. · :,::_:. _,·,,·?: ·,
't.; · ,: Coopcr~ii1•~ plinsc:°lhc 1vinninlfneu,011 de'tcnnfo~ the ce~ter ofa: top,ological neiglihorh . ,. ~ ,·, .' · · ~ ·G.3 ·.: . ~ ~-~~:~ -
. t~
·;-~,;!,,'_\_.: . . . ' h'of foopemting neurons:. Th is is pcrformed·by the latc~al interaction~ among the cooperai
•I
·,,- ·• ··.
.. , ·. iiellrcins,This topological 11eighb9rhood (educes' its size o.ver·a time p·eriod. .. ;• · · :·. ·,'< '
'·' . Adnptivc phase: enables th.e \vilining neuron and its n.eighborhoodneui'ons to _increase·to: .. ,
lL , '.' iiidivi1.hinl va/ues of the discriminani fui1'ction'in relation to the input pa'ttcm through'suitab'ie.'
1/B '- --~......,. synaptic.weight adjustmehts;'t,w= Tjli(x)(x.:. w). . · ' . . __ ·. ·; . < -·: , <. .. ,
.. /~:. __
.. . Self-Organizing Mcidel naturally represents the rieuro-biological behavior; and hence is,used. ·.\.; ••.•·. .• . :;:
. in mriny real world ripplications s/1ch as clusteting, speech recognitlon;' texture segmentation, '.
. . '
--- ~·-- -- - . .
·:..·:._'.':, :•; -:., ~.-:·~- •··. ' ~ ··.:._:. -·:· ... . -__:··,_ .....~-
VIII Se,m, (CSE(IS'O
.-
-------- - -- ·- ··· ---- --·-- 1hcir old (shndcd) vnlues to tHc revised new vnlucs (Figure 8.1) .
\.. - ·- ··
-
,,, ♦ !ii , 1 • •, 7 • "-"'
• 2,• , • 4,•
• G-.3
-·... ~
• s~2
•
• " JI. • A • • 7 . • ·• . •f!
Fi,:11r11 ,B.:4 R111ulo11ily 11ssl::11i11,: t!,rce ce11troids for t!,ree ·data ·,111sters
- -- , _,a. _ . -- - - ---· - - - · " •··- -- · - - - - - - · -- ·- · - - •. • ____ :· __ · __ . . - - ·-
.1. · • .. ,.-.
'
o·· · · . 1
their centroids, It' is a lop-down approach to 'Clust<!ting, Staiting\vith .a given number dfK
. clust~rs, say 3 cllisters;·thus; ihree _random ceniroids wil~ be_.created, as starting points of'.t~i(
centers of thr~e clusters (Figllre.8.4): The circli:s a_re initia_l cluster. i:entro_ids·. :,:. ·. : · /::'
Step I: For a data point, distanc.e values will be from ·each of the three centroids. T~e· ·
point ,viii.be assigried·to the_clus_tcr with the shortest distance to_ ·iJie.centroid._/\ II data por
_· ;j;'s,2 .
Will thus be assigned to one·data poinl_or the either: The.arrows from each data elerrie.nt sli . _t _ •.
the centroid that the point is assigned to (Fi(?;lice 8.5).: · . · · : · .: · · : _; · · · :.\ ::
2:
Step The.centro1~ for eac.h cluste_rwil.lrtow· be recalculated .such that it is .closest to aU t
:.c...,__ _ ...:....~......,.-tiatncpoints11llocat~o-!l,a~ C~ij~e~ .~~e d,sbed _arro'.vllhilwJlr~centroids being moved fro
VIII Se,m, (CS[/ISE)
, . • 'r · ~ / , •
Step 3: Once again, data points ore assigned to the three centroids closest to it (Figure 8.7).
The new centroids will be comr·uted from the data poinls in the cluster until f]nnlly lhc
centroids stabilize in their locotions. These arc the three cll1stcrs compi1tcd by this nlgorithm
(Figure 8.8).
T~e three clusters sho"'r. arc a 3-datapoints cluslcr l"ith ce111roid (6.5,4.5), n 2-darnpoint
cluster with centroid (4.5,3), and. a 5-datapoint cluster with centroid (3.5,3). . .
TI1ese cluster definitions are dilfcrei1t from the ones derived visually. This is afunction of the·
random st.lrting centroid values. TI1e centroid points useif earlier 'in the visual ~xerclse were
different from that chosen with the K-mcans .clustering algorithm. The K-means .clustering ,,.
exercise should, therefore, be run again with1his data, but with new rnndom centro_id starting ·.'
values. With many runs, the cluster definitions ate likely to stabilize. If the cluster definition_f>~
do not a
stabilize; that may be sign tbilt the number of clusters chos~n is too high or too low,:.
The algorithm should also be nm ivith different values of K. _.
.~,12-:::
_: /·
l 2 3 4 . 's 6 : · a·
. Fig;,;~ 8. 7Assig,;l11i: dat;poi,;f; to Recoi11p;ted cent,;i'ds
'• . ' L< • • •,• • . : ' • '• , , , . : · • • • • ' • ,·. • ·• •- • ,. •
•ii's··.
-·- :.. . - .
' ·: - •-,-;~· -: ••_..- ~;~. 7\ c'.i.-j:,. !:", t_ _-• ..-., ~- _: . i
vlrr s~ (CSE/ISE)
trnnsformAtion occurs implicitly on o ;.obusl 1heorc1ic~l buis nnd human expertise judgement
beforehand is not needed.
3. SVMs prov id~ A good out-~f-~amplc gcnerali1,1tion, lf the parameters C and r (in the cas·e
ofa Gaussian krrnr.l) are upprnpr iatciy chnsrn. This mean~·that, t,y choos i~g an apr,ropriatc
gcncrnlization grade, SVMs can be robu~:, even when the training sample has some bias.
4. SVMs deliver n unique -solution, since the optimalicy problem ·is convllX. Thi, ii an
ndvantage compared to Neural Networks, which have multiple solu:ions assoc:iated with
local minima and for this rciison may not be robust over different nmples.
5. With the choice ofan appropriate kernel, such as,the Gaussi.ln kernel, one can put more
stress on the similarity between companies, because the more similar the. financial struciure
. of two companies -is, the higher is the value of the kernel. Thu.s when dass.ifyi_n·g_a new.
company, the values of its financial ratios are compared with the ones of the s1ippor1 .vettors ·
ofthe training sample which are more similar to this new company. This company is theri
classified according to with which group it has tbe_grea:est similarity. : . . .. .'
Herc are ~ome .examples where the ..SVM can help coping ·with non-linearity and non- .
monotonicity, Orie case. is, when the coellicfents of some tinatlciat ratios in .equation (I),
_estimated with ·a,.linear parametric model, show a sign that does not :cormpo~d to the
expected one according to_theoretical.economic reason.ing. . . .. . .
.The reason for that may be.that these financial ratiqs have a nori-monotone.telalioil to the
PD and to the score.The ·unexpecied sign of the coefficien!S depends oil !be fact, that data .
dominate or cover the pa11_of the range, where the relation to !be Po_·has the opposite sign. .
· 'One _o f these financial ratios is typically" the growth·rate:ofa company, 3$ pointed out by ..
Categor~g ~e~s. em"ail spam ~election, face recognition, sentiment arialy~is, medic~i' "Also leverage may sho.w non-inonotonicity,:sim:e.if a compa.'ly ;,rimaiy.wor!a witii. its o.wn ·
. ... diagnosis; digit recognition arid weather prediction are just fow._of the popular use c~es:,o, .· capital; it may riot ,exp!o.it all its e_xterna! financing oppor:nmi~es properly. Anodi~r exmiple
•· · Naive Bayes·algorithni. · ·. ,. ·. ·• · · . · · ·, · ·. •·· . _.. ·.. · • •.· · ). may be the si~ ofa ·company:. small companiei,'are~xpccted to be more firtantially ins~ble; : .
·. . Machine Leaniing explores lh~ ;tu(,ly 'and co,istru~tion of alg~ritliins.'that c~ ,learn fr~n.i: b11tJf a CO!llpany has'grown too fast or if it h·as become ioo-static because of its dimension, the .
·· · and make ·predictions· on data ,_ Among-Classification ·Algorithms, Na_ive Bayes _along wit~: .. big size'niay become a <lis_advantage: B~cause of these chai-.,cteristics, the above n"ie~tioried
Regression is one of the most popular and powerful algorithms. . : ' f.11~ncia1 ratios hrc:: open sorted out; ivhen selecting-the risk zsscssment model according to .
Naive Bayes classifiers is a machine learning algorithm. If you wonder, how Googl~ mar!is.,__ .ii l_inear class:ficatio~·teccinique. Alterna\ively al\ appropriate evaluation of this _information ·
_ _ _ -some of the· mails .as spam in your inbox; a machine lea!:!!_~g_alg~rit,hrn ~m
.be used lo ·:,:. \n .line~qechniqu·es_requjr_es_~_tran~for,mation of the input variables, iii ordetto make ttiem .
- ~ -- ~ classify iin incomin{email as spal)l or.iiotspam . .-:.,-~ •. - - . .. .. . .. .-,,~": . rnon_otorie and Hneirfyseparab)e:-itA-'colllin'<!ndisadvantage'of non-parametric techniques
..-::,. ... ·. _:such as·sVMs is thefack oftranspa.rency of results. · · . ." . :· • · · ·,
·. : b; .What Is ti;c Point {;iUsin·g SVMs a as Cl;ssific~tio~ Technl~ue? ·. · · ·._-.·.' · jq ~1arks) · ::
SyMs camiot repl·~sent !he score of al_l.companies as a simple paraincii"ic ·functiof! ofthe
"An·s• . Alf classification techniques have .advantages and "disadvantages, whic~ are more or les.s .
-: . financial ratios;sinte its dimension.may be yery high. It is neither.i liaear'"combinillioli of
. : important according to the data which a~e being an~lysed, ·and thus hiiv~ a rela~ive_'.releva11ce, . . .· single firtandal ratios nor has it anot~er simple fur,ctional foim~the welgl\iioftlfe firiancial
. SVMs can be a useful tool for m
insolvency analysis, in tbe case of rion-regulanty ~e data, ·r~tios are·not constant; Thus the marginal contribution cir each financilil raUo to t"be score:·
· :for example whe11 the dat~ are not r~guiarly° distributed· or have an _un_know1i distribution:· Is variable. IJsing ii Gaussian kernel each company has its. own weights according to the .•
It c~,i help ev·aluate. informatioi1, i.e. financial ratios which. should ._be transformed prior IQ difference between .\he value·oftheir own:fin~ncial ratios'and those <if the support vectors o(
·entering the score·of classical classificatio_n techniques. · .• ·..., . , · . . . . , : · . · . . ,·· · ·' ' . . ·: thetr:iiningdata~amp(e. -'. .: ·· '. ·. _· .··: :· ·. · · · . · ·. .::· ,.: . \ . _:
The advantages of the SVM technique can be sunimariscd as _foUows: . , . ·. .. · ·,·. . . .· :
L. s ·y introducing the kemel, SVMs gain flexibility in the choice of the· form. of the threshold·
as
. : lnt~rpr~iation ofresults is hoivever po~ible and can rely on giapliical visualizati~n; wen:.
.as on ri focal linear approximation of the scor~. ,The SVM thres!i_old can be represenied witl\i!f
. separating solvent from insolvent companies, ~tiicli" needs_not lie linear and -~,;en ·needs 1191 •. cuts
a'._bi'dimen~ional graph for.each pair of fina~cial ratios. This visualization technique and
· have the sa·me functional form for all data, since irs function;isnonsparametnc .and qperaJ~s -., projectqhe multii:!imcnsional feature .space as \veil as the inuitivnriate .tftresflold'function •
. locally. As a consequence.tlicy can work with financi~I 1atio{ ~hich show a riori:monot9ne:,
relation to. the score and to the probability of default, or which are non-linearly _depel)dent,:
ihe
s¢p_arating s·olverit .ind insolvent companies on a bi-dimensional one; by fixing ~alucs of : !
the' other financial ratios equal to the values of tlie company, \Vhich has io b~ classified. By" · .
. - and this without needing a11y specific~orkon each _non~monofone varia?fe·•· _' : \ _·: . .., ·. . :this way, different compimies -~ill have dilferent,threshold projections: · . . · · · · .
;;=======~,-=·.u h · n i~1 .licitl containsii"no1i-lincarfransforiiiatio1i,11oassumptio11saboutthe
-H0\V(Vef,all"1l~lym-of tht-~·~pl.s gi,cs Jil impu,Wi.rinpilni~i~?ll'.~
:121 ,.
more
n\Jn1~er of links, arid/or r'nore li~ks from higher:'quality websites, will be ranked ~iglie~; it
. works ·in a_similar way..as determiriiiig the status of a person in a society ofpeopie>Those ..
1
with relations to, more people.and_!or relations to:peopl~ of higher status will be·a~c<ird~d a ·: : · .. <·:
.' : higher status. PageRank is the afgoriihin that helps det~rm•ine the order of pages !i~ted upon . .. : . ..· .· · -.
-~ ~ ::Ooog!e--Search query.-The original PageRank algorithm forn1liati<liin'iisoeen ~paalcd_m_ . ~- -_ -_. .- -
: many ways alid. the _latestalg<iritlim is k~pt a secret so either websites· <:annot take _advant.i"ge · ' .
~rt.he algorithm ~n<J/r~'anjpul~ie th~i~ :,vtlisiie according lo_i( "?'~ever; (~ere ~te:-11_1:aiiy .·:
•..:_?2:0. N~:L~:i.~~:~:·K2; .~s·e . sta!1dard eleme_nts ,Ilia! ren1ain unchanged, These.elements leadio the·principlesfor:"a'gtiod · ·
\· website: Thls pr~tess is:_als~_called Search-Engine Optiinization {SEO). , . _ _.: : / ... : . :
·.The gre,i lin~~oi~~!l~~ito;h~ iine~;:ap~:?·~~~ai(~n of th~ §_cof~•~r Po" f~nction,tJiojecti_on
for companyj. One in/cresting result of this graphical analysis is_ th~t succ~ssful com_pan_ies c. Expiufn the Pract!calc~nsideraiion ofSocii "iieiwork" analysls; Gikthe iiifftrerice'
· with a low PD often lie ina closed space. This implies thatthe1c exists an optllTlal~ombmatwn · b~lw~c~ Soci~l ~c()Vork_Ana(y~fs v(s !radjlfonal/llita Analflfcs · . . _·•· (~II M"ar~)
·.. area, for the financial ratios being considered, outside of which the ,PD gets h1ghe~. If we ,
· con;ider the net income change, we notice that its influence on the PD is non-m~~otone. Both PRATICA°L CONS1DERATiON: .. : '
-. ,. too low or too high' growth rates imply a higher PD. This may in~icate the eJ!i~tence of th~ ·.• .•. : .. Networ~ Size :Most SNA research ls dol)e ~sing sma_il it~tworks, (.oUci:i~& d~ii iibciu(i~rge .·.·}
1
. optimal gro~tluate and suggest.that ab?ve a ~.e11ain rat~ a c?mpany m~y get into _tr~uble, ·. netwo,k can be verydiaiienging. Tliis")s because the number c>ffink is the ordei-oft6e squ~re., · ·
especially if the cost structure of the company 1s not o~llmal 1.e. the net mt_erest rail? IS to ' "ofthe number cif nodes. Thus, ina netwcirk of 1000 "nodes there aiipote~tial1y :l.hliilfon .
big~: But if.a company (ii:s i1_1 tl)~ optimal gr~\rH1 zone, !t_cary al~o ~lfo~d a _htg~~r n~t mte possi9Iepaii-S"oflinks. ·· ._ . . ·• • . . · · ·. !_. . , · -:. :· _., ·:::.,.: <•- :,_
ratio. · ·· · · i
.'·C :
1 .:-_ ~ izc.2__.___-~
_-_•._-__---~
.·•·
.:. _-_: ~_-.-___.··_.:_·_··.._ •._._·_·_ _ .- -.
- -- ~~------
.__ . . -~~ - - -
! ... J • • -
- -- _.••-.- _
·. :::.= -_
_..:.,..:.._--'--.......:...-----'-- ~ ~ ~ ~ ~-
·..:,:· ;.·:r·
)_:~.j~_l _'•;
:..~1 -·
· -" ·
·: t
-:;,
VIII Sewv(CSE(ISE)
Gathering D,ta: Electronics communication records (email, chat, etc.) can be harnessed to
gather social network data more easi\y.cD"ta o.n the ndatu rc an_d qu.,lity of rcl_ationship need to
Ulg-V'®IA� :''.\'
'..i� i:\
/._�:�.·. ,_r(!:_,J_1..:·
be collr:cted using survey documents. apturing·an· c 1 eans111g and org:rnizing tile data can -.
take a lot of tirr.c: .1nd effort ,Just like in a ry�ical ar.aiylics p:"ojccl. , -����: �•:. ··
.j Computation And Vi1unliiation: Modeling ·1arge networks can be computationally .};;'. �;.
challenging and visualizing the,� also would require special skills; Big data Analytical.tools':\;\ lJii
may be needed to compute large networks
.' :f.' ·
Dynamic Nctworks:Relationships between nodes in a social network can be fluid: They can,
change in strength and functional nat6ure. For Example, th•re could be mnltiple reiatio
betweentwo people ... ihey could simul:aneously be coworker , coautho:-s, and spouses,
network should be modeled frequently to see die dxnarnics of the. network. .
Tobie JO. I Socidl NeMork Analysis vis Tradilional Datcr Analytics
D�mtnsion Social Network Analysis
Traditional Data Mining
Nature of loamin� Unsupervised-learning
Supervised and.
_
lear:ning
Analysis·or goafs .' Huh nodes ,important rio<1es; Key· decision
and sul>-networks centroids
· Dataset. structures. A graph of nodes and (di_rccted) Rectarij(ular data.cof variables'
·-
links. · · · · and ihsJanc�s . ··
Analysis techniques' Vi's,lializati9i1'' witli ·statistics;
·ite'iative gr'aphical co,npatation.
Q)ifility mcasuicmen1'· Usefii'Jness is key criterion ' , .
I, . • . · .. ;. ,' � . .. i' ·.. I
;Ii l�:.�,::\;.�.:1J
\•�•;!·:r;:;'..'., ,:L{·.1·:1,:·:.- .· ,'.: :.i·, .. •. i ·\·
· ,, ·: : ,: i .;,. •.;o: :.. ::
H,•.·ri. ;,'.,; ·;"1:•l;:···i:: .. �•,:1·!:I· .E ·; • , rl ,,.: ·: ·.:•:f.'.-.f,.,,:,. r·.
· !::�ti;!_:� ·;:,: j:::r1ri:111: ,,;.�::i1 ,,1 .1 :s:\"' :" .• ,.: )_;·· _1 � t,:'
. .{i..l •:�;! !i. :.:,- /,·,:·.
·
. .,;,.i•1i�;-;if, .•1'i :_;f;..: .':!!�:.)t;iin 1�.:":•::.J.i. 'Ji·;r,:� :., 1;:,;•: :.. 1.i·,.,:; l· :,.. :·�;.�:! .;,f, .; 1 ;·,;.{;•:·
.
·.,.·-
.,._ •' ·-----·
;.,' .-;. <;; ,\· ' ,,.... -
·--·:-1:24 .. ·