Vous êtes sur la page 1sur 65

..

/
As J>~r New Syllabus .o fVTU,20'1s ·s~h~rpe_ . . ·I
. ' Choice Based Credit Sys't ~~(Ctics).. ,.. -. -r
j ' ; , ', l• , l, · •. ;: ' ," ,. I . ~ ,;, , '

·ALL IN ONE
·-r
:s.UNS.TAR .EXAM's·cANNER'.
• • t • • . • • • . \• , :

.... ,:,.·.. - . . ·B.E.·· ·: :' .. -.:-~· ·'·:::;'.t•·/=··~::


.
_;•... •c/:· ·1 ·
..

[
!'•
.,
. . . •'

_::-: ,.- : . -

.·.. ·, :~·:. ::: ·: :_:._'

~ .• . .
. _/ ' ' '
. ·.·;.! L-.-/~1,!,:,-.
:··.. ( 'A~TlfoR~~HYATEAM .OF.EXPERTS. ) -~.,::'. :'. ·~:~~<. :,; ..
.( .
;r " . .. ...._ ·FfZi~r-~/:~- -
£~~-~
. .~- . _:_.

~ --· SUNSTAR'PUBLISHER ,;•


.114/f; KtJppas~ amy B~i(di;g, {9,h Cross, . ·.
.. _ Cubbonpet, Bangalore c560002. _
-· ·:- ,. '
· · Phoi1e : 080 2222414,3 ·
·-.··:·.
'. ·E-mail:
- sunstar884@gmaiL~om
. . I . ~:-
. .
1-· . !.
I.,
.. t .. ;
II CONTENTS �11

1. Internet of Things Technology


.
► CBCS Model Question Paper - 1 · 03- �8 .• ' .
► CBCS Mo1el Question.Paper - 2 29-56 .• j \
..
C

- 'JI
:�

2. , Big Data Analytic�


:► 'c:scs M9de! .Question Paper - t 03 - 46·
· ► �BCS M�<l.el Que�ti6np�p er - 2 · 47� 84
.
► CBCS Model Question Pa�er - 3 85 - {24 ..
. , .. ; i·-

3, . Network Ma�age�ent ·
►- CBCS.M,od�i Question Paper! \ 03 - L9
► CBC� l'-'!oqel Q��s�ion_Pa:B:e;
:
' .
=:':l ' 20 - 3.6
. . -• . ,

. '4; · · . SystenrModeiing and Sim:ul�tiori


► <;:BCS. Mcidel Question Paper� I · 03 - 24
► CBCS Model Qtiestioi1 Paper �- 2 25- 44

·. . . .

.•··:,1,- :- .
·''
( -�,

As Per New VTU Syllabus w.eJ-2015:16.


Choice Based Credit System(CBCS)

SU.N. STAR .;••.·.

I SUNSTAR EXAM SCANNERj


'-

· ..-.. ;,.,:-.· :.

;._ ·'� ,1 ; ...-_.,- ... -.

:;;··-.7_,;1·
'. ,,'i_••.:·=:,!-

(VH)�:EM.8.E: C$EJ,.1SE) : �-

---�------.-···
: -}j --'----~ ·
i· ·
SYLLABUS · Eighth Scmc., tcr B.E. Degree E:rnmim1tion,
BIG DATA ANALYTICS CBCS - Model Question Piiper - I
IA S !' ER C II OIC E 11,\ S J:n CIU' Oll S YS'I E~: ,C l! LS I SCI IEM[ J
(EFl· ::CI IV E FROM I II E i\Ci\l> F~II C Y!' fdl ~<1 16 . 2Ul7) : BIG DATA ANALYTICS
SnbJ«I C:utlt · 15(1182 I.~ Mnrk, 20 Time: 3 hrs. . . Max, Marks: 80
N.,.mbrr of Ll'<'tun- lloun/Wcrk : 0~ .I kO Note : A1rsiver a11y F/V£J,1f/ 1111tstloi1J, sclecllng'Otv_£/ull qutsf/01ifro111 eac/r 111od117t, •
lol•I Nuonbrr ur Lwun, lluurs . fa,m tiiiuri . OJ
.. Module~ I
. · . . · · , · . ·.MODULE 1 .. , . ,. . . •
Hadoop _Distributed fil~ Sj>ste.rii,_,B,_asicsf Rimnh11(~~ampte ·Benchmarks, Ha~oo P~6gramia1rd
MapReduce Fram~work, ~fapReduce Programmi~g;:° • . · •. .: · ;_ ·.· . · . . . . . . . . . ·:

.<·'/ ;.·.::. '1: :,/~'·'.;>./ ::: f-,~:·_::;:i}'.:.:'i''.·~~()U.~E,~ ·: ·<_:..:\ : •,\· ':-<iil . . >•., :,./t><'
\) . }; p,ss,enh~l-'l-!<\doop Toci!~•::H4~APP-~1-~N AP.i>llc11Jio11~\•Manag.ing.: Ha.~?cif\yit,li Ap_ache:A:m~ar)
1 ·83,s(c Ha{looj, Administ~atipri' Prod:diires. , ·· '. :·, · · :· :·· '
( ':~ : · , : ·. -- '\ :· : .•• ' :.!... : , , 1•;'•.1.--,2:.~""~--:;· ~-~ ...~ . -~\, ~.... . . . ·. · . ·· ·,:··:/.. .: :• ..'·!, ·\·..... .
l;,;li
·'!:. . : .
·, : . ,M9DUL£3 .
. Business. Intelligence ·concepts 'and :~pplic~tion, Daia Warehoii~\ng, ~a_·~ai t,;iii1ing,
Visualization •( 1. .
..·, ~- ·.
. . . .. . . '·'.''· .: ....:,.. .,.MO.[?L!LE4, .... ,, . . . '-~... ._., . . ··:: ·, .
Decision Trees, .Regressicin, -Artificiar Neu~J Ndivorks, Cl~st~t Analysi~, · Associa,ion R~I
M1nlngs. . ··:~:;'::( .. ~
.i/t:~lJtL~.f~:,:"' <( ttt}J .,,. .·:. ·•
.· ·. Text -~ in i;1g, .Na~\>e,Bayes Analysis, ·.su1,port .,\!ecto~~Machirtes, Web .~ ining,·,S~daLNetwqrk
Analysis . . . . ·. · - · .1_: •. : ., .-_ . . . · . . . · :·.. ·.·. . . >.,. :·:.

·.

.. , •,, • .·

,•::.
I
VIII Sem, (CS£/ISE)
b. Wilh ~ neHI di11gr~111 cxpl~in v.1rio11s system roles in nn IIDFS deployment'! (12 Marl,s) lets lhc ~onvcrs.ation bctwccrr the client and the DataNodcs procec~ilc dal~ transfer is ·
Ans. ):!DI'S Components . · . · . . progressing, the .NamcN()(!c also monitcrs the datuNodes b~ listening for heanbeats senl
lllxe design of HDFS is b~scd 0111wo types of nodes: fwm DataNoJe){ihe' lack of a hcar1bcn: si1;nal indicMcs a potential node failur9 ·1n such
. i\'amci\'ode ;ind multiple DataNodc:) · a case, the. NarncNodc will rou1c· arou 11~ the failed DataNod,· and begin re-replicating the
(J:n a basic dcoign, a sic g!c Ne,11r\ode rn~nagcs all :he me,ad:H?. needed 10 sto,~ ~i,d rmievc 11ow-mi,s_ing bl ocks. ~cc ~c;sc :t,c fik sys:c::i is rcdun.!.inl.@,1r.Ncccs c~n ~.e iakeii offiine
the ac!L1al data frot)l the Da,ai\' otic§)to data is acn(ally stored 01i the NameNod~wevcr. (decomm1ss1oncd) I r maintenance by informing 1hc NnmeNodc ofrhe Da1nNodes to exclude
For a minimal Hadoop installa: :on, there needs to be a s_ingle NameNodc daemon and sing.le i'om DF •·
)lataNode dac1i1on running on at least one machin;> . : The mappings bet en data blocks and .the physical DataNode arc nol -kept i~ persist~nt .
(J_hc design is a master/slave architecture in whic.h the inastu {NameNode) mar.ages the file s orage 9n the NameNode~). · . . .· · . · . ..
· system namespace and regulates access to files by client,9tle system namespac~ operations· . . (Dir performance reasons, the NameNode slom all sietadata in memo·ry)i.!l!on startup, each .
).Lrch as opening,clos.ing, and_reuaming file~ and dir~torie, are au managed by the Name Node'\ . DntaNode provides a block repo11 (which it keeps in pcrsis1ent.storage) to the NameNod~ ,
Qhe NameNode also deter11111ics the mapping of blocks to DataNodes and handles DataNod( · (the block repm1s are sent eve~ IO heanbentj)(The int~rval between repor1_s is a con~gurab~
failure~ . · · '· · . . . · · . ·· · .:· · · property,) The reports ena~le(!!!e NameNode to keep an up-to-Jata account ufall data bl01:ks
(The slaves (DataNodes). are responsible for serving read and write reql)ests from th~ file. in the clusler::; · ,· · · · ·' , ·
/ 'i},stem to the client~c Name Node nrnnagcs block creation, deletion, and repiicatioij) ([h almost al(Hadoop deploymen·ts, there is a SecondaryNnmeNode. While pot explicitly ·
. ·(An .~xampl_c!'of the c~ienl/NameNode/DatbN~d~ i1ueractio11 is. pr~vi1ie i11 figure .1). ~hen . required by a NameNod~, it is hi~ly _re,crimrrtende~h: term "~econdary- N~meNode~ '•
'a client wnlcs dala, 1t first .communicates wnlr the NameNolle and requests to crate a file,) . (no~ called CheckPointNode) , 11, 1s nol an acttve fa1lover node and ·cannot replace lh~ . •
!he Nam~~cid,i d~terniines how n{irny blocks are nee1led ~ml proyides ·the clieni with ti,c pdmary NnmeNode in case of its failurt) · . • . .. . . · ·· ·: : ·· ·. ; . · -. . . ·
( ~ purpose of the SecondaryNameNOde Is to perform period checkpoints that e~al~ate-the
.· Data~od~s that will st?re t h ~ part of the storaf~}:?_c:~~•t . · '. . .· .
status oft~c NameNode) Recall that the NnmeNode keeps ail system metadata memory ~or ·
fast access,&also l:ias two disk files thal track changes to the metadata:'--, .·
~ n image of the file system state when the NameNode was star1ed.1'hi~ file' begins w[th

··:~~ {] ·
.. /
. fsimage_• and is used only at startup by ihe NameN.ode. · ·.: : ·· i: ., .. · : . ' .
A s:des of modi~c~tions ·ctone to uiefile systelll afte·r starting' the. NnirieNo~e·/T~es~ file ' '

~-~
.....

. .. ~/ 1:::t
. '..,) ~:t&)<~ ~~( ' ·, i begm with edit ~ and)'eflect•the changes·.made after the fsim)lge . • file was re.id; :· · ·'
' (The location of th~e flies Is ~~t by the'df~inarile~bde.name:dif pr~rty ,iri the"hdf~·foe.xml .'

~;~econiJary~n~~eNode p~riodical;~ downfoa~s fsi~;ge aml e~its ~les ..join·tS·;~c;·ln°io·; . ·.


new fsimage, and u·ploads the new fsimage file to ihe NatneNod,;l[_hus, i•,he.ri the NameNod~ · · ·
· . l'estarts, tl1e.fsimage:flle is reasonably up-io-data and requires oniy the edit fogs tirbe applied

L~J
'j
.;:·~_ ,___s!Jl_::, since the_last _che~kpoiny(!i' theSecondaryNameNode were not. runrting,: li -r~sta~· of thti
NameNode could 'ta~c a prohibitively long time duno the num~r changes "io the flhi'
· system) .'~'>,,~ · .
of
i· • - ·
L_"2':.:.'.J· ~ OR · . ',
1

:. ; : ·. ·. , · Figure i. I Vt,riou,f ~yJ/enr r~/eJ i11.a11 Jl[ff_i,· deJ!j~v11ie11i . . ·' • •. . 2, a. · E~piain ihe inap reduce' nt~el witti siniple':map~r Sc.rip! ~nd s·i~pie rcdii~e script; .
·: data blocks are replicat~d a!kr they are written 10 the assigned 11od9J?cpending on how many · ·.· . · ·· · · · · . · · ·. · · :: (08 Ma'rks)
: 'nodes m in the ch1s1cr: the NamcNodc will ~ttcmpt to write replicas of the da1a'.block_s on · · Ans. The Map Reduce Model: Apache ·1-i~doop is often associated 1vith Mapred~ce·conip11ti~g.
. nodes that arc._in 0th.er separate rack~ {if possiblelXi1hcre is only one rack, then ihe rep! icated · Prior to Hada op version 2, th is assumption ~as certa.inly 'true. H·ado.op yersion 2 main1ained
. blocks are written to other servers 111 the same rac~ fter .the Data Node acknowledges that the MapRedt1ce capabilityand als(? made other processing mode.is available 10 users~Vi~fl81l)' .. '
' tlic file block rcpHcation is complete, the client closes the file and informs the Name'Noc!e'that •· ·. all the tools_'developed (91)-iadoop,'. sui:h as Pig a_rtd Hive; will work sea,mlessly-on top_ o~j~e. •:.
the operation is complet~te that the Nar11eNode does nolwrite a11y data ifa~i:ily to the . Hadoop version,2 l'xlapReduce,• . · · . : ., , .. . ' · . , , . . ... ·
. · DaiaNodcs)Q!, g_ivfs_ the. Sl,i~nJa limited a".1ou'.11_oflime to com~let_e t~~ bp~~ifo~(fu! does .·· · The· MapRcduce cdmputation model provide~ a very powerful ·tool for inany_applicat_io_n_s a.nd ..
2,ot complete 1n the time p.eqod, the operalion,1s canceled) . · , _:.. · ;_:: .' · · · . ... is more common than mo~t users realize. Its i1~detlying idea is very simple. · ·. · . ' ·. . ·. :
~ad ing data happens in a.similar fashion)Jh~ client. ,~q·u·ests a.6le from . th~ I-Jame Node,,.·
. wh1ch _rdum; the nest· Da1aN6dcs from whic~·.to read 11_,.~:d.a ~.(!:.hc dicnt t_h~n accesm the .'
'Thm ar~ lwO stages:~ inappiqg stage and a reducing stage. _· . :· <:.·. .: ' . . ' '
ln·\he mapping stag~. a·mappingpr9cet)11re is applied to inpui dat,a, The map 1s·usually some
~ata directly from the DataNoucs'.J . . . . · · · · · .: . ·· ·· ··:.'· ·. ::. · · . · ·, kind of filter or sorting proces~, ' :.. ' ' ' ·. . ·, ' ,,: ;-: :·-,• . .
- - -- tThus, ·once th_c meradara has be&,:_delivered to the clieni,, th~:;i:,/a~eNo?e-steps-back:and ' .. ·•·.:'.-----:-l --"-:- ..
.. &i\~+;f C.,A;,. ~ r1tf
't.
,:l;:'
VIII Sem, (CS'E/IS'f) i;'..': cncs ,. Moaei Qi..wM-toYlll'o-fJe<" . l
ror instnncc, ~ssumc you llCl',! to l·o,1111 h11w nrnn)' times the nnmc "K utuzov" 11ppcms in ih·c Milp(key l ,valt1c I)-> list(kcyz',villue2) ·
novel Wa: nmt i'cccc Onl' sol1,1io:1:s to L,11thcl' 20 friends nnd ~ivc them cnch n sc rlion of the f'hc reducer function is then ilpplicd :o cnch key-value pair.which in tum prv<!U"\ a ,ullc~:ion
book to senn:h. TI1is step is the 1r,ap stngc. The miurc phas.l' hn1;pc11,s when cvrryonc is done of v·ill11cs in rhc snim t!<Jlll.1in : ..
countlr.g a~d yo.i scm llt' lot~ I ~~ y0 ur fr.er,ds t~il yct1 !lieir t:cu:,ti .' Rcd111:c(kcy2, li:,l(v~l ,1c2) )·> l, ,:(s r,.,.cJ )
i'\ow con~;dt::· how thi s ~:rna: p ,\Ji.:1-.:S~ ::v.i!d be ,11.:rumplishcd using s iinp :c • ni., ,ommrmd- E~ch reducer Cilll 1ypirnl:y prod\lces c;:her one val 11, (•~.luc1J u, a:1 cmply_r~:;punst. Thus.1hc
line tools. The follow ing gn:p comnl<lnd applied n spcdfic nJap tv 11 text file: MapRcducc frnr.icwork rrnnsforms a list of(kcy,val(ie) pairs into a lilt of values. ,.
S gl'ep " Kutuzov" wnr-and-r,cact.t~t. . . . . : '. . .. . '· The MapReducc rpodel is inspired by the map and reduce functfoo3 _comrrionly· lisec in
This com'n1aml searches for the ,vord Kt1i'i'i'zuv· (wii'f1 feading'·nna'trniling s\i'iic'e} 'in n text many functional pro·gramming languages. The functional nature of MapReduce has some
file called war-nnd-pcace.txt. Ench match is reported as :i single line of text tltnt contains·• important properties:. ' ' ' ' . ' ' '
the senrch term. TI,e ncttml text file is a 3.2MB text dump of the novel Wnr n;d Pe~ce nnd i.Data flo,v is in one direction (map to reducc).lt is possible to usc·output'ofa rtljute step u ·
is nvailnbi~ from the book downland page. The s~nrch term,. Ktituzort, Is n. character in the , the input to n~other MnpRedtice process. · · · . · ' .· · ·
oook ..if w~ ignore the grep l'QUnt (-c) option fot:·the moment, we·.~n.n reduce 1tie·nm!1ber of (i. As with functiorinl programming, the input data arc mx change~, By applying the'mapp_ing
fostances to a single 11.(unber (257) b~ set1\l'ing (piping) \he.res11lts;of grep, . , . . and reduciion functions to th~ irtput data, new data arc produced. In etfet1,the original state :
intowc-1. ' ', ' ' '. . ' of Hadoop data lake is·always preserved. ·. . . . . . ·· . . .
(we -1 or '.',i·o:-d coup.C'. rcpo11s the number_of lines i1_1'ccciyes.t,,: , , iii. Bec.~use there is no dependency on how the mapping and redui:ing functions ateappljeli ,
'•·" .· .S,grep ''.Kuiuzon.":.war-a9d;pcnce.txt!wc , \ 1, • ; • , .• ,. : · , • • .. :, tci'the data,th_e mapper and reducer data ffo~ ~an be implemented~ any ilumbet o(iMai~ to . :
. 25'7 , , ,.. ,, .· .· ' . . · . .·..., ·' , ·.. . , ,, .,, .:.;;:/, .. , .. ,':;>, :, :; _. ' , provi.de be\terpe(forrtiance: .· . · · .. ·. . .. . . : .. · . . ,_. :~. :
. Though 'not strictly a MapReduce process, t~i~)<i,ea,js quite similar t(i.an,d, inu,c_h xa~,:er than. ' Distributed(parallel). implementations of Ma;:Reduce enable· large amo<JllU -of·~ta to be
·. ihe manual process ,of counting th,e instanc~_sQf.,.K~iuzon,i_n, ui,e p~in(ed ,bo~k. The 1a1inlogy lirialyzel quickly. In general, the mapper process· is fully -.scalable and be ipplied' io ~ny
. . . can be:taken a bit /urt!)er by using t)ie tw~ sl_mpl~ (and nai~e) #\i~!·iptI s~o',\11i )11 Listing
I. i and Listing I.~.-Tiie shell scripts are oper.ation_(mu~~-more .s\~wly) n_iid tok 7niz~ -both the·
subset of the input data: 't{esults from multiple parnlt~I mapping fun~tions ~'fen ~o_mbined a.~
in the ~educhphase._: . ' .' . . ' .... ,:·;: · . '. '
J(uwz.ov·and P.etersburg s:rings in \he tixt: .,.: . , , ,.. , ·.. . . .. ·. \ ·.· ·
of,l~e : Hadoop ~ord CO~II! ·~x~.n.iPJe
.t%;r~1r~"tll•' ."',~::':::::•',•' ;,' ;;: ,.,·.·, ., •·.'. .
:·Notice. that more instar.ce .of Ku:uzov•ha~ ,beeri ·fotrnd (the. first -greP. co~n,and .ignore •·
b, Elpl~ln c~mpiilhg imd ru~nlng 'i>roce:ss
program:,: ; ' ' ·I .: .: . . '. ' '.. . : .. '
~lib . .
/ (~_Marks) '
,, , Ans; WordCoimt. is a,slmp/e·application .that counts tlie niirn~rof cx_cutrences :or n.ch_'~ord 41. ■.• .
i giyeif input set. The"MapRedu_ce framework operatesexd~ively 00 key• val~~: that'
i~siance 'like "Kuiuzov.'; or "Kitt~~~~."). The mapper inputs ~ iextfile'ai1ci ttieriot(tputs aain . . is,' the.framework views the input to .the jQb as a·set'ofkey-value pairs and prod~~--s a set of .
in a (liey,' Vi!lue),Pair (ioke~'._namc, coun_l) f.ormat. Strict)~ speakirig,;th~ inpuno th_e script ' .key-value pairs o(di_fferent types,,\he MapReduce job prcc~ as follows: ' ?:'· .
· . ·the file and :he Keys ?.re Kutuzcv.:and,:Petersburg. The reducer script takes these key-value • .
cairs andcombir.es the similar ioken .~nil coi;n!s the l<!tnl nun1be1:.of instances:Th~ result is a.. .
ti~~ctl, :r~;t_ap•> <k2;.v2:>-\~o_m_b)nr~.<k1. v!> rcduc:-~;,~~-~~::: .: ., ' .
.i_tiy--;.~i'ue p~ir [tok~r,-namc;sunil: : · · · · · · · .·. · · , ., · · · · · 1be niapp~r implem~ntation,.via th~ m~p met~od,pr~~~.o~e)_irt~ aJ_ a_.t~'.'t~s ,prov,d.:d
. .. l,,lstlngJ ;( Simplc.J\1nnr~rScript '
·· '#! /bin/bash
; ..
.·:.- ·• · by the specified Te~tlnputformat class. It the~ splits th,e line:i_nto_to~~; seP.a111ted by
whitcspaces using the Strit1gTokeni1~r and emi'.s a key-value pai, ~f<wofd. !>.The_~lev~nt .
' While read .line; do for iaken in Slirie; do ' '
if [''Stoken""-'.'Kutuzov''I ; theri encho :'K~ti:zo'~; I" .• · ·'·'•rc;• co;:~~~:tai~i~~~~J~::t~Y~Text v~i~_e,c:ont~~t coil;e~t.,; :: ·-·,·. ' ;: ·:. ~-;_.(,..;· · · :
'. :·~-ir;:::it; .:S,:•p~:ersburg''J; ihert ebhc ·:petersburg, I" ' '' ·. -:;. ' .. )throws IOE¼ceptiQn, lrtte~rup\edE.1ception_{ .~tringJo,k_eiiii.er i!FflC\f ~S~~tT,<>kenii...-r . ·
(value.tosiring ()); while. (i;r.hasM_oreTokens () ); :. .. · · 1:i; , .
Listing, l.2Simple Reducer Script , ·. woid.sei(itr.'iiexToken (}); context:\vrit(l"'.ord, one): .. , : ,s, /
:-#!/bjn/bash kcomit'70 pcount,=6 " ·,,;, }.. ·, /; ,, .:
'·' ' ' . ' ··,,'• . . ·
' .-:.(: ::._. ,.· .
whileread:liJJe (do · ·. . . . ,. • . .. , . }, ., :·, ," , . . . . . ,- ,.- . ~;,-_-·,/j\_.-•
· if[t $line''.='.'Kutuzov,l.'.') ; tben let.~count:'kcciuhtf l . . . Given two input'files with contents Hello World By~-Wo[~da_!:~ He,llo , ,. ·,-:,L ••.:,.j ·:; · :....:.: •.
'elif["$1irie" =;'Petersburg, 1:"J '; then· let pco~nt=pcou~i-t-.1.
.. done ; · . . . : " ·.' . · . . . . . ., . ~J:iri:·,~~~~ye th
Hat~op; .e w~1tt~ap~-?t~ p~;.:t::}:t!k;;,:tf' :' :..' ·
0

' echo "Kuluzov,Skcount"·echo "Petei·sbltf'g.$pcou~t'' > ;iworld, I> . ,_. ~r, --: ·:: •· ,·, ·b, ·: • .. , .. ',:. '
. Formally;the Map Reduce process can be descdbed as folloivs. : . ., :: .· ·., . . : .·./ , · ... .··
: :Th~ niappcr and reducer ft1rictf ~ns ;tre both ,forincd ,vn'dil~.St(ucturcd i~ (k~y,ya 1Iie) pa i;s:.
The rriapper tak~s .one paifof ihe daia ,vitli ntype iri one data dornain,a~~I ,etJins,~ list of~nirs 1.
in a different ~.omain : . · · . . · ·. · , · · ·
.-~_.:__'.:~!~~};_>\
· <,Ha~oop, I>
.(· ; ..: ~_::i,:>::"~'.i:,:'./\~\''j./:-..·. <•·.
. I : ·. · ;;'-':
,I i
-~~\fM ~AM ~MV .
• • , ;•.~ -~ • • .· , :'·'•~:- •. • , . ;1·1 •-~-•• , .

-:.-,_-,;;.,·
VIII Se.111 (CSr./LSE)

<Gootlbye, I> 15/0S/24 18: I J:7.6 INFO_ impl.Timelincdicntlmp I: li:11clinc mvicc mJdrm: hllp:/1 ·
<Hadocip, I> limulus :B188/ws/v 1/timclinc/
WonJCount sets n mapper 1510517. 1, 18. I3:26 INFO client .RMProxy: Cunnc,ting 10 RcsourceNfanagcr a:
job.setMnpp~rClas~ (TokcnizcrMnppcr.dMs) ;· lim11lt1sll O0.0. I:8050 , '
a combir.cr io:i.sttC'oc1bincrCln~s(intSumRcdu(cr.cl~ss); Isi0~/24 : X: ! 3:26 WARN nrnprcducc.JobSubmir.er· I lacl.)t;p comr:iar.dstine op_:ion p :s:r.g
and n rcduc~r jo~.setC'omcincrClass(lntSumRcJuwr dn~~) ; not performed. lmplemcm the Tool interface' aad cxec~:e yo,1t -~~lka1ion wi1h TcolR11nncr
Hence, the output"t1f each nrnp is passed through the l~cnl. combi_ner (which sums thc/alucs to remedy this ,• · _.
in the snmc wny os the rclil1cer) for local nggrcgntion and then· sonds lhc dntn on ·to the final . IS/0'5124 18: 13:26 INFO input.Filefnputf'ormal: TO:af lnpul paths .to proceu; I IS/0$/24
rcdltccr. Thus, each mnp above the combiner perform$ t11c following pre-reductions: IS: 13:27 INF0 inapreduc_e.JobSubmiller: number of splits: I
<Bye, I> . [... ] .' . ' '

<Hello; I>·· File lnput·FOnnat Gounters


<World,2> Bytes Read=3288,7_46
<Goodbye, I> Fil~.Olilpul l'onnat Counters. . . , ,·
<Hndoo·p, 2> Bytes Written=467839 ·· · -, . . · . ·. • . , ..
~~~,~. ' ' . .. . . . . In aqdition, the following files should be in lhe wu-znd-peaceloutpul dir«to_ry.:;the actual
The reducer implementation, via the reduce.mcihod, simply .sums the v·alues, which are the , · file name may be'slightly-ditferent dependjngon yom:,Hildoop versioir;.·· · ··· ' '
occurrence counts for ench kev. The relevani code.s~ction is as· ro·i10.1vs;.public void reduce. :. S hdfs.dfs :1s war-apa-peawouipttt . ·. .. · . ,, -:. ~
(Text key, llerablc<l~tWritnbl~> values, · · · · · ,. ·· ·. · · · · Folind 2 times . .
Context context ·· · ' ··. ;·. ·.: ~rw-r-~- 2 hdfs hdfs O2015-0S-2_4 11: fll war-and-p;~utpui ( ~SUCCES~ . : ·.,; , .
' ' ) thro'ivs IOfacepiicin, lnterruteE~ception { int s;,m,,;O; . ~rw;r--r:. 2 hdfs hdfs 467863~ 2Ql 5~05-24 ll : 14 war-and-peace-OU'.p-Jt/ part ;-r-.00000 ·
· for
I (Int Writable
. val . . .: ~alues) I sum+= val.get ( ) ; . The complete li~t of word counts. can b~ c~pied from HDFS 10 the w~rtiig directory
with the following command: . . . · ·:. . · ·_
resu1t:set (~um);·conte~r.w·rite.(key, result) ·; $hdfsdfs-getwar-and-peace-ol!tput/pi!rt 0 r:00000. . :_ : .. ' .. . ·<· _;.-,, . .. .·_.
. 1· ' . . . ' . . If ~lie,Wo1'dCouni prpt(am is run again using the :s~e-'outputs; .it :witi fail l't~"ln,;it, lries" ;
; .o~~f:,V(ite the ),var•c1~'d:peac~soli_tpui d.iiectory. -The outp~t direc_1~:y-.;r.·1t i~. i:~mt;nts ciin !Jc
J: · . The final output ofthe·reducer is the following: ' .
. .<Bye,· I> '·· . ~ .:,. ··. ·· . . _:; : · :' _.. · ,._ · : ' r~moved.,IVith. the following ~'Orn(tlan~,.;- ·.' .·. ' . . .· .. . . ,;; ~ ii. :: .
<Goodbye, I> _· $ hdfs dfs -1m_:!' -skipTrnsl1 war-and-p~ace~output_' ' .
<Hadoop;-2> · · · -· . · ' ,_, .· · , . ·Module :..2 .· ·
<Hello, 2>.· ·.. '. 3: a. Explain. ,rith exam}!c Apa ch~ .pig'and Ap~cheH.ivi:f . "10~Marks)
<World, 2> ·.· · . .. .. . .. · . ,· Ans. Apache.Pig is a high-level l~nguage that en~bles programmers to write conip(cl(' Ma1iReduce
to·compile and ruri
the program from tlie command lin~;'perform the·following steps: . : '.· ti·a'nsformaiio~'s .using a simple scripting languag,. Pig_ Lltiri (the a~al latg1~1ge) defines
·, ~-t~t/{~~~~:~orit~:ttf l~scs direc'.o?':'· · · · ·' · \ ·, , .-;' \, • ·· · · ·· · . a set·of fransfo1:m~1ions on a ,lata set such as .riggreg:ite;join,.Eni! sor:_·:i,ig "is -or.e.i' used 10·
1
. 2.._GoJI1pilc the WortlCount.jav~ progt·~in~i~g the· 'h,adoopda~spat~;' '
.command to include all the··availableHlidoop 'class' pilth·sF· · · -'_ ".'/'· -'~ _;:'.~~~~:i~1~l~?-le:l~:, ;~f:1:r:it:\.::~:~~i::s;/~'i;~:::~:n~::t~:2:::~: -.- l '
. , sjavadccp·•hadoop classpiith' :d lvoidcoilri(_ i.isseS WoidC~uiii!jn va:
C l done on the local machine: Tlie non-local (cluster) modes are MapReduce and Tez. These .. .
3. Thi: jar file can b~ c"reated usi"ng the fol!o,ving' com'ninnil:...' : · ' ·_,. . ·. . . . modes ex~cute the job on llie ~luster•using l:ither ihe MapReduce engine-or !lie op1Lmi27d
Sjar-cvtwordcountjar-C.wordcouni_da·sesci- ·.-:,. ·,,_- ·. ··:. · ., ,•: ·· ·., ·, :· ._,• . ' ··. .·rei: "engine/ .· . f. - . • •

4. To nm :he example, c1'ente an input 1Hrec1ory"iri HO.FS and place. a·text file :in.•the new Table 2.1 Apitche Pig Usnge Modes ,
: dircctciry. For this example; we will use .the war-and-pi:°ace.t~f: . . . .. . Local _Mode . T~z Loca·I Mode : Map Reduce, ~_ 9llc-·'Tez. Moue ...:
•~-~:~: ~~:~;j~~~~at:~-t~:t:.~: t ~~hd~~:•a~~~;~~~;\.
· s. Run the WordCount application using the following command:
.;,r,. , ' .:,._,:_,,.,..... .. ,.. . ·· h1teractive.Mod~.; ' · Yes'_ ,. • _E)(p~rimentaL
Batch Mode.' . ·._:Yes .
Yes>:.
·.. Experimerital.. ,· ' . . .Yes.
::/Ycf.
; • •,- ;::Yes . .
. s·haddop jar wordco\1nt:jr.r WordC<iurit" war:and-pence-inpul . There are also interact iv~ modes, using small nmou'nts ,•f data, a_nd then r~1n at , . • •
... war-and-peace-output: · · . . ' . · .,· . •. : • ·. . . deveioped ·1ociil1y iri inter~cii~e rnodes, using small amounts of da_~,. a~dthen nma_t .
. lf~verything in\lorking correctly, Hadoop messages for.ihe job should look·like.\he . scale on the chister_in a prodl1ction·mode. The modes are summanze_d mTnbl~ -2.L: ·
., f~Howing (abbre~iated vei-sion): · · · · ·· · · Pig Example Walk-Thro__u,g__ h: . _ _ ' · · · ' ·
,, ·.1:··
. 5"~~+Mi~M:~~~v:. . ·..:

°"~s+M c.i~M ~ M t ( ·. .
. . ,,
~-:.: 9·.·:.
8.
;; J __

_·..:,.. -·
VIII 5e-wv.(CS'£/ISE)

for this example, the following sofhvarc cn.vironment is nssunml. Olher environments Comments nre delineated by 1•1• and:. at lhc end of a line. Th~ script will create a J ireclory ·
should work inn similar fashion. cnllcd id.out for the rcsull.1. l'ir1l, ensure 1111111hc id.011: dirc,tury is nol in your local directory,
• OS: Lir.ux ~nd 1hcn stn11 Pig with lhe scrip! on rhc command line:
• Platform, RHEL 6.6 $ /~i~/rm -, id.miV
• Ho,~onwori<s HDP _2.2 wh ich 1-iadoop "crsion: 2.6 $ pig -x local id.pig . . .
• Pig version: 0. 1--1.0 ;' If the script worked con·ectly, you should see at lei15t one data file with the results and a zero~
If pseudo•dis:ributcd installation is used, "Installation Rccipi:s,'._; inslrnctions for · installing ·>· length file with the name-SUCCESS. To run the MapReducc version, the same procedure;
Pig nrc . . • the only difference is 1h_n·1now all reading and wri!ing taken pince in HDFS. '
In this simple example, Pig is used to extract user names froml he ·/etc lpasswd file·. A full , ) · $ hctrs dfs -rm -r id.out ·
descriptio:i of the Pig Latin language is beyond the scope _of this introduction, but more '. $ pig id:pig . .
iqformation about'Pig tan cc found nl hur://pig.apnche,orgldocs/r0.14.01 stiu1;h:ml. The· · If Apache tez· is installed, you can _run the example script 115ing the -x tez option.-You ,an
following example assumes the user'is hdfs,but any .valid user with acc~ss lo HDl'S can run learn more about writing Pig-script at hnp:/lpig.apach~.orgldocs/rtl.14.0lstart. html. , · ·
the example. · ·. - . .. . · · · . Using Apache ~ive . · · ·· . · ,,;
-To begin the example: copy the pamvd .file io a wo'rldng directory for local Pig operation: · Apache Hive :s a data warehouse infrastructure built on top of Hadoop for_prov1ding da:a
S cp /etc/passwd · . - . · : · .· ' . s1unmarization·, ad hoc .queries;and the analysis of large dala sets usin,g a SQL- l_ike language ..
Next, copy the data file· ir.lo HDFS for-Hadoop MapReduce operation: called HiveQL, Hive is'considered the de facto standard for interactive SQl.iqueries over
$ lidfs dfs -put passwd passwd - . . · - .: . , : . petabytes·of data using· Hadoop and offers the fol!owing features:
You can co:ifilm the file is in"IIDFS by enteririg the following command: · • Tools 'to enable easy data e"trai:tion, transformation, and ;oading (ETL)
-hadfs dfs -Is passwd · • Amech~nism :o impose structure on a variety ofca!a formats .
-rw-r--r-- 2 hdfs hdfs 2526 2015-03-l91 li08 passwd . " . • Access -lo files stored ~ilher directly ill< HDFS or in other datz storage syste~ such as
In the followl:ig example of local Pig·orera:ion, alrj)rocessing is ·done on the :ocal macl:ine HBase · · · · ·· · · ·
! '( · {Hadoop is not used). First, the interactive command line is sta11eit: · · · • .Query, executio~ via· MapRe~uce and Tez (optimized MapReducc) :_· ,. , . . .

I
. S nig-x lucal . . , - _ . _ : . . . _.. f{ive provides users who are already ..familiar w.ilh SQL, the ca·pability to tjuery,.ihe da:a 0:1--
. If _Pig starts correFlly,:yo·u wi!I sec:'a gnint> _prol]ipt. Yo.it may nisei see a--bunch ..of INFO ;: Hadoop clusters. At th~ same time,-Hive makes·itpossible for programnim 1~h!Ji.iri-familiar ,
~due~

lif
messages, whicti.yoµ can-ignore. Next, enter tl1e'follQy,,ing.co11;ma11ds :o load lhcfpasswd file :;- ' with,the MapReduce' frame\VO(k' lo.add-i~eir C\istom ma'pprn and toJiiVe'C]Ueries.
and the~ grab the usernnm~ and dump i_t to the :erminal."Note th~t Pig comr_nands mu~t end' '. · .:, Hive qliei'ies can nlso -bc drrimaticallly,'accel~raied·using ihe Apache Tei. frarneworhsnder
1 ~~=~~0 ' . . • . . YARN ii) HadooiJVersion 2, --. . .
grunl> A= load •·passwd' using PigStorage ('·;');. Hive faample W11,l·k-Through·. . . ;. ':"
grunt> 13 = foreach A.genmte SO as id; .. . For this cxamp!e.;the foliowfog ·sonw·are envfronment -is ·assamed; Othenin~irorunents :
MN)l
,,\•:IJ
grunt> dump I}; ·_: . · .· . .. .. _ . . ... , . -should work in a.similar fashion:· . . . '
Th_e proc~s.sing ~viii sta11 ;~nd a list of 1iser oames w.ili be "pi'in_le~ lo ·:hii·sc:c:en.,To _exit ihc 0~: Lim1x ·1 . ·
ihternc1ivc sessioti,~nter lhe command quit ..:• . . . - . . • Platfo'rni: RliE.L 6.6 , . _ . ._. ..
. $ pig -x mapreduce . _ .. .-. ... • • _ _. . . •. _ ••. _ _ . .. _.- . Hci110i1works l;IDP 2.2 with Hadoop Version:-t.6 :- :" .
1
. The same sequence of commands can be entered at the grunt> prompt. You may wish to > Hive vei·sion/ 0, 14.0 · - . ·' . .
_change the $0 argument to pull out other items in the passwd filc ..lri the case of this simple Although' the :fcJllowing example assumes_.the ·~sei is hdfs, any va)i_d 1:ser with,:"accessto
scrip_t,you will notice· that the Map Reduce versio·n tnkes much..longer. Also, ·bccuase we are HDF'S.can 'run the example: . . _ ·. · _·. . _ -·'. · · _- · . _'. :-. · . :,
runn1Tt·g lhis application under Hadoop, make st1re 1:1e file 1s placed in HDFS. : . To star Hive, simply enter _th·e hive command. If Hive stnr.s corr~ctly, yo(t sh9uld get
lfyou are uslng the Hortonworks HDP dislnbul1on with tez installed, the tez engine can be . ahive> prompt ·. :_ . . . . -
used ns follows: · ·$ hive
$ pig -x tei . _. . . · (some rnes~age may show up here)_:, •_
Pi~ can also ~e from ascript. An example script (id.pig) is a~ailable fro~ the example code -hive> :· ·. .. · · · ' "'· · '
· down'!o;id (see Appendix A, "Book Webpage-and Code Down'ioad"f This script, which. is As a.simpl; ;est, c~~ate and:drop !l ta~le:: Note that Hive commanqs inust end with a semic-~'ion \
repeated herc,is desigr,ed to the same thbgs ~s the interactive v~rsio1i: ' (;) . . . . . . . . . .· .

t;r .__ 1• id.pig •t . . . ... . . . _ ..


· ·A_= load 'passwd' using ·PigS:orage ('; ')•;,-load :h~ passwd _file
_. . : hive> CREATETAlllE pokes (foe INT, bar STRING) :
· oK .. . _ .. . · . ,
' ;,.
B = foreach A generate $0 as 'id; -- extract the user lDs - .- .Time t~ke~: (705 secoilcfa .
____
{j -----·dumo·B· - · --.-· _ _.__
i:,io
stord B·
· •· . . · .
;id.01;1'; -~write the result;· io a dir~ctory na.mc.id,-oui
llive>.SHOW TABLE$;
... -- · --~ ,


: _[ 10 1i
): l
vm Se-!'ltl (CS!:/ISE)
OK fDEBlJC.,Jm
pokes [ERROR) J .
Time taken: 0.174 secor.Js. l'c:chcd: I row (s) [F/\Ti\L] I
hive> DROP TABLE pckl·s [INP0]96
OK iTRi\C E] 816
Time tr.ken: 4 U38 secu:1ds [Wi\RN]4 .
A more dc:niled cxnmplc cnn be developed llsing a web server log file to suinmnrize Time taken: 32 .624 s~conds, Fetchect 6 row(s)
message types. First, create n table using the following cumninml: To exit Hive, simply type exit;:
hive> CREATE.TABLE logs(tf string, _t2 string, i3 string, 14 string,' ti'ive> .exit;
--+ 15 string, 16 string, t7 string) ROW l'ORMAT DELIMITED FIELDS IJ, E~plaln with. the follo1ving cumnianci~ in the H base data modcL (O_~IVlarks)
--+TERMINATED.BY"; . . 1) Create·the.database. . . . ·.
OK .
2) lnspe~·t the database
Time taken: 0.129 seconds. . . . · ··
Nel\l, lond .the da.to-in this cnse,from the .snmpldog file. This· file is .nvililnb_le .frorri the ·
3) C~eale ro; . ' . ~ : ·. : - ·;
e~ample code downloa·d. Note that file is fourid in the local direct.ory and.not in HDFS. 4) D_elele a row · ·
hive> LOAD DATA LOCAL INPATH 'snmple.log'OVERWRITE INTO TABLE logs; 5) Remove a ia hie
Loading data to table default.logs · . • . · -: · · ;·. . . .. · .. ·. ·.· 6) Adding data In Bulk,
0

Table defoult:logs stats: (numFiles=I, n~mRow=O, tota1Size=99271; rawDat~Size=O) · . . I) Ci'eate the Database '. . •.
OK . . ' · The.next ~tep is to create the daiabnse in.HBase using the·following-eoinmand: .. ,.
Time taken: 0.953 seconds ·. .. . :fa 1· h~ase.(main) :006:0> create 'apple', 'price•; 'volme' ·. .
Finally, npply tile Select step ,to the fi1¢. Note .that ;his invok~·s n Had·~op MripReduce 0 roiv(s) in .0.8150 seconds .·. · .' . . .. : . , . · ·. c · · • · · ·
i operation. The resu(!s apP41ar _nt the end of -ihc olllput (e;g.; totals for the tnessage'iypes_ In \his. case;th~ table riamekapple, a~d two ·coluni~s ar~ ddined.. The.daia.,wll! iie. U:Sed as.
I
DEBUG,ERROR,andsoon). · · · · .' . · , · ·... ·. . ·· . , · : : . • the row key. Th~ price colliinn· is· a f~mily of four- valuesioperi;.close;_ lo_\v, high), r.,e put .
! \ com_mand _is used, to a~d to the daiabase .from _within· the:shelL ~or. instance. •~i pieccding .
. hive> SfUCTt4:AS ;ev, COUNT(•)_AS .cnt_ FROM l~g;•\'.'HERE.i4 LIKE i!¾' qROUP :: 1
BYt4; . . · .- . . · · ,. ·- . .:· .. ,... , . ,.. . · · .. : "· .. : - · · :. , data ·ca~ be entered by-using th¢ fol[o1ving -commands: .· ·. · : ·.-.- .i; · ,,':',
. Query ID= hdt'sJ015032il 3,.P000_dlela265~rt~d1-4~i\8-b785-2c6569791368 Toial j~-~ = / .·. ,, jlUt 'apple:'; •~-May-IS', 'price:open•; '.126.56··: . .... ,:,

t1! · Launchin·g Job I·oui of I ·. · •· . · .· .:


. Numbe(\>f redw:e iasks riot specified. Estimated from inp~t daia si~e: .1.
··. . · · . put 'apple\ '6-t,,iay-15', 'price:high.', • 126,75' ·
put 'apple',''6-May~fs:, •·price:iow', • 12S:36 . . ·
put ;apple\ '6-Mayil5', 'pri~e':close.', '125.01' ·
' ·• ·_ ., •

i. ·
·.;:·:. · ·

ff
In order to change the· average load for·.a ·reducer (in bytes) : ·
set hive .exec.reduce.byies.per.reducer=;<number> · · . : plit 'apple'. \6-Ma'y.-15', 'vo!\Jine'.-, ' 7i820387' . . .. . . .
.. In order to limit the maximum number'of reducers: Note that these··c(/mmands can be copied arid pasteiLinto :HBasc shei'i::and::a.,: :'availabie .
set hive.el\ec.reducers.max=<1,umber> . · . . ,• . · from the book <io·,,·nloatl fi!es . The shell also k~~p:> a histoEy..fcr ib~ sectioit,.ancfipreviou·s
In order to set a constant number ofredu~ers:-·.:'. > commands can be.'retrieved·i,nd°edited.for res1ibmiss10n • .' · · · •.• .. ·., ., • · ·
set mapreduce.Job:i·educes':<numbi:'r/ . · . . . - . . ·· 2) Inspect the !?~ta base . ·.. .. . _ . . . . · . , . . , . . . ·\· . . _.
Starting Job =: . joh_::_1427397392157_000 I-;< Tkeing .:_-URL· =: . http://11or'bert:8088/prox·y1 · . · , The .eritfre ·• d?.tabase can be_ listed ;u~irig the scan c_oriut\and. Be .carefu_l:.~~en:_using· ttiis _
application,_142i39739257 0001/ · . . · . · · · · . · · .. · ·. ·· command .with large ~atabase: This eXampJe ls for: onC.row. ~ ,., ·': ~:: ·
· scan.'apple' · . , · : .'· .. ' . . · · . . · :._ :
Kill ~ommand =.Jopt/lindo;;-p-2.6.Q/gin/hadoop job : kil i job~ 1417397 392iS(OOO i Hadoop
job ·_ infornrn1ion for Stage-I : nu.mber of mappers: L; number of:i-educem · 1 2015-03-27 . . hbse (niain) ,:006:0> scan 'apple' .
.. ·13:Q0:17;399 Stnge-1 rna·p=0o/o;reduc~'=.0% . . .· . . ... . . . .. . . RO\\/. . .. .COLUMN+CELL · . . .· .. -~· · · · ·

'jfJ~~~l~~~ifg!j~i~~~{!f(!ir*t{
2015-03-27 13:00:26.100 Stage- I map;, _.100%, reduce= 0%; Cumulative'-CPU 2.14 sec .
· 201 S-03'.27 13:00:34,979 Siage-1 map"" I00¾;reduce = I00%, CumnlativiCPU 4,07,sec
MapR·educe Total cumulativ·c-Cl'U tinie: '4·secoitds'-70:msec'.' .
Ended Job =job~l4273~7392757 oooi·., ·.. ·: ... . C • . ·· ·
Map~edncdobs La1indied: . · -~: - ' · · · . . .

:;;j Stagc-Sta~e-1 :·Map: I Reduce: I Cumulative CPU; f07 sec HDfS Read: i06384 .'
HDFS Write; 63 -SUCCESS . . .
Total MapReduce CPU Time Spent: 4 seconds 70 msec·
· . . .• . Yqu Caij (ISC 'lh,e row key 'IO ~ccess an indi vidu:a1 .row: 'in the.st'ock,'price d~t~ba_~e'.:th~ dM.a is_; ·.
_ ___ih_e:~ro_.w~)~
'e:·~ ; - _ ·· . , . . . ' . . ·. : : . ,t ,·:?-::::~-
OK . . . . - :-,--- -- .. ,.. ·-
,,
,I
"'tJ
;,.,:: /
, iz .Sii11~+... ~· e,c..,~ &...~~i,. ·.
.

··'. - :· ·
vm Sett11 (CS[/ISE)
hbase (main) :008:0> get 'apple', 'G•MnH5' Using the· YAltN Dlstrihutcd-Shell ,
COLUMN CELL . For tile purpose of the example, prescntcrl in the remainder of this chap:cr. "'c auumc aud
price:closelimesfomp= 1430955128359, value~ I 25.0 I asiign the following installation parh, b~scd on Hor:onwork! HDP2.2, !he Dis:ributcd-Shcll ·
price:high :imcstamp= !4:;09 SS12602.:, va!uc= 126 .75 :lpp!icntion: . 1 •
price:101v timest.mp= 1~ 3U95 5 i 26053, vnluc=123.36 .i export Y/\ rt N_ DS~iuw l1dp/curr«11/11adoop-yarn-clic11tfnidcop-yur.-applicat1ons-
price:opcn timestnmp,,;1430955125977, vnlue~l;l_6.56 ·•distrib11tcdshell.jar · •. · ·- · -~
volume: :imestnmp=l430_955 I 41440, valuc~71820387 For the pseudo-dis1rib11ted install 11singApache Hadoop version 2.6.0, L~c follow_ing path will
row(s) in 0.130 seconds run the Distributed-Shell application (assuming SHADOOP,:_HOME is defined to reflect the
4}Dcktc n Row . location HadQop): . . .
You can delete nn entire row by giving the deletenll command as follows: $ export YA RN_os~SHA DOOP_HOME/shart/hadocp/yamihadoop-yarn-applications•
.hbase(rnnin) :009:0>dc!ctca!I 'apple', '6-May-1S' · disfribuledshell-2.6.0.jar , • .
5) Remove a Table · . . If another distribution is used, search for the file hadoop-yarn-appHcaiions- distributedshell'.
To remove (drop) a mble, you must first disable h. The following two commands remove jar and set $1(I\ RN~DS.based on its lo~ation_. D~lfibuted-Shell exposes vcrious opti~ 1ha1
the apple trible from Hbase: hbase(main). :009:0> disable 'apple'·hbm(main) :_010:0> drop can be foiind by running the f?llowing command: · ·· .· .
'apple' · $ ·yarn org:apac!ie.hadoo~.yam.applications.distributedshell.Clienl -jar SYARN_OS .
6)Adding data io•Bulk · . . . · --help · . · . . ·
There are several ways to efficiently load.bulk data into· HBase. Covering all ofthese.niethods The .output ofthis·co111rnand fo_liows: ·
is beyond lhe scope of'this chapter. Instead, we will fo~us on the lmpo1tTsv utility, which usage: client
loads d~ta ir.tab-separated values {tsv) fo:mat into HBase. It has two distinct usage modes:· -appname <arg> Applica:ion Name. Default ·
• Lolding data from a :sv-:ormat file is HDFS into H~ase via ttie:plit corrinirind . . . value -distributedShdl ·
• Preparing StoreFiies to be loaded via :he corr.pletebulkload utility · -attempUaiil1rcs_ validilY,~: wh~n att.:m;n_failure_validity_ icit.:rva l in m!llis::Conds is
/. I The following example shows how to use lmporTsl' for the firsroption, loading the tsy- i11tei-v_al ·<arg> . se.t to >O, the failures number will r.OI ~e
failure which'-
!. .format file using th.e put command. The second option works in a two-step fashion and can happen Olli of,thc validitylntmal )nto fa!lu,rt count. ff .
.
be explored by consuiting http://hbase.apache.org/book.html#irnporttsv.. · :· ·· ·
;i I. ·. ,
fail1ire count r:a_ches to m~XAppAt:tempu, tht application ,
· The first step is co.nwi:t the Apple-stock.csv fi I~ to tsv fo1mat. The following si:ript,.whid1 is • 111ffl be failed. . · - . ..
ii\Hj:° included iri the book software, 1vill ;·c.move the rirst-l inc and do the conversion. In 'doing io,. . -~ontainer_m~mory <arg> A~o~rit.o( trie~ory i~ MB to be rcquested_~o ru.1the_sh~H
f!:·:_:•. ,:,•·:·.: l,:11;: . it crentes a file named Apple:stock:tsv,: . . . . . command . . . . . . .
S con·vert-to-tsv.sh Apple,stock.tsv /tmp:• . . .. .
,;-· 'j Finally, lmportTsv is run u~ing 'the following command'line, Note the column designation in · -~?ntaine;_yco;~;~rg> . Amonnt of virtual co~ to be requested to nin-Lie_shell. ·
command , . . .
![~
_·(: \i;.} l. thc.-Dimporttsv.columns option. In 'the e/(ample·, the HBASE_ ROW~ KEY·is set as the first ...
le!P4j column-that'is, thedala forthe.da!a.: . · ·.. . : .- . · .. ·: . .create -- Flaglo·'indicate ·whether.to creaie the:i!omain . ·
l~.l ,~_;_1-;!_· $ hbasc org.apachc,.hadoop,hbas~,~apicduce.lmportTsv -Dimportts.v:c.olurriris";' • specified with "-do_main. . ' .
," -, HBASE_ROW_KEY, price:open,price:high,price:low,prke:clos~,volume , · ~ebug .-: : i : .. · Dl1mp·out infomiatiorr . ,
:1J, ,i: -+ apple /tmp/Appl_ e-siock.tvs . ,' . . . . . .. • ' . . ,l..
-domam•<arg>. . .ID ofthe timeli~i:°do~ain\vh~rc th~ timefi~~ .·
l'l)l:u. . The lniportTvs command ,vorks wili use Ma.pReduce to load Nie'data .into HBase.•To ·verify entities\;•ill be put · . ..
1
ji~ · that the command works, drop arid rf-create tkappie databa~c; ti$ described previously,
:'_J'i(~I· before"ninning the import command: . . . . i • • • .' •
-help ' ·. Pr.intusage
J~~ fil~ contajning th~ application master \ ·
-)ar <arg> .
·'; i••i· C, w.hat is YARN? E_xplain any fiv_._e.com_ma_l)dS?. . . . . _ arks).
(04 M '.
rU1
1i· Ans . . Tlie Hadoop YARN project includes the Distribi1ted-Shell application, which is an ex:ainple ,. -keep coniainer(_iicross_· Flag·i to . indtcat~ . \\'hethe; to keep _cor.tai~ers\ across
ijt,[\] . of-a Hadoop non-MapReduce·· applicaii~n built. on top of YARN: Distdbuted: Shell is a · applic;;iion:._attem_pis · · _application atter(lpts. lfLie flag is true. running containers ·
will not be retrieved by the new applicafion attempt. -·., · .
!1ii1
1
;/l! . simple mechanism for 11inning shell commands and scripts in cont~inc~s· on multiple nodes
\igyroperties <arg> .. · log4j,properties fil~ . . . . .
!,il_·•.'t::•::i_::
_·,·
·.·
:·•;.; ~t
:.,!;:_.-J;
;·',·,·•.

:
:. !:~•
·.,·. . . · . in a Hado~ir
rather cluster. This
a demonstiation ?.pplication
of.the is nbl mea,nt
non~MnpRcduce to be a production
c·apability tha(can be administratiori tool.
impli:~cn.ted on topbut
of' ·. -master_meineory <qrg>' . :Amount or'me~ory ,i~ MB 10 be· requeste'd 'io fUU th'e, ,
YARN. Tliere.are multiple mature implementations of a distribu1ed shcli°lhat administrators _app/ication _m~st~r. : .. . ' .
< • ,,,. · typically· use to mannge a cluster ofmachines: _1n addition, Distributed-Shell cart be used ~s .. vcorcs
... · ' . .<arg>
· Amo1mt of v __
irtual cores to b·e requC,S:ied___: to run the
• -master~
;J \ :r -~--- a stanirig poirit for explor:ng and building Hadoop YARN _applicatiC>~~~TJi.i1_c.!i!!ptet olfe.rs ,. . . ., -r·aP.plica:ion niaster"· .
. . !,
l\l!/ . -·-~4 =~:~~~:; 1Ww lli< Dis<ab"kd~•ll em be """ " ' " ~ " • ~ ; ~ : : : : : ~ : L. 15
VIII Se-+n- (CSE/IS[)

-modify_ads -:org> Um~ 11ml ip·t;ups that allowed to mml ify the timline of vnriuus Hndoop service is faplayed on the lefl using green/orange dots. :"lote that two of-'
cnti:ics the timeline entities in the giwn doma.in the scrvic~ mnnngcd hy Ambarl Mc Nagios and Ganglia; the s1and.ird ch15ter maruigement
services mnnnged by Ambari, they nrc 511cd 10 · provide cluster monitoring (Nagios) and
-node_label~expre~sion <a~> Node label expression to dctcl'mine the nodes whcr~ all ') ' metrics (GMglin).
:he co~taincrs of th:s application wilt !:e ttllorn1cd, '" '" OashboRrd Virw
mcar.s co:1tai11ers can b,• al!ocalc~ anywb·rc, if you don':
The Dashboard view provides small srn11:s widg·ets for m~ny of th~ 51::·v:ce run:iing on the
specify the t>ption, default 1i°odc_labcl_expn:ssion of tluster. T.hc_nctu~I services are'"listed on the _lefl-side vertical menu. Yvu can move; edit,
queue will be ·umt. · · rempve, or·ndd these widgets as follows: .
•num_containm <arg> No. of containers on which the shell .comman.d needs to .• Moving: Click and hold a widget while it is moved about the grid. · ,
be exm;ted ' · • Edit: Place the mottsc on the widget and ·click.the gray edit symbol in the upper• right
•priority <arg> . . Applicati.on Priority. · corner of the ·widget, You can change several ditl'ererit aspei:u (including tivesholds) of
Defaull o·· . the widget: . . -
• Rem_ove: Plnce.the_mouseo~ ihe widget and click the' X in the ~~~-left comer.. .
-queue <arg> RM Queue in which this ~pplication is io .be submitted
• Add:.Click the small trfangle next to !he Mterics tab and select Add, The available widgets
-shc!(~args <arg>_ Command line ara~ f~r tii~ ;,,~1i
sc/ipt:. Multiple ~·rgs .can . wi!I be-displ~ycd. Select the widgels you want 10 add and clic~ Apply. · ·. . .: ;
. .
.
. . '. . be separated by e~1pty space.· ·.• . . . Some widgets .provide additional informatiori when you move the mouse over them. For.
-sl1etLcmd_JJriority <llrg> Priority "ror the shell rnmmmid c.imtnin-~r; . . "in~ta_rice; the Data Nodes widg~t displays the number of live, de2d-; and vie_w, Foe insla~ce, ·
-sh et(_command' <arg> · io
Shell ·com1nand _be executed by ihe Applicatimi' Master. . _-, Figure 4.2 prQvides a detailed vielj'..Qf the CPU Usage widge: from Figure 4. I. : . :
· The Dashboard view also.includes a heatmap .view of the cluster. Cluster Jie.r.maps physically-
,
C?,n only s~cifyeitlier - -shell:._cprnrnnnd or- :shcH_script
~shell_env <agr;_ . · En~ironment for shell _script. Sp&~ ified ~sen~_ key=env..:: ·
mar. selected metdcs across the cluster. \Vhell you click the Heauna;,s tab, heatmap (or the a
dust,r. will be displayed~To sele,t the m·etric used for tile heat:m:p, choose the desired option .
_val pairs _i • : . ·.· · . . ~ .· . . . . . . · ,. from the Select Metric pull-down menu. Note that the scale oiy used is displaye«fin Figure
. -sh~ll..:_scr_ir! <aig? .. · _· Lo~~ti9n. ~f th~/ hell script· to be e.xecuted Ca~'. only /: 4:3 . ' '. . ' ' .' ; ., ·:· . .
. _sp~c1fy, either· : . . . . . • .. ;.. . . • . · . . , • ·
· ·-~shc!l -commarid or . ,- ·

· ~v_i~-~_acls <arg>: '


. i~:~~~I~:~~!L i-;:,~;i1ise~-ort~s ·: . . . •.
. Users arid group ihat allowed io vi~wt~e timeiin~ entities
· . .

in the given_ilomain. .. . .

- ·-..;; ..
4. n: E~plaln virus or Ap~che Arubari1 <: . _: .: . .. . . . . . · (IH1~r.ks) :· -~.oams .
Ans. After con1plcting .the initial jostal!ation a°nct" logeing fnto Aniliari) a. dashboard similar to · · . .'·.·,
• that_sho:-vn .in Figure__4.1 is 'piesented.;'Thc-sam; four-node clust~r as c1:eated that will be
· i1sed tci explorlAnibiiri ;- It you ni:ed fo: 'i-~dpe_n the Ambari ~ash board iinerface;.simply ~qter: .•
_,;,.,~ .:;"(.:-:. .-;:~~'j
52.0d ·, . . · •-
·- - ·
1.33
the following commanil (which as'st111i'es·you arti using the Firefox browser.Oat though othe"r ·, •
brows~rs may also be used):
S flrefox locathost.: 8080
· ·
---•····~ .

•-~
~ ·
,,. . 18.2d .• _. 414
. The d~fault !ogin and password are admin and adm in, respeciiv~ly. Befor~ continuing a,;y ..
. . further, y~u should. c~ange the defauh password. To change,the password. select Manage.
. Ambarjfi:om the :A~(ll!1t: pull7down menu jn the uppeMight comer. [n the rnanageme~I .:~- : ..
· wi11dow, click Users 111\der User+ Group Management, and then click the adrriin user.name: ·.
· Select Change Passw_ord and ente(a new passwot'ct. Whe1i yolt are flnishcd,·click°the Go To '\
io
· Dashboard link ori th.i: i~fl sid~ of the window retmn 10 the d~sh board view: , · , . ·: :-
To leave :he -Ambari interface. ;eiect theAdmi~ pull-down· 1ile1111 of the instalbl servim. :i Fig,ire.': i A~a_ci,e Ambari dlisltbonrd view of n Hadoop Ci113·te~ ..
Aglance.at the dashboard shot1ld allow you te get a sense of how the clt1ster is pe1·forming. >.·.
The top navigation n1~nu bar, shown.infigure4.l ,:provi&S:access tb the Da~hboard;Services; \·:; -'-. ~ -'---......- --'---
Hosts, Adinin andVie-ws foaiures{llie3•3ciibi! is theVi~ws m~iiu). · Th~ status (up/uow1iJ 7:'- ·:; i
- . . . . . "! . .. . .· . . ·-· ·

16 . &_11~+~1 C.C~M .&..MJ. • ..


. ' .. ' .. .
.
·-- ~ -. - ._ -~~··- - - -

.:~. : .·_ ,( _ _ .. -· · 1., • .,


VIII Se,m, (CSf/ISE)

CPU Us.igo , Configurntion l1istory is tl1c;final tab in the dashboard window. This view pro~ldes.o lis_t' ,
of conf1gurntio11 ·chang~s made to the cluster. As shown in figure 4;4, Ambari·- enabie
config11rntions to be soricd by service, co~fig11ration; group, data, .and"author. To find the •
5pecific co11fig,11·,11io11 ,,t:i,:gs, click the service name. Mme information 011 configurntion
selling is provided lat~r in 1l1c d1art~r. ·
Service VICI! ' , .• . . .· "
The servic~ menu provides a detailed look at ea·ch service running on the clusfer: · It aiso
provides a g1'.i11l1ic_al mctl10d· for configuring each service (i.e., instead of hand-editing the·, ·
etc/hadoop/conr XML fi!cs). The summa1y tab provides_~currcntSummary view ofii:ilportant .
service metrics amt an Alters and Health_Checks sub- window. .-. ,
Similar to tiic Dashbo:1nl view, the clu'rently lnstalled se.rvices ilre listed on th'~·•1en- side
menu. To"sclcct ;\"scrvicc,click the service nall)e in the menu: when np"plicabl~; iPc~service
will iiaye is ow11 s11mma1:y, Alters and Health Monitoring and Service Metrics.wi~dqw's; For
example, Figure 4.5 shows the.service view for HDFS. Important information s_uc~ as the ·
statLI_S ofNam~Nodc, SccondnryNameNodc, DataNodcs, uptime, and°a~ailab_ledi~t~pace is .
. displayed in i11c Slllt\llWY IV i11dow, TheAlters an~ Health Checks widow proviili!s'.ilie hitesl
'-.: statµ~ 'oftli·e sciv,cc· and its co_;1ipom:nt systems. Finally, several important real;tiriie's~rvice : ·_.
,· meirics arc displayed ns widgets at the bottom of the screen. : . . .. ·. :.:. <·,:U'.f. ,, .. .•·..
:SJi:s on th~',dasliboitnl, thcsew idg~ts ca_nb~ fXpanded to display amor~ det'~ili;cfvie!; <;licking .
the Gonfigs tab will open.an options from, shoivn in.Figure 4.6, for die Tli'e:options . seNic~:
. . . ~:7'.':".!"-:,~•j
·,: ;_ .
l'U)~"f'!~'"'ol,
(prbpe11ics) arc the same Cl\CS .that are .s~l in U1e Hadoop XML sho1ild mari~g~_th~in-~iily. ,·
.:through the li1:1bari intcrlacc-tllaris, the user sl1ould not-edit .the files by hmid.: " -. --.i, :

::t...;.,/ -.~
•:t--"f-
-~i ---·/:::;~;: _~ - ~ ~:-L:
... _.,_ ... , .(L:;~; ;'.~:i~:;~.i+,~~~~:~.~~~~__:£~~~=;:::.'.~'.: :!;~,~~
;~
·:· ~ :.>~ . -~:·:_~;·~.--;.::::- · ;:·-rt-:· :~~-.~--.: ·
1
. ,r· ~ ·.
·•. 1~ ~
1' ff, ~-,•·:" ~ . •.:,

:;•• , ...
:,

,:. . -_,.:-.
. ~ -'~•--'~ . .:-~~~;·&~ . - ,..u...,i:r-"'~ -- -
', ...•.•..·• ·i · ··=t!'1~0-~ '"- · ;:··
.-.'W;fi:.,.:1 ...... - - .

.. .•,
' IITI'.,,,.., ,..,...,..ri~"" .
1, .. . . ,~ • · · ........ , _ _ _ '4,
i....•~·•·"- . _..,, .... ,_,_.,•- -..;., ........ . , _ci .... • · .
,~ .. .......... ·,,, .... , ·,.. •.·:
~, .... , .. , _ ' .... ,, , IG-. ... ,.~..

r,'..,..,.llll,l)o ·, . ·,o.a ,1 ' 11 ;1 ~ .


tt-•-•"l""'• .. 1111u•.TtV1· ·

. ··,··- ✓-- -,........ ~, .


, :· ;.;,;...·.:_
·:·

··-· - -~-~
...,o,,.,,Ci:::J:I ,.._...,,._,.,t1,n
f>~ ~l'I.-UT.I

0 .. ~ .....~~ ... " . ,........ . '1 "•·


· -C .'' " . .

f• ~""'i 111'1t \loll:'t- . •_ , : •• . ,


. □,- - ;-~ ,'
"IJ."-'"'
q ,.... ,_
. . Ccl ""' ' __ . ... . . .

. ,' . _.:. . - ..
_

· l·f:ii;._i!·4.4 Ai11brlfi_mi1ste; c-01ifig11rat/011 d11111ges lilt_

18
VIII Se-tw (CS[/ISI::)

The current se11i11g.1 a ,, il nlilc for each Sl,rvicc .ire shown in the form. The ndministrator c.in sofiwnrc installed, The remaining optiora in the Actions pull-<lown menu provide control ·
set each ortiocsc prnp,•:·tics by clw11i;ing the values in the ro:-111. Placing the mouse in the input over the various service cu111po11c11ts mnning on the hosts.
box of the property di,p l.,ys n sl:011 llcsccir1 i0r. of e.ich propc11y. Where possible, properly Further <lct;,i!s !iii' a par1k11lar host can be rm,nd by clicking host name in the lefi column.·
nre grou;x,: b)' :,:o:c: ic:·:11i1y r:_- fo:· ," cl~o b , r,1:ovisions for adding prcr,cnies 1h~1 arc not As shown in Fi3u r•.: 1l.8. ~he imlividu:tl !·1r,~: view r~ovidc!, three ~ub· -~·indows: Components.
!isled. /\,1 c.x,,n,plc uf c!;;111gi,;g sc11·iccc prcpcnics nad rcsta,1ing the service componcn:s is Ilusl Mctr:cs, .ind Srn11111:t1 y in lorr.iatio:1. ·1 he Componcr.1, wi~dow \,,is the se rvices that are
provided in thl' ''M:1mging I lndoop Service" sec:i6n. L"urrcntly running on Ilic ltost. Each service can be stopped, restarted, decommissioned, or·
lf a service pruvidcs ils ~wn grnphical interface (e,g., HDl'S, . .YARN, Oozic), then that placed in 1naintcr.,111cc mode, The Metrics window displays widgets that provide important
interface c:rn be qpcnc,! ln o sep:irn1c browser tnb by usi,ig the Quick Links pull- down menu metrics (e,_g., Cl'U, 11\cmory, <lis_k, and network usage). Clicking the widget displays a larger
located in top middle of the window, . . . version of lhc grapltic . .The Summary window provides basic infonnation about the host,
F_inally, lhc Service A<:tion pull,down menu in the upper-lcfl corner provides a method for · including the 'inst time ri l1c:i11beat was received,.
sl'a11ing and stu1,pi11g ead1 service and/or its component dnemons across the cli1ster. Some ,
service may lrnvc :t s~t of uniqu~ actions (such as rebalancing .IWFS) thoi apply to only
cenain si!llailons, l'inali)', every service has a Se1-vice Check option to make sui-e·the service
is working propc11y. The service check is initially run as pait or'tlie installation process and
-··.. --·
can be vahmblc when di:um_osiiu: roblcms. · . . ···-·
'''" .
- .. -. i
r,J"'(I,&·

-
I
;~J •. .• ; .

..
. .,,:, ;,
Iii ·~
· ·.,~w: T.

···••"I.¥
. ~·,~,. ··.
¥'sffl:i.'H.}··. ...,
,
~

~/-..-;: ...... ,,- ,-~. - :~ ~


· Fii!11r~ 4.,7.Amburi n11ii11 llol·ts scrc/1!11- . ..
,,;'-fa.-•·.·'· •::,1:n,"1:•, :,.~,.)f._,._~ . .

~
:-.:_. ,. -: •.. ~ : .. •. . -- -··- - .- .. t: n.. ,;.,.:.._.._;a: - . ...-; ; - S
' ~ c.r,- . ·"'!-• .......,.
011-r-"I ·:
·~~~-~ ·;~~;~%:Ai~¼(. --···. "eiw~-~--•--~.
: ... : · ....-~: ·- ,,./': .. ~ .:; ·_._ : ·
~..... _· Iii, :: .
- ..
~ . , ., . . . ~•IM'<ill

~ ....
,< . : ;:·.~-.-~~~-- 1
.~ ~~......:~- _.,;,~,
......,v......... .....

·_:~~ ;. -~---...- '1•4


·-.!'/~----.
• . • : _1 . .
.... . . ..~_.,,...
,... "..,... . ,...
. 1••---- , .'"-•. ·
~

.
!L-.•u,,..,:_»
~
• ✓•-
•• •·4•.•·· .
.
-~ -.
'. 't·• ...... ~~ -·:·· ·. .
"'··-·>••
:.-: ~---
: . .. .
·--·
',

- . ·~•-~..,
'Figure 4.6A_in~111'i i~;11ice.opt(o1isfor 1/Df.S
c-~~:;~::- :-:-,. . ,·..
_::. :t.,:•;·t~ •li.~,■ ,w.ch--~
· tlostSVicw . . . . . . . . . . .
Seleiting· the . IJosi~ :lllCllll 'iic,;t provlde's. tl;·e information show,; in :Figu~e 4:7. The-iiost. .--... -
~ · ·,.,, ..
.. . -
· .. . -. ·
.- ....

· .Fi••1i;c
0
·4,8~l111b11ri cl1isier t,bsl llet11il vie1f · .
Admiri Vic,v , ·· . ; •·· ·: . · • . • '. . . ' •• ., ..• .. :. :

.TheAdministn;lfon'. (Admin),viiw provides three options, The first; as show_nin Figure-4;,.,9~;·- - - -


. display_s ; list or'it161~llcd solhvar(. This repositories listing generally reflectnhe_:,:ersi_on ?f
. . . !. . I • •••

20 Siins.+M (;.,;.,;,,.. ~M~_°:'.". . __ ~(_ _,.


VIII Set111(CSE/!SE)
Horton works O;ttn l'l.1tfo11n (I-IOI') used during the instnllation pr_ocess. The Service Accounts YARN WcWroiy ,
option lists :he service ncco;mts ;Hided when the_- system was installed. These nccounts ~re T~e Web Applicnrion ;roxy is a separate p;oxy server in YARN that addres5cs security issues
used to run various sc.-vice and tests for Ambari. The third option, ,Security, sci> the security with the cluster web rn_t"rfacc 011- Application Masters, By default, the proxy runs as part of
on the cluster. ,\ :\,lly H';;.ircd I!adoot: ch,ster is important in many instances and should be the Resource Manager Jlsclf, but it ca11 oc configured ro run in a stand-alone mode by adding
01 1
explored if a ~c,,l!l'I: cnv iromnc:11 i~ riceded. ~~~ , focur~t,_ o:, r 1'or,c:·ty y~rn .web,r,ro<y.~dJ rsss to yam-site.xml. (Us ing Ambari, go 10 the
Vie1n view . . ·, · . . ' N Co nd gs vr~w, scroll to !~c boaom, and select Cuslom yarn-sitc.)l.m//Add proi:;<rty.)
Ambari Views is a frn11icwoik offering a syste·rn·aric way to plug ii) i1se1· interface capnbilltics In _ ~n<i-al?.nc_ mode, yarn.web- proxy.principle and yarn.web-pcoxy.keyiab control the ~
st
that provide for custom visualiz:ition, management, nnd monitoring•• features in ~mbari. · Ket bcros P1h1c,pal name and the corresponding keylab, rcspeclively, foe IUe.in secure mode. .
Views nllows vuu to exten,! .ind c11stomize Ambari to meet ours ecific needs. :' · ·· · Th~se elements can be added to tltc yam-situ ml if required. : , .:. ' . .
Us111g the JoblJi~toi·yScrvcr .. · - . .. · .· . •
The removal df tl1c_JobTrnckcr and migration oh.fapRedu~e fr~ a sys:em t~ an ap~-rication•
. , level. f1ame,~ork nccc~sttatcd creation of a pla~ to store, MapReduc'< jo,~. ~ls\or}', The
---~ - ~..;;~,.~.- ·-· :. Job_H1slq:ySci:vcr provides all YARN MapReduce ·applicat1011cS with a Cffllral ,location jn
winch to .aggregate complctcdjobsfor historical r~ference and ddlugg~g, Jiie,seltings.for
thefobH1storyScrver ca)t be _found in the mapred-site, xml file. ·. . . . · -. , .: · · •. .
•ManngingYARNJolJs · ·.: , • :. . · . _, _-·. . . ·. ·. . ,, .~ - - · .
:A~~ jobs ~an b~ 111a1rnged using the· yam application: ~~an<i The (Qll~ng ~ptions,
: tncludmg-:kill,_-/1st, .a11d ·-status, _ a,c available to the administrator :wiih·this :command.
MapReduce jobs can .also be controlled w)th the mapredjob co~and. usage: applicatjon·_. .
-appTyp,es <comma-separated list ofappljcation·types>.Works wi.th . _. . _: · . . · , .
:, · • · ·;••list to filterapplications ·bas~dQn their type,-:, - >,., ·
~h.elp· . . . [?isplayshelp•for all commands. . · . ·::; 1, •._., .:«
0
• kill<:Applicaiion ID> :· ~_illsth~application> .·; ·., . •. , -~; ",'·:~· . .-,-_: '
· ~liit :-,·· .· • · •,.. · Lltsts. applications from the RM •. Supports optjonal-; use . of
. , .. ... ·..·. ·, •..:'_-J) " : a'(lpType_s to filter.applications flaiied on applicali!)II type, · '-' ·,
_. . .' ,s~tY,S:<~pphcat,o_n ID>,:. : -erm:s- th~ status_of thc..appliclitio11:, .: '-' ,;·, ;-, · • .--' ~ · .:i/ . . ·. -~-; .•
·. ' .1.·-N~Jther. ~!.~e '({\~N .Rt:so_urceMana&er U[:nor-the A11lbari u_., can be: ·usccf:tO:icilI YARN -_, . ·,:.·,,
, _app,li~ati?ns. If: ll~:j~b:· ncccis,io, be .killed, give. tliesam applicati~ corilman"d ;10' fuid the _.
· Apphcatron ID !Ind ;then usc the -kilfaritwrterit. · • · · ·· ' ·
. ~etting C~ntainc{ Mciitory . -~ .. ·.. . . . ' . . . : . ' ·. : . . ; .
·. . YARN . i1ian~ge~ .a~plicatiqir· resource contai~c~ over' the. enilie cluster. Controlii~g the·· .
~ ~
' am~unt_of con ta,~et lll~IIIO(i takes place throtigh three impoitant values iii the yarri:siii.xrpl ..
. . . o,I . . ' ·::/ :: ; _ ... . .

{ .. . Figi;,e 4: 9~f ;11b1i;.I °ill$ltilletl p,11,'kligi! ·willr-1•i!~io11s,1111111be_rs ~i11t tlescriptiom· .

· .'f;t72~!1Ir:i;:~]t::ittt:htf.~·,t•··
. . . . •. ... . . · .:;·_:--.:, i . ·. .. . . . . ·. . . ' .. . · ..
. .. . . ~; Explain the llasic lliitloo(i YARN administration? . . . ,. 1 · (04Marks)
.· Ans. YARN:has scvernl.administrntive ..featu1'es ·and commands, .To Jind out inore .abou.t the1t1,
. cxami~e the YARN ·- commnnds document~tio.ns at https:1/hacjoop,apachq)rg/ docs/current( .
· ., hlidoop-yarnlhndoo1>~ynrn-sitc/Yar~·cominands,htm l# Administration_ . Comm.an~s. · The t. : .Resotu ~C'."1ntHl;\CJ'. A requested contamer smaller than this.\:alue will result'in lin allocated .
. co~ta[ncr of1hissiz~.(clcfat1lt 1024MB). . . . . . . . :• .' ;,_ .
main a¢ininistration-comniands.is·yar.n rmadinin (resource .manager administration). Enter :,.,
• . yarn.scliedti lci·:max i:iu111i-allcicatio1i:1~b. is ·the·_·larg~! i:on,~'i~er,, ,ai_
foi~~ ''. by · the, .
~ac;orn7~~~:::ii~~!P~~-g~r~:~ra_b~ut the v.arious options .. .. : • . . : . . i :.;... :'. -•::··.,\'
. If a NodcMnnngcr host/iipdes·to be re:mo11ed froni thedi:ster; it should°be decommissioned·,
·.first.Assuming the i1ode is i-espondi~g,. you .tan easiiy decommission it froin.-iheAmbari\veb "
U_I. Sinip!ygo to-th e i·tosts view, cl ick on· the host; and select becommission from the puu-;,
·..~i¾~~l~'.~~):k(::t;},1:wi\~•~roi~~~lff.~t?t••f .
· •/ y_arn.,sch~duJ~r.maximiu~-allo~a:[oti-vcores: Tiie inin_imum·aJIOClitioii forevei)r°coniiiintr. , . ··
.. down. men~ next to tlie NcideMana·ge,r cbmp~hent: l'Jode that.the host may-also be' acting as _::
. a HDFS. D:itaNotlcO- Use tlic Anibari Hosts view to dccbmrriission the HDFS host in a similar.· ·. : t ~eq~~st ~.\ _,1~~ -~csbl1_r~e¥_a_~mger, ~t~ t~r~ts·of.~irtu~I :~P~ c'PrC~·. -~~·q_
ue:sr~:j:O)~·n~t 'th·an·~ .-: ·:·
fashion.- · · ·. ' , this :illoca1io11 w,l!not tali¢ ·errect;a,1d tlte specified vaiue will b~'allociit~dt1if-- ihtni~,'iun:,
- .-,:.: ..' ·.. __ _,- --.~·:-:- · .-~ nun,1bcr _oftorcs ..1:i:i_c~.:~ult,is 1.-core_.._:.;. : . .' <;}:/'(/';,-· -:-'c'-·
4
;;......~,:,:·',:..:.
1 i :;.:'

22 ·, .!: '·

· . ·,,., ·· ..·_ ... '.


VIII Se.111, (CSE/LSE) 13 tt}' Vett"w A ~
ynrn.schcdulcr.mnximum-allocatio11-vcorcs: The maximum nllocntion for every container Ill can help 11111!.c both l,elkr,
request at tile ResourccManagcr, in terms of virt11al CPU corc·s. Rc,1ucst larger than .this Slnilc~ic decisions arc thus.: that impact the direction of the comp:rny. The decision lo reach
allocation will not t.ikc effect, ar.d the number of cores will be capped nt this value. The . 0111 lo a l)CW c11stoincr SCI WOii Id be ii strategic decision, .
,!cfa \1It is 32. Opcraliunal <kl'isio11s ar~ ,no r.: ro.r:bc and tactical dccis:or.s, focused on dcvclopir.g
• y,1rn.nodcmanngcr.rcsourcc.cpu,v~orcs: The num\Jer of CPU cores that can·b~ blloca\ed _grca:cr c:ilcicncy. Upc°l.r:i11i :11: old website wi:h new foau:res will be an operation::! dccisio11.
for containers. . In strategic dccisio11-making, the goal itself may or may nel be clear, and the sanjc Is true for
Setting Mapltcduc~ l'nipcrlics the· path to reach the goal: · ·
MapReduce nms as o YARN application. Conscqucrttly, it may be ncccssmy to adjust some_of The consequences of the decision would qc apparent Sdme linie later. Thus, one j5·constaally. .
lhc rnaprcd-sitc.xml propc11k~ as they relate.to the map and reduce containers. The' following . scanning for new possibilities n11d new paths to achieve the goals. 81 can help with what-if
propc1'tics arc used lo scl some Java arguments and memory size for both the rlia'p and redu,c analysis of many .possible sccrrnrios. 131 car. aiso·help create hew ideas based on new pallerns
containers: . . _ . _ . . • . · · , · ..: founc1'fro111datamining. . · ' · . . ,-· . ·.. .' .
• -maprcd.child.Jnva.op:s provides ,n-._larger cir smalle,.. heop size.for child JVMs of maps · · Opcra_tion al ilccisio1is can be mmk more efficient using an analysis ofpast data.A classification,
-. {e.g.,· -Xmx204Sm), . . . . . _ . system can be Cl'Cat~d and niodc.leµ using the data_of pas! instances.to develop ~ good model
• maprcd1icc.mcip,mc11i91y.mb.provid~'S a forger o:r smalier resoui·ce,limit for mops (default of lhe domain. Thfs model can he IP. impiOve operational dccisions'i_n the future. Bl can help .
= 1536MB). - . · . . - - . . . . r.utomate opcrnl'io:is level dee.is ion-making and improve efficieiicy by making '111illions•of
• mapreducc.rcducc.'me1nory.mb· provides a larger heap size for chiid JVMs of maps micro level operational ckcisioi:s in a model.-drivc·n way. · ' . · · _.· .' , ·
(default= ;;012M1J). _ · For example, a bank might want fo n1ake <lecisions about makingfinandal ~oat!S iri a ·mori: ·
.• mnprcducc.mlticc.java.opts p1:ovides .a largei·:or smaller heap size for childreducers. scientifjc way usiiig <l.rta-bnscd modds. · ·· · ··
. Module -3 . . A <lecision~:rec~b:1scd model could prov.ide a consisteptly acc1irate loari de=isions; ·· ' .· ..
beve!opiJ1g Sl;Ch ,!ccisio:1 tree models . is one ·Cf the main applk2tio;s of (lata, mining .
5, a, w ·1iy sh-~ulJ org.111i2alio,i ;;,vc~t -in business 'int~lligcncc(Bl) ·~:o.lutii>n~·1 Arc Bl tools ·:· tectmiq;1cs. Effective Bl has"~n evoh:tionary component,·:as business models. evolve. When ·
'more imporl:1-111 than _rr scc1irily solulions1 - _ . · _ _- <· . _ · (08 Marks) :·· '-p~ople ,and organ izal ions ad; ~cw .facts (data) ~re ·generated. Current ·business: ~ode ls can .
'An_s. ·ousincss ii11clligc11cc (Ill) is an rnnbrella tci·in 'tirat inc!11t!es 3 vari~:y'of iT applications that .- - be tested againsHhc·ncw Mt:i, and it is possible that Uwse models will:uoth9ld-up ~ell .
. ·.· Are us.er! to :innlyzll nn organizntion's data imd com·municate the inforrnatiori to relevant users .. · -111 that cas~. ·decisio,,1 'models should be revised and ne,w insiglits slio:ild.~ i&iq!'porated:.
0

an~-
13usirie~s l_nt~lli!,'<! l1Ce, .131 is a con_cept !hat VSli~lly involves the d~livciy ini~gi·ation of ·.; A!l unendi'ng' prnccs~ of lJCncrating fresh new insiglits i11 rear time c,in' belin1i~ke 'belt~r
relevant and llS~foL'bt1si!lCSS inforinntion in ari organiz.1tion,-lts major components are data . . d~cisions, and thus cari bc ·a sigi1ilii:a111 competitive ·advantage. . · . · -: ·,,:::. .·

·...• ':•¥2~"i~~i -
warehousing, d:i\wndnin ; qucryi~·g, tind-rcj)o_t1irig (Figure 5._I). · ni tools.more import.111t lh,11i'1Tsccurily solutions: . . . __ . . _ .
9 Bl tcolii'i11cltid~\!att1 _wludwusing, -online ana"iytical. processing,· social_m~dia--; ~nalytics, .
'. reporti:1g, dashhca.r'ils, quei-yi11g, and data :i1i:iing. · •_'_ · · - ' • '. f _. ,. · - · ·
. £it tools can rai1gc ,fromvc1ysimple tools that cotild_be considered end-.usedooii, to ·very :,
. . sophisticated tools !h'at offd a very--broad a1id complex sef of .f1J11ctionality.' .Tlii1s,: even ·
. ,executivcs ·cai1:b~'.ll:cir cw:nB.i'cxperis; or lheycan rely on· Bl specialists toset"itp\he BI_
mech;:1'.is1i1s ,for\Jiem. "Tl111s, large organizations invest in expensive sopijisticated BI solutions
: . : . . . _ . nw1re_S.I .JJ11si11ess i11tellige11ce nml 1lilln 111i11/i1g cycic, . .- · :, . -'
·, that provide goo.If infomrniioi1-i1i'rcal·time:. · _·- . · . ': . · +_- .-
The ,nature · of lire · and businesses is lo ·grow. 'Information is the lifeblood ·of business.· ·
Busiriesscs ll!\C ma!l)' tcdrniqucs for imdcrstanding their environment and predicting the
- Aspreadsh~i:t tool, such as Microsoft Excel; can act 'as a1i ea~y b:i:effecli~·e Sltool by itself. .
. o'ara>cai1 be' \lowi1(oa<led and ~tor~d i,;: tlie spre'adsheei, Hien, ahaiyzed to 'produ~e'.insights,-
future ior ihcir_own .beneht and growth. Decisions are made·from facts and feelings. Dala•
· .. based decisions arc more effective than th_ose based cin 'feelings· alone. Actions based on iticn fll'~S~l;t~d ir\ the form ofgi'aph.i'a:ii.t'tablcs:·111is sys\em' offers l.im.ited auioriiation using
'macros alll1-olhcr fc::(lir..:s .. Th~ analytical features includei basic ·statisticill ·a'i1d .financial .
- accurate data; infon11n\ion, knowledge, experimentation, and -lesting, using fresh insights,
, fui1dioi1s: PiVc! 'iabl~s · lielp '<lo:sopliisficaied what-if a1iaiysis: Add'.011 ' rhod~l~s can . be -.
can mor,i likely SIICC~ed nnd km! to sustained gr~wth: ' :· . , . " . . . '
instajlcdto cn:ibl~ inti,kra1c1Y,s'opliistic~ied -statlsticai analysis .. --:., ;'· · '.. ·;-.. :,· • / : .
The org,i'ni1.;1tion ,ho11lc! invest in b_usinm inlclligcncc(BIJ.solutions': -__- -·_ . , -. .
A dashbo:,rdi11gs))ld11,s11ch ~s Tabic.iu; canoffero sopliisticatcd set oftools'for·gntheririg;
Compa1iies USC 131 . lo ,ktcct s ignific~nt evcr)is 'and identify/monitor business trends iQ
··. aiiaiyzi1ig; a:id 1ii\$e1i_til1gda(J. A!ibe"u~ef c11d, :i11odlilardasilboarcis can b~' designtd and
ordc( to adapt quickly to their d1a11ging ·environmenl a;1d a scenario. If eiTectivc busihess .
· [·edesigncil ei1sily 1vith :I' giii'phiral user interface. The back-end data analytical capabilities ·
intcllig~1ice trniriing is 11ml in the oq~ai1ization, lioih decision making p;ocesses at all levels
i,id1id~ m~ny s\11islical t'1i11~1ioi1s.-The (lashboards are linked to dat~warohous~s:~l(he back,'
· of manag~1{1cnt nm! _tactical s!rritcgic managemen!.processes -caii be in1proved , - . ..
e'nd 10 ensure th:1i the tablmmd graphs and oth~r elc111e'nts. of."thc dashtio'ar~ ar¢'iipdated in
·_. ' . D'i.for llc'ttcr O'cl'i$ionst Them arc t1vo·main ki,ids -ofdecisioiis: . . -. real tinic: . . : · ·" i.· ' . . - . : ' · ' ·· ·• . .. .
_ :.__:: __,_ _._-•-Strateglc:dccisi~nsiui ·, < --·-
Data mini11g systems, such as 113M SPSS:Mooerer;-are- i1idustriol ·strength systems that
. • .Operational ikcisions. · ' ,_. ,-·
.. ,/

.. - 24' ---- ·- ;2:5_


VIII Se.111 (CSE/LSi:)
tkcis/01111111kini.: ;ind hclpinll 1rcform business processes. · ·
providccripabili:ics lo ~pply a wide rnni;c ufnnnlytirnl models on lal'ge dnla sets. Open soul'cc
systems, Mn:h as Web, ~re popidal' pl~tforms <ksigncd to. help mine large nmounts of dntn ~W enables•.' cu11soli'.l:itcJ view of corpora le data, all clcane<l ,,nd organized.
to discover palkrns. · ·· 1hus, the cntr re 01gan1zat:011 can sec an intcgrntcd view of itself.
DW thus p1ovidcs bcllcrand timely inform:ilion. II simplifies data access ;ind ,!lows end
u. Wli:11 is the purpose of" data w;1rd1011~c·!. . (0~ Mnrl<s) 11sc1·s tu pcr: o:rn ~xlc;;s_ivc "n~dysis. l: c:1b~.c~s over.,!! IT r,e'.formancc ty no'. b~1rdcning 11:c
Aus. Purpose of Da:~, Wai·d;c,:sc li~s st•111~11 here in its Jc Gni:ion itscif i.c.a d:itnb;isc created by o;,crnt:c11al d.. t:ib;,scs used by En:cr;,:·:sc Rcsu·11:ce Pl:i,,ning (ERP) arad other S)' ,,rms.
combining datn tha\_is galhcml through various sources that can· be of. different types and
c. llusi1mse.s uccd .i "two-second .tdvanlagc" lo ~uccccd. Wliat d~cs that mc;in .to you?
fot'mMs (e.g. text, sql, xml etc). · ·' 1'·· .
· · · · · ' . (04 Marks)
Now whM you will do alkr stori11g such huge·amouni of data from. dilfcicni sources into a
Ans. S.ome_o'.thc examples cited . for ''Two Second Adyarit.ige~ are::. .- . · . ·
single dntabasc, you· wiU analyse the ,!ata whidi you have accumulated ·and try t~ answer
T~~ wlujcs,~.ay_e, vii t_l1c data_ about your bags. Why is then that you hiive io wait for eternity
queries which were not possible or were performance intensive earlier. • · •·
unlll nll lkb,1gs li~vc al'rive<l at the baggage carousel to discover lhat yolll' bag 1s missing and° ·
I~ a nutshell Data warehouse is a process of collecting· dnta ; transforming it, loading into
. th~n rnpclrt it to their custpmcr scryice? Why can't.airlines- be proai:tive. and l.it-passe/lge~
single database and lhcn :using ii DI (Onsiness lntclli!lencc) tool to answ~r your analytical
know upfront thnt their.bags will bc'arriving later? · · ... .. -· · ·.. ·, . · : :-
. . que~,ies and prediction of any further questions that may arise are ~elpful io your do.main or
: . business. . . • · · - . .· ·.. · • ·· · · Power co111p:mics have the data nt hand on grid failures. Why do they only respond sever.ii
hours aner rlozciis of customers call and comp1ain?. Wouldn't itbe better.iflhey use:it1e 'dat.a ·
Below i\rc.fow reasons: . . .
aheadoftimctop,'cvcntfailurcsinlhefirstplacc? . · · · ·,: • · -· ·. · ·
~1i1pro1·ing Visibility of Di1t:1 : A~ ~rg.iniza'tio~ registers· dai~ in·differeot syst~ms, w'hich
. The. fed lias all _the data tq take fiscal, economic and·moiietary decisioni in·real time •.Why
suppo1: lhc. variot1s business processes. In order to create· an .ov.~i'llll picture qf business·
. is stiH clinging· oii to an obs9l~·tc 'mod~[ of meeting only a few til)les a.year to review the
opcrntwn~. customers .nnd~upplicrs-tlrus creating a single.version of the truth-the.data·must
·· data _nnd adjust the policies and rates in hindsight? Why can't the.Fed be replaced liy a much .
~omc togcllM' _in on: pl:Kc .and made compatible,_ Both ~xte,rital (fr9m the criviro~inent) and · smarter, real time cbmjmter·algurithm? · · · · · :. . ... , -.
intcrnal,d,11.a (lrom Elll', CllM and .linnncial systems) s_hould me,rgc:inlo the data warehouse .
' Tibco 's piodllClS' bring 1!1ar~aluabic.'t\~I) secl)na advantage: tQ the.er.t~riseJ~r-~l{UCWCCd ·
and then be grouped, ThcrcforG having_a single source to answers a'JI your queries. . . . . .
data; TI1c· mo.st conimon form of structured data is a daiabase, l~hereda:a is'.stoi'cd in fO\VS
Improved l'erfo1:mn11cc: One could use an ~.I ready existing pperational.database if there.is
!j,. ·
on It single dc.stinalio_n(Datab.asc) of.II the data, yet thei'pJew constraints.like performance
an~co!tu1111s, _: . i i. : .. , .·. · · .· : : · .. • .·_, _: :·, ·\·/t,\::. · . .
·Docurrien(s, on the .olher hn11d, represent th~ world of unstrJctured data. 'Thei-eJs-a wealth
lJ'.l . . whtch degrade for.both operatio,mi\ p1:ocesses'iin_d reporting processes.Therefore we create a. . ~(irifqtii1aiiQit ·cgn/aineci,:ih Ertierprise dgcume~i.. :Hci~~ever, iliis.' informatj6~ cannQt be;.
)1'
:.Ji,:·_· .
datn~nse tuned nod opti1~izcd databaie which \viij be ready·to answer querief\Vhi~li req1Lire
. · to _ brmg huge a111uunt of data and analysis: • .' ·. . .: . . , . ·. . ·.· . · . . : . '. ·.
iin~lyzerl eris i!y sinci lhe cbntL'n! is' n6ttirgarriicd [n a's-.ruciuredway:,tmagi,ethe6qtciliialin .
·. Enterprise could t1nlcnsl1 ifir ,vcrc.ab!eto analyze the information scattered across 'thoiisands
· l·. .. ~ncrcasc U;1t,1 Qui1lit)' : St:ikchollkrs and. usersfr~que~iiy 9~erestimat~ tl;e qu~iity ofdatil· ) of do,ouments to oli,tiira \he'l1vo second advantage; . · ·.. ·. ·. . . . . ;,.. •.. :,:. '. } ..,.: :. ' . .
u;;:. I[\ th,e source syslC!llS. Uni'orlunr,tely, source systems quite oflen contain data of.poor quaii:y·. . .
· .. rnfinote bl'ings s:,{1~1urc to 1l1c unsfructurtd 1yorfd ·of docume:its. [t pro~ides {plalf~rm to
'\•'· · . When 1vc. use a data wnrcl;cusc, we ca;i greatly improve the data ·ql(ality, either through -
build /ools \hat can give coq1ora: ions. the ability to e.~tmct in forma:ion rrori-t'thcir cfocurnents
'litf · 1
were possible - co,,-ccling tire tfat:i whilst loading or by tackling '1he problem at its source. ·,
:. ~nd become p1·0.1~ti1;c i:1s!caJ of beir.g \'cacti\•e. lli mt'futm-i: ~cilumi-.s, nviil ~xpJ~in how , .
5t
Fn cr autl More adl',Utccd lkpor!ing: ,The structure of.both data warehouses enables en<l . ·.l~~note can h_~lp your c_~rpo~n_ti?n get U,c two secon~-adVantage by mining ipfoi'(Tlaflo~. trorri · ··. ···
ii;1
:.:,.·"~.•··;~, )'·:,:
: .i l... .· ... .
..us~r~ ~o report in a 11cxibk mani1cr.and IP qui~klY. perform interai:t_i~e analysis ~ri ,the b_asis .
, ofvanohs prcdtri1,1~tl :,nglcs: They _may', for exrimpte; w/th a single mouse _dickjump fron'i ; ·. i n~::t;cs:;:~:Lsc,faer~;sc ·: ', )_.: . -< , - .. . - ... . ::~:: \(\ · . - .. ·
1
t~.·,;,':~:i,,' .. ·· '--- · y_ear l~vel.-to qunr.tcr,--t,o 1noq1h.level and quickly switch between tli~ cus·tomer data and.the · ·. . : 'Li~erty S,tores fnc is ii sp~~ialized globa(retail chain 11!.it sellrntllli.ic food,:~rganicclothing, -
1
ti ·. ,
prilduct dnta whci:clJy the indicntoi re1~a.iris fi~cd, In this .way, end use.rs can.actu.i:ily jtiggle :
,i:~l· .'with tile data and tlrns quicldy gain kn9;vleqge ab9utJ1usiriess operations and KPls (Key
' wellness·products, and cddc.\tiori products to enlight~ned LO HAS (Lifestyles oi'Jh.e Healthy
· andSustainable)citizcnswo'rithvidG.· · ·· · ·. · · . ··' ·:;,. , · ·· ··
ijIB!\ P~rforma~cc lmlicatur). . ,- • : ; i , : , , . • . .: • . : :
' Theciimpm1y is 20..yearsold mid is growing.rnpidly. Ii.· no;v· opei'lltes i~ 5,tci.~ti~~nis,:so. .
[iii~ an
A d!1t11 w:1 rchouic _(ll W) .i$ organ ii~t\.~~.11~~\!pn_p/i~tegra(eJ, ~itl>ject ·;iie~\e'J°datab~ses·.· .
. . c6tintrics; 150 citi.\s, illld has -soo storcs.':it sells 20,000 products and h:is.JO;OOOC~mpioyties, ' : .
!1{ :.~fj :
1
.. des1gn~.t~ . t~:suppc'.' \ dc~is\9!1: ~!LPl]Ort fu!)~t}o%,QW)s. orn.ani~i~ .af )h~ ;righti lcyel .of . '. )~ecorripnny)r:is}ev~nu~s ofov~r.S5 billioriartd:h~saprofii of~bours.:per~~qi:cif~veri~e; . ., .·
' it . : .· ..g~a11~larttY: \o,pro~,d~ cl~M cnt_crp1)$C. ll'.,i/1e pa~~ lo as[.iitd~rd_i~¢cj (orm~tfp,r repc,itj, queries, '. ' The company pays special ~liiniion to the'condilions tinder ,vhich dic' prp'diids,are 'gi#.Y~ '
..' ·.; nod analys,1s. D.\~.1s.pl1ys1'-t.l.ly ;ind fuhdi~\WIIY s.epara\e from ari operational and transactional
1
.ij
,:,,;_J
.• ·, . .·.
· · and produced. (i do\ia(~S abotit one0fi fth (20 percent) of iis pretait profits' fi-orn igll)bal 16cia1
]r,l : datab.\sci CXca1i1,1g a DW for .in~lysis :anfq~er,ies·rcpresciiis sig~ifica~t investment in time
)1. ,·,
a~:,elfr: It :l~S
t_obc CUll~tantl)'. kcpi'f!P;tQ-(jatefor;it to)ii: useful. DW offer$ ril:~~ybusiness
, :: chaiifablecrillSCS. ·, . " · ·'. : . ,' ·. . . . .' ' ' ,. ::,, ·: ~,.:, ,:.: . . .
.,_. 'i\: C\:e·ate.a'Co~ni,rCl\c'11siv~ driSl1b~·nrd for"the:cEclolt11e ~om·P~rty.. - ·-._·~.-._
~~j~i~g i:~n ., 2:creitte iuiotlicr t!asltb_oaril/o: il'cmmtry hc~cl . . . / i} j\L· .
j i.{_. . ·,· ; .· _'_.·~w ~1~;;~;-~:b~~\~;!~~ ;.cp~i·t i;l~~iid~:ta ~~L;tiJ~:Ii f~~iiita't~ d'i~)ri~~\ed access ·. . ;
to ll~•to-datc bu_sincss knowledge for de~at1merits and functions, thus improving business · ·,
. .' . . I
ji,.'lil·'.:l·'.: :.-. · - ·· '--:-: effictency and custo111e1· s,·1 vil'e.· ow ·c:uf1in!ffiir~petitive· advantage by facHitating '.

..
. 6 &~iM C.CAM Si..,_l\t\~
-

. ·: . ---} :~ '. ~- ' . - :.~-..:.,_: . ~. ,.-.._· __· ·, : •. ~ .... : ·. ";~· ' . .- :~ -~--:.. . . .. .
~ . :__ ....
~.-. ~
"I
VIII Se,m, (CSE/IS!:)

~~:::;:r'fu1:~- .
I
1
m"lA ~~

!
S1J~>cr v:, ll!'c.J '.
l.<?;i r n ir~r; :

-~·'.1~·~1:~
1

·uiisui,ervlsccJ
M.ithlr\e ·.·
l a..1rr1 Jr-e,
rcchhl (fU(!t

t'.'t? .:-, ~!~~::~~t'/,':


l'•'larnlnc . ..
Artlllcl;. I Neural N~tw,,,k.5

·}if{}~ffr:t-iJ/l\f1tt~
Ch.l•ter Aii;iiy,IF'· . '. c
~:~;~~:~~" .·~::i,~'.,;~~, -..•4~:~~;~#fJi'.fM;f~f~f~f ·
_ 1

. Fig11;e 6.i ii1ijiortu11t ,t,itil 111il1li1g tcct1i1tq11a . ..


Data ntay bi: minec1 to help make mote efficient decislons .in the future. Or it titay be used to
explore the_dola lo fi1iJ in:etesting associative paiterns, T111:-right technique depends iipoil the ·
kind of p!·obl.e11i b<iing solved (Flg1ire 6.1 ). . ·. . . . .. . ·. ·.
The most iinportHdt ciass ofproblcins solved using dat.1.riiini11g ace classificatfoij probJ_ems.
OR These are jJi'oblcms wi1erc data from past dc~isions is mined to CJ(tract ihe feW iulcs and
patterns tiiat would i111prov~ the 2cctir.icy Qf tlte dedsiori-rrtaking process ill iiie i"uiilre. The
. 6. a. What is·data minliig? What an: supcn•istd arid unsil_p~rviscd lca_~riing techniques? data of past dccisi.oiis i~ oi·ganized a_nd mined for J~dsion.rules or ecjuafioris, which·ai:e then ·
. · • . · ,: , . ·. . (08 Marks) codified lo prodiicc 11iiii·c acci1rate dlicisions. Classification techniques are called supervised_ .
Ans. Data mir. ing is the nit r.nd science of disco~cring knowledge, insi&hts, and p:ftt~rns in data. learning as tiiac is,; way tci supG.vise \vh~therthe·model's predictior. :~ right or wrong.
. It is the act cif. i:xtrnctir.g useful patterns frcim iln organized collect.ion of data: Patterns musi A decision tree is a hierarchically orgai1ized branched, structured to help rnaR: decision in
be valid, novel, po:critially use fut, and uriderstnndabie. The implilit assumption is ihat data '. ail easy ai1d logical inaii1ier. Decisio11 lrees are the most popular data min_~ iedmique, for
pbout the past can reveal patterns of activity thnt can be pl'Ojecied into the futuri:. ,· . · - : tilaiiy reasbns. ·, . ·
mm mining js a lnuhidisciplinmy field that .bon·o1v~ techniques fro1ri avarjeiy of fieids.~; I. Decision i1ees me easy to understand and easy fo use, I?)' analysts as well as·executives •
. · It utilizes ihe knowied,.gc of data· quality and dnta'prgat1izibg from the dat~bases are:1- It ·\ • They also show a higH pt·edictive accuracy. ' - .. ·. : . l : - .
draws modeling and an:ilytical techniques from -st:itrsiics and· corrtpliter science .(artificial_'· 2. they·select the mbsl relevant' variables automatically ot:t of all the available variables for
intelligence) areas. It rilso draivs thekn<iwiedge of\lecisibn-making from thi: field ofousiness'-:· .d~ciSif?n~ni,akhig:· ." . . . . . . ·. . . .-
.' management. .. · .. · . .. .. _ .· : _ ,,, . . __·· J. b~clsioii trees ai·c·foleh111i ofaaia quality.issues.and do not.require much data j,rq,miion
The field of data mining cinergcd:in ihe _ccintcxt"of pattern i·ecognition in defe_rise; such as . . ·rrorri -tiie users . . /'.. .. . .. ·.- . - : ' ..: . . :· . .. . .• .
identifying a fricn<l-or-foc on n battlefield: Like •· ·.. · . . . . . · _· . . . :4. Even iioidii1ear'~dationsh°ips can be hatidied w'ell by dci:l.sio;; trees. · . . . _· .
many cithi:rdcfrnsc-inspi i•cd technologies, ii has evolved io help gain~ i:oinpetitive advantage
: · in bµsiness. · . · . • •·. . ·: . . · . · · .· . . : . · .: · . ... ·
There ~re ina11/alioriti11ns to implement de~ision trees. Some the:•popular on~s of CS, are
CART, a1id. Cl·IAiO; . . : . . ' · ·. . ·· · · ·" ·., . .
For exti:~plc, "customers who bqy che,ese and mllk :ilso buy bread 90 pel'cent ofihe tjme»· . Rcgr~siou is a relatively si1\ip!e aiid _the "most ·popular statisiica.l data minirig.iecliniciue. :
.. would bc·a uscfui pattern for.a grocery stoi·e, which_can the:i stock the products app~opriately; the goal 1s to fi(~1·smooth well-defined CLirve to'. the data. Regression wlysis techniques, .
' SimilaHy, "people witli blood pressili-c greater than -160 an~ an·age greater than 65 were at a_- :·, ·for example, can be used io.modcl ~nd predict"the energy conslirnption as a function of daily
high risk of dying from a h~a11 stroke'' is of great dingnostic value for docion, who can then i ·•· ·.temp~ratut'e. S_imply plotting the da:a sh·owsa nonlineartur've.Applyinganonlinearregression.:
fej;llS on treating such patients with ui·gent care arid great sensitivity. . . .· . . ... ·. ; ·equalion\vill fit tlie dat~ very w_ell with higlni_ccuracy. l11us, the energy c.ori~iimptioii on any
Pasi data cari be ofpredictivq v.iiue in many complex situations, especialiy ~vher~ the pattern future pay can be.p1'edktcd 1isi1ig tliis equatioh. . . · . . _
may" not be so easily visible wi1'i10ut the modeling technique. Here is ii clrama)ic·case· of a )'" Artificiai i1c1irnl i1clli·oi-k (ANN) is a sophisticated data mining technique froni the'Artificial
· data0 drivcn dccislon-inciking S)'St, m tlmr beats the.bestof human experts. Using past data; a .. ,:- irilelligericc Stieaiil i11 CotnplllerScience. lt mi1ilics the behavior o(hinnari rii:liraf s,111:icjure: .
. . ·· decision tree m·odci was_developed to predict votes for Justice Sandra bay _O'Connor, wlio ._-; Neltroi1s. receive sti1mili, proc~ss ihein, and coliihiu!licatc their resulis io: other- ryeurons
.· :had a s\i'ing vote ih a 5_-4 divided US Supreme C6u1t.. AH her previous decision_s were coded ::":· . · · successi~ely, ~nd eveiiti1:illya neLiror. Olltp~IS ii"de~islon. Adecision task may be processed by ·
. on a few variables; What emerged from .data mining was a simple four-step decision 1ree ihai- · just'one netiron ~nd llie :-csult inay be communicated.soon. Alternatively, there·coufd be many _:_ . ·
was able to aCCliratcly predict hcrvotcs-71 percent:of the time. ln.contriisi, thdega(anaiysts Ibyers of i1¢liroi1s ihvo l~cd in a tkcisi oi1 task, depending upon the complexity of ihe domaip.
·could at best pl'cdi_c( COl'i-ectly 59 re1·cent of the_lilne: (Source: Martin.et al. :.?904) _ · The neuri:l ne1wo:-k c·a11 be fr:iined by making a decision over and over agalli w_ith inany data ..
. -~ -_; __ppints: H wi ll ci1nti1Wc to karii by adjustiDg its intcnial comp(1tation: and~com.iiiunicatio.~n~._. ......;__ _
(lai'aiheters based bii feedback received.on .its previous decisions. The i11termedi~!e values
. ,- . ' . . . . .

28 29
VIII Se-m, (CSE(ISE)

pas,ctl wi1hi11 the -layers u,f 1\c11rons may no." m:ikc in1ui1ivc sense to 1111 observer. Thus, the
6. Outlier data clcmc111s nec'd lo be rerr.ovcd aOu cMcful review, 10 avoid the sktwing ..-
ncu:nl networks nrc considrl\'\l n black-box system. . .
of results. For cxnmplc, one big donor could skew !he analysi; of alumni donors in ,m ·
Al some poi11:, the ncurnl network will ha l'c le:irncd enough and begin to mutch the pred1c11vc _cd11catio1rnl setting. - ·_ ·
nl'c-.:1-:1cy of a :1:1111:in cxpe1~ or al:l'rna:il'c dass ifoc:iti o11 tcdmiqucs. The predictions of.some
7. Any biases m the sdc, tion of <lar., should be corm::ed to ensure the data is rep~mir.la.tive
A!',.~~ th:-,t h,"i, ~ b::tn t:·~:n~<l ovr r ;i lung rcrioC: f., f 1imc wi1h tl large urnou:1: of ci:.l~il fit:ve
of tl:c plicno111cn.1 1,11dcr ,<J1;dysis. _If the data induce& mar.y more mt mbcrs of unc gender ·.
l:crnmc decisively mo1c accur,1:c th:i,1 huma:: ,·,pc1ts. Alth.i: 1>o_i111, the ANNs c;111bcgi1, to
than is ty'pical of the P.?Ptdation of interes\, then idjustmcr.ll need lo be applied to lhe
be i~riously l'onsitkretl for ,kploymcnt, in 1'cnl situ-:itions in real _iimc. _ . data. · _ .. - - - _ ·•
ANNs arc poplllar b,'cm1se they aro evcntunlly rible to reach a high predictive accuracy.
ANNs arc :il~o relatively simple io implement and do not_hnvc nny issues with _dntn quality.
s·. Dnta_sho11ld be brought to \he same gran~larity to ensure comparability. Sales d.1ta may
be nvnilabl~ daily,_b11t lhe sales person-compensation data may only be avaH:ible monthly.
ANNs require a lot of data to tmin :10 develop good predictive n,bility. ·
To relate (l)e~c variables, the data must be brought lo the lowest common de110111iruilor, in·
Cluster :inalysis is an exploratory learning lechniquc that helps in ide11tifying a set of similar· , this cnsc, moiithly.- · . -
groups in the data. lt is n :~'Chniquc us~d for nulomatic idcntificntion of rnturnl groupings of
· 9..Data may-need to be selected to i_nereMe-information density. Some data may_liot show
things.,Data insta~ces !hat are similar-to (or near') each other are J~tcgorized into 0/le cluster, ._
while data instances th~t ai-c-vcry-difi'crent (or fnr mvay) from ~nc_~ other are_categorized into -:
much variability,- beca_use it _was not propcriy recorded or for any_ other reasons.
This data • .
' _may d1_11l the effects of other differences in the data and shou[d lie rmioved lo improve_the -
separntc clust~rs. TI1erc-can be ;my number of cltistcrs that couid be prod1iced by the data.
. infotmation dcnsily ofthe d.1ta. _ _ - - ~ _· - , - - . - . ._ - -
The -K-rncnns te~lmiq1ie is a popular technique and allows ti1e tts~r 'g~idancc in selecting the .
-- right:number(K)ofclustcrs -fro1nthcdntn; _ . - - ; _ - -- - _ -_ -__ - _- _- C. : Wlmt is"ciat~ visti~li;atiini? I/ow w~uld you judi:e the q~lity of d.:lta v~~t&aiio~? .
Clustering is also known as 111~ scgi1ientniion tc~hnique. The_(echhique sho\V_s_\neAuste1~ of ·. • : • ••
1
- _
-•- - • • •• _ _ ; _ - , - _ - , ,_ < ! - _ ' · ·(ef4 _Marks~-,
-thinss froin past data. The ciuiP.ut is thc_cciltroids for-each cluster·and the alloca)ion of d~ta :, _Da_ta Yisu~liz.1tion ls the ai-t' nn1 si:ienc~. of IJ1,'tkl-ig <!ata easf to_understand-~d'~u_nie,
-1ioints to their duster. · ' _- - -· · -_ :- · - · · -(;)- .- - ; I - - _ < :: . for tlic end user, Iden! visua_lization shows the right amou:11 of data, in the i-igh; oidei, in the
The c~ntroid dcu'niti~n is used to assig; ~cw dafo instn~ces thii\' ca11 be as~ign~d: to the ii' :- - right vis uni form, to convey the l~igh priority_ information. The right visuaJ.iiation requfres ail -
clttster homes. Cl11slcri,;g is aiso n jmt ofllie nrtifid;il intelligciic~fo~1i1/oftech1iiques. - ' -, understa11iling cif(hc cons1:mci's ~e~ds, nature of the-data, and :he_many tools)i:1ct_1.;.;~:iiques -
' i\ssoci:i"lion ru(cs 'are-.a-poi11il.ii- datri minilig method in btisiness, especially \,\'lierc seUiilg is :_ avnllabfe lo pr~s~nt duta. The right visualization arises rro!ll a comp let~ imdemaiii!ing qflhc: : _
. in_volvedf\lso kno1v11 ns market brisket analysis, it helps in-answeriiig-quesiions abo11t cross~ - totaliiy of1hc_sit1m1ion.iOne should use visuals to tei( a true, complete arid fast~Piiced slozy. ,
- ,.. selling 0jipcii,unitics. 'fhis .is _the h_cai1 ci(thc personalii:aiiori engine. use4 tiy-e~commercc_ '. Datri visualiz.1tion,is the last st~p in the .data life cy~le. This is ~here the data isprot~ for_· ; .
sites lil<e "Amazon:~om anil str,cmning niovie..sites-_like Netflix.coin. Th/ tcclioiqi1e helps find :_ •-f1:fr~s~citat_icin in ah ensy0 1o·:~onsume.i~~nner to ihe iighf auq_ience ror the ngliui-~rpose,_ '.The• .
__ inti:restin-g rclalioilships (nffi1iiiiesf 1ieiivcen variables (iteinsor events).These :ire i-"epresented _ --·dat'a shoiil_d be c:onvetted into a language nnd format that is b~t pn:(cited and understood by
· _ns_-cules of the- fonrt 'x-::::; -Y, wher~ X and Y arc sets of data items; A foi'm -cif imstipervised '_ ! - the consumer -ufdafa.cThc pi-csentation should aim to highlight the_in;ighJs front-th·e data in
.. - .foaming, it has i10 licperi<lcnt v~i·iable; and tl;eie ~I~ no righi cir ,vi·ong arisweis. There ai-e jusr_: _. a_il nctionable 11iiinller. If the di.ta' is pri:senied in too rnuch.<!ctail~ thea the _coilsumer of that
sfronger and weaker .i!linities. Tluis, each m!e has a confidence level assigned "to' ii. A pa1t ·,; data niight lose_i~t~resi and _ihc insight. - - .
ofihe n.iachi11c.1~;rn·i;1g fmi1ilv, this' tecl;;1ique a~i1leved legendary statu~ 1vh~n i foscinating •- The qi:nlity ot\1:iia;,,,isualizmi~ns cnn be judged by E.wellcnccYJSualizzlion. _ •· • ,
·refoiionship ivns fou11din thi: l r. ics ~fdiapers a~d beers. - •. - - -- - - - ' __. Data cnn be p1emitcd in the form of rectangular tabli:s,"or it.can be-presented in-colorful -
·-. b. Why.is ,h;t~ prc~arnlicin so im~orlant anil iimc i:ons;iming? : · - _-_-_, i .(04-M~r~~/:-.' - _ graphs _of v~riods iypcs: '$mall;-nol) comparati've,_)lighly-labeted.data setsusualfy:- belong-·· '_
_ _ _ in fables" ·. Ifo,¥ivw, as thi:'am~nr.t;cfdata grows;;gra;,lis are preferable. Gi-:ip(iici help ,
_A.~si_-Data clcan~ing-and prcparatio1i is (I labbrinlcnsive-or scmiautom;ted activity that cal\ -tn~e up_ -_- give sh~pc _fo dnlri. Tufte, a pion~eririg e.~peif·on daia -Visual1zatiori, p_resents thdoliowing -
-: to _60 io 70 _pcrccntofthctlinc nee<led'.for a data mining project - , _. -- _ -- _:- i cibjcctivesTorgi:aphicriJ-excellelic~: ,.: ·- 1 _ <-:-' · .- - ·
- ·L -0~1plicati:_datn: needs-to be rcmoved_ -Ttie same data· may be_:rcceive~ -from multiple :; ,.
fi ;'; 1/· som~es. When_nieM_-,_·111!! ti-1" data._sct_s, (i",- t", _m-ust-b-"-d•-dupe-d.-·_: :-.' _': - - , _ _! -·- _-_ --.-__ \_i__
- I.. Show, ai1dcvc11-reveal, Ilic cbia:-Th~ data should.tell n stbry;especia-lty°astoty hidd-en in' --
_, ~
1.!.·;.·;_J,i a,-,l
.,!, __
2. Missing -values- need •<.:. .., ... "u .., ...,
':o be fi!lctl -in;·or_ihose -fOIVS shoµld be rernoved 'frorri analysis: :("
Inrg~masses of ditta:J!o1vevc~; rnvcnl the data µi context,so -the story is cor~ctly told '. -
flfil, · - -- - - - -
- :- Missingvah1cs can be filled in-witluivcrag~ orrnodil! oi-defa\tlt values:-__ .-.·,:! ' . -_: - -: i - - - ' ._ .;'f Tnliucii th~ vie\v~rto tl1irik oftM substriilce ofth~ data: The foimat of the g,:aph should be
1,,,. 'f .• - S(\ nat_ liral lo the-data, that if hides iiselfand lets 'data shille. : :-.. _, · : . , -:-, ;: ~- '~- · _ :
1:111:i,t. ':_ ..,_- 3. ;Data d~mcnls may need !O_be transforiiicd from one unitfo anoth~r, Ft>i example, toiat' .j
;ili, 'i _- '. .costs ofi hc.iith cnrc _~nd ·,lie total numb.er ~(patients rnay;ifoed ici be _rcd_1iced to costJ:J -· 3:;·Av6id -disto1ting·,vliat thc 'data'have to:say: Stai:isijc"s-_can:be used to_:lie:lri tliii namcj_r _-
-_ '' ' sihrplifyirig,soilic c1i.\cial coritcxt c011ld cc rem<ivedleiiding 19 distorted co,mniu_iijcatioii: ·
\~iR - - paticntto :illoiv'coinparnbility 9f1Iin;'valuei -- . - : - - - : . • . __ - --- -
- 4._. _Mnkc lnrgc daln sels coherent: Iiy giving shape to data, visualizatiori~ _c·an help bri~g tlie -
~*1·1! •- :4. Coniinuous val\lCS i;nay ne~:1 lo bcbinnedin\o iifa~v. buck~i"s to_heln with _someartalyse.s) . :·:' data iogcthcrto tell ri comprehensive sto1y. - : .! - -,_- i -_- - _; __- - _- -_"-: - . -_
,}; :(. . .f.or example, ivork cxp~rience CC>uld be_bif1ricd as lo\v, nii:dium; a'nd ~igh.' _; . ' , _ -_- :,
f:.:-1{ - , _ _- --5. _Data el_cmc~\s inay need to ~c adj_i\stedJo (riak(tli~rn coinpefi:ableo~cr(ir)je.Jpr examp'le/ · 5( E11co11rilge t_h~ -cy~s to. coin pare di/1\:rer:t pieces of c:i~ia: .Organize ·the c~art iii ways the
-eycs-woltld trntuml_ly -mov¢ to derive insights from t~e-graph.: ,.' -'· ""-· , _ . _ ,
,F-'-- - ---.-,f~• i°:-~-Rcve;il,tli~ dnfa -at sevcntl l~vcls of t!cbil: Gr:aphs lcad_n o in_s\ght~, ,v_hich raise ~1\rther : _ _ _-
'~
_'::
: .!:.
-_:·_•
:: ,~'.·.
__
. -_:_:,
_·___
.; :, _ · ~-.:- ----~ ~ ~.~~~'.~~j!lJl.1S9.car_for_~o111partibl\iiy.:
·;,
:i': -curre_ncy val\1cs l)lay _need iu be ~djusted _Thef_inay
:_:,-_::·:f_il:li,:, -_--:,:
_l[
.,-,.
i:,
_'______
.-_- - -__ · _ ·__· _ for-inflation;
: _ , •' - be ~onv~[i~dii!\..f.Qmiriori"
theyto,would
need • heed lo -~e 1.:on verteq .
-

- · ctli;i.osit~; and thi1s prlls~,i~~iions should \i~lpgeff~tlie root ca\1se. .__-•- --:7'--c-~1 ~ !·;- -~'-~'--c-· ..
·30 - -- ~M.+#.f CKAM ~1111i/: - - -• ·--•-~1
Si,,~~+~( ~;~ $uinAv
~l
VIII Se-m, (CSE/IS£)

7. Serve o reasouobly clenr purpose - infonning or dccisio11•mnkfng.'


· 1,
8, Closdy in:egrnie with :he stntisticol and verbal dcm·iptions of the dntasct: There should I : •

be no scpJ:·;;tiun of cil;:rts 011d 1ext in prcscr.1a1ion. Encl\ mod~ shou ld tell n complete I
sto,y, lntc,~pcrs~ :cxl wi:il the m;,pigrnph,c :o hii;hligh: !he main insights.
Module -4
(.l 1Cu,711/•"•' frt,...·...... ,..;.,..
7. n. What Is a dcdsion tree? Wily ;1rc decision trees the .most popular classifica_tion
tcc~niquc? . · (02 Murks) 'i . ( .• ,· ~ ~ µ_
Ans. A decision :ree 1s a tree where each node represents a feattire(attribute), each,link(branch)
; I L
represents a decision(1~1ic) and each lenf represents nn outcome(catcgorical qr continues ,' i
•1
value).TI1c whole ide~ is to crcMe a tree like this for the entire data and process. a single , · . . :;; . ··i: ' . .. ·. .. ·, . ' ·.. . : ·. :~- ~-~ :_·::\:)_
outcome at cve~y leaf(or minimize the error iit every leof) · \ . Figure,7. r: Scaffl!tJJIOIS .show/11g type, oire/ntlo,iship111uwng two vifrlubltJ
De~ision trees arc a si1nplc ·;,,ay to guide one's path ton decision.1l1e decision iiiay ·be a simple · (Sou'rce:Grocbneretal.2013) ~ .-. . .... -.. · - .... ·· . · / . ,; ·,:';, ~:··:: · ,
binary one, whctlier to approve~ loan or not. 9rit may be n com.plex !llulii•~ahied decision, · : .Chai~ (a) shows n very strong lin~ar, r.clat1o~h1p ~et\l'ecn the_v.:lriabl~. x and,Y:,'!l;i,I means
as to 1vhat may be tl1e diagnosis for a pmticulai sickness. Decis.ion trees are hierarchical_ly the value ofy im;rc3SCS propcitt.io.nally with'i Chart (b) also s~s.~ Slrolig linearrilatiooship
br.1nci1cd stnicturcs that help one come to Ii decision based on asking certain question"s inn . between the vai·iables X and. y.,Herc il is an inverse re[ations!iip'. That means 'tbe'vilue or y
· . particular sequence.. · · _ . . . .,-. · ' . ··.
dec;eampr~p:ortionaliy.ivit~~: ., . · · · • • ' . , :·.', " ·.·,: ·:•~.:: . .
Decision. ti-ees ire one ofihe most'widety·used techniqi:es for classification. A go9d decision ·
tree should tc short and P.sk only a few meaningful questions ..They are veiy-.efficient to.us·e;
Chart (c) shows a curvilinear: rclaticnship. It is an inveise' relatio:tship, _:~~ns that~ ,r.~i~~
the value of y decren$Cs propo!tionally with x. However, it seems a rela:1v~ly_'\!~11:-defined
easy 10 explain, anq their da~sific:ttion i ccuracy is co.mpcFtiveJibi ether methods. Decision relationship, like a:1 arc ·of a drck, which can be represented by a siciple quadraiic ~uaJion .
frees can gcr.crntc ~nowlcdgdrom.a: few tei,t instances that -can _then be applied to n broad _.;: (quadratic. qienns·tile pow~r of two, ihat is, using terms like x2 and.y2), Cn.:irt..(d)_shciws_11
·· population, Decision trees arc iis.cd mostly.to answer r~!atively simple bi.nmy decisions. : · positive curvilinear i'clal,ionship. However, it.docs not seem to resem:ile a: ~guJar shape, and·· ·
. thus would.not be a sti-ongTelationship. Ch.irts (e) and (f) show no relationship, · · . ·. . .
· · ··b; Wh~i°i~ a;i"cgrcss~n model? \Vhal is a scaftcr pl.ot?' lio,v docs ii iiclp?(6 Mark~} · .. . .- of
That nieani variabl~s· )( and y ar~ independent each otr.i:r. Cll.lrtS (a) and!(bhre good, . .
a
• Ails; -Regressio:{is 1veli-k1i~wn·stntistical ·te:dinique to ·modt:I the predi~tive rel21ionshipbe!~e~n. .candidates·: tliat .inodel a 'si1i1plc ·11iie~r regreisioa mod;t' (the le(ll!S ,regressi-011 model aiid
fegres'sio1i cqi1ation can be_used interchangeably). O,..lrt (c) could_be modeled° with ii too
several independent variables ·coVsi and ..OJ!C depcndent:variable. The objective is to find_'
the·best•filli~g .curve :for a· dep.endc1i1: variable 'in a multidiinensional space, iyit.h_. each. little more complex, quadratic i·egrcssion equation. Cliajt (d) might require. air even higher
independent vai-i?b!e heing ;; <lime.;sion. The.. curv~ co.uld be.a straight line, or.it .could be a/ order poly1iominl ·r~gre.ssion e.qt:a:ion to·represer.t L~edata · . . . . .. . .
nonlinear curve.. · .· . ' · . ·. ·. Charts (e) a11d (i}'have no·relnt1onship, thus, they ca.·1:1otbl: modeled together, by regn:ssio:i
Tne quality of pl of the ct;rve :o.che.-d;i:a can:bemeasured by a coefficient.of correl~.tion',(r).,..:: cir using m1y·.othe1· modeling tee]. . .
.• • • ' · .• 1 • ,' ' . . ; ·. ·.·; • .· . . • - • :• • • ";- • . ' "( : , •· • • : ::;. ~ ~ : ; ;! :· ..:': ~ (

· whic!i is the sq1iarc roo:of the aill(:iunt ofvarlmice' explained by the curve. · ·· ·· .·. c.·_Examine .thc,sicps in dcv~J9pi11g a:ncml nch!or.~ f<!r PE'~j_cti,ng ~~ix,:~,:P.fi~~ -What
TI1e key stc;,s for rcg:essioi1 are simple: · •.,. : . . . . . . . . .. .kinct ,or.objcetiv¢ TuncJio·n ancCw)la1. kin.d of:d_ata Jvquid ,be rcquif.cd; ~of~ -&!le>d stock
'· I, List ali :he variables avr. ifoble for niaking die model.:·. · •prkc prcdictwsystcm asin~.A("l'R . : •,.--'. .·. •., :: ·.• -<< :- ,;"·,,:'·:,..
,,,,.; ,~•,:-.) {~~Marks) :·
· . 2. ·f:st'ab.lish a Depc1ident Vaiiable (DV) of·i11teres1.· . . . ._ . ·; Ans . . Jt takes reso1\rccs;:trnii1ing ilat;i; rimhki)!._an.d \ime ,10 4eveio;, a:n_eui:a~r,e~C?r.ll ,¥0s,1 data ·
3; Examine visual (if possible) relationships betwe·eri variables.of interest. · ·rriining"platfornii;-offcr at kast the:MLralgorithm to implerne:.:a.:c_ura.l_n~~"'~fK, Th: steps
4. FiT\d ~ wa), to predict DV using the o.ther variables, .- . rcquir_cdtobui!danANN~rnas.fol!o,~s:. . · . . . . •·: · · ·,>,.: ' . ·· ..
A-scatter plot (or.scatter ci)agram) is a simple .exercise.forp.lolting all data points between tWo '.:': I. Gather dntn: Divide into training data and test data ..The training data needs to be further •
variables ona:two-.dinmisional graph. ·11 provides a visual layout of.wh~r~ all tlie data points ;: . divided into training d"ata 'andvalidation·data. . . - : .: . . '
. areplacedin.thatt1vo-dimcnsio1rnlspace, ·: . .·.. :. . · _. .·· .. •·...... '·. . · ·_ :·,:: . · ·.:· 2. ·Select the llCtWQl:k ardiiteclure, ·such as fcedforward network, . .
· -. The scatier plotcan be us~~1i.1foig1:np!iically intuiting the relationship betw.een,two variables; ; 3. Selectthc .aigoJ·\ii1111, :s11cl.1 .as' Multilayer ~erception . ..' -'
. · .Here 'is a pictt1re'
..
(FigLm!
..
__
7.1)., ·.
tlrnt shows
.
mm1y possible ·eatlems:in
.. .
s·caftcr.diagrams
-· ·
.. - . ·. . .4. $el 1ietwor~ parm1icle1's,: . . . . .. , . .
. : 5.' train the ANN 1viih trai1\ing'clafa. · --·. ··-· ·
'6. Validate the mot.Id witl1 valid~tion d.nta.
:7_ f,te~ze the·weigh1s·anio1iicrparameters.. ..
· s·. Test the trained network )Yith test data. · ,.' .·. . ·. ·
~9:' Deploy thc.ANN·whe1t-it-~~hiews good predictive ~ccuracy,- ; ,:_· . . . • . -. --- - - --
. Other. neti'ral nelwork ...
architeciuics i11clu<l~I probabilistic networks and self•orgarii.zitig .
feature
. . .
32 . '5',.(lr.+....-; e.c. .~ ~~~~... 5"~~t...~ e.c. .-"' ~Mu__.,:._ __ · ·- ·· . 33
-~- -
·",,
I

VIII Se,rn, (CSEIISE)


·. A computer nel!ron is·built ip a si,nilnr manner, as· shown in the diagram. TherJ,a/e inputs. ·
maps. _10 the nruron mnrked with yellow circles, and the n·curon emits an output signol-.~ftcr some
Tr:iining 1111 ANN: Trnlning data is split into three parts computation . The input layer resembles the dendrites of the ne1:ron mid the olitput signal
Trni11i11g set This data set is used to adiiist the "wd hts 011·\he neurnl network(.:. 60%). !s the nxon. Each input signnl is assig~ed n ·weight, w,.- This weight is multip'iied· by thi:
input value and the neuron stores the weigh1cd SU(ll of all the input.variables. These weights
Valid:itiou so: 11:is ,'.o:o sci is 1:saJ 1u r.iir.ia,izc uva !i:·.iai; and verif •ing accuracy (- 2Xi"lo).
arc compu:cd in the training phase of the nc11rnl network through concepts calle,!'gradien·t
his data set is used only for testing the final solution it1 order to confirm . descent .and back propagation; \vc w,ill cover these topics_in our subsequent blog posts o.n
Testing-~~! (he achtal predictive p~wcr°ofthc nciwork.(-20 %) .. · ·
Neural Networks. An activation function is then applied to the weighteg ~um, which results
k · approncl1 means that the data is divided into k equal pieces, and the Jearnin · in the 011tp11l signal of the n~uron. The input signals nre generated by other rieuroils,:i.e,.th~
· • fold cross· rnccss is rcp·catcd k-times ·with each pieces becoming the training set. outpii) of other ~eurons, and the netw_ork is built to make predictions/computalioni lri.this
·val1dalio11 · · l
is process leads t_o1ess bfas and more accuracy, but is more time con sum 11g : . , manlier. Thi's,is.:the basic idea ofn neural net1~ork.-. We )lli_ll loo.~ at each_ofthes~ 'con·Cepls in
Machine learning has pruv~d to improve etnciencics significantly, l!nd there are mai1y X' more detail-in tl1is_nrticle.. _. . , · ·. · .. ·· · · -· · >,>:.,_ .. . ..
jobs which have been ·replaced by S[TI~rter and faster machines using artificial intelligence -. l, . Working of Ncur~l Networks '-. : . : . . .. . . · ·, :\' · •
or machine lea1:ni11g. Th~ stock markets a1'e 1ici exceptions to thi~. Today, there are sc_veral !:;j; We 1~ill look at an 'examp_le to understnnd;the iYorking'of rieural networks::-jbe'.JnpUt layef..
Machine Le_arning algo'rithms i111111ing in the live market~; Thes~ algqrithms often provide \:~'" .consists .?f the pnrainetcrs that_will lielp. us arrive at an output value or rriake::ai~i;e~ictioii.·:
gn;atcr 1'ct1irns than ·other altemate nlgorithms or sometimes even_high~r than· experie11ced . .. '·,;, ' Our brains:cssentially hqvdive basic input ·parameters; which·are our sci1ses tQlO:ucH,'~eor,' ..
traders. In this ai'iicle, I will talk aboul the concepts involved in a neural networ~ alid how it .·1, ·se·e; sm~II hnd·1nste. .TI1e niuro'iis iri our brain crcate:more complicated j>aramii_f~r.s :s~ch"iis · · .
·can be a1iplied t? predict stock prices•in the live.markets. Let us start by understanding what .. ·,; .·emotions ·and ,foeliilgs, from th'csc ·brisic input pariJmeti:rs, And oureini)tions ifocl ·-feeli_nes; .. .
· a neuro11 is: . · · · · .... · ~r
-make us ~-t take decisions which is basically the output of \he neurahretwci'.rk:of our ..
· Neuron -brains. Tliercfore, there arc two laye1·s ofconipufotions in this-casi:before ·m·akftiga 'il.e~isiori. .
The first layer takes in the'five senses as inputs _a~ct'resul:s in emoti~ns aildfe.~Jiiigs,'whic_h •
nre the :iilp~ts to the"nexi layer of compt'ttiltions, where the output is, a dei:1sipfo/ri:n a~tioii, _ ,
Hence, ii1 this e,xtrem_!y siti1plistlc n19del of the working of the hitman br.ii~;\ve haxe one
iriput layer, two hidden layc:rs, U\ld one _oi1tput lnyer.' Ofcours~ frdm our 'ex'pe_rieh~es., we
· :·iiil know that tli\i br~ifr ismtich ,n1orc:con1plicatcd tluµi this, but essegtiallft~sjs how the:·

'. . . .
.

. .
.. ~~~1$!~~:·
'.

.
. .

. . . . ' .

·· TI1is is the neuron that you must be familiar with; well if you aren't.you.should now be · ,;\,,,
. .
.,.
'_};,
<~:I:~1~~a:~~lf:1;,y~~~~~~1:~;~i~~~~~ ·'
· · · ' \ . ·: lnpullnyor . ... ,: Hdui,,,l.llyer,, · .· ~~,4•.;_::/':,
··grateful that ymfrtin ltnderstand this.because there are billions of neurol)S in your'br:iin. There ,,-_:;:~.
. 'ilre ihi•ce components tci·n·ncuroi),: tf1c dcn'drites,lhe axon and the main body of the neuron, :.~';;!
. The.dendrites nrc·thc 'rcceivc1:s <ifthc s/gnal nri•d the axon, is the transmillcr. Alone; .a neuron · .'t i
.~
.P.\2~
::r :· ·-
. .' ·is not of.n1uch use; but_when ·lt is'i.:ontiectcd to·other neurons, it docs sevc~al complicated \ .-:'.~ . '.J
· comp(ifotfons.and helps opcrntc tire ·111ost c~mplicated machine on our.planet; the human .: /ff!

~ (;_-) .::-.,.,.
_ ~,-'•,<--;....\

i,·
\,;"',; (])r.
_· ·. ·:·-.· -·· ---· -j · 7 - ,-·.'-'
.
_ . ! .- ' - - - - ~ - - , - , - ,, . . . . , . . . . ~

l ·. : j- 35
, : ., ,_; ..
•-1 ~ ...
34

', .. ·,., • i '; '


''I
•I

VIII Se,m, (CSf/ISE)


I
There m~ five input parameters as shown in the diagrnm, the hidden layer consists of 3
neurons a11d the resul:ant in the output layer is the prediction for the stock price. The 3 cl...- .:
neurons in tile liidd~n lnyer will -have different wrights for each of the five input pnrnmetcrs
and mii,:lit ha\'c .:i!forcnt n,tivation flln cticns, \1 l:ic~ w:11 nctivatc the inp:it pernmctcrs
.iccording to various combinat ions or the ir.p:,:s. For example, the firs: neuron might be
.
' .•,. -·
.
• Jt,cJC,tlf11-•dl"",c11 •. I'
looking at the volume and the dif:'crence between the Close and tlie Open price and might be · :.L8~ '- '" : " ' .1: ,:: ·t
ignoring the Higli and_Low pric~s. In this case, the weights for High and Low prices will be , . . . . .
zero: Based on lhc weights that the model has trained itself to attain, an activation function This approach could be, successful for a nctiral network in_volving a single weight which
. will be applied to ihc weighted sum in the neuron, this will result in an output value for that • needs to be optimized. However, as t~e m1niber of weights to be adj11sted and the number or ·
particular neuron. Similarly, the other two neurons will result in an output-value based on hidden layers increases, the number of comp11tations required will increase drastically. The
their individual hctivntion functions and we·ights. Finally, the output yalue or the predicted ·;, tim~ it. will require to trnin such a model will be extremely large even on t~e world's fastest •
· value of the stock price will be the SUIT! of the three output val11es ofench neuron. This is.how \ supercomputer. For this reason, it is essential to develop a better,' faster ·ll)ethodoiogy for
. . the.- neural network will work to predict st_ock prices. · · . computing the weights of the neural network. This process is callea Gradient Descent . .
Conclu1ion . . . , . .. . _. _. _ Gradient descent involves analyzing the s!ope of the curve of the cost function. Based on the
There nre two\vays_to code a_'program for performing a specific t_ask. One is to define all i_he slope we adjust the weights, to minimize the cost function in steps rather than colliputing_the
rules requi~ed by the program to compute the resu!t given some input to the program. The values for all possible co.mbimttlons. The visualization of Gradient descent is shown in the.
-other woy is to. develop the framework upon which tlie code will learn to perform the specific . .diagrams below, Tlie first plot is a single value of weights and hence is two dimensional. It
_task by trnining itself on a d~taset tlirough adjusting the resul! it computes to be as .close to the can be seen that the red ball moves in a zig-zag pattern to arrive at the minimum of the cost'.
actllal results which have been observed. This process is called )raining the model, we will···· funcii~n. In the second diagram,we have to adjust twoweigbts in order to mfoirnize the cost ..
now look nt how our i1eurnl network will trnin itself to predict st<ick prices.. ' . . function. Tlwefore, we can visualize it as a·contour, as shown in the_grap~. where :wir-are

·~r~--·~··~~'.!~·.
. ' The neural network will be give:1 the dataset,' which consists of the OHi.CV data as the . , moving.in the direction .ofthe steepe$t slope, in order to reach the· minima in.the_shonest
input and as thc'o11tput, we would also give·the model.the Close price of the nexi<lay, this is: duration. With this approach, we do not have to do niariy computations aiid 1,5 a-result; the_
: the vnlue that we want our model to learn to predict. The actual value _of the output will be:
r
·· repres~!ited by 'y' and the predicted value will .be represented by y", hai. The trainin~ o, .
the·modcl involve~ ndjusting the weiglit~ of the variables fol all the different neurons prl.lSent•
~-,,~~ "' •.*·
· in th~ neural network. This is do;ie by minimizing t1ic 'Cost Function'. The cost_fu~ction; ai
·. .thenanie suggests is the ,ost ofmiiking a prediction using the neural netwo;k, It is a measur~ .
· .. ofhow far off tl:e predicted val ue, y~, is from the act1tal or observed value, y..There are ma·riy :C.:
· cost functions that arc ·usi:d i11 practice, the most popular one is computed ashalfoftlie su'ni' )
of squared differe11ces of the actual and predicted values for the training dataset. · .· .· . .. . .: : _. 'k .
. . . ' ~ ·~ - .. . . _. . . [ 1 . . ' . . .

. . ·. . . . .
·•c;; Ir:::1·112 u~ '-'. yf
: . . .. L.l · . . . . :


. Tlie \~ay th~ neu~al network trains it~elr is by first ~o~puting thecosi fun;tip1{ fo~ th~ lr~i~ing
· aata,<;et for a given set of weights f6r the neurons, Then it _goe_s back and adjusts the weights, .
. . followed by computii1g the cost function for the training dataset based on the new weights. \':;
The: process of sending the errors back to the network ·for adjusting the weights is called ··i
· . backpr9pagaiion. This is rcpc~ted several times tjll the cost fun;,tion has been minimi.zed, We ·\ . Gradient descent cian be do~e in three possibl~ ~~ys, catch gracHent desce~i; stochastic
will look ·at 110w the weiglits are.adj1isted ·a·nd the cos\ fonciion is minimized in more detail .- •· .·. gradient desc~nt and mini-batch gradient dcscei1t In batch gradient descent, the cost function
. next, . . . . . . .. . .
· is computed by sm11min{all the individual ·cost' functions in the training dataset and tbin
Gradient Descent computing the slope and adjusting ihe weights. In stochastic gradient desbeni,' the slope of
Ttie w:dghts are adjusted to ~ i1iimize the cost function. One w~y to do this is throi1gh.brut~/ ihe cost function and the adjustments of wei~1ts are done after e·ach data··entry Jil tlie training · .
. . force. Stippose \VC iake _I000 V?li1cs for the \Veights, arid evaluate the cost function for these.'.- · dataset. This is extreme'ry'uscful to avoid getting stltck a_t a local minima if'ttie"cui:ve.or'the . ·
vtill1cs. Whei1we plot th e graph ofihe cost function; we will arrive at a graph as shown below/ cost funciion' is not sirictly convex. Each .time you run the _stochastic gradienfd~s~eni/ t~_e
. The best value ·for weiglits wo11ld be the. co~t function corresponding to the mi11ima of this . . process to arrive at the glo.b<1I minima will be different. Batch gradient descent may result ,in
·_ · ~: ~'- -- :·~~~ -grarlt . . . . . . . ' a
- - ge!ting stuc_k with si1boptinial result if it stops.at local:tt!ln1ma-:1'hetliird type is:ihe'inini-•
.. . . I .. .
: . . ' !

. ~,.;~+...~ C.C...ill ~~~~ _37


VIII Se.ot- (CSE/ISE) ·

batch i;rodient dcsce~t, which is u combinntion of the bntch nnd stochnstic mcth~ds. Herc,
we create different botches by clubbing together multiple dnta entries in one batch. This
' ,1_ '.
essentially rcsi11ts in implementing tlic stochnstic gradient descent on bigger batclm of dntn
entries in the training dnlnsct. Nc.,t, let us understand how backpropagation works to ndjusl
:he weir,)1:s acconl:ng tq lh~ e:m: whkh l·,nd been grncratcd.
..,·
11ackpropag:ition . . : ., · .
· Bnckpropiigotion is an ndvanccd nlgoritlnwwhich ennblcs us io u_pdatc all the weights in the
neural network simullan1.-ously. Th.is drastically reduces the. complexity ·of the process to
adjust weights. If we were no: using this algorithm, we would have to adjust each Weight
individually by figuring out w\int impact that pa11icular weig~t has ·on the 'error in, the .,
prediction. Lei us look nt :he steps involved in training tlic neural network witl1 Stochastic..
· Gradicni' Descent: · ·,· · · · .' ··
• Initialize the weights to small numbers 'very clQse /o O(but not 0) . :·.· . . . · ..
,; F~rwnrd propagation - the neurnris a1~ activate_d from left' to.rig.hi, by 1ising the first data .
entry in our training data$et, until WC al'l'ive·al the pre'dicted result y" .. .
. ,i Me.1Slll'C the error w!iich will begcncriiltid · . . . . . '.° .• ,. . ... : . . ~.
. -~ ·Backpi'opagntion-:- the en-or gcnernted \Yillbe back propagate~.fr?m right to !en;arid th_e :,.
. '. wdghts will bc'adjuslcd according to ihe.leaming rate . . ._,; :..: . . . ·.• :·:· ' . . .:-:
• Reperit the previous tlirce _steps, fonvrird prop,i'gatfori;. erro_r computation_· and- ~ack: :::·
. : propagation on the entirc.training'dataset . . : . . fo . . /. ..• . : :
~ This WOl;id ~ark tlie end or the firsr'ep6ch, the successive epochs wili begin with .the :,
. . ·· weight v:ilues ·of ilie.pre~ious epocfis; 1vi can ·siop'this process when.'ihe cosf 1unctiotr,
,'. )?~verg~~w1tlijnace11ain.hccep1tgri_it ~ _.. .. , ·. ~ .·. ·. >.. ', . ·.- .\ · . (:
0

s::_a_;: :::tJ~:~-l~N::~~:(:;:::;~~rjrow ~0-~S jt W~~~?i~xpi~fn t.D~sign fri_dcjg~e~:~is :


An~. Artificial Net:rnl Networks (ANN) are irispi(cd'liy thc:'i~for~1atiori processing model o(th:
mirid/brnin: The human brain consists;of biliions of neurons that link'.w,ith one' another in a_.,.
. iniricate paucm:.Eve:_y nct1ron 1:ec~i{es:.1~(o_rirlation from many oilier neuron·s, processes it, '
gets excited or not, and passes iis state 1~for'rnationto o_iher' neurons.' . : . . · ·, · .· .. .
:Just like.the brain is a m~ltiourpose system, so also the ANNs' are very, Yersatile systems:.;
: · They can be.used for /llany kinds 6f pattern rec~gnitl~ffand prediction. They are also used ..
· for dass.ificatiol), regression, clusteriQg, associ~tiori, and optim_ization activities. They are',
used in finance, marketing, manufactuiing,'operaiions; information systems applications; and, :
. _.'So o~.- · .· · · · · ·_ · ·•'. .- '· · . · . · . · .· ·.• · 1 : •

. . . A.Nl\°s are {:Oirip~sed ,of a large :muiiber ot'h{ghiy inte:~conne:cted :processing elements \' :.;.. : ·

· . ·. '.· (neurons) working in a. ril'uitklayered structures that receive inputs, process the inp:uts,.~nd_'; > ... ·..(: . ..•:.· ii;;~t~ a,i:M~l!~lfo; ~ m,ilti-layer ANl'i .. :·. . \ ,; ,
00

, : •• '

· . · . pro(jui:e ari output AnA,NN is.dcsign~dfor a specific applicati~~; SliC~ as pauem reco~mhon · . ·;.l The processing logic.of ~ac)i neur.dn.may assign, different ~eights to iiie ~~ioiis lncomi~g
, . <, or data ·classification and Jrained through a.learning process.Just li~e in bicilogi~al systems, · input:slrea_ms; The pr_~cessin~ l9g.ic may also use nonlinear transfoiniatiim,fi~Kas/sigriioid :
. ;: .:,ANN~ make adJust;~!)tS t,citlie. ~yo~ptic_coone.~iions \Vith dch l~arning_insian~e. '. ,. . .. ; . : ; ru.~ct1on; .fi:orn ,the:pro~essed •values ,to· the ,output value. This piocessing; Jogie -aiid~tite "
.:, .. ANNs a;·e iike a black' bp,i' traine'd° in(o· solving a ·particular typ~ of piol:ile_m( a,nd they c~~ : . ~~t~rme~iat~-:\Veig~t_~•~d processing ftinctio'.15 are just wh~i \fOrks fo~:t~~ ~yst~~jf~~i¥;': .'
: ·.:: d~v~!op high e\:Cdict,iy._~ po;v_ei,s. Tlieir'i_ntermedi~te JY.!1~pti9, r~fo11_ete_rYa 1:u.~s evo/v_e as the .. m 1ts:obJec_t1ve of solvmg·a.pro~lcm ~ollectively,'Thus; neural networks·are considered to.be ,
..· system obt_~!~~ f~eqback. on i.ts. prcdic~ions_. atjd thus a11 ANN le~rns from m?reJrai,nmg da\~
(figur'e8 .I),. · ' . . . ,:\. . .::~f::~;:~t:it~~t~it:~~1~ted b~ ~;ki~~ ~;hi;;_ar d~c~ions,~y~~;:~d {i~lfo~;wir(:,>
· m~~y tra111111g c~sil's:. lt wlll .~ori!inuc to learn ~y a'djusting .its internal ,f:\linP~(a(f~'ii:and. . · .. . , ·
con\n1uni,calion based onfe~dback about 'iis previoliS. decisions. Thli~. the:neLirai':networks · . ~: · · ·
.become'lietler·at.q1nkiiig-a decision-as:they-hr~le-nioreand.moredecjsi9rts'. :.~·/;f -~1-~ ~-: ~-~
-~ i~,·
VIII Se..n, (CSE/LSE)
Depending upon.the nMurc of 1he probkm and the availability of good lrnining dnta, at some Now append the scrambled d.i:a set to·the original' da)a. We therefore now liavci the ·same '
point the neural network will learn enough and begin to matd1 the prcuictive nccurncy of. number of columns as before lrnt lwicc as many rows. The top portion of the data is the
a human cxpc,1. In mnny praclical sihlations, 1hc prcdiclions of ANN, 1rnir1cd over a long · original dalu und lhc bollom port ion will be the scrambled copy, Add a new' column· to the·
period of time with" J;1,·ge m, 110 :,er of trninir,g da:a, have begun 10 decisively become. moi-e da:a 10 !nbd records by their d,i:a ~oum, ("'Original" vs. "Cory''). . .·
nccurale lhan human cxpc,,s. Al that point ANN can begin to be seriously considcr,ed for Gener.le a prc1!ictivc rnoucl lo allcmp: to discriminat~ between the Original and Copy data
deployment in rcnl si:ualions in real lime. ·· · · ·· · sets. If il is impossible to tell, nflcr lhe foci, which records nre original and which are.random
m1ifoc1s.1hen lhcre is no structure in the data. If it i.s easy. to tell th~ difference t4~n thereis
b. What is ufisupcrviscu lc,1rning? When is"it used? (04 Marks) strong structure in the data. .. · . .
Ans; Unsupervised lenming, by contrast, does not begin with a target variable. Instead the objective . In tlie CART motfo} separating the Original from.the.Copy records, node3 with~ ·high fraction
is to find groups of similar records in the data . .One can think of unsupervised learning as i' • of Orig.inal records define regions of-high density and qualify as potential "clusters". Such
a form of data compression: we scarcl1 for a moderate number .of representative: records} nodes reveal patterns of d~ta vnlu.es, which ripp.ear frequently in th~ real data ~ufnot iii.t,he :
to summarize or stand in for 1he original database. Consider a mobile telecom.munications ? rand.ornized miifact,.- . , _- . . . .: . . . . .,:...', _, · ·:· ·.··
company with 20 million customer... The compan~ database. will likely c·ontain various '. We don not exp~ct t.l1e optimal sized tree for ch1ster,detcaion 10 be the most accuri'te:separator
ca,t~gories of information in~luding customer charnctei;istics such as. nge and postal code, : .ofOrigi.nal from CopY:rcc<irds. Wli recommend that you prune back to a tree s"fzi iliai.reveals
product inforniation describing the customer's mobile handset, .features oqhe P.1.ans.. the.. ·. interestingdata 'gro·upir;gi: . . , . _ ·· · · ·. ·· .· .··_-
subscriber has selected, dc:ails of the subscribers use of plan features, and billing and payment · · This. ~pproach to·.unsu·pmised leaming repr.esen~ a~ ,important· advance ·,iii:~lust~ing
information. Although .it is almost certain that .no· two·subscribers ·will be idenlical on every · . t~cl:mol.ogy. bec..-ause; · ., .,: ,_. . ., . : •. . .. .· . ; • . • · . , :· . .
detnil in· llieir cuslomemcor~s. we would expect 10 find groups.of customers that are very . . • v;rihble.- selection is noi ni!cessal'y and different clusters ,may,b(d.efi~e~ 'o·~.-.dilfereni
·sfniilar irt their overall pM:c:-n of dcmogniphics, selected eqajp:nent, plan use, and spending · · · •·.grQtips.<l(variable. ··: : . . ,. , ·, . . _ . _. .. _.. . . _·. · ,'.. _-,-.:···;. _.·
and paymc~t behavior. If we coui'd find say 30 representative c~.~tomer fypes sue~
that lhe '.· as
·.:·• Preprocessingsor r~scal:ng ofthe ~a:;1 is .unnec:ssa_ry !~es~ ciust;ring qieth~ are not
bulk of cus\c1ncrs arc -.i•cll described .as belonging lo their •~type"; 1his inforniation could be .. inHucnccd by how data is scaled.. . : .. . , .. . . . .. ; . -. · , . '. :. -·_- ·-:-- · . ...
. very ;l~eful fol' i11a'r~cHnii, p!an~ing, and new product developme(lt: We cannot prpmise that/ • !'vlissing:va,lues pi-fl5ent no·'~halie.lig\!.5a; the ..~elliod:_s a·ut~rn. ~ti~ily~anag¥'~.i~;.\niid.ata
' ·. we can find clusters
. or groi1pings
. in c!:i1a.. that yo11 \viii finc! ·uscful. But ,ve inc!ud~ a method_ • Th e CAR.,.,. based c1ustenng · g,v. · es easy contra I.over the number 01 cliisters· and h.elps. ·
..quite distinct from lhat found in other slalisticalordafa mining soflware. CART and olhet . . . . '1 . . . '' . .
Salford daLi n1ining modules now includc'im approach 10 duster analys~; densitycstimation·:
.arid i:msup~rviscd learning using {deas tliat \~C trace io Leo Breimnn, but ivhich ·may have : " " " "
>,:...·.·~
.·•.··'. ·_
· :r:c'th,/s.'sOo[c~.;~,'. :.mt'1ao,ln0.r'. ,,m,.lcbse?·.r:1·1'0·.,·,.;_<l·o ·tl··t·ei 1,·.~_.,p··n.<..
. .· : c.· ,,,~:.1·.·S·•.etl~.·
" , ~ ,
:·. ?._i ._.··, .~.-0'4'.:M
. ~
·. a
:'r,:~,·-
""
·been known informally in among statisticians a.t Stanford and else,vherc for some time. The · ·A~s . . Ass?cialion rule mining js a po.pular, unsupervised. learning.l~cfinique, used irfbusihess to
niethod detects structure in ·data by contrasting original d,~t1. with rnm,1dmized variants of !hat>. .I ' (_hclji . idctjtify shor,jiing patterns, 1.t is·:also known·as market "basket amilxsif' ri :helps. find
data, Analysts ti.Se d1is method implicitiy'whcn vic1\ling d?.ta gl'aphically lo identify clusters·'. . interesting idation~hips (affini:i~s) ?e1,veen variables (iienis or e.~ents):Thus:·11 can help ..
or.other structure. in ·cata visually, Take for example Cllstomer ages and handsets o·wned; If. . cross-sell related items and increase the size ofa sa.le; . ' . • . . .:• ·:r. ', · · . .
there is a ·pat:ern in·1he da:;itl,cn we expect lo see cc:1ain handsets 01~ned by people in their. . All data used in this technique is categoricil. There is•n~ depen~e~tv'aiiable ..lt"iisiis machini- ·
¢arly 20's, am\ rather different_ hands:ts o·wncd by customers in !heir early 30's. ·1r every . learning a_lgorithrr.$. The fascinating ''.rclat!onship. betwe~Q sales of diapers a~d ,b~ers". is ..
· handset h jus1· as likely to be ·o,vn&i in every age group then.there is no· structure relating·.- · _how it is oflen explained in popular.lilerature. This technique accepts as inputth~ f.1)¥; ·point --
. . ttiese two data dimen~ions. The ·meth/id we use generalizes this everyday detection ide?. t.o . ;·of-sale transncti~,i ~ata.-Thcoutputprod,iced is the destription of the mosffrequ,e11t-~ffiri_it(es
. . high dimensions, . ·... . . · : · . . · . . · .·. . .. . . . . · . among iteins. An' ex·a:nple ofa:n asso.ciatiqn rule would be,''.a Customer.who. bough!,a)aptop
.. Tlic method COl)Sists of ihcsii steps:· .. . . . . .. . ... : . . . compute,. arid virus protection softw~re also" bought an ex.tended·se~ice·planJQ,P,ercent·of-.
Make a copy qfthe·odginal data, and then randomly scramble each column of data separately. . : the time.''. ' :::· • ·. :. · . . · ..- <.. _: .. .. . · -:.·.. ·. ,·: ,_."·.·<,; ·.
As ~n example, stait'in·g 11•i1h·daia 1yji"icrif·of a··mobile phone company, suppose we randomly .:2~ In business envirohments,a pattern or knowledge can be used for many p~rpos:s ..Irtsafe3. and·:
_' exchanged_date"~f bi11ldnfoi'ma.t!on at random in our copy of the database;· Each custo,mer /¾1 .·.
marketing, i\ is· tised for crosHnarketirtg and cross se.lling; catalog des'i&l)ie:~Onim.erce site
. rei:ord would ·1ikeli• comain age inforniatiori:belonging to •another customer, We noiv repeat ;} ti . de.sign, on.line advertising ciptimi2ation: produc1 pricing, and s.iles/promot.ion ·co.-nfi.guratio~s.
this process in ~very colunmo.f the data. Bre.ima·n uses a varianf in ,vhicli·cach column: of \fj ;:,·: ···,,This analysistan suggest not to put one item on sale at a time, an\lin~lead to.cfe.1te abundle
. original"d:iltl is ·.replaced .,v.itli ~ bootstrap resample· of the cohiinn and you :can· use either._. )'P{ . .of products promoted as.a package io'sell .other nonscllirig items; ' . . ;.,. .:•,e, :, _. . ·;,:..-.. ·. ·.->. .·
· ·method lnSalford s9fiwar~.. . . · ·'· ·· · ·< ' .: ..··.. · :·· ··. ·.., )lf} r~ retail cnvironmeiits; it can be used for store design, Strongly associaie4 i!ems~an be.kept . .
Nole that all we have d<Jne is mpved info11nation about in the dat1.base, but other than moving"Jf1 . close tougl1er for customer convenience. Or th~y"could be placed f~rJrom ,each other sci:that
. data we not changed 'aqylhing. So nggrcgates such as averages and totals will not have .~ : ·. I.he ciistoni~r has to walk ll1e aisl~s and by doing so is pcik~tially.exposed.tq other.items, ' --: . .
changed. Any one customer record is now a "Frankenstein" record, with ile(TI ·ofinformatio.n -;,j?£ :' · 0ln medicine; this techniqtie:can be used for relationships between,symptoms·:ai)d :illnesses; . : . • :
having b~n obl~ined fro1iH11iilfor1!ilH:us1omer, Thus, date of birth might lie from custome( l , .ic--~iagnosis.~nd patie~t ~haractdri~ticsi1rcat1ilents; genes ;ind .t he~f.~nctjons;imdsiio~ ___:_· _
IO 1135, the service plan taken from customer 456779 and the spend data from 98700 I. .... :)!,~ r· .
Represenlmg.As~oc,at,on 1\-ulcs · ·· · · . .:I : · ..

.. 40' . ~11~+....- _fo'-' ~Ill\~ :\I ti,'. ·Sew,+.,,,. c,c.,,"'_ &,:,i,1i:;? ·41,
vm Sem, (CSE/IS[)
w,,lf'o r111
A generic rule is rcprcscntcu between a set X nnd Y: X ⇒ Y [S%, Co/u] I
X, Y: products anillor scn·ices ~«i~ 11.. 11,..,,4.411;1•
~,..,...,. I l'I.• I I
X: Lcn-lrnml-sidc (LHS or Antecedent) · ft 1'!1V •41 -,-, .. ,~ llt I :omr. I I • I ~IA 1 •0..i, I
!.4,1•w .. ,.,
I ' 1••1" ( j J f l f ..,,I,& O..M I
Y: Right-hnnJ-s idc (RHS or Co1m•quent) ~-4<1\ "\... I " 41 I J I J I , l .,,~ O)t I
S: S11pj Hll7: ~ow <Jf1 cn X ;i nd Y go 1oge:l1cr in the 101:i l 1rnnsnction set .; ~,1,(1".• 1 I s
C: Confldrnce: how oficn Y goes togcthc.~with X ·: .
Example: (Laptop Computer, Anti virus Sonwnrc} ⇒ (fatendcil Sctvice Pion} (30%, 70%] r,un ~ · .. ,~ GX ..
rc~1-,y ..,,.,
Modulc-5
1i~•,rr., u v-•
'), 11. Why ls text mining uscrul in the age ohoillnl media? ·(04 Mnrks) rn·11m~•r •-•
Ans, Text mining is·the art and sdence·of discovering knowledge, ins_ights and paiterns from an i . n..inv 1-!tl
organized collection of textual databases. Textitnl mining can help'_with ft·eqllency analysis of,:-: Step 3: Now, usc·Naiv~ J3ay'csian ·equatfon 10·calculate ·t11e posterior probability -for each
· important terms, and their scm,mlic relationships.. .. · . · ·· . class·. The class with the highcsC posterior probability is the outcome of prediction:· '
Text ls an impomnt pa1t of the groiving data in the:world; Social inedla technologies.have · , ·.- .Problem: Players will piny if weather-is sunny. Is tliis statement is correct? . -.- :',. ,.:
•cn'abicil users to become .producers of text and images and other· kinds · of infQnnattori, ' '. \ ..We cnnso(vc it using above discussed method 'of posterior pr~babili:y. ". ·.' :.-: ~/ '. . ; 7, .
·. Text. mining cnn be npplicd to lai-ge-scale social ·media ·data ·'~M g~therlng_ preferences, · • , , ·. ; P{YesJSlimiy):: P( Sunity I Yes) ♦ P(Ycs)/ P (Stinny) . · , :· ·. .'.-'-. ' ' ,•
and measuring emotional sentiments:· it can also be applied to s!idelal; organizational and . · •. ·• 11.e ~ ,ve·liave P(Sunny !Yes) ':)/9 ;., OJJ, P(Sti11ny),= 511,4 := OJ6, P( \l~>#:9114,':l.0:64: · ·
individual scnles.- . . . ' . . . -. · · . . . . ·. . \ · · ' ·., , • •· . N~w,'P (Yes j Sininy) = 0.33 • 0.64'/ 0.36=Q.60; whicfi has higherpro~abliii9,.::C';.; : , . . .
Text mining works on texls from practically any kind of sourccs,~roni any business or non'. : ;; Naive 13iiyi:s us~s a similar 1i1ethod.to predict the probability of diJferent,. c~ ~don
business domains, in any fo1mats irichtding W9rd documents, P/:)F files; XML files, text -;: ·varioL1s aitributes. This algorithm is mostly used in text classification and_.with ·p~blems .
messages; etc. Here are·some representative examples: · .· '. · .·· .' · ·• : ; · ' ·.:i h~{ing 1t1l!lti11le classes. ; ; .. · · ··· · · · .. •. ', : · .. ·
L In the leg~l profession, text source~ would include iaw;.court ,d~lib1,i~1loris, court orders,·:. · Naive 13aycs stand for: . . . . . .

:i· :•. . . etc, . . : . .· : .· . . . : .· .· . . ,•.. . . ·. :.' . .• ' ·. ·. ,:


2. In itcadeinic research, it ,vould include t~xts·of interviews, published resea~charticles, etc.:
The word Bayes refori;' to 13aysian ana iysis (based on the.work of the ma1heinaii~ian.Thomas
'Bayes) \yhich comput~s tlie probability of a new occumnce not only the re«i!i record, but
i ·.. ·- ·. ,, ·, · .,:, ' >,•. >- , .:.
' ~-.~.·:: ...:;

•~
_:,:_· 1/i'·.~::~'__:·.::~.:
'.
_·.•.•:. ·.. ,' • 3,. ,~:1~,:~rld ~r.~~ance ':ill 'includc;~l~:u~o?· rcp~rt~; i-"t?~a[:r,~octsiSFY ~'~'.~~;nt~,:a~d
4. ln'medicine, it ,vould includc.inedicaljournals, patiel)t histories,'discharge summaries'; etc.
; also·onilie·baslsof~rior'exper/ence:- .
.•Tiie\ vord Naive represents 'ihe.strong assumption ihal -all the parameters of the instam:es -
. .are iiid~pe1id~nl.~ariables wit~ li1tl~ or 1io coirclatfon. Thus ifpiiopie are ider.tified by their,.
~- . S..fn marketing, itl.vould include advcrtiseri1erits; customer'coinmenls; etC: .· _' . .. . , .. : . height, w~ighi, age,'geitd~r;'allOthes~ vari~bles··are assuh1edto be uicepen<!enio'f:cach other,
6, '? ~hcwo~lcl oftelc~nololgd'.Y .i~dd s~arbch! idt would inehide pateni ?Piiljcations; the 1vhole of ;

~lil"I.l;,..· . ·
nl,l·_l,,~. . ·
tef~;: .m1pl1ftnt1on on t 1e wor . -w1 c we , rtn . more:
b. · wi,;us :i Naivc:aaycs tcclii1 iquc?Whi,1 d~cs Nah·c & naycs stand for?
•f~~i,~~;{~i~(~;Ji~;~:
·-iA)l~:'.'. . (;~~j~t~!di;c,i·i.1~i~~:i~_e c~s·i.~erf Q;:;;;
d: ~.i.=~~~ ; i ~ . ·. ·.. ,
·,,. :; : 7;~Yp~[plan,c. Il]_;olhcr,~ypr~s; g1yci1}ab~l~d tram mg data (supervis::d leai:n,tog), ~qlgonthm · ·

Ans: ~
·Nal\l~ Dn;~s alg~rtihm' i'Naivl Bay~1is simpl~ tecli11ique fot c6nstructing : ia~s~fl~~r;t '
i ,P~tj,ry~.l h(P~r~l?-\lf (yh i~\ C?t;go~izes _ne_w exa~p!es, In, I~~ d,!i11t11f~!i.a(s.~ce ,
~u_tp11l,s at_
•.. . •.· • tlus·hypet'pl~ne \S' a.line d1v1µing_apla11e; in two.parts wlicrc m~ch,,cl~~sJ?r,in, ~1_t&r s_1d~• ..
!~}_:_i!l,J,R
.,' ..·'. . : ; .: models !hat as.sign class lnbels lo problim instances; represented ~s vectors ~f.(eatllre ,v~lt_~.es; ;: .' ·. '. ·;;:: _; ;~qnf.usi11grn,~n%worry;,w·espall)~all\,iri,l,~ymen terms, . _.: / <:' .. ,:~,:/, ,;,fr,, .
fl_.'1'~1 : : where the class labels are drawn frpni some finite set. It-is not-a single algorithrtrfor training··,:.·: · ·,S\jppo~u~u,?r-~ glve1i p!ot,of.!\vo. fobi;I ;class~s (in graph a~ shpi\in i11 imageJA,}_;.Can you .
~
,j}j .
y,!I -
.
a
.si1ch ~1assificrs, but a fa;riily of algorithins. based o~ common princ)ple: all .11aive Bayes. ;--_:
·:classifiers assume tlint th~ value of a particular feature is indep,endent of the:vahie of any :,:
\i~_cjqe ~sirniaijng li11efor:!]i~{l*~~es1 ·<. , .·•·. . . .···, ·
·.':,~,y\ .:r;,);; '. · . .
:·d~·:f: ~-
~~lfl > . o~:~::e~'.:;:~!i,~~~i:~;1:':ir~;;able:-,,, ·. . : . :. . <.·,.: .-: ·, :::r ·. . ) ;:·: ...._,~.-.•,·
\.'~·/. ~>-:
. -· :_. -: ,j :..·-:.-.'-t ;..: ,; _. . !,, .,
\ _::i:::-.-,-~ ~~; .:\ ~·_;!: ~i•.
tt~!liN •.'·
til1~ ·. ·
h :!ill '•
.. Lei'.s 111iderstand it usirig an ·c~an1ple, Below Lnave atrainirig:data ·set of wea_th_er and._
..
corrcspon~ing target variable/ Pia/ (stiggesfoig p_ossibilities ofplay_irig).,N,olwl; we nebcdl.to:.
cl~ssifywhether players will play or not based on )l'eath~i condition. Let's fo _OIi;' the e ow· :. _-: ~-·-.:_:. --: ,
,: ::'. .·.. ~t:•.·_. ';,;.:,·\:\/~:::i•..
;. --'~.ii;,· ·,.::·:::•;.~~)! .."tf/~f --.
~li!il,' . steps io pe1fom1 It. . , . , . . . .. . .. . .
· _. ,'.~d',J,
!--"'--...:........:...-----

\f~t} · · , Step:f: Con_vcri,thc_data set into.a fr~quency table .. ; ''.x ..·:, ..'.:,[I)~~~ I\}o.ni:~ ~Hne th!itsep~rot~sbl~ck ~ircles ati~~l.~~.t~~~I:;;'..1\
~->·.•
_. . . . . . . . . . . . .. .. . .' .
'{:lt'f- . .. St~p 2: Create Likelihood table by findii1g ti1c probabili(ies like Cvetcast prdbability ;= 0:29 ', .. '. YQ4 might h'~V~ CQOlQ lip witll -~on1eihing siaiUilr (0 f\lUo,ving irtl)lg_~,(i~a?.e,: ~)dl fa_irly . . .:·"
arulp,ibabil_ilyof PfaYiog,iso.64.C · · · "i-. : sep~r~\~s th~ t.1vq 9l'asscs, Any f!OinHhat,is !en.of lin~ frills intQ black cu,;leql_~ :and on right ;,,/ .·· ·
. .. . . . .··'-,-·.:-:- - ·r-~-· ' .
!. :-' \. ·, ;43
42
~•it~W
. - ~ . ~--~ - ~- . :· ,.. ,..::,-;:.. :::.- :i i-·
...~

. .I,", - :.·
VIII Se.mt (CSE/IS'£)
falls into blue sq11are class. Separation of classes. That's what SVM docs, 1t finds out a line/ pn!',CS , Tl1crc nrc two basic slnltegic models for successful websites: Hubs and Auth·orities.
hyper-plane (in multidimensional space that separate outs classes). Shortly, we shall discllss I. llull5: Th ese nrc pages with n large number of inlemting links. They serve as a hub, or a
w/ty I wrote multidimensional space. · ·galhering-point, where people visit to access a variety of informncion. Media sites like Ya.hoo, ·
com, or govcrnmc11C iil_ci; wou:d !:e:ve th:t purpoic. More focused sites like Travclad_visor.
com and yelp.com could asp_irc to becoming hubs for new emerging areas.
2. Authorities: UltimMcly, people would grnvitate towards pnges that provide the lnost·
■ compl~tc and authoritative information . on a particular subject. This could be factual .

■ information, Mws, advice, user reviews etc. These websites would have the most nlirilber


of inbound links from other welisites. T~us Mayoclinic.com would serve as an authorillltiv'e
page fot· expc.11 medical opinion. NYtlmes.com wo'uid serve as an -authoritative page for daily
news. · · · · ·
Image B: Sample cut to-_divide into two classes. · Web usage 111ining , . .. . ._ .
OR As a user clicks anywhere on a webpage or application, the actiqn is recorded _by many .
entities .in many locations. The browser at the client machine will record ihe click, and the
10..a, What are the three types of web mining? (10 Marks) . web setver providing the content would also make a record of the pages served:aiidihe user
Ans: The web c~uld be analyzed for· its stru~ture as well as content. ·The usage paiteqi·of web ·activity on· those pages,.The entities between the client and.the server, such as the ·rouier,.
pages."could also be analyzed. Depending 11pcin objectives, web qiining:cari be divided into proxy server, or ad server, too would record that-click. ·. ·. . . .. ,.. ' .
three different types: Web usage mi~ing, Web content mining ~nd · Wcb structure mining . The goal o(web usage mining is to extract useful infonnationand patte'rns from da1~g~nerated :
(Figure iO. I). · through Web pnge visits and traasaction·s. The activity data comes 'from data stored in server
· .. access logs, referrer logs( agen( logs, and client-side cookies_: The _user_c~ar,acter_is\ics and
1
WebMinin¢ . usage profiles are also gathered -directly, or indirectly, through syndicated .daia.: Further,
. . metadata, such as page attributes, content attributes, acid usage data are also gathered; .
.I. · · --- . ---~---:-- The,webcontentco,uld pean.alyzedatmultiplelevels(Fi$ure 10.2);_· :, · · : ., ·, · ., .
):I,
IH...
---~·-'°·-·----~-~----
Web, Content Mlnlne . \vcb-6truct.urc Mining:
~
·-.;.;::::,,__'----..
Web u,age-Mlnfrig:
I. The ·servcr side an~lysis would.show the relative pqpularity oftheweb P.3gi/s accessed.
Those Websites co{(Jd'be hubs and aulhoriiies ... ' :- .,. ' .. ' \
0

f:l! !-1( Uslnc HTML pages Uiing URL lini<s ._. Using visits, d ick,, lcJ . ·2. Th~ client side analysis couid focus on the usage pattern or ihe actual:conten\ co_nsunied·
i' , . · and creaied.b)"users: . . · · · · .. . ·
)ii~ ; ·
Fig,irei f O.I Web Mi11i11g Jimc1111c · .1. U~age .pattern co_ u.li!,be analyze~ usi~g ;clickstrearn' analysis, i:e;. analyzing W~~,activ\ty . .
1iff: .· Web content nii~-lng
_ . . . for, :patterns of seq6ence of clicks, and th~. location ·and duration: of, v~_its on :febsites.
u1l.~_;1\,1 . A website is designed.iii tile fomi' of pages wiil1 a distinct URL (w1iversal resource loc'ator): ' .· ·, Clicksfream.. an~lysis can· be us~tltl for web . activity. analysis, soffw~re ·_ce,stitig, ' market
;(' · ·· A large -website mny contain thousands of pages. These pages and U1eir conteht is managed . rcse·arcli," apd rie1:ily;i:Irtg-e1nployee productivity. . . . .. : ·'-' . : ..... . ·,. . . :
'IJi!'i· using SP.CCialized software systems called Content Managemei11 Systems. Every page can ·. ' ' •2. TextuaUnfor~aiiori accessed on· thit pagcs retrieve'~ by users couid ,qe an,a,yi:ed using _.
· f'
l'.' ~
:l ha.ve te~·t: graphics, 3~dio, video, fonns: applications, _a1id more k.inds ·of.content incl.~ding
.. user generated contei1t. : .. . . ,. , . ! ... ' • . · . • : .. · ' • ' . • . . . ', . . . . .
.text mining techn~ques. Th_e text l)'OUld, be gathered ar.d ~lruciured using the,b~g~(~\\'or_d~ .
technique to bujld a Te,m-dosument matrix:i This:mat_rix could then be rg_i)le~~i~gcl_uster,
l t;\; t J11e websites keep a__rccord of all ·requests· received for _its page/URLs; 'including the . as
analysis: and a\sciciatiori rules ,for patterns such popular topics, use.r segllleniation, and· .
:,

'
11 ;rI
,i_ •,(:,·· ;~"
'.:
'
:. ~
w
.":·
•_,··:
, requester infmmation using 'cookiesr,' Th.e log of these requests could be ·a1ialyzed fo gauge .
. ·. .

·.the .pOpularity of, ti10sc pages among different: segments of tlie population. TI1e ·text and
.
apPli.C{'ti(Hl conte11t'(m the pages ~ould be analyzed for ils usage· by" visit counts .. lltc pages .
sentiment ana_lysis: ..
,-----
______ ·
, , . _. , . . _ _ _lia.lhttJ,
. _·..... .:·•. ·:, ·.
· · .:_:,,··. ·.
.. : · ~

;~:~~r:,~:~_s
7;,e;;:1~,:3. . . . ',, .
~:1I{', ' ·. Oil a 1vebsiie tltc1i1se·lves could be analyzed for quality. of content that attracts most users. ' -:r;.~~:~~ '_:
. Web logs, ' · •ld,nttlv uscn ·--•Web'p•i•;·: .
,'ri~;t'.·,, , ,_!
:,['·,~
,•.•,;_:1._! TI1us the tuiwantcd or unpopular-pages could be weeded out, or they can be transformed with
_· Wcb,ho Users, . cit,1<s1rcoms . :::::;;;;';:; ~,:~••..·.. .
< ,: • , . diJTet'ent content and style. 'similarly; 11rcire_resources ,could° be assigned _to keep _ the more· ·,i•'i · ..._cu,t omcr• ' · views · : ~••!_m1,-,1ori_.

Ji i: ·
. .. .
popu_lar pages more fresh and inviting. · · ' · .: . ;,#j
:Jfff,
-----
· .. ·
,...c__;___ __,,
. •
~~;;,~•;~,:
. .. .
!,!,:_. l~
,:.i_\.·.
I Web slrurturc mining . · _ . ,.. . . . . . :
The Web works through a system ofhyperhnks using the hypertext protocol (http). Any page ·-:):?1/l . . .. . . Figurl!: 10.2 IVeb Usag~ Mi;li11g nrd1ill'cl11t1! . , . . . .

1~':
1- j 1 ··cM create a ltyp~rlink to ·any other pJtge; it can be·_linked_to by' another page'. The inlertwined-.•_:;,f::~ /:· \1/eb usage mi.~ing has,ri1any.busines_s appiicalions: It cari help predict user behavior b.ased on, . ·_- . .· .
~.[fl .···._
or self-refcri-al n'aturc ofweli lends it~elfto sonfe.uniq1ie rietwork ai1alyticai algorithms. The:·<11 ?;[/. ' . previo~sly learn'ed rules·and i(sers'. profiles, and can help determine lifetime v~lu? of clients.:· _. ·, ' . ·.
4
:;-· - ~ - ~ ---. .~_. tructure of Web 11agcs could al.so .be analyzed to.
e,xamine1fie,iattem-ofhyperlinkninong\':/!i J)l( -' -. · It"can alsci~1elp des.i~n'·cross-=itiarkctrng-stra;egim!cross products, ~y o_bsetvin,g as~ociaHon ;-:- .- _.-..- .,

'"'h '44 . ""''"' "'"' s...,,.,, ,: , ~-:.' ~11~+~( e.;,";,;. &...11_ii~ I . 45 . ,~,-- _. -
]Tf~:1: . .
VIII Sem, (CSE(ISE)

E!ghlh Semester D.E. Degree Ex:1minalion,


CUCS • Model Question P11per - 2
BIG DATA ANALYTICS
Time: 3 hrs. , . • Max. Marks1 80
Nole,: AIIJ'WCI' a11y FJVEJ,1(( q11estlo11s, J'elut/11g ONEffill q11es1(011fro11r"en'cft,i111u{lfle•. -

·.:~

. i'

. : :. ::- ·-· ·-·-..~ .


VIII Se-iw (CSE/ISE)
(-sclfotlr (-n name (-v vah1c] 1-x name} <path>) [•sctrep [·RI [-w] <rep> <pnlh> ... J
[-stat [format] <path> .. ]
it
,i~·
c•cs • M•"""Q""'"''•""J>u · 2
d1wxr-xr-x -l1dfa hdl's 0 2015-04-27 16:00 l1<1me-cha11ncl
0 2015-04-29 14:JJ oozie-4 .1.0
,,'
- ...

(-tail-(•fJ <file>) Ul'WXl'•Xl'-.X -hdfs hell's


.hdfs O20 15-04-30 tO:JS uu1:c-e,arn1:les
-
(-test -[dcfsz] <Fath>( dnvxr•Xr•X -lu'.fs
[-text (-ignorcCrc] <src>...] [-1011chz <path> ...]
. [-truncate [-wl <length> <path>:.. ] [-usage [emd ... ] I j

•: . 1 drw,xr-x1·-x -hdfs _hllrs 0'2015-04-29 20:35 ooi.ie-oozi
., l , .
drwxr-xr-x -htlfs hdfs .0 2015-05-24 18:11 war-and-peace-input
(9.eneric oplfons suppo11cd arc) · · : . ,;
_,,,,.- -cbnf.<configuration file> specify an.application configuration file ·drwxr~xr-x ,hdfs . )ldt's' 02015,05-25 !5:22 war-and-p'eacc-outpi11··
..
-_,,,,... .o <property=val4e> use value for given property 111c same.rcsull can be obtained by issuing the fo)lowmg command.
/ -fs <local I namcnode:po1t> specify :i. namcnodc $ lulfs dfs -lrJuser/hdfs . . · .· . , . . . ..
/ -jt <lo<aal jresourcemanagcr:po11> specify _a ResourceManager . . . . · (_Make a Dli·cctory in HDFS , . _ .. :_. . . · : . .' ·· . : , · ·
.,,, sfilcs <comma separated list offi lcs> specify comma separnt<idfiles to be copied to the n1~p . To make:a· dircdmy HDI'S, use the follow1l)g comntand. As w111l_ the -ls commattd;iwhen no ·
/ reduce.cluster , · . · .. . . ·· · . p~th -is s111;plicd, the use~'s home dl~ectory i, us~ (e.g., /us~l1dfs). . ;1,:, ., <.
,, -libjars <comma separated list of jars> specify cpmma sepnrate,fjar files to inc:ude in the.- S l1dfs dfs-inkdir ~.t un) . . . . . ·. ' · ·, "· i,_ =' . :: , ·. .
/ _classpath. · . · · .· · ' . . . . . · ·· '::.'·
-.,,,. -archives. <comma separated.list of archives> spcclfy comma _separated
arcfiivcs to be u_nart!tivcd on the.compute machines. :·.
· · ·· ·
~:~!'.~~): ~:?if:~urcu;re~t'local dir;~tol)' ln:o i~~FS, ~ ilie_f~ll~wfog;.;~;.t;tfr/ .·
full path is not supp!icd, _yo(1r Home dirccloiy.is anlimed. In this casu; \hi: fil~l~t':(s'jil~~eJ.'
<J~c general cor.1rrinnd_tine s~ntax_is . · . . . . . , . in·the direclciry stuff that was created previo4~ly. · .. - . o.:;_-. ; : ' .·
· b:n/hadoop commnnd rncnencopuons] {co.mmadOpt,ons] J . $ hdfs dfs ·-put test stuff · ·, · ·- ··
'(''List _FilcsinH~FS_ . ·,. : . · . . ·. . .•,, . . . 11ic file fransfer can ·be coi1finned by ·uslng the -ls co111mand: :·
To list the files m U1c root HDFS directory, enter :hdollowmg command: $ hdfs.dts .rs stuff -' . . ,. ... ·. .
!~~:~ ~~::~.~) . . .. ' ' .. - ' . '<· •. •
F<i111td I items .· ·.. · ...._ · · •... .- ·. · · ·
-nv;r--r--- 2hdfs !idfs t2857 20 i5-05-29 13: 12 stuIT/tcsV
··,:.,. ·,.. .,,,,
· . · . ;''. . .
dr.wxrwxrwx · ~yarn
dnvxr-xr-x · -hdfs : .·
hadoop · 0 ,2015:04-28 16:52 /app-logs
lulfs · · (T-2015--04-21 H :28 /apps -, , ,(;:jr::~1\\~ ~t;·:~:i fatt~_;'our ioc1;
1
f;lis;s:e~1 ~s;ng~he
f~noifug~~~;i~!Hit11s
. the file we' cop_ied tn(o H[)~S; t~st,\~ill lkcopied ba;ck'to tli~·cui:rerit foca! dfrect&ry_witll ihe ·
~se; ..
drivxr-xr'.x -hdfs ltdfs.· 02015-04-2 f 10:53-/benchmarks . .: ·o·a111e··tesl-local.·.-. ' ·. ·' · ·: • ·:- .. · · ·· ::• ,. : .
drwxr-xr-x -hdfs hdfs 0 2015-04-21 15:18 /hdp . .. . • . $ l1itfs <i 1s:-iei itu~fost tcst:lq2~1) ·.: . . . . .\ '-; \{ ;· (
drw~r-xr-x -mapred . hdfs · 0 2015-04-2114:26 /mapred ,copy f:ilC$WiU1i11 p~r-s. .·.. ·.;_ . ..,·,... . ··-·>
~ 1efollowihgwili'copyafileinHFDS:· : · · ·, •' · -·. :, _ -~ ..
· · ~ t-<_tn_vx_r_-x_r_•x_·...
· -+:-•l·_,d_fs_.-+11_ ·d_fs_··•-"_· _._0_2_0_1_5-0_4-_2_1_1_4_:2_6__/m_·_r-_hi_st_o.~ry-..:....--1
1 · $lidfsdts'•cp'sliitf/testtcst.ftdf1/ :· · ·- · ' - ~,- ;--·,:- - :
1-.ui" · ·· drwxr-xr-x ~hdfs·· hdfs/. 0.2015-04-2114:27- /system -~clct'cal'ilcwitl\in,'IIDl'S. -- · · ·_ · - ; ,, , . ,:;,_,; ·
<l1wxrwxr.wx: . -hdfs
drwxr-xr,x . ., -hdfs
drfX•WX-WX ~hdfs '. · hclfs·. ·
li!lfs• . · '02015-05-0713:29/tmp:'
. hdfs· . .. O2015-04-27 16:001user
02015-05-27'9:0 f/var: '
..~:l~:r.:fii2:Il::~/ '~!:!l~'.:it:_~11e ~D~S file tcst.dh~ that::;\ ~ tr ;~}f:::~:: .

To list fries in your home direclo;, enter the f~llow.ing command:$ hdfs dfs -ls. Moved: .'hdfs: ·/ lliit1ulus: 80201lis'er/l1dfs/stufl7iest'.'to.traslrai: hdfs:/1 limuh'is:&O~O}uscr/
·rom1dllitems' .,·.. · · ...• . · ... .. .. . · · lictfs/ ,Tnish/Cl11·1·c;,t · • •. ·. :·. ·, · • : . . . . . · : .· i ·, · ; t{::. : · .
/ ' \ Noic ihai \~h~ri the fsJiilsh.i~tcryl~p(iQn is set to a'no~'.zero v~li1e in corc~site'.:.xliit;'aii' '

l.:
.•.•;,,:-:·: drwx---'· · -hdfs =· hdfs 02015-05-27.20:00 .Trash • • I • deieied files are 111ove,1fo·i1i u~~r!s .Trilsh.ilircct~ry. 'This tan be'a~oidci:fby)ncfoding the .
drwX-'···• -hdfs hdfs 0.2015-05-26 15:43 ,sta.ging .. • · ..0·sk1pTrash ·optio11,· .-· .·.·. _:: .:.. _ .... ·.. , . , ,_·, . ··:,.· _'.-· .. ·,.~:,·_.· ·-::'./ :), ..::_. ._
:.i ,'.'; _> ' -
t-d-:-n-v-xr--x-·r--x-•·7· r_._....
hd~ri:-s-.• +h:- fs-..-.-+,-0'-2.0-.r'""~-~0'-5-~2..,.8_1.:.,3:_0:..3.;.;O.;.;is;:.ir:;.;ib:::.ut-ed.:.s_h:,;.el'.-1-.----"--~
_d.... $hd,fs dfs-1111-skip·:111sli stuff/test Delclc(i'stulf/t~stl ', . ·: · ' . ·- :· ';';=<!/ -~:· ·.
dl')VXr:xr-x· -hdfs . hdfs···· 02015:os-!409:19TeraGen-50GB (o~lctc n Dlrcctqr~ 111 IIDFS .. _- .) · ' - , ' · , ' .;,::.. ·
The followl11g cominand will ddctc the HDFS directo1y stuffand_all its contents:
dr;vxr-xr-x -h.dfs hdfs. 02015:05-14 10:ll TeraS<irt;50GB . • ·
;$ 1\.ilfs dfs cnn:er -_skip'fra.sli~lu_:n: Deleted stulf:1 ·..·!. ·, ".':-.''·' ·. •
lCctan HOFS Status Report . ,· . . J . , . .• . . . . .. '' {·'•,;' . ·. :
drwxi-xr-x-.a _;hdfs 015,04,29 J_6:52 ecan1pli:s __ • . ~ ---__,,-_Regulai: um~.' can gel ;111 'nbbh:viatcd:HDl'S staf(1S _report tising:· tlic/a,~1.oW:in~]~~inm~~d:' :: 1 . .·_:.
•. rho$e w11l1 HDFSa,ain:ii1isti:ato( privileges .will ·genrfol 1.(ri111t116terittat\tt0Jig)' 1~PO! tt,JsoP:-"'.-. •--..- -
·this com:11a11d uscs 'df!n<l111i11 i11sterid of sfs to bvo~e-ad111inistr~ti1,1c-·til1i1i1iands( irlie:siat~s : · •• ,. :·. . ·
I .\ •. •
. 1·· ·
48
:-:'..,;j~H .
VIII S&1l1t (CSE(!Sf.)
dbcu1111t: An example job that co,int the fJagev icw coun:s from a datDbaw.
report is 5im:lar to the datn presented in the HDFS web GUI
dlslhhp: A nrnp/rcduce progrnm that u~e~ a DUl'-typc formula to compute ex.ict bits of Pi.
S hdfs dfsndmin -report
grcp: A map/reduce prngrnm tha: counts the ma:ches of a rcgcx in the input. join: 1\job th.it
Configured Capncity: 1503409881088 (1.37 TB) cllhts a join over sOl1cd, equally p;utilioncd datasets .
i'l'C~Cnt C,paci:y: t40i94598 1952 ( 1.28 TO)
mullllilcwr: Ajob tl1at counts wor:ls ~rr,rn !cv~ral files. pcnlomlno: A map/redur~ till faying
UF:i llcm:i :11ing: 1255510SMS64(1.1411.1) program to f:n<l solutions to pe,;:~minu prol;.c.n~. ·
DFS Used: 1524354170SS (141.97 GB) __
pi: A map/reduce jll'ogrnm that estimates ri using a quasi-MonteCarlo method. . . ..
DI'S Used¾: 10.83%
__________..,..
,, .
___.
Under repficnted blocks: 54 Blocks with C0ITUpl replicns: 0 Missing blocks:0

report: Accm denied for user deadline. Superuser privileg~ .is _tequircdl
.
. .
rnntlomMvritcr: A map/reduce program that writes 10GB of random textual data per node.
sccondurysort: An example definlng a secondary sorl to·.the reduce. .aorl: A m~p/reduce.
program that sorts the·data written by the random writer.
suduku; A sudoku solver.
b; ·\Vrit~ a short i1otc on "running nurp reduce· cxmnplc ··and· also Jplain the existing _. . tern gen: Gen.crate data (or t~e terasort
nvttilablc cxumplcs.· . · (08 Markli) ''., . tcrasort: Run tlie ic111sort . · .
Ans. _Running ~lupRcducc l::\:111111lcs . . .. .. · •. terlivalillntc: Checking results ofterasort . .
All Haddop r~kns·es come wi:h -MapRcduce e.xample. applicallons, Rqnning the existing wor,dconnt_:A miip/reduccp,ogrnrn that counts the words iri the input files.
MapR~du,c examples is a ~impk process-once the example files are iocated, that is, !'or •. 1i-o·rt11ilcan: A 111ap/reduce program timt counts the average length ofth~ words in the input
example, if you_..installcd fladdop version "2.6.0 from the Apache· soiu-ces under /opt, th·e files . · · • · · · . . : ·.
cxampk will be in the foilowiiig dir~cto1y: · '· · · ·· · wordian: A1nap/rec!11cc progrn~ that counts the' medlan iength of ih~ ~ords in tlie ~!~;' inpu! ..
/opt /hack,op-2.6.0/sharc/lmdoop/maprcduce/ • . .. , . · word stnnd~rd deviation: A map/reduce.program that counts s:andard deviation of-,t!)e.:
_In other vc,~ions, the.examples n;ay l>e in/urs/liblhadoop-mapredli~e/ or. some other lcx:ati~n.. length of the words in the inJ)ut fi l~s, ·· .
The exact !ocMion of the ~xamplcjar file cn_n be foi1nd using the find coniniand: . .· · .
$find/ -n:une "hnddop-niapreducc-cxmi1ple* Ja,~' -pri_tit ~-. . , . . , ,• ,
OR . . . . -· .
.Consider the following software environment : Exjii~r~ 1vith neat ~lag~am Apache Hadoop paraUei lll?P red~~ l:la!ll ri~~ (or)Expl_aln
• ·OS: Linux' . . · · ba~irstcps of MajJijcdl!CC parnllcl c!ata pow with the euruplc iirword couo·c pi:og~m ··
• Platform: RHEL 6,6 . . . .• . . . . .. . . .. . · (diagr:im). :-.. ;i ' • · . ·.· :: . . . .· .· . :· >· :·: .:.(08~arks)
._i HortonwoibliDP2.21vithHadoojiV~rsion:i.6 · .', . · · . ,. . ; ' ..: · :_r ~ns •. MajiRcdiiec PataUcl nn111 :Fiow: From· a programmers p.crspecti~e; ihe MapRed~,o
is
. In .this environ_meili,·the l~ation of the e\amples /usr/hi:lpli1.4j~vhadoop: rtiaprcduce:..: . algorithm is fairly simple,'Thc pr<igmminer must prov_idc a!rulpjlirtg futction iirid ii rcduciiig
for the purpose. ofU1is example, ·an environment variable cailed HADOOP EXAMPLES earl . functio!1 ..Opc'ration~l.ly, how.ever, !he Ap.ichc Hadoop paralle( MapRedu~ data fiow can
be·definedasfollows: . . . ··. · . . ·.· · : _. ·. ·. · ~-- ·. , , - . : be quite comp le~; Parallel _execution of MapReduce requires other:s::ps in additiort to. the
$ e~pori HA DOOP_EXAMPLES=/usr/hdp/2.2.4.2.-2/hadoop-mapreduce '. . · · . · . ..·, . ·mapper-and rec!u'ccr processes,·. . - . ·.
011cc )'OU define the exnii1ples. path, )'QU can fl)n the Hadoop exrimpies"'using the ~ommands •.. Thcbasic;sicji~arcasfollo1v~~- .·:·'., ·. ·· .___ ..-. : ' . ,_:_ · . . . .· · ._ ..
.discuss~d "in the follo\vlng sections, : · ·· ·: :_· . · · · · · ·· · ·. ;- ·- ~'. ·.---::-:-· J.lnputSplits,·&IDfS distdbL:tes andreplic.ites·d:ita over multiple serv~:,'.f1iedefault c(ata ' ·
· LislingAvaiinlile·lsxt1111j>li:s . · ··. ·,·: · . . . · •. .· '. · · · .: ·.·. ,. ' · · ·_chunk o_r block ~ndwdt'tcn. to differeritma~hincs in tkchister. Jl!e.da:a ·iu:e·alsg replic;itcd .
. ~ li,st of tl:e available exaii1ples can Ue found by running thef~liowing eom~and. ln some . : 1111 multiple m,a\:hines (typically Utrc;e machine). Thesedl!la slicCl! wph~ical :6oun!laries .
\ cases;_the versi.on number may be pa11_ofthejar./\le (e.g., in the.version 2.6Apaciic _sources> . determined by 1-iors and have nothing to do with ihe data L'i the file. Also,' while not 0

thefilc is named.hadoop-mupreducc:cxamples-2.6.0;jar). ·.· • . ·. . . . conside~ed part ~f the MapRedi1c~ process, the tune required°!o lo;d ;i.~d djstr,b1ite dat~
Syam jar SHA DOOP_EXAMPLES/hadoop-tnapreduce-example.jar · . . · ' ·. . ·.· •:. .,: througl1out:HDFS ·servers can be considered pM of the.total processing time . .•' .. : · · : .
. ~ole: ln.pr~vio·us version ofHadoop, the command hadoop Jar:.:was us~ to run l'vfapRed11ce : . TI1e input splits: used by MapR~dtice are iogical boundaries based on: the ·input data·.: For
: " progr~rris. Ne_wer VJ!rsions provides the y;u-n comriiaiid; which .olfers_rtiore.capabllities. Boih .i I . . exa1nple, ihe split.size ,an
be bas~d-on ih·e number of records in•a file .(if thnfata. exist as •
commands.will work for tl1ese·examples. . · . · . · . .· ' ·· ·.· . · ·\ . . · . · ·. · ; ;, ·records) 01' an actuai size in bytes, Splits are almost alw.iys s111·a11er than the HOF$ blo_ck size.
· The possible·exilmples are is foll01vs: . i .. ' · · · .The numbei· ofsP.lits corresponds to th~ nu.niber of mapping processes use~jn the map stage. :
An example progrnm miist be given as the first arnurric.
0
nt: .· . · ·'.2.Map Step. Tl1e mapping .process is ivherc the parnlkl nature of Hadooirc.o!Ilcs interplay.
Valid program ·names·'a.re: · · .. ·. ·
· ·For laige mnounts of data, many rriapp~rs can b~ operat_ing at tht same time.The user provides
ag~rcgate ,vordcouut:"An Aggregate based mapireduce progra~ tli~t.. cou~tSth~ words in ' the spcci!ic mapping process. Map Reduce. wiij try to exi:cute the mapper on the machines
the input files. : .. . ., .·. . _. : · . . ., ; • : :. ., ' • . \ . ·· . . : _·.·:: : where the block i-esides, Decaus.: •ti1dile is repl1ented in,HDFS; .the le~s\.biisy; ivJap~educe
aggrcgalewo5dl!~~: An A~rcgate based map/redu~e program tha·t computes tli.~1)1stogram-\ ·.will iry to pi_ck ~ ~od.c Uia,t is dosdSI to the _node thni hosts the 'dat;a block (.i char;icteristic .
ofthe·worqs 111 the rnput files. ·. . · , .. ': ·· :· . · . ,. ·. . .. . • . · _._. ,:·..:; •: called rack ali,arcness). Thi! 1.ast choice .ls ariy node in.the cluster ;h;it has ac.cess to HDl'S.'.
,0---,--..,..----,-.-'-trti·p:7\-lna¢.redure:program:that use~ Dailey:Borwein-1"1imire thahcrmpuii ~xaci hits I\::.;'
of , ' -- ~-~· . -_· · . .·-. . -- - ·. : _
_,,· ~ . : .. .. . . : . :· . __
. ·. .,:._,..._.._..,... , ..,

. I ..

50 · <st .-··
&.11~fir E:.c.-.M Sui.~riv .··

.. ·, , ·
..· \:-· .:-.,;· ·:, :
VIII Sem, (CS[/ISE)

s'tcr is to write th~ outrut t~ MDFS. •• . .


As mc~!iuncd, n combiner step cnnbles some pre-reduction or the map ou:pui data.
For Instance, in !he_previous c~ample, one mar prod11ced the following counts: (ruri,I)
(spu1, I) ·
(nm, I)
As shown "in rigurc 1.2, _the count for run can be combined into (run ,2) before the.shufile;.
This optimization cnn help minimize of data transfer needed for the shuffie phase.. ·
. , ·. . · M~p • i ,,, 15hu~ .· ·
' . i . .
,i,;,.,
c.. , ·i

Figure 1.2 A1/1{i11g 'r, cu1i1bi11er process fo /I,{! 11wji-stip 1,, MnJiR~il;i,~. ;.,; '. .. .· .

i:~1:.:;:~:prngi·a1\11;1 iii~ ex It) ;~~crnri~i :mil.tcd ~:~ S~fip'. •~P;~·:~ usi1'.~-!h\tf;:~~g.).·


Using· (he Sti'ca111i11g lnlcrfa~c:" · . : . . _ .. . .
, Tile Apache Hadoop strcani'ing)n!cdace cnabl~ alrrfost_"any program io _11Se· ttili MapR~duce ·.
engine, Thi:'streams _interface ,v.ilf'worl(with any program that c;io read and,.\Yrite' t_o:stdi11
and stdout. ·. . _ .. , .· _ _ · _ _ _ , . . . . _:: . . .
Whc1i worki11g in the Hadoop s:rcaming mode, ·onlj th~ mapper and tlie red1icer.' 'are.·createct···
. by \he user,:TI1[s appr6cich does!1ave cite "lldv~~tagJ i.hal.lhe mapper and the redtlce~·ca~'. be' .
-;;as,iiy tested from 1iie_com1i1nnd lin<!: In ih,s example, a Pyth6~ mapper iind.redi;~e;'sit'own in. :.
Lisiings l. I aiid .1.2, will be usc·d/ . : .. ·, ·· . . . .. \. -. · · . _ . -· .. 'o ·: . •.- : _.
Li~ting_J..J Pyth~·n,Mµp(lcr Scri11t (mappr.py) · · · · ' ·· ·
• ~;;~~~~;t'.vpytho)< · . . · •: . · : •·· . : •
· .·: . . . Fig11rt!l.i tlp11c/1e 1_i11i(i!op p1ir11ffd;1Jnprei/11ce i,inp',;w ..·: -.. ' : , · # inputcon1cs'frcii\1 '.STQIN °(stiindiird input);:- · ---:-.--.-
The input to the MapRcducc application is the following file in· HDF.S with three : lines of
· text. Thi goal is to cou.nt the numb~i- oftinies each word is used.:._ - - ..
. _for·lincinsys .stdin:\ . ·.· · · .. ·•.· . , .'. ,· :: •.· :_-._
.-# re~,o~~ leacliog•m).1 trnilir.i; whitcsp~c~.(ine "' line.strip ( >: ' • · .1' . . , ." ~
. ,• '

...
. ·. Se.l -spot fill\ . . I . . . ~ split ,the l(nehit~:w,brd-s IYO!'<ls .-= line:split () . : ..
run spci( run . . # increase counters. ' · · ··.•:)i'.: :'),:
·see the cot ·. -_ for ·w~rd iii. wo~ds: - .. __ : . , . _ : -. •. . . .. : . ,_·:.-.- , _. ,. •.: i :. . . .
· , .Thefw-st 11ii,',!lMnpR~d;1cc w.ill do_·is create ih~-- data'Spliis.- For.simpficity, cach.line·"will be .' · # write the res11lts:toSTDPUT(stairdar9 :output);# what we,output here. ~illbe.9i~ inpm for ·
the .' ·:- ·.· · ~ ·1 • • •• •• : - .: :._ ·: ·, . :_ -. • - - .'_ ·.:•: .• : -· •• • •• : . :•: : • .· , ·= '/< .:,.:. · ~:•:<· ·.'. 1· •
one split. Sine~ each split will rc~uirc a map task, there arc tl11ee mapper processes that.count '· : __•. ... _ : -

the number of words -in the split.Ori a cluster, the results ofea~h map task:are ~1)iten fo local : # Redl1ce· st~p; i:c:·th~ inpittfo~ redt1~ti~ ,py i .
·· disk i\ild not to IIDl'S. Next;si1i1i,lar keys need to be collccted:a,;d sent io a reducef.j,rocess: · }I tab-cletimiled; ti1e trivial IVO;,;r count is ( piJnt ''¼s/1¾s'¾ (word,:'!) ...·
. . . . . . ·. ), . . . , ; ~ •· ,,
. . Tl;e shuhlc sfcp rctjuircd ,fot~ 1110vc111c1ir ai1d can ·be expansive in (C!IDS of processiilgtimc:-:_.
. .- Dcpci1ding on thc _na·11irc of the ,1 pplicufion, thc ammuil of data .thatniust b,shuffie.th.roughotit . . Ustingt'.2Py(h~~Rcd11cc~Scripi, (reduce ,py)
·· tlieclustcrcan .1;myfrom~malltolnrge, · : : ,·.-. · ' . .'· · . ·:, -, - . .:·.- . _. _ i·. : . . #!/usr/bin/cnv.python : ' · • · · - · · _ • ·
, · 'once the 1fa1:i have ·been: coilectcd and sor1cd liy.key,·thc reduct101i step can begin· (even' if.' froni opcrnto'rin\pori ii~rneg~ticr i~port'sys . . '
. only pa1'1jal re_suhs m ·:wnilablc). It is nutnccessary0 and riot normally recoo1mer.dc<l-to have ·; CUl'l'i!nt_:wor<l"'None Clll'rei1t~COl11Jt=0 ,~orcl:,; Norie,
. . ·a rcducer'fµr eaclfkcy~valuc pair as show11 in Figurt I.I : frt some·cascs;a single reducer wil! i .•, #ihput:comes fro.m STD.IN · 1 ,
. . , .;._ piovidi:adcqLiale perfomimicc: in othci:cases,..m.ulJ.ipJe.."re.ducers niay...lxi.required.to.speed upi \,. __, _ foF.lineJn.sys.stdin~·~ ----~- - . : , .--· ·.-- :>
. .: _the.rcdticc_plmsc.-Tiie number of reducers is n tunable opii~n fqr many applicat:ons,' Thefinal./ • '· . # ,ch1b~e:leading all(l'trailingwh'itespacc'line = line.s_trip O '·.
. -•· . . . ·. I . -. , -.
SZ.·. S...11r.f~~ E...-AM ~tit\~ ; ~11~tAf E;c,;1,;.. :&...il~u
VIII Sem, ( CSE/IS[)
fl parse the inpltt we got from mapper .p)' word, count= line .split(' /I', I)
µ conve,·: count (currently a string) to int
Locate lhc lrndoop-s1rcmning.)ar file in your distribiition. The location may va;,
and it ·may '
conlnh'. n v~1·sio11 lag. l11 this example, the Ho11tinworks HDJ72:2 distrib111ion wa{used.,The
try: ~ollowmg command 1ii1e will use 1hc muppc1· .py nnd reducer .py to do n word count on the
count= inl(count) input file. · •
except Vnh:cError: ~ 1~doop jnr /11sr/hdr,/currc1n/hndoop-maprcduce-client/hac:oop-strcaming.jnr ·
II count ,~as not n number, so si lently# ignore/ discard this line.
contin(1c . _. . . • -file .lm.tpper .py
# tl1is lF-switd1 only woi·ks because lfadoop smts map output# by key (here: wo11I) before -mripper ./niapper .py
it is passed lo the reducer · . . . . · -file J~ed11cer, .py -.reduce ./reducer .py ·
if c·uri:cnt_word ==word, currcnt;_CQ11nt += count else: -input \var-nhd:peace-inputlwnr-nnd-pcace .txt
if current word: ·. -output war-and-peace-output · . . . . .
II wi:itc re;ult to STUDOUT · . The output will be the familfor (_SUCCESS and part -00000) in the ~ar-a~d-petic~ ouiput
print '%s/t¾s' 5 (current_word, c11n-ei1t_count) current_cotint = count • directory. Th~ actual file naine may be·slightly difference depen,ding on youd-la~oop:version:
, current word ,;word • .· . · . Also note th~t the Python scripts used irt this eJmmple could be Bash, Perl, Tel; A\yk, compiled·.
# do not forget to output the last word if needed! if current_ivord ==word: C codc.,or miy language th«! can read and write from std in and stdout. ·: ·: · • : : •. . · ·
. print '%s/t¾s' ¾ (currcn(..word, currcnt_count) . . . ~ . Ailhough 'tl1e_streaming interface is rnther simplei.it does have some disadv.antages·~ver
: The operation of the mapper .py script can be observed by rnn.ning the.commands as shown ·- using _Java directly. In pa1tic11lar, not all ·applications are string-atid character •bfo~iy dnta,'
.in the folloiving: .· · .'·· : . · · · - ·· . Ariotl1er disadvantage is thM ri\ariy tiinirig par.i~eters ~vailable through tl1c ftiH )avaHadoop _
·seclio "foo foo quux labs foo bar quux" I .lmrippcr .py . AP[ are n9tavailnbl_e iri sirerihilng. . .- . . .
··r-'oo -I . .
Foo I . .. ·.· .· Module -2 . . . . . . .. .
Quux 1 . to
Explain ifoiv quite dat~ streams using Apachcfluinc? . . . ' . .' : . ·. (O~· Ma~ks)
·Labs I· Apache rhune is an independent .tgen·t designed to·collect, transport, arid:store .. ifata_irifo
Foo' I 1-JDFS. Often data tr~nsport involves a numberofF.iume agentstJ:iat may traveise a series.of
·. Bar..'r' · · m.~chincs a1id locations. l'lume is often used for log files, sociill ttiedia-gener.ited_daia, c/illiil ·
Qtlux. I . f, _ •. • . .. • •. • • •- • ~ , tnes~age;•andjust about any coirtinliQliS dain source: .. ·.. ' . .• ' ' .. . . .
0

Piping the ii:sult of the° mr.p into the s.oit coinmimc! can create a simulated shuffie phase.:.. As shown in Figure 3.1, a Fhune agentls composed of thre; ~oriipoilthis. ·. ·_, . ; , , ··
~I . . . . • ·Sou rec, The source component receives data and sends it to ii charineL It c~it seiid the data
foo I· 19-more Iha~ oq&chanrtel. The inpuid.\ta 'can be from;i real-tiin~ solii'ce (e.g., weblog) or .
. Foo I anolher Flume agent: . · ·. . · . · . . . . . . .. : :.. .· . -.
Foo I . • Channel. A channel is a da:a queue :tliilt for\vards the source diita to the sinkdestiriation. ·
Labs I It can .be thoi1gii of as b11Jrer Iha! mtli1:rges -input(soiir~) aiid ~11ipu\ (~itik) flow rates.. ..
Qullx j· •
.Sink. Th.c sink ¢divers data to dcstinntio11 such as HDFS, a·1ocaj file, oranother Flume agent
· Quu~ I . · . ·' .·· · · . ·' . · · • .· · · ·. · ·· · · ·• · · '. .: : · A Flume ~gent in'ust have :ill three of these compcin~rits defit1ed.-AFiume agent tan Have .
.'. filially, the full .Maplleduce process cmi be ~iniulated by ridding the r.educer..py scrip\ to tlie ·· ·. severnLs'ources, channels,and·sirtk~. Souices can writdo mul(iple ·chaiiriels; buffsink:cnif
follqwi~g command pipeline:· · ·. . . .. · · · .. . . .. . take datirfrontonly :{single·chaiincl. Data ,vritteifto :1 channel remaiii til'th{(:k~nn1:°l ti~til .
· $ cl~ci "foo (oo 'q1iux, labs foo lia1: quux'.'. I )mapper.pf I so1t. a_ sink removes the data. By default, the 'data fo a channel are kept irt memo!)' but may be
-'+ -k I, I iJ1·educcr.py · · optionally stored on disk to prevent data 'loss in thfevent of a network failure. _ ·
· Bar 1 ·
. Foo 3
: .·Labs .1
Q11ux.2· ·
0
. _. . . . . .. . . ·,. .. .
. To.ru1i 1his npplicatibll.using a !fadciop· iristnllation; ·create,.ifri~edcd, l\ dir~ctory and mov
·. _lhc war-and-peace.txt input.file. into HDF_S: ·. . . . .
.}.I ; i] .' ·s hdfs dfs '.mk<lir war-and.-pme-inp11L .
~ii/, I· , $ hdfs dfs -put war-and-pcace.txt wa1-and-peacc-input , . ' '.:··. .
XJll 1
~
'.: ~~e s11r~~P~_di1ectory 1s 1emovcd fro~1 any previous.test rnns: ·
7ffil:., ·. · $ hdfsafs -rm -r -slip1Trnsh war-and-peacc-oulpm·-:....,..__ _ _ __ ___:..__~_ _._:;. :_,_ -·-FijiiFeTTFl,~gei,iwii,,.~~;,,.ce, (·i111:111e1; a11i1il1if(tultip~iiilJrollrAftiir:/jiiF1~111e :.·
1!i
': ' ' 54 ,
. ·.· ·::_: :,
5'i~5,f-ii.l" CilAM Su.M~
~~~+-.. .- Cic.¼ fuilti~ _ · · · tf1J1:1d11e11iatio11) .. · , . · · ··.· ! :S!i .
- ~~~~/ -~ . ~
1> • i,
., . ' ,'.
ff· i
I

!
VIII Se-rw (CSE(ISE)

t\s show~ i,1 Figure 3.2, Sqoo1i agcnls n:iay be placed in_ a pipeline, possibly io traverse nn1ncspace and logs. . . .
sevcrn! machines or domains. This conflgurntio11 is normal!y usetl when d;,ln nre colleclecl The wcb-ba~cd UI -.m te started from within Ambari or from .i web browser co1inected to
on on ',;; 11::.:.:h;!'le (e.g., ::! Wl'b servl.'.r) and sen: fo ~n()thcr r.rnch i:1c :h~\l h,:s i1CCCSS to HDFS . · the NameNodc. 'In Ambnl'i, simply select the i-mr-s service window and click on the Quiel<
Lin!<s pull-down :ncnu b tl;c top mid<llc of the page. Select NomeNodc UI. A r.cw-.brciw~er
tab will open with the Ul shown in rigurc 3.4. You ca:i also stnrt the UI directly by enterir.g
the following command: · · ·
S fircfo)( http:localhost:50070 . .
There ;,re ri·vc•inlJs on the UI ·:' Overview, i>ntnnodcs, Snapshot, siartup rrogrcis,·und
Utilities. The Qverview page provi<.les much of the essential information that the commimd-
Figure 3.2 Pipeline is crcat'ed b)• connecting Fume agents ( Adapted from.~paci1c Flun~i line tools als.o offer, but in a much easii:1·-'to - rend format. The qai:inodes tab displays node
Sqoop Documeniation) . . . . '·' iqforma_lion like.that shown in figure 3.5 · · • . · · ·• , . ·
The ·sanpsho: window lists the "snap-shottable" directories and the snapsho(s ....Further ·
info!·mation on snapshots can be founu'in t~e "HDFS Snapshots" section. · . . . · ·. . .
figure 3.6 -provides ~ NameNodc .stm1up progress view. when the NarrieNode stn.rts it .
. reads the prcv_ious file sys:cin image file(fsinirige); applies. ~ny new _e.clits to the file_'syst~m
image, thereby creating a new file system im_age; ~nd drops into safe m<id_e· untit.lcno1igl1 ·
DafaNodes come online. This progress is shown in real time in the UI as the NariieJilocte
starts. Compl~ted phases me .displayed in bold text~ The curre.nlly ·running phase i~·di.splnycd
in•italics_. Phases that have _not yet bcg\111 are displayed·in gray text. Figi1re 3.6, nlf the pliases
have been COlilpleted, and a_s.indicr.tc<l in the overview ,viridow in_ Figuie :3.4,'th'~ s·ystem is ..
out of sn.femodc.. ·· · . · ' , . .'. . . . . . . ' ·
.. Th~ ·utilities mcau olfors ·two option~ ..TI1c fi_rst,fas' shown i,i Figure _3':77·_if~-.ij1~· system
brow~cr:_From this ,~indow, ymi cai1 easil)'c~plorethe HDFS namesp~ce. 'f!1e sccoli~option,
which.·is not,sli~wn;' li1\ks toJ'2e various NnmcNode fogs , ··· · . .... · .'-,:,,.' '.'i\i!{ .·
~ .~ i -

_,, .. .
.Overview .i!i:iu,:a~~u·1;,,;,-oi ·.
Figure j,J ~(Fliim<! ~omolit!n1iiJ1111elwori (Adapi<!d j,0111 Apa~/;<! F/11111~
. . ' .. . •~: -
"bujtlll n ll ·lll: 1' IDI :Dn ·- .'
. . . · i ·· . ,: · . .. · _.iJoc11i11e11tr11io11i . . . · .· · · _. ,.,
. ·.._In a. r,Iume_iiipcline, th~ ·s_ink fom1 .'one agent is conr.ecred tci ·:he source of another. The . . \~
,'data trmtsfor format 11ormaUy used b)' Flume, which is called Apache Avro, provides several .-.. i
uscful-feailircs.- f:irst,_A_vro is a data scrializa:io:1/d~serili22.1io~ system that uses a compact \ i_;,l
blnOry:forinal. Th~ schem.i is sci1t as part of the da1a exch ange and is· defined using JSON ~•:'
(JavaScript Object Notation). Avro also rcmole procedt1:-e calls (RPCs) to sent! ~.ata. Tirnt is, ;j;;,.
. :an Av1'0 sink will coniact an Avro so11rcc to send data . Another<t1seful Flume ccrnfigurntio·11'is, i~
· , shoivn in Figure J.3. In tli is configuration,_Flum·c is used IQ c_onsolida:e several cata sources J,1r;
bcfori:comn1itting them to HDf-S. Thcrn ·m·e n1any possiblc 'waystoco!1struc\ flurnc tr,rnsporf
networks:.rn addition;othe·:• flume feature's no\describcd ·in depth here include ·plug-ins and_.
i_nici'c·cpto1·s i,;~t can ·c11i1ancc flume pipclinc.- F'cr pipelines . .::: , · • .· · ,.·.)
: .• . . ·. .. . . • • . . .· ··.!;
·1i. Expl~in hridly 'basic IIUFC.n<lministralion? (12 Marks
·. . ·Ans • . The N,iiitcNo<lc User lntcrfocc . . . .' . . . . . . ·. ':.:
Mon.1!._~i~g_l!QfS_~m1 be done i,1 several ways. One ofth'emore coimnieiti'ways to g~i"
quick. view o_f.HDFS siiitus is throu~h ·1!1e NamcNoile user 11\terme:llTinveb-based to':.
providc's essei1tial ·information about HDFS and offers the c·apability. to. browse the HI:>~

· -- -- -·_. Sii.~!.+..:,_~.,_,,; sJ.~,..;


, , . , . ,• a
· : 5.7
f '.'
I,

VIII Se-t'"N (CSt!IS[)


1. Add lhc ll5fr lo lhc group for your opcraling system on lhe HDfS client syste~. In '
most,cnscs, the group name should be that of 1he HDfS superuser, which is often hadoop or
M~ . . .
uscrndd a <gr0l1 name> <uscrname> ·

Datanode Information

· In op~alion
.,. .
... _ .. M_,1,,_ ,_., ·.~- ....... ;,,. •• ~ _;, •.-;~.;:..:. ,,,;..,.__~ . : .._
11 ~ 111 _._.,,111_
,ni~ ..... ~· : . n1 .1 , 1 1.11c.t._ Ju,r,oe _ur u, _011.11.n11t • r.uu.•H
,.UU.1,\0~IOI · 1
IWI~• . Ul.1'-- 11.)1..0 >H)i:.i ... ~i~.~I~ " -~,♦ • ·,;;;~~._,~--. ~ - I .HUH~

"°"'""••U.1.l:5NitJ - I )lt-~ I c•· 1t~OI ,...;,:-... . JlUI ft ·_ '" _ . ~~~ !»(t,0"-t .• . 1-~ ~"':' :'"-.
0 0
- ~·• . no.IOI IUltl 1111r.l JOl-1f01 • ilt 14.ff"t,(.l\tll.t ,.,u.Ut.W

. . . :

Deco~is~ion.rlg
.f '
:·==:::.::.'!-·
_
.,_:
.,_ ., __ . ____ . •··-- ·----~ ---·
. '"''_,.. IIIM.,1"f'l"'-"1MI .. U
.
,~• - • ~ -- - • w• _ _ _ _ _ ,. _

·-· - ; .. . __ . . , >'-' ' ~ ,,:. •.·..

, ,.·:·
:-···

·· Haiffp,~OU.·-

figure:J:6 NnmeN~¢1! web iiiieifnce 5t,o'w_b1,isii1r1tfp progress .


Adding Users lo ijDFS: ·. · · •· : ,· ·. . .. . · .. '> . ..: · .· . . .'
,t . Keep in· mind th~t errors:tha1 crop up while Hadocip applications are runnirg ar,e ~f\en d.,,
i''iF · to file permissio11s: ·. > . . .:. · .... :. .. ·: · .. : .·:,. : .:
i . . ·,:
;;;~( ( . . ,T~.qu[~ldy c~~le user accounts.manually:o~ a.L~ux-b:ased system, perform t.iiefolloY(ir

- -""li;i~["""'J-:-,•~.• ---.-5-8-..-.·-·s_teps: ., • . .: .•:

.... ;:.' . . ,' ~i. :;' . · ,. . : ,· '·, - · •.


r· VIII Sem, (CSE/ISE)

niter the Dnta NoJcs have reported that most file system blocks ar.e available.
Mis-replicated blocks 0 (0.0 o/,)
The adminis:rntor can place IIDFS in Safe Mode by giving the following command:
Dcfr.ull r~plication factor : i S hdfs ,dfsadmin -safcrnotle cn:cr ·
Avcrg;;c block replication : 1.nso 144 Co:rnpt blocks : o· . Entering the fo!lowin g co:nr:ianJ t11:ns off Sifo Mc~c: •
Missing re;,!ic.1s: O(0 .0%) Num~er or darn. nodes: 4 Numb~r or ,~cks: l S hdfs dfs~dr:1in -safcmot.!c lc~,c .
!'~CK ended al Fri M~y 29 14 : 48: 03 ED'l'2015 in 1~53 milliseconds l·IDF.S may drop into Safe Mode if a ·,najor issue arise~ within the.file s~stem (e.g., a full
The. filesyslcm under path'/' is HEALTHY . . . DataNode). The lile system will not leave Safe Mode unti_lthe situation is resolved. To chec_k
Other options provide more detail, include_snapshots and open· fil~~. and management of whether HDr:S'·,s in Safe Mode, cnw the followir.g command: ·
corrupted ffles. · '· $ hdfs dfsadmin -safcmode gel ·
• move moves corrupted files to /lost+ fo1md Dccommlsslo11lng HDF:S Nodes
• delete deletes corrupted files · If .vou need to ·remove ·a DataNode host/node from the' clusler you should decom·-· mission
• files priiits out files being checked it first. Assuming the node is respond_ing. ii c:in be.easily dccommissjoned from ·the Ambnri ·
· ~ o~cnforwrilc prints oi1i files opened for.writes during·tlock . . .- web _UI.°Simply go to the Hosts view, click on the hosfand s~l~ed Deconimissioii from th'e
• lnchidcSnnpshots , includes s1mpshot data, The path indjs;ates .\he existence _of ·a ·.,. pull-di>,vn menu next.to the DataNode component. .- . . . . _, . . . . . . ·
: snapshottnbte directoiy or the presence of snapshottable directories under it. . , · . ; .l Nole th~t' the host may also be ·acting ns ll Yarn NodeManager.' Use- !hi ~~ba.rl H 10
,; llst-corr111itfilcblocks prin:s_out a lis: ·or missi1)g bio'cks and the_files to which theybelong. 1 decommission the YARN host in_a similar fashion. · .· • : · >':·· .. .: .
• b\ocks"prinls out a blo.ck repo11., .. ·. , ·. · · · · The restoration pi-ocess is basically ll slmple copy-froni'lhe snapshot'~ the previau.s dfrectory ·
.• · 1ocatior1s prints Olli locations for every blqck : (or anywhere else). Note the LlSC o(lhe .../. snapshot/wapi-sriap~l.path _lo restore the fiie: . . .
· • racks prints out network topology for data-nodc.locaitQns. ·$ hdfs . dfs-~cp /usedhdfs°iwar-and-peace'.input/.snaps!iot/wapi-sll?.J)-l/war0illif peace·. txt/ ..' ·.
Dahincing_HOFS . .. · , · ::'. . , · .user/hdfs/wa1·-and-pcnce-inp11t •· · .· . · · _: : : · . .
. Based. ori tisage patterns and DataNode availability, the number.of.data blocks across.tho . Confirmation that t111:' filc"h.as. bei:.n restored can be obtained by·issuing the following ·
DataNodesrnay b~com~·u1:blahced. To avoi(I ove'r-utili~d Datai'-/.odes, the HbFS ~alaucer ; • . command: · ·· · · · ·. ·
·· tool rebalances data blocks across ·the availabl~ DataNodes. Data blocks are moved froni:
0

., • 0Ver-utiiiied to'undeMtlilizcd nodes'I() ivithiiia ccrt~irrpei'.cent lhteshold. Reb;ila11cing ciiif t!~---~-1 .: ·~_-i"'!.~ .';~·-·,· :_~j"~-~~~--
be'done-wlie~ new.Dat.iNodes are added ·or.when a,_DaiaNodc is .removed from servji:e. This', ♦""J_~~~!. ~:~·~~:,1 ~:rf~i~.i·~~~-~-.... . •...•...,. • -
, step·do~s i1ot create inllr.c spm:e in l{QFS, bu'traiher. i1h~roves ufliciency. . . . • · .•
_•.=:,-: •:•·:
· The HDFS superuser must run tiie balancer. The simplesi way iq run the ii~!ancei is jo•ei;teri
the followilig command: . ·· · · · · · · · ·
· $ h'ufs balMcer . . .. . . . . .
. Snapshot Summary:
.: :_ ·:.·~- . /' .
By ·defaul1, tbe balancer will coiliii1ue lo rebalance the nodes until the number _of data block
on all.Data Nodes a;·e within ·i 0% of each other. TI1e balancer cai1 be stopped witlt'ou! liarming
. HDFS, ai ~ny l1ine by elllcring. a Ctrl-C, Lo_wer or hfglicr_.thresholds car be set,_usi~_g the
• ,threshold argiiment. For examp_le, gmng tl1e following COIJ'!man_d sets a 5%_thres!iold:_: ..
· · $ hdfs balaricer-threshold 5 .. · .· . . · . ... .·. l
The l~w'er the'threshokl; the longer the bala/\_cer ~viii nm. To ensure'tlie·_ b~lancer' do.es noi ·_..-:
a
. swa1i1p ·1he-cluslerne1works, you can :set _bandwidth limit before runningU1c balancer, as ', ·,
.
...,-,.,.t ......
. .
.
. 00
~--·
1/'1.,;!0ll..-JUJ,_
.· ·+-~.....
···•· -~~:~~--
- -- ~ lofll; .

- ~~ . . . . . .

S•dfsadrnin ~setBa}anmBandwidth newbandwidth ·. · ·· · · .· · : · . · · . · ··. ·. .,..,1N1:0l....-twf . .

, . .1,J·-...~-~1-,.loi:.-"""".M-,.....;....M.Nt,-l ·-
. .....~":"~;
~ · , 1~;1Lltl~11''!'
. The 'ncwbandwidth option is the max:inium amount ' of rictw~rk ba1\dwidth, fo bytes per ·, ' ... . ·, __ :,:- '•

·second, that eachDataNodc ·can use during the balancing operation:.. ·. _:; · : . . ; . . ·. · • ..
· Balancii1g -da,tablocks can also .break.HBase locality.W11~n HBase.rg(ons _ are moved, some _-. , · F{giire; ),q t/ptici1eff.a111't!N(ltf1!_ 1Yeb /11terfl1ceffwwi;/t s11upsJ1ot ilif.<?.T11!t1,tio11 ·
. data locality is lost;- and lhc Rc~ionSe~vers i ·ill then reqliest the data over t!ie network froin • · $ hdfs dfs,Js /user/hdfs/~ar,and,peace-ji1put./ Fowl_d J iieiils , .,.-. . . · . . , •V
re1~01e DataNod~(s): 11iis coi19itionwi.il pcr_sisl imtil_a majorl:lBasc: ~oinpactio11 ~rent take, : . , -.1:w-r--r:,: 2 h\lfs hdfs, .: .328&746 :21i1s:06-24 2J: 12 /usei'/hilfs(~af:iii1d,,p~iice:; . . .
pla~e (which ,itay either oc9ur at: r~gular intervals or be _initialed by lhe:adiniiiisfrator). .' . ··1\ . .inp~i.lwar-and~pea_ce.tx( ·, :::; ·:>. ,.:.,.... ·,: ;·,.:, ·.· :, .-:-: _.,. ,.. :,::.,;, ,,... ·. .- ·
1IDFS Safe Mode . . . : . • . . •. . . : . .. . .. ·. The NanieNode· Lil .provides' a 'listing of snapshottable .d_ir(Ctories a,rid._thf:-$~llp~l!Ol'S thai ....
when the·N~meN.ode starts, ii loads tli.e file·system sta_te from. tl.1dsiinage_and then applies';: have ~n taken: F.igure J:8 shows 1f1e resul(s.ofcrcating the:previ~us sn,aps~oI, j'~ delete a
____ _:____\wh,._e...edwi.,.ts..i:Joll.!·g._fiwl¥,pJJ.u
l tlmi waits_(Qr_D~la~odes to JtJlo.it.tlieh:J.J.19.Ck...h.Thlrin ; thi 'mi: · the:, ~---'--,S.(lapshot, give the follo'wing tontniand: .. . . . .. :" . .
. N·anieN_ode stays in a rcad-onlYSafc'M_ode._TI,e Na1neNodc leaves S~feMode automatically) $ hdfs dfs-deleieSnapshot/user/hdfs/war,a?d,pe~ce-input~ap_i~sriap~I C-. :•
. ' . !
.-60, S<!ri!.tM E.c.,;/,\ ~~~(,;( ·. ~l\s.t....- ~.-;;;.. &.i.tiiii . -'---- - - -- - . 61
VIII Se,n, (CSE/Isri .
To make ;i directory "un-snapshottnblc" (or !lO back to tl1c ddnult stntc), use the
followir.g co:nmand: . · . · . · •
S hdfs dfsadmin -disnllowSnnpshot /userlhdfslwar•and-pcacc•inpul Disallowing snnpshot
on /usc,/hdfslwar•and•peace'. inpnt succeeded
OR
4. 11.ll01tto nianngc Hadoop service? (08 Marks)
Ans. During the course iifnormal Hadoop duster operations, servicc.irrny:fail foi any number of .,
reason. Amuari monitors all of the Hadoop service and reports any· service intcrruptio.n to , '
the dashboard. In addition, when the· system w~s inst~lled, an ildministrative 'email for the)
Nag!os monitoring system was req11ire9. All service intern1ption ~otifications. arc sent io th{
email address. , . . .· :
Figure 4.1.shoivs ·the Ambari dashboard reporting a.:down DataNode.Tjie service error;
· indicator numbers next to the HDFS.service and Hosts menu item indicate this conditions·;
The D.ataNode w.idget also has turned red and indicates that 3/4 ·oataNode.'are:oper~ting, .. ·:; .
. Clicking the HDFS service link in the lefl vertical menu will bring up the set"\'ice·suinmary :
s.creen ·sho0wn in figure 4.2. The Alters and Health Checks windo;v confirjn~ !hat a.DataNode :.
isdQwn.. . : ·. . . · . . . ::. . . . . . · ·
The specific h~st·(or hosts) 1\iith an issue. can. be found ~y examining the. J:-lo~ts ..1vindow: A~ .·
. shown •in Figure ·4.J, the status of host nI has changed from a gc~en •dot.with a check mark/
· .inside.to a yellow dot with a dash inside, An orange dot .with a question mark inside.indica
the hostjs not responding arid is probably down. Other service intiirruptlon inc!ic;ato.r !llaya
.be:set as ib~liolt·ofthe i:inre{ onsive.n·oae. . .1 • : ..• : . . . . .;-.·. · ; , ; ·:• . • • ·• ,, ; : • ,, . ... • • ·.:; . . • •
.. i ::Fif!llff 4.2Ai11hntiillDFS ser11/~•i!s1iim11ii'r,.whuldw. li1dici1/li1i a.dowiiJJ,iinNiJdt i.
1
.•.'~~.! .·: !.~
~
,:'.".·.,
.. .'..:_.~
1~
1~
•..., ',·._=.__·.•· .,.·.•. ,.·, \.-.,•::.c•.,·.·.·,.-•-•·.•.-.:.-.<?...':~.;.:. . _......'·.'.·.·'.'·. _·:_. .... . ~~-.~·-··- ~-~~-~-:C:-is!~-....-~.'¼.·-~•-:--r:-;;:--~·
,_= = · "' · • ·-~ -~: · ··:· · -~.-,: · · --- · -::.-::
·.,_,_;.•;,;-;, : .......

. ----
·-- - -~ (Int •-

·._ --~~~ ·-:-- ~·-· _ ,_


. -• .
. ;::,.. 314
i..~ ..: . ;...,.:,..i~· :
.......... .. . .,, .... ._ ... .
:_·. -~.' . ~·-·:~~~:~._:--t'·.=..

. O'V- -

.-1
-,.,Ji
_, - -----
.1:.r.~.~.-.t.~~(;1_~~0~·.:; _. ~;l:,~~- ~, ~.- . . .. 0.14 ms
. . -
·.
- •~ ; _
-- . . -·
.
..
.
- ~
: ·· .
.,.
-(WV~
-. -··

..•"• ..
. '. c,·. .,.. ..
·. ·co,iu.(; _~
• .i:,~, ~
,~~>·i;::/\-·:.\../. ·:r<-
r:,o ~ .··./· · .
l~•:C• .'_.~~~ •.,;__:__.~.
• ...e,,~,i; · :·
·II ·. 1u01 i.:..· ·
' ~.
_1;(, . ·

t !1;

- - ~~f
. ... .
.. 1'\·. . . .
.I .

. I .
... · - - ~:, -
··.: ~.:.-_1_:;... .
-.__~.n~
i:t~
•M~-- -.
:·: . ;.,;~-=-:·.:--:·
'·.
.

I . ·' "'·
.. · ,. . ..
-.,. __
.
·._·.-~..,_
-•- .. ,_-·. .·-. -~~ I;·•:~,;, •.. ." ; ,::--· ," '~.' _. :• . ~

~=-- ·• , ,:J !•~'._'. • ·~-


. -~- ~~- --~ ,. .
-,..,...__· · 1,33 ,: 52.9 d
•.. 52.9 d .
;.L fJG.!!r~ f3. :.Ai11b11ti llosis sc;wi .. QI/ iss11e witfi
biilli:ati11g liost,r;
··~·.". '
--
.... . . ""- .
:_ -.. __._. ~
.. .- ..
. . ,--,:• ,
~
.' ·

. . ....... lM•
. .. . .. . . . '

..
..,. ·;;,,=: .' 1~/ ( ·. / .
";..• ·

~'~ ,., . '


. . Figur~~,J~i11bari1i1aiil ifuslibdilfli111iticati1ig idiaitiNotleisfoe'. .. , ·. · · ·. ;
. : ·c1ickiiig 01\' then I ~iisdiiik opi,is the i,,iei~ in Figi1re4.4. In#i~tting tlie Comp6ne1its' sub/1'
. window reveals·that the DataNode dliemqi1. has stopped·on the hos\: At this 1ioint; checkf1ig
. the DataNode logs'on host ni will help identify ihe actual cause of the failui'c ~Assilining th
•. f;iilure's •is' resolv~d;the DataNode .dacmoil-Cati be started using ifie Start option ,in ihe pu\
· down .
menu next. .
io the
.
service
. . .
name:
.. .
· · · '

.:62

::. -.· '·. .·.•'--=. .. . . ,. ..: •· ." · •• .t


VIII Se-nv (CSE(ISE)

J; =~.:~;~'.....,•;1~c1~.:~:-:_. 1

:-=--•~• iJ:•~:-;:,.,/ri:,;~;;~•,;;;.:;.· 0
----==---lill
I+.. :.' · - ,;_. - ..: . .' : 0 ' ..- !:

•-D
:.,:•:o

il•~-
"""" · _; / _o.,.- !.
, 9'C·as::~~ ·· ~ . n ·11 · -~ i===~ i. ---~-- ,(
t\i·:. :2:1,::;:· · · .lnt\i.~yJJu 1·~- ·:, . ,....,u.-, ·

f
l!!rli
1,,;:f
31 .
,..~.,·
...... ., ~.... .
,:,:§ §;.~:.,/~:~' '
,PJ•~~. ,nJ,~ .
. . --- · r=-c -~
t~
:· ; ._w,,;_.. ~--~ .
... '!" ...... •
. _'-:"~-·.'-,;'
1_.33 .·, ;·. ~2.9 ~-

/ ";'U.
~dl ""'''"" ."""-~-- --~-~: l ~ (}~) .. 19.1 d- 414 : :4/4 ·.
i!}
.1; .
. Fig11r~ ;;:•,::::; wi11dow for lt~JI II I i111/ic11ti11g tlte D11rat•io1/e/llDFSrcrricc has ·.
. . . . · ·/!topped . . · . ., · · . . ::: . Figure 4.6. Ambari dashbpard iliclic·ati_ng.:, all .DataNo_dcs_~re running (The seryice ·error-· ·
i: · Whcn·a· service daemon is sta11cd 'or stopped, a progress window similar to •Figurc :4.5. is:, · • · indicators will slowly droP. off.tl1~ scieeii) Dat.Nod~s are.now working andtiie 'servfco error ··
i,., · .opcric~. The progress b_ ar imlicmcs the status of each .action. Noie that previous act ions are:'. .· indicators are beginning to 'slo:,vly disappear. TI1c scr_vice errorindicators _inay lag.tiellind 1lie

t~i°' •
IW,;'i,
. ·, p~rt o_fthis ,vindow. lfsomcl11iag goes wrong during the aciion, the progress bar wi! Itum.red. ,,:
.-._·· If the__sy·stein gcn·erafes··a warnf11g atlouf'th~actiol!, th~ process bac·will tum-orange.
· When thcse:ba_ckground operations are running, the smail ops -(operations) bubble on the top·· .':
. :}
real-time widget-upda.te·s for several mi11i1t¢s, . ·,
· .; .. · :•.: , .. · '- > - · ' ·. ,,
.

Ans~,· ~:if::~i:-~~\A~~o~~1!i~~i~~:;!i~.'~~s as.a .scheduling aaemorl ~t !~:~~i::~


,· . ! · .. . .
..~ ' -.·.';·~. . ,_,.

I
', ;:
~; · ·mcnu'ba'r.ivill
· ·indicale how ma'ny 011crations arc 'n11111in ; . .' . ·.· · . .·.-: •,1' .
machine . . .
and acts' as the. central authority
. .
for allocatir.g rcsOlirces .
lei the various. compeiing ..
.

'~i \; '. j ilii'clig_rciund Operations Running . ';_ npplica(i~as (11 thc' cit1ster.The_Rcso11rceMar.age1· htis acentral and global vie,w o'rali_cluster
)]I,' --- 'oi..,,,.;.;... '"'' ,... '0"""" 01
• .,_; ~' ~ - .. -~ j ,· !:esourcepnd,: therefore, ciln,,ensurc fairness, capacity, a11d _locality are -~!Jafe1:across ~.H
, ... .. :--- __ __ . ..... ·.... · .- ·- - . - - - - ...---r., 11scrs.. Dcpcnding Oil tbe application demand, scheduling priorities, and resource availability,
"" 1 o ,.,,,,i,.,,,..... ,®· -~~•~~---~------- _____.. ~ --.:_•_ _______ 'ihc ll'csourccManager ,dynamically allocates resource containers to apliicatiims to ruil:on ·. ..._- .
I-~- . ·--..·---. · ... · .,. .,.... , ,.,,. , particuinr nodes.' A container ls a logical bu~dle ofresou~ccs (e.g~, memory, coi·es) bound' ..
•,, ~_,.,,;o""'"'"' . . ·.-.. .. . ........... ·---, · ---- - -- - to a partictilar cli1ster node,. To enforce and track such _assignments,·the R~s~'tii·ceMnnager ·
I. .. .... .._. .. ... ..... , .. , .. , ....... .. . ., •• ," '" . ""' ~· " "' I a
. interacts 'with special system daemon running on eaclf riode called agers are he~rtb:eat ba~ed . .
l tf ,j'
{.,.~lif
~ "'"'' '-'" '"''""
• 511110., ~,M~kl . .. ........ __j,-.. -· --
, ... -- ··- -- - - · --- - - - -- - ,-,.,,-.-, for sc~lability. NodcManage:'S a;·e responsible for local monitoring or'1'esource availqbility,
' b
s)-, T.he
0

H~ .,,/ ~, ~ H IJ. ".


- - - ~ - -------;;_:-;- •
fault reporting, and container life-cycle management • <
(e.g., starting and kdli.ng JO
l. ;\~ . ~~ -1~:~~;,;;:.~::.....~~,;,s:'.'· "" .,;, ;.,,, ,.,, .,,... RcsourceMannger depends on the NodeMa'nagers for_its '.'global view'.' of the clust_er. . .·. -: .
J111 c... ~1• •~ rx1,:11 · __ _ _ _ - ; . . . _ - - - ~-- - - - - - · User applications are submill.ed to the Resource~anager viaa public. protoc~I · ~rid gq .
fil · -·-:;-;;;;~;;;;;;;._':-..;;:~,;·;;~,;-:.-,~"·"""'".._
,,,...,, ,,.,.. • through an admission ·.control phase during ·\yhich security credentials ·are validated ilnd '
if ., , ,.,.,, --'------:-
.
l
various dperational·and adminis:rntivc , hecks are pcr(oi-rncd. Those applicntiops;lHat :~r~
©_f •m•:1111• ·-
. ····-~.::..... ,.,.. ,,, .,.,,,...;. I . . accepted -pass \o the scheduler,.aod· are. allowed to mn. Once: th!} sclicdul.e_ r lias enough_.
! "' ~ ,/ fl"''"'"'"'';"."""'"'•'·'!~W• . resources to satisfy the request, the application Is.moved from an accepted state to -a-r_un'ning .·

l1rr:'
~~-•t,,, . _. . . Y,;1;· ,., . slate. Aside frrni1 ·internal bookkeeping, this :process ir. valves allo~ating ~-container for the .
i!,:I'_ .
).
. . :
. °"""'""."'"' 0 ' 109 ~·---•...i . . .. . . __ _:...,_ '. :'/{~ f; ·. . .
. : .. ·
single ApplicationMas(ei· ar.d spaiyning it on a .node in' the cluster. 9ren caHed c_ontainer 0,
;:,( : · . · ·. • _ . Figure 4,5 Ambur/ pl'ogms· wi11ilorv for DataNoile-reslad . · -,:--,:.~ {C:. : . the ApplicationMaster docs npl have any additional _rcso1!rces. a\ this P,Oint, but rather ,must ·
_ .,,:;:.c·1":,-_ _ _ __ _ _ ~ Onc~ ·the 0alaNode hfts ,beel1 resta,ied~uccessfully,tl½e~ash_boar<l-WiH -refleot-the ne".' status~~~ ff.-~ __.,. :--·rCquest·addidori~?U!'tCs-from1he:ResourceManag~~-'.-~- ~ ·_- · ;_-- :.·_.··,. . ·- ·... ~-- •·. · .
(e.g., 4/4 DataNode arc Live). As shown in l'igui•c 4.61 all four . ·::J;:fj f:' . · .. · ·! · . . · . .
.-,~-:-

-½~if----· 64 . . .· . . S..11~f;,~ E.c,.;., S<At\/1~ J~}~i; S..t\~f;..c E.c.t~_SWulV - ~~ 65-·---'--'--'-


11;,i:,. .
VIII Se-iw (CSE(ISE)

The App!icnti~nMnstcr is tile ·•r.,as:cr" user job that manages all npplicntion life-cycle aspects, was dc~igncd nnd intcgrntctl n1;011nd managing only MnpReduce tasks. . . .
including dynamically ir.creasing and _decreasing rcsomcc consumption (i.e., containers), Figure 4.7 ilhtstrntes the relationship between the npplicntion and YARN components: The
managin~ the flow of execution (e.r., in case of Maplleduce jobs, running reducers _ag?in~t the YARN components nppca'r ns the large outer boxes (ResourceManagcr and NodcManagers);
outpi1t ofmnps), handling foul:s ar.d co:11p1itation skew, and pe1'formi11g other opt1m1zat1ons, r.nd the two applica_tions app~nr as smaller boxes (containers), one dark-one light. .Each
The Ap~lic.;tiu~!'vl;istcr is ,!";i;;r.•::'. 10 ru:1 a:·bi11rnt1y l!Scr code th~\ ca11 bz wr_,:tcn •~ :my_ ~rplica:ion uses a <.!iff~rc~t A;i~licn:io;i:-.1~z:c:; the darker c_iicn: is ru,mi;ig a :-.1essage
progrnmming lang\1age 1 .:s .ill , 01:111n:r.i,:Hion wil!t the Rcsource:Y:ai1agcr and NcdcMannger p.issing Interface (MPI) applkation. and the lighter client is running a 1ratlitional MaµRcduce
is encoded using extensible network protocols. . ' . ;. · · · .· application: . · ' . :· · · · ·:-· ·•
YARN makes fow asst1mptio11s about thc.ApplicntionMaster, although in practice it expects ·,- The darker clicnt(Ml'f AMi) is_ nmning an MPI application,and the lighter clicnt(fv(R AM 1)is .
m.ostjobs will use a higher-level programming framework. lly delegating iill thl}.i,e function~. :_- running n _MapRed_uce application. · · · . . • , . .. ·
to ApplicationMnsters, YARN's architcctme gains a great den! of scalability, programrning / c. Explnin"cnpdc·fly scheduler.background.. .. . '._..(04 Marks)
model flexibility, and improved llSc,· .agility. t'or example, upgrading :nnd testing a nrn1:·.
·MnpReduce framework can be done imlependently o_fother runpin$ MnpReduceJrameworks. :
Ans, C:1pnciiy Schctlulc·r Backgrouncl . .. . . . i: . . , , . . . · . : . ·:: ·. .'
,The _Cnp~city scheduler is the tlcfault scheduler for YARN 1hat enables multiple ·gr~ups to ·
, Typicnlly, nn Appli"cationMaster will need lo harness the processing power of multiplc·servers '·,
securciy slmni alarge 1-ladoop cluster. Developed by the original Hadoop team at Yahoo!, the_
: to c·omplete a job. To nchicve this, the ApplicationMastcr issltes resource requests _to the .:
·cnpaciiy scheduler has successf11lly run.niany-oflhe largest i1ndotlp c_lusiers: · ·.. . · ·.::: ,-·•.
Res(!urceMnnager. TI1e form qf these requests incli1des spccificat_ion of locality preforenc~s .
·To tisnhe Cnpacity schcdulc"r; ·or(c or more quc~es ~re configured, with a predeter91i_n.~d ·,:"
(e.g., to nccommodate 1-IDt'S use) and properties of the containers. _Thc ResorceManagcr 1v1ll
fraction _of the total slot ·(or processorLcapacity, This assignmenr guarantees-a mi_n_im~m··
. nltempt to s~tisfylhc resource requests coming from each npplicati,on according t_o.nvnilability .' . 'amolint ·Ofresourccs ·for each queue: Administrators can configure sofi·Jimiis and OP,tionaf
· and scheduling policies. When n resource is schcd°uled on · behalf -of an ApplicationMast~r~ ·
hard lii11it~ on ti,e c'apacity allocated io each qi1eue, Each 'queu'ci \1as -~frict A~ts-- (!\tc~~s ,
a
the RcsclurccManngcr gcncrntcs lease for !lie resource, which i,s acquired by_a subsequent" _..
Control l,ists) that control which: i1sers crin submit ~pplic.itions to individual quei:_es::Also, .
ApplicntionMastcr heartbeat, · · · · _. }t -' . . . · safeguards arc i_n place to ensure that users c~:lnci view :or modify applications frcim.other.. .·
. .. . ,. . ~,.,~~...,~~-:U-~~~~~~ ( h;!,i"iii ~•d~~ii;p11em.jV':_i users.

. · ,~~~- . - _. t---~
_-c_-.::.::.::_:::::_:::::::::~-
~ .:
1
i:1
·-• ~1;;;~~;t;a:~\;~:i::~,~;;~:'.ii1!:~:\~~n~J:1; ::;~~ !J:i::t;~~l!:aj,r;(J1~~~p;t;t.·.: •'. :· ·

..'[ !]EJ[j]r;:J,:;
I . . . I
:or dcmand.{t~., a group:1s always guaranteed ii ·minimum number of resources 1S,_ava1lable); : .·
. Ex4ess-stots are giwii to th~;mos, starved ·q'u~m:°s, bascd ·on,!he m;mber'~(n1ri"rifn1 tasks·, ...
divided by the queue Capacity. Ti111s, the"futlest queues"as defined by their i:titial"ininfmu·m ' ,
i capacityg(1aranti:e.get th~ most necrJed i'eso(u'tes Idle capai:itycan be assignecfiindplovldes·. .
. · ela_stidtyfo,; tfi~ i1seh_i'n·:i cosi~cflective maii"ner .'. --: '.-. · . : ·'- '·' -<:· ~ < :-',.';_':t'}t,::.·,:' : ·

;\,1~ff .;;g ;::;r ~ . . Adiriinisi~ators can/hange queiic definitions and properties, such as qip~ciiy:aitii'.~tLs; ai ·.
1
· .·. ·,_ ;1~:~n~:!~~i~~tiri~t::~~~f/:1~~;i:te:d:~: t}~i!~d:t~o~u:~:~t rit~f1~~~fn~~:~: ~_:
)hat ivhile_. existi,ig appiicationsrlln t.ci completion! no new llpp!i-c.itioris can b.e subr\iitteit _: ·. . .
. Tbe_CiipnciiY _sc~edu!c('cuo-ciiify. luppoits : me1iioiy'.iiien.si~e :ap~li~ajiij_~~i ,%fre' : a~ . ·
··c
. ~=
' J
-;_; ~·- '&fjpj,-- ·u.· · ,••o~·· ~<,.,:··. s···· ·: -.. . __···. ··_. ·apP.licatio~ -~~n .?ii,tiona!ly sp~cify. h_igher rn_emory reSOl(tc~ req(lire~eil_~J,~1\f/ifWatflt. • ·
. Using infojrn~tion f~on! the No~eMan_agers! the capacity scl1edufer ca_n then r/~~~ ~?,~tain~_rs _·
'
'
, ..
'
·,
:

'• :, .. :. ~ Container, 1 f
'
I ,

·>· .·,-: :·... ·, ;~ ·-


·. ' ·. . . ·-
, on .the. ~sHtuted no;les ..: . •·. :- •.· _.. __, .. ·... ·.:. , · .. ._. . .. : -:· .. ... . ..... '· :····: · ._.. _. , ; · , . .,. .
,, ' ' ·. · . . . : •.· ' . · The cnpacity -schedulen>iorks best when'. the"ii'orklonils are ' well·knowrt,'·wlrich '·helps in· . ·.
. .·~
. . .
. .assignii1g _the mir,iriuun c~p.icity, Fci~ this schcdu1efto'. 1~otk mostclfectjv;iy):'fit.it;queu~ i •.
'
. ' sho4ld be assign~ a.minimal captjcify that .is icss fhan the maximnf exjifc(c,f1xo,rldoacli ', .
. ·. ,·. . Fhire4.7: Yi1r1u;;,,1tiii!ct11i'1!1"i-ith 1,vo clie1!{s(M11pRe1i11,·ei111d MP/). ' .. ·:_. ·. . Within'. each qlieue; muidplejobs are schcdufotl u·sing hierllrchical FIFO queue(similar;ib< . .
., Th~ Appi'ica;i6;1M~ster then-worh l~itli)ile NodeManage~s to start the res~urce. A token-:. · .·. the approacli uscd\ liith ihest.inq~along FifO sciii:duler:1t'there ai-~ no qtie~~s'~oh~g1i\:ed,~ll~ ; ·

..~;.:~,~; i:t~;: ~: t~,:,~: 'tit::~:+i;t(ti·i~~ii:ri::t··


· based_.security ti1ed1~nis1n l,\Uarailte.~s•its ·ai1the1iticity when _the fi:pplica~ionMasle:· presen_ts·:;
"the container .lcas<i! lo· the NodcMa11agcr, In :a: typical-situation, runn11ig containers w1l,I :
: .coni1i111niciitc with the ApplicationMaste/ through an applicati"on-specific protocol to repo11 ,.
. status·.and health. inf;nnation linct 10 receive framework-specific commands: l:_1 this way/ '. - scheduler view click the schedul~r option at the bottom of the Jeft0 s1de vcrt1~al:i,n:eni1, _: .. ·. , ·
. ·, YARN provides bask infrastructui•~ for 111011itoring and.life-cycle 111:rnagement ofco11.tai11e1~\r- '[n(o~nation ·on' configu~ttlg thc 1cnpacitY schc<tul\!r Can be found at hit~s : /1han<100P!·apache: .f
,:;.;-.- - ---,-,-,--- . ~ ,vhile each framcwoik manages iipplic~tio1i'spe~ific semantics design;" i11\vhic)1 ·scheduling : •. '. rgldacslc1ut••nt/J1.ado_ofl'yarh/hadoo·P,-•Yam-~1te/CapacityScheduler.html_-a~~•".-.-
_ - --
·. , . . . .i . . . ·- ) :·,&1>
- ~ ----,._,_,..--- ~ .:. - . . .

. :.~.:. . - .....- •.
'VIII Se,m, (CSE(ISf) '

Hadoop YARN : Moving beyond MapReduce nnd Batch Processing with Apnche Hadoop 2. IJ . Microsofi Bl platform 32. SQL Server Anaiysis Services
In addition to the capacity schcdulnr, lfodoop YARN offers n F~ir scheduler, More information '
14 . MicroStrntegy 33. Style Intelligence
can be found on the Hndoo website. · · · · ·
15. MITS J.4. Syntell sol11tions·
16. Openl 35 . Tar&it
~ NEW,NEW_Si.VING,SUOMITTEO,ACC:EPTEO,RUNNING Appllcallons 17. Orncle 81 .. ·- 36. Vlsmatica
·18. Oracle Eii_terprise Bl Server 37, WebFOCUS ·
19. Orncle Hyperion Sys1em · JB. Yellowfin 81
The Bl tool used ,n our 01gan1zal1_on .. Educat1011 . . .
As higher education becomes n\ore expensive and competitive, it is a great user of data-based

~~~,--_::5r:!.~~~'.;2·~~,F-==· ~~- ·~~·~


decision'.nrnking. There is a strong need for efficiency, incre_asing reveQue, a11d impr~vj11g the
. quality of student cxperic1\ce at all levels of education. . ._ _ : · , . .
I. Student' ei1rolment (recruitmcul and _retention): Marketing to new pote~tiiil ~tudents

. - . m . J J ~ .. ... =;:~. M"f•IOYff "'... V!:i:, .. .,_ Uf'l(IICO - ~-..: : .


requires schools to de.velop profile~ ofthes_tildents that are most likely to attenct_~~~ools can
develop models of what kinds of students are attracted to the school, and _thefl _reafh out_to
•f:"" ~ \: ,'.:- •~I,:,~~~~'.;~.•,;>'''. ·,~~::: those students. The students at risk of not returning can be flaggeq, and correclive measures ·
can be taken in time.. . · . . '. · . . ._ . . . , --. · .. ·.
flg1irl! 4.7: Ap11cl11! YARN reso11rcii11111111ge web illf.l!_r/llce 5/,fiwi11g C/1/llltily scl1e//11/i!s 2, Co1irse· offerings: Schools can use the class enrolment data to develop m9del~ o.fwhich ·
· -·· ·. i11jur11111tiu11 new courses are. likely to·· oe more popl:lar,with s_tudenls. This can· help increas~ class size, ·
Modutc.;.3 redu~e costs, ~nd improve siudent satisfaciion: · ·.· · · . ·_·. · _.·:. ·.. .
3. Alumni pledges: Sciiools_ca11 develoJl·j:frediciivt 111odels _of which ?lµrn~i are m_ost·likely .
-~· n Dcscl'ibc list of business lntcllige_ticc'tools used in .the organiiation. Explnin nny 2 of·. to 'pJedge ~irnncinl Sllpport to die school. Schools Clµl create a profile for ~1-~mnrni~re lik~ly
t1i'cm used In your orgnnizntion. · . . .· . ·• . ·. · (10 M~rks)·,:
_to 'ple.dge donations to_the schooL Th_is could _lead to a reductio~ i_n ti!.~ cci~t ofj'1i~!_li~S$ a~d .
of
Ans, . According lo the list best business intelligence iools prepared by experts from Finances·:,:
•·.Qnli~e--th·e· leading solutions··in this category comprise of systems_ designed .to capture, i
.categorize; and a_nalyzc coi'pornic ·data and extract best practices for improved decision ·'·
._r,rnkipg. _The more advanced the system is, ti1e more data·soi:rces it will combine, including .:_
b: . :t~:r:: :t: 1

::~t:::~:;s!~~:i~~~a
An~. DW has four ke· dem_ents (Figure 5: I). ·
w~~~~~~~~~ ~~~:r;b~·,a~h:·:·· ' .:. :::r\6 ~~rks) . . .
· ,. · ·
i'ntemnl metrics coming from different- company depnrtments, and external data extracted· ) Ontil Sources : !2s.!i . . •.A·t,;Ssi•ng u$@n
1

from · third-pa11y systems, social media clrnnne ls, emails, or even macroeconomic data. · .Operation$ · /' Iransforma_tion : Data Mofrt or · · g·~
ah'.PUC?tlocis
-ERP systems ,·~ _.- Q(APtciols·.. _.,_
Ultimately, busin_ess intelligence sofiware.. helps companies ga in· insight-on their overall
. ', growih, sales trends, nnd CllStomer behavior, . . . .. . ~ .
_.Legacy systcmi
,Point.of Sale . I
,. S~ici~t Data-'-~
-Extra.ct Data
_: One data .'
mart' for ca.ch
· Rep_o:,i1rie .Toois
: Dashboards· .
I:. Sisense 20. Palo OLAP Setver '.:~~ ~~:t;;•. ! . -cleanse data
1
.. ~?ar~ment, .. . :. ~~1~,:~i~vfces ..
2. Actuate BLisi_ness Intelligence and Repo1ting Tools (OIRT) 21 . Pentaho· E•teinal : ·-fi~~:puie ~•. • ·. -A Warehou~• . O~ta Mlnl~'g- ·:
-s·uhptiers· ' lrite·a rate Data for ihe whole ,. , Custom apps .:
l icCube . 22. Profit base' -Custome·rs:· ·-loilJ data· f:f"!!erPrtse .,: ·
-G·overnmc:nt
23. QJikView
. . . . ., ·. ' . . . Flgu;l!S,J Data·,varctioiil·i11g a;r:hilecture . .. , ;.,: . . . . .. : :
5. Boaid ~fonagcinc111 Intelligence Toolkit. 24. Rapid.insight..
The first element ·is•the data sources that provide: ·the raw data._The second element is•the.
'/i.,Clerir Analyti~s ;, · 25.' SAP business intelligence
· process oftransforming that.data. to meet _the decision needs; nie third elertien_tis the m~thods
_7. Ducc,i ·. 26. SAP BusinessObjects . of regularly and accurately loading of that datb into EDW or data marts. TheJo_urth _element _·
8. Gooddnta . . 27 .._SAPNetWeaver BW : .is the data access nnd ·analyst~ j,ar:t, where devices and appli~ations tise the dat~fro!fi.,DY{-to .
9. IBM Cognos·lntellig·erice 28. SAS Bl . deliv~r insights and other bcn'efits to users._ . . .. .·.
I o..lnsightsquared · 29. Silvan :.
Dala Soured. · . . , · . ···.. . ·_. ;, .· .. ·. , ... , .-.' . .· :-·. .
DWs are created from structured·data sourc·es. Unst,ructurcd data, such ns tcxt ·data, .would · '
IL JaspcrSoft · -30. Solver · . . need to be structured befo1•e inserted°into·ow: . . . .: .• . · . . · ..
·12, Looker 31. SpagoBI . . --i . • . , ·~ ·.:. - ~·-'_··_ __
..- : •, , ·
. [ .
. . .
, . . ' .
i
68 ~l\~t... ~ t:,cAf,\ 5w.MV.· 69.
•;,•,, ·

J
VIII Sr,wv (CSE/ISE)

· I. O~eralions dnta inchi°dc daln from all business applica1io11s, inclu,ling frot:i ERi's systems dnlnbnsc mnnagc~ncnt system 'and the right set of data management tuob. There ore a few big '
thnt !orm 1hc bJckbone or .111 organ iz.nt ion's 1T syst_cms. The llatn to be cxtraclccj will depend nnd rclmblc providers of DW syslems. · · · .
upon 1hc subject matter of DW. For cxmnple, for a sales/marketing DW, only 1he dat_n nbout The pro~idcr of the opcrntioMI DOMS r.iay be chosen for DW also. ·
c1:stomcrs; or,krs, customl'r set vier. and so on would be extrnctcd. · Altcrnaltvcly, a l:csl-of-brccd DW vcncor could b~ 111cd. There are also a variely oftools oi:t
2, Other :ippli~aticns,_sudi ~.s poir.t-o!:s;i lc (l'OS) :erminals and e-commerce applica:ions, U1c.-c lor da:a rn igrntion, d_.ita uplo11d, d;ita rc:ri.:,a!, and data :m.ilyii).
provide customer-facing data . Supplil'r Jn:a could come from suppl~ chain mnrrngcmeli1 .· DWAcccss ·
systems. Planning nod budget data should also be ·added ns needed for making compnrisons ··, · Dain fr~_m DW could be accessed f'iir many p11rposes1.through many·dcvices. ·: ·
against targe:s. · . , . ·· ·· L A.primary use of DW is lo produce routine management and monitoring· re~orts. For
3. External syndicated dma, such as weather or'cconomic activity dnta, couid alsci be ndded.· c~a~ple, a sales pcrfor'.11ance report _would show sales by many d~ensioris,_a~d c:o·mpared·,
to DW, as nccdull, to provide good ccintcxtunl information to decision makers. Wl!h pl8n. A·,?ashboanlmg syslem will use data from the warihouse and pmen\ analygis.to
Figure 5.2 Da:_n warehousing architecture ·. ·· · ' lisers, The dat'a from DW ca.11 be used to populate customized. perfomiance dasl;boards -for ·
Data Trnr\sformation rroccs~cs, . • . • to
executives .. The dashboard could include drill-down capabilities anaiyze tlie peri;11rmancc·
The heart ofa useful DW is tlie processes fo populate the ..DW with:good qua°lity dntd. This is > data for toot cruse.analysis.- . . , . . . - . _' , · _. .. , , . · .
called the cxtract-tnmsform-load (ETL) cycle.' ·. ·· .- ·. · · : · · · · · ' . · . - · .\ :· > 2. The ·datn_from.thc wnrcltouse could bfilsed (orad hoc queries _and ~y-~thec apP.lica!ions
I. Data should be exlracted 'fiom many operationai (transactio~al) database sources on"n . thatmakeuse .ofthe'internaldnta. . · • ·, ·.. ." , - . '·· ·. , • : ,·.,::,;;.,-- , :-.. _.·.
regular basis. - - · · · . 3. Da~n froni OW is u_scd to prcivid~_.d~ta for mining purpos~s.Parts of.tile ~t-~~ou•ii be,' .·
2. Extracted d~ta sliould .be a°Iigncd tcigethef by key fields. It stroi;lll .be d~ansed :iny--,·.·'' of ~111111.~ted; -nnd then combined with !)!her relevant data, for data'!lining ... ·. ·_,;,:_:.\, ·. . · · ,
irrcgularitiesoi'missing values, It should be rolled . · ·. • · · ' · • - . ··. · OR -." , -:. .... ; ,_.,t,, ·
. ~p together :o ihe siin~e level ~-f grnnufarity: Oesir~d'.~elds: Sl\Ch a~~;a;_sal~~ t?t~ls,·s~~uld
be computed. The ent11·e darn should·then be brought fo tfie same format as the 'centtal tab[e D~sc~i~~ih~ key st~i1s in ~he dat~ mining process; W~~ Is it·1;np~rtani ;~ .rcil ~~v ihesc
ofDW. . .- . _·. . . . .: · . . ··_. : . ' ✓-· pr.ci_ccsscs_? ' , _ · . , · ·. . . ' , . .. ·, '. : '(os:Marks)
. 3'. The iransformed :<lat~ shquld tiien be uploaded int~ ow: rhif'ETL prd~ess ~hould be/ Elfec1ive and successful 1ise of data mining activity requires botil business and:tei:hriology :
_..run at a_-rcgular frequency, Daily ti'i11isaction data can be extracte'd from ERPs{transformed}' skills: Th~ business hspeds help understand. \be 'domain and the·key questions': !ialso helps.
nrid lipl~nded io ttic database the same night. Thus, DW _is ~P:to-(i~te n'extm_ci\:rirrig.:lfD • .one imagine possible relationships in the data•and create hypotheses to'test:if.:iW'rt:nsoects
. is rieedcd for near,i-eai-time informntion access, then' the ETL processes \vb11lf need to 6C : he!iifetchthc ddta rrdn; rnrtriy sour~c$; clc~n up _the d.1ta, assemb[e ~ to :neet tlf~rteeihJrihe .
is
.-exeputed more frequeotli ETl,work usuai!y:autodrntei.l i1iingprogrnmf~g sci'ii,ts that a·re\ bqsines~problcni, and then. run
!lie d.1t.1.mining teciiniques on ilti! platform.. ·\,:>:-\! .: : ._,· .-
·. · written, tested, arid then deployed for periodic. i1pdating DW. · · · · · and
.An illipo"1tant-~lemcnt is to go after tlie· probiem iteratiyely.·It is be!ter to divide conquer.::-·
. DW Dcsiqn · . . . .. . . . ·c. tlic prob!em ivith s~alle1i.amounts of. data; arid get closer to the heart of the soiuiloi:i .in. an ·
Star sclie;a is the preferred datn -archited u;.~ foi most DWs: Ther~ is a ie~trifat t tabic that: ; . iterniivi::'s¢quenc~ ,ofstcps:. Tiicre are several ·best practices learned 'frcin ihe~us~,-of data :
provi4e~ rims! of the inform_riti6~ of ii;ter~st: Th~re
are lookup t<\bles that 'provide d'etnile;i \ ll)ining te9hniqu~f pv"ci- n long perio(!of time.The c.lafa min~,g imfastry has·pri;po.se4 a . . .
Cross-~lrdustr{Siapdai:d Process. fo/ Dritil'Mining(CRISP~OM).' It 1-.as,six esseJ\tiel steps ·
· :values for codes 1i~ed iii the central 'table. Foe example, the central table inay use digits to ·• · (Ffgl)re '6":J): : ·,-. ·.·;, . ' · - ·· ;..,--:,;:_~-
- .-. - ~ · " ..·. -,:•-·--, :;,:;,. : •. '.
reprcsciii'a"snl~person: The iook,;p 't.ib!c will
help p~ovidc the ham_c fo1' that 'safoi ;pe-;:so'n ..
· :cod~..Herc is a~ example of a star's.chema fcir _a data.mart for,moni\oring·sales peifo(mance ._. __:
' (l'igure5.2); . : > . .· .: . · .: '.)..' . ·. .· .: ' . · . : : - · : '. : .·., .'_'. ,
- ~rst.t.:mtLi!:.\il. 1
-- -~ ~

,l;Edfi · Figu;!! 5.i St~r 1·~11;,im arc1tiieci11;~ ·.


·. Other. scherirns . iricl1idc th~ .snowna_ke hrchiie~ture. 'rlie. di!Teienc·e bet1~een a "staf
0
., ·
·: ·, · ;: ·.-:, · , ·. F[giti-e6.I CRISP-D,li i1ai11 1i1iiii11g;~jii:/~ ; · .; .. ; : • . _· > .·,· _.
• 1; The fi~st and mcist lmp9rtant step in data ininirig is bi1siness'umi~rstanding th~t/ s! as~ing·
the right,b"usine'ss qliestions: A question is a .good cine.i( nimvering. it would)e_ild t?)~lllC; .
· p~yolT~ifor:thii·orgimizatiori;-f\nanGinll)' a_ild :othel)vjse,: lti .olher) '/O[dS; . s~iccl_inlt_[ 1~~ . · i ~ -
.sn·owflake is that in tl1e latter, the lookup tables can-lrnve·tneir own further lookup tables.
m1n1n!rproJecr1s lik~mfyotlier project, m 1)hlch 1t s_nould snow strong payolfs 1ftli~pr0Je~r:-" _
·. ·I he.re are many' lech_nology choices for develojlingDW:-"f!iisTiiclu es se cctmg .t e-r1g t • ' . ,, I .- ,
.\ ' • : - • •• •• . L- • • • • ' . ' ,. ,, . ' ' . ' • •, • • : , • • . ' ; •• • •• , •,• •

' 71
5"ils:+..., ~~~
S~,1~fa( f;..,~M
·70 sJ.r1~J< ~MV~ . .. . ' . ,·
. ·~•", ......

~ ~-·•. ~ .:,
. : ... ~ .. -··-: ·..-.
Qi,•·,

-I"
VIII Se.rv (CSf(ISE} ;c,ics . f.,toaci,QtAMtwwP~u · 2
is successful. There should be strong executive suppoi·t for the data mining project, which 2.C.cornrlric. rrojcctio11 vi.1ulili1,;1tio11 techniques
menns that the project nlig11s well with the business stmtcgy.· . A drnwba·ck, of plxel•oricnlcd vlsunliwtiori techniques is lhnt they cannot help ,.,s ·much in
: · A second important step i~ to be creative nnd opc11 in proposing'imaginntive hypotheses 1m<lerstnnding the <ii~trioution of da1a in a multidimensional space.
,or tl'.e solution . Thi,,king uutsice :!,d,ox is imp011an1, both in tcni1s ofa pror,oscq model as Gcumc1,ric 1:rojcction tcchr.iq:1t·~ hei;, users fin,! ir.tercsting projections of m11l1i.l:.:1~11sfo~~1
well m Jhe dma sets nvnilable aml :cquircd. ~~ - ' .
3.. _The Ja1a should be clean and o:higt·iquality. It is impo11onl 10,assemb1e a team thnt has n A scatter plot displays 2-D data ·point using Cartesian co-ordinates. A third dim~,Hion can be .
mix of technical and business skills, who understilnd the domain and the data. Dain cleaning added using different colors of shapes to represent different data points. " :' : ·
can take 60 to 70 percent oflhe ti_mc in a data mining project. It may be desirable_to add nc1v . ,, Eg. Where x"and y nre two spatial attributes and the thiid dimension is repr~schted°by
data elements frQm external sources of data that could h_elp improve prediciive accuracy." .'· .:-:' different s,lrnpcs . . . . . . .. . :· ·
4. Patience is required in.continuously engaging with itie data until 1he data yields some good' · Thrnugh this.v_is'ualization, we can see that points ~types"+" &"X" tend ·10 be 'collocated.
insights. A host of modeling tools and algorithms shimld be used, A tool could-be tried wfot' ', ·•. . . . ia ~' ; ' I l .

·different options, such as running different decision tree algorithms. · . . ....,.'.(


S. _One should not nccept what the data says at ffrst. Ii is beuer to triangulnte·the·analysis b{-.
applying multiple data mining techniques and conducting many ivhnt-ifsc'ertarios, to build,.}
. ..
·_70 ,O

con~dence in the solution. Evaluate the model's predictive accuracy with more test data. . ' ,v'
· 6..The dissemination and rollout oflhe sohttion is the key to project success. Otherwise the 40
a ..
project will be .a ~aste ~ftime and will be a setback for establisliing 1and suppo11ing a·data- . •' 10
bascd decision-process culture in the organization. The model should be embedded iq the
• organiza!fon's business p_rocesses. · · · iii . · :f .
' ~( - .
b. Whal arc ihc d~ia vlsuall~ation lcch~iqucs? When wo.uld y~u'tisc iablcs' ~r graphs? .. .. _.1~ .· ~ - . ·:
. ·.
Aris. Qnla Vls~~lfza·1io1i fcchniqtles arc! · .
. .
~-0 ·,. •'-'---....,....-.;...-,---
··:a _10 . 1~ . JO .-'D ·io . Ml .:~ . ~ -
- · r·- :. =. ··

f.Plxcl orlcriJcd visualization icchniqucs; · · . . . . . .


A simple way to yisualize t!ie valuac.of a dimension is to.use a pix~I.Where tt}e ~ior .of:thi
. . · .. . : . . · Fig ~.3.: virn11liwti~11 of2D tlaf11 ~el ,;sillg sc11/ltr'plo( . \:J·.i;i( ·. ~.
: J.(coi1bn$cd 'v/s~aiib(ion,:t~cli!1iquc$,:; ·: ...., .., .. _• •,; . . • . ·• :·
pixelrefleclsthciim~ns·ion's ·value . . '. _ . ·, . ·. · ·:. ·: . · · .· . . ",' .. ' Itrises sm~lf icons t~ r;pi-es'~nt' 1miltidimci1sional data values. . 0: . . . '
l"<ir':~ d,ata set of m di_m'ensions 'pixcJ .ocienied iech~iques create m_.w1ndc;ws ~n th~ screen( .. 2 popul~dcoii~ased techniq11es: 0 :_ : ': ',, •·• ,: : .- • • : .' . • •.· '
oneforeach dimension. . . . . . ·. .· . ·. .. : . .. . · :.. . . ·. ::.i J.I Chern'o'fffiic·cs:··, rhey display 1n(1ltidimensiorial data ofup IQ !8.,vari,ableH.sa carto9~ '
,Them dimension values of a record are m_apped to. in pixel~ at the c_orresponding posiliqn in ';_-<: [1uman'face.:.' .. . '';, . · · ;·,.·· . · ·., .. · · ,. ·' · · · .. · ·.·.-. ·. • .:,.. : .
. the windows. The. color ofthe·pixel.r_efle;:>.s other corresponding·_valu.es, . .' .· . ... ,,
. · ·Inside ~ window, the.data v.alt1es are a·rrnng~d.in·som~ glo~aI order ihare\!.by !ill.\Vindciws_·. ..
· ·-Eg: All Electronics maintains a customer information table, whichsoil$ists ·or 4 dimensions: ~-.'
income,_. credit_limil, trnnsiiction_volume and. age. We analyze . th~ correlation bet.ween . '
incomeandothernttribiuesbyvisualization: · ... · . .·· ": , · :, . · ·. ··,:,. . t
·We soil aH ci1st01pers i:1 loco.me ih asctinding 'otde~ ~riii'-use ~is order io layoutthe customer: ',(
..·•.··.n,,)L~i;f 1~~~-l iL.~~M•f,,:.;:~~:.•.·•·.·.·
·· $\hspace(L9jt)$·3.2 Stick figures::lt n1ap~ mult1d1m~ns1op~lcfat~ to five-i:1~c~ ~f:•~1( figure!:··
.. data_inthe4visual_izat(on•\vi·ndowsas 'showninfig. ·.·. · ':-, . • · · . ·· . · . .- ·,.·.· . v,lierecach'figdrelias4Hnibsand(body. ,' · ... _:,.",c;: :· ·. ,: , :"> · : '::':-'_. :;•. : ·.
'The ptXel colors are chosen so that Jhe smaller the'value, the·. iighter the shading. · . .. · :_.. 2 dimensjons·are ,napped-to the display.axes and the remaining di~ensio~s are ~~~pe~ to the .
Us_ing pix.el based visun_liz.1tion \Ve c.,n easily o\)ier·ve,that crcdi(limit increas~ as income .:·' . .' anglfand1:or l~ngth o(llie liinbs.: ':, ... :: .:• : · . . :· . · . . . '! •, ,
· .inci·eascs custome'r whqse •income is. in ..the_.middle range-'are:more likely to purchase more •: . '- 4.Hli:rat:chh:'nl ylsualization,tcchuiqu·es (te.-:sub_spaces)':
:::' ,".Ai:fo.:,,. •··

-1~
. .· from All Electroni~s; the~·e is no clear correlation between income and ag~. ··Tlie silbspaces ·are xisualiz'edirt a liierarcbical rnanner: .. . .,.. ,.

..·'t~,~ :re~~:~~ kin.ds-~f'dataas ~~e~:'i~ tfi~ ~as;l~t:a6~\ -~ITi~ '{~;i~~}d~,~)i:\h~ in:6~t.·


/:t¥fui!iJ · • " po'pulhr fonn of dil:a, It helps .reveal patterns 'civer iin:ie, H~we~er, ~a\a. C?,~.!~:.bf P,rganize~_
around alpliab~tical list of things, such as countri~s or produ~ts .orsalesp~op,l~J1gure.6: l. . .
~ho.ws.-some of the popular cha11 types a11d .tl1eir usage. ·,- ;,: .·i.·'..-, ,··;,:··,:",;. ,': ; •.:,. '•..;;.;i '. ·
. _ -. :Income . . . :~~iuimit . tri.'\1~ction.:,.volume tgt .- [·.Line graphil11is is-a tiasiy,and most poptilar type of displaying i,n,fo~l)l_atio_n,1$~~0\YS ~ata
"-- -- · -.· •~ -'. • Figr6.-i ·;"Pix_thJtlm~i!d rt.~lruttuirtUITTJj-f?li1r'lb1i1e~-bµorti11g-~ff-;~s1;,,,ers ilt im·on;~ .: ,i . •' ', . . fwi~t~ connected by ~traight fire scgme~!Ufmi11u1g1'(jl1Lli'.'.'".".SCJ15~1~1~ l~•p:e_
· · ·· · · •·.·Asce11di11g imler..·. · · ·. is usually shown 011. the x:aiis·. M.iliiple vrriables' ·can be ~epresentcd ·011 thp sat,ne s_cale on, •. :•:-- '- "·
.. . ,: . . · ·_:, :I·'; ,._73·_:
. ~J2 - - - ~- - - -- - $i..11~}M {;'\'~M Su.MJ. ~~~tA~ ~~r,i Su!\_~¥. . - ~ - ~----.~..._.- ,_.
,,.
r,
r•,!
1 VIII Se.int (CSt(ISE)

y-a~is to compare or1hc line graphs orall lhc vnrinblcs,


I, 2. Sculler plol: This is nnother very basic an,l useful graphic form . II helps rcvcnl the
relationship bclwccn two variables, Ir. :he ~bovc cnsekt, it shows two <limcnsions: ~ifc
Exrcc1ancy and fcniiity Rate, Unlike in a line grnph, :here arc no line scgmenls connecting
trli.: points. .
3. Uar grnph: A _bar graph shows t:1:1: co io,.fo l rccrnngular bars wi,l·. :heir lengths t:cing
propo11ional to the values represented. The b11rs can be plotted vertically or horizont~lly. The
bar graphs use ·a lot of more ink tlrnn the line grnph and should be used when line graphs arc •, '
inadequate. · . ..
4. Stacked Bur graphs: l11cse are .a pm1icular .method of doing. bar graphs, Values o~i, .. ; '

multiple variables arc· stacked one on.top of the olher· lo tell an ii1!eresting story. llms c~~
also be nomialized sllch as lhc total height of every bar is- eq,ial, so it can show the relative'
composition:or each bar. · . · . .'·.\
S. Histograms: 111eic .ire like_bar grn11hs, except' that.they" are useflil in showing dlitn:·,:
frequencies or d~ta values on clnsm (or ranges) ofa numerical variable. . ' · .:
6. ['.ic cliurts: 111csc arc VCI)' p~,pub1 to show lhc distribution ofa variable, such as sales,_by
· region. The size or a slice is rcrrcsentativc of the 1:el_a1ive strengtlls of each value, . .. :.
. 7. Bo.x ~harls: TI:csc ·.1rc f.;K'ci ;,! form of clwrts to show the d1~t1'ibt1tion of variabl~s. The
box show,; the middle lialf of the ·v-,1h1es, whi'le whiskers on botl1 sides extend to the extreme
value_s in ~ilhcr <lircctio11. .:;-.
8._Dubblc Graph: This is nn intcrc~ting way of displaying multiplc.dinwnsions in one chh11;: Y
It is 3 varbnt of :I scatter plot wi th maay dala points h1nrked·o'n IWO_diincnslons. f'!ow imogi~e:
that each data poi11t on tl:e graph is a bubble (ura circle)° ... lhe size oftliccirclc.and lhc coki..'
fill in the circle could represent two additional_dimensions,

. .. . . .

,...:,:::II' ' ·. .:.:::~:* . _.··. ',·•.•


650!~!\--•-. · •75QCl•tS•cu_ ,.:~9ot.. -- .. · 'g4~tol'lt;.,
. .. . Jl~~lt_:
_ , · /cigur~). 7: Pictogmph 'of W«t~r foo(print (s~urce: waleif~otpr~11I otg)
· A lnble is best ,~her;: . · . · · ·· - · · ·, •
• You need to look up specific. values · . ·. . . ·.
•·Usersij~edprecise:~alues. · . ·. .. · . . : . _.', · :_ ' ··
' . . . ~ Yoii rietil to pr~ciscly c~mprire reiated values . .
. . . , . Fig11re 6.5: ,M1111y typc.i ofgr~phs . . . _ .. •,_YOll have..ffiU!tif)Ie dnfa. Sets:wltlt di.ffejeht,uriiis-Qf ffi_eiiSU~ . :; -: ,:,
. 9. Dials.: Thcse _a·re charts like 1hc speed dial in the car, that shows whether the variable v~luc .\ Agrapti is best wheri: ·_ _ ._ . _._. _ _ _ . . .. . _ . . ·' . .
. (such ~s s~le~ number) is in the lowrange, medium range, or high rarige. These ranges cciu_ld ' .; The message is contained lri ilie shape of ih~ values . . ' . . ' .. ', . '
. . , be colol'ed red, yellow an·d green to give an 'instant view of the:data. . ·.. · ·.....:· , .. · . : '. You ivitrit to, reveal relationships among multi pie values (similatttlesandduteich!:iis)
I_O. Geographical Data maps are ~Rrticularly useful maps to ·denote stalistics. _Figure 6.~-'
shoivs a tweet density map of the US:_It shows where the tweets em~rge frorp. in th_e !JS .. •. , .-:t~~:t~:~a~i; 1
:t,!\ets i·;;: · ·_·'··. •.. :: ·: :·. _•:, 1
\ ::•-_·• .: .• ·-::: ,:: , ._ ,. :<·•, .·.·. · ._.
1

· 11. Pictographs: O_ne can use pictures IO'represent data, E:g, Figure 6.7 shows the number of: . •."Giaphs and tables serve ditferin:1 purposes. Ch_oos~ ilie appropriate daia display iofit you(
· litei:s~fwatcr needed_to produce orie pound of each of the products, where imag~ are used io'· purpose. . . . . . . I .,· . . ' . . .. . ' . . . .. .
sh_ow ihe product fc;,_,. eaiy reference. Each droplet of water al_so represeril,s 50 liters ofwater.:
. : .
. . . · ,'
74
~~d-~-, ~,;~ fu~11~
. ..•·- :~<•\::.:.:,. ·.. .... ,?:-. -- ··... ~.. . . . -:~ .- -. :, .. ~

"-:...+~: -~-., ~J.


•. :., ....
VIII Sem, (CSE/ISE)
a. When to ;top huildln~ the lrcc? Tncrc uc two major ways lo make that detcrmin.itioa ..
Module -4 The :rec buildi ng coald be stopped when a certain dcp:h of the brnncnes l:a5 been reached aad
7, a. Whal is a splitling nrinhlc? Describe lhrcc critcd11 for choosing splllliu~ varlnhlc. the tree bcco,,,es unreadable afier that. The tree could also be 5toppctl w!1cn the error level al
(Oll Marks) any node is within predefined lokr .1ble lcve is.
,\n s. Splitting lht· Tree: r ro:r :l~c rr;o1 nc\!e. t'. 1r C:cCisi011 tn:l' wj/. Cc 1,pi 1t i111u thre e brt!nchcs 3. l'rmring: ·11:c l!ec cm,l_d '.>= ,, 11r.:nco lo 01~k• i'. :11orc balar.ccd ~r<! r.:o,c ea~ily 11s2 bl.:.
or sub-lrc~s, init for each o_f 1hc tl,rcc l'alucs or ouliook. Dala fa: the root node (the entire lhe pnmin!l is olle11 clone aftcr _tl,e tree is ~on~trnctcd, to balanc_c out Ille tree ar.d improsc
· dma) wili be divided into the three segments. one for.each of the.value of oullook. The sunny· .. usability. The symptoms of an ovcrfitted tree arc a-tree too deep, wilh too many ~ra·nchcs,
brn11ch will inherit the da:a for the instances that had su11ny as the valuc.Qf outlook. These will .. some of which may_rcficct anomalies d.ic to noise or outliers. Thus, the tree should be prun·ed.
be used for fu11her building oftha: sub-tree. Similarly, the rainy branch will inherit data for .· Tl1ere arc two .ipproaches to avoid over-fitting.
the ir:stanccs that had rair.yas the value of oullook. These will be used for fllrther building · :.
b. Compiirc n11d coulrast decision trees with regression models? .. -(~~Marks) ·_
that sub-trce.1l1e overcast branch will inhc1' it the data for the instances that had overcast'
AUS, Advantages:and Disadvantages orRcgrmion Models . .
the outlook. However, there will be no neeilto build further on that branch. There is a cle!i
decision, ycs,·for all instances when outlook·vahie is overcast. · ··
Regression uiodds .ire very P.O[!Ular.becau,e they ciri"c~ many advantages; _. ,.
• !.,Regression moqels 11re easy to l,lilderstand as'they·are buil~ upori basic sraiistici(prlnclpies,
TI1c decisiou tree _will look like th is after the firsi lcv:el of splitting. ·sucl1 as correlation a_1id least square error. . . - . . . ' . · · :: : . . :
2. Regression models provide simple algebraic equations that are..easy to undi:rstancfand use.
. 3. TI1e strength ·(or the goodness offft) of the ·regression model is ineasu"red.m·•ierrns:or tlic.
corr~latio·n coefficients, and otlie,: related statisti~I parameters tl)at ~ we·u un~od,-
4.. Regres_sion models i:a1\ match and b_eat the predictive power .of_other mcde)ing techniques.'
5):Regression models can include all the.variables that o:te w ·ants ro'inci~de in. the model; .
tfot Hii;h . F:ils-!
.· 6. R~gre;si~n modeling topls are pervasive. They are found in ,statisti~l°packagcis·as ~~ii as ·
data mining _packages. MS Excel spreadsheets can also·provide.simple rcgrec.sio~ .modeling
. !fot · lligh T1,11! N:>.
capabilities: · . · . . : . • . .,. · ;. · • _
- Mi!J
C~
• . High
Nom1al .
f rilsc
f:ilsc
. N~
·yes ; :· ( -
Cpol
. M_ild
Ncm1al
Nanri:il .,
·_ True
f,'Qlse
No
_·v~s
Regression models c~u
howeyer prove inadcquate_.under inany_circum~pi~~~ ..' ·. : . .
. I . Regression models !cannot .c_over for poor data.quaH~ 'issu~._If.the dati: is: h~I' prepared . · ·.
Tmc . . - ,;c.s : 1
Mild: _Nomi.ii·~ Mild Hi~h · . True N~ · : w.¢11 ·toxeiiiove n-,issmg·valLres, or is·not well;behaved in terfnsof a no:mal distribution,(.'ie ,., .
D~_cision tre_es employ· the div:i<~c. and conq~~r 1J1ethcid. Tlie data ·is· branched ~teach nQ ' validityofti1emodeisullers, .. : ·. _', · , •. · _ : . ·, ·_. · . ' . .', .,: .::. , •.. .
. iiccortting to_certain criteria until _all the data ·is assigiied to ieaf nodes.'·it recursively divid.· . 2. Regression ~i,iodels st:ffer_fronf collinear problems (mcaninfs-.rong line~.~orrelaticiris . . .
ii training set until each division consists of example"~ from· one class.: . . . ,. among,soine) nclcp:11den( variables), . If the. _indcpendtnt va~~blcs hav.e__sttong _correlations_.
TI1e following is a ps~udo cci<ic foi: m~king decision trees: . . . ·. . among ther:1selves; then they will eai_;nt9 each other's predictive power ai1d _1he ~gression . ·. ·
• ( Crea_:e a root node and -assign all ~f the training data to.ii. .. . coefficie1its wjil .lose their ruggedness. .- . . . ~ . . ,. .. ., . . . . . . . . . ·. . .
. 2. _Select :he best splitting ittributc according to ce1tain criteria. · · ·· · 3. Regression mci,dels wiH not ai1iomaticatly ch6~s(.bctwc~ !iighly :collinear :~ariables;
3. Add a branch to lhe root node for e·acli value of the split , .·.· .: . .. . . · . • although some packages athimpi todq that. Regressionmodels' can be unwicldy_and unreliablo
4. Split the data.into mun1ally excluslye subsets alongfoe Iin.cs ofthe specific·sp!it;•. . . ·, . if a large number bf variables a;c jiicl~ded it: the model: All variables ent~ intoihe iniidei
·.s. Repeat steps 2 andJ for eacli a11d every leaf node trntil°a stopping criteria is r°eaclica. ·.. ': :·: . ~ will he reflecteQ•-fnthe iegte~sion ·equ~tion, i!T~S~ecti_ve of their con!fibuti~nio'ihc'predictive . .
. There are many algoritluns fo~ niaki~!}decision trees. Thejllost'poptilai ~ries CS, CART; ar~ powerofthe lnodcl: There isno concept9friutom.atic prunmg·themode!> .· . .. .. . .
and·CHAID. They differ on three key .elements: · . . . . . '. .\ . . . . 4, Regres~!on models do not auto.matically tak~.c~re QfnontiJ\earity; _ . ;, _ . ·. ,
The userneeqs to imagin_e the ki11d .qfadditional tem1s that migfit _be needed to ,bc added jo
· <.
·. · 1.Splittiugcrileri,i . · · ··. · .. ·_- ·: :_._. __ . . ·.. · . ·; · ·. ·. . . · :·.
.a:Which variable to use fo_r th~ lint split? Ho~ sliqu(d on~ detenni~~ii;e nitistimportanii :tl11i_"regi~~sfon mqdel to improve its fit . . . ·· .. i'· />':\ •:·> ,;, c: ; :
variable for the first branc1i;ana subse'qiie1itly,. ·· . ·• ...·";, :c·.:: ... ·.:. · / './ .., •. '. .· ; .t Regression inodels work orily. witli'1,umenc daia and riot.with categorical-viiriables: Ther_e
'aie•ways to:dea Lw']thca\egol'ic;it vatiilo[e~.tfiough· liy ·c(ri~tiiig'iiiiitiipl~ new varfati)e.s \\i[th:a
Jo1· ~ach subtrei:? TI1cre Me mA11y ri1_~_iis(i_res like least ;e1j<Jrs: inf~{i1ilition gairf;,a.rid Gi~f:
. cq~fficient: :: _:.. · _· ·. · ·_ -:,., . . ·: : , •• ,,_ i>, .·· :-,-· , ; : , , ·•·
'. i : ycs/n6.yal11e;' :' ' . ' :<' -, '· '':\:: \ ..,:".
· _b. What_valu.cs to u_se fodhe·split? -Jf'.the·variables hav~cimtfouous\iatu·es; siich as ·for age/ , ..,._
:--. _,... _~.<"
:,:.-:o· _·R
::,_
•·.·•
· or BP,_what vi1h1_;rai1g~s should be used to make bins? . · : ., . ' '· ...' ·:'. ;: : .' . _ · \{!
· c; Ao,1: many biaiic_lies ·shoitld be aitowed for eacl(iiode? Tliere·coiiia be· bi11miir~s wit ·.
-1~1:;;;;t::~~t:fteach
1 node. Or thc1~ coi{ld be rrio(~
br~ii~hes alli>wef ,, : :'.\ ·i ··. ' .{\
·. 7(,.. .

- · -- __ _____
._ - _- -
- -- ----- - -- - -
;-,:
:'.:'
VIII Sem, (CSE(lSE) ~ cries '. /v1orMd, Qu~lt orv T'Ctfle+' - 2
dimc11lt.
Decision Trees
Should l:c fos!cr once tr.1i :1cd {although both al~ori thms nm trnin slowly depending 011 An. idc.al cluster ca~ be tie.lined as asel of points that is compact nnd i5olatcd. .
exact a!gorithm a11ll the a:mllmt/d:mcnsionality of the ctatn). This is because n dcci~ion tree ~n ica/',tf•,a clu~tcr 1s II subJ~ctivc entity whose significance u.d in1crprclation requires dorn~in
i:1hercn1I:' "1hrows away .. the inp11t fca111 rcs that it tlncsn't find uscfol, whereas a neural net . now cq,c. In the 5arnplc G~la bclo,v (Figure 8.1), how many cluskt'l can one visualize'/ ·
will use :':en: ;,II unless ym: do some fcnmrc scil·c: ion as a pre-processing step. · X
If it is i111po11an1_10 11111.lmtand what the model is doi11g, the trees ·arc very interpretable.
Only model functions whkh arc a~is-pnrallel splits of the data, which may 091 b~ the case. >( X

to
You probably want to b.e sure to prune the tree avoid ov~r-fitting. X
X X
X · X
Neural Nels _ .. X
Slower(both for training and classification), and less int~rpretable. . .., X , )( X
If your data arrives in· a stream, you can do incrcnienta( updates wi.th stochasti.c grndien{· ·x .
·descent (unlike decision trees, which use inhercQtly. batch-leami11g algorithms).· . · : . . . .. . · Flg11re8.J: Vl~ualclusttrtxample . .· . . ·'. '.. .~ · ·
Can niodel ·more arbitrmy functions (nonlinear-interactions, ~tc'.) and tl1erefqre•i11ight be more .. It seems like there are two clusters of approxi~tely eqUAI sizes. However, thty'~ be seen
accuraie, provided there is enough traiiiing data. But it can be prone t<iover-fitting as ·well. · ·:• as three clusters, depending on .how we clraw the dividing Jines. There-is nol a truly optimal
TI1eli: are 111a1iy advantages of using ANN. , -. . . way to calcufati: it. Heuristics are oflen.11sed to define the number of clusters. -~ .· -.· . · ·
· I. ANNs impose very little restrictions•on Uieir us·c. ANN can deal witl\ (identify/model) : · Three business.applications: · . · · · · · _· ·
highly nonlinear relationships on their.own, without much worldrom the user or analyst. . · Cluste~ ~nal~sis is used _in almost every'tield w~ere there is a ~ge of~~ctions. ~ey
They h~.lp find praotical data-drive1i solu\ions where algorithmic s!)lutions are nonexi~ten\ or lt he.fps prov1_de characterization; definition! and iabels· for populations. Ircari help ideniify ·
too complicatecl. · . . , .• . , . . . ·~atural gr,oupmgs.o_f customers, products, pat fen ts, and.sc, on.'ft can also belp ide11tify.outliers
2. There is no need to program ANN neural networks, as they.iear\\'.from exa1riples. They get ... ·-· .m a. spe~ific ~o~am. an_d thus decrcas~ 'the size and complexity, of problems_ A prominent
· better wi:h use, witi19ut much programing effo1t. . - . · . . - b.usm.ess application ~f duster ~nalysis is, in market r~earch. Custo17Jers are segmented into
3.. ANN can kndle a viriety of proble1il . types, ·including b1~ssific~t;on·, ~lusteri~ · clusters.based on their characteristics--,wants an.ct _needs, geography, pri~e ~~itivity, ana s,;,
associations, -andso on. · . _ . ·. .·. . . . •, on. Herc ~re some examples ofclustering:-:· -. . . . - .- - · , .:: · . , ' · -. · ..
. 4. ANNs are tolerant of data quality issues, and they do not resirict the .data lo follow st.r'k . l,Mit~ketSeg,i1Cmtali!J/1; Categorizing ~itsto111e~according tot/lei~ similariti~; inst~ce r,J~
norniality 2qaior independence assui1iptions.:. · ' - . . ·. ·, . : . . . , . . ·.. .by,tl)eir £0".ll!lon ~~nts and 'le~s, _and ~ropen~ity to (i?Y, can,_ho!p ~ tazget~ ~keting; . ,.
5. ANN can han(lle both·numerical and caiegor1e11l v~riables .. -. . . . .. _. _ 2. Prorl11ff fortfo/10. Peep.le ~f s1m1!a1' sr~es·can be grouped rogctl!eno ma.rceslriall; inediuJJi· ·
6. ANNs can be muclifaster than other'lechniques. · ·. · · · --. · · · ·. ·- · -'_: .and large sizes for. clothing items, ·· · · · · · · . ·- - ·. ·
· 7. Most inipo11nntly, ANN usually provide better results -(prediciioli and/or clustering).: '~- Tex_r M(11ii1g, ~1t!st~~i~1g ~an help org~ili~e ~ given coilectlon ~f t~xt doc-~~~~~ccrdlrig
. -compared to st2tistical counterparts, once they .hav·e been tra.ined enough. -:· lo the1.r content s1mll~nlles mto ~lusters oftelated topics.. --.--, -, .-.- - · ·
· The key disadvantages arise from the. (act that tliey are not easy to interpret or. explai1(or., . .- · •, Moduic ·~f
·:,. ." · c·o1l1pute. · · · - --~~~ - . ·_ -,__ · · · ·
· ·1.. They are deeri1ed to be black box solutions, lacking explainability, : - :, ,.. - _· . . . :
0 rnvctl'.c_C9".1l'aris~dii~~lw:Cif1{rvfj~j11g ~~-(oaiaM!~iiltK •~.,-.: ,\ (~8 ~~~~k$)::<--~t~
·2: Optimal design ofANN is still an arf: It requires expertise and ·extensive experinientation,· _c TextMi'.11.ng I~ a, f.01;11 ~f~ata mm111g, :rpere are many common elCfll_t;nts:l,efyi~n.Text and . .
3~It ca11 be diffi~ult ·to handle ·a larie number of variables (especially:.the rich nomi~al .: are
._ ~9(,1 Mming-. ~o~cver,_ t~,ere .s?meJey differences (Tabli Beiowr k~y,~jll'ci~n~ the
.attributes) ,vith an ANN, ., .. ' . . -· . . , .- JS itta~ text ,ll)llllng rcq~,r~s conversion CJf;teict data in:o, fre(]!lffl(;)' data, bef'oi;e 4~tii tnining
4, It takes large daia sets to trai~ ari ANN.. _ - .. ._ . · techniques can lieapp/1eq.: ·. •··- · · · · · · ' ·

. ' b. n'cr.n: Clusi~i-~~ Describe ihi~e busht~S appll~t/on~ i~ yo~r ·!nd~~t;y ;vl;ci-c cl~s;~~; '. Qinie~~fon ;. Text fl,ifni~g .0~12 M!~ing j , . .
. .. annlysis1vHl.bcuscful . . < , . ··. · ·-'- . :· - · .· ... · :. (0.8.Marks) :
Nature or -' · · ·
- ·.:"· - .u. ~.sir.uc_·turcd._data.·_ :.w_
· ·.. ..d.. nta
·· · ..
_ .ten_~.i:s · values··
ords. .P_h..ia,·s.c:s_:sen Numbers; alpha!JC!ical.arid
,. - - -- ...,.~logical
.
' ·
:.' .
. .' ,Ans .. ·oefinitioi1 of·a Cluster :An operational definition ofa cluster is that, giveii a representation of,'
· ri.oiijects, find K groups based ori a measi1re'o(similarity,such ihafQbj~~ts.withhi the same .; · Lang a · Mnny ln~guag.:s and dial~is used in tltc . , · :, - -- .-.. " · .-.,, · · ·
group 11re alik.e but tti~ objects in different groups .are not al_ike. _ . - . - . - . _ .. _ .: -· ~sc;I ~~ world; many languag.:s are. e.<linci, new _·:,,r: _:vlo~~l.r.d.~_fll.~n.',;a.. .•
Lijs .i'.',s, ~...c.·.r ~ :
- ~~.m
docii!Jleri!S ~re 4,scove~d· ·: ' ..
However, the notion·o'f sl'milariiy can bd i~terpi·eted :ill iriany ways, .dusters can differ in);
terms of their shape, size, a1id density. Clusters are pattems,. and the.re cari be many kin~s. ' (:larity. anci
· !lrt(jsio~· ·
.· ,of patterns. Soiltc clusters ~re ·11;e traditional types, such as 'data points h~iiging together.:
. t\'01.v'ever, tliere a'ri: other clusters, such as all points,repres,e1iti1ig the_circtiinferen.ce g(J
· ·,. · :circle; Tlicre-may be concentri_c circles witli points o( different ciicles represe1iting different.
'==-:-c--:-,---,:ltisters,-'fhe-presen ce-of-noise'in-lhe-data-inakes the _tle!eGti 011-ef-t~ust~rs-eve'n---mor~·
· 1

78

:. .-.•.-•·;• #_
VIII Se-iw (CSE(ISE)
1hc computations involved nnd, hence is called "naive" , Thl.1 dn.11 ifier is also ~'nlled idiol
1'1.:xt IHJ)' present :1 ckM :rnJ consistent er
_llnycs, simple Unyes, or independent IJayei ,
Scnlimrnt ::iixcJ sl!nlinwnt, ~\.:ross aconli11u1.1m: Spokl!n NIA
\\ C'l rtl$ ,:,ids for::11.·r s\!1~: iir:cnt
The ndvnntages of Naive !Jayes nrc ;
· It use ::. ii H:ry intui!i1vc 1cchr. i,1 ue. Bayes clai sifi crs1 uuJi~c: m:ural nrtwcxks, du pot ha\'e
Spi.:lli1;g l.'.i rn.~. Di! ~~nnf v;1lui.:s of p10F ;;l issues wilh mis~in&,:Villt11.!S,
Qunlity 1111uns, s11ch ns nnm.:.,. Varyins q11alil)' of
several free parameters that m,1_st be set. This grcally simplifies the «!uign proc~,. ·
ontlh,:r~, nn<l rn on • ·: since _lhe classifier returns probabilities, it is sir:ipler to apply these icsults 10 a wide variety ·
langu"llc 1rnnslation .
of tasks thnn lf nn arbitrary scale was used. :.
A l\lll 1vi<lc rnhgc ol'staiisticnl
Natui:r o'f K9wonl-basc<l search; co.:.xistcncc of · It does not require large nmou~ts.of dntn btfore learning can begin,, ·
nnd machinc-icnrning nnalysis for
:inHlysis · themes, sentiment mining relationships nml dilfocnccs · · · Nai~e .Bay~s classifiers Me compulational(y fast when making decisiol13.
·o R · . ,
ll. In whut ways is Nnin-:-Uaycs bcltcr._lhau olhc~ clnssificnlion tcdnliqu·cs? Co.mparc wi\i/; \ ' .•I

decision tree. . · · · · . (OS Murks) 10, n. What is cl_ic~strca_tn analysis?' . • · . . . . : , . : . . . cdf~i-kt) ·


Ans: Clnssifica:ion-is the sepa1~:ion or 6rderi11g.of objects into classes :11,ere ·are two phase~ ·iri . On a Web site, cllckstre_amana!ym. (afso.callcl clJcl:stream analy:ics) is th! pi:ocess of
·. · · classification algorithm: first, !he ·algorithm t::ies to iind a niodet f~r the .class ·attribut_e_as a, coHectipg; analyzing _and reporting aggregate·data about which pages a website visitor' visits ·
· fuuclion oi other variabl;:s of the.datasets. NeKt, it applies previo~sly designed modet·on th( . •• an~ in what order; The path the visitOr takes thotigh a webske is called !lie cliekstream.' :· .
· new and un~een datasets for detcrm ining the reln:ed class of each _record ;. . • _: _..There' are !wo Jevels of -clickstream .an~lysis; traffic .analytics _a:1d e-commtn::t ana'ifyics; ·:·
· Classification has been r.pplicd in many-fie_lds ·such •as mediq~I, astroi1omy,. comr:wrce, . _ T~affi~ analyti~s operates at the server !eve! ind Lrzcks l:ow niany pages are served to·iJie user,
. bi~logy, ritefa; etc'. TI,ere-are many techr;iques in classification method like: Dectsioil Tree; . how fo:ig it .takes e~ch p~ge to load, how often tht:1 user hits die browser's ~k ortioi, biJuon .
Narve·8ayes, k-N~arest Neighb'or, Neural ~ctworks, Suppo~ V~~\or Mi.chine; and:~encti~ : . and how mlich _data is transmitted before.the· user moves on, lxommen::e-bas~analysis ·
. Algorithm: In tl1is paper we will use Decisio!' Tree, NaNe Bayes{:~nd k-Ne~r¢st Netgnbor. -i useS'clickstreani' datil to determine the·etrcctivcness ofthe-si:e as a,chailnel-to-maikct ll's
. . ·rhey arc bo\li supervised learningal'gorithms used forclassificatio~ :asks. -lt'strongly·d~pend ·. concemed \Vilh \v9~t pages-th~·shopper lingers 911, _what the shopper puts in or.iillces out of
. ofthedatayouha"veai1dwhat-youaretryingto_lcan1 ·. : ·: :0 · • . \ , ,;, ' .... : '. · ; " . .a slioppirtg crirt, what items th~·~hopp~r ptl:chases, whether oi- nO(.t.'ie-shopper: beioiigs to II
Althougl1 it dcpebds on the ,problem you are ·solvil\g, liut _some·,gcneraLadvantages_ loyalifp1'ogi-am and tise,s_n ccitipo:\ code and the shop;,er's preferred method of p!j~ent _
· . followtng: · •·, · :. · · · ,Beci\us~ ari ·~xtr~trie!y lnrgc volume of d~:i can be gathered t.'irough ctickstreclr,iilaJysis,
Naive~Oaycs:'~ ;.. ~., .., . ._,, , . · -'> · ,· .. • ·-.- . •.· : · h)any-e:~ltsiriesses re.lY 0~ 6ig data:an~lytics and .related fools: Stich ~ HadO~JYIO helri .
I. \York well wiili ~m~ll.'dataset ~ompared tci DT wnich:neecf more data:•·. ,:· interpret the data an~ genei·a_te _rej)OrtS for specific areas of miercst Cliclistreal)i 'analysis j~ ..
· 2. Lesser.overfittiilg _: ... - ' · ·_ · ·· · ·· considered to be·111_ost elfectjve when used in conjunction wi'~i 9(11er, more·ti-aditio~(market
-- 3. Smalle_r-in size mid faster .in processing eval1iationres?~rces/ · . _ . : ·: . : . ;:: , : . . • ___ . i':_'~::;
Decision 'free: _. _ . .. ., ._ -. - ... - . __ -.· ·_; . . - _. . _ . . .. . Ex~lain briefly ,tlirtcclrniqu.cs and ~lgorilhli1s or soci_al network analysis? : ... · : . . ·
l. Decis-ion Trees are very flexib!e'° easy,;o ,IJ!l~e.rstand, .and easy to debug_ - -:· - . ,<
c--- .. .-, · i -. -- >_ ·_·-· -• . .·: · ·.,-.:-'- .~,- ·-->:..:..;:-<i:i it~i-ks>
of
2. NCJ preprocessing or :ransformatio1i f.catures. required · . . : · , . _ . _ TECHJlilQUES Al)ID J.,[;COli.lTHM· : There are two major 'tevcll of sodal r.ctwork
·. , 3. Prone :o ove:iitting but you czn'user·pnmi1_1g·or Random.forests to:avo1d tliat.,.:, .:
analysis ~ discoxerh,1g sub;networks .within the.nel\vork and ra.:,_'<4tg th~ -rt~~ :i::i.flnd ·more
.-- . ·111Brief:Dedj·ioiiTree' '. •· __.... ·.. ,:; :' '.,. · · .. : '• ·. •i :-: ·: ·•·· --'.··.;.' _ :.:· · . / ,:,; ; i.ilipot1anl nodes oyhubs. . ' .: , ::·, • ... . . •. :. .. : • . '·" /'. :·. ·. : .
. . Ad~dsion· tree is'~ !low~ciiart~like tre~ Stl:Ucture,:where ·each' inierna(:no'de dellotes a test,,<i_~
- : . an· attribute, ea'cli brimch··rcprese1its .~n outccme of tlie'tcst, a11d.leafnodenepresentclasse,s.:
A
. ·. Finditig Sub~nchvo/ks: large rieiwork could l;c: the beuer- anafyiM mana°k~ can
and if.it
. be ~een as an interconnected setofdistinct sub-r!et\~ort<s:each with its owii distlncl ide~trty and ,·
or class distri,butions, ·ri,e pop11\ar Oecisionnee_algoi'ithlllS at:e IO\C:4.5-, CAR~.'They) : · _imique chanic/eristic's. This is like doing a'clusfer aitalysis·; ofnodes. Nodes witli strong tics .
algo;itl;n,.ii-conside're_d asa very sin_ij,le decision.tree aigorlth1n. lt u~e,s _info:m~ti~n gainl~s ,between 'them. would belong tci' th_e.same sub-ne~cirl; ,while those With ~eak,or rio ties .would .. .
spliitirig q _iteria. C:4S is 3!\ evolutio11ofjD3.' lt uses gain rntio as sphtp~1g ~(l(7_(1a, •' . / : . i beli\ns to sep:irnte sub-netwcirlss .This is1ii1supervised leaming technique-;as in Apriori there i{ .. .
. ·CART algoritiuii'uses Gini-coefficiei'1t as the test attribute selection cntcn~;,an_d e~ch tun,~ a
. rio correct nutnber of sub,rret\vorks'in network;The l1SeftilneS!> oftlte ofthe network struciurif .-_
···selects an aitri~Utew;'1i1 tlie sniallest Girii coefficient as.the test attribute for i!:fi;lven $et'[5li!
.. The ad~antage ofusing'DecisiqJi, l\eei i1/ cJas.sify)ng the) aif is :t/1at they. ,are' s(m_p_l~}~
for decision: -imikiiig is,ihe niain criterion for adoptini a paiticurar structure: . . . .. . c· • .. ..... ·; .
Visual representation of 11ehvorks can help identify sub-networlcs. Use of color ·c~'n help
. ·· t;i,d~rsta1id and'intetprct '. However, decision trees haye Sl!ch.d,i.s~dvahtages as•: · : . .. i '
, differentiate the types of nodes. Representing strQng ties with-thicker or bolder lines could
· · ' '/) 'Afosi vj the cilgo~iih;,,; (like 103 iitfd C4;5) reiJiii~d ihat ilie ialge~ 'attr{fi1te ~viii have o,
hcJp;yisually identify tl1e s!rcitiger relationships. :A sub-,network . c:Juld be·a collection o(
· · di;creievaf1rei·>. ', .- . :.. •.· ' · :'::·:•, ' ';. / '., · :·· :\. ,/ .'-:.',°: _: ',
sir~ng felMi61iships around ahlib nQde. ln tlii's cnse' tht . hub ncide'c6iililrepri:seifr°a distinct '.,,
2))s decision ire,;s i',se the "divide aiid ciiiu/uer" ineihod,:tite/te'nd iq /e,f_~rn(well_if_~ _- sub 0netwock, Asub~nctworkcould aiso Se a subset of nodes with 'de~s~·_relatio11sfiips bc&-~·en.-,
. lii_~h.l)i refevar,t'ditrib11/es·(!XiSl, _bui/eJ/~o if._!ll!iny,co~tp!ey ntt ractio.m:e~~ ~r7s.ent., -.· ;i
· mve. 11yes ;·Naive-, BiiYeslatr clussiftefS'.assume;thilt--tl)ere:'are--M_-_d~enden()l~s-amqA h .,.. hi~~~/ci--;;oiie .o~~~tr des :wil\a~-f isg~~~a1t? tt• '~ : q£,:b~f-.iiM~< ; , . ·- .
:att.ributesi Tiii~'as·stinpiioii is caUed class conditional independence. It is 111ade_\o sim ·-
-~-. •'

.' / .,

. ,:so- - ~----
vm Se-1r11 (CSE/LSI)
There nrc two outbound links from node A to node IJ and C. Thus, ,both Band C receives ·
· halfofnodc A'~ inllucncc. Si111ilarly, there arc two outbound links from node D to node C and
· A ,so both C and A receives l1alf of node B's influence. .
There is only m,:bound link ·from no.de D to node A. Thus ,node A gets all the ir1lluence of
noJc n. T11crc is on ly·oi:tbo1.nd link from no:Je Ct~ node o ·a:id hence, node D gieu ~II the
inn ucnce or node C. · . ·
Node A gets all of the influence of node D and half the influence of node D.
Thus, . ·, ·. · Ra =0.5 X Rb +Rd. .
Node B gels half the influence of node A. : .
Thi1s, . . Ro :O.S X ll11.
Node C ge1s··half the influence of node A a·~d half the influence ofriode·B.
Thus, . · · Re =O.S X Ra+ 0.5 X Rb. · .
. Node D gets.all of the influence 9(node C,an4 half the influence ofno.de B•
. Thus, . . ·R,t =Re. . . . .· . .;
. . .·: ·· ·•FigA11e1workwill1Di.1li11clsubNeMo_rk · . . i . . , . · _-_ ·we hn·ve 4 equations using fvn~inbles. These can be solved rru1themalically.-_·_ . .
. Computing tin(i~rtnricc of Nodes: When the co:mcctions between nodes in the network We can represcn\ the ~oe(Iici,ent oflhcse 4 equations in a matrix fonn as show'! _lnth~ Dataset_
·have a.direction to tliem ,then the nodes cnn be compared for;their relative influence or . (10.1} gfven belciw..This is-ihe ·1nnue~ce· Matrix -, The zero valui: ~presenlirulfthe term is '
· i-ank. This is doneusi~g 'Influence flow Model' ~Every outbound link from a node can _be . an
not representel in eq11ation. . . ··. .
~on~idercd nn outflow of inillience. Every incoming link is .sim\lar an inflow of influence, : D~tn Set 10.l ·
More. in-links to a node means greater impqrtance.- Thns there\ will b~· ll)any direct a11d Ra .Rb . Re . Rd ,
inQir~ct flows of intluerici: between ~ny two nodes in lhe networ_k·.' · . · ··• ·.. . · · :-.._c.' Rri ' 0 0.50 0 LOO
Computing tlie relative.influence of each ri.ode is done on the basis-'1lf an input-outpi1t matri"\ Rb ,. • -, o.s·o ·o··..\ -·· ·· o· o
of flows of influence among the nodes. Assume .eacli nodes has an influence value . ·Th_c; t-,---,----------.-------1------,-1
. Re 0.50
1 0.50.·. .
. o:· o·, · J.t >··.··
computational task is·tci identity .a·set ofi·ank value~ that'sati_sfies tlie set of links between t)i~•
. .nodes. It is 1111 iterative task where \ve begin with s.orile initia,l yaluesand.contin·ue to iterat~. · Rd , ·o ·. o .1.00 · o, . :
. till tl1e rnnk values,siabilize. . . · .· , ' · • · ·:· ·. · . . .' • ,. ·• ·... ; . ,: :•. F,6i: ~iniplification I let us also state that all the rank values-ad<f~up,10 I..Thus, eac~:~odc:hes· ..
.. Consider !lie following simple network w.iih 4 nod;s (A,B,C,p) :and 6 directed links between _ .:.-. afraction as thtfrank value, Let U!, sta1t with an in_i[ial set of rank val_ucs andthc.ri°lteralively
·them· as shown iii°thefigure(I0.2). Note that there_is a bidirectional link. Here are the links:::. · compute new.rank,values °till they stabil,ize.. One can start wiih anY: in°itial ·rank values, such·
Node A links into B . as
/In or i/4 . fo(_tli{nodes. : . . . .. '. .
· Node B links into.C· Variable . · Initial Value · ·:·.-· -. ··
Node Clinks into D , --~·- :----.~ - - - -:rl~-:-,· c...c... _ --.• Ra ~ -- __:__, _0250
Node D links into A .
Node A iinks into C ..- · · Rb . . Q250
Node Blinks hito·A· R.c·. 0250 ·
G),. .,0 : .,,•Q250 ' ;

l~ L
·. Variable . lliWal,Va,lue .:, lleratioid
(: . . . Ra 0.250 : , . . o:~75 . · · .
: Rb. 0250 -•. 0, 12~ . . ... .
··;_, :·-· . :~~ . • ·Oi50 .0.250 , :·.- ·,

.: : . ' :··.
0 .'. :
: ·.· ·. .· .. . . •.· Fig
0 .'
. · ..··. . •. . . .· . :'.
.. n.4·, o.2so .: o:2so .-. ·· , \ ::.. .
. .Comput_ingthe revised valu.es_using .the equatiims startc,j earlier.we _g_et a revised set:ofvahies · ·
. The goal is to find ihe re.lati~e importance ; or rli_nk ' or _eve!)' node in the netwo~k '. This .wiff ·show1uis. iterati_on I. . ·. . , . . .. . . . . . . . . . .. . . . ... _-
help identify the.most impo1tance ~ode(s) °in tb~ n~twork. · : . . · ,. ··, · .. ·: , . •'.:. • Using the rank yalues from lteration.i as the n_ev.: starting values ,ivc_cai1 compu!e new values .
We begin by assigning the variables for influence (or Jank) valudor each. node,: as Ra , Rbs,: for these vadab.les ,showi1as·\teration 2. Ra_nk vali1es wiff continue )o.ch~rige.. · ·:.:· · . •:
lo
R.· an· Rd.:The on! is find the relative .values of these variables~. . . . ,..
~f~~,.- /.

~~-J: 82
. ., : •. ~ (~ ·~ -. •' J •
VIII Se.w (CS[/ISE)

VarialJlc lnitiul Vnluc lll'rntiun I llcrnliun 2 . Eii:hth Semester Il.E. Degree l):x:11nin:1tion, _
Ila 0.250 0.375 0.3125
CBCS - Model Question Pn[Jcr.,. 3
Rb 0.250 0.125 0.1875
B_IG DATA ANALYTICS
Re 0.250 U.2~0 0.250
Tlnic: 3 hrs. . . .. .. · . Max. Marks: sg
-.
Rd 0.250 0.250 ,, ., 0.250 Nol"c: Answer 111/Y FIVE full 1i11cs//q11s, se/ect/113 ONE/111/ q11est/011/ro111 l!IICh '11101/ule.
Working from values of lternt1on2 and so, we can do a few more ucrallons still the values
stabilize. Dataset(I0.2) shows the final values a Iler.the 8th iterntion. • • ·
·Dain Set 10.2 .,
·Module-1
VnriaMc lnilial Vali1c· itcrnlion I Iteration 2. t, a. Write note on followln~, : -~08 Marks)
Ra 0.250 .0.375 0.313 (1) Rock_:11varc11,css
(2) HDFS snnpsliots . : ..
Rb - 0.25Q 0.125 Q.188
(3) HDFS Namc-nods.Fcdcraifon ·
. Re 0.250 0.250 : 0.250 .,... 0.250 . · (O Rack A1i•ai·cncss · . . . . . . . . . .
Ril 0.250 o.zso 0.250 . o:2so Rack aw·arenesscl.eals witli data'lcicality, _Recall thalciie oflhe main desjgi1go~s'.c,r)~a.d°oop ·
The final -rank 'shows .lhat rn1ik of node A is,the highest at 0.333 .. Tl,.us, the most important : .. Map Reduce 'is t<i move the computation to the data. Assu•ming lliat most data c"eiitre nciworks
noile is A . The_lowest ronk is 0. I67 ofRb. Thus ·, 13 is the least important node. Nodes C an~:·. rlCl not' offer bisection baiidw~d\h, a typ_ical Hadoop_clustet \Vill_cxhibit' thri:dev~ls of data
Dare in th~ middle_. In this casc,their ranks did not change lit all: '. . . . . . . -locality: · .. · ·· · · · ·· ' .-· · · ·" · ,.··. · •.
The relative scores of the nodesfo this· ilet,voiic: would have been ~~e same irreipective ofthi'; I . Data resides o~ ·the local machine (best) .
initial_values c.hosen for the ~ornpu_tations ; It_may take_longer or ii\'orter riuniber.of iteration's' 2. Data ~esidcs in.the sii.me rack,(oett~rJ·
for the results to stabilize for different sets of initial liill11es; . -';_·· · · • · .'::
:i, Pala re~ides in a differe'nt 'i~ck (go.ad). . • . :. i.:
a
PAG~RANi<: P:gcRank i~ p_a,tteular apjili1:_ation ofilie s~iai nJt\v~rk n~~lysi~ tcchnici;i':° When tjte·Y,ARN 'sClied~l~t is 'assigni~g MapRedlit:e·containers'-to worlc IIS'm'app~;s;irwill
. _above to _coq1pute the rclahvc 1mp011ance ofwebsitesjt1 the overall World Wido Web. _Th
. data. ~11 wcbsite:a~dtheir' links is gat(1er.~d through web crawler bots lliat travei-se throu
. the·webp:tge al fr~qucrit in_tervals: Every web.page is a. nod_e iri· the social 11etivoi:k and
.. , ~~~;~«'::k'.l~e c~~Jt,r f!:~.t~~1/1e !~t~_·_
'.~t~hi~\t~;~'.t 'te: ~~?ct;~~?'.~;!:t t°,:·.
111 adclitio11, tileNaini:N,odc tr.ies fopla~e .~P.J icated d~ta block_s _on hiu)tipl~ ri~k:$ rot;iinproved ' . ' ' ..
t~1e hyperlii1ks fro~1 ·that,page ·becolile directed links to other web-pages. Eve1y outboun·
fault tolera11cec'._ln sucb acase, an .e1itire rack fa,ilure ·wil!'nol cause dat-:i.lOS:s:/k~i!JP HOPS
:_lmk from a web-pag~ is considered an outOow of inn uence-of that web-page. An iteraiiy~J_
from working, Perfqrinance ,in,~y b~ degrade~, howe.v~r.: . :, .. ,.::.. ': ::::·: :.f ../ ,'· : .
· computational kch111q11e i_s applied to compute a relative importance t_o each page,. Thd1::_·_
HDFS can, be maqe)~ck~a,,.,are by using a.L1s~r;derive-d script that ·enables-the mast~f node to
. score 1s called PageRank, accqrding to an _eponymous algorithm invented by the founders g(,'
.. nia'p the networl< i,opo!ogi, of.the cluster. A' <lefault)1adoop i1i'sta1faiion:i:s~unietaiCifie, nodes

J~}riti:i;~::;:~r;~;::::;;::~:.,oi~~~:~i;.;}·•·,~.
Google ,the web search company. · .; - · .: · - · . · : _,'.
. PageRank is used by QQogle for brderiiig ti1e display oTwctisites in ic~ponse to scardi ' ·.
que,:ies: To be sl1own highe(in tile search results, nian~ websites o,vners _try to ar.ificia!ly '. _:,
-boost.'lheir. PageRank by creating mariy.durhiny wc bsiies ,vhose:ranks caii be made to flow· ..
Jn_to their desire\! weqsites. ·Also, ma;iy ;vebsites cai1 be designed tq cyclicai" sets of.links :: ·· -snap~l1ot .c9111Jnand. :HOf~,sri_aps}1o_ts:ate ~~a~:o~I}'- poin~-m-ti~c;,cop)~ ,oqJ1r~)(s~stem •
. fro~ where th_e 1veb crawler may 1iot ~cable to break out. The'se are called spider trap~: · /:,
. , :To:0ovcrcop1c U\c'se ,and other challenges .; Goo~le incll!des :a· Telcporting ·factor i1iio \ '-'
. . comp1ili11g_U11! Plig_eR~nk. Tcleportii1g ~ssuined that thei-e is a poten!ial link frcim :a1iy node' ..
to any other.node, irrespective of whether it actuall{exists; Thus, ihe· influence niatrix, is·:.. \
multiplied by_~.'wf=igllli11g factor called Beta wi)li a typica·I value of.0.85.or 85(pcrcent)W .' ·
The-~cmai11ing 1vdght .~fO:JS·or 15 (percent)% is given to ielepoiiation. In Telepol'tation
···•.c"'.litli~l~~jitf~:.:::;,~~:::'.)J;-1~tf2!'."J,J
.. • mocks on the Data Nodes are not .copied, because _the s~ap~h.o,t -~Jes -~~c~:d th,e ~!oclc list
· \natr.ix.; eac_h,.ce[I· itgiy~f a rankof !/n ;. wfiere n 'is the niunbei' of ~odes the web. The , and th~ file size. lhere is no data c_opying, although it app~ars t~~ l~~:~,s_cr..t,h~t ther:,.are
. tl~O m~trices are ~dded to'•con1pure 'the final i~flt;enc:~ ~ai~i~ ..Thi~ matrix can be used
. iter~tively conijl\tte tlie •P;ageRnnl< o~ all._the,iti d,s: ·: · : '·. .. .
r,/··
\ ': !);tii~,;1~:1.atdt~·!•fHDi~~~,;;,J; ' i i, };; .· ·...i .•

·SM,,!::=:~•;:~9~~~;~,,:;;:~;;~;1F
. ,:·\.i,nothe1' i;1;porta11t fiahire Of i-iDFS is Nan1eNocle' Pedeiatfon ,'.6ii)~i' vtl'sioii~ 'Of HDFS · ,

if-.-·-.-~8
4
-
.-l

VII! Se-trv (CS[/ISE) .


'· 1 I. Hun tcrngcn lo gtncrntc tow, or random dgta 11:11orl.
this limilatio11 by ndding suppo,t for 11111lliplc NamcNodc namcspaccs lo lhc 1-IDFS file
system. 'Th~ key benefits al'e ns·follows: . . . S yarn jar SIIAOOOf'_EXAMPLES/hadoop-mapreduce•eumplc,.jar tmgen
500000000 . •
• Namespace scalability. llDFS cluster storage scales horizontally without plAcing a burden
to the J\ameNodc . · · ... /user/hrlfsrrcraGcn-50013
• 13cllcr pnforn:.1n, c. i\o.t;iing mo:-c 'lan,c1'cde 10 the r!i:stcr scnlcs the ftlc ~ys1c111 ,earl/ 2, llun lcrn~orI lo ~ort the il•l~b:uc,
write. operations 1hrougl1pu1 uy,scparating the total namespnccs. · , , . $ yarn jM Si IA DOOP_EX AMPLES/hadoop-maprcdu,c•c~amplcs.j~r \crasor1
• System isolation. t-h11t iple NmncNodc enable different c~tcgories of npp)icntlons lo be .../user/hcjfsrreraGen·SOGD /uscr/hdfs/TcraSort-50Gll. · . ·
distinguished and users can be isolated to different namespaccs. 3. Run tcravalidatc to validate the sort,
Figure I.I illustmtes how HDFS NnmcNodc l'cJcmtion is nccompli~hcd. NnmeNodel $ yarn jar SHA_DOOf'_EXAMPLES/l;adoop-mapreduce•example.jac teravalid.Lte
manges the/ research an /marketing i1amespaces, and NameNode2 monnges the/ data nnd / ,.. -+/user/hdfsfrcrnSort•5dGB /user.hdfs/TeraValid-SOGB ·
project munespaccs.' l11e NRmeNode do not communic.ate with each.other iuid the DatRNodes } To report rcs1ilts, the tin,1e for the actual sort (terasort) is measured and Llie benchmark rate in
"just store data block" as directed b either NameNode. · · inegabylcs/secoild (MB/s) is calcnlated. For best performance, the aaual ter.uort benchmark _
should.be run \\lith a rcplicatiol1 factor of I, In addition, the default number oftcmoitmlucer
·,.,l/l'IINod~dafd tasks is set to I: Increasing the.omnber of reducers often helps with bcn,_hmark performance.
CCl!MU\leltl.l'lt'f
Uf'Wh~
HUkt(ofN
For cxaniplc, the following command will instruct terasort to use fource<lucer tasks~ ·· .
. ftltff\. . . . .
$.yarn jar-SHADOOf'_ EXAMPLES/hadoop-mapreduce•example.jar te.-asoi-t . .
--+ -Dmapred.reduce.tasks=4 /user/hdfs/TeraGen~50GB {u.ser/hdfs/TeraSort-SOGB Also, .
do not forgei to clean up ihe terasort data between runs (and af~r testing is finisl-.ed]. The
following command _Will perform the cleanup for ilie previous example: : ..
.I ·
$ hdfs rlfs -rm -r ~s-kipTrash Tera• ·
Running the TcstDFSIO Bc11chhi:1rk. . . .
Hadoop ~lso includes an ·HDFS benchmark application called testDfSIO.,.The TesiDFSIO .
• Figure.U iJDFC,N~111eN01fe Fl!der11ti~11 /!X(lli_rp_le, . .· benchmark is' a reiid a1)d write test for HDFS. That is, .it.will write or rcad_a number ofliies
.!K: . ii: ·E1pl~i~ with a~onccpt ofrunnirig basic llnd~op Ilcnchniarks.
. . . Mark~)' .i (~ _ to arid from HbFS aQd is designed .in such a way th.al it will use one niap task per file. The
· Ans. Run!ling Basic Hadoop ilcnchniarks: Many Hadoop benchmarks' can· provide · insigh( are
. : ; '.file size and number of files specit'ied ~ 'c onllli~iirie arguments. Siniilar.io :lit ~rt
.. • be'.nchmark, yciu si{ould run'ithis test as' user hdfs. Siniil~r to.ierasort; TcstDFSIO h3ssc~mr
~.~r:~
·: ;:
;'l
t;:-~::i•,: . .. · . · ~~:r:::a~~s.0~~ance: Th~•b'est b~nchm~rks a~ ~!ways thos~ that.reflect real application ·• steps. 'In the follo:wing exiunp!e. . · · . '· . : ~·:

.·.
f · The \wo benchmancs discussed ~re i tci'.nsort and TcstDFSIO, provide ·u:good sense of how. .· I661es of size! GB 'ap: specjned. Note that'. the TestDFSIO- l:encfuriark is part·of.ibe badoop-
well your Hadoop i.nstallatio~ 'is ·operating and ~an be coinpared w.ith public data published.·: ·. inapred_lice•client-j6bclient.jar. Other't_enchmarks . are also available :as _part of.this jar
ijl fo1· other Had<iop' systems: The result$, however, should. not be takert as a single indic.itor for. ·. . · . · file. Running it .1vith nci arguments will yield a lis( In addition. to Tc:stDFSiO; ?,'NBcnch
Mil . ssytem•wide perfonnance.011 all applications:_:} ~ . . : . : . . , . . . . ' : ·. -:_~lo~d testi1ig the J\'.ameNoc!e) and MR Bench (load tesi~1g tl!c..Map~frame\i-ork).are ..

I
Thefoliowing benchmarks are designed for full Hadoop clusier.'installations. These tests. . · coni111o·n1y use.d Hridoop benchmarks.Neverthel~, TestDFSTO'is perlw.ps. th= mQSt.widely
·. · · assume a:multi:disk ·HDFS en\'ironme~t. RliQning these benchmarks in ·the H_ortonworks . · ( as
reported of these ~enchmarks. The steps to l1ll) TesiDFSIO fu-c follows: · > . ..
. . Snadbox cir in·the pseudo:distritiu,ted single-nod¢ install is'·n·ot· iecommend_ed because aU ,: · I. Riii1 :TcstDFS'io i11 wriic.mode arid mate dala. · .. . .. . . ;_. · : .·: ,··.. i
. input_ahd output (.1/0) aie done lisi_ng a,'~ingle systel!I disk driye. ' .. · . . $ ya,rn jar SHA600P_ EXAMPLES/hadoop-mapredutHli;nt-jalicliei::t-t~.jar.: · ·
~
f ·. . . Running the Teras<irt Test . . . . . . . .. . . · . .. · .
.. ..,... TestDFSIO ~write -1irFiles 16 -fileSize iOOO· ·: · · · . ·· . . .
jtW:t . : The terasoit· bechmarks·sorts a specified amount or randomly. generated data . . ·. . . · ·. · Exam pie results are as follows (data and iime prefix r~~oved). .
~~ .
. .!~l::. ·
~b
.
This benchmark prc.:ivides combine testing of the·HDFS and MApReduce laye'rs_of a Hadoop.
~luster. A full terascirt benchmark run .consists of the following .three ·steps:
. ··:1. Generating the input data-via terageri prograin, .'. ·. .. ' . .
fs.TestDFSIO: -.,:.·•TestDFS.10-s-....:: write . .
' fs .TestDFSIO:: Date .& time: Thu M~y 14· I0:39:33 EDT 20_15 fs'.TestDFSIO: Number
files : I~ .·.', · \ · . · .. · · . . . · .. ·
.

. :_. . ._· ..::\ . ' / ._'.:·. '. . ' .·


of
0. ~f1·. .
2. Run~ing theactu:il ierasort benchmark on tlie input data. .• . fs.TcstDFSIO: Total · f-'!Bytes . processed: ..16000.0 fs.TestDFSIO~- Thrciuhp1it. 'mb/scc:
·. \4 .890106361891005_ fs.TcstDFSIO: Average, 10., ra_tc :nib/sec: 15.690713'882446289 .
-~ .:i. Validating tlic.sorted output data vi~ the ter~validate program:: . •. . .. · . . . .. . . . ·
ru, lo _general, each 1uw is I00 bytes ·long; thus ihe total amount of data. Written is. I00 tiines: 'fs.TesiDFS[O: 10 rate stq deviaiion:· 4.0227035201665595 fs.Tcs!DFSIO: Test 0i:xec ti'me ,
', sec; 105.63i . . , . ' . . . ' .
i~~\'J the immber ofrows Sf)1:Cified as part or the benchmark (i.e,, io write I OOGB.ofdata, use,F ' ::.. ,,.·.
·, .. billiotl rows): The i'liput .ind outp11t directories need·to be specified in HDFS. ·T!ie'.followitit( .' .2'. Run TcstDFsio in ~~d ·;node. . . . . .. . . . .
i.tlr.. ·,,'. ·.·' ,.
sequence of comniands will:run
the tiench~ark for 50GB ofdatii as usei ~dfs:Mak.e sure th¢,,; . . · $.yarn jar $.HADOOP:.. EXAMPLES/hadoop-mapreduce-clieilt-jeibcJicnt-~ts.jar
_ _,user/hdfa di_rectoi:y exists jn HOES.befar• rnnniiig..ille,l)ericliin~ : . ·. .
-..;;.;,------ ·,•..._, ' .· -'c---~~=-1.CSl:·10F.SIO...rcad..:nrFilts.J6'•tileSizc.J()
. . . . .
..
.} . . ''

ii
. .....:.-
· •r ·
·I'
86 . . 87 '

·•, ,
.. :.~- ~ . .·.... .;·~. \-.. ,·_i'. ·. , ··. •
VIII Se,m,.(CSF./IS'E)
Example results arc ns follow s (dn:a nnd time prefix !'~moved). The largo standnrd deviation Hrchives to ~c unn (chivcd on the compute machincJ.
is due to the placement of tasks in the cluster on n smnll four-node cluster. fs.TestDFSIO: . The general comm11nd line syntax is .
· bin/ha<loop command (gcncricOptionas) [comr:rnndOptions!
--·----TestDFSlO--·- : read
fs .TcstDFSIO: Data & time: Thu May 14 i0:'14 :09.EDT 2015 fs.Te, tDl'SIO: Number of OR
files: 16 · 2.·n. Write a short note on following
'fs ,TestDFSlO: Total Ml3ytcs processed:· 16000.0 . fs.TcstDFSlb! Thrlighput . mb/sec:
i. Spccuulalivc c~ccution ··
32.38643494172466 fs.TestDl'SIO: Average 10 1rnte mb/sei:: 58.72880554199219
ii. Hmloop Mop reduce lrnr<lwurc. .. _ (~ .Marks)
fs.TestDFSIO: 10 rate std deviation: 64 .60017624360337 fs.TestDFSlO : Te;t e~ec time sec:.:
~m . Ans. _ i. S(lccul~tivc Exccution_: ·one of the chall_enges with many large cluster is the inability to
predict or manage unexpected system bottlenecks or failuieS:, In theory, it is possible to control
3. Clean up the TC5t0F'SlO dutn. - . and monitor 1'esources so that network traffic and processor load can be evenly biilanced; in
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreducc:client-jobcijent-tests.jur
-, practice, h5Jwever, this pr?blem represents a difficult d1;1Jlcnge for iarge systems...Tl]us, it is ·
_, TestDl'SIO -clean · . ·
possiblfthat a congested -network, slow disk controller; (ailil)g disk, high processor load, or
Running the TcstDfSIO arid temsort benchmark~ help: you gain 9011fidenc~-in ll •Hadoop ;,
__ some other similar problem might lead to slow performance without anyorie_nciticing. · . -
installation and detect any potential problems. It.is .also iristru'ctive ·to.-view the Amabri :
When one-part of i\ MapReducc process mns slowfy, it ultimately s:ows a~nJveiything
deshboard and the YARN web GUI (as described previously)'as the:iests run. . .
. else because the application canrrot comP.lete until-all processes are t\nishcd; Th·:: nature
Managing Hridoop MapReduce Jcibs' _ . •_ . . . . .
of the parallel MapRJdllCC lt)Odei p_rovides ·an intc.ies"ticg soluiion tc, this"proble~. Recall.
Hadoo{i . Mnp_Reduce jobs" ,ca:i, be .managed using ihe iriapred jdb ccfrmriaild. The ai~st ' · . that input data are immutable. in th~ MapReduce proce_ss. Theiefore,it is poss:lbl~ t~ stitrt a
im~ortant options for this command in _terms ofthe exaniples and benchrimks are -list, -kill, <
nnd--stotus, _ · - ·. : ,_ ._ ·- - · . . , _ ,· · - - · · · · copy of a running map process without disturbing any other running 'mapiJer proce~ses. Fo~'
In particular, if )'OU need to kill one.of :h~ e~ainpies 01· ber.~h.~arks, y'b~ ~~n iise fite mnprcd ·::: ~xample, suppose that as most of the map tasks are coming to a close, the A?pli~ionMasier
that some ~r.e stiU .nmni_ng· _and JChedulcs_redu:idant copies of the-rema ining jo)is'.on less
job -list command 1o·fin(!thi: job -ici .and therr llS_e rimpred job'-kill ;<job-id::i to kill the job ·
- busy or fr~e servers, ~hould ihe secorrdaf}( processes fi11ish first, tjie other li:t- jii~ess~ arc .
.across the cluster.: MapReduce jobs_~arr also be controlled at the applicatioct l~vel with th(r'
rarnnpplicaiion 'c,ommand. Th_e possible optlims foi mapredjob,areas follo,vs! . ' .
as
· then terminated (or vice versa). This pro·cess is known speculative exeeiltiori.'Thec same ·
S mapr~djob _- · · , · . _·. ·- _- . . __ · . . . · ·' · · ·· approa_ch c~r; be a)lpl)e,U to reducer processes,.thatseem }<1 be-1.\king a lor.g timc,[SRC~lative --:
., -:: ' Uslige: CL[-<coininiind> <args> (~subm'it <Job-'fife>J . ' . ex,ccut1on can seem:to•h~xe. a sl9w SP.ot: lt.can .also.be turned off a.'.ld on i.'l themapieil-site·- __
·. :xml configuration_file_ .. < ·._ .' ," · · · -- - , - ·. · · · ·.· 1 . • .
[-status <job-id>] :. · ;. -. - < _- · . : .', ·. > •. · ---- -. ·. . _.-
ii, Hadoop M11(lllcduccllardwarc . --_ : · · .. _._ _ . : . •-i . :·.,
[-courrter <job-id> <group-name> <~cillnle(•name>][-kill ~ob-id>j : ·' . '· ' . ' ', \
[~set-priority .·<job-id> .<pdority>] : ·y alid values· for. ,priorities: ;re: VERY HIGH HIGH /
'· · The -capability o(Hadoop 'MapReduce and HDFS t~ tole~te serv·e~_;,r e'v~/;;~hii~rack-
· NORMALLOWVERY LOW . : .. . - - - -. . - ., .faiiure~ cah iriflucn,cc har1ware d~signs. The u~e of commodity (typically ~86 64) se~ers for.

. ·.~'·:'z · N:::.l~:~t:~::::;o~~event-#>.~#'. ~f'.ev~nts>1·[~histo~ ·~ o~1f~~;y£il_~]~_-.·'. ;,~ ,_


. Had?op clusters has niad: low~cos_t; ~igh~ost, high-availabi.lity _impleme:iia~:i~ !>(ffadoop,·
- possLble for many ~ata centres; lndeed~thc·Apache Hadoop p!iiloso;i~y seerr.s assuine on
aclustef·. _ _. _'.,. - . · · - _ - ·_
io
_- · _ - _ - ___ .. _. 1._ ,', , _i .·--- _ _·
,:tie use of_senie(: nodes .for .both'. st<>rnge (HDl'S) ·and proceli$ing (mappers; ;redl!cers) is .
. (-list:blacklisted~tl'ackers] . __- ; __ _ _ . __ _ __ : _
some'>l'hat.dilfercnt from the tradltional ·separation_ofthese two tasks in the data .i:entre .. Ii -
[-list•attempt~ids <job-id> · <task:type;> ,~ta~k~stai~]. ,: Valid .~alues f~r <task-type> are ,
.· is possible to b~ild HaduojJ syst~m and separat~ theroles '(discrete storage a~lprpces'sing .
R~DU_CE MA_P: Valid values for<takscstate>_are·run~irig,,completed __·,_: -:i _. .. ·_· · ..
: •-· . nodes). HQwever,a·majority ofHadoop systems use the g~neral_approach where sei:vers enact :
[-kill-ta~ <task.attempt-id>) (~fail-task <task~tten1pHd>]
~ot!t roles, iuiotl1er interesting feature of-dynamic MapRedu~e execution)s .the ·ci1pabiiity to ·,
[-logs <job-id> <task-attempt-id>)
. tolerate dissimilar _servers, That is~. ol~ a·nct new fiardware can be uscd"iogether.: Q(course, . .
- Generic options su~po11ed are _ _ _ , _ . ._.. . _
large 'disparities in performance-,vill lim.it tlie faster..systems, but the _dynami~ :nature ·o r · ·
. -conf <configuraiion file> spe¢ii'yan application configµr.itionfiti ·
:D <pl'ciperty;values> use value for 1£iven· propet1)' · · · · · · .Map Reduce execution will still work_e[ectively on such syste_nis:. - · ·
•fs <!ocal I nameno~c:p0l1> specifyil ·namenode-... _ ·'. _.· 'b. :Explain COlilpiling a'nd runi1i11g iii~ Had~op Gri:p ch~i~i~g cx~mple ;¥it11'p~~gri1in,;.:,
_,-j t <:locaq re.~mircemlinagi:r;poit> ~pee ify-n- J{c,iour~eManager -, .· :,-i ·. ·' _
-_ : _,:-: · ,· . . · ·. . i . ··: · .-~:· ,: '; (OS"Mitrlis)
·--files <co,nrrtil s~paraicd )i~t of files> spe°i:ify comma sepa~led fileiiW : Ans •. The Hadoop Gn:p.java example cxti,icts,matchingst,ri~g,froiri text files a_ridcoiin,tsh0\Y many
becopiedtothe ·mnprcduce'cluste~- -, ·,. - - - -. _- . ::" . • _- '· ·.' · times tl1ey occtirrrd. Tlie c9minand works differently from the •nix grep co'ni'n1aiid·in ihat if
-libjnrs <co,mma separated list:ofjal~> specify comm~ ~eparatedj~r: does not display th~ comple/e matchlng lilte, ·only the matching string._lf riiatching lines' are ·
-files to include in the classpatl1. _· . _ ·. _ · ·: · · · ~- _ . . ,.,._...._ __ _l),ecded for the string foo, us~ . •foo.' as a regular expression. ' '. · · ·. · · . _
· · · ,.. · ~--Urch i:e~:~~omqi~ s_ Cparrlted._Jist of-archi veS?, sp~cify c9rn~~~ sepaJ:at~e:~~-~-_. :-~~-, __ _ ·The program -runs _t_wo 1riap ; ·redii'ce_ 1obs· in ·sequence a_ns · is·_arr· example.of MapReduce --

.. _. 88 ~n~+~f ~ ~ M- ~ M ~ , . .- .- - . / ·- . .-- . ' _ .. '.]' -89__ __


......,_
YIII Se,m,(CSE/ISE) ·,• r,
'.;
chaining. The firstjob ·c□unls ho"': many times a matching string occurs in the input, nnd the grcpJob.sc(OutputKeyClnss (Text.class); ·
s~c~nd job sor1s matching strings by their freque.nC)' and stores the output In a single file, grcpJob.setOulputValucsclnss (LongWritable.class);
, L,stmg J. I displays the source code for Grcp.jnva. · . grcpJob.1ynitForCompletion (true);
Note that all the Hadoop example source files cnn be extracted by locating the hadoop- Job sortJob = new Job (c~nt); s011Job.sc1JobNamc ("grcp'.s01t");
mr.preducc-e~~mpte-•-so.Hces.;c,r either fro~1 a lloccop .distribution or tro;n the Apache FilclnputForm~t.sctlnp11:l'aths (sonJob, tempDir); .
Hadoop website (as pa1t of a full Hadoop package) and ihen exttacling"the files 1ising the so11Job.scllnputFormn.1Class(Scq11cncefilelnpulformat.tlass);
following command (your version tag may be different): ·- · s01tJob.sctMapper~lass(tnverseMapper.class); · : :· .· · ·... .
$ jar xfhadoop-maprcducc-example-2.6,0 sources Jar .s01tJob,setS01tN11mReduccTnsks ( I); //Write a · single file FileOutp.utformat.
. Listing 1.1 Hadoop Grcp.javn Ex nm pie · setOulplttPath(sortJob,ncw path(a·rgs . [I]); , LongWritable.DecreasingComparator,class");
package org,apache.hndoop,exninples;· }:11Job.wailForC~mplction (true); · . <":
impo11 ja.vn.util.Random; · . . . . ....
impo11 . org,apache.hndoop.conf,Configuration; . . •iinport ~ org.~pache.hndoop.c~nf.) finally { : • . ,,;1 ·
Configuration;'import org.apache.h.idoop.fs.Fi lesystem; .import org.npnche.tiadoop.fs.path; . :.' FileSystem,get(corih:cielete (temp Dir, true);.··. . ' . ;., ·, :·.
}' ·• · . ·.; . ·. . :- : - .,
lmport ·org.apache.hadoop.io.LimgWritable;. .. ·=· ·, . :. : . · ·
!mpo11 org.apache.hadoop.. io.Text; import org."apache.hadoop,mapi~duce,•; · /urnO; . . ..
1mpo1t org.apache.hadoop.mapreduce.lib. inptitFileJnputForinat; · · ..
. :mp01t.org.apachc.l1adoop.mapreduce.lib.inputSeq'uenceFilelnputf~rmat; · publi~ static void main (String (J .argsj throws Exception { . , ...
intres :' 'foolRtiilrier,run(m,w Configw·iiii!)D o; new'ure,v ( J; args); :
1mpo1t org.apache.hadoop.rnaprcduccJib,map.InverseMapper . · ·
impo11 org.apachc,hadoop:mapreduce.lib.map.RcigexMapper; · ,:. · · System.'exit((cs);
} .
'· ·
. ' ~. -~
l,::····
impo11 org.apache.hadoop.mnpreduce.lib.output.FileOulputFormai;:~ /
impo11 ·org.opachc.hadoop.mapreduce.lib.output.SequcnceFileOutputFonnat; . } . ···. .,. ; . . ' . . · • ·. . . . ·.· . .. .. . .; . . . ·: .' . . ,.: .,
·, ·•. ·.
im.~6rt org.apache.hadoop.maprcduce:lib.redu·ce.LongSumReducer/ . '. · · · In t,he.preceding-code, each mapper oii'he firsi job t~kes a,l ine _as input andni.aithes ,the1iser: ·
i.mpoit org.opache.Jrndoop.ulil.Tool; · · · · · ·..• . provided i·egular expression ~gainst tlie line: The RegexMriprer cfoss·is tised'ib' p~(fii~ni U1i~ 0

imp.ort .org.apache.irndoop.util:ToolRuriner; . •. . . . · . task mid ex,trncts. texi n\atcl1irig using tlie-given regular expression.1"he. matchjn8·strings:iire :.
!{ ·)• Extr11cts mat~hinlfrcgexs from 'input' files and counts the~:*/ ..·.
W
i' pu~lic closs Grep extt,n~ config11red implem'cnis-T~ot( ;'.° . .·: ;: ~J~~~tn}%~;;~~i;fi;~hi%~tjgf;~:;~pib;~{6!~::~~!~;;~~~; 1f.~ta~;~:\•:..::.
r private Orcp () () //5/iigletoii. ·. , .·. . . ' ·. ; " . reducer uses the. LongSumReducer class that outputsthe sum Qf long values pe? \!~E~( mp~\ ': ·.
V p~blic int ruo (String [ J args) tfirows E~ceptiori { if(args ,length <3) { ..
/:,: , System.out prit' ln ("Orcp <inDri> <outDir> <icgcx> (<group>]"); ~i
i .· ~~6~nd.job i~k;s\~~ i;~
~ut~lo'~d1~ ri1~;J~b:;s i~but::rhe'cidp~~~r~ ·i~\ r ~~;;~~~p ~at '· .
, return 2; ·. rev~rs!!r {o1· swaps) its inpUt<key,v~lue::►. p~irs if1l/J <::val_ue,~ey>,There is nof ddu9t\on sldp; ..
.. so the I.dentityReducer.cla'ss is'.u.ied'&y.dcfoiilLAIUnpiii issirriply"passedt~th,eoiltpiil (N~te:·

.£21::fi ~;~~;!t1~t%~%~;:ti:ir;1~%1~tfi,~~
,.'_ _·_ _'... _ - --- jatfftempDir= ·- ·- . -'- .
new path ("grcp-temp-"+ ·
, ' Integer .loString (~e1v R~ndom ( Y,
. Tnc exainp[e •aiso demoiisti'ates haw io pits Sa cominaiid~Hne parametedo ·a,iii~'ppe( ..

.·•. g;~~~Ji~:ii~i$tii:,:.9~;i:J~~~eff
~ · · ncxtlnt (Integer .MAX.'._VALUE)); ·.
~- . .Con(lgurritio~ confc'gctCimt O; ·. · .
conf.ser(RegexMapper .PATTERN, arge[2J ); ·
' lf(args. lcngih = =·4) · , . · . . ·· ·, . ·
.· conf:.~et (Rcgex'Mbppci' .OROU:P, ui,is [Jl); '
. Iol? grepJob =; new Job ·(count); $ mkdirOrep_dasses ,, : , :; , .i ;. ; _,_: 0: ;.• ,/, : : . ·: ·,:. , .._: i ..
. ·tty{ . .. . . .· :2. Coin pile the )1/ol'dC6unt:java ·piogram using.the foUowirig line:
: .grepJob.setJobNn/Jle C'gri:p,scarch;'); · $javac~cpfhadoopdasspaili' S:4Grepj;Jasscsdre!i,java°' :,:.·. ,:, · :,;/,.(
· .. Fi°lclpputformat.set[nputPaths (grepJob, ilrgs[OJ); ·.
~rcpJob.sctMapperClass (RegexMappcr.class); . , ·. .
. . Lr:~::j~t.:~::::~~11:;1:t!~:i~Jlowi~g:~o:mr~nd: ' ' :,: ... ..· . . ..·.· •. { ,. i.:\
grepJo~.~~tConibinerCl.ass (LongSumReduccr.cJass); . . If needed, crea.te a directory :and move th~\w·ar,an·d•µ.~ace.txtfile in.to HDF~\ ·,.: :"!'
$ hd[s ?rs°'mkdi; .wai'.~nd-p:e,d¢~~i~pui . . .: . :.' ... :.. . • ; ,:· . . .. . :.. \i ;,;r!' .
. grepJ9b,_setRed~cerCla~s (LongSumReducer.class); . ·::_(
, .· Fll~Q4tputFormnt.setOutpu1Path (grepJob, tempPir); .. ..·.. . . . .. . . · $ hdfs dfs:-put war-and~pcn~e.tl(t ~ar-and-peacc;output · '. ·
-'--,.,,..,,..,,-,:'= · ·"'g"""iepjo1n:etOmpatFl5rmin,Class-(S~cjueilceFileOITfput oroia. :c nss ;-•c: ~ --:-~-;-:-:
,- cc-
- --- _, __ _··-· . . . . ' . . . -~---.-------·:- .
90-
·- · ~· ~-'
·:- ·-· -
·- : · ·,.t•::·. ·
,1 • • ' ~ - .' ~( ...~ : _, ·
VIII Sem, (CSE/ISf)

As always, make sure the oulpul dircclory has bec.n removed by im1ing the following Container: conlainer_l4J2667013445_0001_01_000001 on nO 4~454
command:
S hdfs dfs -rm -r -skipTm~h war-and-peace-output
Entcrinr 11,c foll owine comm~r.c wiil nin the G1cp progrnrn: Cnntainer: co~iai~cr_l47.66i013445_0001_01_000023 or. nl _~~4 ~4
S hadoop jar G1 ~p.jar org..apachc.hadoop.example .Grep war-and-peace-input e _ :=; _ = - o - :: - - - - - - == - - - - = - ::a= c: - - e - ~ - - - = - ::,,
-+ war-and-peace-output Kuluzov · .. . . . . . [... ] ' .
As the example-runs, two stages will be evident. Each stage is easily·recogniznblc in A specific container can be examined by the containerld and U1e nodeAddress from the
the program output. The results can be found by examining. the ·rcsultant outpi1t file. · preceding · output. For example, containc.r_l43266701344S_OOOI_OI_000023 can be
$ hdfs dfs -cat war-and-pcace-output/part-r-00000 · examined by entering the command following this p,aragraph.·Note that the node name (Iii)
. 530 Kutl!ZOV . nod po1t number ·are written as.nl_45454 fn the comman4 output. i:o get the nodeAddress;
c. Explain command line log voiding. (04°Mnrks) simply replace the_ with. a: (Le., -node.Address nl :45454). Thll5, the. resulu for-a .single
container ·can be found by entering 11\is line: . . . ·
Ans. MapRe<lucc logs can also be viewed from the command line. The yarn logs command enables •,
the logs to be easily viewed together' witho~t.ha·ving lo hunt for .individual log files on the $ yarn logs :application Id application_l432667013445_000 l_containerld
ciuster nodes: As before, log aggregation is•required for use. The options to ya('li logs are as -+·conta.iner_1432667013445_0001_01_000023 -nodeAddrm·nl :45454 Imor.~
follows: · · · · · · · . . ·Modul~-2
$yam logs , . .
Retrieve logs for completed YARN applications. . Explain with an .diagrams .DAG W!)Tk riows? . . . . . . (JO ½arks) '
· usage: yam logs -application Id <application ID> [OPTIONS] Oozie is a.\Vcirkflow director system designed to run .and manage multiple related Apache :
general options are: . . (:ladoop,jobs, For inst~nce, complete data input ~nd 2nalysis·may req~fred severai discrete
-appOwncr <Application O\vner>. AppOwner (~ssumed to ;Be curren( user Hadoop jobs to be run as. a workflow in which the output of one job sirves i!S the·input,for
· . · . · ·. · , . specified) a si1ccessive°job. Ooze is nota substitute for ihc YARN scheduler. That ls, YARN manages
-containe~ld <Container ID>ContaineM. (must be specified. if node ·:address . resources foi .individual Hadoop jobs; and Oozie provides a ',Yliy .to coiuiei:1 a.rid ·control
specified) . •. , .• ., , . : Hadoopjobs ori thedustet, : ··.· .· . : · · . : · . · '. ·'. . ·: . . · ";; .·
-nodeAddress <:NodeAddress> . : NodeAddress :in the format noderiame: .. · . . : .. . Oozie workflow jo,bs are represented as directcdacyc!i~ grapfls(DAGsj of a~orjs. (DA9Gs
, '.. , _. . . , . . .'poit(rnust be specified if contai~er id is specified) ·. . . are baskally graphs tllat ~annot have.directe.d locips:)Th."e_e"types ofOoziejobs are·pertnitt~d:-
.For e;ample, after r;1nning ·1he iifexample program (discussed in Chapter 4), the logs can .be : •• : Workflow-a specified sequence of Hadoop jobs.with outcome-based decislon:points"nnd
. examined as.follows: · ,. . . . . _- . . . . . .. .. . ·control dependency: Progress from one action to another; cannot happen until the first
S hado~p jar SHA DOOP_EXAMPLES/hadqop-mapreduce-examples.jar pi 16 action is complete. . :. . · . .. . . · "' . . : ., .: . ,
100000. . . . .. . . . .. . .. . . . . .. . Coordinate:a·s21i~duled Worldl~iv.job ihat ·can run
at vari~us.time i.'i!ervaJs or when aata·
Aller the pi. examp_l~~mp!e~s,_~ote tl_1~·.appli~ationld, which can be: found e.ither.from the. . . bec~me:avail~ble. ... . . .· . . .' · . , • _ .· _: - ~ ~ " . _-:· ·
. application o·utp;,t or by ·using the' yam application ·command. 1l1e ~pplicationld will start - -: •-c Bundleca highef-level.Oozie abstraction that will baic~ a set of coordin1torjo~s;?o~e.•is•
. with application_and appear under the Application-Id colurrin, ...· ... . · ·. · . ' . . •. integrated with the rest of thi:" Hadoop stack, s.uppor.iiig seve~ ty~·o.f Hadoop Jobs:out
. s·yarn application -list -appStates FINISHED. · . . . : . .· ·. . , · · . . · · ·. ·. :· · of the box (e:g.; JavaMapReduce, Streaming MapReduce, Pig, Hive, .also Sq_~cip).aswc.U·
. Nexi, run the following command to prod11ce a dump of all t.he logs"fotthat application. Node . as system-sp~dificjobs (e.g.; Java prog1'am and shell sc_ripis)..Oozie.a.lso proyid,es aC~I
that the ouiput can be long and is best sav·ed to a file. · ·. . :·. ... . : . . ' :.. and a' web UI for monitoring jobs. . ; . . · • , .. • . · : ·. , . . ·
· $yarn.logs -appficationld application~l43i667013445_000 I> ApP,Out . · : . .· · · Figure 3.l depicts a. simple Oozie workflow. ln this case, Oozie,runs. aliasic MapRcdu~c
The AppOut file can be inspected u~ing a te?(t editor. Note that for. eaclt container, stdout, .. operation . .(ftl1e application was succ·essful; U1e job end; if <!ii-error occilr.red'."_hejoti is kill_ed. .
stderr, ~nd syslog are provided . Tlie. lis\ of ac\ual containers .can be.Jo.imd by using the . Oozie •workflow dcfinitio.ns are written in hPDL (an XML Process Definition Language). .
:,follmving com·mand: ' · · ·. : · · ·:: . . · : · · . ·· . · · · ·· · · ·' . Suchworkfloivcontainseveraltypesofriodes: . .· . ··: .··• '..'.. · . . :. . . . .
$grep-B I "'==;,..=AppOllt.Forexaniple(mltputtiuncated): •i . ,,. • :. Control ·flow nodes defirie the beginning.and \he·end:qfa world\mv. They•!nlll_1\de start,· .
. [...] ·. . ·.·· .. · · . :
·,:··· : .· .,· .· . . : .. . ·· . . . · ·. · end,:-and .optionalfailnodcs; :-:.::-... . · ; ..·· .:_ .'. ·_.·_. .. ; ;: ·. · ..,: .:< ..:;;·:,.:.::,L·· ._
.,_.
· Containe~ con.tainer.)4326670 I3445_000 (..OI_000008 on liml\lu.s_:45454 · · ·• Action ,n·oiJes arc where: the ·actual. i>ioccss.ing taks are .defined, W~en._-:an,acp,on· node .

-~ --~--,--..:. .'-- ~~ . /- ;. · __ ·· ·_-_- ~- . - ,._ · .• . ~- . ·- -··. . · _...,:. · ___ _ . finishes; the.remote systems.notify Oozie and the next node.in theworkflpw.is '.executed, :·
. Acfi~n nodes can also include HDFS commands/ . . . . · .' 1·.. . ... '. . . .
· 6~ntai~~r: ~onlai;e~_I 4Ji6~7013445_0001 :CO iJioo6. 10 on lidi111l~~45454 .' i Fork/join node enable parnilel execution of tasks in the workflo.w: .'!Jic for!< node enable . ..
:~ ==== _. =-_-._: ·-=- . . .- __ :....~ _.:..._·__ _~ _- ·~,::...._·..:..,-:-_:.._·-:.._ _.:,;-~ -,_ -_,;,.;;_ ·.. two or" more tasks to nin <\t •the saiiie time.A join n.ode represents.aredeZ','.OUs"point.that :.
.
-- . .
..
- . -- - ----- - - --. . ... ~-./~t-.wail't;ntihil~forkedtaskscomple1.i· . . . :, ."; :,' '. '.:':::{ .. ·~</;
92 ., .
'93
VIII Se.1111 (CSr/ISE)
• Conlrof now 11odcs enable decisions to made about the p~evious ta~k. _control deci~ions '.'un o~ standard Hadoop VI : using the MapRcducc framework, but,that app;~~1;-'i,;o~~d .
. :! are based on the results of the previous action (e.g., file Slze or file existence). Dccislon mcfficie_nt nnd totally unnatural for vnrious rel!sons. The native Girnph impleinenttition un'der
nodes are essentially switch-case statements that use JSP . YARN provide~ the user witli nn lterntive processing model_that is not direclty 'available with
I ';__ EL (Java Server pages-Expression Language) that evaluate to either true or Cal_sc. Figure 3.2 MapRcduce. Support for YARN has been present in Giraphsince·its own version LO release.
depicts a more complex workfolw th~t uses ,di of these r,odc ty;ic_s_- In adfiition, us ing the flcxi l>ility of YARN, the <iiraph developers plan ·on impl~me~ting their
own wet, in_terface to monit9.r job progress . · .. ,, · · ·
OK
Iii) Hamster: Hadoop and Ml,'I on the SRmc Cluster
The Mes~age Pnss ing Interface (MPI) is Widely used in high-performance co~puting (HPC). ·
MP! is primarily.a set of optimized message-passing library calls for C,.C++, and Fortran that
operate over, popular server interconnects such as Ethernet and lrifiiiiBand. Betausc users ·
have full control o_ver their YARN containers, there is no reason why MP!_applications ~~n~ot
tun with iii n Hadoop cluster. The Hamster effort is a ~ork-in-progress that P(avides a good-
·.' ., discussion of the issues involved in•mapping MP! to a YARN cluster·. Currently, an:alpha ·
version ofMPICH2 is·available for:YARN that.can be used to run MP! applic:atloris:: '.'-: ·;.
. iv)Apachc Spark ' · : · '. . ' · ' .· • · ;;',__·
·. Spark was initially developed f~r a·pplications in whfch keeping data f~ mertidrflitiproves

· performance, such .as .iteraiive algorithms, which* common . fn m·ac~ine'. (eaming;'arid
inieractiv'e data mining: Spn~k dilfe~ from classic- MapReduce ii1 t1vo imp◊rt~ni wayfFirst, .
Spark holds intermediritc results In ineinory, rather tlui_ti 111riting them to disk,;S~cdricl;Spark
holds supports more than jLiSI MapReduce f~nctio.is; that is, it greatly ex'p~rtd~j ~~ s~t
o(.
-possible analyses t~af c~-~ be ex~cut~d ayer HDFS d.ttli,_sl~res. II also),riivld~wls ~.Sc~i~/

.,oftri;porting
~ ~Odl~m;:k h~ feri r~~ni~g ~~- prnducii?~ ~;~~A1ste~,a~.r~~~l;:iu(~~~~\~~.~· :_ :
and ru~nlng ·Spark.:on top· of Y~RN 1~-the common
r~outce tnanug~i)\ent a11d a,:·.
's(~gleiuid~rly,ing_'fifesysiem ;_ .. · '. :·,(); .·: · . ·.: ,. ,,. , ' •· ·:;:,fr,:·· ..
·.. ,
H~~_t'ii ~tl~~~e ~adoop pr~11c:~u~f ., , \;:> :.'.:·:):/, ;\~,:,::·j t/Jj)i;:~~~-~•{
One of the .challenges of ma11agiog a HadooP. cluster is mi111agirig change{toj [Ltsicr \\'ide . . . '- ·
/ .,. . . configuration properties. In addition to modifying a: large,number 'o(propi:rtie~)l't!iklng. :
. changes·to a prope~y often !~<!)I.ired'daeinoils·1!!JiL~jien~e.nt' 4.aenion_so;:ii~~fihc eµjif~ • .
·_- -·--- -.\ . .. . ·.• . ' . - --·-- -------- ....
:.E{~j:,~~;~~-r~i ci,w/i~~:i::1~~tr;;:;:;1°::~:; 1
'
. clus_ter. This pro~ess is tedious-and·time consuming; Fortunately, Ai:n.bar.i prov1de(an e~sy:·. ·
'b: G_work o·iv .. . ' ·.. -~::ht:en::1::g;rf ~1lt:e~6ri~g~ ii;hi a :~ns :f'oinl di~play;~f i 1(~he~p~~l;,~,~~ii~':,-.'
foperty c.a~ ~~ ·c~ang·e.(~r Qd~ed}.uSing.th~s ~~'i~/f8~e/A
· ·.· pr''~peit(~s;~ny'Se'rv!ce p_ {~Hi~~mp·l~;,:,: -. :
·· :(i) Ap ache Tei · . ..
. · (ii)A11aclie'Glraph .. : . . · . . . - · · -
'
tlie configuratfon prope_itie~ for th'c_Y>;\~N:~chedule(as sh~1~n _in Figui6_4;p :
.· ,: _-
<>.: i :,:·.: '. .
,.- (iii) l!amstci- 111\dQop ~mt'MPI on the same cluster.
. . . :'(iv) Apachcsparli:: :>., '• . :. . . •
Ans. (I) Apaclie Tcz ' · . . , d · tis involves :·
• Ohe great exaniP.le of 11 new YARN framework is Apache Tez. Ma~y Ha oop ~o Reduce j
the execution of a complex directed acyclic graph (DAG) of task usmg separate Map ta 5 ,;
sta"~ Apache Tez generalizer~ th is process. and enables these tasks be spreacl _acrossRs dge ,
• b Tez can be used as a Map c uce
0
d dtO the
' ' • •
>.1 so that tliey can b'enm as a single, all-cncompass1ngJo ·
.;J.1 i,j replacement for projects such as Apache Hive and Apache Pi~. N~ changes are n~e e ';

t,l - Hive or P1g'applicat1ons .. ' . . . .. . , ' . . •_: .c


l;~;:r . . . . 'ii)Apache Giriiph . .' . ,..,,__ -(. .. : . . : : : ·< ·: : ·. _·. .: . ' :,'; •Fac~boold
1k!j:: _- . . .. A ache Giraph is an iterative graph processmg system bn\11 f~r.htgh scalability. . . . .-"'
,·:ili' ~ -:~ --hiittcr,,and Linkedln use 1t t~ create social graphs users. G1rapn-,was_~rl¥1~~~ 1! ~~lltte'.' :~,-
1];1i.
••
: ._: !',I,
a :;\
"--- ---- - .
!9.f
•.· - .. . . .• .,_
_·.-.~-'_·t _•.. .' · c___·•_-_' _.,_ _- - M·._., t,-_·.· ,._
~ • "' \rl("m .;;,t,I\ •
-. - ____._ ...:__:....:......_~-~--
iwi""'>!J__:__ ~ - -- -·
~·· -·. -~ .. ,: -
VIII Se-1111 (CS'E/ISE)
Once lhc user n_dds nny notes a'dd clicks the Save button, ano1l1cr window, shown in


figmc 4.4, is presented. This window confirms 1ha1 lhe properties have teen saved. Once the
new properly is changed, an ornngc Rcslart bunon will appear al 1he lop Jeti of1hc window.
The new properly w:11 r. ;it 1::kc ~f:cc\ i:01 :I lhc required ;erviccs are 1csla11cJ. As shown in
·--·· : - ·--·· -~-- ..
< r ;.-;;i" • ·-----. ·--- • ,-.:·,.
'
figure 4.5, lhc Reslarl butlor. provides lwo oplions: Restart All and Restart Nodc!\llanagers. .
To be safe, the Restart AU sfiould be used. Note Iha! Restart All docs no: mean nil the Hadoop·
service will be rcsta11ed ; rather, only those lhal use the ·new property will be restarted. ·
I :. 15 After the user clicks Restart All, a confirmation window, shown in Figure 4.6 will be
displayed. Click Confirm Restart All to begin the cluster-wide restart.
.. ~ ; . Save Configuration Changes
·- ·· ··"'·· .

m
. .-.... • _ , . . _ . . , , _ . . . .. "II •

; __ ~

... ~~-· . .

..... _;_.,~
.
--

F{gur{! 4.5A.111h1/11 Rei'ia;( 1111cti~11 II .


· · : Gonfirmation - ·

Y~~ie~11o·ies~rt~AF,N

._';i~~~~:~~L:~~tfr;;i~}~°:J:~~:~cn~; ~
,, :
·-- ~-· . . ·"' .• -'. ·.•· .
.F.~~~..... :.
j~"'~.1:1ya . .; · , . •·
ht&plf7" •' ·• .
.. o ;_~_·: ij· : .: ~ , ·. ..
_I\~_..;~ ■ - . - ---~ --::-::::~.~~- : ___:~c~.J.!.,i!J ..
~~nad : f :~ ~=--=--=-- -- .·. Figure-4.6 tfl11barl conjimiatioi1 box/or selvice r~tart-'.
)'In\~~~·. :: ~·! .J~;;\;J .
to
. ··. .Similar tlle-DaiaNode restart, exan1pte; _a progress windOIV win" be displ"aycd. Again; the
_ _· . .Figure 4.2 }'ARN~raperties wilh log 11ggreg{l(io1;- /1,;ue,i off ,. ._. : progress bads for_ihe entire YARN restm1. Details from.the fogs can be found:t,yclick_ ing·the .
. · Changes
· do _n·o) -become
· · pcrmanent'untl
•- ·. ·1:tie · C1·IC·ks tlie. .Save. button
I user . ·, ' A save. /. notes
tt,e"-:· . ari·ow to the.right of the bar•(see Figure 0)." . . . . . .· _- .. . ' . .
wi~ilow will t1ien be ,lisp(ayed. It is highly re~ommended that h1stoncalnotes.conc~1nmg _· . . O~c~ the restart is complete, run a simple example and attempt to view ihe. logs u~ing the ..
· . chan;•e be added to this wi1ido1v. . · ·. . . ' ' · .. · · · . ' . . YARN ResourceMana~erApplicaiion u1: (You can access :the UI fr~m the Qui~kl.iriks put~~:. ·.. . . , ,·
·· :' . ; , . · · ve Conliguraticin · ·x · down inenu iirthe 1tiidi!le_of the YARf':I series·windo\v.) A message_similar tq that'ill'.figurc · .
.· •' . ' • . . . ·: 4.8 wHl be displayed : . . . . . . .. . .. ._ . . .
·· .:· /\~~ .;i~·rn ~lf-~a~,~~I~~· :·. . I
~. .. . .' . Anibari"iracks al( chaoges·made to system' propertfo·s,,J scan be seen in Figuce , . :'-, ·, · : :
.. ·. __ ____,/. . 4.1 ·and in more detail in Figure 4.9; each time a corifigL,ration is ch~nged, a11ew· \lers10n is . .
' ·-··· . .
cieated:..1teve11ing back to a previous version results in a new vcrs.fon, You can reduce ihe .
potC:~tiil_ilor-,1,'.CCSion.:confusioci_by p~o•(i•JiR~~~ngfui--commcntHor--fl'ith-<:harig~::(e.g-:,·_.-. - - -
II. .' ·.. F/gltfl!.4:'3 A111iwi ,·01ijig11mtlo11 si1~diioiesw i1ilo1P:~-)- -" ' ' . .. .
. Figure 4:3 and Figure 4.Uf Iii ihe_.prec~ding .exah1pfe, we created, versio_n .12_(Vl2). ;The . :
· . 1 L;9;_.-_-
,. .

~~ .
-- ·-· - - · · - - - · - · - ' - ' < - ,-

, . . , ----· ->
i'fi!!..-•-- _· --,--c-.,. .:.96
- -->i~!t,.;; . . . .. . 5"~~+;.,( t;.;......

----- -- -- -- - -- - - ·. - -:~ ·· .
VIII Se-t~ (CSf(ISE)

current vcision is indicated by a green Current label in the horizontal version boxes or in the
dark horizontal bar. Scrollin11 thouah the version boxes · . . . . . : . i'~±~-,
1 Background Operations Running a,~ 1---~~~~~~}.:i.: .i ,~Corl.c,Gto..c- ''"'' ·_· ·-·::· ·1·i·-
s1ar,; .~ t '."°'.: · AA (IOI .. , J l:ZIJ ..., III -"· Cl - .. , Cl -" ~ -•-
1•dlrl • ,• . . , . . ' .,.,...,. '''""'- ' !t••·

100'"AI .' ►

--~--------·- . ·-- ··-· ·,_ . . ·--·- -·- .,...


IIX>" I
n~.00~~1a;o\ ·. · nrn,,1u,._

·-::;::Man•g•r~va \ j102•_. . •.. ~--•, •· -· · : i~{


•~ · · ►

"JhKL111$1~1'11U,ilflSW. .· . ~JJn\U01:.1C'27· IJ211«1 1~ . ►


yaffl.&et.On~•-1 . · oi:~J!)
•I ' ·, . , • --· - • ( ,- '.. . : .-
~lolOOZIE . r9m.idin1t.Gd . . .. L ... · .- ·,_. ___, . . ·--
-OOlfl'O--ht-11!1-.,.-~-Sll-•..:.C,.-,~-g,-,-...-_,,-,,-,,.-,-,,o-_,,--.,~,.:-.,.--.-.-.-'■""•-■-■--
·- ►
,· ·,
I-~-.....
-~.- ~OO'Mi . .., )lffl~~IO!Jllon• • ~ 1-~_'l.!_; :.; . . . ,, • •• _--- ~ i .

. ·. j:lg11re 4.9A111btiri to11jig11m(ioiz ci1a11ge 11i'aiiagemenijo/YA.RN i~ivlc~ (Ve;i;;;; Jij 1·


. . . ·-·. c11rreni '." - . . - -: - .
Fi. 11re 4. 7A111b11ri iro"ress wii,dow for cluster widi:fARN .it"rf

.: Figi,re '4.8 YARN ReJourcellin11agcr i11ierjnce .with log-tigg;egaiitm lltmetl off ·' _,..() .
Or'pulling down thl menu on the-left-ha11d side'cif the datk horizcinta_l bar will display t_he ?
.. previous configuration versions·. ··__·- ·-; -.- ·- ·.· ' . . . . . : : .· . ·. . ·. . . ' . '
To revert to a previous version,'simply·selec tthe version·from th'.c version box~ or the pull~·. '.
. down m:enu. In .Figure 4. i 0, the user has selected the prevfous version by clicking.the Make, ':
. C~rie~( button-'iii flic_'. infprmation box:. :This 'configuration w(II ret~rn to.the previous sta~ ';,_·,
. wherf foii;ggr'~~~lion is enabled. ·.. . ,· " · . · · : . . . - . ·. . · . · ·,·
-~ .,;, _ . ; .

·: ·: .. ' ' Fig~re -I.JO ReYei-ti11g io previous YARNco~,jiguraiio.11 'iJ'JJ)'witl,'A1i1b1i_';f :' ',: ;
' .. ;. As. sh_oyin [n· Figure4.11;a confiimation i not.es w(ndow opiiri before fue;iicw \:orifigutl!tioh .-
. , .' 'is-sav,ecl, Again; it'is suggested that you prrividc -noie 'about ihc .cliange:1n'-ihe:Noies:text_:
-·. ·-.·
: ,
·, . . . .. :.-:'·~ .·.•.·E:~t!i1i:!:!~;~;t:%~~::;:;it;~~[:~

.. . 98

. . : ..
,.. . .
•. . .
.• .~! •.
3.i:_,:·:.~ ~: ~:.'~~,:~:- ~
{-~.j. :-~\ -
\/III Se,,n, (CSf/ISE)

Make Current Conllrmallon Ambari, go to the HDFS icrvic'e window and selccfthe Configs tab.'roward the bottom of the
screen, se!cct the Add Property link in the Custom core- sitc.xml section. Add the following
two properties (the item used for the key ficlct in Ambari is the name field included in th!s
code): _ ·
<property>
. <namc>hadoop.proxyuscr.root.grm1ps<Jname>
<value>•<ivalue> ·
· · Figure 4.11 A111b11rl ,·01tjir111a1io11 w/tulo.v Joi 1111ew co1tjigur11tio11' :: < /prope11y>
There are several i111porla11t points to rc111e1:ibcr about lhe Amabri versioning tool: _ · <propirty> -, -
Every time you ~ha11ge the configuration, a new version is_creatcd. Reverting to_a prev_ious_; <nan1e>hadoop.proxyuser.r,oot.hosts<iname>
version created n new version. · . . · . . .? <vaiue>•</value> </propc11y> . . .· . .
· • You can view or compa:·e a version to other versions ,vithout h~ving lo change <>r resla_rt ·.. The name of the\1ser who wlll start the Hadoop NFSvlgateway -j; •i,Jaced iri the_n~,rie field. · 1 •
service. (Sec the l.iu1toi1s in the V 11 box i:1 Figure 4, I0) · . . : In the· previous example, rqot.is. used for this purpose. This set_t(n~;_c_aid,e.·a~)'.'. us_cr wh9 _
Each service has its own version record. · ·· · · · . . starts the gateway. lf, .for ins.lance, user nf.sadmin strirts tlie gateway, then tlie'two· names ..
. . • · Every (inie you d1a11ge the properties, you must restart the servi.ce by us_ing the wo·uld be. hadoop.proxyuser.nfsadmin.groups and hadoop.proxy_user. nfsadmin·.host_s. The.•.
· Restait button. Whei1 in doubt, restart all services. · · value, entei·ed in the prcc~ding lines, opens the gateway to_all_groups _and aUc,ivs,it_l_o run ·on
· . nny host. _Access.is restri9ted b entering gniups:(comma separatei!) in;the _group'i property,
b. · Define tl;c cap~billti~-and,configu~ntion step~ of'ari NFS VJ Ga:teway to HDFS _-.. · .·. . Entering a host name forthe host's property caii restrael the host'running°:ilie' gatew#', · . ·
. · ·· .·. -·_- . · · ·. . -. · . . · -·. · · · · (04Mnrks)
Next, move to the Advan·ced 'hdis-site.xml :section and setihe following 'property: ·,;r9perty>,
Aris, C~nfiguring a~ NFSvJ Gateway to HDFS · ·· · . \.:.'. · · . .. · · . . · ':: <name;,dfs.~fs3.di1mp.dir</name> .. .. ·, . ' .. .
HDFS suppo11s an NFS version 3 (NFSv3) gateway. This _featur~ enables files to be e~silr,:;-. <value>/tmp/.hdfs-nfs</value> -· ·. . ,.
moved between I·fDFS and client systems. The NFS gateway. sup11oris NFSv3 and allo1~~( <ipropci1y> . . .. ,, ·: . ' • _ . - . . · . · : . . •. . ~- ., · , : · : . .
HDFS to be mo.uritcd as pti1t of the client's local file system. Currently the N_FSv3 gatew .. · Thc'°NFSv3'dump directoty is · needed be~.iuse \he NFS ·cli~nt often
recorde~ writes . .
· suppci11s the followi_ng capabi_litics::· . .· , - _' - · ; :
- • Users can browse the HDFS. file systeip duoug_h their local ·_Ii le system,_usmg a~. NFSy
. . · .·. . . · Se~1ientiaLwrites caniafriy~ at _U1e N~S _giit~w~y in r.ind~1:r1 cird~.r.-This dir~jo({is l_Z to __
. tempo1'arily S-ave out'of-or~er_ wi'iies b_cfore 1vriting IQ H~FS:,Mak~ ~ure th~ ~UillP,:directory : •
.clicot-~ompatibie operating sysiein: . ; -'· ·. ..' _ .. _ . · ~ · · · < _ • ·· ' hasieliough space. Forexampl~, if the applica'tio1i' i1ploads 10 files, eai:h·of sit~ I0:01\11B, it is .. .. ·
<:.
. ,i Users can download files fro1i1 tlie HDFS file,systein to their local _file S)'.Stem. . .
··• · Users can 1iplond fil~sfrom iticfr local file.system ~iiectly to the ~DFS.file ~vstem,_ . :._.
. • · Users can stream ·data directly to HDFS through the mount point. File append 1s supported,_',
. ·
,er,eryfite.' -· >,."•·'. :. :_.- ,;./ ·•.-··:; .:·,:.. '.i:".
. recommended that this directory- have 1GB .of s;iai:·e to cover a worst-case write!reorder for.

. bnc·e all .the·changesbive been hiride; c)ick the.greei(Save button a.rid iioie the_d\aiigesyou
":s · ·

. but 1'andom write is pot ·supported. · . . · . · ·. i . made 'to ihe Notds iiox'in tii~ ·save.confir~aiion_dialog;r~eri r~i~rt allofHDFS bf~licking
· , The gateway musf be•riin on the same host as a·oa:aNo_de, NameNode, or any HDfS chent-: ·
· More information obout the ·NFSv) gatewny can be found at' https:/' hadoop. apache.org,'
doc·s/current/hadoo·p-pro]ect-dist/hadoop-hdfs/HdfsNfsGateway . .
:~;!;~:ta~t~t~ gt:~~~~(;:,i· ·._. ·:.'c;' _ ,._ ..-,.'. _··: :: :, '·._._'_.· t::•::: r-;,;_·· ·.
1

·_Log into ii DataNode aiid make sure all_Nl'S ,sci-vices are stop~d: In this example;
.htri1L - .. . . . . . , . - ·. . . .
.In the' -follc;>~ing ~xample, a simple fou'r~node cluster _is l\Sed to demonstrate toe st~ps_for . .t;:~~:ed;p:~i~~~::::s!~,:;~~\:;t/.· ' .· : . . . . •. · ·. ·:., -. ..
. enabling the NFSvJ·gatcway.·Other potential options, including those, relate!! to secunty, are.
Next, stai1 ihe H.DFS gateway by \Ising tfre hailbop-dae.mon script to start portmap .
not adcjressed in.this exi1111ple. A Datal'>lcide.is used a:; the gateway node i~ this example, _a~d :
. and nfs3 as fo llov1s : .· _ _ . . '. · -- . · · · ..
· . HDFS is mounted on ti1e maln (login) cluster node. : . . . . . ·. . . · ,-
_#/usr/hdp/2.2-4.2-2/radoop/sbin/h~doop-daemon.,sh _, .sta1t . portmap. _#/usr/hdp/2,24.2-2(- ,.
. · Slep I: S~t Configuration l'ilcs . . · .. ._:· : .. ·. .- . , , · . _ :· . . :,
· hadoop.sbin/hadoop-daemon:sh ·st.art nfs3.
S~\lernl.Hadoop configurntidn files _need to be chimged. In this example, the Am_ban GUI .
· The _po11m:ip daemon will write its log to .
.· will b~ used to alter -the HDFS configuration files. Do not save_the changes or rest.art HDFS -::
/vvar/log/hacfoop/root/hadoop-r_oot-nfs3'. nO.l6g . . . -: . : , --< '.. . -;'. -: ~ ; . . --- - .
-- unfii"all tho following d1anges are ,111ade. If you :ire not. 1ising Amtiari, you must char\g~ :-
. these files hy hand and then rcsta1t the appropriate -services across the cluster. The following ':·'
To confirm the ga/cway is ·working; .issue ·1he fol!owing command. The: output s~ould .
look. lik'e the fol1;1vi~g: .· . . . . .. ' .. . . .. . . .
· __. environnirn.1s assi1in~d: · , . · · . . .. · · · · · '; .
#rp¢infci -p no . . - --- -...---.--......;..-,--~,...,....----,
• .~l~;f~~:\.~El: 6:6·. . ._ . _ · . ' · .. ·. · • . _· program . v'ers --ptoto . .
• · Ho11onworks HDP 2.2 with Hadqo version: 2.6 . . . . _, J9Q005 1
Sevfra!'p(operti~s__ need to be added to the /etc/hadoop/conf ig/core-_sit~.xml fil_e;· IJ_
~1r:r
' 1_00 ioi '
VIII Se-1w (CSE/JS[)
(.
100000 2 udp portmappcr
Ill ,I. Siscnsc 20. Palo OLAP Server ·.
( :',
' •.\ 100000 2 top portn1appcr
ill 2. /\clualc Uusinm lnlclligencc a~d Rcporling Tools (IJIRT) 21. Pentaho
tcp
4242 mounted 3. icC11bc 22. Profit b:ise
106005 I
nfs 4. Oon:o 23.Q!ikVi~-w
100003 } lcp 2049
5· Uoal'd Management Intelligence Toolkit ·24. Rapid insight
100005 l udp - 4242 mounted
6. Clear Arinlyties 25. 6AP businns intelligence.
·\00005 3 udp 4242 mounted
7. Queen 26, SAP BusincssObjccts··
100005 3 tcp 4242 mounted
q1ounted __.
8. Gooddata \ . 27, SAPNctWc:ivcr BW . >
10000S 2 udp 4242 9. r'DM Cognos Inlclligcnce 28. SAS Bl .
.Finaliy, make sure the moµnt 1s available by 1ssqmg the.following c9m!Jlaitd.: ·
I0. .lnsighlsquarcd 29. Silvon ·. :; :-:,::,_.
#showmount-en0 , . . -. . . .: : , . · . '. . .· .
Export listJor nO: ·
I I.JosjicrSofl Jo.solver.' ' , · · .-. ,,·., .
.. /• . ·: ·. • , • ... . ' . .. . . . . . . •' . 12.Lookci
.If the rpcinfo or sho'~mo1ii;il command d_oes n6t ,vork corr.ectly, chC:c\<tlie pre_viously ll ~icrosc(t Ill plolform
. ~i:~11;c~ ~~)~t;tobie·'.;s.
6 -~ · _.· __· .: ·_. · ·: · \ ; · . '. . :-.> '·:···:· \ · _ .,. 't4. MicioS1ra:egy · · · 33,, S1yldnti:lligencc ·•.- ;i;:-_· _
.
15. MITS
The ~nal st~p is ici 11_19Jiit i-IOFS ~n aclieninod\:. in this}xampie, \~e\naiii l,ogi_1f node.is·usect ·35_Targit ' ·-_-.,_, . .·, x ·
· To 'inouni tlie HD.FS file's, _exit from tlie gateway·1iode' and create_th~..fol~owh1g·qirectory :. -: · i6: Open[ f- ·
#1i1kdir /mnt/ hdfs . • . - .· . . . . . : : .· ·.· · ;], '. ·_ >:.. ·_. . . 17.broc le[ll 36. V~it:atic,i". '·_ . ,·-; --, ,,
The moimt:comrriand'is as follows. Note·tltat the name oftiie gateiaj rio_de w\!f b(differe~ti .18..Orack Enterprise [l_[ Si!rvcr
· on other clusters, and ari IP. address cari be used instead of1he•tiode name. #moi"1nt'-t nfs .. 19. Oracle Hyperion.System , 38, Yc!low·fi,,81 . ... __ -
· Vers' ),pr6to·,;;t~p;nril~_cfho'.:i tin.l}t/~dfsi . .:, ',.. . :\ ',' _: \ ·•- 0: ·,- ..·:•)
0
Th~;Bt~oolus~din opr:-Qrganizati9n:::Ed\lca_tion -:,·: , ,., . ,:,:: ::> :;,. ;,:.,; .;:iI)~i ·
:_:;:l;,.
•. . .· Once. tlie file syste)1( is 1'noi11it~d; thefiles:wm b(vis\lile to' th~'clie~t. use~s:J'liehll<iwin . ,As ~l?h_er e~-~_cat1on be~_?llles:in~re_ex~nsi_ve.and c,in.ii,ititiye, i{Js aP.,t iis~r 44a.~~bii~ea. . .
. ·.·\ ~~l}):t~~tif'·'.sJ;Hr\11.~uii:\~:: fi:l~w\~~: :':;;\ ,~ :!, •. . ;.:• ·,:.::-./:··:_:<·;,;~::.-·,~; ·:, ·.'\. ·deCJ.~1on:-makmg, There.is a'Strqngneed for efficie11cy,'increasing revenue; andjgipr_oy1ng th~. .
qua_h1y ·of student expenence·at all levels of education. ·: · · - - - · .t· - .. .·,
. app~logs app's be1ichinarks hd(i maprcd inr~history sysi~ri, tmp user var The gate;,:iiy_in the_';:
1
J, S~udcnt cnrolmcn~ (recruitment 'a!ld retention): Marketing -to.new,pote~~ial.~tudents:
. current Hadoop .release uses AUTH UNIX-style auth~nticatio1i ,anr~quires ,that ,t~e logi( req111res sc,nools to develop profiles ·of the stlfdent.i _that are inosi l_ikely.tci_atten.d; Schools can· ) . ,.. _. ,,_
. . •. .;;ser'n_3,ne on' lhe dien(miir~h thi us~r ;;~ir!el!iai NFS.pa,s,s ei to 1-lpf's.:f~r ~X:a111pfo, if_the ,:': . develop m?dels of w.hat kinds of students ~r'eJJttracted to !fie schoQI; and then_·;~ach 01ii to . . · · . · ·

~~~I~~ii¥~l1I~~~i~Ii if~t~rt .
. . · _:. · . · ' NFS 'tl1ent is user _adniin, th~ NFS gateway will ·access HbFs_· as ·u:ser. admin ai1d _existing
· ·-:-~~ _,_:.......:... HoFSpennissions .willprevail:. . . . ... · --- _, ,-.,,-.~s:~.;:·.c7:~ .-~· -' . ·· , ...
. ' . The system administrator must ensure that the user on the N.FS clfe,1it machine has the same ·: i
. user ii~;1i'e and.l;Se(ID 'as thai'on tiie Ni' S:safewaY:maclijn'e.'This is l\sually rioi:a problem if \°
. you use the same user manag~inent system, sucii ai: tr:iAP/Nis; to cr~teand 'c!epfoy users·,··•

·. ..·;:;: ;:~;~:~: :~~~:~:i1t:~ .L~;;~.:,.:.;'";;;~.:i to pledg~fin~nc_1at Sllppcir!to _theschoo!_,:Schools can.create a profi[e_for almimimpr~ lficely·; · _..
to pledg~,donat1ans. to th~ _school. Th/s coi1ld lead to a reduction.iri the cosi of:mailfiig(aitd ; ·

··Ans . . ·According ta the.llsiof bes\ busfness intelligence tools' prepared.by experts 'from Finances ;
·. OriHne \he leading solutions in this ca{egory cortprise of'sysienis desigtied .to captor~/:
,; c~\egorize; an~ analyze corpora1e data and' exirac( besr practices for improved decision ;•
. .. ·'nia~ing, T,he more _advanc~.d the system is, the-more da1a sources_itwiUcombine, includi11g
. · · int.ernal' metrics coming from different com_papy' departments, .aiid external ·data ,extrac_te<I
:. fioin :thi1;d-party ;ystems, ~ocial media chamiel~, ·emaiis,'_or ·e_ven mac_roecorio~ic_ dat, .
. Ultimately, business. intelligen_ce. software helps c'ompat1ies gain insight :on their overal
.rowth, sales trends·'.and cusiorner-tieliav·ior.·· . . ' ·. '··. . · . · - : ··' . ::.
.ift!i~ilf~£i~~!i t~;r,~~!;1~
. that form the back~one ofan'organiiati<iri's· IT sys\enis. Thi: data t~ p~:-~xiraciefviill'depe~d
, . upoq the s~bjcd l!]alterof.DW. For example, (or.a salcs/markctihg' OW; _oniyili~ daia.·aboUt·
. . u.S.ta·ifi.·Ccs: o·C
_1.bt~i~6@21jiti_SCj'viCe;and-so-Ofl~WOUtd-b~itr,aCfed ..·. . :. '. · -.. .···~· :. . ' .
. . . ':..:-.-.: __;_:.._. .. .i. . . . . . . ..
1

'' })} '

-
.
•- . - ... __ .,.... __. .,. ·.
.
,.·, ._

102

·- ...
~/ '!f \ -~~: :J~i: !·~~~:..:.~:.~f..~. ---
·.,·.. ;,

VIII Se,m,, ( CSf(ISf)


.. ,, :.,,
i. Other applications, such ns point-ol~sale (POS) term inals and e: commerce npplicntions, ror exam pie, 1vork ·cxpcricncc could be binned as low, medium, and high.
1
. . ., ·,.~-;---· ... , ,
·provide customer-facing data: S11pplicr data could come from supply chain management 5.. Dnla elements mny need to.be adjusted to.make them comparable over time. For example,'.
systems. Pl,mning and bur.get d.:la should nlso be added ns needed for mpking comparisons c11rrency v~lues mny need to _be adjusted for in0ation; they would need to be c~nverted to
.igain~I targets. . . the snmc bnsc yrar for co:nparnb;lity, 1hey may need to be conve11ed to a common currencv.
3: External syndicilled data, such .is weather or economic activity data, could also be midcd . · 6. Out11ci data clements need ~o be ·removed af\cr careful review, to avoid the ikewi;g .
to OW, as needed, to provide good contcxtualinformation .to decision makers:'. ·of results. For ex.ample; one big donor could skew the analysis of alumni donors in ·an·
Three main types or Data \Vnrchouscs nrc: ' · cduc~tional settirig. · ·
I. ·Enterprise Dntn Warehouse: , . , . . 7. Any biases in the selection of dat.a should be _corrccled to ensure the dala is rcpmeritative
. Enterprise Data Warehouse is a ccntrnlized wai-ehouse,.lt provides decision support servjc~ ,/ of the phenomena undet analysis. If.the data includes manY, more members of bne gerider ·
across the enterprise. It offers a unified approach for organizing and representing data. ll «.lso·· ·' than is typica! of the population of interes.t, !her. adjustments need to be applied io the data;
provide the ability to classify data according to the subject and give ac.cess according to those· 8. Data should be ~.r?u.gl11 tot,h~ same.gmnularity to en~uce comprirability. Sales data may be
divisions. · · :' avilila),le daily, buf the sales p·erson c?mpensation·datii ri:iny only be available. monthly. To ·
2. Operational Data Store: · . . . . , .relate these variables, the data must be brought to' the low~st common denorriinato~ in this
Operatio1ial Data.Store,·which is also called ODS, are no:hing bufdata store ·required wheri :.
·neither-Data warehouse nor OLTP systems suppo1t organizations i"i:porting needs. In ODS, '. : .. ;~s;~~~t~i~eed lei b~ s"elec~ed .io increa;e infol'matfon :~e~i[y;-
'Sb·fu~
niuch variability, because,it 1vas, not properly rec9rded or for any other·h:asoR~:cthi°~fdata::. ·:· .
~ata: i~lLth&~{•
Data warehouse is l'cfreshed in real tfme. Hence, it is widely preferred for rot1tine nclivities •
like. storing records of the En1ployees . . : ·. . . . : . . ·. . · : . •· .may du.Ji·t1leeffects •~f 6fher difi;et·e~ces in the,'data; and •should be remo.~ed lb kp~dv11·1hf . ..
information density o:the,data; ·> ' . . . . . . . .•, , . . . . . : -·•"··I,: .:-.
3: Dntn Motl: ·· . . . .i · · ·• . . · · .· · . . .· · . ·;
· ·A data mart is a subset of the data warehouse. It specially designed for a particular line of ·.
business such as sales, finance, sales or finance. In an independent dh:a 1na1t, data can collect. '
. I,, Des~r,ibc ~omc key ~lcps'illdata ~isu;lizaii.~11. ' ,, . . . . : ·. i~iJif8J\'farl1s)
.· Data has b~en describea asU1e ne_w rav,! material fofbu~L,~ss a~d_the.;'oi~ ofJh,e}\~f .'!1.~~!'>? :
directly fi·oIT! souiccs. · . · . .. · ·· · · .. ·, ·. The-vo!mne of data used::n business, research an4 technolog1cal de.ve\opm;oU$ .massive,
AD.W project reflects asignificant investment into IT.All of the best practices in implementi~g · . a~d c.onti.n(;es to g1'0:v·. Fo,t i_nstartce at, El~evier, il1e~e ar~ ab?½! 700:t:till_io_n·?,1ic}irpe~year :,
any IT prnject sh_ould befoUowcd. . . . . . . ' .· downloaded from Sc1enceD1rect"so;ooo mshtut1on profiles on Scopus, 13 m1ll10Mesearcher. ·
i;Thc DW.projcc·t should nlig'ii with .the.corporate straicgy. Top mtinag~mentshould . . profiles oil ~copliS ,ari;l :3 ml Ilion rese~rche~ jlroril~s ori Mendeley. It becoii{6 h~(i!it 'a~d ' :
consulted for,self1hg ,objectives ..financitll viability' ri:eturn:on Investment (RO.I) shoiM.., ~~-~Ger-f9t ~·~s~r. t~ ·~_r3b~·a tk~--ffi~S_s_ag~/r0in ·_~iS:.iu\iye1~-o~~~_t8:_;.·_i.~··-j. :~/ \ \~~~-;·._r)•_
}:/./·~.: 1 • \
est11blish~d. The project must be m11naged by both 1T and business professionals: The l?o/ That's whe're daia visualization COJllCS ;in: sutiuii.rrlzing' ind presenting Jargp dafa. in"itm~le .. / ;:\
·design·shimld be carefully tested before beginningdeveloprue11i work.lris often much m.o{e· / ari.d easy-to,ui:derstand v[~uai izaiions 1.0 give readers insightfijrinfcinn,.:ion:. •. :;,;, :·/ c' .· ·• .· . . ...
exp~nsfve to redesign iifler developinc.nt work has .begun. · • · • · • . · ·.. · · · .:( ·,. Jl.1er~ aie rp.tinv nd.Vanc<id visualizations (e:g., net1vorks; 3Dsmcdels and map_~veriavs)°us~d , : , :
2, It is i11iportant to •ninn:igc user expectations. DW .should be built -incrementally. Us~rf for specialized pt1rp6;es such as 3D medicaijmagi~g. urbanlransportati~n siinula,tion;;nd . ' . : ..
:··, -~11~ri;J~~lGi~~~r1if~%:,~t~~;~~i1~ ,~f~t1~°h:~1;i%ii~~;·t~ht!{tt{~!}W~t;_
:-.•-·.~ .:
· shouid be trained in using the syster11, and absorb . . · ... ·, · ·. : · . . . . . . . . .'
: th·e many featur~s of the system: . ' . . . . ·-...- --.· -~ ·•" .·· -· .C : i •'· . ..- .:;
•3. Qua lit~• and ·ada.ptnbilit>: should !Jc·built in. rroni the_ slart,..Only cleansed and hig~- :· · read t~aious desc11pfions suclr11s: "A'sptbfitwas m'ore than B by i.9% fn.20PQ[anq;d~sp:ite :•.' :. , •
· q·ualil)•·daia should be. loaded. The syst~m should b.e able to .adaptto .new access tools. As :· . ·. a profit gr\)Wlh of25% 1n'2001;°A's pt9lit t:i~came. lessJlian B by3.5%,in '2QQJ.~i1\good ·. .
· ·.busiriess·nceds change, new data marts can be .created.(or new needs ..· · ·. ·: visualization surriK!arizes iriformatidn and.oiga:iize~fa invay Uiat 'c:~aliIJ tlie r~a'dlr!to'. fo~uii° ·.
\ . , . . '01t . .. . ;. . . on tlie points that ate
tdevani to tiiekey ,messiige~irig ci;irtveyed. · ,: · >O"''\ ;.••-·
~~
.~ . An analysis clearly explained witli tables; graphs, ·chartscand diagral)is.'. k~eP.lngiQ~indth~f
~'@,: ·6. a, .Whyi!data prcparat1;n so importaniand time consuming? . . . (04·M~rks) :'
.1--,; ·Ans, .. Data cleansing and.preparation is a l,aborintensive or semiautomated activity that can take up i: ) ~!:t~fi~.a~.ii~t~l~~t~t_ion is anilerativ?ioce.ss: :; . . . ! ; : , :;:· ·: c;}:'.\t½;;'. .• ...
, . to60to70percentofthetimenecdedforadata·miningprojeci. .. ·, · . . :·. ·. ·· ._."· ,. 'To demortstratd1ow each ofthe visualization :ools could be used;' imagin~.al),.e~esutivefiir . ·.

!iI, '· ·.• ..·:.


·.~J
K . I. Duplicafe d~la.ne.
. cds to be 1'eri1oved. t1.1e.sa·me
.. data may be.r7.ceived from . multiple sources.::. ,ii contpanrwhci wa~iito. artalyze,ihe sales perforniimi:e bf; bis ~ivisfott, Ji1bJe;6))hoW the i
~.f\1·. . .w_henmergingthedatasets,datamustbede-duped. ·:· , ·.,, . · . ... . .·. : . ...• ,. . . ·· . .. · ·impoi1ant ra·w.~alcs data.fonhc curi:ent year;.alp~abeiically:soi:ted by:Pro·cluctGMJ,~i, •.: ·H'. .
' 2. Missing values need to be filled in, onbose rows should be removed fr,oril an,\lys1s. M1ss1ng .·
.vaiues can be filled in with average or modal or defaiiltvalues:. ·. .· .. ' .. •. .: . .' : ' •·
., .' >'- ',.'
.-t;•rroduit •· ·:i liciierilfe .: Ortliirs·· ,·SafesPcrf?

~;..:l,J· . 3. Data elements mav_ .


nc. ed t.o. be transfo.rmcd from on.c unit t,o another.: for exa.mple; tota/. . AA . 9731 ·131 23 ··

. f~~~I :._. .;~~: ~

9:o~~~~,:~~,~~ya:~ttl;l:t t~~~~~-umbcrofea'.ients :ay:n~~d l~.be r~dufed to??t{~ati~~t_!;


· BB . · .
cc
355
991
. 43
· ·• 32 :
8';
6 ' .'
-_,n.;u;::,_"'·p'--.-----'---4;-ai-ntiriuoun•alucnm1yneedtoi,c bi'm,cd iutoa fe11-~ack~1:s io hclp,witlioome11nalyse~:
-il: Jlro:-- I . -r.':,. •·· .. ···•·.·. ....:!. :;:·-,~-r:
', .

... ~ ,11-··
~Ui··; 1Q_4· ·- . _,_. ~ -
:\q.i>:
--'---'-"'-'"'-.........~ - - ' ~ -
VIII Se,rn, ( CS'f/LSf)

EE . 933 30 7 R~venue shore by


FF 676 35 ' 6' Prod11ct,
CG 1411 128 . 13
Hli 5116 132 38
JJ 215 7 ·- 2
l)K 3833 122 50
LL 13.48 15 · 7
.. _M~ 1201 28 . 13 · . ;~ ;~~ ,rx 1G.) uL· il-N ,c,; ,r~.. -• ~' ,u,
l~ .
Table 6.1: Raw Performance D:itn ·/
To reveal some meaningful pattern; a g~d first step would be to sort 'the ·table by Product' _·. . · . ·, · . F'igitrc6:1/aevc1111eSl111rebyPro1f11ct · . · · -_ ;. · · .•
revenue, with highest revenue first. We could total up the values of Revenue, Orders,' an(, The number of orders for each product can be plot/cd as ;i bar graph. :Jliis "shows thiit\vhile'
Sales persons for all products. We can also add some important ratios to the right of the table : the reve~ue- is widc_ly dilferent.for .the top fo11r producJs, they have npp~oj(imatcly the same ·.
(Table 6.2), . · • ·· · · number of orders. , ·· · · ·• • · · · · ·· ·
Product Revenue Orders SalesPcrs . Rev/Order Rev/S111~; ·P Orders/Sales P · 1t, ,·. -:: Ordor~'byProduct · ·: _ 1'•.·· .. ·{::· :.
AA 9731 131 23 ' 74.3 423:1 5.7
HH '5116 132 38 3S.S 13,t.6° . 3.5

IHlll1·1-'a1l !,i
. ,
KK . ~333 ·. . 122 so: 31.4 ·, . ·16:?: .2:4.
. GO; 1411 128 13 11.0. 1os.s· 9.8-.
1 ·:, ,·
'._ 89.9
LL
MM
1348
1201 "
15
28
,1 ·-.
· 13 .
. 192. 6 ·
42.9 :: - . '92.4
· 2._ I
. 22
.: (f';___ .,_l ;,_ + 1 1. -.: _, :: ; •: :i : , : 1·:.; _
C( . g92; : '.'J2 ' :6'.' "31',0 165.3 · ·5.3 .:_..'. ·.: . 0,. .. . , ,-.• .F_lg,ir~ ~.2: O~<{e;-s by Pr~di,cts ,
:Tlie::forc, _the ~:dcrs' ,data . could ' be investigated further io •~ei
>:Order
pati~•rns:
- .:··:;_:·..:Suppo.i~'
·. ... '. .
. EE - ·933 ·:JO ·7, 3LI. ·133.3 . 4.3·
.' add1t1on~I data.is _made 'available for Orders. by their size..Suppose·thc ofders ·nif ~hunked ·
Ff . .676 : 35 6 j9.3 112.7 " · _5.8 .. into 4 sizes·; Tiny, S11)91!, Medium, a_nd Large:/\dditionaidata is showri in Table 6:3: . :
138 _3SS : 43 8.3 44.4 5.A Product ':totai.Orll~rs· . · Tiny_ S11ia11 · . ·Medium .-, ,Large_··.
'JJ :-: ··. ---=2--1s ~ 7- _,_2~ _;, 30.7 .. . 107.5 3;5 _. . • .I • ' 13[ · .5 · 44 -.c _· ,c - 70 . . _ -:-• ·. 12'. ..
DD 125 31 4 :· . 4.0_· . 31'.3 7.8 . 4.:
Total . 177 146.5 - 4.1 ::~·· : : 8·.. . :·i ·-
25936 734 35.3
. . . . . T"ble 6.2: Sor/i!d dntn, with nlldifio11nl mtio1· .
Tiiere are t_oo many numbers on 'this table to visualize any trends in them. The numbers ate.· .' . ·_- S:/,.;."- .
·in dilf;rent scales so plottihg ·1hem on the same chart ivould not be-easy. 'E.g. the Reve1iue:; ·::,_·:;:-2' ·. /,:.:·
nu111bs:is are.in ·thousands \Vhile the SalesPers numbers ilnd Orders/Sales Pers ar~ in the single ·• ·,•10 ·• ..:,_ ·.: :ii'o--/f··
· P( dollble digit. · . . . . .· .. . . - · ._ · . ,,
·. cine c~uld start by visualiziiig tl1i revenue ~s ,a pie-diait:_ni~ _revenue pr~pordon:dropi}
·. significaniiy from'. tlie first product,t6 the next. (Figure 6: (). . : . · ..-·. . '.: . . . ·!
.•_It is int_cresting-'to 11ote that the_iop 3 p\oducts produ~e_aimo·st 75¾.ofthe·revenlie . .
2l ' ' . ' 10 ; ·. ' () . ·'· ,;:' O' ·''·
189 . 329 · · 185 ·:--:,.31 ·
· Tqtili!_6,J: A'lfdi1io,1nf 1f11ta im-otder sizes •: : .. ;-
·. Figure 6:3 is a Stacked bar grnpf1 that shows the pcrcentageofOrtlers tiy s1ze foreacli product. ,.
~- - - -- 's..ciliu:t..(figurLI.J)..brings..adiJTere.nLself>Ciosighls_ ft shows that the pco~uct .tJH has
t :: __________ . .• r- . . --- --- . ,. ___. ---- -?-_1_·_,_:__1__ 7~:
106 ~1:, · .
_...___ ~ - - -'----"----'-,---'-C,,..
~
:_o·_.__

.... -~
-:.::.~·-
. ~

.
VIII Swv (CSE/IS[) I • ,( ... I ' . ""'

a larger· proponion of 1iny 01icrs. The products at lhe fnr right have n lnrgc number of tin·y ~; Cltoose 11ppropr/ate 111et/1nll'10 pm'tlll 11,; 1/11111. The dala c_ould be presented as a l'~blc, or
orders nnd very few large ordm. . 1t_co11ld be presented as any oflhe graph types. ·. . ,
4. Tlte t/11/11 se/ c111//tl he p11111r1f lo include only the ntore significant elemcnl1, M~c data
P, oduct Orti,.,rs by Sin, i; 1101 ncr..:s~arily bcl:er, 1111lc; s ii m;i~es the mos~sig11ifican1 ionµac1on 1ite situa;ion.
S. 11w vi.11111//,111io11 ,·011/r/ 11tot111uldi1/01111/ 11/i11emlo11for rVtWm! such as the cxp,ectations
!ii~ . .. '
' I1·1·1,I IIIIIII
or targets with which to c6i11pare 1he results. . -· . · · · :. . . ".
6. TM 111111,erlc11I 1/1111111111y 11e,:d to b~ bflm~d /1110 afei,, cntegor/eJ. E.g. ·the orders per ·
per~on were plotted as actual values, while the order sizes were binned into 4 cat~goric.il·
choices. . , · . . · r• . .
' i ,. ' • ' : ' .' ' . • ..-, .. • 7, lliglt-/evel v/s111illzat/011 could be backed Ir; inpre de/a/led anal}'JIJ. For tl1e mosl sigrilfiea:it" ·
. . . result~, .a driH-dowq may lie required. · ·: ·. •·! · ' • .
■ ~ ....~,,•~ · ■ orders'· ··ii_s,ito'!lPltr~ ■ n~wo,det' al~ev,'~M.a:tP· 8, Tl11muii11y.be 11ei!I( to prm11t ni(tl(t/01111{_lw11al l1tfor,;,ailo11· 10 tcll_the whole s!;oii.' For ·
■ Ordni'S ul.,,P •T'lpyt : ■ Srrudl •, •at-le~jum ■ L.arqe
example, oi:e may requ_ire n_o:e, to expl~iii•some extraordinary r~ults, • ·. __:· '.•' ·.' : ,:•. · .

I
. Figure 6.3, Pro,luci Order~ by Order She: · ., . Modulc-4 .· · . . .. .
VisuallzationExamplcphasc-2· . . .. . ·:. . ·. ·. · · .·
·111e executive wants to understand the 'productivity ofsalespersori~. This analysis cquld be. What·is pr1111i~g? Wh~t ~~e p~-pni~ing a~d post~p~111ilng? Why ch~;\;~~~;;:,~~ .·
done boih in terms of the number_of orders, or reve1iuc, per sale.spefson·. There could_be 'two ·:· .other?._ · -. ,_ · _ · . •~· '. :· ·.- , . ·=: .. > · . . · ...-· ~{08Markl) ._. ...
separate graphs, one for.the nunibero_i'oiders p_er i.alesperson,·and the other _for the reven_u~ -; · Pruning·,: 1:1e lree cou/d _be td111me~ to)nake it more baianced an:{. more·e~ily·i u,sibic:
per salesperson. However, an interesting way is' to plot both measu~;s on the same .graph:t(i' ·The prun mg 1s. often done after t~c tree. is constni~ted, to balance out t'ie·tree' uid· improve
give·a more·complete picture. This ean·bc done excn when the two diih have djffcrent scales. ' iisabi_lity, :J:1e symptoms!of ,n ~yer1i_tted tree. tree witf\6omany i,k~c1;~, are_~ too4ee11,
: 'The data is ·here resorted by nu~1ber·oforders per salesperson.': . , -~ . . · •···· some of whtch may re~ect ano1T1111i~ due_t~·noise or_outlicrs, Thus;thi:trceshO\llcf bepnvied. ·:
~! .
·,ri1:i~~ Ei:i:i ;Jt~~~~i~~t:2,;~;~\~:~fi:~=if!ft~Jtcti~:··· _.
C · · · • ·. ord<!1:; ~-nd ~eVenue· per s~-rcsP~rSon
1
lt,: .,,< .. ::~·-C'"'''""'..;~..;. .· ..
ljj>.
. ...=·'c:':+;<.:i;,
.. -bec.ausr; we_do not know ,wliat m~y ~appe11 ~u~qil~ntly, ifwe keep_gx:owing the.tr~. -:-- · • · ·· ,: ,

1 , ,. L,,.;,..s,::.. ,;1;; . . •, Pos:-prunmg: Remove branches or sub-trees from a "fully gi:own" me. This method ls . · •
·.· . commonly used._C4.5_algodtlirfr uses~ sl~tistlcal_me1'Jocf to ·es1iina1e the crror1'at each node
I f9t pruning. A validat)oi1set'ijiay,be.us·ecfforpruning as well. . ·; ·. . •' ;,_· . ': •
·

. The most popular de9isi~n fre~:atgor,ithms areCS,CARTandCHAID.(Tabie to .. / .... ' ·


ij,::.-. -.--. ., ' - _- ---~~- - - ~-C~ .: r.:---:'-~=-··__,·,.Ta::
Oecision_Trcc
·b-:-fe ~-:;_l.__
7
· .C-:,-o--11aa:fp;..u;.;;rt;;;''1;g::.JJ~op~_u.tnr::
C4,S ;
..:. · -=
. CA'liT :·
··D:.:e::
·r:.::ls::lo:::
·n..:·71:.:'::t~:.:11::lg~·o::r;:.il::1zn:::·a::·_· _· :-..;;.;..
·CifAID . ·. ·
··..:..,:··.::.··--..;.;, .:; .:... .::• :.:..._

. . .;, · :. . : M,: :\ic~L~:u~:J;J,}!,~pf;iluc;;vilJ;;p·;:ti"\; . ' . . ' ,· . -r-,1ll'nainc ·


·•· ·. • :: Fig(;rc 6~4 shows two line graphisupe/i111posed upon·each'otlier,"O1\~ iine shows the reve~i1~/ ' ...
..
.' .. '. ; pe(salesperson, ~vhile the other showSthe number of ord~rs per sale$per~~li. lt shows.that,tl.)~/ Adjusted.significance .
.. ,_; ;. Highest produ'ctivi\y of 5:3 orders per sates.·person, down t.o i .1<iNers·p~r salesperson. Tlie ' te...('(in2 ;·~ .... ··
. . ' second:1ine; Ilic blue line shows 1he revenue per-sales pcrso1i for each for the ~rod11cts; Th(, Devdopcr Ross Quinlan . Bremman . .
revenue ·per salesperson·.is highest at 630, ,while it is _lowest a\ just 30;'And ttius addilton~L. Whcn·devclopcd 1986 ,·· ·19114- 1980 ..
0
. .:·:t: ~::~ :::::a:~s~:~::t:r::!;;~;~: ::id.~ist:tt:i;.:-·\:;,_:-• ;- ~_: _:·~-
i.:: ,:_-:,:; .(04 1M~rks)·\;.
Serial ·: : Tre~ gi;ciwth· iuid in:oi · ·· ·
Cfossificaiion a~d regression .. . ~fassificlilion ond·
iiccs:.' . . .
. Trte gr 1h" ·d 1
. regression·./ :
. · ·°Ans. ·To help the /:Jiend1(uri'dei-standing the situatioii; tire following are some :key requirements imokmentalion . 6iunin~ ,:, rrce.gr?~ th anq·~ priming ·. .. prunirii?ow . .an ru·
for,good visualiiat\o~\·< · ' : .. . . • .- . ·. : . , •. ,'
· ·· · Discrch:' and• ·
: L Fetch a111io1irlii_,?a1iil ~orrecl i/l1t11for 11i11i/ysis:Tfiiirequi(e°fsoine uiiaers\anqing of l . Ty~ of d.,ta . . . . co:itimioiis;:.<. .. Di~rel~ ond ·c~~iinJo~ Nriit:~or~al d:it:i ~so-.
· do!llaiii"of°the ciieiibind what is imponant f'orthnHent .•,E.g, in a business se\ting, one !Jlii incomnlcte diita ·· · accept(·(. ·., · · ··
need io un~erstand'the many measure of prQfitability ana'p'roductivity.. · .· ... . ..\ . , :. . ..
2; So(/ the tint({ in tlte most 11j,proprlnte i1ii11111er\ It could lie s~ited by nUinericalv"i\rhi .·
VIII Se,m, (CSE/IS£)
$168,500 1 ,8◄ 0
Binnry splits'only; clever . · M11 Uiwoy spll!s ns $180,400 1,720
Ty~sufsplils Mu!iiwoy splits surrogate splits lo rcd11ce free ddbull S156,200 . 1,660
Jcpth .
5288.350 2,405
Glni ~o..:1Ticknt, or.d otht:rs Chi-squnrc.tos\ Sl56,7~0 1,525
Spli.lung_~rit..:ria lnformntion gain $202,100 2,030
Cle,·er bo:1om up Trt<:s con bccom• very $256,800 2,240
Pruning criteria tedmiquc avoids . R.:movc wc.,kcs\ Links first large
uvcrrinin rop(llnr In mnrkct
. , rublis1y ovoilablc in.most research, for .
Jmplemcntnlion Publicly nvoilnblc · · .p.,ckngcs:_. sc mcttintion .
·. . · .. , -• ·,.. • d ~cdkt'a hou.se price from tli'e.:
: b, ·Using the data lhnl follows; trcnlc a r~grcss10~ n10 _c11
. o _P_: :- (08 Marks) ·'·
• .HOuse PrfCe ·•
size of the house, llcrc arc sample house dntn, · ·
. nouse Prke Size (sqtl) -· -Unear [House Price).
~'229,500 1,850.
$273,300 2,190 .
I. $247,000 voo 0 ,1000, 2000 .. 3000 ·
$195,100 1,930
s1,e (Sq HJ . . .
$261 ,000 2;300
1,710 ·
Figure 7.1 Scatter plot and reg~sion equation between House price and house size ·.
$179.700
The two dimensions of(one predictor, one outcome variable) data~ b~ plott~o11 a sca_tter
$168,500 1,550 '
di_ag'ram. A scatter plot wit!t _a bel,t-fitti~g line looks lik~ the ~ph ~! follows.(Fi~~re 7.1).
$234,400 . l_,920 -
Visually, one can sec a pos1t1ve correlattmrbetweeq house pnce and size {sqfl), H(!wever, the · .
Si6i!,soO ' . re_laiion_ship is noi Rerflic't. Runnillg a r~gress'ion rnoµel ~~n the two variabJes_'pfiJduces ·,_·_ .
.,.·.. : $180,400'' ·. 1,720 . :he_following output ,(tr11ncaied),' . .- . • .
$156,200. : 1,660 : Rcgmsion Sutistici-
. $288,350 - 2,405 . · . ·,· Mtiftii>lc i 0.891 · ·
. _;''
SJ 56,750, 1,525 · r' : 0:794
Cocmcients .
--
: - 1- · .. -~l-nt-cr-cc-·p-t'-'.l---:.;54c:,ccl9.;;.;l.;;_~ ,;;,,.,1==:.:;;:.=..~c....:::....-.c~c...,..- - . '---:' --f:'"
. . . • .. _
. , ·· ·siz~ (sqft) 139'.48 . . , __.. . _: '.,
·rr sliownhe:cQeflJip(eni of correlation is 0.891. r2, the measure of total l'ariance explained ·
by_ the equation, is o: 794, of 7~ percent, That _means i_
bc two ".3_riabl~' a~c.moi!e).!tely a_n~.·
positively'correlaied:·. ·· .. · :· · · · · · .. -. ·. ·. · · -;, · ,:.: . .·:·_': ' ... ' .·'.
. Regression coeffidcnts help'ereate ih~ follo,;;ing equation for predicting !io_use Piiccs. . ' .
HouscPi'lce'(S)~ 139,48 ~·size(sqft)--:54,191 . ·_ : ,'. ' ·: .' . ·.: '._;,_.·:. _· ·.,
.This equation expiains 'only 79 percent ofthe vilriance.in house pri~es: .· . . : . ._, . ~ ·· ·. · ' , -
- · ·Supp<Jse·other predictor variables,are made available, sucli·;isthe nuin.b.crN_i:~iqms in the-.
.. ; ~ t . hoi1se, it _might help improve ihe'iegress'iori model. The·iiousecfata .no~ looks'li~e'ihis~: :· ,;,
!louse Price Size (sqft) ii Rooms .
$229,500 i,850 4
· $273,300 · - 2,190 5
$247,000 2,100 4
·• : ··
'I'
j- I.
. ..
- --~- -
.
--
'.110 · ._.,.
---'-'-----'---~----+---~-_..:.,.- ,.___,._ .,.
!..,. . --, -
- ;, _: ._
. VIII Sem, (CSf/ISf)
The predicted v~l11es should bc·comparcd to the adual values to see how ·c1o;e th~ model is
· S195,IOO I l,930 3
able 10 predict the oc:ua\ value, As.new dat~ points become avail~ble; th~re ue opportunities
5261,000 · 2,300 4 to fine-Hine and improve the model.
Sl79,700 l,1l0 2
1.~50 2
OR
$168,500
· · $234,400 . 1,920 4 .. ~ 8. n. What makc_s n 11cural net.work versatile enough for•upervlsed as well us non-supervised
$168,500 1,840 2 lcnrnlng tasks? · (08 Marks)
,1:120 2 Ans, ~upcrviscd Learning . . · , .. .
. $180,400
1,660 2 ~ Training data includes both the input and the desired results. . ·· ·• ·
$156,200
2,405 5· •. For sonic examples the correct results {targets) are known and nre siven in·inpui to the
$288,350
" 3 model dming the- learning·process.. . . , . , . .
$156,750 1,525

$202,100 2,030 2 of
• The construction a prop_er training, validation and tcst'.set (Bok) ls·cnicial.
• These methods· af'\l usually fast and accurate. · ·
. $2~6,80,0 2,240 4
• Have-to be able to gener~lize: give l~e correct results when new data are.giyeni iri input
While it is possible to make a three-~1me1monal scatte~ plot, Of\e 1n'nlte_mat_ively ex~mincl- withou! knowing a prio.ri the target. .· ·. . · · · ·.. · · · . · . . · \ . · · : • . ·
. _the correlation man t ·x among the variables · Supervised learning is based on training ,a data sample from data sourc{\vith corre:d. :
. House Prkc . Size (S(I fl) . -#Rooms classification already assigned. Such _techniques arc• utilized in feedforward or fy1ultiLayer
. House Price I Perceptron (MLP) modcis. ;nrnse MLP has three distinctive i:haract_eristics:-: .-' :,: . . , · '
Sizc(sQ fl) . ' 0.891 I I. One or more layers of hidden neurons that are not pact of_the:,input or oi1tpu\ layers of the
,Rooms ·o.944 · . -0.748 r..~ I network that enable the network to learn and solve any complex'.problems . ·, , · :.-.
. it sh;ws .that the hot1se pnce has as,rong correlall~n. w'.th numb_er o~room· S (0:,944) as.we\i:{. 2: The nonlinearity reflected in the .neuror.al activity is.differentiable and, · .· . . •.
. .Thti~,·il is likely ihnt adding this v~riable to the regressmn mo~el will udd to the_s1reng1h.,<1 , 3. Thdnierconnection model of the n~twork e·X:hibits a high degree ,of cortnectivii.y_·. ,
· Jhe inode't Rtlnriing a regr~ssion model between these tli_ree_vanablcs prnduc~ thefollowm These characteris_tics along wjth icarning through !mining soive .difficult i!nd diverse probleirisi ·
·ouipt)t. · · Learning tlmiugli training in.a supervised ANN model aiso c,.lled ~s error back~ptopag.,tion.
. algoriti1m. The erro'i· dmection:learning. algoriihrn_tralns:1he, network based, ~;ilitiripul~ ·
output samples ~nd finds error signal; which is the .dt[ere~ce of.the q~tpµl' ~l~11(aiea and .
. · tfie desired output and adjusts the synaptic weights of U1e miurons'. that is propo~ional io .
r1 -.'., _\ :. ' the product of 1he error sigriaf _and the.. inputinstance of 1i1e s~apt1ii \~ei'gJ1i.:sas:~d on this ..
principle, errorbnckpropagatiori learning.occurs in two passeI: .. ~: ·, .-i\ ,/;,. . . . ..
., . .. . Forward Pass: Her~, input vector is jJres~nted to the n~twork, This input,signalipioP.agates . .
- :·- . . . . - ; 1nicrcc~1,':'- 12,923 ',' .._·. _ ..- -forward, r.et1ron by i]euron lhrnugh the nclworkntu.femergcs iit-th~~oiili?l!l_end'o(\r~:nel\york ;_._
1-:....:.;;;~~-+-:-'-.,,-----,---j
Sii~(sqll) . 65.6.0 . .· as output.signal: y(nJ= <p(v(n)).wher~ v(n) is the in~~1ceq)tjcjllfiel_d,ofa neurgr -~efjned by_
. . .· v(n)=Lw(n)y(n).:The output that is c*ulat~d at the output iriyer o(nl i~ compar~d:with' the . •
. : . dcSir~d response q(n) and finds the criw ~(n) fo~ !hat ~~uro~ ..Jhi_ sy~il''i~, f~i~is:or the '
· n~twork during this pass arc rcmains·sanw. •.· .. ,. . . .: · ·.: ...... , ;: :. ,._; :, . . ,.: . ·
· Dackwarcl Pass: The error.signal that •is_originated at the..output ·neuron' of that layer is
· propagated _back1vard through network. This cai~;,!ates ii{ci~cal grii~ic~i~01: _~a·~~neuron iii
each layer and allows the synaptic weights of the network-to undergo cha.nges' iri accordance · •.
. . \\oith the ifoita riile ah: ..,.,::: "., .' ' . ,.•. ·'· ..' .: :(; ,;.. ,;,:_ ;, . ,: .'._,,j ,. ''\: ,_;:·,,,,,;.. ;. ~-·.• . . : ..
.'· ' ;iv(n) ,,,;,1' o(n),;, y(n) :C . . ·.. · . . ::,; / ·.. ,. ,,. . :; .,. ' :' ;; ·" 1 :';;i{fr' · ·. . .
: This recursive comp,utation_i_s:co'ntim1ed;: _with forward pass followed _by the bac.kwa'rd ·pa·ss ·
foreachinputpait_e,rri_tillthericl\'iorkis.conyerged. .: .. . . . . . •".' . . .·
Supervised learning paradigm ;ofhnANN is efficient and findssolut.ions:to s~veral linear ahd ,
n:on'linear,prqb!~ri1s suth as c!assification,plant·ton1rol, forecasting; pr~dlction,.rcibotics·eic
Unsupervised Learning / ·.. . , ; .:. · : . . . · · . . .
• The model is not pro11ided with the ~oircct;results during the trainiiig:' . . . . .. . • ·. ..
=.,.,,.,-_~·-=~ed.i~usteuhdn\,~Ld~ta:m.d~scs. on tbe basis of tbefr ~t,•tiitii:al puiperties .
'~ ,. .. , . . . ,. :

·tu3
VIIL-Se.w(CSE/ISt)

only: ~
• Cli1sler significnnce and labeling. ·
, The labeling can be carried o:it even if the lnbels nre only nva ilnblc for n small number ol
,
'
t±ili .
,, ,

objects representat ive of the des ired cl~sscs. . , , . An.I, A_s~nllcr plot of 10 data points in 2 dn:cn~ion~ shows them disrributed fairly ; andomly
s c:f-Org,mi 7jr.g ,.:uni! j;t;:wcr:<s !t~rn ~ts i1,g. u:i sup~:·vbrd lr ~rn i~g al£ori:hm to 1dl'nllfy ~Pig.ire 8.1). As n t cll o:n-,,~ lcchaiqcc, :!1c number of'ciusters ,,r.t! th eir crn:roids cr.n be
hidden patterns in unlabe'. led ia?ut data . Th is unsu pervi sed refers lo the .ibii ity ~o learn ~nd 1ntu,tcd. · · ..
·organize inform~tion without providing .in error signal lo ev;ilunte 1hc potential solution, T_hc points arc distributed randomly enough that it <ould be considered ns one cluster. Toe
The ·1ack of di,-ection for the learning algorithm in unsupervised learning cnn sometime c'.rclewould represen( the central point (centroid) of these poi~t., However, Ihm isii big
be advantageous, si;1ce it lets the _nlgorilhm to look bnck for patterns that havc ,rtol _been ,,: distanc_e between the points (2,6) and (8,3). So, th is drita co1ild be broken into two clusters.
previously.considered , The main charnctcristics of Self-Organizing Mops (SOM) are: · .. , ,.:· Th_e three point~ al the bottom right could form one cluster and the other seven could form
I. It trnnsforms an incoming signnl pattern oforbitrary dimens.ion iuto one .or 2_dimensionat} the other c(uster,.The two clusters would look iike_ ihis (Figure 8.2). The circles w·ill be the
. map and perform-this lrhnsformalion od.aplively · · . : · . . ·. . . .. new centroids.. . . . .. · · · : · . · . ·.
2. The network represents fecdfonvnrd structure.\Vithn single computational.layer consisting",, The bigger chistei seems too far apart. S<_>, it seems Hke the four poinl.s on the iop·iviil (o~m a .
.of neurons nrriinged in rows 11nd colulnns." . ·. .. . . . .· ··; . . . ·. ·,:, . sepnrntc chtster. The three clusters could look like this (F.igurc 8.3). . .. - . , ; · , ,
3. A! each stnge of representation, each input signal is kept in•its proper co11tcxt-and, :. ;. '.' ~ · .:·, .This solttt(on has thr~e clusters, The cluster .on U1e_right ~ for from the other.two·c_l~!~ · ,
. 4. Neurons dealing with closely relMed pieces of infornialion are close together ~nd they : However, tis centroid.is _not too close to all the d.itlt points ..The chi.s\er at.th~ t.<>P.fooks.:ve!)' .
. comniunicat~throi1ghsynaptic~onnections. -. ·_· . ·· .. · •. { ·.• _, .: ··.: .. ·· ·-::, . ··_ tight~~(titig, )"'.ith n nice cehlroid. the third duster, at the left, is spread(!ilt an_ dmay notbeo,f ·
The crnnputational ·lnycr is also called as ,oriipetitive la,cr since ibe neurons in the lnyer, .' · . much i1sefttlriess. ·.. · · · , · · \ · . · · ·. · · , .··,: . ·•t:__ · · · _::.. : , .' . > ..-. ._
-.·:
conip~tc \~ ith each other lo become activc·,,Hcncc,'lhis learning algorit,hm .is called ccinipetitive,:C:: I
algorlthin, llnsupe!Vis'cd algorithm' in SOM wo,·ks in three phases:· \.;\ . : ·l . . ' ' . . . ., , ♦ . .5~ 7 :
_Competition ph~se: for each 'input ·pattern :x, preseriied to 'the netw'otic, foner''produc_t 1v··
·. sy'naplic _11;efght w_is calculated arid tlie rieurons in the coinpe(itivelayerfirid(a discrim'i · • s., . ~ 6,e .

''
lf.
function that induc<: .competition amonjfthe .new·ons rind the synaptic w'eight vector tha
cfose ns
tOlhe input 'vector in th~ Euclidea,n distance i~announced winner In the competiti
:·_T.ha(nCuro'i1is:cntfcdbcStlti:itchinS.'netiron;'i.e~·x:;argnlin::flx:-:w)I > · :· ·.:: ..- -:i ". ·· ,;·i·. ' ' ', • J ...
.r
-• •2.4 ;
.
··· . ~:: ~ -- J<
:. · :,::_:. _,·,,·?: ·,
't.; · ,: Coopcr~ii1•~ plinsc:°lhc 1vinninlfneu,011 de'tcnnfo~ the ce~ter ofa: top,ological neiglihorh . ,. ~ ,·, .' · · ~ ·G.3 ·.: . ~ ~-~~:~ -
. t~
·;-~,;!,,'_\_.: . . . ' h'of foopemting neurons:. Th is is pcrformed·by the latc~al interaction~ among the cooperai
•I
·,,- ·• ··.
.. , ·. iiellrcins,This topological 11eighb9rhood (educes' its size o.ver·a time p·eriod. .. ;• · · :·. ·,'< '
'·' . Adnptivc phase: enables th.e \vilining neuron and its n.eighborhoodneui'ons to _increase·to: .. ,
lL , '.' iiidivi1.hinl va/ues of the discriminani fui1'ction'in relation to the input pa'ttcm through'suitab'ie.'
1/B '- --~......,. synaptic.weight adjustmehts;'t,w= Tjli(x)(x.:. w). . · ' . . __ ·. ·; . < -·: , <. .. ,
.. /~:. __

'· n .... · \ ·.. : . .-·•i-:'~:.,.._...:._1_ :....__....:..____:..i..__:-__• -~ •• :__:__r; . . 7. .. - .' , ~ ;-··, . 'L· . ·


··1· - ' '': ._~lipoifrc.~cateifpresentationi:!f lhefraining palterils,' th~;y'naptic-:-1iteigfal'e'Ctors-tend·1c{; ·. ,:ig11_ii iJ:Hn,11ia[ ii11ta, poi1rfrn1id;tli~ceiitroiti_(si,o~~.~ ilii;,. doQ .
. ·· · •' foll<iw the di_stributionolihc input pauems due to 'the neighborhood ~pdating and thus ANN .· ,, ,\ . . :- . :.:..,::·, . - -·
·.· foa'ms whhout.'supe1'Visor [2]:' ,· . . ·. 1.: • • . • . . . . .• • , ,

.. . Self-Organizing Mcidel naturally represents the rieuro-biological behavior; and hence is,used. ·.\.; ••.•·. .• . :;:
. in mriny real world ripplications s/1ch as clusteting, speech recognitlon;' texture segmentation, '.

1.:_' ' ·::::::;:::r,t~rsa~d·,;dm·,


,
b, Ilic data a~'
~ho1~~ . irr Datak~t be!O\V l~bl~:oc';ermlric -~~f-;
:!l!! ' nuinbcr of dusters 11;e·ccntcqioii1is'ofU1oseclustcrs. ,, ' . ' ,·', , '.l0,8''~arks)\,
I . . .·:x . . i.v.. . . . ..'.: ,: .:· ,

~;, :_;-:· ' '-< ., 2 ' : ,)> : ,:'


' , •, ~-
(tj
!L
fiI ·.. ·

. . '

--- ~·-- -- - . .

·:..·:._'.':, :•; -:., ~.-:·~- •··. ' ~ ··.:._:. -·:· ... . -__:··,_ .....~-
VIII Se,m, (CSE(IS'O

.-
-------- - -- ·- ··· ---- --·-- 1hcir old (shndcd) vnlues to tHc revised new vnlucs (Figure 8.1) .
\.. - ·- ··

-
,,, ♦ !ii , 1 • •, 7 • "-"'

• 2,• , • 4,•

• G-.3
-·... ~

• s~2


• " JI. • A • • 7 . • ·• . •f!

Fi,:11r11 ,B.:4 R111ulo11ily 11ssl::11i11,: t!,rce ce11troids for t!,ree ·data ·,111sters
- -- , _,a. _ . -- - - ---· - - - · " •··- -- · - - - - - - · -- ·- · - - •. • ____ :· __ · __ . . - - ·-

~ · · · ' ♦ 5,7 •. , '

~.. 2,4 \..


;~
• ,_ '/""'t,.
,.. i. · .v--·~- ,_- -:
. .~ .- .a.J .

.1. · • .. ,.-.

0 ' 1-_)~. ·-· _ ·] . _ __4 5 _. & _ 7 -_ ~- ·

_ __ -_ -:-:...-- (- :--: . ·-:: _. ; ~ ·. . _· __.·,' ~ _· ~ . .. Figlire'8.SAss/giiiiig dala·polii/J.~~ clos°est ce#t;oid · .. __:~.·-- -= _ ._


. . . - -- -·.- ·--.- ;.__;,_

'
o·· · · . 1

. · , . Figure B;J D(;,i,J,'11g ;,;to t/1iee·ct1iste;s (ce11troicls si,~1~11 ~s ,,;ick dots) · · : . :·


. ,: ) , ·. :· ; ,: ·. ·. · ; , ~-~ 7 ~ s:7 ·
. This was a'ri cxercis_e ·in 11roducil)g three· best-fitting clu~ter dcfinif~ons ·from the given data;..
. The right number.of clusters will depend on \he data ·and the.appiication _for. wnich the data'.. .J
,would be llScd. . . · ·. • . ' . . . ; · · · ·
K-l\1tans Algorithm for .Clustering . .. . . . . . . . ,. . .
·: K-ineans· is the most popular clustering algorithm, it iteratively computes the cl~sters\ ind.', ... . .
_

their centroids, It' is a lop-down approach to 'Clust<!ting, Staiting\vith .a given number dfK
. clust~rs, say 3 cllisters;·thus; ihree _random ceniroids wil~ be_.created, as starting points of'.t~i(
centers of thr~e clusters (Figllre.8.4): The circli:s a_re initia_l cluster. i:entro_ids·. :,:. ·. : · /::'
Step I: For a data point, distanc.e values will be from ·each of the three centroids. T~e· ·
point ,viii.be assigried·to the_clus_tcr with the shortest distance to_ ·iJie.centroid._/\ II data por
_· ;j;'s,2 .
Will thus be assigned to one·data poinl_or the either: The.arrows from each data elerrie.nt sli . _t _ •.
the centroid that the point is assigned to (Fi(?;lice 8.5).: · . · · : · .: · · : _; · · · :.\ ::
2:
Step The.centro1~ for eac.h cluste_rwil.lrtow· be recalculated .such that it is .closest to aU t
:.c...,__ _ ...:....~......,.-tiatncpoints11llocat~o-!l,a~ C~ij~e~ .~~e d,sbed _arro'.vllhilwJlr~centroids being moved fro
VIII Se,m, (CS[/ISE)
, . • 'r · ~ / , •

Step 3: Once again, data points ore assigned to the three centroids closest to it (Figure 8.7).
The new centroids will be comr·uted from the data poinls in the cluster until f]nnlly lhc
centroids stabilize in their locotions. These arc the three cll1stcrs compi1tcd by this nlgorithm
(Figure 8.8).
T~e three clusters sho"'r. arc a 3-datapoints cluslcr l"ith ce111roid (6.5,4.5), n 2-darnpoint
cluster with centroid (4.5,3), and. a 5-datapoint cluster with centroid (3.5,3). . .
TI1ese cluster definitions are dilfcrei1t from the ones derived visually. This is afunction of the·
random st.lrting centroid values. TI1e centroid points useif earlier 'in the visual ~xerclse were
different from that chosen with the K-mcans .clustering algorithm. The K-means .clustering ,,.
exercise should, therefore, be run again with1his data, but with new rnndom centro_id starting ·.'
values. With many runs, the cluster definitions ate likely to stabilize. If the cluster definition_f>~
do not a
stabilize; that may be sign tbilt the number of clusters chos~n is too high or too low,:.
The algorithm should also be nm ivith different values of K. _.

.~,12-:::
_: /·

l 2 3 4 . 's 6 : · a·
. Fig;,;~ 8. 7Assig,;l11i: dat;poi,;f; to Recoi11p;ted cent,;i'ds
'• . ' L< • • •,• • . : ' • '• , , , . : · • • • • ' • ,·. • ·• •- • ,. •

•ii's··.
-·- :.. . - .
' ·: - •-,-;~· -: ••_..- ~;~. 7\ c'.i.-j:,. !:", t_ _-• ..-., ~- _: . i
vlrr s~ (CSE/ISE)
trnnsformAtion occurs implicitly on o ;.obusl 1heorc1ic~l buis nnd human expertise judgement
beforehand is not needed.
3. SVMs prov id~ A good out-~f-~amplc gcnerali1,1tion, lf the parameters C and r (in the cas·e
ofa Gaussian krrnr.l) are upprnpr iatciy chnsrn. This mean~·that, t,y choos i~g an apr,ropriatc
gcncrnlization grade, SVMs can be robu~:, even when the training sample has some bias.
4. SVMs deliver n unique -solution, since the optimalicy problem ·is convllX. Thi, ii an
ndvantage compared to Neural Networks, which have multiple solu:ions assoc:iated with
local minima and for this rciison may not be robust over different nmples.
5. With the choice ofan appropriate kernel, such as,the Gaussi.ln kernel, one can put more
stress on the similarity between companies, because the more similar the. financial struciure
. of two companies -is, the higher is the value of the kernel. Thu.s when dass.ifyi_n·g_a new.
company, the values of its financial ratios are compared with the ones of the s1ippor1 .vettors ·
ofthe training sample which are more similar to this new company. This company is theri
classified according to with which group it has tbe_grea:est similarity. : . . .. .'
Herc are ~ome .examples where the ..SVM can help coping ·with non-linearity and non- .
monotonicity, Orie case. is, when the coellicfents of some tinatlciat ratios in .equation (I),
_estimated with ·a,.linear parametric model, show a sign that does not :cormpo~d to the
expected one according to_theoretical.economic reason.ing. . . .. . .
.The reason for that may be.that these financial ratiqs have a nori-monotone.telalioil to the
PD and to the score.The ·unexpecied sign of the coefficien!S depends oil !be fact, that data .
dominate or cover the pa11_of the range, where the relation to !be Po_·has the opposite sign. .
· 'One _o f these financial ratios is typically" the growth·rate:ofa company, 3$ pointed out by ..
Categor~g ~e~s. em"ail spam ~election, face recognition, sentiment arialy~is, medic~i' "Also leverage may sho.w non-inonotonicity,:sim:e.if a compa.'ly ;,rimaiy.wor!a witii. its o.wn ·
. ... diagnosis; digit recognition arid weather prediction are just fow._of the popular use c~es:,o, .· capital; it may riot ,exp!o.it all its e_xterna! financing oppor:nmi~es properly. Anodi~r exmiple
•· · Naive Bayes·algorithni. · ·. ,. ·. ·• · · . · · ·, · ·. •·· . _.. ·.. · • •.· · ). may be the si~ ofa ·company:. small companiei,'are~xpccted to be more firtantially ins~ble; : .
·. . Machine Leaniing explores lh~ ;tu(,ly 'and co,istru~tion of alg~ritliins.'that c~ ,learn fr~n.i: b11tJf a CO!llpany has'grown too fast or if it h·as become ioo-static because of its dimension, the .
·· · and make ·predictions· on data ,_ Among-Classification ·Algorithms, Na_ive Bayes _along wit~: .. big size'niay become a <lis_advantage: B~cause of these chai-.,cteristics, the above n"ie~tioried
Regression is one of the most popular and powerful algorithms. . : ' f.11~ncia1 ratios hrc:: open sorted out; ivhen selecting-the risk zsscssment model according to .
Naive Bayes classifiers is a machine learning algorithm. If you wonder, how Googl~ mar!is.,__ .ii l_inear class:ficatio~·teccinique. Alterna\ively al\ appropriate evaluation of this _information ·
_ _ _ -some of the· mails .as spam in your inbox; a machine lea!:!!_~g_alg~rit,hrn ~m
.be used lo ·:,:. \n .line~qechniqu·es_requjr_es_~_tran~for,mation of the input variables, iii ordetto make ttiem .
- ~ -- ~ classify iin incomin{email as spal)l or.iiotspam . .-:.,-~ •. - - . .. .. . .. .-,,~": . rnon_otorie and Hneirfyseparab)e:-itA-'colllin'<!ndisadvantage'of non-parametric techniques
..-::,. ... ·. _:such as·sVMs is thefack oftranspa.rency of results. · · . ." . :· • · · ·,
·. : b; .What Is ti;c Point {;iUsin·g SVMs a as Cl;ssific~tio~ Technl~ue? ·. · · ·._-.·.' · jq ~1arks) · ::
SyMs camiot repl·~sent !he score of al_l.companies as a simple paraincii"ic ·functiof! ofthe
"An·s• . Alf classification techniques have .advantages and "disadvantages, whic~ are more or les.s .
-: . financial ratios;sinte its dimension.may be yery high. It is neither.i liaear'"combinillioli of
. : important according to the data which a~e being an~lysed, ·and thus hiiv~ a rela~ive_'.releva11ce, . . .· single firtandal ratios nor has it anot~er simple fur,ctional foim~the welgl\iioftlfe firiancial
. SVMs can be a useful tool for m
insolvency analysis, in tbe case of rion-regulanty ~e data, ·r~tios are·not constant; Thus the marginal contribution cir each financilil raUo to t"be score:·
· :for example whe11 the dat~ are not r~guiarly° distributed· or have an _un_know1i distribution:· Is variable. IJsing ii Gaussian kernel each company has its. own weights according to the .•
It c~,i help ev·aluate. informatioi1, i.e. financial ratios which. should ._be transformed prior IQ difference between .\he value·oftheir own:fin~ncial ratios'and those <if the support vectors o(
·entering the score·of classical classificatio_n techniques. · .• ·..., . , · . . . . , : · . · . . ,·· · ·' ' . . ·: thetr:iiningdata~amp(e. -'. .: ·· '. ·. _· .··: :· ·. · · · . · ·. .::· ,.: . \ . _:
The advantages of the SVM technique can be sunimariscd as _foUows: . , . ·. .. · ·,·. . . .· :
L. s ·y introducing the kemel, SVMs gain flexibility in the choice of the· form. of the threshold·
as
. : lnt~rpr~iation ofresults is hoivever po~ible and can rely on giapliical visualizati~n; wen:.
.as on ri focal linear approximation of the scor~. ,The SVM thres!i_old can be represenied witl\i!f
. separating solvent from insolvent companies, ~tiicli" needs_not lie linear and -~,;en ·needs 1191 •. cuts
a'._bi'dimen~ional graph for.each pair of fina~cial ratios. This visualization technique and
· have the sa·me functional form for all data, since irs function;isnonsparametnc .and qperaJ~s -., projectqhe multii:!imcnsional feature .space as \veil as the inuitivnriate .tftresflold'function •
. locally. As a consequence.tlicy can work with financi~I 1atio{ ~hich show a riori:monot9ne:,
relation to. the score and to the probability of default, or which are non-linearly _depel)dent,:
ihe
s¢p_arating s·olverit .ind insolvent companies on a bi-dimensional one; by fixing ~alucs of : !
the' other financial ratios equal to the values of tlie company, \Vhich has io b~ classified. By" · .
. - and this without needing a11y specific~orkon each _non~monofone varia?fe·•· _' : \ _·: . .., ·. . :this way, different compimies -~ill have dilferent,threshold projections: · . . · · · · .
;;=======~,-=·.u h · n i~1 .licitl containsii"no1i-lincarfransforiiiatio1i,11oassumptio11saboutthe
-H0\V(Vef,all"1l~lym-of tht-~·~pl.s gi,cs Jil impu,Wi.rinpilni~i~?ll'.~

l '_c,:cc:Jt'''- ·····,. , , ,.,,' :· .·. : ...•


, , , :· - - -'-'--~ ,--· functional form of the transformation, which maKes dat~lin~ r)y~[aTI!J?.~i.!~!.1.~c:essary._ ,,
. ._
· ~ '-~.f-1". ·: . .' ... : .
iL~.. .:;;;~
·j- -

:121 ,.

- -.-:··.. . . . .-. --·


VIII S~w (CSE/IS£)
which the financial ratios of non-eligible companies . should change, in order to rench
eligibility. ·· · ." • · f · t nnd colour OR
The PD cnn represent a third dimension-of the grnph, by means O isoqua? .s · - timates tO. A, wi;a( a"rc the two major way~ that awcb1ltc can become popular? (04 M~rks)
coding. The approach chosen for the estimnlion oflhe PD can be based on cmpir!cal es .
Ans. The Web works lhr_ough a sy$lcm ofhypcrfi.,ks using tlie hypertext protocol (http). Any page
or on a 1hco1ct1CJI model . . f he PD can be ca~ crentc a hypcrlmk to any o:hcr page, ii can be linked 10 hy another pag.:. The intertwined
Since the relation between sco,e and PD ,s monotone, a local llnea11za~1on o t
· calculated for single companies by estimating the tangent curve lo tpe isoqu_ant oft~e sc~J ~- or self-referral nature of web lends itself 10 s~!11c unique network·analytical atgorlih.ms. The
For single companies this can offer interesting information about the faclms mfluen~mg I eir · stmctl\rc of Web pages could also be analyzed to examine the pattern ofhyperlinks iunong ··
pages. There a\"e two basic srrntegic models for successful websites: Hubs a:nd A~thoritic:s:
financial solidity. ·· · · · ·. · · · ·b ( · ' · I l."H11bs: These are pages with a large number of interesting links. They serve as a huh or a
. In the figure below the ?°Dis estimated by means Qf a Gaussian kernels on dn'.a. e on~mg_ _o
the trade sector and then smoothed and monotonized by.means of_a__Pool AdJace_nt V10!3:to_ . gatherin~ point, wher~ people visit lo access a variety of information. Media.sites like Y~hoo ..
algorithm.6 The pink curve represents the projectio(I of the SVM ~1r:shol_d o~ a bmary spa com, or government sites would serve that purpose. Mordocused ~ites fike Tia; eliidvisor•.
with· ihe two variables K2 I (net income change) and K24 (net mteresl ia\Jo), _whereas a\1 com arid yelp.com cdu"ld aspire to _be"coming hubs for new·cmerging areas. I • _. ·:·-.· • ; · • .- . •.
other varia.bles are fixed at the !eve.I of company j, The blue~curve represents_the 1soquant for _ 2. A11lf1oriiles: Ultiniately, people would gravitste towards pages .ihat pro~id{ihe most
the iib of company); whose coordinates are ·ma_rked by a triangle . . , . • .-·• · · ; . _.- co111P.icte""3n~ a11thoritative information on
a particular subject. . This could, be, fact~! . , .
:·Figure. Graphic~! Vlsuullzatlon of theSVM Threshold and of a L~cal Lme~rlzahon of .. · information,' news; "advice, user. reviews :etcc. These websites W6u!d have':tilii hf<i{tfi{umil.~r.\-· .
·.. . of i11bounci liriks from other websites. Th1is Mayoclfnic.com. wouid serve as ali;~'tiihbtii~(i~C:'-i :;"
~::~t~:~~ E~am~leofa Projection ~n
.. ·. 3 B;~di~ens;onal Gr~ph wi;~ ib c~,~~r ~od·i~g . ·. :::::,,,,,.,. ~'""' ,,,.,.. ·::•~..,.. ~· ""'"-':'""'-§1'' ,~Ftt",·
Probability of Oefault · · ·• ·
--~-=::c ;;.
r~~s~~. ·:;~~M:~~:~n:,~1~~,tb '.nirtin' ~l:g~-~ithiii;:. : . . . . . . ·.. . -,~~Q,4~~~rks)
. f:Jyperli~k'.li1duced Topid ~earch (HITS) is a liriR analysis' algorithm .that rates i~_b j,ag~s ·
as being hiibs or authorities; Many other HITS-baseii' afgorirhr)is liaviaik}eenJqbI\shed. ·
The niost famous l\ltd powerful of these algorithms is the P.ageRank algorithm. Jnvtnted by ·
a;
Go~gf_~;co-rounder Larry_Page, thi~ a!gori!hrri' is \1sei by Google org·an-ize tlfe:r~uus· or
·. its,i ~e~~ch,function -. r1is.algorithm helps determine the relative 'importance-iir'anip:arliculai
.·web_page by ' cbuntinirtlie nunib~r a~d quality'6r !in~ (0 a page. The .website~wlih
0

more
n\Jn1~er of links, arid/or r'nore li~ks from higher:'quality websites, will be ranked ~iglie~; it
. works ·in a_similar way..as determiriiiig the status of a person in a society ofpeopie>Those ..
1
with relations to, more people.and_!or relations to:peopl~ of higher status will be·a~c<ird~d a ·: : · .. <·:
.' : higher status. PageRank is the afgoriihin that helps det~rm•ine the order of pages !i~ted upon . .. : . ..· .· · -.
-~ ~ ::Ooog!e--Search query.-The original PageRank algorithm forn1liati<liin'iisoeen ~paalcd_m_ . ~- -_ -_. .- -
: many ways alid. the _latestalg<iritlim is k~pt a secret so either websites· <:annot take _advant.i"ge · ' .
~rt.he algorithm ~n<J/r~'anjpul~ie th~i~ :,vtlisiie according lo_i( "?'~ever; (~ere ~te:-11_1:aiiy .·:
•..:_?2:0. N~:L~:i.~~:~:·K2; .~s·e . sta!1dard eleme_nts ,Ilia! ren1ain unchanged, These.elements leadio the·principlesfor:"a'gtiod · ·
\· website: Thls pr~tess is:_als~_called Search-Engine Optiinization {SEO). , . _ _.: : / ... : . :
·.The gre,i lin~~oi~~!l~~ito;h~ iine~;:ap~:?·~~~ai(~n of th~ §_cof~•~r Po" f~nction,tJiojecti_on
for companyj. One in/cresting result of this graphical analysis is_ th~t succ~ssful com_pan_ies c. Expiufn the Pract!calc~nsideraiion ofSocii "iieiwork" analysls; Gikthe iiifftrerice'
· with a low PD often lie ina closed space. This implies thatthe1c exists an optllTlal~ombmatwn · b~lw~c~ Soci~l ~c()Vork_Ana(y~fs v(s !radjlfonal/llita Analflfcs · . . _·•· (~II M"ar~)
·.. area, for the financial ratios being considered, outside of which the ,PD gets h1ghe~. If we ,
· con;ider the net income change, we notice that its influence on the PD is non-m~~otone. Both PRATICA°L CONS1DERATiON: .. : '
-. ,. too low or too high' growth rates imply a higher PD. This may in~icate the eJ!i~tence of th~ ·.• .•. : .. Networ~ Size :Most SNA research ls dol)e ~sing sma_il it~tworks, (.oUci:i~& d~ii iibciu(i~rge .·.·}
1
. optimal gro~tluate and suggest.that ab?ve a ~.e11ain rat~ a c?mpany m~y get into _tr~uble, ·. netwo,k can be verydiaiienging. Tliis")s because the number c>ffink is the ordei-oft6e squ~re., · ·
especially if the cost structure of the company 1s not o~llmal 1.e. the net mt_erest rail? IS to ' "ofthe number cif nodes. Thus, ina netwcirk of 1000 "nodes there aiipote~tial1y :l.hliilfon .
big~: But if.a company (ii:s i1_1 tl)~ optimal gr~\rH1 zone, !t_cary al~o ~lfo~d a _htg~~r n~t mte possi9Iepaii-S"oflinks. ·· ._ . . ·• • . . · · ·. !_. . , · -:. :· _., ·:::.,.: <•- :,_
ratio. · ·· · · i
.'·C :

1 .:-_ ~ izc.2__.___-~
_-_•._-__---~
.·•·
.:. _-_: ~_-.-___.··_.:_·_··.._ •._._·_·_ _ .- -.
- -- ~~------
.__ . . -~~ - - -
! ... J • • -

- -- _.••-.- _
·. :::.= -_
_..:.,..:.._--'--.......:...-----'-- ~ ~ ~ ~ ~-
·..:,:· ;.·:r·

)_:~.j~_l _'•;
:..~1 -·

' .. . . . : :: · · -- ~· .# ~-~!-; ._:"


... :·-1·~\4,:<·~::~:::~~--.;(~::-:·:.>: / :'·:.:-:-: ·), :-_~;_: :: .· ; ~ ~-

· -" ·
·: t
-:;,
VIII Sewv(CSE(ISE)
Gathering D,ta: Electronics communication records (email, chat, etc.) can be harnessed to
gather social network data more easi\y.cD"ta o.n the ndatu rc an_d qu.,lity of rcl_ationship need to
Ulg-V'®IA� :''.\'
'..i� i:\
/._�:�.·. ,_r(!:_,J_1..:·
be collr:cted using survey documents. apturing·an· c 1 eans111g and org:rnizing tile data can -.
take a lot of tirr.c: .1nd effort ,Just like in a ry�ical ar.aiylics p:"ojccl. , -����: �•:. ··
.j Computation And Vi1unliiation: Modeling ·1arge networks can be computationally .};;'. �;.
challenging and visualizing the,� also would require special skills; Big data Analytical.tools':\;\ lJii
may be needed to compute large networks
.' :f.' ·
Dynamic Nctworks:Relationships between nodes in a social network can be fluid: They can,
change in strength and functional nat6ure. For Example, th•re could be mnltiple reiatio
betweentwo people ... ihey could simul:aneously be coworker , coautho:-s, and spouses,
network should be modeled frequently to see die dxnarnics of the. network. .
Tobie JO. I Socidl NeMork Analysis vis Tradilional Datcr Analytics
D�mtnsion Social Network Analysis
Traditional Data Mining
Nature of loamin� Unsupervised-learning
Supervised and.
_
lear:ning
Analysis·or goafs .' Huh nodes ,important rio<1es; Key· decision
and sul>-networks centroids
· Dataset. structures. A graph of nodes and (di_rccted) Rectarij(ular data.cof variables'
·-
links. · · · · and ihsJanc�s . ··
Analysis techniques' Vi's,lializati9i1'' witli ·statistics;
·ite'iative gr'aphical co,npatation.
Q)ifility mcasuicmen1'· Usefii'Jness is key criterion ' , .
I, . • . · .. ;. ,' � . .. i' ·.. I

:}:7,•; . ,i.":�"'�"'. ,�. -... -,,--, ..__._.


. __ : ___,·_ �-�
1

;Ii l�:.�,::\;.�.:1J
\•�•;!·:r;:;'..'., ,:L{·.1·:1,:·:.- .· ,'.: :.i·, .. •. i ·\·
· ,, ·: : ,: i .;,. •.;o: :.. ::
H,•.·ri. ;,'.,; ·;"1:•l;:···i:: .. �•,:1·!:I· .E ·; • , rl ,,.: ·: ·.:•:f.'.-.f,.,,:,. r·.
· !::�ti;!_:� ·;:,: j:::r1ri:111: ,,;.�::i1 ,,1 .1 :s:\"' :" .• ,.: )_;·· _1 � t,:'
. .{i..l •:�;! !i. :.:,- /,·,:·.
·
. .,;,.i•1i�;-;if, .•1'i :_;f;..: .':!!�:.)t;iin 1�.:":•::.J.i. 'Ji·;r,:� :., 1;:,;•: :.. 1.i·,.,:; l· :,.. :·�;.�:! .;,f, .; 1 ;·,;.{;•:·

• . . : .. ,._. .. ,'J ,t;. (;'! (-,"',· (i ·,


··;'.,'l:•,•.·'•,'.i!1l.f­ · .. ; .� . ;.·,.� ';· . ··) ....1·. ....f,1 ;t·
•,

.).' ,.-:., ·: , .. -,/.

.
·.,.·-
.,._ •' ·-----·
;.,' .-;. <;; ,\· ' ,,.... -
·--·:-1:24 .. ·

Vous aimerez peut-être aussi