DWDM Lab Manual: Data Warehousing and Data Mining

LabManual:DWDM
JawaharlalNehruEngineeringCollege
LaboratoryManual
DATA WAREHOUSING AND DATA MINING

For Final Year Students CSE Dept: Computer Science & Engineering (NBA Accredited)
Author JNEC, Aurangabad
DepartmentOfComputerScience&Engineering 200910
LabManual:DWDM
FOREWORD
It is my great pleasure to present this laboratory manual for FINAL YEAR COMPUTER SCIENCE engineering students for the subject of Data warehousing and data mining. As a student, many of you may be wondering with some of the questions in your mind regarding the subject and exactly what has been tried is to answer through this manual.
As you may be aware that MGM has already been awarded with ISO 9001:2000 certification and it is our endure to technically equip our students taking the advantage of the procedural aspects of ISO 9001:2000 Certification.
Faculty members are also advised that covering these aspects in initial stage itself, will greatly relived them in future as much of the load will be taken care by the enthusiasm energies of the students once they are conceptually clear.
Dr. S.D.Deshmukh Principal
LabManual:DWDM
LABORATORY MANUAL CONTENTS
This manual is intended for FIANL YEAR COMPUTER SECINCE students for the subject of Data Warehousing and Data Mining. This manual typically contains practical/Lab Sessions related Data warehousing and data mining covering various aspects related the subject to enhanced understanding.
Students are advised to thoroughly go though this manual rather than only topics mentioned in the syllabus as practical aspects are the key to understanding and conceptual visualization of theoretical aspects covered in the books.
Good Luck for your Enjoyable Laboratory Sessions
Prof. D.S.Deshpande HOD, CSE
Ms. S.B.Jaiwal Lecturer, CSE Dept.
LabManual:DWDM
DOs and DONTs in Laboratory:
1. Make entry in the Log Book as soon as you enter the Laboratory.
2. All the students should sit according to their roll numbers starting from to right.
their left
3. All the students are supposed to enter the terminal number in the log book.
4. Do not change the terminal on which you are working.
5. All the students are expected to get at least the algorithm of the program/concept to be implemented.
6. Strictly observe the instructions given by the teacher/Lab Instructor.
Instruction for Laboratory Teachers:
1.
Submission related to whatever lab work has been completed should be done
during the next lab session. The immediate arrangements for printouts related to submission on the day of practical assignments.
2. Students should be taught for taking the printouts under the observation of lab teacher.
3. The promptness of submission should be encouraged by way of marking and evaluation patterns that will benefit the sincere students.
LabManual:DWDM
SUBJECT INDEX
1. Evolution of data management technologies, introduction to data warehousing concepts.
2. Develop an application to implement defining subject area, design of fact dimension table, data mart. 3. Develop an application to implement OLAP, roll up, drill down, slice and dice operation 4. Develop an application to construct a multidimensional data. 5. Develop an application to implement data generalization and summarization technique. 6. Develop an application to extract association rule of data mining. 7. Develop an application for classification of data. 8. Develop an application for one clustering technique
9. Develop an application for Nave Bayes classifier.
10. Develop an application for decision tree.
LabManual:DWDM
LabExercises1
Aim: Evolution of data management technologies, introduction to Data Warehousingconcepts.
Theory: Datamanagementcomprisesallthedisciplinesrelatedtomanagingdataas avaluableresource. Topics in Data Management, grouped by the DAMA DMBOK Framework, include: 1. DataGovernance
o o o
Dataasset Datagovernance Datasteward
2. DataArchitecture,AnalysisandDesign
a. b. c.
Dataanalysis Dataarchitecture Datamodeling
3. DatabaseManagement
LabManual:DWDM
o o o
Datamaintenance Databaseadministration Databasemanagementsystem
4. DataSecurityManagement
o o o o
Dataaccess Dataerasure Dataprivacy Datasecurity
5. DataQualityManagement
o o o o
Datacleansing Dataintegrity Dataquality Dataqualityassurance
6. ReferenceandMasterDataManagement
o o o
Dataintegration MasterDataManagement Referencedata
7. DataWarehousingandBusinessIntelligenceManagement
o
Businessintelligence
LabManual:DWDM
o o o o
Datamart Datamining Datamovement(extract,transformandload) Datawarehousing
8. Document,RecordandContentManagement
o o
Documentmanagementsystem Recordsmanagement
9. MetaDataManagement
o o o o o
Metadatamanagement Metadata Metadatadiscovery Metadatapublishing Metadataregistry
DataWarehouse: A data warehouse is a subjectoriented, integrated, timevariant, and nonvolatile collectionofdatainsupportofmanagementsdecisionmakingprocess DATAWAREHOUSING: Data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates
LabManual:DWDM
analysis workload from transaction workload and enables an organization to consolidatedatafromseveralsources. Acommonwayofintroducingdatawarehousingistorefertothecharacteristicsof adatawarehouseassetforthbyWilliamInmon:
SubjectOriented Integrated Nonvolatile TimeVariant
1. SubjectOriented Datawarehousesaredesignedtohelpyouanalyzedata.Forexample,tolearnmore about yourcompany'ssalesdata, youcan buildawarehousethatconcentrateson sales. Using this warehouse, you can answer questions like "Who was our best customerforthisitemlastyear?"Thisabilitytodefineadatawarehousebysubject matter,salesinthiscase,makesthedatawarehousesubjectoriented. 2. Integrated Integrationiscloselyrelatedtosubjectorientation.Datawarehousesmustputdata fromdisparatesourcesintoaconsistentformat.Theymustresolvesuchproblems as naming conflicts and inconsistencies among units of measure. When they achievethis,theyaresaidtobeintegrated. 3. Nonvolatile
LabManual:DWDM
Nonvolatile meansthat,onceenteredintothewarehouse,datashouldnotchange. Thisislogicalbecausethepurposeofawarehouseistoenableyoutoanalyzewhat hasoccurred. 4. TimeVariant Inordertodiscovertrendsinbusiness,analystsneedlargeamountsofdata.Thisis very much in contrast to online transaction processing (OLTP) systems, where performancerequirementsdemandthat historicaldata be movedtoanarchive. A data warehouse's focus on change over time is what is meant by the term time variant.
10
LabManual:DWDM
ArchitectureofDataWarehouse
11
LabManual:DWDM
LabExercises2
Aim: Develop an application to implement defining subject area, design of fact dimensiontable,datamart.
S/wRequirement:ORACLE,DB2
Theory:
A data mart is a subset of an organizational data store, usually orientedtoaspecificpurposeormajordatasubject,that maybedistributedto supportbusinessneeds.Datamartsareanalyticaldatastoresdesignedtofocus onspecificbusinessfunctionsforaspecificcommunitywithinanorganization. Datamartsareoftenderivedfromsubsetsofdatainadatawarehouse,though in the bottomup data warehouse design methodology the data warehouse is createdfromtheunionoforganizationaldatamarts
Designschemas
star schema or dimensional model is a fairly popular design choice, as it enables a relational database to emulate the analytical functionality of a multidimensionaldatabase.
snowflakeschema
12
LabManual:DWDM
Reasonsforcreatingadatamart

Easyaccesstofrequentlyneededdata Createscollectiveviewbyagroupofusers Improvesenduserresponsetime Easeofcreation LowercostthanimplementingafullDatawarehouse PotentialusersaremoreclearlydefinedthaninafullDatawarehouse
Starschemaarchitecture Starschemaarchitecture isthesimplestdatawarehousedesign. The main feature of a star schema is a table at the center, called the fact table and the dimension tableswhichallowbrowsingofspecificcategories,summarizing,drilldownsand specifyingcriteria. Typically, most of the fact tables in a star schema are in database third normal form, while dimensional tables are denormalized (second normal form). Despitethefactthatthestarschemaisthesimplestdatawarehousearchitecture,it is most commonly used in the data warehouse implementations across the world today(about9095%cases).
13
LabManual:DWDM
Facttable The fact table is not a typical relational database table as it is denormalized on purpose to enhance query response times. The fact table typically contains recordsthatarereadytoexplore,usuallywithadhocqueries.Recordsinthefact table are often referred to as events, due to the timevariant nature of a data warehouseenvironment.
Theprimarykeyforthefacttableisacompositeofallthecolumnsexceptnumeric values / scores (like QUANTITY, TURNOVER, exact invoice date and time).
Typicalfacttablesinaglobalenterprisedatawarehouseare(usuallytheremaybe additionalcompanyorbusinessspecificfacttables): 1.salesfacttablecontainsalldetailsregardingsales 2.orders facttable insomecasesthetablecanbesplit intoopenordersand historicalorders.Sometimesthevaluesforhistoricalordersarestoredinasales facttable. 3.budgetfacttableusuallygroupedbymonthandloadedonceattheendofa year. 4. forecast fact table usually grouped by month and loaded daily, weekly or monthly. 5. inventoryfacttablereportstocks,usuallyrefresheddaily
14
LabManual:DWDM
Dimensiontable Nearlyalloftheinformationinatypicalfacttableisalsopresentinoneormore dimensiontables.ThemainpurposeofmaintainingDimensionTablesistoallow browsing thecategoriesquicklyandeasily. Theprimarykeysofeachofthedimensiontablesare linkedtogetherto formthe compositeprimarykeyofthefacttable.Inastarschemadesign,thereisonlyone denormalizedtablefora given dimension. Typicaldimension tablesinadatawarehouseare: timedimensiontable customersdimensiontable productsdimensiontable keyaccountmanagers(KAM)dimensiontable salesofficedimensiontable
Starschemaexample
15
LabManual:DWDM
FigA.1
Snowflakeschemaarchitecture: Snowflake schema architecture is a more complex variation of a star schema design.The maindifference isthatdimensionaltablesinasnowflakeschemaare normalized, so they have a typical relational database design.
Snowflakeschemasaregenerallyusedwhenadimensionaltablebecomesverybig and when a star schema cant represent the complexity of a data structure. For example if a PRODUCT dimension table contains millions of rows, the use of snowflakeschemasshouldsignificantlyimproveperformancebymovingoutsome datatoothertable(withBRANDSforinstance).
16
LabManual:DWDM
The problem is that the more normalized the dimension table is, the more complicatedSQLjoinsmustbeissuedtoquerythem.Thisisbecauseinorderfora querytobeanswered,manytablesneedtobejoinedandaggregatesgenerated. Anexampleofasnowflakeschemaarchitectureisdepictedbelow
FigA.2
17
LabManual:DWDM
LabExercises3
Aim: Develop an application to implement OLAP, roll up, drill down, slice and diceoperation S/wRequirement:ORACLE,DB2. Theory: OLAP is an acronym for On Line Analytical Processing.Online Analytical Processing: An OLAP system manages large amount of historical data, provides facilities forsummarizationandaggregation,andstoresand manages information atdifferentlevelsofgranularity. MultidimensionalData: Salesvolumeasafunctionofproduct,month,andregion
Month FigA.1
18
LabManual:DWDM
Dimensions:Product,Location,Time Hierarchicalsummarizationpaths IndustryRegionYear
CategoryCountry Quarter
ProductCityMonthWeek
Office OLAPoperations:
Day
The analyst can understand the meaning contained in the databases using multi dimensionalanalysis.Byaligningthedatacontentwiththeanalyst'smentalmodel, the chances of confusion and erroneous interpretations are reduced. The analyst can navigate through the database and screen for a particular subset of the data, changing the data's orientations and defining analytical calculations. The user initiated process of navigating by calling for page displays interactively, through thespecificationofslicesviarotationsanddrilldown/upissometimescalled"slice and dice". Common operations include slice and dice, drill down, roll up, and pivot. Slice: A slice is a subset of a multidimensional array corresponding to a single valueforoneormoremembersofthedimensionsnotinthesubset.
19
LabManual:DWDM
Dice:Thediceoperationisasliceonmorethantwodimensionsofadatacube(or morethantwoconsecutiveslices). DrillDown/Up:Drillingdownorupisaspecificanalyticaltechniquewherebythe usernavigatesamonglevelsofdatarangingfromthemostsummarized(up)tothe mostdetailed(down). Rollup:Arollupinvolvescomputingallofthedatarelationshipsforoneormore dimensions.Todothis,acomputationalrelationshiporformulamightbedefined. Pivot:Tochangethedimensionalorientationofareportorpagedisplay. Otheroperations Drillthrough:throughthebottomlevelofthecubetoitsbackendrelationaltables (usingSQL)
20
LabManual:DWDM
FigA.2
21
LabManual:DWDM
CubeOperation: CubedefinitionandcomputationinOLAP 1. definecubesales[item,city,year]:sum(sales_in_dollars) 2. computecubesales TransformitintoaSQLlikelanguage(withanewoperatorcubeby) SELECTitem,city,year,SUM(amount) FROMSALES CUBEBYitem,city,year NeedcomputethefollowingGroupBys (date,product,customer), (date,product),(date,customer),(product,customer), (date),(product),(customer) () RollupandDrilldown Therollupoperationperformsaggregationonadatacube,eitherbyclimbingupa concepthierarchyforadimensionorbydimensionreductionsuchthatoneormore dimensionsareremovedfromthegivencube. Drilldown is the reverse of rollup. It navigates from less detailed data to more detailed data. Drilldown can be realized by either stepping down a concept hierarchyforadimensionorintroducingadditionaldimensions
22
LabManual:DWDM
Sliceanddice The slice operation performs a selection on one dimension of the given cube, resultinginasub_cube. Thediceoperationdefinesasub_cubeby performingaselectionontwoor more dimensions. DrillDown:
FoodLine OutdoorLine CATEGORY_total 59,728 151,174 210,902 DrillDown
Asia
FoodLine OutdoorLine CATEGORY_total Malaysia China India Japan Singapore Belgium 618 33,198.5 6,918 13,871.5 5,122 7797.5 9,418 74,165 0 34,965 32,626 21,125 10,036 107,363.5 6,918 48,836.5 37,748 28,922.5
23
LabManual:DWDM
Rollup:
24
LabManual:DWDM
Slice:
25
LabManual:DWDM
Dice:
26
LabManual:DWDM
LabExercises4
Aim:Developanapplicationtoconstructamultidimensionaldata.
S/wRequirement:ORACLE,DB2.
Theory:
MultidimensionalDataModel:
Multidimensionaldatamodelistoviewitasacube.Thecableattheleftcontains detailedsalesdata byproduct, marketandtime. Thecubeon the rightassociates sales number (unitsold)withdimensionsproducttype, marketandtimewiththe unitvariablesorganizedascellinanarray. Thiscubecanbeexpendedtoincludeanotherarraypricewhichcanbeassociates withalloronlysomedimensions. As number of dimensions increases number of cubes cell increase exponentially. Dimensionsarehierarchicalinnaturei.e.timedimensionmaycontainhierarchies for years, quarters, months, weak and day. GEOGRAPHY may contain country, state,cityetc.
27
LabManual:DWDM
FigA.0
TheMultidimensionalDataModel Thischapterdescribesthemultidimensionaldatamodelandhowitisimplemented in relational tables and standard form analytic workspaces. It consists of the followingtopics:
TheLogicalMultidimensionalDataModel
28
LabManual:DWDM
TheRelationalImplementationoftheModel TheAnalyticWorkspaceImplementationoftheModel
1.TheLogicalMultidimensionalDataModel The multidimensional data model is an integral part of OnLine Analytical Processing,orOLAP.BecauseOLAPisonline,itmustprovideanswersquickly analystsposeiterativequeriesduringinteractivesessions,notinbatchjobsthatrun overnight. And because OLAP is also analytic, the queries are complex. The multidimensionaldatamodelisdesignedtosolvecomplexqueriesinrealtime. The multidimensional data model is important because it enforces simplicity. As RalphKimballstatesinhislandmarkbook,TheDataWarehouseToolkit: "The central attraction of the dimensional model of a business is its simplicity.... that simplicity is the fundamental key that allows users to understand databases, andallowssoftwaretonavigatedatabasesefficiently." The multidimensional data model is composed of logical cubes, measures, dimensions, hierarchies, levels, and attributes. The simplicity of the model is inherent because it defines objects that represent realworld business entities. Analysts know which business measures they are interested in examining, which dimensions and attributes make the data meaningful, and how the dimensions of theirbusinessareorganizedintolevelsandhierarchies. FigA.1showstherelationshipsamongthelogicalobjects.
29
LabManual:DWDM
DiagramoftheLogicalMultidimensionalModel
Fig.A.1
a)LogicalCubes Logicalcubesprovidea meansoforganizing measuresthat havethesameshape, thatis,theyhavetheexactsamedimensions.Measuresinthesamecubehavethe same relationships to other logical objects and can easily be analyzed and displayedtogether. b)LogicalMeasures Measures populate the cells of a logical cube with the facts collected about business operations. Measures are organized by dimensions, which typically includeaTimedimension.
30
LabManual:DWDM
Ananalyticdatabasecontainssnapshotsof historicaldata,derived fromdata ina legacy system, transactional database, syndicated sources, or other data sources. Threeyearsofhistoricaldataisgenerallyconsideredtobeappropriateforanalytic applications. Measures are static and consistent while analysts are using them to inform their decisions.Theyareupdatedinabatchwindowatregularintervals:weekly,daily, orperiodicallythroughouttheday.Manyapplicationsrefreshtheirdatabyadding periodstothetimedimensionofameasure,andmayalsorolloffanequalnumber of the oldest time periods. Each update provides a fixed historical record of a particularbusinessactivityforthatinterval.Otherapplicationsdoafullrebuildof theirdataratherthanperformingincrementalupdates. A critical decision in defining a measure is the lowest level of detail (sometimes calledthegrain).Usersmayneverviewthisbaseleveldata,butitdeterminesthe typesofanalysisthatcanbeperformed.Forexample,marketanalysts(unlikeorder entry personnel) do not need to know that Beth Miller in Ann Arbor, Michigan, placedanorderforasize10bluepolkadotdressonJuly6,2002,at2:34p.m.But theymightwanttofindoutwhichcolorofdresswasmostpopularinthesummer of2002intheMidwesternUnitedStates. Thebaseleveldetermineswhetheranalystscangetananswertothisquestion.For thisparticularquestion,Timecouldbe rolled up into months,Customercouldbe rolledupintoregions,andProductcouldberolledupintoitems(suchasdresses) withanattributeofcolor.However,thislevelofaggregatedatacouldnotanswer the question: At what time of day are women most likely to place an order? An importantdecisionistheextenttowhichthedatahasbeenpreaggregatedbefore beingloadedintoadatawarehouse.
31
LabManual:DWDM
c)LogicalDimensions Dimensionscontainasetofuniquevaluesthatidentifyandcategorizedata.They form the edges of a logical cube, and thus of the measures within the cube. Becausemeasuresaretypicallymultidimensional,asinglevalueinameasuremust be qualified by a member of each dimension to be meaningful. For example, the Sales measure has four dimensions: Time, Customer, Product, and Channel. A particular Sales value (43,613.50) only has meaning when it is qualified by a specific time period (Feb01), a customer (Warren Systems), a product (Portable PCs),andachannel(Catalog).
d)LogicalHierarchiesandLevels Ahierarchyisawaytoorganizedataatdifferentlevelsofaggregation.Inviewing data,analystsusedimensionhierarchiestorecognizetrendsatonelevel,drilldown tolowerlevelstoidentifyreasonsforthesetrends,androlluptohigherlevelsto seewhataffectthesetrendshaveonalargersectorofthebusiness. Each level represents a position in the hierarchy. Each level above the base (or mostdetailed)levelcontainsaggregatevaluesforthelevelsbelowit.Themembers at different levels have a onetomany parentchild relation. For example, Q102 andQ202arethechildrenof 2002,thus2002istheparentof Q102andQ202. Supposeadatawarehousecontainssnapshotsofdatatakenthreetimesaday,that is, every 8 hours. Analysts might normally prefer to view the data that has been aggregatedintodays,weeks,quarters,oryears.Thus,theTimedimensionneedsa hierarchywithatleastfivelevels.
32
LabManual:DWDM
Similarly, a sales manager with a particular target for the upcoming year might wanttoallocatethattargetamountamongthesalesrepresentativesinhisterritory the allocation requires a dimension hierarchy in which individual sales representativesarethechildvaluesofaparticularterritory. Hierarchies and levels have a manytomany relationship. A hierarchy typically contains several levels, and a single level can be included in more than one hierarchy.
e)LogicalAttributes An attribute provides additional information about the data. Some attributes are usedfordisplay.Forexample,youmighthaveaproductdimensionthatusesStock KeepingUnits(SKUs)fordimensionmembers.TheSKUsareanexcellentwayof uniquelyidentifyingthousandsofproducts,butare meaninglesstomostpeopleif theyareusedtolabelthedatainareportorgraph.Youwoulddefineattributesfor thedescriptivelabels. You mightalsohaveattributeslikecolors,flavors,orsizes.Thistypeofattribute canbeusedfordataselectionandansweringquestionssuchas:Whichcolorswere the most popular in women's dresses in the summer of 2002? How does this comparewiththeprevioussummer? Time attributes can provide information about the Time dimension that may be usefulinsometypesofanalysis,suchasidentifyingthelastdayorthenumberof daysineachtimeperiod.
33
LabManual:DWDM
2TheRelationalImplementationoftheModel The relational implementation of the multidimensional data model is typically a star schema, as shown in Figure b, or a snowflake schema. A star schema is a convention for organizing the data into dimension tables, fact tables, and materializedviews.Ultimately,allofthedataisstoredincolumns,andmetadatais requiredtoidentifythecolumnsthatfunctionasmultidimensionalobjects. InOracleDatabase,youcandefinealogicalmultidimensionalmodelforrelational tables using the OLAP Catalog or AWXML. The metadata distinguishes level columns from attribute columns in the dimension tables and specifies the hierarchicalrelationshipsamongthe levels.Itidentifiesthe various measuresthat arestoredincolumnsofthefacttablesandaggregationmethodsforthemeasures. Anditprovidesdisplaynamesforalloftheselogicalobjects.
34
LabManual:DWDM
FigA.2
a)DimensionTables A star schema stores all of the information about a dimension in a single table. Each level of a hierarchy is represented by a column or column set in the dimension table. A dimension object can be used to define the hierarchical relationshipbetweentwocolumns(orcolumnsets)thatrepresenttwo levelsof a hierarchy without a dimension object, the hierarchical relationships are defined onlyinmetadata.Attributesarestoredincolumnsof thedimensiontables.
35
LabManual:DWDM
Asnowflakeschemanormalizesthedimensionmembersbystoringeachlevelina separatetable. b)FactTables Measures are stored in fact tables. Fact tables contain a composite primary key, which is composed of several foreign keys (one for each dimension table) and a columnforeachmeasurethatusesthesedimensions. c)MaterializedViews Aggregatedata iscalculated onthebasis ofthe hierarchical relationshipsdefined in the dimension tables. These aggregates are stored in separate tables, called summary tables or materialized views. Oracle provides extensive support for materializedviews,includingautomaticrefreshandqueryrewrite. Queriescanbewritteneitheragainstafacttableoragainstamaterializedview.Ifa queryiswrittenagainstthefacttablethatrequiresaggregatedataforitsresultset, thequeryiseitherredirectedbyqueryrewritetoanexistingmaterializedview,or thedataisaggregatedonthefly. EachmaterializedviewisspecifictoaparticularcombinationoflevelsinfigA.2, only two materialized views are shown of a possible 27 (3 dimensions with 3 levelshave3**3possiblelevelcombinations).
3TheAnalyticWorkspaceImplementationoftheModel Analytic workspaces have several different types of data containers, such as dimensions, variables, and relations. Each type of container can be used in a varietyofwaystostoredifferenttypesof information. Forexample,adimension candefineanedgeofameasure,orstorethenamesofallthelanguagessupported
36
LabManual:DWDM
bytheanalyticworkspace,oralloftheacceptablevaluesofarelation.Dimension objects are themselves one dimensional lists of values, while variables and relations are designed specifically to support the efficient storage, retrieval, and manipulationofmultidimensionaldata a)MultidimensionalDataStorageinAnalyticWorkspaces In the logical multidimensional model, a cube represents all measures with the same shape, that is, the exact same dimensions. In a cube shape, each edge represents a dimension. The dimension members are aligned on the edges and dividethecubeshapeintocellsinwhichdatavaluesarestored. In an analytic workspace, the cube shape also represents the physical storage of multidimensionalmeasures,incontrastwithtwodimensionalrelationaltables.An advantageofthecubeshapeisthatitcanberotated:thereisnoonerightwayto manipulate or view the data. This is an important part of multidimensional data storage,calculation,anddisplay,because differentanalysts needto viewthedata indifferentways.Forexample,ifyouaretheSalesManagerforthePacificRim, thenyouneedtolookatthedatadifferentlyfromaproductmanagerorafinancial analyst. Assumethatacompanycollectsdataonsales.Thecompanymaintainsrecordsthat quantifyhow manyofeachproductwassoldinaparticularsalesregionduringa specifictimeperiod.Youcanvisualizethesalesmeasureasthecubeshowninfig A.3.
37
LabManual:DWDM
FigureA.3ComparisonofProductSalesByCity
Descriptionoftheillustrationcube1.gif FigA.3comparesthesalesofvariousproductsindifferentcitiesforJanuary2001 (shown) and February 2001 (not shown). This view of the data might be used to identify products that are performing poorly in certain markets. Fib A.4 shows salesofvariousproductsduringafourmonthperiodinRome(shown)andTokyo (notshown).Thisviewofthedataisthebasisfortrendanalysis.
38
LabManual:DWDM
FibA.4ComparisonofProductSalesByMonth
Descriptionoftheillustrationcube2.gif Acubeshapeisthreedimensional.Ofcourse,measurescanhavemanymorethan three dimensions, but three dimensions are the maximum number that can be represented pictorially. Additional dimensions are pictured with additional cube shapes.
b)DatabaseStandardFormAnalyticWorkspaces FibA.5showshowdimension,variable,formula,andrelationobjectsinastandard form analytic workspace are used to implement the multidimensional model. Measureswithidenticaldimensionscomposealogicalcube.Alldimensionshave attributes, and all hierarchical dimensions have level relations and selfrelations
39
LabManual:DWDM
for clarity, these objects are shown only once in the diagram. Variables and formulascanhaveanynumberofdimensionsthreeareshownhere. FibA.5DiagramofaStandardFormAnalyticWorkspace
Descriptionoftheillustrationawschema.gif
c)AnalyticWorkspaceDimensions A dimension in an analytic workspace is a highly optimized, onedimensional indexofvaluesthatservesasakeytable.Variables,relations,formulas(whichare storedequations)areamongtheobjectsthatcanhavedimensions.
40
LabManual:DWDM
Dimensions have several intrinsic characteristics that are important for data analysis:
Referential integrity.Eachdimension member is uniqueandcannotbeNA (thatis,null).Ifameasurehasthreedimensions,theneachdatavalueofthat measuremustbequalifiedbyamemberofeachdimension.Likewise,each combinationofdimensionmembershasavalue,evenifitisNA.
Consistency. Dimensions are maintained as separate containers and are shared by measures. Measures with the same dimensionality can be manipulatedtogethereasily.Forexample,ifthesalesandexpensemeasures are dimensioned by time and line, then you can create equations such as profit=salesexpense.
Preservedorderofmembers.Eachdimensionhasadefaultstatus,whichisa listofallofitsmembersintheordertheyarestored.Thedefaultstatuslistis always the same unless it is purposefully altered by adding, deleting, or moving members. Within a session, a user can change the selection and orderofthestatuslistthisiscalledthecurrentstatuslist.Thecurrentstatus list remains the same until the user purposefully alters it by adding, removing,orchangingtheorderofitsmembers. Because the order of dimension members is consistent and known, the selection of members can be relative. For example, this function call comparesthesalesvaluesofallcurrentlyselectedtimeperiodsinthecurrent statuslistagainstsalesfromthepriorperiod. lagdif(sales,1,time)
41
LabManual:DWDM
Highlydenormalized.Adimensiontypicallycontainsmembersatalllevels ofallhierarchies.Thistypeofdimensionissometimescalledanembedded total dimension.
In addition to simple dimensions, there are several special types of dimensions used in a standard form analytic workspace, such as composites and concat dimensions.Thesedimensionsarediscussedlaterinthisguide.
c.1)UseofDimensionsinStandardFormAnalyticWorkspaces Inananalyticworkspace,datadimensionsarestructuredhierarchicallysothatdata atdifferent levelscanbe manipulated for aggregation,allocation,and navigation. However, all dimension members at all levels for all hierarchies are stored in a singledatadimensioncontainer. Forexample, months,quarters,and yearsareall stored in a single dimension for Time. The hierarchical relationships among dimension members are defined by a parent relation, described in "Analytic WorkspaceRelations". Notalldataishierarchicalinnature,however,andyoucancreatedatadimensions that do not have levels. A line item dimension is one such dimension, and the relationships among its members require a model rather than a multilevel hierarchy. The extensive data modeling subsystem available in analytic workspacesenablesyoutocreatebothsimpleandcomplexmodels,whichcanbe solvedaloneorinconjunctionwithaggregationmethods. As a onedimensional index, a dimension container has many uses in an analytic workspace in addition to dimensioning measures. A standard form analytic workspace uses dimensions to store various types of metadata, such as lists of hierarchies,levels,andthedimensionscomposingalogicalcube
42
LabManual:DWDM
d)2.3.4AnalyticWorkspaceVariables A variable is a data value table, that is, an array with a particular data type and indexedbyaspecificlistofdimensions.Thedimensionsthemselvesarenotstored withthevariable. Eachcombinationofdimensionmembersdefinesadatacell,regardlessofwhether a value exists for that cell or not. Thus, the absence of data can be purposefully included or excluded from the analysis. For example, if a particular product was not available before a certain date, then the analysis may exclude null values (calledNAs)inthepriorperiods.However,iftheproductwasavailablebutdidnot sellinsomemarkets,thentheanalysismayincludetheNAs. No special physical relationship exists among variables that share the same dimensions.However,alogicalrelationshipexistsbecause,eventhoughtheystore different data that may be a different data type, they are identical containers. Variablesthathaveidenticaldimensionscomposealogicalcube. If you change a dimension, such as adding new time periods to the Time dimension, then all variables dimensioned by Time are automatically changed to includethesenewtimeperiods,eveniftheothervariableshavenodataforthem. Variablesthatsharedimensions(andthusarecontainedbythesamelogicalcube) can also be manipulated together in a variety of ways, such as aggregation, allocation,modeling,andnumericcalculations.Thistypeofcalculationiseasyand fast in an analytic workspace, while the equivalent singlerow calculation in a relationalschemacanbequitedifficult.
43
LabManual:DWDM
d.1)UseofVariablestoStoreMeasures In an analytic workspace, facts are stored in variables, typically with a numeric datatype.Eachtypeofdataisstoredinitsownvariable,sothatwhilesalesdata and expenses data might have the same dimensions and the same data type, they are stored in two distinct variables. The containers are identical, but the contents areunique. An analytic workspace provides a valuable alternative to materialized views for creating,storing,andmaintainingsummarydata.Averysophisticatedaggregation system supports modeling in addition to an extensive number of aggregation methods. Moreover, finely grained aggregation rules enable you to decide precisely which data within a single measure is preaggregated, and which data withinthesamemeasurewillbecalculatedatruntime. Preaggregated data is stored in a compact format in the same container as the baselevel data, and the performance impact of aggregating data on the fly is negligiblewhentheaggregationruleshavebeendefinedaccordingtoknowngood methods.Ifaggregatedataneededfortheresultsetisstoredinthevariable,thenit issimplyretrieved.Iftheaggregatedatadoesnotexist,thenitiscalculatedonthe fly. d.2)UseofVariablestoStoreAttributes Like measures, attributes are stored in variables. However, there are significant differences between attributes and measures. While attributes are often multidimensional,onlyonedimensionisadatadimension.Ahierarchydimension, which lists the data dimension hierarchies, and a language dimension, which providessupportformultiplelanguages,aretypicaloftheotherdimensions.
44
LabManual:DWDM
Attributes provide supplementary information about each dimension member, regardless of its level in a dimension hierarchy. For example, a Time dimension might have three attribute variables, one for descriptive names, another for the periodenddates,andathirdforperiodtimespans.TheseattributesprovideTime member OCT02 with a descriptive name of October 2002, an end date of 31 OCT02,andatimespanof31.Alloftheotherdays,months,quarters,andyears in the Time dimension have similar information stored in these three attribute variables. e)AnalyticWorkspaceFormulas Aformulaisastoredequation.AcalltoanyfunctionintheOLAPDMLortoany customprogramcanbestored ina formula.Inthisway,a formula inananalytic workspaceislikearelationalview. In a standard form analytic workspace, one of the uses of a formula object is to providetheinterfacebetweenaggregationrulesandavariablethatholdsthedata. The name of a measure is always the name of a formula, not the underlying variable. While the variable only contains stored data (the baselevel data and precalculated aggregates), the formula returns a fully solved measure containing datathatisbothstoredandcalculatedonthefly.Thismethodenablesallqueries against a particular measure to be written against the same column of the same relational view for the analytic workspace, regardless of whether the data is calculated or simply retrieved. That column presents data acquired from the formula. Formulas can also be used to calculated other results like ratios, differences, movingtotals,andaveragesonthefly.
45
LabManual:DWDM
f)AnalyticWorkspaceRelations Relations are identical to variables except in their data type. A variable has a generaldatatype,suchasDECIMALorTEXT,whilearelationhasadimension as its data type. For example, a relation with a data type of PRODUCT only acceptsvaluesthataremembersofthePRODUCTdimensionanattempttostore any other value causes an error. The dimension members provide a finite list of acceptablevaluesforarelation.Arelationcontainerislikeaforeignkeycolumn, exceptthatitcanbemultidimensional. In a standard form analytic workspace, two relations support the hierarchical contentofthedatadimensions:aparentrelationandalevelrelation. Aparent relation is a selfrelation that identifies the parent of each member of a dimension hierarchy. This relation defines the hierarchical structure of the dimension.
46
LabManual:DWDM
LabExercises5
Aim: Oracle 9i Develop an application to implement data generalization and summarizationtechniques.
S/wRequirement:ORACLE,DB2
Theory:
Specialization and Generalization define a containment relationship between a higherlevelentitysetandonemorelowerlevelentitysets Specialization:resulttakingasubsetofahigherlevelentitytoformalowerlevel entityset. Generalization: result of taking the union of two or more disjoint entity sets to produce a higherlevel entity set. The attributes of higherlevel entity sets are inheritedbylowerlevelentitysets.
Specialization:Anentitysetmayincludesubgroupingsofentitiesthataredistinct insomewayfromotherentitiesintheset.Forinstance,asubsetofentitieswithin anentitysetmayhaveattributesthatarenotsharedbyalltheentitiesintheentity set. The ER model provides a means for representing these distinctive entity groupings.Consideranentitysetperson,withattributesname,street,andcity. A personmaybefurtherclassifiedasoneofthefollowing:
customer employee
47
LabManual:DWDM
Eachofthesepersontypes isdescribedbyasetofattributesthat includesallthe attributes of entity set person plus possibly additional attributes. For example, customer entities may be described further by the attribute customerid, whereas employee entities may be described further by the attributes employeeid and salary. The process of designating sub groupings within an entity set is called specialization.Thespecializationofpersonallowsustodistinguishamongpersons accordingtowhethertheyareemployeesorcustomers. Generalization:Thedesign process may alsoproceed inabottomup manner, in whichmultipleentitysetsaresynthesizedintoahigherlevelentitysetonthebasis of common features. The database designer may have first identified a customer entitysetwiththeattributesname,street,city,andcustomerid,andanemployee entitysetwiththeattributesname,street,city,employeeid,andsalary.Thereare similaritiesbetweenthecustomerentitysetandtheemployeeentitysetinthesense that they have several attributes in common. This commonality can be expressed bygeneralization,whichisacontainmentrelationshipthatexistsbetweenahigher levelentitysetandoneormorelowerlevelentitysets.Inourexample,personis the higherlevelentitysetandcustomerandemployeeare lowerlevelentitysets. Higherandlowerlevelentitysetsalsomaybedesignatedbythetermssuperclass andsubclass,respectively.Thepersonentitysetisthesuperclassofthecustomer and employee sub classes. For all practical purposes, generalization is a simple inversion of specialization. We will apply both processes, in combination, in the courseofdesigningtheERschemaforanenterprise.IntermsoftheERdiagram itself,wedonotdistinguishbetweenspecializationandgeneralization.Newlevels of entity representation will be distinguished (specialization) or synthesized (generalization) as the design schema comes to express fully the database application and the user requirements of the database. Differences in the two
48
LabManual:DWDM
approaches may be characterized by their starting point and overall goal. Generalization proceeds from the recognition that a number of entity sets share some common features (namely, they are described by the same attributes and participateinthesamerelationshipsets.
49
LabManual:DWDM
LabExercises6
Aim:Developanapplicationtoextractassociationminingrule.
S/wRequirement:C,C++
Theory:
Association rule mining is defined as: Let be a set of n binary attributes called items.Letbeasetoftransactionscalledthedatabase.EachtransactioninDhasa uniquetransactionIDandcontainsasubsetoftheitemsinI.Aruleisdefinedas animplicationoftheformX=>YwhereX,YCIandXY=.Thesetsofitems (for short itemsets) X and Y are called antecedent (lefthandside or LHS) and consequent(righthandsideorRHS)oftherulerespectively. Toillustratetheconcepts,weuseasmallexamplefromthesupermarketdomain. ThesetofitemsisI={milk,bread,butter,beer}andasmalldatabasecontainingthe items(1codespresenceand0absenceofaniteminatransaction)isshowninthe table to the right. An example rule for the supermarket could be meaning that if milkandbreadisbought,customersalsobuybutter. Note: this example is extremely small. In practical applications, a rule needs a support of several hundred transactions before it can be considered statistically significant,anddatasetsoftencontainthousandsormillionsoftransactions. Toselectinterestingrulesfromthesetofallpossiblerules,constraintsonvarious measuresofsignificanceandinterestcanbeused.Thebestknownconstraintsare
50
LabManual:DWDM
minimumthresholdsonsupportandconfidence.Thesupportsupp(X)ofanitemset X is defined as the proportion of transactions in the data set which contain the itemset.Intheexampledatabase,theitemset{milk,bread}hasasupportof2/5= 0.4sinceitoccursin40%ofalltransactions(2outof5transactions). Theconfidenceofaruleisdefined.Forexample,therulehasaconfidenceof0.2/ 0.4=0.5inthedatabase,whichmeansthatfor50%ofthetransactionscontaining milkandbreadtheruleiscorrect.Confidencecanbeinterpretedasanestimateof the probability P(Y | X), the probability of finding the RHS of the rule in transactionsundertheconditionthatthesetransactionsalsocontaintheLHS Exampledatabasewith4itemsand5transactions
transactionID
milk bread butter beer
0 FigA.1
51
LabManual:DWDM
The lift of a rule is defined as or the ratio of the observed confidence to that expectedbychance.Therulehasaliftof. The conviction of a rule is defined as . The rule has a conviction of , and be interpretedastheratiooftheexpectedfrequencythatXoccurswithoutY(thatis tosay,thefrequencythattherulemakesanincorrectprediction)ifXandYwere independent divided by the observed frequency of incorrect predictions. In this example,theconviction valueof1.2showsthat therulewouldbe incorrect20% more often (1.2 times as often) if the association between X and Y was purely randomchance. Association rules are required to satisfy a userspecified minimum support and a userspecifiedminimumconfidenceatthesametime.Toachievethis,association rulegenerationisatwostepprocess.First,minimumsupportisappliedtofindall frequent itemsets in a database. In a second step, these frequent itemsets and the minimum confidence constraint are used to form rules. While the second step is straightforward,thefirststepneedsmoreattention Manyalgorithmsforgeneratingassociationruleswerepresentedovertime. SomewellknownalgorithmsareApriori,EclatandFPGrowth,buttheyonlydo halfthejob,sincetheyarealgorithms for mining frequent itemsets.Anotherstep needstobedoneaftertogeneraterulesfromfrequentitemsetsfoundinadatabase.
ALGORITHM: Associationruleminingistofindoutassociationrulesthatsatisfythepredefined minimum support and confidence from a given database. The problem is usually decomposed into two subproblems. One is to find those itemsets whose
52
LabManual:DWDM
occurrences exceed a predefined threshold in the database those itemsets are called frequent or large itemsets. The second problem is to generate association rules from those large itemsets with the constraints of minimal confidence. Suppose one of the large itemsets is Lk, Lk = {I1, I2, , Ik}, association rules withthisitemsetsaregeneratedinthefollowingway:thefirstruleis{I1,I2,, Ik1}and {Ik}, by checking the confidence this rule can be determined as interesting or not. Then other rule are generated by deleting the last items in the antecedent and inserting it to the consequent, further the confidences of the new rules are checked to determine the interestingness of them. Those processes iterateduntiltheantecedentbecomesempty.Sincethesecondsubproblemisquite straightforward,mostoftheresearchesfocusonthefirstsubproblem.TheApriori algorithmfindsthefrequentsetsLInDatabaseD.

FindfrequentsetLk1. JoinStep.
o
Ck isgeneratedbyjoiningLk1withitself
PruneStep.
o
Any (k 1) itemset that is not frequent cannot be a subset of a frequentkitemset,henceshouldberemoved.
where

(Ck:Candidateitemsetofsizek) (Lk:frequentitemsetof sizek)
AprioriPseudocode
53
LabManual:DWDM
Apriori(T,) L<{Large1itemsetsthatappearinmorethantransactions} K<2 whileL(k1) C(k)<Generate(Lk1) fortransactionstT C(t)Subset(Ck,t) forcandidatescC(t) count[c]<count[c]+1 L(k)<{cC(k)|count[c] K<K+1 returnL(k) k
EXAMPLE: A large supermarket tracks sales data by SKU (item), and thus is able to know whatitemsaretypicallypurchasedtogether.Aprioriisamoderatelyefficientway tobuildalistoffrequentpurchaseditempairsfromthisdata.Letthedatabaseof transactionsconsistofthesetsare T1:{1,2,3,4}, T2:{2,3,4}, T3:{2,3}, T4:{1,2,4}, T5:{1,2,3,4},and T6:{2,4}.
54
LabManual:DWDM
Eachnumbercorrespondstoaproductsuchas"butter"or"water".Thefirststep of Apriori to count up the frequencies, called the supports, of each member item separately: ItemSupport 1 2 3 4 3 6 4 5
Wecandefineaminimumsupportleveltoqualifyas"frequent,"whichdependson thecontext.Forthiscase,letminsupport=3.Therefore,allarefrequent.Thenext stepistogeneratealistofall2pairsofthefrequentitems.Hadanyoftheabove itemsnotbeenfrequent,theywouldn'thavebeenincludedasapossiblememberof possible2itempairs Inthisway,Apriori prunesthetreeofallpossiblesets.. Item Support {1,2}3 {1,3}2 {1,4}3 {2,3}4 {2,4}5 {3,4}3
This is counting up the occurrences of each of those pairs in the database. Since
55
LabManual:DWDM
minsup=3,wedon'tneedtogenerate3setsinvolving{1,3}.Thisisduetothefact thatsincethey'renotfrequent,nosupersetsofthemcanpossiblybefrequent.Keep going
Item
Support
{1,2,4}3 {2,3,4}3
56
LabManual:DWDM
LabExercises7
Aim:Developanapplicationforclassificationofdata.
Theory: Classificationisadataminingfunctionthatassignsitemsinacollectiontotarget categories or classes. The goal of classification is to accurately predict the target classforeachcaseinthedata.Forexample,aclassificationmodelcouldbeusedto identifyloanapplicantsaslow,medium,orhighcreditrisks. A classification task begins with a data set in which the class assignments are known. For example, a classification model that predicts credit risk could be developedbasedonobserveddataformanyloanapplicantsoveraperiodoftime. Inadditiontothehistoricalcreditrating,thedatamighttrackemploymenthistory, homeownershiporrental,yearsofresidence,numberandtypeofinvestments,and so on. Credit rating would be the target, the other attributes would be the predictors,andthedataforeachcustomerwouldconstituteacase. Classifications are discrete and do not imply order. Continuous, floatingpoint values would indicate a numerical, rather than a categorical, target. A predictive model with a numerical target uses a regression algorithm, not a classification algorithm. The simplest type of classification problem is binary classification. In binary classification,thetargetattribute hasonly twopossible values: forexample, high
57
LabManual:DWDM
creditratingorlowcreditrating.Multiclasstargetshavemorethantwovalues:for example,low,medium,high,orunknowncreditrating. Inthemodelbuild(training)process,aclassificationalgorithmfindsrelationships between the values of the predictors and the values of the target. Different classification algorithms use different techniques for finding relationships. These relationshipsaresummarizedinamodel,whichcanthenbeappliedtoadifferent datasetinwhichtheclassassignmentsareunknown. Classificationmodelsaretestedbycomparingthepredictedvaluestoknowntarget values in a set of test data. The historical data for a classification project is typically divided into two data sets: one for building the model the other for testingthemodel. Scoring a classification model results in class assignments and probabilities for eachcase.Forexample,amodelthatclassifiescustomersaslow,medium,orhigh valuewouldalsopredicttheprobabilityofeachclassificationforeachcustomer. Classificationhasmanyapplicationsincustomersegmentation,businessmodeling, marketing,creditanalysis,andbiomedicalanddrugresponsemodeling.
DifferentClassificationAlgorithms OracleDataMiningprovidesthefollowingalgorithmsforclassification:
DecisionTree Decisiontreesautomaticallygeneraterules,whichareconditionalstatements thatrevealthelogicusedtobuildthetree.
58
LabManual:DWDM
NaiveBayes NaiveBayesusesBayes'Theorem,aformulathatcalculatesaprobabilityby countingthefrequencyofvaluesandcombinationsofvaluesinthehistorical data.
GeneralizedLinearModels(GLM) GLM is a popular statistical technique for linear modeling. Oracle Data MiningimplementsGLMforbinaryclassificationandforregression. GLMprovidesextensivecoefficientstatisticsandmodelstatistics,aswellas rowdiagnostics.GLMalsosupportsconfidencebounds.
SupportVectorMachine Support Vector Machine (SVM) is a powerful, stateoftheart algorithm based on linear and nonlinear regression. Oracle Data Mining implements SVMforbinaryandmulticlassclassification.
EXAMPLE: In classification classes are predefine. So the temperature is divided into 3 classesthatarelow,medium,highcategory. If you enter the temperature of the day and you will get output as the temperaturebelongstowhichclass.
59
LabManual:DWDM
LabExercises8
Aim:Developanapplicationforoneclusteringtechnique
S/wRequirement: matlab,C,C++.
Theory: Clusteranalysisorclusteringistheassignmentofasetofobservationsintosubsets (calledclusters)sothatobservationsinthesameclusteraresimilarinsomesense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining,patternrecognition,imageanalysisandbioinformatics.
Typesofclustering Dataclusteringalgorithmscanbehierarchical.Hierarchicalalgorithmsfind successiveclustersusingpreviouslyestablishedclusters.Thesealgorithmscanbe eitheragglomerative("bottomup")ordivisive("topdown").Agglomerative algorithmsbeginwitheachelementasaseparateclusterandmergetheminto successivelylargerclusters.Divisivealgorithmsbeginwiththewholesetand proceedtodivideitintosuccessivelysmallerclusters. Partitional algorithmstypicallydetermineallclustersatonce,butcanalsobeused asdivisivealgorithmsinthehierarchical clustering. Densitybasedclusteringalgorithmsaredevisedtodiscoverarbitraryshaped clusters.Inthisapproach,aclusterisregardedasaregioninwhichthedensityof
60
LabManual:DWDM
dataobjectsexceedsathreshold.DBSCANandOPTICSaretwotypical algorithmsofthiskind. Twowayclustering,coclustering orbiclustering areclusteringmethodswherenot onlytheobjectsareclusteredbutalsothefeaturesoftheobjects,i.e.,ifthedatais representedinadatamatrix,therowsandcolumnsareclusteredsimultaneously. Manyclusteringalgorithmsrequirespecificationofthenumberofclustersto produceintheinputdataset,priortoexecutionofthealgorithm.Barring knowledgeofthepropervaluebeforehand,theappropriatevaluemustbe determined,aproblemforwhichanumberoftechniqueshavebeendeveloped.
kmeansclustering:
In statistics and machine learning, kmeans clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observationbelongstotheclusterwiththenearestmean.
ALGORITHM: Regardingcomputationalcomplexity,thekmeansclusteringproblemis:

NPhardingeneralEuclideanspacedevenfor2clusters NPhardforageneralnumberofclusterskevenintheplane
dk+1 If kanddarefixed,theproblemcanbeexactlysolvedintimeO(n logn),
wherenisthenumberofentitiestobeclustered
61
LabManual:DWDM
Standardalgorithm The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the kmeans algorithm it is also referred to as Lloyd's algorithm,particularlyinthecomputersciencecommunity. Givenaninitialsetofkmeansm1,,mk,which maybespecifiedrandomlyorby someheuristic,thealgorithmproceedsbyalternatingbetweentwosteps: Assignment step: Assign each observation to the cluster with the closest mean (i.e. partition the observations according to the Voronoi diagram generatedbythemeans).
Updatestep:Calculatethenewmeanstobethecentroidoftheobservations inthecluster.
KMeansClusteringExample
We recall from the previous lecture, that clustering allows for unsupervised learning. That is, the machine / software will learn on its own, using the data (learningset),andwillclassifytheobjectsintoaparticularclassforexample,if ourclass(decision)attribute istumor Typeand itsvaluesare: malignant,benign, etc.thesewillbetheclasses.Theywillberepresentedbycluster1,cluster2,etc. However, the class information is never provided to the algorithm. The class information can be used later on, to evaluate how accurately the algorithm classifiedtheobjects.
62
LabManual:DWDM
Curvature Texture Blood
Tumor
Consump Type 0.8 0.75 0.23 0.23 1.2 1.4 0.4 0.5 A B D D Benign Benign Malignant Malignant Curvature Texture Blood Tumor
Consump Type x1 0.8 x2 0.75 x3 0.23 x4 0.23 . . L learningset) Thewaywedothat,isbyplottingtheobjectsfromthedatabaseintospace.Each attributeisonedimension 1.2 1.4 0.4 0.5 A B D D Benign Benign Malignant Malignant
Texture 1.2
.
0.23
x1
D 0.4 B A
Blood Consump
0.8 Curvature
63
LabManual:DWDM
Afteralltheobjectsareplotted,wewillcalculatethedistancebetweenthem,and theonesthatareclosetoeachotherwewillgroupthemtogether,i.e.placethem inthesamecluster.
Texture
1.2 benign
0.4
. . . . . ..
Cluster2 malignant Cluster1
Blood Consump
A 0.23
0.8 Curvature
EXAMPLE:
Problem:Clusterthefollowingeightpoints(with(x,y)representinglocations)into threeclustersA1(2,10)A2(2,5)A3(8,4)A4(5,8)A5(7,5)A6(6,4)A7(1,2) A8(4, 9). Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). The distance functionbetween two points a=(x1, y1) and b=(x2, y2) is defined as: (a,b)=|x2x1|+|y2 y1|.
Usekmeansalgorithmtofindthethreeclustercentersaftertheseconditeration.
64
LabManual:DWDM
Iteration1 (2,10) Point A1 (2,10) A2 (2,5) A3 (8,4) A4 (5,8) A5 (7,5) A6 (6,4) A7 (1,2) A8 (4,9) (5,8) (1,2)
DistMean1 DistMean2 DistMean3 Cluster
First we list all points in the first column of the table above. The initial cluster centersmeans,are(2,10),(5,8) and(1,2)chosenrandomly.Next,wewill calculate the distance from the first point (2, 10) to each of the three means, by usingthedistancefunction:
point x1,y1 (2,10)
mean1 x2,y2 (2,10) (a,b)=|x2x1|+|y2 y1|
(point,mean1)=|x2x1|+|y2y1| =|22|+|1010| =0+0 =0
65
LabManual:DWDM
point x1,y1 (2,10)
mean2 x2,y2 (5,8)
(a,b)=|x2x1|+|y2 y1| (point,mean2)=|x2x1|+|y2y1| =|52|+|810| =3+ 2 =5
point x1,y1 (2,10)
mean3 x2,y2 (1,2)
(a,b)=|x2x1|+|y2 y1| (point,mean2)=|x2x1|+|y2y1| =|12|+|210| =1+ 8 =9
66
LabManual:DWDM
So,wefillinthesevaluesinthetable: (2,10) Point (5,8) (1,2)
Dist Mean Dist Mean Dist Mean Cluster 1 2 5 3 9 1
A1 (2,10) A2 (2,5) A3 (8,4) A4 (5,8) A5 (7,5) A6 (6,4) A7 (1,2) A8 (4,9)
So,whichclustershouldthepoint(2,10)beplacedin?Theone,wherethepoint hastheshortestdistancetothemeanthatismean1(cluster1),sincethedistance is0. Cluster1 (2,10) Cluster2 Cluster3
So,wegotothesecondpoint(2,5)andwewillcalculatethedistancetoeachof thethreemeans,byusingthedistancefunction: point x1,y1 (2,5) mean1 x2,y2 (2,10)
67
LabManual:DWDM
(a,b)=|x2x1|+|y2 y1| (point,mean1)=|x2x1|+|y2y1| =|22|+|105| =0+5 =5 point x1,y1 (2,5) mean2 x2,y2 (5,8)
(a,b)=|x2x1|+|y2 y1| (point,mean2)=|x2x1|+|y2y1| =|52|+|85| =3+3 =6 point x1,y1 (2,5) mean3 x2,y2 (1,2)
(a,b)=|x2x1|+|y2 y1| (point,mean2)=|x2x1|+|y2y1| =|12|+|25| =1+3 =4
68
LabManual:DWDM
So,wefillinthesevaluesinthetable: Iteration1 (2,10) Point (5,8) (1,2)
Dist Mean Dist Mean Dist Mean Cluster 1 2 5 6 3 9 4 1 3
A1 (2,10) A2 (2,5) A3 (8,4) A4 (5,8) A5 (7,5) A6 (6,4) A7 (1,2) A8 (4,9)
0 5
So,whichclustershouldthepoint (2,5)beplaced in?Theone,wherethepoint hastheshortestdistancetothemeanthatismean3(cluster3),sincethedistance is0.
Cluster1 (2,10)
Cluster2
Cluster3 (2,5)
Analogically, we fill in the rest of the table, and place each point in one of the clusters:
69
LabManual:DWDM
Iteration1 (2,10) Point (5,8) (1,2)
Dist Mean Dist Mean Dist Mean Cluster 1 2 5 6 7 0 5 5 10 2 3 9 4 9 10 9 7 0 10 1 3 2 2 2 2 3 2
A1 (2,10) A2 (2,5) A3 (8,4) A4 (5,8) A5 (7,5) A6 (6,4) A7 (1,2) A8 (4,9)
0 5 12 5 10 10 9 3
Cluster1 (2,10)
Cluster2 (8,4) (5,8) (7,5) (6,4) (4,9)
Cluster3 (2,5) (1,2)
Next,weneedtorecomputethenewclustercenters(means).Wedoso,bytaking themeanofallpointsineachcluster.
ForCluster1,weonlyhaveonepointA1(2,10),whichwastheoldmean,sothe clustercenterremainsthesame. ForCluster2,wehave((8+5+7+6+4)/5,(4+8+5+4+9)/5)=(6,6) ForCluster3,wehave((2+1)/2,(5+2)/2)=(1.5,3.5)

70
LabManual:DWDM
Theinitialclustercentersareshowninreddot.Thenewclustercentersareshownin redx. ThatwasIteration1(epoch1).Next,wegotoIteration2(epoch2),Iteration3,andso onuntilthemeansdonotchangeanymore.

71
LabManual:DWDM
In Iteration2, we basically repeat the process from Iteration1 this time using the newmeanswecomputed.
Kmeans:ProblemsandLimitations Based on minimizing within cluster error a criterion that is not appropriate for many situations.Unsuitable when clusters have widely different sizes or have convexshapes.
72
LabManual:DWDM
Restricted to data in Euclidean spaces, but variants of Kmeans can be used for othertypesofdata.
Kmediodmethod: Thekmedoidsalgorithmisaclusteringalgorithmrelatedtothekmeansalgorithm and the medoidshift algorithm. Both the kmeans and kmedoids algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize squared error, the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the kmeans algorithm k medoidschoosesdatapointsascenters(medoidsorexemplars).
ALGORITHM: The most common realization of kmedoid clustering is the Partitioning Around Medoids(PAM)algorithmandisasfollows: 1. Initialize:randomlyselectkofthendatapointsasthemediods 2. Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, ManhattandistanceorMinkowskidistance) 3. Foreachmediodm 1. Foreachnonmedioddatapointo 1. Swapmandoandcomputethetotalcostoftheconfiguration 4. Selecttheconfigurationwiththelowestcost. 5. Repeatsteps2to5untilthereisnochangeinthemedoid.
73
LabManual:DWDM
EXAMPLE: Clusterthefollowingdatasetoftenobjectsintotwoclustersi.ek=2. Consideradatasetoftenobjectsasfollows:

X1 2 6
X2
X3
X4
X5
X6
X7
X8
X9
X10
74
LabManual:DWDM
FigureAdistributionofthedata
Step1 Initialisekcentre Letusassumec1 =(3,4)andc2 =(7,4) Soherec1 andc2 areselectedasmedoid. Calculatingdistancesoastoassociateeachdataobjecttoitsnearestmedoid.Cost iscalculatedusing Minkowskidistancemetricwith r=1.
c1
Dataobjects (Xi)
Cost (distance)
75
LabManual:DWDM
c2
Dataobjects (Xi)
Cost (distance)
76
LabManual:DWDM
Thensotheclustersbecome: Cluster1 ={(3,4)(2,6)(3,8)(4,7)} Cluster2 ={(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)} Since the points (2,6) (3,8) and (4,7) are close to c1 hence they form one cluster whilstremainingpointsformanothercluster. Sothetotalcostinvolvedis20. Wherecostbetweenanytwopointsisfoundusingformula cost(x,c)=|xc|fromi=1tod where x is any data object, c is the medoid, and d is the dimension of the object whichinthiscaseis2. Totalcostisthesummationofthecostofdataobjectfromitsmedoidinitscluster sohere: Totaolcost={cost((3,4),(2,6))+cost((3,4),(3,8))+cost((3,4),(4,7))} +cost((7,4),(6,2))+cost((7,4),(6,4))+cost((7,4)(7,3)) +cost((7,4),(8,5))+cost((7,4),(7,6))}
=(3+4+4)+(3+1+1+2+2) =20
77
LabManual:DWDM
Step2 SelectionofnonmedoidO randomly LetusassumeO=(7,3) Sonowthemedoidsarec1(3,4)andO(7,3) If c1andO arenewmedoids,calculatethetotalcostinvolved Byusingtheformulainthestep1
c1
Dataobjects (Xi)
Cost (distance)
5 78
LabManual:DWDM
Dataobjects (Xi)
Cost (distance)
79
LabManual:DWDM
Figure1.3clustersafterstep2
totalcost=3+4+4+2+2+1+3+3 =22 Socostofswappingmedoidfrom c2 toO is S=currenttotalcostpasttotalcost =2220 =2 So moving to O would be bad idea, so the previous choice was good and algorithmterminateshere(i.ethereisnochangeinthemedoids). It may happen some data points may shift from one cluster to another cluster dependingupontheirclosenesstomedoid
80
LabManual:DWDM
LabExercises9
Aim:DevelopanapplicationforimplementingNaveBayesclassifier.
Theory:
Naive Bayes classifier assumes that the presence (or absence) of a particular featureofaclassisunrelatedtothepresence(orabsence)ofanyotherfeature.For example,afruitmaybeconsideredtobeanappleifitisred,round,andabout4"in diameter.Eventhoughthesefeaturesdependontheexistenceoftheotherfeatures, a naive Bayes classifier considers all of these properties to independently contributetotheprobabilitythatthisfruitisanapple. An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary forclassification.Because independent variablesareassumed,onlythe variancesofthe variables foreachclass needtobedeterminedandnottheentire covariancematrix ThenaiveBayesprobabilisticmodel: Theprobabilitymodelforaclassifierisaconditionalmodel P(C|F1.................Fn) over a dependent class variable C with a small number of outcomes or classes, conditional on several feature variables F1 through Fn. The problem is that if the number of features n is large or when a feature can take on a large number of
81
LabManual:DWDM
values,thenbasingsuchamodelonprobabilitytablesisinfeasible.Wetherefore reformulatethemodeltomakeitmoretractable. UsingBayes'theorem,wewrite P(C|F1...............Fn)=[{p(C)p(F1..................Fn|C)}/p(F1,........Fn)] InplainEnglishtheaboveequationcanbewrittenas Posterior=[(prior*likehood)/evidence] In practice we are only interested in the numerator of that fraction, since the denominatordoesnotdependonCandthevaluesofthefeaturesFi aregiven,so that the denominator is effectively constant. The numerator is equivalent to the jointprobabilitymodel p(C,F1........Fn) whichcanberewrittenasfollows,usingrepeatedapplicationsofthedefinitionof conditionalprobability: p(C,F1........Fn) =p(C)p(F1............Fn|C) =p(C)p(F1|C)p(F2.........Fn|C,F1,F2) =p(C)p(F1|C)p(F2|C,F1)p(F3.........Fn|C,F1,F2) =p(C)p(F1|C)p(F2|C,F1)p(F3.........Fn|C,F1,F2)......p(Fn|C,F1,F2,F3.........Fn1)
82
LabManual:DWDM
Now the "naive" conditional independence assumptions come into play: assume thateachfeatureFi isconditionallyindependentofeveryotherfeatureFj forji. Thismeansthat p(Fi|C,Fj)=p(Fi|C) andsothejointmodelcanbeexpressedas p(C,F1,.......Fn)=p(C)p(F1|C)p(F2|C)........... =p(C)p(Fi|C) This means that under the above independence assumptions, the conditional distributionovertheclassvariableCcanbeexpressedlikethis: p(C|F1..........Fn)=p(C) p(Fi|C) Z where Z is a scaling factor dependent only on F1.........Fn, i.e., a constant if the valuesofthefeaturevariablesareknown. Modelsofthisformaremuchmoremanageable,sincetheyfactorintoasocalled class prior p(C) and independent probability distributions p(Fi|C). If there are k classesandifamodelforeachp(Fi|C=c)canbeexpressedintermsofrparameters, then the corresponding naive Bayes model has (k 1) + n r k parameters. In practice, often k = 2 (binary classification) and r = 1 (Bernoulli variables as features) are common, and so the total number of parameters of the naive Bayes modelis2n+1,wherenisthenumberofbinaryfeaturesusedforprediction
P(h/D)=P(D/h)P(h)
83
LabManual:DWDM
P(D) P(h):Priorprobabilityofhypothesish P(D):PriorprobabilityoftrainingdataD P(h/D):ProbabilityofhgivenD P(D/h):ProbabilityofDgivenh
NaveBayesClassifier:Derivation D:Setoftuples EachTupleisanndimensionalattributevector X:(x1,x2,x3,.xn) LettherememClasses:C1,C2,C3Cm NBclassifierpredictsXbelongstoClassCiiff P(Ci/X)>P(Cj/X)for1<=j<=m,j<>i MaximumPosterioriHypothesis P(Ci/X)=P(X/Ci)P(Ci)/P(X) MaximizeP(X/Ci)P(Ci)asP(X)isconstant NaveBayesClassifier:Derivation Withmanyattributes,itiscomputationallyexpensivetoevaluateP(X/Ci) NaveAssumptionofclassconditionalindependence
84
LabManual:DWDM
P(X/Ci)=n P(xk/Ci) k=1 P(X/Ci)=P(x1/Ci)*P(x2/Ci)**P(xn/Ci)
EXAMPLE!:
D:35yearoldcustomerwithanincomeof$50,000PA h:Hypothesisthatourcustomerwillbuyourcomputer P(h/D) : Probability that customer D will buy our computer given that we knowhisageandincome P(h):Probabilitythatanycustomerwillbuyourcomputerregardlessofage (PriorProbability)
85
LabManual:DWDM
P(D/h):Probabilitythatthecustomeris35yrsoldandearns$50,000,given thathehasboughtourcomputer(PosteriorProbability) P(D):Probabilitythatapersonfromoursetofcustomersis35yrsoldand earns$50,000 MaximumAPosteriori(MAP)Hypothesis: Generallywewantthemostprobablehypothesisgiventhetrainingdata
hMAP=argmaxP(h/D)(wherehbelongstoHandHisthe hypothesisspace)
hMAP=argmax P(D/h)P(h) P(D)
hMAP=argmaxP(D/h)P(h) Generallywewantthemostprobablehypothesisgiventhetrainingdata hMAP=argmaxP(h/D)(wherehbelongstoHandHisthe hypothesisspace)
hMAP=argmax P(D/h)P(h) P(D)
86
LabManual:DWDM
hMAP=argmaxP(D/h)P(h) h1:Customerbuysacomputer=Yes h2:Customerbuysacomputer=No Whereh1andh2aresubsetsofourHypothesisSpaceH P(h/D)(FinalOutcome)=argmax{P(D/h1)P(h1),P(D/h2)P(h2)} P(D)canbeignoredasitisthesameforboththeterms MaximumLikelihood(ML)Hypothesis:
IfweassumeP(hi)=P(hj) Wherethecalculatedprobabilitiesamounttothesame Furthersimplificationleadsto hML=argmaxP(D/hi)(wherehibelongstoH)
87
LabManual:DWDM
Sr.No
Age
Income
Student
Creditrating
Buys computer
1 2 3 4 5 6 7 8 9 10
35 30 40 35 45 35 35 25 28 35
Medium High Low Medium Low High Medium Low High Medium
Yes No Yes No No No No No No Yes
Fair Average Good Fair Fair Excellent Good Good Average Average
Yes No No Yes Yes Yes No No No Yes
P(buyscomputer=yes)=5/10=0.5 P(buyscomputer=no)=5/10=0.5 P(customeris35yrs&earns$50,000)=4/10=0.4

88
LabManual:DWDM
P(customeris35yrs&earns$50,000/buyscomputer=yes)=3/5=0.6 P(customeris35yrs&earns$50,000/buyscomputer=no)=1/5=0.2 Findwhetherthecustomerbuysacomputerornot? CustomerbuysacomputerP(h1/D) =P(h1)*P(D/h1)/P(D) =0.5*0.6/0.4 CustomerdoesnotbuyacomputerP(h2/D) =P(h2)*P(D/h2)/P(D) =0.5*0.2/0.4 FinalOutcome=argmax{P(h1/D),P(h2/D)} =max(0.6,0.2)
Customerbuysacomputer BasedontheBayesiantheorem Particularlysuitedwhenthedimensionalityoftheinputsishigh ParameterestimationfornaiveBayesmodelsusesthemethodofmaximum likelihood In spite oversimplified assumptions, it often performs better in many complexrealworldsituations
89
LabManual:DWDM
Advantage : Requires a small amount of training data to estimate the parameters EXAMPLE2:
X=(age=youth,income=medium,student=yes,credit_rating=fair) P(C1)=P(buys_computer=yes)=9/14=0.643 P(C2)=P(buys_computer=no)=5/14=0.357

90
LabManual:DWDM
P(age=youth/buys_computer=yes)=2/9=0.222 P(age=youth/buys_computer=no)=3/5=0.600 P(income=medium/buys_computer=yes)=4/9=0.444 P(income=medium/buys_computer=no)=2/5=0.400 P(student=yes/buys_computer=yes)=6/9=0.667 P(student=yes/buys_computer=no)=1/5=0.200 P(creditrating=fair/buys_computer=yes)=6/9=0.667 P(creditrating=fair/buys_computer=no)=2/5=0.400
P(X/Buysacomputer=yes) =P(age=youth/buys_computer=yes)*P(income=medium/buys_computer= yes)*P(student=yes/buys_computer=yes)*P(creditrating=fair/buys_computer =yes) =0.222*0.444*0.667*0.667=0.044 P(X/Buysacomputer=No) =0.600*0.400*0.200*0.400=0.019 FindclassCithatMaximizesP(X/Ci)*P(Ci) P(X/Buysacomputer=yes)*P(buys_computer=yes)=0.028 P(X/Buysacomputer=No)*P(buys_computer=no)=0.007 Prediction:BuysacomputerforTupleX
91
LabManual:DWDM
ADVANTAGES: The advantage of Bayesian spam filtering is that it can be trained on a peruser basis. The spam that a user receives is often related to the online user's activities. For example, a user may have been subscribed to an online newsletter that the user considers to be spam. This online newsletter is likely to contain words that are commontoall newsletters,suchasthe nameof the newsletterand itsoriginating email address. A Bayesian spam filter will eventually assign a higher probability basedontheuser'sspecificpatterns. Thelegitimateemailsauserreceiveswilltendtobedifferent.Forexample,ina corporateenvironment,thecompany nameandthe namesofclientsorcustomers willbementionedoften.Thefilterwillassignalowerspamprobabilitytoemails containingthosenames. The word probabilities are unique to each user and can evolve over time with corrective trainingwheneverthe filter incorrectlyclassifiesanemail. Asaresult, Bayesian spam filtering accuracy after training is often superior to predefined rules. Itcanperformparticularlywellinavoidingfalsepositives,wherelegitimateemail is incorrectly classified as spam. For example, if the email contains the word "Nigeria",whichisfrequentlyusedinAdvancefeefraudspam,apredefinedrules filtermightrejectitoutright.ABayesianfilterwouldmarktheword"Nigeria"asa probable spam word, but would take into account other important words that usuallyindicatelegitimateemail.Forexample,thenameofaspousemaystrongly
92
LabManual:DWDM
indicate the email is not spam, which could overcome the use of the word "Nigeria." DISADVANTAGES: BayesianspamfilteringissusceptibletoBayesianpoisoning,atechniqueusedby spammers in an attempt to degrade the effectiveness of spam filters that rely on Bayesianfiltering.AspammerpracticingBayesianpoisoningwillsendoutemails with large amounts of legitimate text (gathered from legitimate news or literary sources). Spammer tactics include insertion of random innocuous words that are not normally associated with spam, thereby decreasing the email's spam score, makingitmorelikelytoslippastaBayesianspamfilter. AnothertechniqueusedtotrytodefeatBayesianspamfiltersistoreplacetextwith pictures,eitherdirectlyincludedorlinked.Thewholetextofthemessage,orsome partofit,isreplacedwithapicturewherethesametextis"drawn".Thespamfilter isusuallyunabletoanalyzethispicture,whichwouldcontainthesensitivewords like "Viagra". However, since many mail clients disable the display of linked pictures forsecurityreasons,thespammersending linkstodistant pictures might reach fewer targets. Also, a picture's size in bytes is bigger than the equivalent text's size, so the spammer needs more bandwidth to send messages directly includingpictures.Finally,somefiltersaremoreinclinedtodecidethatamessage isspamifithasmostlygraphicalcontents. AprobablymoreefficientsolutionhasbeenproposedbyGoogleandisusedbyits Gmailemailsystem,performinganOCR(OpticalCharacterRecognition)toevery midtolargesizeimage,analyzingthetextinside.
93
LabManual:DWDM
LabExercises10
Aim:Developanapplicationfordecisiontree.
S/wRequirement:.C,C++.
Theory:
Decisiontreelearning,usedindataminingandmachinelearning,usesadecision treeasapredictive modelwhich mapsobservationsaboutan itemtoconclusions abouttheitem'stargetvalueInthesetreestructures,leavesrepresentclassifications andbranchesrepresentconjunctionsoffeaturesthatleadtothoseclassifications.In decision analysis, a decision tree can be used to visually and explicitly represent decisionsanddecision making.Indata mining,adecisiontreedescribesdata but not decisions rather the resulting classification tree can be an input for decision making.Thispagedealswithdecisiontreesindatamining.
Decision tree learning is a common method used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of the input variables there are edgestochildren foreachofthepossible valuesofthat input variable.Each leaf represents a value of the target variable given the values of the input variables representedbythepathfromtheroottotheleaf.
94
LabManual:DWDM
Atreecanbe"learned"bysplittingthesourcesetintosubsetsbasedonanattribute value test. This process is repeated on each derived subset in a recursive manner calledrecursivepartitioning.Therecursioniscompletedwhenthesubsetatanode allhasthesamevalueofthetargetvariable,orwhensplittingnolongeraddsvalue tothepredictions. Indatamining,treescanbedescribedalsoasthecombinationofmathematicaland computational techniques to aid the description, categorisation and generalisation ofagivensetofdata. Datacomesinrecordsoftheform: (x,y)=(x1,x2,x3...,xk,y) Thedependentvariable,Y,isthetargetvariablethatwearetryingtounderstand, classifyor generalise. The vector xiscomprised of the input variables, x1, x2, x3 etc.,thatareusedforthattask.
Typesoftree: Indatamining,treeshaveadditionalcategories:
Classification tree analysis is when the predicted outcome is the class to whichthedatabelongs.
Regressiontreeanalysisiswhenthepredictedoutcomecanbeconsidereda real number (e.g. the price of a house, or a patients length of stay in a hospital).
ClassificationAndRegressionTree(CART)analysisisusedtorefertoboth oftheaboveprocedures,firstintroducedbyBreimanetal.
95
LabManual:DWDM
CHisquaredAutomaticInteractionDetector(CHAID).Performsmultilevel splitswhencomputingclassificationtrees.
A Random Forest classifier uses a number of decision trees, in order to improvetheclassificationrate.
Boosting Trees can be used for regressiontype and classificationtype problems
96
LabManual:DWDM
Weatherdata
OUTLOOK (INDEPENDENT VARIABLE) TEMPERATURE (INDEPENDENT VARIABLE) HUMIDITY (INDEPENDENT VARIABLE) WINDY (INDEPENDENT VARIABLE) PLAY (DEPENDENT VARIABLE)
Sunny
85
85
FALSE
DONTPLAY
Sunny
80
90
TRUE
DONTPLAY
overcast
83
78
FALSE
PLAY
rain
70
96
FALSE
PLAY
rain
68
80
FALSE
PLAY
rain
65
70
TRUE
DONTPLAY
overcast
64
65
TRUE
PLAY
Sunny
72
95
FALSE
DONTPLAY
Sunny
69
70
FALSE
PLAY
rain
75
80
FALSE
PLAY
Sunny
75
70
TRUE
PLAY
overcast
72
90
TRUE
PLAY
overcast
81
75
FALSE
PLAY
rain
71
80
TRUE
DONTPLAY
97
LabManual:DWDM
FigA.1decisiontree
98
LabManual:DWDM
EXAXMPLE: Selectattributethathasthemostpurenodes.
Foreachtest:
Maximalpurity:allvaluesarethesame. Minimalpurity: equalnumberofeachvalues. Find a scale between maximal and minimal, and then merge across all of the attributetests.
Anfunctiontocalculatethisiscalledtheentropyfunction. Entropy(p1,p2pn)=p1*log(p1)+p2*log(p2)+pn*log(pn)
p1pn are the number of instances of each class,expressed as a fraction of the totalnumberofinstancesatthatpointinthetree.logisbase2.
Foroutlooktherearethreetests:
Sunny:info(2,3)=2/5log2(2/5) 3/5log2(3/5)=0.5287+0.4421=0.971 Overcast:info(4,0)=4/4log2(4/4)0=0 Rainy:info(3,2)=0.971 Wehave14instancestodividedownthosepaths.Sothetotalforoutlookis: (5/14*.971)+(4/14*.971)+(5/14*.971)=0.693 To calculate the gain we work out the entropy for the top node and subtract the
99
LabManual:DWDM
entropyforoutlook: Info(9,5)=0.940 Gain(outlook)=0.9400.693=0.247 Similarlycalculatingthegainfortheothers: Gain(windy)=0.048 Gain(temperature)=0.029 Gain(humidity)=0.152 Selectingthemaximumwhichisoutlook.Thisiscalledinformationgain.
Gainratioofdecisionvariables
Outlook 1. Info=0.693 2. Gain=0.9400.693=0.247 3. Splitinfo=info[5,4,5]=1.577 4. Gainratio=0.247/1.577=0.156 Info=5/14[(2/5log2(2/5)3/5log2(3/5)]+4/14[4/4log2(4/4)0log20]+5/14[ 3/5log2(3/5)2/5log2(2/5)]=0.693 Gain=.940.693=.247 Splitinfo=1.577 Gainratio=.247/1.577=.156 Theattributewiththehighestinformationgainisselectedasthesplittingattribute
100
LabManual:DWDM
ADVANTAGES: Amongstotherdataminingmethods,decisiontreeshavevariousadvantages:
Simple to understand and interpret. People are able to understand decision treemodelsafterabriefexplanation.
Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed.
Able to handle both numerical and categorical data. Other techniques are usuallyspecialisedinanalysingdatasetsthathaveonlyonetypeofvariable. Ex: relation rules can be used only with nominal variables while neural networkscanbeusedonlywithnumericalvariables.
Use a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by boolean logic. An example of a black box model is an artificial neural network since the explanationfortheresultsisdifficulttounderstand.
Possibletovalidateamodelusingstatisticaltests.Thatmakesitpossibleto accountforthereliabilityofthemodel.
Robust.Performswellevenifitsassumptionsaresomewhatviolatedbythe truemodelfromwhichthedataweregenerated.
Performwellwithlargedatainashorttime.Largeamountsofdatacanbe analysed using personal computers in a time short enough to enable stakeholderstotakedecisionsbasedonitsanalysis.
101
LabManual:DWDM
DISADVANTAGES:
The problem of learning an optimal decision tree is known to be NP complete. Consequently, practical decisiontree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimaldecisionsaremadeateachnode.Suchalgorithmscannotguarantee toreturnthegloballyoptimaldecisiontree.
Decisiontree learners create overcomplex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning are necessarytoavoidthisproblem.
There are concepts that are hard to learn because decision trees do not expressthemeasily,suchasXOR,parity or multiplexerproblems. Insuch cases,thedecisiontreebecomesprohibitivelylarge.Approachestosolvethe problem involve either changing the representation of the problem domain (knownaspropositionalisation)orusinglearningalgorithmsbasedonmore expressiverepresentations(suchasstatisticalrelationallearningorinductive logicprogramming).
102
LabManual:DWDM
WARMUPEXCERCISES: Whatisdatabase? WhatisDBMS? Advantageofdatabase Whatisdatamodel? Whatisobjectorientedmodel? Whatisan entity? Whatisanentitytype? Whatisanentityset? Whatisweakentityset? Whatisrelationship? WhatisDDL? WhatisDML? Whatisnormalization? Whatisfunctionaldependency?
st Whatis1 NF? nd Whatis2 NF? rd Whatis3 NF?
WhatisBCNF? Whatisfullyfunctionaldependency? Whatdoyoumeanofaggregation,atomicity? Whataredifferentphasesoftransaction? Whatdoyoumeanofflatfiledatabase? Whatisquery? Whatisthenameofbufferwhereallcommandsarestored? AretheresultingofrelationPRODUCTandJOINoperationissame?

103
LabManual:DWDM
Isdata,transientorperiodicindatamart? CanaDataWarehousealsobeusedforTransactionDatabase? HowdoestheusageofDatasetimprovetheperformanceduringlookups? WhatareStagingandOperationalData Storesandwhat istheir difference?Why doesoneneedstagingandwhydoesoneneedODS?Whatwillhappeniftheyare notthere? WhatarethevariousOLAP&MOLAPoperations? Explainwhydatabasedesignisimportant? WhenistheneedtodoOOAD(Objectorientedanalysis&design) HowyoudifferentiatebetweenSOA&WebServices? WhatistheuseoftheUserdefinedEnvironmentvariables? Describethestagesofextract/transformprocess,includingwhatshouldhappenat eachstageandinwhatorder? Whataresometypicalconsiderationsduringthedatacleansing phase? WhatarethedifferenttoolsavailableforDataExtraction? WhyisitnecessarytocleandatabeforeloadingitintotheWarehouse? WhataretherolesandresponsibilitiesofanETLdeveloper/designer/teamlead? IsthereanyoptionotherthanSurrogatekeyorconcatenatedkey? HowdoyoumanagethedatabasetriggersinDWH? Whichapproachisbetterandwhy?Loadingdatafrom datamartstodata warehouseorviceversa? Howdoesreplicationtakesplacebetweentwosites? HowcanyousayRelationalisNormalizedoneandDataWarehouse? Whatarethestepstobuildthedatawarehouse? WhatisthedifferencebetweenODSandStaging? WhatisthemainFUNCTIONALdifferencebetweenROLAP,MOLAP,HOLAP?
104
LabManual:DWDM
Whatisslicinganddicing?Explainwith realtimeusageandbusinessreasonsof itsuse? ExplaintheflowofdatastartingwithOLTPtoOLAPincludingstaging,summary tables,Factsanddimensions.? WhatisthedifferencebetweenOLAPandOLTP? Whatarethedifferenttracingoptionsavailable?Cantracingoptionsbesetfor individualtransformations?
105
LabManual:DWDM
9.Quizonthesubject:
Quiz should be conducted on tips in the laboratory, recent trends and subject knowledge of subject. The quiz questions should be formulated such that questions are normally are from scope outside of the books. However twisted questions and self formulated questions by faculty can be asked but correctness of it is necessarily to be thoroughly checked before conduction of the quiz. the the the the
10.ConductionofVivaVoceExaminations:
Teacher should oral exams of the students with full preparation. Normally, the objective questions with guess are to be avoided. To make it meaningful, the questions should be such that depth of the students in the subject is tested Oral examinations are to be conducted in co-cordial environment amongst the teachers taking the examination. Teachers taking such examinations should not have ill thoughts about each other and courtesies should be offered to each other in case of difference of opinion, which should be critically suppressed in front of the students.
11.Evaluationandmarkingsystem:
Basic honesty in the evaluation and marking system is absolutely essential and in the process impartial nature of the evaluator is required in the examination system to become popular amongst the students. It is a wrong approach or concept to award the students by way of easy marking to get cheap popularity among the students to which they do not deserve. It is a primary responsibility of the teacher that right students who are really putting up lot of hard work with right kind of intelligence are correctly awarded. The marking patterns should be justifiable to the students without any ambiguity and teacher should see that students are faced with unjust circumstances. The assessment is done according to the directives of the Principal/ Vice-Principal/ Dean Academics.
106

DWDM Lab Manual: Data Warehousing and Data Mining

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

DWDM Lab Manual: Data Warehousing and Data Mining

Transféré par

Droits d'auteur :

Formats disponibles

LabManual:DWDM

DATA WAREHOUSING AND DATA MINING

Author JNEC, Aurangabad

Dr. S.D.Deshmukh Principal

LABORATORY MANUAL CONTENTS

Good Luck for your Enjoyable Laboratory Sessions

Prof. D.S.Deshpande HOD, CSE

Ms. S.B.Jaiwal Lecturer, CSE Dept.

DOs and DONTs in Laboratory:

4. Do not change the terminal on which you are working.

6. Strictly observe the instructions given by the teacher/Lab Instructor.

Instruction for Laboratory Teachers:

10. Develop an application for decision tree.

Dataasset Datagovernance Datasteward

Dataanalysis Dataarchitecture Datamodeling

Datamaintenance Databaseadministration Databasemanagementsystem

Dataaccess Dataerasure Dataprivacy Datasecurity

Datacleansing Dataintegrity Dataquality Dataqualityassurance

Dataintegration MasterDataManagement Referencedata

Datamart Datamining Datamovement(extract,transformandload) Datawarehousing

Metadatamanagement Metadata Metadatadiscovery Metadatapublishing Metadataregistry

SubjectOriented Integrated Nonvolatile TimeVariant

Easyaccesstofrequentlyneededdata Createscollectiveviewbyagroupofusers Improvesenduserresponsetime Easeofcreation LowercostthanimplementingafullDatawarehouse PotentialusersaremoreclearlydefinedthaninafullDatawarehouse

Dimensions:Product,Location,Time Hierarchicalsummarizationpaths IndustryRegionYear

c)AnalyticWorkspaceDimensions A dimension in an analytic workspace is a highly optimized, onedimensional indexofvaluesthatservesasakeytable.Variables,relations,formulas(whichare storedequations)areamongtheobjectsthatcanhavedimensions.

Referential integrity.Eachdimension member is uniqueandcannotbeNA (thatis,null).Ifameasurehasthreedimensions,theneachdatavalueofthat measuremustbequalifiedbyamemberofeachdimension.Likewise,each combinationofdimensionmembershasavalue,evenifitisNA.

Highlydenormalized.Adimensiontypicallycontainsmembersatalllevels ofallhierarchies.Thistypeofdimensionissometimescalledanembedded total dimension.

Aim: Oracle 9i Develop an application to implement data generalization and summarizationtechniques.

milk bread butter beer

Any (k 1) itemset that is not frequent cannot be a subset of a frequentkitemset,henceshouldberemoved.

(Ck:Candidateitemsetofsizek) (Lk:frequentitemsetof sizek)

minsup=3,wedon'tneedtogenerate3setsinvolving{1,3}.Thisisduetothefact thatsincethey'renotfrequent,nosupersetsofthemcanpossiblybefrequent.Keep going

DecisionTree Decisiontreesautomaticallygeneraterules,whichareconditionalstatements thatrevealthelogicusedtobuildthetree.

NaiveBayes NaiveBayesusesBayes'Theorem,aformulathatcalculatesaprobabilityby countingthefrequencyofvaluesandcombinationsofvaluesinthehistorical data.

Curvature Texture Blood

Afteralltheobjectsareplotted,wewillcalculatethedistancebetweenthem,and theonesthatareclosetoeachotherwewillgroupthemtogether,i.e.placethem inthesamecluster.

DistMean1 DistMean2 DistMean3 Cluster

point x1,y1 (2,10)

mean1 x2,y2 (2,10) (a,b)=|x2x1|+|y2 y1|

(point,mean1)=|x2x1|+|y2y1| =|22|+|1010| =0+0 =0

point x1,y1 (2,10)

mean2 x2,y2 (5,8)

(a,b)=|x2x1|+|y2 y1| (point,mean2)=|x2x1|+|y2y1| =|52|+|810| =3+ 2 =5

point x1,y1 (2,10)

mean3 x2,y2 (1,2)

(a,b)=|x2x1|+|y2 y1| (point,mean2)=|x2x1|+|y2y1| =|12|+|210| =1+ 8 =9

So,wefillinthesevaluesinthetable: (2,10) Point (5,8) (1,2)

Dist Mean Dist Mean Dist Mean Cluster 1 2 5 3 9 1

A1 (2,10) A2 (2,5) A3 (8,4) A4 (5,8) A5 (7,5) A6 (6,4) A7 (1,2) A8 (4,9)

So,whichclustershouldthepoint(2,10)beplacedin?Theone,wherethepoint hastheshortestdistancetothemeanthatismean1(cluster1),sincethedistance is0. Cluster1 (2,10) Cluster2 Cluster3

So,wegotothesecondpoint(2,5)andwewillcalculatethedistancetoeachof thethreemeans,byusingthedistancefunction: point x1,y1 (2,5) mean1 x2,y2 (2,10)

(a,b)=|x2x1|+|y2 y1| (point,mean2)=|x2x1|+|y2y1| =|12|+|25| =1+3 =4

So,wefillinthesevaluesinthetable: Iteration1 (2,10) Point (5,8) (1,2)

Dist Mean Dist Mean Dist Mean Cluster 1 2 5 6 3 9 4 1 3

A1 (2,10) A2 (2,5) A3 (8,4) A4 (5,8) A5 (7,5) A6 (6,4) A7 (1,2) A8 (4,9)

So,whichclustershouldthepoint (2,5)beplaced in?Theone,wherethepoint hastheshortestdistancetothemeanthatismean3(cluster3),sincethedistance is0.

Iteration1 (2,10) Point (5,8) (1,2)

Dist Mean Dist Mean Dist Mean Cluster 1 2 5 6 7 0 5 5 10 2 3 9 4 9 10 9 7 0 10 1 3 2 2 2 2 3 2

A1 (2,10) A2 (2,5) A3 (8,4) A4 (5,8) A5 (7,5) A6 (6,4) A7 (1,2) A8 (4,9)

Cluster2 (8,4) (5,8) (7,5) (6,4) (4,9)

Cluster3 (2,5) (1,2)

ForCluster1,weonlyhaveonepointA1(2,10),whichwastheoldmean,sothe clustercenterremainsthesame. ForCluster2,wehave((8+5+7+6+4)/5,(4+8+5+4+9)/5)=(6,6) ForCluster3,wehave((2+1)/2,(5+2)/2)=(1.5,3.5)