Vous êtes sur la page 1sur 105

Chapter13

MultivariateAnalysis
Factor Analysis
We can use factor analysis to summarize the information contained in a large number of variables.
Factor analysis creates a smaller number of dimensions or factors to represent the larger number of
variables.Becausethissummaryisbasedoncorrelationsamongvariables,ittendstobeameaningful
summary.

Whatisfactoranalysis?Factoranalysisisatechniquethatisusedtoreducealargenumberofvariables
intoasmallersetofunderlyingdimensions.Considerasurveyinwhich1,000consumersareaskedto
ratetheimportanceof20attributesinbuyingacar.Factoranalysiscanbeappliedtothedatatoreduce
these20attributestoafewunderlyingdimensionssuchasprice,safetyandquality.Theseunderlying
dimensions are called factors and are derived on the basis of correlations among the initial set of
variables.

Factor analysis is a generic term that is includes two groups of techniques: principal components
analysis(PCA)andcommonfactoranalysis(FA).Thesetwotechniquesarestatisticallydifferentandare
basedondifferenttheoreticalassumptions.However,inapplieddisciplinessuchasmarketingresearch,
the distinction between the two techniques is usually blurred. PCA and FA are often considered as
complementaryratherthanalternativetechniques.Infact,inmostpackagedcomputerprogramssuch
as SPSS and SYSTAT, principal components analysis is found as a part of factor analysis programs.
Following this convention, we treat factor analysis as a general term to include both principal
componentsanalysisandcommonfactoranalysis.

I.WHYFACTORANALYSIS?
Wejustreceivedtheresultsofasurveyinwhichconsumersratedfourstoresonseveralvariablesusing
a10pointscale(10=Completelysatisfied;1=Completelydissatisfied).TheresultsareshowninExhibit
13.1.

Exhibit13.1.Ratingfourstores

Store___.
Attributes
A B C D
Friendlystaff
8.5 8.2 7.4 7.2
Knowledgeable
6.7 6.2 6.6 6.5
staff
Helpfulstaff
7.1 7.3 7.8 7.1
Competentstaff
8.3 3.6 4.4 7.3
Helpstaffavailable 7.6 5.1 3.7 6.6
Items always in 8.1 6.4 4.2 7.1
stock
Stocks
specialty 6.2 5.8 3.1 6.0
items
Varietyofitems
7.9 6.9 7.3 6.2
Attractivelayout
8.8 9.1 7.5 8.9
Instead of looking at each item individually, we may want to summarize the data, so that it can be
understoodquickly.
Theintuitiveapproach
Lookingthroughtheitems,weseethatthefirstfiveitemshavetodowiththestaffwhiletheremaining
4 items with satisfaction with the items sold. We can average the first 5 items and call it satisfaction
withthestaffandtheaveragethenext4itemsandcallitsatisfactionwiththemerchandisingeffort(see
Exhibit13.2).

Exhibit13.2.Ratingfourstores
Variables
A
Staff
Friendlystaff
8.1
Knowledgeable
6.4
staff
Helpfulstaff
7.0
Competentstaff
8.0
Helpstaffavailable 7.5
Average(Mean) 7.3

Merchandising
Items always in 8.1
stock
Stocks
specialty 6.2
items
Varietyofitems
7.9
Attractivelayout 8.8
Average(Mean) 7.8

8.2 7.4 7.2


6.2 6.6 6.5
7.3
3.6
5.1
6.1

7.8
4.4
3.7
6.0

7.1
7.3
6.6
6.9

6.4 4.2 7.1


5.8 3.1 6.0
6.9 7.3 6.2
9.1 7.5 8.9
7.1 5.5 7.1

Ifwewanttoquicklysummarizetheinformationforthefourstoreswecansaythat

Satisfactionwith
A B C
D
Staff
7.3 6.1 6.0 6.9
Merchandising
7.8 7.1 5.5 7.1

Creatingthetwonewdimensionsbyaveragingrelatedvariablesmakesintuitivesenseanditmakesthe
datamorecomprehensible.Forinstance,itiseasytoseethatstoreBisseensuperiortoCintermsofits
merchandising,butnotintermsofitsstaff.
Whytheintuitiveapproachisnotadequate
Yet,averagingapparentlyrelatedvariablesisnotthebestwaytocreatenewdimensionsandsummarize
yourdata.Ithasmanypitfalls.
Intuitive dimensions are arbitrary. In using the intuitive approach we decided on the dimensions
solely on the meaning of variables involved. Do we really know, for instance, whether knowledgeable
staffandhelpfulstaffarerelateditemsintermshowconsumersseethem?Infact,wehavenoevidence
that the way we grouped the items corresponds to the way a consumer
wouldperceivethem.Dependingonwhogroupstheitemswecanendup Intuitive dimensions can
with one, two, three or more dimensions. We do not have any objective 1. Be arbitrary
wayofknowinghowmanydimensionsweneedtomeaningfullysummarize
2. Be of unknown validity
the data. Is a single dimension enough? Can two dimensions adequately
3. Obscure patterns
summarizethedata?Doweneedmoredimensions?
Intuitive dimensions are of unknown validity. The second problem is
thatwedonotknowhowwellthenewdimensionsrepresenttheoriginaldata.Wheneverwecreatenew
dimensionswelosesomeinformationcontainedintheoriginaldata.Forexample,bothStoreBandD
getmeanratingof7.1onmerchandising.YetifweexamineExhibit13.1,itisobviousthattheirratings
onindividualvariablesarenotidentical.Forexample,foritemsalwaysinstock,StoreBgetsaratingof
6.4 while Store D gets a rating of 7.1.Such information is lostwhen we average different items. But
how much information have we lost? Is it significant enough for us to be concerned about? Simple
averagesdonothaveanyinherentmeasurethatwillanswerthisquestion.
Intuitivedimensionscanobscuretheunderlyingrelationships.Thethirdproblemisthat,whenwe
average related items, we assume that all items are of equal importance. For example, the first
dimensionhasfivevariablesandthetotalwasdividedby5.Thisisthesameasdividingeachitemby1/5
orapplyingaweightof0.20foreachitem;theseconddimensionhasfourvariablesandthetotalwas
dividedby4.Thisisthesameasmultiplyingeachitembyorassigningeachvariableaweightof0.25.
Thuswecreatedtwodimensionsbyassigningequalweightstoallvariableswithinadimensionasshown
inExhibit13.3.

Exhibit13.3.Impliedweightsinourgrouping

AssignedweightforDimension

I(staff)
II(merchandising)
FriendlyStaff

0.20

0.00
KnowledgeableStaff

0.20

0.00
HelpfulStaff

0.20

0.00
Competentstaff

0.20

0.00
Staffavailablewhenyouneedthem

0.20

0.00
HasavarietyofItems

0.00

0.25
Desireditemalwaysonstock

0.00

0.25
Advertiseditemsalwaysavailable

0.00

0.25
Itemsareofgoodquality

0.00

0.25

Theintuitivelycreateddimensionsmakeapparentsense.However,therearemanypotentialproblems
withthisapproach.Assigningequalweights toallvariableswithinadimensioncandistorttheresults.
Considerfriendlystaff.Moststoresgetsimilarratingsonthisvariable;theratingsdonotvaryverymuch
andarenotwelldistinguished.Butwhenweconsidercompetentstaff,wefindthatstoresgetdissimilar
ratings,ratingsvaryagreatdealandstoresarewelldistinguished.

Variables
A B C D
Friendlystaff
6.4 6.2 6.6 6.5
Competentstaff
8.0 3.6 4.4 7.3

It is more logical to assign greater weight to competent staff than to friendly staff, since the former
contributesmoretodistinguishingstores.Simpleaveragingtechniquesdonotaccomplishthissincethey
assignequalweighttoallrelatedvariables,asshowninExhibit13.3.
Factoranalysisprovidesalogicalalternative
Therefore,weneedawayofcreatingdimensionsthatovercomestheaboveproblems.Thetechnique
ideallyshould:
1. Assigndifferentweightstodifferentvariablesbasedontheirvariability.Wewouldexpectvariables
thatvarymorebegivengreaterweights,becauseoftheirhigherpotentialtodistinguish.
2. Groupvariablesthataresimilaronempiricalgrounds(usingameasuresuchascorrelation)rather
thanonarbitrarygrounds;
3. Create dimensions in decreasing order of importance (in terms of variance explained) so we can
presentthedatausingthefewestpossibledimensions,withoutlosingtoomuchinformation.
4. Provide a measure that will indicate how well the new dimensions summarize the underlying
variables.
Principal components analysis (PCA) technique addresses these issues. Often we may want to extend
theseobjectives.Forinstance,wemayalsowantto
5. Createfewerdimensionsthantherearevariables.

6. Beabletoexpressvariablesasacombinationoffactors.
7. Understandfactorsintermsoftheirintercorrelations.
When we add these objectives to our initial objective, we need to use the common factor analysis
technique, more simply known as just factor analysis. In this chapter we will deal with both these
techniques.
Factorsashiddendimensions
In principal components analysis, we treat factors as weighted combinations of variables. In factor
analysis, we treat factors as latent dimensions that are manifested through measured variables. For
instance,wemayaskourcustomerstheimportancetheyattachtodifferentservicevariables.Letussay
that we collect such information on 20 different variables, such as the phone be answered promptly,
staff be very knowledgeable and staff be very friendly. Yet there may not indeed be 20 distinct
dimensions on which customers naturally evaluate our service. Our customers may simply be judging
our service on just 4 or 5 underlying (latent) dimensions. The 20 variables we use can be seen as
imperfect measures of fewer latent underlying dimensions. We can gain a better understanding of
customer needs if we can identify these latent dimensions. Such latent dimensions are often called
factorsandtheanalysisassociatedwithderivingthemiscalledfactoranalysis.

II.HOWFACTORANALYSISWORKS
Howdoweidentifyfactors?Becausefactorsarederivedfrommeasuredvariables,itstandstoreason
that each factor is based on a correlated set of variables. For instance, if a
consumerratesabrandon15variables,wemaybeabletosummarizethebulkof Factors are based
theinformationcontainedinthose15variablesinsay,just4factors.Thesefour on correlated sets
factors are then deemed to be the latent variables that gave rise to the 15 of variables.
variablesthatwemeasuredoriginally.Ifavariableisnotfullyexplainedbythese
four factors, it is assumed to be the result of error (error variance) or aspects associated with that
variableonly(specificvariance),notsharedbyothervariables.
Howarefactorscreated?Anintuitiveexplanation
Becausefactorsarederivedfromthemeasuredvariables,correlationsamongthemformthebasisfor
derivingfactors.Correlatedvariablescanbedisplayedvisually:variablescanberepresentedasvectors
(straightlineswithanarrowhead)withtheanglesbetweenthemrepresentingthecorrelations.Exhibit
13.3a visually depicts 6 variables based on their intercorrelations. We can create two hypothetical
dimensions to represent these 6 variables (Exhibit 13.3b). The size of the weights we assign to each
variablewilldependonhowclosetheyarecorrelatedtothehypotheticaldimension.

Exhibit13.3a.Correlationsamongsixvariables

AB

C
D
F
E

You will note that the first dimension is close to the largest number of variables that cluster
together. The second hypothetical dimension is created such that it best represents whatever is not
represented by the first dimension in this case variables C, D, E, and F. This is the most basic and
intuitiveexplanationofhowafactoriscreated.
In reality, we can create factors through the use of many techniques. One popular technique for
creatingfactorsisprincipalcomponentanalysisorPCA(whichcanbeusedasanindependenttechnique
aswell),whichwewilldiscusshere.

Exhibit3.3b.Factorsbasedoncorrelations

AB

C
Factor1D
F
E

Factor2

Principalcomponentsanalysiscreatesnewdimensionsknownascomponentsorfactorsbycombining
theoriginalvariables.Theaimoftheanalysisistosummarizethemostamountinformationinthefirst
few uncorrelated dimensions. While the variables themselves may be (and generally are) correlated
witheachother,thenewdimensionsarecreatedinsuchawayastobemathematically uncorrelated
withoneanother.Forinstance,inExhibit13.3a,FactorsIand2areatrightangles(90)toeachother,
meaningthattheyhaveacorrelationof0.
Principalcomponentsanalysisstartswithcorrelations1amongallvariablesunderconsideration.Itthen
assignsweightstovariablessuchthatthefirstcomponentwillexplainthemaximumpossiblevariancein
the data. The technique then creates the next component that explains the maximum amount of the
remaining variance. Using the same procedure, the PCA creates as many components as there are
variables, as shown in Exhibit 13.4. In practice, however, we will be interested only in the first few
components.

Or, more precisely, with covariances among variables. Covariance is unstandardized correlation.


Ansimpleexample
Inaresearchstudy,500respondentsratedtheimportanceofanumberofvariablesona5pointscalein
choosing instant coffee. For simplicity, let us confine ourselves to just two variables Taste and Flavor,
whichhaveacorrelationof0.85.Thiscanberepresentedvisuallyastwovectorsconnectedattheorigin,

witha32 angleseparatingthem,asshowninExhibit13.5.

Exhibit13.5.Twovariablevectors(Correlation=.85=32 )

Taste

Flavor

32

The question is, can a single dimension replace the two variables?2 Given that the two variables are
highlycorrelated,ouranswershouldbeyes.Aswehavebeendiscussing,thisnewdimensionorfactoris
someweightedcombinationofthecorrelatedvariables.Sincewearedealingwithonlytwovariables,
themostlogicalplaceforthenewvectorwouldbeinthemiddle,asillustratedinExhibit13.6.

Exhibit13.6.Twovariablesandthenewfactor

Flavor

16

16 FactorI

Taste

Therelationshipbetweenafactorandanvariable(factorloading)
Howdoesthisnewlycreatedfactorrelatetoeachoriginalvariable?Ifwemeasuretheanglebetween
thefactorandtheflavorvector,theangleis16(32/2=16)whichtranslatesintoacorrelationof0.96;
the same relationship holds for the taste vector. In other words, the newly created factor has a
correlationof.96toeachoftheoriginalvariables,meaningthenewfactorisagoodrepresentationof
thevariablesfromwhichitiscreated.Thecorrelationbetweenafactorandavariableiscalledthefactor
loading.Asweshallseelater,factorloadingscanalsobeusedtointerpretthemeaningofafactor.

Please note that we are using just two variables only to demonstrate how the technique works. In practice, we would expect to work with
many more variables. We would also expect a factor be related to a number of variables.

Varianceexplainedbyeachvariable(commonfactorvariance)
Letusconsidertaste,whichhasacorrelationof.96withthenewlycreatedfactor.Thesquareofthis
correlation(.962,orthevarianceexplained)is.92or92%.Thisistheamountofvariancesharedbythe
factorandthevariable,knownasthecommonfactorvariance.InExhibit13.7,theoverlapbetweenthe
squareandthecirclerepresentsthecommonfactorvariance.

Exhibit13.7.Commonfactorvariance

Variable1

Factor1

Varianceexplainedbyeachfactor(eigenvalue)
The common factor variance for taste is .92 and for flavor also it is .92. Hence the total variance ex
plainedbythisfactoris.92+.92=1.84.Theextractedvarianceisalsoknownastheeigenvalueorthe
latentroot.Sincethemaximumamountofvariancethatcanbeexplainedbytheanalysisis2(onefor
eachvariable)thepercentageofvarianceextractedbythisfactoris1.842=92%.
CompletingtheAnalysis
Nowthatwehaveextracted92%ofthevariance,itisunlikelythatwewouldwanttoextractanother
factor.However,inPCA,wecancreateasmanyfactorsastherevariables.Inthisexample,wecan
createanotherfactoratrightanglestothefirstfactor.(Whentwofactorsareatrightanglestheyare
uncorrelatedwitheachother,arequirementofPCA.Uncorrelatedfactorsaresaidtobeorthogonalto
eachother).Exhibit13.8showsthenewfactor.Thecorrelationsbetweenthisnewfactorandthetwo
variablesareshownasanglesaandb.


Exhibit13.8.

CreationofFactorII

FactorII

Flavor

FactorI

Taste

The analysis that we just carried out is called the principal components analysis. The results are
summarizedinExhibit13.9.
Exhibit13.9.Tableoftwocomponents

Taste
Flavor

Component
I II Communality
0.96
0.28

1.0
0.96
0.28

1.0

Varianceextracted(Eigenvalue) 1.84
0.16

2.0
Varianceexplained(%)

92

100

Exhibit13.9hasonemoretermthatneedsexplaining.Itiscommunality.Communalityistheamountof
variancesharedbyavariablewhenallfactorsaretakenintoconsideration.Itisthesumofavariables
commonfactorvarianceacrossallfactors.Communalitiesareobtainedbysquaringthefactorloadings
ofavariableonallfactorsandsummingthem.Forexample,thecommunalityfortaste=(.962)+(.282)=
1.0.ThisisvisuallyrepresentedinExhibit13.10.
Since we extracted as many dimensions as there are variables (two factors and two variables), all
the variance is fully explained by the two factors. As a general rule, in principal components analysis,
whereweextractasmanyfactorsastherearevariables,wewouldhaveexplainedallthevarianceinthe
dataintermsoffactors;thereforethecommunalitywillbe1.00.Inothertechniquesoffactoranalysis,
where there are fewer factors than there are variables and provisions for error and specific variance
exist,allthevarianceinavariablewouldnotbeexplainedbythefactors.Thismeansthecommunality
estimateforeachvariablewillbelessthan1.00.


Exhibit3.10.Communality

High communality
Factor 1

Low communality

Factor 1

Variable

Factor 2
Factor 2

Factor 3

Factor 3

Low

Thesquareinthecenterrepresentsavariable.Eachcircleisafactor.Theoverlapbetweenafactorand
thevariableisthecommonfactorvariancebetweenthevariableandthatfactor.Whenweaddupthe
commonfactorvarianceforallfactorsforagivenvariable(partsofthecirclethatoverlapthesquare),
weobtainthemeasurecommunality.

Exhibit13.10a.Varianceexplained
Factor analysis is the right

Variable2

Factor

Variable1

Variable3

type of technique if your


objective is to reduce a
large number of variables
to a smaller number of new
dimensions.

Theoverlapbetweenafactorandallthevariablesisvarianceexplained.Varianceexplainedcanbeseen
astheconverseofcommunality.

III.HOWTODESIGNAFACTORANALYSISSTUDY
Isyourproblemsuitedforfactoranalysis?
Factoranalysisistherighttypeoftechniqueifyourobjectiveistoreducealargenumberofvariablesto
asmallernumberofnewdimensions.IfyouusethePCAmethod,thenewdimensionswillbederived
suchthatthemostamountofvariabilityinadatawillbeassignedwiththehelpoftheleastnumberof
newdimensions.Herearesometypicalproblemsthatcanbesolvedbyfactoranalysis:
Youareamarketer.Youhavesurveyedyourcustomerstoassesstheirsatisfactionon35aspectsof
your service. You would like to know if you could reduce the 35 variables to a smaller number of
newdimensions.
Youareaneconomist.Youhavethepricesofvariousitemsondifferentpartofthecountryoverthe
pastseveralyears.Youwanttocreateanindexsuchthatcostoflivingcanbecomparedindifferent
partsofthe country.You wanttoassignweightsto differentfooditemsdependingonhowmuch
variabilitytheindexexplains.
You are president of a company. You would like to understand the dimensions that underlie your
corporateimage.
Isyourdatasuitedforfactoranalysis?
Factoranalysisisametrictechnique.Thismeansyourdatashouldbeatleastintervalscaled.Variables
suchasratingscales,income,andyearsofschoolingaresuitableitemsfor
factor analysis. However, you can use data collected on different scales Your data should be metric
(such as 5point or 10point scales, income and age) in the same analysis.
This is because data obtained on different scales can be standardized and expressed in comparable
standarddeviationunits.
Arethevariableswellchosen?
Factoranalysiscancreatefactorsbasedonlyonthevariablesthatwemeasure.Therefore,thevariables
chosenforresearchshouldcoverallrelevantaspectsandberepresentative
of the subject chosen for the study. Also, factor creation will depend on
Variables chosen for
how many variables are highly intercorrelated. Suppose we include 10
research should cover all
relevant aspects and be
representative of the
subject chosen for the

correlatedvariablesthatdealwithproductquality,10correlatedvariablesthatdealwithservicequality,
andonlyonevariablethatdealwithprice.Inthiscase,pricemaynotshowupasafactoratall.Thismay
notbebecauseconsumersdonotconsiderpricetobeaseparatedimension.Rather,itisaconsequence
ofvariableselection.Therefore,inchoosingvariablesforinclusioninafactoranalysisstudy,
Choosevariablesthatcoverthecompletespectrumofwhatyouwishtostudy.
Wherepossible,striveforsomebalanceamongthedifferentareas(potentialfactors)covered.

You should strive to have

Isyoursamplesizeadequate?

a sample size that is at

As a general guideline, your sample size should be at least 10 times the


least 10 times the
numberofvariables.Ifyouareperformingafactoranalysison11variables,
number of variables.
you need a minimum sample of 110 (10 x 11). However, this is only a
mathematical consideration. It may not necessarily provide precise enough results for marketing
purposes.Forinstance,asamplesizeof110willhaveamarginoferrorofabout10percentagepoints.
Sincefactoranalysiscombinesvariables,themeasurementofwhichissubjecttomarginsoferror,the
factors scores and loadings may not have a high degree of stability. The mathematical criterion is a
necessarybutnotasufficientcondition.Abetterwaytodeterminethesamplesizeistoaskyourselfthe
questionwhetheryoubelievethatthesamplesizeislargeenoughtorepresentthepopulationfroma
marketingperspective.Ifyouranswerisno,thenyoursamplesizeisalmostcertainlyinadequate.
Doyouhavetherightprogram?
Practically any major computer package (such as SPSS, SAS, or SYSTAT) will carry out factor analysis.
Factor analysis is a defined mathematical technique and you should get the same results no matter
whichprogramyouchoosetouse.Pleasenotethatinmanypackages,principalcomponentsanalysisis
treatedasapartoffactoranalysis.Whileallprogramsincludewidelyusedprocedures(suchasvarimax
rotation),differentprogramsincludedifferentlesswidelyusedprocedures.
IV.HOWTOPERFORMFACTORANALYSIS
Whenyouwanttocarryoutfactoranalysisonyourdata,youneedtodecideonanumberofthings.This
section deals with issues that can be useful when you are about to perform a factor analysis. What
methodsareavailabletocreateafactor?Whattechniquesareavailabletorotatethem?Whendoyou
wantstopfactoring?Theseissuesaredealtwithbelow.
Doyouhavetheinputdataintherightform?
Whatgoesintothecomputer?Whatinputdatadoyouneedtoperformfactoranalysis?Toperform
factoranalysisyourinputdatashouldbeattheindividuallevel.Forinstance,ifyouwanttofactor
analyze20variableratingsof1000consumers,thenyourinputdatawouldbe20variableratingsgiven
byeachofthe1000consumers.SoyourinputdatawillbeintheformatshowninExhibit13.11


Exhibit13.11.Typicaldataformatforfactoranalysis
ID

0001

0002

0003

0004

1000

1
14
5
8
6
5
4
5
5
7

Ratingsonvariables
2
3
4
15
16
17
6
7
6
5
6
5
7
6
5
6
5
4
6
6
5
4
6
7
4
4
9
5
6
7

5
18
6
4
6
6
4
5
8
6

6
19
6
6
5
7
4
6
6
6

7
20
5
7
4
5
9
7
9
6

10

11

12

13

8
6

6
6

5
4

6
4

5
9

9
6

8
5

Howtocreatepreliminaryfactors
When we submit our data to a factor analysis program, the primary task of the program is to create
factors.Wheredowegetthefactors hypothesizedbythefactoranalyticmodel?Aswenotedearlier,
therearemanymethodsbywhichwecanderivethefactors,oneofthemostpopularbeingtheprincipal
components analysis, which we described in some detail in this chapter. There are other methods as
well.Herearesomemorepopulartechniques:
Principalaxisfactorsmethod.Thismethodisidenticaltotheprincipalcomponentsanalysiswithone
exception. In the diagonal cells of the correlation matrix (which usually contains 1.0), the program
insertsanestimateofcommunalityforeachvariable.Communalityis,asexplainedbefore,theextentto
whicha givenvariableisexplainedby allthefactorsextractedintheanalysis.Thisisoneofthemost
commonlyusedmethodsofextractingfactors.
Minimumresidualmethod.ThismethodisalsosimilartoPCA,exceptthattheanalysisisperformed
ignoring the diagonal entries of the correlation matrix. Factors are extracted such that the sum of
squared ***of the off diagonal residuals is minimized. This method can potentially produce
communalitiesthataregreaterthanoneandcanprovidemisleadingresults.Ifyouarenotwellversed
infactoranalyticprocedures,youmaywanttoavoidthisprocedure.
Maximumlikelihoodmethod.Themaximumlikelihoodmethodattemptstomaximize theamount
of variance explained in terms of the population (as opposed to the sample, which is the focus of
principal components analysis). When communalities are high, the differences between principal
components method and maximum likelihood method tends to be trivial. The main advantage of the
maximumlikelihoodmethodisthatithasstatisticaltestsofsignificanceforeachfactorextracted.
Imageanalysis.Imageanalysisassumesthatthecommoncoreofavariableiswhatcanbepredicted
by other variables. Prediction is measured by multiple regression analysis of each variable against all

other variables. These analyses produce a covariance3 matrix, which in turn is subjected to factor
analysis.Theresultsproducedthroughthisanalysisaresomewhatdifficulttounderstandandinterpret.
Thismethodtendstogiveresultsthataredifferentfromothermethodsoffactoring.
Alpha factor analysis. Alpha factor analysis attempts to create factors that have the highest
reliability.Reliabilityreferstointernalconsistencyandconsistencyovertime.Nosignificantadvantageis
attachedtothismethodcomparedtoothermethodsoffactorextraction.
Whichfactoranalysismethodtouse
If you have no special reasons for using a specific technique, either principal components or principal
axiswillbequiteadequateinmostappliedcontexts.Manyanalystsprefertouseprincipalcomponents
exclusivelybecauseitisbasedonfewerassumptionscomparedtoothertechniques.
Howarefactorsrotated
When factors are extracted, the first factor will tend to correlate with most variables. The remaining
factors may not show a clear pattern. This makes the factors difficult to interpret. So we need to
readjustthefactorssuchthateachfactorcorrelateshighlywithonlyafewvariables.Thiswillmakethe
meaningoffactorsclearer.Theprocessofreadjustingthevarianceisknownasfactorrotation.Heres
briefdescriptionoffactorrotation:
1. Wecreateinitialfactorusing,say,principalcomponentsanalysis.
2. Inordertomaximizethevarianceextracted,theeachfactoriscreatedsuchthatitmaximizesthe
varianceextractedateachstageoffactorcreation.
3. Thisproceduredoesnotnecessarilyexplainobservedcorrelationpatterns.
4. The factors are mathematically adjusted (rotated) such that the factors conform more closely to
theobservedcorrelations.Thisprocesschangesthemeaningofindividualfactorsbutdoesnotalter
theoverallvarianceextracted.
Thesestepsarevisuallyillustratedinthefollowingexhibit.

Covariance is correlation expressed in unstandardized form.

FACTOR

ROTATION
Six variables represented as vectors based on their
correlations.

The first factor will be created where there is the


highest concentration of correlated variables. The
next uncorrelated factor will be created orthogonal
tothefirstfactorasshownhere.

However, this solution can be improved such that


eachfactorloadshighlyononlyafewvariables.This
isdonebyrotatingthefactorsasshownontheleft.

Note how variables load clearly in each factor after


the rotation. This makes interpretation easier. Compare
this with the unrotated factors (second chart from the
top).

Ingeneral,weneedtorotatethefactorsbeforewecaninterpretthem.Rotatingthefactorschanges
their loadings and therefore their meaning. However, the rotated and unrotated solutions are
mathematicallyequivalent.Whenyouuseacomputerprogram,youneedtospecifyhowyouwouldlike
the factors rotated. There are many ways of rotating factors. They can be summarized into two basic
typesofrotations:orthogonalandobliquerotations.
Orthogonalrotation
Inorthogonalrotation,factorsarerotatedinsuchawaythattheywillcontinuetobeuncorrelatedwith
one another, even after the rotation. This means that the factors are rotated at right angles (90) to
eachother.

Varimax rotation. This form of orthogonal rotation is by far the most popular. In the varimax
procedure,factorsrotatedsuchthateachvariableloadshighlyononefactorandonlyonefactor.Thus
eachfactorwillrepresentadistinctconstruct.

Quartimax rotation. Factors are rotated such that all variables will be loaded highly on one
factor.Inaddition,eachvariablewillhaveahighloadingononeotherfactorbutnear0loadingsonthe
remainingfactors.Thisresultsinoneoverallfactorfollowedbyanumberofspecificfactors.
Obliquerotation
In oblique rotation, the constraint that factors be rotated such that they are at right angles to each
otherdoesnotexist.Wecanrotatethefactorssuchthattheyanyangletoeachotherandtherefore
thefactorscanbecorrelatedwitheachother.Quartmin,Oblimin,Covarimin,Binormamin,Orthoblique,
Doblimin,DquartandPromaxaresomeoftheobliquerotationprocedures.
Whichrotationtouse
Obliquerotationisappealingtosomeusersbecauseitislessconstrained.However,obliquesolutions
arelesswidelyusedthanorthogonalones,possiblybecauseofthepotentialproblemsofinterpretation.
When we use oblique solutions, the sum of squared loadings in each row would not equal the
communality(h2),exceptbychance.Asaresult,wecannotclearlydeterminetheproportionofvariance
inavariableexplainedbythefactors.Neithercanwedeterminethevarianceexplainedbyeachfactor.
So,itisgenerallypreferabletouseanorthogonalmethodsuchasvarimax.However,ifyouchooseto
useanobliquesolution,youshouldweighthestatisticallossesagainstpossiblegainsinmeaning.
Howmanyfactorstoextract?
It is for users to determine how many factors should be retained. Some researchers accept as many
factors as would explain a predetermined amount of variance. For instance, you may want the factor
analysis to explain at least 75% of the variance and therefore accept as many factors as necessary to
achievethiscriterion.Thisisoneoftheweakestwaysofdecidingthenumberoffactorssincehereour
priordecisionratherthanthepatternsinthedatadeterminethenumberoffactors.Thefollowingare
someofthecommonmethodsusedtodecidethenumberoffactors.

Eigenvalue criterion. One guide we use for this purpose is to accept all factors which have an
eigenvalueof1ormore.Thelogicbehindthisapproachisthatitisintrinsicallylessappealingtocreatea
weighted combination of variables to explain less variance that what a variable would explain on its
own.Thisistheeigenvaluecriterion.

Scree criterion

3.0

Eigenvalue

2.5
2.0
1.5
1.0
0.5
0.0
1

Factor

Exhibit13.12thescreeplot
Screecriterion.Tousethescreecriterion,plottheeigenvalueofeachcomponentonagraphand
connectthepoints(seeExhibit13.12).Weacceptallfactorsuntiltheslopeofthecurvestartschanging
(theelboweffect).Theideabehindthescreecriterionisthatatthetailendofthecurveallwehaveare
mostlyrandomvariations.
Parallel procedure Scree criterion is difficult or even impossible to use when the curve is smooth
with no sudden slope changes. Parallel procedure is a technique designed to overcome this problem
whenstandardizeddataareused.
Whichstoppingcriteriontouse
For most purposes, the eigenvalue criterion followed by logical adjustment of factors should be
adequate.However,ifyouarelookingforamore objective criterion,the parallelprocedureisagood
choice.


Nomatterwhichcriterionweuse,weshouldnotuseitmechanically.Supposewedecidetouse
theeigenvaluecriterionandfindthatwehave4factors.Itisquitepossiblethatthefourthfactorisnot
verymeaningfuloritloadsonjustonevariablethatmaynotparticularlyimportantfromourpointof
view.Thenitismuchmoremeaningfultoworkwiththreefactorsratherthanwithfour.

Generally speaking, we need to determine how many factors to extract before rotating them.
However,itisalsonecessarytounderstandwhatthesefactorsmeanbeforedecidinghowmanyfactors
toaccept.Itisdifficulttounderstandunrotatedfactors.So,wherenecessary,youmaywanttorotate
thefactorsinitiallyextractedusinga mechanical criterion.Youcanrotatethemagainifyoudecideto
acceptmoreorlessthantheoriginalnumberoffactors.

HOWTODOFACTORANALYSIS: Astepbystepguide

1. Collectallthedimensionsthatareimportanttoyourproblem.
Example. Suppose you are planning to understand the factors of customer satisfaction for a
telephoneutility,youmaywanttothinkofallthedimensionspertainingtothissuchasinstallation
service, repair service, price tariff and responsiveness. Make sure that the dimensions you have
chosenisfullyrepresentativeofwhatyouwanttofactoranalyze.
2. Developdifferentattributestatementspertainingtoeachdimension.
Example.Torepresentthepricedimension,wecanincludestatementssuchasmonthlycharges
arereasonableandlongdistancechargesarecompetitive.Althoughnotabsolutelynecessary,itis
preferabletobalanceeachdimensionbycreatingapproximatelythesamenumberofstatements.
3. Conductthesurveyandanalyzetheresults.
Inanalyzingtheresults,youneedmethodandstoppingcriterion.Seetextfordetails.tomakea
numberofdecisionssuchasthefactoranalyticmethod,rotation
4. Interprettheresults.
Seetext.
5. Presenttheresults.
Whatyouwouldpresentorincludeinyourreportwilldependontheaudience.Ataminimum,
you should report the number of factors, their interpretation, attributes included in each factor,
factorloadingsandthevarianceexplainedbyeachfactor.

DOING THE ANALYSIS


For a complete example with annotation please see the PowerPoint presentation, Chapter 14.

Cluster Analysis
We often have a need to group similar items, such as grouping customers who have similar needs or
people who have similar life style characteristics. Techniques of grouping have several applications. In
marketing cluster analysis very frequently used to segment the market to homogeneous parts.

One of the common problems for many areas of inquiry is that of grouping similar items together: How
to group consumers who seek similar benefits from a product so we can communicate with them
better? How to group people with similar social attitudes that might lead up a predictable political
affiliation? How identify the group physcial symptoms that lead up to some illness? How to group
animals and plants on the basis of their similarity so we can understand them better? How to group the
fiancial characteristics of companies so we may be able to relate them to their stock market
performance? Questions like these require that we groups items on the basis of their similarity. Cluster
analysis is a technique that aims to group similar items. In marketing, cluster anlaysis is typically used
to segmetn the market. The analyst provides the characteristics based on which clustering needs to be
carried out.

I. WHY CLUSTER ANALYSIS?


An automobile company is interested in marketing a new sports car. The company can target every
one, but that would be expensive. Obviously not every driver is equally likely to buy a sports car. How
to go about identifying groups of consumers who are more likely to buy the new sports car?
Traditional approach

Looking at the problem from a common sense perspective, we can exclude groups that are unlikely to
be our best prospects. If the car is an expensive one, it would preclude many low income consumers
from buying it. For the same reason, drivers who are too young and unlikely to have the resources to
buy an expensive car will also be excluded from the target audience. If the car is performance oriented,

it might exclude some elderly consumers who might be concerned more with safety than with speed.
We may also have research data that show that the current owners are of sports cars are
predominantly from the higher income groups (say, $75,000+ per year) and are between 35 and 45
years of age. We can collect such information, exclude low purchase probability groups and define high
purchase probability groups as our target audience. Such analysis is often based on logic past
empirical observations.

Why traditional approach is not adequate

The traditional approach is useful but stops short of identifying the most profitable segments. In the
above example if we identify drivers between the age of 35 and 45 who earn more than $75,000 per
year as our target audience, then perhaps we will have better chance of marketing to them. However,
obviously not every driver between the ages of 35 and 45 with an income above $75,000 may be
equally interested in buying an expensive sports car. The decision may depend on a set of complex
characteristics. Even within the group we have chosen as our target audience there might be
subgroups that have much greater potential. For example, the probability of buying a sports car could
be much higher among those who without children, those who have a tendency to seek thrill, those
who are conscious of their status and those whose is occupations are people-oriented. In fact, the
presence of a combination of several traits may define the best prospect for the new product. The
problem becomes even more complex if the target audience share certain life style and attitudinal
characteristics that are not easily apparent. For instance, simple analysis will not identify segments
such as thrill-seeking drivers since thrill seeking may be a combination of other life style attitudes a
driver may have. Traditional methods of analyses have no means of identifying complex attitudinal or
demographic segments.
Custer analysis provides a quicker solution

Because it is nearly impossible to group items manually on a large number of attributes, we need a
technique which would identify groups of attributes that would (1) identify attribute combinations that
would distinguish one group from another; and (2) keep each group member as homogeneous as
possible.

While the criteria are simple, as we will soon see, there is any number of solutions that would
at least in part satisfy the criteria. As a result, we have a number of cluster analysis solutions to the
same problem. Solutions produced by different methods can be very different. Different solutions have
different strengths and different weaknesses.
In this chapter we will study some clustering techniques, the type assumptions involved in creating
these clusters and why we get different solutions when we use different clustering procedures.

II. HOW CLUSTER ANALYSIS WORKS


We define how similar two customers are on the basis of some measured variables such as their life
style characteristics, demographics or attitudinal responses to a number of statements. Naturally we
would expect those who belong to a cluster would share similar characteristics in terms the variables
on which they are clustered. For instance, if people are clustered on 30 lifestyle attributes, we may be
able to identify a group which enjoys outdoor activities (therefore may be consumers of outdoor
sporting and recreational equipment), another group which enjoys home-centered activities (possible
consumers of in-door entertainment equipment such as VCR or barbecue grill) an so on.
There two basic methods through which clusters are generally created: hierarchical and iterative
partitioning. Within each group there are many techniques through which we can create clusters.
Hierarchical clustering methods

In hierarchical methods, each cluster is formed on the basis of the clusters existing at the previous
stage. For instance, the four cluster solution is formed by subdividing one of the clusters of a three
cluster solution. The three cluster solution, in turn, is formed by subdividing one of the clusters of a two
cluster solution and so on. The procedure is best understood through an example, using clustering
technique known as the nearest neighbour technique. In the following illustration, lets assume that we
have two clusters. A, B, C belong to Cluster I and D, E, and F belong to Cluster II.
A

B
Cluster I

F
Cluster II

Now we need to assign the next individual G to one of the two clusters.

Cluster I

Cluster II

i. Nearest Neighbour Method (Single Linkage)

There are many ways in which we can decide whether G belongs to Cluster I or Cluster II. One way is
to calculate the distance between G and the nearest point in Cluster I (C) and the nearest point in
Cluster II (D) and assign the new individual to the closer cluster. In this case we will assign G to Cluster
II since G is closer to D than to C. This technique is called the nearest neighbour method. Once we find
the nearest neighbour of a new point, we are not concerned at all about the distance between the new
point and all other points in a cluster. In our example, our only concern is to understand how far G is to
D, not how far G is to E or to F. If a point is not the nearest neighbour, it plays no role in the formation
of a cluster.

Identifying clusters. Once we have a distance measure (Euclidean distance) and an assignment rule
(nearest neighbor method), we are ready to carry out cluster analysis. Exhibit 13.13 shows the
Euclidean distance between five individuals.
Exhibit 13.13
Euclidean distances between five individuals
A

3.7

1.4

3.5

5.4

3.3

6.2

3.0

4.6

7.8

7.1

First we look for the two respondents who are closest. We note that of all respondents, A and C are
closest to each other with a distance of 1.4, the smallest in the table. So let us combine them into
cluster 1.
__________ A
I__________C
The next pair that is closest to each other is E and A (distance 3.0). (Note that A and C have already
formed a cluster.) A is the nearest neighbour for the next point E. So we combine it with A and C as
shown below.
__________ A

_____I__________C
I_______________ E

We look for the next closest pair: it is B and D (distance 3.3). Since neither B nor D is already in a
cluster. We create a new cluster that consists of B and D.
__________ A

_____I__________C
I_______________ E
_________________ B
I_________________D

The next closest pair is B and C (distance 3.5). B and C are in different clusters, but they happen to be
the nearest pairs, so we join both clusters.
__________ A

_____I__________C
I_______________ E
_________________ B
I_________________D

The above representation is called a dendrogram. From this you will note several things. First, we start
with everyone being separate and gradually combine individuals into larger and larger and clusters until
we end up with one single cluster. Secondly, the dendrogram preserves the original distances of the

matrix to the extent possible. Third, since each cluster is formed by annexing more individuals, we can
have as many or as few clusters as desired. For instance if we want to group the respondents into 2
groups, we can start from the left of the dendrogram, start moving right and stop at the point where
there are two branches. In our case the clusters will be [A, C, D] and [B, E]. If we need three clusters
we further to right until we encounter three branches: [A, C] [D] and [B, E] and so on.
Nearest neighbour method is also known as single linkage clustering.
This method is not problem-free. Lets return to our earlier example.

Cluster I

F
Cluster II

Here we combined G with Cluster 2 because it was close to D than to C. Now watch what happens we
have another new point H.

Cluster I

F
Cluster II

The nearest neighbor logic would have H combine with G, since H is closer o G than it is to C. Now
assume we have another point I, as shown below:

B
Cluster I

F
Cluster II

The point I will be combined with Cluster II because it is closer to H than it is C. By now the problem
should be obvious.

Cluster I

F
Cluster II

If you examine the original clusters A, B, C vs. D, E, F, clearly many points such as I and H are closer
to A, B, and C than to D, E, and F. Again, if point G was not in the sample, I and H would be combined
with Cluster I than with Cluster II. Yet, because the nearest neighbor method concentrates on the
closest point, we may end up getting clusters that may not represent reality very well. This is problem is
known as linking or chaining effect.
ii. The farthest neighbor method (Complete linkage)

To solve the problem of linking effect, we could measure closeness from the farthest point in each
cluster rather than from the nearest point. If we apply this logic to the problem we will get the following
configuration because we will be assigning new points depending on how close it is to the farthest point
in each cluster (A for cluster I and F for cluster II).

Cluster I

F
Cluster II

While this seems to be an improvement over the nearest neighbour method, it is not very satisfactory
either. Assigning (ABC) to one cluster and (IHGDEF) to another cluster might have been more logical,
at least from a visual perspective, as shown below:

Cluster I

F
Cluster II

Since, as before, a single point (farthest point) decides how the groups are formed, a single point to the
left of A could change the way the two groups are formed.
Centroid clustering

To avoid depending on a single point, we can use a third method known as the average linkage cluster
analysis. Here we combine points with the cluster whose centroid is closest to the point to be joined.
Centroid is the average point of a cluster. It is the hypothetical point created by averaging all points in
a cluster. Each time we include a point in a cluster, centroid is recalculated to account for the new
member of the cluster. Because of this, it is somewhat less likely to be influenced by a single point in a
cluster. Centroid clustering has some disadvantages as well. When clusters are of unequal sizes, if one
of the clusters is large, the new point to be fused is likely to be closer to the larger cluster than to the
smaller ones. We might miss the characteristics of the smaller cluster.
Wards method

Ward proposed a solution to the problems associated with the techniques we discussed so far. His
solution is to combine new points to that cluster that will create the least amount of information loss.
Information loss is defined as the error sum of squares. Wards method does not compute distances
between clusters. Rather it aims to reduce the total within-cluster sums of squares. If we were to assign
the new point to any cluster, assignment to which cluster would result in the minimum increase to the
total and within sum of squares? It is to this cluster will the new point be assigned.
These are not the only hierarchical clustering procedures in use, although these are probably
the most frequently used.
Hierarchical vs. Partitioning methods

So far we discussed hierarchical clustering methods where clusters at each level are dependent on
clusters at the previous level. In the example discussed earlier, the two group solution consisted of [A,

C, E] and [B, D]. The three group solution was created by splitting the cluster into two: [A, C] [E] and [B,
D]*** check the figure as well as the earlier description. The four-cluster solution was created by ****.
When we use hierarchical clustering, individuals comprising different clusters do not fundamentally
regroup. They simply combine with other individuals to form larger clusters.
Hierarchical methods are commonly used in marketing to understand how brands group and as
a way to understand the relationships found in correspondence analysis better. However, they are
generally not favoured for analysing a large number of records as is the case when we attempt to
segment the market. Firstly, it takes up a large amount of computer power since hierarchical methods
require store similarities between each and every point. If we try to segment the data based on 2,000
respondents, the computer has to generate a matrix containing approximately two million cells. In fact,
the number goes exponentially. If we cluster 10,000 records (as is possible in data based projects), the
matrix will consist of nearly half-a-billion cells! They not only have to be stored, they also have to be
repeatedly searched to identify the cluster to which each respondent should belong. Besides when we
have that many respondents, dendrogram is neither feasible nor is of any specific value.
Iterative partitioning clustering (k-means clustering)

For the reasons mentioned in the previous section, to solve large clustering problems such as market
segmentation, we use non-hierarchical techniques. They are known more specifically as iterative
partitioning methods (the most frequently used versions are often referred to as k-means clustering).
Iterative partitioning methods create clusters that are independent at each stage. For instance, when
partitioning methods creates a 4-cluster solution, the solution is not dependent on the way the
technique creates 3 or 5 clusters.
How do iterative partitioning techniques work? Suppose you have 1,000 respondents who
responded to 30 life-style questions. You would like to segment the market based on these responses.
First you need to decide how many clusters or segments you want4. Once you have decided this is
what the technique does:
1. Suppose you specify that you need four (k=4) clusters. The technique chooses 4 respondents from
the sample. These are called seeds. There are many ways to select seeds such as
4
At this stage we do not know what the segments might be or how many would be best suited for our purposes. Neither does the
computer. So, in practice, you may want to ask for several alternative solutions: 2, 3, 4, 5, 6 or more depending on your experience and
expectations.

Selecting the centroids of each group;


Selecting the first respondent as the seed for the cluster; selecting the respondent who is at a
predefined distance from the first respondent as the second seed and so on; and
Selecting k respondents on a random basis.
There are indeed many more ways of selecting initial seeds5.
2. Each subsequent respondent is assigned to the cluster to which it is closest. Note that, unlike in
hierarchical clustering, the centroids are not recomputed after each assignment.
3. Each respondent is reassigned to the cluster that conforms to a pre-determined stopping role.
Reassigning can be carried out in many ways.
Compare the centroids of each cluster after assignment is completed with those that were
obtained initially. If they diverge more than by a pre-set convergent criterion, make another
attempt to reassignment of members to different clusters. Continue this process until difference
between the initial and the final centroids are within the limits allowed by the convergent
criterion.
Recompute the centroids of the cluster from which the member was drawn and the cluster to
which the member was assigned. Continue the reassignment procedure until the changes in
centroids is less than the convergent criterion.

III. HOW TO DESIGN A CLUSTER ANALYSIS STUDY


Is your problem suited for cluster analysis?

Cluster analysis is the right type of technique if your objective is to reduce a large number of objects to
a smaller number of homogenous groups. Problems such as the following can be tackled by cluster
analysis.

Please note that the seed respondents should not have missing data.

You market computers. You wonder if there are distinct segments in the marketplace, each
emphasizing different benefits e.g. one segment concerned with multimedia capabilities, another
emphasizing high speed scientific computations and so on.
You a long distance telephone service provider. You would like to how different brands (service
providers) are grouped by customers.
You are a political pollster. You would like to identify groups of voters who hold similar views on a
variety of issues.
You are data base analyst. You would like to identify demographic segments that might serve as
target markets for your product.
Is your data suited for cluster analysis?

Cluster analysis can handle both metric and non-metric data. This means your data can be in any
scale. However, clustering in marketing research mostly carried out using metric data, especially when
clustering used for segmenting the market. If your data are non-metric, please make sure that the
computer program you use has provisions to handle non-metric data.
Are the variables well-chosen?

Cluster analysis can only cluster the variables we specify. It can only segment on attributes we think
are important. So, if we are doing a benefit segmentation study, it is for us to make sure that we
covered all the benefits. If in an automobile study include benefits such as speed, gasoline efficiency,
trouble free operation, safety and long-life, obviously we will not get a segment whose main motivation
for owning a brand of car is prestige. So it is important that the items we include be representative of
the area we intend to cover. The second important thing to remember is that we not overweight any
group of items. Recall that cluster analysis derives the distance between two members by combining
their differences in all attributes. If we include a large number of items on automobile operation and
only one item on price, items on automobile operation will automatically decide on the structure of the
clusters since these attributes will contribute most to create distances among members.
. Therefore, in choosing variables for inclusion in a cluster analysis study,
Choose variables that cover the complete spectrum of what you wish to study.

Do not over- and under-weight any particular area. Where possible, keep the variables as
uncorrelated as possible.
Is your sample size adequate?

As with factor analysis the general guideline is that your sample size should be at least 10 times the
number of variables. If you are performing a cluster analysis on 30 variables, you need a sample of
about 300 (10 x 30). However, when cluster analysis is used for segmenting the market, you might find
that this primarily mathematical criterion not really adequate. Why? Consider the above example where
we have 300 respondents. If we deem 4 clusters solution to be satisfactory, then we will have about 75
respondents per segment. This would make subgroup analyses very difficult. Suppose we want to
compare two segments on income, which in turn is grouped into high, medium and low. It is difficult to
make any meaningful comparison of income distribution between two groups because we will be
dealing with very small sample sizes: 75 respondents subdivided into 3 groups.
So the mathematical consideration plays a secondary role to practical requirements in cluster
analysis when our aim is to understand subgroups as is the case when we carry out market
segmentation. The best way then is to work backwards.
1. Make sure that you have about 10 times the number of objects (respondents) as you have
attributes on which you intend to cluster them. This satisfies the mathematical criterion.
2. Estimate the number of segments you are likely to end up with at the end. (Although we do not
know this at the beginning, typically we tend to work within some range. In most instances for
products and services sector we derive clusters ranging between 3 and 6. Part of the reason for
this has to do with marketing strategies. Having too many segments may be too difficult to work
with for non-niche products or services.) Lets assume that you expect to extract four segments.
3. Decide how many respondents you need to have in each segment to make comparisons on
important variables across the segments. Lets assume that you need about 200 respondents per
segment. In this case, you would need a sample of 800 (200 x 4). At the end of the analysis
suppose you end up with 5 groups instead of four as you anticipated. In this case, you will have an
average of 160 respondents per group, not too far from 200. However if you derive 8 segments
instead of 4, then average your sample size per group may be just 100, only about half as big as

you would ideally like. For this reason, you might want to decide on the sample size based on the
upper end of the range of expected number of segments.
Do you have the right program?

All major statistical computer packages (such as SPSS, SAS, BMDP or SYSTAT) have cluster analysis
routines. However, different packages may offer different choices of clustering algorithms. Unlike other
statistical programs, you may get different results from different packages depending on the type of
clustering procedure and algorithms used. Initially you may want to experiment with the standard
program you may have access to and then try other programs until you find a program that is suited to
your needs. In general, no clustering procedure or algorithm can be considered necessarily superior to
any other. What works better usually depends on the context and the nature of underlying
relationships. For this reason, it is better to have access to a variety of clustering routines than just to a
single program.
As a general rule, it you intend to do clustering of brands and understand the
interrelationships among a small number of objects, you should have access to hierarchical clustering
procedures. These are readily available in all major statistical packages. For market segmentation
procedures, youll need iterative partitioning (k-means) clustering routines. Although these are also
available in major packages, the choices of procedures and algorithms used may depend on the
package.

IV. HOW TO PERFORM CLUSTER ANALYSIS


Doing cluster analysis involves a series of decisions: Which technique to use - hierarchical or k-means
clustering? If hierarchical, which kind - nearest neighbor, farthest neighbor, Ward or some other? If kmeans, how to select the seeds? How to reassign cluster members? How many clusters to extract? All
such decisions can have a major impact on the type of clustering you get at the end.

Do you have the input data in the right form?

To perform cluster analysis your input data should be at the individual level. For instance, if you want to
cluster analyze 20 lifestyle statements of 1000 consumers, then your input data would be 30 would be
ratings on 20 lifestyle statements by each of the 1000 consumers. Your input data will be in the format
shown in Exhibit 13.14

Exhibit 13.14. Typical data format for cluster analysis


ID

Ratings on lifestyle attributes

0001
0002
0003
0004

14

15

16

17

18

19

20

10

11

12

13

1000

How to decide on the basic technique

The big advantage of hierarchical techniques is that you dont have to specify in advance how many
clusters you want to extract. The disadvantage is that hierarchical techniques do not allow reassigning
of a cluster member once it is assigned. Hierarchical techniques are subject to linking or chaining effect
while partitioning methods are very sensitive to the initial solution. So there is no clear winner between
the two.
Consequently the choice between the two is often made on pragmatic grounds. When the items to
be grouped are many (say 30 or more), hierarchical clustering becomes fairly cumbersome for several
reasons: (1) dendrogram becomes too complex to provide a quick visual representation of clusters; (2)

if there is no natural clusters in the data different hierarchical methods will give different solutions. Such
differences in grouping becomes more and more difficult when the objects to be grouped becomes
large; and (3) when the objects becomes really too many (like 1,000 or so), hierarchical clustering
requires heavy computing power for no apparent advantage. So, our first guideline is whenever the
objects to be clustered exceed 30 or so6, you should have some specific reasons for considering the
use of hierarchical technique. Our second guideline is that, whenever the sample size is large, partition
clustering has a natural advantage over hierarchical clustering.
If hierarchical, which clustering rule to use

As we discussed in detail earlier in this chapter, there are at least four major methods by which
hierarchical clustering can be performed. Nearest neighbor, farthest neighbor, centroid and the Ward
method. None of the alternative methods is likely to be consistently superior to all others. As result, you
may want to try alternative methods to test the consistency of the solutions and accept the one that is
most reasonable and consistent with your knowledge. Nearest neighbor method is probably the most
susceptible to the influence of outliers (observations that are exceptional and different from others in
the group) followed by farthest neighbor, centroid and the Ward method. In common practice, nearest
neighbor method is probably the most commonly used method (perhaps because of its intuitive
appeal), followed by farthest neighbor, centroid and the Ward method.
If partitioning, which clustering rule to use?

It is difficult to decide on the superiority of the any number of algorithms currently available to select
initial seeds and assign and reassign objects to clusters. From a practical point of view, most analysts
choices are confined to the ones available in the analytic package they use. If one were to consider all
the choices available from different clustering programs there are in fact far too many. Which of the
programs you have access to works well for your data is best ascertained by trial-and-error. While this
is not a reassuring method (especially if you are new to these techniques), cluster analysis, it should be
noted, is a heuristic and not a mathematical technique. This means that there are no real validity
checks but only whether the techniques work in practice and whether results are reproducible.

This is not meant to be a cut-off or suggested cut-off number. One should rethink the use of hierarchical clustering once hierarchical
grouping becomes too cluttered and difficult to interpret.

What does this mean in practice for someone new to the technique? It means that you may
want to start with the default options, analyse the results, use another technique and compare the
results. If you find an algorithm that works well for your data, then you may want to split your data into
two sub-samples and cluster them separately to assess whether the results are stable and
reproducible.
How to deal with outliers

In discussing hierarchical clustering we noted how a value that is extreme an outlier could
potentially alter the structure of the clusters derived. The problem of a single attribute altering the
structure of clusters is somewhat less of a problem in large scale segmentation studies, especially the
items are based on rating scales. Outliers have less significant effect on larger samples than they do
on smaller samples. When the responses are limited to limited choices (as points on a rating scale),
again outliers tend to be a less significant problem.
Notwithstanding the above, it may be worthwhile to identify and minimize the effect of outliers
on clustering. Many clustering identify outliers. What does one do with outliers? The most commonly
used procedure is to eliminate the outliers from the analysis. The reasoning for this is that the outliers
are rare unusual observations and including them simply distorts the general cluster structure.
However, caution should be exercised in eliminating the outliers, especially if there are too many of
them. Too many outliers can potentially be caused by problems such as misunderstanding of the
survey instruments by a small group of respondents or the presence of respondents with distinct
response patterns. Therefore, when outliers exist in large numbers, the analysts may want to look more
closely at the response in an effort to understand why there are so many outliers, rather than simply
ignoring them.
How many clusters to retain

Since clusters are not defined entities when we start the analysis, we do not know in advance how
many clusters there might be in our data. The fact is that we dont know how many clusters there might
be in our data even after we have performed cluster analysis. So we may have to use some criterion to
decide on how many clusters to retain.

Hierarchical clusters. In you use hierarchical procedures, deciding on the number of clusters is not

really a major problem. Since we get the dendrogram, we can go though different levels of
classification and stop at any level of aggregation if it is sufficient for our analysis purposes. Suppose
we carry out cluster analysis on 15 brands soft drinks. Here if your aim is to understand how the brands
group at a generalized level, you may find brands divide into diet and non-diet drinks and within each
category cola and non-cola drinks. This is all you might be interested in. If your aim is to find how your
particular brand fits in this overall scheme but also to assess who your nearest competitors are at the
brand level and at the subcategory level, then you may want to use the complete dendrogram.
Iterative partitioning clusters.

When we use iterative partitioning clustering programs, many criteria

are available which we can use to decide on the number of clusters to retain. Some of these criteria are
very complex while others are less so. One of the simpler and intuitively appealing criteria is based on
comparing the within group sum of squares (how much do members within a segment differ from on
another) with the between sum of squares (how much do groups vary between themselves). Since our
aim is to create clusters whose members are similar to each other but are dissimilar compared to
members of other clusters, we use compare the between and within group variation to decide on the
number of clusters to be retained.
Yet in practice, most criteria used to determine the number of clusters are not very useful in
many contexts such as market segmentation as the following examples will demonstrate:
1. Statistical criterion suggests that the optimal solution is 6 clusters. However, if the 4-cluster solution
identifies segments that exhibit high market potential for your product, the fact that 6-cluster
solution is a statistically superior one is of little practical importance to the analyst.
2. Statistical criterion identifies the 8-cluster solution as the best. If this happens to be too many to
effectively implement your marketing strategy, then you may want to use a smaller number of
clusters.
3. Statistical criterion identifies the 4-cluster solution as the best. If you are looking for some highly
specialized segments, you may have to extract more clusters before you can identify the segments
that are of interest to you.
As the examples illustrate, in determining the number of segments there are far more compelling
practical rather than statistical criteria. Consequently, statistical criteria for deciding the number of
clusters are of limited use in marketing analyses.

So how do we use practical criteria to decide on the number of clusters? This depends on your
objectives, as discussed below.
Specialized segments.

Consider a situation in which your aim is to identify the segment that will be

interested in a very expensive perfume for women. Even after excluding obvious non-target audience
such as low-income groups, we may be dealing with a very small group. Broadly segmenting the highincome group women may not result in isolating our target audience. We may have to extract many
segments before we can identify the segments that are of interest to us.
Generalized segments. Generalized segmentation aims to understand the market better so our

product can be marketed (or slanted) to a large part of the market such as the price sensitive segment.
If you are interested in identifying generalized segments, then obviously you do not want to create too
many segments, say no more than 5 or 6 segments. To achieve this, you may want to create a
number of solutions - say up to 8 clusters compare the solutions and accept the one that best fits
your marketing objectives. Suppose you identify clusters 5, 6 and 7 are all equally usable from a
marketing point of view. Which one should you accept? One way to answer this question is to use a
number of marketing variables to identify the solution that best distinguishes them. For instance, if
variables brand usage, demographics are best distinguished clearly from segment to segment in 5cluster but somewhat less diffused in other cluster solutions, then 5 cluster solution may be your best
choice.
Aggregated micro-clusters. William Neal7 suggests an alternative. His approach is initially to
segment the respondents different ways using different criteria (e.g. Demographic, attitudinal, usage).
By crosstabulating these segment with one another, we create a large number small clusters called
micro-clusters. Because they are small, these clusters tend to be highly homogeneous. To make them
useful for practical purposes you can combine similar microsegments into larger segments. Note that
this procedure is similar to hierarchical segmentation except that (1) segments are derived using
partitioning methods, and (2) the combining of segments are not based on mathematical (such as
nearest neighbor or farthest neighbor) but on marketing objectives. When we use aggregated micro
clustering, then the number of clusters does not depend on outside criteria but on aggregating together
all the micro clusters we are interested in.

Canadian Journal of Marketing Research, Vol. ***, 1998.

What distance measure to choose

Distance measures calculate how far apart two individuals are. Earlier we described the Euclidean
distance. But it is only one of the many ways of determining how close two individuals are to each
other. Not all distance measures will give you the same results. Neither can any single method be
considered definitely superior to all others.
Euclidean distance

This is probably the most frequently used of all distance measures. We have already described how
this works. To briefly review:

1. Find the difference between each attribute between pairs of respondents


2. Square the differences
3. Add them together, and
4. Take the square root of the sum.
Ratings on two attributes
Respondents

(A B)

(A B)2

Attribute 1

-2

Attribute 2

Attribute 3

-4

16
Total =

Euclidean distance [ 29] =

29
5.4

Although this is a frequently used measure, please note that it is by no means an ideal measure. This
is because, like most other distance measures, it ignores the correlation among different items. For
instance if your questionnaire battery consists of 4 highly correlated items and 6 not so highly
correlated items, your clusters will be highly influenced by the 4 highly correlated items, since the
concept represented by these items is counted four times in different guises. This can indeed be a

problem since the clusters are not determined by all concepts represented in the analysis, but mostly
by the concept that has been represented by a number of correlated variables.
City block distance

City block distance, also known as the Manhattan metric, is simply the total of all absolute distances
(i.e., ignoring the direction of the difference) between two individuals. For example,

Ratings on two attributes


Respondents

(A B)

IABI

Attribute 1

-2

Attribute 2

Attribute 3

-4

City block distance between A and B =

City block distance is intuitively appealing. It is simply the total of all differences between two
individuals. City block distance does not overcome the problem of correlated variables any more than
does Euclidean distance. Furthermore, city block distance assigns the same weight to all distances
while Euclidean distance assigns greater weight to larger differences than to smaller differences8.

Note that Euclidean distance first squares all differences before summing and taking the square root. This process of squaring the differences
implicitly assigns larger weights to larger differences.
8

HOW TO DO ANALYSIS: A step-by-step guide


6. Collect all the dimensions that are important to your problem. For instance, if you are planning
to understand the factors of customer satisfaction for a telephone utility, you may want to think
of all the dimensions pertaining to this such as installation service, repair service, price tariff
and responsiveness. Make sure that the dimensions you have chosen is fully representative of
what you want to factor analyze.
7. Develop different attribute statements pertaining to each dimension. For instance, the pricing
can include statements such as monthly charges are reasonable and long distance charges
are competitive. Although not absolutely necessary, it is preferable to balance each dimension
by creating approximately the same number of statements.
8. Conduct the survey and analyze the results. In analyzing the results, you need method and
stopping criterion. See text for details. to make a number of decisions such as the factor
analytic method, rotation
9. Interpret the results. See text.
10. Present the results. What you would present or include in your report will depend on the
audience. At a minimum, you should report the number of factors, their interpretation, attributes
included in each factor, factor loadings and the variance explained by each factor.

DOING THE ANALYSIS


For a complete example with annotation please see the PowerPoint presentation, Chapter 14.

Perceptual Mapping
Perceptual mapping techniques enable us to provide a visual representation of two sets of variables
such as brands and attributes in a way that is easy to understand the inter-relationships. Such
techniques include correspondence analysis and biplots. Techniques such as non-metric
multidimensional scaling are used to plot just one set of variables. These plots enable us to understand
and communicate complex relationships among brands and attributes.

What is perceptual mapping? Perceptual mapping is a group of techniques that provides a visual
representation of the relationship between two sets of attributes. The typical sets that are mapped
include brands and evaluative attributes or demographics and brand usage. Such visual representation
can enable us to understand and communicate the data better.
Factor analysis gave us tools that enable us (1) to reduce a number of variables into fewer
dimensions; and (2) to depict these dimensions in a visual format. In research we often need to go a
step further. We not only want to know what the relationships are among different variables, but how
one set of variables are related to another set of variables. For instance, a researcher may want to
know not only how different product attributes are related to one another but also how they relate to the
demographic profile of the consumers. Or, a marketer may want to know the relationship between
product benefit attributes and consumers life style characteristics. In such instances correspondence
analysis can be used to understand the inter-relationships between two set of variables and depict
them visually, so they are simple to understand and communicate.
Techniques that represent one or more set of variables on a two-dimensional plane are
generally referred to as perceptual maps, or more narrowly, as brand maps. Such techniques include
correspondence analysis, biplots, discriminant analysis and multidimensional scaling. In this chapter
we will discuss correspondence analysis because it is a currently a widely used method.

Correspondence analysis can be used to solve these and similar problems. A typical contingency table
to which correspondence analysis may be applied is given in Exhibit 13.15.
Exhibit 13.15
Cereals and Attributes

Sugar

Rice
Corn flakes

Weetabix

Krispies

Shredded Wheat

Puffs

Special K

Frosties

All Bran

Comes back to

65

31

10

10

Tastes nice

64

40

32

23

29

17

22

11

Popular w/ all the family

59

30

20

13

15

13

Very easy to digest

60

42

24

18

20

20

19

17

Nourishing

40

50

17

31

18

19

14

19

Natural flavour

47

39

11

28

15

18

Reasonably priced

60

37

12

A lot of food value

27

38

26

17

17

Stays crispy in milk

42

23

18

13

11

Helps to keep you fit

24

28

10

21

29

40

Fun for children to eat

17

12

57

50

43

In this exhibit, the figure at the intersection of an attitude and a brand stands for the number of times
that attitude was found to be associated with that brand in this particular survey. Correspondence
analysis answers question such as: Which cereals are perceived similarly? Which attributes are closely
related to one another? How attributes relate to specific brands?

CORRESPONDENCE ANALYSIS MODEL


Correspondence analysis terminology
Correspondence analysis in its current form was developed and popularised by French data analysts
such as Jean-Paul Benzcri and his associates. During the course of its development, the developers
of the technique used their own special terminology, which is commonly used in the literature, to
describe well-known concepts. It is important to be familiar with the terminology not only to follow

published research in the field but also to use some of the correspondence analysis software which
may describe the results in terms of this terminology.
Profiles. Profiles are proportions or relative frequencies. A profile is a set of frequencies (a nonnegative quantity) divided by their sum, as shown in Exhibit 13.16a, 13.16b and 13.16c.
Exhibit 13.16a
Raw data
High income

Low income

Total

Men

100

300

400

Women

100

100

200

Total

200

400

600

Exhibit 13.16b
Row profiles (Gender profiles across income)
High income

Low income

Men

.25

.75

Women

.50

.50

.33

.67

Profile of
marginal row

Exhibit 13.16c
Column profiles (Income profiles across gender)
High income

Low income

profile marginal col.

Men

.50

.75

.67

Women

.50

.25

.33

Mass. The mass of a row (row proportions) is the row total divided by the grand total. The mass of a
column (column proportions) is the column total divided by the grand total. The average row profile is

the centroid of the row profiles, where each profile is weighted by its mass. Column masses are
derived similarly as shown in Exhibit 13.17. (Centroid or the weighted average is known less commonly
as the barycentre.)

Exhibit 13.17
Row (gender) and column (income) masses
High income

Low income

Total

Row Masses

Men

100

300

400

.67

Women

100

100

200

.33

Total

200

400

600

Column masses

.33

.67
Exhibit 13.18
Row and column masses

C1 Cj

Cp

ki

R1

Ri

kij

ki

kj

An
Rn

kj

Inertia and display error. The accuracy of the correspondence analysis map is measured by
percentage of inertia (analogous to the expression percentage of variance explained commonly

used in multivariate analysis). The percentage of inertia not accounted for (unexplained variance) in
the map is considered as display error. The total inertia is the sum of the individual inertias of the rows
when we are considering the set of row profiles, and as the sum of the individual inertias of the
columns, when we are considering the set of column profiles. Total inertia can also be split into
components along the principal axes.
Contribution. Contribution refers to the inertia contributed by each row and each column to the total
inertia. Such contribution lets the analyst know the impact of different variables, thereby facilitating the
interpretation of the axes.

The correspondence analysis model


In the correspondence analysis model, the initial data matrix, of the raw numbers of the contingency
table, are transformed into two tables of profiles: a table of row profiles and a table of column profiles.
With each of the profiles in each of these tables an appropriate mass is associated, and their centres of
gravity computed. To measure the distances between the profiles, a modified Euclidean distance called
the distributional distance (or chi-squared distance) is used. Distributional distance is similar to the
usual Euclidean distance except that, in computing chi-squared distances, we divide each squared
difference between co-ordinates by the corresponding element of the average profile.
The dispersion of the profiles with respect to their centre of gravity is measured by inertia. Principal
components analysis is applied to each of these tables to obtain the principal axes. And the axes of
one analysis are superimposed on the corresponding axes of the other analyses, two by two, to get
perceptual maps in which both the variables of the data are plotted. These maps are interpreted using
inertia components accounted for by the principal axes, and absolute and relative contributions
(explained below) printed in the computer output that are indispensable for the interpretation, to
understand the structure of the data.
Although the central operation of a correspondence analysis is the determination of principal axes,
correspondence analysis is not a simple variant of principal components analysis. Correspondence
analysis can be applied to many other types of data besides contingency tables. One of the most
useful extensions of correspondence analysis is to questionnaire data that are not in the form of a

contingency table, but can be usefully analysed after an appropriate coding. The analysis of
questionnaire data where each individual is described by his answers to many questions (variables) is
known as multiple correspondence analysis.
Perceptual maps

The usual perceptual map of a correspondence analysis shows the axes 2 by 2. For example, if only
the first two axes are considered for interpretation, we can have a perceptual map showing axes 1 as
the horizontal axis and axis 2 as the vertical axis. If three axes are used, then we can have a 1x2 map,
1x3 map, and a 2x3 map, and so on. The principal maps of both the variables are plotted in this map.
Quality. Quality of representation of a row or a column on a principal axis is measured by its relative
contribution (COR) in respect of that axis. If COR = 1000 (which is really equal to 1 since the values
printed are per mille) the quality of representation is perfect for that row or column on that axis. If
COR=0, then the row or column is not at all represented on that axis. Any in between value for COR
shows the degree to which the row or column is present on that axis. (Some programs report relative
contributions as proportions or percents rather than as per mille.)
In perceptual maps which take into consideration a certain number of axes, say 2 or 3, the quality of
representation is given by adding the COR for these axes. If only two axes are considered, then Qual =
COR1+COR2; if three are considered, as in our hospital data later in this exposition,
Qual=COR1+COR2+COR3.

Active and supplementary points

Sometimes, after an analysis has been completed, we may come across new data that could have
been included in the analysis if we had had them earlier. For instance, suppose two distinct new
brands of automobiles are introduced into the market, after we have just completed an analysis
showing the perceptual space of the consumers with respect to brands. It is then not necessary to redo
the entire analysis in order to see how the new brands fit into the perceptual map already obtained.
The transition formula is used to calculate the coordinates of these new brands and they are plotted on
the same perceptual map. Their position in this map is an indicator of their relative place vis a vis the

other brands. However, the relative contribution COR for these points is likely to be poor, and CTR has
no meaning for a supplementary element.
When a particular row or a column has a very large mass it is likely to overshadow the relationships
that exist between the other rows or columns in the analysis. In such a case the analysis can be
repeated by treating this element as supplementary. The analysis will then reveal more clearly the
relationships between the other elements.

HOW TO PERFORM CORRESPONDENCE ANALYSIS


Correspondence analysis is a straightforward procedure and the derivations are fairly standard.
Although you need to decide on a number of things, such decisions are far fewer than for many other
techniques such as factor or cluster analysis.
Do you have the input data in the right form?

To perform correspondence analysis your input data should be in the form of a contingency table.
Contingency table is the technical term we use to what we generally call cross-tabs. For instance, if
you want to study the relationship between demographic characteristics and brands used, you may
have demographic characteristics as the column variable and brand usage as the row variable. Note
that
Cells of the table typically contain the number of observations (e.g. respondents) rather than
percentages or averages.
The cells within any given variable are mutually exclusive, i.e., an observation may not appear in
more than one cell for any given variable.
A typical input data table is shown in Exhibit 13.19.

Exhibit 13.19. Typical data format for correspondence analysis


Very

Very

Brand

satisfied

Satisfied

Dissatisfied

dissatisfied

AA

40

16

24

18

BB

18

14

34

20

CC

115

58

60

17

DD

73

127

164

64

EE

54

33

35

14

How to decide on row and column variables

In the above example we used satisfaction levels as column variables and brands as row variables.
How do we decide on this? Will our interpretation be different if we use satisfaction levels are the row
variables and brands are column variables? In correspondence analysis, it does not matter whether a
given set of variables is treated as a column set or a set. The interpretation of the final output does not
change.
However, if you are using asymmetric maps, you should know that maps will look different
depending on which set is specified as column variables and which as row variables.
How many dimensions to extract and plot?

Many correspondence analysis programs do not ask the users to decide how many dimensions should
be extracted but lets the data decide on number of dimensions. Since correspondence analysis is
generally used as a visual mapping program, the first two factors if often adequate in most instances.
However, extracting more dimensions may allow us to understand and interpret the data better.
Many programs have a provision to let the analyst specify the number of dimensions to be
plotted. Usually, plots of first two dimensions are adequate.
Is your map symmetric or symmetric?

Correspondence analysis can be either symmetric or asymmetric. As mentioned earlier, in symmetric


maps, both row and column analysis is produced using the same technique principal co-ordinates
analysis. In asymmetric analysis row analysis is through standard co-ordinates analysis while column

analysis is through principal co-ordinates analysis. Because of this asymmetric map will produce maps
that look different depending on which whether a set is specified as columns or as rows.
Which map should you use?

Asymmetric maps have some definable relationship between row and column variables. In symmetric
maps the relationship between row and column variables (e.g. brands and demographics) is only
directional. So it would seem that asymmetric maps are preferable to symmetric maps. Unfortunately,
asymmetric maps tend to cluster most variables (e.g. brands) in small area around the middle making it
less useful as a clear visual representation of data. This is a particularly serious limitation when we use
the map to communicate the findings with others. For this reason, most researchers tend to use
symmetric map far more frequently than asymmetric map.

HOW TO DO CORRESPONDENCE ANALYSIS


A step-by-step guide
1. Decide on the attributes that are to be submitted to correspondence analysis. For instance, if you
are planning to understand how age groups is related the usage of investment products, (a) decide
on the age groups so they can be logically related to the usage of investment products; and (b)
decide on the investment products that are likely to be influenced by stages of ones life.
2. Make sure that your data is not percentages or means but is in the form of a cross-tabulation.
3. Analyze the results. In analyzing the results, you need decide to decide on supplementary points,
and on the analytic technique. See text for details.
4. Assess the validity of the results. Check inertia accounted by the first two dimensions. Higher the
percentage of inertia accounted for by the first two dimensions, the better the maps representation.
5. Interpret the results. Make sure that you interpret each dimension separately. See text for details.
6. Present the results. The primary advantage of correspondence analysis is its ability to present the
results in a visual format. When presenting the results, construct your explanations around the
map, rather than on statistical details.

V. DOING THE ANALYSIS


For a complete example with annotation please see the PowerPoint presentation, Chapter 14.

Regression analysis
We are frequently interested in predictions. What will our market share be if our customers are more
satisfied with our service? What will our sales be if we lower our prices? What features of a product
result in increased purchase intent? Regression analysis attempts questions of this nature.

A marketer is often faced with the problem of identifying how marketing variables are related to one
another. How important are different service attributes in deciding whether a customer will be loyal to
us? How do variables like inflation, discretionary income, and advertising expenditures affect our
sales? Which of the product attributes influence the overall evaluation of that product? What is the
contribution of different service attributes to customer satisfaction? What are the key drivers of
purchase behavior? Answers to questions like these enable the marketer to deploy resources
effectively and predict, however weakly, the course of future events and the potential effects of certain
marketing actions.

I. WHY REGRESSION ANALYSIS


The intuitive approach

The problem of predicting one thing based on another is common in marketing, as in everyday life.
Here are some more examples of relationships a marketer might be interested in.
Predicting the effects of a proposed price increase on sales
Predicting the effects of different service attributes on customer retention
Predicting the effects of customer satisfaction on customer loyalty
Predicting next years sales given various assumptions about economic variables affecting
sales

Identifying the key drivers of sales among high income groups


The simplest way to identify these relationships is perhaps through graphs. When we use graphical
techniques, certain relationships can look obvious. A scatter plot, in which each dot can represent a
person, can show if there is a straight relationship between two variables such as satisfaction and
loyalty. But what appears obvious in a graph is only the beginning. We need to know more. What
exactly is the nature of this relationship? How do we summarize and communicate this relationship?
How do we use this in further work?
Why intuitive approaches are not adequate
The intuitive approach does not identify the precise nature of the relationship. In the intuitive approach we identified

the direction of the relationships. However, in the examples cited, we are not simply looking for the
existence or absence of a relationship. We would also like to know the nature, strength and direction of
the relationship. We would like to state this relationship precisely so any one can use it and assess
what would happen under hypothetical scenarios.
The intuitive approach does not show the relative influence of different predictor variables.

There is one more area in

which intuitive understanding may fall short. If two variables are related to usage, which of the two
variables is stronger? Are the interchangeable or they complementary?
To answer these questions, we need to go beyond an intuitive understanding of the
relationships. Regression analysis is a technique of relevance here. It aims to clarify issues like the
following:
1. Are the variables assumed to be related to a criterion variable really related to it?
2. Of the variables that are identified to be related to a criterion variable, which one is the strongest,
which one is the weakest?
3. What exactly is the strength of the relationship between each independent variable and the
criterion variable?
4. How much of the variability in the criterion variable is explained by the variability in predictor
variables?
5. What is the most economical set of variables we can use to predict the value of the criterion
variable?

The regression analysis solution

Regression analysis provides formal solutions to the issues raised above. Regression analysis enables
us to predict the outcomes and to explain how variables are related. While, as with any multivariate
techniques, the solutions are coupled with several shortcomings it is still an extremely useful technique.
In fact, of all the statistical techniques, regression is probably the most frequently used. Its use is
implied in several other techniques such as the principal components analysis, path analysis,
experimental designs, structural equation modeling to name a few.
For predictions to be possible, the criterion variable should be related to independent variables. If
these two sets of variables are not related, no prediction is possible. As we noted earlier, the
correlation measure or r tells us whether two variables are related.
We first create a scatter plot and put a line through the scatter plot, we can see if it visually looks to
be a fair representation of the relationship. Such a line often does not go through center (called the
origin) of the axes. This is not unexpected. We do not expect loyalty to be 0 when satisfaction is 0.
After all, we do expect some customers who are thoroughly dissatisfied with our product be repeat
buyers for several reasons: it may be cheapest available product; it may be the only available product
where these customers shop, some customers cannot make the effort to investigate other brands and
so on. The place at which the line crosses the y-axis is called the y-intercept. The y-intercept is the
value of y when x=0. The line may or not may go through the center or the origin. The second is that
the line has a slope. The slope indicates how many units y (loyalty score) changes when x (satisfaction
scores) changes by one unit. This relationship can be represented by:
y

= a + b(x)

= Loyalty = a + b(satisfaction)
Where

a = y-intercept; b = slope of the line

If we know the values of a and b, we can predict the value of y for any value of x. Let us assume
that the slope of the line is 0.8 and the y-intercept is 10. Then
a = 10;

b=0.8

If Johns satisfaction index is 80, then our prediction of his loyalty score is 74 as shown below:
y=
=

a+bx
10 + 0.8 (80) = 74

If Janes satisfaction index is 90, our prediction of her loyalty score is 82.
= 10 + 0.8 (90) = 82
In the next section we will see how regression analysis actually works.

II. HOW REGRESSION ANALYSIS WORKS


In the last section we noted that predictions are possible if

1. We can fit a line through the scatter plot;


2. We know where this line intersects on the y-axis; and
3. We know the slope of the line.
In addition, it would also be helpful it

4. We have a measure of how well the line represents the scatter plot. This will tell us how
accurate our predictions are.
Here we will see how these objectives are achieved.
Fitting a line through the scatter plot

Once we plot a line through the scatter plot, by definition, we can easily derive where the line intersects
the y-axis and what the slope of the line us.
In our earlier example, we drew a line through the scatter plot solely based on our judgement that it
was representative of the scatter plot. As you might have guessed, this is too crude to be accurate in
real life situations. First, literally an infinite number of lines can be drawn through the scatter plot. Only
one (or a few) can be considered a true representation of the data. Second, because an infinite number
of lines can be drawn through the scatter plot, different people might come up with different lines for the
same set of data. Obviously, we need (1) a criterion by which to decide which of the many possible
lines is the best one; and (2) a procedure to easily identify this line.
Lets begin at the beginning.
Suppose you know that the average loyalty index for customers of firm A is 78 and for firm B, it is
85. So if we know only that a given customer deals with A, then our best prediction of that person
loyalty score would be 78. If we make the same prediction for every customer of A, we would be right
sometimes, wrong at other times. The actual score of each customer will vary from our average

prediction of 78. The variability can be summarized into a single number by measures such as the
variance (or the standard deviation).
Can we improve upon this prediction? Surely, not every customer of A is likely to have a score of
78. Some will have score higher than 78 and others lower than 78. If we can attribute some of the
variability to a cause, then part of the variability we noted earlier will be explained away and our
predictions will become more precise.
This is where regression analysis comes in. If we hypothesize that, among customers of A,
satisfaction is related to loyalty those with a higher level of satisfaction will have higher loyalty. So let
us plot the loyalty score on the y-axis. Now we are faced with the problem of drawing a line through the
scatter plot.
You will recall that when we used the mean as our predicted value for customers of A, we had
many customers whose scores deviated from predicted score. Here we are attempting to use
satisfaction scores to account for these deviations, so that our predicted scores move closer to actual
scores. We want to minimize the deviations between our actual and predicted scores. For technical
reasons, we do not minimize the actual deviations. Instead, we minimize the squared deviations from
our predicted scores. Consequently, we are said use the least-squares method. The line drawn through
the scatter plot using this method.
It is not important for the user to know how the technique actually finds the line that reduces the
squared deviations. What is important is that a technique exists which will quickly find the line that
minimizes the squared deviations. This is known as the least squares regression.
You will recall that the regression line is of the form y = a + bx and our aim is to predict the value of
y given the value of x. To do this we need the values of a and b. Regression analysis provides these
values. Although you are likely to calculate these values without the use of a computer, the calculations
are trite and straightforward.
Multiple regression analysis

So far, what we talked about is known as simple regression. Simple regression has one dependent
variable and only one independent variable. In real life, it is a little bit more complicated. For instance,
customer loyalty (dependent variable) is not affected just by customer satisfaction it might also be
affected by factors such as price and availability. Similarly, the purchase intent (dependent variable, y)

can be influenced by variables such as price (x1), quality (x2), availability (x3), after-sales service (x4),
and ease of use (x5). So the relationship is no longer y = a + bx, but
y

b1 x1

b2 x2

b3 x3

b4 x4

b5 x5

where
y

purchase intent (dependent variables)

b1 =

weight assigned to x1 in predicting y (purchase intent)

x1 =

the value of x1 (price, in our example)

b2 =

weight assigned to x2 in predicting y (purchase intent)

x2 =

the value of x2 (quality, in our example)

. and so on.
The purpose of multiple regression, then, is to find weights for different independent variables such
that we can predict the value of the dependent variable as accurately as possible (i.e. with the lowest
possible deviation between the actual and predicted values, as measured by squared deviations). In
addition to providing the weights for different independent variables, multiple regression also provides
the following:
1. How effective are the selected variables in predicting the value of the dependent variable
2. How effective each variable is in relation to other variables in predicting the value of the
dependent variable

III. HOW TO DESIGN A REGRESSION ANALYSIS STUDY


Is your problem suited for regression analysis?

Regression analysis is the right technique if you are trying to understand the relationship between one
dependent variable and one or more independent variables. The purpose of the technique is to
establish the nature and strength of relationships between the dependent and independent variables.
Some typical problems solved by regression analysis are:
You are an economist. You want to predict next years gross domestic.

Is your data suited for regression analysis?

Regression analysis is a metric technique. This means that both the dependent and independent
variables is expected to be metric. However, in regression analysis some independent variables can be
nonmetric. When we use nonmetric independent variables, we are said to be using dummy variables.
In regression analysis all variables do not need to be on the same scale. For instance, you can use
both height (measured in inches) and weight (measured in kilos) as independent variables in the same
analysis.
Are the variables well-chosen?

The basic principle in the selection of variables for regression analysis is to choose independent
variables such that
1. They are expected to explain most of the variance in the dependent variable. The reason for this is
obvious. If we exclude some important variables that affect the dependent variable, then our
equation is not likely to be very effective.
2. They are well correlated with the dependent variable. In practice, since we are not likely to know
beforehand the correlation status, we choose the variables that we expect to be correlated with the
dependent variable.
3. It is highly desirable that independent variables are not highly correlated among themselves. The
reason for this is explained later.
We may also choose to do a correlation analysis before actually doing the regression analysis. We can
examine the correlation matrix to identify the variables that are highly correlated with the dependent
variables. If all the proposed independent variables are weakly (or not at all) correlated with the
dependent variable, regression analysis is unlikely to be useful. If this happens to be the case, we need
to examine whether there are other candidate variables that we might overlooked.
The problem of multicollinearity

What happens if the independent variables chosen are correlated among themselves? Let us consider
an example of attributes that contribute to the overall favorability for a fast food chain. Suppose, quick
service and efficient service are highly correlated among themselves and also with the dependent
variable, overall favorability.

Let us assume that we are performing stepwise regression analysis. Suppose the model considers
the effect of first variable (say, quick service) and then evaluates whether the each of the remaining
variables adds anything significant to the model. If quick service and efficient service are highly
correlated in the above example, then there will not be much variance left for efficient service to
explain; whatever variance that would been explained by efficient service has already been explained
away by quick service.
As a result, the regression model will attribute a very weak relationship (or even no relationship)
between efficient service and overall favorability. If we use this model for interpretative purposes we
can easily arrive at seriously erroneous conclusions. For instance, because efficient service and overall
favorability are only weakly, or not at all, related in the model, it is easy to conclude that efficient
service has no bearing on overall favorability evaluation. It is not hard to imagine the creation of a
service quality program that ignores efficient service, based on such erroneous interpretation.
You can see why this problem, known as multicollinearity, is a very serious one. Its effects are
hidden. Unless one understands the consequences of multicollinearity, one can easily mislead oneself
and others into concluding that certain variables have little effect on the dependent variable, when in
fact they may have a strong impact.
Even more importantly, as the correlation between two independent variables increases, for
mathematical reasons, instability is created in model estimation. Briefly stated, the mathematics behind
the technique has no way of attributing the effect of either variable correctly.
Multiple regression analysis is designed to handle correlated variables. That is the reason why
we do multiple regression analysis instead of doing a series of simple regressions. The problem of
collinearity arises only when two dependent variables are highly correlated (say, 0.8 or above). So, to
the extent possible, you should choose the independent variables such that they are not likely to be too
highly correlated among themselves.
Is your sample size adequate?

The sample size chosen for regression analysis (1) should be large enough to draw reliable
conclusions; and (2) should have a sample size which is, at a minimum a large multiple of the number
of variables.

Consider this relationship9:


% Variance explained by chance = [Number of variables (sample size 1)]x 100

Assume that you have 20 variables and 100 respondents. In this case,
%Variance explained by chance = (20/99) x 100 = 20%.

Given this rather startling result, we should be extremely wary of using a sample size that is small. A
suggested rule of thumb is that our minimum we should attempt to have a minimum sample that is 10
or even 20 times the number of independent variables10.
Do you have a suitable program?

Regression analysis, being the most widely used of all statistical models, finds a place in any statistical
packages such as SPSS, SAS, BMDP, or Systat. Standard regression and several of its variations are
usually included in all major packages. Unless you have special requirements or need the output in a
certain way, practically any regression program will perform adequately.

IV. HOW TO PERFORM REGRESSION ANALYSIS


As with other multivariate techniques, you are called upon to make a number of decisions while
performing regression analysis. What method should you use? How many variables should you
include? How to specify dummy variables? Many such decisions have to be made.
Do you have the data in the right form?

Modern computer programs do not require the data to be in any rigid format. As long as the format is
clearly specified, the program can read the data and perform the required analysis. In its most basic
form, regression analysis looks for data specified as in Exhibit 13.20. For each respondent we have the
value of a dependent variable (DV) and the values of several independent variables (IV). In the Exhibit,
This relationship holds under conditions in which, (1) x-variables may be random or fixed; (2) the value of y does not depend on the x
variables; and (3) the distribution of Y is essentially symmetric.
10
There are also other rules of thumb suggested for this purpose (e.g. Green, S.B. How many subjects does it take to do a regression analysis?
Multivariate Behavioral Research, 1991, 26, 499-510). For instance, the suggested minimum sample size for testing individual predictors is
104+number of independent variables. For instance, if we have 20 independent variables we need a minimum sample size of 124. (Similarly,
the minimum sample needed for testing the multiple correlations: [Number of independent variables x 8] + 50. For instance, if we have 20
independent variables, we need a minimum sample size of [20 x 8]+50 = 210.). Given the ease with which we can obtain significant results in
multiple regression analysis, I suggest that you err on the conservative side by comfortably exceeding the suggested minimum.
9

the value of the dependent variable for each of the 500 respondents is followed by the values of 19
independent variables.

Exhibit 13.20. Typical data format for regression analysis


ID

DV

IV (Independent variables)
.

0001
0002
0003
0004

14

15

16

17

18

19

20

10

11

12

13

0500

How to choose a regression model

You can perform regression in three different ways: standard, stepwise and sequential (hierarchical).
The standard model

In the standard form, we require the technique to consider all the variables submitted simultaneously
and fit a regression equation. This method is particularly if you had built a model a few years ago (or in
a different context) and would like to see how things have changed or remained constant across time
or across different contexts.
The stepwise model

In the stepwise model, also known as the statistical regression, we require the technique to follow
some rules before entering or removing a variable in the regression model. You can choose one of the
many methods available to perform stepwise regression.

In the backward elimination method, we begin will all the independent variables. During the next
stage, the variable with the least significant predictive power is eliminated from the model. Following
this, the next least significant is removed from the model. This process is continued until no nonsignificant variables are left in the model.
In forward elimination method, the process is reversed. We start with the most significant predictor
variable and keep adding or deleting one variable at a time until none can significantly improve the
model fit.
The sequential model

The sequential model, also known as the hierarchical regression, is a variation of the stepwise model.
However, unlike the stepwise model, the sequential model is user defined. At each step, the user
decides which variable should be added or taken out from the model. Although this model looks less
scientific, it has the advantage of the user deciding which variable would be more logical to enter or
eliminate given what is currently in the model at any given stage.
Which regression model to use

The regression model you use would depend on the context and your objectives. You would use the
standard model if you are confirming a theory or trying to assess changes in the model over a period of
time, across geographic regions or across different subgroups in a market. The stepwise procedure is
useful if you do not have any serious hypotheses or hunches as to what variables are good predictors
and would like an automatic search procedure decide it for you, at least at the initial stages. The
sequential model is ideal for experienced researchers who are skilled in combining prior knowledge
with statistical results.
If you are new to regression analysis, you may want to start with an automatic procedure such
as the stepwise procedure than with the sequential procedure since the latter requires greater skills.
Once you understand how the automatic procedure works, you may want to look at the variables that
are included and excluded and decide whether you want to sequence the analysis differently.
What stopping criteria to use

Not all independent variables we start with may be useful in predicting the dependent variable. For
instance, price may be a predictor of purchase behavior in general. But for a given product, in a given

market segment, price may play no part. In such cases, we would not like to exclude it from our model.
But we need to specify the conditions under which the inclusion of a variable will be unacceptable to
us. Standard computer programs allow the user to specify conditions under which a variable should be
excluded from the model. Here are some of the conditions:
No statistically significant improvement to the model

If adding or deleting a variable from the regression model makes no difference to the predictive power
of the model, then we may want to exclude the variable that makes no difference to the models
predictive power. For instance, we may want to exclude all variables that are not significant at given
level of significance. Computer programs allow the user to specify the required significance level.
The model is unstable

When independent variables are highly correlated, the regression model can become unstable.
Tolerance measures this condition. It is given by (1- R2), where R2 stands for the squared multiple
correlation between the predictor variable under consideration and other predictors included in the
model. When we specify a minimum tolerance, we prevent a variable that is highly correlated with
variables already in the model from being included in the model. Computer programs allow the user to
specify the required tolerance level.
Need only a certain number of variables

In some cases, the researcher may want to use a certain number of variables, in the model. In such
cases, some computer programs will allow you to specify the number.
Which stopping criteria to use

As I suggested earlier, multiple regression is built using a combination of prior knowledge and
statistical results. Mechanical criteria are of limited value in building the model. So you might want to
start with the default stopping criteria that are built into your program and refine them as you gain more
experience.

V. DOING THE ANALYSIS


For a complete example with annotation please see the PowerPoint presentation, Chapter 14.

Discriminant analysis and


Logistic Regression
What determines whether a consumer will buy our product or not? What influences a persons voting
behavior? Discriminant analysis attempts to solve problems like these. Whenever we need to
understand what contributes to persons belonging to one group as opposed to another, we can use
discriminant analysis.

Marketers are interested in knowing whether a consumer will buy their product or not. Similarly,
politicians are interested in knowing whether a given person would vote for them or against them.
Psychiatrists may be interested in predicting whether a person is susceptible to mental illness or not.
In all these cases, we are interested in predicting the group (such as buyers and non-buyers,
Democrats and Republicans) to which a person would belong. To achieve this, we start with some
information on the person about whom we need to make the prediction. Here are some examples of
problems that can be answered through the use of discriminant analysis:
Given a person's demographic characteristics, can we determine the probability of a person being
a reader of The New York Times?
Given a consumer's ratings of variable importance, can we predict whether he will use Brand X?
What is the relative importance of different variable ratings in determining that the consumer will
prefer Brand A to Brand B?
What demographic traits distinguish poor credit risks from good credit risks?
What social views influence a voter to vote Republican?
The purpose of discriminant analysis is predict, as accurately as possible, to which group a person
would belong, given the characteristics of the person.

I. WHY DISCRIMINANT ANALYSIS


Suppose a credit card company is interested in predicting whether a customer is likely to be a good
credit risk or not. To solve a problem like this, we will probably attempt to identify the variable (or
variables) that most effectively predicts a customers being a good or a bad credit risk.
The intuitive approach

Lets start with a customers debt load first. The percentage of customers who default on their credit at
different debt levels is our concern and data shows that customers whose debt level is 25% or more of
their annual income tend to default more on their debts than those whose debt load in less than 25% of
their income. We can make a simple prediction that those whose debt load is under 25% of their
income are good credit risks and those whose debt load is over 25% are poor credit risks. This may be
largely correct. A persons debt level can help us predict whether that person is a good credit risk or
not.
By declining to extend credit to those whose debt load is over 25% of their income, the company
can reduce its bad debts. But this may not be the ideal solution. Why? In the process of weeding out
bad credit risks, we also weeded out many good credit risks. Although the company has reduced its
credit risk, it did so by letting go of many potential customers (i.e., good credit risks whose debt load is
over 25%)
So we can look for other predictors of bad credit risks. How about income? After all, many people
with high income carry a high debt load because they have assets such as large homes or large
investment portfolios. So we could plot income against credit risk. We find that income does predict
good credit risks as well. We can make another simple prediction: those whose household income is
over $40,000 are good credit risks and those whose household income is lower, are not. A persons
income level might also help us predict whether that person is a good credit risk or not.
Perhaps income is a better predictor of credit risks than is debt load. However, it still has the same
problem. When the company accepts only those with a certain level of income, it reduced its bad credit
risks at the expense of letting go some good credit risks whose income is not high. So while both
these variables are related to a customers credit worthiness, the classification of customers is not
entirely satisfactory.

Why the intuitive approach is not adequate

It is reasonable to assume no matter what variable we choose, there is going to be some trade off.
Under most real life conditions, we cannot be certain we can identify only bad credit risks with absolute
certainty.
Yet the problem remains. Can we do better? Maybe we cannot completely eliminate all bad credit
risks without, at the same time, letting go of some of our good credit risks. But how about maximizing
the exclusion of as many bad credit risks as possible while, at the same time, minimizing the loss of
good credit risks? How do we optimally distinguish one group from another such that we misclassify as
few customers as possible?
What we if we take into account both debt level and income together? Maybe low income and a
large debt load taken together can predict credit risks better, than either of these variables by itself.
Now a clearer picture begins to emerge. The highest proportion of credit risks comes from low-income
earners whose debt loads are high.
In our example, income and debt load might not be equally important in predicting creditworthiness.
In fact, further investigation might make it obvious that debt load has a greater influence on credit
default than is income. Those with higher income and high credit loads are poorer credit risks than
those with lower income and lower debt load. Obviously, debt load is a better predictor. To create an
actionable formula, we need to assign a greater weight to debt load than to income. But how do we
decide what weight we should assign to each variable?
The problem becomes even more complicated if we add some more variables to the list such as
family size, occupation, net worth and so on. We saw how we were able to make better predictions by
using two variables rather than just one. Perhaps we can add more variables and possibly increase our
predictive accuracy. Obviously, not all variables will increase our predictive accuracy. Even those that
do will not increase the predictive accuracy to the same degree. Generally, these relationships can be
complex.
Therefore, we need a technique that will identify the influential variables and the weights we need
to assign to them to make the best possible predictions. The technique that is designed to achieve
these objectives is discriminant analysis.

II. HOW DISCRIMINANT ANALYSIS WORKS


Discriminant analysis is best understood in terms of regression analysis. Regression analysis assigns a
weight to each independent variable such that it explains the maximum amount of variance in the
dependent variable. Discriminant analysis similarly assigns a weight to each independent variable but
here the objective is to distinguish groups. In regression analysis, the dependent variable is metric
while in discriminant analysis it is nonmetric (see Exhibit 13.21). In most other ways both techniques
are very similar.

Exhibit 13.21. Discriminant vs. regression analysis


Technique

Dependent
Variable

Regression

Metric

Independent

Purpose

variables

of weights

Metric (& Non-metric)

Assign weights to independent


variables to explain the variance in the dependent variable

Discriminant

Non-metric

Metric (& Non-metric)

Assign weights to independent


variables to distinguish groups

Lets return to our example whether a person is a good or a bad credit risk. As we noted before, the
characteristics of good and bad credit risks overlap. For instance, two customers may have identical
debt load. But one of them may be a good credit risk while the other may not. Exhibit 13.22 shows the
distribution of good and bad credit risks.
How would we assign a customer to one of these two groups? One logical way to do this is to
assign the customer to the group whose average (mean) is closer to that of the customer. For instance,
assume that the mean debt load of good credit risks is 20% of the annual income and that of the bad
credit risks is 32%. A customer whose debt load is 28% will be classified as a bad credit risk since his
or her debt load is closer to the mean of the bad credit risk than of the good risk customers. This is the
same as dividing the overlapping area in the middle. The accuracy of our predictions will depend on
how much these two groups (good and bad credit risks) overlap on the independent variable (debt
load), as shown Exhibit 13.23.

Ex. 13.22 Overlap between good and bad credit risks

Ex. 13.23 - The problem of classifying a person as a good or bad risk

When we have a single predictor variable, the solution is simple. But what happens when we
have more than one predictor variable? Suppose we add income as a predictor variable. We need to
know what weights to assign to each of the two variables. This is given by the equation (known as
Fishers Linear Discriminant Function):
Z =

b1 X 1 + b 2 X 2

Where X1 = Variable 1 (debt load)


b1 = Weight assigned to the variable 1
X2 = Variable 2 (income)
b2 = Weight assigned to the variable 2
(You may have noticed that the equation is very similar to that of the regression equation.) How are the
weights b1 and b2 found? They are calculated by mathematical methods subject to the following
criterion: the resulting function Z should maximally distinguish between the two groups. To put it other
words, the weights assigned will result in the least number of misclassifications. When we use the
function Z, the good and bad credit risks are separated as clearly as possible. A statistic known as
Mahalanobis distance is used for this purpose when we use the Z function. The distance is not
computed simply as the linear distance (as we did when we had only one independent variable).
Rather it takes into account the concentration of respondents in two groups as well.
The same logic holds when we increase the number of independent variables.

III. HOW TO DESIGN A DISCRIMINANT ANALYSIS STUDY


Is your problem suited for discriminant analysis?

Discriminant analysis is suited for a variety of problems. The problems should have the following
structure.
They should have a non-metric dependent variable
They should have a metric independent variables
Our objective is to identify which of the
Discriminant analysis may be useful when you want to identify the variables that best discriminate
among groups. For instance:
What demographic characteristics identify spenders and savers?
What lifestyle characteristics distinguish luxury car buyers from utility automobile buyers?
What service attributes characterise highly profitable firms from the rest?
You want to classify a new person as belonging to an already defined group. For instance:
You have identified four market segments that are relevant to your market.

You want to place newly identified people in the appropriate segments.

THE DISCRIMINANT ANALYSIS MODEL


We have a number of independent metric variables (such as customer ratings of different product
attributes) and a single non-metric dependent variable (such as whether the consumer is a Buyer or a
Non-Buyer of a brand). How are a consumers ratings on product attributes related to buying or nonbuying? To answer this question we can use discriminant analysis and weights for each of the
independent variables that would best predict whether the customer is a buyer or a non-buyer. We start
our discussion with a two-group discriminant analysis (i.e., the dependent variable with two groups).

The discriminant function


If we let Z stand for the discriminant function that best differentiates the two groups buyers and nonbuyers we have the discriminant function of the form
Z = a1 x1 + a2x2 + a3x3 + a4x4+ apxp
Where x1, x2, x3, xp. are values of independent variables and a1, a

2,

3,

a 4 a

are weights

associated with each of the independent variable. One way to find discriminant coefficients is to choose
values that maximize the ratio of between-group variance to its within-group variance, i.e., the
coefficients that maximize the F ratio. If we designate the two group means as Z 1
and Z 2 , then we can compute D2 (Mahalanobis distance) which indicates how farther apart these two
groups are.
D2 = ( Z 2 - Z 2 )2/S2z
The values of a1, a 2, a 3, a 4 a p are selected so as to maximize the D2 so the groups are as clearly
distinguished as possible.

Cut-off score
The value of Z in the discriminant equation is similar to the y value in the regression equation. To use
this value to group respondents we need a cut-off score. Each respondent is assigned to a group
depending on whether the respondents Z score is above or below this cut-off score. The cut-off score

is chosen such that it minimizes the number of misclassifications. To obtain the cut-off score for two
group discriminant analysis:
1. Apply the discriminant function to the means of each group separately to obtain Z 1 and Z 2 ;
2. Use the following formula identify the cut-off score (C):
If both groups have the same sample size: C = ( Z 1 + Z 2 )/2
If the groups have different sample sizes: C = (n1 Z 1 + n2 Z 2 )/( n1+ n2)
Quadratic classification rule. There is a more complicated quadratic rule for assigning respondents to
groups. However, Huberty (1984) points out that results obtained by a linear rule yield greater stability
than those by a quadratic rule when small samples are used and when the normality conditions are not
met.

Standardized coefficients
As with the regression coefficients, the value of a1, a 2, a 3, a 4 a p are not directly comparable. We can
obtain the relative importance of the discriminant coefficients by standardizing them. Standardized
discriminant coefficients can be obtained by multiplying the a is by the corresponding pooled standard
deviations. From the size of standardized discriminant coefficients we can infer the ordinal importance
of variables x1, x2, x3, xp.

Probability of group membership


Although so far we have assigned respondents to one or the other group, there is a chance of
misclassification. To assess the chance of misclassification, we can compute the actual probability of a
respondent belonging to a group.
Probability of belonging to group 1 = 1/[(1+exp(-Z+C)]
For instance, if respondent A has a probability of 0.51 of belonging to group 1 and respondent B has a
probability of 0.87, it would mean that B is a much stronger candidate for group membership than A.

In most computer programs, these probabilities are referred to as posterior probabilities. They can be
valuable to marketers in that they may be more interested in consumers who have high probability of
belonging to a group (such as those who have a high propensity to buy a brand) than those who only
have marginal tendency to do so.

How good is the discriminant function?


To validate the obtain discriminant function, we can apply it to each respondent and predict his or her
group membership. We can then compare this with their actual group membership. Most computer
programs will provide a table something like this:
Actual group

Predicted Group
1

Cell a shows the number of respondents who the discriminant function correctly predicted to be
members of group 1; cell d shows the number of respondents who the discriminant correctly predicted
to be members of group 2; cells b and c show respondents who were misclassified. This procedure
tends to inflate the validity of the procedure since we have used the same sample to develop the
equation and to validate the equation. Ideally, we would like to derive the equation from one set of data
and apply it to another set of data.
Holdout sample. One way to validate the results is through using what is known as a holdout sample.
Suppose we have a sample of 1000 respondents, we use 500 to develop the equation and use the
remaining (the holdout sample) to validate the equation. This is commonly used procedure.
Jackknife procedure. If the sample is small, the holdout sample may not be a good procedure to use,
since splitting the sample further reduces the sample size and can potentially make the estimates less
stable. In such cases, one can use the Jackknife procedures. Here we exclude one observation at a
time from the sample, estimate the equation, then apply the equation to the excluded observation. We
repeat this for all respondents. Some computer programs execute jackknife procedures.

Selecting the best set of predictors


The procedures given for regression analysis hierarchical and non-hierarchical methods apply here
as well. Whereas in regression analysis we tested to see whether the R2 was altered when we added
or deleted a predictor, in discriminant analysis we test whether the value of D2 is altered by deleting or
adding a variable.
The appropriate test statistic is
F = [(n1+n2-p-2)(n1n2)( D2p+1- D2p)/(n1+n2)(n1+n2-2)+n1n2 D2p)]
with 1 and [(n1+n2-p-2) degrees of freedom where p is the number of variables in the equation. The P
value is the tail area to the right of the computed statistic.
As with regression analysis the user may specify a criterion for adding or deleting a variable (such as
F-to-enter or F-to-remove) in computer programs using hierarchical methods.
The overall significance of the equation can be tested with the null hypotheses that none of the
variables improve classification based on chance alone. This can be tested by the F statistic
F= [(n1+n2-p-1)/p(n1+n2-2)] * [(n1n2)/(n1+n2)] * D2
with p and (n1+n2-p-1) degrees of freedom.

Using categorical independent variables


Discriminant analysis accommodates categorical independent variables. They are specified in the
same way as in regression analysis.

Discriminating among more than two groups


Although conceptually multi-group discriminant analysis is an extension of two-group discriminant
analysis, both in terms of mathematics and general interpretation multi-group discriminant analysis can
be complex. Here we provide only a very brief overview of multi-group discriminant analysis.

In multiple discriminant analysis, classification functions are computed for each group.

The

approximate F statistic tests the null hypothesis that the means of all groups are equal for all variables
simultaneously. F-statistics are also computed for pairs of groups to test the equality of means of the
two groups. The analyst may use this statistic to decide whether to combine some groups if they are
fairly close.
To classify an individual to one of the groups, we evaluate an individuals scores on different variables
using discriminant functions corresponding to each group. The individual is assigned to the group for
which the computed classification function is the highest.

Prior probabilities
Consider a case in which the probability of an individual falling in group A is 0.9 and in group B 0.1. In
this case, we can achieve 90% accuracy by simply allocating everyone to group A. To avoid this
ineffective solution, the allocation procedure that assigns an individual to a group based on
Mahalanobis distance should be modified. Many computer programs allow the user to enter prior
probabilities for this purpose.

Assumptions
Most methods discussed here are based on certain assumptions. The first assumption is that the
within-group covariance matrix is the same for all groups. The second assumption is, for tests of
significance, the data should be normally distributed within groups. However, as Manly (1994) points
out, even if both assumptions are violated we might still be able to discriminate among groups,
although it may not be simple to establish the statistical significance of group differences.

ALTERNATIVE TECHNIQUE: LOGISTIC REGRESSION


Logistic regression
The objectives of logistic regression are similar to that of two-group discriminant analysis. It is a good
alternative to discriminant analysis, when we cannot assume a multivariate normal distribution. Logistic
regression analysis can be used for any combination of metric and non-metric variables. The
dependent variable takes on a value of 1 or 0 (e.g., a respondent is a Buyer or a Non-buyer). What
logistic regression attempts to do is predict the probability of a person belonging to a group. For
example, logistic regression enables us to state (on the basis of the values of the independent
variables) that the probability of a consumer buying our product is 0.8. However, probability (p) could
not be directly used as a dependent variable for many reasons. Consider this: if we express p as the
linear combination of independent variables and in an ordinary regression, the value of p could exceed
1.0. To avoid this problem, we make a logistic transformation of p (the logit of p). Logit(p) is the log (to
base e) of the odds or likelihood ratio that the dependent variable is 1.
logit(p)=log(p/(1-p))

The value of logit(p) can range from negative infinity to positive infinity. The logit of 0.5 is 0, and the
scale is symmetrical around this value, as shown in Exhibit 13.24.
Exhibit 13.24
The relationship between probability (p) of group membership
and the corresponding logit(p)
p

.3

.4

.5

.6

.7

.8

.9

.95

.99

logit(p) -.847 -.405 0.0 .405 .847 1.386 2.197 2.944 4.595

The logit scale is approximately linear in the middle range and logarithmic at extreme values: the
difference in logits between ps of .95 and .99 is much bigger than that between .5 and .8. Logistic
regression is of the form
logit(p)= b0 + b1x1 + b2x2 + b3x3 + ...

To enter the independent variables, we can use stepwise (forward or backward elimination) or
simultaneous methods. However, unlike linear regression which uses the least-squared deviations
criterion for the best fit, logistic regression uses a maximum likelihood method that maximizes the
probability of getting the observed results given the fitted regression coefficients. Consequently, the
goodness of fit and overall significance statistics used in logistic regression are different from those
used in linear regression. We can also use the percentage of correct classifications (as in discriminant
analysis) table to assess the effectiveness of the model.
When we calculate the likelihood of observing the exact data we actually did observe under the null
hypothesis (all coefficients are 0) and the alternative hypothesis (the model is correct), the resulting
number tends to be small in magnitude. Since it is more convenient to handle larger numbers, we
calculate its natural logarithm (i.e., its log base e). This gives us the log likelihood. Because
probabilities are always less than one, the log likelihoods are always negative. Again for convenience,
we work with negative log likelihoods.

Interpreting logistic regression coefficients


As with multiple regression, logistic regression holds that, when other variables in the equation are held
constant, there is a change of bi in logit(p) for every unit increase in xi. Since the logit transformation is
non-linear, the change in p associated with a unit change in xi will vary with the value of xi. However,
there is a straightforward way to interpret a constant increase in the value of xi. It involves a constant
multiplication [by exp(b)] of the odds that the dependent variable takes the value 1 rather than 0. For
example, if bi takes the value 2.30, exp(2.30) equals 10. This means if xi increases by 1, the odds that
the dependent variable takes the value 1 increase tenfold. So, when xi takes the value 0, the logit(p) is
0; this means that there is an even chance of the dependent variable taking the value 1. If x1 increases
to 1, the odds that the dependent variable takes the value 1 rise by a factor of ten, from 1:1 to 10:1, to
a p of 0.909. If xi increases to 2, then the odds will be 100:1, to a p value of 0.990. We can represent
the results of logistic regression through a plot that shows the odds change produced by unit changes
in different independent variables.

A frequently used statistic to assess the statistical significance of each independent variable is the
Wald statistic. The Wald statistic has a chi-squared distribution. It is interpreted in the same way as the
t values for each independent variable in multiple regression analysis.
The relative importance of the independent variables can be assessed by multiplying each coefficient
by the standard deviation of the corresponding variable. The ranking of the resultant values will reflect
relative importance of the independent variables.

V. DOING THE ANALYSIS


For a complete example with annotation please see the PowerPoint presentation, Chapter 14.

Path Analysis and Structural Equation Modeling


(Contributed by Terry Grapentine)

The following discussion provides a basic introduction to the differences and similarities among
multiple-regression, path analysis, and structural equation models (SEM).
All methods share a common motivation. They
seek

to

describe

and

test

theoretical

relationships among variables. For example,


Multiple
Regression

Path
Analysis

SEM

what factors are associated with promoting


perceived brand quality?

Asyoumovefromlefttoright,thedifferentmethodsgive
youtheopportunitytodevelopmorecomplexand
psychometricallyvalidmodels

Multiple-regression only has one dependent


variable. In contrast, path analysis and SEM
can have multiple dependent-like variables. But

there are many more differences among these three methods, some of which we will briefly explore
below.
Multiple-Regression Example

We will compare and contrast the three methods by way of the following illustration11:

The Acme Printing Machine Companys marketing research department wants to understand the relative
influence selected factors have on repeat purchase, or Brand Loyalty.

Acmes research department is going to build multiple-regression, path analysis, and structural equation
models to examine this issue.

11
We thank Dawn Iacobucci, Ph.D., Department of Marketing, Owen Graduate School of Management, Vanderbilt University, for providing
the simulated data set for this example, which is drawn from her article, (2009) Everything you always wanted to know about SEM
(structural equation modeling) but were afraid to ask, Journal of Consumer Psychology, 19, pp. 673-680.

The Base Case multiple-regression model takes the following form:


Y = a + b1(Product Quality) + b2(Cost) + error
Where,

Y = the Brand Loyalty measure

a = the intercept term

b1, and b2, = the partial regression coefficients for each predictor variable, and

e = the equations error term.

The survey items on the questionnaire are given in Exhibit 13.25, below.
Table 13.25. Example Scale Items for Multiple Regression Based Brand Loyalty Study
Strongly

Strongly Agree

Disagree
Constructs

Item on Survey

Brand

I would buy another Acme printer if

Loyalty

I had to buy another printer

Product

The quality of the Acme printer I

Quality

bought is excellent

The Acme printer was not


Cost

reasonably priced

Results from the Base Case model are as follows:


Brand Loyalty = 4.0 + 0.32(Product Quality) 0.37(Cost) + error

Thus, if the Product Quality rating increases by 1 rating point on the 7-point agree/disagree scale, the
Brand Loyalty measure will increase by 0.32 points. If the Cost rating increases by one rating point,
Brand Loyalty decreases by -0.37 points.12

A Path Analysis Example

Typically, path analysis is used when there is more than one dependent-type variable relationship in
the model. This is depicted in Exhibit 13.26, which contrasts the previous multiple-regression model
with a path analysis model

Figure 13.26. Multiple Regression vs.


Path Analysis Model
Error

Thisschematicrepresentsthebasic
BrandLoyaltymultipleregression
equationmodeldiscussedearlierin
thismodule.

Apathmodelallowsusto
modelmorecomplex
relationshipssuchasfactors
influencingValue,
Satisfaction,andBrand
Loyalty.

Product
Quality
Brand
Loyalty
Cost

Error

Product
Quality
Value

Satisfaction

Brand
Loyalty

Cost

Clearly, we fibbed when we implied earlier that the three models would have the same independent
and dependent variables. But that is one of the differences and strengths of path analysis and SEM
they can model more complicated relationships than can multiple-regression analysis.
To understand path analysis, its advantages over multiple-regression, and to recognize how path
analysis is related to SEM, we first need to answer the following three questions about Exhibit 13.26:

12

Why use path analysis?

What do the arrows in Exhibit 13.26 represent?

What is the basis for specifying the model?

We will dispense with discussing model r2 at this point in our discussion. We will revisit this issue when we discuss SEM.

How are the coefficients in a path model estimated?

Why use path analysis?

Think of the path model as allowing the researcher to model more

complicated functional relationships among the variables. In the above example, Value, Satisfaction,
and Brand Loyalty serve as dependent-like variables, since their values are affected by other variables
in the model. For example, Value is directly related to Cost and Product Quality, and indirectly related
to Brand Loyalty through Satisfaction.
What do the arrows mean? Each of the arrows in the regression and path models indicates the
functional relationship between the variables. The strength of these relationships is reflected in the
models regression coefficients; that is, each arrow has a coefficient attached to it. Path analysis is
related to multiple-regression because multiple-regression can be used to estimate these coefficients in
the path model (to be explained shortly).
What is the basis for specifying a path model? Answer: a theory that the researcher developed.13 A
theory reflects the marketing researchers best thinking about what influences a phenomenonor, in
our example, repeat purchase of the Acme photocopier. In our example, theory specifies the following:

Conceptual definitions of one or more dependent and independent variables (e.g., what do we mean by the
terms, Brand Loyalty, Product Quality, and so on?)

The functional relationships among the constructs (e.g., the arrows in Exhibit 13.26)

Operational definitions of the constructs that appear on the questionnaire. An operational definition
translates the conceptual definition into the actual questions or rating items on a survey, and the method of
survey administration (e.g., telephone interview, Internet survey, and so on)

The method by which the functional relationships among the constructs will be measured (e.g., Will the
researcher use multiple regression, path analysis, or SEM to estimate the coefficients depicted by the
arrows in Exhibit 13.26?)

Theory development in any physical or social science is an extremely complex topic and is beyond the scope of this introductory module to
discuss at length. Consequently, our discussion of theory in this introduction to SEM is extremely basic. For more information, see
Theory-Based Data Analysis for the Social Sciences, by Carol S. Aneshensel, 2002, Sage Publications, easily found on Amazon.com
13

In the field of applied marketing research, typical sources of information that are used to build theories
are the following:

Exploratory research such as IDIs or focus groups

Past quantitative research

Secondary research such as journal or magazine articles that discuss the area being investigated

Interviews with industry experts such as senior executives who have extensive market experience in the
area being investigated.

How are the coefficients estimated in path analysis? This is done by multiple-regression, which we
explain in detail shortly. In Exhibit 13.26, you have to estimate three separate regression equations
that are represented by the arrows pointing to the three dependent-like variables, Value, Satisfaction,
and Brand Loyalty.
The independent variables may have either a direct, indirect, or direct + indirect effect on Brand
Loyalty. For example, Figure 13.27 shows these effects for Cost on Brand Loyalty.14
Exhibit13.27. Example of Direct and Indirect
Effects of Cost on Brand Loyalty
Thelettersdenoteregressioncoefficientsfromthe
multipleregressionequations.
Product
Quality

Value
Cost

Error

e
f
b

d
Satisfaction

Brand
Loyalty

Costs
DirecteffectonBrandLoyalty=a
IndirecteffectonBrandLoyalty=bxcxd
TotaleffectonBrandLoyalty=a+(bxcxd)

Coefficient values d and a are derived from the following regression equation:

Single item measure for Satisfaction is, I am very satisfied with my newly purchased Acme printer; and, for Value, I feel like I got good
value for this purchase.

14

Brand Loyalty = 2.99 + 0.48(Satisfaction) 0.27(Cost) + error


e and c are derived from the following regression equation:
Satisfaction = 1.39 + 0.54(Product Quality) + 0.08(Value) + error
f and b are derived from the following regression equation:
Value = 3.72 + 0.57(Product Quality) 0.47(Cost) + error
Therefore, the total effect that Cost has on Brand Loyalty is calculated as follows:
Total effect of Cost on Brand Loyalty = -0.27 + (-0.47 x 0.08 x 0.48) = -0.29
If Cost increases by one scale point, Brand Loyalty declines by -0.29. In the Base Case regression
model, this value was -0.37. If our path model is indeed more theoretically sound than the initial
multiple-regression model, then the Base Case model over-estimates the effect Cost has on Brand
Loyalty. Why this is so is explained shortly.
Exhibit 13.28 gives all variables effects on Brand Loyally. Table 3 give the actual coefficient values
and compares Product Quality and Costs effects on Brand Loyalty for the path vs. Base Case
regression models.

Exhibit 13.28. Direct, Indirect and Total Effects of Model Variables on Brand Loyalty
Variables Effect on Brand Loyalty
Direct Effect

Indirect Effect

Total Effect

None

(e x d) + (f x c x d)

(e x d) + (f x c x d)

Cost

bxcxd

a + (b x c x d)

Value

None

cxd

cxd

None

Variable
Product Quality

Satisfaction

Exhibit 13.29. Unstandardized Coefficient Values for the Path and Base Case Models
Variables Effect on Brand Loyalty
Path Analysis Model
Variable

Direct Effect

Product Quality

None

Cost

-0.27

Value
Satisfaction

Indirect Effect
(0.54 x 0.48) + (0.57 x

Base Case
Total Effect

MultipleRegression

0.28

0.32

-0.47 x 0.08 x 0.48

-0.29

-0.37

None

0.08 x 0.48

0.04

Not in model

0.48

None

0.48

Not in model

0.08 x 0.48)

The total effect of Cost on Brand Loyalty is somewhat less in the path analysis model (-0.29) vs. the
Base Case model (-0.37) because there are more variables in the path model that can account for
some of the variance in Brand Loyalty.
Which model do we believe? Which is closer to the truth? We always place more faith in the more
theoretically sound model, which, at this point in our discussion is the path model.

In summary, the table below identifies a few issues on which multiple-regression and path analysis are
similar and different.
Exhibit 13.30. Comparing and Contrasting Multiple-Regression vs. Path Analysis
Multiple-Regression

Path Analysis
Differences

Issue
Number of dependent variables

Only has one

Can have one or more

Only direct effects

Direct and indirect effects

Direct and indirect effects of


independent variables on a
dependent variable
Similarities
Method by which coefficients

Ordinary least squares (OLS)

are estimated
Test theoretical relationships

Both do this, although path analysis can test more complicated

among variables

models

Preface to Structural Equation Modeling

Overview: You are now familiar with the relationships between multiple-regression and path analysis.
However, before we can complete our story and describe how these two methods relate to SEM, we
first need to introduce you to two important termslatent constructs and multi-item summated scales.

Latent constructs are a fundamental characteristic of SEM

Although multi-item summated scales are not used in SEM, an understanding of them will help you
understand what indicators of latent constructs are and why they are important. Additionally, multi-item
constructs can be used in multiple-regression and path analysis models, of which we will give examples at
the conclusion of this document.

Latent constructs: Consider the following definition of a latent construct.

Latent Constructs
A latent construct is a representation of an idea that the researcher feels is important in understanding
consumer behavior. This idea is not something that the researcher can directly observe; however, in
our example, the researcher feels that the item ratings in Table 1 will serve as useful indicators of the
constructs Brand Loyalty, Product Quality, and Cost. Often, latent constructs are simply called
constructs.

You will note that each of the constructs in Table 1 is measured with a single item on the survey.
However, in marketing research, most of the concepts we use in multiple-regression, path analysis, or
SEM are not validly measured by one item. Consequently, where appropriate, one should use multiitem measures of these constructs, as describe below. In multiple-regression and path analysis, these
multi-item measures take the form of summated scales.
Multi-item summated scales: For example, a 6th grade vocabulary skills test in which a student defines
only one word would not be a valid measure of a students vocabulary. Rather, a test which has
students, say, define 100 words would be a better indicator of their vocabulary skills.
The same is true for many marketing concepts. Consequently, researchers will sometimes use more
than one item to measure a construct. This is done by using multi-item summated scales to measure
the constructs of interest.

Multi-item Summated Scales


As shown in Exhibit 13.31 below, a multi-item summated scale incorporates multiple scale items on a
survey to measure a given construct. In the Exhibit 13.31 example, Brand Loyalty is measured with
three items. For ease of interpretation, for a given construct, the items that serve as indicators of the
construct, for each respondent, are typically summed and divided by the number of items.

The summated scale of Brand Loyalty, which will be used in subsequent examples of multipleregression and path analysis, is given in Exhibit 13.31.

Exhibit 13.31. An Example Brand Loyalty Summated Scale


A Hypothetical Respondents Rating of their
Acme Printer on Each Item Using a 1Multiple Items that Measure the Brand

(Strongly Disagree) to 7-point (Strongly Agree)

Loyalty Construct

Scale

a. I would buy another Acme if I had to buy


another printer
b. I would buy other Acme products
c. I would tell my friends and co-workers to buy
Acme

6
5
6

Summated Scale Measure of Brand Loyalty: (6 + 5 + 6) 3 = 5.7

Validity of multi-item measures: Consider the following example illustrating the increased validity of
summated scales over single item measures.
Respondents A and B both give the same rating of 6 to statement (c), I would tell my friends and coworkers to buy Acme, but they may differ in how they rate the remaining two statements for the
reasons given below.
Exhibit 13.32. Respondents As and Bs Ratings of the Three Brand Loyalty Items
Ratings and Rationale
Respondent A

Scale Item

Respondent B

Rating = 6: Because my needs Rating = 7: My Acme printer


a. I would buy another Acme if are changing, I may have to look meets my needs.
I had to buy another printer.

at other brands.

b. I would buy other Acme

Rating = 5: Their other products Rating = 7:All of Acmes

products.
c. I would tell my friends and
co-workers to buy Acme.
Summated Scale Loyalty
Rating

are not reliable.

products are reliable.

Rating = 6. They make good Rating = 6. They make good


products.
(6 + 5 + 6) 3 = 5.7

products.
(7 + 7 + 6) = 6.7

These respondents have different levels of loyalty toward their Acme printer, but the survey would not
detect this if the respondents were only asked item (c)!
Consequently, you can increase construct validity of your measures if you use multi-item summated
scales where they are appropriateand you can use them for independent and dependent variables,
and they can be used in multiple-regression as well as path analysis models. Example summated
scale items for all the variables used in our examples are given below.

Exhibit 13.33. Example Multi-item Measures of Product Quality, Cost, Value, Satisfaction and
Brand Loyalty
Constructs

Items Comprising Multi-Item Construct


Measures

Product Quality

The quality of the Acme printer I bought is


excellent

Cost

Acme printers are known to be high quality

Im sure my Acme printer will last a long time

The Acme printer was not reasonably priced

The Acme does not set fair prices for its


products

The Acme printers are more expensive than


others

Value

I feel like I got good value for this purchase

The quality of the printer is worth its cost

I could tell my boss this purchase was a good


value

Satisfaction

I am very satisfied with my newly purchased


Acme printer

My printer is better than I expected it would be

I have no regrets about having bought this


printer

Brand Loyalty

I would buy another Acme if I had to buy


another printer

I would buy other Acme products

I would tell my friends and coworkers to buy


Acme

Now, we will use your knowledge of latent constructs and multi-item measures of constructs to learn
about SEM.

Structural Equation Models15


Exhibit 13.34 depicts a SEM model that uses multi-item measures for each construct (i.e., represented
by the small rectangles above the ovals).
Exhibit 13.34. SEM with Standardized Regression Coefficients
Thesecirclesrepresent
measurementerrorinthe
indicatorvariables.
.95

.97

Q1

Q2
.99

.97

Thesecirclesrepresentunexplained
varianceintheendogenousconstructs.

.95
Q3

.97

Product Quality

VALUE1
VALUE2
VALUE3

.96
.98
.98
.99

.95

.60

.56

SAT1

.97

.60
.04

Value

.98
SAT2
.99

.95
SAT3

.39 .98

Satisfaction

-.04

Brand Loyalty
.97

.97
.93

.41
.54

.93

-.51

-.33

REPEAT1

.99
.98
REPEAT2

.97
.94
REPEAT3

Cost
.99
.97
COST1

1.00
.99
COST2

.98
.97
COST3

Definitions: Before we go further, we need to explain the ovals, rectangles, and circles in Exhibit 13.34,
and introduce you to the nomenclature of SEM.

The ovals are the latent constructs. If one or more single-tipped arrows points to an oval, the oval is called
an endogenous construct. In our model, we have threeValue, Satisfaction, and Brand Loyalty. They are
unobservedanother term in the SEM lexiconbecause we dont directly measure them. They are
endogenous because their values are determined by other variables in the modelthats why the singletipped arrows from other variables point to them. For example, Brand Loyalty is determined by Satisfaction

15

We used the AMOS software package to estimate the SEM models discussed in this White Paper.

and Cost. The error term for an endogenous variable is denoted by a small circle pointing to the
endogenous construct (e.g., above Brand Loyalty, below Satisfaction, and above Value).

The exogenous constructs are Product Quality and Cost. These variables are exogenous because there
are no single-tipped arrows pointing to them. Their values are not determined by other variables in the
model. These are the SEM equivalents of independent variables in a multiple-regression model.

The small rectangles above/below/to the side of the ovals are multi-item measures of our constructs as
identified in Exhibit 13.34. These measures also go by the term indicator variables. They are observed
because they appear on our questionnaire. However, their variances are being caused by the constructs
(i.e., the ovals) that they are presumed to represent, together with an error term (i.e., the small circles
pointing to them) since the constructs do not perfectly account for all the variance in these indictor
measures. What cannot be accounted for by the construct is called measurement error. (This term is
discussed more at length shortly).

The curved, two-headed arrow represents the correlations between our two exogenous variables, Cost and
Product Quality. Generally these two-headed arrows are used to show the correlation between or among
the models exogenous variables.

What do the numbers in Exhibit 13.34 signify? First, SEM will display two sets of numbers in these
diagramseither standardized or unstandardized coefficients. We will explain both, and will start with
explaining the standardized coefficients in Exhibit 13.34.16

You often will want to examine the standardized coefficients because they provide information on the
proportion of the variance explained by the model, which is similar to r2 in a multiple regression model (our
Basic Case).
o

For example the proportion of the variance in Brand Loyalty explained by the SEM model is 0.41,
which is just above and to the right of the Brand Loyalty oval). This compares to 0.25 in the Base
Case model. One reason the SEM models r2 is larger is because it has more variables in it.

The proportion of the variance in Value explained by Product Quality and Cost is 0.60.

The proportion of the variance in Satisfaction explained by Product Quality and Value is 0.39.

A reflection of the reliability of the multi-item indicators of the latent constructs is also given in Exhibit 13.34.
For example, look at the three indicators of Product Quality for which arrows point from Product Quality to

Sometimes a model will use variables rated on different scales or metrics. Using standardized data to calculate the coefficients facilitates
comparing and contrasting these beta coefficients.
16

the rectanglesQ1, Q2, and Q3. Think of their respective coefficients0.97, 0.99, and 0.97, respectively
as standardized beta coefficients17.
o

As this value approaches 1.0, an item becomes a better indicator of the latent construct.

Squaring these coefficients tells you what percentage of the variance in the indicator is explained
by the latent construct.

So, for the Q1 indicator of Product Quality, 95% of its variance is being explained by the Product
Quality latent construct (0.972 0.95). Note: these coefficients are taken to the second decimal
place. Thats why when you square the indicator coefficient, it does not always equal the
percentage-of-variance-explained coefficient above it).

Therefore, 5% is explained by random error, denoted by the small circle above the Q1 rectangle.

Typically, we want at least 50% of the variance in an indicator variable to be explained by its latent
construct; thus, the standardized coefficient should be 0.71 or higher.

The correlation between Cost and Product Quality is -0.04.

The standardized coefficient of the path linking Product Quality to Satisfaction is 0.60; of Value to
Satisfaction, 0.04; Cost to Value, -0.51; Cost to Brand Loyalty, -0.33; and, Satisfaction to Brand Loyalty,
0.54.

Now lets compare and contrast the unstandardized coefficients for the Base Case, path, and SEM
models when we use multi-item summated scales in the Base Case and path modelsthis makes the
three models more comparable. Refer to Exhibit 13.35.
Exhibit 13.35. Model Coefficients (Multiple-regression and path models use multi-item summated scales from
Exhibit 13.34. SEM uses same items as indicators of latent constructs)

Unstandardized Model Coefficients Predicting Brand Loyalty


Base Case Multiple-

Combined Direct and Indirect Effects

Variables

regression

Path Analysis

SEM

Product Quality

0.37

0.30

0.24

Cost

-0.36

-.28

-0.27

These coefficients are extremely high (e.g., >0.95) because these data were generated from a simulator. In the real world, you will often
see these coefficients range from 0.75 to 0.85.

17

Value

Not in model

0.03

0.02

Satisfaction

Not in model

0.51

0.39

r2

0.31

0.42*

0.41

*For the equation Brand Loyalty = Intercept + b1(Satisfaction) + b2(Cost) + error

Some observations on Exhibit 13.35:18

The model with the fewest predictor variables, the Base Case model, has the lowest r2.

There is much similarity between the path and SEM models:


o

Value has the lowest coefficient of all the variables in both models

Cost takes on a negative coefficient, which is what theory would predict

Satisfaction has the largest coefficient in both models

Product Quality has a comparable effect on Brand Loyalty as Cost, but in the reverse direction.

How might the research department interpret the SEM model for its management? Consider:

Product Quality and Cost nearly equally impact Brand Loyalty

This suggests a few strategies for management to consider:


o

Make improvements in quality with minimal impact on the products cost

Seek to reduce manufacturing and/or supply-chain costs while keeping Product Quality constant

If Product Quality and Cost both increase, they may not have any effect on Brand Loyalty.

SEM and Path Analysis Similarities: Structurally, the SEM (Exhibit 13.34) and path model (Exhibit
13.33) look identical in our example. SEM, however, can specify even more complicated relationships
than what a path analysis model is able to do.
How is SEM different? SEM differs from path analysis and multiple-regression analysis on several
points. For now, you should be aware of the following difference:

Generally, for the same data set and variables, SEM coefficients are slightly larger than their path analysis counterparts. This is because SEM
takes into account random measurement error of the items which serve as indicators of the constructs. We dont see this in this data set
because it is a simulated data set in which the indicators of the constructs have extremely high reliability.

18

SEM accounts for measurement error in the indicators of the constructs. This needs brief clarification
because we have talked about two kinds of error(1) the error associated with the dependent variable
components in a multiple-regression, path analysis, or SEM model; and, (2) measurement error component
in the scale items (Exhibit 13.34) that serve as indicators of the latent constructs.
o

Error (1) is simply the difference between the actual value of the dependent variable, Y, and the
models predicted value of Y. It is caused by variables not accounted for in our model.19

Measurement error (2) refers to the error inherent in our attribute ratings that serve as the
indicators of the latent constructs.

If our SEM model is theoretically sound, the variance in our attribute ratings is being
affected, in part, by the latent constructs.

But these constructs cannot account for 100% of the variance in these ratings. Why?
Because some of the variance in these ratings is going to be caused by factors outside
the researchers control such as the method of survey administrationasking the same
question over the telephone vs. the Internet will give you slightly different answers; the
mood the respondent is in at the time of the interview; and the fact that there is always
some ambiguity and vagueness in our survey questions. Different respondents will
interpret our questions differently. All these factors will introduce random measurement
error in multi-item indicators of our constructs.

This kind of random variance in our indicator measures is something that SEM takes into
account when it estimates the models coefficients. Consequently, SEM will generate
more reliable coefficient estimates than will multiple-regression or path analysis.

Summary
Multiple-regression, path analysis, and SEM share a similar traitthey seek to specify and estimate
relationships among variables.
Path analysis and SEM are more flexible than multiple-regression models because they can model
more complicated relationships such as models with multiple dependent-like
variables.

In social science research, we can never attain r2s of 1.0. We simply cannot measure all the variables that might explain the variance in our
dependent variables.

19

Although both multiple-regression and path analysis can use multi-item measures of latent constructs
via summated scales, only SEM can take into account measurement error in the indicator measures
(i.e., the attribute ratings), which produces more reliable coefficient estimates.

V. DOING THE ANALYSIS


For a complete example with annotation please see the PowerPoint presentation, Chapter 14.

Conjoint analysis
Real life decisions involve trading off some advantages to gain some others. For instance, if a
customer wants a cheaper car, he or she may have to forego some features like air-conditioning or air
bags. A customer who wants a large house that is inexpensive may have to move to a remote suburb.
What would a customer trade-off to gain a given benefit? This is the type of question that is answered
by conjoint analysis.

How important are different features in a VCR? What features would a consumer exchange in return
for a desired feature? Would an investor give up brokers recommendation in return for low
commissions? For an airline passenger, is price more important than service? What is the combination
of features that would be most appealing in a computer? Questions like these are answered by tradeoff techniques such as conjoint analysis and discrete choice models.

WHY CONJOINT ANALYSIS


The problem of knowing what is really important to customers is a recurring one in marketing. Suppose
you are a manufacturer of an automobile. You want to know what is important to a customer in an
automobile, so you can build an optimal product.
The direct approach

If we use traditional methods, we would perhaps ask potential customers to rate the importance of
different product features such as roominess, gasoline economy, safety features and price. This direct
approach can work in some cases, but it has its limitations.

Why the direct approach is not adequate

When we ask the customers what they would ideally like, we make the situation somewhat unrealistic.
We imply that there are no constraints on the choices. There is nothing that prevents a respondent
from saying that every single feature presented is equally important.
If consumers state that all features are equally important, the data provide very little input to the
marketer.
Some requirements may, in practice, be incompatible. For instance, cars that provide both
roominess and good gas mileage may not be able to offer good acceleration. Yet consumers may
rate all these variables equally highly.
Even if all physical requirements are compatible, fulfilling all the requirements will increase the
price that a customer may not be willing to pay.
Real life situations are not like this. For instance, a car that is roomy may be more expensive than what
we plan to spend on it. A car that has good acceleration may lack safety
features. When we make decisions, we often make choices among
alternatives that are less than perfect. A customer who wants good

Direct questioning
to elicit attribute
importance
can

acceleration may to decide how many of the safety features would he or she forego to get a car with
good acceleration. A customer who wants longer warranty period has to decide whether he or she
would forego the money needed to get it.
Such information is difficult to get at by direct questioning. One main reason for this is that most
people are not consciously aware of the exact importance they attach to different features, even though
it may play a critical role in their decision-making.
We need a technique that models the decision making process

The above-mentioned limitations of the direct approach lead us look for a technique that fulfils the
following criteria.
1. The technique should be similar to the way a consumer makes decisions in real life.
2. The technique should avoid asking consumers the importance they attach to different features (e.g.
the relative importance of price versus durability of a product).

3. The technique should express the desirability of each feature in quantitative terms so we can
compare the importance of different features in making decisions.
4. The technique should be able to estimate the importance a person attaches to different levels of a
feature (e.g. relative importance of different price levels).
Conjoint analysis is a family of techniques that is designed to the extent possible to fulfil these
criteria.

HOW CONJOINT ANALYSIS WORKS


Conjoint analysis aims to estimate the importance a person attaches to different features of a product
or service, without direct questioning. For instance, a builder may be
interested in knowing the degree of importance a buyer attaches features
such as number of bedrooms, closeness to schools, the type of
neighborhood and price. Respondents are not directly asked to specify the

Conjoint
analysis
aims to estimate the
importance
a
person attaches to
different features of

importance they attach to these features but are asked to indicate only their overall preferences to
different bundles of features. From these overall preferences for different bundles of features, we can
infer the importance respondents attach to individual features.
Assumptions of the analysis

Conjoint analysis, in its simplest form, assumes the following:


People ascribe a measurable amount of importance (called utilities or partworths) to different
features of a product or service; and
When called upon to choose between alternatives, people simply add up the utilities for each
feature of a given alternative.
They would favor the alternative with the highest total utility.
These points can be illustrated by an example. John is in the market for buying a house and let us
assume that he has the utilities (importance weights) shown in Exhibit 13.36.

Exhibit 13.36. Johns utilities for price and number of bedrooms


Utility for price

Utility for bedrooms

$100,000

2-bedrooms

$125,000

3-bedrooms

$150,000

If John has to choose between 3-bedroom house at $150,000 and a 2-bedroom house at $125,000
which one would he choose? Would he be willing to pay an additional $25,000 for the additional
bedroom? We can answer this question if we know what importance weight or utility John attaches to
price and bedrooms.
From the exhibit we know that the utility for $125,000 is 3 and the utility for a 2-bedroom house is
4. The total utility for a 2-bedroom house at $125,000 is 7 (3+4 = 7). The utility for $150,000 is 2 and
the utility for a 3-bedroom house is 6. The total utility for a 2-bedroom house at $150,000 is 8 (2+6 = 8).
So John will be willing to pay the additional $25,000 for the extra bedroom since it has the higher total
utility.
Exhibit 13.37 shows the utilities and their rank order.

Exhibit 13.37. Utilities of different alternatives and ranks


Choice

Utility

Rank

Two bedroom house at $100,000

11 (4+7)

Two bedroom house at $125,000

7 (4+3)

Two bedroom house at $150,000

6 (4+2)

Three bedroom house at $100,000

13 (6+7)

Three bedroom house at $125,000

9 (6+3)

Three bedroom house at $150,000

8 (6+2)

Without knowing anything about Johns utilities, if we simply ask him to rank the 6 possible choices, he
would have chosen the 3-bedroom house at $100,000 as his first choice since this has the highest total

utility for him. His choice would have been a two-bedroom house at $100,000 and so on. This is
shown in italics in Exhibit 13.38.

Exhibit 13.38. Conjoint choices of John


2 bedrooms [ 4]

3 bedrooms [ 6]

$100000 [7]

(11)

(13)

$125000 [3]

( 7)

( 9)

$150000 [2]

( 6)

( 8)

The numbers in square brackets are partworths or utilities. The numbers in regular parenthesis
are total worths. For example, the partworth for 2-bedrooms is 4 and the partworth for for
$100,000 is 7. Therefore, the total worth for a 2-house priced at $100,000 is 11 (4+7).
Since this has the second highest utility for John (after a 3-bedroom house at $100,000),
he would assign the second rank to this choice. Ranks are shown in italics.

Given a series of alternatives like the ones shown above, most people will be able to indicate their
preferences. They can usually rank from the most preferred to the least preferred alternative. Heres
the problem that conjoint analysis attempts to solve: How to estimate a persons utilities for various
features, given a persons overall preferences for different alternatives? The task of conjoint analysis is
to derive utilities. In our example it means finding the five numbers (the numbers in square brackets in
Exhibit 9.4) such that when they are added together and ranked (as is done in the body of the exhibit),
they will match the ranks given by the respondent.
Some terminology

In conjoint terminology, the basic variables under consideration are called attributes or factors; the
choices within an attribute are called levels. Thus, in our example number of bedrooms and price are
called attributes. $100,000, $125,000 and $150,000 for price and 2-bedroom and 3-bedroom for
numbers of bedrooms are called levels of the attributes. We are said to be dealing with conjoint
analysis with two attributes; price has 3 levels and number of bedrooms has 2 levels. The importance

weights we derive for each level of each attributes using conjoint analysis are called utilities or
partworths. (Sometimes, the term variable is used instead of the more common term attribute.)

Different Types of Conjoint Analysis


Conjoint Analytic techniques broadly fall into three major categories
1. Choice-based;
2. Ratings-based; and
3. Hybrid.
In Choice-based Conjoint, respondents are presented with a set of products (generally two to five).
Respondents can pick any or none of the available alternatives, if all choices presented prove to be
unacceptable. This technique closely parallels buying environments in real world. Ratings-based Conjoint
involves rating each individual product alternative or pairwise rating two product alternatives
simultaneously. Ratings-based Conjoint are likely to be more suited for non-competitive markets, such as
oligopolies or newly emerging categories.
Hybrid techniques, approaches combine self-explicated scaling with either Ratings-based Conjoint
or Choice-based Conjoint. Hybrid techniques are more commonly used when a large number of attributes
must be included. Adaptive Conjoint Analysis is an example of hybrid techniques.
There are also other types of conjoint analyses such as Ordinary Least Squares menu-based
techniques, Self-explicated conjoint analysis and so on. Currently, probably the most commonly used
technique is Choice Based Conjoint Analysis.
The first step in doing Conjoint for the problem you have on hand. In most cases, it is likely to be
Choice-based Conjoint (CBC). CBC offers respondents a series of choice sets, generally two to five
alternative products. Respondents can choose any of the available alternatives or none, if those
alternatives in the choice set are not at all acceptable to the respondent. This method closely imitates what
happens when a consumer goes shopping.

Ratings-based Conjoint involves rating each individual product alternatives individually. Pairwise rating
involves comparing two product alternatives simultaneously. No-buy options cannot easily be incorporated
in Ratings-based Conjoint. As a result this technique is Ratings-more suited for non-competitive or nearly
non-competitive markets. Both these techniques can be used in full-profile or partial-profile studies. The
full-profile approach involves one level from every attribute in the study. If there are six attributes in your
full-profile study, then each product alternative will have six attribute levels which define it. Partial-profile
approach involves a subset of the total set of attributes. If there are eight attributes in your study, then
each product alternative may have two or three attribute levels that will define it.
Hybrid techniques are approaches that combine self-explicated scaling with either Ratings-based
Conjoint or Choice-based Conjoint, Hybrid method is often used when a large number of attributes need be
included. ACA (Adaptive conjoint analysis) is probably the best-known and most widely used example of a
hybrid technique.
It is critical to remember that we need to define attributes that are simple enough to be understood by
respondents. With complex or unfamiliar alternatives six may too many, while with simpler and familiar we
may be able to include more than six. Partial-profile designs, which can include up to 50 attributes is a
relatively recent development and are considered effective alternatives to hybrid techniques. Full-profile
designs are generally preferred over partial profile designs if the number of attributes is sufficiently small.
This is because full-profile designs can use smaller samples, identify interactions terms and intuitively more
understandable. Full-profile designs are also often preferred over hybrid designs if the number of attributes
is sufficiently small because hybrid designs are somewhat more artificial and often cannot accommodate
interaction effects.
Attributes and Levels
Attributes and levels must be specified with the study objectives in mind. For example, if the objective is to
assess the impact of the introduction of a new brand into your category, it is important to include the brands
as an attribute in the study with the new brand as a level within that attribute.
Experimental Design, Conjoint Tasks and Sample Size
Conjoint studies require an experimental design to determine the attribute/level sets to be used testing.
This is not a major issue since all major conjoint programs incorporate this function and are easy to use.

Design software can also provide diagnostic information for evaluating the design. If you use a complex
design, you may want to test it with synthetic (or other) data prior to field.
Numerical attributes, such as size or price, can be defined as part-worth attributes or vector
attributes. Vector attributes assume linearity, partworth attributes do not. If defined as a part-worth
attribute, each level within the attribute would receive its own utility weight. If defined as a vector attribute,
one utility weight would be calculated for the attribute as a whole and would then be multiplied by each
level value to determine the utility weight by level. Researchers often define all attributes as part-worth
attributes so they are free to model non-linear relationships.
In general it takes a while for a respondent to get used the conjoint tasks. So it is best to use a few warmup tasks before getting into the main study. It is also a good idea to randomize the task order whenever
possible. Another good practice is to use holdout tasks which are tasks that will not be included in the utility
estimation process, but used for validating the study. Holdout tasks should be designed such that
responses are not constant across alternatives. This will make validating the model easier. Although you
can use as many 20 tasks in a choice-based conjoint study, if you want a Conjoint study that works well,
limit the tasks, perhaps around 10 with 2 or 3 warm tasks.
Although sample size consideration dont receive much attention in literature, available evidence suggests
that models can be reliably estimated with samples as low as 75, regardless of type of conjoint technique
employed. However, if you had a market with four regions and you model each region separately, you
would need a sample of 300 (75x4) if you wanted to model males and females separately within each
region, your minimum sample size would be twice (600).
The two most common errors in marketing research conjoint studies are these: 1. asking respondents
questions they are unable to answer accurately; and 2, including attributes that cannot be meaningfully
understood e.g. a razor that shaves 10% more smoothly. You should try to avoid both.
Utility Estimation Models
Traditionally, Ratings-based Conjoint utilities have been estimated using Ordinary Least Squares (OLS)
regression at the individual respondent level and Choice-based Conjoint utilities been using logit regression
at the aggregate (total sample) level. .

Generally speaking, disaggregate models are better than aggregate models. The main reason for this is
that aggregate models dont capture heterogeneity. For example, if half the sample prefers an iPhone over
Blackberry and the other half prefers a Blackberry over an iPhone. An aggregate model will show the total
sample indifferent to which smart phone they prefer, since preferences for the Blackberry are cancelled by
preferences for the iPhone. In a disaggregate model, brand preferences will be well recognized since all
the iPhone lovers will score large utilities for iPhone and all the Blackberry lovers will score large utilities for
Blackberry.
Choice-based Conjoint is generally preferred to Ratings-based Conjoint because CBC is replicates the
buying environment, can handle interaction effects and can accommodate the no-buy option. The main
disadvantage of CBC is that it cannot identify utilities for individuals (disaggregate). However, the recent
introduction of Hierarchical Bayesian (HB) Model has rectified the situation. . HB allows for individual
utilities estimation of Choice-based conjoint data. There is even some evidence to show that HB estimates
are superior to OLS regression estimates for Ratings-based Conjoint. On the other hand, HB estimation is
that it is computationally intensive. However, in general, it may be small price to pay for the advantage of
deriving a disaggregate model.
Some software packages allow for constraints and can force certain attribute levels to always be the same
or higher than other levels. As an example suppose your product is price oriented and that consumers
would prefer to buy your product at a lower price. In this case, the utility of the lowest price level should be
greater than or equal to every higher price level. To accommodate this, you can constrain your utility
estimates to follow this logic. While this can be very efficient when know certain things to be true, extensive
use of constraints is likely to confirm or disconfirm what we know rather than tell us how the market actually
is.
Once the utilities are estimated, the next step is to apply a simulation model to the derived utilities. There
are many methods of simulation. Here are some more commonly used ones.
First Choice models are only available for disaggregate data and follows the winner takes all logic. It
assumes that each individual will choose the product for which he or she has the highest utility. The
disadvantage of this is that minor changes in product configurations can result in unrealistic high choices to

the first product and the second product will lose out entirely to the first. So minor differences in utility
between equally preferred products can have an exaggerated effect,
Share of Preference models can be either aggregate or disaggregate. These models distribute preferences
as being proportional to each products total utility. If, for example, in an aggregate model of two brands,
brand X had total utility of 12 and brand Y had total utility of 6, product A would have 67% share of
preference (12/(12+6)) and product Y would have 33% share of preference (6/(12+6)). This overcomes one
of the limitations of the First Choice model in that the strength of preference between alternatives is
preserved. However, Share of Preference models are subject to the Irrelevance of Independent
Alternatives (IIA) bias. If two products are very similar, their net share is over-estimated. This is akin to
double counting. Correction factors can be applied to remedy this problem.
The Randomized First Choice models exhibit less IIA bias than Share of Preference models and are less
subject to the influence of minor differences in utilities among alternatives than the First Choice models.
They also offer several ways to fine tune the model for increased accuracy.
No matter which method you use, the model predictions of choices should be compared to the actual
choices made by respondents and validated.
For disaggregate models, we can use two measures of model to assess the model: Hit Rates and Mean
Absolute Error (MAE), while for aggregate models, only MAE can be used.

Hit Rates are calculated by comparing the choice predicted for an individual respondent by the
model (using the maximum utility rule) to the actual choice made by the respondent. Every correct
prediction is a "hit. The total number of hits divided by the sample size is called the Hit Rate.

MAE is the sum of the differences between predicted and actual shares of preference for all
products in a holdout task divided by the number of products in the holdout task.

Initial hit rates and MAE (prior to model refinement) can be compared to hit rates and MAE from a random
model to assess how successfully the model has been able to capture and model respondent choices. For
example, if there are 3 choices available in a holdout task, such as two products and a no-buy, a random
model could be expected to have a hit rate of 33% (1/3). If your initial model has a hit rate of 66%, then
clearly the model is not random. MAEs for a random model can be calculated by subtracting 33% from the

percent of respondents who picked each of the three options, summing the absolute value of the
differences and dividing by three. If your random model has an MAE of 9 and your model has an MAE of 3,
this can be taken as evidence that the model is valid. It is for this reason that the holdout tasks would work
better if there are unequal preference across alternatives. In general, hit rates above 60% and MAEs
below 5 points will reflect a reasonably good fitting model.

DOING THE ANALYSIS


For a complete example with annotation please see the PowerPoint presentation, Chapter 14.

Vous aimerez peut-être aussi