Heterogeneity Optimization

Optimization Conjoint Models for Consumer Heterogeneity
Theodoros Evgeniou Technology Management, INSEAD, Bd de Constance, Fontainebleau 77300, France Tel: +33 (0)1 60 72 45 46 Fax: +33 (0)1 60 74 55 01 theodoros.evgeniou@insead.edu
Massimiliano Pontil University College London, London UK pontil@cs.ucl.ac.uk
Optimization Conjoint Models for Consumer Heterogeneity

Abstract A novel optimization based approach for modeling consumer heterogeneity for conjoint analysis is presented. Based on this approach we develop novel methods for aggregate conjoint estimation. The methods developed are experimentally compared to Hierarchical Bayes (HB) estimation with both simulated and real data. The experiments show that the proposed methods often signicantly outperform HB in terms of model accuracy. At the same time estimation is computationally very ecient. Moreover, the methods can be used, like the existing individualspecic preference models they are based on, for large datasets, with products with many attributes, and for estimating interactions among product attributes, situations that are often computationally and statistically challenging for existing methods such as HB. Finally, the approach for handling heterogeneity presented here can also be used to extend other individual specic optimization based conjoint estimation methods proposed in the past to the case of aggregate conjoint estimation.
Keywords: Choice Models, Marketing Research, Data Mining, Regression And Other Statistical Techniques, Marketing Tools.
Introduction
As the amount of data capturing preferences of consumers increases, for example due to the plethora of consumer data on the internet (Cooley et al 1997; Kohavi 2001) and the possibility to conduct large scale market research projects online, there is a need for methods for consumer preference modeling that are computationally very ecient without sacricing statistical performance. Conjoint analysis has oered a number of methods for modeling consumer preferences (Ben-Akiva and Lerman 1985; Carroll and Green 1995; Green and Srinivasan 1990; Louviere et al 2000; Sawtooth Software) that have been used for many applications (Wittink and Cattin 1989) both by researchers and marketing practitioners. However, many of the existing methods have been developed over the years based foremost on statistical arguments and then on computational eciency ones. As a result, some of the most widely used conjoint analysis methods today, such as Hierarchical Bayes (HB) (Allenby et al 1998; Allenby and Rossi 1999; Arora et al 1998; DeSabro et al 1997; Lenk et al 1996), are often computationally challenging for large scale conjoint analysis problems (Sawtooth Software). In this paper we develop a novel approach for conjoint analysis that lls this need: it is both computationally very ecient, and it leads to accurate estimations of preference models. In recent years there have been, among others, two key developments in the area of conjoint analysis:
First, mainly through the development of HB (Allenby et al 1998; Allenby and Rossi 3
1999; Arora et al 1998; DeSabro et al 1997; Lenk et al 1996), it is now possible to estimate preference models simultaneously for many individual consumers which are signicantly more accurate than the models estimated for each consumer independently. Handling consumer heterogeneity has been a central issue in conjoint analysis and a lot of work has been done on this topic, particularly since the introduction of HB. However, one of the most common practical problems with HB is computational: using HB with very large datasets and/or with products with many attribute levels is often out of reach for the current state of the art HB software tools (see for example (Sawtooth Software)). Second, more recently an approach to conjoint analysis that is based on optimization theory (Bertsimas and Tsitsiklis 1997), such as polyhedral optimization, and statistical learning theory, as used for example for the method of Support Vector Machines (SVM) (Vapnik 1998), has been proposed as complementary to the Bayesian approach used for example for HB (Srinivasan and Shocker 1973; Srinivasan 1998; Toubia et al 2003; Toubia et al 2004; Cui and Curry 2004; Evgeniou et al 2004). The methods developed in this direction are very promising both in terms of statistical accuracy and, equally important, in terms of computational eciency, as these methods are developed also from an optimization theory viewpoint and not only based on statistical arguments. However, although very promising, so far this approach has been developed only for individual-specic preference modeling.
The main goal of this paper is to extend the latter methods so that they can handle 4
consumer heterogeneity. Essentially, to phrase it in a simple way, this paper aims at oering to the optimization based approach to conjoint analysis (i.e. (Srinivasan and Shocker 1973; Srinivasan 1998; Toubia et al 2003; Cui and Curry 2004; Evgeniou et al 2004)) what HB has oered to traditional multinomial logit type individual specic conjoint methods (Ben-Akiva and Lerman 1985; Carroll and Green 1995; Green and Srinivasan 1990; Louviere et al 2000; McFadden 1974; McFadden 1986): a way to model consumer heterogeneity and estimate preference models for many individuals simultaneously using aggregate data from all the individuals. Therefore this paper presents an optimization theory and statistical learning theory based approach for handling consumer heterogeneity for conjoint analysis. As this is, to the best of our knowledge, a rst attempt to doing so, we also discuss various directions for future research motivated by the work presented here. We believe that starting from the methods and ideas we develop here there are many possibilities for extensions for other novel conjoint estimation methods and theory. For example, although we focus here on extending the more recently proposed method of (Cui and Curry 2004; Evgeniou et al 2004) for handling heterogeneity, the ideas presented here can be used for such an extension of other optimization based conjoint methods such as those of (Srinivasan and Shocker 1973; Srinivasan 1998; Toubia et al 2003; Toubia et al 2004). We compare experimentally the proposed approach with HB using simulations, as in (Arora and Huber 2001; Evgeniou et al 2004; Toubia et al 2003), as well as real data. The experiments show that the proposed methods often signicantly outperform HB in
terms of accuracy. Moreover estimation is computationally very ecient, therefore the proposed approach can be used for very large datasets. So our experiments indicate that the proposed approach is more accurate and computationally faster than HB in many situations. Moreover, although we dont show such experiments here and refer the reader to (Cui and Curry 2004; Evgeniou et al 2004), the proposed approach can also be used, like the existing individual preference models it is based on (Cui and Curry 2004; Evgeniou et al 2004), for modeling better than HB interactions among product attributes. Finally, as a cautious note, we believe that much like other optimization based work on conjoint analysis, the methods presented here do not aim to replace existing methods for preference modeling, such as HB, but instead to contribute to the eld new tools and methods that can complement existing ones. In this paper we do not address the issue of designing questionnaires as typically done in conjoint analysis (Arora and Huber, 2001; Kuhfeld et al 1994; Segal 1982; Toubia et al 2004). We focus only on the utility estimation problem. We also deal only with fullprole preference data - full product comparisons - for the case of choice based conjoint analysis (CBC) (Louviere et al 2000) instead of other metric based ones. Extensions to the latter are possible like in the case of SVM regression (Vapnik 1998). The paper is organized as follows. In section 2 the new approach for handling consumer heterogeneity is presented. For simplicity only the basic linear model is shown. We refer the reader to (Cui and Curry, 2004; Evgeniou et al 2004) on how to extend the presented methods to the case of non-linear utility functions, for example in order to
capture interactions among product features. In section 3 experiments comparing the proposed approach with HB are shown. Finally section 4 is conclusions.
An Optimization Based Method for Modeling Consumer Heterogeneity
We consider the standard (i.e. (Louviere et al 2000)) problem of estimating a utility function from a set of examples of past choices coming from T individual consumers. Formally for each individual t we have data from n choices where, without loss of gen2 erality, the ith choice is among two products {x1 ti , xti }. To simplify notation we assume
that for each i the rst product x1 ti is the preferred one - we can rename the products otherwise. All products are fully characterized by m-dimensional vectors - where m is the number of attributes describing the products. So the ith choice for each individual is among a pair of m-dimensional vectors. We are now looking for T utility functions that are in agreement with the data, namely T functions that assign higher utility value to the rst product - the preferred one - for each pair of choices for each individual. This is the standard setup of choice based conjoint analysis (Louviere et al 2000). Variations of this setup (i.e. cases where we know pair wise relative preferences with intensities) can be modeled in a similar way. For simplicity we make the standard assumption (Ben-Akiva and Lerman 1985; Srinivasan and Shocker 1973) that the utility function of each individual is a linear function
of the values (or logarithms of the values, without loss of generality) of the product attributes: the utility of a product x for individual t is Ut (x) = wt x. For a generalization to nonlinear utility models that, for example, also capture interactions among attributes, we refer the reader to (Cui and Curry, 2004; Evgeniou et al 2004). Our method for handling heterogeneity is based on the individual-specic preference modeling methods proposed by Cui and Curry (2004) and Evgeniou et al (2004). We briey review the individual specic case here and then extend it to handle consumer heterogeneity.
2.1
Individual Specic Preference Models
In (Cui and Curry, 2004; Evgeniou et al 2004) (also see for example (Srinivasan and Shocker 1973; Srinivasan 1998; Toubia et al 2003; Toubia et al 2004) for similar approaches) the utility function wt of a given individual t (so t is xed in this case) was estimated as the solution of the following optimization problem: Problem 2.1 min subject to, for i {1, . . . , n}:
wt ,ti i=1..n
ti + wt
2 wt x1 ti wt xti + 1 ti
and ti 0 (1)
The method is based on the idea of simultaneously minimizing the error we make on the example data, via minimizing the slack variables ti , and maximizing the condence margin wt
2
(Cortes and Vapnik 1995; Vapnik 1998; Cui and Curry 2004; Evgeniou et al
2004) with which the solution satises the constraints. Parameter controls the trade o between tting the data (
i ti )
and the condence of the model ( wt 2 ) the latter is
also a measure of complexity of the estimated model (Vapnik 1998) and we will refer to it below as the complexity of the model. Optimizing simultaneously the error on the data and the complexity of the estimated model is crucial for avoiding important issues such as sensitivity to noise or the curse of dimensionality and having highly accurate estimates (Vapnik 1998). There are a number of ways to choose parameter : for example it can be chosen so that the prediction error on a small validation set is minimized or through cross-validation (also called leave-one-out error) (Vapnik 1998). We refer the reader to (Cui and Curry 2004; Evgeniou et al 2004) for more information about method 2.1.
2.2
Aggregate Conjoint Estimation
We now show how to estimate all T utility functions (parameters wt for all t {1, . . . , T }) by extending method 2.1. We note that similar extensions using the ideas presented here are possible for other optimization based conjoint methods such as those of (Srinivasan 9
and Shocker 1973; Srinivasan 1998; Toubia et al 2003; Toubia et al 2004). We do not discuss this here for simplicity and we focus on the method of (Cui and Curry 2004; Evgeniou et al 2004) which has shown promising results and relies on the theory of SVM (Vapnik 1998). We rst develop a simple version of the proposed method to also give some intuition.
2.2.1
A Simple Model
We start by assuming that the utility functions of all individuals are not very dierent from each other. We can therefore assume that all wt can be written, for every t {1, . . . , T }, as wt = w0 + vt (2)
where the vectors vt are smaller when the consumers are more similar to each other. In other words we assume that the preferences of the consumers are related in a way that they are all close to some model w0 . We then estimate all vt as well as the common part w0 simultaneously. To this end we solve the following optimization problem which is analogous to the individual specic conjoint estimation method 2.1. Problem 2.2
T w0 ,vt ,ti n T
min
ti +
t=1 i=1
1 T
vt
t=1
1 w0 1
(3)
10
subject, for all i {1, 2, . . . , n} and t {1, 2, . . . , T }, to the constraints that
2 (w0 + vt ) x1 ti (w0 + vt ) xti + 1 ti
(4)
ti 0.
In this problem: The ti are slack variables measuring the error that each of the nal models wt makes on its corresponding data. Parameter is a user dened parameter which, as in the individual specic method 2.1, denes a trade o between the error made on the data ( complexity of the estimated functions modeled as
T T t=1 n i=1 ti )
and the
1 T
vt
t=1
1 w0 1
We penalize the complexity of the common part of the functions, w0 2 , just like in the case of method 2.1, and we control how much the solutions wt dier from each other by controlling the sizes of the vt 2 . Parameter (0, 1) is a second user dened parameter that controls the similarity between the individuals. Intuitively, when parameter is close to 0, problem 2.2 reduces to estimating one utility function, namely w0 , for all individuals as if we 11
consider all consumers to be one consumer we force the vt s to be zero. On the other hand when is close to 1, problem 2.2 reduces to estimating one utility function, namely vt , for each individual independently just like in (Cui and Curry 2004; Evgeniou et al 2004) we force w0 to be zero and let the vt s to be independent. We note that we chose to model the population heterogeneity with parameter in the form of
1 T
and
1 1
so that the dual optimization problem of
problem 2.2 has a simpler form w.r.t. , as we discuss below. As for the individual specic method 2.1, parameters and can be chosen using a small validation set or cross validation (see below).
It turns out that the optimal solution w0 of problem 2.2 is an average of the optimal individual wt . Formally (for simplicity all proofs are in the appendix):
Lemma 2.1 The optimal solution to the optimization method 2.2 satises the equation
T t=1
w0
1 = T
wt .
(5)
This lemma suggests that we can replace w0 in problem (2.2) with an expression of wt s and obtain an optimization problem which involves only the wt s. This leads to the following lemma whose proof is shown in the appendix.
Lemma 2.2 Problem 2.2 is equivalent to solving the following optimization problem:
12
Problem 2.3
T wt ,ti n
ti +
min
t=1 i=1
wt T t=1
wt
t=1
1 T
T s=1
(6)
ws
subject, for all i {1, 2, . . . , n}, t {1, 2, . . . , T }, to the constrains that
ti 0
This result provides a dierent interpretation of our original problem 2.2 in that it implements a trade-o between the complexity of the utility functions wt and the closeness of each wt to their average. This trade-o is controlled by . As before, if is small (close to 0) the utility functions are very related and all close to their average, whereas if is close to 1 the utility functions wt are all computed independently. Optimization problem 2.3, as the individual specic model 2.1 of (Cui and Curry 2004; Evgeniou et al 2004) based on SVM (Vapnik 1998), is a quadratic programming problem which has a unique optimal solution and can be solved as fast as 2.1. The quadratic optimization problem that we solve in practice - that we used for the experiments - is the dual optimization problem of 2.3 (Bertsimas and Tsitsiklis 1997). In particular, following the standard process for getting the dual optimization problem (Bertsimas and Tsitsiklis 1997; Cortes and Vapnik 1995) (we show the proof in the appendix) the dual formulation 13
of problem 2.3 as well as the optimal estimates of wt s are as follows: Theorem 2.1 The dual optimization problem of 2.2 is given by Problem 2.4
T n
max
ti t=1 i=1 n T n T
ti
1 4
2 1 2 si tj (x1 si xsi ) (xtj xtj )((1 ) + T st )
(7)
i=1 s=1 j =1 t=1
0 ti 1
is a solution to the above where st = 1 if s = t and 0 otherwise. In addition, if ti optimization problem, the optimal utility functions wt are given by
wt
1 = 2
2 si (x1 si xsi )((1 ) + T st ).
i=1 s=1
This is the optimization problem we solve in practice. Optimization problem 2.4 is a quadratic programming problem with all constraints being of the form 0 ti 1, which are box constraints hence the problem is fast to solve (Cortes and Vapnik 1995).
Notice that all utility functions wt are expressed using the same T n parameters ti for
t {1, 2, . . . , T } and i {1, 2, . . . , n}. Moreover, all utility functions wt are expressed as 14
2 linear combinations of the questionnaire data from all individuals (x1 si xsi ), s = 1, . . . , T .
However, for each individual t the corresponding utility function wt is expressed using
2 that particular individuals data (x1 ti xti ) with weight ((1 ) + T ), and using the data
from all other individuals with weight only (1 ). To have this simple form w.r.t. to parameter is the reason we used in the form of
1 T
and
1 1
in problem 2.2. Finally,
we note here that there have been variations of problem 2.1 within the SVM literature that are even more ecient computationally (e.g. based on linear instead of quadratic programming), see for example (Bennett, 1999). We believe that exploring this direction in the future for model 2.3 may lead to further signicant computational improvements without signicant sacrices of statistical accuracy. We now build on the simple method 2.2 and the intuition above to develop other methods for aggregate conjoint estimation.
2.2.2
More Advanced Methods
Starting from problem 2.3, we now can develop other methods for modeling consumer heterogeneity. We discuss a particular method here, and then briey mention other possibilities. Notice that in method 2.3 we eventually estimate the utility function of each individual consumer so that it is close to an average utility function of the population. This is also the intuition a priori for HB: in the case of HB all individual utility functions are assumed a priori to be samples from a Gaussian distribution, and the covariance matrix of this Gaussian is controlling how close the individual utility functions are to the mean 15
utility function of the population. A key dierence between problem 2.3 and HB is that in 2.3 we do consider the population mean but not the covariance of the individual estimates. We can therefore extend problem 2.3 to also consider this covariance. We therefore propose the following method:
Problem 2.5
T wt ,ti n T t=1 T T T
min
t=1 i=1
ti + T
1 wt (Awt ) +
1 ( wt T t=1
1 ws ) (A(wt T s=1
ws ))
s=1
ti 0
where we simply added matrix A to method 2.3 which plays the role of the covariance matrix. Notice that when matrix A is the identity matrix, method 2.5 reduces to the simple method 2.3. For a xed matrix A, problem 2.5 is still a quadratic optimization one, hence there is a unique optimal solution and it is computationally fast to solve. However, if matrix A is also to be estimated, then the problem is not any more a quadratic optimization one. It can be shown (we dont show this here for simplicity) that 2.5 is still a convex problem, hence there is a unique optimal solution and using convex optimization 16
methods (for example, (Boyd and Vandenberghe 2003)) may lead to a fast solution of problem 2.5, but this is an open problem. We propose here a particular iterative heuristic for estimating matrix A and solving problem 2.5 which, however, does not necessarily lead to an optimal estimate of matrix A. We emphasize that developing better methods for solving problem 2.5 is an open research direction which can lead to better aggregate conjoint estimation methods along the lines of the approach we propose here. In particular we propose the following iterative (heuristic) method:
PCA-based Heterogeneity Method:
Choose parameters , (0, 1), and p (0, 100) and initialize matrix A to be the identity matrix. Then: 1. Step 1: solve problem 2.5 with matrix A being the current estimate. 2. Step 2: Do a principal component analysis (PCA) (Bishop, 1996) of the estimated wt s. Keep only the principal components that capture p percent of the total energy, that is, the sum of the largest eigenvalues kept does not exceed p percent of the sum of all the eigenvalues of the covariance matrix of the estimated wt s in this iteration. If all eigenvalues need to be used for this purpose, then stop. Otherwise go to step 3. 3. Step 3: Compute the covariance matrix of the principal components selected, and 17
use this as the new matrix A. Go to step 1. As we remove some of the eigenvectors of the covariance matrix of wt s at each iteration (the wt s lie in a lower dimensional space after each iteration), only a few iterations are done. For the experiments below this was after 4 or 5 iterations. Notice that variations of this method can be developed. For example we may do only one iteration where instead of considering only a few principal components of the estimated wt s, we use the actual covariance of the estimated wt s (no PCA performed) for the matrix A in 2.5, re-optimizing 2.5 using this covariance matrix. This approach led to poor performance in the experiments: removing the least signicant eigenvectors of the covariance matrix of the estimated wt s may be advantageous because this operation removes the noise as typically argued for PCA (Bishop 1996). Another variation is that we simply keep a few of the principal components and re-estimate the wt s doing only one iteration. Here we chose to eliminate iteratively some of the principal components (the least signicant ones at each iteration) so that more accurate estimates of the wt s and therefore of the principal components can be achieved at each iteration. Moreover, at each iteration we can measure the performance of the estimated wt s, for example using a validation set as we do for selecting parameters , , p (see below), and choose the wt s estimated at the iteration with the best performance.
Choosing Parameters , , and p:
18
Parameters , , and p can be chosen using a validation set that can consist, for example, of a few (e.g. one) questions per individual from all or only a few of the individuals. Since the data are typically from many people these parameters can be practically chosen using only a few, proportionally to all the data available, validation data without practically losing many data for parameter selection. In the experiments below we used one question from each individual (50, 100, 200, or 500 individuals) to choose these parameters. Parameter was chosen among values 0.1, 1 and 10, among values (0, 0.2, 0.4, 0.6, 0.8) (1 corresponding to individual specic estimations), and p among values 90% and 95% (for a total of 30 dierent triplets of parameters). After selecting the three parameters we re-estimated the wt s using all the data, including the one question per individual we used for parameter selection. Notice that theoretically the optimal parameters chosen when estimating the functions wt without using the question(s) left out as a validation set may not be the same as the optimal parameters when these questions are included. However, we expect the eects of this to be small as we typically have data from many individuals, therefore the proportion of data left out for parameter (, , p) selection is small. Generally, studying theoretically and experimentally the eects of these parameters on the performance of the proposed methods can shed light on how to choose them. This is beyond the scope of this work and can be a direction for future research.
Other Possibilities
19
Other methods can be developed starting from 2.3. For example one can model the population as a set of clusters of individuals and modify optimization problem 2.3 so that what is penalized is variations of utility functions only within clusters. For example, if we group the population in two clusters, for simplicity lets say one cluster is people 1 to T1 and the other is T1 + 1 to T for some T1 < T , then we can modify 2.3 as follows: Problem 2.6
T wt ,ti n
min
t=1 i=1 2
ti + T 1
T t=1 T
wt 2 +
T s=T1 +1 2
+ 1 T
T1
wt
t=1
1 T1
T1
ws
s=1
wt
t=T1 +1
1 T T1
ws
ti 0
Notice that this is equivalent to solving problem 2.3 independently for each of the two clusters. However, it may be possible to iteratively both dene the clusters and solve problem 2.6. Moreover, there are other directions to explore, such as having soft clustering of the individuals (Bishop 1996) and adding the covariances of the clusters in the optimization 20
problem as we did in 2.5. We believe that future work in these directions may lead to better methods within the optimization framework we developed here, but this is an open research problem.
Experiments
We run two types of experiments. We rst run Monte Carlo simulations (Andrews et al 2002; Carmone and Jain 1978; Cui and Curry 2004; Evgeniou et al 2004; Toubia et al 2004) in order to test the proposed approach under varying conditions of population heterogeneity and response error two basic parameters capturing the conditions of a real situation. In particular in this experiment we tested how the proposed approach works when there is high and low response error, as well as when there is high or low population heterogeneity. For simplicity for this experiment we compared with HB and the individual specic SVM method 2.1 of (Cui and Curry 2004; Evgeniou et al., 2004) only the simple method 2.2. For the second experiment we used real data provided to us by Research International1 . For this experiment we used both the simple method 2.2 and the PCA-based method 2.5 proposed above.
The data are proprietary and are available by the authors and Research International upon request.
21
3.1
Simulation Experiments
We followed the basic simulation design used by other researchers in the past. In particular we simply replicated the experimental setup of Toubia et al (2004) and Evgeniou et al (2004), which in turn was based on the simulation studies of Arora and Huber (2001). For completeness we briey describe that setup here. We generated data describing products with 4 attributes, each attribute having 4 levels. Each question consisted of 4 products to choose from. The question design we used was either orthogonal or randomly generated. For the orthogonal design to be welldened we used 16 questions per individual. We used the same orthogonal design as in (Evgeniou et al 2004; Toubia et al 2004). We simulated 100 individuals. The partworths for each individual were generated randomly from a Gaussian with mean (, 1 , 1 , ) for each attribute. Parameter 3 3 is the magnitude that controls the noise (response accuracy). As in (Toubia et al 2004) we used = 2 for high magnitude (low noise) and = 0.5 for low magnitude (high noise). We modeled heterogeneity among the 100 individuals by varying the variance 2 of the Gaussian from which the partworths were generated. The covariance matrix of the Gaussian was a diagonal matrix with all diagonal elements being 2 . We modeled high heterogeneity using 2 = 2 , and low heterogeneity using 2 = 0.5 , like in (Toubia et al 2004). As discussed in (Arora and Huber, 2001) and (Toubia et al, 2004) these parameters are chosen so that the range of average partworths and heterogeneity found in practice is covered.
22
Therefore in this experiment we tested all 2 2 scenarios in terms of: a) the amount of response error (low versus high), and b) the population heterogeneity (low versus high). All experiments were repeated ve times so a total of 500 individual utility functions wt were estimated and the average performance is reported. We used two measures of performance: a) the Root Mean Square Error (RMSE) of the estimated utility functions relative to the true (supposedly unknown) utility functions as in (Arora and Huber 2001; Evgeniou et al 2004; Toubia et al 2004). Both estimated and true partworths were always normalized for comparability. In particular we followed (Toubia et al 2004): each attribute is made such that the sum of the levels is 0, and the utility vector is then normalized such as the sum of the absolute values for each attribute is 1; b) the average hit errors of the estimated functions on a test set of 16 questions per individual. We note that the conclusions are qualitatively the same for both error measures, as also observed by (Evgeniou et al 2004; Toubia et al 2004).
3.2
Experimental Results
We show the results in Table 1. We report with 1 SVM (last column) the individual specic method 2.1 of (Cui and Curry 2004; Evgeniou et al 2004) when we assume that all data come from the same individual corresponding to using parameter = 0 in 2.2. We also note with T SVMs the individual specic method 2.1 of (Cui and Curry 2004; Evgeniou et al 2004) when we estimate the utility of each individual independently corresponding to using parameter = 1 in 2.2. The column Hetero SVM is the
23
proposed method 2.2. We used one of the 16 estimation questions from each individual to select parameters and , but then included this question for the nal estimates, as discussed above. Parameter was selected to be 10 (among 0.1, 1, and 10), and to be 0.2 (among 0, 0.2, 0.4, 0.6, and 0.8) in all cases. We also show the eects of parameter in Figure 1 for the random questionnaire in all four scenarios: high/low noise, and high/low heterogeneity. In each case we show the RMSE of our method 2.2 as parameter changes from 0 (which corresponds to using one SVM based method 2.1 of (Cui and Curry 2004; Evgeniou et al., 2004) for all individuals together) to 1 (which corresponds to using the individual specic SVM based method 2.1 of (Cui and Curry 2004; Evgeniou et al., 2004) for each individual independently). We also show the performance of HB a constant line, as this does not change when we change parameter . From the results we observe the following: For the random design the proposed method is never worse than HB. Moreover, the proposed method is signicantly better than HB in the case of low magnitude and high heterogeneity. The proposed method is weak when an orthogonal design is used. In this case HB always signicantly outperforms the proposed method. This issue is also discussed in (Evgeniou et al 2004) as a limitation also of the individual specic method 2.1, as also shown from the results here. The advantage of the proposed method relative to the individual-specic method 24
of (Cui and Curry 2004; Evgeniou et al 2004) is larger when heterogeneity is low. This is also the case when the advantage of HB relative to the method of (Cui and Curry 2004; Evgeniou et al 2004) is larger. As the parameter of the proposed method increases, the performance rst improves and then decreases and becomes the same as that of the individual-specic method 2.1 of (Cui and Curry 2004; Evgeniou et al 2004). Moreover, the proposed method 2.2 is not very sensitive to parameter . As we see from the plots, for quite a large range of performance is almost constant. Therefore in practice one may not need to tune this parameter accurately. Finally it is important to note that the proposed method is computationally very efcient. Solving 2.2 requires the optimization of a quadratic programming problem which can be done fast (Cortes and Vapnik 1995). For example, for this experiment the optimization problem (2.2) was taking only a few seconds to be solved (for all 100 individuals) while HB would take hours. We note however that we implemented our method in C while we used an existing HB implementation in Matlab2 . Clearly one needs to make a comparison of equally ecient (and ideally commercial quality) implementations of both HB and our method 2.2 with a number of datasets to draw nal conclusions about this issue, which is beyond the scope of the work here. We now turn to the second experiment we did with real data.
2
We used an HB implementation developed and available online by Toubia et al (2004).
25
3.3
Real Data Experiment
We tested the proposed methods using a real data set capturing choices among products made by many individuals. We have data from 200 individuals. For each individual we have answers to 20 questions where each question consisted of choosing among 7 products. This leads to 120 paired comparisons (for each question we consider the comparisons of the chosen product with each of the other 6 non-chosen ones, to create 6 pairs (xi1 , xi2 ) as we noted in section 2). Therefore we have a total of 24000 paired comparison data. The products have three attributes (such as color, price, size, etc): two of the attributes have seven levels each, while the third one has six levels. So the data describing the products (data points xti above) were 20 dimensional binary vectors. For this experiment we also tested how our methods 2.2 and 2.5 compare to HB and method 2.1 of (Cui and Curry 2004; Evgeniou et al 2004) as the number of data per individual and/or the number of individuals changes. To this purpose, we run experiments with varying numbers of questions per individual and number of individuals in the population. In particular we considered 50, 100, and 200 individuals: we split the 200 people into 4 groups of 50 (or 2 groups of 100, or one group of 200), and we took the average performance among the 4 groups (or the 2 groups, or the 1 group). For each individual we also split the 120 paired choice questions into 20, 30, 60, 90 estimation data (corresponding, roughly, to 3, 5, 10, 15 questions of choosing among 7 products for each individual), and 100, 90, 60, 30 test data respectively. As we do not have the actual utility functions in this case, we measured performance
26
using the hit rate of the estimated utility functions on the test data (100, or 90, or 60, or 30 for each individual). All parameters of method 2.5 and 2.2 were chosen using one estimation question per individual which was then incorporated back for the nal estimation of the utility functions, as discussed above. We selected = 10 (among 0.1, 1 and 10) and = 0.6 (among 0, 0.2, 0.4, 0.6, 0.8) for this dataset. Parameter p for the PCA based method proposed above was selected to be 90% (chosen among 90% and 95%). We show all the results in Table 2. Notice that the performance of individual specic method 2.1 does not change as the population size increases - as expected. We also note that when we use one SVM 2.1 for all the individuals treating the data as if they come from the same person we get very poor performance: between 38 and 42 percent test error for the (questions people) cases considered. From the results we observe the following: The proposed approach almost always outperforms HB: when there are few questions per individual the simple method 2.2 is the best one, while with many questions per individual the PCA based method 2.5 is the best one. HB is the best method only in one case: in this experiment when there are 200 individuals and 30 question data per person. When there are few questions per individual (3 questions leading to 20 datapoints per person in this case) both proposed methods 2.2 and 2.5 signicantly outperform both HB and the individual specic SVM method of (Cui and Curry 2004; and 27
Evgeniou et al 2004). As the number of question data per person increases, HB outperforms the simple method 2.2. However, the PCA based method 2.5 signicantly outperforms all other methods when the number of data per person becomes large (60 and 90 in this case). When there are few data per person, the PCA based method performs poorly. It may be that in this case the PCA method overts the data. If we choose the best performance among method 2.2 (corresponding to zero iterations for the PCA based method) and the PCA based method (where we choose the best performance - using the validation data - after the best iteration, but not including the zeroth iteration corresponding to method 2.2), then the combined proposed method (the best among columns 5 and 6 in Table 2) is always the best except from the case of 200 individuals and 30 question data per person when HB is the best.
In summary, the proposed method is very promising either when there are very few data per individual in which case it is better to use the simple method 2.2 or when there are a lot of data per individual in which case it is better to use the proposed PCA based method 2.5. This indicates that a promising direction of research is to develop better methods for estimating matrix A in 2.5 that work well both for few and many data. As we discussed above, this may be possible for example using convex optimization 28
methods, and we leave this for future work.
Conclusions
We developed a novel optimization based approach for handling consumer heterogeneity in conjoint estimation. Based on this approach we developed specic methods for aggregate conjoint estimation handling heterogeneity, and tested two of them with simulated and real data. The methods developed are based on recently proposed individual-specic optimization based conjoint estimation methods (Cui and Curry 2004; Evgeniou et al 2004), but other optimization based conjoint estimation methods such as those of (Srinivasan and Shocker 1973; Srinivasan 1998; Toubia et al 2003; Toubia et al 2004) can be extended using the ideas presented here in order to handle consumer heterogeneity for aggregate conjoint estimation. Experiments showed that the proposed approach is very promising: it outperforms HB in many cases, although the latter is better when an orthogonal design is used. The limitation of the proposed methods in the case of the orthogonal design indicates that there may be a lot to gain if the proposed approach is coupled with a questionnaire design for a population of consumers developed along the lines of the questionnaire design method of Toubia et al (2004). A number of extensions can be developed starting from the methods outlined in this paper. Once consumer heterogeneity is handled using an optimization approach as the one proposed here, changes in the form of the optimization cost function can lead to other methods for modeling consumer heterogeneity for aggregate conjoint estimation. We 29
discussed here such extensions where, for example, the proposed approach can be coupled with clustering methods in an iterative manner. Exploring these forms and studying their computational and statistical characteristics can lead to further improvements. In this paper we only developed the foundations for modeling consumer heterogeneity using optimization methods, which we plan to extend in the future.
References
[1] Allenby, Greg M., Neeraj Arora, and James L. Ginter (1998), On the Heterogeneity of Demand, Journal of Marketing Research, 35, (August) 38489. [2] Allenby, Greg M. and Peter E. Rossi (1999), Marketing Models of Consumer Heterogeneity, Journal of Econometrics, 89, March/April, p. 57 78. [3] Andrews, Rick, Asim Ansari, and Imran Currim (2002), Hierarchical Bayes versus nite mixture conjoint analysis models: a comparison of t, prediction, and partworth recovery, Journal of Marketing Research, (February), 39, p. 87-98. [4] Arora, Neeraj and Joel Huber (2001), Improving parameter estimates and model prediction by aggregate customization in choice experiments, Journal of Consumer Research,(September), Vol. 28. [5] Arora, Neeraj, Greg Allenby, and James Ginter (1998), A Hierarchical Bayes Model of Primary and Secondary Demand, Marketing Science, 17,1, p. 2944.
30
[6] Ben-Akiva, Moshe and Steven R. Lerman (1985), Discrete Choice Analysis: Theory and Application to Travel Demand, MIT Press, Cambridge, MA. [7] K.P. Bennett (1999), On Mathematical Programming Methods and Support Vector Machines, In Advances in Kernel Methods Support Vector Machines (Schoelkopf et al. Eds) MIT Press, Cambridge, MA, pp 307-326. [8] Bertsimas, Dimitri and John Tsitsikilis (1997), Introduction to linear optimization, Athena Scientic, Belmont, MA. [9] C.K. Bishop (1996), Neural Networks fro Patter Recognition, Oxford University Press. [10] S. Boyd and L. Vandenberghe (2003), Convex Optimization, Cambridge University Press. [11] Carmone, Frank and Arun Jain (1978), Robustness of Conjoint Analysis: Some Monte Carlo Results, Journal of Marketing Research, 15, p. 300-303. [12] Carroll, Douglas and Paul Green (1995), Psychometric Methods in Marketing Research: Part I, Conjoint Analysis, Journal of Marketing Research, 32, (November), p. 385-391. [13] Cooley R., J. Srivastava and B. Mobasher (1997), Web Mining: Information and Pattern Discovery on the World Wide Web, In Proceedings of the 9th IEEE International Conference on Tools with Articial Intelligence (ICTAI97), November 1997.
31
[14] Cortes, Corinna and Vladimir Vapnik (1995), Support Vector Networks, Machine Learning 20, 125. [15] Cui, Dapeng and David Curry (2004), Predicting Consumer Choice Using Support Vector Machines with Benchmark Comparisons to Multinomial Logit, Marketing Science, (forthcoming). [16] DeSarbo, Wayne and Asim Ansari (1997), Representing Heterogeneity in Consumer Response Models, Marketing Letters, 8:3, p. 335-348. [17] Evgeniou, Theodoros, Constantinos Boussios, and Giorgos Zacharia (2004), Generalized Robust Conjoint Estimation, Marketing Science, (forthcoming). [18] Green, Paul and V. Srinivasan (1990), Conjoint Analysis in Marketing: New Developments with Implications for Research and Practice, Journal of Marketing, 54, 4, p. 3-19. [19] Kohavi, Ron (2001), Mining E-Commerce Data: The Good, the Bad, and the Ugly, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 8-13. [20] Kuhfeld, Warren F., Randall D. Tobias, and Mark Garratt (1994), Ecient Experimental Design with Marketing Research Applications, Journal of Marketing Research, 31, 4, p. 545-557.
32
[21] Lenk, Peter J., Wayne S. DeSarbo, Paul E. Green, and Martin R. Young (1996), Hierarchical Bayes Conjoint Analysis: Recovery of Partworth Heterogeneity from Reduced Experimental Designs, Marketing Science, 15, 17391. [22] Louviere, Jordan J., David A. Hensher, and Jore D. Swait (2000), Stated Choice Methods: Analysis and Applications, New York, NY, Cambridge University Press. [23] McFadden, Daniel (1974), Conditional Logit Analysis of Qualitative Choice Behavior, Frontiers in Econometrics, New York Academic Press, p. 105-142. [24] McFadden, Daniel (1986) The Choice Theory Approach to Marketing Research, Marketing Science 5(4), p. 275-297. [25] Sawtooth Software, Inc, HB-Reg: Hierarchical Bayes Regression, URL:
http://www.sawtoothsoftware.com/hbreg.shtml [26] Segal, Madhav N. (1982), Reliability of Conjoint Analysis: Contrasting Data Collection Procedures, Journal of Marketing Research, 19, p. 139-143. [27] Srinivasan, V. (1998), A Strict Paired Comparison Linear Programming Approach to Nonmetric Conjoint Analysis, Operations Research: Methods, Models and Applications, Jay E. Aronson and Stanley Zionts (eds), Westport, CT: Quorum Books, p. 97-111. [28] Srinivasan, V. and Allan D. Shocker (1973), Linear Programming Techniques for Multidimensional Analysis of Preferences Psychometrica, 38,3, p. 337-369.
33
[29] Toubia, Olivier, Duncan I. Simester, John R. Hauser, and Ely Dahan (2003), Fast Polyhedral Adaptive Conjoint Estimation, Marketing Science, 22 (3), p. 273-303. [30] Toubia, Olivier, John R. Hauser, and Duncan I. Simester (2004), Polyhedral methods for adaptive choice-based conjoint analysis, Journal of Marketing Research, (forthcoming). [31] Vapnik, Vladimir (1998), Statistical Learning Theory, New York: Wiley. [32] Wittink, Dick R. and Philippe Cattin (1989), Commercial Use of Conjoint Analysis: An Update, Journal of Marketing, 53, 3, p. 91-96.
34
M L L L L H H H H
H H H L L H H L L
D R O R O R O R O
HB 0.81 24.65 0.79 23.98 0.90 31.49 0.79 30.07 0.59 13.97 0.68 15.38 0.47 13.05 0.54 13.83
Hetero SVM 0.79 24.24 0.86 26.17 0.90 31.48 1.00 33.82 0.58 14.02 0.79 18.60 0.46 13.28 0.85 20.62
T SVMs 0.82 24.98 0.84 25.37 1.01 33.13 1.02 33.94 0.66 15.57 0.91 21.13 0.66 16.98 1.11 27.57
1 SVM 1.31 38.91 1.24 37.32 0.96 33.02 1.01 33.92 0.98 23.93 1.12 24.13 0.57 15.81 1.02 24.36
Table 1: Comparison of methods using RMSE and hit rates (rst and second row in each case, respectively). There are 500 individuals. The questionnaire designs are random or orthogonal. HB is Hierarchical Bayes. Hetero SVM is the proposed method 2.2. SVM is the method of (Cui and Curry 2004; Evgeniou et al 2004). We indicate with 1 SVM the use of the method of (Cui and Curry 2004; Evgeniou et al 2004) when we treat all people as one individual, and with T SVM when we use the method of (Cui and Curry 2004; Evgeniou et al 2004) for each individual independently as done by (Cui and Curry 2004; Evgeniou et al 2004). Bold indicates best or not signicantly dierent than best at p < 0.05 among all four methods.
Proof of Lemma 2.1
Lemma 2.1 The optimal solution to the optimization method (2.2) satises the equation
T t=1
w0
1 = T
wt .
(8)
35
People 50 100 200 50 100 200 50 100 200 50 100 200
Q 1 SVM 20 41.97 20 41.41 20 40.08 30 40.73 30 40.66 30 39.43 60 40.33 60 40.02 60 39.74 90 38.51 90 38.97 90 38.77
T SVMs 29.86 29.86 29.86 26.84 26.84 26.84 22.84 22.84 22.84 19.84 19.84 19.84
Id het SVM PCA het SVM 28.72 29.16 28.30 29.26 27.79 28.53 25.53 25.65 25.25 24.79 25.16 24.13 22.06 21.08 22.27 20.79 21.86 20.00 19.68 18.45 19.34 18.08 19.27 17.53
HB 31.92 31.59 28.49 26.21 24.77 23.55 21.76 21.26 20.66 19.21 18.64 18.27
Table 2: Comparison of methods as the number of questions per person and the size of the population change. We show the hit rate errors (percentage errors) on test data. First column is the number of people in the population, second column is the number of questions per person used. 1 SVM stands for estimating one SVM with all the data from all the individuals together, T SVM stands for modeling each individual independently using an SVM as in (Cui and Curry 2004; Evgeniou et al 2004), Id het SVM stands for our simple method 2.2, PCA het SVM stands for our method 2.5 with matrix A estimated iteratively using the PCA approach, and HB is Hierarchical Bayes. Best performance is in bold.
36
Proof: This result follows by inspecting the Lagrangian function for problem 2.2. This is given by the formula
T n T
L(w0 , vt , ti , ti) =
t=1 i=1
ti +
1 T
vt
t=1
1 w0 1
T n
t=1 i=1
ti ((w0 + vt )
(x1 ti
x2 ti )
1 + ti )
t=1 i=1
ti ti
(9)
where ti and ti are nonnegative Lagrange multipliers. Setting the derivative of L with respect to w0 to zero gives the equation 1 = 2
T n
w0
2 ti (x1 ti xti ).
(10)
t=1 i=1
The same operation for vt gives, for every t {1, . . . , T }, the equation T = 2
n i=1
vt
2 ti (x1 ti xti ).
(11)
By combining these equations we obtain that

T t=1
= w0
1 T
vt .
The result now follows by this equation and equation (2).
37
Proof of Lemma 2.2
Lemma 2.2 Problem 2.2 is equivalent to solving the following optimization problem: Problem 2.3
T wt ,ti n
ti +
min
t=1 i=1
wt T t=1
wt
t=1
1 T
T s=1
(12)
ws
ti 0
Proof: Let us dene the quantity

T
1 J (vt , w0 ) = T
vt
t=1
1 w0 2 . 1
Using equations (2) and (8), we rewrite the second part of the objective function in equation (3) as 1 J (vt , w0 ) = T
T 2
wt
t=1
1 T 2
wt
t=1
(13)
38
On the other hand we note that

T T s=1 ws 2 T T 2
wt
t=1
=
t=1
1 wt 2 T
wt
t=1
and, so, that 1 T

T 2
wt
t=1
= wt
T s=1
ws
t=1
wt 2 .
(14)
Substituting equation (14) in the right hand side of equation (13) we obtain
T T s=1 2
1 J (vt , w0 ) = T
wt
t=1
1 wt + T
ws
from which the result follows.
Proof of Theorem 2.1
Theorem 2.1 The dual optimization problem of 2.2 is given by Problem 2.4
T n
max
ti t=1 i=1 n T n T
ti
1 4
2 1 2 si tj (x1 si xsi ) (xtj xtj )((1 ) + T st )
(15)
i=1 s=1 j =1 t=1
0 ti 1. 39
(16)
where st = 1 if s = t and 0 otherwise. In addition, if ti is a solution to the above optimization problem, the optimal utility functions wt are given by
= wt
1 2
2 si ((1 ) + T st )(x1 si xsi ).
i=1 s=1
Proof: Taking the derivative of the Lagrangian function in equation (9) with respect to the variable ti we obtain the equation
1 ti ti = 0.
(17)
Since ti must be nonnegative this equation gives the constraint (16). Equation (15) is obtained by replacing equations (10), (11) and (17) in the right hand side of equation (9). Finally, by using equations (10) and (11) we obtain that
T n n i=1
wt = w0 + vt =
1 2
T
2 si (x1 si xsi ) +
s=1 i=1
T 2
2 ti (x1 ti xti )
1 = 2
2 si ((1 ) + T st )(x1 si xsi )
s=1 i=1
which completes the proof.
40
0.65
0.95
0.9 0.6 0.85
0.8 0.55 0.75
0.7 0.5 0.65
0.6 0.45 0.55
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.02
1.3
1
1.2
0.98
1.1
0.96
1
0.94
0.9
0.92
0.9
0.8
0.88
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 1: Horizontal axis is parameter of our method 2.2. Value = 0 corresponds to using the SVM method of (Cui and Curry 2004; Evgeniou et al 2004) for all individuals together, while = 1 corresponds to using one SVM for each individual independently. The vertical axis is the average RMSE of the estimated utility functions. Dashed straight line is the RMSE of HB. There are 500 individuals. Top: Noise is low (magnitude is high). Bottom: Noise is high (magnitude is low). Left: Heterogeneity is low. Right: Heterogeneity is high.
41

Heterogeneity Optimization

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Heterogeneity Optimization

Transféré par

Droits d'auteur :

Formats disponibles

Optimization Conjoint Models for Consumer Heterogeneity

Massimiliano Pontil University College London, London UK pontil@cs.ucl.ac.uk

Optimization Conjoint Models for Consumer Heterogeneity

An Optimization Based Method for Modeling Consumer Heterogeneity

Individual Specic Preference Models

and the condence of the model ( wt 2 ) the latter is

Aggregate Conjoint Estimation

subject, for all i {1, 2, . . . , n} and t {1, 2, . . . , T }, to the constraints that

2 (w0 + vt ) x1 ti (w0 + vt ) xti + 1 ti

so that the dual optimization problem of

subject, for all i {1, 2, . . . , n}, t {1, 2, . . . , T }, to the constrains that

2 1 2 si tj (x1 si xsi ) (xtj xtj )((1 ) + T st )

i=1 s=1 j =1 t=1

subject, for all i {1, 2, . . . , n} and t {1, 2, . . . , T }, to the constraints that

2 si (x1 si xsi )((1 ) + T st ).

in problem 2.2. Finally,

More Advanced Methods

subject, for all i {1, 2, . . . , n}, t {1, 2, . . . , T }, to the constrains that

PCA-based Heterogeneity Method:

Choosing Parameters , , and p:

subject, for all i {1, 2, . . . , n}, t {1, 2, . . . , T }, to the constrains that

We used an HB implementation developed and available online by Toubia et al (2004).

Real Data Experiment

methods, and we leave this for future work.

Proof of Lemma 2.1

People 50 100 200 50 100 200 50 100 200 50 100 200

By combining these equations we obtain that

The result now follows by this equation and equation (2).

Proof of Lemma 2.2

subject, for all i {1, 2, . . . , n}, t {1, 2, . . . , T }, to the constrains that

Proof: Let us dene the quantity

On the other hand we note that

and, so, that 1 T

from which the result follows.

Proof of Theorem 2.1

2 1 2 si tj (x1 si xsi ) (xtj xtj )((1 ) + T st )

i=1 s=1 j =1 t=1

subject, for all i {1, 2, . . . , n} and t {1, 2, . . . , T }, to the constraints that

2 si ((1 ) + T st )(x1 si xsi ).

2 si ((1 ) + T st )(x1 si xsi )

which completes the proof.

0.9 0.6 0.85

0.8 0.55 0.75

0.7 0.5 0.65

0.6 0.45 0.55

Vous aimerez peut-être aussi