Bayesian Strategies For Dynamic Pricing in Ecommerce

Bayesian Strategies for Dynamic Pricing in E-Commerce
Eric Cope
Sauder School of Business, University of British Columbia, 2053 Main Mall, Vancouver,
British Columbia V6T 1Z2, Canada
Received 22 November 2004; revised 2 August 2006; accepted 12 September 2006
DOI 10.1002/nav.20204
Published online 22 December 2006 in Wiley InterScience (www.interscience.wiley.com).
Abstract: E-commerce platforms afford retailers unprecedented visibility into customer purchase behavior and provide an envi-
ronment in which prices can be updated quickly and cheaply in response to changing market conditions. This study investigates
dynamic pricing strategies for maximizing revenue in an Internet retail channel by actively learning customers demand response
to price. A general methodology is proposed for dynamically pricing information goods, as well as other nonperishable products
for which inventory levels are not an essential consideration in pricing. A Bayesian model of demand uncertainty involving the
Dirichlet distribution or a mixture of such distributions as a prior captures a wide range of beliefs about customer demand. We
provide both analytic formulas and efcient approximation methods for updating these prior distributions after sales data have been
observed. We then investigate several strategies for sequential pricing based on index functions that consider both the potential
revenue and the information value of selecting prices. These strategies require a manageable amount of computation, are robust to
many types of prior misspecication, and yield high revenues compared to static pricing and passive learning approaches. 2006
Wiley Periodicals, Inc. Naval Research Logistics 54: 265281, 2007
Keywords: dynamic pricing; e-commerce; uncertain demand; nonparametric Bayesian models; Dirichlet prior; reinforcement
learning
1. INTRODUCTION
An improved understanding of customers sensitivity to
product price can have a dramatic impact on the operating
prots of a rm. For example, Marn and Rosiello [28] found
that rms with typical cost structures that adjusted their prices
by as little as 23% could frequently increase their operating
prots by up to 35%. In addition, they found that prots are
usually much more sensitive to small adjustments in price
than to analogous adjustments in sales volumes, xed costs,
or variable costs. In the past, many rms have been reluctant
to consider strategic pricing, in part because price sensitivity
research typically requires hundreds of thousands of dollars
and many months of effort to complete [3]. However, the rise
of the Internet as a retailing channel has now provided rms
with an ideal environment for testing the effects of various
prices on online sales. Not only can posted prices on a Web
site be changed instantly and at little cost to the seller, but also
e-commerce sites can be designed so that customer access to
price information is carefully tracked and controlled. As a
result, online sellers have access to unprecedented amounts
of data on customer purchase behavior, and several leading
rms are now successfully using Internet-deployed dynamic
Correspondence to: E. Cope (eric.cope@sauder.ubc.ca)
pricing strategies to increase revenues through demand
learning [14].
This paper considers methodological issues in demand
learning and its impact on vendor revenues in e-commerce
markets. In particular, we shall focus on dynamic pricing in
markets of nonperishable goods where the marginal costs of
production are low and sellers have some market power in
terms of their ability to adjust prices. For example, consider
the market for information goods, including digital media,
software, and access to database content. Goods of this type
are closely associated with the Internet as a distribution chan-
nel and are characterized by having near-zero marginal pro-
duction costs and no signicant inventory or storage issues.
The lowmarginal cost of information goods usually prohibits
sellers of such goods from entering into direct competition
witheach other, since prices wouldcollapse as a result. There-
fore, producers of information goods tend to differentiate
their products, so that to some degree they may be considered
as monopolists [37]. In addition, the heterogeneity of the mar-
kets for information goods, as well as the global reach of the
Internet, makes it difcult for a rmto do the advance market
research necessary to set prices accurately. Finally, because
the life cycles of digital products are often very short, retail-
ers often simply do not have the time or resources to conduct
extensive market research on product demand. Therefore, it
2006 Wiley Periodicals, Inc.
266 Naval Research Logistics, Vol. 54 (2007)
may be advantageous for the rm to dynamically adjust the
price of such goods in light of sales data.
This paper addresses two major issues related to demand
learning in an online retail environment. The rst issue is how
the seller should best represent uncertainty about demand, as
well as howto update that uncertainty in the presence of sales
data. We propose a Bayesian model using Dirichlet distribu-
tions as priors for the demandfunction. These semiparametric
priors have an intuitive specication and capture a wide range
of possible demand behavior. Exact analytical formulas are
derived for updating these distributions based on customer
purchase data that are likely to be available in an e-commerce
setting. Because these formulas may be difcult to apply in
many situations, we shall also present efcient methods for
approximating posterior distributions in a useful form.
The second issue regards the design of an efcient sequen-
tial price testing strategy that achieves high total revenues
over the sales period. An efcient strategy will be able to bal-
ance the competing objectives of exploring untested prices
and setting prices to levels that have already proven to yield
high revenues. We develop several price-testing strategies
that effectively trade off these objectives using reasonable
amounts of computational effort. In particular, we shall apply
and investigate strategies that select among various can-
didate prices on the basis of a gure of merit computed
from the posterior marginal distributions of demand at each
price level. The Dirichlet prior is convenient in this regard
because these marginal distributions can be readily computed
or approximated.
2. LITERATURE REVIEW
Much recent work in dynamic pricing has concentrated
on pricing perishable products, such as are typically sold by
the airline, hotel, car rental, and fashion industries. Inventory
levels are a critical consideration in pricing such products. A
large literature has addressed dynamic pricing in the presence
of inventory considerations, where it is usually referred to
as revenue management or yield management. A recent
comprehensive review of such practices is given in [14]; also
see [29]. In this paper, however, we shall consider markets
where inventories are not as important to pricing and focus
on the revenue implications of learning demand functions
online.
Early work in dynamic pricing for demand learning
focused on the long-run learning characteristics of optimal
pricing strategies. In a classic paper, Rothschild [35] inves-
tigated the nature of optimal sequential pricing strategies
that experiment with two prices. He showed that under a
general class of prior distributions on demand, there is a
positive probability that a pricing policy that maximizes
expected total discounted reward over an innite horizon
will select the revenue-maximizing price only nitely many
times. Therefore, the policy may converge to the subopti-
mal price. This phenomenon was studied more generally by
Easley and Kiefer [12, 13], as well as McLennan [30] and
Aghion et al. [1].
Several other prior studies have also described the quali-
tative behavior of optimal price-adjustment strategies using
parametric classes of demand functions, such as those
by Grossman, Kihlstrom, and Mirman [19], Mirman,
Samuelson, and Urbano [31], Balvers and Cosimano [5], and
Treer [38]. Carvalho and Puterman [6, 7] and Lobo and
Boyd [27] address the same problem from a more method-
ological viewpoint, althoughtheir formulations alsomake use
of parametric assumptions about the shape of demand func-
tions. The Dirichlet prior that we employ in this paper does
not require such restrictive assumptions. Leloup and Deveaux
[25] consider pricing strategies in a Bayesian framework sim-
ilar to the one we develop here, but their formulation ignores
key dependencies among demand levels at various prices by
treating them as independent.
By taking a Bayesian approach, our methods contrast with
some line search-type methods previously proposed for this
problem in the literature, such as the derivative following
methods proposed by Greenwald and Kephart [18], Das-
gupta and Das [10], and DiMicco, Greenwald, and Maes
[11]. These methods resemble stochastic approximationalgo-
rithms for locating the revenue-maximizing price. While
simple to implement, these methods do not incorporate prior
information, nor do they make full use of the data that have
been observed.
Some authors have considered these sequential online sell-
ing problems in relation to auction mechanisms. Baliga and
Vohra [4] consider an adaptive monopoly pricing mechanism
involving customer bidding, whose per capita expected prot
converges to the optimal prot for a xed-price mechanismas
the number of bidders grows large. Segal [36] compares the
performance of multiunit auctions versus sequential pricing
mechanisms when the distribution of demand is unknown. He
shows that although auction mechanisms are frequently more
efcient than sequential pricing mechanisms, if nonparamet-
ric estimation is used to estimate the demand (as is done in
this paper), then both of these mechanisms may achieve the
same rate of convergence. Kleinberg and Leighton [22] char-
acterize this optimal rate of convergence for nonparametric
estimation in a dynamic posted-price setting similar to the
one studied in this paper; these results are similar to general
results derived by Cope [8].
Other studies by Aviv and Pazgal [2] and Lin [26] have
considered pricing policies that actively learn about size of
the market, rather than the shape of the demand curve, as
we do in this paper. Market size is not relevant to our prob-
lem formulation because the vendors goal is to locate the
price that maximizes expected per-customer revenue. How-
ever, these studies are excellent complements to the analysis
Naval Research Logistics DOI 10.1002/nav
Eric Cope: Dynamic Pricing in E-Commerce 267
considered here and suggest the development of sequential
pricing methods that actively learn about both the shape and
the scale of the demand function.
3. MODEL AND PROBLEM FORMULATION
Consider a retailer (vendor) operating a shop where a
single good is for sale in unit quantities. Here, a shop is under-
stood to be any space, either real or virtual, where prices are
displayed to arriving customers, who then decide whether to
purchase the product. For example, a shop may be a web page
or collection of web pages that display the price for a par-
ticular product, such as a software product or digital media
document, and provides visitors with links to purchase the
product. We assume that information about present and past
prices is not available outside of the shop, as would be good
practice if the vendor wishes to strictly control the price of the
good. In each of N +1 consecutive time periods 0, 1, . . . , N
of equal length, a random number of customers arrive to the
shop, observe the posted price, and individually decide to
either purchase one unit of the good or exit the shop without
purchasing. We shall further assume that the numbers of cus-
tomers arriving in each period form an identically distributed
sequence that is independent of past arrivals as well as both
past and current prices. This assumption may be warranted
in a nite time horizon if we assume that the customer base
is composed, for example, of a large number of distributed
Internet users, each of whom desires the good in a given
period with a small probability, independent of other users
and independent of past demand. We also assume that N rep-
resents a deterministic time horizon (which may be innite
if no denite end period is anticipated).
The vendor may change the price at the start of any time
period, always choosingfroma predeterminedset P of prices.
Given that the price is only displayed in the shop, we assume
there is no cost to changing the price. A customer will buy
the good if the current price is less than or equal to his or her
privately held reservation price . We shall assume that the
reservation prices of arriving customers also form an identi-
cally distributed sequence that is independent of past prices,
the reservation prices of past customers, and past numbers of
customer arrivals. Let
(p) = P( p)
denote the probability that a random customer purchases the
good at price p. Ideally, the vendor would set the price p to
maximize the per-customer expected revenue
R(p) = p(p),
for p P. We shall refer to () as the demand function and
stipulate that () be a nonincreasing function on P.
Suppose that P contains k distinct prices z
1
< . . . < z
k
.
For convenience, also dene z
0
= and z
k+1
= . Let
i
denote the probability that the reservation price of a random
customer falls in the half-open interval [z
i
, z
i+1
), for any i =
0, . . . , k. Since
(z
i
) =
k
j=i
j
for any i, the vendor can represent initial uncertainty about
the demand function at prices in P according to a prior
probability measure P
0
dened over the k-simplex:
_
(
0
, . . . ,
k
) 0 :
k
i=0
i
= 1
_
. (1)
Because prices are constrained to be in the set P, P
0
contains
all the prior demand information the vendor needs to make
pricing decisions.
In time period n, the vendor sets a price p
n
and receives
a random amount of revenue R
n
(p
n
) according to the num-
ber of customers arriving to the shop in that period whose
reservation prices are at least as great as p
n
. The vendor can
only observe whether arriving customers purchase the prod-
uct and not the actual reservation prices of those customers.
Therefore, the vendors data in period n can be represented
by the sequence
H
n1
= (p
0
, a
0
, s
0
, . . . , p
n1
, a
n1
, s
n1
), (2)
where a
i
denotes the number of customers arriving in period
i and s
i
denotes the number of purchases made in that period.
We shall assume that the vendor can identify new and return-
ing customers, so that a
i
represents the number of unique
customer arrivals in a given time period. This is technically
feasible on most e-commerce web servers, which can identify
customers by tracking requesting IP addresses or requiring
customers to log in to the site. We also assume that time peri-
ods are long enough (on the order of days or weeks) so that the
failure of a given customer to purchase within a time period
may be interpreted as a refusal of that customer to purchase.
Customers who return in periods subsequent to their rst visit
will simply be treated as new customers.
Let P
n
denote the posterior measure over the set (1) that
results fromapplying Bayes rule to the measure P
0
using the
data H
n1
. We call a pricing strategy admissible if the time-n
price p
n
is determined on the basis of the posterior demand
distribution P
n
alone. Under these assumptions, the vendors
goal is to nd an admissible pricing strategy that maximizes
the total expected discounted revenue
E
0
_
N
n=0
n
R
n
(p
n
)
_
, (3)
where 0 < 1 is a discount factor strictly less than 1
if N = , and the expectation is taken with respect to the
prior distribution P
0
. Since we assume that the number of
customer arrivals in each time period forms an independent
and identically distributed (iid) sequence that is independent
of current and past prices, it is equivalent for the vendor to
nd a strategy that maximizes total expected discounted per-
customer revenue
E
0
_
N
n=0
n
p
n
(p
n
)
_
. (4)
We shall use the objective functions (3) or (4) according to
whichever is most convenient in any given context.
4. BAYESIAN MODELS FOR DEMAND
UNCERTAINTY
This section outlines a Bayesian framework for capturing
demand uncertainty using the Dirichlet distribution, which is
easy to specify and can capture a wide range of prior beliefs
about demand. We shall rst introduce the Dirichlet distribu-
tion, then describe how its parameters can be specied in this
context, and nally discuss how to compute or approximate
posterior distributions under different scenarios, depending
on the volume of data collected in each period.
4.1. Dirichlet Priors
Let = (
0
, . . . ,
k
) denote a vector of k + 1 non-
negative parameters such that

i
= 1, and let c be a
positive real number. Under a Dirichlet prior with parameters
and c, the probability density function of a random vector
= (
0
, . . . ,
k
) at the value x = (x
0
, . . . , x
k
), where x 0
and
x
i
= 1, is given by
f
0
(x) =
(c)
k
i=0
(c
i
)
x
c
0
1
0
x
c
1
1
1
x
c
k
1
k
, (5)
where () denotes the gamma function. We henceforth use
D(c
0
, . . . , c
k
) to denote this distribution.
The parameters of the Dirichlet prior have intuitive spec-
ications in terms of reservation prices and the demand
function. The parameters
i
represent the vendors prior
expectation of the probability that a given customer has
a reservation price in the range [z
i
, z
i+1
). In addition, the
variance of
i
under the D(c
0
, . . . , c
k
) prior is given by
Var
0
(
i
) =

i
(1
i
)
c +1
.
Thus, the larger the value of the parameter c, the smaller the
corresponding variance of each component probability
i
.
Dene the function
m(z) = P
0
( z)
to represent the prior expectation of demand E
0
(z) at the
price z P. Under the D(c
0
, . . . , c
k
) prior, the unknown
demand value (z
i
) has a Beta distribution with parameters
cm(z
i
) and c(1 m(z
i
)) with expectation m(z
i
) =

k
j=i

i
.
Also, using this function, it is possible to dene a coher-
ent prior distribution for demand even when P ranges over
a continuous set of prices, as long as m : P [0, 1] is a
nonincreasing, left-continuous function with right limits. In
this case, the resulting prior distribution is then known as a
Dirichlet process [15]. The Dirichlet process provides a nat-
ural extension to innite-dimensional priors that shall prove
convenient in many parts of the exposition to follow.
We alsonote that we cangainadditional generalitybyusing
a mixture of Dirichlet distributions for our prior; it is well
known that any prior distribution ranging on the set (1) can
be approximated by mixtures of Dirichlet distributions to an
arbitrary degree of accuracy [9, 34]. While all of the methods
proposed in this paper can be extended in a straightforward
way to this more general class of priors, we shall focus on
pure Dirichlet priors to simplify the exposition.
4.2. Specifying Dirichlet Priors
It is well known that with probability one, the sample
demand functions under the Dirichlet prior are piecewise
constant functions with a countable number of downward
jumps [15]. The size of a downward jump in the demand
function at any price z indicates the proportion of customers
in the population whose reservation price = z. Variability
in the size of these downward jumps is strongly related to the
value of c. When c is large, the jump sizes all tend to be rela-
tively small, and the shape of the demand functions appears
more smooth, corresponding to what one might expect if the
size of the market is large and reservation prices are well
dispersed in the customer population. Conversely, when c is
small, there tends to be wide variation in the jump sizes, with
a few very large jumps dominating the demand distribution.
Such demand behavior might be observed in a market domi-
nated by a few large customers or when customers tend to be
insensitive to price variation in certain ranges [3].
These various interpretations of the parameter c of the
Dirichlet prior may lead to conicting directives in specify-
ing a value for this parameter. If no single value for c captures
prior uncertainty about demand, then a Dirichlet mixture
might be considered a better approximation. However, the
likely shape properties of sample demand functions under
the prior are largely immaterial to the dynamic pricing strate-
gies discussed in the next section. These strategies are based
solely on the posterior marginal distributions of demand at
individual prices and not upon the joint dependencies among
various demand values that induce these shape properties. In
addition, when c is not too large, these strategies are observed
to be robust when the shape of the true demand functions
does not correspond to the shape of demand functions that
are likely to be sampled fromthe Dirichlet prior. (We consider
the question of robustness explicitly in Section 7.) Therefore,
we argue that for the purposes of dynamic pricing, c should be
selected primarily according to the level of accuracy assigned
to prior mean m() as an estimate of the true demand.
A possible difculty that may arise in selecting a value
for the parameter c is that it may be difcult to capture the
variability in the prior demand distribution at every price in
P using the single parameter c. Additional exibility may be
gained by supposing that (z
0
) = 1 t for some t (0, 1)
and that (
0
, . . . ,
k
) has a Dirichlet distribution when scaled
by 1/t , i.e.,
_
1
t
, . . . ,

k
t
_
D(c
1
, . . . , c
k
).
Introducing the parameter t can be helpful in accounting for
a xed fraction of customers who are solely interested in the
informational content of a website and may be assumed to
never buy the product regardless of the posted price [32]. All
the methods introduced in this paper can be easily adapted to
handle this modication.
4.3. Computing and Approximating Posterior
Distributions
Updating the Dirichlet prior on the basis of the bino-
mial purchase count data observed by the vendor can be a
challenging computational task. Usually, the Dirichlet prior
is used in statistical applications with multinomial data,
since it is conjugate to the multinomial distribution. For
example, suppose the vendor could observe the reservation
prices of arriving customers directly; then if the prior dis-
tribution is D(c
0
, . . . , c
k
) and the vendor has observed
n
i
customers with reservation prices falling in the range
[z
i
, z
i+1
), i = 0, . . . , k, the posterior distribution would be
D(c
0
+n
0
, . . . , c
k
+n
k
). (For this reason, the parameter c
in this context is often interpreted as a prior sample size.)
However, in the model we are considering, the vendor can-
not observe the reservation prices directly, but rather only the
reservation prices censored at the prevailing price at the time
the customer entered the shop. When the reservation price
data are censored in this way, the Dirichlet distribution is no
longer a conjugate prior, and the posterior distribution has a
more complex form. An overviewof Dirichlet updating under
different forms of censoring can be found in [20].
In Appendix A, exact analytical forms for the posterior
marginal distributions of demand are derived. These forms
are particular to the method of censoring we are considering
here and as such appear to be new. The results of these com-
putations indicate that the posterior density of the demand
function at a price z P is expressed as a nite mixture
of densities of Beta random variables. The weights assigned
to each of these element densities in the mixture can be dif-
cult to compute, however, and there do not appear to be
any simple recursive formulas available for updating the pos-
terior distributions given an additional data point. Kuo and
Smith [23] showed how to efciently approximate the poste-
rior marginal distributions for censored data using the Gibbs
sampler, a version of which is outlined in Appendix Bfor use
with this particular application. The Gibbs sampler generates
samples of parameter values (a, b) of the Beta densities in
the mixture approximately according to their mixture weights
in the posterior distribution. Thus, if M parameter samples
(a
1
, b
1
), . . . , (a
M
, b
M
) are generated, then the posterior mar-
ginal density f (x) of demand at a given price level can be
approximated as
f (x) =
1
M
M
m=1
B(x; a
m
, b
m
), (6)
where B(x; a, b) represents the Beta density at x with param-
eters a and b. In computed examples where fewer than
100 customers arrive per time period, the Gibbs sampler
approximation is observed to be very accurate when M =
100.
In the case where hundreds or thousands of customers
arrive at the shop per period, the Gibbs sampler approx-
imation of the posterior distribution requires much more
computational effort to achieve the same accuracy, as the val-
ues of both M and K must increase according to the amount
of data. In such cases, other approximations to the poste-
rior marginal distributions of demand are available. A crude
approximation can be obtained by treating the observed pro-
portions of customers who purchase at a given price level as
the true demand values, so that the posterior marginal distri-
bution at a tested price is treated as a point mass. In this case,
the Dirichlet distribution is split into two halves, each of
which are themselves Dirichlet distributions. For example,
suppose that (p
1
) and (p
2
), p
1
, p
2
P, p
1
< p
2
, have
already been observed with high precision. Then the posterior
distribution of (p), for p
1
< p < p
2
, can be approximated
according to
(p) (p
2
)
(p
1
) (p
2
)
Beta(cm(p) cm(p
2
), cm(p
1
) cm(p
2
)).
(7)
(Note that (p
1
) and (p
2
) are treated as deterministic quan-
tities in this expression.) We shall term this approximation to
the posterior marginal distributions of demand as the exact
observation approximation.
The exact observation approximation is most accurate
when large numbers of customers (numbering in the thou-
sands) arrive in each time period, and there is a high proba-
bility (greater than 0.2) that an arriving customer will buy the
good. In more realistic scenarios where only a small percent-
age (less than 10%) of customers are willing to buy, then the
exact observation approximation is less accurate. Using the
example from the previous paragraph, an improved proce-
dure is to assume that the posterior distributions of (p
1
) and
(p
2
) are Binomial(n
i
, q
i
), i = 1, 2, where n
i
is the number
of customers whoarrivedwhile the price p
i
was posted, andq
i
is the observed proportion of those customers who purchased
the product at that price. The posterior distribution of (p)
can then be approximated by drawing M random samples
each fromthe respective Binomial distributions for (p
1
) and
(p
2
) and applying the exact observation approximation (7)
to compute a density for (z
i+j
) for each pair of sampled val-
ues. The modied exact observation approximation is then
computedbyaveragingthese M sample densities. This proce-
dure can be far less cumbersome than using the Gibbs sampler
approximation and has been observed to produce quite
accurate results even for large amounts of customer data.
5. STRATEGIES FOR DYNAMIC PRICING
Having now dened a Bayesian framework in which the
prior distributions are easily specied and the posterior dis-
tributions can be readily approximated, we next turn to the
question of nding price testing strategies that make intel-
ligent use of the demand information summarized by the
posterior distributions. In principle, an optimal (N + 1)-
period pricing strategy can be computed by solving for the
value functions V
0
, . . . , V
N
, dened for distributions D over
demand function values at the prices in P, using the recursion
V
n
(D) = max
pP
E
D
_
p(p) + V
n+1
(
p
D)
_
,
n = 0, 1, . . . , N, (8)
along with the boundary condition V
N+1
(D) = 0. Here,
p
D
denotes the (random) posterior distribution after data from
one time period are observed, given that price p is selected.
Considering, however, that the state space is multidimen-
sional and continuous, exact solution of these equations is
practically impossible.
5.1. One-Step Lookahead with Recourse Strategies
Rather than try to solve (8) directly, we will instead look
for approximations to this formula that will yield nearly opti-
mal solutions. In particular, we develop approximations to the
reward-to-go term V
n+1
(
p
D) in (8). These approximations
are motivated on the basis of a few simplifying assumptions
regarding the decision problem that allow V
n+1
(
p
D) to
be directly estimated. These assumptions may be stated as
follows:
1. When any price pis set for the upcoming time period,
the exact value of (p) will become known at the end
of the period.
2. For every possible choice of price p P, there
exists a best alternative that attains an average rev-
enue value r(p) over all subsequent time periods.
This value may depend on p and is assumed to hold
constant over time.
Assumption 1 replaces the notion that only a limited amount
of information about customer demand is learned in each
period (since only a nite number of customers arrive each
period) with the assumption that perfect information becomes
available, as if an innite sequence of customers were
observed. Assumption 2 will allowus to directly approximate
the reward-to-go term in (8) by specifying for each choice of
p the potential value of not choosing p. The function r(p)
denotes an average revenue recourse value that represents
a possible reward alternative to choosing price p. In princi-
ple, this function can be freely chosen by the practitioner,
although we shall provide specic suggestions on how this
function may be chosen in the next subsection. Note also that
r(p) is distinct from the random per-period revenue function
R(p) used in previous sections.
Under these assumptions, the vendors decision problem
may be recharacterized as follows: If price p is chosen in the
current time period and (p) is known at the start of the next
time period, then the vendor again has a choice of either price
p or the best alternative price in the next time period. Since
(p) is already known at this point, to choose p again would
result in no new information. Thus, if the vendor prefers to
choose p again in the next time period (i.e., if p(p) > r(p)),
then she should prefer to do so in all subsequent time periods.
If p is not chosen, then the vendor can assume that expected
revenues of r(p) would be attained in each subsequent time
period. This decision process is depicted in the owchart in
Figure 1.
From this characterization of the problem under the new
assumptions, the value associated with choosing any price p
in the current time period n can then be reformulated as
v(p) = E
n
_
p(p) + d
n+1
max{p(p), r(p)}
_
. (9)
Here E
n
denotes expectation with regard to the posterior
distribution of demand that is available at time n, and we
dene d
n
=

Nn
t =0

t
for n 1. A pricing strategy based
on Assumptions 1 and 2 would choose the price p in each
period that maximizes (9). We term this strategy a one-
step lookahead with recourse (OSLR) strategy because we
Figure 1. Flowchart representing the decision tree for selecting a given price p at time n under the OSLRassumptions. Values in the rounded
rectangles represent reward values. The expected value of the decision to test p is the gure of merit assigned by the OSLR policy to price p.
choose prices by looking ahead only to the next decision
point, and we assume that the recourse reward value r(p)
is always available as an alternative. OSLR policies approxi-
mate the reward-to-go termV
n+1
(
p
D) in (8) by the quantity
d
n+1
max{p(p), r(p)}, as in (9).
5.2. Two OSLR Policies
The question remains as to what values should be chosen
for r(p), p P. We shall consider two candidate val-
ues, denoted r
1
(p) and r
2
(p), which are easily computed
in each time step, and yield near-optimal results, as the
simulation results of the next section shall establish. Further-
more, these strategies are also robust against many possible
misspecications of the prior, as we shall explore in Section 7.
The rst search strategy assigns to r(p) the maximal
expected revenue value (as measured under the current pos-
terior distribution) over all prices in P. That is, let r(p) = r
1
for all p P in period n, where
r
1
= max
pP
E
n
p(p).
This assignment to r(p) rather conservatively estimates the
expected per-period revenue to be the expected revenue asso-
ciated with the optimal static pricing policy, which would
choose the same price in every period regardless of the out-
comes. We shall refer toa pricingstrategybasedonthis choice
of r(p) as an OSLR-1 policy.
The second search strategy approaches the estimation of
r(p) using a more aggressive estimate of the recourse value.
This strategy sets r(p) to be the value at which the vendor
would be indifferent between choosing p and choosing the
best alternative to p in the current time period. This value
can be computed as follows: the expected value of choosing
p in the current period is given in (9). The expected value
of choosing the alternative in all subsequent time periods is
d
n
r(p). The value of r(p) at which the vendor is indifferent
between the two alternatives is thus given by the solution
r = r
2
(p) of the following equation:
d
n
r = E
n
_
p(p) + d
n+1
max{p(p), r}
_
. (10)
If there is more than one solution to (10), choose r
2
(p)
to be the largest of these. Solutions to (10) can be found
numerically. A search strategy based on the assignment
r(p) = r
2
(p) will choose in each time period the price p that
maximizes r
2
(p). We label this second policy a OSLR-2
policy.
(Those familiar with the theory of optimal stochastic
sequential control may recognize that r
2
(p) in the OSLR-
2 policy coincides with Gittins dynamic allocation index
under the conditions imposed in Assumption 1. Gittins
showed that for the case of the multi-armed bandit prob-
lem, the use of the dynamic allocation index as a gure of
merit in selecting among alternatives forms an optimal pol-
icy [16]. This problem has a similar structure and objective
to the price selection problem we are considering here; how-
ever, in our problem the independence assumption does not
hold since the demand functions are nonincreasing.)
In what follows, we shall x the value of the time index n
and denote the value of (9) with r(p) = r
1
or r(p) = r
2
(p)
as v
1
(p) and v
2
(p). The following proposition establishes
some relationships among various values of v
1
, v
2
, r
1
, and r
2
that are helpful in comparing the behavior of the two OSLR
policies:
PROPOSITION 5.1: Suppose that (p) has a posterior
density that is positive and continuous on (0, 1) at a given
time n, and let p
denote the value in P that maximizes

E
n
(p(p)) over all p P. Then the following assertions
hold:
(a) If d
n
r E
n
_
p(p) + d
n+1
max{p(p), r}
_
for
some value of r, then r r
2
(p).
(b) v
2
(p
) v
1
(p
)
(c) If r
1
> r
2
(p) for any p, then v
1
(p
) > v
1
(p) >
v
2
(p).
(d) If Assumption 1 holds and the price p
has already
been tested prior to time n, and if v
1
(p) > v
1
(p
),
then v
2
(p) > v
2
(p
).
The proof is given in Appendix C.
Proposition 5.1 suggests that the OSLR-2 policy will typ-
ically be more aggressive in its experimentation with new
prices than the OSLR-1 policy. To see this, suppose that
Assumption 1 holds true. If the set of prices P is nite,
then eventually any OSLR policy will choose some price
p for the second time. When this occurs, no new infor-
mation will be gained, since according to the assumption
(p) is already known. If the posterior distribution does not
change, then the OSLR policy will continue to choose price
p ever after, and at any time n after this occurs, v(p) =
d
n
p(p). However, v(p
) v(p) since E
n
p
(p
) +
d
n+1
E
n
max{p
(p
), r} d
n
Ep
(p
) d
n
p(p);
therefore, p = p
. This implies that the price p
(dened as
the maximizer of E
n
p(p) at time n) will already have been
tested by time n, for n large enough. Suppose we select any
such large n. Part (d) of the proposition states that the number
of prices p such that v
1
(p) > v
1
(p
) is a subset of the prices

where v
2
(p) > v
2
(p
). In particular, if at some time there are

no more prices that the OSLR-1 policy is willing to explore
because v
1
(p) v
1
(p
) for all p P, then there still may be

prices p = p
such that v
2
(p) > v
2
(p
), but not vice versa.

This indicates that the OSLR-2 policy will tend to explore
more prices than the OSLR-1 policies. This hypothesis is
supported in the simulation trials discussed below.
5.3. Discussion of OSLR Strategies
Strategies such as OSLR, which are computed based on
marginals rather than the full joint distribution of demand,
usually offer a large computational savings over optimal
strategies. Given the nature of the semiparametric prior distri-
butions we have dened for this problem, it is also reasonable
to expect that the OSLRheuristics will performnear optimal-
ity, because values of the revenue function at prices that are
not close to each other are not highly dependent under the
semiparametric Dirichlet prior. Inspecting the form of (9),
OSLR strategies tend to favor either prices that are known
to yield high revenues (when E
n
p(p) is large) or under-
explored prices with large demand variance that have the
potential to yield high revenues (when E
n
max{p(p), r(p)}
is large). Given the dependence structure of the Dirichlet
prior, testing a price p will reduce the demand variance not
only at that price, but also at local prices where the variance
may also be high. Thus, strategies such as OSLR that are
based only on the marginals of the posterior demand distri-
bution are typically able to perform efcient explorations of
the most promising regions of P.
(The situation can be quite different when the demand
function is supposed to come from a parametric set of func-
tions, for then global demand information may frequently
be inferred on the basis of a few observations. For exam-
ple, if the demand function is assumed to be linear, then the
entire demand function will be assumed known after demand
levels are learned at only two prices. Such an assumption is
usually unwarranted and may result in a search strategy that
underexplores the price set P.)
Finally, another advantage of using OSLR policies is that
they do not require the user to specify a tuning parameter
to be effective. Many general reinforcement learning and
global optimization strategies that could be adapted to this
context (cf., [17, 21, 24, 33] for examples of search strate-
gies that also rely on functions of marginal distributions to
make sequential decisions) require the choice of an exoge-
nous parameter that is specied according to the judgment
and expertise of the practitioner. Frequently, the performance
of these strategies can be very sensitive to the choice of the
parameter value. Simulation experiments (not reported here)
have shown that the two OSLR policies we have introduced
are competitive or better than methods requiring tuning under
their best parameter settings.
6. COMPARISON OF THE PRICING STRATEGIES
We now examine the performance of the various policies
described in the previous section under various assignments
of the prior distribution and the discount factor. Performance
will be evaluated based on simulation trials under a represen-
tative spectrum of possible prior distributions. Throughout
this section, we shall assume that the random demand func-
tions are generated in our simulation according to the prior
distribution; we shall investigate cases where the prior is
misspecied in the next section.
6.1. Results under Exact Observation
We rst investigate the performance of the OSLR policies
when Assumption 1 is assumed to hold, not only because
this captures the limiting behavior of the search strategies as
the frequency of customer arrivals grows, but also because
it allows us to efciently compare the relative performance
of policies without the added randomness of the customer
reservation price sequence. We shall lift this restriction in the
set of simulation results reported in the next subsection.
Four price-testing strategies were compared in these simu-
lations: a static policy that always chooses the price that
maximizes the expected revenue function under the prior
distribution; a passive learning (certain equivalent) pol-
icy that always chooses the price that maximizes expected
per-period revenue under the current posterior distribution;
and the OSLR-1 and OSLR-2 policies. The experimental
set-up was as follows: a discrete set of 50 candidate prices
(p
1
, . . . , p
50
) was selected so that the prior expected demand
equaled E
0
(p
i
) = 1 i/50 and the prior expected revenue
was given by
E
0
p
i
(p
i
) = 100 a(i 25)
2
, (11)
Figure 2. Average percentage difference between revenues attainable under perfect information and the revenue achieved by each policy.
Shorter columns correspond to better performance. The revenue functions were normalized as described in the text. Data are shown with
approximate 95% condence intervals.
with the parameter a ranging in the set [0, 0.03]. This
selection was made in order that:
(1) The prior expected difference in demand between
consecutive prices was uniform, so prices were
equispaced in terms of the frequencies of customers
with reservation levels between each price.
(2) In the case where a = 0, the prior expected rev-
enue function is at; thus allowing us to compare
the policies when the prior expected revenue of each
price is similar.
(3) In the cases where a > 0, the prior expected revenue
has a concave shape. The efciency of many ran-
dom search algorithms has been shown to depend
on the degree of curvature at the maximum; we
therefore chose different values of a to compare the
performance of the policies as the curvature of the
expected revenue function increased.
We also tested the algorithms under various values of the
certainty parameter c (which ranged in the set {10, 20, 100})
and the discount parameter (which ranged in the
set {0.9, 0.95, 0.99}). For each combination of parameters
(a, c, ), the policies were tested using 1,000 sample demand
functions using an innite time horizon, recording in each
trial the total discounted revenues attained by each policy, as
well as the revenues that would be attained by the policies
that selected the revenue-maximizing price and the revenue-
minimizing price in each period. The latter two policies
were tested so as to provide a means of comparison with
the best and worst possible policies. Note that these policies
assume perfect information of the demand function, so the
best possible policy will outscore even the Bayes-optimal
policy.
The results of our simulations are displayed graphically
in the charts in Figures 2 and 3. We only show the results
for a {0, 0.01, 0.02} for illustration. The rst set of charts
compares the revenues attained by each policy relative to the
best and worst price strategies. These gures were computed
as follows: suppose during a trial the total discounted rev-
enue obtained by the best price strategy is R
1
and that of the
worst price strategy is R
0
. If one of the strategies of inter-
est attained a total discounted revenue of R, then the value
(R R
0
)/(R
1
R
0
) was recorded. Thus, the total revenues
attained by each strategy were rescaled to lie 0 and 1, with 0
corresponding to the worst possible policy and 1 correspond-
ing to the best possible policy (under perfect information).
This rescaling eliminated the under- or over estimation of
the relative efciency of the algorithms due to random uc-
tuations in the range and the arbitrary choice of scale of
the sample revenue functions. (In the graphs displaying the
results of these trials, the percentage value of information
Figure 3. Number of different prices tested on average by each index function. Data are shown with approximate 95% condence intervals.
100 (1 (RR
0
)/(R
1
R
0
)) is displayed for better visual
comparison among policies.)
The data in Figure 2 showin many cases that the losses due
to demand uncertainty can be substantially reduced by fol-
lowing an active learning strategy, such as the OSLRpolicies,
rather than a static pricing or passive learning strategy. These
differences in performance are most pronounced for high val-
ues of the discount factor and low values of the certainty
parameter c. Also, the value of primarily affects the perfor-
mance of the OSLR policies, as these policies explicitly rely
on . When a > 0, the value of c affects the performance
of all the strategies; we expect this because the more precise
our knowledge of the demand function is, the better the poli-
cies should perform. Finally, we can see that for the larger
value of a, the performance of all algorithms improved. This
is because if the prior expected revenue has greater curvature
then the range of prices at which the maximum is likely to
be attained is smaller, and hence there are effectively fewer
prices from which to select.
Figure 3 shows the number of different prices explored on
average by each of the algorithms. As we expect, the active
learningstrategies explore more prices thanthe passive strate-
gies, and the OSLR-2 policy explores slightly more prices
than the OSLR-1 policy, as suggested by Proposition 5.1.
6.2. Results Using the Gibbs Sampler Approximation
Using the Gibbs sampler approximation, simulations of the
performance of the two OSLR policies were performed. For
the simulation trials we report here, the set P contained 20
candidate price levels (p
1
, . . . , p
20
), the prior mean demand
satised m(p
i
) = 0.5 + (i 10) .01, and the prior mean
revenue function R
i
= p
i
m(p
i
) was set according to (11)
with a = 0.02; the actual price levels were determined from
the equation p
i
= R
i
/m(p
i
) for i = 1, . . . , 20. In each trial, a
demand function ((1), . . . , (20)) was generated according
toa D(c(1m(1)), c(m(1)m(2)), . . . , cm(20)) distribution
with c = 20. In each of 30 periods, 100 customers with reser-
vation prices randomly generated according to this demand
function arrived to the shop. In each period, the posterior
distribution at each price level was approximated using the
Gibbs sampler algorithm given in Appendix B, with param-
eter values of K = 10 and M = 100. A discount factor of
= 0.99 was assumed.
The averaged results of 200 simulation trials are shown
in Figure 4. This graph shows that in the early periods, the
passive learning and static pricing methods perform best, but
over time the OSLR policies collect more revenue. In partic-
ular, the OSLR-1 strategy appears to dominate after only a
few periods, and the OSLR-2 policy makes slower but stead-
ier progress. By period 30, the OSLR-2 policy is making as
much revenue per period as the OSLR-1 policy. If we com-
pute the total revenues gained by each policy over an innite
horizon by adding to the total revenues achieved by period 30
the total revenue that would have been achieved on average
using the prices selected at period 30 going forward, the per-
centage values of information of the static, passive, OSLR-1,
and OSLR-2 policies are 46.3, 26.0, 19.9, and 20.5, respec-
tively (with a standard error of less than 2% in all cases.)
These simulations suggest that for the high value of the dis-
count factor chosen, the OSLR policies begin to outperform
the static pricing policies after only 23 periods. Both OSLR
policies outperform the passive policy in the long run.
7. ROBUSTNESS OF THE HEURISTICS
To address the robustness concerns raised in Section 4.2,
we show in this section that the heuristics are still reliable
even if the prior mean demand function only approximates
the true expected demand. We shall approach the question of
robustness in two ways. First, we prove that an OSLR policy
will come arbitrarily close to nding the revenue-maximizing
Figure 4. (a) shows the total revenues achieved by each of the index functions by each time period. (b) displays expected per-period revenues
achieved by the price that appears the best to each index strategy in each time period.
price as the discount factor nears 1, independent of the cho-
sen prior distribution. Second, we show in several simulation
settings that the heuristics continue to perform competitively
even if the prior is misspecied.
7.1. Convergence of OSLR Policies
One indication of the robustness of a search strategy is if
the price points selected converge to the revenue-maximizing
price when the prior distribution is poorly specied. How-
ever, when future revenues are discounted, such convergence
typically does not occur, even if one were to use the Bayes-
optimal search policy. This is due to the fact that under
discounting, present rewards associated with known price
levels will outweigh the potential future discounted rev-
enues that may be obtained at other prices, and hence an
optimal algorithm may eventually settle on a nonoptimal
price [12]. However, we can show that as the discounting
of future revenues vanishes, the sequence of prices selected
by OSLR policies will converge with probability one to a
price sequence whose limit is the revenue-maximizing price.
The convergence theorem presented in this section
assumes that the prior demand function is a mixture of Dirich-
let processes dened on a nite interval P = [p, p], such
that the marginal distribution of (p) at any given value of p
has positive support over the range [0, 1] and such that these
marginal distributions are left-continuous with right limits
(LCRL) on p P. Let (, F, P) denote a probability space
such that each outcome determines the demand func-
tion as well as the sequence of reservation prices of arriving
customers. In stating this theorem, we denote by (p; ) a
xed sample demand function, which is nonincreasing and
left-continuous with right limits; R(p; ) = p(p; ) is the
corresponding sample revenue function. We let p
() be the
price at which R(p; ) attains its maximal value of R
() on
P. Since p
() is unique except on a set of probability zero

for a Dirichlet process as specied above, for simplicity we
assume p
() is unique. We shall take the time horizon N

to be innite and the discount factor to be less than one.
Furthermore, let D
() denote the sample sequence of prices

selected by a given OSLRpolicy when (p; ) is the demand
function and is the discount rate. Let A
() denote the set

of accumulation points within D
(). For > 0 let S

.
denote the number of times that a given OSLR policy selects
prices in the half-open interval (p
() , p
()] when the

discount factor is .
The following theorem asserts that as the discount factor
1, the only accumulation point of the selected prices in
the limit can be p
.
THEOREM 7.1: Suppose the conditions of the preceding
paragraph hold. Then for all in a set of probability one,
under either the OSLR-1 or the OSLR-2 policies, the sets
A
() converge to p
(), in the sense that max{|ap
()| :
a A
()} 0 as 1. Furthermore, for any > 0,

liminf
1
S
,
= .
The proof is contained in Appendix C.
7.2. Simulation Results for Misspecied Priors
The simulations described in the preceding sections all
assumed that the specied prior distribution was accurate;
that is, the sample demands were generated according to
the prior distribution in each trial run. We now investigate
the performance of the algorithms when the assumed priors
do not match the actual sampling distribution. We look at
three variations of prior mismatch: (a) misspecication of
the mean function m(p); (b) misspecication of the certainty
parameter c; and (c) misspecication of the demand distri-
bution (but properly specifying the mean and variance of the
distribution).
To test the effect of misspecication of the mean function
m(p), we compared the effects of assuming a uniform (high-
entropy) distribution of demand (as pictured by the function
f
0
in Figure 5) when the true distribution of demand was
more concentrated (low-entropy) in one region of the prices
(as pictured by f
1
to f
3
in Figure 5) and vice versa. The
performance of each OSLR policy was tested under settings
of c = 10 and 100 for the certainty parameter; the value of
the discount factor was set in each trial to 0.95. The results
are displayed in Table 1. The methods are more robust when
m(p) corresponds to a higher entropy distribution than the
true generating distribution. Performance degrades if m(p)
is a lower entropy distribution than the true distribution.
To test the effect of the choice of the certainty parameter
c on the performance of the search strategies, we ran simula-
tions where the assumed value of c again ranged in {10, 100},
Figure 5. Functions used in misspecication trials.
Table 1. Simulation results for misspecied mean demand m(p).
m = f
0
, m = f
0
, m = f
0
, m = f
2
,
m
= f
1
m
= f
2
m
= f
3
m
= f
0
% VoI 1.96SE % VoI 1.96SE % VoI 1.96SE % VoI 1.96SE
c = 10
Passive 8.7 0.1 15.8 0.5 9.5 0.2 29.3 1.2
OSLR-1 3.8 0.1 6.4 0.1 6.5 0.0 24.5 1.1
OSLR-2 5.0 0.2 6.4 0.0 7.5 0.1 26.4 1.1
c = 100
Passive 6.5 0.1 11.3 0.3 7.1 0.1 24.5 0.9
OSLR-1 0.5 0.0 7.3 0.1 5.3 0.0 23.5 0.9
OSLR-2 0.5 0.0 7.1 0.1 5.2 0.0 24.4 0.9
Note. Results are reported in terms of the percentage value of information (% VoI). m denotes the assumed prior mean demand; the value of
m
is the mean demand actually used in generating demand functions.

m(p) = f
0
(p) as pictured in Figure 5, and the true sampling
distribution was Dirichlet with parameters c
{10, 100}
and mean function m
(p) = f
0
(p). The prior mean revenue
function was (11) with a = 0.02. The results are given in
Table 2. Performance degrades in each case where the prior
is misspecied (especially when the certainty parameter is
set too high rather than too low), although the OSLR policies
continue to outperform the static and passive policies.
Finally, we also tested what happens when the marginal
means and variances of the prior roughly matched that of
the true sampling distribution, but the shape of the sam-
ple demands deviated from that of the true demand. We let
P = {1, . . . , 50} and assumed that the prior was Dirichlet
with parameters m(p) = f
0
(p) and c {10, 100}; the true
demand functions were instead sampled according to
(p) = 1 ((p U)/v), (12)
where is the cdf of the standard Normal distribution. Here,
U was generated according to the uniform distribution on P
and the parameter v was chosen so that the variance of (p)
Table 2. Simulation results for misspecied c.
c
= 10 c
= 100
% VoI 1.96SE % VoI 1.96SE
c = 10
Passive 26.5 1.0 13.8 0.6
OSLR-1 8.3 0.3 8.7 0.2
OSLR-2 9.4 0.3 10.0 0.2
c = 100
Passive 25.5 1.0 15.0 0.7
OSLR-1 11.1 0.6 5.2 0.3
OSLR-2 10.4 0.5 5.1 0.2
Note. The value of c denotes the certainty parameter of the prior dis-
tribution; the value of c
is the certainty parameter used in generating

priors.
roughly matched that of the prior distribution. We picture 100
realizations of the prior Dirichlet distribution and the demand
realizations as sampled according to (12) in Figure 6 for the
case where c = 10 and v = 250. Under various param-
eter settings, these trials actually showed that performance
of the passive, OSLR-1, and OSLR-2 strategies all actually
improved, with all three strategies performing equally well.
8. CONCLUSION
This paper has introduced a novel Bayesian model for
demand modeling involving the Dirichlet distribution that
is exible, intuitive, and easy to specify and that can be
tractably updated for a variety of data formats. Akey property
of these distributions is that the posterior marginal distribu-
tions of demand at each price level are readily available from
these updates. Following a tradition of practice in operations
research and reinforcement learning, we developed search
strategies that rely on gures of merit computed from these
marginal distributions. These gures of merit were motivated
by a direct approximation of the optimal reward-to-go func-
tion in the Bellman equation corresponding to a partially
observed Markov decision process. The use of marginal dis-
tributions rather than the full joint distribution of demand
greatly simplies computation and can be partially justied
by the localized nature of correlations among various demand
values in the Dirichlet model. The OSLR pricing strategies
we developed in particular were shown in a variety of repre-
sentative simulation trials to achieve high revenues relative
to a known upper bound on performance and achieve large
improvements over both static and passive pricing strate-
gies. Moreover, these strategies were shown to be robust
in the sense that they converged to the optimal price as the
effects of discounting were removed, and they continued to
achieve high performance even when the prior distribution
was misspecied.
Figure 6. Top: 100 demand distributions sampled according to the prior (a) and according to the true demand distribution (b). Bottom:
Histograms of the distribution of (p) at p = 25 under the Dirichlet prior (a) and the true sampling distribution (b).
APPENDIX
A. Exact Formulas for Posterior Distributions
We now present analytical expressions for expressing the posterior distri-
bution that results from updating a D(c
0
, . . . , c
k
) prior based on a history
H (see (2)). Because the reservation prices of customers are assumed to form
an iid sequence and are independent of their arrival process, it is sufcient
to write the history H as
H = (s
1
, f
1
, . . . , s
k
, f
k
),
where s
i
is the number of purchases (successes) and f
i
is the number of
refusals (failures) that have been observed when the posted price was z
i
,
for each z
i
P. This representation will be more useful for expressing the
formulas we derive below. In addition, let A = (A
1
, . . . , A
k
) represent the
randomnumber of customer arrivals corresponding to each price in P, given
the sequence of chosen prices p
1
, . . . , p
n
. Finally, let t = (t
1
, . . . , t
k
), where
t
i
= s
i
+f
i
for each i.
In order to compute the joint posterior density of = ((z
1
), . . . , (z
k
)),
let x = (x
1
, . . . , x
k
) with 1 x
1
. . . x
k
0, and let a
j
= c
j
,
j = 0, . . . , k. Then by Bayes rule and the model assumptions regarding
independence of customer arrivals and reservation prices,
P{ dx | H, A = t} =
P {H | A = t, dx} P { dx}
P {H | A = t}
, (13)
where
P {H | A = t, dx} = x
s
1
1
(1 x
1
)
f
1
x
s
k
k
(1 x
k
)
f
k
P{ dx} =
(c)
(a
0
) (a
n
)
(1 x
1
)
a
0
1
(x
1
x
2
)
a
1
1
x
a
k
1
k
dx
1
dx
k
P {H | A = t} =
(c)
(a
0
) (a
k
)
_
1
0
x
s
1
1
(1 x
1
)
a
0
+f
1
1
_
x
1
0
(x
1
x
2
)
a
1
1
x
s
2
2
(1 x
2
)
f
2

_
x
k1
0
(x
k1
x
k
)
a
k1
1
x
a
k
+s
k
1
k
(1 x
k
)
f
k
dx
k
dx
1
. (14)
The innermost integral of (14) can be evaluated as
_
x
k1
0
(x
k1
x
k
)
a
k1
1
x
a
k
+s
k
1
k
(1 x
k
)
f
k
dx
k
=
f
k
n
k
=0
_
f
k
n
k
_
(a
k
+s
k
, a
k1
+n
k
)x
a
k1
+a
k
+s
k
+n
k
1
k1
(1 x
k1
)
f
k
n
k
,
(15)
where (, ) is the Beta function. Now, letting s
+
j
= s
j
+
k
i=j+1
a
i
+s
i
+n
i
and f
+
j
= f
j
+
k
i=j+1
f
i
n
i
, we may rewrite (15) as
f
+
k
n
k
=0
_
f
+
k
n
k
_
(a
k
+s
+
k
, a
k1
+n
k
)x
a
k1
+s
+
k1
s
k1
1
k1
(1 x
k1
)
f
+
k1
f
k1
.
(16)
The next-to-innermost integral in (14) can be evaluated as
f
+
k
n
k
=0
_
f
+
k
n
k
_
(a
k
+s
+
k
, a
k1
+n
k
)
_
x
k2
0
(x
k2
x
k1
)
a
k2
1
x
a
k1
+s
+
k1
1
k1
(1 x
k1
)
f
+
k1
dx
k1
=
f
+
k
n
k
=0
_
f
+
k
n
k
_
(a
k
+s
+
k
, a
k1
+n
k
)
f
+
k1
n
k1
=0
_
f
+
k1
n
k1
_
(a
k1
+s
+
k1
, a
k2
+n
k1
)
x
a
k2
+s
+
k2
s
k2
1
k2
(1 x
k2
)
f
+
k2
f
k2
. (17)
Comparing the terms in (17) with (16), we see that we can proceed
inductively to nd that
P{H} =
(c)
(a
0
) (a
k
)
f
+
k
n
k
=0
C
+
k
(n
k
)
f
+
k1
n
k1
=0
C
+
k1
(n
k1
, n
k
)
f
+
1
n
1
=0
C
+
1
(n
1
, . . . , n
k
), (18)
where
C
+
j
(n
j
, . . . , n
k
) =
_
f
+
j
n
j
_
(a
j
+s
+
j
, a
j1
+n
j
), j = 1, . . . , k.
In order to obtain the marginal density at the point z
i
, we integrate the
numerator of (13) over x
1
, . . . , x
i1
, x
i+1
, . . . , x
k
. The derivation of this
quantity is similar to that of (18). Let
s
j
= s
j
+
j1
i=1
s
i
n
i
f
j
= f
j
+
j1
i=1
a
i1
+f
i
+n
i
C
j
(n
1
, . . . , n
j
) =
_
s
j
n
j
_
(a
j1
+f
j
, a
j
+n
j
), j = 1, . . . , k.
Then the posterior marginal distribution is of the form
P((z
i
) dx
i
| H, A = t) =
N
(x
i
)N
+
(x
i
)
D
dx
i
, (19)
where
N
(x
i
) =
s
n
1
=0
C
1
(n
1
)
s
i1
n
i1
=0
C
i1
(n
1
, . . . , n
i1
) x
s
i
s
i
/2
i
(1 x
i
)
a
i1
+f
i
f
i
/21
N
+
(x
i
) =
f
+
k
n
k
=0
C
+
k
(n
k
)
f
+
i+1
n
i+1
=0
C
+
i+1
(n
i+1
, . . . , n
k
) x
a
i
+s
+
i
s
i
/21
i
(1 x
i
)
f
+
i
f
i
/2
D =
f
+
k
n
k
=0
C
+
k
(n
k
)
f
+
1
n
1
=0
C
+
1
(n
1
, . . . , n
k
).
B. Gibbs Sampler for Approximating the Posterior
Demand Distribution
In this Appendix, we shall suppose that N customers have entered the
shop by the current time, and let p
n
denote the price that was posted when
the nth customer entered the shop, n = 1, . . . , N. Furthermore, let a
n
= 1 or
0 according to whether the nth customer purchased the good. This algorithm
is directly adapted from the Gibbs Sampler described by Kuo and Smith
[23].
1. Let M be a positive integer. For m = 1, . . . , M, do:
(a) Initialize the vector
(0)
= (
(0)
0
, . . . ,
(0)
k
) such that
(0)
i
=
i
for i = 1, . . . , k. Set K 0 to be an iteration limit, and set the
current iteration number t = 0.
(b) For n = 1, . . . , N, assign a reservation price r
n
to customer n
randomly according to the following probabilities:
P(r
n
[z
i
, z
i+1
)) =
_
(t )
i
j:z
j
pn

(t )
j
if z
i
p
n
and a
n
= 1,
(t )
i
j:z
j
<pn

(t )
j
if z
i
< p
n
and a
n
= 0,
0 otherwise.
(c) Let n
i
be the number of customers whose reservation price
was imputed to be in the range [z
i
, z
i+1
) in the previous
step, for i = 0, . . . , k. Randomly generate a vector
(t +1)
according to a Dirichlet distribution with parameter vector
(
(m,t )
0
, . . . ,
(m,t )
k
), where for i = 0, . . . , k,
(m,t )
i
= c
(t )
i
+n
i
.
(d) Let t = t + 1. If t K, then return to step (b). Otherwise,
return the vector (
(m,K)
0
, . . . ,
(m,K)
k
).
2. Estimate the posterior density function of (z
i
) for any z
i
P
according to
1
M
M
m=1
B
_
_
x;
k
j=i
(m,K)
j
,
i1
j=0
(m,K)
j
_
_
,
where B(x; a, b) denotes the density at x of a Beta randomvariable
with parameters (a, b).
The estimatedmarginal posterior densityfunctionis therefore a mixture of M
various Beta densities. The mean and the variance of the demand (z
i
) under
this mixture distribution are easy to compute. Let a
m
=

k
j=i

(m,K)
j
and
b
m
=

i1
j=0

(m,K)
j
denote the parameters of the mth Beta density function
in the mixture, and let c
m
= a
m
+b
m
. Then
E(z
i
) =
1
M
M
m=1
a
m
c
m
Var((z
i
)) =
1
M
M
m=1
_
a
m
b
m
c
2
m
(c
m
+1)
+
_
a
m
c
m
E(z
i
)
_
2
_
.
C. Proofs
C.1. Proof of Proposition 5.1
PROOF: (a) Under the assumption of a continuous positive posterior den-
sity for (p), v(p) is continuous and strictly increasing in r. In
addition, for values of r > p,
d
n
r > E
n
_
p(p) + d
n+1
max{p(p), r}
_
,
and the reverse inequality holds for r < 0. Therefore, 0
r
2
(p) p. As r
2
(p) is dened as the largest value for which
equality holds in (10), it follows that if, for a given r, d
n
r
E
n
_
p(p) + d
n+1
max{p(p), r}
_
, then r r
2
(p).
(b) Note rst that r
1
= p
(p
) by denition. Then
v
1
(p
) = r
1
+
N(n+1)
t =0
t
E
n
max{p
(p
), r
1
}
Nn
t =0
t
r
1
= d
n
r
1
.
From the result of part (a), it follows that r
1
r
2
(p
). By the
monotonicity of v(p) in r, v
1
(p
) v
2
(p
).
(c) Suppose for some price p, r
1
> r
2
(p). From the proof of part (b),
v
1
(p
) d
n
r
1
, and from the result of part (a), it follows that
d
n
r
1
> E
n
_
p(p) + d
n+1
max{p(p), r
1
}
_
= v
1
(p).
Therefore, v
1
(p
) > v
1
(p). Finally, v
1
(p) > v
2
(p) by the
monotonicity of v(p) in r.
(d) If price p
has been observed by time n and Assumption 1 holds,

then the posterior distribution of (p) is a point mass. There-
fore, r
1
= r
2
(p
) and v
1
(p
) = v
2
(p
). Suppose now that

v
1
(p) > v
1
(p
). According to part (c), r

2
(p) r
1
, which implies
that v
2
(p) v
1
(p) by the monotonicity of v(p) in r. Therefore
v
2
(p) > v
2
(p
).
C.2. Proof of Theorem 7.1

In the following proof, we shall assume that is xed and omit it from
the notation except when needed for clarity. In particular, we shall use the
convention that (p; ) represents the true value of the demand function at p,
but (p) is a randomvariable distributed according to the marginal posterior
distribution at a given time. We also use the convention that (a
+
; )
lim
pa
+ (p; ).
PROOF: We begin by proving the second statement of the theorem,
that liminf
1
S
,
= . Suppose there exists a number M and a pos-
itive sequence
i
1 such that S
i
,
M for all i. We x and let
S
i
= S
i
,
for convenience and let I denote the interval (p
, p
]. Dene
D
i
= (p
in
: n 1) as the sequence of prices selected by the given OSLR
policy when the discount factor is
i
.
Rather than use (9) to express the gure of merit associated with the given
OSLR strategy, we shall nd it more convenient to use the equivalent gure
v
in
(p) = (1
i
)E
in
p(p) +
i
E
in
max{p(p), r
in
(p)}, (20)
where r
in
(p) denotes the value assigned to the recourse reward for price p
by the given OSLR policy and E
in
denotes expectation with respect to the
posterior demand distribution at time n when discounting at rate
i
. Because
v
in
(p) equals (9) times the positive constant (1
i
), a policy that chooses
prices at each time step n on the basis of v
in
(p) is equivalent to the given
OSLR policy.
As we assume only nitely many prices within I are selected for each
i
,
the sequence of prices D
i
must accumulate at some point a
i
I
c
{p
}.
Let (n
k
: k 1) be a sequence of times such that p
in
k
a
i
as k
(note that the sequence n
k
may depend on i). According to our hypotheses,
we may assume that p
in
k
I for all k, so that, e.g., if a
i
= p
, the sequence
p
in
k
approaches p
from the right.

The Dirichlet process is a pure jump process [15]; therefore, (p; )
decreases only at a countable number of discontinuity points in [p, p].
Because (p; ) is left-continuous with right limits (LCRL), then at a point
of discontinuity p, (p
+
; ) < (p; ). It is easy to see that under these
conditions, the point p
will be a point of discontinuity, and the supremum

of the function R(p; ) on P is attained at p
.
As we assume that the prior marginal distributions of (p) on P and the
sample demands are both LCRL, then the posterior marginal distributions
of (p) at any given time n will also be LCRL in p. (This follows from the
continuity of the Beta distributions appearing in the posterior distribution in
Appendix A.) Also, in the limit as n , the posterior marginal distribu-
tions of (p) may be discontinuous whenever an accumulation point of the
price sequence is a point of discontinuity in (p; ). If the price sequence
p
in
k
approaches the accumulation point a
i
fromthe right, then the right limit
of the limiting posterior distribution of (a
i
) is concentrated at (a
+
i
; ). On
the other hand, if p
in
k
approaches a
i
from the left, then the left-limit of the
limiting posterior of (a
i
) is (a
i
; ).
We now conne our attention to the OSLR-1 policy. In this case, r
in
k

r
in
k
(p) is constant over P. We begin by showing that limsup
k
r
in
k

R(a
i
; ) and liminf
k
r
in
k
R(a
+
i
; ). To see this, suppose rst that
liminf
k
r
in
k
< R(a
+
i
; ). Since liminf
k
E
in
k
p
in
k
(p
in
k
) R(a
+
i
; ), then
for large enough k, E
in
k
p
in
k
(p
in
k
) > r
in
k
, contradicting the denition of
r
in
k
for the OSLR-1 policy. Nowsuppose limsup
k
r
in
k
> R(a
i
; ). Let p
in
k
be the price at time n
k
that maximizes E
in
k
R(p) over p P; by denition
r
in
k
= E
in
k
R(p
in
k
). Then limsup
k
r
in
k
> R(a
i
; ) implies p
in
k
= p
in
k
for innitely many k. Since the limiting posterior distribution of (p
in
k
)
has support on a set none of whose members exceed (a
i
; ), then for
large enough k, v
in
k
(p
in
k
) > v
in
k
(p
in
k
). Under the OSLR-1 policy, this
cannot be the case if p
in
k
is selected at time n
k
, and so we have estab-
lished that limsup
k
r
in
k
R(a
i
; ) and liminf
k
r
in
k
R(a
+
i
; ). Also, as
a consequence liminf
k
v
in
k
(p
in
k
) R(a
+
i
; ) and limsup
k
v
in
k
(p
in
k
)
R(a
i
; ).
Dene q < p
to be a point in I such that R(q; ) > sup

pI
R(p; )
and (; ) is continuous at q. (There is always such a point regardless of
the choice of .) According to our hypothesis that no more than M samples
are taken within I for all i, the support of the limiting distribution of q(q)
will be [0, q], and since R(q; ) > R
(a
i
) limsup
k
R(p
in
k
; ), r
in
k
must
eventually be in the interior of this support for large enough k. Therefore,
there will also be a positive number > 0 such that E
in
k
max{q(q), r
in
k
}
r
in
k
+ almost surely for all i and large enough k. However, note that
limsup
k
v
in
k
(q) = limsup
k
_
(1
i
)E
in
k
q(q) +
i
E
in
k
max{q(q), r
in
k
}
_
_
limsup
k
(1
i
)E
in
k
q(q)
_
+
i
(R
(a
i
) +). (21)
Since does not depend on i, we can choose i large enough so that (21) is
larger than R
(a
i
). This implies that
limsup
k
v
in
k
(q) > limsup
k
v
in
k
(p
in
k
),
and thus p
in
k
was selected (when k was large) even though there was another
price q with a higher value of v
in
k
(q). This contradicts the premise that prices
were selected according to the OSLR-1 policy, and we conclude that for this
policy, S
i
as i .,
Similar logic can be used to show that the OSLR-2 policy also satises
the second statement of the theorem. From (10), the recourse reward for the
OSLR-2 policy satises the equation
r
in
k
(p) = v
in
k
(p) (22)
and for a xed posterior marginal distribution of (p) with positive support
on [0,1], as
i
1, the value of r
in
k
satisfying (22) approaches p, which is
the upper limit of the support of the distribution of R(p). For any > 0 we
can nd i large enough such that r
in
k
(q) > q for all k large enough.
Since limsup
k
r
in
k
(p
in
k
) = R
(a
i
), we can therefore nd i large enough
such that r
in
k
(q) > r
in
k
(p
in
k
) for all k large enough. This again leads to
contradiction, since we assume that in following an OSLR-2 policy that at
all times n
k
we select the price that maximizes r
in
k
(p). This establishes the
second statement of the theorem.
We nowprove the rst statement of the theorem. Suppose that under either
the OSLR-1 or OSLR-2 policy, there is a subsequence (i
j
: j 1) and a
value > 0 andinterval I = (p
, p
] suchthat there exists for all j ana

A
i
j
with a I. Since we have established that liminf
1
S
,
= for any
choice of , then lim
j
limsup
n
v
i
j
n
(p
) = R
by the left-continuity
of the limiting posterior distribution. Let = min{R
R(p; ) : p
I} > 0, and select j large enough so that limsup
n
v
i
j
n
(p
) > R
.
For a A
i
j
such that a I, let (n
k
: k 1) be a subsequence such that
lim
k
p
i
j
n
k
= a and p
i
j
n
k
I for all k. Then v
i
j
n
k
(p
i
j
n
k
) R
(a; )
R
almost surely. This implies that for large k, p

i
j
n
k
is selected by the
OSLR policy even though v(p
) is larger. This contradiction establishes the

theorem.
ACKNOWLEDGMENTS
The author thanks IBM T.J. Watson Research Center in
Yorktown Heights, New York, for support in conducting this
research. Also, the helpful comments of Martin Puterman are
gratefully acknowledged.
REFERENCES
[1] P. Aghion et al., Optimal learning by experimentation, Rev
Econ Stud 58 (1991), 621654.
[2] Y. Aviv and A. Pazgal, Pricing of short life-cycle products
through active learning, Working paper, The John M. Olin
School of Business, Washington University, St. Louis, MO,
2002.
[3] W.L. Baker et al., Getting prices right on the Web, McKinsey
Q 2 (2001).
[4] S. Baliga and R. Vohra, Market research and market design,
Adv Theoret Econ 3 (2003), article 5.
[5] R.J. Balvers and T.F. Cosimano, Actively learning about
demand and the dynamics of price adjustment, Econ J 100
(1990), 882898.
[6] A.X. Carvalho and M.L. Puterman, Dynamic pricing and
learning over short time horizons, Working paper, Department
of Statistics, University of British Columbia, Vancouver, BC,
2003.
[7] A.X. Carvalho and M.L. Puterman, Dynamic pricing for
demand learning in e-commerce, Working paper, Department
of Statistics, University of British Columbia, Vancouver, BC,
2004.
[8] E.W. Cope, Regret and convergence bounds for immediate-
reward reinforcement learning problems with continuous
action spaces, Working paper, University of British Columbia,
Vancouver, BC, Canada, 2004.
[9] S.L. Dalal and G.J. Hall, Jr., On approximating parametric
Bayes models with nonparametric Bayes models, Ann Stat 8
(1980), 664672.
[10] P. Dasgupta and R. Das, Dynamic pricing with limited com-
petitor information in a multi-agent economy, Coordination
Languages andModels : 4thInternational Conference (COOR-
DINATION 2000), New York, Springer-Verlag, 2000.
[11] J.M. DiMicco, A. Greenwald, and P. Maes, Dynamic pric-
ing strategies under a nite time horizon, Proceedings of the
3rd ACM Conference on Electronic Commerce (EC01), New
York, ACM Press, 2001.
[12] D. Easley and N.M. Kiefer, Controlling a stochastic process
with unknown parameters, Econometrica 56 (1988), 1045
1064.
[13] D. Easley and N.M. Kiefer, Optimal learning with endogenous
data, Int Econ Rev 30 (1989), 963978.
[14] W. Elmaghraby and P. Keskinocak, Dynamic pricing in the
presence of inventory considerations: Research overview, cur-
rent practices and future directions, Manage Sci 49 (2003),
12871309.
[15] T.S. Ferguson, A Bayesian analysis of some nonparametric
problems, Ann Stat 1 (1973), 209230.
[16] J.C. Gittins, Multi-Armed Bandit Allocation Indices, New
York, Wiley, 1989.
[17] J. Ginebra and M.K. Clayton, Response surface bandits,
J R Stat Soc B 57 (1995), 771784.
[18] A.R. Greenwald and J.O. Kephart, Shopbots and pricebots,
Proceedings of the 5th International Conference on Articial
Intelligence (IJCAI99), San Mateo, CA, Morgan Kaufman,
1999, pp. 506511.
[19] S.J. Grossman, R.E. Kihlstrom, and L.J. Mirman, A Bayesian
approach to the production of information and learning by
doing, Rev Econ Stud 44 (1977), 533547.
[20] J.G. Ibrahim, M.-H. Chen, and D. Sinha, Bayesian Survival
Analysis, New York, Springer-Verlag, 2001.
[21] L.P. Kaelbling, Learning in Embedded Systems, Cambridge,
MA, MIT Press, 1993.
[22] R. Kleinberg and F.T. Leighton, The value of knowing a
demand curve: Bounds on regret for online posted-price
auctions, Working paper, Department of Mathematics, MIT,
2004.
[23] L. Kuo and A.F.M. Smith, Bayesian computations in survival
models via the Gibbs sampler, Survival Analysis: State of the
Art, J.P. Klein and P.K. Goel (Editors), Boston, Kluwer, 1992,
pp. 1124.
[24] H. Kushner, A new method of locating the maximum of an
arbitrary multipeak curve in the presence of noise, J Basic
Eng, 86 (1963), 97106.
[25] B. Leloup and L. Deveaux, Dynamic pricing on the Internet:
Theory and simulation, J Electron CommRes 1 (2001), 5364.
[26] K. Lin, Dynamic pricing with real-time demand learning, Eur
J Operations Res, forthcoming.
[27] M.S. Lobo and S. Boyd, Pricing and learning with uncertain
demand, Working paper, The Fuqua School of Business, Duke
University, Durham, NC, 2003.
[28] M.V. Marn and R.L. Rosiello, Managing price, gaining prot,
Harv Bus Rev (Sept/Oct 1992), 8493.
[29] J.I. McGill and G.J. van Ryzin, Revenue management:
Research overview and prospects, Transport Sci 33 (1999),
233256.
[30] A. McLennan, Price dispersion and incomplete learning in the
long run, J Econ Dynam Contr 7 (1984), 331347.
[31] L.J. Mirman, L. Samuelson, and A. Urbano, Monopoly exper-
imentation, Int Econ Rev 34 (1993), 549563.
[32] W.W. Moe and P.S. Fader, Dynamic conversion behavior at
e-commerce sites, Manage Sci 50 (2004), 326335.
[33] J. Mockus, Bayesian approach to global optimization: Theory
and applications, Dordrecht, Kluwer, 1989.
[34] E. Regazzini and V.V. Sazonov, Approximations of laws of
multinomial parameters by mixtures of Dirichlet distributions
with applications to Bayesian inference, Acta Appl Math 58
(1999), 247264.
[35] M. Rothschild, A two-armed bandit theory of market pricing,
J Econ Theo 9 (1974), 185202.
[36] I. Segal, Optimal pricing mechanisms with unknown demand,
Am Econ Rev 93 (2003), 509529.
[37] C. Shapiro and H. Varian, Information rules: A strategic guide
to the network economy, Cambridge, MA, Harvard Business
School Press, 1998.
[38] D. Treer, The ignorant monopolist: Optimal learning with
endogenous information, Int Econ Rev 34 (1993), 565581.

Bayesian Strategies For Dynamic Pricing in Ecommerce

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Bayesian Strategies For Dynamic Pricing in Ecommerce

Transféré par

Droits d'auteur :

Formats disponibles

Bayesian Strategies for Dynamic Pricing in E-Commerce

denote the value in P that maximizes

. This implies that the price p

) is a subset of the prices

). In particular, if at some time there are

) for all p P, then there still may be

), but not vice versa.

() is unique except on a set of probability zero

() is unique. We shall take the time horizon N

() denote the sample sequence of prices

() denote the set

(). For > 0 let S

()] when the

(), in the sense that max{|ap

()} 0 as 1. Furthermore, for any > 0,

is the mean demand actually used in generating demand functions.

is the certainty parameter used in generating

has been observed by time n and Assumption 1 holds,

). Suppose now that

). According to part (c), r

C.2. Proof of Theorem 7.1

from the right.

will be a point of discontinuity, and the supremum

to be a point in I such that R(q; ) > sup

] suchthat there exists for all j ana

almost surely. This implies that for large k, p

) is larger. This contradiction establishes the

Vous aimerez peut-être aussi