Académique Documents
Professionnel Documents
Culture Documents
Habits
By Matt Schroeder
1
Discovering the Effects of Online Purchasing
Habits
Abstract
by
Matt Schroeder
Research Topic:
purchasing value, using zip code income in the United States as a proxy for
Abstract:
This empirical study was conducted to see if zip code income is a good
explainer for order value when shopping online. I am using Shelly Cove, LLC
as the proxy for all online purchases. I am also including the variables:
population, number of orders within a zip code, and percent of the population
that is dependent. For the second half of the study, I will be testing the
percent of people who earn over $75,000 a year, rather than average
income, in an attempt to remove any outlier bias that can result from mean
measures. Last, I will run a simple test to see if income has any affect during
holiday times. I will be splitting zip codes into quartiles based on income,
and curating a time series line chart, to see if order behavior converges
Two equations were created from this study that attempt to describe
order value. One with the primary independent variable of mean income,
2
and the other equation with the primary variable of percent of income over
Table of Contents
Abstract
..2
Appreciations
.4
Chapter 1:
Introduction
..5
Chapter 2: Literature
Review
..7
3.1
Data
.9
3.1.1 Data
Source.
9
3.1.2 Data
Limitations
..9
3.1.3 Data
Manipulation10
3.2
Theory
10
3
Chapter 4:
Results
...14
5.1
Conclusion
..21
5.2
Improvements
..22
Works
Cited
..27
4
I would like to thank my parents, who showed me the importance of
hard work and following your passions. Without you, I would not have been
able to attend Covenant, and receive this wonderful education. I would also
like to thank my readers Oliver Beers and Hunter Davis, as well as my
professors Dr. Lance Wescher and Dr. John Rush, who have all helped me
design my experiment.
Chapter 1:
Introduction
retailers across the globe. With the rise of Amazon (revenue of $107 billion
in 2015 alone)i, and more and more retailers creating online companions to
their stores, eCommerce is obviously something that is not going away any
5
time soon. In fact, in the United States alone, eCommerce is expected to rise
56% from $335 billion (2015) to $523 billion by 2020ii. One major difference
that online shopping creates. Not only are customers using a credit card
instead of physical cash, but sites like Amazon, and other major retailers
have credit card info saved, so we dont even have to enter it on checkout.
Less time spent thinking about our decisions means easier consumption on
specifically, does ones income make a difference on how much they spend
online, or does income not matter at all? You could argue that necessary
spending (insurance, mortgage, food, gas, etc.) is not spent online, where
that their higher level of disposable income is spent through online shopping,
versus someone with a lower income who cannot afford as many luxuries
dependents, order activity within that zip code, and population. I also
believe that orders placed around holiday periods will be of higher value, and
that mean income will have less of an effect on the order value from a zip
6
a household and their spending. I will discuss this theory later, but the
income, noting the average order value within the zip code, average number
of dependents, number of orders from that zip code, and the population
within the zip code. I also plan on accounting for mean bias by running two
regressions, one for the mean, and one structured around quartile analysis.
Chapter 2:
7
Literature Review and Theoretical Analysis
spending would increase. However, we also know that people who have less
money are often less wise with their money. I believe that the disposable
income theory will outweigh peoples lack of self-control with their money,
especially given the nature of the goods that Shelly Cove, LLC sells on their
online store.
2011 to see if age, gender, and income had an effect on whether or not
people did their primary shopping onlineiii. They found, contrary to their
most of their buying online vs. in a brick and mortar shop. This is beneficial
market we are studying, which is anybody who shops online. If the study
found that only people above an income level did their shopping online, then
concluded that advertisers who can get a hold of the top income bracket
(specifically men) would find themselves increased revenue more than any
other demographic. This implies that the spending in the higher income
8
bracket, while maybe on par with other income levels on the aggregate, is
more volatile as a whole people will spend more money, just not as ofteniv.
from 18-34 have much lower incomes than older adults, they make up a
shoppers tend to live in households with above average incomes, with 55%
of shoppers living with incomes above $75,000vi. It also states that while
that have higher incomes are often more conscious about how they spend
make sense that they would take advantage. A study by Girish Punj verifies
this theory, and found that people with higher incomes are more inclined to
People with college degrees especially were inclined to shop online. The
9
A study in 2004 shows that recommendations from friends have a large
effect on buying in women specifically. The risk of trying a new site is neutral
friend to try a site, they are much more likely to purchase from it then men
arevii. Since Shelly Cove is more geared towards womens apparel, this is
something to consider.
Chapter 3:
3.1 Data
This paper will include two data sets. The first data set is from the IRS
Individual Income Tax Statistics from 2014viii. This data set gives numerous
income levels. My second data set is from the online store Shelly Cove, LLC.
This data set includes online purchases from July 2015 - September 2016.
Over 16,000 customers data (zip code, date, and order value) will be cross
referenced with the IRS zip code data (mean income, number of dependents,
etc.) to see how order value, mean income, and percentage of dependents
are correlated. I will be running 2 variations of this test. One test will be a
10
variable. The other test will be measuring the percent of people in the zip
suitable proxy for household income, the accuracy of this generalization may
be lacking. Also, while the Shelly Cove data set with 16,000 customers is
relatively large in itself, it is divided amongst individual zip codes, which may
provide bias to certain zip codes with only 1 or 2 orders. I will attempt to
control for this issue by only regressing zip codes with more than 2 orders.
Although, controlling for zip codes with few orders also reduces the size of
the overall data set, and we end up losing data. A real tradeoff exists. Last,
there are other variables I simply do not have access to that could be helpful
order to properly run tests. First, the zip code data set included 4 rows for
each zip code (one for each quartile of income). This was consolidated into
one row, allowing me to match the zip code data with the purchasing data.
Before this was done, I needed to average the purchase values for the zip
codes into a new data set, as I was not interested in individual purchases,
11
but rather the mean of the zip code as a whole. Once this was complete, I
combined the zip code data and purchase data using a matching function,
and removed zip codes that did not have any purchases, and also zip codes
that only had one purchase. I also changed the number of dependents to a
new variable % over 75k, which gave the percent of people who earned
over $75,000 a year, as an attempt to remove outlier bias (ie. People who
3.2 Theory
analyze different variables and how they affect ones online order value.
That being said, the dependent variable in the model is going to be average
order value, and the independent variables are mean income (of percent of
from that zip code, and percent of the population that are dependents. Time
this analysis. The average order value, number of orders per zip code, and
time of year will come from the Shelly Cove data set, while the mean
2014 IRS data. The mean order value from the Shelly Cove data set is
$43.11, ranging from $2.99 to $306. All 50 states are represented, and over
12
3,000 zip codes will be tested from the 50 states. The average number of
orders from a zip code (excluding the zip codes with only one order), is 5.1
orders/zip code.
positive correlation on the zip codes average order value online. For the
Shelly Cove data, I have removed zip codes that only have 1 order in an
attempt to get a better aggregate on what the average person within the zip
code will purchase. For time of year, I plan on visualizing this variable,
making a line chart, separating zip codes into 4 levels of income, and seeing
if average order value converges as holiday periods roll around. For the
couple ways to view this variable. The more children you have, the less
disposable income you have, meaning you will spend less. But on the other
hand, the more children you have, the more people you need to buy for, so
the order value would go up since you are buying for 4 people, for instance,
hypothesize that as population goes up, the average order value will also go
up. Finally, the number of orders per zip code. If there is a large influx of
orders from a zip code, I would expect that there may be some sort of trend
happening within that zip code for Shelly Cove apparel. Therefore, if the
13
number of orders within a zip code increases, I will expect the average order
time, I would expect the quantity of orders to increase, but due to sales
offsetting how much people are buying, the average order value to increase,
dependents. It does
not represent
households.
OVER75K This is another income H: > 0
0
measure, that
describes the percent H: 0
A
December,
H: 0
representing holiday
A
percent of the
population that is H: = 0
A
dependents.
14
POPULATION Population represents H: > 0
0
zip code.
Equation 1:
(+) (+)
(-)
Average Order Value = f(Mean Income, Time of Year, Number of
Dependents,
(+) (+)
Population, Number of Zip Code Orders
Equation 2:
(+) (+)
(-)
Average Order Value = f(Over75k, Time of Year, Number of
Dependents,
(+) (+)
Population, Number of Zip Code Orders
15
Chapter 4:
Results
The following are the results from the series of regressions I ran, in an
attempt to see if the variables I selected were good predictors for average
order value. I ran two series of regressions to test the mean. The first was
on zip codes with at least two orders, and the second was on zip codes with
at least five orders. This was an attempt to remove bias from a single large
order (or small order) skewing the data too much. There is a tradeoff when
performing this kind of test, which is trading a smaller sample size for less
the regression on zip codes that have at least two orders, and the second
column is the regression zip codes with at least five orders. As you can see,
avincome and the constant term are the only statistically significant
significance for that variable was significantly lower than all other variables,
so I left it out (a 0.0741 coefficient with a standard error of over 1). I did not
include the results from this regression in the table, but I removed the
percofdep variable from the first regression data, and it hardly altered the
results, so I left only my original theory in the report. Youll also notice that
16
the R^2 values in these regressions are very low. This is expected, as we are
clearly missing variables from our data set that could help explain the
and $.039. Clearly, this has very little marketing initiative, as an order
increase of only pennies makes the extra cost of target marketing a wash,
and maybe even a loss. The following variables will be taken with a grain of
results will still be described. As the population in the zip code increases by
1,000 people, the predicted order value will decrease by $.029 and $.032.
the order value. As the percent of dependents increases by 1%, the average
order value will increase less than a penny (since the coefficient needs to be
number of orders within a zip code has a positive correlation with the order
value. For every extra order a zip code receives, it is expected to have a
17
(<2 Orders) (<5 Orders)
VARIABLES avordervalue avordervalue
percofdep 0.0741
(1.062)
population who earn over $75,000, rather than mean income. This was an
attempt to control for extreme cases like millionaires living in an area (and
18
with zip codes typically containing only a few thousand people, extremely
The table below shows my regression results. The first column is all zip
codes with at least two orders, and the second column is all zip codes with at
least five orders. I have left out the variable percofdep (percent of
again, that the income measure and the constant are the only variables with
seeing a positive correlation with income and order value. As the percent of
people who earn over $75,000 a year increases by 1%, we can expect that
the order value within that zip code will increase $.055, and $.068
their results are still noted. Population again has a negative effect on order
Last, as the number of orders within a zip code increases, we see another
positive correlation on order value. For every additional order within a zip
code, we can expect a $.09 or $.042 increase in the order value respectively.
Our R^2 became even lower, and our statistical significance also became
lower, so this leads me to believe that there was not much rich income bias
(1) (2)
VARIABLES avordervalue avordervalue
19
over75k 5.586** 6.785*
(2.239) (3.845)
For this next test, I wanted to see if holiday times effected income-
based buying habits. Rather does it matter what your income is when the
holidays roll around? I separated each zip code into quartiles based on
income. Next, I divided the sales data into monthly totals, and tabulated
them based on what quartile the sale came from. My theory was that there
convergence, as people are more spendy during those two months, and it
doesnt matter as much what your income is. In the first graph, the X-axis is
percent of monthly sales, and the Y axis is time. As we can see, there
becomes a relatively large separation close to the holiday season, where the
top quartile becomes separated from the others, and the bottom quartile
dips below the middle 50%. Oddly enough, from May to October, the
20
monthly spends become inverted, where the poorer zip codes spend more
rather than a convergence. However, these findings are still notable and
significant.
30%
25%
20%
15%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
a total dollar value to try and see another side of this picture. In the
quartiles. In fact, the poorest zip codes had a relatively high purchase
rate in August, higher than all the other categories of zip codes.
holiday months, where the richest zip codes end up spending more
than the poorer ones. Again, although my hypothesis was that the zip
21
code spending would converge in November and December, I still find
similar spending habits throughout the year, but treat the holidays as a
time to stretch the wallet a little farther than other folks. Maybe
Cove.
35000
30000
25000
20000
15000
10000
5000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
deduced. One, is that rich people dont spend as much of their disposable
spends about the same regardless of income). The second conclusion is that
22
rich areas do tend to spend more during holiday times (or at least more
quantities rich people buy, since we are measuring total spend in the zip
codes rather than the average order value). While this is the opposite of
test.
Chapter 5:
5.1 Conclusion
variables on the purchasing habits within a zip code. Data that describes the
information for marketers and business owners, but not easy to nail down
and define without expansive data sets that span a large number of
variables. Using variables that were able to be derived from my data set, I
was able to run tests for income levels, number of dependents, number of
orders from a zip code, and population to see if they accurately predicted
checkout totals for the store Shelly Cove, LLC. After conducting this study, I
would say that the hypothesis could go either way. The rest of the variables I
23
believe would be useful in a larger scale study, with more variables included
and a much larger data set. The statistical significance on many of these
data set, broken even smaller into zip code segments. Population had a very
Number of orders received from a zip code had a decently sized positive
effect on order value, but the statistical significance was low. Again, I believe
this problem could be fixed with a larger data set. The most significant
variable I found was income levels. In the two ways of testing this variable, I
have come to the conclusion that measuring mean income is a better way to
predict order value, rather than the $75,000 measure. While income had a
positive effect on order value, it did not seem to be the biggest deciding
omitted variable bias. Even with the small data set, income levels were
significance with a larger data set in most of the variables I chose, I do not
believe this is the whole story, and to create a better model with a more
and cultural data, and a companys purchases that appeals to a wider and
5.2 Improvements
24
As stated above, I have a series of improvements that I would attempt
attain a larger data set so that there are at least 10 orders from all the zip
codes you are studying. Second, attempt to attain variables such as gender,
personal household income (which would negate the need for more zip code
data), a company with a wider range of products, and has been around
longer, and finally, more variables. Gender, age, source of where the
customer heard of the company, who they are buying for, etc. would be
Appendix:
Regression Tables
25
Table 1: The first regression for average income on the zip codes with over 2 orders.
Table 2: I regressed the average zip code order value on average income to see if there was a noticeable correlation with just
these two variables involved. There was nothing more significant than before.
26
Table 3: For the last regression in the first "half" of the study, I regressed the same variables as the first regression (minus
number of dependents), on the zip codes that contained more than 5 orders.
27
Table 4: I ran two regressions below, with the percent over $75k as the new income descriptor. In the second regression, I left out
the number of dependents measure, as it was extremely statistically insignificant.
28
Table 5: Finally, I ran the same regression as above (omitting the dependents measure) on the zip codes with at least 5 orders.
Table 6: A summary of descriptive statistics for my main variables in the regression analysis
Works Cited:
29
i Statista. Feb, 2016. Accessed September 28,
2016.https://www.statista.com/statistics/266282/annual-net-revenue-of-
amazoncom/
ii Ecommerce Sales, Internet Retailer. Matt Linder, Jan 29, 2016. Accessed
Sept 28, 2016.https://www.internetretailer.com/2016/01/29/online-sales-will-
reach-523-billion-2020-us
iii Blanca Hernndez, Julio Jimnez, M. Jos Martn, "Age, gender and
income: do they reallymoderate online shopping behaviour?", Online
Information Review, Vol. 35 Iss: 1, pp.113 - 133
vi http://www.businessinsider.com/the-surprising-demographics-of-who-
shops-online-and-onmobile-2014-6
vii Ellen Gabarino & Michael Strahilevitz, 2004, Journal of Business Research 57
pp. 768-775,
Gender differences in the perceived risk of buying online and the effects of
receiving a site recommendation