Vous êtes sur la page 1sur 30

Deriving the Effects of Online Purchasing

Habits

A Senior Integration Project

Submitted to the Economics Department of Covenant College

In Partial Fulfillment of the Requirements

for the Degree of Bachelor of Arts

By Matt Schroeder

1
Discovering the Effects of Online Purchasing
Habits

Abstract

by

Matt Schroeder

Research Topic:

This paper will attempt to analyze what variables affect online

purchasing value, using zip code income in the United States as a proxy for

individual household income.

Abstract:

This empirical study was conducted to see if zip code income is a good

explainer for order value when shopping online. I am using Shelly Cove, LLC

as the proxy for all online purchases. I am also including the variables:

population, number of orders within a zip code, and percent of the population

that is dependent. For the second half of the study, I will be testing the

percent of people who earn over $75,000 a year, rather than average

income, in an attempt to remove any outlier bias that can result from mean

measures. Last, I will run a simple test to see if income has any affect during

holiday times. I will be splitting zip codes into quartiles based on income,

and curating a time series line chart, to see if order behavior converges

closer to holiday times.

Two equations were created from this study that attempt to describe

order value. One with the primary independent variable of mean income,

2
and the other equation with the primary variable of percent of income over

$75,000. All regressions were run using simple OLS models.

Table of Contents

Abstract
..2

Appreciations
.4

Chapter 1:
Introduction
..5

Chapter 2: Literature
Review
..7

Chapter 3: Data, Theory, and


Equations.9

3.1
Data
.9

3.1.1 Data
Source.
9

3.1.2 Data
Limitations
..9

3.1.3 Data
Manipulation10

3.2
Theory
10

3.3 Equations and


Variables12

3
Chapter 4:
Results
...14

4.1 Mean Income


Regression.
14

4.2 Percent of Income over $75,000


Regression..16

4.3 Seasonal Convergence


Test.18

Chapter 5: Conclusion & Improvements .


21

5.1
Conclusion
..21

5.2
Improvements
..22

Appendix: Regression Tables.


..23

Works
Cited
..27

4
I would like to thank my parents, who showed me the importance of
hard work and following your passions. Without you, I would not have been
able to attend Covenant, and receive this wonderful education. I would also
like to thank my readers Oliver Beers and Hunter Davis, as well as my
professors Dr. Lance Wescher and Dr. John Rush, who have all helped me
design my experiment.

Chapter 1:

Introduction

Online shopping has clearly become a new beast in the realm of

retailers across the globe. With the rise of Amazon (revenue of $107 billion

in 2015 alone)i, and more and more retailers creating online companions to

their stores, eCommerce is obviously something that is not going away any

5
time soon. In fact, in the United States alone, eCommerce is expected to rise

56% from $335 billion (2015) to $523 billion by 2020ii. One major difference

between online shopping and retail shopping, is the intangibility of money

that online shopping creates. Not only are customers using a credit card

instead of physical cash, but sites like Amazon, and other major retailers

have credit card info saved, so we dont even have to enter it on checkout.

Less time spent thinking about our decisions means easier consumption on

our end. Purchasing from home is quite literally a click away.

Does this simplicity cause an irresponsibility in online shopping? More

specifically, does ones income make a difference on how much they spend

online, or does income not matter at all? You could argue that necessary

spending (insurance, mortgage, food, gas, etc.) is not spent online, where

luxury spending (clothes, electronics, accessories, etc.) is mainly through

online purchases. Therefore, if ones income is higher, it would make sense

that their higher level of disposable income is spent through online shopping,

versus someone with a lower income who cannot afford as many luxuries

that online shopping may provide.

This paper is going to test the theory that a customers online

purchasing value can be described by zip code income, number of

dependents, order activity within that zip code, and population. I also

believe that orders placed around holiday periods will be of higher value, and

that mean income will have less of an effect on the order value from a zip

code. There also should be some correlation to the number of dependents in

6
a household and their spending. I will discuss this theory later, but the

correlation could feasibly go either way. My hypothesis is that there is a

correlation between order value and dependents, which could either be

positive or negative. I will attempt to structure a model around mean

income, noting the average order value within the zip code, average number

of dependents, number of orders from that zip code, and the population

within the zip code. I also plan on accounting for mean bias by running two

regressions, one for the mean, and one structured around quartile analysis.

Chapter 2:

7
Literature Review and Theoretical Analysis

Based on basic economic theory, it would make sense that as mean

income increases, disposable income would increase, therefore online

spending would increase. However, we also know that people who have less

money are often less wise with their money. I believe that the disposable

income theory will outweigh peoples lack of self-control with their money,

especially given the nature of the goods that Shelly Cove, LLC sells on their

online store.

Blanca Hernndez, Julio Jimnez, and M. Jos Martn ran a study in

2011 to see if age, gender, and income had an effect on whether or not

people did their primary shopping onlineiii. They found, contrary to their

hypothesis, that these variables have no effect on whether or not people do

most of their buying online vs. in a brick and mortar shop. This is beneficial

to my study, as it goes to show that there is not an inherent bias to the

market we are studying, which is anybody who shops online. If the study

found that only people above an income level did their shopping online, then

we would be faced with a bias issue.

Brendan Hannah and Kristina M. Lybecker, however, ran a study in

2009 about advertising and online spending in income brackets. They

concluded that advertisers who can get a hold of the top income bracket

(specifically men) would find themselves increased revenue more than any

other demographic. This implies that the spending in the higher income

8
bracket, while maybe on par with other income levels on the aggregate, is

more volatile as a whole people will spend more money, just not as ofteniv.

Another similar study in the International Journal of Humanities and

Social Sciences found that there is a significant difference in attitudes of

online shopping in different income brackets. Wealthier people are more

prone to be open to online shopping. Age group makes little to no difference,

as well as occupational groupv.

A study from BI Intelligence in 2014 found that although millennials

from 18-34 have much lower incomes than older adults, they make up a

larger proportion of online spending. However, on the aggregate, online

shoppers tend to live in households with above average incomes, with 55%

of shoppers living with incomes above $75,000vi. It also states that while

women make up a majority of a households spending, men are more likely

to purchase online and on devices.

We can also think of online shopping as the convenience factor. People

that have higher incomes are often more conscious about how they spend

their time, and if saving time by purchasing online is an option, it would

make sense that they would take advantage. A study by Girish Punj verifies

this theory, and found that people with higher incomes are more inclined to

buy online, because they view online shopping as a time-saving mechanism.

People with college degrees especially were inclined to shop online. The

explanation to this, is college students are often more conscious of

maintaining efficiency, and online shopping can provide this efficiency.

9
A study in 2004 shows that recommendations from friends have a large

effect on buying in women specifically. The risk of trying a new site is neutral

across men and women, however, if women get a recommendation from a

friend to try a site, they are much more likely to purchase from it then men

arevii. Since Shelly Cove is more geared towards womens apparel, this is

something to consider.

Chapter 3:

Data, Theory, and Equations

3.1 Data

3.1.1 Data Source

This paper will include two data sets. The first data set is from the IRS

Individual Income Tax Statistics from 2014viii. This data set gives numerous

variables describing zip codes in the United States, notably including

average income, population, total number of dependents, and categorical

income levels. My second data set is from the online store Shelly Cove, LLC.

This data set includes online purchases from July 2015 - September 2016.

Over 16,000 customers data (zip code, date, and order value) will be cross

referenced with the IRS zip code data (mean income, number of dependents,

etc.) to see how order value, mean income, and percentage of dependents

are correlated. I will be running 2 variations of this test. One test will be a

simple regression with the average income value as the independent

10
variable. The other test will be measuring the percent of people in the zip

code that earn over $75,000 a year.

3.1.2 Data Limitations

This pair of data sets in particular contain a number of limitations

which I am attempting to control for. Initially, there is an issue with

individual household income. While zip code average income may be a

suitable proxy for household income, the accuracy of this generalization may

be lacking. Also, while the Shelly Cove data set with 16,000 customers is

relatively large in itself, it is divided amongst individual zip codes, which may

provide bias to certain zip codes with only 1 or 2 orders. I will attempt to

control for this issue by only regressing zip codes with more than 2 orders.

Although, controlling for zip codes with few orders also reduces the size of

the overall data set, and we end up losing data. A real tradeoff exists. Last,

there are other variables I simply do not have access to that could be helpful

in describing online purchasing value. Age and gender specifically would be

interesting to test this theory on.

3.1.3 Data Manipulation

These raw data sets included data that needed to be manipulated in

order to properly run tests. First, the zip code data set included 4 rows for

each zip code (one for each quartile of income). This was consolidated into

one row, allowing me to match the zip code data with the purchasing data.

Before this was done, I needed to average the purchase values for the zip

codes into a new data set, as I was not interested in individual purchases,

11
but rather the mean of the zip code as a whole. Once this was complete, I

combined the zip code data and purchase data using a matching function,

and removed zip codes that did not have any purchases, and also zip codes

that only had one purchase. I also changed the number of dependents to a

percentage to normalize the variable across zip codes. Finally, I created a

new variable % over 75k, which gave the percent of people who earned

over $75,000 a year, as an attempt to remove outlier bias (ie. People who

earn millions of dollars a year).

3.2 Theory

Model and Equations:

The model in this paper will use a series of simple multivariate

regressions to analyze and predict a reasonable outcome. The tests will

analyze different variables and how they affect ones online order value.

That being said, the dependent variable in the model is going to be average

order value, and the independent variables are mean income (of percent of

people over $75,000 income), time of year, population, number of orders

from that zip code, and percent of the population that are dependents. Time

of year will be in an analysis by itself, as the formatting will be different for

this analysis. The average order value, number of orders per zip code, and

time of year will come from the Shelly Cove data set, while the mean

income, population, and number of dependents is going to come from the

2014 IRS data. The mean order value from the Shelly Cove data set is

$43.11, ranging from $2.99 to $306. All 50 states are represented, and over

12
3,000 zip codes will be tested from the 50 states. The average number of

orders from a zip code (excluding the zip codes with only one order), is 5.1

orders/zip code.

If the mean income in a zip code is larger, it would seem to have a

positive correlation on the zip codes average order value online. For the

Shelly Cove data, I have removed zip codes that only have 1 order in an

attempt to get a better aggregate on what the average person within the zip

code will purchase. For time of year, I plan on visualizing this variable,

making a line chart, separating zip codes into 4 levels of income, and seeing

if average order value converges as holiday periods roll around. For the

average number of dependents in a household, this could be a tossup. I am

hypothesizing a negative effect on average order value, but there are a

couple ways to view this variable. The more children you have, the less

disposable income you have, meaning you will spend less. But on the other

hand, the more children you have, the more people you need to buy for, so

the order value would go up since you are buying for 4 people, for instance,

instead of 2. Population could also be a tossup. The Shelly Cove brand

appeals more to preppy individuals, which often rely in cities, so I will

hypothesize that as population goes up, the average order value will also go

up. Finally, the number of orders per zip code. If there is a large influx of

orders from a zip code, I would expect that there may be some sort of trend

happening within that zip code for Shelly Cove apparel. Therefore, if the

13
number of orders within a zip code increases, I will expect the average order

value to also increase.

I am predicting that mean income will have the strongest correlation

between all of the variables, followed by holiday times. During Christmas

time, I would expect the quantity of orders to increase, but due to sales

offsetting how much people are buying, the average order value to increase,

but not by much.

3.3 Equations and Variables

Variable Description Hypothesis


AVINCOME This represents the H: > 0
0

mean income in the


zip code for H: 0
A

dependents. It does
not represent
households.
OVER75K This is another income H: > 0
0

measure, that
describes the percent H: 0
A

of the zip code


population that earns
over $75,000 a year.
TIME OF YEAR Time of year is a
dummy variable where
(1) is from October H: > 0
0

December,
H: 0
representing holiday
A

times, and (0) is from


January September,
representing off-
holiday times.
PERCOFDEP This represents the H: 0
0

percent of the
population that is H: = 0
A

dependents.

14
POPULATION Population represents H: > 0
0

the population of the


zip code. This is both H: 0
A

dependents and non-


dependents.
NUMORDERS This represents the H: > 0
0

number of orders that


have come from the H: 0
A

zip code.

Equation 1:

(+) (+)
(-)
Average Order Value = f(Mean Income, Time of Year, Number of
Dependents,
(+) (+)
Population, Number of Zip Code Orders

Equation 2:

(+) (+)
(-)
Average Order Value = f(Over75k, Time of Year, Number of
Dependents,
(+) (+)
Population, Number of Zip Code Orders

15
Chapter 4:

Results

4.1 Mean Income Regressions

The following are the results from the series of regressions I ran, in an

attempt to see if the variables I selected were good predictors for average

order value. I ran two series of regressions to test the mean. The first was

on zip codes with at least two orders, and the second was on zip codes with

at least five orders. This was an attempt to remove bias from a single large

order (or small order) skewing the data too much. There is a tradeoff when

performing this kind of test, which is trading a smaller sample size for less

potential bias. (Essentially, a smaller sample size in one capacity vs. a

smaller sample size in another capacity).

The table on page 16 shows my regression results. The first column is

the regression on zip codes that have at least two orders, and the second

column is the regression zip codes with at least five orders. As you can see,

avincome and the constant term are the only statistically significant

variables in these regressions. Also, you will notice that percofdep is

removed from the second regression. This is because the statistical

significance for that variable was significantly lower than all other variables,

so I left it out (a 0.0741 coefficient with a standard error of over 1). I did not

include the results from this regression in the table, but I removed the

percofdep variable from the first regression data, and it hardly altered the

results, so I left only my original theory in the report. Youll also notice that

16
the R^2 values in these regressions are very low. This is expected, as we are

clearly missing variables from our data set that could help explain the

average order value. What I am more concerned about is statistical

significance. My main theory was that average income would be a good

explainer for average order value. It appears that as income increases by

$1,000, the average order value is expected to increase, respectively, $.022,

and $.039. Clearly, this has very little marketing initiative, as an order

increase of only pennies makes the extra cost of target marketing a wash,

and maybe even a loss. The following variables will be taken with a grain of

salt, so to speak, as their statistical significance is not notable, but the

results will still be described. As the population in the zip code increases by

1,000 people, the predicted order value will decrease by $.029 and $.032.

The percent of the population that is dependent has a positive correlation on

the order value. As the percent of dependents increases by 1%, the average

order value will increase less than a penny (since the coefficient needs to be

divided by 100, since the data is in terms of a proportion). Lastly, the

number of orders within a zip code has a positive correlation with the order

value. For every extra order a zip code receives, it is expected to have a

$.11, and $.048 increase on the order value respectively.

17
(<2 Orders) (<5 Orders)
VARIABLES avordervalue avordervalue

avincome 2.17e-05*** 3.90e-05***


(6.21e-06) (1.23e-05)

population -2.86e-05 -3.19e-05


(3.63e-05) (6.08e-05)

percofdep 0.0741
(1.062)

numorders 0.108 0.0484


(0.112) (0.128)

Constant 41.18*** 40.32***


(0.898) (1.553)

Observations 2,923 549


R-squared 0.005 0.019
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1

4.2 Percent of Income over $75,000 Regressions

Next, I ran a series of regressions that included percent of the

population who earn over $75,000, rather than mean income. This was an

attempt to control for extreme cases like millionaires living in an area (and

18
with zip codes typically containing only a few thousand people, extremely

rich individuals could easily skew the data).

The table below shows my regression results. The first column is all zip

codes with at least two orders, and the second column is all zip codes with at

least five orders. I have left out the variable percofdep (percent of

population that is dependent) due to its extreme lack of significance. We see

again, that the income measure and the constant are the only variables with

any statistical significance. However, the income measure has lowered in

statistical significance. Similar to the average income measure, we are

seeing a positive correlation with income and order value. As the percent of

people who earn over $75,000 a year increases by 1%, we can expect that

the order value within that zip code will increase $.055, and $.068

respectively. The following variables have no statistical significance, but

their results are still noted. Population again has a negative effect on order

value. As the population increases by 1,000 people, the order value is

predicted to decrease by $.03-.04, similar to the previous set of regressions.

Last, as the number of orders within a zip code increases, we see another

positive correlation on order value. For every additional order within a zip

code, we can expect a $.09 or $.042 increase in the order value respectively.

Our R^2 became even lower, and our statistical significance also became

lower, so this leads me to believe that there was not much rich income bias

in the average income measure to begin with.

(1) (2)
VARIABLES avordervalue avordervalue

19
over75k 5.586** 6.785*
(2.239) (3.845)

population -3.01e-05 -3.95e-05


(3.62e-05) (6.14e-05)

numorders 0.0897 0.0417


(0.113) (0.130)

Constant 41.23*** 41.32***


(0.801) (1.672)

Observations 2,923 549


R-squared 0.003 0.006
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1

4.3 Seasonal Convergence Test

For this next test, I wanted to see if holiday times effected income-

based buying habits. Rather does it matter what your income is when the

holidays roll around? I separated each zip code into quartiles based on

income. Next, I divided the sales data into monthly totals, and tabulated

them based on what quartile the sale came from. My theory was that there

would be a noticeable separation during the off-season (between holiday

times) between quartiles. In November and December, there would be a

convergence, as people are more spendy during those two months, and it

doesnt matter as much what your income is. In the first graph, the X-axis is

percent of monthly sales, and the Y axis is time. As we can see, there

becomes a relatively large separation close to the holiday season, where the

top quartile becomes separated from the others, and the bottom quartile

dips below the middle 50%. Oddly enough, from May to October, the

20
monthly spends become inverted, where the poorer zip codes spend more

than the richest ones. However, contrary to my hypothesis, there is a

noticeable divergence in the rich zip codes in November and December,

rather than a convergence. However, these findings are still notable and

significant.

Seasonality Convergence Test


35%

30%

25%

20%

15%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Poorest 25% Lower Middle25% Upper Middle25% Richest 25%

In the below graph, I changed the Y-Axis from a percent value, to

a total dollar value to try and see another side of this picture. In the

months of April October, we see little to no separation between the

quartiles. In fact, the poorest zip codes had a relatively high purchase

rate in August, higher than all the other categories of zip codes.

Similar to above, however, we see a noticeable divergence in the

holiday months, where the richest zip codes end up spending more

than the poorer ones. Again, although my hypothesis was that the zip

21
code spending would converge in November and December, I still find

these findings extremely interesting. Rich people appear to have

similar spending habits throughout the year, but treat the holidays as a

time to stretch the wallet a little farther than other folks. Maybe

holiday spending is proportional to your income, while spending

throughout the year is simply a fixed cost on a site such as Shelly

Cove.

Seasonality Convergence Test


40000

35000

30000

25000

20000

15000

10000

5000

0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Poorest 25% Lower Middle25% Upper Middle25% Richest 25%

So what does this mean? Well, two possible conclusions can be

deduced. One, is that rich people dont spend as much of their disposable

income during the year (or theoretically, in off-season times, everyone

spends about the same regardless of income). The second conclusion is that

22
rich areas do tend to spend more during holiday times (or at least more

quantities rich people buy, since we are measuring total spend in the zip

codes rather than the average order value). While this is the opposite of

what my hypothesis was, it is still encouraging to find notable results in this

test.

Chapter 5:

Conclusion & improvements

5.1 Conclusion

In this paper, I aimed to test the statistical significance of a series of

variables on the purchasing habits within a zip code. Data that describes the

characteristics of the most spendy customers are extremely valuable

information for marketers and business owners, but not easy to nail down

and define without expansive data sets that span a large number of

variables. Using variables that were able to be derived from my data set, I

was able to run tests for income levels, number of dependents, number of

orders from a zip code, and population to see if they accurately predicted

checkout totals for the store Shelly Cove, LLC. After conducting this study, I

believe that the number of dependents does not have an effect on

purchasing habits. The statistical significance is severely lacking, and theory

would say that the hypothesis could go either way. The rest of the variables I

23
believe would be useful in a larger scale study, with more variables included

and a much larger data set. The statistical significance on many of these

variables is small, but I am hypothesizing that it is due to an already small

data set, broken even smaller into zip code segments. Population had a very

small negative affect on order value, which contradicted my hypothesis.

Number of orders received from a zip code had a decently sized positive

effect on order value, but the statistical significance was low. Again, I believe

this problem could be fixed with a larger data set. The most significant

variable I found was income levels. In the two ways of testing this variable, I

have come to the conclusion that measuring mean income is a better way to

predict order value, rather than the $75,000 measure. While income had a

positive effect on order value, it did not seem to be the biggest deciding

factor, which further leads be to believe there were numerous cases of

omitted variable bias. Even with the small data set, income levels were

statistically significant at the 1% level, which is encouraging for possible

future study on this topic. Although I believe there would be statistical

significance with a larger data set in most of the variables I chose, I do not

believe this is the whole story, and to create a better model with a more

precise specification, there needs to be a data set that explains geographical

and cultural data, and a companys purchases that appeals to a wider and

more diverse population.

5.2 Improvements

24
As stated above, I have a series of improvements that I would attempt

to implement if this study was to be conducted again in the future. First,

attain a larger data set so that there are at least 10 orders from all the zip

codes you are studying. Second, attempt to attain variables such as gender,

personal household income (which would negate the need for more zip code

data), a company with a wider range of products, and has been around

longer, and finally, more variables. Gender, age, source of where the

customer heard of the company, who they are buying for, etc. would be

extremely valuable information to a marketing team, and could cut down on

advertising costs by laser targeting advertising campaigns.

Appendix:

Regression Tables

25
Table 1: The first regression for average income on the zip codes with over 2 orders.

Table 2: I regressed the average zip code order value on average income to see if there was a noticeable correlation with just
these two variables involved. There was nothing more significant than before.

26
Table 3: For the last regression in the first "half" of the study, I regressed the same variables as the first regression (minus
number of dependents), on the zip codes that contained more than 5 orders.

27
Table 4: I ran two regressions below, with the percent over $75k as the new income descriptor. In the second regression, I left out
the number of dependents measure, as it was extremely statistically insignificant.

28
Table 5: Finally, I ran the same regression as above (omitting the dependents measure) on the zip codes with at least 5 orders.

Table 6: A summary of descriptive statistics for my main variables in the regression analysis

Works Cited:
29
i Statista. Feb, 2016. Accessed September 28,
2016.https://www.statista.com/statistics/266282/annual-net-revenue-of-
amazoncom/

ii Ecommerce Sales, Internet Retailer. Matt Linder, Jan 29, 2016. Accessed
Sept 28, 2016.https://www.internetretailer.com/2016/01/29/online-sales-will-
reach-523-billion-2020-us

iii Blanca Hernndez, Julio Jimnez, M. Jos Martn, "Age, gender and
income: do they reallymoderate online shopping behaviour?", Online
Information Review, Vol. 35 Iss: 1, pp.113 - 133

iv Hannah, Brendan and Lybecker, Kristina M., Determinants of Recent


Online Purchasingand the Percentage of Income Spent Online (May 29,
2009). Colorado College Working Paper No. 2009-02. Available at
SSRN:http://ssrn.com/abstract=1413983 or http://dx.doi.org/10.2139/ssrn.1
413983

v Zuroni Md Jusoh, Goh Hai Ling, International Journal of Humanities and


Social Sciences, Vol 2No 4, FACTORS INFLUENCING CONSUMERS ATTITUDE
TOWARDS E-COMMERCE PURCHASES THROUGH ONLINE SHOPPING

vi http://www.businessinsider.com/the-surprising-demographics-of-who-
shops-online-and-onmobile-2014-6

vii Ellen Gabarino & Michael Strahilevitz, 2004, Journal of Business Research 57
pp. 768-775,
Gender differences in the perceived risk of buying online and the effects of
receiving a site recommendation

viii IRS 2014. Accessed Aug 25, 2016.https://www.irs.gov/uac/soi-tax-


stats-individual-income-tax-statistics-2014-zip-code-data-soi

Vous aimerez peut-être aussi