Vous êtes sur la page 1sur 11

Anne Monet LaPine

Stats 1040
April 29th 2016
Prof. Ping

The Rainbow Report


Report Introduction

As a collective, our elementary statistics class participated in a project to statistically analyze the colors
of skittles present in any one bag of candies. Each student participated in this project by buying their own bag
of skittles, and journaling the counts per color of candies within their bag. These numbers were then compiled
by our classes Professor, whom returned with a spreadsheet analysis of total counts of each candy color, total
count of all candies, and sample size count of number of bags of candy. With this data, we got the chance as
students to look critically at what we founded as a class about the color distribution in each random bag of
candies.
Organizing and Displaying Categorical Data: Colors
Pie Chart: Overall Combined Class Skittle Colors

Skittle Pie Chart Class Totals


Purple Skittles
0.182

Red Skittles
0.194

Orange Skittles
0.212

Green Skittles
0.215

Red Skittles

Orange Skittles

Yellow Skittles
0.196
Yellow Skittles
Green Skittles

Pareto Chart: Overall Combined Class Skittle Colors

Colors
GREEN
ORANGE
YELLOW
RED
PURPLE
Total

Frequency
352
346
321
317
298
1634

Cum. Freq.
352
698
1019
1336
1634

Cum %
21.542%
42.717%
62.362%
81.763%
100.000%

Purple Skittles

By Color Frequency Pareto Chart


360

352

350

346

Skittle QTY

340
330

321

320

317

310

298

300
290
280
270
GREEN

ORANGE

YELLOW

RED

PURPLE

Skittle Color

Pictured above are two handy graphs that help put this data into perspective. Upon observation, I
found that the candy color distribution seems to be fairly evenly distributed when engaging with the pie chart.
Percentages of distributions displayed only vary by a few points for each color. When observing this same data
within the Pareto chart, we can clearly see in descending order the distribution amounts of candy colors within
the sample. For our class project, Green was the most commonly occurring candy color, where Purple was the
least commonly occurring candy color. Together, these graphs dont necessarily represent what I imagined to
see within the bags of skittles collectively. I assumed that distribution would be more even, displaying the same
number of candy colors per bag. I was assuming this because packaging of skittles is likely highly automated.

Purple Skittles
0.133

Personal Bag Pie Chart


Red Skittles
0.200

Green Skittles
0.233

Orange Skittles
0.250
Yellow Skittles
0.183
Red Skittles

Orange Skittles

Yellow Skittles

Green Skittles

Purple Skittles

Personal Pie Chart: Overall Skittle Colors

Within my personal bag of skittles, my data is shown in the pie chart (above) and histogram (next
page). My data revealed the most commonly occurring color to be orange, with green at a close second. The
least frequent color was purple as well. My data does match the class collective, revealing a higher likelihood
of green and orange skittles then purple skittles.

Personal Bag Histogram


14

15

11

12

Skittle QTY

15

10
5
0

Purple Skittles

Collective data: See shaded row for personal data.

Green Skittles

Yellow Skittles

Orange Skittles

Red Skittles

Personal Histogram Chart: Overall Skittle Colors

Organizing and Displaying Quantitative Data: the number of candies per bag

Skittle Frequency Histogram


360

Green, 352

Orange, 346
350
340

Skittle QTY

330

Yellow, 321

Red, 317

320
310

Purple, 298

300
290
280
270
Red

Orange

Green

Yellow

Purple

Skittle Color

Frequency Histogram: Overall Combined Class Skittle Colors

Number of Bags

Frequency of Bag Contents


5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0

3
2

1
0

54

55

0
56

57

58

59

60

61

62

63

64

Number of Skittles Per Bag

Frequency Histogram: Overall Frequency of Bag Contents

Let us now discuss the mean number of skittles per bag of candies submitted by the students of our
class. Our sample size for this project was 27 bags of skittles (one submitted by each student). The mean
number of candies per bag was 60.5. The median for our sample was 61, the mode was 61 and 62, and the

range was 54-64. The sample standard deviation of our data is 2.5. Pictured below is a box plot with a 5
number summary of the sample data set.
# of each bag
54
59
61
62
64

MIN
0.25%
MED
0.75%
MAX

My Personal bag count: 60 Candies


Class Sample Size: 27 bags

Five Number Summary Box Plot


1

53

54

55

56

57

58

59

60

61

62

63

64

65

Box Plot: Five Number Summary Box Plot

The mean reveals that the average number of candies per bag is 60.5. This is consistent with my
personal bag, which contained 60 candies. If we observe the 5 number summary plot, we can see that our data
is slightly shifted, having a higher occurrence of more candies per bag then less. We can see this clearly by
observing that the median of the data, or Quartile 2, is shifted and occurs further to the right within our range
of data. The graph does not match what I expected to see, assuming that the skittles packaging process is
highly automated. I assumed that there would be little variation between number of skittles per bag, assuming
that each candy weighed approximately similar amounts, and that machine packaging would provide a very
close to similar distribution of candies per bag according to weight of the bag (in ounces). Within our data set, I
feel that 54 seems to be a bit of an outlier perhaps, and that the majority of students bags were between
57-64 candies per bag.
When observing the data pictured in the Frequency Histogram, we can see a nearly normal distribution
of candy colors emerges. The box plot for skittle colors is a great visual for the distribution that occurred within
our data set as well, however I believe mine needs further corrections. (The max numbers are incorrect at this
time, will correct for future turn in)
Reflection

The differences between categorical and quantitative data are important to understand. All data is not
created equally. As we learned in the first few lectures of our class, some data cannot be carefully calculated.
These types of data may include things like gender, yes and no, or pass and fail. These are considered a part of
the categorical category of data. These types of data can generally not be ranked, but can be put into
categories. Another common piece of categorical data we often see is color. Color cannot be ranked generally.
However, in our term project we found an exception to this property when we compared the frequency of
colors by assigning numerical values to the colors we observed in our data. This now qualifies as quantitative
data. Quantitative data has the ability to be observed and ranked by their numerical properties. Common data
sets within our quantitative category are heights in inches, time in seconds, and weight in pounds. These things
can easily be ranked, organized, and further calculated.

Graphs needed to display these two categories of data are not created equally as well. Certain graphs
and charts are better than others for the job at hand. For instance, with categorical data, pie charts work
wonders to offer a visual aide representing percentages of the data when applicable. Other simple forms of
charts work well, depending on the data in question. A data set of preferred car manufacturers in the state of
Utah could be displayed by a bar graph; where each bar represents a car make, and the vertical axis displays
numbers (in percentages) of preferred vehicles by Utah drivers input in the data set. Quantitative data can be
well displayed on a variety of different charts and graphs. Favorable formats would be boxplots, stem and leaf
plots, as well as frequency histograms. The important thing to note here, is that depending on what you want
to emphasize, you can pick a graph/chart accordingly to display your data. You should pick the chart that
displays your data in the easiest way to read and comprehend. In the charts I made, my favorite for the
quantitative data in our term project would be the frequency histogram for the color of candies and their
frequency distribution.

By Student Bags Pareto Chart


1800
1634
1580

1600
1523
1466
1408

1400

1350
1292
1233

1200

1174
1114
1054

1000

Skittle QTY

994
934
873
812

800

751
690
629

600

567
505
443

400

381
319
256

200

192
128

64
0

64

64

64

64

63

62

62

62

62

62

61

61

61

60

60

60

59

59

58

58

57

57

54

CM DR VC

JB

JB

AA

QL

JL

JM DP

PA

EB ND TM EH

AC

AL

DP NV YA

CC

KF MR KS

LC

RS

NR

61

61

60

58

Student Bag QTY

# Each Bag

Cum. Freq.

Confidence Interval Estimates

A confidence interval helps us to gauge the probability that a specified population parameter will occur
within 2 set values. Within inferential statistics, 95% and 99% are the most commonly used confidence
intervals. To explain this better, a good way to look at our confidence intervals is to imagine a sample of data
that was all collected using the same methods. For this example we will use our 95% confidence interval.
Examining this data we would want to calculate the upper and lower limits of the population parameter that

we would expect to see 95% of the time. These calculations can be performed with the formulas mentioned
below.
Construct a 99% Confidence Interval Estimate for the true proportion of yellow candies.
Yellow candies x=321 Total candies n=1634
=

321
=
= 0.196

1634
Critical Value = 2.575
(1 )
= /2

= 2.575

(. 196)(.804)
1634

= .025
Proportion: 0.171 < p < 0.221

First we were asked to construct a 99% confidence interval estimate for the true proportion of yellow
candies. Our "p hat" represents the percentage of candies present in our sample = .196.To discover the margin
of error we used the appropriate confidence interval from Z Score table A-2 to find 2.575. The margin of error
found for this sample and confidence was = .025. The interval equation is simple, merely subtract and add the
margin of error to the p hat to discover the lower and upper limits of the interval. This revealed the true
proportion of yellow candies to be .171<p<.221. This interval describes that the occurrence of yellow candies
based on our collection would find between 17% and 22% yellow candies in random samplings of skittles.
Construct a 95% Confidence Interval Estimate for the true mean number of candies per bag.
n = 27

Degree Free = 26

s = 2.5

x = 60.5

= /2

= 2.056

Critical Value = 2.056

2.5
27

= 0.989
59.5 < < 61.5

We then were asked to construct a 95% confidence interval estimate for the true mean number of
candies per bag. To find the t critical value I consulted t-Critical values chart A-3 to find 2.056. The margin of
error was found using this critical value = .0989. The interval equation is similar to proportion (shown above) in
that we are looking for the upper and lower limits using the margin of error with our mean or x bar. X bar =
60.5 so our interval revealed the true mean to be 59.5< < 61.5. This tells us that from our data that we are

95% confident that any random bag of skittles will have a true mean of somewhere between 59.5 candies to
61.5 candies per bag.

Construct a 98% Confidence Interval Estimate for the standard deviation of the numbers of candies per bag.
n = 27

Degree Free = 26

(1) 2
2

(26)2.52
45.642

s = 2.5

< <

98% .02 / 2 = .01

(1) 2

(26)2.52

< <

12.198

. < < .

The final step in calculating inference intervals for our project was to construct a 98% Confidence
Interval estimate for the standard deviation of the number of candies per bag. Our project standard deviation
was 2.5. While referring to table A-4 in our text, we found the Chi Square Distributions to be 45.642 (right) and
12.198 (left). Using these critical values we were able to use the formula required for computing the standard
deviation interval, revealing 1.88< <3.65. This tells us the upper and lower limits for the standard deviation
with a 98% confidence.

Hypothesis Testing

Hypothesis testing is a valuable tool used by statisticians and researchers around the world. The basic
principles are straight forward, and can be mathematically described and solved with the following steps
Identify the null hypothesis and alternative hypothesis from a given claim, and express both in
symbolic form
Calculate the value of the test statistic, given a claim and sample data
Choose the sampling distribution that is relevant
Either find the P-value of identify the Critical value(s)
State the conclusion about the claim {based on the original claim} in simple and nontechnical terms
*Conditions found on pg 382 of text

0 : Red skittles = 0.20

1 : Red skittles 0.20

Red = 317

n = 1634

317
=
= 0.194

1634

Z=

(1)

.194 .20

Z=

(.20)(.80)
1634

Z = -0.61
Critical Value = 1.96

Sufficient evidence to support that 20% of Skittles candies are red.


Within our first round of Hypothesis testing, we were asked to test the claim that the proportion of red
skittles occurring in a bag of candies would be 20% with a 0.05 significance level. The first step is to state our
null and alternative hypothesis. Next we solved for p-hat = .194. P hat was our proportion (percentage) or red
skittles in our class sample. Using our p-hat we can solve for the test statistic, which = -0.061. The critical value
that corresponds with our significance level is 1.96, and that is the boundary we will use to test our test
statistic of -0.061. Since this is a two tailed test, we can see that -.061 falls within our boundaries framed by
critical values positive and negative 1.96. In this instance we support our original claim: we have sufficient
evidence to support that 20% of skittle candies are red.
0 : Red skittles = 55
n = 27

Degree Free = 26

x = 60.5
=

1 : Red skittles 55
= 0.01

C.V. = 2.779

s = 2.5

60.5 55
2.5
27

t = 11.432

Test statistic fell within rejection region. There is sufficient evidence to warrant rejection of the claim that the
mean number of candies in a bag of Skittles is 55.
Our second hypothesis was to test the claim that the mean number of candies in a bag of skittles is 55,
using a significance level of 0.01. Here we stated our null and alternative hypothesis, and then located our
critical value using table A-3 which = 2.779. Solving our test statistic we found a value of 11.432. This data was
startling to me, for it fell way further our on the distribution then I had seen previously. This test statistic was
well within the rejection zone. Therefore there is sufficient evidence to warrant rejection of the claim that the
mean number of candies in a bag of skittles is 55. After consulting our original class data, it became
exceedingly clear why this test was so far in the rejection zone. The majority of our class bags contained more
than 57 candies, with the mean falling at 60.5. However, it was interesting to calculate the hypothesis of a
much lower mean to see the results that it provided.
I felt it was a cool opportunity to apply the concept we have been studying in class to something
tangible like data collection on our own part. Performing the calculations on our own sample have been

challenging and interesting, and good variance from our text book. I felt that the confidence interval
calculations were appropriate to our sample. For example; the question of proportion of yellow candies
revealed an interval of 17%-22%. When reviewing the data gathered my individual class mates, this matched
up fairly well and even better than I expected. This observation makes me feel confident about the concept of
using confidence intervals within inferential statistics.
The hypothesis testing was also fun and challenging, and even had me stuck on a few occasions. Solving
the test statistics based on our class data provided some insight that allowed us to make conclusions about our
data. When the test statistic value was so off the chart in part two, it was a great representation for me about
how a hypothesis can be a good place to start and actually lead to some solid inferences. I also really enjoyed
the visual representation of how far gone our test statistic was in that example.
I think general errors can be made in equations and problems like these. For one thing, in the beginning
I felt like the student with a bag of only 54 candies was most certainly an outlier. I think an outlier can affect
the class mean and standard deviation. It could have been a possibility to reject that bag from our sample,
being as the rest of the class displayed a normal distribution. In addition: I made many calculation errors
through-out the project. I think human error is common with almost anything, and it is something to take into
account; especially with statistics. I was fortunately able to work with peers and my professor in order to
sufficiently iron out flaws with my round offs and even data entry errors with my calculations. This project
taught me a lot about team work, and I benefited greatly from connecting with my classmates and working
together. I also feel that we all did much better on the test after using the skills we used when engaging with
this project.
In conclusion to what the statistical analysis of skittles has revealed, I believe there may be cause
enough to state my original hypothesis. When counting out the candies in my personal bag a thought occurred
to me: what if the frequency of the colors in the bags is affected by the cost for Skittles co to manufacture
candies of that color? Sure enough after reviewing the class-wide collection of data, is clear to see that not all
colors are created equally. Purple and red always had low counts, while orange and green had high counts.
Could it be rational to assume that orange and green are cheaper colors to manufacture? I feel we would need
a more robust sampling of skittles to perform testing of this hypothesis. In addition I feel we would also need
skittles wholesale price information about some of the products used during the manufacturing of the candies.

Reflection

This project has proven itself to be a fun and intuitive process. I have learned a lot about how to better
apply the concepts I have learned from this class, and I have surely learned a lot more about skittle color and
quantity frequency distributions!
I have admittedly struggled with math for what has seemed to be my entire life. Yet, I have felt very
fortunate for taking statistics this semester. Statistics has been the most rewarding and intuitive time I have
ever experienced within mathematics that I can recall. The concepts were easy to grasp, and the ability to use
visual aids truly helped me process and comprehend the statistics being found. I really enjoyed the majority of
the learning processes and lessons. I am looking forward to the lessons I have learned being tools in my tool
belt that I can utilize later in my schooling career. In spring of 2017, I will transfer to the University of Utah to
complete my bachelors in Psychology. After my completed undergrad I plan to pursue application and
acceptance into the Masters of Occupational Therapy program also offered at the U. I have learned throughout
my schooling, especially in the sciences involved with Psychology, that statistics are heavily present. I

acknowledge that as I dive more fully into research and methods within my undergrad, the careful keen eye
and knowledgebase I have gained here in my statistics class will help me in phenomenal ways.
In this project my classmates and I learned more about how to engage with hypothesis testing. I think
that it is a pivotal part of inferential statistics, and is truly relevant to where I want to be in my future.
Hypothesis testing still doesnt feel exactly like my strong suit, so to speak: yet I have gained valuable insight
from using the techniques practically. My highest praise and thanks goes out to my small study group I have
worked with this semester. As a team, every Saturday we would hash out new materials and explain to one
another so we could gain a better understanding. After missing a lecture from being out of town, I felt
completely lost with the concepts involved with hypothesis testing: the set-up, the test statistics, critical values,
literally everything. If it werent for my time spent with my team mates, I would have done poorly on the final
section of this term project as well as our final section test. One skill that will endure well past my schooling is
the ability to work with my cohorts and gain powerful interpersonal relationships. Many of my most successful
school, work, and life moments have stemmed from my ability to connect and network with a team of people
around me. I will hold this value in high esteem as I push forward in schooling, and in life.
Problem solving was a huge component within this project. At times the information and set up would
seem relatively straight forward and intuitive. At other times, I found it increasingly challenging to properly
extract the correct data that I would need from our sample in order to get the proper proportions or test
statistics. When working with these things, this became frustrating to me and overwhelming. It however gave
me a wonderful chance to dive in deeper and read with more careful study and comprehension. I felt I gained
more from the set-up of these problems then I did from the standard text book questions. The more visual and
practical application provided a sturdy foundations to spring into better understanding for me. The other huge
element of critical thinking and problem solving was the presentation of this project. I have almost zero
experience with excel, and I remember feeling completely overwhelmed with the task at hand. How on Earth
were all of these graphs and charts going to happen?! Magically?! It was daunting to say the least. Fortunately
for us these days, YouTube had all of the answers. I learned through a lot of trial and error that excel is actually
a fairly intuitive program, with far more robust possibilities then I originally expected. I learned a ton while
performing this project in both Excel and in Word processing.
After this class, I have already noticed that I more critically observe and assess statistics I see in my day
to day life. I believe that a fuller understanding of statistics methods can prepare the everyday consumer to be
mindful. In our current day and age it is so common for us to be bombarded by statistics: but they are generally
so heavily biased that it is clear that the only intention present is the goal of selling a product. This must be the
price of consumer and capitalist society: but I do not think ignorance is bliss. Having even a basic
understanding of how statistics are found can help us to stay away from hoax products, or even just to look
twice before believing something we read online. I think it is important to have as non-biased statistics as
possible, and I will continue to honor that concept as I push into research methods of my own here in the next
few years of my continued education.