Académique Documents
Professionnel Documents
Culture Documents
MATH-1040
11/29/2018
of flavors and record the data for the number of candies for each of the five colors in the bag. Each of us
recorded the number of Skittles in our data, and them submitted them in an online poll for the
instructor to record to the data for all the students in this course. The instructor then gave us that data
for us to use for the remainder of this term project. Our goal is to use that data to compute statistics so
that we can learn about how sample data is used to approximate values for the population and make
conclusions about them. To compute these statistics, we were assigned into separate groups of at least
6 people to find a variety of statistics for that data, such as graphs and confidence intervals. This paper
will contain parts 2 to 5 of the team project, as the first part didn’t include any assignments to submit
but the individual data. Before including the fifth part of the project, this paper will include the summary
Part 2
After we put in our data for the bag of Skittles, we focused on the second part of the team
project, using spreadsheet software to compose the statistics. The group portion of the project
showcases the pie and bar graphs of the distribution of each color in the total number of candies in each
bag. In each of these graphs, the total numbers for the colors of candies appear to be very close to each
other in statistical appearance, as if they were distributed almost evenly. The relative frequencies for
the colors in the data table show evidence of the approximate balance of the data. Our group implied
that even though individual bags may very in terms of the distribution of the colors, the proportions
1. We would expect the proportions of the different colors of skittles to be equal in a large sample.
We would assume so because the manufacturer likely produces the same amount of each color.
Individual bags may vary, but on a whole the proportions should be quite similar.
2.
Skittles Data
1245 1410
Yellow, 21.11%
Orange, 20.3%
Red, 20.06%
1329 Purple, 19.9%
1356
Green, 18.64%
1340
Jared Hunter
MATH-1040
11/29/2018
3. a. The data set is a random sample, as it is taken from a sample of the entire Statistics class. The
sample contains a frame consisting of the 111 verified bags of Skittles for each student. The frame
contains the data of the number of candies for each color, and the total number of candies for each
bag. This is a sample because the data from 111 students in a statistics class may not represent the
entire population. The sample is random because the data is entered by the statistics students
themselves, not the researcher who organized the data.
b. The population would be all Skittles. For this research, we are trying to figure out the
proportions of all the candies for each color in 2.17-ounce bag. There are some assumptions that the
Skittles factory put the number of candies for each color equally, so we must have proof. However,
because there are many locations of where these bags are sold, we couldn’t count the data
individually, so we had to use a sample of 111 entries.
For the individual portion of this part, I was assigned to create a table comparing individual data
to the entire class data, and answer three questions regarding the graphs and data in at least one
Jared Hunter
MATH-1040
11/29/2018
paragraph. The table should contain the counts of candies by color for the individual data and the total
sample data. For the essay section with three prompts, I was asked if the graphs we made reflected
what I expected to see. I was also asked to see if there are any observations that are outliers, which is a
topic that the class would be learning in later chapters that involve outliers. For the final prompt, I was
asked to compare my distribution of the colors in a single bag with the total distribution of the colors for
For the graphs, I didn’t expect the categories for the data to look fairly constant. The pie chart
almost looked like the categories have the same percentages. The ratios for the pie chart make the
graph look perfect. The Pareto chart has the bars go down for each category, but with the y axis starting
with 0 (like it should be to avoid a misleading graph), the bars are closer to each other in height than
expected.
According to the file for the Skittles data set for this course, there appears to be a few entries
highlighted to be outliers. One of them is much lower than the mean of 60 candies, having a total of 35
candies. The other few entries have totals that are above 90. The outliers don’t affect on the graphics
and summary statistics much, as much of the data have totals that average around 60 total candies per
bag. Since there are more than a hundred entries for the data set, it’s safe to say that it wouldn’t stop
the mean from shifting away to an unusually higher or lower number than 60 total candies. This reminds
me of the law of large numbers, where it states that the average of the results should be close to the
expected value with the same experiment performed a large number of times.
Since the mean of this data set is about 60 candies per bag, I would say that the totals for mine
match that for most of the data. However, for the distribution of colors, my data for each of the
categories is different for some of the class entries. For example, my count for the red candies is 13, but
some entries have numbers that are lower or higher than that. Another example is that the much of the
class data has numbers that were higher than my counts. Some entries have at least 17 candies in at
least one of the categories, which is what my data didn’t have. I can hardly see one entry which has the
exact same distribution my data has.
Jared Hunter
MATH-1040
11/29/2018
Part 3
The third part of the team project is part of the module on summary statistics. For the group
portion, our group was assigned to calculate the mean, standard deviation, and the 5-number summary
for the total candies in each bag. The 5-number summary consists of the minimum, the first quartile, the
median, the third quartile, and the maximum of the data. Next, we were assigned to make a frequency
histogram and box plot for the total candies in each bag. According to the histogram, it seems that the
average number of the total candies for each bag seems to be around 60, which is the calculated mean
number. According to the boxplot, the data for the total candies has ten outliers; four for the side less
than the minimum, and six for the side more than the minimum.
1.
a. The mean number of candies per bag is 60.2.
b. The standard deviation (sample) is 7 candies per bag.
c. 5 number summary
Minimum 35
Quartile 1 58
Median 59
Quartile 3 61
Maximum 97
Jared Hunter
MATH-1040
11/29/2018
2. Frequency Histogram
3. Boxplot
Jared Hunter
MATH-1040
11/29/2018
For the individual portion of the third part, I was assigned to write information in two essays:
one essay for the paragraph on my findings about the variable “Total candies in each bag”, and half a
page of another essay for a paragraph on the difference between categorial and quantitative data. On
the first essay, I was supposed to analyze the shape of the distribution and reflect on the graph. I was
also supposed to compare the overall data with my own data in terms of the total number of candies
per bag. On the second essay, I was supposed to write half a page of the types of graphs that work for
categorical and quantitative data. I was also supposed to write which graphs won’t work for each of
1. According to the boxplot and the frequency histogram graphs, the shape of the distribution
appears to be skewed right. I compared the median with the first and third quartiles, as well as
with the minimum and maximum values. The distance between the median and the first quartile
is 1 candy. For the third quartile, minimum, and maximum values, they are 2, 24, and 38
respectively. Sure enough, the distances from the median to the values of maximum and third
quartile of the data are larger than those of the minimum and first quartile. I expected the graph
to be symmetric, with the outliers not affecting the middle of the graph, which is the median
AND the mean. As it turns out, the median shifted to the left because of the higher outliers.
However, the outliers didn’t shift the median to the left very much, it is just enough to make it
obvious. In conclusion, the graphs aren’t reflected to what I expected to see. My data for the
total number of candies in my bag is 61. Because the mean of the overall class data is 60.2 and
the median being 59, I can say that out of the total number of 6680 candies in 111 bags I am in
the majority. The mean and median are close to my number of candies in a bag.
2.
a. Categorical data is the type of data that is listed by qualitative trait variables. Examples of
variables for categorical data can include race, gender, age group, and educational level. The
order of the categories may not matter, as categories can be nominal or ordinal. In the
second part of this team project, we list the data for the number of candies in each 2.17-
ounce bag of regular flavor Skittles by colors.
b. Quantitative data is the type of data that includes measurable information, whether discrete
or continuous. The order of the data does matter, as data can be intervals or ratios, which
both have traits of ordinal variables. Examples of measurable information can include
temperature, length, mass, and tax rates. In the third part of this team project, we
measured the total number of candies per bag for each student in the data, and analyzed
them into separate classes, which are values grouped by intervals of numbers for frequency
distribution. The similarity of categorical data and quantitative data is that they both group
values for frequency distribution.
Jared Hunter
MATH-1040
11/29/2018
c. One of the differences between the two types of data is that categorial data lists the
frequencies in categories that consist of specific qualitative traits, and that quantitative data
lists the data by classes, which consist of numerical values listed in intervals. Another
difference is that the frequencies for each category in the categorial data are easy to count,
while values from quantitative data may vary into an extent where they had to be placed in
intervals of numbers depending on the size and distribution of the data.
d. For categorial data, the types of graphs that make sense are pie charts, bar graphs, and
Pareto charts. These graphs compare the categories of the data to each other by frequency
distribution. Types of graphs that wouldn’t make sense include box plot and stem-and-leaf
plots. Box plots wouldn’t make sense for categorial data because it would be impossible to
have the 5 number summary for the qualitative categories. Stem-and-leaf plots wouldn’t
make sense because categories are not quantitative values that are the digits to the left of
the right most-digit in the stem.
e. For quantitative data, many types of graphs that would make sense include box plots, bar
graphs, stem-and-leaf plots, dot plots, and time-series graphs. They can work with classes of
measurable data because they list the frequency distribution for each separate interval of
the data. Types of graphs that wouldn’t make sense for quantitative data are the pie chart,
and the Pareto chart. A pie chart wouldn’t make sense for the quantitative data because the
spread of distribution and shape would be hard to describe. A Pareto chart wouldn’t make
sense because the classes in the data wouldn’t go in order like they were supposed to.
Part 4
For the group portion of the fourth part of this project, this would officially be the last time that
we were supposed to work together in my group. This part of the project is part of the ninth module
involving estimates and sample sizes, which would be the first time working on the project as a group
since the third module. We were assigned to compute the confidence interval estimates for the
population proportion of yellow candies and the population mean number of candies per bag. The
confidence levels for these population statistics are 99% and 95% respectively. After that, we were
supposed to interpret the results of these interval estimates in conclusions. The calculations were done
using a TI-84 calculator, as shown by the directions for calculating confidence interval estimates in that
calculator. For the proportion of yellow skittles, we used the 1-PropZInt tool in the calculator. For the
mean number of skittles, we used the TInterval tool. We also verified the sample data for calculating the
estimates.
Jared Hunter
MATH-1040
11/29/2018
Yellow Skittles
Stat->Tests->A. 1-PropZInt
1-PropZInt
(0.19822, 0.22394) p-
hat 0.21108
x 1410 n
6680
c-level 0.99
Verify: n≤0.05N, we can assume there are more than 133,600 skittles than yes
** We are 99% confident that the proportion of yellow skittles is between 0.198 and 0.224.
Stat->Tests->8. Tinterval
TInterval Inpt:Stats
(58.863, 61.497)
Mean no. 60.18018
St dev 7.000257
n 111
c-level 0.95
** We are 95% confident that the mean population of skittles per bag is between 58.863 and 61.497.
For the individual proportion, I was only supposed to write a paragraph explaining the purpose
and meaning of a confidence interval. This proportion explains that the confidence interval is used to
Jared Hunter
MATH-1040
11/29/2018
approximate the proportion or mean of the population data from a range of values, using data from a
random sample. The explanation includes the origin of the term “confidence interval”. I explained how
taking data from a random sample is the quickest way to get data to compute the confidence interval. I
also explained that the population data approximated from the sample data may not be accurate from
The confidence interval is used when the random sample is taken, then computed to estimate
the proportion or mean of the population in a range of possible values. The values are calculated within
a margin of error from the proportion or mean of the population. The possible values in the range are
for when we are confident of the population proportion or mean, hence the term “confidence interval”.
The reason why we calculate the confidence interval is to get the proportion or mean of the population
without having to get data from the entire population. Taking data from a random sample is the
quickest way to get data. The proportion or mean of the sample with two outcomes may not be
accurate with the entire population. The views we get on looking at the statistics on the sample may not
match those for the entire population, leading us to believe that whatever the outcome is for the
sample is true to the population. With the range of the interval, we must be sure that one of these
Summary
After using the data from the entire class to calculate the statistics, I learned about how the
distribution for the colors for each bag is presented visually, using a pie graph and a bar graph. I saw that
the total numbers of candies for each color are very close to each other. I also learned how to analyze
the average total number of candies per bag by using frequency histograms, boxplots, and 5-number
summaries. Looking at my total number of candies in my bag, I saw that it was very close to the
Jared Hunter
MATH-1040
11/29/2018
calculated mean number in the sample data. I learned about how to approximate the population data
using confidence intervals, so that we can be sure that the population data is somewhere in a range of
numbers. I explained that the population data in a confidence interval may not be accurate, but it is the
best way to get the data from a population using data from a sample.
Part 5 (Reflection)
Doing this team project with many partners in a group for an online class, I learned how to apply
statistics to anything that involves data from a sample. I’ve been reading news articles that use data
from a sample, interpreted as if they were from an entire population. I think that people tend to believe
them, even though they know the data is not really from a population. Taking from what I learned from
this team project and the rest of the class, I realize that it would take a lot of time to get the results for
the sample data to be completely accurate of the population. That is the reason why being confident
about finding the population data from using statistics from a sample is the best idea. Confidence on the
population data shouldn’t be accurate, but it gives the best way to be sure of the parameter. I didn’t
read confidence intervals on news articles, so if I can calculate the confidence intervals for them using
sample data, then I can be sure that the data from the population is somewhere in that interval.
On the calculation progress of the team project, I regained knowledge from using Microsoft
Excel that I gained back in high school. I must have forgotten all about using Excel since I passed that
high school class about the basics of Computer Science, and I was unable to use Excel for the rest of my
high school years. Working on this team project may be the first time I’ve worked on Excel since then,
and I’ve been able to get used to it. I’ve also used Excel in homework assignments that require me to do
more on hand or on a TI-84. For instance, if I want to find the frequency of a specific value from a big list
of sampled data, I would use the “COUNTIF” function on a new cell in the spreadsheet to count the
frequency from a range of data and specify the criteria that I want the frequency to be. This has helped
Jared Hunter
MATH-1040
11/29/2018
me solve homework problems much faster than I would by hand. Excel has played a huge role in
working on the team project. The entire data for this class on the colors of a normal bag of Skittles is
available in the Excel spreadsheet format, so it would be easy for me to add graphs and calculate
statistics in one spreadsheet, even if I’m not online. I credit Excel for giving me the best scores on the