Vous êtes sur la page 1sur 12

Jared Hunter

MATH-1040
11/29/2018

Team Project Part 6 (ePortfolio)


Introduction
For this course, we were supposed to purchase one 2.17-ounce bag of Skittles in its original form

of flavors and record the data for the number of candies for each of the five colors in the bag. Each of us

recorded the number of Skittles in our data, and them submitted them in an online poll for the

instructor to record to the data for all the students in this course. The instructor then gave us that data

for us to use for the remainder of this term project. Our goal is to use that data to compute statistics so

that we can learn about how sample data is used to approximate values for the population and make

conclusions about them. To compute these statistics, we were assigned into separate groups of at least

6 people to find a variety of statistics for that data, such as graphs and confidence intervals. This paper

will contain parts 2 to 5 of the team project, as the first part didn’t include any assignments to submit

but the individual data. Before including the fifth part of the project, this paper will include the summary

of the project containing what I’ve learned working on it.

Part 2
After we put in our data for the bag of Skittles, we focused on the second part of the team

project, using spreadsheet software to compose the statistics. The group portion of the project

showcases the pie and bar graphs of the distribution of each color in the total number of candies in each

bag. In each of these graphs, the total numbers for the colors of candies appear to be very close to each

other in statistical appearance, as if they were distributed almost evenly. The relative frequencies for

the colors in the data table show evidence of the approximate balance of the data. Our group implied

that even though individual bags may very in terms of the distribution of the colors, the proportions

should be the same for the entire data.


Jared Hunter
MATH-1040
11/29/2018

1. We would expect the proportions of the different colors of skittles to be equal in a large sample.
We would assume so because the manufacturer likely produces the same amount of each color.
Individual bags may vary, but on a whole the proportions should be quite similar.

Colors Total Count Relative Frequency


Red 1340 0.201 (20.1 %)
Orange 1356 0.203 (20.3%)
Yellow 1410 0.211 (21.1%)
Green 1245 0.186 (18.6%)
Purple 1329 0.199 (19.9%)
Cumulative Total 6680 1 (100%)

2.

Skittles Data

1245 1410
Yellow, 21.11%
Orange, 20.3%
Red, 20.06%
1329 Purple, 19.9%
1356
Green, 18.64%

1340
Jared Hunter
MATH-1040
11/29/2018

3. a. The data set is a random sample, as it is taken from a sample of the entire Statistics class. The
sample contains a frame consisting of the 111 verified bags of Skittles for each student. The frame
contains the data of the number of candies for each color, and the total number of candies for each
bag. This is a sample because the data from 111 students in a statistics class may not represent the
entire population. The sample is random because the data is entered by the statistics students
themselves, not the researcher who organized the data.

b. The population would be all Skittles. For this research, we are trying to figure out the
proportions of all the candies for each color in 2.17-ounce bag. There are some assumptions that the
Skittles factory put the number of candies for each color equally, so we must have proof. However,
because there are many locations of where these bags are sold, we couldn’t count the data
individually, so we had to use a sample of 111 entries.

For the individual portion of this part, I was assigned to create a table comparing individual data

to the entire class data, and answer three questions regarding the graphs and data in at least one
Jared Hunter
MATH-1040
11/29/2018

paragraph. The table should contain the counts of candies by color for the individual data and the total

sample data. For the essay section with three prompts, I was asked if the graphs we made reflected

what I expected to see. I was also asked to see if there are any observations that are outliers, which is a

topic that the class would be learning in later chapters that involve outliers. For the final prompt, I was

asked to compare my distribution of the colors in a single bag with the total distribution of the colors for

the entire class.

Count Red Count Count Count Count Total


Orange Yellow Green Purple
My Bag 13 15 15 12 6 61

Class 1340 1356 1410 1245 1329 6680


Counts

For the graphs, I didn’t expect the categories for the data to look fairly constant. The pie chart
almost looked like the categories have the same percentages. The ratios for the pie chart make the
graph look perfect. The Pareto chart has the bars go down for each category, but with the y axis starting
with 0 (like it should be to avoid a misleading graph), the bars are closer to each other in height than
expected.

According to the file for the Skittles data set for this course, there appears to be a few entries
highlighted to be outliers. One of them is much lower than the mean of 60 candies, having a total of 35
candies. The other few entries have totals that are above 90. The outliers don’t affect on the graphics
and summary statistics much, as much of the data have totals that average around 60 total candies per
bag. Since there are more than a hundred entries for the data set, it’s safe to say that it wouldn’t stop
the mean from shifting away to an unusually higher or lower number than 60 total candies. This reminds
me of the law of large numbers, where it states that the average of the results should be close to the
expected value with the same experiment performed a large number of times.

Since the mean of this data set is about 60 candies per bag, I would say that the totals for mine
match that for most of the data. However, for the distribution of colors, my data for each of the
categories is different for some of the class entries. For example, my count for the red candies is 13, but
some entries have numbers that are lower or higher than that. Another example is that the much of the
class data has numbers that were higher than my counts. Some entries have at least 17 candies in at
least one of the categories, which is what my data didn’t have. I can hardly see one entry which has the
exact same distribution my data has.
Jared Hunter
MATH-1040
11/29/2018

Part 3
The third part of the team project is part of the module on summary statistics. For the group

portion, our group was assigned to calculate the mean, standard deviation, and the 5-number summary

for the total candies in each bag. The 5-number summary consists of the minimum, the first quartile, the

median, the third quartile, and the maximum of the data. Next, we were assigned to make a frequency

histogram and box plot for the total candies in each bag. According to the histogram, it seems that the

average number of the total candies for each bag seems to be around 60, which is the calculated mean

number. According to the boxplot, the data for the total candies has ten outliers; four for the side less

than the minimum, and six for the side more than the minimum.

1.
a. The mean number of candies per bag is 60.2.
b. The standard deviation (sample) is 7 candies per bag.
c. 5 number summary
Minimum 35
Quartile 1 58
Median 59
Quartile 3 61
Maximum 97
Jared Hunter
MATH-1040
11/29/2018

2. Frequency Histogram

3. Boxplot
Jared Hunter
MATH-1040
11/29/2018

For the individual portion of the third part, I was assigned to write information in two essays:

one essay for the paragraph on my findings about the variable “Total candies in each bag”, and half a

page of another essay for a paragraph on the difference between categorial and quantitative data. On

the first essay, I was supposed to analyze the shape of the distribution and reflect on the graph. I was

also supposed to compare the overall data with my own data in terms of the total number of candies

per bag. On the second essay, I was supposed to write half a page of the types of graphs that work for

categorical and quantitative data. I was also supposed to write which graphs won’t work for each of

these types of data and why.

1. According to the boxplot and the frequency histogram graphs, the shape of the distribution
appears to be skewed right. I compared the median with the first and third quartiles, as well as
with the minimum and maximum values. The distance between the median and the first quartile
is 1 candy. For the third quartile, minimum, and maximum values, they are 2, 24, and 38
respectively. Sure enough, the distances from the median to the values of maximum and third
quartile of the data are larger than those of the minimum and first quartile. I expected the graph
to be symmetric, with the outliers not affecting the middle of the graph, which is the median
AND the mean. As it turns out, the median shifted to the left because of the higher outliers.
However, the outliers didn’t shift the median to the left very much, it is just enough to make it
obvious. In conclusion, the graphs aren’t reflected to what I expected to see. My data for the
total number of candies in my bag is 61. Because the mean of the overall class data is 60.2 and
the median being 59, I can say that out of the total number of 6680 candies in 111 bags I am in
the majority. The mean and median are close to my number of candies in a bag.
2.
a. Categorical data is the type of data that is listed by qualitative trait variables. Examples of
variables for categorical data can include race, gender, age group, and educational level. The
order of the categories may not matter, as categories can be nominal or ordinal. In the
second part of this team project, we list the data for the number of candies in each 2.17-
ounce bag of regular flavor Skittles by colors.
b. Quantitative data is the type of data that includes measurable information, whether discrete
or continuous. The order of the data does matter, as data can be intervals or ratios, which
both have traits of ordinal variables. Examples of measurable information can include
temperature, length, mass, and tax rates. In the third part of this team project, we
measured the total number of candies per bag for each student in the data, and analyzed
them into separate classes, which are values grouped by intervals of numbers for frequency
distribution. The similarity of categorical data and quantitative data is that they both group
values for frequency distribution.
Jared Hunter
MATH-1040
11/29/2018

c. One of the differences between the two types of data is that categorial data lists the
frequencies in categories that consist of specific qualitative traits, and that quantitative data
lists the data by classes, which consist of numerical values listed in intervals. Another
difference is that the frequencies for each category in the categorial data are easy to count,
while values from quantitative data may vary into an extent where they had to be placed in
intervals of numbers depending on the size and distribution of the data.
d. For categorial data, the types of graphs that make sense are pie charts, bar graphs, and
Pareto charts. These graphs compare the categories of the data to each other by frequency
distribution. Types of graphs that wouldn’t make sense include box plot and stem-and-leaf
plots. Box plots wouldn’t make sense for categorial data because it would be impossible to
have the 5 number summary for the qualitative categories. Stem-and-leaf plots wouldn’t
make sense because categories are not quantitative values that are the digits to the left of
the right most-digit in the stem.
e. For quantitative data, many types of graphs that would make sense include box plots, bar
graphs, stem-and-leaf plots, dot plots, and time-series graphs. They can work with classes of
measurable data because they list the frequency distribution for each separate interval of
the data. Types of graphs that wouldn’t make sense for quantitative data are the pie chart,
and the Pareto chart. A pie chart wouldn’t make sense for the quantitative data because the
spread of distribution and shape would be hard to describe. A Pareto chart wouldn’t make
sense because the classes in the data wouldn’t go in order like they were supposed to.

Part 4
For the group portion of the fourth part of this project, this would officially be the last time that

we were supposed to work together in my group. This part of the project is part of the ninth module

involving estimates and sample sizes, which would be the first time working on the project as a group

since the third module. We were assigned to compute the confidence interval estimates for the

population proportion of yellow candies and the population mean number of candies per bag. The

confidence levels for these population statistics are 99% and 95% respectively. After that, we were

supposed to interpret the results of these interval estimates in conclusions. The calculations were done

using a TI-84 calculator, as shown by the directions for calculating confidence interval estimates in that

calculator. For the proportion of yellow skittles, we used the 1-PropZInt tool in the calculator. For the

mean number of skittles, we used the TInterval tool. We also verified the sample data for calculating the

estimates.
Jared Hunter
MATH-1040
11/29/2018

Yellow Skittles

Stat->Tests->A. 1-PropZInt
1-PropZInt

(0.19822, 0.22394) p-
hat 0.21108
x 1410 n
6680
c-level 0.99

Verify: nphat(1-phat)≥10 6680(0.2210)(1-0.2110) =1112.08≥10 yes

Verify: n≤0.05N, we can assume there are more than 133,600 skittles than yes

** We are 99% confident that the proportion of yellow skittles is between 0.198 and 0.224.

Mean Number of Skittles

Stat->Tests->8. Tinterval
TInterval Inpt:Stats
(58.863, 61.497)
Mean no. 60.18018
St dev 7.000257
n 111
c-level 0.95

Verify n≥30 yes

** We are 95% confident that the mean population of skittles per bag is between 58.863 and 61.497.

For the individual proportion, I was only supposed to write a paragraph explaining the purpose

and meaning of a confidence interval. This proportion explains that the confidence interval is used to
Jared Hunter
MATH-1040
11/29/2018

approximate the proportion or mean of the population data from a range of values, using data from a

random sample. The explanation includes the origin of the term “confidence interval”. I explained how

taking data from a random sample is the quickest way to get data to compute the confidence interval. I

also explained that the population data approximated from the sample data may not be accurate from

the actual population data.

Team Project Part 4 Individual Proportion

The confidence interval is used when the random sample is taken, then computed to estimate

the proportion or mean of the population in a range of possible values. The values are calculated within

a margin of error from the proportion or mean of the population. The possible values in the range are

for when we are confident of the population proportion or mean, hence the term “confidence interval”.

The reason why we calculate the confidence interval is to get the proportion or mean of the population

without having to get data from the entire population. Taking data from a random sample is the

quickest way to get data. The proportion or mean of the sample with two outcomes may not be

accurate with the entire population. The views we get on looking at the statistics on the sample may not

match those for the entire population, leading us to believe that whatever the outcome is for the

sample is true to the population. With the range of the interval, we must be sure that one of these

values is that of the population, rather than just the sample.

Summary
After using the data from the entire class to calculate the statistics, I learned about how the

distribution for the colors for each bag is presented visually, using a pie graph and a bar graph. I saw that

the total numbers of candies for each color are very close to each other. I also learned how to analyze

the average total number of candies per bag by using frequency histograms, boxplots, and 5-number

summaries. Looking at my total number of candies in my bag, I saw that it was very close to the
Jared Hunter
MATH-1040
11/29/2018

calculated mean number in the sample data. I learned about how to approximate the population data

using confidence intervals, so that we can be sure that the population data is somewhere in a range of

numbers. I explained that the population data in a confidence interval may not be accurate, but it is the

best way to get the data from a population using data from a sample.

Part 5 (Reflection)
Doing this team project with many partners in a group for an online class, I learned how to apply

statistics to anything that involves data from a sample. I’ve been reading news articles that use data

from a sample, interpreted as if they were from an entire population. I think that people tend to believe

them, even though they know the data is not really from a population. Taking from what I learned from

this team project and the rest of the class, I realize that it would take a lot of time to get the results for

the sample data to be completely accurate of the population. That is the reason why being confident

about finding the population data from using statistics from a sample is the best idea. Confidence on the

population data shouldn’t be accurate, but it gives the best way to be sure of the parameter. I didn’t

read confidence intervals on news articles, so if I can calculate the confidence intervals for them using

sample data, then I can be sure that the data from the population is somewhere in that interval.

On the calculation progress of the team project, I regained knowledge from using Microsoft

Excel that I gained back in high school. I must have forgotten all about using Excel since I passed that

high school class about the basics of Computer Science, and I was unable to use Excel for the rest of my

high school years. Working on this team project may be the first time I’ve worked on Excel since then,

and I’ve been able to get used to it. I’ve also used Excel in homework assignments that require me to do

more on hand or on a TI-84. For instance, if I want to find the frequency of a specific value from a big list

of sampled data, I would use the “COUNTIF” function on a new cell in the spreadsheet to count the

frequency from a range of data and specify the criteria that I want the frequency to be. This has helped
Jared Hunter
MATH-1040
11/29/2018

me solve homework problems much faster than I would by hand. Excel has played a huge role in

working on the team project. The entire data for this class on the colors of a normal bag of Skittles is

available in the Excel spreadsheet format, so it would be easy for me to add graphs and calculate

statistics in one spreadsheet, even if I’m not online. I credit Excel for giving me the best scores on the

group portions for parts of the team project.

Vous aimerez peut-être aussi