Vous êtes sur la page 1sur 8

Lab 0: A Statistical Analysis of Color Distribution in Candy

Jonathan Langton May 4, 2010

PHYS 283

Lab 0 Report

Spring 2010

Abstract
In this experiment, we undertake a statistical analysis of the distribution of colors in two easily available types of candy: M&Ms and Skittles. For each, we determine the frequency with which each color appears. Our results for Skittles are consistent with an even distribution among the ve colors; for M&Ms, however, we can reject an even distribution at the 7- level. Our results for M&Ms are consistent an even distribution of green, yellow, orange and brown M&Ms, each with a frequency of 20%, with red and blue each getting 10%. Our 95% condence intervals for both Skittles and M&Ms were less than 3% for each color; we used a sample of n = 733 Skittles and n = 906 M&Ms. Additionally, we present a smaller-scale experiment where we ensure that the distribution of our results is consistent with the binomial distribution essential to the generalization to a normal distribution at larger sample sizes. We count the number of green M&Ms in a total of N = 10 samples each containing n = 20 M&Ms; we compare our results to the predictions of a binomial distribution and nd no signicant discrepancies.

Introduction
The rapid expansion of scientic knowledge since the days of Galileo and Newton was enabled by the philosophical shift that demanded that explanations of natural phenomena be systematically tested against observations in other words, that theory be tested by experiment. Despite the vast improvements in technology which have increased the precision of scientic experiments since the 17th century, both systematic and random errors continue to aect the results of scientic inquiry. Therefore, the development of mathematical methods to quantify the eects of error and to determine the uncertainty inherent in any data set has been as essential to scientic progress as the development of the scientic method itself a measurement is virtually useless without a knowledge of its uncertainty and a proper understanding of these mathematical methods is an essential tool of any scientist. In this lab, then, we intend to develop our understanding of these methods by applying them to the simple problem of determining what fraction of M&Ms (and, later, Skittles) are a given color. While this can be solved to good approximation simply by taking a large enough sample, in order to know just how precise the results are, one must consider a number of statistical issues. The concepts of mean and variance, the distinction between the mean and variance of a sample versus the mean and variance of the entire population, probability distributions and condence intervals all are essential components of a complete and rigorous understanding of the errors which are inherent in any experiment. The intention here is to develop this understanding in the context of an extremely simple experimental procedure, so that it may be applied more easily and with greater eectiveness during later labs where the procedures become more complex. The overall organization of this report is as follows: in 2, we develop the necessary mathematical tools to undertake a rigorous error analysis of any experiment; in 3 we outline our experimental method. In 4, we present and discuss our results, and in 5, we conclude.

1 of 7

PHYS 283

Lab 0 Report

Spring 2010

Theory
In any experiment in which the same quantity is to be measured multiple times, it is desirable to have some method by which we can reconcile the dierent results of each experiment. Assuming the procedure is the same for each measurement, a simple arithmetic mean is the traditional tool of choice. If we take N dierent measurements xj , then the mean value x (alternatively written as x ) is N 1 xj . (1) x = N j =1 The virtue of taking repeated measurements is twofold it both reduces the uncertainty of the measurement and allows us to determine statistically what that uncertainty is. In particular, the expected variability within individual measurements can be quantied by the (sample) variance N 1 2 (xj x )2 . (2) s = N 1 j =1 Here s is the (sample) standard deviation simply the square root of the variance. By increasing the number of measurements, the uncertainty of the mean u can be improved: s u= . n (3)

It is important to be able to interpret these quantities correctly. The basic idea is that the population mean (i.e., the true mean) is likely to lie within u of the measured mean x ; that is ( u) ( x u) x ( + u) ( x + u). (4) (5)

In order to quantify more precisely what we mean by likely, it is necessary to consider the specic probability distribution which describes the distribution of our measurements. Often this will be a normal distribution, and in the absence of information to the contrary, the assumption of normality is generally reasonable. In this experiment, however, we expect the results of individual samples to follow a binomial distribution. Consider a sample of n M&Ms, and suppose we are attempting to determine which fraction of those M&Ms are green. For each individual M&M there are only two relevant outcomes green or not-green. If the (true) probability of any individual M&M being green is p and therefore the probability of not being green is q = 1 p, then the probability of a sample of n M&Ms total containing k greens is n! pk q nk . (6) Pk = k !(n k )! This distribution has a well-dened mean and standard deviation; the mean is k = pn (7)

2 of 7

PHYS 283

Lab 0 Report

Spring 2010

and the standard deviation is k =

pqn.

(8)

If n is suciently large (n 10 for p 0.5, n 50 for p 0.2), then the binomial distribution is is well approximated by a Gaussian: 1 2 2 Pk e(kk ) /2k , (9) k 2 with k = pn and k = pqn, as for the binomial distribution. Since the statistical properties are so well understood, this allows us to determine the probability of a measurement lying with a given distance of the mean; 68.3% of measurements will lie within 1 and 95.5% will lie within 2 . More complete results are given in Table 1. The main point here is that, given a suciently large sample size, we can be 95(.45)% condent that the experimental value is within 2 of the true value. This result is not necessarily conned to this particular experiment; it is valid in any situation where the distribution of errors is normal or approximately normal.

Experimental Method
Our experimental procedure for this lab is extremely simple. For the basic M&M test, we simply count the number of M&Ms of each color in a sample containing n = 906 M&Ms; these results are presented in table 2, along with the results of statistical analysis to be discussed in 4. A similar procedure was applied to the bag of Skittles. Here, we used a sample containing n = 733 Skittles in total. However, in order to examine dierences in the distribution between bags, we tracked separately the totals from each of the two bags we used for the sample, with the rst bag containing n1 = 355 Skittles and the second containing n2 = 378. The results of the Skittles tests are presented in Table 3. In the tests with Skittles, no extra eort was necessary in order to avoid systematic error, since the entire contents of both bags were counted. In the M&M tests, we used a smaller subset of the full bag, and therefore extra care had to be taken to avoid preferential inclusion of particular colors. This was generally avoided by selecting the M&Ms blindly, so that the color were not seen until an M&M had already been included in the sample. Nevertheless, it is possible that this was not perfectly accomplished, in which case unintentional and unconscious preference for one particular color over another on the part of the researchers could have introduced some systematic error into our results. However, we do not believe this eect to be signicant. We also tested whether our distributions were consistent with the binomial distribution by counting the number of green M&Ms in each of N = 10 samples containing n = 20 M&Ms each; those results are tabulated in Table 4.

Results and Discussion


Our statistical analysis is predicated on the assumption that the distribution of colors within a sample (i.e., green and not-green, repeated for each color) is described by the binomial 3 of 7

PHYS 283

Lab 0 Report

Spring 2010

distribution, so our rst task is to make sure that this is indeed the case. As described in 3, we determined the number of green M&Ms out of a total of 20 for each of 10 samples. We can therefore determine an observed distribution and compare to the predicted binomial distribution. The true value of p is not known, of course, but we assume that our observed pobs = 0.210 is suciently close to the true value to give a good approximation. The results are shown in Figure ??. The observed pk,obs is calculated according to pk,obs = Nk , N (10)

where Nk is the number of samples containing k green M&Ms and N = 10 is the total number of samples. (The standard deviation of the fraction of green M&Ms observed per sample, in a total of N samples, used to produce the error bars in Figure ??, is Pk = Pk (1 Pk )/N ; the calculation is not particularly dicult, but its a bit ddly and we will not go into it here.) Clearly, there is no statistically signicant dierence between our observed distribution of green M&Ms and that predicted by the binomial distribution; however, due to the small number of trials, the error bars are too large to draw any rm conclusions. Since Pk shrinks as N , we would need to use a very large number of samples say, N 1000, at least before we obtained a good t to the binomial distribution. Since this would require 2 104 pieces of candy, we chose not to pursue this line of inquiry further. With n = 906 for the M&M tests and n = 733 for the Skittles test, we were able to reduce the 2- condence intervals for each trial to less than 3%. The percentages were calculated simply by dividing the observed number of each color by the total sample size; that is pobs = kobs ; n (11)

this calculation is, of course, repeated for each color. Since the standard deviation in the number of M&Ms (or Skittles) of a given color in the sample is (12) k = pqn, we can divide this by n to nd the standard deviation of the fraction of M&Ms (or Skittles) of a particular color: pq p = . (13) n For each color, we calculate k , pobs , and p according to equations (11) through (13) and tabulate the results in Tables 2 and 3. For the M&Ms, we nd that 21.0% 2.7% were green, 19.0% 2.6% yellow, 9.5% 1.9% red, 20.4% 2.7% orange, 9.3% 1.9% blue, and 20.9% 2.7% brown, where we give 2- uncertainty intervals. This is consistent with a distribution wherein green, yellow, orange, and brown each receive an equal 20% fraction of the total, with the remaining 20% split evenly between red and blue that is, a 20/20/20/20/10/10 distribution. For the Skittles, we nd 21.4% 3.0% green, 19.9% 3.0% yellow, 18.3% 2.9% orange, 19.0% 2.9% red, and 21.4% 3.0% purple. This shows no statistically signicant dierence from an even distribution of 20% for each of the ve colors. These results are shown graphically in Figures ?? and ??. 4 of 7

PHYS 283

Lab 0 Report

Spring 2010

We also considered the possibility that the two bags of Skittles used in our sample may have had a dierent color distribution. With each bag containing 360 Skittles, the 2- margin of error would be approximately 2 0.2 0.8/360 4%, which means that dierences of less than 8% in color distribution would not be statistically signicant. (For example, bag 1 could have 24% red, and bag 2 could have 16% red, and both results are consistent with an overall multi-bag distribution of 20% reds, so that there is no statistically signicant dierence between the two bags.) As can be seen from the results shown in Table 3, the inter-bag dierences are well below this level of variation, and so no statistically signicant dierence in color distribution between bags of Skittles is observed.

Conclusion
In this experiment, we have examined the statistical properties of the distribution of color in two well-known types of candy, namely, M&Ms and Skittles. With sample sizes of n = 906 M&Ms and n = 733 Skittles, we were able to obtain 95% condence intervals less than 3% for each color. Our results are consistent with an even distribution between each of the ve colors found in a bag of Skittles. However, we are able to rule out an even distribution between the six colors of M&Ms; the even distribution is excluded at a 7- level. Instead, the results appear to be consistent with green, yellow, orange, and brown each receiving 20% of the total, and red and blue each receiving 10%. We also considered the possibility that the distribution of color could vary from one bag to another. We examined separately the distribution of two bags of Skittles and found no statistically signicant dierences. However, due to the relatively small size of these samples, this result cannot be considered robust. We also tested the ability of the binomial distribution to predict the number of M&Ms of a particular color within a given sample. This was accomplished by tracking the number of green M&Ms in a total of 10 samples, each containing 20 M&Ms. No statistically significant deviations from a binomial distribution were observed; once again, the error bars are suciently large that this should not be considered conclusive. This lab has provided an eective means of applying the principles of statistical analysis to the problem of random sampling. While the experimental method is simple, this allows for greater attention to be focused on the mathematical issues involved. It is therefore a useful introduction to the practice of data analysis and the quantication of uncertainty. As more data is gathered as this experiment is repeated, we expect to obtain smaller uncertainty. Nevertheless, due to the very large number of samples required, it is unlikely that we will be able to fully and reliably test the predictive ability of the binomial distribution in this context. Continued study will also allow us to determine the nature of any temporal variations in the color distribution is there an optimal division between the colors which the manufacturers are continuously seeking, or is the assignment of color eectively random? While these questions are unlikely to keep us up at night, we hope that future research may prove illuminating.

5 of 7

PHYS 283

Lab 0 Report

Spring 2010

1 2 3 4 5 6 7 8

Pin Pout 0.682689 0.317311 0.954500 0.045500 0.997300 0.002700 0.999937 6.33 105 0.999999 5.73 107 1 1.97 109 1 2.55 1012 1 1.22 1015

Table 1: Table showing condence intervals of the Gaussian distribution. The rst column gives the number of standard deviations away from the mean; the second gives the probability that a given measurement will lie within that interval; the third gives the probability that it will lie outside that interval.

Color kobs Green 190 Yellow 172 Red 86 Orange 185 Blue 84 Brown 189

k 12.3 11.8 8.8 12.1 8.7 12.2

pobs 21.0% 19.0% 9.5% 20.4% 9.3% 20.9%

p 1.4% 1.3% 1.0% 1.3% 1.0% 1.3%

Table 2: Results of the M&M color sampling. The number of M&Ms of a particular color in the sample is given in the second column, with the standard deviation of this number given in the third column. Columns four and ve show the percentage of M&Ms for each color and the standard deviation of this percentage, respectively.

6 of 7

PHYS 283

Lab 0 Report

Spring 2010

Color Green Yellow Red Orange Purple

k1 76 62 74 70 73

k2 kobs 81 157 84 146 65 139 64 134 84 157

k 11.1 10.8 10.6 10.5 11.1

pobs 21.4% 19.9% 19.0% 18.3% 21.4%

p 1.5% 1.5% 1.4% 1.4% 1.5%

Table 3: Results of the Skittles color sampling. The organization of the table is similar to Table 2, except for the insertion of two more columns showing the number of Skittles of each color in each bag we used for the test. Uncertainty intervals and percentages are given for the combined sample.

Sample

1 2 3 4 5 6 7 8 9 10

k 5 6 4 5 2 3 2 4 5 6

Table 4: Results of the binomial distribution consistency test; the table shows the number k of green M&Ms out of a sample of n = 20 M&Ms total.

7 of 7

Vous aimerez peut-être aussi