Vous êtes sur la page 1sur 22

Edexcel Notes S1

Statistics 1
Mathematical Model A mathematical model is a simplification of a real world problem. 1. 2. 3. 4. 5. 6. 7. A real world problem is observed. A mathematical model is thought up. The model is used to make predictions, "What happens if...?" Real world data is collected. Predicted results are obtained. These are compared with statistical tests. Models are refined as required and then it's back to stage 3...

Advantages of using mathematical models are:


They simplify a real world problem. They improve our understanding of a real world problem. They are quicker and cheaper. They can be used to predict future outcomes.

Disadvantages of using mathematical models are: Only give a partial description of the real problem. Only work for a restricted range of values.

Stem and Leaf One of the simplest ways of ordering data and presenting data is to place it in a stem and leaf diagram. For example, which the following data:
Person Weight (lb) Height (cm) 1 166 161 2 164 160 3 143 160 4 189 199 5 191 167 6 178 178 7 165 169 8 159 174 9 189 172 10 191 178 11 176 167

Unordered Stem and Leaf Height in cm 3 | 4 represents 34 cm. 15 16 10079 17 84286 18 19 9

Ordered Stem and Leaf Height in cm 3 | 4 represents 34 cm. 15 16 00179 16 00179 18 19 9 1

Liverpool F.C.

Edexcel Notes S1 As you can see from the key, the | divides tens from units. Stem and leafs can also be back to back, if you have two sets of data to display. Using the data above: Weight in pounds 4 | 3 represents 34 lb. 3 9 7654 8 99 11 14 15 16 00179 17 24688 18 19 9 Height in cm 3 | 4 represents 34 cm.

Stem and leafs can give us an indication of distribution. There is a much wider distribution for weight, in this example, than height. If it were comparing something like scores on two exams, we could compare the median. Frequency Tables
Amount (x) 0 x < 20 20 x < 40 40 x < 60 60 x < 80 80 x < 100 Frequency, (f) 5 9 20 25 9

Cumulative Frequency One way we can interpret the data is by working out the cumulative frequency. This simply means add the frequency as you go along. Cumulative frequency is plotted against the upper class boundary. From the above example, we get:
Amount (x) 0 x < 20 20 x < 40 40 x < 60 60 x < 80 80 x < 100 Total Frequency (f) 5 9 20 25 9 68 Upper class boundary 20 40 60 80 100 Cumulative frequency 5 14 (5+9) 34 (5+9+20) 59 (5+9+20+25) 68 (5+9+20+25+9)

To check you're right for the cumulative frequency, you can add the frequency column. Or the question will probably say something like, "a survey of 68 people..." and that's an even easier check. When we have our cumulative frequency column, we can draw a cumulative frequency curve. 2 Liverpool F.C.

Edexcel Notes S1

Using this, we can also create a box plot. This is deduced by looking at the quartiles up the y-axis and finding the corresponding x-values:

Box plots are useful because they tell you lots of information, such as the Median, show you the spread of the IQR, if there are any outliers and whether the data is normal, positively or negatively skewed. The IQR is a measure of spread. IQR = Q - Q Outliers are extreme values. They are usually represented as a cross:

3 Liverpool F.C.

Edexcel Notes S1 They can be either too low or two high and are usually worked out by the equations: Q1 - 1.5 x (IQR) Q3 + 1.5 x (IQR) (Anything less than this figure will be an outlier)

(Anything greater than this figure will be an outlier).

The exam question will always state how to work out the outliers though, so this is one thing you don't have to worry about remembering (just as long as you know how to use the formula). When you've distinguished the outliers, where does the end of the box plot occur? You can either use the next highest/lowest data value after the outlier, or use the value worked from the formula. Linear Interpolation To work out the median, find the For Q1 work out the value. value. .

value and for Q3 find the

Percentiles (P12) mean a percentage of the CF. To work out P12 for example, work out the

For a grouped frequency, it can be difficult to calculate the median and quartiles. There is a way of estimating an answer, however, and this is called linear interpolation.
Time (sec) 0 x < 10 10 x < 15 15 x < 17.5 17.5 x < 20 20 x < 24 Frequency 0 8 3 7 12 Cumulative Frequency 0 8 11 18 30 Class width 10 5 2.5 2.5 4

The first step is the find the value. In this example, it is 15.5. We take away 11 and then divide it by 7 (the frequency of the row the cumulative 15.5 is found in). Next we times by 2.5 (the class width of the row 15.5 is found in). Finally add on 17.5 (the lower class boundary of the row 15.5 is found in) and the answer appears, 19.1. The only difference for the percentiles and other quartiles is replacing Mean from frequency table It's easy enough to work out the mean from normal data, just the simple formula: by whatever you want to find.

4 Liverpool F.C.

Edexcel Notes S1 (In other words, add them all up and divide by the number that there is.)
Time (sec) 0-9 10 - 14 15 - 17 18 - 20 21 - 24 Frequency (f) 0 8 3 7 12

For a grouped frequency table, you'll need to work out the mid-point of the x variable. Midpoint = The formula is:

Therefore, once you have the midpoint, you need to multiply f and x:
Time (sec) 0-9 10 - 14 15 - 17 18 - 20 21 - 24 Frequency (f) 0 8 3 7 12 Midpoint (x) 4.75 12 16 19 23 f(x) 0 96 48 133 276

Add the f(x) column and then divide by the total of the Frequency column to find the mean:

Standard Deviation For an ordinary set of data, the standard deviation is found by the following: (Variance is the same formula, but without the square root). For a frequency table, or grouped frequency table, though, again we have a slightly different formula:

Taking the above as an example, we need to add an f(x)2 column. Be careful with this. Notice only the x is squared, not (fx)2.

5 Liverpool F.C.

Edexcel Notes S1
Time (sec) 0-9 10 - 14 15 - 17 18 - 20 21 - 24 Frequency (f) 0 8 3 7 12 Midpoint (x) 4.75 12 16 19 23 f(x) 0 96 48 133 276 f(x) 0 1152 768 2527 6348
2

Now add up the fx2 and f columns, and write in the mean squared:

Stick all that in your calculator and you'll get the answer: 4.48 (3 sf) Coding When the numbers are too large to be reasonably worked with, there is an option for finding the mean. We can use coding. This replaces x (the midpoint) with y (connected by a formula, which makes it a smaller number). Use the code
x 15.5 25.5 35.5 45.5 55.5 65.5 75.5

to calculate the mean and standard deviation of the following frequency table:
Frequency f 8 12 15 16 11 6 2

We need to add the code column, and work out y and then add a column for f(y) and f(y)2 rather than 2 f(x) and f(x) :
x 15.5 25.5 35.5 45.5 55.5 65.5 75.5 Frequency f 8 12 15 16 11 6 2 -3 -2 -1 0 1 2 3 f(y) -24 -24 -15 0 11 12 6 f(y) 72 48 15 0 11 24 18
2

Next, work out the mean of y, using the formula:

6 Liverpool F.C.

Edexcel Notes S1

= -0.49 (3 s.f) We think back to the original code:

If we replace y with

here, we can replace x with

Add the numbers, and rearrange to make = 40.6 (3 s.f.) and that's your answer!

the subject of the formula.

For standard deviation its exactly the same. Now, if we think of the dispersion, adding and subtracting won't affect the Standard deviation. Dividing and multiplying will, however. Histograms Histograms are used for representing data that is continuous and are summarized in a grouped frequency distribution. There are no gaps between the bars. The area of the bar is proportional to the frequency.

Example: The height of twenty children (to the nearest cm) was recorded in the following frequency table. Draw a histogram to represent the data.

Height 120-124 124-129 130-134 135-139 140-149

Frequency f 1 5 7 4 3

There are two columns that we need to add: the class width and the frequency density.

7 Liverpool F.C.

Edexcel Notes S1 Class width is the width of each group. Be careful when calculating to work out from the lower class boundary and the upper class boundary. For example, 120-125 is actually: 124.5-119.5 and so the class width is 5.

Height 120-124 125-129 130-134 135-139 140-149

Frequency f 1 5 7 4 3

Class Width 5 5 5 5 10

Frequency Density 0.2 1 1.4 0.8 0.3

When we have these values, we plot the lower class and upper class boundaries on the x axis and the frequency density on the y axis.

Skewness From the histogram above, we see a slight positive skew: there are more values towards the negative than there are towards the positive. There are three types of skew, positive, negative and normal, and there are three tests to differentiate between them: 8 Liverpool F.C.

Edexcel Notes S1
Positive Skew Symmetrical Negative Skew

Mean > Median > Mode Q2 - Q1 < Q3 - Q 2

Mean = Median = Mode Q2 - Q1 = Q3 - Q 2

Mean < Median < Mode Q2 - Q1 > Q3 - Q 2

Correlation Correlation is a measure of relationship between two or more variable. When we have two sets of data, we can draw a scatter diagram to see if there is any correlation between them Data: The marks of 10 candidates in Maths and Physics is shown below:
Candidate Physics (x) Maths (y) 1 18 42 2 20 54 3 30 60 4 40 54 5 46 62 6 54 68 7 60 80 8 80 66 9 88 80 10 92 100

From the data, we can plot the x values corresponding to the y values. The only difference is that we don't join the crosses with a line: We can already see that it's positively correlated. A way to test this is to divide the graph into four quadrants, and then look at where the majority of the points lie:

9 Liverpool F.C.

Edexcel Notes S1

If most points lie in the 1st and 3rd quadrants, we have a positive correlation.

If most points lie in the 2nd and 4th quadrants, we have a negative correlation.

If points lie in all four quadrants randomly, we have no correlation.

However, just looking at the scatter diagrams, is a bit inaccurate. It's much better to calculate the strength of the correlation. There's a formula for this called PMCC (Product Moment Correlation Coefficient).

How to calculate Sxy, Sxx and Syy:

10 Liverpool F.C.

Edexcel Notes S1 From the above information, we complete the following table:
x 18 20 30 40 46 54 60 80 88 92 x = 528 y 42 54 60 54 62 68 80 66 80 100 y = 666 x 324 400 900 1600 2116 2916 3600 6400 7744 8464 2 x = 34464
2

y 1764 2916 3600 2916 3844 4624 6400 4356 6400 10000 2 y = 46820

xy 756 1080 1800 2160 2852 3672 4800 5280 7040 9200 xy = 38640

If you're lucky the question will already give you these figures, and all you'll be asked to do is use them.

Now using the PMCC formula:

PMCC works so that 1 r 1, with -1 being perfect negative correlation, 0 being no correlation and +1 being perfect positive correlation. 0.863 is strong positive. Even if we code the data, the PMCC remains the same. Least squares regression line

We can work out b easily enough from the data above:

11 Liverpool F.C.

Edexcel Notes S1 = 66.6 = 52.8

If the question asked you to draw on the regression line, an easy way is to plot the and point on the scatter diagram, and then draw the line from the y-axis point, crossing this point. The mean point always lies on the line. If the data is coded, we need to uncode when finding the mean. An independent (explanatory) variable is one that is set independently of the other variable. (Plotted on the axis). A dependent (response) variable is one whose values are determined by the values of the independent variable. (Plotted on the axis). Interpolation is when you estimate the value of a dependent variable within the range of the data. Extrapolation is when you estimate a value outside the range of the data. Values estimated by extrapolation can be unreliable. Probability If A is an event, the probability of it occurring is the number of ways A can occur, divide by the sample space (total number of outcomes, S).

=
Probability is always 0 p 1. If you have a probability, p(A), the probability of not getting A is written as: p(A'). We can say that to find p(A'), we merely take p(A) away from 1. A B - this means A "intersection" B - all elements that are in A and in B. We can see this on a Venn diagram:

12 Liverpool F.C.

Edexcel Notes S1

A B means A "union" B -- all elements that are in A or in B. On a Venn diagram this is:

Addition Rule This addition rule for finding P(A B) :

We can rearrange this to get:

Example: There are 15 books on a bookshelf. 10 of these are fiction, 4 of which are hard-back. 6, in total, are hardback and the remaining 9 are paper back. Find the probability that a hard-back fiction book is chosen at random. First stage is to draw a Venn diagram and write in all the numbers:

13 Liverpool F.C.

Edexcel Notes S1

We're looking for p(H

F) so where is it both H and F? Where the two circles overlap, so 4/15.

Find the probability that a hardback is chosen but is not fiction. We're wanting p(H F'). Which is 2/15.

Conditional Probability This occurs when the probability of A is conditional upon B having already occurred. Given B, find the probability of A. It's written out as p(A|B).

We use tree diagrams to solve conditional probability. Example: A bag contains 6 red and 4 blue balls. 2 balls are picked at random and retained. Find the probability that both balls are red. First, draw out a tree diagram.

We want p(R

R), so we just follow the tree diagram along: 14

Liverpool F.C.

Edexcel Notes S1 6/10 x 5/9 = 30/90 = 1/3. Find the probability that the balls are different colours. We want p(R p(R p(B B) and p(B R). Multiply across both branches and then add these together:

B) = 6/10 x 4/9 = 24/90 R) = 4/10 x 6/9 = 24/90

= 48/90 = 8/15. Find the probability that the second ball is red, given the first is blue. We want p(R|B), so we use the formula:

= 24/90 4/10 = 2/3. Independent Events Independent events are the opposite of conditional, where one factor doesn't affect the next. Example, if balls are taken from a bag and replaced. The probability of a red ball is the same no matter how many times you pick from the bag. This means:

If they are mutually exclusive, they cannot occur at the same time and the p(A This means that:

B) is 0.

Sample Space Diagram Example : A dice is thrown twice and the scores obtained are added together. Find the probability that the total score is 6. 15 Liverpool F.C.

Edexcel Notes S1 There are 36 equally likely outcomes. 5 of the outcomes result in a total of 6. Second Throw

6 5 4 3 2 1

7 6 5 4 3 2 1

8 7 6 5 4 3

9 8 7 6 5 4

10 9 8 7 6 5

11 10 9 8 7 6

12 11 10 9 8 7 6

2 3 4 5 First Throw

Discrete Random Variables Discrete Random Variables are probabilities such as the "number on a fair die". The probability for discrete random variables is written as P( ).

Example: A tetrahedral die has the numbers 1, 2, 3, 4 on its faces. The die is biased in such way that: P( )= P( ) = 3 = 1,2,3 =4

If we draw out this in a probability distribution table we get:


P( 1 2 3 4

All the probabilities added together = 1. (1 + 1 + 1 + 3) = 1 6 =1 = Therefore, we can write out the probability distribution:
P( 1 2 3 4 )

We can also find the cumulative distribution, the F(x):

16 Liverpool F.C.

Edexcel Notes S1
P( 1 2 3 4 ) F(x)

The cumulative probability always adds up to 1. P( ) means the probability of getting an X value less than or equal to 2. We add up the probabilities )= we have, and so, in the above example, P( F(x) means so F(2) =

If a question asks you something like F(3.5), in our example 3.5 doesn't exist. Therefore, we do F(3) instead, which would be . Mean and Variance Finding the mean and variance is almost identical to finding the mean of a frequency table. The formula for mean:

For Variance, we have the formula: To find Example: If X is a discrete random variable.

0 1 2

0.4 0.5 0.1

0 0.5 0.2 0.7

0 0.5 0.4 0.9

Therefore,

Suppose is the random variable given by look like this:

by coding for the above table. The table would now

17 Liverpool F.C.

Edexcel Notes S1

-2 1 4

0.4 0.5 0.1 Total

-0.8 0.5 0.4 0.1

1.6 0.5 1.6 3.7

Remember the code: To decode back:

In general:

Discrete Uniform distribution is where each random variable has the same probability. For example, when is the probability of a fair 6-sided die. Each probability would be . A Discrete Uniform distribution over the values 1,2,3,, n.

Example: A tetrahedral dice has its faces numbered 1, 2, 3 and 4. X is the score obtained when the dice is rolled.

18 Liverpool F.C.

Edexcel Notes S1 X therefore has a uniform distribution, = 2.5 .

The Normal Distribution

Symmetrical about the mean. Total area under the curve = 1 Probabilities correspond to the area. A continuous distribution (therefore there is no difference between . 68% of the distribution lies within 1 standard deviation of the mean. 95% of the distribution lies within 2 standard deviations of the mean. 99.7% of the distribution lies within 3 standard deviations of the mean.

and

Examples: The masses of new born babies. IQ of school students. Hand span of adult females. Height of plants growing in a field. 19 Liverpool F.C.

Edexcel Notes S1 Working out Probabilities using tables. Examples: 1. 2.

3.

4.

5.

6.

7.

If P(Z < a) is greater than 0.5 than a will be >0. If P(Z < a) is less than 0.5, than a is less than 0. If P (Z > a) is less than 0.5 than a will be > 0. If P (Z > a) is more than 0.5 than a will be <0.

8.

9.

20 Liverpool F.C.

Edexcel Notes S1 Standardizing If and then:

Example: If

find

The first step is to standardize:

Working Backwards Example: If , find the value of if . .

To find x, we start by finding the standardised value such that From tables we see that .

We therefore need to find the value that standardises to make

by rearranging the formula.

Examination style question: A machine is designed to fill jars of coffee so that the contents, , follow a normal distribution with mean grams and standard deviation grams. If and , find and correct to 3 significant figures.

21 Liverpool F.C.

Edexcel Notes S1 Firstly : + 1.96

Secondly, we are told that - 1.75

The two equations are: + 1.96 - 1.75 Subtract to eliminate :

This gives

So the solutions to 3sf are

and

g.

22 Liverpool F.C.

Vous aimerez peut-être aussi