Vous êtes sur la page 1sur 1

1.

BASIC STATS When area is > something Interval Estimate


Types of variables: Expected Value/ Mean E.g. P(Z>1.08),  2 
x ~ N  , .
xix
Use 1 – P(Z<1.08)
 E( E
Count: no of bedrooms/children, manufacture year
Ordinal: categories, income brackets X()X)  i p
p((xxii ))  n 
all all
xi xi
x 
~ N  0,1 .
Nominal: gender, yes/no, manufacture model When area is in middle of graph So, Z 
Continuous: distance, time, age, best cruising speed  0 * 18  1* 83  2 * 83  3* 81 E.g. P(1.08<Z<1.51)  n
Expected = P(Z<1.51) – P(Z<1.08)
 12 Values/Mean
1.5 / Variance
Graphs E(c) = 8c V(c) = 0
Other
Scatterplots: shows ifmeasures
there is a linearof centre(positive
relationship E(cX) = cE(X) V(cX) = c2V(X) Symmetry → P(Z<-a) = P(Z>a)
increasing(continuous/interval data)
left to right, cov>0), doesn't measure strength E(X-Y) = E(X) – V(X+c) = V(X)
Histogram: modality skewness (positive long tail to right), E(Y) V(X+Y) = V(X) + V(Y) Suppose we require P(X<a).
modal class, symmetry E(X+Y) = E(X) + [If X and Y are independent] X 
Know that Z 
– average  E  X    
~ N (0,1).
 Arithmetic mean E(Y)
Box Plot: Skewness (short/long whisker, short 2 2
V(X-Y) = V(X) +V(Y) 
bottom=positive), symmetric E(XY) =E(X) x E(Y) [If X and Y are independent]  X  a 
 If observations are labelled X 1 , X 2 ,..., X n So, P  X  a   P  
Empirical CDFs
 E X    2 2
   
 
then sample average is called X V ofXX 
Variance
  x 2
 p  xi    2 
 PZ 
a 
Confidence Interval
  
V  X   

  xxii2  pp  xxiii  2
i
Location
Location of Percentiles
of percentiles 2 2  where Z ~ N (0,1).
 
 Calculated as all x  P  x  Z / 2    x  Z / 2   100(1   )%
 n  1 P 1  n
 X 2  ...  X n   02  18  12  83  22  83  32  81   1.52 n
all xi
Lp  = ___th  X1 
X observation all xi Or suppose we require P(a<X<a).
100 = what figure?
n E.g. 02  1  12  3  22  3  32  1   1.52 Find the LCL and UCL. 
 Arithmetic Mean
N=number of data points;
n
 8  0.75
8 8 8  100(1   )% height
E.g. Average CI: x ofZasample
/2 of 25 men is found to be
1
 Xi
 Lp is the location of the pth percentile. n
  0.75 178cm. SD of male heights is 10cm, and heights follow a
n i 1 Std Dev  X   0.75  0.866 (to 3dp) When asking max or min amount normal distribution. Find a 95% confidence interval.
Mean vs Median Std Dev  X   0.75
Marginal  0.866 (to
Distribution 3dp)
of X E.g. rackets last an average of 16months, SD of 3months.
If symmetric, mean ≈ median
p ( x )  P  X  x    p ( x, y )
Only wants to replace a max of 4% under warranty. How long    
If positive skew, mean > median should warranty period be? P  x  1.96    x  1.96   0.95
If negative skew, mean < median all yi  n n
Covariance  10 10 
Variance  xy  cov  X , Y   E  X   x  Y   y   P 178  1.96    178  1.96   0.95
Measures spread  25 25 
  xi y j p  xi , y j    x . y
m n
Larger SD – ↑ risk – ↑ rate of return
i 1 j i
P 174.08    181.92   0.95
1 n
s2   ( X i  X )2
n  1 i 1 Correlation of Coefficient Minimum sample size
Measures strength of linear relationship between variables, Use the CLT to find the min sample size if the SD of the
1 1  n2 2 2  2 
n

s2 n 1 X i 


i  nX 
XnX X&Y (-1 to +1) population is known.
n i11  i 1   cov( x, y) E.g. = Estimating the length of bus trip in the morning. Example: 2 proportions – z tables
1  ; 1    1 4. SAMPLING DISTRIBUTIONS St.dev. of the trip length is 5 mins. Estimate the true Eg. A random sample of 80 voters were surveyed in one
Eg   25  49  1  4  16   5*3.8 
2
 x y As sample size ↑, distribution of the sample mean becomes population mean length to within 3 minutes, with 99% region, and 70 voters in another. 45% of the voters surveyed
4
Linear Combinations - Portfolio Variance closer to the normal distribution certainty. in the first region said they would vote for candidate A, and
1
 Deviation
Standard 95   5*14.44
s s 2 E  aX  bY   aE  X   bE Y  Step 1: set up the equation needed. 40% of the voters surveyed in the second region said they
4
1
Coefficient of variation V  aX  bY   a 2V  X   b 2V Y   2ab cov  X , Y 
The SD of the sample mean is the model standard deviation,
, (the theoretical SD) divided by n, that is, /n.
 
P X    3  0.99 would vote for A.
Do these surveys indicate that the level of support for A is the
 * 22.8  5.7 s
Measures
4 spread cv  Step 2: standardise. same in both regions?
 a   b   2ab x y
2 2 2 2
X x y Central Limit Theorem  X  3  H 0 : p1  p2
Covariance Expected Return If X is a random variable with a mean µ and variance σ², then P    0.99
 
Measures the strength (and direction) of linear relationship RP = E(RA) x investment + E(RB) x investment  2   n  n H A : p1  p2
between 2 variables. X  N  , 
n
 n   3  Test
Test Statistic:
Statistic:
 ( X i  X )(Yi  Y ) Variance of Return P Z    0.99
cov( X , Y )  i 1 V(RP) = (investment)2 x E(RA) + (investment)2 x E(RB) X   5 n n1 pˆ1  n2 pˆ 2 80*0.45  70*0.4 64
n 1  Z ~ N  0,1 as n  . pˆ   
1  n   n  3 n n1  n2 80  70 150
  X iYi  nXY 
n  1  i 1
3. CONTINUOUS PROBABILITY DISTRIBUTIONS P  Z    0.99
Types of random variables: SEMean → Usually don’t know the theoretical value ,  5  pˆ1  pˆ 2 0.45  0.4
If cov > 0, then as X increases, Y increases (positive slope). If Uniform therefore estimate it using the sample SD, s, and s/n is called Z 
cov < 0 = opposite. If cov=0, not linearly related Step 3: solve for n. Example: Proportions 1 1 64  64  1 1 
1    
- Parameters: End point the standard error of the mean pˆ (1  pˆ )   
Normal: requires probabilities P(X<a) or P(a<X<b) P  Z  2.575   0.99 A random sample is taken over a week. 625 of the 945 users
 n1 n2  150  150  80 70 
Coefficient of correlation - Parameters: Variance, mean Mentions ‘sample’ - E.g. The probability that a sample of 10 are found to be cyclists. Find a 90% confidence interval for
Binomial: the number of “successes” in the n trials students will have an average mark over 78 if mean is 72 and 3 n the true proportion of users who are cyclists.  0.6177 to 4dp.
COV ( X , Y ) cov( X , Y )  2.575
 , r 1. A fixed number of trials, n. SD is 9. 5  pˆ (1  pˆ ) pˆ (1  pˆ ) 
 XY sX sY P  pˆ  z / 2   p  pˆ  z / 2    1   Decision Rule:
2. 2 possible outcomes for each trial; success and failure.  92  n  (2.575*5) / 3  n n 
If r=-1, perfect negative linear relationship 3. Probability of success is p, probability of failure is (1-p). X ~ N    72,  2   P-value: P(|Z|>0.6177) = 0.5352.
 10  n P( Z  1.645)  0.05
If r=+1, perfect positive relationship 4. Trails are independent. Rejection Region: For 95% significance and two-sided
 X   78  72   pˆ (1  pˆ ) pˆ (1  pˆ ) 
P  X  78   P 
If r=0, no LINEAR relationship 6. HYPOTHESIS TESTING P  pˆ  1.645   p  pˆ  1.645  alternative, reject null hypothesis if test statistic is less than -
From Minitab: |r| = √R-sq
    0.90
  n 9 10  Null hypothesis: something WILL happen 1.96 or greater than +1.96.
Binomial  n n 
Notation: If x is a binomial random variable with n trials and Alternative hypothesis: something WILL NOT happen 625 Conclusion
 6 10  pˆ   0.6614 (to 4dp)
(xi–x)2 x p is the probability of success in each trial, then we write,  P  Z   Errors 945 p-value is large or test statistic does not lie in the rejection
xi yi xi - x (xi–x)2 yi - y (yi–y)2
(yi–y)2  9  Type 1 error  Reject H0 when it is true region.
X~Bin(n,p)
1 7 -2.5 6.25 3 9 -7.5
Mean and variance:  P  Z  2.11 Type 2 error  Accept false H0 E.g. In the same data set, 204 of the 945 people are found to
be walking. It is supposed that overall, 20% of users are
2 5 -1.5 2.25 1 1 -1.5 If x~Bin(n,p)  1  P  Z  2.11
Example: 𝞼 is unknown – t distribution walkers. Test this supposition using the data.
3 5 -0.5 0.25 1 1 -0.5 µx2 = E(x) = np  1  0.9826 Government claims that mean noise level is no more than H 0 : p  0.2
4 4 0.5 0.25 0 0 0 𝞼x2 = np(1-p)  0.0174 60dB. A test is conducted, measuring noise levels on 18
5 2 1.5 2.25 -2 4 -3 Binomial formula: H A : p  0.2
E.g. 2 question that says “will differ from the population occasions and obtain an average of 72dB with a standard pˆ  c
6 1 2.5 6.25 -3 9 -7.5 P(x=k) = nCk.(1-p)n-k mean by less than __units” deviation of 10dB Test Statistic: Z
c(1  c) n
H0: µ=60
21 24 0 17.5 0 24 -20
Uniform Distribution f  x 
1
, a xb E ( X )   and HA: µ>60 204
 0.2
 945
Y mean = 4 ba 0.2(1  0.2) 945
( a  b) var( X )   X  0 72  60
2
12
X mean = 3.5 EX   n T    5.091
 1.2199 to 4 decimal places.
x y 2 5. ESTIMATION
s n 10 18 2.357
x 3.50000
b  a  Compare to t distribution with n-1 df. Decision Rule
2
Point estimate: a single value or point, i.e. sample mean = 4
y -4.00000 4.80000 V X   is a point estimate of the population mean, µ. For 5% error, want to find upper 5% tail – marked “one sided Rejection Region: For 5% significance, two tailed test, reject
= covariance of x, y 12 Interval estimate: Draws inferences about a population by (area to the left) = 0.95. Rejection region is t>1.7396. H0 if test statistic is less than -1.96 or greater than +1.96.
Probability estimating a parameter using an interval (range). OR
2. PROBABILITY (b-a) x f(x) E.g. We are 95% confidence that the unknown mean score E.g. with CI: Find a 99% confidence interval for the P-value= P(|Z|>1.2199)=2*P(Z>1.22)
Outcomes must be mutually exclusive (No two outcomes can lies between 56 and 78. population mean noise level given the resident’s data (that is, =2*[0.5-P(0<Z<1.22)]
both occur on any one trial) and collectively exhaustive (Each Normal Distribution Unbiased and consistent from a sample of size 18, an average of 72dB and standard =0.2224
trial must result in one of the outcomes in the sample space) Different means – shift curve up and down x-axis. Different deviation of 10dB). Conclusion
P(A) = P(A∩B) + P(A∩Bc) variances – curve becomes more peaked or more squashed E  X    , so X is an unbiased estimator of . X  72, s  10, n  18 We fail to reject the null hypothesis- test statistic did not lie in
P(A or B) = P(A) + P(B) - P(A∩B) the rejection region; p-value is large. That is, the data support
An unbiased estimator is consistent if the difference between t / 2, n 1  t0.005,17  2.8982
P(A|B) = P(A∩B) / P(B) [Conditional Probability] We require probabilities P(X<a) or P(a<X<b) estimator and the parameter gets smaller as the sample gets the claim that 20% of users are walking.
s
P(A)=1-P(Ā) larger. CI: X  t / 2, n 1
If A and B are independent, then n
Standardising - produces a Z-score.
P(A∩B) = P(A) x P(B)  72  2.8982 *
10
∴P(A|B) = P(A) If X~N(μ,σ²), 18
P(A and B) = P(A|B) x P(B)  72  6.8311
= P(B|A) x P(A) [Multiplication rule] X   (65.1689, 78.8311)
Z ~ N (0,1)

Vous aimerez peut-être aussi