BASIC STATS When area is > something Interval Estimate
Types of variables: Expected Value/ Mean E.g. P(Z>1.08), 2 x ~ N , . xix Use 1 – P(Z<1.08) E( E Count: no of bedrooms/children, manufacture year Ordinal: categories, income brackets X()X) i p p((xxii )) n all all xi xi x ~ N 0,1 . Nominal: gender, yes/no, manufacture model When area is in middle of graph So, Z Continuous: distance, time, age, best cruising speed 0 * 18 1* 83 2 * 83 3* 81 E.g. P(1.08<Z<1.51) n Expected = P(Z<1.51) – P(Z<1.08) 12 Values/Mean 1.5 / Variance Graphs E(c) = 8c V(c) = 0 Other Scatterplots: shows ifmeasures there is a linearof centre(positive relationship E(cX) = cE(X) V(cX) = c2V(X) Symmetry → P(Z<-a) = P(Z>a) increasing(continuous/interval data) left to right, cov>0), doesn't measure strength E(X-Y) = E(X) – V(X+c) = V(X) Histogram: modality skewness (positive long tail to right), E(Y) V(X+Y) = V(X) + V(Y) Suppose we require P(X<a). modal class, symmetry E(X+Y) = E(X) + [If X and Y are independent] X Know that Z – average E X ~ N (0,1). Arithmetic mean E(Y) Box Plot: Skewness (short/long whisker, short 2 2 V(X-Y) = V(X) +V(Y) bottom=positive), symmetric E(XY) =E(X) x E(Y) [If X and Y are independent] X a If observations are labelled X 1 , X 2 ,..., X n So, P X a P Empirical CDFs E X 2 2 then sample average is called X V ofXX Variance x 2 p xi 2 PZ a Confidence Interval V X xxii2 pp xxiii 2 i Location Location of Percentiles of percentiles 2 2 where Z ~ N (0,1). Calculated as all x P x Z / 2 x Z / 2 100(1 )% n 1 P 1 n X 2 ... X n 02 18 12 83 22 83 32 81 1.52 n all xi Lp = ___th X1 X observation all xi Or suppose we require P(a<X<a). 100 = what figure? n E.g. 02 1 12 3 22 3 32 1 1.52 Find the LCL and UCL. Arithmetic Mean N=number of data points; n 8 0.75 8 8 8 100(1 )% height E.g. Average CI: x ofZasample /2 of 25 men is found to be 1 Xi Lp is the location of the pth percentile. n 0.75 178cm. SD of male heights is 10cm, and heights follow a n i 1 Std Dev X 0.75 0.866 (to 3dp) When asking max or min amount normal distribution. Find a 95% confidence interval. Mean vs Median Std Dev X 0.75 Marginal 0.866 (to Distribution 3dp) of X E.g. rackets last an average of 16months, SD of 3months. If symmetric, mean ≈ median p ( x ) P X x p ( x, y ) Only wants to replace a max of 4% under warranty. How long If positive skew, mean > median should warranty period be? P x 1.96 x 1.96 0.95 If negative skew, mean < median all yi n n Covariance 10 10 Variance xy cov X , Y E X x Y y P 178 1.96 178 1.96 0.95 Measures spread 25 25 xi y j p xi , y j x . y m n Larger SD – ↑ risk – ↑ rate of return i 1 j i P 174.08 181.92 0.95 1 n s2 ( X i X )2 n 1 i 1 Correlation of Coefficient Minimum sample size Measures strength of linear relationship between variables, Use the CLT to find the min sample size if the SD of the 1 1 n2 2 2 2 n
s2 n 1 X i
i nX XnX X&Y (-1 to +1) population is known. n i11 i 1 cov( x, y) E.g. = Estimating the length of bus trip in the morning. Example: 2 proportions – z tables 1 ; 1 1 4. SAMPLING DISTRIBUTIONS St.dev. of the trip length is 5 mins. Estimate the true Eg. A random sample of 80 voters were surveyed in one Eg 25 49 1 4 16 5*3.8 2 x y As sample size ↑, distribution of the sample mean becomes population mean length to within 3 minutes, with 99% region, and 70 voters in another. 45% of the voters surveyed 4 Linear Combinations - Portfolio Variance closer to the normal distribution certainty. in the first region said they would vote for candidate A, and 1 Deviation Standard 95 5*14.44 s s 2 E aX bY aE X bE Y Step 1: set up the equation needed. 40% of the voters surveyed in the second region said they 4 1 Coefficient of variation V aX bY a 2V X b 2V Y 2ab cov X , Y The SD of the sample mean is the model standard deviation, , (the theoretical SD) divided by n, that is, /n. P X 3 0.99 would vote for A. Do these surveys indicate that the level of support for A is the * 22.8 5.7 s Measures 4 spread cv Step 2: standardise. same in both regions? a b 2ab x y 2 2 2 2 X x y Central Limit Theorem X 3 H 0 : p1 p2 Covariance Expected Return If X is a random variable with a mean µ and variance σ², then P 0.99 Measures the strength (and direction) of linear relationship RP = E(RA) x investment + E(RB) x investment 2 n n H A : p1 p2 between 2 variables. X N , n n 3 Test Test Statistic: Statistic: ( X i X )(Yi Y ) Variance of Return P Z 0.99 cov( X , Y ) i 1 V(RP) = (investment)2 x E(RA) + (investment)2 x E(RB) X 5 n n1 pˆ1 n2 pˆ 2 80*0.45 70*0.4 64 n 1 Z ~ N 0,1 as n . pˆ 1 n n 3 n n1 n2 80 70 150 X iYi nXY n 1 i 1 3. CONTINUOUS PROBABILITY DISTRIBUTIONS P Z 0.99 Types of random variables: SEMean → Usually don’t know the theoretical value , 5 pˆ1 pˆ 2 0.45 0.4 If cov > 0, then as X increases, Y increases (positive slope). If Uniform therefore estimate it using the sample SD, s, and s/n is called Z cov < 0 = opposite. If cov=0, not linearly related Step 3: solve for n. Example: Proportions 1 1 64 64 1 1 1 - Parameters: End point the standard error of the mean pˆ (1 pˆ ) Normal: requires probabilities P(X<a) or P(a<X<b) P Z 2.575 0.99 A random sample is taken over a week. 625 of the 945 users n1 n2 150 150 80 70 Coefficient of correlation - Parameters: Variance, mean Mentions ‘sample’ - E.g. The probability that a sample of 10 are found to be cyclists. Find a 90% confidence interval for Binomial: the number of “successes” in the n trials students will have an average mark over 78 if mean is 72 and 3 n the true proportion of users who are cyclists. 0.6177 to 4dp. COV ( X , Y ) cov( X , Y ) 2.575 , r 1. A fixed number of trials, n. SD is 9. 5 pˆ (1 pˆ ) pˆ (1 pˆ ) XY sX sY P pˆ z / 2 p pˆ z / 2 1 Decision Rule: 2. 2 possible outcomes for each trial; success and failure. 92 n (2.575*5) / 3 n n If r=-1, perfect negative linear relationship 3. Probability of success is p, probability of failure is (1-p). X ~ N 72, 2 P-value: P(|Z|>0.6177) = 0.5352. 10 n P( Z 1.645) 0.05 If r=+1, perfect positive relationship 4. Trails are independent. Rejection Region: For 95% significance and two-sided X 78 72 pˆ (1 pˆ ) pˆ (1 pˆ ) P X 78 P If r=0, no LINEAR relationship 6. HYPOTHESIS TESTING P pˆ 1.645 p pˆ 1.645 alternative, reject null hypothesis if test statistic is less than - From Minitab: |r| = √R-sq 0.90 n 9 10 Null hypothesis: something WILL happen 1.96 or greater than +1.96. Binomial n n Notation: If x is a binomial random variable with n trials and Alternative hypothesis: something WILL NOT happen 625 Conclusion 6 10 pˆ 0.6614 (to 4dp) (xi–x)2 x p is the probability of success in each trial, then we write, P Z Errors 945 p-value is large or test statistic does not lie in the rejection xi yi xi - x (xi–x)2 yi - y (yi–y)2 (yi–y)2 9 Type 1 error Reject H0 when it is true region. X~Bin(n,p) 1 7 -2.5 6.25 3 9 -7.5 Mean and variance: P Z 2.11 Type 2 error Accept false H0 E.g. In the same data set, 204 of the 945 people are found to be walking. It is supposed that overall, 20% of users are 2 5 -1.5 2.25 1 1 -1.5 If x~Bin(n,p) 1 P Z 2.11 Example: 𝞼 is unknown – t distribution walkers. Test this supposition using the data. 3 5 -0.5 0.25 1 1 -0.5 µx2 = E(x) = np 1 0.9826 Government claims that mean noise level is no more than H 0 : p 0.2 4 4 0.5 0.25 0 0 0 𝞼x2 = np(1-p) 0.0174 60dB. A test is conducted, measuring noise levels on 18 5 2 1.5 2.25 -2 4 -3 Binomial formula: H A : p 0.2 E.g. 2 question that says “will differ from the population occasions and obtain an average of 72dB with a standard pˆ c 6 1 2.5 6.25 -3 9 -7.5 P(x=k) = nCk.(1-p)n-k mean by less than __units” deviation of 10dB Test Statistic: Z c(1 c) n H0: µ=60 21 24 0 17.5 0 24 -20 Uniform Distribution f x 1 , a xb E ( X ) and HA: µ>60 204 0.2 945 Y mean = 4 ba 0.2(1 0.2) 945 ( a b) var( X ) X 0 72 60 2 12 X mean = 3.5 EX n T 5.091 1.2199 to 4 decimal places. x y 2 5. ESTIMATION s n 10 18 2.357 x 3.50000 b a Compare to t distribution with n-1 df. Decision Rule 2 Point estimate: a single value or point, i.e. sample mean = 4 y -4.00000 4.80000 V X is a point estimate of the population mean, µ. For 5% error, want to find upper 5% tail – marked “one sided Rejection Region: For 5% significance, two tailed test, reject = covariance of x, y 12 Interval estimate: Draws inferences about a population by (area to the left) = 0.95. Rejection region is t>1.7396. H0 if test statistic is less than -1.96 or greater than +1.96. Probability estimating a parameter using an interval (range). OR 2. PROBABILITY (b-a) x f(x) E.g. We are 95% confidence that the unknown mean score E.g. with CI: Find a 99% confidence interval for the P-value= P(|Z|>1.2199)=2*P(Z>1.22) Outcomes must be mutually exclusive (No two outcomes can lies between 56 and 78. population mean noise level given the resident’s data (that is, =2*[0.5-P(0<Z<1.22)] both occur on any one trial) and collectively exhaustive (Each Normal Distribution Unbiased and consistent from a sample of size 18, an average of 72dB and standard =0.2224 trial must result in one of the outcomes in the sample space) Different means – shift curve up and down x-axis. Different deviation of 10dB). Conclusion P(A) = P(A∩B) + P(A∩Bc) variances – curve becomes more peaked or more squashed E X , so X is an unbiased estimator of . X 72, s 10, n 18 We fail to reject the null hypothesis- test statistic did not lie in P(A or B) = P(A) + P(B) - P(A∩B) the rejection region; p-value is large. That is, the data support An unbiased estimator is consistent if the difference between t / 2, n 1 t0.005,17 2.8982 P(A|B) = P(A∩B) / P(B) [Conditional Probability] We require probabilities P(X<a) or P(a<X<b) estimator and the parameter gets smaller as the sample gets the claim that 20% of users are walking. s P(A)=1-P(Ā) larger. CI: X t / 2, n 1 If A and B are independent, then n Standardising - produces a Z-score. P(A∩B) = P(A) x P(B) 72 2.8982 * 10 ∴P(A|B) = P(A) If X~N(μ,σ²), 18 P(A and B) = P(A|B) x P(B) 72 6.8311 = P(B|A) x P(A) [Multiplication rule] X (65.1689, 78.8311) Z ~ N (0,1)