P&amp S

www.jntuworld.
com
1 Introduction to Probability and Counting

1.1 Heuristic Probabilities
Idea: Assign value between 0 and 1 to event; magnitude gives likelihood event will occur near 0: unlikely near 1: likely near 1/2: may or may not occur (either equally likely) How to assign probabilities Personal approach: guess! requires experience; often used when have no previous data ex: estimating the probability a totally new aircraft design will crash on its first flight
Relative frequency approach: conduct experiment many times; then P = m/n, where n = total number of times experiment is conducted m = number of times in which desired phenomenon occurs requires ability to repeat ex: weather forecast If it rains on 15 out of 50 days with identical meteorological conditions, then the probability of precipitation for a day with those conditions is P = 15/50 = .30 = 30%. Classical Approach: - compute total number of possible outcomes, n(S) - compute number of outcomes with desired result A, n(A) - then probability P = n(A)/n(S) valid only if outcomes are equally likely! ex: roll 1 die; what's probability get an even number? n(S) = number of possible values = 6 n(A) = number of values with desired property (even) = 3 probability P = n(A)/n(S) = 3/6 = 1/2. Venn diagrams can help! Next section
www.jntuworld.com
www.jntuworld.com
1.2 Sample spaces & Events

Def: Sample space: set of all possible outcomes of an experiment; elements called sample points usually denote by S must include all possible outcomes sometimes more than one possibility for S, depending on how outcomes are specified ex: flip coin 3 times; sample space S is S = {hhh, hht, hth, htt, thh, tht, tth, ttt} (8 possible outcomes) or S = {3 heads, 2 heads & 1 tail, 1 head & 2 tails, 3 tails} (4 possible outcomes) either is acceptable as the sample space; which one is used might depend on what we're interested in investigating. (The first has a very nice property not shared by the second: each of the outcomes is equally likely to occur! Because of this, we'll usually use the first as our sample space.)
ex: Have 4 stages of a rocket; any one can fail, at which point mission is over. A logical sample space representing all possible outcomes would be S = {f, sf, ssf, sssf, ssss}, where ssf represents the outcome in which the first two stages succeed but the third fails. (Hopefully, outcomes not equally likely!!)
Def: An event is any subset of sample space (i.e., any set of possible outcomes) - can consist of a single element
ex: (rocket) The event that the rocket fails at some stage is subset A = {f, sf, ssf, sssf} The event that rocket goes through 2nd stage is subset B = {ssf, sssf, ssss}
Notes: The empty set is a subset, hence an event; called the impossible event The entire sample space S is a subset, hence an event; called the certain event When the actual outcome of the experiment is a member of the subset, we say the event has occurred ex: (rocket) if rocket blows up during 2nd stage; then event A above has occurred, event B hasn't
www.jntuworld.com
www.jntuworld.com
2
2.1
Probability Laws
Axioms of Probability
Given a sample space S, we will assign probability values to events (subsets) which obey the following axioms: 1. P(A) >= 0 for every event A 2. P(S) = 1 3. If A1, A2, are mutually exclusive events, then P(A1
A2
) = P(A1) + (A2) +
(addition rule)
Note: When we assign probabilities to all of the subsets of a sample space, we create what is called a probability measure on the space; behaves very similarly to area.
Can think of as Venn diagram, where total area is 1 (and so areas of subsets are <= 1); then probability of subset is just its area.
ex: of M & M's manufactured, 20% are red, 10% orange, 10% green, 10% blue, 20% yellow and 30% brown. Thus if our experiment consists of selecting one M & M at random and considering its color, our sample space could be S = {R, Or, G, Bl, Y, Br} The logical assignment of probabilities to the single-element subsets of the sample space is P({R}) = .20 P({Or}) = .10 P({G}) = .10 P({Bl}) = .10 P({Y}) = .20 P({Br}) = .30 (we would usually write these as just P(R) = .20, without using the set brackets, when we're dealing with single elements of the sample space, even though strictly speaking probability values are associated with subsets) We would then get probabilities for other subsets by using the addition rule! So the probability of getting a red or a green M&M would be P({R,G}) = P({R}) + P({G}) = .20 + .10 = .30
www.jntuworld.com
www.jntuworld.com
Note: when a sample space is discrete, we usually assign probabilities to individual elements, then find probabilities of other subsets from these, as in the above example Note: if the outcomes in our sample space are equally likely, and there are N possible outcomes, then the probability of each outcome is 1/N and thus the probability of any event A is P(A) = n/N, where n = number of outcomes for which A occurs N = total number of outcomes this is just the classical approach to assigning probabilities!
Some results about probabilities: Theorem 2.1.2 proof: P(A') = 1 - P(A)
Since A A' = S, P(A A') = P(S) = 1 Since A A' = , we can use the addition rule to get P(A A') = P(A) + P(A') Thus P(A) + P(A') = 1 or P(A') = 1 - P(A) Theorem 2.1.1 proof: P( ) = 0
= S', so P( ) = P(S') = 1 - P(S) Note: sometimes much easier to find P(A') than P(A)!
From a Venn diagram, we can get an idea of how to find P(A1 A2) when A1 & A2 aren't disjoint:
P(A1) + P(A2) counts the overlap twice, so subtract P(A1 A2) This suggests the General Addition Rule: P(A1 A2 ) = P(A1) + P(A2) - P(A1 A2) This can be used when events A1 and A2 aren't mutually exclusive (disjoint), in which case the previous addition rule wouldn't apply. ex: Suppose that the probability that a child will have blue eyes is .25, the probability that a child will have blonde hair is .30, and the probability that a child will have both is .13.
www.jntuworld.com
www.jntuworld.com
What's the probability that a child selected at random will have either blue eyes or blond hair (or both)? Let E = the event a child has blue eyes, H = the event a child has blond hair. We want P(E ). The general addition rule gives us P(E ) = P(E) + P(H) - P(E H) = .25 + .30 - .13 = .42
ex: On a particular football team, the probability that a player plays offense is .60 and the probability that a player plays defense is .65. Everybody plays one or the other. What's the probability a player selected at random plays both? Let A = the event a player plays offense, B = the event a player plays defense We want the probability of A B. We can solve the general addition rule for P(A B) to get P(A B) = P(A) + P(B) - P(A B) = .60 + .65 - 1 = .25
Note: Sometimes easiest to use a Venn diagram to compute probabilities; just find the probability associated with each separate region. ex: From the example above, what's the probability a child will have blonde hair but not blue eyes? With E and H as above, E' = the event a child doesn't have blue eyes; then we want P(E' ). The Venn diagram for the problem is shown, with all of the pertinent regions labelled:
The shaded region is what we want; it's everything that's in H but not in E. From the diagram, we can see that P(E' ) = P(H) - P(E H) = .30 - .13 = .27
Previous section Next section
www.jntuworld.com
www.jntuworld.com
2.2 Conditional Probability

Def: If A & B are events, then P(A|B) denotes the conditional probability of A, given B. and is defined as P(A|B) = .
measures probability that A has occurred, given that we know B has occurred. in Venn diagram, since we know B has occurred, we know we're in the region for B, and we're essentially looking for the fraction of B's that are also A's. This is given by the "area" of the overlap divided by the "area" of B.
ex: Selecting child at random; let H = event child has blonde hair, E = event child has blue eyes Suppose P(H) = .30, P(E) = .25, P(E H) = .13 Whats probability child has blue eyes, given that we know he has blond hair? We want
Note: in general, the knowledge that event B has occurred will change the value of the probability that event A has also occurred. In the above example, knowing that a child has blond hair increases the chance that he/she has blue eyes, as might be expected.
ex: Youre dealt 2 cards from a standard deck of 52. Whats probability that 2nd card is an ace, given that 1st card is ace? Given that the first card selected is known to be an ace, there are only 3 aces left in the 51 remaining cards, and thus the probability of getting an ace as the second card is P = 3/51. ex: Extreme example: roll one die; let A = event number on top is a 6, B = event number on bottom is a 1. Then P(A|B) = 1! On a standard die, the numbers are arranged so that the sum of the values on opposite faces equal 7 - thus the 3 is opposite the 4, the 5 opposite the 2, and the 6 is opposite the 1. Thus if we know that the 6 is on top, we know for certain that the 1 is on the bottom.
www.jntuworld.com
www.jntuworld.com
2.3 Independent Events

Def: Two events A, B are independent if and only if P(A B) = P(A) P(B). called the Multiplication Rule for Independent Events makes it easy to compute P(A B) if know just P(A), P(B) what does independence mean? Look at the conditional probability:
for A and B independent, or P(A|B) = P(A). the fact that B has occurred doesnt change the probability that A will occur knowledge that B has occured doesn't give us any additional information as to whether or not A might also have occurred A and B are independent if they "don't affect one another" ex: Flip a fair coin twice; use sample space S = {HH, HT, TH, TT} to denote result of experiment. (A fair coin is one for which the probability of getting a head or a tail on each flip is 1/2.) Compute the probability of each outcome in the sample space. Clearly, results of flips are independent; whether or not a head occurs on the first flip has no bearing on whether or not a head will occur on the second flip. Let H1 = event get head on 1st flip H2 = event get head on 2nd flip Then P (two heads) = P(H1 H2) = P(H1) P(H2) since H1, H2 are independent = (1/2) (1/2) = 1/4 Thus P(HH) = 1/4. The same reasoning follows for all other elements of the sample space; thus P(HT) = 1/4 P(TH) = 1/4 P(TT) = 1/4 Thus all of the elements of the sample space are equally likely. ex:
Consider now a weighted coin, for which the probability of getting a head is P(H) = .7 and the probability of getting a tail is thus P(T) = .3. Flip this coin twice, and use sample space S = {HH, HT, TH, TT} to denote possible outcomes. Compute the probability of each outcome in the sample space. As above, the results of the two flips are independent, so compute as above using the multiplication rule for independent events:
www.jntuworld.com
www.jntuworld.com
P(HH) = P(H) P(H) = (.7) (.7) = .49, P(HT) = P(H) P(T) = (.7) (.3) = .21, P(TH) = P(T) P(H) = (.3) (.7) = .21, P(TT) = P(T) P(T) = (.3) (.3) = .09. Notice that now the elements of the sample space are not equally likely!
When it's not clear whether or not two events are independent, use the definition (the multiplication rule) to see if they are, i.e., check to see if P(A B) = P(A) P(B). ex: Let R = event it rains on a given day, and W = event its windy. Suppose P(R) = .30, P(W) = .20, P(R W) = .15 ( = probability that it's both rainy and windy). Are these events independent? P(R) P(W) = .06, which is not equal to P(R W). Thus the events are not independent, as we might expect: if it's rainy, it's also likely to be windy.
General Multiplication Rule If A, B independent, and know P(A) and P(B), easy to find P(A P(A B) = P(A) P(B). If A, B not independent, cant find P(A conditional probability, as follows: P(A|B) = or P(A B) = P(B) P(A|B) thus we take the probability that B will occur times the modified probability for A, given that B has occurred! B) from just P(A) and P(B). But can always ue the definition of B) from the definition of independence:
ex: Urn model Have 4 white balls, 2 black ones, in an urn. Draw 2 balls in succession, without replacing 1st before drawing 2nd. Let the sample space be {WW, WB, BW, BB}. Find the probability of each of the outcomes. Note: can't use the multiplication rule for independent events to find the results, i.e., can't use P(WW) = P(W) P(W), since drawing a white ball on the first draw does affect probability well get a white ball on the second draw (since there will thus be fewer white balls in the urn) - these events aren't independent! Let
www.jntuworld.com
www.jntuworld.com
W1 = event 1st ball drawn is white W2 = event 2nd ball drawn is white B1 = event 1st ball drawn is black B2 = event the 2nd ball drawn is black Since these events aren't independent, use the general multiplication rule: P(WW) = P(W1 W2) = P(W1) P(W2 | W1) = (4/6) (3/5) = 12/30 in this computation, P(W1) = 4/6, since initially there are 4 white balls out of 6 total in the urn; but then P(W2|W1) = 3/5: the probability of getting a white ball on the second draw, given that we got a white ball on the first draw, is 3/5, since there are only 5 balls left in the urn, of which 3 are white. Similarly, P(WB) = P(W1 2) = P(W1) P(B2 | W1) = (4/6) (2/5) = 8/30 P(BW) = P(B1 W2) = P(B1) P(W2 | B1) = (2/6) (4/5) = 8/30 P(BB) = P(B1 2) = P(B1) P(B2 | B1) = (2/6) (1/5) = 2/30
www.jntuworld.com
www.jntuworld.com
Random Variables and Discrete Distributions
3.1 Random Variables

Def: A random variable X is a variable whose value depends on chance, i.e., whose value depends on the outcome of some experiment. use capital letters to denote random variables ex: Roll 2 dice, and let X = sum of values on faces. X is a random variable: its value depends on the outcome of the roll of the dice. the values that X can take are 2, 3, 4, ..., 12. since this is a discrete set of values, X is called a discrete random variable ex: Let R = number of inches of rainfall received at Allentown airport on given day R can take on any value in the interval [0, 10] (for example, 3.0", 1.257", etc.) since there is a continuum of possible values, R is called a continuous random variable ex: Keep flipping a fair coin until you get a tail; let N = number of flips. N can take on any of the values 1, 2, 3, 4, 5, ... N is a discrete random variable More formally: A random variable is a function whose domain is the sample space of some random experiment: the value the random variable takes on is determined by the outcome of the experiment. A random variable is discrete if its range (the set of values which it can take on) is countable, i.e., either finite or countably infinite, and is continuous otherwise. Previous section Next section
www.jntuworld.com
www.jntuworld.com
3.2 Discrete Probability Densities

Given a discrete random variable X; if x is one of its possible values, want to know the probability that X takes on the value x, i.e., want P(X=x).
Def: Let X be a discrete random variable. The function f defined by f(x) = P(X=x) is called the probability density function (p.d.f.) of the random variable X
Usually, the value of X depends on some underlying sample space S. Find P(X = x) as follows: determine the subset A of S on which X = x then P(X=x) = P(A). ex: Flip 3 coins; sample space is S = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} all outcomes equally likely (multiplication rule for independent events); each has probability 1/8 Let random variable X equal the number of heads obtained; its possible values are thus 0, 1, 2, 3 Then P(X=0) = P({TTT}) = 1/8 P(X=1) = P({HHT, THT, TTH}) = 3/8 P(X=2) = . . . = 3/8 P(X=3) = . . . = 1/8 Thus the probability density function f is given in the following table: x f(x) 0 1/8 1 3/8 2 3/8 3 1/8 ex: On AP Calculus exam, possible scores are 1, 2, 3, 4, 5. Let Z = score of student chosen at random; then Z is a discrete random variable. Suppose its p.d.f. is given by the following table:
Whats probability that a randomly selected student's score is 4 or higher?

www.jntuworld.com
www.jntuworld.com
P(Z >= 4) = P(Z=4) + P(Z=5) = f(4) + f(5) = .15 + .10 = .25 ex: Flip a fair coin until you get a tail; let N = total number of flips. sample space: { T, HT, HHT, HHHT, HHHHT, ...} probabilities : 1/2 1/4 1/8 1/16 1/32 value of N: 1 2 3 4 5 Thus probability of flipping n times before getting a tail is and density function is ,
In tabular form: Whats the probability you'll flip less than 4 times before getting a tail? P(N<4) = P(N=1) + P(N=2) + P(N=3) = f(1) + f(2) + f(3) = 1/2 + 1/4 + 1/8 = 7/8.
Properties of probability density functions Let f be the p.d.f. of random variable X. Then 1. f(x) >= 0 for all x 2. , where the sum is over all values that X can take on. These properties flow directly from the definition f(x) = P(X = x): f(x) = the probability that X takes on the value x. Thus the first property follows from the fact that all probabilities must be positive, and the second from the fact that the sum of the probabilities for all possible outcomes must be 1.
ex: Consider r.v. N above, w/ p.d.f.
Show that this satisfies the above two properties. 1. clearly, f(n) >= 0 for all n 2. To verify, compute the value of the infinite sum: 1/2 + 1/4 + 1/8 + 1/16 + ... is a geometric sum, with first term a = 1/2 and ratio r = 1/2; the value of such a geometric sum is given by
www.jntuworld.com
www.jntuworld.com
Note: the p.d.f. tells us everything we need to know about random variable X; dont need to use the underlying sample space once we have the p.d.f.
There is an alternate way to characterize random variables, closely related to the probability density function:
Def: Given discrete random variable X. The function F defined by F(x) = P(X <= x) is called the cumulative distribution function of X. Thus F(x) gives the probability that X will take on a value less than or equal to x.
The density function f and distribution function F are closely related: F(x0) = P(X <= x0)
or
ex: Consider the random variable N above, where N = the number of times a coin is flipped before a tail appears. Find F(n) as both a table and formula. F(n) = P(N <= n), so F(1) = P(N<=1) = P(N=1) = f(1) = 1/2 F(2) = P(N<=2) = P(N=1) + P(N=2) = f(1) + f(2) = 1/2 + 1/4 = 3/4 F(3) = P(N<=3) = P(N=1) + P(N=2) + P(N=3) = f(1) + f(2) + f(3) = 1/2 + 1/4 + 1/8 = 7/8 etc. Thus it's clear that the value of F(n0) is obtained by summing the values of f(n) for n <= n0. The following table gives both f(n) and F(n):
From the table, a formula for F(n) can be inferred:
www.jntuworld.com
www.jntuworld.com
3.3 Expectation, Mean & Variance

Def: The expected value or expectation of discrete random variable X is denoted E(X) and is defined as the weighted average of all possible values of X, with each value weighted by the probability that value will occur:
i.e., .
E(X) is also called the mean of X, denoted
ex: On the Advanced Placement (AP) Calculus exam, possible scores are 1, 2, 3, 4, 5. Let Z = score of student chosen at random be the discrete random variable with p.d.f. given by the following table:
(Thus the probability that a student scores a 3 on the exam is .40) What's the expected value of Z? = 1f(1) + 2f(2) + 3f(3) + 4f(4) + 5f(5) = 1(.15) + 2(.20) + 3(.40) + 4(.15) + 5(.10) = 2.85 What does this mean? How can this be the expected or average score, when a student can't ever get a score of 2.85 (the scores must be whole values)? The expected value gives the value that would result if we sampled a large number of students, and found the average of their scores on the test.
ex: Consider a lottery like the Pennsylvania Daily Number, in which a 3-digit number is chosen at random. Suppose it costs $1 to play, and if you pick the correct 3-digit number, you win $500. (This is simpler than the real lottery, which has has more different ways in which you can win.) Let W represent your net winnings in one play of the lottery; then W is a discrete random variable, which takes on the values -1 and 499 (if you don't win, you lose the $1 it cost to play; if you do win, you get $500 back, but you're out the dollar it cost to play). The probability of winning is 1 out of 1000, or .001 : there are 1000 possible 3-digit numbers, out of
www.jntuworld.com
www.jntuworld.com
which yours is one. The probability density function for W is thus given in the following table:
What is the expected value of W? = (-1)f(-1) + (499)f(499) = (-1)(.999) + (499)(.001) = -.50 What does this mean? This is the long-term average winnings per trial: if you play many times, you should expect to lose on average .50 each time you play - if you play 1000 times, you should expect to be down $500, if you play 4000 times, you should expect to be down $2000, etc. If random variable Y is a function of random variable X, Y = H(X), the expected value of Y is defined to be
Most important functions are powers X2, X3, ... The expected values of these are called the moments of X: = first moment = second moment = third moment etc. ex: In the AP example above, compute the second & third moments of Z. = 12f(1) + 22f(2) + 32f(3) + 42f(4) + 52f(5) = 12(.15) + 22(.20) + 32(.40) + 42(.15) + 52(.10) = 9.45 = 13f(1) + 23f(2) + 33f(3) + 43f(4) + 53f(5) = 13(.15) + 23(.20) + 33(.40) + 43(.15) + 53(.10) = 34.65
ex: Consider the random variable N above, where N = the number of times a coin is flipped before a tail appears. Compute the mean and second moment. mean m = E(N) = = 1(1/2) + 2(1/4) + 3(1/8) + 4(1/16) + ....
www.jntuworld.com
www.jntuworld.com
glitch: infinite sum! While techniques exist to find the value of this, it's still difficult. The second moment is even worse! Luckily, we'll look at another technique for finding the moments e next section (using the moment generating function). We'll revisit this computation then!
Note: If we know all of moments, can reconstruct the density function of X other words, the moments completely determine the random variable.
Rules for Expectation Given random variables X, Y, constant C. then 1. E(c) = c 2. E(cX) = cE(X) 3. E(X + Y) = E(X) + E(Y)
Note: E(XY) * E(X) E(Y), in general! ex: AP scores; pick 200 students at random; scores are Z1, Z2, Z3, ..., Z200. Look at the average of the 200 scores, i.e., let What's the expected value of Y? E(Y) = = = = = 2.85 Thus the expected value for the average of a set of scores is the same as the expected value of a single score! Note that the number of scores used (200 here) is immaterial; the same conclusion would result. This will be an important result when we look at sampling and statitics. by rule 2 by rule 3
Def: The variance of random variable X is defined as var (X) = E((X - )2) The standard deviation (s.d.), , is =
www.jntuworld.com
www.jntuworld.com
the variance calculates the expected value of the squared deviation of X from its mean the variance and s.d. measures the variability of the values X takes on; the larger the variance or s.d., the more likely the values will vary widely from the mean. most of time, values X takes on will lie with in 1 side of the mean!
Simplified formula for computing var(X): var(X) = E(X2) - 2 (or var(X) = E(X2) - E(X)2 ) folows from rules for expectation: var(X) = E((X-)2) = E (X2 - 2X + 2) = E (X2) - 2 E(X) + E(2) = E (X2) - 2 + 2 (note that E(X) = )
ex: AP example; compute the variance and standard deviation. var(Z) = E(Z2) - 2 ; but computed above that E(Z2) = 9.45 and E(Z) ( = ) = 2.85, so gives var(Z) = 9.45 - (2.85)2 = 1.3275, and thus = = 1.15 What does this tell us? The standard deviation ls us the range into which our scores will fall most of the time: most of the time the scores will be within 1.15 units of the mean (2.85).
Rules for variances Given random variables X, Y, constant c. 1. var(c) = 0 2. var(cX) = c2 var(X) 3. If X & Y are independent random variables, var(X + Y) = var(X) + var(Y) Note: X, Y are independent if the value obtained for X doesnt influence the value obtained for Y,i.e., the value of Y doesn't depend on the value of X; more on this later! ex: independent vs. dependent random variables
www.jntuworld.com
www.jntuworld.com
Roll 2 dice; X = value on 1st, Y = value on 2nd. Then X and Y are clearly independent: the value on the first die doesn't affect in any way the value that will appear on the other. Roll 1 die; X = value on top, Y = value on bottom. Then X and Y are clearly dependent: the value on the bottom is completely determined by the value on the top. Roll 2 dice; X = value on 1st, Y = sum of values on the two dice. The X and Y are dependent: the value of Y, the sum, is not completely independent of the value on the first die. For example, if the value on the first die is a 6, then the sum can't take on any value less than 7.
www.jntuworld.com
www.jntuworld.com
3.4
Moment Generating Function
Recall: the moments of a random variable are useful to know, but not so easy to find. In cases where we know a formula for the p.d.f., can often find all moments at once in a convenient way! Def: Let X be a discrete random variable. Then the moment generating function of X is the function of the variable t defined as mX(t) = E(etX) the moment generating function is the expected value of the function etX the variable t is just a parameter (auxiliary variable), whose use will become clear the moments of X are hidden inside the function mX(t)! ex: AP example; recall that the p.d.f. for the score Z of a randomly selected student was given by table What's the moment generating function of the random variable Z? Well, mZ(t) = E(etZ) = = et (.15) + e2t (.20) + e3t (.40) + e4t (.15) + e5t (.10) Notice that this is a function of the variable t.
ex: Flip a coin until you get a tail; let N = the number of flips. Then we found previously that the p.d.f. was f (n) = (1/2)n . Find the moment generating function. As above, mN(t) = E(etN) = = = e1t * (1/2)1 + e2t * (1/2)2 + e3t * (1/2)3 + ... = (et * 1/2)1 + (et * 1/2)2 + (et * 1/2)3 + ... But this is another geometric sum, with first term a = et/2 and ratio r = et/2, whose value is thus So mN(t) = The above are functions of t; where are the moments of the random variables??
Theorem If mX(t) is the m.g.f. of X, then the moments of X can be found as E(Xk) =
www.jntuworld.com
www.jntuworld.com
i.e., to find the kth moment, take the kth derivative of the moment generating function and evaluate at t = 0. ex: Example above; by the theorem, E(N) = = = This is the mean of N. E(N2) = = = = (by quotient rule) = 6 = = 2.
(by the quotient rule)
From this we can find the variance of N: var(N) = E(N2) - E(N)2 = 6 - 22 = 2.
Why do we care so much about finding moments? gives easy way to compute mean & variance can identify the type of a random variable by looking at its moment generating function, as stated by the following: Theorem If two random variables have the same moment generating function, then they have the same p.d.f . This will be useful to us in the future, when we're trying to determine what type of random variable we get when we look at specific combinations of other random variables!
Properties of moment generating functions Let X be a random variable, mX(t) its moment generating function, and let c be a constant. Then the moment generating functions of certain modifications of X are related to the moment generating function of X as follows: 1. the moment generating function of the random variable cX is mcX(t) = mX(ct) 2. the moment generating function of the random variable X + c is mX+c(t) = ect mX(t) 3. Let X1 and X2 be independent random variables with moment generating functions
www.jntuworld.com
www.jntuworld.com
mX1(t) and mX2(t). Then the moment generating function of the random variable X1 + X2 is mX1+X2(t) = mX1(t) * mX2(t) These results will be of use to us later. The proofs of these follow from the properties of expectation discussed earlier.
The Geometric Distribution

Def: A random variable X is a geometric random variable if it arises as the result of the following type of process: 1. have an infinite series of trials; on each trial, result is it is either success (s) or failure (f). (Such a trial is called a Bernoulli experiment.) 2. the trials are independent, and the probability of success is same on each trial. (The probability of success in each trial will be denoted p, and the probability of failure will be denoted q; thus q = 1 p.) 3. X represents the number of trials until the first success. (In short, a random variable is geometric if it "counts the number of trials until the first success.")
ex: Consider the "flip a coin until you get a tail" experiment above; then N (= # of flips until a tail occurs) is a geometric random variable, with p = 1/2: N counts the number of trials until the first success.
The sample space of such a process can be written as below; the value of the random variable X associated with each possible outcome is shown beneath; and the probability of each outcome is given beneath that. (The probabilities just come from the multiplication rule for independent events.)
Thus the probability density function of a geometric random variable X with probability of success p is: f(x) = pqx-1 = p (1 - p)x-1, x = 1, 2, 3, ...
Its moment generating function can be shown to be mX(t) = (using a technique identical to that used in the "flip a coin until a tail appears" example above). We can thus use the moment generating function to find the moments, and hence the mean and variance and
www.jntuworld.com
www.jntuworld.com
standard deviation: E(X) = = E(X2) = = = = = 1/p = (from the quotient rule) (using the fact that p = 1 - q )
var(X) = E(X2) - E(X)2 = q/p2 So the mean and variance are = 1/p var(X) = q/p2 (so the standard deviation is = ex: Dice game; pick a number from 1 to 6, then keep rolling until get that value. Let X = total number of rolls needed to achieve this. Then X is a geometric random variable: it counts the number of trials until the first success. The probability of success on each trial is here p = 1/6. Thus from the above, the expected number of rolls until the desired number appears is E(X) = 1/p = 1/(1/6) = 6. (This is pretty much what we would have anticipated!) The variance is var(X) = q/p2 = (5/6)/(1/6)2 = 30, so the standard deviation is = 5.48 , which indicates that the value of X will usually fall in the range 6 +- 5.48; thus we should not be surprised if the number of rolls needed to get our number is as few as 1 or as many as 12. ex: Consider the Pennsylvania daily number lottery, discussed before, in which the probability of winning on any given day is 1/1000. Now let N be the number of times you play before winning the first time. Then N is a geometric random variable, since it's counting the number of trials until the first success, with p = 1/1000. Thus the expected number of plays is E(N) = 1/p = 1/(1/1000) = 1000; thus you should expect to have to play 1000 times before winning! Since you would then be down $1000 (since it costs $1 to play), and would only recoup $500 for winning, this isn't such a great situation. Notice that this agrees with our previous results, in which we determined that, on average, you should expect to lose $.50 each time you play.
Cumulative probability function The cumulative probability function F(x) of a geometric random variable X with probability of success p is
www.jntuworld.com
www.jntuworld.com
F(x) = 1 - qx = 1 - (1 - p)x (This follows by summing the values of the p.d.f., and using the formula for the value of a finite geometric sum.)
ex: For the lottery example above, what's the probability that you'll win in a year (312 days, not counting Sundays) or less? Want P(N <= 312) = F(312) = 1 - (1 - .001)312 = 1 - .999312 = 1 - .732 = .268; thus there's only about a 1 in 4 chance that you'll win in a year if you play every day!
www.jntuworld.com
www.jntuworld.com
3.5
The Binomial Distribution
Def: A random variable X is a binomial random variable if it arises as the result of the following type of process:
1. have a fixed number n of Bernoulli trials (success (s) or failure (f) for each) 2. the trials are independent, probability of success on each trial is same (as before, denote probability of success as p, probability of failure as q) 3. X = total number of successes in the n trials. (possible values are 0 to n) (in short, X is binomial if it "counts the number of successes in n trials.")
ex: Test with 3 questions; have probability of .8 getting any of questions correct. Let X = the number of questions you get correct. Then X is a binomial random variable, with n = 3 (3 trials) and p = .8 (probability of success on each trial), since it counts the total number of successes in a fixed number of trials. A sample space giving the 8 possible outcomes of your exam is shown below; beneath each outcome is the corresponding value of X; beneath this is the probability of that particular outcome (as before, these are found using the multiplication rule for independent events).
From the above we can get the probability density function, given in the following table:
The density for a binomial random variable X with parameters n and p is f(x) = , x = 0, 1, 2, ..., n
Its moment generating function is mX(t) = (pet + q)n (see text for the derivation). From this, we can compute the first and second moments, and use these to compute the mean and variance: E(X) = = (by the chain rule) (since p + q = 1)
= n (pet + q)n-1 (pet) |t=0 = n (p + q)n-1 (p) = np E(X2) = =
= n (n-1) p2 + np (when the smoke clears, using the product and chain rules to differentiate; do this as an exercise!) These give us
www.jntuworld.com
www.jntuworld.com
var(X) = E(X2) - E(X)2 = np(1-p) Thus the mean and variance are = np var(X) = npq = np(1-p) (so the standard deviation =
(exercise)
ex: Suppose you take a test like the one above, in which the probability of your getting any one of the questions correct is .8, but suppose now the test has 20 questions. Whats the expected number of questions you'll get correct? Let X = the number of questions out of the 20 you get correct; then X is a binomial random variable, with n = 20 and p = .8. Then E(X) = np = (20)(.8) = 16, i.e., you'd expect to get 16 of the 20 correct. Whats the probability you'll get exactly 1 wrong (i.e., 19 correct)? P(X = 19) = f(19) = = .058 What's the probability you'll get 14 or fewer correct? We want P(X <= 14) = F(14) there's no convenient formula for F(x) for a binomial random variable. We could compute F(14) by summing f(1) + f(2) + ... + f(14), but this is tedious. Fortunately, values of F(x) for binomial random variables have been tabulated for various values of n and p; your text has tables in the back (p. 720724) giving values for n = 1 through 20 and p = .1 through .9. Using the table on p. 724 for n = 20 and p = .8, we find P(X <= 14) = F(14) = .1958; thus there's only about a 20% chance you'd get 14 or fewer correct.
Note: Though the tables only go up to n = 20, we'll find there's a quick way for us to approximate the value of probabilities for larger values of n using the normal distribution (to be discussed shortly).
ex: A binary packet consisting of 64 bits is sent over a communications line. Because of noise, some of the bits will be corrupted when they are sent (i.e., they'll be sent as 1's or 0's, but will be received as the opposite). Suppose the the probability that any 1 bit is corrupted during transmission is p = 02. Whats the expected # of bad bits in a packet? Let X = # of corrupted bits received out of the 64 sent. Then X is a binomial random variable, with n = 64 and p = .02 (here, "success" is interpretted as a bit being corrupted). Thus E(X) = np = 64*(.02) = 1.28 On average, we would expect 1.28 bits to be corrupted per packet.
www.jntuworld.com
www.jntuworld.com
Whats the probability the message comes through correctly (i.e., # bad bits = 0)? P(X=0) = = .27 So only about 1/4 of the 64-bit packets sent would be expected to come through without any errors. In this situation, since most of the pakets will be corrupted, but usually only by about 1 bit, it would make sense to use an error-correcting code to transmit the data which can correct for errors in 1 or 2 bits.
www.jntuworld.com
www.jntuworld.com
Chapter 4 Continuous Distributions

4.1 Continuous Random Variables and Densities
Recall: X is a continuous random variable if the possible values for X form a continuous range or interval.
ex: Let L = length of leaf picked from tree; could take on any value in interval [0, 10]
Important note: It doesnt make much sense to ask whats probability L = 2.5 inches . The chance that L = exactly 2.500000000000... inches should be 0. For a continuous random variable, the probability that X takes on any one specific value is zero: P(X=x) = 0 for any x. Instead, we'll ask for the probability that the value of X will lie in some range of values, usually an interval: want to know the probability that the value of X lies between two specified values a and b, P(a <= X <= b).
Can characterise continuous random variables with density function f(x); however, it will have a very different meaning from its interpretation in the discrete case!
Def: Let X be a continous random variable. Then the density function f of X is the function for which
i.e., the probability that X lies between a & b is the area under the graph of f (x) from a to b.
The density function for a continuous random variable must satisfy 2 properties: 1. f(x) >= 0 for all x this says the height of the graph must be >= 0 for all x 2.
www.jntuworld.com
www.jntuworld.com
this says the probability that X lies between - and
must equal 1: X has to take on some value!
Note: The probability of X taking on one specific value should be 0. Indeed, by the above definition of the density function, P(X=a) = P(a <= X <= a) = = 0.
ex: Let X = weight of cereal in a randomly selected box; suppose density function of X is
The graph of f is:
Whats probability X lies between 14.5 and 15.5 ounces? P(14.5 <= X <= 15) = = = = 0 - (- (.5)2) = .25 Whats probability X <= 14.2? P(X <= 14.2) = P(- < X <= 14.2) = = = (since f(x) = 0 for x <= 14)
= = .36 Whats probability X >= 16? P(X >= 16) = P(16 <= X < ) = under the curve from x = 16 to x = is 0. = 0; since f(x) = 0 for x > 15, the area
www.jntuworld.com
www.jntuworld.com
Can characterize a continuous random variable by its cumulative distribution fn: Def: The cumulative distribution function F of a continuous random variable X is defined to be F(x) = P(X <= x) notice that this is the same definition used in the discrete case. also called the cumulative probability function If the density function of X is f(x), then , since P(X <= x) equals the area under the graph of f to the left of x.
ex: For the example involving the weight of cereal boxes above, the cumulative distribution function is found as follows: for x <= 14, F(x) = 0; this follows because P(X <= x) = 0 for any x less than 14, since the density function f(x) = 0 to the left of 14 and hence the area under the curve to the left of x is 0. for x >= 15, F(x) = 1; this follows because P(X <= x) = 1 for any x greater than 15, since the entire "hump" of the density function will lie to the left of x and hence the area under the curve to the left of x will be 1. for x between 14 and 16, we have F(x) = P(- <= X <= x) = = = Thus we have (since f(x) = 0 for x less than 14) = 1 - (15 - x)2
The graph of F is:
www.jntuworld.com
www.jntuworld.com
The distribution function in the example above illustrates certain features shared by all cumulative distribution functions: Properties of cumulative distribution functions
1. 2. 3.
0 <= F(x) <= 1 for all x
Note: F(x) will be continuous!
www.jntuworld.com
www.jntuworld.com
4.2
Expectation, Variance, Moments, Moment Generating Function
Consider: in discrete case, the expected value of a random variable X is defined as
i.e., it gives the weighted average of all possible values of X cant quite do this in continuous case, since P(X=x) = 0 for every x! still, want E(X) = weighted average of all possible x values
Try instead: approximate the continuous distribution by a discrete one, as follows: Suppose possible values for X lie within some interval [a,b]. break the range of possible values into n subintervals of equal width; let the points of subdivision be a = x0, x1, x2, ... , xn = b; the width of each subinterval will be x = from the definition of the density function, the probability that X lies between xi and xi+1 = area under density curve from xi to xi+1, i.e.,
as long as x is small. approximate X by a discrete random variable which takes on the values x0, x1, ... , xn-1 - i.e., if the value of X lies between xi to xi+1, round it down to xi . Then the expected value for this "discretized" approximation to X is given by the sum of the products of the x values and the probabilities from above:
get a closer approximation to X by using more closely spaced points; should get "true" value for the expected value of X in the limit as the number of subintervals approaches infinity; and thus should have
by the definition of the definite integral. We use the above to motivate the definition of the expected value of a continuous random variable (extending it to the case wgere the possible values lie in the infinite range from - to ):
Def: If X is a continuous random variable, then its expected value is defined as
as before, we also call this the mean of X, denoted
www.jntuworld.com
www.jntuworld.com
ex: Let X be the weight of cereal in a randomly selected box, as in the example from the previous section, and suppose it has the density function
Then the expected value of X is
= =
(note that f(x) = 0 for x < 14 and x > 15) = = 14.33
Note that the value of the mean is less than the midpoint of the interval [14, 15] of possible values; this follows because the density function is higher on the left end of the interval, indicating that X is more likely to lie close to 14 than close to 15.
In general, E(X) gives the balance point of the density function: if the region between the density curve and the x-axis were cut out of a piece of wood, the location of the mean would be the point on which the piece would balance.
For any function H(X) of X, we define the expected value of H as
Thus the moments are
The variance is
We have the same properties for expectation as before:
www.jntuworld.com
www.jntuworld.com
1. E(cX) = c E(X) 2. E(X+Y) = E (X) + E (Y) As before, these give an alternate formula for computing the variance: var(X) = E(X2) - E(X)2
ex: For the above "cereal density", = = So var(X) = E(X2) - E(X)2 = 205.5 - (14.33)2 = .0565 = .24 = 205.5 =
Note: the standard deviation measures the expected deviation from the mean, as before; thus it measures the spread of the density function, i.e., how widely spread the values tend to fall from the location of the mean.
The moment generating function is again defined as
and is used as before to find the values of the moments by differentiation:
www.jntuworld.com
www.jntuworld.com
4.4
The Normal Distribution
Def: Let X be a continuous random variable. If its density is
then X has the normal distribution w/ parameters , . called a normal random variable the parameters and are the mean and standard deviation (hard to show directly; use moment generating function (below)) graph: "bell-shaped curve", w/ maximum at x = , inflection points at x = (use calculus to show (prob. #43)):
it is true that this!
but requires some multivariable calculus tricks to show
Moment generating function:
(see text for derivation)
Can use this to find mean, variance:
Can similarly find the second moment using the second derivative; get E(X2) = 2 + 2 giving 2 = 2. Thus the parameters , in the density function are in fact the mean and s.d.!
ex: IQ scores are assigned in such a way that they are normally distributed with mean 100 and standard deviation 15. Let X be the IQ score of a person selected at random. What's the density function of
www.jntuworld.com
www.jntuworld.com
X? Since = 100 and = 15, we get
Graph:
Note: there are lots of normal distributions, one for each value of , determines where the center of the distribution will be the larger is, the broader the distribution
Def: The normal distribution with = 0, = 1 is called standard normal distribution; use variable Z to denote it. Density function for Z is f(x) =
Finding probabilities for normal distributions
ex: Consider IQ scores, as above. What's the probability that a randomly selected person will have an IQ score less than or equal to 100? Easy to find using symmetry: P(X <= 100) = area under the density curve to the left of 100 = 1/2
What's the probability that a randomly selected person will have an IQ score between 110 and 130? P(110 <= X <= 130) = area under density curve from 110 to 130
www.jntuworld.com
www.jntuworld.com
This is hard to find! We can't find an antiderivative to use to evaluate the integral. Could use numerical methods (such as the trapezoid rule or Simpson's rule) to find the value. Instead, use tables for the standard normal distribution
Approach: tabulate values for the standard normal dist; see table V, appendix A, p. 637; the table gives values of P(Z <= z) for various values of z use the standardization theorem to transform a question about a non-standard normal random variable X into one about the standard normal random variable Z: Theorem If X is normal, with mean , s.d. , then (Proof: find the moment generating function of is a standard normal random variable.
from the moment generating function of X using
our rules for moment generating functions discussed earlier, and see that it is the moment generating function of a standard normal random variable.)
ex: Consider IQ scores, as above. What's the probability that a randomly selected person will have an IQ score of 80 or lower? Want P(X <= 80). Let Then when X = 80, so P(X <= 80) = P(Z <= -1.33) = .0918 from the table. Thus about 9% of people have IQ scores less than 80. What's the probability that a randomly selected person will have an IQ score between 110 and 130? Want P(110 <= X <= 130); = -1.33,
www.jntuworld.com
www.jntuworld.com
again, let Then when X = 110, when X = 130, so P(110 <= X <= 130) = P(.67 <= Z <= 2.00) = P(Z<= 2.00) - P(Z < .67) (i.e., the area to the left of 2.00 minus the area to the left of .67) = .9772 - .7486 from the table = .2286 . So about 23% of people have IQ scores between 110 and 130. = .67, = 2.00,
www.jntuworld.com
www.jntuworld.com
4.5
Normal Probability Rule & Chebyshevs Inequality
Theorem Normal Probability Rule Let X be a normal random variable, with mean , standard deviation . Then The probability that X lies within 1 s.d. of the mean is .68 The probability that X lies within 2 s.d. of the mean is .95 The probability that X lies within 3 s.d. of the mean is .997
Useful for quick & dirty estimates Can be used only with normal random variables Follows from the fact that: the probability that X lies within 1 s.d. of the mean = P( <= X <= + ); using the standard normal random variable to compute this, let Z = ; = -1 = 1
when X = , Z = when X = + , Z = so
P( <= X <= + ) = P(-1 <= Z <= 1) = .68 using the table for Z. The other two parts follow in the same way.
ex: Consider IQ scores; as discussed in the previous section, these are normally distributed with mean 100, s.d. 15. Then the Normal Probability Rule gives the following information: 68% of people have IQ scores within 1 standard deviation of the mean, i.e., between 85 and 115 95% of people have IQ scores within 2 standard deviations of the mean, i.e., between 70 and 130 99.7% of people have IQ scores within 3 standard deviations of the mean, i.e., between 55 and 145
Chebyshev's Inequality Chebyshevs inequality gives similar estimates which are applicable to any random variable (not just normal distributions)
Theorem Chebyshevs Inequality Let X be a random variable w/ mean , standard deviation .

www.jntuworld.com
www.jntuworld.com
Then P(|X - | < k ) >= 1 - 1/k2, i.e., the probability that X lies within k standard deviations of the mean is at least 1 - 1/k2. Specific values of k give specific information: k=1: the probability that X lies within 1 s.d. of the mean is at least 1 - 1/12 = 0. gives no info! k=2: the probability that X lies within 2 s.d. of the mean is at least 1 - 1/22 = .75 thus for any distribution, at least 75% of the values will lie within 2 standard deviations of the mean k=3: the probability that X lies within 3 s.d. of the mean is at least 1 - 1/32 = .89 for any distribution, at least 89% of the values will lie within 3 standard deviations of the mean Note that these give lower bounds on the probability; for a specific distribution, it is certainly possible that the actual probability that X will lie within 2 standard deviations of the mean is greater than .75 (in fact, if X is normal, then we know from the above that the actual probability that X lies within 2 standard deviations of the mean is .95). Notes: these results are consistent with results for normal random variables; just gives less precise info! often useful for estimates when dont know the exact distribution, and suspect normal distribution is not appropriate! graphically, these say that the area under the density curve from x = 2 to x = + 2 is at least .75
ex: Suppose that the length of 20 years worth of baseball games has been investigated, and that it has been found that the average (mean) length of a game is 165 minutes and the standard deviation is 32 minutes. Since we don't know whether or not the distribution of game times is normal, we can't use the normal probability rule to get information about how likely it is that a game will last a particular length of time; however, we can use Chebyshev's Rule: the probability that a randomly selected game will have a length within 2 standard deviations of the mean is at least .75, i.e., at least 75% of games will last between 165 - 2(32) = 101 minutes and 165 + 2(32) = 129 minutes. the probability that a randomly selected game will have a length within 3 standard deviations of the mean is at least .89, i.e., at least 89% of games will last between 165 - 3(32) = 69 minutes and 165 + 3(32) = 261 minutes. This information might be useful if we had to estimate the number of hours that security personnel would be on duty: about 90% of the time we'd expect to have to pay them for between about 1 and 4 hours of work.
www.jntuworld.com
www.jntuworld.com
4.6 Normal Approximation to the Binomial Distribution

recall: Binomial distribution: distribution of random variable X which counts # of successes in n independent trials with probability of success p on each trial. (coin flips) X is a discrete random variable density function: f(x) = mean: = np variance: var(X) = np(1 - p)
Recall that in the discrete case, density function gives probability that particular outcomes will occur: f(x) = P(X = x). Can present density function as a table of values. ex: binomial distribution with n=5, p=.30; then density function f(x) given in table below:
Can represent the table as a histogram (bar graph):
(It is customary to center the bars over the values they represent.) Then: probability that particular value x will occur = height of bar over this value: P(X = x) =height of bar. with widths of bars equal to 1, area of each bar = height * width = height; thus P(X = x) = area of bar. Thus can use areas to find probabilities, as with continuous random variables; P(X = x) = P(x - .5 <= X <= x + .5) = area of bar between x - .5 and x + .5 In example above, to compute P(2 <= X <= 3), could approach as follows P(2 <= X <= 3) = P(1.5 <= X <= 3.5) = sum of areas of bars lying between x = 1.5 and x = 3.5 = .309 + .132 = .441
www.jntuworld.com
www.jntuworld.com
Of course, gives same result we'd get just by using density function table. When n is large, tops of rectangles seem to form a smooth curve; if we knew what this was, we could use it to find areas & hence probabilities with integrals (instead of summing areas of bars). Theorem Let X be binomial parameters n & p. Then for n large, X is approximately normally distributed with mean = np, variance 2 = np(1-p).
i.e., the tops of rectangles in histogram form approximately a normal curve w/same mean, variance how large must n be for the approximation to be good? Approximation good if np(1 - p) > 5.
Application To use this result, compute probabilities for binomial random variables by finding the area under the appropriate normal curve. ex: In an experiment 80 trees are grown under stressful conditions. Suppose the probability of any one tree surviving is .35; whats the probability that between 15 and 25 trees survive out of the 80? Let X = # which survive; then X is binomial, w/ n=80, p=.35; so mean = np = 80 (.35) = 28 variance 2 = np(1 - p) = 80(.35)(.65) = 18.2 standard deviation = 4.3 Want P(15 <= X <= 25). Glitch: tables dont go up to n=80 could compute as P(15 <= X <= 25) = f(15) + f(16) + ... + f(25), using the formula for the density function, but this is time-consuming! Approach: Use a normal distribution to approximate the probability Let Y be a normal random variable, with mean = 28, s.d. = 4.3. Then X and Y have approximately the same distribution, in the sense that if we drew the histogram corresponding to X the tops of the bars would be very closely approximated by the density function for Y. Histogram for X:
We want
www.jntuworld.com
www.jntuworld.com
P(15 <= X <= 25) = area of the bars for x = 15 to x = 25; since the bars of the histogram are centered over their respective values, this is approximately the area under the density curve of Y for x between 14.5 and 25.5 (this is called the half-unit correction): P(15 <= X <= 25) = P(14.5 <= Y <= 25.5) To compute the probability for Y, use the usual standard normal random variable technique: Let Z = when Y = 14.5, when Y = 25.5, so P(14.5 <= Y <= 25.5) = P(-3.14 <= Z <= -.58) = P(Z <= -.58) - P(Z <= -3.14) = .2810 - .0008 = .2802 Thus P(15 <= X <= 25) = .28, or there's a 28% chance that between 15 and 25 trees will survive. ex: Flip coin 200 times; whats probability get more than 120 heads? Let X = # heads that occur in 200 flips; then X is binomial, with n = 200 and p = .5, mean = np = 200(.5) = 100, variance = np(1 - p) = 200(.5)(.5) = 50, standard deviation = 7.1. Want: P(X > 120) Calculate using the normal approximation: let Y be normal, with mean 100 and s.d. 7.1; then P(X > 120) = P(Y > 120.5) Use standard normal r.v. Z to compute probability of normal r.v. Y: Z= ; when Y = 120.5, Z = 2.89, so P(Y > 120.5) = P(Z > 2.89) = 1 - P(Z <= 2.89) = 1 - .9981 = .0019. Thus P(X > 120) = .0019, i.e., there's anly about a .2% chance that we'll get more than 120 heads in 200 flips of a (fair) coin. = -3.14 = -.58
www.jntuworld.com
www.jntuworld.com
Chapter 5
Joint Distributions
5.1 Joint Densities & Independence

Goal: Look at pairs of random variables (X, Y); result of experiment will give pair of values (one value for each) ex: Roll 2 dice; X = value on 1st die, Y = value on 2nd. Then the set of possible results of the experiment are the pairs of values { (1,1), (1,2), (1,3), ..., (1,6), (2,1), ....., (6,6) } Note: we'll consider only the case where the random variables are discrete (not continuous)
Def: Let X, Y be discrete r.v.s. Then their joint probability density function is defined to be f(x,y) = P( X = x and Y = y ) f(x,y) gives the probability that the outcome of the experiment will be the pair of values (x,y) can represent the joint density function as a table ex: Plants are grown in a greenhouse; suppose the number of stems and number of blooms on each plant varies, with the number of stems being 1, 2, or 3 and the number of blooms being 0, 1, or 2. Let X = # stems on a randomly selected plant: then possible values are x = 1, 2, 3 Let Y = # blooms on a randomly selected plant: then possible values are y = 0, 1, 2 Suppose that the joint density function of X and Y is given by the following table:
Whats the probability a randomly selected plant will have 2 stems and 1 bloom? We want P(X=2 and Y=1) = f(2,1) = .25 Whats the probability a randomly selected plant will have more stems than blooms? We want P(X > Y); thus need to consider all the pairs (x,y) where x > y, i.e., P(X > Y) = P(X=1 and Y=0) + P(X=2 and Y=0) + P(X=2 and Y=1) + P(X=3 and Y=0) + P(X=3 and Y=1) + P(X=3 and Y=2) = f(1,0) + f(2,0) + f(2,1) + f(3,0) + f(3,1) + f(3,2) = .22 + .09 + .25 + .01 + .07 + .09 = .73 Whats the probability a randomly selected plant will exactly 1 bloom?
www.jntuworld.com
www.jntuworld.com
We want P(Y = 1); thus need to consider all the pairs (x,y) where y = 1, i.e., P(Y = 1) = P(X=1 and Y=1) + P(X=2 and Y=1) + P(X=3 and Y=1) = f(1,1) + f(2,1) + f(3,1) = .12 + .25 + .07 = .44 (Note that this just amounts to summing the entries in a particular column)
Properties Let f(x,y) be the joint density function of discrete random variables X and Y. Then 1. f(x,y) >= 0 for all x, y 2.
Can consider X or Y alone: Def: The marginal density for X is defined as fX(x) = P(X = x) = i.e.,
The marginal density for Y is similarly defined as fY(y) = P(Y = y) i.e.,
thus the marginal densities are obtained by summing either the rows or the columns of the table for the joint density function called marginal densities because the natural place to write them is in the margins of the joint density table (see next example) ex: Consider the plant example above; find the marginal densities for X and Y X: fX(x) = P(X = x), so fX(1) = P(X = 1) = P(X=1 and Y=0) + P(X=1 and Y=1) + P(X=1 and Y=2) = f(1,0) + f(1,1) + f(1,2) = .22 + .12 + 0 = .34 fX(2) = f(2,0) + f(2,1) + f(2,2) = .09 + .25 + .15 = .49 fX(3) = f(3,0) + f(3,1) + f(3,2) = .01 + .07 + .09 = .17
www.jntuworld.com
www.jntuworld.com
Notice that we're just summing each row of the table! Y: fY(y) = P(Y = y), so fY(0) = P(Y = 0) = P(X=1 and Y=0) + P(X=2 and Y=0) + P(X=3 and Y=0) = f(1,0) + f(2,0) + f(3,0) = .22 + .09 + .01 = .32 fY(1) = f(1,1) + f(2,1) + f(3,1) = .12 + .25 + .07 = .44 fY(2) = f(1,2) + f(2,2) + f(3,2) = 0 + .15 + .09 = .24 Just summing each column of the table!
These can be conveniently written right on the original table, in the margins!
Independent Random Variables

Q: do the values for X and Y depend on one another, i.e., does the value obtained for X influence in any way the value we get for Y? In other words, is the event X = x independent of the event Y = y ? Recall: two events A, B were defined to be independent iff P(A B) = P(A) P(B)
So, letting A be the event that X = x and B be the event that Y = y, we want to see if A and B are independent: is P(A i.e., is P(X=x and Y=y) = P(X=x) * P(Y=y) i.e., want to see if f(x,y) = fX(x) * fY(y) for all values of x and y We use the above condition as our definition: B) = P(A) P(B)
Def: discrete r.v.s X and Y are independent iff the joint density is the product of the marginal densities, i.e., iff f(x,y) = fX(x) * fY(y) for all values of x and y
www.jntuworld.com
www.jntuworld.com
X Y
random variables X and Y are independent if the value obtained for X doesn't influence the value obtained for Y, and vice-versa
ex: In the above plant example, the number of stems X and the number of blooms Y are not independent. This follows because the joint density isn't equal to the product of the marginal densities for all values of x and y. For example, for x = 2 and y = 0, f(2,0) = .09 but fX(2) * fY(1) = (.49) (.32) = .1568 = .16 (to 2 decimal places)
ex: Roll 2 dice; let X = value on first die, Y = value on second die. Then X and Y are clearly independent (the value on the first die can't influence the value that appears on the second), and we can use this to find the joint density function from the marginal densities using f(x,y) = fX(x) * fY(y). Since fX(x) = 1/6 for x = 1, 2,..., 6 and fY(y) = 1/6 for y = 1, 2,..., 6, we get f(x,y) = 1/36 x, y = 1, 2,..., 6. But this is exactly what we'd expect: for example, f(3,5) = probability that we get a 3 on the first die and a 5 on the second one, which equals 1/6 times 1/6! Thus we expect f(x,y) = 1/36 for all x, y.
ex: Roll 2 dice; let X = value on first die, Y = sum of values on the two dice. Then X and Y are not independent; the value on the first die definitely influences the value that the sum can take on. To show that the definition isn't satisfied, we need to find a pair of values for x and y for which f(x,y) = fX(x) * fY(y) doesn't hold. Look at the case x = 1 and y = 12, i.e., the case where the value on the first die is 1 and the sum of the values on the two dice is 12. But this clearly can't ever happen: if the value on the first die is 1, the sum could be at most 7. Thus the value of the joint probability function is 0: f(1,12) = 0 ( P(X=1 and Y=12) = 0 ).
Now look at the marginal density functions. fX(1) = 1/6; the probability that we get a 1 on the first die (ignoring the value of the sum) is 1/6. fY(12) = 1/36; the probability that the sum of the two dice is 12 is 1/36, since the only way to get a 12 is to get a 6 on each die, giving 1/6*1/6 = 1/36 for the probability. Thus fX(1) * fY(12) = 1/6 * 1/36 = 1/216
www.jntuworld.com
www.jntuworld.com
But thus f(1,12) is not equal to fX(1) * fY(12), so we've shown formally that X and Y aren't independent.
www.jntuworld.com
www.jntuworld.com
5.2 Expectation & Covariance

Def: Let X and Y be discrete random variables with joint density function f(x,y), and let H(X,Y) be any function of X & Y (or either alone). Then the expected value of H is
ex: Consider the plant example from previous section, where X = number of stems on a plant and Y = # of blooms. Then
= (1)(0)f(1,0) + (1)(1)f(1,1) + (1)(2)f(1,2) + (2)(0)f(2,0) + (2)(1)f(2,1) + (2)(2)f(2,2) + (3)(0)f(3,0) + (3)(1)f(3,1) + (3)(2)f(3,2) = (0)(.22) + (1)(.12) + (2)(0) + (0)(.09) + (2)(.25) + (4)(.15) + (0)(.01) + (3)(.07) + (6)(.09) = 1.97 so the product of the number of stems and number of blooms averages 1.97.
= (1)f(1,0) + (1)f(1,1) + (1)f(1,2) + (2)f(2,0) + (2)f(2,1) + (2)f(2,2) + (3)f(3,0) + (3)f(3,1) + (3)f(3,2) = (1)(.22) + (1)(.12) + (1)(0) + (2)(.09) + (2)(.25) + (2)(.15) + (3)(.01) + (3)(.07) + (3)(.09) = 1.83 so there are on average 1.83 stems per plant.
Could have computed E(X) using just the marginal density for X, since doesn't involve Y: = (1) fX(1) + (2) fX(2) + (3) fX(3) = (1)(.34) + (2)(.49) + (3)(.17) = 1.83, as before
www.jntuworld.com
www.jntuworld.com
Similarly, E(Y) = .92, computed either using the joint density function or more simply using the marginal density for Y. Thus there are on average .92 blooms per plant. E(X+Y) = E(X) + E(Y) = 1.83 + .93 = 2.75 However, notice that E(XY) <> E(X) * E(Y) from properties of expectation
Note: denote E(X) by X, and E(Y) by Y.
Q: when is it the case that E(XY) = E(X) E(Y)?
Theorem: If X,Y are independent, then E(XY) = E(X) E(Y)
Def: The covariance of X and Y is defined to be cov(X,Y) = E((X - X)(Y - Y)) What does this measure? Consider: X - X measures how far X is from the mean for X; it's positive if X is above the mean, negative if X is below the mean Y - Y measures how far Y is from the mean for Y; positive if Y is above the mean, negative if Y is below the mean (X - X)(Y - Y) will be positive if X and Y are both above or both below their means will be negative if X is above average, Y is below, and vice-versa E((X - X)(Y - Y)) is the average value of the product will be positive if above-average values of X tend to occur with above-average values of Y will be negative if above-average values of X tend to occur with below-average values of Y cov(X,Y) thus measures whether X & Y tend to "vary together"
Computational formula for covariance: cov (X,Y) = E(XY) ? E(X)E(Y).
www.jntuworld.com
www.jntuworld.com
ex: previous plant stuff: cov(X,Y) = E(XY) - E(X)E(Y) = 1.97 - (1.83)(.92) = .2864 since the covariance is positive, X and Y tend to vary together: when X is above average, Y tends to also be above average, i.e., a plant with an above-average number of stems will also tend to have an above-average number of blooms. note: the magnitude of the covariance, .2864, is not directly meaningful - just whether it's positive or negative.
Note: If X & Y are independent, then cov(X,Y) = 0 follows because then E(XY) = E(X) E(Y) makes sense: if X and Y are independent, whether X is above or below average should have no influence on the value of Y; thus X and Y wouldn't tend to vary together Converse not true; just because cov(X,Y) = 0, doesn't mean X, Y are independent!
www.jntuworld.com
www.jntuworld.com
5.3 Correlation Coefficient

Def: Given discrete random variables X,Y their correlation coefficient is defined as
Gives a "normalized" value of covariance; always have
measures the strength of the linear relationship between X & Y. If the values of X and Y are recorded for a lage number of experiments, and the points (X,Y) are plotted (generating a scatter plot), then: if is near 1 or -1, points (X,Y) will tend to fall near a line the slope of the line will be positive if positive, negative if negative if is near 0, the points (X,Y) will show no clear linear trend when plotted
In fact: = 1 or -1 if and only if X and Y are directly linearly related, Y = a + bX for some constants a,b. If X,Y are independent, then = 0 (although converse not true) follows since cov(X,Y) = 0 Note: even if = 0, X and Y may not be independent!! May be directly related, but by a non-linear relationship!! ex: plant example from previous sections: we found before that cov(X,Y) = .2684, E(X) = 1.83, E(Y) = .92. need
= (12) fX(1) + (22) fX(2) + (32) fX(3) = (12)(.34) + (22)(.49) + (32)(.17) = 3.83,
= 1.40 (in a similar fashion) so var(X) = E(X2) - E(X)2 = 3.83 - 1.832 = .4811 var(Y) = E(Y2) - E(Y)2 = 1.40 - .922 = .5536 Thus The value of is positive, about halfway between 0 and 1; thus the number of stems and number of
www.jntuworld.com
www.jntuworld.com
blooms will tend to vary together, both being above or below average, but the trend is not a particularly strong one: it won't be the case that every plant with an above-average number of stems will have an above-average number of blooms.
www.jntuworld.com
www.jntuworld.com
6.1 Random Samples

Idea: have some population with unknown distribution of some quantity ? say, heights of students in class the (unknown) mean, variance, and moments of the quantity are called the population parameters Note: If population is finite, and you could measure values for the whole population, could compute these parameters directly. Suppose have m individuals in population, and their heights are x1, x2, ..., xm.
finite population mean: finite population variance: average squared deviation from mean! Glitch: usually, hard to get values for whole population (too hard to collect all information) if population is infinite, e.g., distribution is continuous, clearly cant measure whole population! Approach: choose a sample of n objects from population; use the values for the items in the sample to estimate the values of the parameters for whole population pick at random n objects from population value of interest for each depends on which object selected, i.e., is a random variable whose distribution is that of whole populaton thus have n random variables X1, X2, ..., Xn with identical (but unknown) distribution assume the values of the random variables are independent, i.e., that the value of one doesn't affect the value of another Thus get the following definition: Def: A random sample of size n from a particular distribution is a set of n independent random variables X1, X2, ..., Xn , each of which has this same distribution. when we choose a particular sample, get an observed value for each of the random variables denote observed values for the sample by x1, x2, ..., xn (small x's) use these values to estimate parameters for population
www.jntuworld.com
www.jntuworld.com
6.2 Picturing the Distribution

Given a set of n values x1, x2, ..., xn from a random sample, how can we graphically picture their distribution? Histogram: bar graph for data from random sample divide values into categories (ranges), usually of equal width determine the number of values which lie in each category (frequencies) or the percentage of the total which lie in each category (relative frequencies) histogram is bar graph of frequencies or relative frequencies example: Heights of students; suppose a random sample of 25 students is selected from the student population at large, and their heights are recorded, giving following values: 70, 68, 65, 69, 77, 62, 70, 70, 61, 72, 64, 62, 69, 72, 73, 69, 63, 72, 69, 71, 70, 64, 68, 75, 61 Create a histogram with category widths of 2 inches: the number of values in each category, and the percentages these represent, are given in the following table:
Categories 63-64 65-66 67-68 69-70 71-72 73-74 75-76 77-78
number fraction .16 .12 .04 .08 .32 .16 .04 .04 .04 3 1 2 8 4 1 1 1
61-62 inches 4
the percentages in this table can be represented as a bar graph, giving the histogram:
www.jntuworld.com
www.jntuworld.com
Create a histogram with category widths of 5 inches: the number of values in each category, and the percentages these represent, are given in the following table: Categories number fraction 59-63 64-68 69-73 74-78 5 5 13 2 .20 .20 .52 .08
representing the table as a bar graph, get the histogram:
www.jntuworld.com
www.jntuworld.com
Note that the appearance of the histogram depends on the number of categories used! Q: how many categories should be used? A: depends; textbook gives some guidelines
Notes: histogram gives an approximation to density function f(x) for the entire population can use to estimate a few probabilities (those involving our categories) example: Use the first histogram above to estimate the probability that a student selected at random will have a height of 65 to 68 inches. Just use the percentages from the bars in the histogram for 65-66 and 67-68 inches; get P(65 <= X <= 68) will equal approximately .04 + .08 = .12
www.jntuworld.com
www.jntuworld.com
6.3
Sample Statistics
Def: A statistic is a function of the random variables X1, X2, ..., Xn of a random sample used to estimate the values of population parameters since its a function of random variables, its a random variable itself The most important statistics are: Def: The sample mean is defined as
is a random variable; its value will vary from sample to sample its value for a given sample is just the usual average of the observed values of X1, X2, ..., Xn used to estimate the population mean ex: Heights of students; suppose a random sample of 25 students is selected from the student population at large, and their heights are recorded, giving following values: 70, 68, 65, 69, 77, 62, 70, 70, 61, 72, 64, 62, 69, 72, 73, 69, 63, 72, 69, 71, 70, 64, 68, 75, 61 Then the value of the sample mean for this sample is Note: this value will be computed by SPSS or any other statistics software package
Def: The sample median of a random sample is defined as follows: the observed values from the sample are put in increasing order is the middle value if n is odd, or halfway between the two middle values if n is even ex: For the student heights above, if the values are put in increasing order, we get 61, 61, 62, 62, 63, 64, 64, 65, 68, 68, 69, 69, 69, 69, 70, 70, 70, 70, 71, 72, 72, 72, 73, 75, 77 The sample median is given by the middle value, which is 69, so = 69.
Def: The sample variance S2 is defined as
this is the average of the squared deviations of the sample values from the sample mean used to estimate the population variance 2
www.jntuworld.com
www.jntuworld.com
the sample standard deviation, S, is the square root of the sample variance Q: Why use n-1 instead of n in formula? A: S2 is a random variable used to estimate the value of the population variance 2 . Though its value will vary from sample to sample, being sometimes a little greater than 2 and sometimes a little less, we would like its value to be a good estimate to the value of the population variance for most samples. In particular, we wouldn't want it to give high values more often than it gives low values, or vice versa. As such, we would like the expected value of S2 to be 2, i.e., the average value of the sample variances from a large number of samples should be the population variance. Well see that the n-1 is needed for this to be true; if we used n in the quotient, the values of the sample variance would tend to consistently underestimate the value of the population variance.
ex: For the student heights above, the sample variance and standard deviation are
www.jntuworld.com
www.jntuworld.com
Chapter 7 Estimation
want to estimate a parameter of a population (mean, say) using some statistic computed from a sample (sample mean, say) want the estimators to be good in the sense that for most sample, want the value given by the estimator to be close to the true value of the parameter its estimating.
7.1
Unbiased Estimators and Variability
Def: Let p be the value of some population parameter, and let statistic be an estimator for p computed from a random sample. (Note that is a random variable: its value depends on the particular sample selected.) Then is said to be an unbiased estimator of p if E( ) = p i.e., the expected value of the estimator is the value of the population parameter. this is a desirable property for an estimator to posess; it indicates that the values given by the estimator from sample to sample will tend to be centered around the true value of the population parameter, rather than being consistently too high or too low In addition to having the values given by an estimator being centered around the true value of the population parameter it's estimating, we'd also like the values to have a narrow spread, i.e., we'd like them on average not to vary too far on either side of the expected value. To measure this, we'll look at the standard deviation (variance) of the estimator; this will tell us how far on average the values of the estimator will vary from the expected value.
The most important estimators are the sample mean estimating the population mean the sample variance S2 estimating the population variance 2 Consider the expected value and standard deviation (variance) of these two estimators.
Sample mean 1. Expected value (unbiasedness)
(The above steps follow from properties of expectation, and from the fact that since each of the
www.jntuworld.com
www.jntuworld.com
random variables Xi in the sample comes from the population being considered, the expected value of each is ) Thus the expected value of is , i.e., the observed values for over many samples would be centered at the true population mean . Note that this doesn't depend on the type of distribution of the population!
2. Variance (variability) Look at the variance of to see how widely the values of can be expected to vary from sample to sample.
Thus the variance of the sample mean is the variance of the population divided by n, the sample size; the values of the sample mean from sample to sample will tend vary less than the values of single individuals selected from the population The larger the sample size, the smaller the variance of , and thus the less the observed values of will tend to vary from the value of . Thus larger samples will tend to give more accurate estimates than smaller samples. Note that again these results don't depend on the type of distribution of the population In fact: as n approaches infinity, the variance of will approach 0, and thus the probability that the value given by will differ from goes to zero! This is known as the Law of Large Numbers.
Sample Variance 1. Expected value (unbiasedness)
It can be shown that the expected value of S2 is 2, i.e., that E(S2) = 2. See the text for the derivation; the derivation is a little gory algebraically, but uses properties of expectation as in the above derivations. However, it's worth noting that it becomes clear in the derivation that we must have n - 1 in the denominator of the expression for S2 in order for it to be unbiased.
2. Variance (variability) Alas! The expression for the variance of S2 depends on the type of distribution of the population! However, under some general assupltions, it can be shown that the variance of S2 decreases as the size of the sample
www.jntuworld.com
www.jntuworld.com
increases, as was the case for the sample mean. Thus the larger the sample, the more accurate the value given by the sample variance. well consider the special case when the population has a normal distribution later!
www.jntuworld.com
www.jntuworld.com
7.3
Distribution of the Sample Mean; Central Limit Theorem
In section 7.1, we looked at the expected value and variance of the sample mean . Now, we'll look at the type of distribution that has.
We'll use two important results about normal random variables. 1. If X is a normal r. v. with mean , variance 2, and c is a constant, then Y = cX is normal, with mean c, variance c22. 2. If X1 and X2 are independent normal random variables with means 1 and 2 and variances 12 and 22, then X1+X2 is normally distributed,with mean 1+2, variance 12+22. -- the info about means and variances is old hat; whats new is fact that cX & X1+X2 are normal. To summarize: a constant times a normal r.v. is normal the sum of normals is normal. The proofs of these assertions use properties of moment generating functions, in particular the following: if X & Y have the same moment generating function, mX(t) = mY(t), then X & Y have the same distribution (density function).
Distribution of the sample mean From the above two properties it follows that: if independent random variables X1, X2, ..., Xn are a random sample from a normal distribution with mean , variance 2, then the sample mean is normally distributed with mean = and variance 2 = 2/n . Thus if the original population has a normal distribution, the the sample means from samples of some (fixed) size n will also be normally distributed, with the same mean but a smaller variance (and standard deviation). The density curve for the sample means will thus be bell-shaped, and centered at the same location as the density curve for the population, but will be narrower.
Using the information about the distribution of ex:
www.jntuworld.com
www.jntuworld.com
Raising trout in a fish hatchery; suppose lengths (of 2 year olds) are normally distributed, with mean = 7.3 inches, standard deviation = 1.6 inches. Take samples of size 70, and compute the sample mean for each sample; then the sample means will be normally distributed, with mean = 7.3 inches and standard deviation = 1.6/sqrt(70) = .19 inches. Graphs of densities: the density graph for the sample means from the various samples will be a normal curve, centered at the same location (7.3 inches) as the population density curve, but will be much narrower: the standard deviation of the sample means is just .19, while that for the population is 1.6. Thus the values of the sample mean for various samples will be centered at the population mean, but will tend to vary much less on either side than individuals from the population. By the normal probability rule, since the probability that i.e., the probability that will lie within .38 units of the population mean is .95 (since the standard deviation of is .19, and the mean of is the same as the population mean) i.e., 95% of samples we choose will have Thus: For 95% of the samples we draw, will lie within .38" of the value of This gives us an idea of how reliable the value of from a sample will be as an estimate of the value of : 95% of the time, the population mean will be within .38 units of the value found for the sample mean . This is the fundamental idea behind confidence intervals (discussed in the next section). lying within .38 units of the population mean is normally distributed,
will lie within 2 standard deviations of its mean is .95
Central Limit Theorem Let X1, X2, ..., Xn be a random sample from a population having any distribution (with mean , variance 2), not necessarily a normal distribution; then for large n, the sample mean will be approximately normally distributed (with mean , variance 2/n ). this is surprising: regardless of distribution of population, distribution of sample mean will be approximately normal for large n!! thus can use normal calculations with the sample mean even when distribution of population isnt normal proof involves looking at moment generating functions
www.jntuworld.com
www.jntuworld.com
7.4
Confidence Intervals
Idea: Use the value of from a sample to try to find an interval in which the true population mean is likely to lie Consider: Suppose original population is normal, with mean , standard deviation
from the results in the previous section,
is normally distributed, with
mean , standard deviation / . since is normally distributed, can use the normal probability rule: P( | - | < 2 ) = .95 or since = and = / P( | - | < 2 / ) = .95 i.e., 95% of samples will have lying within 2 / of the population mean , or equivalently, for 95% of samples, will lie within 2 / units of the sample mean or for 95% of samples, Gives an interval for such that for 95% of samples, will lie in the interval! Called a 95% confidence interval Actually, slightly more than 95% of values lie within 2 standard deviations of the mean; to get exactly 95%, we need to use those values that are within 1.96 standard deviations of the mean. Thus a slightly refined 95% confidence interval is
Glitch: to use this, need to know for popluation! ex: Have a machine filling bags of popcorn; weight of bags known to be normally distributed, and machine is such that mean weight is adjustible, but s.d. is a built-in tolerance for machine: = .3 oz. Take sample of 40 bags; average weight for the sample is Whats a 95% confidence interval for mean ? From the above, we know that for 95% of samples, or or = 14.1 oz.
www.jntuworld.com
www.jntuworld.com
Thus assuming ours is one of the 95% of "good" samples, the true value of the population mean will lie in the interval
Why are we 95% confident? Because we could have gotten a bad sample! In fact, only for 95% of the samples we could choose will the value of the population mean lie in the specified interval; for 1 in 20 samples, the "bad" or nonrepresentative samples, the true mean will lie outside of specified interval, and we'll draw an incorrect conclusion by assuming it is in the specified range!
99% Confidence Interval Goal: find an interval such that for 99% of samples, the true value of will lie in the specified range! Approach: is normally distributed, with mean = and standard deviation = / thus is a standard normal random variable let z.005 be the value such that P( Z >= z.005) = .005, i.e., the area under the density curve for Z to the right of z.005 is .005; z.005 is called the upper .005 critical value Then P(-z.005 <= Z <= z.005) = .99 (i.e., the probability that Z will lie between +z.005 and -z.005 is .99) so
i.e., for 99% of samples the value of
will lie in the range
Solving the inequality for , we get that for 99% of samples, will lie in the range
This is our 99% confidence interval for The .005 critical value can be found from (accurate) tables, and has the value z.005 = 2.576 This gives the interval
ex: Cereal boxes: using the data from the example above, we can construct a 99% confidence interval for the mean : or or
www.jntuworld.com
www.jntuworld.com
or or Note: can never be 100% confident that the true value of the population mean lies in the specified interval! the higher the confidence level, the wider the interval must be! Using the above argument, we can derive confidence intervals for any desired level of confidence; we'd get A (100 - )% confidence interval for (given that is known) is given by
where z/2 = upper /2 critical value. (Note that the probability associated with the critical value is half that of the "uncertainty" associated with the confidence interval - we use z/2 for the (100 - )% confidence interval. For example, for the 95% confidence interval (where we are going to draw the wrong conclusion 5% of the time, i.e., the probability of making a mistake is .05), the critical value used is z.025 ! Usual confidence levels and associated critical values: 90%: 95%: 99%: ex: The 90% confidence interval for for the cereal example would be z.05 = 1.645 z.025 = 1.960 z.005 = 2.576
Sample size vs. Accuracy A 100% - % confidence interval for is
the width of the interval is the larger the sample size n, the narrower the interval! often, choose sample size to give desired accuracy at a specified confidence level. ex: Cereal example; want estimate for with accuracy of plus or minus .01 ounces at 95% confidence level; how large of a sample is needed to get this accuracy?
www.jntuworld.com
www.jntuworld.com
95% confidence; use critical value z.025 = 1.960 width of interval will be
want the width to be .02 ounces; so set equal to .02 and solve for n: or
Thus we,d need a sample size of almost 3500 boxes of cereal for the 95% confidence interval to give us an accuracy of .01 ounce.
www.jntuworld.com
www.jntuworld.com
8.2
Confidence Intervals When Unknown: the T-Distribution
Goal: find confidence interval for when is unknown (the usual case). Approach: Proceed as before, using S (the sample variance) in place of . previously, used fact that now, look at Glitch: the distribution of was standard normal to find confidence interval
to derive interval isnt a normal distribution!
Def: Let , S be the sample mean & variance from sample of size n from a normal population. Then the random variable
has a distribution called the T distribution with n-1 degrees of freedom
have a family of distributions, one for each degree of freedom graphs of densities:
symmetric & have general shape of standard normal density, but are broader as , density of Tn approaches that of Z use info about distribution to find a confidence intervals
95% confidence interval
look at this quantity has the T-distribution with n-1 degrees of freedom find interval such that the middle 95% of T-values lie in range: let t.025 be the value such that P( T >= t.025) = .025, i.e., the area under the density curve for T to the right of t.025 is .025; t.025 is called the upper .025 critical value
www.jntuworld.com
www.jntuworld.com
Then P(-t.025 <= T <= t.025) = .95 t.025 is .95) so (i.e., the probability that T will lie between +t.025 and -
i.e., for 95% of samples the value of
will lie in the range
Solving the inequality for , we get that for 95% of samples, will lie in the range
This is our 95% confidence interval for
The critical value can be found from (accurate) tables; note that its value depends on n, the number of elements in the sample Note: the number of degrees of freedom for a sample of size n is n-1!! ex: Sample heights of 40 students; find Then 95% confidence interval is = 67.3", S = 3.6".
from tables (p. 732 in text), the .025 critical value for the T distribution with 39 degrees of freedom is t.025 = 2.023 gives or our 95% confidence interval is
The same approach can be used to find confidence intervals for other confidence levels; just use the appropriate critical values. Other commonly used ones are 90% and 99% confidence intervals.
www.jntuworld.com
www.jntuworld.com
ex: 90% & 99% confidence intervals for sample of student heights above 90% confidence interval: use the .05 critical value t.05
from tables, the .05 critical value for the T distribution with 39 degrees of freedom is t.05 = 1.685 gives or
99% confidence interval: use the .005 critical value t.005 from tables, the .005 critical value for the T distribution with 39 degrees of freedom is t.005 = 2.708 gives or Note: when n is large, the density function for the T distribution with n-1 degrees of freedom approaches that of the standard normal distribution; thus for large n (n > 100), we can use the critical values from the Z distribution as a good approximation to the values for the T distribution. (In fact, most tables will only give critical values for the T distribution for n up to about 100.)
ex: Take a new sample of student heights; sample 200 students, find confidence interval for using this new data. Interval is = 67.7", S = 4.1"; find a 95%
table of critical values in text (p. 732) gives critical values only for n up to 100; the next listed value is for n = *, which gives the critical values for the Z distribution; these can be used for values of n larger than 100. Thus use t.025 = 1.960, giving or

www.jntuworld.com
www.jntuworld.com
8.3
Hypothesis Testing
Idea: have some suspicion as to value of (unknown) population parameter; use a sample to test if hypothesis is true
ex: Let p = the fraction of students at Allentown College who are registered Democratic I think fewer than 40% of students are registered Democratic (p < .40).
Setup: use two complementary hypotheses: H0 = null hypothesis; what we suspect isnt true H1 = alternate hypothesis; what we suspect is true
ex: In example above, we'd use null hypothesis H0 : p >= .40 alternate hypothesis H1 : p < .40
Approach: play devils advocate: assume null hypothesis is true take a sample, and compute a test statistic from the sample whose value will (hopefully) refute the assumption that the null hypothesis is true and allow us to reject it In hypothesis testing, choose a set of outcomes for the test statistic that will be used to reject the null hypothesis; called the critical region or rejection region
ex: Assume the null hypothesis, that p >= .40, and sample 20 students. Let X = number in sample registered Democratic. What outcomes would suggest that the assumption p >= .40 is false? Well, if p >= .40, we expect 8 or more to be registered Democratic; thus if our sample has fewer than
www.jntuworld.com
www.jntuworld.com
8 Democrats, the assuption that 40% or more of the students are registered Democratic would seem incorrect. However: even if 40% of the students are Democrats, we certainly won't always get exactly 8 Democrats in every sample of 20 students; we'd expect the number to vary somewhat from sample to sample. For example, it wouldn't be all that unlikely that just due to random chance, we'd get a sample of 20 students in which only 7 are registered Democratic; thus this result wouldn't give strong evidence that there must be fewer than 40% Democrats in the student population. Thus we really only get strong evidence that we should reject the null hypothesis if the number of Democrats in the sample is far less than the expected number of 8. We'll use as our rejection region X <= 4; if p >= .40, it is unlikely wed get a sample with 4 or fewer Democrats in it. In fact, we can quantify just how unlikely this is using probabilities: If the null hypothesis is true (p >= .40), whats the probability that in a sample of size 20 we would have X <= 4? Suppose p = .40; then the distribution of X will be the binomial distribution with n = 20, p = .40 Then the probability that X <= 4 is .0510, from the table of cumulative probabilities for the binomial distribution. If p > .40, theres even a smaller chance that X <= 4. Thus we conclude that if the null hypothesis is true, there would be only a 5% chance we'd get a sample with 4 or fewer Democrats in it due just to sampling variation. While this could happen, it's quite unlikely, and thus it seems more likely that the null hypothesis is false, and that there are in fact fewer than 40% Democrats in the student population at large.
The value of .0510 is called the significance level of the test its the probability that the test statistic will fall into the rejection region when the null hypothesis is true (causing us to erroneously reject the null hypothesis) usually choose a desired value for , and then find the rejection region corresponding to this.
ex: If we want the significance level of the test of Democratic registration to be .01, what should the rejection region be? Well, assuming the null hypothesis is true, that p = .40 or greater, we can see from the table for the binomial distribution with n = 20 and p = .40 that P(X <= 3) = .0160 and P(X <= 2) = .0036. Thus the rejection region X <= 3 doesn't quite give us the significance level desired; there's a 1.6%
www.jntuworld.com
www.jntuworld.com
chance we'll erroneously reject the null hypothesis when it is in fact true. The rejection region X <= 2 is a little too strong, since using it there would only be a .3% chance of erroneously rejecting the null hypothesis. We'd have to decide which of the two regions to use (depending on whether a 1.6% chance of erroneously rejecting the null hypothesis is low enough, or if we really want to have this probability below 1%.)
Errors There are 4 possible outcomes of a hypothesis test: Null hypothesis is true, and we erroneously reject it; called a Type I error Null hypothesis is false, and we correctly reject it Null hypothesis is true, and we correctly refuse to reject it Null hypothesis is false, and we incorrectly refuse to reject it; called a Type II error Table: H0 true reject H0 Type I error H0 false correct decision Type II error
accept H0 correct decision
Note: for a hypothesis test, we specify the desired level of significance , and use this to determine the rejection region; this specifies the probability of making a type I error. this is usually the worse error to make: thus we want to limit the chance that we'll make it ex: Surgical study: determine if a new surgical technique increases the lifespan of patients suffering from a particular condition. In this case the null and alternate hypotheses would be as follows: H0 : surgery has no effect on longetivity H1 : surgery increases longetivity Because of the dangers associated with surgery, we would definitely not want to erroneously conclude that the surgery is beneficial if it in fact offers no real benefit to the patients. Thus we'd want to choose our rejection region to be such that there would be only a small chance that sampling variation would give us a sample showing improvement if there is in fact no improvement in the population at large.
www.jntuworld.com
www.jntuworld.com
8.4
Significance Testing
Differs from hypothesis testing in only one way: in both, assume hypothesis true; then: Hypothesis Test: specify a rejection region (for desired significance level) at outset; see if our test statistic falls into this Significance Test: compute the probability the value of our test statistic will be as extreme as (or worse than) the observed value if the null hypothesis were true; called the P-value. Reject the null hypothesis if the P-value is small enough. ex: Democratic example from previous section; suppose that we take our sample of 20 students, and find that X = 5 of them are Democrats. The P-value for this sample is obtained by computing the probability that we'd get a sample with this few Democrats in it, or even fewer, if the null hypothesis were true. Thus we assume that p >= .40 (that 40% or more of students at large are Democrats), and find the probability that X <= 5. Using p = .40, the distribution of X would be a binomial distribution with n = 20 and p = .4; then we find that in this case, P(X <= 5) = .1256. If we used any value of p greater than .40, the probability would be even less than this. Thus we use as our P-value P = .1256: this is the probability that we'd get a sample "this bad, or even worse," just due to sampling variation. Thus there's a 13% chance we'd get a sample with 5 or fewer Democrats in it if in fact the percentage of Democrats is 40%. We now have to decide if this probability is small enough for us to be able to justify rejecting the null hypothesis.
What's really the difference between a hypothesis test and a significance test? In the hypothesis test, we decide on the criterion for rejecting the null hypothesis before the sample is taken; if the results of the sample don't meet our criteria, we don't reject the hypothesis. In a significance test, we sample first, and look at the result, and decide if this is unlikely enough for us to conclude that the null hypothesis is incorrect. Which is better? Both are used, but statisticians prefer the approach used in the hypothesis test, as a way to "keep themselves honest." If we choose a significance level of .01 in designing a hypothesis test, and then compute our rejection region based on this, we will then only reject the null hypothesis if the sample we get is one of the "1% of worst samples" that would occur if the null hypothesis is true. If we get a sample that is close to our rejection region, but not in it, we don't reject! In the significance test, we look at the probability that a sample would be "as bad as or worse than" the one we got; if this probability is low, we reject the null hypothesis. But since we haven't specified a clear-cut criterion for rejection, we can sometimes "talk ourselves into" rejecting the hypothesis. For example, the P-value for a sample might come out to be .02; thus there's only a 2% chance we'd get a sample this bad if the null hypothesis is true, and we might be inclined to reject the hypothesis. However, this sample would not have fallen into the rejection region for the hypothesis test with significance level .01, since its P-value is a little larger than the significance level. Thus we would not have permitted ourselves to reject, having previously decided on our criterion.
Hypothesis and Significance Tests on the Population Mean

www.jntuworld.com
www.jntuworld.com
Frequently, our hypothesis will concern the value of the mean of a population. ex: Consider the machine filling popcorn boxes discussed in an example in section 7.4, in which the mean fill (in ounces) was adjustable but the standard deviation arose from a built-in tolerance and was known to be = .3 ounces. Suppose the machine is supposed to be set so that the mean fill is at least 14.0 ounces of popcorn per box, but we suspect that it has gone out of adjustment and the mean fill is now less than 14 ounces. Our hypotheses would be: H0 : = 14.0 H1 : < 14.0 Hypothesis Test Suppose we want to test our hypothesis with a hypothesis test at significance level = .05 (so that there will only be a 5% chance we'll conclude that the machine is out of adjustment when it is in fact OK). To test, we'll take a sample of 50 boxes, and look at the sample mean; if is sufficiently less than 14.0, we'll conclude that the machine is out of adjustment and that the mean fill is indeed less than 14.0 ounces. Rejection region: how far below 14.0 should Consider: be for us to reject H0?
Assuming that the weights of boxes are normally distributed, will also be normally distributed. (Since n is large (n = 50), will be normally distributed even if the weights aren't normally distributed, by the Central Limit Theorem.) Thus will have the standard normal distribution; using the null hypothesis, that = 14.0 ounces, and the known value of = .3 ounces, this becomes . We want the Z value such that only 5% of Z-values would be below this value just due to random chance; this value is the critical value -z.05 = -1.645. Thus if the null hypothesis is true and = 14.0, only 5% of samples will have Solving for , we find that if H0 is true, for only 5% of samples will just due to sampling variation. If we get a sample with at or below this level, it's more likely that the null hypothesis isn't true and that the machine is out of adjustment. This gives us our rejection region; reject if our sample yields <= 13.93 Suppose we now take our sample of 50 popcorn boxes, and find that = 13.88 ounces. Then by the above, since the value of lies in our rejection region, we would reject the null hypothesis that the machine is adjusted properly to give a mean fill of 14.0 ounces, and accept instead the alternate hypothesis that the mean fill is set below this level.
www.jntuworld.com
www.jntuworld.com
Significance Test To test the hypothesis via a significance test, we would not bother to figure out a rejection region; we'd just take a sample, look at the value of obtained, and compute its P-value to see if it's low enough for us to reject the null hypothesis. To compute the P-value for the sample above, with = 13.88 ounces, we need to find the probability that we'd get a sample with a mean this low or lower just by chance if the null hypothesis is in fact true. Thus we want to find P( <= 13.88), assuming that the population mean is = 14.0. Using the fact that is normally distributed, we use the Z distribution to compute this probability:
Thus there's only a 2% chance we'd get a sample with a mean this low or lower if the null hypothesis were in fact true; since this is quite unlikely, we would reject the null hypothesis.
www.jntuworld.com

P&amp S

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

P&amp S

Transféré par

Droits d'auteur :

Formats disponibles

www.jntuworld.

1 Introduction to Probability and Counting

1.2 Sample spaces & Events

Some results about probabilities: Theorem 2.1.2 proof: P(A') = 1 - P(A)

Previous section Next section

2.2 Conditional Probability

2.3 Independent Events

Previous section Next section

Random Variables and Discrete Distributions

3.1 Random Variables

3.2 Discrete Probability Densities

Whats probability that a randomly selected student's score is 4 or higher?

ex: Consider r.v. N above, w/ p.d.f.

From the table, a formula for F(n) can be inferred:

Previous section Next section

3.3 Expectation, Mean & Variance

E(X) is also called the mean of X, denoted

Previous section Next section

Moment Generating Function

(by the quotient rule)

From this we can find the variance of N: var(N) = E(N2) - E(N)2 = 6 - 22 = 2.

The Geometric Distribution

Previous section Next section

The Binomial Distribution

= n (pet + q)n-1 (pet) |t=0 = n (p + q)n-1 (p) = np E(X2) = =

Previous section Next section

Chapter 4 Continuous Distributions

this says the probability that X lies between - and

must equal 1: X has to take on some value!

The graph of f is:

The graph of F is:

0 <= F(x) <= 1 for all x

Note: F(x) will be continuous!

Previous section Next section

Expectation, Variance, Moments, Moment Generating Function

Consider: in discrete case, the expected value of a random variable X is defined as

Def: If X is a continuous random variable, then its expected value is defined as

as before, we also call this the mean of X, denoted

Then the expected value of X is

(note that f(x) = 0 for x < 14 and x > 15) = = 14.33

For any function H(X) of X, we define the expected value of H as

Thus the moments are

We have the same properties for expectation as before:

The moment generating function is again defined as

and is used as before to find the values of the moments by differentiation:

Previous section Next section

The Normal Distribution

Def: Let X be a continuous random variable. If its density is

it is true that this!

but requires some multivariable calculus tricks to show

Moment generating function:

(see text for derivation)

Can use this to find mean, variance:

X? Since = 100 and = 15, we get

Finding probabilities for normal distributions

from the moment generating function of X using

Previous section Next section

Normal Probability Rule & Chebyshevs Inequality

Theorem Chebyshevs Inequality Let X be a random variable w/ mean , standard deviation .

4.6 Normal Approximation to the Binomial Distribution

Can represent the table as a histogram (bar graph):

Previous section Next section