Vous êtes sur la page 1sur 3

Bivariate Data

Big Picture

Probability & Statistics

Study Guides

Bivariate data is often represented with scatterplots and line plots. The main reason to display bivariate data in such a way is to find a relationship between the two variables. The relationship is described through the correlation coefficient. Often, transformations on the data must be done so that the correlation coefficient can be used.

Key Terms
Bivariate: Two variables. Scatterplot: A graph where each point represents a pair of measurements (two variables). Correlation: The relationship between bivariate data. Correlation Coefficient: A number that describes the correlation (relation) between bivariate data. Linear Regression: Using data to calculate a line that best fits that data. The line can be used to make predictions. Residual: The distance between the observed value and the expected value.

Displaying Bivariate Data

Bivariate data is primarily examined to show some sort of relationship between two variables. Bivariate data usually has an independent variable and a dependent variable. The independent variable influences the dependent variable. Time is often the independent variable. We are often interested in the change a variable exhibits over time. We can see if there is any relationship between the variables by showing the data in a scatterplot. You may often see scatterplots as a series of disconnected points. They are a good way of representing bivariate data. The independent variable is on the x-axis while the dependent variable is on the y-axis. Line plots are also used to show change over time. A line plot is basically a scatterplot where the dots are connected chronologically (in order by time).

Three important characteristics of bivariate data:

shape (linear, exponential, etc.) direction strength

We are usually most interested in finding if there is any correlation in the data. The correlation describes the direction of the direction. One way to visualize the correlation is with a scatterplot. We can describe the correlation as: positive correlation: positive slope negative correlation: negative slope

perfect correlation: points on a scatterplot lie on a straight line - can be positive or negative

zero correlation: points do not have a linear trend

Disclaimer: this study guide was not created to replace your textbook and is for classroom or individual use only.

The more linear the data is, the stronger the linear correlation. Another way to view the strength of this correlation is to draw an ellipse (oval) around all of the data. The narrower or skinnier the ellipse is, the stronger the linear correlation.

This guide was created by Lizhi Fan and Jin Yu. To learn more about the student authors, visit http://www.ck12.org/about/about-us/team/interns.

Page 1 of 3

Probability & Statistics

Bivariate Data
Correlation (cont.)
Correlation Coefficient

cont .
Transformations to Achieve Linearity
Curvilinear relationships are nonlinear relationships. Just because they are nonlinear does not mean they dont have a strong correlation. However, the r correlation coefficient by itself will not be able to tell us about the strength of a nonlinear relationship. There is a way to manipulate data points to make a nonlinear relationship linear. By doing this, we can use the correlation coefficient to describe the strength of the relationship. For example, if we were dealing with an exponential relationship: y = axb

Correlation coefficient (r) can be used to express correlation.

Can have values between -1 and +1. Signs indicate negative (-) and positive (+) correlations The closer the absolute value of the coefficient (|r|) is to 1, the stronger the relationship

Perfect negative correlation is -1; perfect positive correlation is 1

Only describes linear relationships (nonlinear relationships have r = 0)

Still can have a strong relationship even if correlation coefficient is low

One statistic that measures the strength and direction of a linear correlation is the Pearson product-moment correlation coefficient. To calculate the correlation r of two variables X and Y, use the formula: , z is the z-score and n is the sample size If we have the raw scores and not the standardized scores, we can use this formula:

By taking the log of both sides, we can change the data to become a linear relationship. After doing this, we can describe the relationship with a correlation coefficient. log y = log (axb) log y = log a + log xb log y = log a + b log x

We can define two new variables:

The new relationship is Y = log a + X. log a is a constant, so we have transformed the exponential relationship into a linear one.

Y = log y X = b log x

Correlation only describes linearity. It does not tell us if one variable caused the other.

Least-Squares Regression Line

Linear regression is a mathematical way to determine the best fit line through a set of data.

The least-squares regression line (also known as a linear regression line) is created by finding the line that minimizes the calculated distance from the data points to the respective places on the line. This is also known as the residual.

Residual = Observed - Expected Generally, the smaller the residuals, the better fit the least-squares regression line is to the data. If all the residuals were added together, the sum would be zero.

A straight line that would represent the change in one variable associated with the change in the other Often used to predict values of future data points. This is done simply by substituting a value of a predictor variable (X) into the equation to find the outcome variable Y. (The predictor variable predicts the outcome).

The regression line is a straight line with the form: Y = bX + a

Y is what we are trying to predict b is the slope of the line (regression coefficient) a is the value of Y when X = 0 (regression constant) X is the predictor variable
To calculate the line, we need to find b and a.


r is the correlation between X and Y sY is the standard deviation of Y sX is the standard deviation of X Plotting Residuals and Testing for Linearity
We can plot the residuals by plotting the x-value for each data pair on the x-axis and the residual on the y-axis.

A residual plot with no outliers and with a linear relationship would appear to have no correlation. If the residual plot has an obvious pattern, you may want to try other models of the data, such as power or exponential functions, to see if they are a better fit

Page 2 of 3

Bivariate Data
Hypothesis Testing

cont .

Probability & Statistics

The least-squares line y = a + bx is for samples. To predict the line for the entire population, we use = + x, where is the population correlation coefficient. CAREFUL: Here and are not the level of significance and the power of the test.

Make sure that the set of data is for a random sample. Make sure the y values have a normal distribution.

If these are true, we can use hypothesis testing. Null hypothesis is that the regression coefficient = some number

Ha hypothesis is that does NOT equal the given number ( or > or <)
Use the test statistic where

SSE = sum of residual error squared


Page 3 of 3