Vous êtes sur la page 1sur 48

Bivariate Statistical Analysis: Measures of Association- ChiSquare, Correlation And Regression

Descriptive Analysis
Descriptive analysis is the elementary transformation of data in a way that describes the basic characteristics such as central tendency, distribution, and variability. For example, consider the business researcher who takes responses from 1,000 American consumers and tabulates their favorite soft drink brand and the price they expect to pay for a six-pack of that product. The mean, median, and mode for favorite soft drink and the average price across all 1,000 consumers would be descriptive statistics that describe central tendency in three different ways. Means, medians, modes, variance, range, and standard deviation typify widely applied descriptive statistics.

Levels of Scale Measurement and Suggested Descriptive Statistics

Tabulation and Cross Tabulation


Tabulation refers to the orderly arrangement of data in a table or other summary format. When this tabulation process is done by hand, the term tallying is used. Counting the different ways respondents answered a question and arranging them in a simple tabular form yields a frequency table. The actual number of responses to each category is a variables frequency distribution. A simple tabulation of this type is sometimes called a marginal tabulation.

Cross Tabulation
As long as a question deals with only one categorical variable, tabulation is probably the best approach. Cross-tabulation is the appropriate technique for addressing research questions involving relationships among multiple less-than interval variables (nominal or ordinal ). Cross-tabs allow the inspection and comparison of differences among groups based on nominal or ordinal categories. In cross- tabs, the frequency table display one variable in rows and another in columns. Example- the following cross-tab summarizes several cross-tabulations from responses to a questionnaire on bonuses paid to American International Groups (AIG) executives and federal government bailouts in general. Panel A presents results regarding how closely the respondents have followed the news stories regarding AIG executives receiving bonuses from the 2009 federal government bailout money. The cross-tab suggests this may vary with basic demographic variables.

From the results, we can see that more men (60 percent) than women (51 percent) reported they very closely followed these news reports. Further, it appears that how closely one followed these news stories increases with age (from 41 percent of those 1829 to 68 percent of those over 65). Panel B provides another example of a cross-tabulation table. The question asks if the respondents feel that most of the bailout money is going to those that created the crisis. In this case, we see very little difference between men (68 percent agree) and women (69 percent agree).

However, before reaching any conclusions based on this survey, one must carefully scrutinize this finding for possible extraneous variables.

The Chi-Square Test for Goodness of Fit


A chi-square test is appropriate for testing hypotheses about frequencies arranged in a frequency or cross-tab. Univariate tests involving nominal or ordinal variables are examined with a 2. More generally, the 2 test is associated with goodness-of-fit (GOF). GOF can be thought of as how well some matrix (table) of numbers matches or fits another matrix of the same size. Most often, the test is between a table of observed frequency counts and another table of expected values (central tendency) for those counts. The actual 2 value is computed using the following formula:

where 2 = chi-square statistic; Oi = observed frequency in the ith cell; Ei = expected frequency in the ith cell

with (R-1) (C-1) d.f. where R =row and C= column

Measure of Association
Measure of Association is a general term that refers to a number of bivariate statistical techniques used to measure the strength of a relationship between two variables. Correlation analysis is the most appropriate for interval or ratio variables. Regression can accommodate either less-than interval independent variables, but the dependent variable must be continuous. The chi-square (2) test provides information about whether two or more less-than interval variables are interrelated

Common Procedures for Testing Association

Simple Correlation Coefficient


Correlation is the most popular technique for indicating the relationship of one variable to another. A correlation coefficient is a statistical measure of covariation, or association between two variables. Covariance is the extent to which a change in one variable corresponds systematically to a change in another. Correlation can be thought of as a standardized covariance. Covariance coefficients retain information about the absolute scale ranges so that the strength of association for scales of different possible values cannot be compared directly. Researchers find the correlation coefficient useful because they can compare two correlations without regard for the amount of variance exhibited by each variable separately When correlations estimate relationships between continuous variables, the Pearson product-moment correlation is appropriate.

The correlation coefficient, r, ranges from 1.0 to +1.0.

If the value of r = +1.0, a perfect positive relationship exists. Perhaps the two variables are one and the same!

If the value of r = 1.0, a perfect negative relationship exists. The implication is that one variable is a mirror image of the other. As one goes up, the other goes down in proportion and vice versa. No correlation is indicated if r = 0. A correlation coefficient indicates both the magnitude of the linear relationship and the direction of that relationship. For example, if we find that r= 0.92, we know we have a very strong inverse relationshipthat is, the greater the value measured by variable X, the lower the value measured by variable Y.

Coefficient of Determination (R2)


If we wish to know the proportion of variance in Y that is explained by X (or vice versa), we can calculate the coefficient of determination (R2) by squaring the correlation coefficient. R2 = Explained variance / Total variance

The coefficient of determination, R2, measures that part of the total variance of Y that is accounted for by knowing the value of X. If the correlation between unemployment and hours worked is r = -0.635 the R2 = 0.403. About 40 percent of the variance in unemployment can be explained by the variance in hours worked, and vice versa. Thus, R-squared really is just r squared!

Correlation Matrix
A correlation matrix is the standard form for reporting observed correlations among multiple variables. Each entry represents the bivariate relationship between a pair of variables.

The main diagonal consists of correlations of 1.00. Also REMEMBER that correlations should always be considered with significance levels (p- values). If a correlation is not significant it is of little use.

Regression Analysis
Regression analysis is a technique for measuring the linear association between a dependent and an independent variable. Regression is a dependence technique where correlation is an interdependence technique. A dependence technique makes a distinction between dependent and independent variables. An interdependence technique does not make this distinction and simply is concerned with how variables relate to one another.

With simple regression, a dependent (or criterion) variable, Y, is linked to an independent (or predictor) variable, X. Regression analysis attempts to predict the values of a continuous, interval-scaled dependent variable from specific values of the independent variable.

Simple Linear Regression


Suppose there exists one independent variable (x) and an independent variable (y) then the relationship between y and x can be denoted as y = f(x). The above can be a deterministic relationship or a stochastic/ statistical/ random relationship.

If we now assume that f(x) is linear in x then f(x) = + x


Thus we can write y = + x + u where + x is the deterministic component of y and u is the stochastic or random component. and are called regression coefficients that we estimate from the data on y and x.

If we have n observations on y and x then we can write the simple linear regression as:y= + xi + ui where i = 1,2,3,.,n. The objective is to obtain estimated values for unknown parameters and given that there are n observations on y and x. In the above simple linear regression equation represents the Y intercept (where the line crosses the Yaxis) and is the slope coefficient. The slope is the change in Y associated with a change of one unit in X.

Assumptions of Regression
1) Zero Mean. E(ui ) = 0 for all observations or i. 2) Common variance. Var(ui ) = 2 for all i. 3) Independence. ui and uj are independent for all i j. 4) Independence of xj . ui and xj are independent for all i and j.

5) Normality. ui are normally distributed for all i. Assumptions 1, 2 and 3 are in combination written as ui = IN(0, 2 )

Sources of Error Term (u)


1) Unpredictable element of randomness in human response- If y = consumption expenditure of a household and x = disposable income, there is an unpredictable element of randomness in each households consumption. The household does not behave like a machine. In one month people in the household are on a spending spree while they may be tightfisted next month. 2) Effect of large number of variables that have been omittedDisposable income is not the only variable influencing consumption expenditure in the above example. Family size, tastes of the family, spending habit may also influence consumption expenditure. 3) Measurement error in y- in this example measurement error is in consumption expenditure, i.e., we cannot measure consumption expenditure accurately.

Parameter Estimate Choices


The estimates for and are the key to regression analysis.

In most business research, the estimate of is most important. The explanatory power of regression rests with because this is where the direction and strength of the relationship between the independent and dependent variable is explained.
An intercept term () is sometimes referred to as a constant because represents a fixed point. Parameter estimates can be presented in either raw or standardized form.

One potential problem with raw parameter estimates is due to the fact that they reflect the measurement scale range. So, if a simple regression involved distance measured with miles, very small parameter estimates may indicate a strong relationship. In contrast, if the very same distance is measured with centimeters, a very large parameter estimate would be needed to indicate a strong relationship.

Standardized Coefficient
Standardized coefficient are the estimated coefficient indicating the strength of relationship expressed on a standardized scale where higher absolute values indicate stronger relationships (range is from 1 to 1). A standardized regression coefficient provides a common metric allowing regression results to be compared to one another no matter what the original scale range may have been. Raw regression weights () have the advantage of retaining the scale metric. Standardized coefficient should be explained when the researcher is testing an explanation but not a prediction. Un-standardized weights are used when researcher makes some predictions.

Multiple Regression
Y = 0 + 1X1 + 2X2 + . . . + nXn + u
Y = 0 + .326X1 + .612X2 + u

On an Average , a one unit change in income will change the


number of credit cards by .612 units. Y = Dependent Variable = # of credit cards 0 = intercept (constant) = constant number of credit cards independent of family size and income. 1 = change in # of credit cards associated with a unit change in family size (regression coefficient). 2 = change in # of credit cards associated with a unit change in income (regression coefficient). X1 = family size X2 = income u = prediction error (residual)

Sample Size Considerations

Simple regression can be effective with a sample size of


75, but maintaining power at .80 in multiple regression requires a minimum sample of 150 and preferably 200 observations for most research situations.

The minimum ratio of observations to variables is 10 to 1,


but the preferred ratio is 15 or 20 to 1, and this should increase when stepwise estimation is used.

Regression Analysis Terms

Explained variance = R2 (coefficient of determination).

Unexplained variance = residuals (error).


Adjusted R-Square = reduces the R2 by taking into account the sample size and the number of independent variables in the regression model (It becomes smaller as we have fewer observations per independent variable).

Standard Error of the Estimate (SEE) = a measure of the accuracy of the regression predictions. It estimates the variation of the dependent variable values around the regression line. It should get smaller as we add more independent variables, if they predict well. SEE is simply the standard deviation of the Y values about the estimated regression line and is often used as a summary measure of the goodness of fit of the estimated regression line.
Standard error- is nothing but the standard deviation of the sampling distribution of the estimator, and the sampling distribution of an estimator is simply a probability or frequency distribution of the estimator, that is, a distribution of the set of values of the estimator obtained from all possible samples of the same size from a given population.

Total Sum of Squares (SST) = total amount of variation that exists to be explained by the independent variables. TSS = the sum of SSE and SSR. Sum of Squared Errors (SSE) = the variance in the dependent variable not accounted for by the regression model = residual. The objective is to obtain the smallest possible sum of squared errors as a measure of prediction accuracy. Sum of Squares Regression (SSR) = the amount of improvement in explanation of the dependent variable attributable to the independent variables. Outliers are observations that have large residual values and can be identified only with respect to a specific regression model.

Least Squares Regression Line

Y
Total Deviation

Deviation not explained by regression

Y = average
Deviation explained by regression

Statistical vs. Practical Significance?


The F statistic is used to determine if the overall regression model is statistically significant.

If the R2 is statistically significant, we then evaluate the strength of the linear association between the dependent variable and the several independent variables. A large R2 indicates the straight line works well while a small R2 indicates it does not work well.
Even though an R2 is statistically significant, it does not mean it is practically significant. We also must ask whether the results are meaningful. For example, is the value of knowing you have explained 4 percent of the variation worth the cost of collecting and analyzing the data?

VIOLATIONS OF

REGRESSION ASSUMPTIONS

Heteroskedasticity
The second assumption of Regression is Common variance , i.e., Var(ui ) = 2 for all i. This assumption is also known as homoskedasticity [equal (homo) spread (skedasticity)].

Violation of this second assumption of regression is called heteroskedasticity, i.e., the errors don't have a constant/ common variance.

In Fig 1, the conditional variance of Yi (which is equal to that of ui ), which is conditional upon the given Xi , remains the same regardless of the values taken by the variable X. In contrast, consider the Fig 2, which shows that the conditional variance of Yi increases as X increases. Here, the variances of Yi are not the same. Hence, there is heteroskedasticity.

Sources of Heteroskedasticity
Error-learning models- as people learn, their errors of behavior become smaller over time (2i is expected to decrease). As incomes grow, people have more disposable income and hence more scope for choice about the disposition of their income. Hence, 2i is likely to increase with income.

As data collecting techniques improve, 2i is likely to decrease.


Heteroskedasticity can also arise as a result of the presence of outliers. An outlier is an observation from a different population to that generating the remaining sample observations. The inclusion or exclusion of such an observation, especially if the sample size is small, can substantially result in heteroskedasticity.

Consequences of Heteroskedasticity
Heteroskedasticty does not result in biased parameter estimates, i.e. beta is not biased. Ordinary Least Square estimates are no longer BLUE (best linear unbiased estimates), that is among all estimators OLS does not provide the estimate with the smallest variance, i.e., the standard error of beta is biased. Bias in standard error under heteroscedasticity leads to bias in test statistics like t, F, and 2 may not be valid any more

Graphical Methods of Testing About Heteroskedasticity


Linear Pattern between two variables implies Heteroscedasticity No systematic pattern between the two variables implies Homoskedasticity Linear Pattern between two variables implies Heteroscedasticity

Quadratic Pattern between two variables implies Heteroscedasticity

Quadratic Pattern between two variables implies Heteroscedasticity

Removing Heteroskedasticity
Re-specify variables. the Model or transform the

Use Robust Standard Errors- OLS assumes that errors are both independent and identically distributed. Robust standard errors relaxes these assumptions. Use Weighted Least Squares.

Autocorrelation
The term autocorrelation may be defined as correlation between members of series of observations ordered in time [as in time series data] or space [as in cross-sectional data]. The third assumption of regression implies Independence, i.e., ui and uj are independent for all i j, i.e., E(ui and uj ) = 0 for all i = j.

However under autocorrelation E(ui and uj ) 0 for all i j.


Difference Between Auto and Serial Correlation- correlation between two time series such as u1, u2, ... , u10 and u2, u3, ... , u11 , where the former is the latter series lagged by one time period, is autocorrelation, whereas correlation between time series such as u1, u2, ... , u10 and v1, v2, ... , v3, where u and v are two different time series, is called serial correlation.

Sources of Autocorrelation
Inertia- A salient feature of most economic time series is inertia, or sluggishness. Example- Time series such as GNP, price indexes, production, employment, and unemployment exhibit (business) cycles. Starting at the bottom of the recession, when economic recovery starts, most of these series start moving upward. In this upswing, the value of a series at one point in time is greater than its previous value. There is a momentum built into them, and it continues until something happens (e.g., increase in interest rate or taxes or both) to slow them down. Therefore, in regressions involving time series data, successive observations are likely to be interdependent.

Specification Bias: Excluded Variables Case- In empirical analysis the researcher often starts with a plausible regression model that may not be the most perfect one. After the regression analysis, the researcher does the postmortem to find out whether the results accord with a priori expectations. The residuals may suggest that some variables that were originally candidates but were not included in the model for a variety of reasons should have been included. This is the case of excluded variable specification bias. Specification Bias: Incorrect Functional Form.

Lags- In a time series regression of consumption expenditure on income, it is not uncommon to find that the consumption expenditure in the current period depends, among other things, on the consumption expenditure of the previous period. Such regression models which incorporate lag values are called autoregression. The rationale for such a model is simple. Consumers do not change their consumption habits readily for psychological, technological, or institutional reasons. Now if we neglect the lagged term, the resulting error term will reflect a systematic pattern due to the influence of lagged consumption on current consumption.

Data Transformation- like taking lagged values (level form) and first difference operators (first difference form) may lead to autocorrelation.

Figure a to d shows that there is a discernible pattern among the us. Figure a shows a cyclical pattern; Figure b and c suggests an upward or downward linear trend in the disturbances; whereas Figure d indicates that both linear and quadratic trend terms are present in the disturbances. Only Figure e indicates no systematic pattern, supporting the non autocorrelation.

Consequences of Autocorrelation
The estimated variance underestimate the true 2. ( )is likely to

We are likely to overestimate R2. The usual t and F tests of significance are no longer valid, and if applied, are likely to give seriously misleading conclusions about the statistical significance of the estimated regression coefficients.

Detection of Autocorrelation
The most celebrated test for detecting auto correlation or serial correlation is popularly known as the Durbin Watson d statistic.

The d statistic is given as followsAssuming that the estimated correlation is -

We can write the d statistic as -

Also, since 1 1, it implies that, 0 d 4


If = 0, d = 2; that is, if there is no serial correlation (of the first-order), d is expected to be about 2. Therefore, as a rule of thumb, if d is found to be 2 in an application, one may assume that there is no first-order autocorrelation, either positive or negative. If =+1, indicating perfect positive correlation in the residuals, d 0. Therefore, the closer d is to 0, the greater the evidence of positive serial correlation. If =1, that is, there is perfect negative correlation among successive residuals, d 4. Hence, the closer d is to 4, the greater the evidence of negative serial correlation.

Correcting For Autocorrelation


Try to find out if the autocorrelation is pure autocorrelation and not the result of misspecification of the model. If it is pure autocorrelation, one can use appropriate transformation of the original model so that in the transformed model we do not have the problem of autocorrelation.

In large samples, we can use the NeweyWest method to obtain standard errors of OLS estimators that are corrected for autocorrelation.

Multicollinearity
The situation where the explanatory variables are highly intercorrelated is referred to as multicollinearity. When the explanatory variables are highly correlated it becomes difficult to disentangle the separate effects of each of the explanatory variables on the explained variables.

If multicollinearity is perfect , the regression coefficients of the X variables are indeterminate and their standard errors are infinite.
If multicollinearity is less than perfect, the regression coefficients, although determinate, possess large standard errors (in relation to the coefficients themselves), which means the coefficients cannot be estimated with great precision or accuracy.

In the figure a there is no overlap between X2 and X3, and hence no collinearity. In Figure b through e there is a low to high degree of collinearitythe greater the overlap between X2 and X3 (i.e., the larger the shaded area), the higher the degree of collinearity. In the extreme, if X2 and X3 were to overlap completely (or if X2 were completely inside X3, or vice versa), collinearity would be perfect.

Sources of Multicollinearity.
The data collection method employed, for example, sampling over a limited range of the values. Constraints on the model or in the population being sampled. For example, in the regression of electricity consumption on income (X2) and house size (X3) there is a physical constraint in the population in that families with higher incomes generally have larger homes than families with lower incomes. Model specification error- adding polynomial terms to a regression model, especially when the range of the X variable is small.

An over-determined model. This happens when the model has more explanatory variables than the number of observations. This could happen in medical research where there may be a small number of patients about whom information is collected on a large number of variables.

Consequences of Multicollinearity
Although BLUE, the OLS estimators have large variances and covariances, making precise estimation difficult. Because of the above consequence, the confidence intervals tend to be much wider, leading to the acceptance of the zero null hypothesis (i.e., the true population coefficient is zero) more readily also the t ratio of one or more coefficients tends to be statistically insignificant.

Although the t ratio of one or more coefficients is statistically insignificant, R2, the overall measure of goodness of fit, can be very high. The OLS estimators and their standard errors can be sensitive to small changes in the data.

Multicollinearity Diagnostic

Variance Inflation Factor (VIF) measures how much the variance of the regression coefficients is inflated by multicollinearity problems. If VIF equals 0, there is no correlation between the independent measures. A VIF measure of 1 is an indication of some association between predictor variables, but generally not enough to cause problems. A maximum acceptable VIF value would be 10; anything higher would indicate a problem with multicollinearity. Tolerance the amount of variance in an independent variable that is not explained by the other independent variables. If the other variables explain a lot of the variance of a particular independent variable we have a problem with multicollinearity. Thus, small values for tolerance indicate problems of multicollinearity. The minimum cutoff value for tolerance is typically .10. That is, the tolerance value must be smaller than .10 to indicate a problem of multicollinearity.

Addressing the problem of Multicollinearity


A priori information- It could come from previous empirical work in which the collinearity problem happens to be less serious or from the relevant theory underlying the field of study. Combining cross-sectional and time series data. Dropping a variable(s) and removing specification bias. Transformation of variables. Addition of new data.

Vous aimerez peut-être aussi