0 Votes +0 Votes -

93 vues9 pagesFinding a model that fits a set of data is one of the most common goals in data analysis. Least squares regression is the most commonly used tool for achieving this goal. It’s a relatively simple concept, it’s easy to do, and there’s a lot of readily available software to do the calculations. It’s even taught in many Statistics 101 courses. Everybody uses it … and therein lies the problem. Even if there is no intention to mislead anyone, it does happen.

May 31, 2011

© © All Rights Reserved

PDF, TXT ou lisez en ligne sur Scribd

Finding a model that fits a set of data is one of the most common goals in data analysis. Least squares regression is the most commonly used tool for achieving this goal. It’s a relatively simple concept, it’s easy to do, and there’s a lot of readily available software to do the calculations. It’s even taught in many Statistics 101 courses. Everybody uses it … and therein lies the problem. Even if there is no intention to mislead anyone, it does happen.

© All Rights Reserved

93 vues

Finding a model that fits a set of data is one of the most common goals in data analysis. Least squares regression is the most commonly used tool for achieving this goal. It’s a relatively simple concept, it’s easy to do, and there’s a lot of readily available software to do the calculations. It’s even taught in many Statistics 101 courses. Everybody uses it … and therein lies the problem. Even if there is no intention to mislead anyone, it does happen.

© All Rights Reserved

- The Measure of a Measure
- Why Do I Have to Take Statistics
- Weapons of Math Production
- There Something About Variance
- Purrfect Resolution
- Secrets of Good Correlations
- 30 SAMPLES. STANDARD, SUGGESTION, OR SUPERSTITION?
- Its All Greek
- Polls Apart
- Five Things You Should Know Before Taking Statistics 101
- Time Is On My Side
- The Heart and Soul of Variance Control
- Its All Relative
- Statistics a Remedy for Football Withdrawal
- All in the Technique
- It was Professor Plot in the Diagram with a Graph
- Limits of Confusion
- A Picture Worth 140000 Words
- Assuming the Worst
- Data Scrub

Vous êtes sur la page 1sur 9

Finding a model that fits a set of data is one of the most common goals in data analysis. Least

squares regression is the most commonly used tool for achieving this goal. It’s a relatively

simple concept, it’s easy to do, and there’s a lot of readily available software to do the

calculations. It’s even taught in many Statistics 101 courses. Everybody uses it … and therein

lies the problem. Even if there is no intention to mislead anyone, it does happen.

Here are some of the most common reasons to doubt a regression model.

1. Not Enough Samples

Accuracy is a critical component for evaluating a model. The coefficient of determination, also

known as R-squared or R2, is the most often cited measure of accuracy. Now obviously, the more

accurate a model is the better, so data analysts look for large values of R-square.

R-squared is designed to estimate the maximum relationship between the dependent and

independent variables based on a set of samples (cases, observations, records, or whatever you

want to call them). If there aren’t enough samples compared to the number of independent

variables in the model, the estimate of R-squared will be especially unstable. The effect is

greatest when the R-squared value is small, the number of samples is small, and the number of

independent variables is large, as shown in this figure.

The inflation in the value of R-squared can be adjusted by calculating the shrunken R-square.

The figure shows that for an R-squared value above 0.8 with 30 cases per variable, there isn’t

much shrinkage. Lower estimates of R-square, however, experience considerable shrinkage.

You can’t control the magnitude of the relationship between a dependent variable and a set of

independent variables, and often, you won’t have total control of the number of samples and

variables either. So, you have to be aware that R-squared will be overestimated and treat your

regression models with some skepticism.

2. No Intercept

Almost all software that performs regression analysis provides an option to not include an

intercept term in the model. This sounds convenient, especially for relationships that presume a

one-to-one relationship between the dependent and independent variables. But when an intercept

is excluded from the model, it’s not omitted from the analysis, it is set to zero. Look at any

regression model with “no intercept” and you’ll see that the regression line goes through the

origin of the axes.

With the regression line nailed down on one end at the origin, you might expect that the value of

R-squared would be diminished because the line wouldn’t necessarily travel through the data in a

way that minimizes the differences between the data points and the regression line, called the

errors or residuals. Instead, R-squared is artificially inflated because when the correction

provided by the intercept is removed, the total variation in the model increases. The ratio of the

variability attributable to the model compared to the total variability increases, hence the increase

in R-squared.

The solution is simple. Always have an intercept term in the model unless there is a compelling

theoretical reason not to include it. In that case, don’t put all your trust in R-square (or the F-

tests).

3. Stepwise Regression

Stepwise regression is a data analyst’s dream. Throw all the variables into a hopper, grab a cup

of coffee, and the silicon chips will tell you which variables to use to get the best model. That

irritates hard-core statisticians who don’t like amateurs messing around with numbers. You can

bet, though, that at least some of them go home at night, throw all the food in their cupboard into

a crock pot, and expect to get a meal out of it.

The cause of some statistician’s consternation is that stepwise regression will select the variables

that are best for the dataset, but not necessarily the population. Model test probabilities are

optimistic because they don’t account for the ability of the stepwise procedure to capitalize on

chance. Moreover, adding new variables will always increase R-squared, so you have to have

some good ways to decide how many variables is too many. There are ways to do this. So using

stepwise regression alone isn’t a fatal flow. Like with guns, drugs, and fast food, you have to be

careful how you use it.

If you use stepwise regression, be sure to look at those diagnostic statistics. Also, verify your

results using a different data set either by splitting the original data set before you do any

analysis, by extracting observations randomly from the original data set to create new data sets,

or by collecting new samples.

4. Outliers

Outliers are a special irritant for data analysts. They’re not really that tough to identify but they

do cause a variety of problems that data analysts have to deal with. The first problem is

convincing reviewers not familiar with the data that the outliers are in fact outliers. Second, they

have to convince all reviewers that what they want to do with the outliers, delete or include or

whatever, is the appropriate thing to do. But one way or another, outliers will wreak havoc with

R-squared.

Consider this figure, which comes from an analysis of slug tests to estimate the hydraulic

conductivity of an aquifer. The red circles show the relationship between rising-head and falling-

head slug tests performed on groundwater monitoring wells. The model for this relationship has

an R-square of 0.90. The blue diamond is an outlier along the trend (same regression equation)

about 60% greater than the next highest value. The R-squared of this equation is 0.95. The green

square is an outlier perpendicular to the trend. The R-squared of this equation is 0.42. Those are

fairly sizable differences to have been caused by a single data point.

How should you deal with outliers? I usually delete them because I’m usually looking to model

trends and other patterns. But outliers are great thought provokers. Sometimes they tell you

things the patterns don’t. If you’re not comfortable deciding what to do with an outlier, run the

analysis both with and without outliers, a time consuming and expensive approach. The other

approach would be to get the reviewer, an interested stakeholder, or an independent expert

involved in the decision. That approach is time consuming and expensive too. Pick your poison.

5. Non-linear relationships

Linear regression assumes that the relationship between a dependent variable and a set of

independent variables are additive, or linear. If the relationship is actually nonlinear, the R-

squared for the linear model will be lower than it would be for a better fitting nonlinear model.

This figure shows the relationship between the number of employed individuals and the number

of individuals not in the U.S. work force between 1980 and 2009. The linear model has a

respectable R-squared value of 0.84, but the polynomial model fits the data much better with an

R-squared value of 0.95.

Non-linear relationships are a relatively simple problem to fix, or at least acknowledge, once you

know what to look for. Graph your data and go from there.

6. Overfitting

Overfitting involves building a statistical model solely by optimizing statistical parameters, and

usually involves using a large number of variables and transformations of the variables. The

resulting model may fit the data almost perfectly but will produce erroneous results when applied

to another sample from the population.

The concern about overfitting may be somewhat overstated. Overfitting is like becoming too

muscular from weight training. It doesn’t happen suddenly or simply. It’s not something that

happens in a keystroke. It takes a lot of work fine tuning variables and what not. If you know

what overfitting is, you’re not likely to become a victim. It’s also usually easy to identify

overfitting in other people’s models. Simply look for a conglomeration of manual numerical

adjustments, mathematical functions, and variable combinations.

7. Misspecification

Misspecification involves including terms in a model that make the model look great statistically

even though the model is problematical. Often, misspecification involves placing the same or

very similar variable on both sides of the equation.

Consider this example from economics. A model for the U.S. Gross Domestic Product (GDP)

was developed using data on government spending and unemployment from 1947 to 1997. The

model:

GDP = (121*Spending) - (3.5*Spending2) + (136*Time) - (61*Unemployment) - 566

had an R-squared value of 0.9994. Such a high R-squared value is a signal that something is

amiss. R-squared values that high are usually only seen in models involving equipment

calibration, and certainly not anything involving capricious human behavior. A closer look at the

study indicated that the model term involving spending were an index of the government’s

outlays relative to the economy. Usually, indexing a variable to a baseline or standard is a good

thing to do. In this case, though, the spending index was the proportion of government outlays

per the GDP. Thus, the model was:

GDP = (121*Outlays/GDP) - (3.5* (Outlays/GDP)2) + (136*Time) - (61*Unemployment) - 566

GDP appears on both sides of the equation, thus accounting for the near perfect correlation. This

is a case in which an index, at least one involving the dependent variable, should not have been

used.

Another misspecification involves creating a prediction model having independent variables that

are more difficult, time consuming, or expensive to generate than the dependent variable. You

might as well just measure the dependent variable when you need to know its value. Similarly

with forecasting (prediction of the future) models, if you need to forecast something a year in

advance, don’t use predictors that are measured less than a year in advance.

8. Multicollinearity

Multicollinearity occurs when a model has two or more independent variables that are highly

correlated with each other. The consequences are that the model will look fine, but predictions

from the model will be erratic. It’s like a football team. The players perform well together but

you can’t necessarily tell how good individual players are. The team wins, yet in some situations,

the cornerback or offensive tackle will get beat on most every play.

If you ever tried to use independent variables that add to a constant, you’ve seen

multicollinearity in action. In the case of perfect correlations, such as these, statistical software

will crash because it won’t be able to perform the matrix mathemagics of regression. Most

instances of multicollinearity involve weaker correlations that allow statistical software to

function, yet the predictions of the model will still be erratic.

Multicollinearity occurs often in the social sciences and other fields of study in which many

variables are measured in the process of model building. Diagnosis of the problem is simple if

you have access to the data. Look at correlations between the independent variables. You can

also look at the variance inflation factors, reciprocals of one minus the R-squared values for the

independent variables and the dependent variable. VIFs are measures of how much the model’s

coefficients change because of multicollinearity. The VIF for a variable should be less than 10

and ideally near 1.

If you suspect multicollinearity, don’t worry about the model but don’t believe any of the

predictions.

9. Heteroscedasticity

Regression, and practically all parametric statistics, requires that the variances in the model

residuals be equal at every value of the dependent variable. This assumption is called equal

variances, homogeneity of variances, or coolest of all, homoscedasticity. Violate the assumption

and you have heteroscedasticity.

Heteroscedasticity is assessed much more commonly in analysis of variance models than in

regression models. This is probably because the dependent variable in ANOVA is measured on a

categorical scale while the dependent variable in regression is measured on a continuous scale.

The solution to this is fairly simple. Break the dependent variable scale into intervals, like in a

histogram, and calculate the variance for each interval. The variances don’t have to be precisely

equal, but variances different by a factor of five are problematical. Unequal variances will wreak

havoc on any tests or confidence limits calculated for model predictions.

10. Autocorrelation

Autocorrelation involves a variable being correlated with itself. It is the correlation between data

points with the previously listed data points (termed a lag). Usually, autocorrelation involves

time-series data or spatial data, but it can also involve the order in which data are collected. The

terms autocorrelation and serial correlation are often used interchangeably. If the data points are

collected at a constant time interval, the term autocorrelation is more typically used.

If the residuals of a model are autocorrelated, it’s a sure bet that the variances will also be

unequal. That means, again, that tests or confidence limits calculated from variances should be

suspect.

To check a variable or residuals from a model for autocorrelation, you can conduct a Durban-

Watson test. The Durban-Watson test statistic ranges from 0 to 4. If the statistic is close to 2.0,

then serial correlation is not a problem. Most statistical software will allow you to conduct this

test as part of a regression analysis.

11. Weighting

Most software that calculates regression parameters also allows you to weight the data points.

You might want to do this for several reasons. Weighting is used to make more reliable or

relevant data points more important in model building. It’s also used when each data point

represents more than one value. The issue with weighting is that it will change the degrees of

freedom, and hence, the results of statistical tests. Usually this is OK, a necessary change to

accommodate the realities of the model. However, if you ever come upon a weighted least

squares regression model in which the weightings are arbitrary, perhaps done by an analyst who

doesn’t understand the consequence, don’t believe the test results.

Is Your Regression Model Telling the Truth?

There are many technologies we use in our lives without really understanding how they work.

Television. Computers. Cell phones. Microwave ovens. Cars. Even many things about the

human body are not well understood. But I don’t mean how to use these mechanisms. Everyone

knows how to use these things. I mean understanding them well enough to fix them when they

break. Regression analysis is like that too. Only with regression analysis, sometimes you can’t

even tell if there’s something wrong without consulting an expert.

Here are some tips for troubleshooting regression models.

Diagnosis

You may know how to use regression analysis, but unless you’re an expert, you may not know

about some of the more subtle pitfalls you may encounter. The biggest red flag that something is

amiss is the TGTBT, too good to be true. If you encounter an R-squared value above 0.9,

especially unexpectedly, there’s probably something wrong. Another red flag is inconsistency. If

estimates of the model’s parameters change between data sets, there’s probably something

wrong. And if predictions from the model are less accurate or precise than you expected, there’s

probably something wrong.

Here are some guidelines for troubleshooting a model you developed.

If you have fewer than 10

Collect more samples. 100

observations for each independent

Not Enough observations per variable is a good

variable you want to put in a

Samples target to shoot for although more is

model, you don’t have enough

usually better.

samples.

Put in an intercept and see if the

No Intercept You’ll know it if you do it.

model changes.

Stepwise Don’t abdicate model building

You’ll know it if you do it.

Regression decisions to software alone.

Conduct a test on the aberrant

data points to determine if they are

statistical anomalies. Use

Plot the dependent variable

diagnostic statistics like leverage to

against each independent variable.

evaluate the effects of suspected

If more than about 5% of the data

Outliers pairs plot noticeable apart from the

outliers. Evaluate the metadata of

the samples to determine if they

rest of the data points, you may

are representative of the

have outliers.

population being modeled. If so,

retain the outlier as an influential

observation (AKA leverage point).

Plot the dependent variable

Non-linear against each independent variable. Find an appropriate transformation

relationships Look for nonlinear patterns in the of the independent variable.

data

If you have a large number of Keep the model as simple as

independent variables, especially if possible. Make sure the ratio of

they use a variety of transformation observations to independent

Overfitting and don’t contribute much to the variables is large. Use diagnostic

accuracy and precision of the statistics like AIC and BIC to help

model, you may have overfit the select an appropriate number of

model. variables.

Look for any variants of the Remove any elements of the

Misspecification dependent variable in the dependent variable from the

independent variables. Assess independent variables. Remove at

whether the model meets the least one component of variables

objectives of the effort. describing mixtures. Ensure the

model meets the objectives of the

effort with the desired accuracy

and precision..

Use diagnostic statistics like VIF to

Calculate correlation coefficients

evaluate the effects of suspected

and plot the relationships between

Multicollinearity all the independent variables in the

multicollinearity. Remove

intercorrelated independent

model. Look for high correlations.

variables from the model.

Plot the variance at each level of

an ordinal-scale dependent

Try to find an appropriate Box-Cox

variable or appropriate ranges of a

transformation or consider

Heteroscedasticity continuous-scale dependent

nonparametric regression or data

variable. Look for any differences

mining methods.

in the variances of more than

about five times.

If the autocorrelation is related to

time, develop a correlogram and a

Plot the data over time, location or partial correlogram. If the

the order of sample collection. autocorrelation is spatial, develop

Autocorrelation Calculate a Durbin–Watson a variogram. If the autocorrelation

statistic for serial correlation. is related to the order of sample

collection, examine metadata to try

to identify a cause.

Compare the weighted model with

the corresponding unweighted

model to assess the effects of

Weighting You’ll know it if you do it.

weighting. Consider the validity of

weighting; seek expert advice if

needed.

Sometimes the model you are skeptical about isn’t one you developed; it is models that are

developed by other data analysts. The major difference is that with other analysts’ models, you

won’t have access to all their diagnostic statistics and plots, let alone their data. If you have been

retained to review another analyst’s work, you can always ask for the information you need. If,

however, you’re reading about a model in a journal article, book, or website, you’ve probably

got all the information you’re ever going to get. You have to be a statistical detective. Here are

some clues you might look for.

Another Analyst’s

Identification

Model

If the analyst reported the number of samples used, look for at

Not Enough

least 10 observations for each independent variable in the

Samples model,

If the analyst reported the actual model (some don’t), look for a

No Intercept

constant term.

Stepwise Unless another approach is reported, assume the analyst used

Regression some form of stepwise regression.

Assuming the analyst did not provide plots of the dependent

Outliers variable versus the independent variables, look for R-squared

values that are much higher or lower than expected.

Assuming the analyst did not provide plots of the dependent

Non-linear variable versus the independent variables, look for a lower-

relationships than- expected R-squared value from a linear model. If there

are non-linear terms in the model, this is probably not an issue.

Look for a large number of independent variables in the model,

Overfitting

especially if they different types of transformation

Look for any variants of the dependent variable in the

Misspecification independent variables. Assess whether the model meets the

objectives of the effort.

Assuming relevant plots and diagnostic statistics are not

Multicollinearity

available, there may not be any way to identify multicollinearity.

Assuming relevant plots and diagnostic statistics are not

Heteroscedasticity available, there may not be any way to identify

heteroscedasticity.

Assuming relevant plots and diagnostic statistics are not

Autocorrelation available, there may not be any way to identify serial

correlation.

Compare the reported number of samples to the degrees of

Weighting

freedom. Any differences may be attributable to weighting.

Have No Doubts

So there are eleven reasons for doubting a regression model. Some are easy to identify, others are

more subtle. But if you’re grasping for flaws in a regression model, these are good places to start

looking. Just remember when evaluating other analyst’s models that not everyone is an expert

and that even experts make mistakes. Try to be helpful in your critiques, but at a minimum, be

professional.

Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other

Breeds of Data Analysis at Wheatmark. Stats with Cats is also available at amazon.com,

barnesandnoble.com, and other online booksellers. Read the blogs at Stats with Cats blog.

- The Measure of a MeasureTransféré parterrabyte
- Why Do I Have to Take StatisticsTransféré parterrabyte
- Weapons of Math ProductionTransféré parterrabyte
- There Something About VarianceTransféré parterrabyte
- Purrfect ResolutionTransféré parterrabyte
- Secrets of Good CorrelationsTransféré parterrabyte
- 30 SAMPLES. STANDARD, SUGGESTION, OR SUPERSTITION?Transféré parterrabyte
- Its All GreekTransféré parterrabyte
- Polls ApartTransféré parterrabyte
- Five Things You Should Know Before Taking Statistics 101Transféré parterrabyte
- Time Is On My SideTransféré parterrabyte
- The Heart and Soul of Variance ControlTransféré parterrabyte
- Its All RelativeTransféré parterrabyte
- Statistics a Remedy for Football WithdrawalTransféré parterrabyte
- All in the TechniqueTransféré parterrabyte
- It was Professor Plot in the Diagram with a GraphTransféré parterrabyte
- Limits of ConfusionTransféré parterrabyte
- A Picture Worth 140000 WordsTransféré parterrabyte
- Assuming the WorstTransféré parterrabyte
- Data ScrubTransféré parterrabyte
- Grasping at FlawsTransféré parterrabyte
- You're Off to Be a WizardTransféré parterrabyte
- Ten Fatal Flaws in Data Analysis (and how to avoid them)Transféré parterrabyte
- Fifty Ways to Fix Your DataTransféré parterrabyte
- Seeds of a ModelTransféré parterrabyte
- The Zen of ModelingTransféré parterrabyte
- 2427-10447-1-PBTransféré parakita_1610
- The Impact of NationalTransféré parnataliapopa
- stats written reportTransféré parapi-356191607
- Luxury Fashion Market In Nıgerıa: The Impact Of Consumers’ Socio-Economic Status On Buying DecisionsTransféré parinventionjournals

- Furlough Resources for GSA EmployeesTransféré parterrabyte
- O U T L I E R STransféré parterrabyte
- States Most Often Declared Disaster AreasTransféré parterrabyte
- Carbon is Like MoneyTransféré parterrabyte
- Ten Ideas for Fixing the Federal GovernmentTransféré parterrabyte
- Five Things You Should Know Before Taking Statistics 101Transféré parterrabyte
- Secrets of Good CorrelationsTransféré parterrabyte
- Polls ApartTransféré parterrabyte
- Its All RelativeTransféré parterrabyte
- Statistics a Remedy for Football WithdrawalTransféré parterrabyte
- Limits of ConfusionTransféré parterrabyte
- Grasping at FlawsTransféré parterrabyte
- You're Off to Be a WizardTransféré parterrabyte
- Seeds of a ModelTransféré parterrabyte
- Fifty Ways to Fix Your DataTransféré parterrabyte
- Ten Fatal Flaws in Data Analysis (and how to avoid them)Transféré parterrabyte
- Data ScrubTransféré parterrabyte
- Assuming the WorstTransféré parterrabyte
- It was Professor Plot in the Diagram with a GraphTransféré parterrabyte
- All in the TechniqueTransféré parterrabyte
- The Heart and Soul of Variance ControlTransféré parterrabyte

- TWJ Asia Pacific Geographic Information Systems FINALTransféré parditto reyes
- Caribbean Political PhilosophyTransféré parJepter Lorde
- calhoun written assignment 3 theory 2 sophia calhoun 1Transféré parapi-444977588
- Purposes and Procedures for Assessing Science Process SkillsTransféré parAri Permana
- Woessner Et AlTransféré parzakai
- LOAD-LESS TEST OF INDUCTION MOTORTransféré parjalilemadi
- 160.78-EG1 (610)Transféré parHiep Nguyen
- 330457014mcai Year AssignmentTransféré parSantoshReddy
- Death and Burial in ArabiaTransféré parMădălina Șaucă
- Bird Detection w RadarTransféré parneronci
- Fred C. Lunenburg - The Interview as a Selection Device: Problems and Possibilities - www.nationalforum.comTransféré parAnonymous sewU7e6
- commun icationTransféré parDaisy Andal Vicencio
- 206C-Computer Based Optimization Techniques.pdfTransféré parSaranya Dhilipkumar
- sr11_252Transféré pararulrajiv1
- An Introduction to Sustainable TransportationTransféré partarekyousry
- indexing it all.pdfTransféré parSandip Das
- Drill ViewTransféré parJohn Rong
- bab 1Transféré parRoselia Yuliani Permatasari
- NONAVTransféré parfkrnaw
- ERCBH2S OverviewTransféré parabbass
- Electroplating PresentationTransféré partyopramanda
- An Overview of Syllabuses in English Language TeachingTransféré parSue L Uta
- Pub 8Transféré parAlexandry Augustin
- Turbulent Flow Computation, Springer (2004), 1402005237.pdfTransféré parn_o_w_e_l_l
- Manual GlobalAirTrafficControl EnTransféré parNguyen Quang Nam
- Notice: Endangered and Threatened Wildlife and Plants: Initiation of 5-Year Status Reviews for 70 Species in Idaho, Montana, Oregon, Washington, and the Pacific IslandsTransféré parJustia.com
- invasion lesson oneTransféré parapi-351172625
- PrimaExpert M100, Microscopio Digital, Manual EnglishTransféré parTICTRONICA Ltda.
- Pelamis Wave Energy Converter(Le202)Transféré parDasari Ramamohana
- Mechanism of Microbial Activity on Growing Root Surfaces.pptxTransféré parBushra Altaf Lim

## Bien plus que des documents.

Découvrez tout ce que Scribd a à offrir, dont les livres et les livres audio des principaux éditeurs.

Annulez à tout moment.