Data Screening

Saeed Pahlevan Sharif 1/09/2013
Data Screening and CFA

1
STRUCTURAL EQUATION MODELING (SEM)

& AMOS WORKSHOP
1 ST & 8 TH S E P T E M B E R 2 0 1 3
SAEED PAHLEVAN SHARIF

WWW.SAEEDSHARIF.COM
www.saeedsharif.com Taylor’s Graduate School
Data Screening
2
 Data analysis
 Summarization
 Model fitting
 Testing hypotheses
 Data screening
 Exposure
 Preparation for modeling
 Checking the adequacy of assumptions.
 Your data should be “clean”

 Reliable and valid
SEM & AMOS Workshop 1
Necessary Data Screening To Do:

3
 Handle Missing Data

 Address outliers and influential cases.
 Meet multivariate statistical assumptions for
alternative tests
Problems Resulting from Missing Data

4
 Loss of Information
 Bias
 Power Loss
Statistical Problems with Missing Data

5
 Missing much of your data

 Can’t calculate the estimated model.
 EFA, CFA, and path models require a certain minimum

number of data
 Greater model complexity and improved power require larger
samples.
Logical Problem with Missing Data

6
 Systematic bias due to a common cause (poor

formulation, sensitivity etc).
 Gender Moderator
 Salary
 Etc.
Detecting
Missing
Values
www.saeedsharif.com
Handling Missing Data

8
Hair et al.’s (2009) Rules of Thumb:

 Missing data under 10% for an individual case or
observation can generally be ignored, except when the
missing data occurs in a specific nonrandom fashion.
 The number of cases with no missing data must be
sufficient for the selected analysis technique if
replacement values will not be substituted (imputed)
for the missing data.
• DV is missing
• Impute and run models with and without missing data
Imputation Methods
(Hair et al. (2009), table 2-2)
9
 Use only valid data

 No imputation, just use valid cases or variables
 In SPSS: Exclude Pairwise (variable), Listwise (case)
 Use known replacement values

 Match missing value with similar case’s value
 Use calculated replacement values

 Use variable mean, median, or mode
 Regression based on known relationships
 Model based methods

 Iterative two step estimation of value and descriptives to find
most appropriate replacement value
Imputation in SPSS
10
2. Include each variable

that has values that need
imputing
3. For each variable you can

choose the new name (for the
imputed column) and the type
of imputation
www.saeedsharif.com
Imputation Advantages Disadvantages Best Used When:

Method
Mean • Easily implemented • Reduces variance of the • Relative low levels of
Substitution • Provides all cases with distribution missing data
complete information • Distorts
11 distribution of the • Relatively strong
data relationships among
• Depresses observed variables
correlations
Regression • Employs actual • Reinforces existing • Moderate to high
Imputation relationships among the relationships and reduces levels of missing data
variables generalizability • Relationships
• Replacement values • Must have sufficient sufficiently established
calculated based on an relationships among so as to not impact
observation’s own values variables to generate valid generalizability
on other variables. predicted values.
• Unique set of predictors • Understates variance
can be used for each unless error term added to
variable with missing replacement value.
data. • Replacement values may
be “out of range”
Model-Based • Accommodates both • Complex model • Only method that can
Methods nonrandom and random specification by researcher accommodate
missing data processes • Requires specialized nonrandom missing
• Best representation of software data process
original distribution of • Typically not available in • High levels of missing
values with least bias. software programs (except data require least
EM method in SPSS) biased method to
ensure generalizability
Best Method – Prevention!

12
 Short surveys (pre testing critical!)

 Easy to understand and answer survey items
 Force completion (incentives, technology)
 Bribe/motivate (iPad drawing)
 Digital surveys (rather than paper)
 Put dependent variables at the beginning of
the survey!
Outliers and Influentials

13
 Outliers can influence your results, pulling the mean

away from the median.
 Outliers also affect distributional assumptions and
often reflect false or mistaken responses
 Two type of outliers:
 outliers for individual variables (univariate)
 Extreme values for a single variable
 outliers for the model (multivariate)
 Extreme (uncommon) values for a correlation
Detecting Univariate Outliers

14
Detecting Univariate Outliers

15
50%
should
Mean 99%
fall within
the box should
fall within
this range
Outliers!
Handling Univariate Outliers

16
 Should be examined on a case by case basis.

 If the outlier is truly abnormal, and not
representative of your population, then it is okay to
remove. But this requires careful examination of
the data points
 e.g., you are studying dogs, but somehow a cat got ahold of
your survey
 e.g., someone answered “1” for all 75 questions on the survey
Detecting Multivariate Outliers

17
 Multivariate outliers refer to sets of data points that

do not fit the standard sets of correlations exhibited
by the other data points in the dataset with regards
to your causal model.
 Exercise and Weight loss
 Mahalanobis d-squared.
These are
Anything less than .05 in the
row
p1 column is abnormal, and
numbers
is candidate for inspection
from SPSS
18
Handling Multivariate Outliers

19
 Create a new variable in SPSS called “Outlier”

 Code 0 for Mahalanobis > .05
 Code 1 for Mahalanobis < .05
 AMOS: “Outlier” as a grouping variable

 This then runs your model with only non-outliers
Before and after removing outliers

20
N=340 N=295
BEFORE AFTER
Even after you remove outliers, the Mahalanobis will come up with a whole new set of outliers, so
www.saeedsharif.com
these should be checked on a case by case basis, using the Mahalanobis as a guide for inspection.
“Best Practice” for outliers

21
 It is a bad idea to remove outliers, unless they are

truly “abnormal” and do not represent accurate
observations from the population.
 Removing outliers is risky
 Generalizability
Normality
22
 PLS or binomial regressions do not require such

assumptions
 t tests and F tests assume normal distributions
 Normality is assessed in many ways: shape,
skewness, and kurtosis (flat/peaked).
 Normality issues affect small sample sizes (<50)
much more than large sample sizes (>200)
Bimodal Flat
23 Shape
Skewness
Kurtosis
Fixing Normality Issues

24
 Fix flat distribution with:

 Inverse: 1/X
 Fix negative skewed distribution with:

 Squared: X*X
 Cubed: X*X*X
 Fix positive skewed distribution with:

 Square root: SQRT(X)
 Logarithm: LG10(X)
Normality in AMOS
25
–Refer to the “Assessment of normality” in the

Text View output
–Data is considered to be normal if:
:: Skewness is
between -3 to +3
:: Kurtosis is
between -7 to +7
What is Structural Equations Modeling (SEM)?

26
 Two components:
 Measurement model (CFA) = A visual representation that specifies
the model’s constructs, indicator variables, and interrelationships.
CFA provides quantitative measures of the reliability and validity of
the constructs.
 Structural model (SEM) = A set of dependence relationships linking
the hypothesized model’s constructs. SEM determines whether
relationships exist between the constructs – and along with CFA
enables you to accept or reject your theory.
 Developing CFA and SEM models and developing

hypotheses:
 Theory
 Prior experience
What is the Difference between EFA and CFA?

27
 EFA (Exploratory Factor Analysis):

 Use the data to determine the underlying structure.
 CFA (Confirmatory Factor Analysis):

1) Specify the factor structure on the basis of a ‘good’ theory
2) Use CFA to determine whether there is empirical support for
the proposed theoretical factor structure.
CFA
28
 The major objective in CFA is determining if the relationships

between the variables in the hypothesized model resemble the
relationships between the variables in the observed data set.
 More formally: the analysis determines the extent to which
the proposed covariance matches the observed covariance.
 CFA assesses how well the predicted interrelationships
between the variables match the interrelationships between
the actual or observed interrelationships. If the two matrices
(the proposed and the actual) are consistent with one another,
then the model can be considered a credible explanation for
the hypothesized relationships.
 CFA provides quantitative measures that assess the validity
and reliability of theoretical model
29
Practice
Recommended Criteria for Fit Indices
30
Which Fit Measures to Report?

31
 Jaccard and Wan (1996) is one of often-cited

recommendation: reporting at least three fit tests- one
absolute, one relative, and one parsimonious- to reflect
diverse criteria.
 Recently: Kline (2005) and Thompson (2004):
recommend fit measures without reference to their
classification.
 Meyers et al: Reporting chi square, NFI, CFI, RMSEA.
Although chi square is less informative as an assessment
of a single model, it is useful in comparing nested models
and the model with lower chi square value is considered
to be preferable model.
Model Fit
32
 Factor Loading:
Some of researchers believe that they must be more than 0.7, otherwise they must
be excluded from the model and we report that these items are not good
indicators for it.
Based on Garson we accept factor loading greater than 0.5.
• How many indicators per factor?

2 is the minimum
3 is safer, especially if factor correlations are weak
4 provides safety
5 or more is more than enough (If too many indicators then combine
indicators into sets)
 Normality Test:
Based on Barbara’s book -3 < Skewness < 3 and -7 < Kurtosis < 7 are
acceptable and we consider them Normal. Otherwise the item that cannot meet
these conditions will be removed from the model.
Model Fit
33
 Model Fit:
According to Robert Ho’s book, we need at least three indices to be met
to claim that the model is fit.
GFI, CFI … > 0.9 are OK. (Near 0.9 is acceptable as well).
P-value for CMIN table (Chi-Square) > 0.05 is OK because we want to
prove the null hypothesis here.
Robert Ho, Page 285: RMSEA < 0.05 is excellent. 0.05 < RMSEA <
0.08 is good. 0.08 < RMSEA < 1 is moderate and RMSEA > 1 is weak.
We should report three satisfied indices and also RMSEA and Chi-
Square (CMIN), even these two items are not satisfied.
 The correlation between latent variables must be less than

0.9; otherwise we will combine those two high correlated
latent variables because actually they are measuring the same
thing!
So, based on Barbara we take them on the second order.
Modification Indices
34
Residuals
35
 A significant standardized residual is one with an absolute

value greater than 4.0. Significant residuals significantly decrease
your model fit. Fixing model fit per the residuals matrix is similar to
fixing model fit per the modification indices. The same rules apply.
Construct Validity
36
 If you have convergent validity issues, then your variables do not correlate
well with each other within their parent factor; i.e. the latent factor is not well
explained by its observed variables.
 If you have discriminant validity issues, then your variables correlate more
highly with variables outside their parent factor than with the variables within
their parent factor; i.e., the latent factor is better explained by some other
variables (from a different factor), than by its own observed variables.
Validity and Reliability

37
 It is absolutely necessary to establish convergent and discriminant validity, as
well as reliability, when doing a CFA. If your factors do not demonstrate
adequate validity and reliability, moving on to test a causal model will be
useless - garbage in, garbage out!
 There are a few measures that are useful for establishing validity and reliability:
Reliability
 CR > 0.7
CR : Composite Reliability
Convergent Validity AVE : Average Variance Explained
 CR > AVE MSV : Maximum Shared Squared Variance
ASV : Average Shared Squared Variance
 AVE > 0.5
Discriminant Validity
 MSV < AVE
 ASV < AVE
For more information visit www.SaeedSharif.com
38
 Andrew Hayes
 Andy Field
 Bahaman Abu Samah
 James Gaskin
 Joseph Hair et al.
 Lawrence S. Meyers et al
 Robert Ho
 Saeed Pahlevan Sharif

Data Screening

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Screening

Transféré par

Droits d'auteur :

Formats disponibles

Saeed Pahlevan Sharif 1/09/2013

Data Screening and CFA

STRUCTURAL EQUATION MODELING (SEM)

SAEED PAHLEVAN SHARIF

www.saeedsharif.com Taylor’s Graduate School

 Your data should be “clean”

www.saeedsharif.com Taylor’s Graduate School

Necessary Data Screening To Do:

 Handle Missing Data

www.saeedsharif.com Taylor’s Graduate School

Problems Resulting from Missing Data

www.saeedsharif.com Taylor’s Graduate School

Statistical Problems with Missing Data

 Missing much of your data

 EFA, CFA, and path models require a certain minimum

www.saeedsharif.com Taylor’s Graduate School

Logical Problem with Missing Data

 Systematic bias due to a common cause (poor

www.saeedsharif.com Taylor’s Graduate School

Handling Missing Data

Hair et al.’s (2009) Rules of Thumb:

www.saeedsharif.com Taylor’s Graduate School

 Use only valid data

 Use known replacement values

 Use calculated replacement values

 Model based methods

www.saeedsharif.com Taylor’s Graduate School

2. Include each variable

3. For each variable you can

Imputation Advantages Disadvantages Best Used When:

Best Method – Prevention!

 Short surveys (pre testing critical!)

www.saeedsharif.com Taylor’s Graduate School

Outliers and Influentials

 Outliers can influence your results, pulling the mean

www.saeedsharif.com Taylor’s Graduate School

Detecting Univariate Outliers

www.saeedsharif.com Taylor’s Graduate School

Detecting Univariate Outliers

Handling Univariate Outliers

 Should be examined on a case by case basis.

www.saeedsharif.com Taylor’s Graduate School

Detecting Multivariate Outliers

 Multivariate outliers refer to sets of data points that

www.saeedsharif.com Taylor’s Graduate School

www.saeedsharif.com Taylor’s Graduate School

Handling Multivariate Outliers

 Create a new variable in SPSS called “Outlier”

 Code 1 for Mahalanobis < .05

 AMOS: “Outlier” as a grouping variable

www.saeedsharif.com Taylor’s Graduate School

Before and after removing outliers

“Best Practice” for outliers

 It is a bad idea to remove outliers, unless they are

www.saeedsharif.com Taylor’s Graduate School

 PLS or binomial regressions do not require such

www.saeedsharif.com Taylor’s Graduate School

www.saeedsharif.com Taylor’s Graduate School

Fixing Normality Issues

 Fix flat distribution with:

 Fix negative skewed distribution with:

 Fix positive skewed distribution with:

www.saeedsharif.com Taylor’s Graduate School

–Refer to the “Assessment of normality” in the