Académique Documents
Professionnel Documents
Culture Documents
Standard Scores
Standard Score: raw score that has been converted from one scale to another
scale, where the latter has arbitrarily set mean and standard deviation
-used for comparison
Z-score
conversion of a raw score into a number indicating how many
standard deviation units the raw score is below or above the mean of
the distribution.
CHAPTER 5: RELIABILITY
RELIABILITY - TEST CONSTUCTION
- Dependability and consistent o Item sampling or content sampling – refer to variation
- Error implies that there will always be some inaccuracy in our among items within a test as well as to variation among
measurements items between test\
- Tests that are relatively free of measurement error are deemed to be The extent to which a test takers score is
reliable affected by the content sampled on a test and
- Reliability estimates in the range of .70 and .80 are good enough for by the way the content is sampled (that is, the
most purposes in basic research way in which the item is constructed) is a
- Reliability coefficient: an index that indicates the ratio between the source of error variance
true score variance on a test and the total variance - TEST ADMINISTRATION
- HISTORY OF RELIABILITY: o may influence the test takers attention or motivation
o Charles Spearman (1904): The Proof and Measurement of o Environment variables, test taker’s variables, examiner
Association between Two Things variables. Level of professionalism
o Then Thorndike - TEST SCORING AND INTERPRETATION
o Item response theory has taken advantage of computer o Computer scoring and a growing reliance on objective,
technology to advance psychological measurement computer-scorable items have virtually eliminated error
significantly variance caused by scorer differences
o Based on Spearman’s ideas o However, other tools of assessment still require scoring by
- X = T + E CLASSICAL TEST THEORY trained personnel
o assumes that each person has a true score that would be o If subjectivity is involved in scoring, then the scorer can be
obtained if there were no errors in measurement a source of error variance
o Difference between the true score and the observed score o Despite rigorous scoring criteria set forth in many of the
results from measurement error better known test of intelligence, examiner occasionally
o Assumption here is that errors of measurement are still are confronted by situations where an examinees
random response lies in a gray area
o Basic sampling theory tells us that the distribution of
TEST-RETEST RELIABILITY
random errors is bell-shaped
The center of the distribution should represent - Also known as time-sampling reliability
- Correlating pairs of scores from the same group on two different
the true score, and the dispersion around the
mean of the distribution should display the administration of the same test
- Measure something that is relatively stable over time
distribution of sampling errors
o Classical test theory assumes that the true score for an - Sources of Error variance:
o Passage of time: the longer the time that passes, the
individual will not change with repeated applications of
greater the likelihood that reliability coefficient will be
the same test
o lower.
o Coefficient of stability: when the interval between testing
o Variance: standard deviation squared. It is useful because
is greater than 6 months,
it can be broken into components:
- Consider possibility of carryover effect: occurs when first testing
o True variance: variance from true differences are
session influences scores from the second session
assumed to be stable
- If something affects all the test takers equally, then the results are
o Error variance: random irrelevant sources
uniformly affected and no net errors occurs
- Standard error of measurement: we assume that the distribution of
- Practice tests may make this effect happen
random errors will be the same for all people, classical test theory
- Practice can also affect tests of manual dexterity
uses the standard deviation of errors as the basic measure of error
- Time interval between testing sessions must be selected and
o Standard error of measurement tells us, on the average,
evaluated carefully
how much a score varies from the true score
- Poor test-retest correlations do not always mean that a attest is
o Standard deviation of the observed score and the
unreliable – suggest that the characteristic under study has changed
reliability of the test are used to estimate the standard
error of measurement
PARALLEL-FORM OR ALTERNATE FORMS RELIABILITY
- Reliability: proportion of the total variance attributed to true
- compares two equivalent forms of a test that measure the same
variance.
attribute
o the greater portion of total variance attributed to true
- Two forms should be equally constructed, both format, etc.
variance, the more reliable the test
- When two forms of the test are available, one can compare
- Measurement error: refers to collectively, all of the factors associated
performance on one form versus the other – equivalent forms
with the process of measuring some variable, other than the variable reliability or parallel forms
being measured - Coefficient of equivalence: degree of relationship between various
o Random error: a source of error in measuring a targeted forms of a test can be evaluated by means of an alternate-forms
variable caused by unpredictable fluctuations and - Parallel forms: each form of the test, the means and variances of
inconsistencies of other variables in the measurement observed test scores are equal
process - Alternate forms: different versions of a test that have been
This source of error fluctuates from one testing constructed so as to be parallel
situation to another with no discernible pattern - (1) two test administrations with the same group are required
that would systematically raise or lower scores - (2) test scores may be affected by factors such as motivation etc.
o Systematic Error: - Problem: developing a new version of a test
A source of error in measuring a variable that is INTERNAL CONSISTENCY
typically constant or proportionate to what is - How well does each item measure the content/construct under
presumed to be true value of the variable being consideration
measured - How consistent the items together
Error is predictable and fixable - Used when tests are administered once
Does not affect score consistency - If all items on a test measure the same construct, then it has a good
internal consistency
SOURCES OF ERROR VARIANCE - Split-half reliability, KR20, Cronbach Alpha
CHAPTER 5: RELIABILITY
o Test takers with the same score on a homogenous test
SPLIT-HALF RELIABILITY probably have similar abilities in the area tested
- Correlating two pairs of scores obtained from equivalent halves of a o Test takers with the same score on a heterogeneous test
single test administered once. may have quite different abilities
- This is useful when it is impractical to assess reliability with two tests o However, homogenous testing is often an insufficient tool
or to administer test twice for measuring multifaceted psychological variable such as
- Results of one half of the test are then compared with the results of intelligence or personality
the other
- Rules in splitting forms into half: Measures of Inter-Scorer Reliability
o Do not divide test in the middle because it would lower - In some types of tests under some conditions, the score may be more a
the reliability function of the scorer than of anything else
o Different amounts of anxiety and differences in item - Inter-scorer reliability: is the degree of agreement or consistency between
difficulty shall also be considered two or more scorers (or judges or rather) with regard to a particular
o Randomly assign items to one or the other half of the test measure
o use the odd-even system: where one subscore is obtained - Coefficient of inter-scorer reliability: coefficient of correlation to
for the odd-numbered items in the test and another for determine the degree of consistency among scorers in the scoring of a test
the even-numbered items - Kappa statistic is the best method for assessing the level of agreement
- To correct for half-length, apply the Spearman-Brown formula, which among several observers
allows you to estimate what the correlation between the two halves o Indicates the actual agreement as a proportion of the potential
would have been if each half had been the length of the whole test agreement following the correction for chance agreement
o Use this if test user wish to shorten a test o Cohen’s Kappa – 2 raters
o Fleiss’ Kappa – 3 or more raters
o Used to determine the number of items needed to attain a
desired level of reliability
HOMOGENEITY VS. HETEROGENEITY OF TEST ITEMS
- Reliability increases as the test length increases - Homogeneous items has high degree of reliability
KUDER-RICHARDSON FORMULAS OR KR20/KR21 DYNAMIC VS. STATIC CHARACTERISTICS
- Kuder-Richardson technique simultaneously considers all possible - Dynamic: trait, state, ability presumed to be ever-changing as a function of
ways of splitting the items situational and cognitive experiences
- The formula for calculating the reliability of a test in which the items - Static: trait, state, ability relatively unchanging
are dichotomous, scored 0 or 1, is the Kuder-Richardson 20 (see
p.114) RESTRICTION OR INFLATION OF RANGE
- Introduced KR21 – uses an approximation of the sum of the pq - If it is restricted, reliability tends to be lower.
products – the mean test score - If it is inflated, reliability tends to be higher.
CRONBACH ALPHA SPEED TESTS VS. POWER TESTS
- Cronbach developed a formula that estimates the internal - Speed test: test is homogenous, means that it is easy but short time
consistency of tests in which the items are not scored as 0 or 1 – a - Power test: Few items, but more complex.
more general reliability estimate, which he called coefficient alpha
- Sum the individual item variances CRITERION-REFERENCED TESTS
o Most general method of finding estimates of reliability - Provide an indication of where a testtaker stands with respect to some
through internal consistency variable or criterion.
- Domain sampling: define a domain that represents a single trait or - Tends to contain material that has been mastered in hierarchical fashion.
characteristic, and each item is an individual sample of this general - Scores here tend to be interpreted in pass-fail terms.
characteristic - Measure of reliability depends on the variability of the test scores: how
- Factor analysis deals with the situation in which a test apparently different the scores are from one another.
measures several different characteristics
o Good for the process of test construction The Domain Sampling Model
- Most widely used as a measure of reliability because it requires only - This model considers the problems created by using a limited number
one administration of the test of items to represent a larger and more complicated construct
- Ranges from 0 to 1 “bigger is always better” - Our task in reliability analysis is to estimate how much error we would
Other Methods of Estimating Internal Consistencies make by using the score from the shorter test as an estimate of your
- Inter-item consistency: refers to the degree of correlation among all true ability
the items on a scale - Conceptualizes reliability as the ratio of the variance of the observed
o A measure of inter-item consistency is calculated from a score on the shorter test and the variance of the long-run true score
single administration of a single form of a test - Reliability can be estimated from the correlation of the observed test
o An index of inter-item consistency, in turn, is useful in score with the true score
assessing the homogeneity of the test
o Tests are said to be homogenous if they contain items that Item Response Theory
measure a single trait - Classical test theory requires that exactly the same test items be
o Definition: the degree to which a test measures a single administered to each person – BAD
factor - Item response theory (IRT) is newer – computer is used to focus on
o Heterogeneity: degree to which a test measures different the range of item difficulty that helps assess an individual’s ability
factors level
o Ex: homo=test that assesses knowledge only of #-D o More reliable estimate of ability is obtained using a
television repair skills vs. a general electronics repair test shorter test with fewer items
(hetero) o Takes a lot of items and effort
o The more homogenous a test is, the more inter-item
consistency it can be expected to have Generalizability theory
o Test homogeneity is desirable because it allows relatively - based on the idea that a persons test scores vary from testing to testing
straightforward test-score interpretation because of variables in the testing situation
CHAPTER 5: RELIABILITY
- Instead of conceiving of all variability in a persons scores as error, Cronbach
encouraged test developers and researchers to describe the details of the
particular test situation or universe leading to a specific test score
- This universe is described in terms of its facets: which include things like
the number of items in the test, the amount of training the test scorers
have had, and the purpose of the test administration
- According to generalizability theory, given the exact same conditions of all
the facets in the universe, the exact same test score should be obtained
- Universe score: the test score obtained and its analogous to a true score in
the true score model
- Cronbach suggested that tests be developed with the aid of a
generalizability study followed by a decision study
- Generalizability study: examines how generalizable scores from a
particular test are if the test is administered in different situations
- How much of an impact different facets of the universe have on the test
score
- Ex: is the test score affected by group as opposed to individual
administration
- Coefficients of generalizability: the influence of particular facts on the test
score is represented by this. These coefficients are similar to reliability
coefficients in the true score model
- Decision study: developers examine the usefulness of test scores in helping
the test user make decision
- The decision study is designed to tell the test user how test scores should
be used and how dependable those scores are as a basis for decisions,
depending on the context of their use
CHAPTER 6: VALIDITY
The Concept of Validity (N/2)
- Validity: as applied to a test, is a judgment or estimate of how well a test o CVR Content validity ratio
measures what it purports to measure in a particular context o ne Number of panelists
o Judgment based on evidence about the appropriateness of stating “essential”
inferences drawn from test scores o N Total number of panelists
o Validity of test must be shown from time to time to account for CVR is calculated for each item
culture and advancement o Culture and the relativity of content validity
- Inference: a logical result or deduction Tests thought of as either valid or invalid
- “Acceptable” or “weak” validity of tests and test scores What constitutes historical fact depends to some
- Validation: process of gathering and evaluating evidence about validity extent on who is writing the history
o Test user and testtaker both have roles in validation of test Culture relativity
o Test users may conduct their own validation studies: may yield Politics (politically correct)
insights regarding a particular population of testtakers as Criterion-Related Validity
compared to the norming sample (in manual) - Criterion-related validity: judgment of how adequately a test score can be
o Local validation studies: absolutely necessary when test user used to infer an individual’s most probable standing on some measure of
plans to alter in some way the format, instructions, language, or interest (measure of interest being the criterion)
content of the test - 2 types:
- Types of Validity (Trinitarian view) *not mutually exclusive all contribute o Concurrent validity: index of the degree to which a test score is
to a unified picture of a test’s validity/ critique approach is fragmented related to some criterion measure obtained at the same time
and incomplete (concurrently)
o Content validity: measure of validity based on an evaluation of o Predictive validity: index of the degree to which a test score
the subjects, topics, or content covered by the items in the test predicts some criterion measure
o Criterion-related validity: measure of validity obtained by - What Is a Criterion?
evaluating the relationship of scores obtained on the test to o Criterion: a standard on which a judgment or decision may be
scores on other tests or measures based; standard against which a test or a test score is evaluated
o Construct validity: measure of validity that is arrived at by (criterion-related validity)
executing a comprehensive analysis of: (umbrella validity o Characteristics of criterion
every other variety of validity falls under it) Relevancy pertinent or applicable to the matter at
How scores on test relate to other test scores and hand
measures Validity (for the purpose which it is being used)
How scores on test can be understood within some Uncontaminated Criterion contamination: term
theoretical framework for understand the construct applied to a criterion measure that has been based,
that the test was designed to measure at least in part, on predictor measures
- Strategies: ways of approaching the process of test validity - Concurrent Validity
o Content validation strategies o Test scores are obtained at about the same time as the criterion
o Criterion-related validation strategies measures are obtained measures of the relationship between
o Construct validation strategies the test scores and the criterion provide evidence of concurrent
- Face Validity validity
o Face validity: relates more to what a test appears to measure to o Indicate the extent to which test scores may be used to estimate
the person being tested than to what the test actually measures an individuals present standing on a criterion
o Judgment concerning how relevant the test items appear to o Once validity of inference from test scores is established= faster,
be usually from testtaker, not test user less expensive way to offer a diagnosis or a classification
o Lack of face validity= lack of confidence in perceived decision
effectiveness of test which decreases testtaker’s o Concurrent validity of a test can be explored with respect to
motivation/cooperation *may still be useful another test
- Content validity Prior research must have satisfactorily demonstrated
o Content validity: a judgment of how adequately a test samples the 1st test’s validity
behavior representative of the universe of behavior that the test 1st test= validating criterion
was designed to sample - Predictive validity
Ideally, test developers have a clear vision of the o Test scores may be obtained at one time and the criterion
construct being measured clarity reflected in the measures obtained at a future time, usually after some
content validity of the test intervening event has taken place
o Test blueprint: structure of the evaluation; a plan regarding the Intervening event training, experience, therapy,
types of information to be covered by the items, the number of medication, etc.
items tapping each area of coverage, the organization of the Measures of relationship between the test scores
items in the test, etc. and a criterion measure obtained at a future time
Behavior observation is a technique frequently used provide an indication of the predictive validity test
in test blueprinting (how accurately scores on the test predict some
o The quantification of content validity criterion measure)
Important in employment settings tests used to o Ex: SAT test score and freshman gpa
hire and promote o Judgments of criterion validity are based on 2 types of statistical
One method: method for gauging agreement among evidence:
raters or judges regarding how essential a particular The validity coefficient
item is (C.H. Lawshe) Validity coefficient: correlation coefficient
“Is the skill or knowledge measured by that provides a measure of the
this item… relationship between test scores and
o Essential scores on the criterion measure
o Useful but not essential Ex: Pearson correlation coefficient used
o Not necessary to determine validity between 2 measures
To the performance of the job?” (r)
Content validity ratio (CVR): Affected by restriction or inflation of
CVR= ne – (N/2) range
CHAPTER 6: VALIDITY
Is the range of scores employed Construct: an informed, scientific idea developed or
appropriate to the objective of the hypothesized to describe or explain behavior
correlational analysis Ex: intelligence, depression, motivation,
No rules regarding the validity coefficient personality, etc.
(how high or low it should/could be for Unobservable, presupposed (underlying)
test to be valid) traits that a test developer invokes to
Incremental validity describe test behavior/criterion
o More than one predictor performance
o Incremental validity: the Viewed as unifying concept for all validity evidence
degree to which an additional o Evidence of Construct Validity
predictor explains something Various techniques of construct validation that
about the criterion measure provide evidence:
that is not explained by Test is homogeneous measures single
predictors already in use construct
Expectancy data Test scores increase/decrease as function
Expectancy data: provides info that can of age, passage of time, or experimental
be used in evaluating the criterion-related manipulation (theoretically predicted)
validity of a test Test scored obtained after some even or
Score obtained on expectancy passage of time differ from pretest scores
test/tables likelihood testtaker will (theoretically predicted)
score within some interval of scores on a Test scores obtained by people from
criterion measure (“passing”, distinct groups vary (theoretically
“acceptable”, etc.) predicted)
Expectancy table: shows the percentage Test scores correlate with scores on other
of people within specified test-score tests (theoretically predicted)
intervals who subsequently were placed Evidence of homogeneity
in various categories of the criterion Homogeneity: refers to how uniform a
o May be created from test is in measuring a single concept
scatterplot Evidence correlations between subtest
o Shows relationships scores and total test scores
Expectancy chart: graphic representation Item-analysis procedures have been used
of an expectancy table in quest for test homogeneity
o The higher the initial rating, Desirable but not necessary
the greater the probability of Contributes no info about how construct
job/academic success being measured relates to other
Taylor Russell Table – provide an estimate of the constructs
extent to which inclusion pf a particular test in the Evidence of changes with age
selection system will actually improve selection If test purports to measure a construct
Selection ratio – relationship between the that changes over time then the test
number of people to be hired and the scores, too, should show progressive
number of people available to be hired changes to be considered valid
Base rate – percentage of people under measurement of construct
existing system for a particular position Does not in itself provide info about how
Relationship between predictor and construct relates to other constructs
criterion must be linear Evidence of pretest-posttest changes
Naylor-shine Tables – difference between the means Can be evidence of construct validity
of the selected and unselected groups to derive an Some more typical intervening
index of what the test is adding to already experiences responsible for changes in
established procedures test scores are:
o Decision theory and Test utility o Formal education
Base rate – extent to which a particular trait, o Therapy/medication
behavior, characteristic or attribute exists in the o Any life experience
population Evidence from distinct groups/method of contrasted
Hit rate – defined as the proportion of people a test groups
accurately identifies as possessing or exhibiting a Method of contrasted groups: one way of
particular trait. providing evidence for the validity of a
Miss rate – proportion of people the test fails to test is to demonstrate that scores on the
identify as having or not having attributes test vary in a predictable way as a
False positive (type I error) – possess function of membership in some group
particular attribute but actually does not Rationale if a test is a valid measure of
have. Ex: score above cutoff score, hired a particular construct, test scores from
but failed the job. groups of people who would presumed
False negative (type II error) – does not with respect to that construct should have
possess particular attribute but actually correspondingly different test scores
does have. Ex. Scored below cutoff score, Convergent evidence
not hired, but could have been successful Evidence for the construct validity of a
in the job particular test may converge from a
- Construct Validity number of sources, such as tests or
o Construct validity: judgment about the appropriateness of measures designed to assess the
inferences drawn from test scores regarding individual standings same/similar construct
on a variable called a construct Convergent evidence: scores on a test
undergo construct validity and correlate
CHAPTER 6: VALIDITY
highly in the predicted direction with Issues of fairness tend to be more difficult and
scores on older, more established and involve values
already validated tests designed to Fairness: the extent to which a test is used in an
measure the same/similar construct impartial, just, and equitable way
Discriminant evidence Sources of misunderstanding
Discriminant evidence: validity coefficient Discrimination
showing little relationship between test Group not included in standardization
scores and /or other variables with which sample
scores on the test being construct- Performance differences between
validated should not theoretically be identified groups
correlated
Provides evidence of construct validity Relationship Between Reliability and Validity
Multitrait-multimethod matrix: “two or - A test should not correlate more highly with any other variable than it
more traits”, “two or more methods” correlates with itself
matrix/table that results from correlating - A modest correlation between the true scores on two traits may be
variables (traits) within and between missed if the test for each of the traits is not highly reliable
methods - We can have reliability without validity
Factor analysis o It is impossible to demonstrate that an unreliable test is
Factor analysis: shorthand term for a class valid
of mathematical procedures designed to
identify factors or specific variables that
are typically attributes, characteristics, or
dimension on which people may differ
Frequently used as a data reduction
method in which several sets of scores
and correlations between them are
analyzed
Exploratory factor analysis: researchers
test the degree to which a hypothetical
model fits the actual data
o Factor loading: conveys
information about the extent
to which the factor determines
the test score or scores
o Complex procedures
- Validity, Bias, and Fairness
o Test Bias
Bias: a factor inherent in a test that systematically
prevents accurate, impartial measurement
Technical means to identify and remedy bias
(mathematically)
Bias implies systematic variation
Rating error
Rating: a numerical or verbal judgment
(or both) that places a person or an
attribute along a continuum identified by
a scale of numerical or word descriptions,
known as a rating scale
Rating error: judgment resulting from
intentional or unintentional misuse of a
rating scale
Leniency error/generosity error: error in
rating that arises from the tendency on
the part of the rater to be lenient in
scoring, marking, and/or grading
Severity error: rater exhibits general and
systematic reluctance to giving ratings at
either the positive or negative extreme
Overcome restriction of range rating errors is to use
rankings: procedure that requires the rater to
measure individuals against one another instead of
against an absolute scale
Rater is forced to select 1st, 2nd, 3rd, etc.
Halo effect: fact that for some raters, some rates can
do no wrong
Tendency to give a particular ratee a
higher rating than he or she objectively
deserves
Criterion data may be influenced by
rater’s knowledge of ratee race,
gender, etc.
o Test fairness
CHAPTER 7: UTILITY
Utility: usefulness or practical value of testing to improve efficiency Based on norm-related considerations rather
than on the relationship of test scores to a
Factors that Affect a Test’s Utility criterion
Psychometric Soundness Also called norm-referenced cut score
o Reliability and validity of a test Ex.) top 10% of test scores get A’s
o Gives us the practical value of both the scores (reliability o Fixed cut score: set with reference to a judgment
and validity) concerning a minimum level of proficiency required to be
o They tell us whether decisions are cost-effective included in a particular classification.
o A valid test is not always a useful test Also called absolute cut scores
especially if testtakers do not follow test o Multiple cut scores: using two or more cut scores with
directions reference to one predictor for the purpose of categorizing
Costs testtakers
o Economic and non economic Ex.) having cut score that marks an A, B, C etc.
o Ex.) using a less expensive and therefore less stringent all measuring same predictor
application process for airline personnel. o Multiple hurdles: for success, requires one individual to
Benefits complete many tasks, with elimination at each level
o Profits, gains, advantages Ex.) written application group interview
o Ex.) more stringent hiring policy more productive personal interview etc.
employees o Compensatory model of selection: assumption is made
o Ex.) maintaining successful and academic environment of that high scores on one attribute can compensate for low
university scores on another attribute
o Many subjects - Validity data for some - Stanford Achievement Test - Provides three separate
tested at a time group tests are one of the oldest of the scores though: verbal,
o Subjects record weak/meager/contradictor standardized achievement quantitative, and
own responses y tests widely used in school nonverbal
o Subjects not Use Results with Caution system - Item selection is superior
praised for - Never consider scores in - Well-normed and criterion- to the H-NT in terms of
responding isolation or as absolutes referenced, with selecting minority,
o Low scores on - Be careful using tests for psychometric culturally diverse, and
group tests prediction documentation economically
often difficult to - Avoid overinterpreting test - Another one is the disadvantaged children
interpret scores Metropolitan Achievement - Can be adopted for use
o No safeguards Be Especially Suspicious of Low Test, which measures outside the US
Advantages of Individual Tests Scores achievement in reading by - No cultural bias
- Provide info beyond the - Assume that subjects evaluating vocab, word - Each of the subtests
test score understand purpose of recognition, and reading required 32-34 minutes of
- Allow the examiner to testing, want to succeed, comprehension actual working time, which
observe behavior in a and are equally rested/free - Both of these are reliable the manual recommends
standard setting of stress and normed on big samples to be spread out over 2-3
- Allow individualized Consider Wide Discrepancies a Group Tests of Mental Abilities days
interpretation of test Warning Signal (Intelligence) - Standard age scores
scores - May reflect emotional Kuhlmann-Anderson Test (KAT) averaged some 15pts lower
Advantages of Group Tests problems or severe stress – 8th Edition for African American
- Are cost-efficient When in Doubt, Refer - KAT is a group intelligence students on the verbal
- Minimize professional time - With low scores, test with 8 separate levels battery and quantitative
for administration and discrepancies, etc, refer covering kindergarten batteries
scoring the subject for individual through 12th grade
- Require less examiner skill testing - Items are primarily Summary of K-12 Group Tests
and training - Get trained professional nonverbal at lower levels, - All are sound, viable
- Have more objective and Group Tests in the Schools: requiring minimal reading instruments
more reliable scoring Kindergarten Through 12th and language ability
procedures Grade - Suited to young children College Entrance Tests
- Have especially broad - Purpose of tests is to and those who might be - SAT Reasoning Test,
application measure educational handicapped in following Cooperative School and
Overview of Group Tests achievement in verbal procedures College Ability Tests, and
Characteristics of Group Tests schoolchildren - Scores can be expressed in American College Test
- Characterized as paper- Achievement Tests verses verbal, quantitative, and SAT Reasoning Test
and-pencil or booklet-and- Aptitude Tests total scores - Most widely used college
pencil tests because only - Achievement tests attempt - Scores at other levels can entrance test
materials needed are a to assess what a person be expressed at percentile - Used for 1000+ private and
printed booklet of test has learned following a bands: like a confidence public institutions
items, a test manual, specific course of interval; provides the range - Renorming of the SAT did
scoring key, answer sheet, instruction of percentiles that most not alter the standing of
and pencil o Evaluate the likely represent a subject’s test takers relative to one
- Computerized group product of a true score another in terms of
testing becoming more course of - Good construction, percentile rank
popular training standardization, and other - New scoring (2400) is likely
- Most group tests are o Validity is excellent psychometric to reduce interpretation
multiple choice – some determined qualities errors, as interpreters can
free response primarily by - Good validity and reliability no longer rely on
- Group tests outnumber content-related - Potential for use and comparisons with older
individual tests evidence adaptation for non-English- versions
o One major - Aptitude tests attempt to speaking individuals or - 45mins longer – 3hrs and
difference is evaluate a student’s even countries needs to be 45mins to administer
whether the potential for learning explored - may disadvantage students
test is primarily rather than how much a Henmon-Nelson Test (H-NT) with disabilities such as
verbal, student has already - Of mental abilities ADD
nonverbal, or learned - 2 sets of norms available: - Verbal section now called
combination o Evaluate effects o one based on “critical reading” – focus on
- Group test scores can be of unknown and raw score reading comprehension
converted to a variety of uncontrolled distributions by - Math section eliminated
units experiences age, the other much of the basic grammar
Selecting Group Tests o Validity is on raw scores school math questions
- Test user need never settle judged primarily distributions by - Weakness: poor predictive
for anything but well- on its ability to grade power regarding the
documented and predict future - reliabilities in the .90s grades of students who
psychometrically sound performance - helps predict future score in the middle ranges
tests - Intelligence test measures academic success quickly - Little doubt that the SAT
Using Group Tests general ability - does NOT consider multiple predicts first-year college
- Reliable and well - These three tests are highly intelligences GPA
standardized as the best interrelated Cognitive Abilities Test (COGAT) o But,
individual tests Group Achievement Tests - Good reliability AfricanAmerica
ns and Latinos