Vous êtes sur la page 1sur 31

lOMoARcPSD|3728912

Cohen-Based-Summary of Psychological Testing &


Assessment
Bachelor of Science in Psychology (University of San Jose-Recoletos)

StuDocu is not sponsored or endorsed by any college or university


Downloaded by Abdul Jabbar (icq50112@bcaoo.com)
lOMoARcPSD|3728912

CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT


CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT (2) some test administers don’t even have to
be present
TESTING AND ASSESSMENT (a) usually administered to larger
 Roots can be found in early twentieth century in France 1905
groups
 Alfred Binet published a test designed to help place Paris school
children (b) test takers complete tasks
 WW1, military used the test to screen large numbers of recruits independently
quickly for intellectual and emotional problems b) Scoring and interpretation procedures
 WW2, military depend more on tests to screen recruits for service (1) score: a code or summary statement,
PSYCHOLOGICAL usually (but not necessarily) numerical in
PSYCHOLOGICAL TESTING
ASSESSMENT nature, that reflects an evaluation of
Process of measuring
Gathering & integration of performance on a test, task, interview, or
psychology-related
psychology-related data for some other sample of behavior
variables by means of
DEFINITION the purpose of making a (2) scoring: process of assigning such
devices/procedures
psychological evaluation with
designed to obtain a evaluative codes/ statements to
accompany of tools.
sample of behavior performance on tests, tasks, interviews,
To answer a referral question, or other behavior samples.
To obtain some gauge,
solve problem or arrive at a (3) different types of score:
OBJECTIVE usually numerical in
decision thru the use of tools
nature (a) cut score: reference point,
of evaluation
Testing may be usually numerical, derived by
PROCESS Typically individualized judgement and used to divide
individualized or group
Key in the process of Tester is not key into the a set of data into two or more
ROLE OF
selecting tests as well as in process; may be classifications.
EVALUATOR
drawing conclusions substituted (i) sometimes reached
Typically requires an
SKILL OF Requires technician-like without any formal
educated selection, skill in
EVALUATIOR skills method: in order to
evaluation
Entail logical problem-solving “eyeball”, teachers
Typically yields a test who decide what is
OUTCOME approach to answer the
score
referral ques. passing and what is
failing.
3 FORMS OF ASSESSMENT: (4) who scores it
1. COLLABORATIVE PSYCHOLOGICAL ASSESSMENT – assessor and
(a) self-scored by testtaker
assesse work as partners from initial contact through final feedback
2. THERAPEUTIC PSYCHOLOGICAL ASSESSMENT – self-discovery and (b) computer
new understandings are encouraged throughout the assessment (c) trained examiner
process c) psychometric soundness/ technical quality
3. DYNAMIC PSYCHOLOGICAL ASSESSMENT – follows a model (a) (1) psychometrics:the science of
evaluation (b) intervention (a) evaluation. Provide a means for psychological measurement.
evaluating how the assesse processes or benefits from some type of (a) referring to to how
intervention during the course of evaluation.
consistently and how
Tools of Psychological Assessment accurately a psychological test
A. The Test (a measuring device or procedure) measures what it purports to
1. psychological test: a device or procedure designed to measure measure.
variables related to psychology (intelligence, personality, (2) utility: refers to the usefulness or
aptitude, interests, attitudes, or values) practical value that a test or other tool of
2. format: refers to the form, plan, structure, arrangement, and assessment has for a particular purpose.
layout of test items as well as to related considerations such as B. The Interview: method of gathering information through direct
time limits. communication involving reciprocal exchange
a) also referred to as the form in which a test is 1. interviewer in face-to-face is taking note of
administered (pen and paper, computer, etc) a) verbal language
Computers can generate scenarios. b) nonverbal language
b) term is also used to denote the form or structure of (1) body language movements
other evaluative tools, and processes, such as the (2) facial expressions in response to
guidelines for creating a portfolio work sample interviewer
3. Ways That tests differ from one another: (3) the extent of eye contact
a) administrative procedures (4) apparent willingness to cooperate
(1) some test administers have an active c) how they are dressed
knowledge (1) neat vs sloppy vs inappropriate
(a) some test administration 2. interviewer over the phone taking note of
involves demonstration of a) changes in the interviewee’s voice pitch
tasks b) long pauses
(b) usually one-on-one c) signs of emotion in response
(c) trained observation of 3. ways that interviews differ:
assessee’s performance a) length, purpose, and nature

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT


b) in order to help make diagnostic, treatment, 6. interpretive report: a formal or official computer-generated
selection, etc account of test performance presented in both numeric and
4. panel interview narrative form and including an explanation of the findings;
a) an interview conducted with one interviewee with a) the three varieties of interpretive report are
more than one interviewer (1) descriptive
C. The Portfolio (2) screening
1. files of work products: paper, canvas, film, video, audio, etc (3) consultive
2. samples of ones abilities and accomplishments b) some contain relatively little interpretation and
D. Case History Data: records, transcripts, and other accounts in written, simply call attention to certain high, low, or unusual
pictorial or other form that preserve archival information, official and scores that needed to be focused on.
informal accounts, and other data and items relevant to assessee c) consultative report: A type of interpretive report
1. sheds light on an individual's past and current adjustment as designed to provide expert and detailed analysis of
well as on events and circumstances that may have contributed test data that mimics the work of an expert
to any changes in adjustment consultant.
2. provides information about neuropsychological functioning d) integrative report: a form of interpretive report of
prior to the occurrence of a trauma or other event that results psychological assessment, usually computer-
in a deficit. generated, in which data from behavioral, medical,
3. insight into current academic and behavioral standing administrative, and/or other sources are integrated
4. useful in making judgments for future class placements 7. CAPA: computer assisted psychological assessment. (assistance
5. Case history Study: a report or illustrative account concerning to the test user not the test taker)
person or an event that was compiled on the basis of case a) enables test developers to create psychometrically
history data sound tests using complex mathematical procedures
a) might shed light on how one individual’s personality and calculations.
and particular set of environmental conditions b) enables test users the construction of tailor-made
combined to produce a successful world leader. test with built-in scoring and interpretive capabilities.
b) groupthink: work on a social psychological c) Pros:
phenomenon: contains rich case history material on (1) test administrators have greater access to
collective decision making that did not always result potential test users because of the global
in the best decisions. reach of the internet.
E. Behavioral Observation: monitoring the actions of others or oneself by (2) scoring and interpretation of test data
visual or electronic means while recording quantitative and/or qualitative tend to be quicker than for paper-and-
information regarding those actions. pencil tests
1. often used as a diagnostic aid in various settings: inpatient (3) costs associated with internet testing tend
facilities, behavioral research laboratories, classrooms. to be lower than costs associated with
2. naturalistic observation: behavioral observation that takes paper-and-pencil tests
place in a naturally occurring setting (as opposed to a research (4) the internet facilitates the testing of
laboratory) for the purpose of evaluation and information- otherwise isolated populations, as well as
gathering. people with disabilities for whom getting
3. in practice tends to be used most frequently by researchers in to a test center might prove as a hardship.
settings such as classrooms, clinics, prisons, etc. (5) greener: conserves paper, shipping
F. Role- Play Tests materials etc.
1. role play: acting an improvised or partially improvised part in a d) Cons:
simulated situation. (1) test client integrity
2. role-play test: tool of assessment wherein assessees are (a) refers to the verification of the
directed to act as if they were in a particular situation. Assessees identity of the test taker when
are then evaluated with regard to their expressed thoughts, a test is administered online
behaviors, abilities, etc (b) also refers to the sometimes
G. Computers as tools varying interests of the test
1. local processing: on site computerized scoring, interpretation, taker vs that of the test
or other conversion of raw test data; contrast w/ CP and administrator. The test taker
teleprocessing might have access to notes,
2. central processing: computerized scoring, interpretation, or aids, internet resources etc.
other conversion of raw data that is physically transported from (c) internet testing is only testing,
the same or other test sites; contrast w/ LP and teleprocessing. not assessment
3. teleprocessing: computerized scoring, interpretation, or other 8. CAT: computerized adaptive testing: an interactive, computer-
conversion of raw test data sent over telephone lines by modem administered test taking process wherein items presented to
from a test site to a central location for computer processing. the test taker are based in part on the test taker's performance
contrast with CP and LP on previous items
4. simple score report: a type of scoring report that provides only a) EX: on a computerized test of academic abilities, the
a listing of scores computer might be programmed to switch from
5. extended scoring report: a type of scoring report that provides testing math skills to English skills after three
a listing of scores AND statistical data. consecutive failures on math items.
H. Other Tools

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT


1. DVD- how would you respond to the events that take place in satisfaction, personal values, quality of living conditions,
the video and quality of friendships and other social support.
a) sexual harassment in the workplace  BUSINESS AND MILITARY SETTINGS
b) respond to various types of emergencies  GOVERNMENTAL AND ORGANIZATIONAL CREDENTIALING
How are Assessments Conducted?
c) diagnosis/treatment plan for clients on videotape
 protocol: the form or sheet or booklet on which a testtaker’s
2. thermometers, biofeedback, etc
responses are entered.
TEST DEVELOPER o term might also be used to refer to a description of a set of
 They are the one who create tests. test- or assessment- related procedures, as in the sentence
 They conceive, prepare, and develop tests. They also find a way to , “the examiner dutifully followed the complete protocol
disseminate their tests, by publishing them either commercially or for the stress interview”
through professional publications such as books or periodicals.  rapport: working relationship between the examiner and the
TEST USER
examinee
 They select or decide to take a specific test off the shelf and use it for
some purpose. They may also participate in other roles, e.g., as
examiners or scorers. ASSESSEMENT OF PEOPLE WITH DISABILITITES
TEST TAKER  Define who requires alternate assessement, how such assessment are
 Anyone who is the subject of an assessment to be conducted and how meaningful inferences are to be drawn
 Test taker may vary on a continuum with respect to numerous from the data derived from such assessment
variables including:  Accommodation – adaptation of a test, procedure or situation or the
o The amount of anxiety they experience & the degree to substitution of one test for another to make the assessment more
which the test anxiety might affect the results suitable for an assesee with exceptional needs.
o The extent to which they understand & agree with the  Translate it into Braillee and administere in that form.
 Alternate assessment – evaluative or diagnostic procedure or process
rationale of the assessment
that varies from the usual, customary, or standardized way a
o Their capacity & willingness to cooperate
measurement is derived either by virtue of some special
o Amount of physical pain/emotional distress they are
accommodation made to the assesee by means of alternative
experiencing
methods
o Amount of physical discomfort
 Consider these four variables on which of many different types of
o Extent to which they are alert & wide awake
accommodation should be employed:
o Extent to which they are predisposed to agreeing or
o The capabilities of the assesse
disagreeing when presented with stimulus
o The purpose of the assessment
o The extent to which they have received prior coaching
o The meaning attached to test scores
o May attribute to portraying themselves in a good light
o The capabilities of the assessor
 Psychological autopsy – reconstruction of a deceased individual’s
REFERENCE SOURCES
psychological profile on the basis of archival records, artifacts, &
 TEST CATALOUGES – contains brief description of the test
interviews previously conducted with the deceased assesee
 TEST MANUALS – detailed information
TYPES OF SETTINGS
 REFERENCE VOLUMES – one stop shopping, provides detailed
 EDUCATIONAL SETTING
information for each test listed, including test publisher, author,
o achievement test: evaluation of accomplishments or the
purpose, intended test population and test administration time
degree of learning that has taken place, usually with  JOURNAL ARTICLES – contain reviews of the test
regard to an academic area.  ONLINE DATABASES – most widely used bibliographic databases
o diagnosis: a description or conclusion reached on the basis
of evidence and opinion though a process of distinguishing TYPES OF TESTS
the nature of something and ruling out alternative  INDIVIDUAL TEST – those given to only one person at a time
 GROUP TEST – administered to more than one person at a time by
conclusions.
single examiner
o diagnostic test: a tool used to make a diagnosis, usually to
 ABILITY TESTS:
identify areas of deficit to be targeted for intervention o ACHIEVEMENT TESTS – refers to previous learning (ex.
o informal evaluation: A typically non systematic, relatively Spelling)
brief, and “off the record” assessment leading to the o APTITUDE/PROGNOSTIC – refers to the potential for
formation of an opinion or attitude, conducted by any learning or acquiring a specific skill
person in any way for any reason, in an unofficial context o INTELLIGENCE TESTS – refers to a person’s general
potential to solve problems
and not subject to the same ethics or standards as
 PERSONALITY TESTS: refers to overt and covert dispositions
evaluation by a professiomal o OBJECTIVE/STRUCTURED TESTS – usually self-report,
 CLINICAL SETTING require the subject to choose between two or more
o these tools are used to help screen for or diagnose alternative responses
behavior problems o PROJECTIVE/UNSTRUCTURED TESTS – refers to all
o group testing is used primarily for screening: identifying possible uses, applications and underlying concepts of
those individuals who require further diagnostic psychological and educational tests
evaluation. o INTEREST TESTS –
 COUNSELING SETTING
o schools,prisons, and governmental or privately owned
institutions
o ultimate objective: the improvement of the assessee in
terms of adjustment, productivity, or some related
variable.
 GERIATRIC SETTING
o quality of life: in psychological assesment, an evaluation
of variables such as perceived stress,lonliness, sources of

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS


A HISTORICAL PERSPECTIVE testakers from young children through senior
19TH CENTURY adulthood.
 Tests and testing programs first came into being in China B. THE MEASUREMENT OF PERSONALITY
 Testing was instituted as a means of selecting who, of many o Field of psychology was being too test oriented
applicants would obtain government jobs (Civil service) o Clinical psychology was synonymous to mental testing
 The job applicants are tested on proficiency in endeavors such as o ROBERT WOODWORTH – develop a measure of
music, archery, knowledge and skill etc. adjustment and emotional stability that could be
GRECO-ROMAN WRITINGS (Middle Ages) administered quickly and efficiently to groups of recruits
 World of evilness  To disguise the true purpose of the test,
 Deficiency in some bodily fluid as a factor believed to influence questionnaire was labeled as Personal Data
personality Sheet
 Hippocrates and Galen  He called it Woodworth Psychoneurotic
RENAISSANCE Inventory – first widely used self-report test of
 Christian von Wolff – anticipated psychology as a science and personality
psychological measurement as a specialty within that science o Self-report test:
CHARLES DARWIN AND INDIVIDUAL DIFFERENCES  Advantages:
 Tests designed to measure these individual differences in ability and  Respondents best qualified
personality among people  Disadvantages:
 “Origin of Species” chance variation in species would be selected or  Poor insight into self
rejected by nature according to adaptivity and survival value.  One might honestly believe
“survival of the fittest” something about self that isn’t true
FRANCIS GALTON  Unwillingness to report seemingly
 Explore and quantify individual differences between people. negative qualities
 Classify people “according to their natural gifts” o Projective test: individual is assumed to project onto some
 Displayed the first anthropometric laboratory ambiguous stimulus (inkblot, photo, etc.) his or her own
KARL PEARSON unique needs, fears, hopes, and motivations
 Developed the product moment correlation technique.  Ex.) Rorschack inkblot
 His work can be traced directly from Galton o
WILHEM MAX WUNDT C. THE ACADEMIC AND APPLIED TRADITIONS
 First experimental psychology laboratory in University of Leipzig
 Focuses more on relating to how people were similar, not different Culture and Assessment
from each other.
JAMES MCKEEN CATELL Culture: ‘the socially transmitted behavior patterns, beliefs, and products of
 Individual differences in reaction time work f a particular population, community, or group of people’
 Coined the term mental test
CHARLES SPEARMAN Evolving Interest in Culture-Related Issues
 Originating the concept of test reliability as well as building the Goddard tested immigrants and found most to be feebleminded
mathematical framework for the statistical technique of factor -invalid; overestimated mental deficiency, even in native English-
analysis speakers
VICTOR HENRI Lead to nature-nurture debate about what intelligence tests actually measure
 Frenchman who collaborated with Binet on papers suggesting how Needed to “isolate” the cultural variable
mental tests could be used to measure higher mental processes Culture-specific tests: tests designed for use with ppl from one culture, but not
EMIL KRAEPELIN from another
 Early experimenter of word association technique as a formal test -minorities still scored abnormally low
LIGHTNER WITMER ex.) loaf of bread vs. tortillas
 “Little known founder of clinical psychology” today tests undergo many steps to ensure its suitable for said nation
 Founded the first psychological clinic in the U.S. -take testtakers reactions into account
PSYCHE CATELL
 Daughter of James Cattell Some Issues Regarding Culture and Assessment
 Cattel Infant Intelligence Scale (CIIS) & Measurement of Intelligence in  Verbal Communication
Infants and Young Children o Examiner and examinee must speak the same language
RAYMOND CATTELL o Especially tricky with infrequently used vocabulary or
 Believed in lexical approach to defining personality which examines unusual idioms employed
human languages for descriptors of personality dimensions o Translator may lose nuances of translation or give
20th CENTURY unintentional hints toward more desirable answer
- Birth of the first formal tests of intelligence o Also requires understanding of culture
- Testing shifted to be of more understandable relevance/meaning  Nonverbal Communication and Behavior
A. THE MEASUREMENT OF INTELLIGENCE o Different between cultures
o Binet created first intelligence to test to identify mentally o Ex.) meaning of not making eye contact
retarded school children in Paris (individual) o Body movement could even have physical cause
o Binet-Simon Test has been revised over again o Psychoanalysis: Freud’s theory of personality and
o Group intelligence tests emerged with need to screen psychological treatment which stated that symbolic
intellect of WWI recruits significance is assigned to many nonverbal acts.
o David Wechsler – designed a test to measure adult o Timing tests in cultures not obsessed with speed
intelligence test o Lack of speaking could be reverence for elders
 for him Intelligence is a global capacity of the  Standards of Evaluation
individual to act purposefully, to think rationally o Acceptable roles for women differ throughout culture
and to deal effectively with his environment. o “judgments as to who might be the best employee,
 Wechsler-Bellevue Intelligence Scale  manager, or leader may differ as a function of culture, as
Wechsler Adult Intelligence Test – was revised might judgments regarding intelligence, wisdom, courage,
several times and extended the age range of and other psychological variables”

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS


o must ask ‘how appropriate are the norms or other o fully debrief participants
standards that will be used to make this evaluation’  The right to be informed of test findings
o Formerly test administrators told to give participants only
Tests and Group Membership positive information
 ex.) must be 5’4” to be police officer- excludes cultures with short o No realistic information is required
stature o Tell test takers as little as possible about the nature of
 ex.) Jewish lifestyle not well suited for corporate America their performance on a particular test. So that the
 affirmative action: voluntary and mandatory efforts to combat examinee would leave the test session feeling pleased and
discrimination and promote equal opportunity in education and statisfied.
employment for all o Test takers have the right also to know what
 Psychology, tests, and public policy recommendations are being made as a consequence of the
test data
Legal and Ethical Condiseration  The right to privacy and confidentiality
Code of professional ethics: defines the standard of care expected of members o Private right: “recognizes the freedom of the individual to
of a given profession. pick and choose for himself the time, circumstances, and
particularly the extent to which he wishes to share or
The Concerns of the Public withhold from others his attitudes, beliefs, behaviors, and
 Beginning in world war I, fear that tests were only testing the ability opinions”
to take tests o Privileged information: information protected by law
 Legislation from being disclosed in legal proceeding. Protects clients
o Minimum competency testing programs: formal testing from disclosure in judicial proceedings. Privilege belongs to
programs designed to be used in decisions regarding the client not the psychologist.
various aspects of students’ educations o Confidentiality: concerns matters of communication
o Truth-in-testing legislation: state laws to provide testtakers outside the courtroom
with a means of learning the criteria by which they are  Safekeeping of test data: It is not a good policy
being judged to maintain all records in perpetuity
 Litigation  The right to the least stigmatizing label
o Daubert ruling made federal judges the gatekeepers to o The standards advise that the least stigmatizing labels
determining what expert testimony is admitted should always be assigned when reporting test results.
o This overrode the Frye policy which only admitted
scientific testimony that had won general acceptance in
the scientific community.

The Concerns of the Profession


 Test-user qualifications
o Who should be allowed to use psych tests
o Level A: tests or aids that can adequately be administered,
scored, and interpreted with the aid of the manual and a
general orientation to the kind of institution or
organization in which one is working
o Level B: tests or aids that require some technical
knowledge of test construction and use and of supporting
psychological and educational fields
o Level C: tests and aids requiring substantial understanding
of testing and supporting psych fields with experience
 Testing people with disabilities
o Difficulty in transforming the test into a form that can be
taken by testtaker
o Transferring responses to be scorable
o Meaningfully interpreting the test data
 Computerized test administration, scoring, and interpretation
o simple, convenient
o easily copied, duplicated
o insufficient research to compare it to pencil-and-paper
versions
o value of computer interpretation is questionable
o unprofessional, unregulated “psychological testing” online

The Rights of Testtakers


 the right of informed consent
o right to know why they are being evaluated, how test data
will be used and what information will be released to
whom
o may be obtained by parent or legal representative
o must be in written form:
 general purpose of the testing
 the specific reason it is being undertaken
 general type of instruments to be administered
o revealing this information before the test can contaminate
the results
o deception only used if absolutely necessary
o don’t use deception if it will cause emotional distress

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 3: A STATISTICS REFRESHER


 No absolute zero point
Why We Need Statistics  Can take average
RATIO SCALE
- Statistics are important for purposes of education  In addition to all the properties of nominal, ordinal, and interval
o Numbers provide convenient summaries and allow us to measurement, ratio scale has true zero point
evaluate some observations relative to others  Equal intervals between numbers
- We use statistics to make inferences, which are logical deductions  Ex.) measuring amount of pressure hand can exert
about events that cannot be observed directly  True zero doesn’t mean someone will receive a score of 0, but means
o Detective work of gathering and displaying clues – that 0 has meaning
exploratory data analysis
o Then confirmatory data analysis NOTE:
- Descriptive statistics are methods used to provide a concise Permissible Operations
description of a collection of quantitative information - Level of measurement is important because it defines which
- Inferential statistics are methods used to make inferences from mathematical operations we can apply to numerical data
observations of a small group of people known as a sample to a larger - For nominal data, each observation can be placed in only one
group of individuals known as a population mutually exclusive category
- Ordinal measurements can be manipulated using arithmetic
SCALES OF MEASUREMENT - With interval data, one can apply any arithmetic operation to the
differences between scores
 MEASUREMENT – act of assigning numbers or symbols to o Cannot be used to make statements about ratios
characteristics of things according to rules. The rules serves as a
guideline for representing the magnitude. It always involves error. DESCRIBING DATA
 SCALE – set of numbers whose properties model empirical properties  Distribution: set of scores arrayed for recording or study
of the objects to which the numbers are assigned.  Raw Score: straightforward, unmodified accounting of performance,
 CONTINUOUS SCALE – interval/ratio. A scale used to measure usually numerical
continuous variable. Always involves error
 DISCRETE SCALE – nominal/ordinal used to measure a discrete Frequency Distributions
variable (ex. Female or male)  Frequency Distribution: All scores listed alongside the number of
 ERROR – collective influence of all of the factors on a test score. times each score occurred
 Grouped Frequency Distribution: test-score intervals (class intervals),
PROPERTIES OF SCALES replace the actual test scores
- Magnitude, equal intervals, and an absolute 0 o Highest and lowest class intervals= upper and lower limits
Magnitude of distribution
- The property of “moreness”  Histogram: graph with vertical lines drawn at the true limits of each
- A scale has the property of magnitude if we can say that a particular test score (or class interval) forming TOUCHING rectangles- midpoint
instance of the attribute represents more, less, or equal amounts of in center of bar
the given quantity than does another instance  Bar Graph: rectangles DON’T touch
Equal Intervals  Frequency Polygon: data illustrated with continuous line connecting
- A scale has the property of equal intervals if the difference between the points where test scores or class intervals meet frequencies
two points at any place on the scale has the same meaning as the  A single test score means more if one relates it to other test scores
difference between two other points that differ by the same number  A distribution of scores summarizes the scores for a group of
of scale units individuals
- A psychological test rarely has the property of equal intervals  Frequency distribution: displays scores on a variable or a measure to
- When a scale has the property of equal intervals, the relationship reflect how frequently each value was obtained
between the measured units and some outcome can be described by o One defines all the possible scores and determines how
a straight line or a linear equation in the form Y=a+bX many people obtained each of those scores
o Shows that an increase in equal units on a given scale  Income is an example of a variable that has a positive skew
reflects equal increases in the meaningful correlates of  Whenever you draw a frequency distribution or a frequency polygon,
units you must decide on the width of the class interval
Absolute 0  Class interval: for inches of rainfall is the unit on the horizontal axis
- An Absolute 0 is obtained when nothing of the property being
measured exists Measures of Central Tendency
- This is extremely difficult/impossible for many psychological qualities  Measure of central tendency: statistic that indicates the average or
midmost score between the extreme scores in a distribution.
NOMINAL SCALE  The Arithmetic Mean
 Simplest form of measurement o “X bar”
 Classification or categorization o sum of observations divided by number of observations
 Arithmetic operations can be performed with nominal data o Sigma (X/n)
 Ex.) Male or female o Used for interval or ratio data when distributions are
 Also includes test items relatively normal
o Ex.) yes/no responses  The Median
ORDINAL SCALE o The middle score
 Classifies in some kind of ranking order o Used for ordinal, interval, and ratio data
 Individuals compared to others and assigned a rank o Especially useful when few scores fall at extremes
 Imply nothing about how much greater one ranking is than another  The Mode
 Numbers/ranks do not indicate units of measure o Most frequently-occurring score
 No absolute zero point o Bimodal distribution- 2 scores both have highest
 Binet: believed that data derived from intelligence test are ordinal in frequency
nature o Only common with nominal data
INTERVAL SCALE Measures of Variability
 In addition to the features of nominal and ordinal scales, contain  Variability: indication of how scores in a distribution are scattered or
equal intervals between numbers dispersed

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 3: A STATISTICS REFRESHER


 The Range  The difference between a particular raw score and the mean divided
o Difference between the highest and lowest scores by the standard deviation
o Quick but gross description of the spread of scores  Used to compare test scores with difference scales
 The interquartile and semi-interquartile range
o Distribution is split up by 3 quartiles, thus making 4 T-score
quarters each representing 25% of the scores  Standard score system composed of a scale that ranges from 5
o Q2= median standard deviations below the mean to 5 standard deviations above
o Interquartile range measure of variability equal to the the mean
difference between Q3 and Q1  No negatives
o Semi-interquartile range interquartile range divided by 2
 Quartiles and Deciles Other Standard Scores
o Quartiles are points that divide the frequency distribution  SAT
into equal fourths  GRE
o First quartile is the 25th percentile; second quartile is the  Linear transformation: when a standard score retains a direct
median, or 50th percentile; third quartile is the 75th numerical relationship to the original raw score
percentile  Nonlinear transformation: required when data are not normally
o The interquartile range is bounded by the range of scores distributed, yet comparisons with normal distributions need to be
that represents the middle 50% of the distribution made
o Deciles are similar but use points that mark 10% rather o Normalized Standard Scores
than 25% intervals  When scores don’t fall on normal distribution
o Stanine system: converts any set of scores into a  “normalizing a distribution involves ‘stretching’
transformed scale, which ranges from 1 to 9 he skewed curve into the shape of a normal
 The average deviation curve and creating a corresponding scale of
o X-mean=x standard scores, a scale called a normalized
o Average deviation= (sum of all deviation scores)/ total standard score scale”
number of scores
o Tells us on average how far scores are from the mean
 The Standard Deviation
o Similar to average deviation
o But in order to overcome the (+/-) problem, each deviation
is squared
o Standard deviation: a measure of variability equal to the
square root of the average squared deviations about the
mean
o Is square root of variance
o Variance: the mean of the squares of the difference b/w
the scores in a distribution and their mean
 Found by squaring and summing all the
deviation scores and then dividing by the total
number of scores
o s = sample standard deviation
o sigma = population standard deviation
Skewness
 skewness: nature and extent to which symmetry is absent
 POSITIVE SKEW Ex.) test was too hard
 NEGATIVELY SKEWED ex.) test was too easy
 can be gauges by examining relative distances of quartiles from the
median
Kurtosis
 steepness of distribution
 platykurtic: relatively flat
 leptokurtic: relatively peaked
 mesokurtic: somewhere in the middle

The Normal Curve


Normal curve: bell-shaped, smooth, mathematically defined curve, highest at
center; both sides taper as it approaches the x-axis asymptotically
-symmetrical, and thus have mean, median, mode, is same

Area under the Normal Curve


Tails and body

Standard Scores
Standard Score: raw score that has been converted from one scale to another
scale, where the latter has arbitrarily set mean and standard deviation
-used for comparison

Z-score
 conversion of a raw score into a number indicating how many
standard deviation units the raw score is below or above the mean of
the distribution.

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 4: OF TESTS AND TESTING


 Tasks on some tests mimic the actual behaviors that
Some Assumptions About Psychological Testing and Assessment the test user is attempting to understand
- Assumption 1: Psychological Traits and States Exist o Obtained behavior is usually used to predict future behavior
o Trait: any distinguishable, relatively enduring way in which one o Could also be used to postdict behavior to aid in the
individual varies from another understanding of behavior that has already taken place
o States: distinguish one person from another but are relatively o Tools of assessment, such as a diary, or case history data, might
less enduring be of great value in such an evaluation
 Trait term that an observer applies, as well as - Assumption 4: Tests and Other Measurement Techniques Have Strengths
strength or magnitude of the trait presumed present and Weaknesses
 based on observing a sample of behavior o Competent test users understand a lot about the tests they use
o Trait and state definitions also refer to individual variation  How it was developed
make comparisons with respect to the hypothetical average  Circumstances under which it is appropriate to
person administer the test
o Samples of behavior:  How test should be administered and to whom
 Direct observation  How results should be interpreted
 Analysis of self-report statements o Understand and appreciation limitations for tests they use
 Paper-and-pencil test answers - Assumption 5: Various Sources of Error Are Part of the Assessment Process
o Psychological trait  covers wide range of possible o Everyday error= misstates and miscalculations
characteristics; ex: o Assessment error= a long-standing assumption that factors
 Intelligence other than what a test attempts to measure will influence
 Specific intellectual abilities performance on a test
 Cognitive style o Error variance: component of a test score attributable to
 Psychopathology sources other than the trait or ability measured
o Controversy regarding how psychological tests exist  Assessees themselves are sources of error variance
 Psychological tests exist only as constructs: an o Classical test theory (CTT)/ True score theory: assumption is
informed, scientific concept developed or made that each testtaker has a true score on a test that would
constructed to describe or explain a behavior be obtained but for the action of measurement error
 Cant see, hear or touch infer existence - Assumption 6: Testing and Assessment Can Be Conducted in a Fair and
from overt behavior: refers to an Unbiased Manner
observable action or the product of an o Court challenged to various tests and testing programs have
observable action, including test- or sensitized test developers and users to the societal demand for
assessment-related responses fair tests used in a fair manner
o Traits not expected to be manifested in behavior 100% of the  Publishers strive to develop instruments that are fair
time when used in strict accordance with guidelines in the
 Seems to be rank-order stability in personality test manual
traits relatively high correlations between trait o Fairness related problems/questions:
scores at different time points  Culture is different from people whom the test was
o Whether and to what degree a trait manifests itself is intended for
dependent on the strength and nature of the situation  Politics
- Assumption 2: Psychological Traits and States Can Be Quantified and - Assumption 7: Testing and Assessment Benefit Society
Measured o Many critical decisions are based on testing and assessment
o After acknowledged that psychological traits and states do exist, procedures
the specific traits and states to be measured need to be defined
 What types of behaviors are assumed to be WHAT’S A “GOOD TEST”?
indicative of trait? - Criteria
 Test developer has to provide test users with a clear o Clear instruction for administration, scoring, and interpretation
operational definition of the construct under study - Reliability
o After being defined, test developer considers types of item o A “good test”/measuring tool reliable
content that would provide insight into it  Involves consistency: the prevision with which the
 Ex: behaviors that are indicative of a particular trait test measures and the extent to which error is
o Should all questions be weighted the same? present in measurements
 Weighting the comparative value of a test’s items  Unreliable measurement needs to be avoided
comes about as the result of a complex interplay - Validity
among many factors: o Test is considered valid if it doesn’t indeed measure what it
 Technical considerations purports to measure
 The way a construct has been defined (for o If there is controversy over the definition of a construct then the
particular test) validity is sure to be criticized as well
 Value society (and test developer) attach o Questions regarding validity focus on the items that collectively
to behaviors evaluated make up the test
o Need to find appropriate ways to score the test and interpret  Adequately sample range of areas to measure
results construct
 Cumulative scoring: test score is presumed to  Individual items contribute to or take away from
represent the strength of the targeted ability or trait test’s validity
or state o Validity may also be questioned on grounds related to the
 The more the testtaker responds in a interpretation of test results
particular direction (as keyed by test - Other Considerations
manual) the higher the testtaker is o “Good test” one that trained examiners can administer, score
presumed to possess the targeted trait or and interpret with minimum difficulty
ability  Useful
- Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior  Yields actionable results that will ultimately benefit
o Objective of test is to provide some indication of some aspects individual testtakers or society at large
of the examinee’s behavior

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 4: OF TESTS AND TESTING


o Purpose of test compare performance of testtaker with o STANDARD ERROR OF THE DIFFERENCE – estimate how
performance of other testtakers (contains adequate norms: large a difference between two scores should be before
normative data) the difference is considered statistically significant
 Normative data provides standard with which results - Developing norms for a standardized test
measured can be compared o Establish a standard set of instructions and conditions
NORMS under which the test is given makes scores of normative
- Norm-referenced testing and assessment: method of evaluation and sample more comparable with scores of future testtakers
a way of deriving meaning from test scored by evaluating an o All data collected and analyzed, test developer will
individual testtaker’s score and comparing it to scores of a group of summarize data using descriptive statistics (measures of
testtakers central tendency and variability)
- Meaning of individual score is relative to other scores on the same  Test developer needs to provide precise
test description of standardization sample itself
- Norms (scholarly context): usual, average, normal, standard, expected  Descriptions of normative samples vary widely
or typical in detail
- Norms (psychometric context): the test performance data of a Tracking
particular group of testtakers that are designed for use as a reference - Comparisons are usually with people of the same age
when evaluating or interpreting individual test scores - Children at the same age level tend to go through different growth
- Normative sample: group of people whose performance on a patterns
particular test is analyzed for reference in evaluation the performance - Pediatricians must know the child’s percentile within a given age
of individual testtakers group
o Yields a distribution of scores - This tendency to stay at about the same level relative to one’s peers is
- Norming: refers to the process of deriving norms; particular type of known as tracking (ie height and weight)
norm derivation - Diets may alter this “track”
o Race norming: controversial practice of norming on the - Faults: some believe there is an analogy between the rates of physical
basis of race or ethnic background growth and the rates of intellectual growth
- Norming a test can be very expensive user norms/program norms: o Some say that children learn at different rates
consist of descriptive statistics based on a group of testtakers in a o This system discriminates against some children
given period of time rather than norms obtained by form sampling
methods TYPES OF NORMS
- Sampling to Develop Norms o Classification of norms ex: age, grade, national, local,
- Standardization: process of administering a test to a representative percentile, etc.
sample of testtakers for the purpose of establishing norms o PERCENTILES
o Standardized when has clear, specified procedures  Median= 2nd quartile: the point at or below which
- Sampling 50% of the scores fell and above which the remaining
o Developer targets defined group as population test 50% fell
designed for  Might wish to divide distribution of scores into
 All have at least one common, observable deciles (instead of quartiles): 10 equal parts
characteristic  The Xth percentile is equal to the score at or below
o To obtain distribution of scores: which X% of scores fall
 Test administered to everyone in targeted  Percentile: an expression of the percentage of
population people whose score on a test or measure falls below
 Administer test to a sample of the population a particular raw score
 Sample: portion of universe of  Percentage correct: refers to the
people deemed to be representative distribution of raw scores (number of
of whole population items that were answered correctly)
 Sampling: process of selecting the multiplied by 100 and divided by the total
portion of universe deemed to be number of items *not same as percentile
representative of whole  Percentile is a converted score that refers
o Subgroups within a defined population may differ with to a percentage of testtakers
respect to some characteristics and it is sometimes  Percentiles are easily calculated popular way of
essential to have these differences proportionately organizing test related data
represented in sample  Using percentiles with normal distribution real
 Stratified sampling: sample reflects statistics of differences between raw scores may be minimized
whole population; helps prevent sampling bias near the ends of the distribution and exaggerated in
and ultimately aid in interpretation of findings the middle (worsens with highly skewed data)
 Purposive sampling: arbitrarily select sample o AGE NORMS
we believe to be representative of population  Age-equivalent scores/age norms: indicate the
 Incidental/convenience sampling: sample that average performance of different samples of
is convenient or available for use testtakers who were at various ages at the time the
 Very exclusive (contain exclusionary test was administered
criteria)  Age norm tables for physical
- TYPES OF STANDARD ERROR: characteristics
o STANDARD ERROR OF MEASUREMENT – estimate the  “Mental” age vs. physical age (need to
extent to which an observed score deviates from a true identify mental age)
score o GRADE NORMS
o STANDARD ERROR OF ESTIMATE – In regression, an  Grade norms: designed to indicate the average test
estimate of the degree of error involved in predicting the performance of testtakers in a given school grade
value of one variable from another  Developed by administering the test to
o STANDARD ERROR OF THE MEAN – a measure of sampling representative samples of children over a
error range of consecutive grades
 Mean or median score for children at
each grade level is calculated

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 4: OF TESTS AND TESTING


 Great intuitive appeal CORRELATION
 Do not provide info as to the content or  Degree and direction of correspondence between two things.
type of items that a student could or  Correlation coefficient (r) – expresses a linear relationship between
could not answer correctly two continuous variables
 Developmental norms: (ex: grade norms and age o Numerical index that tells us the extent to which X and Y
norms) term applied broadly to norms developed on are “co-related”
the basis of any trait, ability, skill, or other  Positive correlation: high scores on Y are associated with high scores
characteristic that is presumed to develop, on X, and low scores on Y correspond to low scores on X
deteriorate, or otherwise be affected by  Negative correlation: higher scores on Y are associated with lower
chronological age, school grade, or stage of life scores on X, and vise versa
o NATIONAL NORMS  No correlation: the variables are not related
 National norms: derived from a normative sample  -1 to 1
that was nationally representative of the population  Correlation does not imply causation.
at the time the norming study was conducted o Ie weight, height, intelligence
o NATIONAL ANCHOR NORMS
 Many different tests purporting to measure the same PEARSON r
human characteristics or abilities  Pearson Product Moment Correlation Coefficient
 National anchor norms: equivalency tables for scores  Devised by Karl Pearson
on tests that purpose to measure the same thing  Relationship of two variables are linear and continuous
 Could provide the tool for comparisons  Coefficient of Determination (r2) – indication of how much variance is
 Provides stability to test scores by shared by the X and the Y variables
anchoring them to other test scores SPEARMAN RHO
 Begins with the computation of percentile  Rank order correlation coefficient
norms for each test to be compared  Developed by Charles Spearman
 Equipercentile method: equivalency of  Used when the sample size is small and when both sets of
scores on different tests is calculated with measurements are in ordinal form (ranking form)
reference to corresponding percentile BISERIAL CORRELATION
scores  expresses the relationship between a continuous variable and an
o SUBGROUP NORMS artificial dichotomous variable
 Normative sample can be segmented by an criteria o If the dichotomous variable had been true then we would
initially used in selecting subjects for sample use the point biserial correlation
 Subgroup norms: result of segmentation; more o When both variables are dichotomous and at least one of
narrowly defined the dichotomies is true, then the association between
o LOCAL NORMS them can be estimated using the phi coefficient
 Local norms: provide normative info with respect to o If both dichotomous variables are artificial, we might use a
the local population’s performance on some test special correlation coefficient – tetrachoric correlation
 Typically developed by test users
themselves REGRESSION
- Fixed Reference Group Scoring Systems  analysis of relationships among variables for the purpose of
o Norms provide context for interpreting meaning of a test score understanding how one variable may predict another
o Fixed reference group scoring system: distribution of scored  SIMPLE REGRESSION: one IV (X) and one DV (Y)
obtained on the test from one group of testtakers (fixed - Regression line: defined as the best-fitting straight line through a set
reference group) is used as the basis for the calculation of test of points in a scatter diagram
scores for future administrators on the test o Found by using the principle of least squares, which
 Ex: SAT test (developed in 1962) minimizes the squared deviation around the regression
NORM-REFERENCED VERSUS CRITERION-REFERENCED EVALUATION line
- Way to derive meaning from test score is to evaluate test score in  Primary use: To predict one score or variable from another
relation to other scores on same test (Norm-referenced)  Standard error of estimate: the higher the correlation between X and
- Criterion-referenced: derive meaning from a test score by evaluating Y, the greater the accuracy of the prediction and the smaller the SEE.
it on the basis of whether or not some criterion has been met  MULTIPLE REGRESSION: The use of more than one score to predict Y.
o Criterion: a standard on which a judgment or decision may  Regression coefficient: (b) slope of the regression line
be based o Sum of squares for the covariance to the sum of squares
- Criterion-referenced testing and assessment: method of evaluation for X
and way of deriving meaning from test scores by evaluating an o Sum of squares is defined as the sum of the squared
individual’s score with reference to a set standard (ex: to drive must deviations around the mean
past driving test) o Covariance is used to express how much two measures
o Derives from values and standards of an individual or covary, or vary together
organization  Slope describes how much change is expected in Y each time X
o Also called Domain/content-referenced testing and increases by one unit
assessment  Intercept (a) is the value of Y when X is 0
o Critique: if followed strictly, important info about o The point at which the regression line crosses the Y axis
individual’s performance relative to others can be THE BEST-FITTING LINE
potentially lost  The difference between the observed and predicted score (Y-Y’) is
Culture and Inference called the residual
- Culture is a factor in test administration, scoring and interpretation  The best-fitting line is most appropriately found by squaring each
- Test user should do research in advance on test’s available norms to residual
check how appropriate it is for targeted testtaker population  Best-fitting line is obtained by keeping these squared residuals as
o Helpful to know about the culture of the testtaker small as possible
o Principle of least squares:
CORRELATION AND INFERENCE  Correlation is a special case of regression in which the scores for both
variables are in standardized, or Z, units
 In correlation, the intercept is always 0

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 4: OF TESTS AND TESTING


 Pearson product moment correlation coefficient is a ratio used to - External influence is the third variable
determine the degree of variation in one variable that can be Restricted Range
estimated from knowledge about variation in the other variable - Correlation and regression use variability on one variable to explain
Testing the Statistical Significance of a Correlation Coefficient variability on a second variable
- Begin with the null hypothesis that there is no relationship between - Restricted range problem: correlation requires variability; if the
variables variability is restricted, then significant correlations are difficult to
- Null hypothesis rejected is there is evidence that the association find
between two variables is significantly different from 0 Mulvariate Analysis
- t distribution is not a single distribution, but a family of distributions, - Multivariate analysis considers the relationship among combinations
each with its own degrees of freedom of three of more variables
- Degrees of freedom are defined as the sample size minus 2, or N-2 General Approach
- Two-tailed test - Linear combination of variables is a weighted composite of the
original variables
How to Interpret a Regression Plot - Y’ = a+b1X1 + … bkXk
- Regression plots are pictures that show the relationship between
variables
- Common use of correlation is to determine the criterion validity
evidence for a test, or the relationship between a test score and
some well-defined criterion
- Middle level of enjoyableness because it is the one observed most
frequently – normative because it uses info gained from
representative groups
- Using the test as a predictor is not as good as perfect prediction, but
it is still better than using the normative info
- A regression line such as in 3.9 shows that the test score tells us
nothing about the criterion beyond the normative info

TERMS AND ISSUES IN THE USE OF CORRELATION


Residual
- Difference between the predicted and the observed values is called
the residual
o Y-Y’
- Important property of residual is that the sum of the residuals always
equals 0
- Sum of the squared residuals is the smallest value according to the
principle of least squares
Standard Error of Estimate
- Standard deviation of the residuals is the standard error of estimate
- A measure of the accuracy of prediction
- Prediction is most accurate when the standard error of estimate is
relatively small
Coefficient of Determination
- Correlation coefficient squared is known as the coefficient of
determination
- Tells us the proportion of the total variation in scores on Y that we
know as a function of information about X
Coefficient of Alienation
- Coefficient of alienation is a measure of nonassociation between two
variables
- Square root of 1-r2 –-- r is the coefficient of determination
- High value means there is a high degree of nonassociation between 2
variables
Shrinkage
- Tendency to overestimate the relationship, particularly if the sample
of subjects is small
- Shrinkage is the amount of decrease observed when a regression
equation is created for one population and then applied to another
Cross Validation
- Use regression equation to predict performance in a group of subjects
other than the ones to which the equation was applied
- Standard error of estimate obtained for relationship between the
values predicted by the equation and the values actually observed –
called cross validation
The Correlation-Causation Problem
- Experiments are required to determine whether manipulation of one
variable causes changes in another variable
- A correlation alone does not prove causality, although it might lead to
other research that is designed to establish the causal relationships
between variables
Third Variable Explanation
- Third variable, ie poor social adjustment, causes TV viewing and
aggression

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 5: RELIABILITY
RELIABILITY - TEST CONSTUCTION
- Dependability and consistent o Item sampling or content sampling – refer to variation
- Error implies that there will always be some inaccuracy in our among items within a test as well as to variation among
measurements items between test\
- Tests that are relatively free of measurement error are deemed to be  The extent to which a test takers score is
reliable affected by the content sampled on a test and
- Reliability estimates in the range of .70 and .80 are good enough for by the way the content is sampled (that is, the
most purposes in basic research way in which the item is constructed) is a
- Reliability coefficient: an index that indicates the ratio between the source of error variance
true score variance on a test and the total variance - TEST ADMINISTRATION
- HISTORY OF RELIABILITY: o may influence the test takers attention or motivation
o Charles Spearman (1904): The Proof and Measurement of o Environment variables, test taker’s variables, examiner
Association between Two Things variables. Level of professionalism
o Then Thorndike - TEST SCORING AND INTERPRETATION
o Item response theory has taken advantage of computer o Computer scoring and a growing reliance on objective,
technology to advance psychological measurement computer-scorable items have virtually eliminated error
significantly variance caused by scorer differences
o Based on Spearman’s ideas o However, other tools of assessment still require scoring by
- X = T + E  CLASSICAL TEST THEORY trained personnel
o assumes that each person has a true score that would be o If subjectivity is involved in scoring, then the scorer can be
obtained if there were no errors in measurement a source of error variance
o Difference between the true score and the observed score o Despite rigorous scoring criteria set forth in many of the
results from measurement error better known test of intelligence, examiner occasionally
o Assumption here is that errors of measurement are still are confronted by situations where an examinees
random response lies in a gray area
o Basic sampling theory tells us that the distribution of
TEST-RETEST RELIABILITY
random errors is bell-shaped
 The center of the distribution should represent - Also known as time-sampling reliability
- Correlating pairs of scores from the same group on two different
the true score, and the dispersion around the
mean of the distribution should display the administration of the same test
- Measure something that is relatively stable over time
distribution of sampling errors
o Classical test theory assumes that the true score for an - Sources of Error variance:
o Passage of time: the longer the time that passes, the
individual will not change with repeated applications of
greater the likelihood that reliability coefficient will be
the same test
o lower.
o Coefficient of stability: when the interval between testing
o Variance: standard deviation squared. It is useful because
is greater than 6 months,
it can be broken into components:
- Consider possibility of carryover effect: occurs when first testing
o True variance: variance from true differences  are
session influences scores from the second session
assumed to be stable
- If something affects all the test takers equally, then the results are
o Error variance: random irrelevant sources
uniformly affected and no net errors occurs
- Standard error of measurement: we assume that the distribution of
- Practice tests may make this effect happen
random errors will be the same for all people, classical test theory
- Practice can also affect tests of manual dexterity
uses the standard deviation of errors as the basic measure of error
- Time interval between testing sessions must be selected and
o Standard error of measurement tells us, on the average,
evaluated carefully
how much a score varies from the true score
- Poor test-retest correlations do not always mean that a attest is
o Standard deviation of the observed score and the
unreliable – suggest that the characteristic under study has changed
reliability of the test are used to estimate the standard
error of measurement
PARALLEL-FORM OR ALTERNATE FORMS RELIABILITY
- Reliability: proportion of the total variance attributed to true
- compares two equivalent forms of a test that measure the same
variance.
attribute
o the greater portion of total variance attributed to true
- Two forms should be equally constructed, both format, etc.
variance, the more reliable the test
- When two forms of the test are available, one can compare
- Measurement error: refers to collectively, all of the factors associated
performance on one form versus the other – equivalent forms
with the process of measuring some variable, other than the variable reliability or parallel forms
being measured - Coefficient of equivalence: degree of relationship between various
o Random error: a source of error in measuring a targeted forms of a test can be evaluated by means of an alternate-forms
variable caused by unpredictable fluctuations and - Parallel forms: each form of the test, the means and variances of
inconsistencies of other variables in the measurement observed test scores are equal
process - Alternate forms: different versions of a test that have been
 This source of error fluctuates from one testing constructed so as to be parallel
situation to another with no discernible pattern - (1) two test administrations with the same group are required
that would systematically raise or lower scores - (2) test scores may be affected by factors such as motivation etc.
o Systematic Error: - Problem: developing a new version of a test
 A source of error in measuring a variable that is INTERNAL CONSISTENCY
typically constant or proportionate to what is - How well does each item measure the content/construct under
presumed to be true value of the variable being consideration
measured - How consistent the items together
 Error is predictable and fixable - Used when tests are administered once
 Does not affect score consistency - If all items on a test measure the same construct, then it has a good
internal consistency
SOURCES OF ERROR VARIANCE - Split-half reliability, KR20, Cronbach Alpha

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 5: RELIABILITY
o Test takers with the same score on a homogenous test
SPLIT-HALF RELIABILITY probably have similar abilities in the area tested
- Correlating two pairs of scores obtained from equivalent halves of a o Test takers with the same score on a heterogeneous test
single test administered once. may have quite different abilities
- This is useful when it is impractical to assess reliability with two tests o However, homogenous testing is often an insufficient tool
or to administer test twice for measuring multifaceted psychological variable such as
- Results of one half of the test are then compared with the results of intelligence or personality
the other
- Rules in splitting forms into half: Measures of Inter-Scorer Reliability
o Do not divide test in the middle because it would lower - In some types of tests under some conditions, the score may be more a
the reliability function of the scorer than of anything else
o Different amounts of anxiety and differences in item - Inter-scorer reliability: is the degree of agreement or consistency between
difficulty shall also be considered two or more scorers (or judges or rather) with regard to a particular
o Randomly assign items to one or the other half of the test measure
o use the odd-even system: where one subscore is obtained - Coefficient of inter-scorer reliability: coefficient of correlation to
for the odd-numbered items in the test and another for determine the degree of consistency among scorers in the scoring of a test
the even-numbered items - Kappa statistic is the best method for assessing the level of agreement
- To correct for half-length, apply the Spearman-Brown formula, which among several observers
allows you to estimate what the correlation between the two halves o Indicates the actual agreement as a proportion of the potential
would have been if each half had been the length of the whole test agreement following the correction for chance agreement
o Use this if test user wish to shorten a test o Cohen’s Kappa – 2 raters
o Fleiss’ Kappa – 3 or more raters
o Used to determine the number of items needed to attain a
desired level of reliability
HOMOGENEITY VS. HETEROGENEITY OF TEST ITEMS
- Reliability increases as the test length increases - Homogeneous items has high degree of reliability
KUDER-RICHARDSON FORMULAS OR KR20/KR21 DYNAMIC VS. STATIC CHARACTERISTICS
- Kuder-Richardson technique simultaneously considers all possible - Dynamic: trait, state, ability presumed to be ever-changing as a function of
ways of splitting the items situational and cognitive experiences
- The formula for calculating the reliability of a test in which the items - Static: trait, state, ability relatively unchanging
are dichotomous, scored 0 or 1, is the Kuder-Richardson 20 (see
p.114) RESTRICTION OR INFLATION OF RANGE
- Introduced KR21 – uses an approximation of the sum of the pq - If it is restricted, reliability tends to be lower.
products – the mean test score - If it is inflated, reliability tends to be higher.
CRONBACH ALPHA SPEED TESTS VS. POWER TESTS
- Cronbach developed a formula that estimates the internal - Speed test: test is homogenous, means that it is easy but short time
consistency of tests in which the items are not scored as 0 or 1 – a - Power test: Few items, but more complex.
more general reliability estimate, which he called coefficient alpha
- Sum the individual item variances CRITERION-REFERENCED TESTS
o Most general method of finding estimates of reliability - Provide an indication of where a testtaker stands with respect to some
through internal consistency variable or criterion.
- Domain sampling: define a domain that represents a single trait or - Tends to contain material that has been mastered in hierarchical fashion.
characteristic, and each item is an individual sample of this general - Scores here tend to be interpreted in pass-fail terms.
characteristic - Measure of reliability depends on the variability of the test scores: how
- Factor analysis deals with the situation in which a test apparently different the scores are from one another.
measures several different characteristics
o Good for the process of test construction The Domain Sampling Model
- Most widely used as a measure of reliability because it requires only - This model considers the problems created by using a limited number
one administration of the test of items to represent a larger and more complicated construct
- Ranges from 0 to 1 “bigger is always better” - Our task in reliability analysis is to estimate how much error we would
Other Methods of Estimating Internal Consistencies make by using the score from the shorter test as an estimate of your
- Inter-item consistency: refers to the degree of correlation among all true ability
the items on a scale - Conceptualizes reliability as the ratio of the variance of the observed
o A measure of inter-item consistency is calculated from a score on the shorter test and the variance of the long-run true score
single administration of a single form of a test - Reliability can be estimated from the correlation of the observed test
o An index of inter-item consistency, in turn, is useful in score with the true score
assessing the homogeneity of the test
o Tests are said to be homogenous if they contain items that Item Response Theory
measure a single trait - Classical test theory requires that exactly the same test items be
o Definition: the degree to which a test measures a single administered to each person – BAD
factor - Item response theory (IRT) is newer – computer is used to focus on
o Heterogeneity: degree to which a test measures different the range of item difficulty that helps assess an individual’s ability
factors level
o Ex: homo=test that assesses knowledge only of #-D o More reliable estimate of ability is obtained using a
television repair skills vs. a general electronics repair test shorter test with fewer items
(hetero) o Takes a lot of items and effort
o The more homogenous a test is, the more inter-item
consistency it can be expected to have Generalizability theory
o Test homogeneity is desirable because it allows relatively - based on the idea that a persons test scores vary from testing to testing
straightforward test-score interpretation because of variables in the testing situation

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 5: RELIABILITY
- Instead of conceiving of all variability in a persons scores as error, Cronbach
encouraged test developers and researchers to describe the details of the
particular test situation or universe leading to a specific test score
- This universe is described in terms of its facets: which include things like
the number of items in the test, the amount of training the test scorers
have had, and the purpose of the test administration
- According to generalizability theory, given the exact same conditions of all
the facets in the universe, the exact same test score should be obtained
- Universe score: the test score obtained and its analogous to a true score in
the true score model
- Cronbach suggested that tests be developed with the aid of a
generalizability study followed by a decision study
- Generalizability study: examines how generalizable scores from a
particular test are if the test is administered in different situations
- How much of an impact different facets of the universe have on the test
score
- Ex: is the test score affected by group as opposed to individual
administration
- Coefficients of generalizability: the influence of particular facts on the test
score is represented by this. These coefficients are similar to reliability
coefficients in the true score model
- Decision study: developers examine the usefulness of test scores in helping
the test user make decision
- The decision study is designed to tell the test user how test scores should
be used and how dependable those scores are as a basis for decisions,
depending on the context of their use

What to Do About Low Reliability


- Two common approaches are to increase the length of the test and to
throw out items that run down the reliability
- Another procedure is to estimate what the true correlation would
have been if the test did not have measurement error
Increase the Number of Items
- The larger the sample, the more likely that the test will represent the
true characteristic
o This could entail a long and costly process however
- Prophecy formula
Factor and Item Analysis
- Reliability of a test depends on the extent to which all of the items
measure one common characteristic
- Factor analysis
o Tests are most reliable if they are unidimensional: one
factor should account for considerably more of the
variance than any other factor
- Or examine the correlation between each item and the total score for
the test
o Called discriminability analysis: when the correlation
between the performance on a single item and the total
test score is low, the item is probably measuring
something different from the other items on the test

Correction for Attenuation


- Potential correlations are attenuated, or diminished, by measurement
error

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 6: VALIDITY
The Concept of Validity (N/2)
- Validity: as applied to a test, is a judgment or estimate of how well a test o CVR Content validity ratio
measures what it purports to measure in a particular context o ne  Number of panelists
o Judgment based on evidence about the appropriateness of stating “essential”
inferences drawn from test scores o N Total number of panelists
o Validity of test must be shown from time to time to account for  CVR is calculated for each item
culture and advancement o Culture and the relativity of content validity
- Inference: a logical result or deduction  Tests thought of as either valid or invalid
- “Acceptable” or “weak” validity of tests and test scores  What constitutes historical fact depends to some
- Validation: process of gathering and evaluating evidence about validity extent on who is writing the history
o Test user and testtaker both have roles in validation of test  Culture relativity
o Test users may conduct their own validation studies: may yield  Politics (politically correct)
insights regarding a particular population of testtakers as Criterion-Related Validity
compared to the norming sample (in manual) - Criterion-related validity: judgment of how adequately a test score can be
o Local validation studies: absolutely necessary when test user used to infer an individual’s most probable standing on some measure of
plans to alter in some way the format, instructions, language, or interest (measure of interest being the criterion)
content of the test - 2 types:
- Types of Validity (Trinitarian view) *not mutually exclusive all contribute o Concurrent validity: index of the degree to which a test score is
to a unified picture of a test’s validity/ critique approach is fragmented related to some criterion measure obtained at the same time
and incomplete (concurrently)
o Content validity: measure of validity based on an evaluation of o Predictive validity: index of the degree to which a test score
the subjects, topics, or content covered by the items in the test predicts some criterion measure
o Criterion-related validity: measure of validity obtained by - What Is a Criterion?
evaluating the relationship of scores obtained on the test to o Criterion: a standard on which a judgment or decision may be
scores on other tests or measures based; standard against which a test or a test score is evaluated
o Construct validity: measure of validity that is arrived at by (criterion-related validity)
executing a comprehensive analysis of: (umbrella validity o Characteristics of criterion
every other variety of validity falls under it)  Relevancy pertinent or applicable to the matter at
 How scores on test relate to other test scores and hand
measures  Validity (for the purpose which it is being used)
 How scores on test can be understood within some  Uncontaminated Criterion contamination: term
theoretical framework for understand the construct applied to a criterion measure that has been based,
that the test was designed to measure at least in part, on predictor measures
- Strategies: ways of approaching the process of test validity - Concurrent Validity
o Content validation strategies o Test scores are obtained at about the same time as the criterion
o Criterion-related validation strategies measures are obtained measures of the relationship between
o Construct validation strategies the test scores and the criterion provide evidence of concurrent
- Face Validity validity
o Face validity: relates more to what a test appears to measure to o Indicate the extent to which test scores may be used to estimate
the person being tested than to what the test actually measures an individuals present standing on a criterion
o Judgment concerning how relevant the test items appear to o Once validity of inference from test scores is established= faster,
be usually from testtaker, not test user less expensive way to offer a diagnosis or a classification
o Lack of face validity= lack of confidence in perceived decision
effectiveness of test which decreases testtaker’s o Concurrent validity of a test can be explored with respect to
motivation/cooperation *may still be useful another test
- Content validity  Prior research must have satisfactorily demonstrated
o Content validity: a judgment of how adequately a test samples the 1st test’s validity
behavior representative of the universe of behavior that the test  1st test= validating criterion
was designed to sample - Predictive validity
 Ideally, test developers have a clear vision of the o Test scores may be obtained at one time and the criterion
construct being measured clarity reflected in the measures obtained at a future time, usually after some
content validity of the test intervening event has taken place
o Test blueprint: structure of the evaluation; a plan regarding the  Intervening event training, experience, therapy,
types of information to be covered by the items, the number of medication, etc.
items tapping each area of coverage, the organization of the  Measures of relationship between the test scores
items in the test, etc. and a criterion measure obtained at a future time
 Behavior observation is a technique frequently used provide an indication of the predictive validity test
in test blueprinting (how accurately scores on the test predict some
o The quantification of content validity criterion measure)
 Important in employment settings  tests used to o Ex: SAT test score and freshman gpa
hire and promote o Judgments of criterion validity are based on 2 types of statistical
 One method: method for gauging agreement among evidence:
raters or judges regarding how essential a particular  The validity coefficient
item is (C.H. Lawshe)  Validity coefficient: correlation coefficient
 “Is the skill or knowledge measured by that provides a measure of the
this item… relationship between test scores and
o Essential scores on the criterion measure
o Useful but not essential  Ex: Pearson correlation coefficient used
o Not necessary to determine validity between 2 measures
 To the performance of the job?” (r)
 Content validity ratio (CVR):  Affected by restriction or inflation of
 CVR= ne – (N/2) range

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 6: VALIDITY
 Is the range of scores employed  Construct: an informed, scientific idea developed or
appropriate to the objective of the hypothesized to describe or explain behavior
correlational analysis  Ex: intelligence, depression, motivation,
 No rules regarding the validity coefficient personality, etc.
(how high or low it should/could be for  Unobservable, presupposed (underlying)
test to be valid) traits that a test developer invokes to
 Incremental validity describe test behavior/criterion
o More than one predictor performance
o Incremental validity: the  Viewed as unifying concept for all validity evidence
degree to which an additional o Evidence of Construct Validity
predictor explains something  Various techniques of construct validation that
about the criterion measure provide evidence:
that is not explained by  Test is homogeneous measures single
predictors already in use construct
 Expectancy data  Test scores increase/decrease as function
 Expectancy data: provides info that can of age, passage of time, or experimental
be used in evaluating the criterion-related manipulation (theoretically predicted)
validity of a test  Test scored obtained after some even or
 Score obtained on expectancy passage of time differ from pretest scores
test/tables likelihood testtaker will (theoretically predicted)
score within some interval of scores on a  Test scores obtained by people from
criterion measure (“passing”, distinct groups vary (theoretically
“acceptable”, etc.) predicted)
 Expectancy table: shows the percentage  Test scores correlate with scores on other
of people within specified test-score tests (theoretically predicted)
intervals who subsequently were placed  Evidence of homogeneity
in various categories of the criterion  Homogeneity: refers to how uniform a
o May be created from test is in measuring a single concept
scatterplot  Evidence correlations between subtest
o Shows relationships scores and total test scores
 Expectancy chart: graphic representation  Item-analysis procedures have been used
of an expectancy table in quest for test homogeneity
o The higher the initial rating,  Desirable but not necessary
the greater the probability of  Contributes no info about how construct
job/academic success being measured relates to other
 Taylor Russell Table – provide an estimate of the constructs
extent to which inclusion pf a particular test in the  Evidence of changes with age
selection system will actually improve selection  If test purports to measure a construct
 Selection ratio – relationship between the that changes over time then the test
number of people to be hired and the scores, too, should show progressive
number of people available to be hired changes to be considered valid
 Base rate – percentage of people under measurement of construct
existing system for a particular position  Does not in itself provide info about how
 Relationship between predictor and construct relates to other constructs
criterion must be linear  Evidence of pretest-posttest changes
 Naylor-shine Tables – difference between the means  Can be evidence of construct validity
of the selected and unselected groups to derive an  Some more typical intervening
index of what the test is adding to already experiences responsible for changes in
established procedures test scores are:
o Decision theory and Test utility o Formal education
 Base rate – extent to which a particular trait, o Therapy/medication
behavior, characteristic or attribute exists in the o Any life experience
population  Evidence from distinct groups/method of contrasted
 Hit rate – defined as the proportion of people a test groups
accurately identifies as possessing or exhibiting a  Method of contrasted groups: one way of
particular trait. providing evidence for the validity of a
 Miss rate – proportion of people the test fails to test is to demonstrate that scores on the
identify as having or not having attributes test vary in a predictable way as a
 False positive (type I error) – possess function of membership in some group
particular attribute but actually does not  Rationale if a test is a valid measure of
have. Ex: score above cutoff score, hired a particular construct, test scores from
but failed the job. groups of people who would presumed
 False negative (type II error) – does not with respect to that construct should have
possess particular attribute but actually correspondingly different test scores
does have. Ex. Scored below cutoff score,  Convergent evidence
not hired, but could have been successful  Evidence for the construct validity of a
in the job particular test may converge from a
- Construct Validity number of sources, such as tests or
o Construct validity: judgment about the appropriateness of measures designed to assess the
inferences drawn from test scores regarding individual standings same/similar construct
on a variable called a construct  Convergent evidence: scores on a test
undergo construct validity and correlate

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 6: VALIDITY
highly in the predicted direction with  Issues of fairness tend to be more difficult and
scores on older, more established and involve values
already validated tests designed to  Fairness: the extent to which a test is used in an
measure the same/similar construct impartial, just, and equitable way
 Discriminant evidence  Sources of misunderstanding
 Discriminant evidence: validity coefficient  Discrimination
showing little relationship between test  Group not included in standardization
scores and /or other variables with which sample
scores on the test being construct-  Performance differences between
validated should not theoretically be identified groups
correlated
 Provides evidence of construct validity Relationship Between Reliability and Validity
 Multitrait-multimethod matrix: “two or - A test should not correlate more highly with any other variable than it
more traits”, “two or more methods” correlates with itself
matrix/table that results from correlating - A modest correlation between the true scores on two traits may be
variables (traits) within and between missed if the test for each of the traits is not highly reliable
methods - We can have reliability without validity
 Factor analysis o It is impossible to demonstrate that an unreliable test is
 Factor analysis: shorthand term for a class valid
of mathematical procedures designed to
identify factors or specific variables that
are typically attributes, characteristics, or
dimension on which people may differ
 Frequently used as a data reduction
method in which several sets of scores
and correlations between them are
analyzed
 Exploratory factor analysis: researchers
test the degree to which a hypothetical
model fits the actual data
o Factor loading: conveys
information about the extent
to which the factor determines
the test score or scores
o Complex procedures
- Validity, Bias, and Fairness
o Test Bias
 Bias: a factor inherent in a test that systematically
prevents accurate, impartial measurement
 Technical means to identify and remedy bias
(mathematically)
 Bias implies systematic variation
 Rating error
 Rating: a numerical or verbal judgment
(or both) that places a person or an
attribute along a continuum identified by
a scale of numerical or word descriptions,
known as a rating scale
 Rating error: judgment resulting from
intentional or unintentional misuse of a
rating scale
 Leniency error/generosity error: error in
rating that arises from the tendency on
the part of the rater to be lenient in
scoring, marking, and/or grading
 Severity error: rater exhibits general and
systematic reluctance to giving ratings at
either the positive or negative extreme
 Overcome restriction of range rating errors is to use
rankings: procedure that requires the rater to
measure individuals against one another instead of
against an absolute scale
 Rater is forced to select 1st, 2nd, 3rd, etc.
 Halo effect: fact that for some raters, some rates can
do no wrong
 Tendency to give a particular ratee a
higher rating than he or she objectively
deserves
 Criterion data may be influenced by
rater’s knowledge of ratee race,
gender, etc.
o Test fairness

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 7: UTILITY
Utility: usefulness or practical value of testing to improve efficiency  Based on norm-related considerations rather
than on the relationship of test scores to a
Factors that Affect a Test’s Utility criterion
 Psychometric Soundness  Also called norm-referenced cut score
o Reliability and validity of a test  Ex.) top 10% of test scores get A’s
o Gives us the practical value of both the scores (reliability o Fixed cut score: set with reference to a judgment
and validity) concerning a minimum level of proficiency required to be
o They tell us whether decisions are cost-effective included in a particular classification.
o A valid test is not always a useful test  Also called absolute cut scores
 especially if testtakers do not follow test o Multiple cut scores: using two or more cut scores with
directions reference to one predictor for the purpose of categorizing
 Costs testtakers
o Economic and non economic  Ex.) having cut score that marks an A, B, C etc.
o Ex.) using a less expensive and therefore less stringent all measuring same predictor
application process for airline personnel. o Multiple hurdles: for success, requires one individual to
 Benefits complete many tasks, with elimination at each level
o Profits, gains, advantages  Ex.) written application group interview
o Ex.) more stringent hiring policy more productive personal interview etc.
employees o Compensatory model of selection: assumption is made
o Ex.) maintaining successful and academic environment of that high scores on one attribute can compensate for low
university scores on another attribute

Utility Analysis Methods for Setting Cut Scores

What is Utility Analysis? The Angoff Method


-a family of techniques that entail a cost-benefit analysis designed to yield Judgments of experts are averaged
information relevant to a division about the usefulness and/or practical value of
a tool of assessment. The Known Groups Method
Collection of data on the predictor of interest from group known to posses and
Utility analysis: An illustration not to possess trait, attribute, or ability
What’s the companies goal? Cut score based on which test best discriminates the two groups performance
 Limit the cost of selection
o Don’t use FERT IRT-Based Method
 Ensure that qualified candidates are not rejected Based on testtaker’s performance across all items on a test
o Set a cut score that yields the lowest false negative rate Some portion of test items must be correct
 Ensure that all candidates selected will prove to be qualified Item-mapping method: determining difficulty level reflected by cut score (?)
o Lowest dales positive rate Book-Mark method: test items are listed, one per page, in ascending level of
 Ensure, to the extent possible, that qualified candidates will be difficulty. An expert places a bookmark to mark the divide which separates
selected and unqualified candidates will be rejected testtakers who have acquired minimal knowledge, skills, or abilities and those
o False positives are no better or worse than false negatives that have not.
o Highest hit rate and lowest miss rate Problems include training of experts, possible floor and ceiling effects, and the
optimal length of item booklets
How Is a Utility Analysis Conducted?
-objective: dictate what sort of information will be required as well as the Other Methods
specific methods to be used -discriminant analysis: family of statistical techniques used to shed light on the
 Expectancy Data relationship between certain variables and two or more naturally occurring
o Expectancy table provides indication of the likelihood that groups
a testtaker will score within some interval of scores on a ex.) the relationships between scores of tests and ppl judged to be
criterion measure successful or unsuccessful at job
o Used to measure costs vs. benefits
 Brogden-Cronbach-Gleser formula
o Utility gain: estimate of the benefit of using a particular
test or selection method
o Most simply is benefits-cost
o Productivity gain: estimated increase in work output

Some Practical Considerations


 The Pool of Job Applicants
o There is rarely a limitless supply of potential employees
o Dependent on many factors, including economic
environment
o We assume that top scoring individuals will accept the job,
but those individuals are more likely to be the ones being
offered higher positions
 The complexity of the Job
o It is questionable whether the same utility analysis
methods can be used to measure the eligibility of varying
complexities of jobs
 The cut score in use
o Relative cut score: may be defines as reference point

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 8: TEST DEVELOPMENT


CHAPTER 8: TEST DEVELOPMENT o Item format: variables such as the form, plan, structure,
arrangement and layout of individual test items
STEPS: o 2 types
1. TEST CONCEPTUALIZATION o 1.) selected-response format: testtaker selects a response from
2. TEST CONSTRUCTION a set of alternative responses
3. TEST TRYOUT  includes multiple choice, true-false, and matching
4. ITEM ANALYSIS o 2.) constructed-response format: testtaker supplies or creates
5. TEST REVISION the correct answer
 includes completion item, short answer and essay
TEST CONCEPTUALIZATION - Writing Items for computer administration
- Thoughts or stimulus that could be almost everything. o Item bank: relatively large and easily accessible collection of
- An emerging social phenomenon or pattern of behavior might serve test questions
as the stimulus for the development of a new test. o Computerized Adaptive Testing (CAT): interactive, computer-
- Norm referenced: An item for which high scorers on the test respond administered testtaking process wherein items presented to the
correctly. Low scorers respond to that same item incorrectly testtaker are based in part on testtaker’s performance on
- Criterion referenced: high scorers on the test get a particular item previous items.
right whereas low scorers on the test get that same item wrong. o Floor effect: the diminished utility of an assessment tool for
- Pilot work: pilot study or pilot research. To know whether some items distinguishing testtakers at the low end of the ability, trait, or
should be included in the final form of the instrument. other attribute being measured
o the test developer typically attempts to determine how o Ceiling effect: diminished utility of an assessment tool for
best to measure a targeted construct distinguishing testtakers at the high end of the ability, trait,
TEST CONSTRUCTION attribute being measured
- Scaling: process of setting rules for assigning numbers in o Item branching: ability of computer to tailor the content and
measurement. order of presentation of test items on the basis of responses to
- L.L. Thurstone: credited for being the forefront of efforts to develop previous items
methodologically sound scaling methods. SCORING ITEMS
TYPES OF SCALES: - Cummulative scoring: testtakers earn cumulative credit with regard to
- Nominal, ordinal, interval or ratio a particular construct
- Age-based scale - Class/category scoring: testtaker responses earn credit toward
- Grade-based scale placement in a particular class or category with other testtakers
- Stanine scale (raw score converted to 1-9) whose pattern of responses is presumably similar in some way
- Unidimensional vs. multidimensional - Ipsative scoring: comparing a testtaker’s score on one within a test to
o Unidimensional: measuring one construct another scale within that same test
o Multidimensional: measuring more than one construct o ex.) “John’s need for achievement is higher than his need
- Comparative vs. categorical for affiliation”
o Comparative scaling: entails judgments of a stimulus in ITEM WRITING (KAPLAN BOOK)
comparison with every other stimulus on the scale Item Writing
o Categorical scaling: stimuli are placed into one of two or - Personality and intelligence tests require different sorts of responses
more alternative categories that differ quantitatively with - Guidelines for item writing
respect to some continuum o Define clearly what you want to measure
- Rating Scale: Which can be defined as a grouping of words, o Generate an item pool
statements, or symbols on which judgments of the strength of a o Avoid exceptionally long items
particular trait, attitude, or emotion are indicated by the testtaker o Keep the level of reading difficulty appropriate for those who
- Summative scale: when final score is obtained by summing the will complete the scale
ratings across all the items o Avoid “double-barreled” items that convey two or more ideas at
- Likert scale: each item presents the testtaker with five alternative the same time
responses usually on agree-disagree, or approve-disapprove o Consider mixing positively and negatively worded items
continuum - Must be sensitive to ethnic and cultural differences
- Method of paired comparisons: presented with two stimuli and - Items that retain their reliability are more likely to focus on skills, while
asked to compare those that lost reliability focused on more abstract concepts
- Comparative scaling: judging of a stimulus in comparison with every Item Formats
other stimulus on the scale - Simplest test uses dichotomous format
- Categorical scaling: testtaker places stimuli into a category; those The Dichotomous Format
categories differ quantitatively on a spectrum. - Dichotomous format offers two alternatives for each item
- Guttman scale (Scalogram analysis): items range from sequentially o Ie. True-false examination
weaker to stronger expressions of attitude, belief, or feeling. A - Advantages:
testtaker who agrees with the stronger statement is assumed to also o Simplicity
agree with the milder statements o True-false items require absolute judgment
- Equal-appearing intervals (Thurstone): direct estimation because - Disadvantages:
don’t need to transform testtaker’s response to another scale o True-false encourage students to memorize material
WRITING ITEMS o “truth” often comes in shades of gray
- 3 Questions of test developer o mere chance of getting any item correct is 50%
o What range of content should the items cover?
- Yes-no format on personality tests
o Which of the many different types of item formats should be - Multiple-choice = polytomous
employed? The Polytomous Format
o How many items should be written in total and for each content - Polytomous format resembles the dichotomous format except that each
area covered? item has more than two alternatives
- Item pool: reservoir from which items will not be drawn for the final o Multiple-choice exams
version of the test (should be about double the number of questions as - Advantage:
final will have) o Little time for test takers to respond to a particular item because
- Item format they do not have to write
- Incorrect choices are called distractors

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 8: TEST DEVELOPMENT


- Disadvantages: o The midpoint representing the optimal difficulty is
o How many distractors should a test have? --> 3 or 4 obtained by summing up the chance of success proportion
o Distractors hurting reliability / validity of test and 1.00 and then dividing the sum by 2
o Three alternative multiple-choice items may be better than five Item Reliability Index
alternative items because they retain the psychometric value o Indication of the internal consistency of a test
but take less time to develop and administer o Equal to the product of the item-score standard deviation (s) and the
o Scoring of the MC exams? --> simply guessing should elicit correlation (r)
correctness o Factor analysis and inter-item consistency
o Correcting for this though, the expected score is 0 – as getting a o Factor analysis determines whether items on a test appear
question wrong loses you a point to be measuring the same thing
- Guessing can be good if you can narrow down a couple answers The Item-Validity Index
- Students are more likely to guess when they anticipate a lower grade on a o Statistic designed to provide an indication of the degree to which a
test than when they are more confident test is measuring what it purports to measure
- Guessing threshold describes the chances that a low-ability test taker will o Requires: item-score standard deviation, the correlation between the
obtain each score item score and criterion score
- True-false and MC tests are common to educational and achievement tests The Item-Discrimination Index
- Likert format, category scale, and the Q-sort used for personality-attitude o Measures how adequately an item separates or discriminates
tests between high scorers and low scorers
Likert Format o “d”
- Likert format: requires that a respondent indicate the degree of agreement o compares performance on a particular item with performance in the
with a particular attitudinal question upper and lower regions of a distribution of continuous test scores
o Strongly disagree ... Strongly agree o higher d means greater number of high scorers answering the item
o For measurements of attitude correctly
- Used to create Likert Scales: scales require assessment of item o negative d means low-scoring examinees are more likely to answer
discriminability the item correctly than high-scoring examinees
- Familiar and easy --- likely to remain popular in personality and attitude o Analysis of item alternatives
tests Item-Characteristic Curves?
Category Format o Graphic representation of item difficulty and discrimination
- Category format: uses more choices than Likert; 10-point rating scale
- Disadvantage: responses to items on 10-pt scales are affected by the Other Considerations in Item Analysis
groupings of the people or things being rated o Guessing
- People change their ratings depending on context o Usually in some direction
o This problem can be avoided if the endpoints of the scale are o Depends on individuals ability to take risks
clearly defined and the subjects are frequently reminded of the o Item fairness
definitions of the endpoints o Bias
- Optimal number of points is 7? o Speed tests
o Number depends on the fineness of the discrimination that o Last items will appear to be more difficult because
subjects are willing to make not everyone got to them
o When people are highly involved with some issue, they will tend
to respond best to a greater number of categories Qualitative Item Analysis
- Increasing the number of response categories may not increase reliability  Qualitative methods: techniques of data generation and analysis that
and validity rely primarily on verbal rather than mathematical or statistical
- Visual analogue scale: respondent is given a 100-millimeter line and asked procedures
to place a mark between two well-defined endpoints  Qualitative item analysis: various nonstatistical procedures designed
o Measures self-rate health to explore how individual test items work
Checklists and Q-Sorts o Through means like interviews and group discussions
- Adjective Checklist: subject receives a long list of adjectives and indicates  “Think aloud” test administration
whether each one is characteristic of himself or herself o approach to cognitive assessment that entails respondents
o Requires subjects either to endorse such adjectives or not, thus vocalizing thoughts as they occur
allowing only two choices for each item o used to shed light on the testtker’s though processes
- Q-Sort: increases the number of categories during the administration of a test
o Used to describe oneself or to provide ratings of others  Expert panels
Other Possibilities o Sensitivity review: study of test items in which they are
- Forced-choice and Likert formats are clearly the most popular in examined for fairness to all prospective testtakers as well
contemporary tests and measures as for the presence of offensive language, stereotypes, or
- Checklists have fallen out of favor because they are more prone to error situations
than are formats that require responses to every item ITEM ANALYSIS (KAPLAN BASED)
- Frequent advice is to not use “all of the above” as a response option The Extreme Group Method
- Compares people who have done well with those who have done
TEST TRYOUT poorly on a test
What is a good item? - Difference between these proportions is called the discrimination
o Reliable and valid index
o Helps to discriminate testtakers The Point Biserial Method
- Find the correlation between performance on the item and
ITEM ANALYSIS performance on the total test
o The Item-Difficulty Index - Correlation between a dichotomous variable and a continuous
o Obtained by calculating the proportion of the total number variable is called a point biserial correlation
of testtakers who answered the item correctly “p” - On tests with only a few items, using this is problematic because
o Higher p= easier item performance on the item contributes to the total test score
o Difficulty can be replaced with endorsement in non- Pictures of Item Characteristics
achievement tests - Valuable way to learn about items is to graph their characteristics,
which you can do with the item characteristic curve

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 8: TEST DEVELOPMENT


- Prepare a graph for each individual test item - First step in developing these tests involves clearly specifying the
o Total test score is used as an estimate of the amount of a objectives by writing clear and precise statements about what the
‘trait’ possessed by individuals learning program is attempting to achieve
- Relationship between performance on the item and performance on - To evaluate the items: one should give the test to two groups of
the test gives some info about how well the item is tapping the info students – one that has been exposed to the learning unit and one
we want that has not
Drawing the Item Characteristic Curve - Bottom of the V is the antimode – the least frequent score
- To draw this, we need to define discrete categories of test - This point divides those who have been exposed to the unit from
performance those who have not been exposed and is usually taken as the cutting
- If the test has been given to many people, we might choose to make score or point, or what marks the point of decision
each test score a single category - When people get scores higher than the antimode, we assume that
- Gradual positive slope of the line demonstrates that the proportion of they have met the objective of the test
people who pass the item gradually increases as test scores increase Limitations of Item Analysis
o This means that the item successfully discriminates at all - Main Problem: though statistical methods for item analysis tell the
levels of test performance test constructor which items do a good job of separating students,
- Ranges in which the curve changes suggest that the item is sensitive, they do not help the students learn
while flat ranges suggest areas of low sensitivity - Although the data are available to give the child feedback on the
- Item analysis breaks the general rule the increasing the number of “bug” in their thinking, nothing in the testing procedure initiates this
items makes a test more reliable guidance
- When bad items are eliminated, the effects of chance responding can TEST REVISION
be eliminated and the test can become more efficient, reliable, and Test Revision in the Life Cycle of an Existing Test
valid  Tests get old and need revision
Item Response Theory  Questions arise over equivalence of two tests
- According to classical test theory, a score is derived from the sum of  Cross-validation and Co-validation
an individual’s responses to various items, which are sampled from a o Cross-validation: revalidation of a test on a sample of
larger domain that represents a specific trait or ability testtakers other than those on whom test performance
- New approaches consider the chances of getting particular items right was originally found to be a valid predictor of some
or wrong – item response theory – make extensive use of item criterion
analysis o Validity shrinkage: decrease in item validities that
o With this, each item on a test has its own item inevitably occurs after cross-validation of finding
characteristic curve that describes the probability of o Co-validation: test validation process conducted on two or
getting each particular item right or wrong given the ability more tests using the same sample of testtakers
level of each test taker o Co-norming: when co-validation is used in conjunction
o Testers can make an ability judgment without subjecting with the creation of norms or the revision of existing
the test taker to all of the test items norms
- Technical adv: builds on traditional models of item analysis and can o Quality assurance during test revision
provide info on item functioning, the value of specific items, and the  test givers must have some degree of
reliability of a scale qualification, training, and testing
- Two dimensions used are difficulty and discriminability  anchor protocol: test protocol scored by a
- Most attractive adv. Is that one can easily adapt the IRT tests for highly authoritative scorer that is designed as a
computer administration model for scoring and a mechanism for
o Computer can rapidly identify the specific items that are resolving scoring discrepancies
required to assess a particular ability level  scoring drift: a discrepancy between scoring in
- “peaked conventional” an anchor protocol and the scoring of another
- “rectangular conventional” – requires that test items be selected to protocol
create a wide range in level of difficulty The Use of IRT in Building and Revising Tests
o problem: only a few items of the test are appropriate for (item response theory)
individuals at each ability level; many test takers spend  Evaluating the properties of existing tests and guiding test revision
much of their time responding to items either considerably  Determining measurement equivalence across testtaker populations
below their ability level or too difficult to solve o Differential item functioning (DIF): phenomenon, wherein
- IRT addresses traditional problems in test construction well an item functions differently in one group of testtakers as
- IRT can identify respondents with unusual response patterns and offer compared to another group of testtakers known to have
insights into cognitive processes of the test taker the same level of the underlying trait
- May also reduce the biases against the people whoa re slow in  Developing item banks
completing test problems o Items from other instruments item pool  scrutiny
External Criteria preliminary item bank psychometric testingitem bank
- Item analysis has been persistently plagued by researchers’ continued
dependence on internal criteria, or total test score, for evaluating
items
Linking Uncommon Measures
- One challenge in test applications is how to determine linkages
between two different measures
Items for Criterion-Referenced Tests
- Traditional use of tests requires that we determine how well
someone has done on a test by comparing the person’s performance
to that of others
- Criterion-referenced tests compares performance with some clearly
defined criterion for learning
o Popular approach in individualized instruction programs
o Regarded as diagnostic instruments

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 9: INTELLIGENCE AND ITS MEASUREMENT


What is Intelligence? o ex.) linguistic, mechanical, arithmetical abilities
Intelligence: a multifaceted capacity that manifests itself in different ways across  Guilford: multiple-factor models of intelligence
the lifespan. Usually includes abilities to: o Explain mental activities by deemphasizing, any reference
 Acquire and apply knowledge to g
 Reason logically  Thurstone: conceived intelligence as being composed of 7 primary
 Plan effectively abilities.
 Infer perceptively  Gardner: developed theory of multiple intelligences
 Make judgment and solve problems o Question over whether emotional intelligence exists.
 Grasp and visualize concepts o Logical-mathematical, bodily-kinesthetic, linguistic,
 Pay attention musical, spatial, interpersonal and intrapersonal
 Be intuitive  Raymond Cattell: fluid vs. crystallized intelligence
 Find the right words and thoughts with facility o Crystallized intelligence: acquired skills and knowledge
 Cope with, adjust to, and make the most of new situations and their retrieval. Retrieval of information and application
Intelligence Defines: Views of the Lay Public of general knowledge
 Both social and academic o Fluid intelligence: nonverbal, relatively culture-free, and
Intelligence Defined: Views of Scholars and Test Professionals independent of specific instruction.
 Francis Galton  Horn: added more to 7 factors
o First to publish on heritability of intelligence o Vulnerable abilities: decline with age and tend to return
o Most intelligent persons were those with the best sensory preinjury levels following brain damage
abilities o Maintained abilities: tend not to decline with age and may
 Alfred Binet return to preinjury levels following brain damage.
o Made tests about intelligence, but didn’t define it  Carrol:
o Components of intelligence: reasoning, judgment, o Three-stratum theory of cognitive abilities: like geology
memory, abstraction o Hierarchical model: meaning that all of the abilities listed
o Added that definition is complex; requires interaction of in a stratum are subsumed by or incorporated in the strata
components above.
o He argued that when one solves a particular problem, the o Those in the first stratum are narrow abilities
abilities used cannot be separated because they interact to  CHC model (Cattell-Horn-Carroll)
produce the solution. o Some overlap some difference
 David Wechsler o Doesn’t use g
o Best way to measure this global ability was by measuring o Has broader abilities than Carroll’s theory
aspects of several “qualitatively differentiable” abilities  McGrew: Integrated the Cattell-Horn and Carroll’s model
o Complexity of intelligence  McGrew and Flanagan: integrated McGrew-Flanagan CHC Model
o Conceptualization as an “aggregate” or “global” capacity o Features 10 broad stratum abilities
 Jean Piaget o 70 narrow-stratum abilities
o Studied children o Makes no provision for the general intellectual ability
o Believed order of maturation to be unchangeable factor (g)
o With age, increased schema: organized action or mental o It was omitted because it has little practical relevance to
structure that, when applied to the world, leads to cross-battery assessment and interpretation
knowing or understanding. The Information-Processing View
o Learning occurred through assimilation (actively  Aleksandr Luria
organizing new information so that it fits in with what o How (not what) information is processed
already is perceived and thought) and accommodation o Simultaneous/parallel processing: integrated all at once
(changing what is already perceived or though so that it o Successive/sequential processing: each bit individually
fits with new information) processed
o Sensorimotor (0-2)  PASS model: (Planning, attention, simultaneous, successive)-model of
assessing intelligence
o Preoperational (2-6)  Sternberg ‘The essence of intelligence is that it provides a means to
govern ourselves so that our thoughts and actions are organized,
o Concrete Operational (7-12) coherent, and responsive to both out internally driven needs and to
the needs of the environment”
o Formal Operational (12 and older)
Measuring Intelligence
 All share interactionism: complex concept by which heredity and
environment are presumed to interact and influence the Types of Tasks Used in Intelligence Test
development of one’s intelligence  Infants: test sensorimotor, interviews with parents
 Factor-analytic theories: focus is squarely on identifying the  Older child: verbal and performance abilities
ability(ies) deemed to constitute intelligence  Mental Age: index that refers to chronological age equivalent to
 Information-processing theories: focus is on identifying the specific one’s test performance
mental processes that constitute intelligence.  Adults: retention of general information, quantitative reasoning,
expressive language and memory, and social judgment
Factor-Analytic Theories of Intelligence: Theory in Intelligence Test Development and Interpretation
 Charles Spearman: pioneered new techniques to measure  Weschler made a dichotomous test (Performance and Verbal), but
intercorrelations between tests. advocated multifaceted definition
o Existence of a general intellectual ability factor (g) that  Thorndike: intelligence = social, concrete, abstract
tapped by all other mental abilities.  Putting theories into test are extremely hard
 g representing the portion of the variance that all intelligence tests
have in common and the remaining portions of the variance being Intelligence: Some Issues:
accounted for either by specific components (s) or by error Nature vs. Nurture
components (e)  Currently believed to be mix of two
 greater g = better test was thought to predict overall intelligence  Performationism: all structures, including intelligence are had at birth
 group factors: neither as general as g nor as specific as s and can’t be improved upon

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 9: INTELLIGENCE AND ITS MEASUREMENT


 Led to predeterminism: one’s abilities are predetermined by genetic
inheritance and no learning or intervention can enhance it
 Interactionist: ppl inherit certain intellectual potential
o Theres a limit to genetic abilities (i.e. can’t ever have x-ray
vision)
The Stability of Intelligence
 Stable pretty much throughout one’s adult life
 Cognitive abilities seem to decline with age
The Construct Validity of Tests of Intelligence
 Having construct validity requires having unified understanding of
what intelligence is
 Very difficult. Spearman says its one thing, Guilford says its many
 Thorndike approach is sort of compromise
o Look for one central factor with three additional factors
representing social, concrete, and abstract intelligences
Other Issues
 Flynn effect: IQ scores seem to rise every year, but not coupled with
rise in “true intelligence”
 Personality
o High IQ: Need for achievement, competition, curiosity,
confidence, emotional stability etc.
o Low IQ: passivity, dependence, maladjustment
o Temperament (used to describe infants)
 Gender
o Men usually outscore in visual spatialization tasks and
intelligence scores
o Women tend to outscore in language-skill tasks
o But differences can be bridged
 Family Environment
o Divorce can have negative effects
o Begins with “maternal effects” in womb
 Culture
o Provides specific models for thinking, acting and feeling
o Assumed that if cultural factors can be controlled then
differences between cultural groups will be lessened
o Assumed that culture can be removed by the reliance on
exclusively nonverbal tasks
 Tend not to be very good at predicting success
in various academic and business settings
o Culture loading: the extent to which a test incorporates
the vocabulary, concepts, traditions, knowledge and
feelings associated with a particular culture
o No test can be culture free
o Culture-fair intelligence test: test/assessment process
designed to minimize the influence of culture with regard
to various aspects of evaluation procedure
o Another approached called for cultural-specific intelligence
tests
 Ex.) BITCH measured streetwiseness
 Lacked predictive validity and useful, practical
information

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

CHAPTER 10: TESTS OF INTELLIGENCE


The Stanford-Binet Intelligence Scales Tests Designed for Individual Administration
 First to have detailed administration and scoring instructions  Kaufman Adolescent and Adult Intelligence Test
 First American test to test IQ  Kaufman Brief Intelligence Test
 First to use alternate items (an item that can be used in place of  Kaufman Assessment Battery for Children
another)  Away from information processing and towards a distinction
 Lacked minority group representation between sequential and simultaneous processing
 Ratio IQ=(mental age/chronological age)x100 Tests Designed for Group Administration
 Deviation Ratio/test composite: performance of one individual  Group Testing in the Military
compared to the performance of others of the same age. Has o WWI need for government to test intelligence as
mean of 100 and standard deviation of 16 means of differentiating “unfit” and “exceptionally
 Age scale: items grouped by age superior ability”
 Point scale: items organized by category o Army Alpha Test: to army recruits who could read.
The Stanford-Binet Intelligence Scales: Fifth Edition Included general information questions, analogies, and
 Measures fluid intelligence, crystallized knowledge, quantitative scrambled sentences to reassemble
knowledge, visual-processing, and short-term (working) memory o Army Beta Test: to foreign or illiterate recruits,
 Utilizes adaptive testing: testing individually tailored to testtakers included mazes, coding, and picture completion.
to ensure that items are neither too difficult (frustrating) or too o After the war, the alpha and beta test were used
easy (false hope) rampantly, and oftentimes misused
 Examiner establishes rapport with testtaker, then administers o Screening tools: instrument of procedure used to
routing test to direct, route examinee to test items most likely at identify a particular trait or constellation of traits
optimal level of difficulty o ASVAB (Armed Services Vocational Aptitude Battery):
 Teaching items: show testtaker what is expected, how to do it. administered to prospective to recruits or high school
o Can be used for qualitative assessment, but not scoring students looked for career guidance
 Subtests for verbal and nonverbal tests share same name, but  5 career areas: clerical, electronics,
involve different tasks mechanical, skill-technical, and combat
 Floor: lowest level of items on subtest operations
 Ceiling: highest-level item of subtest  Group Testing in Schools
 Basal level: base-level criterion that must be met for testing on o Useful in developing child’s profile- but cannot be sole
the subtest to continue indicator
 Ceiling level is met when testtaker fails certain number of items in o Groups of 10-15
a row. Test discontinues here. o Starting in Kindergarten
 Scores: raw standard  composite o Also called traditional group testing, because more
 Extra-test behavior: behavioral observation modern forms can utilize computer. These more aptly
The Wechsler Tests called individual testing
-commonality between all versions: all yield deviation IQ’s with mean of 100 Measures of Specific Intellectual Abilities
and standard deviation of 15  Widely used intelligence tests only test a sampling of the many
Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV) attributable factors aiding in intelligence
 Core subtest: administered to obtain a composite score  Ex.) Creativity
 Supplemental/Optional Subtest: provides additional clinical o Commonly thought to be composed of originality,
information or extending the number of abilities or processes fluency, flexibility, and elaboration
sampled. o If the focus is too heavily on whether an answer is
 Yields four index scores: Verbal Comprehension Index, a Working correct, doesn’t allow for creativity
Memory Index, a Perceptual Reasoning Index, and a Processing o Achievement tests require convergent thinking:
Speed Index deductive reasoning process that entails recall and
The Wechsler Intelligence Scale for Children –Fourth Edition (WISC-IV) consideration of facts as well as a series of logical
 Process score: index designed to help understand how testtakers judgments to narrow down solutions and eventually
process various kinds of information arrive at one solution
 WISC-IV compared to the SB5 o Divergent thinking: a reasoning process in which
The Wechsler Preschool and Primary Scale of Intelligence-Third Edition thought is free in many different directions, making
(WPPSI-III) several solutions possible
 New school for children under 6  Associated words, uses of rubber band etc.
 First major intelligence test which adequately sampled total  Test-retest reliability for some of these tests
population of the United States are near unacceptable
 Subtests labeled core, supplemental, or optional
Wechsler, Binet, and the Short Form
 Short form: test that has been abbreviated in length to reduce
time needed to administer, score and interpret
 used with caution, only for screening
 provide only estimates
 reducing the number of items usually reduces reliability and thus
validity
 Wechsler Abbreviated Scale of Intelligence
The Wechsler Test in Perspective
 Factor Analysis
o Exploratory factor analysis: summarizing data when
we are not sure how many factors are present in our
data
o Confirmatory factor analysis: used to test highly
specific factor analysis

Other Measures of Intelligence

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

construct - Designed for infants


CHAP.11: Other Individual Tests test for adolescent validity between 1 and
of Ability in Education and delinquents o Test-retest 42mths
Special Education - Knox developed a reliability - Assesses
battery of leaves development across 5
Alternative Individual Ability performance tests for much to domains: cognitive,
Tests Compared with the Binet non-English adult be desired language, motor,
and Wechsler Scales immigrants to the US Gesell Developmental socioemotional, and
- None of these are – administered Schedules (GDS) adaptive
clearly superior from without language; - Infant intelligence - Motor scale: assumes
a psychometric speed not measures that later mental
standpoint emphasized - Used as a research functions depend on
- Some less stable, - These early individual tool by those motor development
most more limited in tests designed for interested in - Excellent
their documented specific populations, assessing infant standardization
validity produced a single intellectual - Generally positive
- Compare poorly to score, and had development after reviews
Binet and Wechsler nonverbal exposure to mercury, - Strong internal
on all accounts performance scales diagnoses of consistency
- They don't rely on a - Could be abnormal brain - More validity studies
verbal response as administered without formation in utero needed
much as the B and W visual instructions and assessing infants - Widely used in
- Just use pointing or and used with with autism research – children
Yes/No responses, children as well as - Children of 2.3mth to with Down syndrome,
thus do not depend adults 6.3yrs pervasive
on the complex Infant Scales - Obtains normative developmental
integration of visual - Where mental data concerning disorders, cerebral
and motor retardation or various stages in palsy, language
functioning developmental delays maturation impairment, etc
- Contain a are suspected, these - Individual’s - Most
performance scale or tests can supplement developmental psychometrically
subscale observation, genetic quotient (DQ) is sound test of its kind
- Their specificity often testing, and other determined according - Predictive though?
limits the range of medical procedures to a test score, which Cattell Infant Intelligence Scale
functions or abilities Brazelton Neonatal Assessment is evaluated by (CIIS)
that they can Scale (BNAS) assessing the - Based on normative
measure - Individual test for presence or absence developmental data
- Because they are infants between of behavior - Downward extension
designed for special 3days and 4weeks associated with of Stanford-Binet
populations, some - Purportedly provides maturation scale for 2-30mth
alternatives can be an index of a - Provides an olds
administered totally newborn’s intelligence quotient - Similar to Gesell scale
without the verbal competence like that of the Binet - Rarely used today
instructions - Favorable reviews o (developm - Sample is primarily
- Considerable ent based on children of
Specific Individual Ability Tests research base quotient / parents from lower
- Earliest individual - Wide use as a chronologi and middle classes
tests typically research tool and as a cal age) x and therefore does
designed for specific diagnostic tool for 100 not represent the
purposes or special purposes - But, falls short of general population
populations - Commonly used scale acceptable - Unchanged for 60yrs
- One of the first – for the assessment of psychometric - Psychometrically
Seguin Form Board neonates standards unsatisfactory
Test – in 1800s – - Drawbacks: - Standardization
produced only a o No norms sample not Major Tests for Young Children
single score are representative of the McCarthy Scales of Children’s
o Used available population Abilities (MSCA)
primarily o More - No reliability or - Measure ability in
to research is validity children between 2-
evaluate needed - Does appear to help 8yrs
mentally concerning uncover subtle - Present a carefully
retarded the deficits in infants constructed
adults and meaning individual test of
emphasize and human ability
d speed implicatio Bayley Scales of Infants and - Meager validity
and n of scores Toddler Development – Third - Produces a pattern of
performan o Poorly Edition (BSID-III) scores as well as a
ce document - Base assessments on variety of composite
- After, the Healy- ed normative scores
Fernald Test was predictive maturational - General cognitive
developed as an and developmental data index (CGI): standard
exclusively nonverbal score with a mean of

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

100 and a standard mental presumably providing - Major concept is that


deviation of 16 wholes in a nonverbal estimate a child average in
o Index order to of verbal intelligence intelligence may fail
reflects solve a - Can be done in in school because of a
how well problem 15mins, requires no specific deficit or
the child - Nonverbal measure reading ability disability that
has of ability too - Good reliability and prevents learning
integrated - Well constructed and validity - Federal law entitles
prior psychometrically - Should never be used every eligible child
learning sound as a substitute for a with a disability to a
experience - Not much evidence of Wechsler or Binet IQ free appropriate
s and (good) validity - Important public education and
adapted - Poorer predictive component in a test emphasizes special
them to validity for school battery or used as a education and related
the achievement – screening device services designed to
demands smaller differences - Easy to administer meet his or her
of the between whites and and useful for variety unique needs and
scales minorities of groups prepare them for
- Relatively good - Test suffers from a - BUT: Tendency to further education,
psychometric noncorrespondence underestimate IQ employment, and
properties between its definition scores, and problems independent living
- Reliability coefficients and its measurement inherent in the - To qualify, child must
in the low .90s of intelligence multiple-choice have a disability and
- In research studies format are bad educational
- Good validity? Good General Individual Ability Tests Leiter International performance affected
assessment tool for Handicapped and Special Performance Scale – Revised by it
Kaufman Assessment Battery Populations (LIPS-R) - Educators today can
for Children - Second Edition Columbia Mental Maturity Scale - Strictly a find other ways to
(KABC-II) – Third Edition (CMMS) performance scale determine when a
- Individual ability test - Purports to evaluate - Aims at providing a child needs extra help
for children between ability in normal and nonverbal alternative - Processed called
3-18yrs variously to the Stanford-Binet Response to
- 18 subtests in 5 handicapped children scale for 2-18yr olds Intervention (RTI):
global scales called from 3-12yrs - For research, and premise is that early
sequential - Requires neither a clinical settings, intervening services
processing, verbal response nor where it is still widely can prevent academic
simultaneous fine motor skills utilized to assess the failure for many
processing, learning, - Requires subject to intellectual function students with
planning, and discriminate of children with learning difficulties
knowledge similarities and pervasive - Signs of learning
- Intended for differences by developmental problem:
psychological, clinical, indicating which disorders o Disorganiz
minority-group, drawing does not - Purports to provide a ation
preschool, and belong on a 6-by- nonverbal measure of o Careless
neuropsychological 9inch card containing general intelligence effort
assessment as well as 3-5 drawings by sampling a wide o Forgetfuln
research - Multiple choice variety of functions ess
- Sequential- - Standardization from memory to o Refusal to
simultaneous sample is impressive nonverbal reasoning do
distinction - Vulnerable to random - Can be applied to the schoolwor
o Sequential error deaf and language- k or
processing - Reliable instrument disabled homework
refers to a that is useful in - Untimed o Slow
child’s assessing ability in - Good validity performan
ability to many people with Porteus Maze Test (PMT) ce
solve sensory, physical, or - Popular but poorly o Poor
problems language handicaps standardized attention
by - Good screening nonverbal o Moodiness
mentally device performance
arranging Peabody Picture Vocabulary measure of
input in Test – Fourth Edition (PPVT-IV) intelligence
sequential - 2-90yrs - Individual ability test Illinois Test of Psycholinguistic
or serial - multiple choice tests - Consists of maze Abilities (ITPA-3)
order that require subject problems (12) - Assumes that failure
o Simultane to indicate Yes/No in - Administered without to respond correctly
ous some manner verbal instruction, to a stimulus can
processing - Instructions thus used for a result not only from a
refers to a administered aloud variety of special defective output
child’s (not for the deaf) populations system but also from
ability to - Purports to measure - Needs a defective input or
synthesize hearing or receptive restandardization information-
info from vocabulary, Testing Learning Disabilities processing system

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

- Stage 1: info must rs appear with only one or two - Discrepancies


first be received by to be errors between IQ and
the senses before it respondin - Errors occur for achievement have
can be analyzed g to the people whose mental traditionally been the
- Stage 2: info is same age is less than 9, main defining feature
analyzed or criticisms brain damage, of a learning disability
processed that led to nonverbal learning - Most achievement
- Stage 3: with changes in disabilities, emotional tests are group tests
processed info, the Binet problems - WRAT-4 purportedly
individual must make and - Questionable permits an estimate
a response Wechsler reliability of grade-level
- Theorizes that the scales and Memory-for-Designs (MFD) Test functioning in word
child may be impaired ultimately - Drawing test that reading, spelling,
in one or more to the involves perceptual- math computation,
specific sensory developm motor coordination and sentence
modalities ent of the - Used for people 8- comprehension
- 12 subtests that KABC 60yrs - Used for children
measure individual’s o 2. Much - Good split-half 5yrs+
ability to receive more reliability - Easy to administer
visual, auditory, or empirical - Needs for validity - Problems:
tactile input and documentation o Inaccuracy
independently of theoretical - All these tests in
processing and research is criticized because of evaluating
output factors needed their limitations in grade-
- purports to help o 3. Users or reliability and validity level
isolate the specific learning documentation reading
site of a learning disabilities - Good as screening ability
disability tests devices though o Not
- For children 2-10yrs should Creativity: Torrance Tests of proven as
- Early versions hard to take great Creative Thinking (TTCT) psychomet
administer and no pains to - Measurement of rically
reliability or validity understan creativity sound
- Now, with revisions, d the underdeveloped in
ITPA-3 weaknesse psychological testing CHAP: 12: Standardized Tests in
psychometrically s of these - Creativity: ability to Education, Civil Service, and the
sound measure of procedure be original, to Military
children’s s and not combine known facts
psycholinguistic overinterp in new ways, or to - When justifying the use of
abilities ret results find new relationships group standardized tests,
Woodcock-Johnson III Visiographic Tests between known facts test users often have
- Evaluates learning - Require a subject to - Evaluating this a problems defining what
disabilities copy various designs possible alternative exactly they are trying to
- Designed as a broad- Benton Visual Retention Test – to IQ predict, or what the test
range individually Fifth Edition (BVRT-V) - Creativity tests in criterion is
administered test to - Tests for brain early stages of Comparison of Group and
be used in damage are based on development Individual Ability Tests
educational settings the concept of - Torrance tests - Individual tests require a
- Assesses general psychological deficit, separately measure single examiner for a single
intellectual ability, in which a poor aspects of creative subject
specific cognitive performance on a thinking such as o Examiner
abilities, scholastic specific task is related fluency, originality, provides
aptitude, oral to or caused by some and flexibility instructions
language, and underlying deficit - Does not meet the o Subject
achievement - Assumes that brain Binet and Wechsler responds,
- Based on the CHC damage easily impairs scales in terms of examiner
three-stratum theory visual memory ability standardization, records
of intelligence - For individuals 8yrs+ reliability, or validity response
- Compares child’s - Consists of geometric - Unbiased indicator of o Examiner
score on cognitive designs briefly giftedness evaluates
ability with sore on presented and then - Inconsistent tests, but response
achievement – can removed available data reflect o Examiner takes
evaluate possible - Computerized version the tests’ merit and responsibility
learning problems developed fine potential for eliciting a
- Relatively good Bender Visual Motor Gestalt Individual Achievement Tests: maximum
psychometric Test (BVMGT) Wide Range Achievement Test-3 performance
properties - Consists of 9 (WRAT-4) o Scoring requires
- For learning disability geometric figures - Achievement tests considerable
tests, three that the subject is measure what the skill
conclusions seem imply asked to copy person has actually - Those who use the results
warranted: - By 9yrs, any child of acquired or done with of group tests must assume
o 1. Test normal intelligence that potential that the subject was
constructo can copy the figures cooperative and motivated

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

o Many subjects - Validity data for some - Stanford Achievement Test - Provides three separate
tested at a time group tests are one of the oldest of the scores though: verbal,
o Subjects record weak/meager/contradictor standardized achievement quantitative, and
own responses y tests widely used in school nonverbal
o Subjects not Use Results with Caution system - Item selection is superior
praised for - Never consider scores in - Well-normed and criterion- to the H-NT in terms of
responding isolation or as absolutes referenced, with selecting minority,
o Low scores on - Be careful using tests for psychometric culturally diverse, and
group tests prediction documentation economically
often difficult to - Avoid overinterpreting test - Another one is the disadvantaged children
interpret scores Metropolitan Achievement - Can be adopted for use
o No safeguards Be Especially Suspicious of Low Test, which measures outside the US
Advantages of Individual Tests Scores achievement in reading by - No cultural bias
- Provide info beyond the - Assume that subjects evaluating vocab, word - Each of the subtests
test score understand purpose of recognition, and reading required 32-34 minutes of
- Allow the examiner to testing, want to succeed, comprehension actual working time, which
observe behavior in a and are equally rested/free - Both of these are reliable the manual recommends
standard setting of stress and normed on big samples to be spread out over 2-3
- Allow individualized Consider Wide Discrepancies a Group Tests of Mental Abilities days
interpretation of test Warning Signal (Intelligence) - Standard age scores
scores - May reflect emotional Kuhlmann-Anderson Test (KAT) averaged some 15pts lower
Advantages of Group Tests problems or severe stress – 8th Edition for African American
- Are cost-efficient When in Doubt, Refer - KAT is a group intelligence students on the verbal
- Minimize professional time - With low scores, test with 8 separate levels battery and quantitative
for administration and discrepancies, etc, refer covering kindergarten batteries
scoring the subject for individual through 12th grade
- Require less examiner skill testing - Items are primarily Summary of K-12 Group Tests
and training - Get trained professional nonverbal at lower levels, - All are sound, viable
- Have more objective and Group Tests in the Schools: requiring minimal reading instruments
more reliable scoring Kindergarten Through 12th and language ability
procedures Grade - Suited to young children College Entrance Tests
- Have especially broad - Purpose of tests is to and those who might be - SAT Reasoning Test,
application measure educational handicapped in following Cooperative School and
Overview of Group Tests achievement in verbal procedures College Ability Tests, and
Characteristics of Group Tests schoolchildren - Scores can be expressed in American College Test
- Characterized as paper- Achievement Tests verses verbal, quantitative, and SAT Reasoning Test
and-pencil or booklet-and- Aptitude Tests total scores - Most widely used college
pencil tests because only - Achievement tests attempt - Scores at other levels can entrance test
materials needed are a to assess what a person be expressed at percentile - Used for 1000+ private and
printed booklet of test has learned following a bands: like a confidence public institutions
items, a test manual, specific course of interval; provides the range - Renorming of the SAT did
scoring key, answer sheet, instruction of percentiles that most not alter the standing of
and pencil o Evaluate the likely represent a subject’s test takers relative to one
- Computerized group product of a true score another in terms of
testing becoming more course of - Good construction, percentile rank
popular training standardization, and other - New scoring (2400) is likely
- Most group tests are o Validity is excellent psychometric to reduce interpretation
multiple choice – some determined qualities errors, as interpreters can
free response primarily by - Good validity and reliability no longer rely on
- Group tests outnumber content-related - Potential for use and comparisons with older
individual tests evidence adaptation for non-English- versions
o One major - Aptitude tests attempt to speaking individuals or - 45mins longer – 3hrs and
difference is evaluate a student’s even countries needs to be 45mins to administer
whether the potential for learning explored - may disadvantage students
test is primarily rather than how much a Henmon-Nelson Test (H-NT) with disabilities such as
verbal, student has already - Of mental abilities ADD
nonverbal, or learned - 2 sets of norms available: - Verbal section now called
combination o Evaluate effects o one based on “critical reading” – focus on
- Group test scores can be of unknown and raw score reading comprehension
converted to a variety of uncontrolled distributions by - Math section eliminated
units experiences age, the other much of the basic grammar
Selecting Group Tests o Validity is on raw scores school math questions
- Test user need never settle judged primarily distributions by - Weakness: poor predictive
for anything but well- on its ability to grade power regarding the
documented and predict future - reliabilities in the .90s grades of students who
psychometrically sound performance - helps predict future score in the middle ranges
tests - Intelligence test measures academic success quickly - Little doubt that the SAT
Using Group Tests general ability - does NOT consider multiple predicts first-year college
- Reliable and well - These three tests are highly intelligences GPA
standardized as the best interrelated Cognitive Abilities Test (COGAT) o But,
individual tests Group Achievement Tests - Good reliability AfricanAmerica
ns and Latinos

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

tend to obtain - Psychometric adequacy is - Three types of problems: - Designed to provide an


lower scores on less than that of SAT – reading comprehension, estimate of intelligence
average validity and reliability logical reasoning (~half), relatively free of cultural
o Women score - Predictive validity not great and analytical reasoning and language influences
lower on SAT - Overpredicts the - Weight given to the LSAT - Paper-and-pencil
but higher in achievement of younger score is openly published procedure that covers
GPA students while for each school approved three age groups
underpredicting by the American Bar - Two parallel forms are
Cooperative School and College performance of older Association available
Ability Tests students - Entrance into schools - Acceptable measure of
- Falling out of favor - Many schools have based on weighted sum of fluid intelligence
- Developed in 1955, not developed their own norms score and GPA
been updated and psychometric - Psychometrically sound, Standardized Tests Used in the
- Purports to measure documentation and can reliability coefficients in the US Civil Service System
school-learned abilities as use the GRE to predict .90s - General Aptitude Test
well as an individual’s success in their programs - Predicts first-year GPA in Battery (GATB) – reading
potential to undertake - By looking at a GRE score in law school ability test that purportedly
additional schooling conjunction with GPA, - Content validity is measures aptitude for a
- Psychometric graduate success can be exceptional variety of occupations
documentation not strong predicted with greater - Bias for minority group o Makes
- Little empirical data accuracy than without the members, as well as employment
support its major GRE women decisions in govt
assumption – that previous - Graduate schools also agencies
success in acquiring school- frequently complain that Nonverbal Group Ability Tests o Attempts to
learned abilities can predict grades no longer predict Raven Progressive Matrices measure wide
future success in acquiring scholastic ability well - RPM one of the best range of
such abilities because of grade inflation known and most popular aptitudes from
– the phenomenon of nonverbal group tests general
American College Test rising average college - Suitable anytime one intelligence to
- Updated in 2005, grades despite declines in needs an estimate of an manual
particularly useful for non- average SAT scores individual’s general dexterity
native speakers of English o Led to intelligence - Controversial because it
- Produces specific content corresponding - Groups or individuals, 5yrs- used within-group norming
scores and a composite restriction in the adults prior to the passage of the
- Makes use of the Iowa Test range of grades - Used throughout the Civil Rights Act of 1991
of Educational - As the validity of grades modern world - Today, any kind of score
Development Scale and letters of rec becomes - Uses matrices – nonverbal; adjustments through
- Compares with the SAT in more questionable, with or without a time limit within-group norming in
terms of predicting college reliance on test scores - Research supports RPM as employment practices is
GPA alone or in increases a measure of general strictly forbidden by law
conjunction with high- - Definite overall decline in intelligence, or Spearman’s
school GPA verbal scores while g Standardized Tests in the US
- Internal consistency quantitative and analytical - Appears to minimize the Military: The Armed Services
coefficients are not as scores are gradually rising effects of language and Vocational Aptitude Battery
strong in the ACT culture - ASVAB administered to
Miller Analogies Test - Tends to cut in half the more than 1.3million
Graduate And Professional - Designed to measures selection bias that occurs people a year
School Entrance Tests scholastic aptitudes for with the Binet or Wechsler - Designed for students in
Graduate Record Examination graduate studies grades 11 and 12 and in
Aptitude Test - Strictly verbal Goodenough-Harris Drawing postsecondary schools
- GRE purports to measure - 60 minutes Test (G-HDT) - Yields scores used in both
general scholastic ability - knowledge of specific - Nonverbal intelligence test, education and military
- Most frequently used in content and a wide vocab group or individual settings
conjunction with GPA, are very useful - Quick, east, and - Results can help identify
letters of rec, and other - most important factors inexpensive students who potentially
academic factors appear to be the ability to - Subject instructed to draw qualify for entry into the
- General section with verbal see relationships and a a picture of a whole an and military and can
and quantitative scores knowledge of the various to do the best job possible recommend assignment to
- Third section which ways analogies can be - Details get points various military
evaluates analytical formed - One can determine mental occupational training
reasoning – now essay - psychometric adequacy is ages by comparing scores programs
format reasonable with those of the - Great psychometric
- Contains an advanced - does not predict research normative sample qualities
section that measures ability, creativity, and other - Raw scores can be - Reliability coefficients are
achievement in at least 20 factors important to grad converted to standard excellent
majors school scores with a mean of 100 - Through computerized
- New 130-170 scoring scale and SD of 15 format, subjects can be
- Standard mean score of The Law School Admission Test - Used extensively in test tested adaptively, meaning
500, and SD of 100 - LSAT problems require batteries that the questions given
- Normative sample is almost no specific each person can be based
relatively small knowledge The Culture Fair Intelligence on his or her unique ability
- Extreme time pressure Test

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)


lOMoARcPSD|3728912

- This cuts testing time in


half

Downloaded by Abdul Jabbar (icq50112@bcaoo.com)

Vous aimerez peut-être aussi