Research Analysis Exam

Running head: RESEARCH ANALYSIS EXAM
Research Analysis Exam: A Critique of Evidence in Teacher Education: The Performance Assessment for California Teachers (PACT) Tyler Rinker University at Buffalo Department of Learning and Instruction
RESEARCH ANALYSIS EXAM
Research Analysis Exam: A Critique of Evidence in Teacher Education: The Performance Assessment for California Teachers (PACT) A particularly vexing problem facing educational leadership is that of ensuring that the teachers providing students with educational opportunities are of the highest caliber. We know many factors have an effect on student performance and achievement, however, mixed model analysis conducted by Wright, Howrn and Sanders 1997 indicates that of all of these the most important factor affecting student learning is the teacher (p. 63). Furthermore, this effect is similar across all achievement levels, regardless of ability and are additive and cumulative over grade levels with little evidence of compensatory effects. Thus, students in classrooms of very effective teachers, following relatively ineffective teachers, make excellent academic gains but not enough to offset previous evidence of less than expected gains (Wright, Horn, and Sanders, 1997, p. 63). The stakes for making the wrong hiring decision are high, and it is our students that pay the price for making an incorrect choice. In my own experiences as an educational leader I often found myself wondering if the candidates that I selected would be successful at increasing student achievement. Entire practitioner books are dedicated to isolating and providing educators with an understanding of effective teacher qualities (Danielson, 1996; Stronge, 2002). These effective teacher manuals often guided the interview process as the hiring committees weighed various attributes we believed would allow the most qualified candidate to rise to the top of the pool of interviewees. Time revealed that some of those selections were correct and some were deleterious. I am sure that many educational leaders wish that they possess some sort of crystal ball that enables a potential hires future performance to be revealed. The process of teacher certification holds the hope of increasing the probability that teacher candidates, seeking first time employment, will be highly qualified and ultimately lead to increased student achievement gains (Darling-Hammond, 2000). In this critique I examine a study that attempted to provide educators with a more effective, authentic and integrated assessment for making credentialing decisions. The study under critique, titled, Evidence in Teacher Education: The Performance Assessment for California
Teachers (PACT), authored by Pecheone and Chung (Pecheone and Chung, 2006) explored the use of the PACT as a viable alternative to Californias more generic performance assessment for assessing teaching knowledge and skills. The push for exploring certification requirements as a means of increasing student achievement gains has been a growing interest since the first administration of the National Assessment of Educational Progress (NAEP) in 1969, gathering more steam in the 1980s, after the National Commission on Excellence in Educations 1983 release of A Nation at Risk and generating renewed interest in the wake of No Child Left Behind (Arends, 2001; Darling-Hammond, 2000). Although the notion of using teacher certification as a means of improving student outcomes is not new, Darling-Hammond (2000) cite the lack of investigation into the effects of large scale, state level policy on student achievement. This article critique will provide a (a) summary, (b) a critique, evaluating, structure & clarity, theoretical framework, methods, analysis & discussion and (c) concluding remarks of Pecheone and Chungs (2006) article. Pr cis e Pecheone and Chung (2006) undertake a study to examine an alternative teacher certification exam designed to be more authentic and integrated than Californias standard performance assessment. The alluded purpose of the study was to provide evidence that performance assessments have emerged not only as useful measures of teacher performance but also as a way to evaluate the quality of credential programs for state accountability systems and program accreditation (p. 23). I must note that this studys purpose was not clearly stated and at times sends mixed messages about its intention. Pecheone and Chung (2006) describe the PACT as two part: (a) the formative development of prospective teachers through embedded signature assessments that occur throughout teacher preparation and (b) a summative assessment of teaching knowledge and skills during student teaching (the TE) (p. 24). Pecheone and Chung (2006) further expand on the PACT project as the collaborative effort of 12 colleges and universities to identify and share exemplary curriculum-embedded assessments across programs (p. 24). The PACT is a two part, subject specific, performance
assessment in which teacher candidates (a) are given formative feedback, termed the embedded signature assessment, and (b) provide standards based evidences of their teaching ability, termed the teaching episode (TE). The second portion of the PACT, the TE, is scored by an expert committee of teacher educators and is used in conjunction with the signature event to make credentialing decisions (Pecheone and Chung, 2006, p. 29). The majority of the article focused on the TE portion of the assessment, with the embedded signature piece only being mentioned a total of four times in the article. The TE is characterized as a collection of multiple data sources including, teacher plans, teacher artifacts, student work samples, video clips of teaching, and personal reflections and commentaries (p. 23). These teacher candidate supplied evidences are divided into four categories: planning, instruction, assessment, and reflection that are used to compute a total mean item score (MIS) for each of the four categories (as well as a fifth category, academic language, based on three items from the other four categories). After judges calculated the total MIS they were asked to step back...and holistically evaluate the candidates performance based on the...question and [4 point] rating scale (p. 30). Candidates meeting the cut score of two were considered to meet the standard at the minimum level. Pecheone and Chung (2006) provide an overview of the scoring profiles for year one and two. Pecheone and Chung (2006) note that the mean scores, disaggregated by the five categories and seven subject areas, indicate that candidates in both years performed significantly higher on planning and significantly lower with academic language (p. 25). Scores in year two were higher for both the MIS total scores and the four category subscores. Pecheone and Chung (2006) indicate that this information was used to improve program structure and instruction and further augment these findings with results from the Teacher Reflection Survey that showed candidates in year two felt better prepared for the TE than students in year one. Pecheone and Chung (2006) layout the steps taken to ensure reliability and validity. A variety of descriptive and inferential statistical procedures were utilized to confirm that the TE was valid in terms of content validity, bias & fairness, construct validity, concurrent validity and criterion validity. Pecheone and Chung (2006) find that the PACT, after undergoing exten-
sive development and design from an expert committee of teacher educators, maintains a strong linkage between the [Teaching Performance Expectations] TPE standards, the TE tasks, and the skills and abilities that are needed for safe and competent professional practice (p. 29). ANOVA results from bias detection show that there were not significant differences in scores by race/ethnicity of candidates, percentage of English language learner students in candidates classrooms, and the socioeconomic status for year two, however, significant differences were shown between gender (p. 29). Pecheone and Chung (2006) state that the difference in gender scoring will continue to be monitored. Construct validity was confirmed using factor analysis, which revealed that a three factor model best fit the data. This result indicated that the assessment tasks (PIAR) are generally supported by the factor analysis and appear to be meaningful categories that represent significant domains of teaching skill (p. 30). Reliability was determined using percent interrater agreement between judges; these results showed 90% and 91% interrater reliability rates between judges in years one and two respectively. Pecheone and Chung (2006) report that criterion validity was assessed by asking supervisors and faculty members familiar with the candidate to what extent (on a 5-point Likert-type scale) they agreed with the scores candidates received on the subject-specific rubrics (p. 31). Results from the survey show that 90% of those surveyed agreed with most of the TE scorers ratings and 72% agreed with a large majority of the ratings (p. 31). Overall, Pecheone and Chung (2006) find that the teaching knowledge and skills captured by the TE appears to be a credible and valid measure of a teachers competence in relationship to similar judgments of competency by faculty/supervisors that were most familiar with the candidates teaching (p. 31). Pecheone and Chungs (2006) work investigates an important and crucial void in the quest to improve student achievement. Their findings indicate that the PACT may provide a means of offering an alternative to the traditional performance assessments of California. This article could potentially serve as a building block for a developing body of literature around supporting student teacher development and certification.
Critique Structure and Clarity Recommendations for framing and writing structure made by the American Psychological Association are not adhered to in this article (American Psychological Association, 2010). The literature review, theoretical framework, methods, results and discussion are fused together without retaining the distinctive qualities of each. This unconventional writing style makes the piece less accessible and more difficult to clearly evaluate. There is no clear hypothesis stated, though the word hypothesis is mentioned three times. However, the usage is specific to a minor portion of the study or is an inductive use of the term. Without the guidance of a hypothesis, the results and findings appear as a series of disjointed statements rather than a carefully crafted argument that supports specific claims. An example of this lack of coherent structure is found in the middle of the paper under the sub headings Study 1: Comparing Analytic and Holistic Scorer Ratings and Study 2: Criterion Validity (pp. 3031). Each of these small sections deals with a specific aspect of validation rather than supporting a comprehensive study that allows for broader interpretation. A lack of explanation or context for unknown phrases, distinct to this particular study (e.g. the phrase embedded signature) also adds confusion. This leads to an incomplete understanding of the problem and proposed solution. The overall flow of this piece was inadequate and detracted from the interpretations that could be made. Theoretical Framework Pecheone and Chung (2006) build a theoretical frame work for this article from, at the time, recent work including three pieces from Linda Darling-Hammond, a major contributor in the body of literature around teacher evaluation and certification. Pecheone and Chung (2006) point to concerns with the predictive ability of the standard teacher assessments as well as the rigid format that does not allow for multiple measures or differentiation of content area. Per Californias alternative teaching assessment criteria the PACT alligns to state teaching standards.
Pecheone and Chung (2006), at the very least, open the potential for alternative teacher evaluation measures. Although the references are extensive for the teacher evaluation and performance assessment aspect of the study, the references for psychometric methods (a major component of this study) are noticeably absent from the bibliography. The oversight of psychometric literature is manifested in the authors misunderstandings, errors in analysis, and confusion in interpretations. One notable example of this misunderstanding comes from Pecheone and Chungs (2006) use of the term validity. The 1985 Standards for Educational and Psychological Testing, cited by the authors, use the term validity as a unitary concept, for which researchers provide evidences toward one concept of validity (American Educational Research Association, American Psychological Association, and National Council of Measurement in Education, 1999; Osterlind, 2010). Validity should not be seen as separate types (i.e. construct validity, content validity, etc.) but as contributing evidences toward the valid use of an assessment (Osterlind, 2010, p. 92). Pecheone and Chung (2006) should have been familiar with the proper use of the term, as they cite both the 1985 and 1999 editions of the Standards for Educational and Psychological Testing that contain the new definition. Osterlind (2010) truly capture the essence of validity: [V]alidity is not a concern of an instrument per se...but of decisions based on yielded scores of appraisal activity...validity refers to the interpretations of test scores in a particular assessment and not to features of a given instrument...validity is about making supported decisions...there are no distinct kinds or types of validity, such as content validity, criterion validity, or construct validity. There is only validity...This perspective places the focus for evidence where it should be-on the decision at hand. (pp. 90-92) Initially, Pecheone and Chung (2006) appear to be familiar with the field of psychometrics understanding of validity: ...evaluating the validity of the instrument for its ability to accurately and fairly measure the teaching skills of prospective teachers has been a critical activity of the
PACT consortium. Validity in this instance refers to the appropriateness, meaningfulness, and usefulness of evidence that is used to support the decisions involved in granting an initial license to prospective teachers. (p. 28) Despite this statement and citing literature that provides the acceptable approach to validity Pecheone and Chung (2006) demonstrate three misuses of the concept: (a) they provide separate sections for each validity type, (b) refer to the PACT, the assessment rather than the decision, as being valid, and (c) cite a historical understanding of validity from the field of educational leadership around teacher evaluation rather than relying on known and available psychometric principles to guide policy changes. These oversights (as well as additional analysis and design flaws I will discuss) are negligent at best. It is incumbent that a researcher understands the methods she uses or employs the services of those who do understand those methods, particularly when is affects public policy. A second piece of evidence supporting the authors lack of understanding of psychometric theory is the confusion of the term performance assessment. In one statement: In an era in which teacher education has been challenged to demonstrate its effectiveness, performance assessments have emerged not only as useful measures of teacher performance but also as a way to evaluate the quality of credential programs for state accountability systems and program accreditation. (p. 23) Pecheone and Chung (2006) advocate for the use of performance assessments in teacher evaluation (PACT is a type of performance assessment). However, Californias teacher certification exam was already a performance assessment as stated by the authors a page earlier; Many teacher educators at these campuses were dissatisfied by the content and format of the states teacher performance assessment, which was designed as a generic assessment that applies across all grade levels and subject areas (p. 22). This inconsistent use of terminology, an understanding of which is a psychometric necessity, further indicates that the authors did not possess the knowledge base necessary to analyze and interpret the outcomes of the study.
Performance assessments (PA), particularly those that are multidimensional, have the potential to provide a richer understanding of higher order thinking but also carry more difficulty in garnering validity evidence to support interpretation of complex PAs (Osterlind, 2010, p. 253). PACT is an authentic assessment, a specific type of PA that attempts to assess performance or ability within an authentic context, this added dimensionality causes authentic assessment to be considered a complex PA (Osterlind, 2010). Pecheone and Chungs (2006) confusion with the terms validity and performance assessment calls in to question their knowledge of psychometrics at the level of sophistication required for the complex analysis that evaluation of these assessments depend upon. Methods Participants. Pecheone and Chung (2006) furnish basic descriptive statistics for the TEs including sample sizes for each year disaggregated by content area. They do not provide student demographic statistics, such as race, gender, socioeconomic status (SES), and English language learner (ELL) classification, that the authors utilized in bias detection. The article also indicated that the eleven higher education institutions that participated in the study in year one and the 13 from year two, opted in the study. The only information provided about the judge-scorers was that they were trained at five regional sites throughout the state (p. 25). The participant characteristics disclosed in the write up do not afford the reader with an understanding [of] the nature of the sample and the degree to which the results can be generalized (American Psychological Association, 2010, p.30). Data Collection, Measures and Research Design. Pecheone and Chung (2006) make a number of methods decisions that are commendable. The shift to allow the assessment instrument to include multiple measures in an authentic context has the potential to provide greater information about teacher candidates that may lead to better discrimination in selecting those student teachers that are more likely to be successful in the classroom (Osterlind, 2010). A second advancement over Californias traditional standardized teacher assessments was to make the tests specific to content areas rather than the generic assessment that applies across all grade levels
10
and subject areas (p. 22). At its core, the spirit of this study is well intentioned; unfortunantly, flaws in measurement and design leave the results in a compromised position. One of the studys strengths also make it more susceptible to weakness. The authentic, multiple measures approach, what Osterlind (2010) terms complex performance assessments, makes them harder to construct, more difficult to administer, and, for psychometric analysis, more tenuous to interest (p. 253). He further cautions: [G]arnering validity evidence to support interpretation of complex PAs is exceedingly difficult...Because of these scoring conditions, complex PAs are generally ill suited to large-scale assessment programs. Still, when employed on a much smaller scale, such as with a single teacher who has only a few students, such an approach may be exceedingly useful for individual diagnosis. (p. 253). This is likely the product of the contradictory tension of the PACT. On the one hand you have an assessment designed to be evaluative and sumamtive, used to make decisions regarding certification of candidates, while at the same time the designers intended the assessment to be formative as well. While these two forms of assessment compliment each other, they carry very different intentions. Osterlind (2010) is clear that the intentionality of a performance assessment, particularly a complex PA, must be clear and linked to a grounding theory. The design of a study of this magnitude of decision making must be sound and the methods of the creation process transparent. The design process of the PACT, as described by Pecheone and Chung (2006), at times lacks this robustness and pellucidity. Pecheone and Chung (2006) mention in the discussion of scoring that, the total number of GQs [guiding questions] for each subject area varied from 15 questions to 18 questions (p. 25). These guiding questions are fundamental in assisting scorers in scoring a candidates TE. The reason for this discrepancy between number of GQs for various subjects is not addressed nor mentioned again, though its impact could potentially impact the findings. The rationale for the number of GQs per PIAR category is not clear, with the planning category containing five questions and the academic language category containing only one. This leads to an unjustifiably disproportionate test blue print, the
11
plan test developers construct to ensure proper weighting of the process and content of items (Izard, 2005; Osterlind, 2010). Also troubling for the academic language category is that in another portion of the article Pecheone and Chung (2006) explicitly state, [t]he academic language MIS is the average MIS for Planning Guiding Question 5, Instruction Guiding Question 3, and Assessment Guiding Question 2, therefore, there is some overlap between the academic language MIS and other PIAR (p. 26). This leads us to believe that the academic language category question How does the candidates planning, instruction, and assessment support academic language development? may not have been asked of the judges at all (p. 27). These two descriptions of the academical language scoring are inconsistent, possibly violating independence of dimensions if one score is dependant upon a combination of scores from other dimensions resulting from the first description or lacking other items in the dimension to correlate with with the second. The first error will result in poor results from factor analysis whereas the second error will result in poor reliability with no other items in the dimension to correlate with. A final measurement and design concern is the decision to make credentialing decisions from a holistic decision, made after calculating MIS scores. This is disturbing in that we have just discussed that the PACT development contains some design flaws, the scoring method also contains weaknesses and after a score is calculated we may introduce additional judge bias, including central tendency bias, by asking for a holistic pass-fail determination in addition to multiple item scoring decisions (Edmondson, 2005; Edwards and Kenney, 1946; James, Demaree, and Wolf, 1984; Likert, Roslow, and Murphy, 1993). Pecheone and Chung (2006) provide a box plot visualization (p. 31) of the MIS scores in comparison to the holistic credential recommendation score (both on a one to four point scale). This visual reveals a trend for scores to rate the lower MIS scores in the median area of 1.5 as a failing holistic score of one, while those candidates with a score in the median area of 3.5 tend to be scored as fours. This causes questions to form about the rational for the holistic rating such as, (a) Were not enough students scoring in the one and four range?, (b) Is the PACT sensitive
12
enough for decision making?, (c) Does the cut score need to be reexamined?. Osterlind (2010) warns about the difficulty in scoring performance assessments, particularly on a large scale, and also admonishes those who may attempt to carry the enthusiasm around performance assessment beyond its information providing capabilities. These overlooked design flaws and unscrupulous adjustments evidence that Pecheone and Chungs (2006) enthusiasm for the PACT may be clouding sound psychometric assessment judgement (American Educational Research Association et al., 1999; Osterlind, 2010; Postlethwaite, 2005). Data Analysis and Results Pecheone and Chung (2006) have undertaken a massive problem that requires sophisticated analysis (Abell, Springer, and Kamata, 2009; Osterlind, 2010; Traub and Rowley, 1991). Although Pecheone and Chung (2006) attempted to be thorough in analysis, it is extremely weak on multiple facets. The American Psychological Association (APA) (2001, 2010) dictates reporting cell sample size, means & standard deviations per sub group cell, statistical test results (z, t, F, 2 R2 etc.), p-values, degrees of freedom & confidence intervals and also strongly suggests reporting effect sizes (the 6th edition manual also dictates effect sizes though the 5th edition did not). Sample sizes, means and standard deviations are reported by subject area subgroup, however, descriptives for the demographics (i.e. race, SES, ELL) of the student teacher were not provided, a piece of information vital to interpreting the results of bias detection(Pecheone and Chung, 2006). Another major oversight in Pecheone and Chungs (2006) article is the failure to present statistical test results, degrees of freedom, confidence intervals, p-values and effect sizes. Confidence levels were given sporadically and findings were discussed in terms of being significantly higher, significantly lower and marginally significant (.065) (Pecheone and Chung, 2006, pp. 25-29). P-values and tests of significance do not provide the ability to claim direction or magnitude of results, only that there is a difference between groups. To make the claim of magnitude, as Pecheone and Chung (2006) do, requires effect sizes, which were not provided or discussed. Pecheone and Chungs (2006) use of the phrase marginally significant
13
(.065) indicates that no alpha level was set in advance, an oversight that may increase type I error rates (p. 29). Against APA (2010) recommendation, Pecheone and Chung (2006) fail to mention which statistical program was used in analysis, which hinders the interpretation and reproducibility of the study. No mention of missing data handling, assumption testing, or data outlier methods are mentioned. Though it is not certain that there was missing data in the study, any researcher who has conducted a human subject research study, particularly over multiple years, would conclude there was likely unaccounted for missing data. Similarly, it can not be said with certainty that the data did not meet all assumptions of the statistical analysis (at least three significance tests were conducted: (a) ANOVA in score profiles, (b) factor analysis in validity testing and (c) ANOVA in bias testing) it is again unlikely that all assumptions would be met. This is particularly true for the ANOVA assumption of normality of the error terms (assessing normality of the population distribution) on the PACT which contained some dimensions with only one question with only four levels (forced choice). It is known that as the number of responses of odd numbered response scale increases and/or the number of items increases, the likeliness that the distribution of the summed data scores will approximate a normal distribution also increases, a condition not met by this study (Likert, 1932). Beyond the missed APA expectations for quantitative reporting, Pecheone and Chungs (2006) article contains psychometric flaws that detract from the credibility of the PACT study. The articles lack of explicit psychometric language yields a possibility for the results to be misunderstood and misinterpreted. Measurement theory depends heavily on the object being measured. In psychometric terms this object, the ability level, is termed a latent trait, a term that is in common use in the field as the modus operandi for conveying explicit understanding of what is being measured (Osterlind, 2010). Pecheone and Chungs (2006) omission of this term, in conjunction with the lack of clear research hypothesis, makes it difficult to surmise what exactly is being measured, a serious problem in a study dealing with sophisticated measurement.
14
Table 1 illustrates this lack of intentionality in Pecheone and Chungs (2006) descriptions of the uses of the PACT. It would not be uncommon to see all of these different associated outcomes described in a study, however, in this particular study the failure to explicitly convey the intentions of the study and assessment reduces interpretability, as the starting point of measurement, the latent trait, is difficult to surmise. Table 1 Quotes Demonstrating the Differing Intentioned Uses of the PACT Assessment Purpose Quote Page No. the focus of the PACT assessments is on candidates application of subject-specific 23 pedagogical knowledge that research finds to be associated with successful teaching the PACT assessment system also uses a multiple measures approach to assessing 23 teacher competence through the use of course-embedded signature assessments performance assessments have emerged not only as useful measures of teacher 23 performance but also as a way to evaluate the quality of credential programs for state accountability systems and program accreditation there is some preliminary evidence that the implementation of the PACT assess24 ment can be a catalyst for program change. There is also evidence that preservice teachers learning experiences and development as teachers were enhanced by their participation in the pilot The TEs are designed to measure and promote candidates abilities to integrate 24 their knowledge of content, students, and instructional context in making instructional decisions and to stimulate teacher reflection on practice. A driving principle in the design and development of the PACT assessment was 31 that the act of putting together a TE would significantly influence the way candidates think about and reflect on their teaching because of the design and formative nature of the assessment.
Visualizations in a journal article require space and additional ink, therefore, usage should convey meaning beyond what can be displayed in text or table format. Pecheone and Chungs (2006) two bar plots, Figures 1 and 2, representing data distribution for the planning and academic language categories, instantly draws the readers attention. These two visualizations take up 25% of a page yet convey very little additional understanding about the data. The visualizations depict only two of the four categories and for just year two of the study. In some situations
15
the use of a bar plot may be justified, however, this relatively slim data could have been just as easily presented in table or textual form (8 pieces of information are conveyed). Cleveland and McGills (1984) seminal piece on data visualization warns of the deception of under representing the data in a graphical rendering and advocates for the use of the dot plot, a much better means of displaying comparison data that provides for better human discrimination in this type of frequency distribution display. The dot plot, in combination with Sarkars (2008) faceted plot scheme would allow for easy visualization of all distributions, across all categories, for both years of the study, all within a half page of journal space. This contextualized visualization would provide the reader with a more meaningful graphical representation of the data worthy of the additional journal space. A second visualization suggestion is to combine the information for Tables 1 and 2 (MIS scores by subject area) into one table, allowing for easier comparisons across years. A final concern regarding the visual and table display of results is in the inconsistency between the information presented in Table 3 and the actual information presented in the article. The discrepancy arises in that there are 13 questions (GQs) displayed in the table, yet in two pages previous Pecheone and Chung (2006) state that from 15 questions to 18 questions make up the GQs. Pecheone and Chungs (2006) inappropriate use of the term validity has been discussed and will not be evaluated further. While different forms of validity are marginally addressed by Pecheone and Chung (2006), improper use for three of the statistical tests or procedures diminish the evidence for validity. In evidencing construct validity Pecheone and Chung (2006) state that factor analysis was conducted to determine the psychological or pedagogical constructs that underlie the assessment. It was not stated if exploratory factor analysis (EFA) or confirmatory factor analysis (CFA) was used in the study, however, it becomes apparent that Pecheone and Chung (2006) believe they have used EFA in that they state that in year one three factors emerged (p. 30). And for the second year the understanding of factor analysis in relation to validity evidence is confirmed:
16
In the 2003-2004 pilot year, two factors emerged from the elementary literacy and the common rubric item scores, with the first factor composed of planning and instruction items and the second factor composed of assessment and reflection items. These results suggest that the assessment tasks (PIAR) are generally supported by the factor analyses and appear to be meaningful categories that represent significant domains of teaching skill. (p. 30) This use of the phrase factors emerging is reminiscent of qualitative discourse analysis using the phrase themes emerging (p. 30). More troubling is the authors understanding that factor analysis was done and the two or three factors supports the PIAR categories. In a quantitative sense this is puzzling. In this case there is a construct (teacher potential ability) that the authors believe is comprised of five dimensions (PIAR categories), not two or three. It is sensible to use EFA with part of the data set to explore the structure of the data (providing a sufficient sample size), though, CFA should be employed to confirm the five dimension structure that the PACT is theorized to be comprised of (Abell et al., 2009; Osterlind, 2010). The specific type of factor analysis was not discussed nor a table of results displayed, however the language used indicates that an EFA, not the recommended CFA for confirming dimensions of a trait, was employed for construct evidence of validity (Abell et al., 2009). The sole use of EFA, rather than a combined EFA-CFA on a split data set approach, would be disturbing in its own rite, however, a read of Pecheone and Chungs (2007) technical report of the PACT study provides additional insight into the erroneous application of psychometric procedures. In the tech report Pecheone and Chung (2007) alter their description of the five PIAR categories to five components. Pecheone and Chung (2007) still use the term factor analysis to describe the type of testing used for the PIAR components. Pecheone and Chungs (2007) use of the term components is synonymous with principal component analysis (PCA) not factor analysis (p. 31). Table 6 confirms that PCA was the procedure used, not factor analysis (p. 31). PCA, like the EFA only approach, is also problematic for this study in that PCA and CFA, though related, are utilized psychometrically in very different ways for different purposes
17
(Abell et al., 2009; Kabacoff, 2011; Osterlind, 2010). PCA is generally used to reduce the number of correlated variables in a model whereas factor analysis (FA), that is both EFA and CFA, is used to test an underlying theory of a latent trait (Fabrigar, Wegener, MacCallum, and Strahan, 1999; Suhr, 2009). Given the underlying theory that the five PIARS are dimensions of the latent trait of competancy, the PACTs design should be the later usage rather than the former. Failure to properly test the PACTs underlying structure indicates that either Pecheone and Chung (2006, 2007) did not believe the test to have an underlying structure or were unfamiliar with methods such as item response theory (IRT) models, multitrait-multimatrix method (MTMM), and CFA for examining internal structure. Pecheone and Chung (2007) adamantly state their theoretical rational for the five categories aligned to Californias teaching standards, which leads me to assume the authors were unfamiliar with the appropriate statistical procedures. This oversight is likely rooted in two possibilities: (a) more complex structural equation modeling (SEM) techniques are often utilized with FA, making it a more difficult test to calculate than PCA (many popular statistical programs at the time did not allow for CFA) and/or (b) it is also likely the product of a classic blunder carried over from an early computer program, Little Jiffy, that erringly set the default of factor analysis to PCA, an error that has been replicated by numerous textbooks and popular statistical analysis programs (e.g. SPSS), leading researchers to incorrectly believe that EFA is being conducted (Fox, Nie, and Byrnes, 2012; Kabacoff, 2011). Without knowing the statistical program used in the PACT study it is not possible to pinpoint the source of Pecheone and Chungs (2007) mistake, and in any event, the results contain a fatal error. Pecheone and Chungs (2006) approach to identifying and confronting bias has some favorable and unfavorable analysis techniques. The authors inial review of test, examining the assessments language use for word choice that may lead to confusion or unequal scoring of subgroups is consistent with the guidelines set forth by the Standards for Educational and Psychological Testing (American Educational Research Association et al., 1999; Osterlind, 2010). Pecheone and Chungs (2006) use of ANOVA for testing test bias, coupled with the statement,
18
the performance of candidates on the PACT assessment was examined to determine if candidates performed differentially with respect to specific demographic characteristics, implies a theoretical misunderstanding of bias (p. 29). An ANOVA tests for differences between groups, and on the surface seems suited for testing of bias. The ANOVA approach does isolate systematic error, yet, it ought not be assumed that all subgroups should score similarly. For example the ESL students in the study are not as likely to score with the same level of proficiency as native English speakers on English writing portions of the assessment (Huang, 2009; Huang and Foote, 2010). It is absurd to assume there are not differences between groups performing on a particular test item. Systematic error is a necessary, yet not sufficient, indicator of bias. For true bias to occur individuals of matched ability must systematically perform differently on an item, in other words, the analysis must consider ability (it is plausible that most members of some subgroups actually do have a lower ability with a particular task). Psychometrics uses the term differential item functioning (DIF) rather than the more politically loaded word bias, approaching bias at the item and test level based on ability and sub group characteristics. To maintain the generalizability of the fields common language the term DIF will be used in this critique as well. It must be noted that ANOVA (or linear regression) could have been utilized to examine DIF under the condition that individuals were matched on relevant characteristics (Osterlind, 2010, p. 64). This unduly complicates the analysis considering the array of IRT tests available for detecting DIF. Acceptable approaches to DIF detection include (a) the Mantel-Haenszel test, (b) logistic regression modeling or (c) viewing the item characteristic curves (ICC), all of which account for ability level of individuals as well as the subgroups they belong to. This provides more accurate and a greater range of information about items and the test functioning. Pecheone and Chungs (2006) failure to use these available methods may have resulted in incorrect items being retained or excluded, thus decreasing or eliminating the generalizability finding that females performed significantly better than males. An additional point undermining the generalizability of Pecheone and Chungs (2006)
19
DIF detection can be extracted from the notes section of the paper. Pecheone and Chung (2006) state that No significant differences were found between scores received by White, Asian, and Hispanic teacher candidates. Other ethnicities had low sample sizes and, thus, score data for these groups could not be validly analyzed. (p. 35). This statement is false in that methods of bootstrapping and permutation tests have the potential to extend statistical tests of low sample size (Kabacoff, 2011). This lapse in design and subsequent explanation is unsatisfactory in light of the intent to use the PACT across California and potentially farther. The analysis of reliability is a necessary though not sufficient evidence for validity of a particular tests use (Osterlind, 2010). Problematic for the term reliability is the multiple uses and misuses of the term. Reliability is not strictly defined but more broadly refers to precision in mental appraisals. Pecheone and Chungs (2007, p. 35) narrow definition of reliability, based on the, even then outdated, 1985 Standards for Educational and Psychological Testing, the degree to which test scores are free from errors of measurement stands in direct contrast to what modern psychometrics terms reliability, reliability is not a universal or absolute depiction of the absence of measurement error; rather, its meaning may be properly interpreted only in the frame work of a particular assessment (Osterlind, 2010, p. 123). In considering the goal for PACT to assess teacher competence, the most important question of reliability is to ask How consistent do the items of the test identify the level of trait considering an individuals level of teacher competence?, a form of reliability termed internal consistency. Pecheone and Chung (2006) only address interrater reliability, or the consistency from scorer to scorer, and while this form of reliability is important, the complexity of the PACT and its intended credentialing use makes a thorough analysis mandatory. Even within the frame of interrater reliability there is a lack of evidence to support Pecheone and Chungs (2006) claims. Specifically, (a) using rudimentary percentage comparisons and (b) accepting a wide margin of error weakens the findings that the PACT is a reliable assessment. In assessing the reliability between raters and the holistic to MIS scores, simple percentages were employed. Although this approach may be warranted with some research studies a study im-
20
pacting public policy, including certification decisions, demands higher statistical procedures of interrater reliability, such as Cohens kappa, which take into account how the rater scores other TEs. The results and finding that there was an acceptable level of differences in comparisons between raters are questionable in that Pecheone and Chungs (2006) acceptable margin of error is within 1 point (p. 30). In the context of a four point scale a one point margin of error is unacceptable. Discussion Interpretations and Implications. Pecheone and Chung (2006) are sparse with interpreting results and often these explanations are intermixed with findings. It is difficult to make a fair critique of this aspect of the authors study because of the inconsistent and at times convoluted means of disseminating interpretations. This problem is compounded by the lack of clear research hypothesis, therefore, I will only briefly address this aspect of the study. Pecheone and Chung (2006), rightfully, propose a one year follow up of teacher candidates, measuring achievement gains of their students, to address the success of PACT in predicting teaching success. I wholly support a follow up to measure the success of the PACT in predicting teacher success, however, the proposed single year follow up is unrealistic in actually extracting the success of a teacher. This approach implies that we can determine how successful a teacher will be based on observing their first year in the profession; I certainly would hate to have been predictively evaluated on my first year as a teacher (I sense this will elicit smiles from the reader who is an educator). Also problematic is the use of only a single measure, achievement, for determining the success of a teacher (Baker et al., 2010). Pecheone and Chung (2006) find that there is a gender difference in candidate performance and respond: As a result of these findings, PACT will continue to monitor and reexamine the scorer training process as well as the design features of the TE such as GQs, rubrics, and benchmarks. If sources of bias are identified because of socioeconomic status or other variables, then modifications will be made to the assessment system to address
21
the problem areas. (p. 29) This is not the proper response for detected bias in a test that carries this much weight. If proper DIF detection is utilized, then an item that exceeds a critical limit is discarded (Osterlind, 2010). This wait and see approach is reckless when the stakes are so high. Pecheone and Chung (2006) discuss that the PACT has raised awareness of what colleges and universities are teaching. The formative aspect of the PACT has yielded terrific responses in altering and adjusting pedagogy to better support the learning outcomes encompassed in the TE portion of the PACT. It seems that the PACT may be useful for informing pedagogy, yet the unresolved issue of predicting teacher success still remains. Pecheone and Chung (2006) bemoan the lack of predictive validity...predicting effective teaching in the classroom of Californias traditional standardized certification exams, citing it as a major motivator for the development of the alternative PACT assessment (p. 23). Pecheone and Chung (2006) did not include analysis of the longitudinal effects of PACT and though they promise future predictive studies of teacher effectiveness, they have not established predictive evidence for the PACT nor do they propose a credible means of addressing this concern. I must ask two questions of Pecheone and Chung: (a) If the PACT has not delivered on its promise to predict teacher effectiveness, is it worth the cost to develop, administer and score?and (b) How does the PACT enable educational leaders to hire successful candidates?. Limitations. Pecheone and Chung (2006) do acknowledge that the study is limited to only two years rather than multiple repeated measures and revisions necessary that a performance assessment of this magnitude requires (Osterlind, 2010). In considering this limitation it seems unwise to make the claims regarding validity and reliability that the authors make. The limitation of using judges to score a latent trait as a potential additional source of bias is well documented in psychometric literature yet the authors do not address how this may have impacted results (Edmondson, 2005; Edwards and Kenney, 1946). The authors also address the expensive and labor intensive nature of judges scoring performance assessments. Pecheone and Chung (2006) acknowledge that development of the the PACT could not have been accom-
22
plished without member contributions and financial support from the University of California Office of the President and private foundations (p. 33). However, possible sources of funding this financial burden in a state or national application of PACT is not furnished. Instead the authors retort, [d]espite legitimate concerns about the costs of implementing performance-based assessments, the alternative is to judge teacher competence solely on the basis of standardized multiple-choice tests of content and/or pedagogical knowledge (p. 33). In stating this Pecheone and Chung (2006) are assuming three things: (a) the PACT is a better predictor of teacher success, (b) the traditional assessment is a poor predictor of teacher success and (c) there is no viable third option to explore. With recent financial reductions imposed on the educational system it is lackadaisical to dismiss the cheaper standardized assessment, a third alternative assessment or to assume tax payers, students, and educational institutions can bare the additional financial strain the PACT infringes. Pecheone and Chung (2006) consider as a limitation that we have yet to see the policy impact of PACT implementation on teacher credentialing because it has not yet been fully implemented in California as a high-stakes assessment for teacher licensure. (p. 33). This statement does not indicate that the authors are aware of the rigorous evidence that a test must display before being widely accepted. It would be unethical for California to simply grant large scale use of the PACT without proper development and evidence for effectiveness. This is a case of the skinny runt, who hasnt hit the ball all year, begging, Put me in coach, I can do it. This attitude displays a lack of sensitivity to the consequences of a wrong decision. Osterlind (2010) warns of the over enthusiastic PA advocate who merely follow the zeitgeist of anything but multiple choice in a sort of oniomania, a description that Pecheone and Chung demonstrate. Conclusion The intention of the PACT and Pecheone and Chungs (2006) study is admirable, and as an educational leader I can support the effort to make credentialing decisions that will increase the likelihood of a new teacher hire being successful. Based on Pecheone and Chungs (2006) article, I can not, however, throw support behind the PACT as the study does not provide the
23
kind of scholarly, scientific methods required to make the sweeping policy changes the authors advocate. I question the wisdom in developing a measure of this magnitude and attempting to use it to ascertain and report results as its being constructed. This is akin to building an airplane while youre attempting to fly it. The convoluted and disjointed writing style make the piece difficult to interpret. The missing statistical tests, such as p-values, t statistics and 2 values, eliminate the generizability and diminish the discernment and judgements about the findings. The absence of rigor, ignorance of psychometric theory, failure to adhere to testing standards and a failure to utilize known modern inferential techniques renders the study merely a good idea a best. In evaluating Pecheone and Chungs (2006) article I must apply the same standard and critique they applied to the traditional test: ...there is little evidence regarding the technical soundness of traditional teacher licensure tests in the published literature and little research documenting the validity of such licensure tests for identifying competent teachers or for predicting effective teaching in the classroom (p. 23) As an educational leader and researcher I must ask if the study succeeded in providing evidence for the PACT as an alternative assessment to Californias traditional standardized teacher certification exam as I evaluated the merit of this study. To which I must, unfortunately, answer no. Pecheone and Chung (2006) present an interesting idea to attempt to address the problem of hiring qualified teachers, which, as an educational leader, is very near to my heart. Pecheone and Chung (2006) make some reasonable critiques of the student teacher assessment practices. However, the mixed intentionality (both summative and formative) of the PACT as a performance assessment, which we know to be unwieldy in large scale usage, may be too broad to completely accomplish any of the developers goals. It may have been prudent to separate the teaching episode and embedded signature as independent assessments with distinct, yet complimentary, purposes and intended uses.
24
References Abell, N., Springer, D. W., & Kamata, A. (2009). Factor analysis. In Developing and validating rapid assessment instruments. Oxford University Press. American Educational Research Association, American Psychological Association, & National Council of Measurement in Education. (1999). Standards for educational and psychological testing. American Educational Research Association. Washington, DC. American Psychological Association. (2001). Publication manual of the American Psychological Association. 5th ed. Washington D. C. American Psychological Association. (2010). Publication manual of the American Psychological Association. 6th ed. Washington D. C. Arends, R. I. (2001). Performance assessment in perspective: History, opportunities, and challenges. In S. Castle & B. S. Shaklee (Eds.), Performance-based assessment in teacher education. Lanham, MD: Rowman & Littlefield Education. Baker, E. I., Barton, P. E., Darling-Hammond, L., Haertel, E., ladd, F., H, Linn, R. I., . . . Shepard, I. A. (2010). Problems with the use of student test scores to evaluate teachers. The Economic Policy Institute. Cleveland, W. S., & McGill, R. (1984). Graphical perception: Theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association, 79(387), 531554. doi:10.2307/2288400 Danielson, C. (1996). Enhance professional practice: A framework for teaching. Association for Supervision and Curriculum Development. Darling-Hammond, L. (2000). Teacher quality and student achievement: A review of state policy evidence. Educational Policy Analysis Archive, 8(1), 144. Retrieved from http://epaa.asu. edu/epaa. Edmondson, D. R. (2005). Likert scales: A history. CHARM, 12, 127133. Retrieved from https: //www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCkQFjAA& url=http%3A%2F%2Ffaculty.quinnipiac.edu%2Fcharm%2FCHARM%2520proceedings%
25
2FCHARM%2520article%2520archive%2520pdf%2520format%2FVolume%252012% 25202005%2F127%2520edmondson.pdf&ei=8AqKT4WaD4W40QGN1K3mCQ&usg= AFQjCNHITtvhd9XFkFBUWVWseN7yuOmZPA&sig2=ERTm7gL9B8yUit44t84kAA Edwards, A. L., & Kenney, K. C. (1946). A comparison of the Thurstone and Likert techniques of attitude scale construction. Journal of Applied Psychology, 30(1), 72 83. doi:10.1037/ h0062418 Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4(3), 272 299. doi:10.1037/1082-989X.4.3.272 Fox, J., Nie, J., & Byrnes, J. (2012). Sem: Structural equation models. R package version 3.0-0. Retrieved from http://CRAN.R-project.org/package=sem Huang, J. (2009). Factors affecting the assessment of ESL studentsfffd writing. International Journal of Applied Educational Studies, 5(1), 117. Huang, J., & Foote, C. J. (2010). Grading between the lines: What really impacts professorsfffd holistic evaluation of ESL graduate student writing? Language Assessment Quarterly, 7(3), 219233. doi:10.1080/15434300903540894 Izard, J. (2005). Trial testing and item analysis in test construction (K. N. Ross, Ed.). Paris: UNESCO International Institute for Educational Planning. Retrieved from http://www.sacmeq. org/downloads/modules/module7.pdf James, L. R., Demaree, R., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69(1), 8598. Retrieved from http : / / people . sabanciuniv . edu / gokaygursoy / ISTATISTIK OLD / BOLUM CALISMALARI/ILL/2007/OnlineSaglananDokumanlar/12192.pdf Kabacoff, R. I. (2011). R in action: Data analyis and graphics with R. Shelter Island: NY: Manning. Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22(140), 155.
26
Likert, R., Roslow, S., & Murphy, G. (1993). A simple and reliable method of scoring the Thurstone attitude scales. Personnel Psychology, 46(3), 689 690. doi:10.1111/j.1744-6570. 1993.tb00893.x Osterlind, S. J. (2010). Modern measurement: Theory, principles, and applications of mental appraisal (2nd ed.). Boston: Pearson Education. Pecheone, R. L., & Chung, R. R. (2006). Evidence in teacher education: The Performance Assessment for California Teachers (PACT). Journal of Teacher Education, 57(1), 2236. doi:10.1177/0022487105284045 Pecheone, R. L., & Chung Wei, R. R. (2007). Technical report of the performance assessment for california teachers (PACT): Summary of validity and reliability studies for the 2003-04 pilot year. Stanford University. Retrieved from http://www.pacttpa.org/ files/Publications and Presentations/PACT Technical Report March07.pdf Postlethwaite, T. N. (2005). Educational research: Some basic concepts and terminology (K. N. Ross, Ed.). Paris: UNESCO International Institute for Educational Planning. Retrieved from http://www.sacmeq.org/downloads/modules/module1.pdf Sarkar, D. (2008). Lattice: Multivariate data visualization with R. ISBN 978-0-387-75968-5. New York: Springer. Retrieved from http://lmdvr.r-forge.r-project.org Stronge, J. H. (2002). Qualities of effective teachers. Association for Supervision and Curriculum Development. Suhr, D. (2009). Principal component analysis vs. exploratory factor analysis, Philadelphia: SUGI 30. SUGI 30. Retrieved from http://www2.sas.com/proceedings/sugi30/203-30.pdf Traub, R. E., & Rowley, G. L. (1991). NCME instructional module: Understanding reliability. Educational Measurement: Issues and Practice, 10(1), 3745. Retrieved from http://ncme. org/linkservid/65F3B451-1320-5CAE-6E5A1C4257CFDA23/showMeta/0/ Wright, S. P., Horn, S. P., & Sanders, W. L. (1997). Teacher and classroom context effects on student achievement: Implications for teacher evaluation. Journal of Personnel Evaluation in Education, 11, 5767. doi:10.1023/A:1007999204543

Research Analysis Exam

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Research Analysis Exam

Transféré par

Droits d'auteur :

Formats disponibles

Running head: RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

RESEARCH ANALYSIS EXAM

Vous aimerez peut-être aussi