Vous êtes sur la page 1sur 22


Empirical Keying of Situational Judgment Tests: Rationale and Some Examples Kelley J. Krokos American Institutes for Research Ph: 202.403.5259 Fx: 202.403.5033 kkrokos@air.org Adam W. Meade North Carolina State University Ph: 919.513.4857 Fx: 919.515.1716 adam_meade@ncsu.edu April R. Cantwell North Carolina State University Ph: 919.515.2251 Fx: 919.515.1716 arcantwe@unity.ncsu.edu Samuel B. Pond, III North Carolina State University Ph: 919.515.2251 Fx: 919.515.1716 sbpond@ncsu.edu Mark A. Wilson North Carolina State University Ph: 919.515.2251 Fx: 919.515.1716 Mark_Wilson@ncsu.edu

2 Press Paragraph Recently there has been increased interest in the use of situational judgment tests (SJTs) for employee selection and promotion. SJTs have respectable validity coefficients with performance criteria though validity coefficients vary from study to study. We propose the use of empirical keying in order to help maximize the utility of SJTs. Though others have used such methods, we provide a much needed theoretical rationale for such scoring procedures by illustrating the distinction among SJTs, cognitive ability, and biodata. Results indicate that some empirical keying approaches are advantageous for predicting a leadership criterion compared to traditional subject matter expert SJT scoring.

Abstract There has been increased interest in the use of situational judgment tests (SJTs) for employee selection and promotion. We provide a much needed theoretical rationale for empirical keying of SJTs. Empirical results indicate that some empirical keying approaches are more advantageous than subject matter expert SJT scoring.

3 Empirical Keying of Situational Judgment Tests: Rationale and Some Examples SJTs are becoming increasingly popular as selection, promotion, and developmental tools (Clevenger, Pereira, Wiechmann, Schmitt, & Harvey, 2001; Hanson & Ramos, 1996; McDaniel, Finnegan, Morgeson, Campion, & Braverman, 1997), and with good reason; several researchers have had considerable success in predicting performance with SJTs (McDaniel, Morgeson, Finnegan, Campion, & Braverman, 2001; Phillips, 1993; Weekley & Jones, 1999); with less adverse impact than is typically found in measures of cognitive ability (Hanson & Ramos, 1996; Motowidlo & Tippins, 1993; Weekley & Jones, 1997). Despite these promising findings, one persistent problem with the use of SJTs is that validity coefficients often vary widely. For example, some authors have found no significant correlation between SJT scores and employee performance (Smiderle, Perry, & Cronshaw, 1994), while others have found validity coefficients of .45 (Phillips, 1993) and .56 (Stevens & Campion, 1999). Still others have found widely divergent results for men and women (Phillips, 1992) or by construct examined (Motowidlo, Dunnette, & Carter, 1990). Undoubtedly, there are also many unpublished studies of SJTs showing varying or non-significant results as well. One possible explanation for these problems may lie in the way SJTs are scored. Traditionally, subject matter experts (SMEs) have determined the correct responses to SJT items. However, we propose that an empirical approach to item scoring has several theoretical and practical advantages over the SME approach. Though we are not the first to suggest the use of empirical keys for SJTs, we do provide a much needed theoretical rationale for their use that has not previously been discussed. In this study, we discuss the advantages of empirical approaches and illustrate their use with an SJT predicting a leadership criterion.

4 Scoring of SJTs. In a recent review, McDaniel and Nguyen (2001) describe approaches to scoring SJTs. The first and most common approach is to ask subject matter experts (SMEs) to decide which response alternative is best for each item. With this approach, items with little or no SME agreement are deleted or rewritten. Results with the SME scoring approach vary though results are generally positive. A second scoring approach identified by McDaniel and Nguyen (2001) involves pilot testing an SJT and to identify the correct responses based on central tendency statistics, though no example or explanation of how this should be implemented was given. The last approach discussed by McDaniel and Nguyen (2001) is the use of empirical methods to determine the scoring key. Although empirical scoring approaches are rarely used for SJTs, some research evidence suggests that SJTs scored in this way can yield moderate validity coefficients. Dalessio (1994) successfully used an empirical keying technique for an SJT to predict turnover among insurance agents. Weekley and Jones (1997) used empirical scoring based on mean criterion performance of service workers and found a cross-validity coefficient of .22. Finally, although the relationship among the SJT scores and performance criteria were not assessed, Lievens (2000) developed an empirical scoring key for an SJT using correspondence analysis and discriminant analysis. In contrast to the paucity of studies examining empirical scoring procedures of SJTs, biodata has a long history of using empirical scoring procedures. These procedures are easily adaptable to SJT items. However, Hogan (1994) briefly reviewed the entire history of empirical keying methods and found that few studies had compared different empirical keying procedures. In one of the few studies comparing multiple empirical keying techniques, Devlin et al. (1992)

5 found that the vertical percent method (application blank method; England, 1961) was among the best at predicting academic performance for college freshman with cross-validities typically in the .4-.5 range. The horizontal percent (Stead & Shartle, 1940) and phi coefficient methods (c.f. the Lecznar & Dailey (1950) correlational method), also proved useful in their study with validities only slightly lower than those of the vertical percent methods they investigated (Devlin et al., 1992). The mean criterion method had greater variation in cross validities across different time spans, though the cross-validities were between .2 and .5. However, there was more shrinkage in the cross-validation for this method than most others. Rationale for Empirical Keying of SJTs Though use of empirical keying for biodata has been criticized as dust-bowl empiricism (Dunnette, 1962; Mumford & Owens, 1987; Owens, 1976), we believe that it may actually be preferable to SME based scoring procedures for some SJTs. On the surface, SJTs seem to be somewhat closely related to both cognitive ability tests and biodata. However, we contend that SJTs are unique measurement methods (Hanson, Horgen, & Borman, 1998) and thus have unique properties that make them particularly well suited for empirical keying. First, we explicitly reject the notion of correct and incorrect answers for most SJTs. The notion that SJTs should have correct and incorrect responses likely stems at least in part from the relationships between SJTs and cognitive ability tests, which generally do have correct and incorrect answers. First, research suggests that SJT scores are highly related to scores on tests of general cognitive ability McDaniel et al. (1997, 2001). In addition, SJTs are used in ways and contexts that are typical for the use of cognitive ability tests, such as personnel selection. However, despite these relationships and temporal connections, SJT items, unlike

6 typical academic or cognitive ability test items, are not designed to have a single irrefutable correct answer. In contrast, SJT items are typically designed to capture the more complex, social or practical aspects of performance in work situations. McDaniel et al. (1997) suggest that SJTs are indistinguishable from tests of tacit knowledge. To the extent that this is true, SJTs measure something different than general cognitive or academic intelligence (Sternberg, Wagner, Williams, & Horvath, 1995). To capture this type of knowledge, test items pose problems that are not well defined and may have more than one correct response (Sternberg & Wagner, 1993; Sternberg, Wagner, & Okagaki, 1993). Finally, an examination of typical SJT items reveals that there is generally no clear right or wrong answer. This is actually a desirable feature of SJTs as transparent items would quickly lead to ceiling effects that would fail to discriminate between high and low performers. Note, however, that this limitation is not an issue in many biodata scales where items can be based on external, objective, and verifiable previous life experiences (Mael, 1991). We believe that all response options for an SJT item vary along a continuum of best to worst. The exact location of an option on this continuum is difficult to determine and will vary by item and perhaps also by the job for which the applicant is applying. Some items may be written with one, clearly best option while others may be written with less distinct response alternatives. Transparent items lead to ceiling effects while ambiguous items make it exceedingly difficult for SMEs to achieve consensus about the appropriateness of each option. When SJT scores are based on a scoring key that was developed by SMEs, an SJT score represents the extent to which each respondent agrees with the judgments of the SMEs. By requiring a high degree of consensus among SMEs, researchers can increase the likelihood that

7 answers will not be too specific to the opinions of the particular group of SMEs. Unfortunately, however, this procedure also increases the likelihood that correct answers will be the most transparent options. Thus, more transparent items rather than less transparent ones are likely to be retained when SMEs are employed to determine the keyed answer. In addition, the option ultimately determined to be best by the SMEs will depend to some extent upon the unique perspective of a particular SME group and the group dynamics involved in obtaining consensus. Deciding between a SME based key and an empirical key is really a question of who will serve as the SMEs. When traditional SME scoring is used, an SJT score is an index of agreement among respondents and SMEs. The extent to which these scores are construct valid is dependent upon both the validity of the SMEs conceptualization of the construct and the validity of the SMEs assessment of the relationship between the response options and the construct. As such, low validity coefficients for SME scored SJTs could be due to differences in perceptions of the construct among respondents (e.g., job applicants) and SMEs (e.g., a small group of supervisors); poor SME judgment as to which response option is most indicative of the construct; or overly transparent best answers chosen not only by SMEs, but also by both high and low performing respondents. In contrast, when empirical keying is used, the de-facto SMEs are the high performing respondents as measured by the criterion of interest. More specifically, response options that best differentiate between high and low performing incumbents are given more weight than other options, even though those options may in many ways seem to be better responses. Using empirical keying, the most transparent option (and seemingly the best option) may be endorsed by a majority of respondents; however if high and low performing respondents equally endorse

8 the response option, it will not differentiate between criterion groups and consequently will not, in effect, be weighted. In contrast, a response option that is not endorsed frequently but is endorsed much more often by high performing respondents than low performing respondents will be weighted much more heavily with many of the empirical scoring methods. In general, this will be desirable so long as the number of respondents endorsing the response option is not so small as to endanger severe shrinkage in cross-validation. We should point out, though, that if a criterion does not fully capture the performance domain, it might be preferable to use SME judgment to determine the correct answers to SJT items. In such cases, however, attention to better criterion development would be a pressing concern. While much of the criticism leveled at empirical keying methods of biodata scoring concerns the lack of theory behind the choice of predictors (Mumford & Owens, 1987; Owens, 1976), such criticism is not necessarily relevant to SJTs. SJT items are typically written based on job analysis data and are thus believed to be related to job relevant behaviors and criteria from their inception. As a result, empirical keying of these items serves to merely define the optimal relationship between those items and the criterion. In sum, we believe that there may be some utility in investigating empirical keying as an alternative for SJT scoring. In this research, we investigated the use of empirical keying as an alternative to the traditional SME based scoring procedures for an SJT developed to select students receiving a highly competitive four-year scholarship at a major university. Method Participants Participants were 219 undergraduate students (scholars) from a large university who were

9 recipients of a highly competitive four-year academic scholarship. Roughly 55% were female and 45% were male while approximately 36% were freshmen, 26% sophomores, 22% juniors, and 16% seniors. Note that while the sample is composed of students, this was not a lab study or a sample of convenience; the students were the target population for the SJT. Measures Phase one of the project involved criterion development during which the appropriate behaviors associated with four performance dimensions (Leadership, Scholarship, Service, and Character) were identified. For example, the Leadership dimension included behaviors such as knowing when to take a leadership versus a support role, being comfortable in ambiguous situations, developing cooperative relationships, and handling conflict appropriately. Results for the Leadership dimension are reported in this study in order to simplify the presentation of results and because leadership is most readily generalizable to other organizational settings. Phase two involved the development of the SJT item stems and response options. The SJT item stems were developed by the program research team using the data gathered in the criterion development phase. The response options were developed by a group of SMEs including university faculty and the scholarship program directorate. The items and response options subsequently underwent additional rigorous reviews and modifications by SMEs and the research team. The final SJT was composed of three detailed scenarios that describe situations that scholarship recipients may encounter. Each scenario was comprised of several multiple choice items. Respondents were instructed to indicate which of the five response options they would most likely do and which they would least likely do. Phase three involved developing the SME based scoring key. SMEs who were both

10 intimately familiar with the scholarship program and who had advanced training in assessment methodology determined the most effective answers for each item. For the most part, only response options with more than 70% agreement by the SMEs were retained as the correct option. However, in some cases there was less agreement among SMEs, and in these cases preferential weighting was given to a core group of SMEs (i.e., the programs director and one key faculty advisor). In this study, we analyzed responses to the most likely questions in order to simplify analyses and presentation of results. Performance Criteria. Performance rating content and materials were developed in phase four based on the data gathered during criterion development phase. Performance ratings were made primarily by the scholarship program director. When clarification was needed, a mentor or other program director was consulted for further information. Two dimensions of leadership were rated independently: Effectiveness of leadership skills and actively seeks a leadership role. Initial analyses indicated that these two ratings correlated highly (r=.79, p<.01), thus these two were combined into a single index. Procedure SJT scores for the leadership dimension were calculated using the traditional SME scoring approach and several empirical keying methods shown previously to be of some utility in either SJT or biodata research. Small calibration sample size is the biggest determinant of shrinkage in cross-validation (Hough & Paullin, 1994), therefore two-thirds of the total sample was randomly assigned to the calibration sample while the remaining one-third was retained for the cross-validation sample. More specifically, six empirical keying techniques were investigated. Each technique

11 results in numeric values or weights for each option that when combined with the individuals score on the option (0 if the option was not selected, 1 if it was) the result was then used in a regression equation that sought to predict performance on the leadership criterion from the weighted options. The empirical techniques employed are described below. Vertical and horizontal percent methods. In order to compute weights via the vertical and horizontal percent methods, the calibration sample was first divided into high and low performance groups with respect to the criterion. The sample was split into thirds based on criterion scores and only the lowest and highest third of the sample was used for weighting. Vertical percent weights were computed by taking the percentage of person in the high group choosing each option and subtracting the percentage of persons in the low performing group choosing that option. Horizontal weights were computed by taking the number of persons in the high performance group choosing each response option and dividing this number by the total number of people in the sample choosing that option. We then multiplied this number by ten to derive the final horizontal weights (see Devlin et al., 1992). Correlational Methods 1 and 2. The dichotomously scored response options were correlated with the leadership performance criteria. The resulting zero-order correlation was treated as the weight. For this study, we chose two alpha levels to retain response options as predictors. In Correlational Method 1, we used =.25 level which corresponded to zero-order correlations of magnitude of roughly r=.10. For Correlational Method 2, we keyed only item responses significant at the =.10 level (roughly r=.14). Mean criterion method. In order to generate the empirical scoring key for the mean criterion method, we computed mean criterion performance scores associated with each response

12 option. These mean scores were then used as the empirical weights in computing predictor scores for persons choosing each response option. Unit weighting method. With the unit weighting method, response options associated with the highest mean criterion scores were assigned a value of 1.0 while other responses were assigned a value of 0. However, options that were associated with the highest mean criterion were subject to the restriction that at least 10% of the sample must have chosen that option in order to reduce the risk of significant results by chance alone (see Weekly & Jones, 1999). Each of the empirical keying technique results in a numeric value for each response option. This value was used as the beta weight in a regression equation that seeks to predict performance on the leadership criterion using the weighted response options as predictors. Results Descriptive statistics for SME and empirical keying methods of scoring as well as the criterion measure of performance are presented in Table 1. Table 2 contains correlations between the predictor scores and criteria ratings. As can be seen in Table 2, the SME-based scoring of the leadership dimension left much to be desired. The SME-based leadership scores had only a marginally significant relationship with performance for the calibration sample. However, the SME-based predictor was not significantly related to performance for the cross-validation sample. The results of the empirical keying approaches were decidedly mixed. Though all empirical keying approaches had large significant correlations with performance in the calibration sample, only the correlational method was significantly related to performance in the cross-validation sample. Discussion

13 In this study, we found that the predictive validity of SJT could be improved by utilizing several types of empirical keying procedures. In addition, we have detailed several theoretical reasons why empirical keying may be preferable to SME scoring for some SJTs. However, we also found that empirical keying is not a panacea for all that ails a predictor. Instead, we found many techniques shown to be predictive of performance in biodata contexts of little use for our SJT measure. Our study also illustrates the most pervasive problem of empirical scoring procedures a general lack of cross-validation. Validities in our sample shrank considerably between calibration and cross-validation despite our best efforts to split the sample so that the majority of the data was used to derive stable empirical keying weights. Though previous authors have discussed some advantages of the correlational method (Lecznar & Dailey, 1950; Weekley & Jones, 1999), we were somewhat surprised by the clearly superior behavior of this type of empirical keying in our study. Perhaps this is because fewer (but higher quality) predictors were used with the correlational method. The more selective of the two correlational methods (Correlational Method 2) enjoyed considerably higher crossvalidities than did the less restrictive of the two. This is to be expected to some extent. However, as item responses are included that have a weaker relationship with the criterion, these responses are weighted less as well. Thus, these results were somewhat surprising. Though the performance of the empirical keying approaches examined in this study often fared no better than the SME approach, we stress some of the positive aspects of empirical keying. First, empirical keying can serve as a validity check on SME ratings of the correct response option. If the response option chosen by SMEs as the correct response does not distinguish between high and low performers with respect to the criterion measure, then perhaps

14 that option is not correct after all. The unit weighting scoring procedure used in this study exemplified this function. Researchers who reject the use of external keying procedures on philosophical grounds may still derive benefit from its use as a validity check and as part of the SJT development process. Requiring 75% agreement among SMEs to keep an item is a stringent but common criteria (Legree, 1994; Lievens, 2000). Using the unit weighting approach in combination with the SME approach may allow researchers to relax the agreement criteria slightly given empirical information offered by the unit weighting empirical keying approach. Secondly, we believe that the empirical keying approaches significantly improve upon the SME scoring approach because they counter a number of its weaknesses and introduce important information into the scoring process. Empirical keying approaches inherently reject the notion of right and wrong answers to SJT items. That is, (most) empirical keying approaches reward partial credit of sorts to a person choosing a response option that differentiates between higher and lower performance on the criterion. For example, two options that relate highly with the criterion would both be weighted strongly, rather than just one in traditional correct/incorrect scoring. Conversely, negative weighting penalizes choices associated with poorer performance. Also, practitioners not terribly comfortable with pure empirical keying could also consider using a hybrid approach in which only items written to measure specific competencies are used to predict criteria deemed relevant. With this approach, practitioners can maintain a theoretical link between competencies and criterion if only items written to measure those competencies are used as predictors, rather than all SJT items. This approach is not purely empirical but instead is more akin to the family of approaches in biodata research known as construct-based rational scoring (Hough & Paullin, 1994). With this type of scoring, the exact

15 nature of the relationship between the items and the construct (i.e., the scoring) is determined empirically though the theoretical link between predictor and criteria remains. As with any research, there are some potential limitations associated with this study. One limitation is the use of a student sample and an SJT designed for use in this sample. However, note that the SJT, while designed for use with a student sample, was rigorously developed. In addition, we attempted to choose the construct most relevant to organizations in our investigation. Though it is somewhat unlikely that an organization would hire an employee based on an SJT designed to measure leadership, it is entirely possible that such an SJT might be used as one factor in promotion decisions or for personal/career development purposes. Also, although a student sample was used in this study, these students are among the best in the country with remarkable standardized test scores, clear leadership in extracurricular activities, and great promise and in fact expectations for future leadership positions. A further limitation of the study was the relatively small sample used to derive the external weights. When deriving weights that reflect the relationship between the item response options and the criterion for the population as a whole, the larger the sample used to derive these weights, the better (Hough & Paullin, 1994). Also, the SJT contained a relatively low number of items. This was a function of both pragmatic concerns over test length and other design considerations outside the control of the researchers. In general, large sample sizes and a large number of items will lead to the best and most stable prediction of performance. Another limitation was the low initial validity coefficient associated with the SME-based approach. These low initial coefficients do not set a very high bar over which empirical keying approaches must excel.

16 Despite these limitations, we feel that this study provides promising results. By combining well-developed and content valid items with externally derived empirical scoring for the item response options, we feel that an optimal balance can be struck for scoring an SJT.

17 References Chan, D. & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology, 82, 143-159. Clevenger, J., Pereira, G. M., Wiechmann, D., Schmitt, N., & Harvey, V. S. (2001). Incremental validity of situational judgment tests. Journal of Applied Psychology, 86, 410-417. Dalessio, A. T. (1994). Predicting insurance agent turnover using a video-based situational judgment test. Journal of Business and Psychology, 9, 23-32. Devlin, S. E., Abrahams, N. M., & Edwards, J. E. (1992). Empirical keying of biographical data: Cross-validity as a function of scaling procedure and sample size. Military Psychology, 4, 119-136. Dunnette, M. D. (1962). Personnel management. Annual Review of Psychology, 13, 285-314. England, G. W. (1961). Development and Use of Weighted Application Blanks. Dubuque, IA: Brown. Hanson, M. A., Horgen, K. E., & Borman, W. C. (1998). Situational judgment tests as measures of knowledge/expertise. Paper presented at the Society for Industrial Organizational Psychology, Dallas, TX. Hanson, M. A., & Ramos, R. A. (1996). Situational judgment tests. In R. S. Barrett (Ed.), (1996). Fair employment strategies in human resource management (pp. 119-124). Westport, CT: Quorum Books/Greenwood Publishing Group, Inc. Hogan, J. B. (1994). Empirical keying of background data measures. In G. S. Stokes & M. D. Mumford (Eds.), Biodata handbook: Theory, research, and use of biographical

18 information in selection and performance prediction (pp. 69-107). Palo Alto, CA: CPP Books. Hough, L., & Paullin, C. (1994). Construct-oriented scale construction: The rational approach. In G. S. Stokes & M. D. Mumford (Eds.), Biodata handbook: Theory, research, and use of biographical information in selection and performance prediction (pp. 109-145). Palo Alto, CA: CPP Books. Lecznar, W. B., & Dailey, J. T. (1950). Keying biographical inventories in classification test batteries. American Psychologist, 5, 279. Legree, P. J. (1994). The effect of response format on reliability estimates for tacit knowledge scales (No. ARI Research Note 94-25). Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social Sciences. Lievens, F. (2000). Development of an empirical scoring scheme for situational inventories. European Review of Applied Psychology/Revue Europeenne de Psychologie Appliquee, 50, 117-125. Mael, F. A. (1991). A conceptual rationale for the domain and attributes of biodata items. Personnel Psychology, 44, 763-792. McDaniel, M. A., Finnegan, E. B., Morgeson, F. P., Campion, M. A., & Braverman, E. P. (1997). Predicting job performance from common sense. Paper presented at the 12th annual Society of Industrial Organizational Psychology, St. Louis, MO. McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., & Braverman, E. P. (2001). Use of situational judgment tests to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86, 730-740.

19 McDaniel, M. A., & Nguyen, N. T. (2001). Situational judgment tests: A review of practice and constructs assessed. International Journal of Selection and Assessment, 9, 103-113. Mead, A. D., & Drasgow, F. (2003). Examination of a resampling procedure for empirical keying. Paper presented at the 18th Annual Meeting of the Society for Industrial and Organizational Psychology, Orlando, FL. Motowidlo, S. J., Dunnette, M. D., & Carter, G. W. (1990). An alternative selection procedure: The low-fidelity simulation. Journal of Applied Psychology, 75, 640-647. Motowidlo, S. J., & Tippins, N. (1993). Further studies of the low-fidelity simulation in the form of a situational inventory. Journal of Occupational and Organizational Psychology, 66, 337-344. Mumford, M. D., & Owens, W. A. (1987). Methodology review: Principles, procedures, and findings in the application of background data measures. Applied Psychological Measurement, 11, 1-31. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric Theory (3rd ed.). New York: McGrawHill, Inc. Owens, W. A. (1976). Background data. In M. D. Dunnette (Ed.), Handbook of Industrial and Organizational Psychology (1st ed., pp. 609-644). Chicago: Rand McNally. Phillips, J. F. (1992). Predicting sales skills. Journal of Business and Psychology, 7, 151-160. Phillips, J. F. (1993). Predicting negotiation skills. Journal of Business and Psychology, 7, 403411. Russell, C. J., & Klein, S. R. (2003). Toward optimization and insight: Bootstrapping a situational judgment empirical key. Paper presented at the 18th Annual Meeting of the

20 Society for Industrial and Organizational Psychology, Orlando, FL. Smiderle, D., Perry, B. A., & Cronshaw, S. F. (1994). Evaluation of video-based assessment in transit operator selection. Journal of Business and Psychology, 9, 3-22. Stead, N. H., & Shartle, C. L. (1940). Occupational counseling techniques. New York: American Book. Sternberg, R. J., & Wagner, R. K. (1993). The g-ocentric view of intelligence and job performance is wrong. Current Directions in Psychological Science, 2, 1-5. Sternberg, R. J., Wagner, R. K., & Okagaki, L. (1993). Practical intelligence: The nature and role of tacit knowledge in work and at school. In J. M. Puckett (Ed.), (1993). Mechanisms of everyday cognition (pp. 205-227). Hillsdale, NJ, England: Lawrence Erlbaum Associates, Inc. Sternberg, R. J., Wagner, R. K., Williams, W. M., & Horvath, J. A. (1995). Testing common sense. American Psychologist, 50, 912-927. Stevens, M. J., & Campion, M. A. (1999). Staffing work teams: Development and validation of a selection test for teamwork settings. Journal of Management, 25, 207-228. Weekley, J. A., & Jones, C. (1997). Video-based situational testing. Personnel Psychology, 50, 25-49. Weekley, J. A., & Jones, C. (1999). Further studies of situational tests. Personnel Psychology, 52, 679-700.

21 Table 1 Descriptive Statistics for Predictors and Leadership Criteria Performance Ratings Calibration Sample N=144 Std. Mean Dev. 4.28 1.39 0.20 0.26 -42.61 41.06 92.72 16.49 4.42 0.37 0.27 58.55 4.87 1.03 4.30 0.93 Cross-Validation Sample N=75 Std. Mean Dev. 3.99 1.24 .14 .22 53.79 128.33 93.31 16.03 4.58 .30 .22 36.67 6.65 1.46 4.66 0.98

Variable SME Method Correlational Method 1 Correlational Method 2 Vertical % Horizontal % Mean Criterion Unit Weighting Leadership Performance Rating

Note: Correlational Method 1 used predictors significant at the p<.25 level. Correlational Method 2 used predictors significant at the p<.10 level.

22 Table 2 Correlations between Predictors and Leadership Criteria Performance Ratings Calibration Sample N=144 .15* .52** .49** .42** .51** .61** .43** CrossValidation Sample N=75 -.15 .21* .28** .06 .12 .07 .05

Predictor SME Method Correlational Method 1 Correlational Method 2 Vertical % Horizontal % Mean Criterion Unit Weighting

Note: *p<.10** p<.05. Correlational Method 1 used predictors significant at the p<.25 level. Correlational Method 2 used predictors significant at the p<.10 level.