首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
The purpose of the present study was to extend past work with the Angoff method for setting standards by examining judgments at the judge level rather than the panel level. The focus was on investigating the relationship between observed Angoff standard setting judgments and empirical conditional probabilities. This relationship has been used as a measure of internal consistency by previous researchers. Results indicated that judges varied in the degree to which they were able to produce internally consistent ratings; some judges produced ratings that were highly correlated with empirical conditional probabilities and other judges’ ratings had essentially no correlation with the conditional probabilities. The results also showed that weighting procedures applied to individual judgments both increased panel-level internal consistency and produced convergence across panels.  相似文献   

Evidence to support the credibility of standard setting procedures is a critical part of the validity argument for decisions made based on tests that are used for classification. One area in which there has been limited empirical study is the impact of standard setting judge selection on the resulting cut score. One important issue related to judge selection is whether the extent of judges’ content knowledge impacts their perceptions of the probability that a minimally proficient examinee will answer the item correctly. The present article reports on two studies conducted in the context of Angoff‐style standard setting for medical licensing examinations. In the first study, content experts answered and subsequently provided Angoff judgments for a set of test items. After accounting for perceived item difficulty and judge stringency, answering the item correctly accounted for a significant (and potentially important) impact on expert judgment. The second study examined whether providing the correct answer to the judges would result in a similar effect to that associated with knowing the correct answer. The results suggested that providing the correct answer did not impact judgments. These results have important implications for the validity of standard setting outcomes in general and on judge recruitment specifically.  相似文献   

The credibility of standard‐setting cut scores depends in part on two sources of consistency evidence: intrajudge and interjudge consistency. Although intrajudge consistency feedback has often been provided to Angoff judges in practice, more evidence is needed to determine whether it achieves its intended effect. In this randomized experiment with 36 judges, non‐numeric item‐level intrajudge consistency feedback was provided to treatment‐group judges after the first and second rounds of Angoff ratings. Compared to the judges in the control condition, those receiving the feedback significantly improved their intrajudge consistency, with the effect being stronger after the first round than after the second round. To examine whether this feedback has deleterious effects on between‐judge consistency, I also examined interjudge consistency at the cut score level and the item level using generalizability theory. The results showed that without the feedback, cut score variability worsened; with the feedback, idiosyncratic item‐level variability improved. These results suggest that non‐numeric intrajudge consistency feedback achieves its intended effect and potentially improves interjudge consistency. The findings contribute to standard‐setting feedback research and provide empirical evidence for practitioners planning Angoff procedures.  相似文献   

Competency examinations in a variety of domains require setting a minimum standard of performance. This study examines the issue of whether judges using the two most popular methods for setting cut scores (Angoff and Nedelsky methods) use different sources of information when making their judgments. Thirty-one judges were assigned randomly to the two methods to set cut scores for a high school graduation test in reading comprehension. These ratings were then related to characteristics of the items as well as to empirically obtained p values. Results indicate that judges using the Angoff method use a wider variety of information and yield estimates closer to the actual p values. The characteristics of items used in the study were effective predictors of judges' ratings, but were far less effective in predicting p values  相似文献   

Despite being widely used and frequently studied, the Angoff standard setting procedure has received little attention with respect to an integral part of the process: how judges incorporate examinee performance data in the decision‐making process. Without performance data, subject matter experts have considerable difficulty accurately making the required judgments. Providing data introduces the very real possibility that judges will turn their content‐based judgments into norm‐referenced judgments. This article reports on three Angoff standard setting panels for which some items were randomly assigned to have incorrect performance data. Judges were informed that some of the items were accompanied by inaccurate data, but were not told which items they were. The purpose of the manipulation was to assess the extent to which changing the instructions given to the judges would impact the extent to which they relied on the performance data. The modified instructions resulted in the judges making less use of the performance data than judges participating in recent parallel studies. The relative extent of the change judges made did not appear to be substantially influenced by the accuracy of the data.  相似文献   

Who should make judgments about test standards? Who is an expert? How many judges should be used in a standard-setting study? What is the relationship between the number of judges and the standard error of the test?  相似文献   

Standard setting methods such as the Angoff method rely on judgments of item characteristics; item response theory empirically estimates item characteristics and displays them in item characteristic curves (ICCs). This study evaluated several indexes of rater fit to ICCs as a method for judging rater accuracy in their estimates of expected item performance for target groups of test-takers. Simulated data were used to compare adequately fitting ratings to poorly fitting ratings at various target competence levels in a simulated two stage standard setting study. The indexes were then applied to a set of real ratings on 66 items evaluated at 4 competence thresholds to demonstrate their relative usefulness for gaining insight into rater “fit.” Based on analysis of both the simulated and real data, it is recommended that fit indexes based on the absolute deviations of ratings from the ICCs be used, and those based on the standard errors of ratings should be avoided. Suggestions are provided for using these indexes in future research and practice.  相似文献   

Judging readability   总被引:2,自引:0,他引:2  
Individuals are frequently called upon to judge the readability of written text. The accuracy of such judgments, studies show, ranges from high to low. This paper provides another look at the problem, based upon the judgments of 56 professional writers on five passages of text taken from a reading test. The judges were asked to rank the five passages from most readable to least readable. The results showed wide variability in the judgments. Only a few of the judges were able individually to put the passages in the tested order of readability, but the consensus of the entire group put them in exactly that order. Further examination of the results suggested that a relatively small number of gross errors in judgment were made. Accuracy of judgments, it appeared, might greatly increase with selection and/or training of judges, a procedure followed in certain studies where highly accurate judgments had been found. A readability formula was suggested as an accurate and convenient way of getting readability scores under most circumstances. Use of a formula might also, it was suggested, help a judge to increase his accuracy, but human interpretation of the scores was still felt to be needed.  相似文献   

In the USA, student ratings of their instructors are routinely used by administrators in higher education in making decisions regarding instructors' salary adjustments, tenure and promotion. However, when the rating qualifications of amateur student raters and novice public school teachers who have received training that should have enabled them to become qualified raters are examined closely, there are good reasons for believing that both groups of raters are not qualified to give reliable ratings on most high‐inference questionnaire items.  相似文献   

Cut‐scores were set by expert judges on assessments of reading and listening comprehension of English as a foreign language (EFL), using the bookmark standard‐setting method to differentiate proficiency levels defined by the Common European Framework of Reference (CEFR). Assessments contained stratified item samples drawn from extensive item pools, calibrated using Rasch models on the basis of examinee responses of a German nationwide assessment of secondary school language performance. The results suggest significant effects of item sampling strategies for the bookmark method on cut‐score recommendations, as well as significant cut‐score judgment revision over cut‐score placement rounds. Results are discussed within a framework of establishing validity evidence supporting cut‐score recommendations using the widely employed bookmark method.  相似文献   

Students’ judgments about “what counts” as mathematics in and out of school have important consequences for problem solving and transfer, yet our understanding of the source and nature of these judgments remains incomplete. Thirty-five sixth grade students participated in a study focused on what activities students judge as mathematical, and how they make their judgments. Students completed a photo sorting activity; took, viewed, and captioned their own photos of mathematics; viewed and commented on classmates’ photos; and participated in a small group discussion. Across multiple sources of data, findings showed that students attended to two major features of photos and activities when making judgments: surface cues present in the photos, such as numbers and money, and the possibility for mathematical action. Some students looked for the possibility of mathematics, while others asked if mathematics was necessary. Students also gave higher ratings to activities with which they had personal experience. The article concludes with possible implications for practice.  相似文献   

A look at real data shows that Reckase's psychometric theory for standard setting is not applicable to bookmark and that his simulations cannot explain actual differences between methods. It is suggested that exclusively test-centered, criterion-referenced approaches are too idealized and that a psychophysics paradigm and a theory of group behavior could be more useful in thinking about the standard setting process. In this view, item mapping methods such as bookmark are reasonable adaptations to fundamental limitations in human judgments of item difficulty. They make item ratings unnecessary and have unique potential for integrating external validity data and student performance data more fully into the standard setting process.  相似文献   

Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item judgments directly reflect empirical item difficulties, the variability in Angoff judgments over items would not add error to the cut score, but to the extent that the mean item judgments do not correspond to the empirical item difficulties, variability in mean judgments over items would add error to the cut score. In this article, we present two generalizability-theory–based analyses of the proportion of the item variance that contributes to error in the cut score. For one approach, variance components are estimated on the probability (or proportion-correct) scale of the Angoff judgments, and for the other, the judgments are transferred to the theta scale of an item response theory model before estimating the variance components. The two analyses yield somewhat different results but both indicate that it is not appropriate to simply ignore the item variance component in estimating the error variance.  相似文献   

One common phenomenon in Angoff standard setting is that panelists regress their ratings in toward the middle of the probability scale. This study describes two indices based on taking ratios of standard deviations that can be utilized with a scatterplot of item ratings versus expected probabilities of success to identify whether ratings are regressed in toward the middle of the probability scale. Results from a simulation study show that the standard deviation ratio indices can successfully detect ratings for hard and easy items that are regressed in toward the middle of the probability scale in Angoff standard‐setting data, where previously proposed indices often do not work as well to detect these effects. Results from a real data set show that, while virtually all raters improve from Round 1 to Round 2 as measured by previously developed indices, the standard deviation ratios in conjunction with a scatterplot of item ratings versus expected probabilities of success can identify individuals who may still be regressing their ratings in toward the middle of the probability scale even after receiving feedback. The authors suggest using the scatterplot along with the standard deviation ratio indices and other statistics for measuring the quality of Angoff standard‐setting data.  相似文献   

我国法官素质已经引起了社会各界广泛的关注。如何使我国法官成为法律职业群体中的精英,是一个非常重要的问题。必要和经常性的法官考评制度应当逐步健全,这有利于激励法官在审判工作中充分施展自己的才干,为法官个人或群体提供需要改进或变革的重要信息。合理的法官考评机制将有利于推进法官职业化建设。  相似文献   

Validating performance standards is challenging and complex. Because of the difficulties associated with collecting evidence related to external criteria, validity arguments rely heavily on evidence related to internal criteria—especially evidence that expert judgments are internally consistent. Given its importance, it is somewhat surprising that evidence of this kind has rarely been published in the context of the widely used bookmark standard‐setting procedure. In this article we examined the effect of ordered item booklet difficulty on content experts’ bookmark judgments. If panelists make internally consistent judgments, their resultant cut scores should be unaffected by the difficulty of their respective booklets. This internal consistency was not observed: the results suggest that substantial systematic differences in the resultant cut scores can arise when the difficulty of the ordered item booklets varies. These findings raise questions about the ability of content experts to make the judgments required by the bookmark procedure.  相似文献   

Cut scores, estimated using the Angoff procedure, are routinely used to make high-stakes classification decisions based on examinee scores. Precision is necessary in estimation of cut scores because of the importance of these decisions. Although much has been written about how these procedures should be implemented, there is relatively little literature providing empirical support for specific approaches to providing training and feedback to standard-setting judges. This article presents a multivariate generalizability analysis designed to examine the impact of training and feedback on various sources of error in estimation of cut scores for a standard-setting procedure in which multiple independent groups completed the judgments. The results indicate that after training, there was little improvement in the ability of judges to rank order items by difficulty but there was a substantial improvement in inter-judge consistency in centering ratings. The results also show a substantial group effect. Consistent with this result, the direction of change for the estimated cut score was shown to be group dependent.  相似文献   

Survey vignette methodology was employed toinvestigate student beliefs about what constitutesabusive behaviors in dating relationships. A packet of15 unique vignettes depicting incidents that might be considered to be violent was distributed torandomly selected graduate and undergraduate studentswho were asked to rate physical abusiveness. Based onmultiple regression analysis, both contextual and student demographic characteristics were foundto influence abusiveness ratings. Significant predictorsof abuse judgments were nature of the aggressive act andvictim's gender and sexual orientation. More severe acts of aggression, female victims, gayand lesbian victims, a history of violence in therelationship, injurious outcome, male perpetrator, andalcohol consumption significantly increased abusiveness ratings. More advanced students and femalestudents tended to make higher abuse ratings, whereasbeing in a relationship was associated with lowerratings. Although both contextual and demographicfactors affected student judgments of abusiveness,student characteristics explained relatively littlebeyond what was accounted for by situational variablesin the scenarios depicted.  相似文献   

This study investigated the accuracy of classroom teachers' judgments of the reading progress of their low‐performing students. Participants were 36 second grade teachers and students in their lowest reading groups (n = 150). Student progress was monitored weekly using reading‐curriculum‐based measurement (R‐CBM) procedures. After 6 weeks, teachers were asked to rate their students' progress. Expert judges later reviewed the teachers' R‐CBM graphs and rated the individual and group progress based on the graphs. Teacher ratings did not correlate with expert ratings or the R‐CBM slope estimates. Expert ratings correlated highly with slope estimates. Teachers' estimates of progress were significantly higher than expert judges' ratings, indicating that teachers may overestimate student progress. Implications for practice and future research are discussed. © 2008 Wiley Periodicals, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号