期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Innovations in Measuring Rater Accuracy in Standard Setting: Assessing “Fit” to Item Characteristic Curves

Gregory M. Hurtz J. Patrick Jones 《教育实用测度》2013,26(2):120-143

Standard setting methods such as the Angoff method rely on judgments of item characteristics; item response theory empirically estimates item characteristics and displays them in item characteristic curves (ICCs). This study evaluated several indexes of rater fit to ICCs as a method for judging rater accuracy in their estimates of expected item performance for target groups of test-takers. Simulated data were used to compare adequately fitting ratings to poorly fitting ratings at various target competence levels in a simulated two stage standard setting study. The indexes were then applied to a set of real ratings on 66 items evaluated at 4 competence thresholds to demonstrate their relative usefulness for gaining insight into rater “fit.” Based on analysis of both the simulated and real data, it is recommended that fit indexes based on the absolute deviations of ratings from the ICCs be used, and those based on the standard errors of ratings should be avoided. Suggestions are provided for using these indexes in future research and practice. 相似文献

2.

An Experimental Study of the Internal Consistency of Judgments Made in Bookmark Standard Setting

下载免费PDF全文

Brian E. Clauser Peter Baldwin Melissa J. Margolis Janet Mee Marcia Winward 《Journal of Educational Measurement》2017,54(4):481-497

Validating performance standards is challenging and complex. Because of the difficulties associated with collecting evidence related to external criteria, validity arguments rely heavily on evidence related to internal criteria—especially evidence that expert judgments are internally consistent. Given its importance, it is somewhat surprising that evidence of this kind has rarely been published in the context of the widely used bookmark standard‐setting procedure. In this article we examined the effect of ordered item booklet difficulty on content experts’ bookmark judgments. If panelists make internally consistent judgments, their resultant cut scores should be unaffected by the difficulty of their respective booklets. This internal consistency was not observed: the results suggest that substantial systematic differences in the resultant cut scores can arise when the difficulty of the ordered item booklets varies. These findings raise questions about the ability of content experts to make the judgments required by the bookmark procedure. 相似文献

3.

Examining How Professional Roles and Test Development Experiences Impact Angoff Ratings

Adam E. Wyse 《教育实用测度》2018,31(4):324-334

An important consideration in standard setting is recruiting a group of panelists with different experiences and backgrounds to serve on the standard-setting panel. This study uses data from 14 different Angoff standard settings from a variety of medical imaging credentialing programs to examine whether people with different professional roles and test development experiences tended to recommend higher or lower cut scores or were more or less accurate in their standard-setting judgments. Results suggested that there were not any statistically significant differences for different types of panelists in terms of the cut scores they recommended or the accuracy of their judgments. Discussion of what these results may mean for panelist selection and recruitment is provided. 相似文献

4.

Diagnostic Profiles: A Standard Setting Method for Use With a Cognitive Diagnostic Model

Gary Skaggs Serge F. Hein Jesse L. M. Wilkins 《Journal of Educational Measurement》2016,53(4):448-458

This article introduces the Diagnostic Profiles (DP) standard setting method for setting a performance standard on a test developed from a cognitive diagnostic model (CDM), the outcome of which is a profile of mastered and not‐mastered skills or attributes rather than a single test score. In the DP method, the key judgment task for panelists is a decision on whether or not individual cognitive skill profiles meet the performance standard. A randomized experiment was carried out in which secondary mathematics teachers were randomly assigned to either the DP method or the modified Angoff method. The standard setting methods were applied to a test of student readiness to enter high school algebra (Algebra I). While the DP profile judgments were perceived to be more difficult than the Angoff item judgments, there was a high degree of agreement among the panelists for most of the profiles. In order to compare the methods, cut scores were generated from the DP method. The results of the DP group were comparable to the Angoff group, with less cut score variability in the DP group. The DP method shows promise for testing situations in which diagnostic information is needed about examinees and where that information needs to be linked to a performance standard. 相似文献

5.

Commentary: A Response to Reckase's Conceptual Framework and Examples for Evaluating Standard Setting Methods 总被引：1，自引：0，他引：1

E. Matthew Schulz 《Educational Measurement》2006,25(3):4-13

A look at real data shows that Reckase's psychometric theory for standard setting is not applicable to bookmark and that his simulations cannot explain actual differences between methods. It is suggested that exclusively test-centered, criterion-referenced approaches are too idealized and that a psychophysics paradigm and a theory of group behavior could be more useful in thinking about the standard setting process. In this view, item mapping methods such as bookmark are reasonable adaptations to fundamental limitations in human judgments of item difficulty. They make item ratings unnecessary and have unique potential for integrating external validity data and student performance data more fully into the standard setting process. 相似文献

6.

Differential Use of Item Information by Judges Using Angoff and Nedeisky Procedures

Robert L. Smith Jeffrey K. Smith 《Journal of Educational Measurement》1988,25(4):259-274

Competency examinations in a variety of domains require setting a minimum standard of performance. This study examines the issue of whether judges using the two most popular methods for setting cut scores (Angoff and Nedelsky methods) use different sources of information when making their judgments. Thirty-one judges were assigned randomly to the two methods to set cut scores for a high school graduation test in reading comprehension. These ratings were then related to characteristics of the items as well as to empirically obtained p values. Results indicate that judges using the Angoff method use a wider variety of information and yield estimates closer to the actual p values. The characteristics of items used in the study were effective predictors of judges' ratings, but were far less effective in predicting p values 相似文献

7.

Effect of Content Knowledge on Angoff‐Style Standard Setting Judgments

Melissa J. Margolis Janet Mee Brian E. Clauser Marcia Winward Jerome C. Clauser 《Educational Measurement》2016,35(1):29-37

Evidence to support the credibility of standard setting procedures is a critical part of the validity argument for decisions made based on tests that are used for classification. One area in which there has been limited empirical study is the impact of standard setting judge selection on the resulting cut score. One important issue related to judge selection is whether the extent of judges’ content knowledge impacts their perceptions of the probability that a minimally proficient examinee will answer the item correctly. The present article reports on two studies conducted in the context of Angoff‐style standard setting for medical licensing examinations. In the first study, content experts answered and subsequently provided Angoff judgments for a set of test items. After accounting for perceived item difficulty and judge stringency, answering the item correctly accounted for a significant (and potentially important) impact on expert judgment. The second study examined whether providing the correct answer to the judges would result in a similar effect to that associated with knowing the correct answer. The results suggested that providing the correct answer did not impact judgments. These results have important implications for the validity of standard setting outcomes in general and on judge recruitment specifically. 相似文献

8.

Comparing global judgments and specific judgments of teachers about students' knowledge: Is the whole the sum of its parts?

《Teaching and Teacher Education》2018

Teachers' judgments about students' knowledge and skills can be global or specific depending on the diagnostic situation during teaching. We test the relationship between these judgments, their accuracy, and whether global judgment (GJ) accuracy can be measured by aggregating specific judgments (SJ). Judgments of 52 primary school teachers about their students' achievement in a standardized mathematics test were assessed. SJs and GJs correlated high. However, SJs were slightly more accurate than GJs. Additionally, teachers' GJ accuracy is not similar to the accuracy of aggregated SJs. We conclude that teachers use different judgment strategies for GJs and SJs. 相似文献

9.

Equivalent Pass/Fail Decisions

John J. Norcini 《Journal of Educational Measurement》1990,27(1):59-66

In competency testing, it is sometimes difficult to properly equate scores of different forms of a test and thereby assure equivalent cutting scores. Under such circumstances, it is possible to set standards separately for each test form and then scale the judgments of the standard setters to achieve equivalent pass/fail decisions. Data from standard setters and examinees for a medical certifying examination were reanalyzed. Cutting score equivalents were derived by applying a linear procedure to the standard-setting results. These were compared against criteria along with the cutting score equivalents derived from typical examination equating procedures. Results indicated that the cutting score equivalents produced by the experts were closer to the criteria than standards derived from examinee performance, especially when the number of examinees used in equating was small. The root mean square error estimate was about 1 item on a 189-item test. 相似文献

10.

Judges' Use of Examinee Performance Data in an Angoff Standard‐Setting Exercise for a Medical Licensing Examination: An Experimental Study

Brian E. Clauser Janet Mee Su G. Baldwin Melissa J. Margolis Gerard F. Dillon 《Journal of Educational Measurement》2009,46(4):390-407

Although the Angoff procedure is among the most widely used standard setting procedures for tests comprising multiple‐choice items, research has shown that subject matter experts have considerable difficulty accurately making the required judgments in the absence of examinee performance data. Some authors have viewed the need to provide performance data as a fatal flaw for the procedure; others have considered it appropriate for experts to integrate performance data into their judgments but have been concerned that experts may rely too heavily on the data. There have, however, been relatively few studies examining how experts use the data. This article reports on two studies that examine how experts modify their judgments after reviewing data. In both studies, data for some items were accurate and data for other items had been manipulated. Judges in both studies substantially modified their judgments whether the data were accurate or not. 相似文献

11.

The Effects of Mastery and Competitive Conditions on Self-Assessment at Different Ages 总被引：1，自引：0，他引：1

Ruth Butler 《Child development》1990,61(1):201-210

It was hypothesized that self-evaluative accuracy will increase with age in a competitive condition, while even young children will appraise their performance quite accurately in a mastery condition. Children at ages 5, 7, and 10 working in either a match-the-standard or a competitive condition copied a drawing and then evaluated their copies. As hypothesized, competing 5-year-olds overestimated the quality of their copies, and self-assessments became less positive and better correlated with adult judgments with age. There were no age differences in self-evaluative accuracy in the mastery condition. Examination of children's explanations for their ratings and their interest in the task supported the interpretation that young children are guided by a nonnormative concept of ability, which can lead to overoptimistic perceptions of competence under competition. Older children tended to adopt normative goals and criteria for self-assessment in competition and mastery ones in the match the standard condition, and were realistic about their performance in both. 相似文献

12.

A. J. Massey 《Assessment in Education: Principles, Policy & Practice》1995,2(2):187-203

The evolving specification for a series of vertically equated overlapping Key Stage 3 national tests in science in England and Wales sets a series of test development challenges. These include the need to relate standards defined by hierarchically organised ‘level’ criteria to cut‐scores based on total test scores; and the need to allow compensation across the boundaries of sets of items targeted at different levels. A criterion‐related model for test development is described which is governed by a pattern of expectations about the performance of pupils relating to the hierarchical level criteria and builds determination of cut‐scores into the test development process. Some other relevant approaches to standard setting are also discussed. 相似文献

13.

A Note on the Application of Multiple Matrix Sampling to Standard Setting

John J. Norcini Judy A. Shea James C. Ping 《Journal of Educational Measurement》1988,25(2):159-164

In many of the methods currently proposed for standard setting, all experts are asked to judge all items, and the standard is taken as the mean of their judgments. When resources are limited, gathering the judgments of all experts in a single group can become impractical. Multiple matrix sampling (MMS) provides an alternative. This paper applies MMS to a variation on Angoff's method (1971) of standard setting. A pool of 36 experts and 190 items were divided randomly into 5 groups, and estimates of borderline examinee performance were acquired. Results indicated some variability in the cutting scores produced by the individual groups, but the variance components were reasonably well estimated. The standard error of the cutting score was very small, and the width of the 90% confidence interval around it was only 1.3 items. The reliability of the final cutting score was.98 相似文献

14.

Maintaining Equivalent Cut Scores for Small Sample Test Forms

Andrew C. Dwyer 《Journal of Educational Measurement》2016,53(1):3-22

This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common‐item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common‐item equating methodology to standard setting ratings to account for systematic differences between standard setting panels) has received almost no attention in the literature. Identity equating was also examined to provide context. Data from a standard setting form of a large national certification test (N examinees = 4,397; N panelists = 13) were split into content‐equivalent subforms with common items, and resampling methodology was used to investigate the error introduced by each approach. Common‐item equating (circle‐arc and nominal weights mean) was evaluated at samples of size 10, 25, 50, and 100. The standard setting approaches (resetting and rescaling the standard) were evaluated by resampling (N = 8) and by simulating panelists (N = 8, 13, and 20). Results were inconclusive regarding the relative effectiveness of resetting and rescaling the standard. Small‐sample equating, however, consistently produced new form cut scores that were less biased and less prone to random error than new form cut scores based on resetting or rescaling the standard. 相似文献

15.

Teacher Support Teams for Special Educational Needs in Primary Schools: evaluating a teacher-focused support scheme

Brahm Norwich Harry Daniels 《Educational studies》1997,23(1):5-24

This paper reports on part of an evaluation of teacher support teams (TSTs) as a special education needs (SEN) support strategy in primary schools. Using a mixture of quantitative and qualitative evaluation methods, it focuses on areas derived from a theoretical framework for understanding schools’ approaches to SENs. TSTs were set up and run in six of the eight schools, with meetings of between 30 and 45 minutes, usually during lunchtime or after school. Most of the referrals were about behaviour problems, though many were about learning difficulties. The support included providing emotional encouragement, specific approaches to managing behaviour, teaching strategies and consulting others. Referring teachers reported that their TST experience led to increased confidence and some improvements in the children, while TST members themselves believed that they had gained much from the TST experience. Overall the study showed the feasibility and benefits of setting up TSTs in primary schools. The findings are discussed in terms of the wider benefits of TSTs and their relevance to special needs policies and the implementation of the SENs code of practice. 相似文献

16.

标准设定：步骤、方法与评价指标 总被引：1，自引：0，他引：1

李珍辛涛陈平《考试研究》2010,(2):83-95

标准设定（standard setting）是划分标准的过程,指在测验分数分布中划分出两类或两类以上的分界分数。通过标准设定,考生可以被分为“通过”和“未通过”,或者是被分为更多的有序表现类别。标准设定是标准参照测验的重要组成部分,也可为测验决策者提供关于测验效度的依据,是目前测量领域一个颇受关注的研究问题。本文首先回顾了标准设定的源起和发展历程,然后详细地介绍了标准设定的基本步骤和几种主要的标准设定方法,评估标准设定过程的指标,最后简单论述了在国内各类考试中应用标准设定的必要性。相似文献

17.

The development of uncertainty monitoring in early childhood

Lyons KE Ghetti S 《Child development》2011,82(6):1778-1787

This study examined the development of uncertainty monitoring in early childhood. Specifically, this study tested the prediction that preschoolers can reflect on their sense of certainty about the likely accuracy of their decisions, and it examined whether this ability differs across domains. Three-, 4-, and 5-year-olds (N = 74) completed a perceptual identification and a lexical identification task in which they reported whether they were certain or uncertain about their answers. Results showed that even 3-year-olds provided confidence judgments that discriminated accurate from inaccurate responses, but this discrimination increased with age. Furthermore, results suggest that 3-year-olds primarily rely on response latency to assess certainty, whereas older preschoolers do not. Overall, these findings suggest that uncertainty monitoring emerges and develops during the preschool years. 相似文献

18.

The Impact of Process Instructions on Judges’ Use of Examinee Performance Data in Angoff Standard Setting Exercises

Janet Mee Brian E. Clauser Melissa J. Margolis 《Educational Measurement》2013,32(3):27-35

Despite being widely used and frequently studied, the Angoff standard setting procedure has received little attention with respect to an integral part of the process: how judges incorporate examinee performance data in the decision‐making process. Without performance data, subject matter experts have considerable difficulty accurately making the required judgments. Providing data introduces the very real possibility that judges will turn their content‐based judgments into norm‐referenced judgments. This article reports on three Angoff standard setting panels for which some items were randomly assigned to have incorrect performance data. Judges were informed that some of the items were accompanied by inaccurate data, but were not told which items they were. The purpose of the manipulation was to assess the extent to which changing the instructions given to the judges would impact the extent to which they relied on the performance data. The modified instructions resulted in the judges making less use of the performance data than judges participating in recent parallel studies. The relative extent of the change judges made did not appear to be substantially influenced by the accuracy of the data. 相似文献

19.

Clinical data used by pediatric residents to assess parenting

J M Leventhal K Fearn C A Stashwick 《Child abuse & neglect》1986,10(1):71-78

相似文献

20.

Socialization and Social Judgments among Inner-City African-American Kindergartners

Robert J. Jagers Kathy Bingham Sydney L. Hans 《Child development》1996,67(1):140-150

This study explores the relations between certain socialization experiences and social judgments among poor, inner-city African-American kindergartners. 54 mothers and their children took part in this investigation. Consistent with the domain distinction literature, children made judgments about the seriousness, rule contingency, context contingency, and punishment deserved for familiar moral and social-conventional transgressions. Mothers were queried regarding their child-rearing values and discipline practices and described their children's peer network and social experiences. Results indicated that children distinguished between moral and social-conventional issues when explaining why they were wrong and in terms of rule and home context contingency criteria, but not the other judgment criteria. Mothers placed high value on conformity and most often ignored or talked to children about their misbehavior. More frequent use of talking, less ignoring, and less denial of privileges by mothers predicted children's making the domain distinction. Discussion focuses on methodological limitations and directions for future research. 相似文献