首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In judgmental standard setting procedures (e.g., the Angoff procedure), expert raters establish minimum pass levels (MPLs) for test items, and these MPLs are then combined to generate a passing score for the test. As suggested by Van der Linden (1982), item response theory (IRT) models may be useful in analyzing the results of judgmental standard setting studies. This paper examines three issues relevant to the use of lRT models in analyzing the results of such studies. First, a statistic for examining the fit of MPLs, based on judges' ratings, to an IRT model is suggested. Second, three methods for setting the passing score on a test based on item MPLs are analyzed; these analyses, based on theoretical models rather than empirical comparisons among the three methods, suggest that the traditional approach (i.e., setting the passing score on the test equal to the sum of the item MPLs) does not provide the best results. Third, a simple procedure, based on generalizability theory, for examining the sources of error in estimates of the passing score is discussed.  相似文献   

2.
This article introduces the Diagnostic Profiles (DP) standard setting method for setting a performance standard on a test developed from a cognitive diagnostic model (CDM), the outcome of which is a profile of mastered and not‐mastered skills or attributes rather than a single test score. In the DP method, the key judgment task for panelists is a decision on whether or not individual cognitive skill profiles meet the performance standard. A randomized experiment was carried out in which secondary mathematics teachers were randomly assigned to either the DP method or the modified Angoff method. The standard setting methods were applied to a test of student readiness to enter high school algebra (Algebra I). While the DP profile judgments were perceived to be more difficult than the Angoff item judgments, there was a high degree of agreement among the panelists for most of the profiles. In order to compare the methods, cut scores were generated from the DP method. The results of the DP group were comparable to the Angoff group, with less cut score variability in the DP group. The DP method shows promise for testing situations in which diagnostic information is needed about examinees and where that information needs to be linked to a performance standard.  相似文献   

3.
Since 1971 there have been a number of studies in which a cut score has been set using a method proposed by Angoff (1971). In this method, each member of a panel of judges estimates for each test question the proportion correct for a specific target group of examinees. Prior and contemporary research suggests that this is a difficult task for judges. Angoff also proposed that judges simply indicate whether or not an examinee from the target group will be able to answer each question correctly (the yes/no method). We report on the results of two studies that compare a yes/no estimation with a proportion correct estimation. The two studies demonstrate that both methods produce essentially equal cut scores and that judges find the yes/no method more comfortable to use than the estimated proportion correct method.  相似文献   

4.
Rasch模型和IRT在学生成就测验统计分析中的对比研究   总被引:1,自引:0,他引:1  
Rasch模型和项目反应理论的诞生推进了社会科学领域研究方法的变革。大多数学者认为,Rasch模型就是三参数IRT模型的特例。其实,Rasch模型不同于项目反应理论,其数据必须符合模型的先验理论。研究利用基于这两种理论假设开发的软件Winsteps和Multilog对学生成就测验进行统计分析,旨在揭示两种理论模型数据分析结果的异同之处,并探讨Winsteps软件在教育统计中的应用。  相似文献   

5.
由2007年开始,香港中学会考中国语文科及英国语文科采用了水平参照模式(standards-referenced reporting)对考生的成绩进行等级评定。在分数处理过程中,采用了含结构参数的Rasch模型。本文介绍了该模型及其一些主要性质,导出了联合极大似然估计(Joint Maximum Likelihood Estimation)的求解方程,并报告了应用该模型于香港中学会考水平参照等级评定中的主要结果。  相似文献   

6.
The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity.  相似文献   

7.
Abstract

Multilevel Rasch models are increasingly used to estimate the relationships between test scores and student and school factors. Response data were generated to follow one-, two-, and three-parameter logistic (1PL, 2PL, 3PL) models, but the Rasch model was used to estimate the latent regression parameters. When the response functions followed 2PL or 3PL models, the proportion of variance explained in test scores by the simulated student or school predictors was estimated accurately with a Rasch model. Proportion of variance within and between schools was also estimated accurately. The regression coefficients were misestimated unless they were rescaled out of logit units. However, item-level parameters, such as DIF effects, were biased when the Rasch model was violated, similar to single-level models.  相似文献   

8.
Essential for the validity of the judgments in a standard-setting study is that they follow the implicit task assumptions. In the Angoff method, judgments are assumed to be inversely related to the difficulty of the items; contrasting-groups judgments are assumed to be positively related to the ability of the students. In the present study, judgments from both procedures were modeled with a random-effects probit regression model. The Angoff judgments showed a weaker link with the position of the items on the latent scale than the contrasting-groups judgments with the position of the students. Hence, in the specific context of the study, the contrasting-groups judgments were more aligned with the underlying assumptions of the method than the Angoff judgments .  相似文献   

9.
Robustness of the School-Level IRT Model   总被引:1,自引:0,他引:1  
The robustness of the school-level item response theoretic (IRT) model to violations of distributional assumptions was studied in a computer simulation. Estimated precision of "expected a posteriori" (EAP) estimates of the mean school ability from BILOG 3 was compared with actual precision, varying school size, intraclass correlation, school ability, number of forms comprising the test, and item parameters. Under conditions where the school-level precision might be possibly acceptable for real school comparisons, the EAP estimates of school ability were robust over a wide range of violations and conditions, with the estimated precision being either consistent with the actual precision or somewhat conservative. Some lack of robustness was found, however, under conditions where the precision was inherently poor and the test would presumably not be used for serious school comparisons.  相似文献   

10.
The present study evaluated the multiple imputation method, a procedure that is similar to the one suggested by Li and Lissitz (2004), and compared the performance of this method with that of the bootstrap method and the delta method in obtaining the standard errors for the estimates of the parameter scale transformation coefficients in item response theory (IRT) equating in the context of the common‐item nonequivalent groups design. Two different estimation procedures for the variance‐covariance matrix of the IRT item parameter estimates, which were used in both the delta method and the multiple imputation method, were considered: empirical cross‐product (XPD) and supplemented expectation maximization (SEM). The results of the analyses with simulated and real data indicate that the multiple imputation method generally produced very similar results to the bootstrap method and the delta method in most of the conditions. The differences between the estimated standard errors obtained by the methods using the XPD matrices and the SEM matrices were very small when the sample size was reasonably large. When the sample size was small, the methods using the XPD matrices appeared to yield slight upward bias for the standard errors of the IRT parameter scale transformation coefficients.  相似文献   

11.
Evidence of the internal consistency of standard-setting judgments is a critical part of the validity argument for tests used to make classification decisions. The bookmark standard-setting procedure is a popular approach to establishing performance standards, but there is relatively little research that reflects on the internal consistency of the resulting judgments. This article presents the results of an experiment in which content experts were randomly assigned to one of two response probability conditions: .67 and .80. If the standard-setting judgments collected with the bookmark procedure are internally consistent, both conditions should produce highly similar cut scores. The results showed substantially different cut scores for the two conditions; this calls into question whether content experts can produce the type of internally consistent judgments that are required using the bookmark procedure.  相似文献   

12.
13.
A conceptual framework is proposed for a psychometric theory of standard setting. The framework suggests that participants in a standard setting process (panelists) develop an internal, intended standard as a result of training and the participant's background. The goal of a standard setting process is to convert panelists' intended standards to points on a test's score scale. Psychometrics is involved in this process because the points on the score scale are estimated from ratings provided by participants. The conceptual framework is used to derive three criteria for evaluating standard setting processes. The use of these criteria is demonstrated by applying them to variations of bookmark and modified Angoff standard setting methods.  相似文献   

14.
Mixture Rasch models have been used to study a number of psychometric issues such as goodness of fit, response strategy differences, strategy shifts, and multidimensionality. Although these models offer the potential for improving understanding of the latent variables being measured, under some conditions overextraction of latent classes may occur, potentially leading to misinterpretation of results. In this study, a mixture Rasch model was applied to data from a statewide test that was initially calibrated to conform to a 3‐parameter logistic (3PL) model. Results suggested how latent classes could be explained and also suggested that these latent classes might be due to applying a mixture Rasch model to 3PL data. To support this latter conjecture, a simulation study was presented to demonstrate how data generated to fit a one‐class 2‐parameter logistic (2PL) model required more than one class when fit with a mixture Rasch model.  相似文献   

15.
As an alternative to adaptation, tests may also be developed simultaneously in multiple languages. Although the items on such tests could vary substantially, scores from these tests may be used to make the same types of decisions about different groups of examinees. The ability to make such decisions is contingent upon setting performance standards for each exam that allow for comparable interpretations of test results. This article describes a standard setting process used for a multilingual high school literacy assessment constructed under these conditions. This methodology was designed to address the specific challenges presented by this testing program including maintaining equivalent expectations for performance across different student populations. The validity evidence collected to support the methodology and results is discussed along with recommendations for future practice.  相似文献   

16.
The Angoff (1971) standard setting method requires expert panelists to (a) conceptualize candidates who possess the qualifications of interest (e.g., the minimally qualified) and (b) estimate actual item performance for these candidates. Past and current research (Bejar, 1983; Shepard, 1994) suggests that estimating item performance is difficult for panelists. If panelists cannot perform this task, the validity of the standard based on these estimates is in question. This study tested the ability of 26 classroom teachers to estimate item performance for two groups of their students on a locally developed district-wide science test. Teachers were more accurate in estimating the performance of the total group than of the "borderline group," but in neither case was their accuracy level high. Implications of this finding for the validity of item performance estimates by panelists using the Angoff standard setting method are discussed.  相似文献   

17.
Validating performance standards is challenging and complex. Because of the difficulties associated with collecting evidence related to external criteria, validity arguments rely heavily on evidence related to internal criteria—especially evidence that expert judgments are internally consistent. Given its importance, it is somewhat surprising that evidence of this kind has rarely been published in the context of the widely used bookmark standard‐setting procedure. In this article we examined the effect of ordered item booklet difficulty on content experts’ bookmark judgments. If panelists make internally consistent judgments, their resultant cut scores should be unaffected by the difficulty of their respective booklets. This internal consistency was not observed: the results suggest that substantial systematic differences in the resultant cut scores can arise when the difficulty of the ordered item booklets varies. These findings raise questions about the ability of content experts to make the judgments required by the bookmark procedure.  相似文献   

18.
运用Rasch模型对2016年福建省综合质检文科英语测试数据进行分析。研究结果表明:实测数据与Rasch模型拟合较好,2016年福建省综合质检文科英语试卷是一套高质量的测验,能够较好区分考生的能力水平;但是,该测验中有个别题目的作答反应与Rasch模型的拟合效果不太理想,测验题目的难度分布也有一定的优化空间,这两点值得命题团队反思和总结,可以服务于教学调整并为下个周期的命题提供有益的测量学参考。  相似文献   

19.
An IRT method for estimating conditional standard errors of measurement of scale scores is presented, where scale scores are nonlinear transformations of number-correct scores. The standard errors account for measurement error that is introduced due to rounding scale scores to integers. Procedures for estimating the average conditional standard error of measurement for scale scores and reliability of scale scores are also described. An illustration of the use of the methodology is presented, and the results from the IRT method are compared to the results from a previously developed method that is based on strong true-score theory.  相似文献   

20.
目前电大系统英语考试的口试和作文部分多采用语言运用测试的方式.语言运用测试由于引入评分者而使评分的主观性变大.如何控制评分者差异对考生分数的影响成为保证语言运用测试评分质量的重要环节.本文在比较了行为测试中评分质量控制方面常用的三种理论的基础上,着重介绍了多面Rasch模型在提高评分质量方面的贡献,并探讨了在电大系统如何采用该模型对英语运用测试中的评分者进行培训,以控制评分质量和提高考试信度.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号