共查询到20条相似文献,搜索用时 0 毫秒
1.
基于项目反应理论中的LOGISTIC双参数模型研究共同题非等组设计下,考生能力分布与被试量对等值的影响。等值方法采用分别校准下的项目特征曲线法、Stocking-Lord法、Haebara法。等值结果采用等值分数标准误、等值系数标准误、共同题参数稳定性三种方法进行评价。研究结果表明,考生能力分布越接近,被试量越大,等值误差越小;且Stocking-Lord法较Haebara法的等值结果更稳定。 相似文献
2.
Four equating methods (3PL true score equating, 3PL observed score equating, beta 4 true score equating, and beta 4 observed score equating) were compared using four equating criteria: first-order equity (FOE), second-order equity (SOE), conditional-mean-squared-error (CMSE) difference, and the equipercentile equating property. True score equating more closely achieved estimated FOE than observed score equating when the true score distribution was estimated using the psychometric model that was used in the equating. Observed score equating more closely achieved estimated SOE, estimated CMSE difference, and the equipercentile equating property than true score equating. Among the four equating methods, 3PL observed score equating most closely achieved estimated SOE and had the smallest estimated CMSE difference, and beta 4 observed score equating was the method that most closely met the equipercentile equating property. 相似文献
3.
《中国考试》2017,(9)
通过模拟和实证研究探讨样本量、题本量以及锚题题型对大尺度测评中项目参数等值精度的影响,模拟研究和实证研究的结果均表明:(1)0/1计分项目参数的等值精度在大多数条件下均好于多级计分项目,相对而言,实证研究的差异不如模拟研究明显;(2)相对而言,样本容量的增加对于提高项目参数等值精度有着重要的作用,而增加题本数量的作用甚微;(3)无论是区分度参数还是难度参数,均表现为3个题本和2 000人的搭配已经可以达到较好的等值精度,如果进一步提高等值精度,只需将每一题本的样本容量增加到3 000人即可;在多级计分时,当选用5个题本时,每一个题本2 000人是最适宜的组合。 相似文献
4.
5.
IRT下题量与被试量对参数估计模拟返真性能的影响 总被引:1,自引:0,他引:1
"基础教育教学质量监测系统"项目组 《中国考试》2009,(6)
在项目反应理论下的题库建设时,进行纸笔测验测试时需要多少被试量、题量,试题的参数估计能够达到较为精确估计?本文使用蒙特卡洛模拟方法模拟测验情境,对此问题进行探讨。分析题量的变化和被试量的变化对a、b参数估计的模拟返真性能的影响。1)从被试量角度来看,在两级、多级记分试题模拟测验情境下,随着被试量逐渐增大,项目参数估计值模拟返真指标均方误差逐渐减小。2)从题量角度来看,在两级记分试题模拟情境下,均方误差曲线在题量为25题左右时有一个拐点,即当题量小于25题时,随着题量增加时RMSE减小幅度较大,而当题量大于25题时,这时再增加题量,RMSE减小幅度很小。在多级记分试题模拟情境下,均方误差曲线在题量为15题左右时有一个拐点,即当题量小于15题时,随着题量增加, RMSE逐渐减小,当题量大于15题时,随着题量增加,RMSE逐渐增大。 相似文献
6.
应用项目反应理论等值含有多种题型考试的一个实例 总被引:2,自引:2,他引:2
本文以美国一个州的高中统考为例介绍应用项目反应理论来对含有多种题型的考试进行等值处理的具体做法,同时也对考试的其他技术环节进行了一些探讨。 相似文献
7.
An item-preequating design and a random groups design were used to equate forms of the American College Testing (ACT) Assessment Mathematics Test. Equipercentile and 3-parameter logistic model item-response theory (IRT) procedures were used for both designs. Both pretest methods produced inadequate equating results, and the IRT item preequating method resulted in more equating error than had no equating been conducted. Although neither of the item preequating methods performed well, the results from the equipercentile preequating method were more consistent with those from the random groups method than were the results from the IRT item pretest method. Item context and position effects were likely responsible, at least in part, for the inadequate results for item preequating. Such effects need to be either controlled or modeled, and the design further researched before the item preequating design can be recommended for operational use. 相似文献
8.
Deborah J. Harris 《Journal of Educational Measurement》1991,28(3):221-235
Practical considerations in conducting an equating study often require a trade-off between testing time and sample size. A counterbalanced design (Angoff's Design II) is often selected because, as each examinee is administered both test forms and therefore the errors are correlated, sample sizes can be dramatically reduced over those required by a spiraling design (Angoff's Design I), where each examinee is administered only one test form. However, the counterbalanced design may be subject to fatigue, practice, or context effects. This article investigated these two data collection designs (for a given sample size) with equipercentile and IRT equating methodology in the vertical equating of two mathematics achievement tests. Both designs and both methodologies were judged to adequately meet an equivalent expected score criterion; Design II was found to exhibit more stability over different samples. 相似文献
9.
Marie‐Anne Mittelhaëuser Anton A. Béguin Klaas Sijtsma 《Journal of Educational Measurement》2015,52(3):339-358
The purpose of this study was to investigate whether simulated differential motivation between the stakes for operational tests and anchor items produces an invalid linking result if the Rasch model is used to link the operational tests. This was done for an external anchor design and a variation of a pretest design. The study also investigated whether a constrained mixture Rasch model could identify latent classes in such a way that one latent class represented high‐stakes responding while the other represented low‐stakes responding. The results indicated that for an external anchor design, the Rasch linking result was only biased when the motivation level differed between the subpopulations to which the anchor items were administered. However, the mixture Rasch model did not identify the classes representing low‐stakes and high‐stakes responding. When a pretest design was used to link the operational tests by means of a Rasch model, the linking result was found to be biased in each condition. Bias increased as percentage of students showing low‐stakes responding to the anchor items increased. The mixture Rasch model only identified the classes representing low‐stakes and high‐stakes responding under a limited number of conditions. 相似文献
10.
本文使用R-2.15.2软件模拟研究锚测验难度参数方差特征对测验等值误差的影响,采用三种等值方法(链百分位等值法、Levine等值法和Tucker等值法)对锚测验不同类型的难度方差进行比较研究。结果显示,当锚测验难度方差小于全测验难度方差时,其等值的随机误差和系统误差与锚测验难度方差和全测验难度方差一致时(即锚测验为全测验的平行缩减版minitest时)的表现基本相同。因此,对锚测验而言,要求其与全测验具有相同的统计规格可能过于严格。 相似文献
11.
This article suggests a method for estimating a test-score equating relationship from small samples of test takers. The method does not require the estimated equating transformation to be linear. Instead, it constrains the estimated equating curve to pass through two pre-specified end points and a middle point determined from the data. In a resampling study with two test forms that differed substantially in difficulty, the proposed method compared favorably with other equating methods, especially for equating scores below the 10th percentile and above the 90th percentile. 相似文献
12.
13.
In this study, we compared 12 statistical strategies proposed for selecting loglinear models for smoothing univariate test score distributions and for enhancing the stability of equipercentile equating functions. The major focus was on evaluating the effects of the selection strategies on equating function accuracy. Selection strategies' influence on the estimation of cumulative test score distributions was also assessed. The results of this simulation study differentiate the selection strategies and define the situations where their use has the most important implications for equating function accuracy. The recommended strategy for estimating test score distributions and for equating is AIC minimization. 相似文献
14.
关于汉语水平考试等值设计的新思考 总被引:2,自引:0,他引:2
ZHANG Jinjun JING Libo 《中国考试》2008,(8)
汉语水平考试(HSK)实施多年来,一直坚持等值。在实际等值过程中,HSK遇到了一些新情况,旧的等值设计暴露出一些局限,变得难以适应。本文有针对性地提出了预测等值和跨国等值等新设计,以期应对新问题。 相似文献
15.
赵云芬 《荆门职业技术学院学报》2002,17(5):26-29
司法考试制度是一国司法制度的重要组成部分。为保证最大限度地实现通过司法考试选拔高素质法律人才的目的 ,世界上实行司法考试制度的法治国家几乎无一例外地以法律的形式明确地规定了司法考试的应试条件。我国司法考试的应试条件也明确规定在各种法律法规之中 ,具体包括对应试者的年龄要求、品行要求和学历要求等方面。本文拟分析其品行要求和学历要求方面的不足 ,提出完善我国司法考试应试条件的构想。 相似文献
16.
In this study, eight statistical strategies were evaluated for selecting the parameterizations of loglinear models for smoothing the bivariate test score distributions used in nonequivalent groups with anchor test (NEAT) equating. Four of the strategies were based on significance tests of chi-square statistics (Likelihood Ratio, Pearson, Freeman-Tukey, and Cressie-Read) and four additional strategies were based on different evaluations of the Likelihood Ratio Chi-Square statistic (Akaike Information Criterion, Bayesian Information Criterion, Consistent Akaike Information Criterion, and an index traced to Goodman). The focus was the implications of the selection strategies' selection tendencies for the accuracy of chained and poststratification equating functions. The results differentiated the strategies in terms of their tendencies to select models with particular bivariate parameterizations and the implications of these tendencies for equating bias and variability . 相似文献
17.
This study investigated possible explanations for an observed change in Rasch item parameters (b values) obtained from consecutive administrations of a professional licensure examination. Considered in this investigation were variables related to item position, item type, item content, and elapsed time between administrations of the item. An analysis of covariance methodology was used to assess the relations between these variables and change in item b values, with the elapsed time index serving to control for differences that could be attributed to average or pool changes in b values over time. A series of analysis of covariance models were fitted to the data in an attempt to identify item characteristics that were significantly related to the change in b values after the time elapsed between item administrations had been controlled. The findings indicated that the change in item b values was not related either to item position or to item type. A small, positive relationship between this change and elapsed time indicated that the pool b values were increasing over time. A test of simple effects suggested the presence of greater change for one of the content categories analyzed. These findings are interpreted, and suggestions for future research are provided. 相似文献
18.
The goal of this study was the development of a procedure to predict the equating error associated with the long-term equating method of Tate (2003) for mixed-format tests. An expression for the determination of the error of an equating based on multiple links using the error for the component links was derived and illustrated with simulated data. Expressions relating the equating error for single equating links to relevant factors like the equating design and the history of the examinee population ability distribution were determined based on computer simulation. Use of the resulting procedure for the selection of a long-term equating design was illustrated. 相似文献
19.
The standardization approach to assessing differential item functioning (DIF), including standardized distractor analysis, is described. The results of studies conducted on Asian Americans, Hispanics (Mexican Americans and Puerto Ricans), and Blacks on the Scholastic Aptitude Test (SAT) are described and then synthesized across studies. Where the groups were limited to include only examinees who spoke English as their best language, very few items across forms and ethnic groups exhibited large DIF. Major findings include evidence of differential speededness (where minority examinees did not complete SAT-Verbal sections at the same rate as White students with comparable SAT-Verbal scores) for Blacks and Hispanics and, when the item content is of special interest, advantages for the relevant ethnic group. In addition, homographs tend to disadvantage all three ethnic groups, but the effect of vertical relationships in analogy items are not as consistent. Although these findings are important in understanding DIF, they do not seem to account for all differences. Other variables related to DIF still need to be identified. Furthermore, these findings are seen as tentative until corroborated by studies using controlled data collection designs. 相似文献