首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The equating performance of two internal anchor test structures—miditests and minitests—is studied for four IRT equating methods using simulated data. Originally proposed by Sinharay and Holland, miditests are anchors that have the same mean difficulty as the overall test but less variance in item difficulties. Four popular IRT equating methods were tested, and both the means and SDs of the true ability of the group to be equated were varied. We evaluate equating accuracy marginally and conditional on true ability. Our results suggest miditests perform about as well as traditional minitests for most conditions. Findings are discussed in terms of comparability to the typical minitest design and the trade‐off between accuracy and flexibility in test construction.  相似文献   

2.
Four equating methods (3PL true score equating, 3PL observed score equating, beta 4 true score equating, and beta 4 observed score equating) were compared using four equating criteria: first-order equity (FOE), second-order equity (SOE), conditional-mean-squared-error (CMSE) difference, and the equipercentile equating property. True score equating more closely achieved estimated FOE than observed score equating when the true score distribution was estimated using the psychometric model that was used in the equating. Observed score equating more closely achieved estimated SOE, estimated CMSE difference, and the equipercentile equating property than true score equating. Among the four equating methods, 3PL observed score equating most closely achieved estimated SOE and had the smallest estimated CMSE difference, and beta 4 observed score equating was the method that most closely met the equipercentile equating property.  相似文献   

3.
An item-preequating design and a random groups design were used to equate forms of the American College Testing (ACT) Assessment Mathematics Test. Equipercentile and 3-parameter logistic model item-response theory (IRT) procedures were used for both designs. Both pretest methods produced inadequate equating results, and the IRT item preequating method resulted in more equating error than had no equating been conducted. Although neither of the item preequating methods performed well, the results from the equipercentile preequating method were more consistent with those from the random groups method than were the results from the IRT item pretest method. Item context and position effects were likely responsible, at least in part, for the inadequate results for item preequating. Such effects need to be either controlled or modeled, and the design further researched before the item preequating design can be recommended for operational use.  相似文献   

4.
《教育实用测度》2013,26(4):383-407
The performance of the item response theory (IRT) true-score equating method is examined under conditions of test multidimensionality. It is argued that a primary concern in applying unidimensional equating methods when multidimensionality is present is the potential decrease in equity (Lord, 1980) attributable to the fact that examinees of different ability are expected to obtain the same test scores. In contrast to equating studies based on real test data, the use of simulation in equating research not only permits assessment of these effects but also enables investigation of hypothetical equating conditions in which multidimensionality can be suspected to be especially problematic for test equating. In this article, I investigate whether the IRT true-score equating method, which explicitly assumes the item response matrix is unidimensional, is more adversely affected by the presence of multidimensionality than 2 conventional equating methods-linear and equipercentile equating-using several recently proposed equity-based criteria (Thomasson, 1993). Results from 2 simulation studies suggest that the IRT method performs at least as well as the conventional methods when the correlation between dimensions is high (³ 0.7) and may be only slightly inferior to the equipercentile method when the correlation is moderate to low (£ 0.5).  相似文献   

5.
In order to equate tests under Item Response Theory (IRT), one must obtain the slope and intercept coefficients of the appropriate linear transformation. This article compares two methods for computing such equating coefficients–Loyd and Hoover (1980) and Stocking and Lord (1983). The former is based upon summary statistics of the test calibrations; the latter is based upon matching test characteristic curves by minimizing a quadratic loss function. Three types of equating situations: horizontal, vertical, and that inherent in IRT parameter recovery studies–were investigated. The results showed that the two computing procedures generally yielded similar equating coefficients in all three situations. In addition, two sets of SAT data were equated via the two procedures, and little difference in the obtained results was observed. Overall, the results suggest that the Loyd and Hoover procedure usually yields acceptable equating coefficients. The Stocking and Lord procedure improves upon the Loyd and Hoover values and appears to be less sensitive to atypical test characteristics. When the user has reason to suspect that the test calibrations may be associated with data sets that are typically troublesome to calibrate, the Stocking and Lord procedure is to be preferred.  相似文献   

6.
In this article, linear item response theory (IRT) observed‐score equating is compared under a generalized kernel equating framework with Levine observed‐score equating for nonequivalent groups with anchor test design. Interestingly, these two equating methods are closely related despite being based on different methodologies. Specifically, when using data from IRT models, linear IRT observed‐score equating is virtually identical to Levine observed‐score equating. This leads to the conclusion that poststratification equating based on true anchor scores can be viewed as the curvilinear Levine observed‐score equating.  相似文献   

7.
测验等值使得不同形式的考试能进行比较,从而保证了测验之间的相对稳定性。基于IRT的分数等值是在估计出参数的基础上进行的参数转换,等值结果的稳定性与考生样本量密不可分。本研究针对汉语水平考试(HSK)阅读分测验,采用真实数据模拟共同组锚测验设计,确定等值的参照标准,考察考生样本量的变化对IRT分数等值稳定性的影响。结果表明,考生样本量为2000左右时各种方案的等值结果均比较稳定。考生样本量进一步增大时,等值误差不降反增。  相似文献   

8.
The purpose of this study was to assess the dimensionality of two forms of a large-scale standardized test separately for 3 ethnic groups of examinees and to investigate whether differences in their latent trait composites have any impact on unidimensional item response theory true-score equating functions. Specifically, separate equating functions for African American and Hispanic examinees were compared to those of a Caucasian group as well as the total test taker population. On both forms, a 2-dimensional model adequately accounted for the item responses of Caucasian and African American examinees, whereas a more complex model was required for the Hispanic subgroup. The differences between equating functions for the 3 ethnic groups and the total test taker population were small and tended to be located at the low end of the score scale.  相似文献   

9.
10.
通过模拟和实证研究探讨样本量、题本量以及锚题题型对大尺度测评中项目参数等值精度的影响,模拟研究和实证研究的结果均表明:(1)0/1计分项目参数的等值精度在大多数条件下均好于多级计分项目,相对而言,实证研究的差异不如模拟研究明显;(2)相对而言,样本容量的增加对于提高项目参数等值精度有着重要的作用,而增加题本数量的作用甚微;(3)无论是区分度参数还是难度参数,均表现为3个题本和2 000人的搭配已经可以达到较好的等值精度,如果进一步提高等值精度,只需将每一题本的样本容量增加到3 000人即可;在多级计分时,当选用5个题本时,每一个题本2 000人是最适宜的组合。  相似文献   

11.
本研究基于IRT理论中最常用的LOGISTIC三种模型来探讨等值的跨样本一致性,研究对象为某一汉语类别的测验,等值方法采用同时校准法。研究结果表明,双参数模型下同时校准法等值跨样本一致性最好,最为稳定。  相似文献   

12.
In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length.  相似文献   

13.
基于项目反应理论中的LOGISTIC双参数模型研究共同题非等组设计下,考生能力分布与被试量对等值的影响。等值方法采用分别校准下的项目特征曲线法、Stocking-Lord法、Haebara法。等值结果采用等值分数标准误、等值系数标准误、共同题参数稳定性三种方法进行评价。研究结果表明,考生能力分布越接近,被试量越大,等值误差越小;且Stocking-Lord法较Haebara法的等值结果更稳定。  相似文献   

14.
应用项目反应理论等值含有多种题型考试的一个实例   总被引:2,自引:2,他引:2  
本文以美国一个州的高中统考为例介绍应用项目反应理论来对含有多种题型的考试进行等值处理的具体做法,同时也对考试的其他技术环节进行了一些探讨。  相似文献   

15.
The 1986 scores from Florida's Statewide Student Assessment Test, Part II (SSAT-II), a minimum-competency test required for high school graduation in Florida, were placed on the scale of the 1984 scores from that test using five different equating procedures. For the highest scoring 84 % of the students, four of the five methods yielded results within 1.5 raw-score points of each other. They would be essentially equally satisfactory in this situation, in which the tests were made parallel item by item in difficulty and content and the groups of examinees were population cohorts separated by only 2 years. Also, the results from six different lengths of anchor items were compared. Anchors of 25, 20, 15, or 10 randomly selected items provided equatings as effective as 30 items using the concurrent IRT equating method, but an anchor of 5 randomly selected items did not  相似文献   

16.
This study applied kernel equating (KE) in two scenarios: equating to a very similar population and equating to a very different population, referred to as a distant population, using SAT® data. The KE results were compared to the results obtained from analogous traditional equating methods in both scenarios. The results indicate that KE results are comparable to the results of other methods. Further, the results show that when the two populations taking the two tests are similar on the anchor score distributions, different equating methods yield the same or very similar results, even though they have different assumptions.  相似文献   

17.
Practical considerations in conducting an equating study often require a trade-off between testing time and sample size. A counterbalanced design (Angoff's Design II) is often selected because, as each examinee is administered both test forms and therefore the errors are correlated, sample sizes can be dramatically reduced over those required by a spiraling design (Angoff's Design I), where each examinee is administered only one test form. However, the counterbalanced design may be subject to fatigue, practice, or context effects. This article investigated these two data collection designs (for a given sample size) with equipercentile and IRT equating methodology in the vertical equating of two mathematics achievement tests. Both designs and both methodologies were judged to adequately meet an equivalent expected score criterion; Design II was found to exhibit more stability over different samples.  相似文献   

18.
探讨了几种常用的基于项目反应理论(IRT)的试题参数估计方法,并分析了每一种估计方法的优缺点及各自的适用领域,为构建基于IRT的试题库系统提供理论参考.  相似文献   

19.
等值对考试具有重要意义,而我国的大部分考试却没有实现等值,在少数经过等值的考试中,大多只限于对二级记分题目的等值,鲜有对多级记分题目的等值研究。该研究针对包含多级记分题目的国内某大型语言类考试,探讨了等级反应模型下的同时校准法、固定共同题参数法以及链接独立校准法中的平均数标准差方法、平均数平均数方法、Haebara法和Stocking-Lord法六种等值方法的效果,从而优选最适合该考试的等值方法。  相似文献   

20.
Score equating based on small samples of examinees is often inaccurate for the examinee populations. We conducted a series of resampling studies to investigate the accuracy of five methods of equating in a common-item design. The methods were chained equipercentile equating of smoothed distributions, chained linear equating, chained mean equating, the symmetric circle-arc method, and the simplified circle-arc method. Four operational test forms, each containing at least 110 items, were used for the equating, with new-form samples of 100, 50, 25, and 10 examinees and reference-form samples three times as large. Accuracy was described in terms of the root-mean-squared difference (over 1,000 replications) of the sample equatings from the criterion equating. Overall, chained mean equating produced the most accurate results for low scores, but the two circle-arc methods produced the most accurate results, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号