首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In order to equate tests under Item Response Theory (IRT), one must obtain the slope and intercept coefficients of the appropriate linear transformation. This article compares two methods for computing such equating coefficients–Loyd and Hoover (1980) and Stocking and Lord (1983). The former is based upon summary statistics of the test calibrations; the latter is based upon matching test characteristic curves by minimizing a quadratic loss function. Three types of equating situations: horizontal, vertical, and that inherent in IRT parameter recovery studies–were investigated. The results showed that the two computing procedures generally yielded similar equating coefficients in all three situations. In addition, two sets of SAT data were equated via the two procedures, and little difference in the obtained results was observed. Overall, the results suggest that the Loyd and Hoover procedure usually yields acceptable equating coefficients. The Stocking and Lord procedure improves upon the Loyd and Hoover values and appears to be less sensitive to atypical test characteristics. When the user has reason to suspect that the test calibrations may be associated with data sets that are typically troublesome to calibrate, the Stocking and Lord procedure is to be preferred.  相似文献   

2.
IRT Equating Methods   总被引:1,自引:0,他引:1  
The purpose of this instructional module is to provide the basis for understanding the process of score equating through the use of item response theory (IRT). A context is provided for addressing the merits of IRT equating methods. The mechanics of IRT equating and the need to place parameter estimates from separate calibration runs on the same scale are discussed. Some procedures for placing parameter estimates on a common scale are presented. In addition, IRT true-score equating is discussed in some detail. A discussion of the practical advantages derived from IRT equating is offered at the end of the module.  相似文献   

3.
The equating performance of two internal anchor test structures—miditests and minitests—is studied for four IRT equating methods using simulated data. Originally proposed by Sinharay and Holland, miditests are anchors that have the same mean difficulty as the overall test but less variance in item difficulties. Four popular IRT equating methods were tested, and both the means and SDs of the true ability of the group to be equated were varied. We evaluate equating accuracy marginally and conditional on true ability. Our results suggest miditests perform about as well as traditional minitests for most conditions. Findings are discussed in terms of comparability to the typical minitest design and the trade‐off between accuracy and flexibility in test construction.  相似文献   

4.
In this article, linear item response theory (IRT) observed‐score equating is compared under a generalized kernel equating framework with Levine observed‐score equating for nonequivalent groups with anchor test design. Interestingly, these two equating methods are closely related despite being based on different methodologies. Specifically, when using data from IRT models, linear IRT observed‐score equating is virtually identical to Levine observed‐score equating. This leads to the conclusion that poststratification equating based on true anchor scores can be viewed as the curvilinear Levine observed‐score equating.  相似文献   

5.
An item-preequating design and a random groups design were used to equate forms of the American College Testing (ACT) Assessment Mathematics Test. Equipercentile and 3-parameter logistic model item-response theory (IRT) procedures were used for both designs. Both pretest methods produced inadequate equating results, and the IRT item preequating method resulted in more equating error than had no equating been conducted. Although neither of the item preequating methods performed well, the results from the equipercentile preequating method were more consistent with those from the random groups method than were the results from the IRT item pretest method. Item context and position effects were likely responsible, at least in part, for the inadequate results for item preequating. Such effects need to be either controlled or modeled, and the design further researched before the item preequating design can be recommended for operational use.  相似文献   

6.
Practical considerations in conducting an equating study often require a trade-off between testing time and sample size. A counterbalanced design (Angoff's Design II) is often selected because, as each examinee is administered both test forms and therefore the errors are correlated, sample sizes can be dramatically reduced over those required by a spiraling design (Angoff's Design I), where each examinee is administered only one test form. However, the counterbalanced design may be subject to fatigue, practice, or context effects. This article investigated these two data collection designs (for a given sample size) with equipercentile and IRT equating methodology in the vertical equating of two mathematics achievement tests. Both designs and both methodologies were judged to adequately meet an equivalent expected score criterion; Design II was found to exhibit more stability over different samples.  相似文献   

7.
《教育实用测度》2013,26(4):383-407
The performance of the item response theory (IRT) true-score equating method is examined under conditions of test multidimensionality. It is argued that a primary concern in applying unidimensional equating methods when multidimensionality is present is the potential decrease in equity (Lord, 1980) attributable to the fact that examinees of different ability are expected to obtain the same test scores. In contrast to equating studies based on real test data, the use of simulation in equating research not only permits assessment of these effects but also enables investigation of hypothetical equating conditions in which multidimensionality can be suspected to be especially problematic for test equating. In this article, I investigate whether the IRT true-score equating method, which explicitly assumes the item response matrix is unidimensional, is more adversely affected by the presence of multidimensionality than 2 conventional equating methods-linear and equipercentile equating-using several recently proposed equity-based criteria (Thomasson, 1993). Results from 2 simulation studies suggest that the IRT method performs at least as well as the conventional methods when the correlation between dimensions is high (³ 0.7) and may be only slightly inferior to the equipercentile method when the correlation is moderate to low (£ 0.5).  相似文献   

8.
通过模拟和实证研究探讨样本量、题本量以及锚题题型对大尺度测评中项目参数等值精度的影响,模拟研究和实证研究的结果均表明:(1)0/1计分项目参数的等值精度在大多数条件下均好于多级计分项目,相对而言,实证研究的差异不如模拟研究明显;(2)相对而言,样本容量的增加对于提高项目参数等值精度有着重要的作用,而增加题本数量的作用甚微;(3)无论是区分度参数还是难度参数,均表现为3个题本和2 000人的搭配已经可以达到较好的等值精度,如果进一步提高等值精度,只需将每一题本的样本容量增加到3 000人即可;在多级计分时,当选用5个题本时,每一个题本2 000人是最适宜的组合。  相似文献   

9.
基于项目反应理论中的LOGISTIC双参数模型研究共同题非等组设计下,考生能力分布与被试量对等值的影响。等值方法采用分别校准下的项目特征曲线法、Stocking-Lord法、Haebara法。等值结果采用等值分数标准误、等值系数标准误、共同题参数稳定性三种方法进行评价。研究结果表明,考生能力分布越接近,被试量越大,等值误差越小;且Stocking-Lord法较Haebara法的等值结果更稳定。  相似文献   

10.
Various applications of item response theory often require linking to achieve a common scale for item parameter estimates obtained from different groups. This article used a simulation to examine the relative performance of four different item response theory (IRT) linking procedures in a random groups equating design: concurrent calibration with multiple groups, separate calibration with the Stocking-Lord method, separate calibration with the Haebara method, and proficiency transformation. The simulation conditions used in this article included three sampling designs, two levels of sample size, and two levels of the number of items. In general, the separate calibration procedures performed better than the concurrent calibration and proficiency transformation procedures, even though some inconsistent results were observed across different simulation conditions. Some advantages and disadvantages of the linking procedures are discussed.  相似文献   

11.
测验等值使得不同形式的考试能进行比较,从而保证了测验之间的相对稳定性。基于IRT的分数等值是在估计出参数的基础上进行的参数转换,等值结果的稳定性与考生样本量密不可分。本研究针对汉语水平考试(HSK)阅读分测验,采用真实数据模拟共同组锚测验设计,确定等值的参照标准,考察考生样本量的变化对IRT分数等值稳定性的影响。结果表明,考生样本量为2000左右时各种方案的等值结果均比较稳定。考生样本量进一步增大时,等值误差不降反增。  相似文献   

12.
The purpose of this study was to assess the dimensionality of two forms of a large-scale standardized test separately for 3 ethnic groups of examinees and to investigate whether differences in their latent trait composites have any impact on unidimensional item response theory true-score equating functions. Specifically, separate equating functions for African American and Hispanic examinees were compared to those of a Caucasian group as well as the total test taker population. On both forms, a 2-dimensional model adequately accounted for the item responses of Caucasian and African American examinees, whereas a more complex model was required for the Hispanic subgroup. The differences between equating functions for the 3 ethnic groups and the total test taker population were small and tended to be located at the low end of the score scale.  相似文献   

13.
本研究基于IRT理论中最常用的LOGISTIC三种模型来探讨等值的跨样本一致性,研究对象为某一汉语类别的测验,等值方法采用同时校准法。研究结果表明,双参数模型下同时校准法等值跨样本一致性最好,最为稳定。  相似文献   

14.
In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length.  相似文献   

15.
This study applied kernel equating (KE) in two scenarios: equating to a very similar population and equating to a very different population, referred to as a distant population, using SAT® data. The KE results were compared to the results obtained from analogous traditional equating methods in both scenarios. The results indicate that KE results are comparable to the results of other methods. Further, the results show that when the two populations taking the two tests are similar on the anchor score distributions, different equating methods yield the same or very similar results, even though they have different assumptions.  相似文献   

16.
应用项目反应理论等值含有多种题型考试的一个实例   总被引:2,自引:2,他引:2  
本文以美国一个州的高中统考为例介绍应用项目反应理论来对含有多种题型的考试进行等值处理的具体做法,同时也对考试的其他技术环节进行了一些探讨。  相似文献   

17.
In this study we examined procedures for assessing model-data fit of item response theory (IRT) models for mixed format data. The model fit indices used in this study include PARSCALE's G2 , Orlando and Thissen's SX2 and SG2 , and Stone's χ2* and G2* . To investigate the relative performance of the fit statistics at the item level, we conducted two simulation studies: Type I error and power studies. We evaluated the performance of the item fit indices for various conditions of test length, sample size, and IRT models. Among the competing measures, the summed score-based indices SX2 and SG2 were found to be the sensible and efficient choice for assessing model fit for mixed format data. These indices performed well, particularly with short tests. The pseudo-observed score indices, χ2* and G2* , showed inflated Type I error rates in some simulation conditions. Consistent with the findings of current literature, the PARSCALE's G2 index was rarely useful, although it provided reasonable results for long tests.  相似文献   

18.
In this study I compared results of chained linear, Tucker, and Levine-observed score equatings under conditions where the new and old forms samples were similar in ability and also when they were different in ability. The length of the anchor test was also varied to examine its effect on the three different equating methods. The three equating methods were compared to a criterion equating to obtain estimates of random equating error, bias, and root mean squared error (RMSE). Results showed that, for most studied conditions, chained linear equating produced fairly good equating results in terms of low bias and RMSE. Levine equating also produced low bias and RMSE in some conditions. Although the Tucker method always produced the lowest random equating error, it produced a larger bias and RMSE than either of the other equating methods. As noted in the literature, these results also suggest that either chained linear or Levine equating be used when new and old form samples differ on ability and/or when the anchor-to-total correlation is not very high. Finally, by testing the missing data assumptions of the three equating methods, this study also shows empirically why an equating method is more or less accurate under certain conditions .  相似文献   

19.
This inquiry is an investigation of item response theory (IRT) proficiency estimators’ accuracy under multistage testing (MST). We chose a two‐stage MST design that includes four modules (one at Stage 1, three at Stage 2) and three difficulty paths (low, middle, high). We assembled various two‐stage MST panels (i.e., forms) by manipulating two assembly conditions in each module, such as difficulty level and module length. For each panel, we investigated the accuracy of examinees’ proficiency levels derived from seven IRT proficiency estimators. The choice of Bayesian (prior) versus non‐Bayesian (no prior) estimators was of more practical significance than the choice of number‐correct versus item‐pattern scoring estimators. The Bayesian estimators were slightly more efficient than the non‐Bayesian estimators, resulting in smaller overall error. Possible score changes caused by the use of different proficiency estimators would be nonnegligible, particularly for low‐ and high‐performing examinees.  相似文献   

20.
The usefulness of item response theory (IRT) models depends, in large part, on the accuracy of item and person parameter estimates. For the standard 3 parameter logistic model, for example, these parameters include the item parameters of difficulty, discrimination, and pseudo-chance, as well as the person ability parameter. Several factors impact traditional marginal maximum likelihood (ML) estimation of IRT model parameters, including sample size, with smaller samples generally being associated with lower parameter estimation accuracy, and inflated standard errors for the estimates. Given this deleterious impact of small samples on IRT model performance, use of these techniques with low-incidence populations, where it might prove to be particularly useful, estimation becomes difficult, especially with more complex models. Recently, a Pairwise estimation method for Rasch model parameters has been suggested for use with missing data, and may also hold promise for parameter estimation with small samples. This simulation study compared item difficulty parameter estimation accuracy of ML with the Pairwise approach to ascertain the benefits of this latter method. The results support the use of the Pairwise method with small samples, particularly for obtaining item location estimates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号