首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Equatings were performed on both simulated and real data sets using the common-examinee design and two abilities for each examinee (i.e., two dimensions). Item and ability parameter estimates were found by using the Multidimensional Item Response Theory Estimation (MIRTE) program. The amount of equating error was evaluated by a comparison of the mean difference and the mean absolute difference between the true scores and ability estimates found on both tests for the common examinees used in the equating. The results indicated that effective equating, as measured by comparability o f true scores, was possible with the techniques used in this study. When the stability o f the ability estimates was examined, unsatisfactory results were found.  相似文献   

2.
Four equating methods (3PL true score equating, 3PL observed score equating, beta 4 true score equating, and beta 4 observed score equating) were compared using four equating criteria: first-order equity (FOE), second-order equity (SOE), conditional-mean-squared-error (CMSE) difference, and the equipercentile equating property. True score equating more closely achieved estimated FOE than observed score equating when the true score distribution was estimated using the psychometric model that was used in the equating. Observed score equating more closely achieved estimated SOE, estimated CMSE difference, and the equipercentile equating property than true score equating. Among the four equating methods, 3PL observed score equating most closely achieved estimated SOE and had the smallest estimated CMSE difference, and beta 4 observed score equating was the method that most closely met the equipercentile equating property.  相似文献   

3.
A resampling study was conducted to compare the statistical bias and standard errors of nonequivalent-groups linear test equating in small samples of examinees. Sample sizes of 15, 25, 50, and 100 were examined. One thousand samples of each size were drawn with replacement from each of 5 archival data files from teacher subject area tests. For each test, data files from 2 parallel forms were used. Results suggest trivial levels of equating bias even with small samples, but substantial increases in standard errors as sample size decreases. Results were interpreted in terms of applications to testing situations in which small numbers of examinees are available.  相似文献   

4.
This instructional module is intended to promote a conceptual understanding of test form equating using traditional methods. The purpose of equating and the context in which equating occurs are described. The process of equating is distinguished from the related process of scaling to achieve comparability. Three equating designs are considered, and three equating methods—man, linear, and equipercentile—are described and illustrated. Special attention is given to equating with nonequivalent groups, and to sources of equating error.  相似文献   

5.
IRT Equating Methods   总被引:1,自引:0,他引:1  
The purpose of this instructional module is to provide the basis for understanding the process of score equating through the use of item response theory (IRT). A context is provided for addressing the merits of IRT equating methods. The mechanics of IRT equating and the need to place parameter estimates from separate calibration runs on the same scale are discussed. Some procedures for placing parameter estimates on a common scale are presented. In addition, IRT true-score equating is discussed in some detail. A discussion of the practical advantages derived from IRT equating is offered at the end of the module.  相似文献   

6.
The purpose of this study was to evaluate the use of adjoined and piecewise linear approximations (APLAs) of raw equipercentile equating functions as a postsmoothing equating method. APLAs are less familiar than other postsmoothing equating methods (i.e., cubic splines), but their use has been described in historical equating practices of large‐scale testing programs. This study used simulations to evaluate APLA equating results and compare these results with those from cubic spline postsmoothing and from several presmoothing equating methods. The overall results suggested that APLAs based on four line segments have accuracy advantages similar to or better than cubic splines and can sometimes produce more accurate smoothed equating functions than those produced using presmoothing methods.  相似文献   

7.
This article presents a method for evaluating equating results. Within the kernel equating framework, the percent relative error (PRE) for chained equipercentile equating was computed under the nonequivalent groups with anchor test (NEAT) design. The method was applied to two data sets to obtain the PRE, which can be used to measure equating effectiveness. The study compared the PRE results for chained and poststratification equating. The results indicated that the chained method transformed the new form score distribution to the reference form scale more effectively than the poststratification method. In addition, the study found that in chained equating, the population weight had impact on score distributions over the target population but not on the equating and PRE results.  相似文献   

8.
9.
Recently, there has been an increasing level of interest in subscores for their potential diagnostic value. Haberman (2008b) suggested reporting an augmented subscore that is a linear combination of a subscore and the total score. Sinharay and Haberman (2008) and Sinharay (2010) showed that augmented subscores often lead to more accurate diagnostic information than subscores. In order to report augmented subscores operationally, they should be comparable across the different forms of a test. One way to achieve comparability is to equate them. We suggest several methods for equating augmented subscores. Results from several operational and simulated data sets show that the error in the equating of augmented subscores appears to be small in most practical situations.  相似文献   

10.
In this article, linear item response theory (IRT) observed‐score equating is compared under a generalized kernel equating framework with Levine observed‐score equating for nonequivalent groups with anchor test design. Interestingly, these two equating methods are closely related despite being based on different methodologies. Specifically, when using data from IRT models, linear IRT observed‐score equating is virtually identical to Levine observed‐score equating. This leads to the conclusion that poststratification equating based on true anchor scores can be viewed as the curvilinear Levine observed‐score equating.  相似文献   

11.
12.
The equating performance of two internal anchor test structures—miditests and minitests—is studied for four IRT equating methods using simulated data. Originally proposed by Sinharay and Holland, miditests are anchors that have the same mean difficulty as the overall test but less variance in item difficulties. Four popular IRT equating methods were tested, and both the means and SDs of the true ability of the group to be equated were varied. We evaluate equating accuracy marginally and conditional on true ability. Our results suggest miditests perform about as well as traditional minitests for most conditions. Findings are discussed in terms of comparability to the typical minitest design and the trade‐off between accuracy and flexibility in test construction.  相似文献   

13.
考试分数等值的新框架   总被引:1,自引:0,他引:1  
对考试分数进行等值处理不仅是保证测验信度和公平性的重要环节,也是建立题库和实现计算机化自适应性考试的核心环节。由美国教育协会(ACE)和全美教育测量学会(NCME)联合组织编写的《教育测量》一书被称为教育测量领域中的"圣经"。在2006年出版的《教育测量》(第四版)中提出了一个关于考试分数等值的新框架。本文介绍了这一新框架,并结合作者多年从事考试分数等值的实践,对等值问题进行了讨论。  相似文献   

14.
Based on Lord's criterion of equity of equating, van der Linden (this issue) revisits the so‐called local equating method and offers alternative as well as new thoughts on several topics including the types of transformations, symmetry, reliability, and population invariance appropriate for equating. A remarkable aspect is to define equating as a standard statistical inference problem in which the true equating transformation is the parameter of interest that has to be estimated and assessed as any standard evaluation of an estimator of an unknown parameter in statistics. We believe that putting equating methods in a general statistical model framework would be an interesting and useful next step in the area. van der Linden's conceptual article on equating is certainly an important contribution to this task.  相似文献   

15.
This study investigated the extent to which log-linear smoothing could improve the accuracy of common-item equating by the chained equipercentile method in small samples of examinees. Examinee response data from a 100-item test were used to create two overlapping forms of 58 items each, with 24 items in common. The criterion equating was a direct equipercentile equating of the two forms in the full population of 93,283 examinees. Anchor equatings were performed in samples of 25, 50, 100, and 200 examinees, with 50 pairs of samples at each size level. Four equatings were performed with each pair of samples: one based on unsmoothed distributions and three based on varying degrees of smoothing. Smoothing reduced, by at least half, the sample size required for a given degree of accuracy. Smoothing that preserved only two moments of the marginal distributions resulted in equatings that failed to capture the curvilinearity in the population equating.  相似文献   

16.
HSK是为测试母语为非汉语者(包括外国人和华侨)的汉语水平而设立的国家级标准化考试。MHK是专门测试母语为非汉语的中国少数民族汉语学习者汉语水平的国家级标准化考试。HSK和MHK都是证书考试。如果证书授予标准缺乏稳定性和公平性,如果对使用这一份试卷的人一个标准,对使用另一份试卷的人又一个标准,那么,不仅会大大影响HSK的信度和效度,而且会对有关的决策产生误导,会使考生受到不公平的对待。在HSK和MHK的开发和实施过程中,一直坚持了对考试分数的统计等值处理。在HSK和MHK的等值设计方面,我们综合采用了共同组等值、共同题等值和分半组合的混合设计。在HSK和MHK的等值数据处理方面,我们综合采用了线性等值、等百分位等值和IRT等值。本文介绍了HSK和MHK的等值方法。讨论了各种方法的得失,讨论了今后继续改进的可能性。  相似文献   

17.
This study addressed the sampling error and linking bias that occur with small samples in a nonequivalent groups anchor test design. We proposed a linking method called the synthetic function, which is a weighted average of the identity function and a traditional equating function (in this case, the chained linear equating function). Specifically, we compared the synthetic, identity, and chained linear functions for various‐sized samples from two types of national assessments. One design used a highly reliable test and an external anchor, and the other used a relatively low‐reliability test and an internal anchor. The results from each of these methods were compared to the criterion equating function derived from the total samples with respect to linking bias and error. The study indicated that the synthetic functions might be a better choice than the chained linear equating method when samples are not large and, as a result, unrepresentative.  相似文献   

18.
The synthetic function is a weighted average of the identity (the linking function for forms that are known to be completely parallel) and a traditional equating method. The purpose of the present study was to investigate the benefits of the synthetic function on small-sample equating using various real data sets gathered from different administrations of tests from a licensure testing program. We investigated the chained linear, Tucker, Levine, and mean equating methods, along with the identity and the synthetic functions with small samples (N = 19 to 70). The synthetic function did not perform as well as did other linear equating methods because test forms differed markedly in difficulty; thus, the use of the identity function produced substantial bias. The effectiveness of the synthetic function depended on the forms' similarity in difficulty.  相似文献   

19.
Score equating based on small samples of examinees is often inaccurate for the examinee populations. We conducted a series of resampling studies to investigate the accuracy of five methods of equating in a common-item design. The methods were chained equipercentile equating of smoothed distributions, chained linear equating, chained mean equating, the symmetric circle-arc method, and the simplified circle-arc method. Four operational test forms, each containing at least 110 items, were used for the equating, with new-form samples of 100, 50, 25, and 10 examinees and reference-form samples three times as large. Accuracy was described in terms of the root-mean-squared difference (over 1,000 replications) of the sample equatings from the criterion equating. Overall, chained mean equating produced the most accurate results for low scores, but the two circle-arc methods produced the most accurate results, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible.  相似文献   

20.
This study applied kernel equating (KE) in two scenarios: equating to a very similar population and equating to a very different population, referred to as a distant population, using SAT® data. The KE results were compared to the results obtained from analogous traditional equating methods in both scenarios. The results indicate that KE results are comparable to the results of other methods. Further, the results show that when the two populations taking the two tests are similar on the anchor score distributions, different equating methods yield the same or very similar results, even though they have different assumptions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号