期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Comparison Between Linear IRT Observed‐Score Equating and Levine Observed‐Score Equating Under the Generalized Kernel Equating Framework

Haiwen Chen 《Journal of Educational Measurement》2012,49(3):269-284

In this article, linear item response theory (IRT) observed‐score equating is compared under a generalized kernel equating framework with Levine observed‐score equating for nonequivalent groups with anchor test design. Interestingly, these two equating methods are closely related despite being based on different methodologies. Specifically, when using data from IRT models, linear IRT observed‐score equating is virtually identical to Levine observed‐score equating. This leads to the conclusion that poststratification equating based on true anchor scores can be viewed as the curvilinear Levine observed‐score equating. 相似文献

2.

Asymptotic Standard Errors of Observed‐Score Equating With Polytomous IRT Models

Bjrn Andersson 《Journal of Educational Measurement》2016,53(4):459-477

In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length. 相似文献

3.

Evaluating Equating Accuracy and Assumptions for Groups That Differ in Performance

Sonya Powers Michael J. Kolen 《Journal of Educational Measurement》2014,51(1):39-56

Accurate equating results are essential when comparing examinee scores across exam forms. Previous research indicates that equating results may not be accurate when group differences are large. This study compared the equating results of frequency estimation, chained equipercentile, item response theory (IRT) true‐score, and IRT observed‐score equating methods. Using mixed‐format test data, equating results were evaluated for group differences ranging from 0 to .75 standard deviations. As group differences increased, equating results became increasingly biased and dissimilar across equating methods. Results suggest that the size of group differences, the likelihood that equating assumptions are violated, and the equating error associated with an equating method should be taken into consideration when choosing an equating method. 相似文献

4.

Local Observed‐Score Kernel Equating

Marie Wiberg Wim J. van der Linden Alina A. von Davier 《Journal of Educational Measurement》2014,51(1):57-74

Three local observed‐score kernel equating methods that integrate methods from the local equating and kernel equating frameworks are proposed. The new methods were compared with their earlier counterparts with respect to such measures as bias—as defined by Lord's criterion of equity—and percent relative error. The local kernel item response theory observed‐score equating method, which can be used for any of the common equating designs, had a small amount of bias, a low percent relative error, and a relatively low kernel standard error of equating, even when the accuracy of the test was reduced. The local kernel equating methods for the nonequivalent groups with anchor test generally had low bias and were quite stable against changes in the accuracy or length of the anchor test. Although all proposed methods showed small percent relative errors, the local kernel equating methods for the nonequivalent groups with anchor test design had somewhat larger standard error of equating than their kernel method counterparts. 相似文献

5.

Local Equating Using the Rasch Model,the OPLM,and the 2PL IRT Model—or—What Is It Anyway if the Model Captures Everything There Is to Know About the Test Takers?

Matthias von Davier Jorge González B. Alina A. von Davier 《Journal of Educational Measurement》2013,50(3):295-303

Local equating (LE) is based on Lord's criterion of equity. It defines a family of true transformations that aim at the ideal of equitable equating. van der Linden (this issue) offers a detailed discussion of common issues in observed‐score equating relative to this local approach. By assuming an underlying item response theory model, one of the main features of LE is that it adjusts the equated raw scores using conditional distributions of raw scores given an estimate of the ability of interest. In this article, we argue that this feature disappears when using a Rasch model for the estimation of the true transformation, while the one‐parameter logistic model and the two‐parameter logistic model do provide a local adjustment of the equated score. 相似文献

6.

Preequating With Empirical Item Characteristic Curves: An Observed‐Score Preequating Method

Jiyun Zu Gautam Puhan 《Journal of Educational Measurement》2014,51(3):281-300

Preequating is in demand because it reduces score reporting time. In this article, we evaluated an observed‐score preequating method: the empirical item characteristic curve (EICC) method, which makes preequating without item response theory (IRT) possible. EICC preequating results were compared with a criterion equating and with IRT true‐score preequating conversions. Results suggested that the EICC preequating method worked well under the conditions considered in this study. The difference between the EICC preequating conversion and the criterion equating was smaller than .5 raw‐score points (a practical criterion often used to evaluate equating quality) between the 5th and 95th percentiles of the new form total score distribution. EICC preequating also performed similarly or slightly better than IRT true‐score preequating. 相似文献

7.

Local Linear Observed‐Score Equating

Marie Wiberg Wim J. van der Linden 《Journal of Educational Measurement》2011,48(3):229-254

Two methods of local linear observed‐score equating for use with anchor‐test and single‐group designs are introduced. In an empirical study, the two methods were compared with the current traditional linear methods for observed‐score equating. As a criterion, the bias in the equated scores relative to true equating based on Lord's (1980) definition of equity was used. The local method for the anchor‐test design yielded minimum bias, even for considerable variation of the relative difficulties of the two test forms and the length of the anchor test. Among the traditional methods, the method of chain equating performed best. The local method for single‐group designs yielded equated scores with bias comparable to the traditional methods. This method, however, appears to be of theoretical interest because it forces us to rethink the relationship between score equating and regression. 相似文献

8.

Some Conceptual Issues in Observed‐Score Equating

Wim J. van der Linden 《Journal of Educational Measurement》2013,50(3):249-285

In spite of all of the technical progress in observed‐score equating, several of the more conceptual aspects of the process still are not well understood. As a result, the equating literature struggles with rather complex criteria of equating, lack of a test‐theoretic foundation, confusing terminology, and ad hoc analyses. A return to Lord's foundational criterion of equity of equating, a derivation of the true equating transformation from it, and mainstream statistical treatment of the problem of estimating the transformation for various data‐collection designs exist as a solution to the problem. 相似文献

9.

Situations Where It Is Appropriate to Use Frequency Estimation Equipercentile Equating

Hongwen Guo Hyeonjoo J. Oh Daniel Eignor 《Journal of Educational Measurement》2013,50(3):338-354

In operational equating situations, frequency estimation equipercentile equating is considered only when the old and new groups have similar abilities. The frequency estimation assumptions are investigated in this study under various situations from both the levels of theoretical interest and practical use. It shows that frequency estimation equating can be used under circumstances when it is not normally used. To link theoretical results with practice, statistical methods are proposed for checking frequency estimation assumptions based on available data: observed‐score distributions and item difficulty distributions of the forms. In addition to the conventional use of frequency estimation equating when the group abilities are similar, three situations are identified when the group abilities are dissimilar: (a) when the two forms and the observed conditional score distributions are similar the two forms and the observed conditional score distributions are similar (in this situation, the frequency estimation equating assumptions are likely to hold, and frequency estimation equating is appropriate); (b) when forms are similar but the observed conditional score distributions are not (in this situation, frequency estimation equating is not appropriate); and (c) when forms are not similar but the observed conditional score distributions are (frequency estimation equating is not appropriate). Statistical analysis procedures for comparing distributions are provided. Data from a large‐scale test are used to illustrate the use of frequency estimation equating when the group difference in ability is large. 相似文献

10.

Standard Error of Linear Observed‐Score Equating for the NEAT Design With Nonnormally Distributed Data

Jiyun Zu Ke‐Hai Yuan 《Journal of Educational Measurement》2012,49(2):190-213

In the nonequivalent groups with anchor test (NEAT) design, the standard error of linear observed‐score equating is commonly estimated by an estimator derived assuming multivariate normality. However, real data are seldom normally distributed, causing this normal estimator to be inconsistent. A general estimator, which does not rely on the normality assumption, would be preferred, because it is asymptotically accurate regardless of the distribution of the data. In this article, an analytical formula for the standard error of linear observed‐score equating, which characterizes the effect of nonnormality, is obtained under elliptical distributions. Using three large‐scale real data sets as the populations, resampling studies are conducted to empirically evaluate the normal and general estimators of the standard error of linear observed‐score equating. The effect of sample size (50, 100, 250, or 500) and equating method (chained linear, Tucker, or Levine observed‐score equating) are examined. Results suggest that the general estimator has smaller bias than the normal estimator in all 36 conditions; it has larger standard error when the sample size is at least 100; and it has smaller root mean squared error in all but one condition. An R program is also provided to facilitate the use of the general estimator. 相似文献

11.

The Long‐Term Sustainability of IRT Scaling Methods in Mixed‐Format Tests

Lisa A. Keller Ronald K. Hambleton 《Journal of Educational Measurement》2013,50(4):390-407

Due to recent research in equating methodologies indicating that some methods may be more susceptible to the accumulation of equating error over multiple administrations, the sustainability of several item response theory methods of equating over time was investigated. In particular, the paper is focused on two equating methodologies: fixed common item parameter scaling (with two variations, FCIP‐1 and FCIP‐2) and the Stocking and Lord characteristic curve scaling technique in the presence of nonequivalent groups. Results indicated that the improvements made to fixed common item parameter scaling in the FCIP‐2 method were sustained over time. FCIP‐2 and Stocking and Lord characteristic curve scaling performed similarly in many instances and produced more accurate results than FCIP‐1. The relative performance of FCIP‐2 and Stocking and Lord characteristic curve scaling depended on the nature of the change in the ability distribution: Stocking and Lord characteristic curve scaling captured the change in the distribution more accurately than FCIP‐2 when the change was different across the ability distribution; FCIP‐2 captured the changes more accurately when the change was consistent across the ability distribution. 相似文献

12.

An Extension of IRT-Based Equating to the Dichotomous Testlet Response Theory Model

Wei Tao Yi Cao 《教育实用测度》2013,26(2):108-121

ABSTRACT

Current procedures for equating number-correct scores using traditional item response theory (IRT) methods assume local independence. However, when tests are constructed using testlets, one concern is the violation of the local item independence assumption. The testlet response theory (TRT) model is one way to accommodate local item dependence. This study proposes methods to extend IRT true score and observed score equating methods to the dichotomous TRT model. We also examine the impact of local item dependence on equating number-correct scores when a traditional IRT model is applied. Results of the study indicate that when local item dependence is at a low level, using the three-parameter logistic model does not substantially affect number-correct equating. However, when local item dependence is at a moderate or high level, using the three-parameter logistic model generates larger equating bias and standard errors of equating compared to the TRT model. However, observed score equating is more robust to the violation of the local item independence assumption than is true score equating. 相似文献

13.

Asymptotic Standard Errors for Item Response Theory True Score Equating of Polytomous Items

下载免费PDF全文

Cheow Cher Wong 《Journal of Educational Measurement》2015,52(1):106-120

Building on previous works by Lord and Ogasawara for dichotomous items, this article proposes an approach to derive the asymptotic standard errors of item response theory true score equating involving polytomous items, for equivalent and nonequivalent groups of examinees. This analytical approach could be used in place of empirical methods like the bootstrap method, to obtain standard errors of equated scores. Formulas are introduced to obtain the derivatives for computing the asymptotic standard errors. The approach was validated using mean‐mean, mean‐sigma, random‐groups, or concurrent calibration equating of simulated samples, for tests modeled using the generalized partial credit model or the graded response model. 相似文献

14.

A General Linear Method for Equating With Small Samples

下载免费PDF全文

Anthony D. Albano 《Journal of Educational Measurement》2015,52(1):55-69

Research on equating with small samples has shown that methods with stronger assumptions and fewer statistical estimates can lead to decreased error in the estimated equating function. This article introduces a new approach to linear observed‐score equating, one which provides flexible control over how form difficulty is assumed versus estimated to change across the score scale. A general linear method is presented as an extension of traditional linear methods. The general method is then compared to other linear and nonlinear methods in terms of accuracy in estimating a criterion equating function. Results from two parametric bootstrapping studies based on real data demonstrate the usefulness of the general linear method. 相似文献

15.

The Examination of the Classification of Students into Performance Categories by Two Different Equating Methods

Lisa A. Keller Robert R. Keller Pauline A. Parker 《Journal of Experimental Education》2013,81(1):30-52

This study investigates the comparability of two item response theory based equating methods: true score equating (TSE), and estimated true equating (ETE). Additionally, six scaling methods were implemented within each equating method: mean-sigma, mean-mean, two versions of fixed common item parameter, Stocking and Lord, and Haebara. Empirical test data were examined to investigate the consistency of scores resulting from the two equating methods, as well as the consistency of the scaling methods both within equating methods and across equating methods. Results indicate that although the degree of correlation among the equated scores was quite high, regardless of equating method/scaling method combination, non-trivial differences in equated scores existed in several cases. These differences would likely accumulate across examinees making group-level differences greater. Systematic differences in the classification of examinees into performance categories were observed across the various conditions: ETE tended to place lower ability examinees into higher performance categories than TSE, while the opposite was observed for high ability examinees. Because the study was based on one set of operational data, the generalizability of the findings is limited and further study is warranted. 相似文献

16.

Comparison of the One‐ and Bi‐Direction Chained Equipercentile Equating

Hyeonjoo Oh Tim Moses 《Journal of Educational Measurement》2012,49(4):399-418

This study investigated differences between two approaches to chained equipercentile (CE) equating (one‐ and bi‐direction CE equating) in nearly equal groups and relatively unequal groups. In one‐direction CE equating, the new form is linked to the anchor in one sample of examinees and the anchor is linked to the reference form in the other sample. In bi‐direction CE equating, the anchor is linked to the new form in one sample of examinees and to the reference form in the other sample. The two approaches were evaluated in comparison to a criterion equating function (i.e., equivalent groups equating) using indexes such as root expected squared difference, bias, standard error of equating, root mean squared error, and number of gaps and bumps. The overall results across the equating situations suggested that the two CE equating approaches produced very similar results, whereas the bi‐direction results were slightly less erratic, smoother (i.e., fewer gaps and bumps), usually closer to the criterion function, and also less variable. 相似文献

17.

A Comparison of IRT Equating and Beta 4 Equating

Dong-In Kim Robert Brennan Michael Kolen 《Journal of Educational Measurement》2005,42(1):77-99

Four equating methods (3PL true score equating, 3PL observed score equating, beta 4 true score equating, and beta 4 observed score equating) were compared using four equating criteria: first-order equity (FOE), second-order equity (SOE), conditional-mean-squared-error (CMSE) difference, and the equipercentile equating property. True score equating more closely achieved estimated FOE than observed score equating when the true score distribution was estimated using the psychometric model that was used in the equating. Observed score equating more closely achieved estimated SOE, estimated CMSE difference, and the equipercentile equating property than true score equating. Among the four equating methods, 3PL observed score equating most closely achieved estimated SOE and had the smallest estimated CMSE difference, and beta 4 observed score equating was the method that most closely met the equipercentile equating property. 相似文献

18.

Statistical Models and Inference for the True Equating Transformation in the Context of Local Equating

Jorge González B. Matthias von Davier 《Journal of Educational Measurement》2013,50(3):315-320

Based on Lord's criterion of equity of equating, van der Linden (this issue) revisits the so‐called local equating method and offers alternative as well as new thoughts on several topics including the types of transformations, symmetry, reliability, and population invariance appropriate for equating. A remarkable aspect is to define equating as a standard statistical inference problem in which the true equating transformation is the parameter of interest that has to be estimated and assessed as any standard evaluation of an estimator of an unknown parameter in statistics. We believe that putting equating methods in a general statistical model framework would be an interesting and useful next step in the area. van der Linden's conceptual article on equating is certainly an important contribution to this task. 相似文献

19.

Statistical Assessment of Estimated Transformations in Observed‐Score Equating

Marie Wiberg Jorge Gonzlez 《Journal of Educational Measurement》2016,53(1):106-125

Equating methods make use of an appropriate transformation function to map the scores of one test form into the scale of another so that scores are comparable and can be used interchangeably. The equating literature shows that the ways of judging the success of an equating (i.e., the score transformation) might differ depending on the adopted framework. Rather than targeting different parts of the equating process and aiming to evaluate the process from different aspects, this article views the equating transformation as a standard statistical estimator and discusses how this estimator should be assessed in an equating framework. For the kernel equating framework, a numerical illustration shows the potentials of viewing the equating transformation as a statistical estimator as opposed to assessing it using equating‐specific criteria. A discussion on how this approach can be used to compare other equating estimators from different frameworks is also included. 相似文献

20.

Adjoined Piecewise Linear Approximations (APLAs) for Equating: Accuracy Evaluations of a Postsmoothing Equating Method

Tim Moses 《Journal of Educational Measurement》2013,50(4):427-446

The purpose of this study was to evaluate the use of adjoined and piecewise linear approximations (APLAs) of raw equipercentile equating functions as a postsmoothing equating method. APLAs are less familiar than other postsmoothing equating methods (i.e., cubic splines), but their use has been described in historical equating practices of large‐scale testing programs. This study used simulations to evaluate APLA equating results and compare these results with those from cubic spline postsmoothing and from several presmoothing equating methods. The overall results suggested that APLAs based on four line segments have accuracy advantages similar to or better than cubic splines and can sometimes produce more accurate smoothed equating functions than those produced using presmoothing methods. 相似文献