期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

John J. Norcini 《Journal of Educational Measurement》1990,27(1):59-66

In competency testing, it is sometimes difficult to properly equate scores of different forms of a test and thereby assure equivalent cutting scores. Under such circumstances, it is possible to set standards separately for each test form and then scale the judgments of the standard setters to achieve equivalent pass/fail decisions. Data from standard setters and examinees for a medical certifying examination were reanalyzed. Cutting score equivalents were derived by applying a linear procedure to the standard-setting results. These were compared against criteria along with the cutting score equivalents derived from typical examination equating procedures. Results indicated that the cutting score equivalents produced by the experts were closer to the criteria than standards derived from examinee performance, especially when the number of examinees used in equating was small. The root mean square error estimate was about 1 item on a 189-item test. 相似文献

2.

Estimating Average Domain Scores

Mary Pommerich W. Alan Nicewander Bradley A. Hanson 《Journal of Educational Measurement》1999,36(3):199-216

A simulation study was performed to determine whether a group's average percent correct in a content domain could be accurately estimated for groups taking a single test form and not the entire domain of items. Six Item Response Theory based domain score estimation methods were evaluated, under conditions of few items per content area perform taken, small domains, and small group sizes. The methods used item responses to a single form taken to estimate examinee or group ability; domain scores were then computed using the ability estimates and domain item characteristics. The IRT-based domain score estimates typically showed greater accuracy and greater consistency across forms taken than observed performance on the form taken. For the smallest group size and least number of items taken, the accuracy of most IRT-based estimates was questionable; however, a procedure that operates on an estimated distribution of group ability showed promise under most conditions. 相似文献

3.

SOME RESULTS RELATING TO TEST EQUATING UNDER RELAXED TEST FORM EQUIVALENCE

EDMOND MARKS CARL A. LINDSAY 《Journal of Educational Measurement》1972,9(1):45-56

Educational measurement specialists in undertaking test equating in applied settings have been plagued by the absence of a logically or mathematically compelling rationale for their test equating efforts. Classical test theory and other test theories based on the assumption of identically distributed true scores are tautological in terms of test equating. The present study examined (by means of a Monte Carlo procedure) the effects of four parameters on the accuracy of test equating under a relaxed definition of test form equivalence. The four parameters studied were sample size, test form length, test form reliability, and the correlation between the true scores of the test forms to be equated. Significant interactions involving sample size and the other parameters indicated that smaller samples of observations yielded disproportionately larger errors in test equating for fixed values of the test form parameters. In terms of main effects, sample size emerged as most important in controlling equating error. Taken together, the results suggest that when test equating is carried out on larger samples of observations, errors of equating will tend to be relatively small even though the test forms are not strictly parallel. For arbitrarily small samples, however, errors of equating will tend to be larger regardless of how equivalent the test forms are. 相似文献

4.

Evaluating Equating Accuracy and Assumptions for Groups That Differ in Performance

Sonya Powers Michael J. Kolen 《Journal of Educational Measurement》2014,51(1):39-56

Accurate equating results are essential when comparing examinee scores across exam forms. Previous research indicates that equating results may not be accurate when group differences are large. This study compared the equating results of frequency estimation, chained equipercentile, item response theory (IRT) true‐score, and IRT observed‐score equating methods. Using mixed‐format test data, equating results were evaluated for group differences ranging from 0 to .75 standard deviations. As group differences increased, equating results became increasingly biased and dissimilar across equating methods. Results suggest that the size of group differences, the likelihood that equating assumptions are violated, and the equating error associated with an equating method should be taken into consideration when choosing an equating method. 相似文献

5.

Collateral Information for Equating in Small Samples: A Preliminary Investigation

Sooyeon Kim Samuel A. Livingston Charles Lewis 《教育实用测度》2013,26(4):302-323

This article describes a preliminary investigation of an empirical Bayes (EB) procedure for using collateral information to improve equating of scores on test forms taken by small numbers of examinees. Resampling studies were done on two different forms of the same test. In each study, EB and non-EB versions of two equating methods—chained linear and chained mean—were applied to repeated small samples drawn from a large data set collected for a common-item equating. The criterion equating was the chained linear equating in the large data set. Equatings of other forms of the same test provided the collateral information. New-form sample size was varied from 10 to 200; reference-form sample size was constant at 200. One of the two new forms did not differ greatly in difficulty from its reference form, as was the case for the equatings used as collateral information. For this form, the EB procedure improved the accuracy of equating with new-form samples of 50 or fewer. The other new form was much more difficult than its reference form; for this form, the EB procedure made the equating less accurate. 相似文献

6.

Accuracy of Random Groups Equating with Very Small Samples

Gary Skaggs 《Journal of Educational Measurement》2005,42(4):309-330

This study investigated the effectiveness of equating with very small samples using the random groups design. Of particular interest was equating accuracy at specific scores where performance standards might be set. Two sets of simulations were carried out, one in which the two forms were identical and one in which they differed by a tenth of a standard deviation in overall difficulty. These forms were equated using mean equating, linear equating, unsmoothed equipercentile equating, and equipercentile equating using two through six moments of log-linear presmoothing with samples of 25, 50, 75, 100, 150, and 200. The results indicated that identity equating was preferable to any equating method when samples were as small as 25. For samples of 50 and above, the choice of an equating method over identity equating depended on the location of the passing score relative to examinee performance. If passing scores were located below the mean, where data were sparser, mean equating produced the smallest percentage of misclassified examinees. For passing scores near the mean, all methods produced similar results with linear equating being the most accurate. For passing scores above the mean, equipercentile equating with 2- and 3-moment presmoothing were the best equating methods. Higher levels of presmoothing did not improve the results. 相似文献

7.

The Effect of Changing Content on IRT Scaling Methods

Lisa A. Keller Robert R. Keller 《教育实用测度》2015,28(2):99-114

Equating test forms is an essential activity in standardized testing, with increased importance with the accountability systems in existence through the mandate of Adequate Yearly Progress. It is through equating that scores from different test forms become comparable, which allows for the tracking of changes in the performance of students from one year to the next. This study compares three different item response theory scaling methods (fixed common item parameter, Stocking & Lord, and Concurrent Calibration) with respect to examinee classification into performance categories, and estimation of the ability parameter, when the content of the test form changes slightly from year to year, and the examinee ability distribution changes. The results indicate that calibration methods, especially concurrent calibration, produced more stable results than the transformation method. 相似文献

8.

How Well Can We Compare Scores on Test Forms That Are Constructed by Examinees Choice?

Howard Wainer Xiang-Bo Wang David Thissen 《Journal of Educational Measurement》1994,31(3):183-199

When an exam consists, in whole or in part, of constructed-response items, it is a common practice to allow the examinee to choose a subset of the questions to answer. This procedure is usually adopted so that the limited number of items that can be completed in the allotted time does not unfairly affect the examinee. This results in the de facto administration of several different test forms, where the exact structure of any particular form is determined by the examinee. However, when different forms are administered, a canon of good testing practice requires that those forms be equated to adjust for differences in their difficulty. When the items are chosen by the examinee, traditional equating procedures do not strictly apply due to the nonignorable nature of the missing responses. In this article, we examine the comparability of scores on such tests within an IRT framework. We illustrate the approach with data from the College Board's Advanced Placement Test in Chemistry 相似文献

9.

The Performance of a Method for the Long-term Equating of Mixed-Format Assessment

Akihito Kamata Richard Tate 《Journal of Educational Measurement》2005,42(2):193-213

The goal of this study was the development of a procedure to predict the equating error associated with the long-term equating method of Tate (2003) for mixed-format tests. An expression for the determination of the error of an equating based on multiple links using the error for the component links was derived and illustrated with simulated data. Expressions relating the equating error for single equating links to relevant factors like the equating design and the history of the examinee population ability distribution were determined based on computer simulation. Use of the resulting procedure for the selection of a long-term equating design was illustrated. 相似文献

10.

Rater Comparability Scoring and Equating: Does Choice of Target Population Weights Matter in This Context?

Gautam Puhan 《Journal of Educational Measurement》2013,50(4):374-380

When a constructed‐response test form is reused, raw scores from the two administrations of the form may not be comparable. The solution to this problem requires a rescoring, at the current administration, of examinee responses from the previous administration. The scores from this “rescoring” can be used as an anchor for equating. In this equating, the choice of weights for combining the samples to define the target population can be critical. In rescored data, the anchor usually correlates very strongly with the new form but only moderately with the reference form. This difference has a predictable impact: the equating results are most accurate when the target population is the reference form sample, least accurate when the target population is the new form sample, and somewhere in the middle when the new form and reference form samples are equally weighted in forming the target population. 相似文献

11.

A Comparison of Angoff's Design I and Design II for Vertical Equating Using Traditional and IRT Methodology

Deborah J. Harris 《Journal of Educational Measurement》1991,28(3):221-235

Practical considerations in conducting an equating study often require a trade-off between testing time and sample size. A counterbalanced design (Angoff's Design II) is often selected because, as each examinee is administered both test forms and therefore the errors are correlated, sample sizes can be dramatically reduced over those required by a spiraling design (Angoff's Design I), where each examinee is administered only one test form. However, the counterbalanced design may be subject to fatigue, practice, or context effects. This article investigated these two data collection designs (for a given sample size) with equipercentile and IRT equating methodology in the vertical equating of two mathematics achievement tests. Both designs and both methodologies were judged to adequately meet an equivalent expected score criterion; Design II was found to exhibit more stability over different samples. 相似文献

12.

A Note on the Application of Multiple Matrix Sampling to Standard Setting

John J. Norcini Judy A. Shea James C. Ping 《Journal of Educational Measurement》1988,25(2):159-164

In many of the methods currently proposed for standard setting, all experts are asked to judge all items, and the standard is taken as the mean of their judgments. When resources are limited, gathering the judgments of all experts in a single group can become impractical. Multiple matrix sampling (MMS) provides an alternative. This paper applies MMS to a variation on Angoff's method (1971) of standard setting. A pool of 36 experts and 190 items were divided randomly into 5 groups, and estimates of borderline examinee performance were acquired. Results indicated some variability in the cutting scores produced by the individual groups, but the variance components were reasonably well estimated. The standard error of the cutting score was very small, and the width of the 90% confidence interval around it was only 1.3 items. The reliability of the final cutting score was.98 相似文献

13.

Multidimensional Equating

Thomas M. Hirsch 《Journal of Educational Measurement》1989,26(4):337-349

Equatings were performed on both simulated and real data sets using the common-examinee design and two abilities for each examinee (i.e., two dimensions). Item and ability parameter estimates were found by using the Multidimensional Item Response Theory Estimation (MIRTE) program. The amount of equating error was evaluated by a comparison of the mean difference and the mean absolute difference between the true scores and ability estimates found on both tests for the common examinees used in the equating. The results indicated that effective equating, as measured by comparability o f true scores, was possible with the techniques used in this study. When the stability o f the ability estimates was examined, unsatisfactory results were found. 相似文献

14.

Comparisons among Small Sample Equating Methods in a Common-Item Design

Sooyeon Kim Samuel A. Livingston 《Journal of Educational Measurement》2010,47(3):286-298

Score equating based on small samples of examinees is often inaccurate for the examinee populations. We conducted a series of resampling studies to investigate the accuracy of five methods of equating in a common-item design. The methods were chained equipercentile equating of smoothed distributions, chained linear equating, chained mean equating, the symmetric circle-arc method, and the simplified circle-arc method. Four operational test forms, each containing at least 110 items, were used for the equating, with new-form samples of 100, 50, 25, and 10 examinees and reference-form samples three times as large. Accuracy was described in terms of the root-mean-squared difference (over 1,000 replications) of the sample equatings from the criterion equating. Overall, chained mean equating produced the most accurate results for low scores, but the two circle-arc methods produced the most accurate results, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible. 相似文献

15.

Using Diagnostic Profiles to Describe Borderline Performance in Standard Setting

Gary Skaggs Serge F. Hein Jesse L. M. Wilkins 《Educational Measurement》2020,39(1):45-51

In test-centered standard-setting methods, borderline performance can be represented by many different profiles of strengths and weaknesses. As a result, asking panelists to estimate item or test performance for a hypothetical group study of borderline examinees, or a typical borderline examinee, may be an extremely difficult task and one that can lead to questionable results in setting cut scores. In this study, data collected from a previous standard-setting study are used to deduce panelists’ conceptions of profiles of borderline performance. These profiles are then used to predict cut scores on a test of algebra readiness. The results indicate that these profiles can predict a very wide range of cut scores both within and between panelists. Modifications are proposed to existing training procedures for test-centered methods that can account for the variation in borderline profiles. 相似文献

16.

Achieving Form-to-Form Comparability: Fundamental issues and Proposed Strategies for Equating Performance Assessments of Teachers

《Educational Assessment》2013,18(1):99-110

The purpose of this article is to describe some of the measurement issues encountered in the equating of performance assessments designed for use in making teacher certification decisions. As some teacher certification programs move from sole reliance on multiple-choice items to inclusion of complex performance tasks, difficult measurement issues related to equating may arise. A variety of analytic and judgmental strategies are described in this article that may provide solutions for addressing these equating issues. Analytic strategies are based on examinee data and involve the modification of existing equating procedures, such as linear and equipercentile methods, that have been used successfully in the past with test forms composed of multiple-choice items. Judgmental strategies for equating involve the use of expert judgments to determine the equivalence of scores obtained from alternate forms of an assessment instrument. 相似文献

17.

THE EFFECTS OF VIOLATIONS OF UNIDIMENSIONALITY ON THE ESTIMATION OF ITEM AND ABILITY PARAMETERS AND ON ITEM RESPONSE THEORY EQUATING OF THE GRE VERBAL SCALE

NEIL J. DORANS NEAL M. KINGSTON 《Journal of Educational Measurement》1985,22(4):249-262

One of the major assumptions of item response theory (IRT)models is that performance on a set of items is unidimensional, that is, the probability of successful performance by examinees on a set of items can be modeled by a mathematical model that has only one ability parameter. In practice, this strong assumption is likely to be violated. An important pragmatic question to consider is: What are the consequences of these violations? In this research, evidence is provided of violations of unidimensionality on the verbal scale of the GRE Aptitude Test, and the impact of these violations on IRT equating is examined. Previous factor analytic research on the GRE Aptitude Test suggested that two verbal dimensions, discrete verbal (analogies, antonyms, and sentence completions)and reading comprehension, existed. Consequently, the present research involved two separate calibrations (homogeneous) of discrete verbal items and reading comprehension items as well as a single calibration (heterogeneous) of all verbal item types. Thus, each verbal item was calibrated twice and each examinee obtained three ability estimates: reading comprehension, discrete verbal, and all verbal. The comparability of ability estimates based on homogeneous calibrations (reading comprehension or discrete verbal) to each other and to the all-verbal ability estimates was examined. The effects of homogeneity of item calibration pool on estimates of item discrimination were also examined. Then the comparability of IRT equatings based on homogeneous and heterogeneous calibrations was assessed. The effects of calibration homogeneity on ability parameter estimates and discrimination parameter estimates are consistent with the existence of two highly correlated verbal dimensions. IRT equating results indicate that although violations of unidimensionality may have an impact on equating, the effect may not be substantial. 相似文献

18.

A Comparison of Chained Linear and Poststratification Linear Equating Under Different Testing Conditions

Gautam Puhan 《Journal of Educational Measurement》2010,47(1):54-75

In this study I compared results of chained linear, Tucker, and Levine-observed score equatings under conditions where the new and old forms samples were similar in ability and also when they were different in ability. The length of the anchor test was also varied to examine its effect on the three different equating methods. The three equating methods were compared to a criterion equating to obtain estimates of random equating error, bias, and root mean squared error (RMSE). Results showed that, for most studied conditions, chained linear equating produced fairly good equating results in terms of low bias and RMSE. Levine equating also produced low bias and RMSE in some conditions. Although the Tucker method always produced the lowest random equating error, it produced a larger bias and RMSE than either of the other equating methods. As noted in the literature, these results also suggest that either chained linear or Levine equating be used when new and old form samples differ on ability and/or when the anchor-to-total correlation is not very high. Finally, by testing the missing data assumptions of the three equating methods, this study also shows empirically why an equating method is more or less accurate under certain conditions . 相似文献

19.

The Effect of Various Factors on Standard Setting

John J. Norcini Judy A. Shea D. Theresa Kanya 《Journal of Educational Measurement》1988,25(1):57-65

This paper reports two studies of standard setting using Angoff's method. Results of the first study suggest that specialization within broad content areas does not affect an expert's estimates of the performance of the borderline group. This is reassuring because the knowledge base of many professions is so large that no individual can be considered an expert in all aspects of it. Results of the second study support the recommendation that performance data be provided during the standard-setting process. They are frequently used by experts, but will not have an impact on the standard unless the distribution of item difficulties is skewed markedly. It also increases the correspondence between p-values and estimates of borderline group performance, thereby reducing errors in pass/fail decisions. Overall, the results support recommendations often made in standard-setting literature, but they need to be replicated with other groups of experts 相似文献

20.

The effects of reducing correlation of external anchors on test equating methods for the equivalent groups and non-equivalent groups designs

《International Journal of Educational Research》1988,12(4):409-425

Six equating methods were compared: a one-parameter Item Response Theory (IRT) method; two equipercentile methods (direct and by frequency estimation); and three linear methods (Tucker, Levine Equally Reliable and Levine Unequally Reliable) in a situation in which different forms were administered to different groups, thus necessitating the use of an anchor test. The groups were simulated as either equivalent groups or groups of variable ability representing the two types of class groupings that can exist in schools (i.e. parallel or streamed classes). The correlation between the ability measured by an external anchor and the tests to be equated was systematically manipulated. A discrepancy index summarised the discrepancy of each equating method from an IRT criterion, an equipercentile criterion, and from each other. Large discrepancies were interpreted with the aid of graphs and discussed in terms of examinee indifference to the alternative transformations. The direct equipercentile and Levine Unequally Reliable methods were the only methods that consistently increased their level of the discrepancy from criterion following reduction in correlation for the two equatings examined in the equivalent groups design. For the non-equivalent groups design, a reduction in correlation resulted in a systematic effect in favour of those taking an easier form (usually the less able) for all equating methods. What was observed, however, was that for small reductions in correlation, the discrepancy of some of the equating methods from the IRT criterion was reduced. The implications of these findings are discussed and recommendations made for further work. 相似文献