共查询到20条相似文献,搜索用时 15 毫秒
1.
Item parameter drift (IPD) occurs when item parameter values change from their original value over time. IPD may pose a serious threat to the fairness and validity of test score interpretations, especially when the goal of the assessment is to measure growth or improvement. In this study, we examined the effect of multidirectional IPD (i.e., some items become harder while other items become easier) on the linking procedure and rescaled proficiency estimates. The impact of different combinations of linking items with various multidirectional IPD on the test equating procedure was investigated for three scaling methods (mean-mean, mean-sigma, and TCC method) via a series of simulation studies. It was observed that multidirectional IPD had a substantive effect on examinees' scores and achievement level classifications under some of the studied conditions. Choice of linking method had a direct effect on the results, as did the pattern of IPD. 相似文献
2.
Robert L. McKinley 《Journal of Educational Measurement》1988,25(3):233-246
Six procedures for combining sets of IRT item parameter estimates obtained from different samples were evaluated using real and simulated response data. In the simulated data analyses, true item and person parameters were used to generate response data for three different-sized samples. Each sample was calibrated separately to obtain three sets of item parameter estimates for each item. The six procedures for combining multiple estimates were each applied, and the results were evaluated by comparing the true and estimated item characteristic curves. For the real data, the two best methods from the simulation data analyses were applied to three different-sized samples and the resulting estimated item characteristic curves were compared to the curves obtained when the three samples were combined and calibrated simultaneously. The results support the use of covariance matrix-weighted averaging and a procedure that involves sample-size-weighted averaging of estimated item characteristic curves at the center of the ability distribution 相似文献
3.
Adelaide Ariel Bernard P. Veldkamp Wim J. van der Linden 《Journal of Educational Measurement》2004,41(4):345-359
Preventing items in adaptive testing from being over- or underexposed is one of the main problems in computerized adaptive testing. Though the problem of overexposed items can be solved using a probabilistic item-exposure control method, such methods are unable to deal with the problem of underexposed items. Using a system of rotating item pools, on the other hand, is a method that potentially solves both problems. In this method, a master pool is divided into (possibly overlapping) smaller item pools, which are required to have similar distributions of content and statistical attributes. These pools are rotated among the testing sites to realize desirable exposure rates for the items. A test assembly model, motivated by Gulliksen's matched random subtests method, was explored to help solve the problem of dividing a master pool into a set of smaller pools. Different methods to solve the model are proposed. An item pool from the Law School Admission Test was used to evaluate the performances of computerized adaptive tests from systems of rotating item pools constructed using these methods. 相似文献
4.
5.
《教育实用测度》2013,26(4):371-383
School-level assessment of student writing ability using a group-level, polytomous item response theory (IRT) model was illustrated in this study. The study supported the viability of an IRT-based school assessment as an alternative to the conventional approach based on aggregation of individual scores. The precision provided by the assumed assessment design varied dramatically depending on school size and school average ability. For small schools and students with low average abilities, differences in average school performance had to be quite large to be trustworthy. In contrast, the design provided greater precision in detecting differences for large schools and students with high average abilities. An operational use of this design would require great care in the reporting of results to ensure that unreliable school comparisons are clearly identified. 相似文献
6.
Kyong Hee Chon Won-Chan Lee Stephen B. Dunbar 《Journal of Educational Measurement》2010,47(3):318-338
In this study we examined procedures for assessing model-data fit of item response theory (IRT) models for mixed format data. The model fit indices used in this study include PARSCALE's G2 , Orlando and Thissen's S − X2 and S − G2 , and Stone's χ2* and G2* . To investigate the relative performance of the fit statistics at the item level, we conducted two simulation studies: Type I error and power studies. We evaluated the performance of the item fit indices for various conditions of test length, sample size, and IRT models. Among the competing measures, the summed score-based indices S − X2 and S − G2 were found to be the sensible and efficient choice for assessing model fit for mixed format data. These indices performed well, particularly with short tests. The pseudo-observed score indices, χ2* and G2* , showed inflated Type I error rates in some simulation conditions. Consistent with the findings of current literature, the PARSCALE's G2 index was rarely useful, although it provided reasonable results for long tests. 相似文献
7.
Walter D. Way 《Educational Measurement》1998,17(4):17-27
What are the issues and techniques involved in protecting the integrity of item pools in computerized testing? How can item exposure be limited? How do security issues differ in computerized testing and paper-and-pencil testing? 相似文献
8.
Youngsuk Suh 《Journal of Educational Measurement》2016,53(4):403-430
This study adapted an effect size measure used for studying differential item functioning (DIF) in unidimensional tests and extended the measure to multidimensional tests. Two effect size measures were considered in a multidimensional item response theory model: signed weighted P‐difference and unsigned weighted P‐difference. The performance of the effect size measures was investigated under various simulation conditions including different sample sizes and DIF magnitudes. As another way of studying DIF, the χ2 difference test was included to compare the result of statistical significance (statistical tests) with that of practical significance (effect size measures). The adequacy of existing effect size criteria used in unidimensional tests was also evaluated. Both effect size measures worked well in estimating true effect sizes, identifying DIF types, and classifying effect size categories. Finally, a real data analysis was conducted to support the simulation results. 相似文献
9.
10.
Equating test forms is an essential activity in standardized testing, with increased importance with the accountability systems in existence through the mandate of Adequate Yearly Progress. It is through equating that scores from different test forms become comparable, which allows for the tracking of changes in the performance of students from one year to the next. This study compares three different item response theory scaling methods (fixed common item parameter, Stocking & Lord, and Concurrent Calibration) with respect to examinee classification into performance categories, and estimation of the ability parameter, when the content of the test form changes slightly from year to year, and the examinee ability distribution changes. The results indicate that calibration methods, especially concurrent calibration, produced more stable results than the transformation method. 相似文献
11.
A potential concern for individuals interested in using item response theory (IRT) with achievement test data is that such tests have been specifically designed to measure content areas related to course curriculum and students taking the tests at different points in their coursework may not constitute samples from the same population. In this study, data were obtained from three administrations of two forms of a Biology achievement test. Data from the newer of the two forms were collected at a spring administration, made up of high school sophomores just completing the Biology course, and at a fall administration, made up mostly of seniors who completed their instruction in the course from 6–18 months prior to the test administration. Data from the older form, already on scale, were collected at only a fall administration, where the sample was comparable to the newer form fall sample. IRT and conventional item difficulty parameter estimates for the common items across the two forms were compared for each of the two form/sample combinations. In addition, conventional and IRT score equatings were performed between the new and old forms for each o f the form sample combinations. Widely disparate results were obtained between the equatings based on the two form/sample combinations. Conclusions are drawn about the use o f both classical test theory and IRT in situations such as that studied, and implications o f the results for achievement test validity are also discussed 相似文献
12.
基于项目反应理论,文章介绍了测验等值问题的意义和模型,然后分析了测验等值的原理,并采用最小二乘估计法对其中涉及到的转换系数进行了参数估计,真正实现了项目反应理论中的项目参数等值和真分数等值. 相似文献
13.
Functional form misfit is frequently a concern in item response theory (IRT), although the practical implications of misfit are often difficult to evaluate. In this article, we illustrate how seemingly negligible amounts of functional form misfit, when systematic, can be associated with significant distortions of the score metric in vertical scaling contexts. Our analysis uses two‐ and three‐parameter versions of Samejima's logistic positive exponent model (LPE) as a data generating model. Consistent with prior work, we find LPEs generally provide a better comparative fit to real item response data than traditional IRT models (2PL, 3PL). Further, our simulation results illustrate how 2PL‐ or 3PL‐based vertical scaling in the presence of LPE‐induced misspecification leads to an artificial growth deceleration across grades, consistent with that commonly seen in vertical scaling studies. The results raise further concerns about the use of standard IRT models in measuring growth, even apart from the frequently cited concerns of construct shift/multidimensionality across grades. 相似文献
14.
Wendy M. Yen 《Journal of Educational Measurement》1993,30(3):187-213
Performance assessments appear on a priori grounds to be likely to produce far more local item dependence (LID) than that produced in the use of traditional multiple-choice tests. This article (a) defines local item independence, (b) presents a compendium of causes of LID, (c) discusses some of LID's practical measurement implications, (d) details some empirical results for both performance assessments and multiple-choice tests, and (e) suggests some strategies for managing LID in order to avoid negative measurement consequences. 相似文献
15.
Peter Baldwin 《Journal of Educational Measurement》2011,48(1):1-11
Growing interest in fully Bayesian item response models begs the question: To what extent can model parameter posterior draws enhance existing practices? One practice that has traditionally relied on model parameter point estimates but may be improved by using posterior draws is the development of a common metric for two independently calibrated test forms. Before parameter estimates from independently calibrated forms can be compared, at least one form's estimates must be adjusted such that both forms share a common metric. Because this adjustment is estimated, there is a propagation of error effect when it is applied. This effect is typically ignored, which leads to overconfidence in the adjusted estimates; yet, when model parameter posterior draws are available, it may be accounted for with a simple sampling strategy. In this paper, it is shown using simulated data that the proposed sampling strategy results in adjusted posteriors with superior coverage properties than those obtained using traditional point‐estimate‐based methods. 相似文献
16.
薛宝山 《贵阳学院学报(自然科学版)》2010,5(1)
探讨了几种常用的基于项目反应理论(IRT)的试题参数估计方法,并分析了每一种估计方法的优缺点及各自的适用领域,为构建基于IRT的试题库系统提供理论参考. 相似文献
17.
The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity. 相似文献
18.
R. Darrell Bock Eiji Murakl Will Pfeiffenberger 《Journal of Educational Measurement》1988,25(4):275-285
Differential linear drift of item location parameters over a 10 -year period is demonstrated in data from the College Board Physics Achievement Test. The relative direction of drift is associated with the content of the items and reflects changing emphasis in the physics curricula of American secondary schools. No evidence of drift of discriminating power parameters was found. Statistical procedures for detecting, estimating, and accounting for item parameter drift in item pools for long-term testing programs are proposed 相似文献
19.