首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this study I compared results of chained linear, Tucker, and Levine-observed score equatings under conditions where the new and old forms samples were similar in ability and also when they were different in ability. The length of the anchor test was also varied to examine its effect on the three different equating methods. The three equating methods were compared to a criterion equating to obtain estimates of random equating error, bias, and root mean squared error (RMSE). Results showed that, for most studied conditions, chained linear equating produced fairly good equating results in terms of low bias and RMSE. Levine equating also produced low bias and RMSE in some conditions. Although the Tucker method always produced the lowest random equating error, it produced a larger bias and RMSE than either of the other equating methods. As noted in the literature, these results also suggest that either chained linear or Levine equating be used when new and old form samples differ on ability and/or when the anchor-to-total correlation is not very high. Finally, by testing the missing data assumptions of the three equating methods, this study also shows empirically why an equating method is more or less accurate under certain conditions .  相似文献   

2.
This study investigated the extent to which log-linear smoothing could improve the accuracy of common-item equating by the chained equipercentile method in small samples of examinees. Examinee response data from a 100-item test were used to create two overlapping forms of 58 items each, with 24 items in common. The criterion equating was a direct equipercentile equating of the two forms in the full population of 93,283 examinees. Anchor equatings were performed in samples of 25, 50, 100, and 200 examinees, with 50 pairs of samples at each size level. Four equatings were performed with each pair of samples: one based on unsmoothed distributions and three based on varying degrees of smoothing. Smoothing reduced, by at least half, the sample size required for a given degree of accuracy. Smoothing that preserved only two moments of the marginal distributions resulted in equatings that failed to capture the curvilinearity in the population equating.  相似文献   

3.
Five methods for equating in a random groups design were investigated in a series of resampling studies with samples of 400, 200, 100, and 50 test takers. Six operational test forms, each taken by 9,000 or more test takers, were used as item pools to construct pairs of forms to be equated. The criterion equating was the direct equipercentile equating in the group of all test takers. Equating accuracy was indicated by the root-mean-squared deviation, over 1,000 replications, of the sample equatings from the criterion equating. The methods investigated were equipercentile equating of smoothed distributions, linear equating, mean equating, symmetric circle-arc equating, and simplified circle-arc equating. The circle-arc methods produced the most accurate results for all sample sizes investigated, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible.  相似文献   

4.
The concept of invariance in equating and linking is traced from the 1950s to the present. A number of research studies that examined population invariance are reviewed. Theory and research suggest that linkings other than equatings are population dependent. Theory also indicates that equatings are population dependent, although when test forms are built to detailed tables of content and statistical specifications and alternate forms are very similar to one another, the research suggests that equatings might be approximately population invariant. Suggestions are made about further research that should be conducted on methodology for examining population invariance and on empirical research to better understand the conditions under which equatings are sufficiently population invariant for practical purposes.  相似文献   

5.
One of the major assumptions of item response theory (IRT)models is that performance on a set of items is unidimensional, that is, the probability of successful performance by examinees on a set of items can be modeled by a mathematical model that has only one ability parameter. In practice, this strong assumption is likely to be violated. An important pragmatic question to consider is: What are the consequences of these violations? In this research, evidence is provided of violations of unidimensionality on the verbal scale of the GRE Aptitude Test, and the impact of these violations on IRT equating is examined. Previous factor analytic research on the GRE Aptitude Test suggested that two verbal dimensions, discrete verbal (analogies, antonyms, and sentence completions)and reading comprehension, existed. Consequently, the present research involved two separate calibrations (homogeneous) of discrete verbal items and reading comprehension items as well as a single calibration (heterogeneous) of all verbal item types. Thus, each verbal item was calibrated twice and each examinee obtained three ability estimates: reading comprehension, discrete verbal, and all verbal. The comparability of ability estimates based on homogeneous calibrations (reading comprehension or discrete verbal) to each other and to the all-verbal ability estimates was examined. The effects of homogeneity of item calibration pool on estimates of item discrimination were also examined. Then the comparability of IRT equatings based on homogeneous and heterogeneous calibrations was assessed. The effects of calibration homogeneity on ability parameter estimates and discrimination parameter estimates are consistent with the existence of two highly correlated verbal dimensions. IRT equating results indicate that although violations of unidimensionality may have an impact on equating, the effect may not be substantial.  相似文献   

6.
Score equating based on small samples of examinees is often inaccurate for the examinee populations. We conducted a series of resampling studies to investigate the accuracy of five methods of equating in a common-item design. The methods were chained equipercentile equating of smoothed distributions, chained linear equating, chained mean equating, the symmetric circle-arc method, and the simplified circle-arc method. Four operational test forms, each containing at least 110 items, were used for the equating, with new-form samples of 100, 50, 25, and 10 examinees and reference-form samples three times as large. Accuracy was described in terms of the root-mean-squared difference (over 1,000 replications) of the sample equatings from the criterion equating. Overall, chained mean equating produced the most accurate results for low scores, but the two circle-arc methods produced the most accurate results, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible.  相似文献   

7.
The equating accuracy of content-representative anchors versus nonrepresen-tative but substantially longer anchors is compared. Content representation was defined as a match between anchors and total test of the percentage of items in each of several content areas. Through a chain of equatings it was found that content representation in anchors was critical for the testing program studied. The results are explained in terms of differences in mean profiles (by content area) of the nonrandom groups used for equating.  相似文献   

8.
This paper discusses various issues involved in using the Rasch model with multiple choice tests. By presenting a modified test that is much more powerful, the value of Wright and Panchapakesan's test as evidence of model fit is shown to be questionable. According to the new test, the model failed to fit 68% of the items in the Anchor Test Study. Effects of such misfit on test equating are demonstrated. Results of some past studies purporting to support the Rasch model are shown to be irrelevant, or to yield the conclusion that the Rasch model did not fit the data. Issues like "objectivity" and consistent estimation are shown to be unimportant in selection of a latent trait model. Thus, available evidence shows the Rasch model to be unsuitable for multiple choice items.  相似文献   

9.
Combinations of five methods of equating test forms and two methods of selecting samples of students for equating were compared for accuracy. The two sampling methods were representative sampling from the population and matching samples on the anchor test score. The equating methods were the Tucker, Levine equally reliable, chained equipercentile, frequency estimation, and item response theory (IRT) 3PL methods. The tests were the Verbal and Mathematical sections of the Scholastic Aptitude Test. The criteria for accuracy were measures of agreement with an equivalent-groups equating based on more than 115,000 students taking each form. Much of the inaccuracy in the equatings could be attributed to overall bias. The results for all equating methods in the matched samples were similar to those for the Tucker and frequency estimation methods in the representative samples; these equatings made too small an adjustment for the difference in the difficulty of the test forms. In the representative samples, the chained equipercentile method showed a much smaller bias. The IRT (3PL) and Levine methods tended to agree with each other and were inconsistent in the direction of their bias.  相似文献   

10.
When good model-data fit is observed, the Many-Facet Rasch (MFR) model acts as a linking and equating model that can be used to estimate student achievement, item difficulties, and rater severity on the same linear continuum. Given sufficient connectivity among the facets, the MFR model provides estimates of student achievement that are equated to control for differences in rater severity. Although several different linking designs are used in practice to establish connectivity, the implications of design differences have not been fully explored. Research is also limited related to the impact of model-data fit on the quality of MFR model-based adjustments for rater severity. This study explores the effects of linking designs and model-data fit for raters on the interpretation of student achievement estimates within the context of performance assessments in music. Results indicate that performances cannot be effectively adjusted for rater effects when inadequate linking or model-data fit is present.  相似文献   

11.
This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma method). The biases and sample variances for the equating functions based on the characteristic curve methods and concurrent calibrations for adjacent forms are smaller than the biases and variances for the equating functions based on the moment methods. In addition, the IRT true score equating is also compared to the chained equipercentile equating, and we observe that the sample variances for the chained equipercentile equating are much smaller than the variances for the IRT true score equating with an exception at the low scores.  相似文献   

12.
A formal analysis of the effects of item deletion on equating/scaling functions and reported score distributions is presented. There are two components of the present analysis: analytical and empirical. The analytical decomposition demonstrates how the effects of item characteristics, test properties, individual examinee responses, and rounding rules combine to produce the item deletion effect on the equating/scaling function and candidate scores, In addition to demonstrating how the deleted item's psychometric characteristics can affect the equating function, the analytical component of the report examines the effects of not scoring versus scoring all options correct, the effects of re-equating versus not re-equating, and the interaction between the decision to re-equate or to not re-equate and the scoring option chosen for the flawed item. The empirical portion of the report uses data from the May 1982 administration of the SA T, which contained the circles item, to illustrate the effects of item deletion on reported score distributions and equating functions. The empirical data verify what the analytical decomposition predicts.  相似文献   

13.
The Multidimensional School Anger Inventory–Revised (MSAI-R) is a measurement tool to evaluate high school students' anger. Its psychometric features have been tested in the USA, Australia, Japan, Guatemala, and Italy. This study investigates the factor structure and psychometric quality of the Persian version of the MSAI-R using data from an administration of the inventory to 585 Iranian high school students. The study adopted the four-factor underlying structure of high school student anger derived through factor analysis in previous validation studies, which consists of: School Hostility, Anger Experience, Positive Coping, and Destructive Expressions. Confirmatory factor analysis of this four-factor model indicated that it fit the data better than a one-factor baseline model, although the fit was not perfect. The Rasch model showed a very high internal consistency among items, with no item misfitting; however, our results suggest that to represent the construct sufficiently some items should be added to Positive Coping and Destructive Expression. This finding is in agreement with Boman, Curtis, Furlong, and Smith's Rasch analysis of the MSAI-R with an Australian sample. Overall, the results from this study support the psychometric features of the Persian MSAI-R. However, results from some test items also point to the dangers inherent in adapting the same test stimuli to widely divergent cultures.  相似文献   

14.
In operational testing programs using item response theory (IRT), item parameter invariance is threatened when an item appears in a different location on the live test than it did when it was field tested. This study utilizes data from a large state's assessments to model change in Rasch item difficulty (RID) as a function of item position change, test level, test content, and item format. As a follow-up to the real data analysis, a simulation study was performed to assess the effect of item position change on equating. Results from this study indicate that item position change significantly affects change in RID. In addition, although the test construction procedures used in the investigated state seem to somewhat mitigate the impact of item position change, equating results might be impacted in testing programs where other test construction practices or equating methods are utilized.  相似文献   

15.
Research has suggested that inappropriate or misfitting response patterns may have detrimental effects on the quality and validity of measurement. It has been suggested that factors like language and ethnic background are related to the generation of misfitting response patterns, but the empirical research on this is rather poor. This research analyzes data from three testing cycles of the National Curriculum tests in mathematics in England using the Rasch model. It was found that pupils having English as an additional language and pupils belonging to ethnic minorities are significantly more likely to generate aberrant response patterns. However, within the groups of pupils belonging to ethnic minorities, those who speak English as an additional language are not significantly more likely to generate misfitting response patterns. This may indicate that the ethnic background effect is more significant than the effect of the first language spoken. The results suggest that pupils having English as an additional language and pupils belonging to ethnic minorities are mismeasured significantly more than the remainder of pupils by taking the mathematics National Curriculum tests. More research is needed to generalize the results to other subjects and contexts.  相似文献   

16.
This article describes a preliminary investigation of an empirical Bayes (EB) procedure for using collateral information to improve equating of scores on test forms taken by small numbers of examinees. Resampling studies were done on two different forms of the same test. In each study, EB and non-EB versions of two equating methods—chained linear and chained mean—were applied to repeated small samples drawn from a large data set collected for a common-item equating. The criterion equating was the chained linear equating in the large data set. Equatings of other forms of the same test provided the collateral information. New-form sample size was varied from 10 to 200; reference-form sample size was constant at 200. One of the two new forms did not differ greatly in difficulty from its reference form, as was the case for the equatings used as collateral information. For this form, the EB procedure improved the accuracy of equating with new-form samples of 50 or fewer. The other new form was much more difficult than its reference form; for this form, the EB procedure made the equating less accurate.  相似文献   

17.
The Progressive Matrices items require varying degrees of analytical reasoning. Individuals high on the underlying trait measured by the Raven should score high on the test. Latent trait models applied to data of the Raven form provide a useful methodology for examining the tenability of the above hypothesis. In this study the Rasch latent model was applied to investigate the fit of observed performance on Raven items to what was expected by the model for individuals at six different levels of the underlying scale. For the most part the model showed a good fit to the test data. The findings were similar to previous empirical work that has investigated the behavior of Rasch test scores. In three instances, however, the item fit statistic was relatively large. A closer study of the “misfitting” items revealed two items were of extreme difficulty, which is likely to contribute to the misfit. The study raises issues about the use of the Rasch model in instances of small samples. Other issues related to the interpretation of the Rasch model to Raven-type data are discussed.  相似文献   

18.
由2007年开始,香港中学会考中国语文科及英国语文科采用了水平参照模式(standards-referenced reporting)对考生的成绩进行等级评定。在分数处理过程中,采用了含结构参数的Rasch模型。本文介绍了该模型及其一些主要性质,导出了联合极大似然估计(Joint Maximum Likelihood Estimation)的求解方程,并报告了应用该模型于香港中学会考水平参照等级评定中的主要结果。  相似文献   

19.
The teaching and learning of mathematics in schools has drawn tremendous attention since the education reform in Taiwan. In addition to assessing cognitive abilities, Taiwan Assessment of Student Achievement in Mathematics (TASA-MAT) collects background information to help depict average student achievement in schools in an educational context. The purpose of this study was to investigate the relationships between student achievement in mathematics and student background characteristics. The data for this study was derived from the sample for the 2005 TASA-MAT Sixth-Grade Main Survey in Taiwan. The average age of the sixth-grade students in Taiwan is 11 years old, as was the sample for the 2005 TASA-MAT. Student socioeconomic status (SES) and student learning-goal orientation were specified as predictor variables of student performance in mathematics. The results indicate that the better performance in mathematics tended to be associated with a higher SES and stronger mastery goal orientation. The SES factor accounted for 4.98% of the variance, and student learning-goal orientation accounted for an additional 10.61% of the variance. The major implication obtained from this study was that goal orientation was much more significant than SES in predicting student performance in mathematics. In addition, the Rasch model treatment of the ordinal response-category data is a novel approach to scoring the goal-orientation items, with the corresponding results in this study being satisfactory.  相似文献   

20.
Tucker and chained linear equatings were evaluated in two testing scenarios. In Scenario 1, referred to as rater comparability scoring and equating, the anchor‐to‐total correlation is often very high for the new form but moderate for the reference form. This may adversely affect the results of Tucker equating, especially if the new and reference form samples differ in ability. In Scenario 2, the new and reference form samples are randomly equivalent but the correlation between the anchor and total scores is low. When the correlation between the anchor and total scores is low, Tucker equating assumes that the new and reference form samples are similar in ability (which, with randomly equivalents groups, is the correct assumption). Thus Tucker equating should produce accurate results. Results indicated that in Scenario 1, the Tucker results were less accurate than the chained linear equating results. However, in Scenario 2, the Tucker results were more accurate than the chained linear equating results. Some implications are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号