首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The purpose of this research was to recommend an item bias procedure when the number of minority examinees is too small to use preferred three-parameter IRT methods. The chi-square, Angoff delta-plot, andpseudo-IRT indices were compared with both real and simulated data. For the real test data a criterion of known bias had been established by cross-validated IRT-3 results. The findings from the Math Test and the simulated test were consistent. The pseudo-IRT approach was best (measured by both correlations and percent agreement) in delecting criterion bias. The chi-square was close in accuracy to the pseudo-IRT index. The Angoff delta-plot method was found to be inadequate on both heuristic and empirical grounds. In extreme cases it even identified items as biased against whites that were simulated to be biased against blacks. However, a modified Angoff index, where p-value differences were regressed on item point biserials (and the residualized values used as the index), was nearly as good as the chi-square in identifying known bias. A final caution was offered regarding the use of item bias techniques. The statistical flags should never be used mechanically to discard items; rather they should be used to inspect items for possible differences in meaning.  相似文献   

2.
Studies of differential item functioning under item response theory require that item parameter estimates be placed on the same metric before comparisons can be made. The present study compared the effects of three methods for linking metrics: a weighted mean and sigma method (WMS); the test characteristic curve method (TCC); and the minimum chi-square method (MCS), on detection of differential item functioning. Both iterative and noniterative linking procedures were compared for each method. Results indicated that detection of differentially functioning items following linking via the test characteristic curve method gave the most accurate results when the sample size was small. When the sample size was large, results for the three linking methods were essentially the same. Iterative linking provided an improvement in detection of differentially functioning items over noniterative linking particularly with the .05 alpha level. The weighted mean and sigma method showed greater improvement with iterative linking than either the test characteristic curve or minimum chi-square method.  相似文献   

3.
Increasing use of item pools in large-scale educational assessments calls for an appropriate scaling procedure to achieve a common metric among field-tested items. The present study examines scaling procedures for developing a new item pool under a spiraled block linking design. The three scaling procedures are considered: (a) concurrent calibration, (b) separate calibration with one linking, and (c) separate calibration with three sequential linking. Evaluation across varying sample sizes and item pool sizes suggests that calibrating an item pool simultaneously results in the most stable scaling. The separate calibration with linking procedures produced larger scaling errors as the number of linking steps increased. The Haebara’s item characteristic curve linking resulted in better performances than the test characteristic curve (TCC) linking method. The present article provides an analytic illustration that the test characteristic curve method may fail to find global solutions in polytomous items. Finally, comparison of the single- and mixed-format item pools suggests that the use of polytomous items as the anchor can improve the overall scaling accuracy of the item pools.  相似文献   

4.
Reading and Mathematics tests of multiple-choice items for grades Kindergarten through 9 were vertically scaled using the three-parameter logistic model and two different scaling procedures: concurrent and separate by grade groups. Item parameters were estimated using Markov chain Monte Carlo methodology while fixing the grade 4 population abilities to have a standard normal distribution. For the separate grade-groups scaling, grade groupings were linked using the Stocking and Lord test characteristic curve procedure. Abilities were estimated using the maximum-likelihood method. In either content area, scatterplots of item difficulty, discrimination, and ability estimates from the two methods showed consistently strong linear relationships. However, as grade deviated from the base grade of four, the best-fit linear line through the pairs of item discriminations started to rotate away from the identity line. This indicated the discrimination estimates from the separate grade-groups procedure for extreme grades to be, on average, higher than those from the concurrent analysis. The study also observed some systematic change in score variability across grades. In general, the two vertical scaling approaches yielded similar results at more grades in Reading than in Mathematics.  相似文献   

5.
AN ITERATIVE ITEM BIAS DETECTION METHOD   总被引:1,自引:0,他引:1  
Two strategies for assessing item bias are discussed: methods that compare (transformed) item difficulties unconditional on ability level and methods that compare the probabilities of correct response conditional on ability level. In the present study, the logit model was used to compare the probabilities of correct response to an item by members of two groups, these probabilities being conditional on the observed score. Here the observed score serves as an indicator of ability level. The logit model was iteratively applied: In the Tth iteration, the T items with the highest value of the bias statistic are excluded from the test, and the observed score indicator of ability for the (T + 1)th iteration is computed from the remaining items. This method was applied to simulated data. The results suggest that the iterative logit method is a substantial improvement on the noniterative one, and that the iterative method is very efficient in detecting biased and unbiased items.  相似文献   

6.
Empirical studies demonstrated Type-I error (TIE) inflation (especially for highly discriminating easy items) of the Mantel-Haenszel chi-square test for differential item functioning (DIF), when data conformed to item response theory (IRT) models more complex than Rasch, and when IRT proficiency distributions differed only in means. However, no published study manipulated proficiency variance ratio (VR). Data were generated with the three-parameter logistic (3PL) IRT model. Proficiency VRs were 1, 2, 3, and 4. The present study suggests inflation may be greater, and may affect all highly discriminating items (low, moderate, and high difficulty), when IRT proficiency distributions of reference and focal groups differ also in variances. Inflation was greatest on the 21-item test (vs. 41) and 2,000 total sample size (vs. 1,000). Previous studies had not systematically examined sample size ratio. Sample size ratio of 1:1 produced greater TIE inflation than 3:1, but primarily for total sample size of 2,000.  相似文献   

7.
This Monte Carlo study examined the effect of complex sampling of items on the measurement of differential item functioning (DIF) using the Mantel-Haenszel procedure. Data were generated using a 3-parameter logistic item response theory model according to the balanced incomplete block (BIB) design used in the National Assessment of Educational Progress (NAEP). The length of each block of items and the number of DIF items in the matching variable were varied, as was the difficulty, discrimination, and presence of DIF in the studied item. Block, booklet, pooled booklet, and extra-information analyses were compared to a complete data analysis using the transformed log-odds on the delta scale. The pooled booklet approach is recommended for use when items are selected for examinees according to a BIB design. This study has implications for DIF analyses of other complex samples of items, such as computer administered testing or another complex assessment design.  相似文献   

8.
Studies that have investigated differences in examinee performance on items administered in paper-and-pencil form or on a computer screen have produced equivocal results. Certain item administration procedures were hypothesized to be among the most important variables causing differences in item performance and ultimately in test scores obtained from these different administration media. A study where these item administration procedures were made as identical as possible for each presentation medium is described. In addition, a methodology is presented for studying the difficulty and discrimination of items under each presentation medium as a post hoc procedure.  相似文献   

9.
An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items.  相似文献   

10.
A new approach for partitioning test items into dimensionally distinct item clusters is introduced. The core of the approach is a new item-pair conditional-covariance-based proximity measure that can be used with hierarchical cluster analysis. An extensive simulation study designed to test the limits of the approach indicates that when approximate simple structure holds, the procedure can correctly partition the test into dimensionally homogeneous item clusters even for very high correlations between the latent dimensions. In particular, the procedure can correctly classify (on average) over 90% of the items for correlations as high as .9. The cooperative role that the procedure can play when used in conjunction with other dimensionality assessment procedures is discussed.  相似文献   

11.
《教育实用测度》2013,26(4):297-312
Certain potential benefits of using item response theory in test construction are discussed and evaluated using the experience and evidence accumulated during 9 years of using a three-parameter model in the construction of major achievement batteries. We also discuss several cautions and limitations in realizing these benefits as well as issues in need of further research. The potential benefits considered are those of getting "sample-free" item calibrations and "item-free" person measurement, automatically equating various tests, decreasing the standard errors of scores without increasing the number of items used by using item pattern scoring, assessing item bias (or differential item functioning) independently of difficulty in a manner consistent with item selection, being able to determine just how adequate a tryout pool of items may be, setting up computer-generated "ideal" tests drawn from pools as targets for test developers, and controlling the standard error of a selected test at any desired set of score levels.  相似文献   

12.
The purpose of the present study was to develop and evaluate two procedures flagging consequential item parameter drift (IPD) in an operational testing program. The first procedure was based on flagging items that exhibit a meaningful magnitude of IPD using a critical value that was defined to represent barely tolerable IPD. The second procedure was based on flagging items in which the D2 statistic was more than two standard deviations from the mean. Both procedures were implemented using an iterative purification approach to detect IPD. A simulation study was implemented to evaluate the effectiveness of both detection procedures in flagging non-negligible IPD. Both procedures were able to identify IPD and the iterative purification method provided useful information regarding the consequences of excluding or including a flagged item. The advantages and disadvantages of both procedures as well as possible modifications intended to improve the procedures’ effectiveness are discussed in the article.  相似文献   

13.
Linking item parameters to a base scale   总被引:1,自引:0,他引:1  
This paper compares three methods of item calibration??concurrent calibration, separate calibration with linking, and fixed item parameter calibration??that are frequently used for linking item parameters to a base scale. Concurrent and separate calibrations were implemented using BILOG-MG. The Stocking and Lord in Appl Psychol Measure 7:201?C210, (1983) characteristic curve method of parameter linking was used in conjunction with separate calibration. The fixed item parameter calibration (FIPC) method was implemented using both BILOG-MG and PARSCALE because the method is carried out differently by the two programs. Both programs use multiple EM cycles, but BILOG-MG does not update the prior ability distribution during FIPC calibration, whereas PARSCALE updates the prior ability distribution multiple times. The methods were compared using simulations based on actual testing program data, and results were evaluated in terms of recovery of the underlying ability distributions, the item characteristic curves, and the test characteristic curves. Factors manipulated in the simulations were sample size, ability distributions, and numbers of common (or fixed) items. The results for concurrent calibration and separate calibration with linking were comparable, and both methods showed good recovery results for all conditions. Between the two fixed item parameter calibration procedures, only the appropriate use of PARSCALE consistently provided item parameter linking results similar to those of the other two methods.  相似文献   

14.
本文研究的是不同的测试方法-单项选择和信息转移-是否会在阅读理解考试中产生测试方法效应的问题.除对学生的考试成绩(分数)进行分析外,本研究还进一步对试题的难度值进行了分析,而本研究中试题难度是通过项目反应理论(Item Response Theory)计算得到的.结果显示不同测试方法的确会影响题目难度及考生的考试表现,就试题难度而言信息转移比单项选择更难.  相似文献   

15.
When judgmental and statistical procedures are both used to identify potentially gender-biased items in a test, to what extent do the results agree? In this study, both procedures were used to evaluate the items in a statewide, 78-item, multiple-choice test of science knowledge. Only one item was flagged by the sensitivity reviewers as being potentially biased, but this item was not flagged by the statistical procedure. None of the nine items flagged by the Mantel-Haenszel procedure were flagged by the sensitivity reviewers. Eight of the nine statistically flagged items were differentially easier for males. Four of these eight measured the same category of objectives. The authors conclude that both judgmental and statistical procedures provide useful information and that both should be used in test construction. They caution readers that content-validity issues need to be addressed when making decisions based on the results of either procedure.  相似文献   

16.
17.
Using a technique that controlled exposure of items, the investigator examined the effect on mean test score, item difficulty index, and reliability and validity coefficients of the reordering of items within a power test containing ten letter-series-completion items. The results suggest that effects on test statistics from item rearrangement are, generally, minimal. The implication of these findings for test designs involving an item sampling procedure is that performance on an item is minimally influenced by the context in which it occurs.  相似文献   

18.
This article defines and demonstrates a framework for studying differential item functioning (DIF) and differential test functioning (DTF) for tests that are intended to be multidimensional The procedure introduced here is an extension of unidimensional differential functioning of items and tests (DFIT) recently developed by Raju, van der Linden, & Fleer (1995). To demonstrate the usefulness of these new indexes in a multidimensional IRT setting, two-dimensional data were simulated with known item parameters and known DIF and DTE The DIF and DTF indexes were recovered reasonably well under various distributional differences of Os after multidimensional linking was applied to put the two sets of item parameters on a common scale. Further studies are suggested in the area of DIF/DTF for intentionally multidimensional tests.  相似文献   

19.
Item positions in educational assessments are often randomized across students to prevent cheating. However, if altering item positions results in any significant impact on students’ performance, it may threaten the validity of test scores. Two widely used approaches for detecting position effects – logistic regression and hierarchical generalized linear modelling – are often inconvenient for researchers and practitioners due to some technical and practical limitations. Therefore, this study introduced a structural equation modeling (SEM) approach for examining item and testlet position effects. The SEM approach was demonstrated using data from a computer-based alternate assessment designed for students with cognitive disabilities from three grade bands (3–5, 6–8, and high school). Item and testlet position effects were investigated in the field-test (FT) items that were received by each student at different positions. Results indicated that the difficulty of some FT items in grade bands 3–5 and 6–8 differed depending on the positions of the items on the test. Also, the overall difficulty of the field-test task in grade bands 6–8 increased as students responded to the field-test task in later positions. The SEM approach provides a flexible method for examining different types of position effects.  相似文献   

20.
A Monte Carlo simulation technique for generating dichotomous item scores is presented that implements (a) a psychometric model with different explicit assumptions than traditional parametric item response theory (IRT) models, and (b) item characteristic curves without restrictive assumptions concerning mathematical form. The four-parameter beta compound-binomial (4PBCB) strong true score model (with two-term approximation to the compound binomial) is used to estimate and generate the true score distribution. The nonparametric item-true score step functions are estimated by classical item difficulties conditional on proportion-correct total score. The technique performed very well in replicating inter-item correlations, item statistics (point-biserial correlation coefficients and item proportion-correct difficulties), first four moments of total score distribution, and coefficient alpha of three real data sets consisting of educational achievement test scores. The technique replicated real data (including subsamples of differing proficiency) as well as the three-parameter logistic (3PL) IRT model (and much better than the 1PL model) and is therefore a promising alternative simulation technique. This 4PBCB technique may be particularly useful as a more neutral simulation procedure for comparing methods that use different IRT models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号