期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Simulation Studies of the Effects of Small Sample Size and Studied Item Parameters on SIBTEST and Mantel-Haenszel Type I Error Performance

Louis A. Roussos William F. Stout 《Journal of Educational Measurement》1996,33(2):215-230

Two simulation studies investigated Type I error performance of two statistical procedures for detecting differential item functioning (DIF): SIBTEST and Mantel-Haenszel (MH). Because MH and SIBTEST are based on asymptotic distributions requiring "large" numbers of examinees, the first study examined Type 1 error for small sample sizes. No significant Type I error inflation occurred for either procedure. Because MH has the potential for Type I error inflation for non-Rasch models, the second study used a markedly non-Rasch test and systematically varied the shape and location of the studied item. When differences in distribution across examinee group of the measured ability were present, both procedures displayed inflated Type 1 error for certain items; MH displayed the greater inflation. Also, both procedures displayed statistically biased estimation of the zero DIF for certain items, though SIBTEST displayed much less than MH. When no latent distributional differences were present, both procedures performed satisfactorily under all conditions. 相似文献

2.

Comparing Methods of Assessing Differential Item Functioning in a Computerized Adaptive Testing Environment

Pui-Wa Lei Shu-Ying Chen Lan Yu 《Journal of Educational Measurement》2006,43(3):245-264

Mantel-Haenszel and SIBTEST, which have known difficulty in detecting non-unidirectional differential item functioning (DIF), have been adapted with some success for computerized adaptive testing (CAT). This study adapts logistic regression (LR) and the item-response-theory-likelihood-ratio test (IRT-LRT), capable of detecting both unidirectional and non-unidirectional DIF, to the CAT environment in which pretest items are assumed to be seeded in CATs but not used for trait estimation. The proposed adaptation methods were evaluated with simulated data under different sample size ratios and impact conditions in terms of Type I error, power, and specificity in identifying the form of DIF. The adapted LR and IRT-LRT procedures are more powerful than the CAT version of SIBTEST for non-unidirectional DIF detection. The good Type I error control provided by IRT-LRT under extremely unequal sample sizes and large impact is encouraging. Implications of these and other findings are discussed. 相似文献

3.

An Investigation of the Power of the Likelihood Ratio Goodness-of-Fit Statistic in Detecting Differential Item Functioning

Robert D. Ankenmann Elizabeth A. Witt Stephen B. Dunbar 《Journal of Educational Measurement》1999,36(4):277-300

The purpose of this study was to investigate the power and Type I error rate of the likelihood ratio goodness-of-fit (LR) statistic in detecting differential item functioning (DIF) under Samejima's (1969, 1972) graded response model. A multiple-replication Monte Carlo study was utilized in which DIF was modeled in simulated data sets which were then calibrated with MULTILOG (Thissen, 1991) using hierarchically nested item response models. In addition, the power and Type I error rate of the Mantel (1963) approach for detecting DIF in ordered response categories were investigated using the same simulated data, for comparative purposes. The power of both the Mantel and LR procedures was affected by sample size, as expected. The LR procedure lacked the power to consistently detect DIF when it existed in reference/focal groups with sample sizes as small as 500/500. The Mantel procedure maintained control of its Type I error rate and was more powerful than the LR procedure when the comparison group ability distributions were identical and there was a constant DIF pattern. On the other hand, the Mantel procedure lost control of its Type I error rate, whereas the LR procedure did not, when the comparison groups differed in mean ability; and the LR procedure demonstrated a profound power advantage over the Mantel procedure under conditions of balanced DIF in which the comparison group ability distributions were identical. The choice and subsequent use of any procedure requires a thorough understanding of the power and Type I error rates of the procedure under varying conditions of DIF pattern, comparison group ability distributions.–or as a surrogate, observed score distributions–and item characteristics. 相似文献

4.

Transforming SIBTEST to Account for Multilevel Data Structures

下载免费PDF全文

Brian F. French W. Holmes Finch 《Journal of Educational Measurement》2015,52(2):159-180

SIBTEST is a differential item functioning (DIF) detection method that is accurate and effective with small samples, in the presence of group mean differences, and for assessment of both uniform and nonuniform DIF. The presence of multilevel data with DIF detection has received increased attention. Ignoring such structure can inflate Type I error. This simulation study examines the performance of newly developed multilevel adaptations of SIBTEST in the presence of multilevel data. Data were simulated in a multilevel framework and both uniform and nonuniform DIF were assessed. Study results demonstrated that naïve SIBTEST and Crossing SIBTEST, ignoring the multilevel data structure, yield inflated Type I error rates, while certain multilevel extensions provided better error and accuracy control. 相似文献

5.

Examining type I error and power for detection of differential item and testlet functioning

Young-Sun Lee Allan Cohen Maritsa Toro 《Asia Pacific Education Review》2009,10(3):365-375

In this study, the effectiveness of detection of differential item functioning (DIF) and testlet DIF using SIBTEST and Poly-SIBTEST were examined in tests composed of testlets. An example using data from a reading comprehension test showed that results from SIBTEST and Poly-SIBTEST were not completely consistent in the detection of DIF and testlet DIF. Results from a simulation study indicated that SIBTEST appeared to maintain type I error control for most conditions, except in some instances in which the magnitude of simulated DIF tended to increase. This same pattern was present for the Poly-SIBTEST results, although Poly-SIBTEST demonstrated markedly less control of type I errors. Type I error control with Poly-SIBTEST was lower for those conditions for which the ability was unmatched to test difficulty. The power results for SIBTEST were not adversely affected, when the size and percent of simulated DIF increased. Although Poly-SIBTEST failed to control type I errors in over 85% of the conditions simulated, in those conditions for which type I error control was maintained, Poly-SIBTEST demonstrated higher power than SIBTEST. 相似文献

6.

Differential Item Functioning Assessment in Cognitive Diagnostic Modeling: Application of the Wald Test to Investigate DIF in the DINA Model

Likun Hou Jimmy de la Torre Ratna Nandakumar 《Journal of Educational Measurement》2014,51(1):98-125

Analyzing examinees’ responses using cognitive diagnostic models (CDMs) has the advantage of providing diagnostic information. To ensure the validity of the results from these models, differential item functioning (DIF) in CDMs needs to be investigated. In this article, the Wald test is proposed to examine DIF in the context of CDMs. This study explored the effectiveness of the Wald test in detecting both uniform and nonuniform DIF in the DINA model through a simulation study. Results of this study suggest that for relatively discriminating items, the Wald test had Type I error rates close to the nominal level. Moreover, its viability was underscored by the medium to high power rates for most investigated DIF types when DIF size was large. Furthermore, the performance of the Wald test in detecting uniform DIF was compared to that of the traditional Mantel‐Haenszel (MH) and SIBTEST procedures. The results of the comparison study showed that the Wald test was comparable to or outperformed the MH and SIBTEST procedures. Finally, the strengths and limitations of the proposed method and suggestions for future studies are discussed. 相似文献

7.

A SIBTEST Approach to Testing DIF Hypotheses Using Experimentally Designed Test Items

Daniel M. Bolt 《Journal of Educational Measurement》2000,37(4):307-327

This paper considers a modification of the DIF procedure SIBTEST for investigating the causes of differential item functioning (DIF). One way in which factors believed to be responsible for DIF can be investigated is by systematically manipulating them across multiple versions of an item using a randomized DIF study (Schmitt, Holland, & Dorans, 1993). In this paper: it is shown that the additivity of the index used for testing DIF in SIBTEST motivates a new extension of the method for statistically testing the effects of DIF factors. Because an important consideration is whether or not a studied DIF factor is consistent in its effects across items, a methodology for testing item x factor interactions is also presented. Using data from the mathematical sections of the Scholastic Assessment Test (SAT), the effects of two potential DIF factors—item format (multiple-choice versus open-ended) and problem type (abstract versus concrete)—are investigated for gender Results suggest a small but statistically significant and consistent effect of item format (favoring males for multiple-choice items) across items, and a larger but less consistent effect due to problem type. 相似文献

8.

Stepwise Analysis of Differential Item Functioning Based on Multiple-Group Partial Credit Model

Eiji Muraki 《Journal of Educational Measurement》1999,36(3):217-232

Bock, Muraki, and Pfeiffenberger (1988) proposed a dichotomous item response theory (IRT) model for the detection of differential item functioning (DIF), and they estimated the IRT parameters and the means and standard deviations of the multiple latent trait distributions. This IRT DIF detection method is extended to the partial credit model (Masters, 1982; Muraki, 1993) and presented as one of the multiple-group IRT models. Uniform and non-uniform DIF items and heterogeneous latent trait distributions were used to generate polytomous responses of multiple groups. The DIF method was applied to this simulated data using a stepwise procedure. The standardized DIF measures for slope and item location parameters successfully detected the non-uniform and uniform DIF items as well as recovered the means and standard deviations of the latent trait distributions.This stepwise DIF analysis based on the multiple-group partial credit model was then applied to the National Assessment of Educational Progress (NAEP) writing trend data. 相似文献

9.

Simultaneous DIF Amplification and Cancellation: Shealy-Stout's Test for DIF

Ratna Nandakumar 《Journal of Educational Measurement》1993,30(4):293-311

The present study investigates the phenomena of simultaneous DIF amplification and cancellation and SIBTEST's role in detecting such. A variety of simulated test data were generated for this purpose. In addition, real test data from various sources were analyzed. The results from both simulated and real test data, as Sheafy and Stout's theory (1993a, 1993b) suggests, show that the SIBTEST is effective in assessing DIF amplification and cancellation (partially or fully) at the test score level. Finally, methodological and substantive implications of DIF amplification and cancellation are discussed. 相似文献

10.

Item-Bundle DIF Hypothesis Testing: Identifying Suspect Bundles and Assessing Their Differential Functioning

Jeffrey A. Douglas Louis A. Roussos William Stout 《Journal of Educational Measurement》1996,33(4):465-484

This article proposes two multidimensional IRT model-based methods of selecting item bundles (clusters of not necessarily adjacent items chosen according to some organizational principle) suspected of displaying DIF amplification. The approach embodied in these two methods is inspired by Shealy and Stout's (1993a, 1993b) multidimensional model for DIF. Each bundle selected by these methods constitutes a DIF amplification hypothesis. When SIBTEST (Shealy & Stout, 1993b) confirms DIF amplification in selected bundles, differential bundle functioning (DBF) is said to occur. Three real data examples illustrate the two methods for suspect bundle selection. The effectiveness of the methods is argued on statistical grounds. A distinction between benign and adverse DIF is made. The decision whether flagged DIF items or DBF bundles display benign or adverse DIF/DBF must depend in part on nonstatistical construct validity arguments. Conducting DBF analyses using these methods should help in the identification of the causes of DIF/DBF. 相似文献

11.

二级计分数据DIF模拟研究的数据产生原理及其软件实现

朱乙艺焦丽亚《考试研究》2012,(6):80-87,19

和基于实测数据的DIF研究相比,基于模拟数据的DIF研究不仅可以自由操纵实验条件,而且可以给出检验力和I型错误指标。本文详细阐述了二级计分DIF模拟数据的产生原理,其产生过程包括四个阶段：选择DIF产生思路,选择项目反应理论模型,确定考生特征、题目特征和复本数,计算考生在题目上的正确作答概率并转化为二级计分数据。并且分别利用常用软件Excel和专业软件WinGen3展示了二级计分DIF模拟数据的产生过程。相似文献

12.

Testing Features of Graphical DIF: Application of a Regression Correction to Three Nonparametric Statistical Tests

Daniel M. Bolt Mark J. Gierl 《Journal of Educational Measurement》2006,43(4):313-333

Inspection of differential item functioning (DIF) in translated test items can be informed by graphical comparisons of item response functions (IRFs) across translated forms. Due to the many forms of DIF that can emerge in such analyses, it is important to develop statistical tests that can confirm various characteristics of DIF when present. Traditional nonparametric tests of DIF (Mantel-Haenszel, SIBTEST) are not designed to test for the presence of nonuniform or local DIF, while common probability difference (P-DIF) tests (e.g., SIBTEST) do not optimize power in testing for uniform DIF, and thus may be less useful in the context of graphical DIF analyses. In this article, modifications of three alternative nonparametric statistical tests for DIF, Fisher's χ ² test, Cochran's Z test, and Goodman's U test ( Marascuilo & Slaughter, 1981 ), are investigated for these purposes. A simulation study demonstrates the effectiveness of a regression correction procedure in improving the statistical performance of the tests when using an internal test score as the matching criterion. Simulation power and real data analyses demonstrate the unique information provided by these alternative methods compared to SIBTEST and Mantel-Haenszel in confirming various forms of DIF in translated tests. 相似文献

13.

Assessment of Differential Item Functioning for Performance Tasks 总被引：1，自引：0，他引：1

Rebecca Zwick John R. Donoghue Angela Grima 《Journal of Educational Measurement》1993,30(3):233-251

相似文献

14.

Reasons for Gender-Related Differential Item Functioning in a College Admissions Test

Jonathan Wedman 《Scandinavian Journal of Educational Research》2018,62(6):959-970

Gender fairness in testing can be impeded by the presence of differential item functioning (DIF), which potentially causes test bias. In this study, the presence and causes of gender-related DIF were investigated with real data from 800 items answered by 250,000 test takers. DIF was examined using the Mantel–Haenszel and logistic regression procedures. Little DIF was found in the quantitative items and a moderate amount was found in the verbal items. Vocabulary items favored women if sampled from traditionally female domains but generally not vice versa if sampled from male domains. The sentence completion item format in the English reading comprehension subtest favored men regardless of content. The findings, if supported in a cross-validation study, can potentially lead to changes in how vocabulary items are sampled and in the use of the sentence completion format in English reading comprehension, thereby increasing gender fairness in the examined test. 相似文献

15.

Simulated Tests of Differential Item Functioning Using SIBTEST With and Without Impact

Alan J. Klockars Yoonsun Lee 《Journal of Educational Measurement》2008,45(3):271-285

Monte Carlo simulations with 20,000 replications are reported to estimate the probability of rejecting the null hypothesis regarding DIF using SIBTEST when there is DIF present and/or when impact is present due to differences on the primary dimension to be measured. Sample sizes are varied from 250 to 2000 and test lengths from 10 to 40 items. Results generally support previous findings for Type I error rates and power. Impact is inversely related to test length. The combination of DIF and impact, with the focal group having lower ability on both the primary and secondary dimensions, results in impact partially masking DIF so that items biased toward the reference group are less likely to be detected. 相似文献

16.

Logistic Regression and Its Use in Detecting Differential Item Functioning in Polytomous Items

Ann W. French Timothy R. Miller 《Journal of Educational Measurement》1996,33(3):315-332

A computer simulation study was conducted to determine the feasibility of using logistic regression procedures to detect differential item functioning (DIF) in polytomous items. One item in a simulated test of 25 items contained DIF; parameters' for that item were varied to create three conditions of nonuniform DIF and one of uniform DIF. Item scores were generated using a generalized partial credit model, and the data were recoded into multiple dichotomies in order to use logistic regression procedures. Results indicate that logistic regression is powerful in detecting most forms of DIF; however, it required large amounts of data manipulation, and interpretation of the results was sometimes difficult. Some logistic regression procedures may be useful in the post hoc analysis of DlF for polytomous items. 相似文献

17.

Assessing Differential Step Functioning in Polytomous Items Using a Common Odds Ratio Estimator

Randall D. Penfield 《Journal of Educational Measurement》2007,44(3):187-210

Many statistics used in the assessment of differential item functioning (DIF) in polytomous items yield a single item-level index of measurement invariance that collapses information across all response options of the polytomous item. Utilizing a single item-level index of DIF can, however, be misleading if the magnitude or direction of the DIF changes across the steps underlying the polytomous response process. A more comprehensive approach to examining measurement invariance in polytomous item formats is to examine invariance at the level of each step of the polytomous item, a framework described in this article as differential step functioning (DSF). This article proposes a nonparametric DSF estimator that is based on the Mantel-Haenszel common odds ratio estimator ( Mantel & Haenszel, 1959 ), which is frequently implemented in the detection of DIF in dichotomous items. A simulation study demonstrated that when the level of DSF varied in magnitude or sign across the steps underlying the polytomous response options, the DSF-based approach typically provided a more powerful and accurate test of measurement invariance than did corresponding item-level DIF estimators. 相似文献

18.

Detection of Differential Item Functioning with Nonlinear Regression: A Non‐IRT Approach Accounting for Guessing

下载免费PDF全文

Adéla Drabinová Patrícia Martinková 《Journal of Educational Measurement》2017,54(4):498-517

In this article we present a general approach not relying on item response theory models (non‐IRT) to detect differential item functioning (DIF) in dichotomous items with presence of guessing. The proposed nonlinear regression (NLR) procedure for DIF detection is an extension of method based on logistic regression. As a non‐IRT approach, NLR can be seen as a proxy of detection based on the three‐parameter IRT model which is a standard tool in the study field. Hence, NLR fills a logical gap in DIF detection methodology and as such is important for educational purposes. Moreover, the advantages of the NLR procedure as well as comparison to other commonly used methods are demonstrated in a simulation study. A real data analysis is offered to demonstrate practical use of the method. 相似文献

19.

Decisions that make a difference in detecting differential item functioning

Stephen G. Sireci Joseph A. Rios 《Educational Research and Evaluation》2013,19(2-3):170-187

There are numerous statistical procedures for detecting items that function differently across subgroups of examinees that take a test or survey. However, in endeavouring to detect items that may function differentially, selection of the statistical method is only one of many important decisions. In this article, we discuss the important decisions that affect investigations of differential item functioning (DIF) such as choice of method, sample size, effect size criteria, conditioning variable, purification, DIF amplification, DIF cancellation, and research designs for evaluating DIF. Our review highlights the necessity of matching the DIF procedure to the nature of the data analysed, the need to include effect size criteria, the need to consider the direction and balance of items flagged for DIF, and the need to use replication to reduce Type I errors whenever possible. Directions for future research and practice in using DIF to enhance the validity of test scores are provided. 相似文献

20.

A Generalized DIF Effect Variance Estimator for Measuring Unsigned Differential Test Functioning in Mixed Format Tests

Randall D. Penfield James Algina 《Journal of Educational Measurement》2006,43(4):295-312

One approach to measuring unsigned differential test functioning is to estimate the variance of the differential item functioning (DIF) effect across the items of the test. This article proposes two estimators of the DIF effect variance for tests containing dichotomous and polytomous items. The proposed estimators are direct extensions of the noniterative estimators developed by Camilli and Penfield (1997) for tests composed of dichotomous items. A small simulation study is reported in which the statistical properties of the generalized variance estimators are assessed, and guidelines are proposed for interpreting values of DIF effect variance estimators. 相似文献