首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Lord's Wald test for differential item functioning (DIF) has not been studied extensively in the context of the multidimensional item response theory (MIRT) framework. In this article, Lord's Wald test was implemented using two estimation approaches, marginal maximum likelihood estimation and Bayesian Markov chain Monte Carlo estimation, to detect uniform and nonuniform DIF under MIRT models. The Type I error and power rates for Lord's Wald test were investigated under various simulation conditions, including different DIF types and magnitudes, different means and correlations of two ability parameters, and different sample sizes. Furthermore, English usage data were analyzed to illustrate the use of Lord's Wald test with the two estimation approaches.  相似文献   

2.
Componential IRT models for polytomous items are of particular interest in two contexts: Componential research and test development. We assume that there are basic components, such as processes and knowledge structures, involved in solving cognitive tasks. In Componential research, the subtask paradigm may be used to isolate such components in subtasks. In test development, items may be composed such that their response alternatives correspond with specific combinations of such components. In both cases the data may be modeled as polytomous items. With Bock's (1972) nominal model as a general framework, transformation matrices can be used to constrain the parameters of the response categories so as to reflect the Componential design of the response categories. In this way, both main effects and interaction effects of components can be studied. An application to a spelling task demonstrates this approach  相似文献   

3.
A widely used approach for categorizing the level of differential item functioning (DIF) in dichotomous items is the scheme proposed by Educational Testing Service (ETS) based on a transformation of the Mantel-Haeszel common odds ratio. In this article two classification schemes for DIF in polytomous items (referred to as the P1 and P2 schemes) are proposed that parallel the criteria set forth in the ETS scheme for dichotomous items. The theoretical equivalence of the P1 and P2 schemes to the ETS scheme is described, and the results of a simulation study conducted to examine the empirical equivalence of the P1 and P2 schemes to the ETS scheme are presented.  相似文献   

4.
In this article, I address two competing conceptions of differential item functioning (DIF) in polytomously scored items. The first conception, referred to as net DIF, concerns between-group differences in the conditional expected value of the polytomous response variable. The second conception, referred to as global DIF, concerns the conditional dependence of group membership and the polytomous response variable. The distinction between net and global DIF is important because different DIF evaluation methods are appropriate for net and global DIF; no currently available method is universally the best for detecting both net and global DIF. Net and global DIF definitions are presented under two different, yet compatible, modeling frameworks: a traditional item response theory (IRT) framework, and a differential step functioning (DSF) framework. The theoretical relationship between the IRT and DSF frameworks is presented. Available methods for evaluating net and global DIF are described, and an applied example of net and global DIF is presented.  相似文献   

5.
《教育实用测度》2013,26(4):313-334
The purpose of this study was to compare the IRT-based area method and the Mantel-Haenszel method for investigating differential item functioning (DIF), to determine the degree of agreement between the methods in identifying potentially biased items, and, when the two methods led to different results, to identify possible reasons for the discrepancies. Data for the study were the item responses of Anglo American and Native American students who took the 1982 New Mexico High School Proficiency Exam. Two samples of 1,000 students from each group were studied. The major findings were that (a) the consistency of classifications of items into "biased" and "not-biased" categories across replications was 75% to 80% for both methods and (b) when the unreliability of the statistics was taken into account, the two methods led to very similar results. Discrepancies between methods were due to the presence of nonuniform DIF (the Mantel-Haenszel method could not identify these items) and the choice of interval over which DIF was assessed (the IRT method results depended on the choice of interval). The implications for practitioners seem clear: The Mantel-Haenszel method in general provides an acceptable approximation to the IRT-based methods.  相似文献   

6.
A computer simulation study was conducted to determine the feasibility of using logistic regression procedures to detect differential item functioning (DIF) in polytomous items. One item in a simulated test of 25 items contained DIF; parameters' for that item were varied to create three conditions of nonuniform DIF and one of uniform DIF. Item scores were generated using a generalized partial credit model, and the data were recoded into multiple dichotomies in order to use logistic regression procedures. Results indicate that logistic regression is powerful in detecting most forms of DIF; however, it required large amounts of data manipulation, and interpretation of the results was sometimes difficult. Some logistic regression procedures may be useful in the post hoc analysis of DlF for polytomous items.  相似文献   

7.
Liu and Agresti (1996) proposed a Mantel and Haenszel-type (1959) estimator of a common odds ratio for several 2 × J tables, where the J columns are ordinal levels of a response variable. This article applies the Liu-Agresti estimator to the case of assessing differential item functioning (DIF) in items having an ordinal response variable. A simulation study was conducted to investigate the accuracy of the Liu-Agresti estimator in relation to other statistical DIF detection procedures. The results of the simulation study indicate that the Liu-Agresti estimator is a viable alternative to other DIF detection statistics.  相似文献   

8.
Traditional methods for examining differential item functioning (DIF) in polytomously scored test items yield a single item‐level index of DIF and thus provide no information concerning which score levels are implicated in the DIF effect. To address this limitation of DIF methodology, the framework of differential step functioning (DSF) has recently been proposed, whereby measurement invariance is examined within each step underlying the polytomous response variable. The examination of DSF can provide valuable information concerning the nature of the DIF effect (i.e., is the DIF an item‐level effect or an effect isolated to specific score levels), the location of the DIF effect (i.e., precisely which score levels are manifesting the DIF effect), and the potential causes of a DIF effect (i.e., what properties of the item stem or task are potentially biasing). This article presents a didactic overview of the DSF framework and provides specific guidance and recommendations on how DSF can be used to enhance the examination of DIF in polytomous items. An example with real testing data is presented to illustrate the comprehensive information provided by a DSF analysis.  相似文献   

9.
The assessment of differential item functioning (DIF) in polytomous items addresses between-group differences in measurement properties at the item level, but typically does not inform which score levels may be involved in the DIF effect. The framework of differential step functioning (DSF) addresses this issue by examining between-group differences in the measurement properties at each step underlying the polytomous response variable. The pattern of the DSF effects across the steps of the polytomous response variable can assume several different forms, and the different forms can have different implications for the sensitivity of DIF detection and the final interpretation of the causes of the DIF effect. In this article we propose a taxonomy of DSF forms, establish guidelines for using the form of DSF to help target and guide item content review and item revision, and provide procedural rules for using the frameworks of DSF and DIF in tandem to yield a comprehensive assessment of between-group measurement equivalence in polytomous items.  相似文献   

10.
Shealy and Stout (1993) proposed a DIF detection procedure called SIBTEST and demonstrated its utility with both simulated and real data sets'. Current versions of SIBTEST can be used only for dichotomous items. In this article, an extension to handle polytomous items is developed. Two simulation studies are presented which compare the modified SIBTEST procedure with the Mantel and standardized mean difference (SMD) procedures. The first study compares the procedures under conditions in which the Mantel and SMD procedures have been shown to perform well (Zwick, Donoghue, & Grima, 1993). Results of Study I suggest that SIBTEST performed reasonably well, but that the Mantel and SMD procedures performed slightly better. The second study uses data simulated under conditions in which observed-score DIF methods for dichotomous items have not performed well. The results of Study 2 indicate that under these conditions the modified SIBTEST procedure provides better control of impact-induced Type I error inflation than the other procedures.  相似文献   

11.
随着多级计分在心理和教育领域中日益广泛的应用,对检验项目功能差异(DIF)的方法提出新的挑战。已有研究表明,在检验DIF的方法中,MIMIC是一种经济有效的检验方法,然而还没有研究系统地分析MIMIC方法在多级计分项目中的有效性。本研究通过蒙特卡洛实验,探讨参照组与目标组的样本容量、DIF类别、项目区分度、组间能力差异和在锚题中存在的DIF题量5个因素,并在这些因素不同情况的组合中分析MIMIC方法的第一类错误率和检验力。研究发现:1)MIMIC是一种能够灵敏地检验一致性DIF的方法,即使在目标组样本容量较小或明显小于参照组的情况下,它仍然能很好地控制第一类错误率;2)纯化步骤对MIMIC方法控制第一类错误率、提高检验力是有必要的,但MIMIC方法对污染程度又有一定的容忍性;3)检验力受到低区分度的严重影响,但太高的区分度又会导致第一类错误率的增加;4)MIMIC方法对一致性DIF的检验力随着样本容量的增大而增大。  相似文献   

12.
The use of computerized adaptive testing algorithms for ranking items (e.g., college preferences, career choices) involves two major challenges: unacceptably high computation times (selecting from a large item pool with many dimensions) and biased results (enhanced preferences or intensified examinee responses because of repeated statements across items). To address these issues, we introduce subpool partition strategies for item selection and within-person statement exposure control procedures. Simulations showed that the multinomial method reduces computation time while maintaining measurement precision. Both the freeze and revised Sympson-Hetter online (RSHO) methods controlled the statement exposure rate; RSHO sacrificed some measurement precision but increased pool use. Furthermore, preventing a statement's repetition on consecutive items neither hindered the effectiveness of the freeze or RSHO method nor reduced measurement precision.  相似文献   

13.
Identifying the Causes of DIF in Translated Verbal Items   总被引:1,自引:0,他引:1  
Translated tests are being used increasingly for assessing the knowledge and skills of individuals who speak different languages. There is little research exploring why translated items sometimes function differently across languages. If the sources of differential item functioning (DIF) across languages could be predicted, it could have important implications on test development, scoring and equating. This study focuses on two questions: “Is DIF related to item type?”, “What are the causes of DIF?” The data were taken from the Israeli Psychometric Entrance Test in Hebrew (source) and Russian (translated). The results indicated that 34% of the items functioned differentially across languages. The analogy items were the most problematic with 65% showing DIF, mostly in favor of the Russian-speaking examinees. The sentence completion items were also a problem (45% D1F). The main reasons for DIF were changes in word difficulty, changes in item format, differences in cultural relevance, and changes in content.  相似文献   

14.
This study compared and illustrated four differential distractor functioning (DDF) detection methods for analyzing multiple-choice items. The log-linear approach, two item response theory-model-based approaches with likelihood ratio tests, and the odds ratio approach were compared to examine the congruence among the four DDF detection methods. Data from a college-level mathematics placement test were analyzed to understand the causes of differential functioning. Results indicated some agreement among the four detection methods. To facilitate practical interpretation of the DDF results, several possible effect size measures were also obtained and compared.  相似文献   

15.
Exact nonparametric procedures have been used to identify the level of differential item functioning (DIF) in binary items. This study explored the use of exact DIF procedures with items scored on a Likert scale. The results from an attitude survey suggest that the large-sample Cochran-Mantel-Haenszel (CMH) procedure identifies more items as statistically significant than two comparable exact nonparametric methods. This finding is consistent with previous findings; however, when items are classified in National Assessment of Educational Progress DIF categories, the results show that the CMH and its exact nonparametric counterparts produce almost identical classifications. Since DIF is often evaluated in terms of statistical and practical significance, this study provides evidence that the large-sample CMH procedure may be safely used even when the focal group has as few as 76 cases.  相似文献   

16.
This inquiry is an investigation of item response theory (IRT) proficiency estimators’ accuracy under multistage testing (MST). We chose a two‐stage MST design that includes four modules (one at Stage 1, three at Stage 2) and three difficulty paths (low, middle, high). We assembled various two‐stage MST panels (i.e., forms) by manipulating two assembly conditions in each module, such as difficulty level and module length. For each panel, we investigated the accuracy of examinees’ proficiency levels derived from seven IRT proficiency estimators. The choice of Bayesian (prior) versus non‐Bayesian (no prior) estimators was of more practical significance than the choice of number‐correct versus item‐pattern scoring estimators. The Bayesian estimators were slightly more efficient than the non‐Bayesian estimators, resulting in smaller overall error. Possible score changes caused by the use of different proficiency estimators would be nonnegligible, particularly for low‐ and high‐performing examinees.  相似文献   

17.
A rapidly expanding arena for item response theory (IRT) is in attitudinal and health‐outcomes survey applications, often with polytomous items. In particular, there is interest in computer adaptive testing (CAT). Meeting model assumptions is necessary to realize the benefits of IRT in this setting, however. Although initial investigations of local item dependence have been studied both for polytomous items in fixed‐form settings and for dichotomous items in CAT settings, there have been no publications applying local item dependence detection methodology to polytomous items in CAT despite its central importance to these applications. The current research uses a simulation study to investigate the extension of widely used pairwise statistics, Yen's Q3 Statistic and Pearson's Statistic X2, in this context. The simulation design and results are contextualized throughout with a real item bank of this type from the Patient‐Reported Outcomes Measurement Information System (PROMIS).  相似文献   

18.
In low-stakes assessments, some students may not reach the end of the test and leave some items unanswered due to various reasons (e.g., lack of test-taking motivation, poor time management, and test speededness). Not-reached items are often treated as incorrect or not-administered in the scoring process. However, when the proportion of not-reached items is high, these traditional approaches may yield biased scores and thereby threatening the validity of test results. In this study, we propose a polytomous scoring approach for handling not-reached items and compare its performance with those of the traditional scoring approaches. Real data from a low-stakes math assessment administered to second and third graders were used. The assessment consisted of 40 short-answer items focusing on addition and subtraction. The students were instructed to answer as many items as possible within 5 minutes. Using the traditional scoring approaches, students’ responses for not-reached items were treated as either not-administered or incorrect in the scoring process. With the proposed scoring approach, students’ nonmissing responses were scored polytomously based on how accurately and rapidly they responded to the items to reduce the impact of not-reached items on ability estimation. The traditional and polytomous scoring approaches were compared based on several evaluation criteria, such as model fit indices, test information function, and bias. The results indicated that the polytomous scoring approaches outperformed the traditional approaches. The complete case simulation corroborated our empirical findings that the scoring approach in which nonmissing items were scored polytomously and not-reached items were considered not-administered performed the best. Implications of the polytomous scoring approach for low-stakes assessments were discussed.  相似文献   

19.
With a focus on performance assessments, this paper describes procedures for calculating conditional standard error of measurement (CSEM) and reliability of scale scores and classification consistency of performance levels. Scale scores that are transformations of total raw scores are the focus of these procedures, although other types of raw scores are considered as well. Polytomous IRT models provide the psychometric foundation for the procedures that are described. The procedures are applied using test data from ACT's Work Keys Writing Assessment to demonstrate their usefulness. Two polytomous IRT models were compared, as were two different procedures for calculating scores. One simulation study was done using one of the models to evaluate the accuracy of the proposed procedures. The results suggest that the procedures provide quite stable estimates and have the potential to be useful in a variety of performance assessment situations.  相似文献   

20.
This research examined the effect of scoring items thought to be multidimensional using a unidimensional model and demonstrated the use of multidimensional item response theory (MIRT) as a diagnostic tool. Using real data from a large-scale mathematics test, previously shown to function differentially in favor of proficient writers, the difference in proficiency classifications was explored when a two-versus one-dimensional confirmatory model was fit. The estimate of ability obtained when using the unidimensional model was considered to represent general mathematical ability. Under the two-dimensional model, one of the two dimensions was also considered to represent general mathematical ability. The second dimension was considered to represent the ability to communicate in mathematics. The resulting pattern of mismatched proficiency classifications suggested that examinees found to have less mathematics communication ability were more likely to be placed in a lower general mathematics proficiency classification under the unidimensional than multidimensional model. Results and implications are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号