首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This article used the multidimensional random coefficients multinomial logit model to examine the construct validity and detect the substantial differential item functioning (DIF) of the Chinese version of motivated strategies for learning questionnaire (MSLQ-CV). A total of 1,354 Hong Kong junior high school students were administered the MSLQ-CV. Partial credit model was suggested to have a better goodness of fit than that of the rating scale model. Five items with substantial gender or grade DIF were removed from the questionnaire, and the correlations between the subscales indicated that factors of cognitive strategy use and self-regulation had a very high correlation which resulted in a possible combination of the two factors. The test reliability analysis showed that the subscale of test anxiety had a lower reliability compared with the other factors. Finally, the item difficulty and step parameters for the modified 39-item questionnaire were displayed. The order of the step difficulty estimates for some items implied that some grouping of categories might be required in the case of overlapping. Based on these findings, the directions for future research were discussed.  相似文献   

2.
Statistics used to detect differential item functioning can also reflect differential strengths and weaknesses in the performance characteristics of population subgroups. In turn, item features associated with the differential performance patterns are likely to reflect some facet of the item task and hence its difficulty, that might previously have been overlooked. In this study, several item features were identified and coded for a large number of reading comprehension items from the two admissions testing programs. Item features included subject matter content, various properties of item structure, cognitive demand indicators, and semantic content (propositional analysis). Differential item functioning was evaluated for males and females and for White and Black examinees. Results showed a number of significant relationships between item features and indicators of differential item functioning—many of which were consistent across testing programs. Implications of the results for related areas of research are discussed.  相似文献   

3.
Identifying the Causes of DIF in Translated Verbal Items   总被引:1,自引:0,他引:1  
Translated tests are being used increasingly for assessing the knowledge and skills of individuals who speak different languages. There is little research exploring why translated items sometimes function differently across languages. If the sources of differential item functioning (DIF) across languages could be predicted, it could have important implications on test development, scoring and equating. This study focuses on two questions: “Is DIF related to item type?”, “What are the causes of DIF?” The data were taken from the Israeli Psychometric Entrance Test in Hebrew (source) and Russian (translated). The results indicated that 34% of the items functioned differentially across languages. The analogy items were the most problematic with 65% showing DIF, mostly in favor of the Russian-speaking examinees. The sentence completion items were also a problem (45% D1F). The main reasons for DIF were changes in word difficulty, changes in item format, differences in cultural relevance, and changes in content.  相似文献   

4.
The purpose of the present study is to examine the language characteristics of a few states' large-scale assessments of mathematics and science and investigate whether the language demands of the items are associated with the degree of differential item functioning (DIF) for English language learner (ELL) students. A total of 542 items from 11 assessments at Grades 4, 5, 7, and 8 from three states were rated for the linguistic complexity based on a developed linguistic coding scheme. The linguistic ratings were compared to each item's DIF statistics. The results yielded a stronger association between the linguistic rating and DIF statistics for ELL students in the “relatively easy” items than in the “not easy” items. Particularly, general academic vocabulary and the amount of language in an item were found to have the strongest association with the degrees of DIF, particularly for ELL students with low English language proficiency. Furthermore, the items were grouped into four bundles to closely look at the relationship between the varying degrees of language demands and ELL students' performance. Differential bundling functioning (DBF) results indicated that the exhibited DBF was more substantial as the language demands increased. By disentangling linguistic difficulty from content difficulty, the results of the study provide strong evidence of the impact of linguistic complexity on ELL students' performance on tests. The study discusses the implications for the validation of the tests and instructions for ELL students.  相似文献   

5.
In this study, the authors explored the importance of item difficulty (equated delta) as a predictor of differential item functioning (DIF) of Black versus matched White examinees for four verbal item types (analogies, antonyms, sentence completions, reading comprehension) using 13 GRE-disclosed forms (988 verbal items) and 11 SAT-disclosed forms (935 verbal items). The average correlation across test forms for each item type (and often the correlation for each individual test form as well) revealed a significant relationship between item difficulty and DIF value for both GRE and SAT. The most important finding indicates that for hard items, Black examinees perform differentially better than matched ability White examinees for each of the four item types and for both the GRE and SAT tests! The results further suggest that the amount of verbal context is an important determinant of the magnitude of the relationship between item difficulty and differential performance of Black versus matched White examinees. Several hypotheses accounting for this result were explored.  相似文献   

6.
《教育实用测度》2013,26(2):175-199
This study used three different differential item functioning (DIF) detection proce- dures to examine the extent to which items in a mathematics performance assessment functioned differently for matched gender groups. In addition to examining the appropriateness of individual items in terms of DIF with respect to gender, an attempt was made to identify factors (e.g., content, cognitive processes, differences in ability distributions, etc.) that may be related to DIF. The QUASAR (Quantitative Under- standing: Amplifying Student Achievement and Reasoning) Cognitive Assessment Instrument (QCAI) is designed to measure students' mathematical thinking and reasoning skills and consists of open-ended items that require students to show their solution processes and provide explanations for their answers. In this study, 33 polytomously scored items, which were distributed within four test forms, were evaluated with respect to gender-related DIF. The data source was sixth- and seventh- grade student responses to each of the four test forms administrated in the spring of 1992 at all six school sites participatingin the QUASARproject. The sample consisted of 1,782 students with approximately equal numbers of female and male students. The results indicated that DIF may not be serious for 3 1 of the 33 items (94%) in the QCAI. For the two items that were detected as functioning differently for male and female students, several plausible factors for DIF were discussed. The results from the secondary analyses, which removed the mutual influence of the two items, indicated that DIF in one item, PPPl, which favored female students rather than their matched male students, was of particular concern. These secondary analyses suggest that the detection of DIF in the other item in the original analysis may have been due to the influence of Item PPPl because they were both in the same test form.  相似文献   

7.
The first generation of computer-based tests depends largely on multiple-choice items and constructed-response questions that can be scored through literal matches with a key. This study evaluated scoring accuracy and item functioning for an open-ended response type where correct answers, posed as mathematical expressions, can take many different surface forms. Items were administered to 1,864 participants in field trials of a new admissions test for quantitatively oriented graduate programs. Results showed automatic scoring to approximate the accuracy of multiple-choice scanning, with all processing errors stemming from examinees improperly entering responses. In addition, the items functioned similarly in difficulty, item-total relations, and male-female performance differences to other response types being considered for the measure.  相似文献   

8.
Large data sets from a state reading assessment for third and fifth graders were analyzed to examine differential item functioning (DIF), differential distractor functioning (DDF), and differential omission frequency (DOF) between students with particular categories of disabilities (speech/language impairments, learning disabilities, and emotional behavior disorders) and students without disabilities. Multinomial logistic regression was employed to compare response characteristic curves (RCCs) of individual test items. Although no evidence for serious test bias was found for the state assessment examined in this study, the results indicated that students in different disability categories showed different patterns of DIF, DDF, and DOF, and that the use of RCCs helps clarify the implications of DIF and DDF.  相似文献   

9.
This article examines nonmathematical linguistic complexity as a source of differential item functioning (DIF) in math word problems for English language learners (ELLs). Specifically, this study investigates the relationship between item measures of linguistic complexity, nonlinguistic forms of representation and DIF measures based on item response theory difficulty parameters in a state fourth-grade math test. This study revealed that the greater the item nonmathematical lexical and syntactic complexity, the greater are the differences in difficulty parameter estimates favoring non-ELLs over ELLs. However, the impact of linguistic complexity on DIF is attenuated when items provide nonlinguistic schematic representations that help ELLs make meaning of the text, suggesting that their inclusion could help mitigate the negative effect of increased linguistic complexity in math word problems.  相似文献   

10.
This Monte Carlo study examined the effect of complex sampling of items on the measurement of differential item functioning (DIF) using the Mantel-Haenszel procedure. Data were generated using a 3-parameter logistic item response theory model according to the balanced incomplete block (BIB) design used in the National Assessment of Educational Progress (NAEP). The length of each block of items and the number of DIF items in the matching variable were varied, as was the difficulty, discrimination, and presence of DIF in the studied item. Block, booklet, pooled booklet, and extra-information analyses were compared to a complete data analysis using the transformed log-odds on the delta scale. The pooled booklet approach is recommended for use when items are selected for examinees according to a BIB design. This study has implications for DIF analyses of other complex samples of items, such as computer administered testing or another complex assessment design.  相似文献   

11.
We analyzed a pool of items from an admissions test for differential item functioning (DIF) for groups based on age, socioeconomic status, citizenship, or English language status using Mantel-Haenszel and item response theory. DIF items were systematically examined to identify its possible sources by item type, content, and wording. DIF was primarily found in the citizenship group. As suggested by expert reviewers, possible sources of DIF in the direction of U.S. citizens was often in Quantitative Reasoning in items containing figures, charts, tables depicting real-world (as opposed to abstract) contexts. DIF items in the direction of non-U.S. citizens included “mathematical” items containing few words. DIF for the Verbal Reasoning items included geocultural references and proper names that may be differentially familiar for non-U.S. citizens. This study is responsive to foundational changes in the fairness section of the Standards for Educational and Psychological Testing, which now consider additional groups in sensitivity analyses, given the increasing demographic diversity in test-taker populations.  相似文献   

12.
Researchers interested in exploring substantive group differences are increasingly attending to bundles of items (or testlets): the aim is to understand how gender differences, for instance, are explained by differential performances on different types or bundles of items, hence differential bundle functioning (DBF). Some previous work has modelled hierarchies in data in this context or considered item responses within persons, but here we model the bundles themselves as explanatory variables at the item level potentially explaining significant intra-class correlation due to gender differences in item difficulty, and thus explaining variation at the second item level. In this study, we analyse DBF using single- and two-level models (the latter modelling random item effects, which models responses at Level 1 and items at Level 2) in a high-stakes National Mathematics test. The models show comparable regression coefficients but the statistical significances of the two-level models are smaller due to the larger values of the estimated standard errors. We discuss the contrasting relevance of this effect for test developers and gender researchers.  相似文献   

13.
《教育实用测度》2013,26(3):257-275
The purpose of this study was to investigate the technical properties of stem-equivalent mathematics items differing only with respect to response format. Using socio- economic factors to define the strata, a proportional stratified random sample of 1,366 Connecticut sixth-grade students were administered one of three forms. Classical item analysis, dimensionality assessment, item response theory goodness-of-fit, and an item bias analysis were conducted. Analysis of variance and confirmatory factor analysis were used to examine the functioning of the items presented in the three different formats. It was found that, after equating forms, the constructed-response formats were somewhat more difficult than the multiple-choice format. However, there was no significant difference across formats with respect to item discrimination. A differential item functioning (DIF) analysis was conducted using both the Mantel-Haenszel procedure and the comparison of the item characteristic curves. The DIF analysis indicated that the presence of bias was not greatly affected by item format; that is, items biased in one format tended to be biased in a similar manner when presented in a different format, and unbiased items tended to remain so regardless of format.  相似文献   

14.
It is known that the Rasch model is a special two-level hierarchical generalized linear model (HGLM). This article demonstrates that the many-faceted Rasch model (MFRM) is also a special case of the two-level HGLM, with a random intercept representing examinee ability on a test, and fixed effects for the test items, judges, and possibly other facets. This perspective suggests useful modeling extensions of the MFRM. For example, in the HGLM framework it is possible to model random effects for items and judges in order to assess their stability across examinees. The MFRM can also be extended so that item difficulty and judge severity are modeled as functions of examinee characteristics (covariates), for the purposes of detecting differential item functioning and differential rater functioning. Practical illustrations of the HGLM are presented through the analysis of simulated and real judge-mediated data sets involving ordinal responses.  相似文献   

15.
The standardization methodology was used to help identify item characteristics that might explain differential item functioning among Hispanics on the Scholastic Aptitude Test. Results indicated that true cognates or words with a common root in English and Spanish and content of special interest for Hispanics seemed to help Hispanics performance. Limited occurrence of false cognates (words that appear to be cognates but have different meanings in both languages) and of homographs (words that are spelled alike but have different meanings in English) restricted their evaluation. Nevertheless, examination of items with false cognates or homographs gave some evidence indicating that their occurrence might make items unexpectedly more difficult for Hispanic examinees  相似文献   

16.
In this study, we examine the degree of construct comparability and possible sources of incomparability of the English and French versions of the Programme for International Student Assessment (PISA) 2003 problem-solving measure administered in Canada. Several approaches were used to examine construct comparability at the test- (examination of test data structure, reliability comparisons and test characteristic curves) and item-levels (differential item functioning, item parameter correlations, and linguistic comparisons). Results from the test-level analyses indicate that the two language versions of PISA are highly similar as shown by similarity of internal consistency coefficients, test data structure (same number of factors and item factor loadings) and test characteristic curves for the two language versions of the tests. However, results of item-level analyses reveal several differences between the two language versions as shown by large proportions of items displaying differential item functioning, differences in item parameter correlations (discrimination parameters) and number of items found to contain linguistic differences.  相似文献   

17.
Studies of differential item functioning under item response theory require that item parameter estimates be placed on the same metric before comparisons can be made. The present study compared the effects of three methods for linking metrics: a weighted mean and sigma method (WMS); the test characteristic curve method (TCC); and the minimum chi-square method (MCS), on detection of differential item functioning. Both iterative and noniterative linking procedures were compared for each method. Results indicated that detection of differentially functioning items following linking via the test characteristic curve method gave the most accurate results when the sample size was small. When the sample size was large, results for the three linking methods were essentially the same. Iterative linking provided an improvement in detection of differentially functioning items over noniterative linking particularly with the .05 alpha level. The weighted mean and sigma method showed greater improvement with iterative linking than either the test characteristic curve or minimum chi-square method.  相似文献   

18.
《Educational Assessment》2013,18(2):127-143
This study investigated factors related to score differences on computerized and paper-and-pencil versions of a series of primary K–3 reading tests. Factors studied included item and student characteristics. The results suggest that the score differences were more related to student than item characteristics. These student characteristics include response style variables, especially omitting, and socioeconomic status as measured by free lunch eligibility. In addition, response style and socioeconomic status appear to be relatively independent factors in the score differences. Variables studied but not found to be related to the format score differences included association of items with a reading passage, item difficulty, and teacher versus computer administration of items. However, because this study is the 1st to study the factors behind these score differences below Grade 3, and because a number of states are increasing computer testing at the primary grades, additional studies are needed to verify the importance of these 2 factors.  相似文献   

19.
《教育实用测度》2013,26(4):313-334
The purpose of this study was to compare the IRT-based area method and the Mantel-Haenszel method for investigating differential item functioning (DIF), to determine the degree of agreement between the methods in identifying potentially biased items, and, when the two methods led to different results, to identify possible reasons for the discrepancies. Data for the study were the item responses of Anglo American and Native American students who took the 1982 New Mexico High School Proficiency Exam. Two samples of 1,000 students from each group were studied. The major findings were that (a) the consistency of classifications of items into "biased" and "not-biased" categories across replications was 75% to 80% for both methods and (b) when the unreliability of the statistics was taken into account, the two methods led to very similar results. Discrepancies between methods were due to the presence of nonuniform DIF (the Mantel-Haenszel method could not identify these items) and the choice of interval over which DIF was assessed (the IRT method results depended on the choice of interval). The implications for practitioners seem clear: The Mantel-Haenszel method in general provides an acceptable approximation to the IRT-based methods.  相似文献   

20.
《教育实用测度》2013,26(4):297-312
Certain potential benefits of using item response theory in test construction are discussed and evaluated using the experience and evidence accumulated during 9 years of using a three-parameter model in the construction of major achievement batteries. We also discuss several cautions and limitations in realizing these benefits as well as issues in need of further research. The potential benefits considered are those of getting "sample-free" item calibrations and "item-free" person measurement, automatically equating various tests, decreasing the standard errors of scores without increasing the number of items used by using item pattern scoring, assessing item bias (or differential item functioning) independently of difficulty in a manner consistent with item selection, being able to determine just how adequate a tryout pool of items may be, setting up computer-generated "ideal" tests drawn from pools as targets for test developers, and controlling the standard error of a selected test at any desired set of score levels.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号