This article presents a study of ethnic Differential Item Functioning (DIF) for 4th-, 7th-, and 10th-grade reading items on a state criterion-referenced achievement test. The tests, administered 1997 to 2001, were composed of multiple-choice and constructed-response items. Item performance by focal groups (i.e., students from Asian/Pacific Island, Black/African American, Native American, and Latino/Hispanic origins) were compared with the performance of White students using simultaneous item bias and Rasch procedures. Flagged multiple-choice items generally favored White students, whereas flagged constructed-response items generally favored students from Asian/Pacific Islander, Black/African American, and Latino/Hispanic origins. Content analysis of flagged reading items showed that positively and negatively flagged items typically measured inference, interpretation, or analysis of text in multiple-choice and constructed-response formats. Items that were not flagged for DIF generally measured very easy reading skills (e.g., literal comprehension) and reading skills that require higher level thinking (e.g., developing interpretations across texts and analyzing graphic elements).  相似文献   

As a global measure of precision, item response theory (IRT) estimated reliability is derived for four coefficients (Cronbach's α, Feldt‐Raju, stratified α, and marginal reliability). Models with different underlying assumptions concerning test‐part similarity are discussed. A detailed computational example is presented for the targeted coefficients. A comparison of the IRT model‐derived coefficients is made and the impact of varying ability distributions is evaluated. The advantages of IRT‐derived reliability coefficients for problems such as automated test form assembly and vertical scaling are discussed.  相似文献   

应用项目反应理论等值含有多种题型考试的一个实例   总被引:2,自引:2,他引:2  
本文以美国一个州的高中统考为例介绍应用项目反应理论来对含有多种题型的考试进行等值处理的具体做法,同时也对考试的其他技术环节进行了一些探讨。  相似文献   

The purpose of this study was to investigate the technical properties of stem-equivalent mathematics items differing only with respect to response format. Using socio- economic factors to define the strata, a proportional stratified random sample of 1,366 Connecticut sixth-grade students were administered one of three forms. Classical item analysis, dimensionality assessment, item response theory goodness-of-fit, and an item bias analysis were conducted. Analysis of variance and confirmatory factor analysis were used to examine the functioning of the items presented in the three different formats. It was found that, after equating forms, the constructed-response formats were somewhat more difficult than the multiple-choice format. However, there was no significant difference across formats with respect to item discrimination. A differential item functioning (DIF) analysis was conducted using both the Mantel-Haenszel procedure and the comparison of the item characteristic curves. The DIF analysis indicated that the presence of bias was not greatly affected by item format; that is, items biased in one format tended to be biased in a similar manner when presented in a different format, and unbiased items tended to remain so regardless of format.  相似文献   

RCMLM模型是基于Rasch测量理论的通用拓展模型。利用RCMLM模型对一份普通高中数学试卷进行不同性别的DIF分析。结果表明:该模型可对具有二分计分和多分计分的试题同时进行DIF分析,避免了以往分别对两种计分方式试题进行DIF分析的弊端,保持了试卷的完整性,使DIF分析结果更加有效。  相似文献   

While previous research has identified numerous factors that contribute to item difficulty, studies involving large-scale reading tests have provided mixed results. This study examined five selected-response item types used to measure reading comprehension in the Pearson Test of English Academic: a) multiple-choice (choose one answer), b) multiple-choice (choose multiple answers), c) re-order paragraphs, d) reading (fill-in-the-blanks), and e) reading and writing (fill-in-the-blanks). Utilizing a multiple regression approach, the criterion measure consisted of item difficulty scores for 172 items. 18 passage, passage-question, and response-format variables served as predictors. Overall, four significant predictors were identified for the entire group (i.e., sentence length, falsifiable distractors, number of correct options, and abstractness of information requested) and five variables were found to be significant for high-performing readers (including the four listed above and passage coherence); only the number of falsifiable distractors was a significant predictor for low-performing readers. Implications for assessing reading comprehension are discussed.  相似文献   

Numerous assessments contain a mixture of multiple choice (MC) and constructed response (CR) item types and many have been found to measure more than one trait. Thus, there is a need for multidimensional dichotomous and polytomous item response theory (IRT) modeling solutions, including multidimensional linking software. For example, multidimensional item response theory (MIRT) may have a promising future in subscale score proficiency estimation, leading toward a more diagnostic orientation, which requires the linking of these subscale scores across different forms and populations. Several multidimensional linking studies can be found in the literature; however, none have used a combination of MC and CR item types. Thus, this research explores multidimensional linking accuracy for tests composed of both MC and CR items using a matching test characteristic/response function approach. The two-dimensional simulation study presented here used real data-derived parameters from a large-scale statewide assessment with two subscale scores for diagnostic profiling purposes, under varying conditions of anchor set lengths (6, 8, 16, 32, 60), across 10 population distributions, with a mixture of simple versus complex structured items, using a sample size of 3,000. It was found that for a well chosen anchor set, the parameters recovered well after equating across all populations, even for anchor sets composed of as few as six items.  相似文献   

Multiple‐choice, short‐answer, and extended‐response item formats were used in the Third International Mathematics and Science Study to assess student achievement in mathematics and science at Grades 7 and 8 in more than 40 countries around the world. Data pertaining to science indicate that the standings of some countries relative to others change when performance is measured via the different item formats. The question addressed in the present article is the following: Can the instability of ranks in this case be attributed principally to item format, or are other important factors at work? It is argued that the findings provide further evidence that comparing student achievement across countries is a very complex undertaking indeed.  相似文献   

Do eighth-grade males and females display the same DIF patterns as older examinees? Are the patterns the same for different content areas in mathematics? Does a DIF test for essential dimensionality yield expected results?  相似文献   

注重方法层次的数学阅读是帮助学生进行数学学习的重要手段,可以训练培养其分析推理能力,通过简单的数学案例展示了预读、精读、通读、串读的方法和效果。数学的阅读是一种重要的认知方式,能调动学习的主动性,提高学习效力。  相似文献   

Applications of traditional unidimensional item response theory models to passage-based reading comprehension assessment data have been criticized based on potential violations of local independence. However, simple rules for determining dependency, such as including all items associated with a particular passage, may overestimate the dependency that actually exists among the items. The current study proposed a more refined method based on cognitive principles and substantive theories to determine those items that pose a threat. Specifically, the use of common necessary information from text was examined as a contributor of local dependence. Cognitively similar item pairs, those with connected necessary information, had higher local dependence values than item pairs with no connected necessary information. Results suggest that focusing on necessary information may be useful to some extent for understanding and managing item dependence for passage-based reading comprehension tests.  相似文献   

利用基于Rasch模型的DIF检验方法,对近10年高考数学106道选择题进行性别DIF检验,得出有利于男生的试题有55道,有利于女生的试题有51道,在14道中度和重度以上DIF试题中,各有7道题分别有利于男生或女生,因此可以得出试题在总体上没有较大的性别差异;再通过对14道具有中度和重度以上DIF试题考查的内容和能力进行分析得出,男生在空间想象能力方面具有一定的优势,女生在计算能力方面具有一定的优势,造成差异的原因除与试题特征(考查的内容和能力)有关外,还可能与考试过程中的心理因素有关,需要综合考虑多种因素并加以验证。  相似文献   

This article discusses and demonstrates combining scores from multiple-choice (MC) and constructed-response (CR) items to create a common scale using item response theory methodology. Two specific issues addressed are (a) whether MC and CR items can be calibrated together and (b) whether simultaneous calibration of the two item types leads to loss of information. Procedures are discussed and empirical results are provided using a set of tests in the areas of reading, language, mathematics, and science in three grades.  相似文献   

One approach to measuring unsigned differential test functioning is to estimate the variance of the differential item functioning (DIF) effect across the items of the test. This article proposes two estimators of the DIF effect variance for tests containing dichotomous and polytomous items. The proposed estimators are direct extensions of the noniterative estimators developed by Camilli and Penfield (1997) for tests composed of dichotomous items. A small simulation study is reported in which the statistical properties of the generalized variance estimators are assessed, and guidelines are proposed for interpreting values of DIF effect variance estimators.  相似文献   

Monte Carlo simulations with 20,000 replications are reported to estimate the probability of rejecting the null hypothesis regarding DIF using SIBTEST when there is DIF present and/or when impact is present due to differences on the primary dimension to be measured. Sample sizes are varied from 250 to 2000 and test lengths from 10 to 40 items. Results generally support previous findings for Type I error rates and power. Impact is inversely related to test length. The combination of DIF and impact, with the focal group having lower ability on both the primary and secondary dimensions, results in impact partially masking DIF so that items biased toward the reference group are less likely to be detected.  相似文献   

The psychometric literature provides little empirical evaluation of examinee test data to assess essential psychometric properties of innovative items. In this study, examinee responses to conventional (e.g., multiple choice) and innovative item formats in a computer-based testing program were analyzed for IRT information with the three-parameter and graded response models. The innovative item types considered in this study provided more information across all levels of ability than multiple-choice items. In addition, accurate timing data captured via computer administration were analyzed to consider the relative efficiency of the multiple choice and innovative item types. As with previous research, multiple-choice items provide more information per unit time. Implications for balancing policy, psychometric, and pragmatic factors in selecting item formats are also discussed.  相似文献   

This study examined gender differences in students' scientific literacy as measured by OECD-PISA. In particular, we focused on the 2437 students from 140 Hong Kong schools. Hong Kong boys' and girls' science scores did not differ overall. However, boys scored higher than girls at the higher percentiles (75th and above). Moreover, specific test components showed gender differences. Boys tended to score higher on tests with more earth and physical science items, understanding of scientific knowledge items, and closed items. Meanwhile, girls tended to score higher on recognizing questions and identifying evidence items. These results suggest that a science test assessing diverse content and literacy skills in a variety of response formats provides both a more comprehensive picture of students' capabilities and a more likely gender-equitable assessment.  相似文献   

Progress has been made in developing statistical methods for identifying DIF items, but procedures to aid with the substantive interpretations of these items have lagged behind. To overcome this problem, Roussos and Stout (1996) proposed a multidimensionality-based DIF analysis paradigm. We illustrate and evaluate an application of this framework as it applied to the study of gender differences in mathematics. Four characteristics distinguish this study from previous research: the substantive analysis was guided by past research on the content and cognitive-related sources of gender differences in mathematics achievement, as presented in the taxonomy by Gallagher, De Lisi, Holst, McGillicuddy-De Lisi, Morely, and Cahalan (2000); the substantive analysis was conducted by reviewers who were highly knowledgeable about the cognitive strategies students use to solve math problems; three statistical methods were used to test hypotheses about gender differences, including SIBTEST, DIMTEST, and multiple linear regression; and the data were from a curriculum-based achievement test developed with the goal of minimizing obvious, content-related gender differences. We show that the framework can lead to clearly interpretable results and we highlight both the strengths and weaknesses of applying the Roussos and Stout framework to the study of group differences.  相似文献   

阅读是获取语言知识最直接最有效的方法,阅读能力则是衡量掌握语言综合能力的一项重要标志。从阅读的性质着手,分析当前外语界对阅读能力的不同看法,探讨了阅读测试中选材和出题这两个关键问题。  相似文献   

Differential item functioning (DIF) analyses are a routine part of the development of large-scale assessments. Less common are studies to understand the potential sources of DIF. The goals of this study were (a) to identify gender DIF in a large-scale science assessment and (b) to look for trends in the DIF and non-DIF items due to content, cognitive demands, item type, item text, and visual-spatial or reference factors. To facilitate the analyses, DIF studies were conducted at 3 grade levels and for 2 randomly equivalent forms of the science assessment at each grade level (administered in different years). The DIF procedure itself was a variant of the "standardization procedure" of Dorans and Kulick (1986) and was applied to very large sets of data (6 sets of data, each involving 60,000 students). It has the advantages of being easy to understand and to explain to practitioners. Several findings emerged from the study that would be useful to pass on to test development committees. For example, when there was DIF in science items, MC items tended to favor male examinees and OR items tended to favor female examinees. Compiling DIF information across multiple grades and years increases the likelihood that important trends in the data will be identified and that item writing practices will be informed by more than anecdotal reports about DIF.  相似文献   

