期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Use of Restricted Item Response Models for Examining Item Difficulty Ordering and Slope Uniformity

Suzanne Lane 《Journal of Educational Measurement》1991,28(4):295-309

This article demonstrates the utility of restricted item response models for examining item difficulty ordering and slope uniformity for an item set that reflects varying cognitive processes. Twelve sets of paired algebra word problems were developed to systematically reflect various types of cognitive processes required for successful performance. This resulted in a total of 24 items. They reflected distance-rate–time (DRT), interest, and area problems. Hypotheses concerning difficulty ordering and slope uniformity for the items were tested by constraining item difficulty and discrimination parameters in hierarchical item response models. The first set of model comparisons tested the equality of the discrimination and difficulty parameters for each set of paired items. The second set of model comparisons examined slope uniformity within the complex DRT problems. The third set of model comparisons examined whether the familiarity of the story context affected item difficulty for two types of complex DRT problems. The last set of model comparisons tested the hypothesized difficulty ordering of the items. 相似文献

2.

A Stepwise Test Characteristic Curve Method to Detect Item Parameter Drift

下载免费PDF全文

Rui Guo Yi Zheng Hua‐Hua Chang 《Journal of Educational Measurement》2015,52(3):280-300

An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items. 相似文献

3.

Effects of Item Modifications on Test Accessibility for Persistently Low-Performing Students with Disabilities

Dale J. Cohen Jin Zhang Werner Wothke 《教育实用测度》2013,26(4):269-280

ABSTRACT

Construct-irrelevant cognitive complexity of some items in the statewide grade-level assessments may impose performance barriers for students with disabilities who are ineligible for alternate assessments based on alternate achievement standards. This has spurred research into whether items can be modified to reduce complexity without affecting item construct. This study uses a generalized linear mixed modeling analysis to investigate the effects of item modifications on improving test accessibility by reducing construct-irrelevant cognitive barriers for persistently low-performing fifth-grade students with cognitive disabilities. The results showed item scaffolding was an effective modification for both mathematics and reading. Other modifications, such as bolding/underlining of key words, hindered test performance for low-performing students. We discuss the findings’ potential impact on test development with universal design. 相似文献

4.

Item Function Characteristics and Dimensionality for Alternative Response Formats in Mathematics

《教育实用测度》2013,26(3):257-275

The purpose of this study was to investigate the technical properties of stem-equivalent mathematics items differing only with respect to response format. Using socio- economic factors to define the strata, a proportional stratified random sample of 1,366 Connecticut sixth-grade students were administered one of three forms. Classical item analysis, dimensionality assessment, item response theory goodness-of-fit, and an item bias analysis were conducted. Analysis of variance and confirmatory factor analysis were used to examine the functioning of the items presented in the three different formats. It was found that, after equating forms, the constructed-response formats were somewhat more difficult than the multiple-choice format. However, there was no significant difference across formats with respect to item discrimination. A differential item functioning (DIF) analysis was conducted using both the Mantel-Haenszel procedure and the comparison of the item characteristic curves. The DIF analysis indicated that the presence of bias was not greatly affected by item format; that is, items biased in one format tended to be biased in a similar manner when presented in a different format, and unbiased items tended to remain so regardless of format. 相似文献

5.

The Impact of Item Stem Format on the Dimensional Structure of Mathematics Assessments

Adnan Kan Damien C. Cormier 《Educational Assessment》2019,24(1):13-32

Item stem formats can alter the cognitive complexity as well as the type of abilities required for solving mathematics items. Consequently, it is possible that item stem formats can affect the dimensional structure of mathematics assessments. This empirical study investigated the relationship between item stem format and the dimensionality of mathematics assessments. A sample of 671 sixth-grade students was given two forms of a mathematics assessment in which mathematical expression (ME) items and word problems (WP) were used to measure the same content. The effects of mathematical language and reading abilities in responding to ME and WP items were explored using unidimensional and multidimensional item response theory models. The results showed that WP and ME items appear to differ with regard to the underlying abilities required to answer these items. Hence, the multidimensional model fit the response data better than the unidimensional model. For the accurate assessment of mathematics achievement, students’ reading and mathematical language abilities should also be considered when implementing mathematics assessments with ME and WP items. 相似文献

6.

COMPUTER-ASSISTED ITEM WRITING—1 (SPELLING ITEMS)1

JOHN FREMER ERNEST J. ANASTASIO 《Journal of Educational Measurement》1969,6(2):69-74

The current study demonstrates the potential usefulness of the computer as a tool for an item writer. A spelling item type was used for this demonstration, as it seemed to have the fewest facets or dimensions. An analysis was then made of the types of misspellings which are used by writers of spelling items. A set of error-generation rules was developed and a computer program, The MISSPELLER, was written. A sample of words was fed into the computer and a list of misspelled words, separated into previously defined error categories, was created. The list was then evaluated by spelling-test developers and judged to be a useful resource. 相似文献

7.

Effect of Varying Item Order on Multiple-Choice Test Scores: Importance of Statistical and Cognitive Difficulty

《教育实用测度》2013,26(1):89-97

Research on the use of multiple-choice tests has presented conflicting evidence about the use of statistical item difficulty as a means of ordering items. An alternate method advocated by many texts is the use of cognitive difficulty. This study examined the effect of using both statistical and cognitive item difficulty in determining item order. Results indicated that those students who received items in an increasing cognitive order, no matter what the order of statistical difficulty, scored higher on hard items. Those students who received the forms with opposing cognitive and statistical difficulty orders scored the highest on medium-level items. The study concludes with a call for more research on the effects of cognitive difficulty and suggests that future studies examine subscores as well as total test results. 相似文献

8.

Differentials of a State Reading Assessment: Item Functioning, Distractor Functioning, and Omission Frequency for Disability Categories

Kentaro Kato Ross E. Moen Martha L. Thurlow 《Educational Measurement》2009,28(2):28-40

Large data sets from a state reading assessment for third and fifth graders were analyzed to examine differential item functioning (DIF), differential distractor functioning (DDF), and differential omission frequency (DOF) between students with particular categories of disabilities (speech/language impairments, learning disabilities, and emotional behavior disorders) and students without disabilities. Multinomial logistic regression was employed to compare response characteristic curves (RCCs) of individual test items. Although no evidence for serious test bias was found for the state assessment examined in this study, the results indicated that students in different disability categories showed different patterns of DIF, DDF, and DOF, and that the use of RCCs helps clarify the implications of DIF and DDF. 相似文献

9.

Parameter Recovery in the Graded Response Model Using MULTILOG

Steve P. Reise Jiayuan Yu 《Journal of Educational Measurement》1990,27(2):133-144

The graded response model can be used to describe test-taking behavior when item responses are classified into ordered categories. In this study, parameter recovery in the graded response model was investigated using the MULTILOG computer program under default conditions. Based on items having five response categories, 36 simulated data sets were generated that varied on true θ distribution, true item discrimination distribution, and calibration sample size. The findings suggest, first, the correlations between the true and estimated parameters were consistently greater than 0.85 with sample sizes of at least 500. Second, the root mean square error differences between true and estimated parameters were comparable with results from binary data parameter recovery studies. Of special note was the finding that the calibration sample size had little influence on the recovery of the true ability parameter but did influence item-parameter recovery. Therefore, it appeared that item-parameter estimation error, due to small calibration samples, did not result in poor person-parameter estimation. It was concluded that at least 500 examinees are needed to achieve an adequate calibration under the graded model. 相似文献

10.

Componential IRT Models for Polytomous Items

Machteld Hoskens Paul De Boeck 《Journal of Educational Measurement》1995,32(4):364-384

Componential IRT models for polytomous items are of particular interest in two contexts: Componential research and test development. We assume that there are basic components, such as processes and knowledge structures, involved in solving cognitive tasks. In Componential research, the subtask paradigm may be used to isolate such components in subtasks. In test development, items may be composed such that their response alternatives correspond with specific combinations of such components. In both cases the data may be modeled as polytomous items. With Bock's (1972) nominal model as a general framework, transformation matrices can be used to constrain the parameters of the response categories so as to reflect the Componential design of the response categories. In this way, both main effects and interaction effects of components can be studied. An application to a spelling task demonstrates this approach 相似文献

11.

Integrating Cognitive and Psychometric Models to Measure Document Literacy

Kathleen Sheehan Robert J. Mislevy 《Journal of Educational Measurement》1990,27(3):255-272

The Survey of Young Adult Literacy conducted in 1985 by the National Assessment of Educational Progress included 63 items that elicited skills in acquiring and using information from written documents. These items were analyzed using two different models: (1) a qualitative cognitive model, which characterized items in terms of the processing tasks they required, and (2) an item response theory (IRT) model, which characterized items difficulties and respondents' proficiencies simply by tendencies toward correct response. This paper demonstrates how a generalization of Fischer and Seheibleehner's Linear Logistic Test Model can be used to integrate information from the cognitive analysis into the IRT analysis, providing a foundation for subsequent item construction, test development, and diagnosis of individuals skill deficiencies. 相似文献

12.

Differential Item Functioning on the SAT-M Braille Edition

Randy Elliot Bennett Donald A. Rock Inge Novatkoski 《Journal of Educational Measurement》1989,26(1):67-79

This study attempted to pinpoint the causes of differential item difficulty for blind students taking the braille edition of the Scholastic Aptitude Test's Mathematical section (SAT-M). The study method involved reviewing the literature to identify factors that might cause differential item functioning for these examinees, forming item categories based on these factors, identifying categories that functioned differentially, and assessing the functioning o f the items comprising deviant categories to determine if the differential effect was pervasive. Results showed an association between selected item categories and differential functioning, particularly for items that included figures in the stimulus, items for which spatial estimation was helpful in eliminating at least two of the options, and items that presented figures that were small or medium in size. The precise meaning of this association was unclear, however, because some items from the suspected categories functioned normally, factors other than the hypothesized ones might have caused the observed aberrant item behavior, and the differential difficulty might reflect real population differences in relevant content knowledge 相似文献

13.

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Jee-Seon Kim 《Journal of Educational Measurement》2006,43(3):193-213

Simulation and real data studies are used to investigate the value of modeling multiple-choice distractors on item response theory linking. Using the characteristic curve linking procedure for Bock's (1972) nominal response model presented by Kim and Hanson (2002) , all-category linking (i.e., a linking based on all category characteristic curves of the linking items) is compared against correct-only (CO) linking (i.e., linking based on the correct category characteristic curves only) using a common-item nonequivalent groups design. The CO linking is shown to represent an approximation to what occurs when using a traditional correct/incorrect item response model for linking. Results suggest that the number of linking items needed to achieve an equivalent level of linking precision declines substantially when incorporating the distractor categories. 相似文献

14.

Procedural networks and production systems in adaptive diagnosis

S. P. Marshall 《Instructional Science》1980,9(2):129-143

Procedural networks and production systems are used to model an individual's performance on certain cognitive tasks having several subtasks. These models form the basis of the Adaptive Diagnostic System (ADS). By presenting selected items and examining the responses to these items, ADS creates a diagnostic profile of the skills and subskills that an individual may be lacking. ADS is described, and two heuristics approximating optimal item selection are developed. Finally, ADS is briefly compared with Brown and Burton's (1978) BUGGY, an alternate diagnostic system. 相似文献

15.

Comparison of Factor Simplicity Indices for Dichotomous Data: DETECT R,Bentler's Simplicity Index,and the Loading Simplicity Index

Holmes Finch Alan Kirk Stage Patrick Monahan 《教育实用测度》2013,26(1):41-64

A primary assumption underlying several of the common methods for modeling item response data is unidimensionality, that is, test items tap into only one latent trait. This assumption can be assessed several ways, using nonlinear factor analysis and DETECT, a method based on the item conditional covariances. When multidimensionality is identified, a question of interest concerns the degree to which individual items are related to the latent traits. In cases where an item response is primarily associated with one of these traits it is said that (approximate) simple structure exists, whereas when the item response is related to both traits, the structure is complex. This study investigated the performance of three indices designed to assess the underlying structure present in item response data, two of which are based on factor analysis and one on DETECT. Results of the Monte Carlo simulations show that none of the indices works uniformly well in identifying the structure underlying item responses, although the DETECT r-ratio might be promising in differentiating between approximate simple and complex structures under certain circumstances. 相似文献

16.

Differential Item Functioning Resulting From The Use of Different Solution Strategies

Kikumi K. Tatsuoka Robert L. Linn Maurice M. Tatsuoka Kentaro Yamamoto 《Journal of Educational Measurement》1988,25(4):301-319

The present study investigates the degree to which item "bias" techniques can lead to interpretable results when groups are defined in terms of specified differences in the cognitive processes involved in students' problem-solving strategies. A large group of junior high school students who took a test on subtraction of fractions was divided into two subgroups judged by the rule-space model to be using different problem-solving strategies. It was confirmed by use of Mantel-Haenszel (MH) statistics that these subgroups showed different performances on items with different underlying cognitive tasks. We note that, in our case, we are far from faulting items that show differential item functioning (D1F) between two groups defined in terms of different solution strategies. Indeed, they are "desirable" items, as explained in the discussion section 相似文献

17.

Sources of difficulty in assessment: example of PISA science items

Florence Le Hebel Pascale Montpied Andrée Tiberghien Valérie Fontanieu 《International Journal of Science Education》2013,35(4):468-487

ABSTRACT

The understanding of what makes a question difficult is a crucial concern in assessment. To study the difficulty of test questions, we focus on the case of PISA, which assesses to what degree 15-year-old students have acquired knowledge and skills essential for full participation in society. Our research question is to identify PISA science item characteristics that could influence the item’s proficiency level. It is based on an a-priori item analysis and a statistical analysis. Results show that only the cognitive complexity and the format out of the different characteristics of PISA science items determined in our a-priori analysis have an explanatory power on an item’s proficiency levels. The proficiency level cannot be explained by the dependence/independence of the information provided in the unit and/or item introduction and the competence. We conclude that in PISA, it appears possible to anticipate a high proficiency level, that is, students’ low scores for items displaying a high cognitive complexity. In the case of a middle or low cognitive complexity level item, the cognitive complexity level is not sufficient to predict item difficulty. Other characteristics play a crucial role in item difficulty. We discuss anticipating the difficulties in assessment in a broader perspective. 相似文献

18.

Effect of the number of scale points on chi‐square fit indices in confirmatory factor analysis

Samuel B. Green Theresa M. Akey Kandace K. Fleming Scott L. Hershberger Janet G. Marquis 《Structural equation modeling》2013,20(2):108-120

This article investigates the effect of the number of item response categories on chi‐square statistics for confirmatory factor analysis to assess whether a greater number of categories increases the likelihood of identifying spurious factors, as previous research had concluded. Four types of continuous single‐factor data were simulated for a 20‐item test: (a) uniform for all items, (b) symmetric unimodal for all items, (c) negatively skewed for all items, or (d) negatively skewed for 10 items and positively skewed for 10 items. For each of the 4 types of distributions, item responses were divided to yield item scores with 2,4, or 6 categories. The results indicated that the chi‐square statistic for evaluating a single‐factor model was most inflated (suggesting spurious factors) for 2‐category responses and became less inflated as the number of categories increased. However, the Satorra‐Bentler scaled chi‐square tended not to be inflated even for 2‐category responses, except if the continuous item data had both negatively and positively skewed distributions. 相似文献

19.

Invariance of Item Characteristic Functions With Variations in Instructional Coverage 总被引：1，自引：0，他引：1

M. David Miller Robert L. Linn 《Journal of Educational Measurement》1988,25(3):205-219

An assumption of item response theory is that a person's score is a function of the item response parameters and the person's ability. In this paper, the effect of variations in instructional coverage on item characteristic functions is examined. Using data from the Second International Mathematics Study (1985), curriculum clusters were formed based on teachers' ratings of their students' opportunities to learn the items on a test. After forming curriculum clusters, item response curves were compared using signed and unsigned sum of squared differences. Some of the differences in the item response curves between curriculum clusters were found to be large, but better performance was not necessarily related to greater opportunity to learn. The item response curve differences were much larger than differences reported in prior studies based on comparisons of black and white students. Implications of the findings for applications of item response theory to educational achievement test data are discussed 相似文献

20.

Multiple audience rating form strategies for student evaluation of college teaching 总被引：1，自引：0，他引：1

Ken Peterson G. Manny Gunne Paul Miller Orlando Rivera 《Research in higher education》1984,20(3):309-321

Michael Scriven has suggested that student rating forms, for the purpose of evaluating college teaching, be designed for multiple audiences (instructor, administrator, student), and with a single global item for summative functions (determination of merit, retention, or promotion). This study reviewed approaches to rating form construction, e.g., factor analytic strategies of Marsh, and recommended the multiple audience design of Scriven. An empirical test of the representativeness of the single global item was reported from an analysis of 1,378 forms collected in a university department of education. The global item correlated most satisfactorily with other items, a computed total of items, items that represented underlying factors, and various triplets of items selected to represent all possible combinations of items. It was concluded that a multiple audience rating form showed distinct advantages in design and that the single global item most fairly and highly represented the overall teaching performance, as judged by students, for decisions about retention, promotion, and merit made by administrators. 相似文献