期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An Assessment of the Nonparametric Approach for Evaluating the Fit of Item Response Models

Tie Liang Craig S. Wells Ronald K. Hambleton 《Journal of Educational Measurement》2014,51(1):1-17

As item response theory has been more widely applied, investigating the fit of a parametric model becomes an important part of the measurement process. There is a lack of promising solutions to the detection of model misfit in IRT. Douglas and Cohen introduced a general nonparametric approach, RISE (Root Integrated Squared Error), for detecting model misfit. The purposes of this study were to extend the use of RISE to more general and comprehensive applications by manipulating a variety of factors (e.g., test length, sample size, IRT models, ability distribution). The results from the simulation study demonstrated that RISE outperformed G² and S‐X² in that it controlled Type I error rates and provided adequate power under the studied conditions. In the empirical study, RISE detected reasonable numbers of misfitting items compared to G² and S‐X², and RISE gave a much clearer picture of the location and magnitude of misfit for each misfitting item. In addition, there was no practical consequence to classification before and after replacement of misfitting items detected by three fit statistics. 相似文献

2.

The Hierarchy Consistency Index: Evaluating Person Fit for Cognitive Diagnostic Assessment

Ying Cui Jacqueline P. Leighton 《Journal of Educational Measurement》2009,46(4):429-449

In this article, we introduce a person‐fit statistic called the hierarchy consistency index (HCI) to help detect misfitting item response vectors for tests developed and analyzed based on a cognitive model. The HCI ranges from ?1.0 to 1.0, with values close to ?1.0 indicating that students respond unexpectedly or differently from the responses expected under a given cognitive model. A simulation study was conducted to evaluate the power of the HCI in detecting different types of misfitting item response vectors. Simulation results revealed that the detection rate of the HCI was a function of type of misfit, item discriminating power, and test length. The best detection rates were achieved when the HCI was applied to tests that consisted of a large number of highly discriminating items. In addition, whether a misfitting item response vector can be correctly identified depends, to a large degree, on the number of misfits of the item response vector relative to the cognitive model. When misfitting response behavior only affects a small number of item responses, the resulting item response vector will not be substantially different from the expectations under the cognitive model and consequently may not be statistically identified as misfitting. As an item response vector deviates further from the model expectations, misfits are more easily identified and consequently higher detection rates of the HCI are expected. 相似文献

3.

Validation of Cognitive Sensitivity for Item Response Curves

Kikumi K. Tatsuoka 《Journal of Educational Measurement》1987,24(3):233-245

This study sought a scientific way to examine whether item response curves are influenced systematically by the cognitive processes underlying solution of the items in a procedural domain (addition of fractions). Starting from an expert teacher's logical task analysis and prediction of various erroneous rules and sources of misconceptions, an error diagnostic program was developed. This program was used to carry out an error analysis of test performance by three samples of students. After the cognitive structure of the subtasks was validated by a majority of the students, the items were characterized by their underlying subtask patterns. It was found that item response curves for items in the same categories were significantly more homogeneous than those in different categories. In other words, underlying cognitive subtasks appeared to systematically influence the slopes and difficulties of item response curves. 相似文献

4.

Can Item Response Times Provide Insight Into Students’ Motivation and Self‐Efficacy in Math? An Initial Application of Test Metadata to Understand Students’ Social–Emotional Needs

James Soland 《Educational Measurement》2019,38(3):86-96

As computer‐based tests become more common, there is a growing wealth of metadata related to examinees’ response processes, which include solution strategies, concentration, and operating speed. One common type of metadata is item response time. While response times have been used extensively to improve estimates of achievement, little work considers whether these metadata may provide useful information on social–emotional constructs. This study uses an analytic example to explore whether metadata might help illuminate such constructs. Specifically, analyses examine whether the amount of time students spend on test items (after accounting for item difficulty and estimates of true achievement), and difficult items in particular, tell us anything about the student's academic motivation and self‐efficacy. While results do not indicate a strong relationship between mean item durations and these constructs in general, the amount of time students spend on very difficult items is highly correlated with motivation and self‐efficacy. The implications of these findings for using response process metadata to gain information on social–emotional constructs are discussed. 相似文献

5.

Influence of Item Parameter Estimation Errors in Test Development

Ronald K. Hambleton Russell W. Jones H. Jane Rogers 《Journal of Educational Measurement》1993,30(2):143-155

Item response models are finding increasing use in achievement and aptitude test development. Item response theory (IRT) test development involves the selection of test items based on a consideration of their item information functions. But a problem arises because item information functions are determined by their item parameter estimates, which contain error. When the "best" items are selected on the basis of their statistical characteristics, there is a tendency to capitalize on chance due to errors in the item parameter estimates. The resulting test, therefore, falls short of the test that was desired or expected. The purposes of this article are (a) to highlight the problem of item parameter estimation errors in the test development process, (b) to demonstrate the seriousness of the problem with several simulated data sets, and (c) to offer a conservative solution for addressing the problem in IRT-based test development. 相似文献

6.

The Effect of Drag-and-Drop Item Features on Test-Taker Performance and Response Strategies

Burcu Arslan Yang Jiang Madeleine Keehner Tao Gong Irvin R. Katz Fred Yan 《Educational Measurement》2020,39(2):96-106

Computer-based educational assessments often include items that involve drag-and-drop responses. There are different ways that drag-and-drop items can be laid out and different choices that test developers can make when designing these items. Currently, these decisions are based on experts’ professional judgments and design constraints, rather than empirical research, which might threaten the validity of interpretations of test outcomes. To this end, we investigated the effect of drag-and-drop item features on test-taker performance and response strategies with a cognition-centered approach. Four hundred and seventy-six adult participants solved content-equivalent drag-and-drop mathematics items under five design variants. Results showed that: (a) test takers’ performance and response strategies were affected by the experimental manipulations, and (b) test takers mostly used cognitively efficient response strategies regardless of the manipulated item features. Implications of the findings are provided to support test developers’ design decisions. 相似文献

7.

Unidimensional Interpretations for Multidimensional Test Items

Nilufer Kahraman 《Journal of Educational Measurement》2013,50(2):227-246

This article considers potential problems that can arise in estimating a unidimensional item response theory (IRT) model when some test items are multidimensional (i.e., show a complex factorial structure). More specifically, this study examines (1) the consequences of model misfit on IRT item parameter estimates due to unintended minor item‐level multidimensionality, and (2) whether a Projection IRT model can provide a useful remedy. A real‐data example is used to illustrate the problem and also is used as a base model for a simulation study. The results suggest that ignoring item‐level multidimensionality might lead to inflated item discrimination parameter estimates when the proportion of multidimensional test items to unidimensional test items is as low as 1:5. The Projection IRT model appears to be a useful tool for updating unidimensional item parameter estimates of multidimensional test items for a purified unidimensional interpretation. 相似文献

8.

Digital Module 08: Foundations of Operational Item Analysis https://ncme.elevate.commpartners.com

Hanwook Henry Yoo Ronald K. Hambleton 《Educational Measurement》2019,38(3):116-117

Item analysis is an integral part of operational test development and is typically conducted within two popular statistical frameworks: classical test theory (CTT) and item response theory (IRT). In this digital ITEMS module, Hanwook Yoo and Ronald K. Hambleton provide an accessible overview of operational item analysis approaches within these frameworks. They review the different stages of test development and associated item analyses to identify poorly performing items and effective item selection. Moreover, they walk through the computational and interpretational steps for CTT‐ and IRT‐based evaluation statistics using simulated data examples and review various graphical displays such as distractor response curves, item characteristic curves, and item information curves. The digital module contains sample data, Excel sheets with various templates and examples, diagnostic quiz questions, data‐based activities, curated resources, and a glossary. 相似文献

9.

Empirical versus subjective procedures for identifying gender differences in science test items

Richard R. Sudweeks Richard R. Tolman 《科学教学研究杂志》1993,30(1):3-19

When judgmental and statistical procedures are both used to identify potentially gender-biased items in a test, to what extent do the results agree? In this study, both procedures were used to evaluate the items in a statewide, 78-item, multiple-choice test of science knowledge. Only one item was flagged by the sensitivity reviewers as being potentially biased, but this item was not flagged by the statistical procedure. None of the nine items flagged by the Mantel-Haenszel procedure were flagged by the sensitivity reviewers. Eight of the nine statistically flagged items were differentially easier for males. Four of these eight measured the same category of objectives. The authors conclude that both judgmental and statistical procedures provide useful information and that both should be used in test construction. They caution readers that content-validity issues need to be addressed when making decisions based on the results of either procedure. 相似文献

10.

Learning about students’ knowledge and thinking in science through large-scale quantitative studies

Rolf V. Olsen Are Turmo Svein Lie 《European Journal of Psychology of Education - EJPE》2001,16(3):403-420

The main issue addressed in this article is that there is much to learn about students’ knowledge and thinking in science from largescale international quantitative studies beyond overall score measures. Response patterns on individual or groups of items can give valuable diagnostic insight into students’ conceptual understanding, but there is also a danger of drawing conclusions that may be too simple and nonvalid. We discuss how responses to multiple-choice items could be interpreted, and we also show how responses on constructed-response items can be systematised and analysed. Finally, we study, empirically, interactions between item characteristics and student responses. It is demonstrated that even small changes in the item wording and/or the item format may have a substantial influence on the response pattern. Therefore, we argue that interpretations of results from these kinds of studies should be based on a thorough analysis of the actual items used. We further argue that diagnostic information should be an integrated part of the international research aims of such large-scale studies. Examples of items and student responses presented are taken from The Third International Mathematics and Science Study (TIMSS). 相似文献

11.

An Empirical Examination of the IRT Information of Polytomously Scored Reading Items Under the Generalized Partial Credit Model

John R. Donoghue 《Journal of Educational Measurement》1994,31(4):295-311

Using Muraki's (1992) generalized partial credit IRT model, polytomous items (responses to which can be scored as ordered categories) from the 1991 field test of the NAEP Reading Assessment were calibrated simultaneously with multiple-choice and short open-ended items. Expected information of each type of item was computed. On average, four-category polytomous items yielded 2.1 to 3.1 times as much IRT information as dichotomous items. These results provide limited support for the ad hoc rule of weighting k-category polytomous items the same as k - 1 dichotomous items for computing total scores. Polytomous items provided the most information about examinees of moderately high proficiency; the information function peaked at 1.0 to 1.5, and the population distribution mean was 0. When scored dichotomously, information in polytomous items sharply decreased, but they still provided more expected information than did the other response formats. For reference, a derivation of the information function for the generalized partial credit model is included. 相似文献

12.

The Relationship Between Item Parameters and Item Fit

Hamzeh Dodeen 《Journal of Educational Measurement》2004,41(3):261-270

The effect of item parameters (discrimination, difficulty, and level of guessing) on the item-fit statistic was investigated using simulated dichotomous data. Nine tests were simulated using 1,000 persons, 50 items, three levels of item discrimination, three levels of item difficulty, and three levels of guessing. The item fit was estimated using two fit statistics: the likelihood ratio statistic (X²_B), and the standardized residuals (SRs). All the item parameters were simulated to be normally distributed. Results showed that the levels of item discrimination and guessing affected the item-fit values. As the level of item discrimination or guessing increased, item-fit values increased and more items misfit the model. The level of item difficulty did not affect the item-fit statistic. 相似文献

13.

A Stepwise Test Characteristic Curve Method to Detect Item Parameter Drift

下载免费PDF全文

Rui Guo Yi Zheng Hua‐Hua Chang 《Journal of Educational Measurement》2015,52(3):280-300

An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items. 相似文献

14.

An NCME Instructional Module on Polytomous Item Response Theory Models

Randall David Penfield 《Educational Measurement》2014,33(1):36-48

A polytomous item is one for which the responses are scored according to three or more categories. Given the increasing use of polytomous items in assessment practices, item response theory (IRT) models specialized for polytomous items are becoming increasingly common. The purpose of this ITEMS module is to provide an accessible overview of polytomous IRT models. The module presents commonly encountered polytomous IRT models, describes their properties, and contrasts their defining principles and assumptions. After completing this module, the reader should have a sound understating of what a polytomous IRT model is, the manner in which the equations of the models are generated from the model's underlying step functions, how widely used polytomous IRT models differ with respect to their definitional properties, and how to interpret the parameters of polytomous IRT models. 相似文献

15.

The Scaling of Mixed-Item-Format Tests With the One-Parameter and Two-Parameter Partial Credit Models

Robert C. Sykes Wendy M. Yen 《Journal of Educational Measurement》2000,37(3):221-244

Item response theory scalings were conducted for six tests with mixed item formats. These tests differed in their proportions of constructed response (c.r.) and multiple choice (m.c.) items and in overall difficulty. The scalings included those based on scores for the c.r. items that had maintained the number of levels as the item rubrics, either produced from single ratings or multiple ratings that were averaged and rounded to the nearest integer, as well as scalings for a single form of c.r. items obtained by summing multiple ratings. A one-parameter (IPPC) or two-parameter (2PPC) partial credit model was used for the c.r. items and the one-parameter logistic (IPL) or three-parameter logistic (3PL) model for the m.c. items, ltem fit was substantially worse with the combination IPL/IPPC model than the 3PL/2PPC model due to the former's restrictive assumptions that there would be no guessing on the m.c. items and equal item discrimination across items and item types. The presence of varying item discriminations resulted in the IPL/IPPC model producing estimates of item information that could be spuriously inflated for c.r. items that had three or more score levels. Information for some items with summed ratings were usually overestimated by 300% or more for the IPL/IPPC model. These inflated information values resulted in under-estbnated standard errors of ability estimates. The constraints posed by the restricted model suggests limitations on the testing contexts in which the IPL/IPPC model can be accurately applied. 相似文献

16.

Evaluating Comparability in Computerized Adaptive Testing: Issues, Criteria and an Example

Tianyou Wang Michael J. Kolen 《Journal of Educational Measurement》2001,38(1):19-49

When a computerized adaptive testing (CAT) version of a test co-exists with its paper-and-pencil (P&P) version, it is important for scores from the CAT version to be comparable to scores from its P&P version. The CAT version may require multiple item pools for test security reasons, and CAT scores based on alternate pools also need to be comparable to each other. In this paper, we review research literature on CAT comparability issues and synthesize issues specific to these two settings. A framework of criteria for evaluating comparability was developed that contains the following three categories of criteria: validity criterion, psychometric property/reliability criterion, and statistical assumption/test administration condition criterion. Methods for evaluating comparability under these criteria as well as various algorithms for improving comparability are described and discussed. Focusing on the psychometric property/reliability criterion, an example using an item pool of ACT Assessment Mathematics items is provided to demonstrate a process for developing comparable CAT versions and for evaluating comparability. This example illustrates how simulations can be used to improve comparability at the early stages of the development of a CAT. The effects of different specifications of practical constraints, such as content balancing and item exposure rate control, and the effects of using alternate item pools are examined. One interesting finding from this study is that a large part of incomparability may be due to the change from number-correct score-based scoring to IRT ability estimation-based scoring. In addition, changes in components of a CAT, such as exposure rate control, content balancing, test length, and item pool size were found to result in different levels of comparability in test scores. 相似文献

17.

A Comparison of Item Selection Routines in Linear and Adaptive Tests

Deborah L. Schnipke Bert F. Green 《Journal of Educational Measurement》1995,32(3):227-242

Two item selection algorithms were compared in simulated linear and adaptive tests of cognitive ability. One algorithm selected items that maximally differentiated between examinees. The other used item response theory (IRT) to select items having maximum information for each examinee. Normally distributed populations of 1,000 cases were simulated, using test lengths of 4, 5, 6, and 7 items. Overall, adaptive tests based on maximum information provided the most information over the widest range of ability values and, in general, differentiated among examinees slightly better than the other tests. Although the maximum differentiation technique may be adequate in some circumstances, adaptive tests based on maximum information are clearly superior. 相似文献

18.

IRT Model Misspecification and Measurement of Growth in Vertical Scaling

Daniel M. Bolt Sien Deng Sora Lee 《Journal of Educational Measurement》2014,51(2):141-162

Functional form misfit is frequently a concern in item response theory (IRT), although the practical implications of misfit are often difficult to evaluate. In this article, we illustrate how seemingly negligible amounts of functional form misfit, when systematic, can be associated with significant distortions of the score metric in vertical scaling contexts. Our analysis uses two‐ and three‐parameter versions of Samejima's logistic positive exponent model (LPE) as a data generating model. Consistent with prior work, we find LPEs generally provide a better comparative fit to real item response data than traditional IRT models (2PL, 3PL). Further, our simulation results illustrate how 2PL‐ or 3PL‐based vertical scaling in the presence of LPE‐induced misspecification leads to an artificial growth deceleration across grades, consistent with that commonly seen in vertical scaling studies. The results raise further concerns about the use of standard IRT models in measuring growth, even apart from the frequently cited concerns of construct shift/multidimensionality across grades. 相似文献

19.

Identifying Sources of Differential Item and Bundle Functioning on Translated Achievement Tests: A Confirmatory Analysis

Mark J. Gierl Shameem Nyla Khaliq 《Journal of Educational Measurement》2001,38(2):164-187

Increasingly, tests are being translated and adapted into different languages. Differential item functioning (DIF) analyses are often used to identify non-equivalent items across language groups. However, few studies have focused on understanding why some translated items produce DIF. The purpose of the current study is to identify sources of differential item and bundle functioning on translated achievement tests using substantive and statistical analyses. A substantive analysis of existing DIF items was conducted by an 11-member committee of testing specialists. In their review, four sources of translation DIF were identified. Two certified translators used these four sources to categorize a new set of DIF items from Grade 6 and 9 Mathematics and Social Studies Achievement Tests. Each item was associated with a specific source of translation DIF and each item was anticipated to favor a specific group of examinees. Then, a statistical analysis was conducted on the items in each category using SIBTEST. The translators sorted the mathematics DIF items into three sources, and they correctly predicted the group that would be favored for seven of the eight items or bundles of items across two grade levels. The translators sorted the social studies DIF items into four sources, and they correctly predicted the group that would be favored for eight of the 13 items or bundles of items across two grade levels. The majority of items in mathematics and social studies were associated with differences in the words, expressions, or sentence structure of items that are not inherent to the language and/or culture. By combining substantive and statistical DIF analyses, researchers can study the sources of DIF and create a body of confirmed DIF hypotheses that may be used to develop guidelines and test construction principles for reducing DIF on translated tests. 相似文献

20.

How to Equate Tests With Little or No Data

Robert J. Mislevy Kathleen M. Sheehan Marilyn Wingersky 《Journal of Educational Measurement》1993,30(1):55-78

Standard procedures for equating tests, including those based on item response theory (IRT), require item responses from large numbers of examinees. Such data may not be forthcoming for reasons theoretical, political, or practical. Information about items' operating characteristics may be available from other sources, however, such as content and format specifications, expert opinion, or psychological theories about the skills and strategies required to solve them. This article shows how, in the IRT framework, collateral information about items can be exploited to augment or even replace examinee responses when linking or equating new tests to established scales. The procedures are illustrated with data from the Pre-Professional Skills Test (PPST). 相似文献