期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Reporting Valid and Reliable Overall Scores and Domain Scores

Lihua Yao 《Journal of Educational Measurement》2010,47(3):339-360

In educational assessment, overall scores obtained by simply averaging a number of domain scores are sometimes reported. However, simply averaging the domain scores ignores the fact that different domains have different score points, that scores from those domains are related, and that at different score points the relationship between overall score and domain score may be different. To report reliable and valid overall scores and domain scores, I investigated the performance of four methods using both real and simulation data: (a) the unidimensional IRT model; (b) the higher-order IRT model, which simultaneously estimates the overall ability and domain abilities; (c) the multidimensional IRT (MIRT) model, which estimates domain abilities and uses the maximum information method to obtain the overall ability; and (d) the bifactor general model. My findings suggest that the MIRT model not only provides reliable domain scores, but also produces reliable overall scores. The overall score from the MIRT maximum information method has the smallest standard error of measurement. In addition, unlike the other models, there is no linear relationship assumed between overall score and domain scores. Recommendations for sizes of correlations between domains and the number of items needed for reporting purposes are provided. 相似文献

2.

Using Multidimensional Item Response Theory to Evaluate Educational and Psychological Tests

Terry A. Ackerman Mark J. Gierl Cindy M. Walker 《Educational Measurement》2003,22(3):37-51

Many educational and psychological tests are inherently multidimensional, meaning these tests measure two or more dimensions or constructs. The purpose of this module is to illustrate how test practitioners and researchers can apply multidimensional item response theory (MIRT) to understand better what their tests are measuring, how accurately the different composites of ability are being assessed, and how this information can be cycled back into the test development process. Procedures for conducting MIRT analyses–from obtaining evidence that the test is multidimensional, to modeling the test as multidimensional, to illustrating the properties of multidimensional items graphically-are described from both a theoretical and a substantive basis. This module also illustrates these procedures using data from a ninth-grade mathematics achievement test. It concludes with a discussion of future directions in MIRT research. 相似文献

3.

The Use of Hierarchical Generalized Linear Model for Item Dimensionality Assessment

S. Natasha Beretvas Natasha J. Williams 《Journal of Educational Measurement》2004,41(4):379-395

To assess item dimensionality, the following two approaches are described and compared: hierarchical generalized linear model (HGLM) and multidimensional item response theory (MIRT) model. Two generating models are used to simulate dichotomous responses to a 17-item test: the unidimensional and compensatory two-dimensional (C2D) models. For C2D data, seven items are modeled to load on the first and second factors, θ₁ and θ₂, with the remaining 10 items modeled unidimensionally emulating a mathematics test with seven items requiring an additional reading ability dimension. For both types of generated data, the multidimensionality of item responses is investigated using HGLM and MIRT. Comparison of HGLM and MIRT's results are possible through a transformation of items' difficulty estimates into probabilities of a correct response for a hypothetical examinee at the mean on θ and θ₂. HGLM and MIRT performed similarly. The benefits of HGLM for item dimensionality analyses are discussed. 相似文献

4.

A Didactic Explanation of Item Bias, Item Impact, and Item Validity From a Multidimensional Perspective

Terry A. Ackerman 《Journal of Educational Measurement》1992,29(1):67-91

Many researchers have suggested that the main cause of item bias is the misspecification of the latent ability space, where items that measure multiple abilities are scored as though they are measuring a single ability. If two different groups of examinees have different underlying multidimensional ability distributions and the test items are capable of discriminating among levels of abilities on these multiple dimensions, then any unidimensional scoring scheme has the potential to produce item bias. It is the purpose of this article to provide the testing practitioner with insight about the difference between item bias and item impact and how they relate to item validity. These concepts will be explained from a multidimensional item response theory (MIRT) perspective. Two detection procedures, the Mantel-Haenszel (as modified by Holland and Thayer, 1988) and Shealy and Stout's Simultaneous Item Bias (SIB; 1991) strategies, will be used to illustrate how practitioners can detect item bias. 相似文献

5.

Evaluation of linking methods for multidimensional irt calibrations

Kyung-Seok Min 《Asia Pacific Education Review》2007,8(1):41-55

Most researchers agree that psychological/educational tests are sensitive to multiple traits, implying the need for a multidimensional item response theory (MIRT). One limitation of applying a MIRT in practice is the difficulty in establishing equivalent scales of multiple traits. In this study, a new MIRT linking method was proposed and evaluated by comparison with two existing methods. The results showed that the new method was more acceptable in transforming item parameters and maintaining dimensional structures. Limitations and cautions in using multidimensional linking techniques were also discussed. 相似文献

6.

IRT Approaches to Modeling Scores on Mixed-Format Tests

Won-Chan Lee Stella Y. Kim Jiwon Choi Yujin Kang 《Journal of Educational Measurement》2020,57(2):230-254

This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models. 相似文献

7.

Defining and Evaluating Models of Cognition Used in Educational Measurement to Make Inferences About Examinees' Thinking Processes 总被引：2，自引：0，他引：2

Jacqueline P. Leighton Mark J. Gierl 《Educational Measurement》2007,26(2):3-16

The purpose of this paper is to define and evaluate the categories of cognitive models underlying at least three types of educational tests. We argue that while all educational tests may be based—explicitly or implicitly—on a cognitive model, the categories of cognitive models underlying tests often range in their development and in the psychological evidence gathered to support their value. For researchers and practitioners, awareness of different cognitive models may facilitate the evaluation of educational measures for the purpose of generating diagnostic inferences, especially about examinees' thinking processes, including misconceptions, strengths, and/or abilities. We think a discussion of the types of cognitive models underlying educational measures is useful not only for taxonomic ends, but also for becoming increasingly aware of evidentiary claims in educational assessment and for promoting the explicit identification of cognitive models in test development. We begin our discussion by defining the term cognitive model in educational measurement. Next, we review and evaluate three categories of cognitive models that have been identified for educational testing purposes using examples from the literature. Finally, we highlight the practical implications of "blending" models for the purpose of improving educational measures . 相似文献

8.

Item Response Theory Models for Performance Decline During Testing

Kuan‐Yu Jin Wen‐Chung Wang 《Journal of Educational Measurement》2014,51(2):178-200

Sometimes, test‐takers may not be able to attempt all items to the best of their ability (with full effort) due to personal factors (e.g., low motivation) or testing conditions (e.g., time limit), resulting in poor performances on certain items, especially those located toward the end of a test. Standard item response theory (IRT) models fail to consider such testing behaviors. In this study, a new class of mixture IRT models was developed to account for such testing behavior in dichotomous and polytomous items, by assuming test‐takers were composed of multiple latent classes and by adding a decrement parameter to each latent class to describe performance decline. Parameter recovery, effect of model misspecification, and robustness of the linearity assumption in performance decline were evaluated using simulations. It was found that the parameters in the new models were recovered fairly well by using the freeware WinBUGS; the failure to account for such behavior by fitting standard IRT models resulted in overestimation of difficulty parameters on items located toward the end of the test and overestimation of test reliability; and the linearity assumption in performance decline was rather robust. An empirical example is provided to illustrate the applications and the implications of the new class of models. 相似文献

9.

Comparing Multidimensional and Unidimensional Proficiency Classifications: Multidimensional IRT as a Diagnostic Aid

Cindy M. Walker S. Natasha Beretvas 《Journal of Educational Measurement》2003,40(3):255-275

This research examined the effect of scoring items thought to be multidimensional using a unidimensional model and demonstrated the use of multidimensional item response theory (MIRT) as a diagnostic tool. Using real data from a large-scale mathematics test, previously shown to function differentially in favor of proficient writers, the difference in proficiency classifications was explored when a two-versus one-dimensional confirmatory model was fit. The estimate of ability obtained when using the unidimensional model was considered to represent general mathematical ability. Under the two-dimensional model, one of the two dimensions was also considered to represent general mathematical ability. The second dimension was considered to represent the ability to communicate in mathematics. The resulting pattern of mismatched proficiency classifications suggested that examinees found to have less mathematics communication ability were more likely to be placed in a lower general mathematics proficiency classification under the unidimensional than multidimensional model. Results and implications are discussed. 相似文献

10.

IRT Estimation of Domain Scores

R. Darrell Bock David Thissen Michele F. Zimowski 《Journal of Educational Measurement》1997,34(3):197-211

In classical test theory, a test is regarded as a sample of items from a domain defined by generating rules or by content, process, and format specifications, l f the items are a random sample of the domain, then the percent-correct score on the test estimates the domain score, that is, the expected percent correct for all items in the domain. When the domain is represented by a large set of calibrated items, as in item banking applications, item response theory (IRT) provides an alternative estimator of the domain score by transformation of the IRT scale score on the test. This estimator has the advantage of not requiring the test items to be a random sample of the domain, and of having a simple standard error. We present here resampling results in real data demonstrating for uni- and multidimensional models that the IRT estimator is also a more accurate predictor of the domain score than is the classical percent-correct score. These results have implications for reporting outcomes of educational qualification testing and assessment. 相似文献

11.

Sampling knowledge and understanding: how long should a test be?

Richard F. Burton 《Assessment & Evaluation in Higher Education》2006,31(5):569-582

Many academic tests (e.g. short‐answer and multiple‐choice) sample required knowledge with questions scoring 0 or 1 (dichotomous scoring). Few textbooks give useful guidance on the length of test needed to do this reliably. Posey's binomial error model of 1932 provides the best starting point, but allows neither for heterogeneity of question difficulty and discriminatory power nor for students' uneven spread of knowledge. Even with these taken into account, it appears that tests of 30–60 items, as commonly used, must generally be far from adequate. No exact test length can be specified as ‘just sufficient’, but the tests of 300 items that some students take are not extravagantly long. The effects on reliability of some particular test forms and practices are discussed. 相似文献

12.

Using Restricted Latent Class Models to Map the Skill Structure of Achievement Items 总被引：2，自引：0，他引：2

Edward H. Haertel 《Journal of Educational Measurement》1989,26(4):301-321

This paper presents a new method for using certain restricted latent class models, referred to as binary skills models, to determine the skills required by a set o f test items. The method is applied to reading achievement data from a nationally representative sample o f fourth-grade students and offers useful perspectives on test structure and examinee ability, distinct from those provided by other methods o f analysis. Models fitted to small, overlapping sets o f items are integrated into a common skill map, and the nature o f each skill is then inferred from the characteristics o f the items for which it is required. The reading comprehension items examined conform closely to a unidimensional scale with six discrete skill levels that range from an inability to comprehend or match isolated words in a reading passage to the abilities required to integrate passage content with general knowledge and to recognize the main ideas o f the most difficult passages on the test. 相似文献

13.

Explaining Method Effects Associated With Negatively Worded Items in Trait and State Global and Domain-Specific Self-Esteem Scales

José M. Tomás Amparo Oliver Laura Galiana Patricia Sancho Marisol Lila 《Structural equation modeling》2013,20(2):299-313

Several investigators have interpreted method effects associated with negatively worded items in a substantive way. This research extends those studies in different ways: (a) it establishes the presence of methods effects in further populations and particular scales, and (b) it examines the possible relations between a method factor associated with negatively worded items and several covariates. Two samples were assessed: 592 high school students from Valencia (Spain), and 285 batterers from the same city. The self-esteem scales used were Rosenberg's Self-Esteem Scale, the State Self-Esteem Scale, and Self-Esteem 17. Anxiety was also assessed with the State-Trait Anxiety Inventory, and gender and educational level were taken into account. The models were conducted using a multiple indicators and multiple causes (MIMIC) model framework. The evidence in this research pointed out that method effects were present across the different measures of self-esteem. Moreover, a significant and negative effect of anxiety on method effects was present across scales and samples, whereas no effects of age or educational level where found. 相似文献

14.

Pedagogic strategies supporting the use of Synchronous Audiographic Conferencing: A review of the literature

Sara de Freitas Tim Neumann 《British journal of educational technology : journal of the Council for Educational Technology》2009,40(6):980-998

Synchronous audiographic conferencing (SAC) refers to a combination of technologies for real-time communication and interaction using multiple media and modes. With an increasing institutional uptake of SAC, users require an understanding of the complex interrelations of multiple media in learning scenarios in order to support pedagogic-driven planning and effective use of the tool. This paper provides a review of recent literature that explores the pedagogic strategies used to underpin practical uses of SAC for the benefit of learners especially in non-standard contexts such as distance education. The paper reports on approaches from practitioner-oriented perspectives as well as approaches based on educational theory, notably the community of inquiry model, task design and multimodal models of cognition, meaning and interaction. The main features of these models were extracted to provide both a synthesis for future work on dedicated pedagogic models for SAC and a resource for practitioners wanting to link SAC with educational theory. 相似文献

15.

The Impact of Item Stem Format on the Dimensional Structure of Mathematics Assessments

Adnan Kan Damien C. Cormier 《Educational Assessment》2019,24(1):13-32

Item stem formats can alter the cognitive complexity as well as the type of abilities required for solving mathematics items. Consequently, it is possible that item stem formats can affect the dimensional structure of mathematics assessments. This empirical study investigated the relationship between item stem format and the dimensionality of mathematics assessments. A sample of 671 sixth-grade students was given two forms of a mathematics assessment in which mathematical expression (ME) items and word problems (WP) were used to measure the same content. The effects of mathematical language and reading abilities in responding to ME and WP items were explored using unidimensional and multidimensional item response theory models. The results showed that WP and ME items appear to differ with regard to the underlying abilities required to answer these items. Hence, the multidimensional model fit the response data better than the unidimensional model. For the accurate assessment of mathematics achievement, students’ reading and mathematical language abilities should also be considered when implementing mathematics assessments with ME and WP items. 相似文献

16.

多维项目反应理论在数学素养测验中的应用

林子植胡典顺《中国考试》2021,(5):72-80

学生的数学素养具有多维结构,素养导向的数学学业成就测评需要提供被试在各维度上的表现信息,而不仅是一个单一的总分。以PISA数学素养结构为理论模型,以多维项目反应理论(MIRT)为测量模型,利用R语言的MIRT程序包处理和分析某地区8年级数学素养测评题目数据,研究数学素养的多维测量方法。结果表明:MIRT兼具单维项目反应理论和因子分析的优点,利用其可对测试的结构效度和测试题目质量进行分析,以及对被试进行多维能力认知诊断。相似文献

17.

Lord's Wald Test for Detecting DIF in Multidimensional IRT Models: A Comparison of Two Estimation Approaches

下载免费PDF全文

Soo Lee Youngsuk Suh 《Journal of Educational Measurement》2018,55(2):328-353

Lord's Wald test for differential item functioning (DIF) has not been studied extensively in the context of the multidimensional item response theory (MIRT) framework. In this article, Lord's Wald test was implemented using two estimation approaches, marginal maximum likelihood estimation and Bayesian Markov chain Monte Carlo estimation, to detect uniform and nonuniform DIF under MIRT models. The Type I error and power rates for Lord's Wald test were investigated under various simulation conditions, including different DIF types and magnitudes, different means and correlations of two ability parameters, and different sample sizes. Furthermore, English usage data were analyzed to illustrate the use of Lord's Wald test with the two estimation approaches. 相似文献

18.

Communication skills, educational achievement and biographic characteristics of children with moderate learning difficulties

Susannah Lamb Peter Bibby David Wood Gerv Leyden 《European Journal of Psychology of Education - EJPE》1997,12(4):401-414

This paper examines aspects of intellectual, linguistic and academic abilities of 71 children with moderate learning difficulties. A profile of these abilities is presented and analysed. The profile provides a rationale for mounting a long-term intervention study designed to develop these children’s communication abilities. It also provides us with a baseline model against which the effects of the intervention can be assessed. In addition, the profile explores the relationships between several aspects of academic achievement and biographical factors such as age, gender, season of birth and IQ. Statistical analysis reveals significant relationships between several of the variables investigated. The implications of this analysis for educational practice are considered. 相似文献

19.

The Random‐Effect Generalized Rating Scale Model

Wen‐Chung Wang Shiu‐Lien Wu 《Journal of Educational Measurement》2011,48(4):441-456

Rating scale items have been widely used in educational and psychological tests. These items require people to make subjective judgments, and these subjective judgments usually involve randomness. To account for this randomness, Wang, Wilson, and Shih proposed the random‐effect rating scale model in which the threshold parameters are treated as random effects rather than fixed effects. In the present study, the Wang et al. model was further extended to incorporate slope parameters and embed the new model within the framework of multilevel nonlinear mixed‐effect models. This was done so that (1) no efforts are needed to derive parameter estimation procedures, and (2) existing computer programs can be applied directly. A brief simulation study was conducted to ascertain parameter recovery using the SAS NLMIXED procedure. An empirical example regarding students’ interest in learning science is presented to demonstrate the implications and applications of the new model. 相似文献

20.

Instructional Topics in Educational Measurement (ITEMS) Module: Using Automated Processes to Generate Test Items

Mark J. Gierl Hollis Lai 《Educational Measurement》2013,32(3):36-50

Changes to the design and development of our educational assessments are resulting in the unprecedented demand for a large and continuous supply of content‐specific test items. One way to address this growing demand is with automatic item generation (AIG). AIG is the process of using item models to generate test items with the aid of computer technology. The purpose of this module is to describe and illustrate a template‐based method for generating test items. We outline a three‐step approach where test development specialists first create an item model. An item model is like a mould or rendering that highlights the features in an assessment task that must be manipulated to produce new items. Next, the content used for item generation is identified and structured. Finally, features in the item model are systematically manipulated with computer‐based algorithms to generate new items. Using this template‐based approach, hundreds or even thousands of new items can be generated with a single item model. 相似文献