期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

DOES THE RASCH MODEL REALLY WORK FOR MULTIPLE CHOICE ITEMS? NOT IF YOU LOOK CLOSELY

D. R. DIVGI 《Journal of Educational Measurement》1986,23(4):283-298

This paper discusses various issues involved in using the Rasch model with multiple choice tests. By presenting a modified test that is much more powerful, the value of Wright and Panchapakesan's test as evidence of model fit is shown to be questionable. According to the new test, the model failed to fit 68% of the items in the Anchor Test Study. Effects of such misfit on test equating are demonstrated. Results of some past studies purporting to support the Rasch model are shown to be irrelevant, or to yield the conclusion that the Rasch model did not fit the data. Issues like "objectivity" and consistent estimation are shown to be unimportant in selection of a latent trait model. Thus, available evidence shows the Rasch model to be unsuitable for multiple choice items. 相似文献

2.

COMPARISON OF STUDENT PERFORMANCE ON M.D. QUALIFYING AND LICENSING EXAMINATIONS

Peter Harasym 《Assessment & Evaluation in Higher Education》1982,7(2):127-143

This study investigated the extent to which an M.D. qualifying examination and a Medical Council of Canada examination produced the same scores, pass levels and educational outcomes. Data from 72 students were analyzed by means of the Rasch latent trait model. The two examinations produced similar results. The findings of this study have important implications for developing diagnostic and screening programmes. With the use of “quality” examinations and latent trait theory, it is possible to place examination results onto a single common scale. This enables a comparison between examinations to determine differences in student abilities, item difficulties, and pass levels. More importantly, the procedure also enables a student's performance to be predicted: this is important in developing an educational programme that will help students pass certifying (for example, M.D. qualifying) or licensing (Medical Council of Canada) examinations. By diagnostic examinations, students’ weaknesses can be identified and appropriate remedial instruction provided. With the employment of screening instruments, students who are highly likely to fail can be identified and appropriately counselled. If the model fits the data, the current technology of latent trait theory promises to improve significantly the accuracy of educational measurement and decision‐making. 相似文献

3.

Assessing the efficacy of the Measure of Understanding of Macroevolution as a valid tool for undergraduate non-science majors

William Lee Romine 《International Journal of Science Education》2013,35(17):2872-2891

Efficacy of the Measure of Understanding of Macroevolution (MUM) as a measurement tool has been a point of contention among scholars needing a valid measure for knowledge of macroevolution. We explored the structure and construct validity of the MUM using Rasch methodologies in the context of a general education biology course designed with an emphasis on macroevolution content. The Rasch model was utilized to quantify item- and test-level characteristics, including dimensionality, reliability, and fit with the Rasch model. Contrary to previous work, we found that the MUM provides a valid, reliable, and unidimensional scale for measuring knowledge of macroevolution in introductory non-science majors, and that its psychometric behavior does not exhibit large changes across time. While we found that all items provide productive measurement information, several depart substantially from ideal behavior, warranting a collective effort to improve these items. Suggestions for improving the measurement characteristics of the MUM at the item and test levels are put forward and discussed. 相似文献

4.

Detecting Measurement Disturbances in Rater‐Mediated Assessments

下载免费PDF全文

Stefanie A. Wind Randall E. Schumacker 《Educational Measurement》2017,36(4):44-51

The term measurement disturbance has been used to describe systematic conditions that affect a measurement process, resulting in a compromised interpretation of person or item estimates. Measurement disturbances have been discussed in relation to systematic response patterns associated with items and persons, such as start‐up, plodding, boredom, or fatigue. An understanding of the different types of measurement disturbances can lead to a more complete understanding of persons or items in terms of the construct being measured. Although measurement disturbances have been explored in several contexts, they have not been explicitly considered in the context of performance assessments. The purpose of this study is to illustrate the use of graphical methods to explore measurement disturbances related to raters within the context of a writing assessment. Graphical displays that illustrate the alignment between expected and empirical rater response functions are considered as they relate to indicators of rating quality based on the Rasch model. Results suggest that graphical displays can be used to identify measurement disturbances for raters related to specific ranges of student achievement that suggest potential rater bias. Further, results highlight the added diagnostic value of graphical displays for detecting measurement disturbances that are not captured using Rasch model–data fit statistics. 相似文献

5.

Rasch测量原理及在高考命题评价中的实证研究 总被引：1，自引：1，他引：1

王蕾《中国考试》2008,(1):32-39

Rasch测量是当前教育与心理测量中具有客观等距量尺的测量。克服了经典测量的测验工具依赖和样本依赖的局限。本文通过介绍Rasch测量原理及其在高考命题评价考生抽样数据分析上的具体应用,为教育决策者和命题者提供了直观的Rasch测量对高考命题评价的量化图形表现形式。希望Rasch测量能在高考抽样数据分析中为命题量化评价提供新的、有价值的思考方式,能被教育决策者和命题者认同和有效使用。相似文献

6.

多面Rasch模型在主观题评分培训中的应用 总被引：7，自引：2，他引：7

李中权孙晓敏张厚粲张立松《中国考试》2008,(1):26-31

主观题的评分受到很多因素的影响,如评分者的知识水平、综合能力和个人偏好等。这些评分者偏差不仅会导致不同评分者之间存在主观差异,也会到导致同一评分者在不同的时间也具有主观不稳定性,最终导致主观题评分信度的降低。本研究将多面Rasch模型运用到某国家级考试论述题的评分培训中。通过分析6名有经验评分者对58份试卷的试评数据,鉴别出四种评分者偏差,然后据此对每个评分者进行个别反馈,从而提高评分的客观性和精确性。相似文献

7.

A Rasch Analysis of Raven Item Data

《Journal of Experimental Education》2012,80(1):27-32

The Progressive Matrices items require varying degrees of analytical reasoning. Individuals high on the underlying trait measured by the Raven should score high on the test. Latent trait models applied to data of the Raven form provide a useful methodology for examining the tenability of the above hypothesis. In this study the Rasch latent model was applied to investigate the fit of observed performance on Raven items to what was expected by the model for individuals at six different levels of the underlying scale. For the most part the model showed a good fit to the test data. The findings were similar to previous empirical work that has investigated the behavior of Rasch test scores. In three instances, however, the item fit statistic was relatively large. A closer study of the “misfitting” items revealed two items were of extreme difficulty, which is likely to contribute to the misfit. The study raises issues about the use of the Rasch model in instances of small samples. Other issues related to the interpretation of the Rasch model to Raven-type data are discussed. 相似文献

8.

Psychometric characteristics of integrated multi-specialty examinations: Ebel ratings and unidimensionality

Matt Homer Jonathan Darling Godfrey Pell 《Assessment & Evaluation in Higher Education》2012,37(7):787-804

Over recent years, UK medical schools have moved to more integrated summative examinations. This paper analyses data from the written assessment of undergraduate medical students to investigate two key psychometric aspects of this type of high-stakes assessment. Firstly, the strength of the relationship between examiner predictions of item performance (as required under the Ebel standard setting method employed) and actual item performance (‘facility’) in the examination is explored. It is found that there is a systematic pattern of difference between these two measures, with examiners tending to underestimate the difficulty of items classified as relatively easy, and overestimating that of items classified harder. The implications of these differences for standard setting are considered. Secondly, the integration of the assessment raises the question as to whether the student total score in the exam can provide a single meaningful measure of student performance across a broad range of medical specialties. Therefore, Rasch measurement theory is employed to evaluate psychometric characteristics of the examination, including its dimensionality. Once adjustment is made for item interdependency, the examination is shown to be unidimensional with fit to the Rasch model implying that a single underlying trait, clinical knowledge, is being measured. 相似文献

9.

Digital Module 10: Rasch Measurement Theory https://ncme.elevate.commpartners.com

Jue Wang George Engelhard 《Educational Measurement》2019,38(4):112-113

In this digital ITEMS module, Dr. Jue Wang and Dr. George Engelhard Jr. describe the Rasch measurement framework for the construction and evaluation of new measures and scales. From a theoretical perspective, they discuss the historical and philosophical perspectives on measurement with a focus on Rasch's concept of specific objectivity and invariant measurement. Specifically, they introduce the origins of Rasch measurement theory, the development of model‐data fit indices, as well as commonly used Rasch measurement models. From an applied perspective, they discuss best practices in constructing, estimating, evaluating, and interpreting a Rasch scale using empirical examples. They provide an overview of a specialized Rasch software program (Winsteps) and an R program embedded within Shiny (Shiny_ERMA) for conducting the Rasch model analyses. The module is designed to be relevant for students, researchers, and data scientists in various disciplines such as psychology, sociology, education, business, health, and other social sciences. It contains audio‐narrated slides, sample data, syntax files, access to Shiny_ERMA program, diagnostic quiz questions, data‐based activities, curated resources, and a glossary. 相似文献

10.

计算机自适应测验中Rasch模型稳健性的模拟研究 总被引：1，自引：0，他引：1

邓远平蔡艳罗照盛《考试研究》2006,(3)

本研究采用模拟数据的方法,在计算机自适应测验(Computer Adaptive Test,简称CAT)中分别采用Rasch及Birnbaum两种模型估计能力,通过比较两者的误差均方根(Root Mean Square Error,简称RMSE)、平均差异(Average Deviation,简称AD)及能力相关,对Rasch模型在CAT中的稳健性进行了研究。结果发现Rasch模型在区分度不等的条件下仍然能较准确地估计被试的能力水平,具有很强的稳健性。相似文献

11.

Evaluating item response theory linking and model fit for data from PISA 2000–2012

Matthias von Davier Kentaro Yamamoto Hyo Jeong Shin Henry Chen Lale Khorramdel Jon Weeks 《Assessment in Education: Principles, Policy & Practice》2019,26(4):466-488

ABSTRACT

Based on concerns about the item response theory (IRT) linking approach used in the Programme for International Student Assessment (PISA) until 2012 as well as the desire to include new, more complex, interactive items with the introduction of computer-based assessments, alternative IRT linking methods were implemented in the 2015 PISA round. The new linking method represents a concurrent calibration using all available data, enabling us to find item parameters that maximize fit across all groups and allowing us to investigate measurement invariance across groups. Apart from the Rasch model that historically has been used in PISA operational analyses, we compared our method against more general IRT models that can incorporate item-by-country interactions. The results suggest that our proposed method holds promise not only to provide a strong linkage across countries and cycles but also to serve as a tool for investigating measurement invariance. 相似文献

12.

A comparison of methods for determining dimensionality in Rasch measurement

Richard M. Smith 《Structural equation modeling》2013,20(1):25-40

This study compares the Rasch item fit approach for detecting multidimensionality in response data with principal component analysis without rotation using simulated data. The data in this study were simulated to represent varying degrees of multidimensionality and varying proportions of items representing each dimension. Because the requirement of unidimensionality is necessary to preserve the desirable measurement properties of Rasch models, useful ways of testing this requirement must be developed. The results of the analyses indicate that both the principal component approach and the Rasch item fit approach work in a variety of multidimensional data structures. However, each technique is unable to detect multidimensionality in certain combinations of the level of correlation between the two variables and the proportion of items loading on the two factors. In cases where the intention is to create a unidimensional structure, one would expect few items to load on the second factor and the correlation between the factors to be high. The Rasch item fit approach detects dimensionality more accurately in these situations. 相似文献

13.

Hierarchical Generalized Linear Models for the Analysis of Judge Ratings

Timothy J. Muckle George Karabatsos 《Journal of Educational Measurement》2009,46(2):198-219

It is known that the Rasch model is a special two-level hierarchical generalized linear model (HGLM). This article demonstrates that the many-faceted Rasch model (MFRM) is also a special case of the two-level HGLM, with a random intercept representing examinee ability on a test, and fixed effects for the test items, judges, and possibly other facets. This perspective suggests useful modeling extensions of the MFRM. For example, in the HGLM framework it is possible to model random effects for items and judges in order to assess their stability across examinees. The MFRM can also be extended so that item difficulty and judge severity are modeled as functions of examinee characteristics (covariates), for the purposes of detecting differential item functioning and differential rater functioning. Practical illustrations of the HGLM are presented through the analysis of simulated and real judge-mediated data sets involving ordinal responses. 相似文献

14.

THE EFFECTS OF THE DELETION OF MISFITTING PERSONS ON VERTICAL EQUATING VIA THE RASCH MODEL

S. E. PHILLIPS 《Journal of Educational Measurement》1986,23(2):107-118

The purpose of the study was to compare Rasch model equatings of multilevel achievement test data before and after the deletion of misfitting persons. The Rasch equatings were also compared with an equating obtained using the equipercentile method. No basis could be found in the results for choosing between the two Rasch equatings. The deletion of misfitting persons produced minor improvements in Rasch model fit to the data. Both Rasch equatings produced results that differed from the results of the equipercentile equating. The Rasch data also indicated that the misfitting persons deleted in the second Rasch equating tended to be from the lower portion of the achievement distribution, suggesting that they may have been guessing. 相似文献

15.

A Multilevel Testlet Model for Dual Local Dependence

Hong Jiao Akihito Kamata Shudong Wang Ying Jin 《Journal of Educational Measurement》2012,49(1):82-100

The applications of item response theory (IRT) models assume local item independence and that examinees are independent of each other. When a representative sample for psychometric analysis is selected using a cluster sampling method in a testlet‐based assessment, both local item dependence and local person dependence are likely to be induced. This study proposed a four‐level IRT model to simultaneously account for dual local dependence due to item clustering and person clustering. Model parameter estimation was explored using the Markov Chain Monte Carlo method. Model parameter recovery was evaluated in a simulation study in comparison with three other related models: the Rasch model, the Rasch testlet model, and the three‐level Rasch model for person clustering. In general, the proposed model recovered the item difficulty and person ability parameters with the least total error. The bias in both item and person parameter estimation was not affected but the standard error (SE) was affected. In some simulation conditions, the difference in classification accuracy between models could go up to 11%. The illustration using the real data generally supported model performance observed in the simulation study. 相似文献

16.

Inter-subject comparability of examination standards in GCSE and GCE in England

Qingping He Ian Stockford Michelle Meadows 《牛津教育评论》2018,44(4):494-513

Results from Rasch analysis of GCSE and GCE A level data over a period of four years suggest that the standards of examinations in different subjects are not consistent in terms of the levels of the latent trait specified in the Rasch model required to achieve the same grades. Variability in statistical standards between subjects exists at both individual grade level and the overall subject level. Findings from this study are generally consistent with those from previous studies using similar statistical models. It has been demonstrated that the alignment of statistical standards between subjects based on the Rasch model would likely result in substantial change in performance standards of the examinations for some subjects evidenced here by significant changes in grade boundary scores and grade outcomes. It is argued that the defined purposes of GCSE and A level qualifications determine how their results should be interpreted and reported and that the existing grading and results reporting procedures are appropriate for supporting these purposes. 相似文献

17.

A multitrait‐multioccasion generalization of the latent trait‐state model: Description and application

Levent Dumenci Michael Windle 《Structural equation modeling》2013,20(4):391-410

相似文献

18.

Evaluating construct validity and internal consistency of early childhood individualized family service plans

《Studies in Educational Evaluation》2015

This study presents evidence regarding the construct validity and internal consistency of the IFSP Rating Scale (McWilliam & Jung, 2001), which was designed to rate individualized family service plans (IFSPs) on 12 indicators of family centered practice. Here, the Rasch measurement model is employed to investigate the scale's functioning and fit for both person and item diagnostics of 120 IFSPs that were previously analyzed with a classical test theory approach. Analyses demonstrated scores on the IFSP Rating Scale fit the model well, though additional items could improve the scale's reliability. Implications for applying the Rasch model to improve special education research and practice are discussed. 相似文献

19.

Rasch analysis for psychometric improvement of science attitude rating scales

Pey-Tee Oon Xitao Fan 《International Journal of Science Education》2013,35(6):683-700

ABSTRACT

Students’ attitude towards science (SAS) is often a subject of investigation in science education research. Survey of rating scale is commonly used in the study of SAS. The present study illustrates how Rasch analysis can be used to provide psychometric information of SAS rating scales. The analyses were conducted on a 20-item SAS scale used in an existing dataset of The Trends in International Mathematics and Science Study (TIMSS) (2011). Data of all the eight-grade participants from Hong Kong and Singapore (N?=?9942) were retrieved for analyses. Additional insights from Rasch analysis that are not commonly available from conventional test and item analyses were discussed, such as invariance measurement of SAS, unidimensionality of SAS construct, optimum utilization of SAS rating categories, and item difficulty hierarchy in the SAS scale. Recommendations on how TIMSS items on the measurement of SAS can be better designed were discussed. The study also highlights the importance of using Rasch estimates for statistical parametric tests (e.g. ANOVA, t-test) that are common in science education research for group comparisons. 相似文献

20.

A psychometric measurement model for adult English language learners: Pearson Test of English Academic

Hye K. Pae 《Educational Research and Evaluation》2013,19(3):211-229

The aim of this study was to apply Rasch modeling to an examination of the psychometric properties of the Pearson Test of English Academic (PTE Academic). Analyzed were 140 test-takers' scores derived from the PTE Academic database. The mean age of the participants was 26.45 (SD = 5.82), ranging from 17 to 46. Conformity of the participants' performance on the 86 items of PTE Academic Form 1 of the field test was evaluated using the partial credit model. The person reliability coefficient was .96, and item reliability was .99. The results showed that no significant differential item functioning was found across subgroups of gender and spoken-language context, indicating that the item data approximated the Rasch model. The findings of this study validated the test stability of PTE Academic as a useful measurement tool for English language learners' academic English assessment. 相似文献