首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
It is known that the Rasch model is a special two-level hierarchical generalized linear model (HGLM). This article demonstrates that the many-faceted Rasch model (MFRM) is also a special case of the two-level HGLM, with a random intercept representing examinee ability on a test, and fixed effects for the test items, judges, and possibly other facets. This perspective suggests useful modeling extensions of the MFRM. For example, in the HGLM framework it is possible to model random effects for items and judges in order to assess their stability across examinees. The MFRM can also be extended so that item difficulty and judge severity are modeled as functions of examinee characteristics (covariates), for the purposes of detecting differential item functioning and differential rater functioning. Practical illustrations of the HGLM are presented through the analysis of simulated and real judge-mediated data sets involving ordinal responses.  相似文献   

3.
本文以某届国际奥林匹克运动会女子跳水决赛为例,综合应用CTT、GT和IRT三大测量理论进行评分者信度分析,从不同角度揭示评分者之间和评分者内部的差异情况。结果表明:CTT的评分者信度分别为0.981和078;GT的概化系数和可靠性指数分别为0.8279和0.8271,比赛中所采用的7名评委分别对选手在5轮上的跳水表现进行评定的决策是比较适宜的决策;在IRT中,相对而言,评委5在7名评委中最为严厉,评委2最为宽松,但评委之间在宽严程度上的差异不显著,评委1和评委4在自身一致性上存在问题,不同评委在评定不同选手、不同难度系数动作和不同轮数上存在偏差,但未达到显著性水平。基于本文的分析,可以了解三种评分者信度分析方法的特点及各自优势,为评分者培训和提高评分信度提供有用信息。  相似文献   

4.
多面Rasch模型在主观题评分培训中的应用   总被引:7,自引:2,他引:7  
主观题的评分受到很多因素的影响,如评分者的知识水平、综合能力和个人偏好等。这些评分者偏差不仅会导致不同评分者之间存在主观差异,也会到导致同一评分者在不同的时间也具有主观不稳定性,最终导致主观题评分信度的降低。本研究将多面Rasch模型运用到某国家级考试论述题的评分培训中。通过分析6名有经验评分者对58份试卷的试评数据,鉴别出四种评分者偏差,然后据此对每个评分者进行个别反馈,从而提高评分的客观性和精确性。  相似文献   

5.
This study investigates how experienced and inexperienced raters score essays written by ESL students on two different prompts. The quantitative analysis using multi-faceted Rasch measurement, which provides measurements of rater severity and consistency, showed that the inexperienced raters were more severe than the experienced raters on one prompt but not on the other prompt, and that differences between the two groups of raters were eliminated following rater training. The qualitative analysis, which consisted of analysis of raters' think-aloud protocols while scoring essays, provided insights into reasons for these differences. Differences were related to the ease with which the scoring rubric could be applied to the two prompts and to differences in how the two groups of raters perceived the appropriateness of the prompts.  相似文献   

6.
本研究以PETS-1级拟聘口试教师为研究对象,对口试教师评分的培训效果进行了研究。采用多面Rasch分析对比口试教师接受培训前后的评分效果。结果发现:培训后,提升了口试教师与专家评分完全一致的比率,评分偏于严格的口试教师在评分标准上做了恰当的调整,所有口试教师评分拟合值都在可接受范围内,总体上,口试教师评分的培训比较有效,培训后提升了评分的准确性。多面Rasch分析有助于发现评分过于宽松、过于严格、评分拟合差的口试教师以及评分异常情况,为开展有针对性地培训提供了可靠的依据。  相似文献   

7.
Vahid Aryadoust 《教育心理学》2016,36(10):1742-1770
This study sought to examine the development of paragraph writing skills of 116 English as a second language university students over the course of 12 weeks and the relationship between the linguistic features of students’ written texts as measured by Coh-Metrix – a computational system for estimating textual features such as cohesion and coherence – and the scores assigned by human raters. The raters’ reliability was investigated using many-facet Rasch measurement (MFRM); the growth of students’ paragraph writing skills was explored using a factor-of-curves latent growth model (LGM); and the relationships between changes in linguistic features and writing scores across time were examined by path modelling. MFRM analysis indicates that despite several misfits, students’ and raters’ performances and scale’s functionality conformed to the expectations of MFRM, thus providing evidence of psychometric validity for the assessments. LGM shows that students’ paragraph writing skills develop steadily during the course. The Coh-Metrix indices have more predictive power before and after the course than during it, suggesting that Coh-Metrix may struggle to discriminate between some ability levels. Whether a Coh-Metrix index gains or loses predictive power over time is argued to be partly a function of whether raters maintain or lose sensitivity to the linguistic feature measured by that index in their own assessment as the course progresses.  相似文献   

8.
The purpose of this study was to build a Random Forest supervised machine learning model in order to predict musical rater‐type classifications based upon a Rasch analysis of raters’ differential severity/leniency related to item use. Raw scores (N = 1,704) from 142 raters across nine high school solo and ensemble festivals (grades 9–12) were collected using a 29‐item Likert‐type rating scale embedded within five domains (tone/intonation, n = 6; balance, n = 5; interpretation, n = 6; rhythm, n = 6; and technical accuracy, n = 6). Data were analyzed using a Many Facets Rasch Partial Credit Model. An a priori k‐means cluster analysis of 29 differential rater functioning indices produced a discrete feature vector that classified raters into one of three distinct rater‐types: (a) syntactical rater‐type, (b) expressive rater‐type, or (c) mental representation rater‐type. Results of the initial Random Forest model resulted in an out‐of‐bag error rate of 5.05%, indicating that approximately 95% of the raters were correctly classified. After tuning a set of three hyperparameters (ntree, mtry, and node size), the optimized model demonstrated an improved out‐of‐bag error rate of 2.02%. Implications for improvements in assessment, research, and rater training in the field of music education are discussed.  相似文献   

9.
This study investigated the usefulness of the many‐facet Rasch model (MFRM) in evaluating the quality of performance related to PowerPoint presentations in higher education. The Rasch Model utilizes item response theory stating that the probability of a correct response to a test item/task depends largely on a single parameter, the ability of the person. MFRM extends this one‐parameter model to other facets of task difficulty, for example, rater severity, rating scale format, task difficulty levels. This paper specifically investigated presentation ability in terms of items/task difficulty and rater severity/leniency. First‐year science education students prepared and used the PowerPoint presentation software program during the autumn semester of the 2005–2006 school year in the ‘Introduction to the Teaching Profession’ course. The students were divided into six sub‐groups and each sub‐group was given an instructional topic, based on the content and objectives of the course, to prepare a PowerPoint presentation. Seven judges, including the course instructor, evaluated each group’s PowerPoint presentation performance using ‘A+ PowerPoint Rubric’. The results of this study show that the MFRM technique is a powerful tool for handling polytomous data in performance and peer assessment in higher education.  相似文献   

10.
The decision-making behaviors of 8 raters when scoring 39 persuasive and 39 narrative essays written by second language learners were examined, first using Rasch analysis and then, through think aloud protocols. Results based on Rasch analysis and think aloud protocols recorded by raters as they were scoring holistically and analytically suggested that rater background may have contributed to rater expectations that might explain individual differences in the application of the performance criteria of the rubrics when rating essays. The results further suggested that rater ego engagement with the text and/or author may have helped mitigate rater severity and that self-monitoring behaviors by raters may have had a similar mitigating effect.  相似文献   

11.
主观题评分质量的估计方法评述   总被引:2,自引:0,他引:2  
在心理测量理论中,主观题的评分质量是一个值得研究的课题。本文分别介绍了三大测量理论(经典测量理论、概化理论、项目反应理论)对于主观题评分质量的估计方法,并对其优劣进行了比较。概化理论和项目反应理论在评价主观题评分质量上具有较明显的优势,如何结合使用三大理论,为主观题评分质量获取更多有价值的信息是值得深入探讨的问题。  相似文献   

12.
Any examination that involves moderate to high stakes implications for examinees should be psychometrically sound and legally defensible. Currently, there are two broad and competing families of test theories that are used to score examination data. The majority of instructors outside the high‐stakes testing arena rely on classical test theory (CTT) methods. However, advances in item response theory software have made the application of these techniques much more accessible to classroom instructors. The purpose of this research is to analyze a common medical school anatomy examination using both the traditional CTT scoring method and a Rasch measurement scoring method to determine which technique provides more robust findings, and which set of psychometric indicators will be more meaningful and useful for anatomists looking to improve the psychometric quality and functioning of their examinations. Results produced by the more robust and meaningful methodology will undergo a rigorous psychometric validation process to evaluate construct validity. Implications of these techniques and additional possibilities for advanced applications are also discussed. Anat Sci Educ 7: 450–460. © 2014 American Association of Anatomists.  相似文献   

13.
Evaluating Rater Accuracy in Performance Assessments   总被引:1,自引:0,他引:1  
A new method for evaluating rater accuracy within the context of performance assessments is described. Accuracy is defined as the match between ratings obtained from operational raters and those obtained from an expert panel on a set of benchmark, exemplar, or anchor performances. An extended Rasch measurement model called the FACETS model is presented for examining rater accuracy. The FACETS model is illustrated with 373 benchmark papers rated by 20 operational raters and an expert panel. The data are from the 1993field test of the High School Graduation Writing Test in Georgia. The data suggest that there are statistically significant differences in rater accuracy; the data also suggest that it is easier to be accurate on some benchmark papers than on others. A small example is presented to illustrate how the accuracy ordering of raters may not be invariant over different subsets of benchmarks used to evaluate accuracy.  相似文献   

14.
The purpose of this study was to investigate the stability of rater severity over an extended rating period. Multifaceted Rasch analysis was applied to ratings of 16 raters on writing performances of 8, 285 elementary school students. Each performance was rated by two trained raters over a period of seven rating days. Performances rated on the first day were re-rated at the end of the rating period. Statistically significant differences between raters were found within each day and in all days combined. Daily estimates of the relative severity of individual raters were found to differ significantly from single, on-average estimates for the whole rating period. For 10 raters, severity estimates on the last day were significantly different from estimates on the first day. These fndings cast doubt on the practice of using a single calibration of rater severity as the basis for adjustment of person measures.  相似文献   

15.
This study examined rater effects on essay scoring in an operational monitoring system from England's 2008 national curriculum English writing test for 14‐year‐olds. We fitted two multilevel models and analyzed: (1) drift in rater severity effects over time; (2) rater central tendency effects; and (3) differences in rater severity and central tendency effects by raters’ previous rating experience. We found no significant evidence of rater drift and, while raters with less experience appeared more severe than raters with more experience, this result also was not significant. However, we did find that there was a central tendency to raters’ scoring. We also found that rater severity was significantly unstable over time. We discuss the theoretical and practical questions that our findings raise.  相似文献   

16.
This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used  相似文献   

17.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

18.
Machine learning has been frequently employed to automatically score constructed response assessments. However, there is a lack of evidence of how this predictive scoring approach might be compromised by construct-irrelevant variance (CIV), which is a threat to test validity. In this study, we evaluated machine scores and human scores with regard to potential CIV. We developed two assessment tasks targeting science teacher pedagogical content knowledge (PCK); each task contains three video-based constructed response questions. 187 in-service science teachers watched the videos with each had a given classroom teaching scenario and then responded to the constructed-response items. Three human experts rated the responses and the human-consent scores were used to develop machine learning algorithms to predict ratings of the responses. Including the machine as another independent rater, along with the three human raters, we employed the many-facet Rasch measurement model to examine CIV due to three sources: variability of scenarios, rater severity, and rater sensitivity of the scenarios. Results indicate that variability of scenarios impacts teachers’ performance, but the impact significantly depends on the construct of interest; for each assessment task, the machine is always the most severe rater, compared to the three human raters. However, the machine is less sensitive than the human raters to the task scenarios. This means the machine scoring is more consistent and stable across scenarios within each of the two tasks.  相似文献   

19.
In this study, we describe a framework for monitoring rater performance over time. We present several statistical indices to identify raters whose standards drift and explain how to use those indices operationally. To illustrate the use of the framework, we analyzed rating data from the 2002 Advanced Placement English Literature and Composition examination, employing a multifaceted Rasch approach to determine whether raters exhibited evidence of two types of differential rater functioning over time (i.e., changes in levels of accuracy or scale category use). Some raters showed statistically significant changes in their levels of accuracy as the scoring progressed, while other raters displayed evidence of differential scale category use over time.  相似文献   

20.
The term measurement disturbance has been used to describe systematic conditions that affect a measurement process, resulting in a compromised interpretation of person or item estimates. Measurement disturbances have been discussed in relation to systematic response patterns associated with items and persons, such as start‐up, plodding, boredom, or fatigue. An understanding of the different types of measurement disturbances can lead to a more complete understanding of persons or items in terms of the construct being measured. Although measurement disturbances have been explored in several contexts, they have not been explicitly considered in the context of performance assessments. The purpose of this study is to illustrate the use of graphical methods to explore measurement disturbances related to raters within the context of a writing assessment. Graphical displays that illustrate the alignment between expected and empirical rater response functions are considered as they relate to indicators of rating quality based on the Rasch model. Results suggest that graphical displays can be used to identify measurement disturbances for raters related to specific ranges of student achievement that suggest potential rater bias. Further, results highlight the added diagnostic value of graphical displays for detecting measurement disturbances that are not captured using Rasch model–data fit statistics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号