首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
主观性测试中,评分员差异是影响测试信度、效度和公平性的重要因素。本文采用多面Rasch模型考察8位评分员对记叙文和议论文两种体裁各60篇作文的评分情况。结果表明,评分员对不同体裁作文的评分存在不一致性:在评分员层面上,评分员的严厉度基本不受体裁的影响,但在评分员的信度与内在一致性方面,议论文评分好于记叙文评分;在评分量表层面上,评分员在评定语言和内容项目上,议论文比记叙文严格,而在条理项目上,议论文比记叙文宽松,并且议论文高分的使用频率比记叙文高。本文还就评分员评分的不一致性的原因进行了探讨,以求为降低评分偏差提供参考。  相似文献   

2.
社会认知观指导下的作文评分过程研究重视考试机构规约对评分过程的制约,却忽视评分员社会心理因素对评分过程产生的影响。以英语专业八级考试为例,采用有声思维和跟进访谈的形式对后者进行探究。结果表明,有六类评分员社会心理因素影响评分员对文本的打分,分别是考试机制意识、阅卷系统知识、阅卷环境的心理感受、考试认识、考生预期和伦理价值判断。随后就这些因素对评分过程施加影响的机制进行讨论,并提出改进评分员培训效果的建议。  相似文献   

3.
主观题是语言测试中的重要组成部分。主观题可以弥补标准化试题的不足,但又存在评分依赖于评分员主观印象的问题,这导致评分员自身的不稳定性和评分员之间的差异。借鉴、利用三大测量理论和计算机辅助评分,可以优化主观题评分质量,提高其精准性和有效性。  相似文献   

4.
王显涛 《文教资料》2016,(4):173-174
大学英语口语考试的信度和效度受到很多学者关注,但是作为一种应用在一般教学环境中的小组讨论形式的测试,对于其中评分员信度的研究,目前相关的研究成果还不多见。本文对小组讨论形式的大学英语口语考试中评分员信度进行实证研究,并描述和讨论相关的数据与研究结果。  相似文献   

5.
在普通话水平测试工作中 ,经常有同组测试员所测分数差别较大的问题。有时会相差一两分 ,有时相差三四分 ,差别大时甚至可以评定等级也不相同。特别是在等级线上的分数 ,经常出现错档现象。为什么会出现如此大的评分差别呢 ?第一 ,一些测试员缺乏必要的汉语语音理论知识。在广大测试员队伍中 ,科班出身的从事汉语教学与研究工作的内行较少 ,多数测试员来自各大、中专院校甚至小学 ,有的教语文 ,有的教数学 ,有的甚至在某些行政事业单位工作。这些测试员虽然普通话说得标准 ,但缺乏必要的汉语语音理论知识 ,缺乏较强的听音、辨音、记音能力。…  相似文献   

6.
普通话水平测试(PSC)是推普工作的重要组成部分。而在测试实践中出现了测试员评分误差较大的现象,这必然影响PSC的质量,因此,应该设法减少评分误差,提高普通话水平测试的信度和效度,从PSC的理论与实践的结合上,全面而具体地分析造成测试员评分误差的原因及应采取的对策。  相似文献   

7.
通过有声思维实验方法并辅以刺激回忆,收集四名不同性格倾向的评分员在配对口语考试评分时进行的思维报告数据,定性分析结果表明:在实际评分中,评分员对评分量表的理解和使用存在很大的差异性,具体表现在:(1)外向的评分员在评分过程中,表现的比内向的评分员更为宽容;(2)内向的评分员更多地关注评分量表中的各项具体指标和标准,而外向的评分员强调任务的完成状况和考生之间的比较、交流,和互动;(3)外向的评分员比内向的评分员更少地依赖评分量表,更多地使用非语言的特征。本研究结果对考试评分标准的修订和评分员培训均有启示。  相似文献   

8.
主观考试采用评分员进行主观评分,由于评分一致性不高,缺乏信度,测量学界一直在努力探索提高主观评分信度的办法。本文用Longford方法对参加HSK[高等]作文考试评分的异常评分员作了一次实证检验。结果证明,该方法对检验大规模标准化主观考试评分员差异确实有效。  相似文献   

9.
张晋军  任杰 《中国考试》2004,(10):27-32
根据《汉语测试电子评分员研究设想》(以下简称《研究设想》)提出的研究思路,我们随机选取了700份中国少数民族汉语水平等级考试(MHK)三级作文预测卷,由3位评分员严格按照MHK(三级)作文评分要求进行独立评分。设计、编写电子评分员程序,由电子评分员对这700份作文的电子文件进行评分。随后计算电子评分员  相似文献   

10.
造成普通话水平测试员评分误差的原因及对策   总被引:2,自引:0,他引:2  
普通话水平测试(PSC)是推普工作的重要组成部分。而在测试实践中出现了测试员评分误差较大的现象,这必然影响PSC的质量,因此,应该设法减少评分误差,提高普通话水平测试的信度和效度,从PSC的理论与实践的结合上,全面而个体地分析造成测试成员评分误差的原因及应采取的对策。  相似文献   

11.
Historically, research focusing on rater characteristics and rating contexts that enable the assignment of accurate ratings and research focusing on statistical indicators of accurate ratings has been conducted by separate communities of researchers. This study demonstrates how existing latent trait modeling procedures can identify groups of raters who may be of substantive interest to those studying the experiential, cognitive, and contextual aspects of ratings. We employ two data sources in our demonstration—simulated data and data from a large‐scale state‐wide writing assessment. We apply latent trait models to these data to identify examples of rater leniency, centrality, inaccuracy, and differential dimensionality; and we investigate the association between rater training procedures and the manifestation of rater effects in the real data.  相似文献   

12.
This study evaluated rater accuracy with rater-monitoring data from high stakes examinations in England. Rater accuracy was estimated with cross-classified multilevel modelling. The data included face-to-face training and monitoring of 567 raters in 110 teams, across 22 examinations, giving a total of 5500 data points. Two rater-monitoring systems (Expert consensus scores and Supervisor judgement of correct scores) were utilised for all raters. Results showed significant group training (table leader) effects upon rater accuracy and these were greater in the expert consensus score monitoring system. When supervisor judgement methods of monitoring were used, differences between training teams (table leader effects) were underestimated. Supervisor-based judgements of raters’ accuracies were more widely dispersed than in the Expert consensus monitoring system. Supervisors not only influenced their teams’ scoring accuracies, they overestimated differences between raters’ accuracies, compared with the Expert consensus system. Systems using supervisor judgements of correct scores and face-to-face rater training are, therefore, likely to underestimate table leader effects and overestimate rater effects.  相似文献   

13.
Rater training is an important part of developing and conducting large‐scale constructed‐response assessments. As part of this process, candidate raters have to pass a certification test to confirm that they are able to score consistently and accurately before they begin scoring operationally. Moreover, many assessment programs require raters to pass a calibration test before every scoring shift. To support the high‐stakes decisions made on the basis of rater certification tests, a psychometric approach for their development, analysis, and use is proposed. The circumstances and uses of these tests suggest that they are expected to have relatively low reliability. This expectation is supported by empirical data. Implications for the development and use of these tests to ensure their quality are discussed.  相似文献   

14.
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both.  相似文献   

15.
16.
In the United Kingdom, the majority of national assessments involve human raters. The processes by which raters determine the scores to award are central to the assessment process and affect the extent to which valid inferences can be made from assessment outcomes. Thus, understanding rater cognition has become a growing area of research in the United Kingdom. This study investigated rater cognition in the context of the assessment of school‐based project work for high‐stakes purposes. Thirteen teachers across three subjects were asked to “think aloud” whilst scoring example projects. Teachers also completed an internal standardization exercise. Nine professional raters across the same three subjects standardized a set of project scores whilst thinking aloud. The behaviors and features attended to were coded. The data provided insights into aspects of rater cognition such as reading strategies, emotional and social influences, evaluations of features of student work (which aligned with scoring criteria), and how overall judgments are reached. The findings can be related to existing theories of judgment. Based on the evidence collected, the cognition of teacher raters did not appear to be substantially different from that of professional raters.  相似文献   

17.
ABSTRACT

This paper reports findings from a project called “The National Panel of Raters” (NPR) that took place within a writing test programme in Norway (2010–2016). A recent research project found individual differences between the raters in the NPR. This paper reports results from an explorative follow up-study where 63 NPR members were surveyed with 23 items that were dilemma-like in the sense that deviating from the NPR rules would follow another—but socially acceptable—rationale. Four NPR members participated in a follow-up interview in which they motivated why they had agreed or disagreed with certain items. The results indicate two distinctly different stances toward rating work, with one stance threatening the validity of the scoring process.  相似文献   

18.
19.
Internationally, many assessment systems rely predominantly on human raters to score examinations. Arguably, this facilitates the assessment of multiple sophisticated educational constructs, strengthening assessment validity. It can introduce subjectivity into the scoring process, however, engendering threats to accuracy. The present objectives are to examine some key qualitative data collection methods used internationally to research this potential trade‐off, and to consider some theoretical contexts within which the methods are usable. Self‐report methods such as Kelly's Repertory Grid, think aloud, stimulated recall, and the NASA task load index have yielded important insights into the competencies needed for scoring expertise, as well as the sequences of mental activity that scoring typically involves. Examples of new data and of recent studies are used to illustrate these methods’ strengths and weaknesses. This investigation has significance for assessment designers, developers and administrators. It may inform decisions on the methods’ applicability in American and other rater cognition research contexts.  相似文献   

20.
This article examines the role of reviewer agreement in judgments about alignment between tests and standards. We used case data from three state alignment studies to explore how different approaches to incorporating reviewer agreement changes alignment conclusions. The three case studies showed varying degrees of reviewer agreement about correspondences between objectives and test items. Moreover, taking into account reviewer agreement in the analyses sometimes had a marked effect on alignment conclusions. We discuss reasons for differences across case studies and alignment approaches, as well as implications for future alignment efforts.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号