首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 156 毫秒
1.
口语考试评分员的稳定性事关考试的效度、信度及公平性.本文对连续五次计算机辅助PRETCO 口试评分进行历时分析,探讨PRETCO 口试评分员在严厉度、评分准确度以及集中趋势三个方面的稳定性,并探究其背后的原因.  相似文献   

2.
全国公共英语等级考试包括笔试和口试两大部分,分为五级。PETS-3口语测试评分方法既有优点,也存在不足之处。该口试评分方法对测试的信度产生了一定影响。评分的信度取决于评分标准、评分依据以及评分人三方面的共同作用.最终确保评分信度的真实、可靠。  相似文献   

3.
英语口语考试的信度和效度受口试形式、评分标准和考官素质等多方面因素的影响。提高英语口试的效度和信度,需坚持英语口试形式与内容的统一,设计出科学、客观并具有可操作性的评分标准。高信度与效度的英语口语测试对教学具有积极的反拨作用。  相似文献   

4.
英语口试中,口试考官对评分标准的掌握程度、执行的严厉程度以及个人疲劳程度、与应试者的交流互动等都会影响口试成绩的信度和效度.文章以剑桥少儿英语口试为例,透过项目评价逻辑模型、参照评分标准测量模型和问卷调查分析,评价英语口试考官的素质以及评分的准确性.对改进英语口试考官质量保证体系和提高口试考官评分水平进行了有益的探索.  相似文献   

5.
大学英语口语测试评分标准的设计与制定   总被引:1,自引:0,他引:1  
评分方法是口语测试中的一个重要环节,也是比较棘手的一个环节。评分方法的有效性与客观性直接决定了口语测试的成败。本文主要讨论北京航空航天大学大学英语口试改革中评分标准的制定及其理论依据。北京航空航天大学大学英语口试改革中评分标准分别借鉴准则参考性口试和常模参考性口试的优点:一方面确立考试准则,以便提高测试的效度;另一方面制定常模,在宏观上对口试评分进行控制。以提高测试的信度。  相似文献   

6.
全国公共英语等级考试体系(PublicEnglishTestSys-tem,简称PETS)是一种面向全社会、以全体公民为对象的非学历性英语证书考试,由笔、口试两部分组成。口试的目的是测试考生口头语言表达能力,属主观性考试。口试因其主观性,评分的最大难度在其公正性和一致性。因此,对PETS口试考官因素及其量分情况进行研究很有必要。本文拟在对一次PETS口试量分统计结果分析的基础上,研究影响考官评分的因素,并对考官素质和培训提出一些看法,目的是进一步提高PETS口试的信度、效度及其权威性。一、研究方法与统计结果研究对象为某考点2003年3月参加…  相似文献   

7.
英语录音口试考试在形式和评分过程方面较传统口语测试有不同特点和优势。作为主观性测试的一种,其评分过程涉及因素多且复杂。因此,探究录音口试评分的过程和模式对于提高其测试信度和进一步推广非常重要。结合院校在我国高等英语教育中的重要作用,选取了北方民族大学外国语学院非英语专业研究生英语复试这一个案,围绕评分员这一评分理论的核心,从口试评分员对评分标准的处理,以及Milanovic et al.评分模式在实践中的具体运用,探究了民族院校英语录音口试评分的过程与模式。  相似文献   

8.
目前电大系统英语考试的口试和作文部分多采用语言运用测试的方式.语言运用测试由于引入评分者而使评分的主观性变大.如何控制评分者差异对考生分数的影响成为保证语言运用测试评分质量的重要环节.本文在比较了行为测试中评分质量控制方面常用的三种理论的基础上,着重介绍了多面Rasch模型在提高评分质量方面的贡献,并探讨了在电大系统如何采用该模型对英语运用测试中的评分者进行培训,以控制评分质量和提高考试信度.  相似文献   

9.
王显涛 《文教资料》2016,(4):173-174
大学英语口语考试的信度和效度受到很多学者关注,但是作为一种应用在一般教学环境中的小组讨论形式的测试,对于其中评分员信度的研究,目前相关的研究成果还不多见。本文对小组讨论形式的大学英语口语考试中评分员信度进行实证研究,并描述和讨论相关的数据与研究结果。  相似文献   

10.
交流是学习外语的最终目的.语言测试在经历了几个发展阶段后,已从单纯追求考试信度过渡到注重考试效度.在笔试中提高考试的效度,意味着增加能直接测量考生语言能力项目的比重.写作能力的测试就通常通过这种直接测试的方法来衡量考生的写作水平.本文旨对英语写作测试的评分方法进行探讨.  相似文献   

11.
试论公共英语口语测试(PETS)的有效性   总被引:3,自引:0,他引:3  
公共英语测试是目前国内流行的英语水平测试之一,为人们广泛认可和接受。作为公共英语测试的一部分,口语测试的有效性是人们普遍关注的一个问题。应从提高构卷效度、强化规范性原则、坚持总体评分标准诸方面增强测试的有效性。  相似文献   

12.
INNLEDNING     
Engvik, H., Kvale, S. &; Havik, O. E. (1970). Rater Reliability in Evaluation of Essay and Oral Examinations. Scand. J. educ. Res. 14, 195‐220. The rater reliability for the examination system at the Psychological Institute in Oslo was investigated. The essay and oral performances of the candidates are evaluated by an examination committee of three. Significant differences in arithmetic mean were found both among and within the committees. When rating the same essays within a committee a wide variation of reliability coefficients was found — from —.16 to +.90. At the critical boundaries of the scale, such as the Laudabilis boundary for access to further study of psychology, considerable variations between raters were demonstrated. There was demonstrated a slight, but significant trend for female students to improve more than male students at the oral examination. The general rater reliability found is not satisfactory, either with respect to current standards for psychometric tests or with respect to the importance of the marks for the individual students.  相似文献   

13.
The NTID Writing Test was developed to assess the writing ability of postsecondary deaf students entering the National Technical Institute for the Deaf and to determine their appropriate placement into developmental writing courses. While previous research (Albertini et al., 1986; Albertini et al., 1996; Bochner, Albertini, Samar, & Metz, 1992) has shown the test to be reliable between multiple test raters and as a valid measure of writing ability for placement into these courses, changes in curriculum and the rater pool necessitated a new look at interrater reliability and concurrent validity. We evaluated the rating scores for 236 samples from students who entered the college during the fall 2001. Using a multiprong approach, we confirmed the interrater reliability and the validity of this direct measure of assessment. The implications of continued use of this and similar tests in light of definitions of validity, local control, and the nature of writing are discussed.  相似文献   

14.
本研究以PETS-1级拟聘口试教师为研究对象,对口试教师评分的培训效果进行了研究。采用多面Rasch分析对比口试教师接受培训前后的评分效果。结果发现:培训后,提升了口试教师与专家评分完全一致的比率,评分偏于严格的口试教师在评分标准上做了恰当的调整,所有口试教师评分拟合值都在可接受范围内,总体上,口试教师评分的培训比较有效,培训后提升了评分的准确性。多面Rasch分析有助于发现评分过于宽松、过于严格、评分拟合差的口试教师以及评分异常情况,为开展有针对性地培训提供了可靠的依据。  相似文献   

15.
文章采用定量和定性分析方法,以北京广播电视大学远郊区县分校159名学习者为被试对象,使用网络语音电话skype进行远程条件下英语口语测试的实证研究,探索英语口语测试的新形式。研究发现:利用skype进行口语测试不仅能解决计算机辅助口语测试缺乏交互性和真实性的问题,还能够有效降低管理成本,平衡了口语测试效度和信度,是适合对英语专业学生进行测试的新型口语测试形式。  相似文献   

16.
17.
Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating rating quality are often based on measurement theories such as invariant measurement. However, these methods are based on assumptions and transformations that may not be appropriate for ordinal ratings. In this study, I show how researchers can use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, to evaluate rating quality within the framework of invariant measurement without the use of potentially inappropriate parametric techniques. I use an illustrative analysis of data from a rater‐mediated writing assessment to demonstrate how one can use numeric and graphical indicators from MSA to gather evidence of validity, reliability, and fairness. The results from the analyses suggest that MSA provides a useful framework within which to evaluate rater‐mediated assessments for evidence of validity, reliability, and fairness that can supplement existing popular methods for evaluating ratings.  相似文献   

18.
Psychometric models based on structural equation modeling framework are commonly used in many multiple-choice test settings to assess measurement invariance of test items across examinee subpopulations. The premise of the current article is that they may also be useful in the context of performance assessment tests to test measurement invariance of raters. The modeling approach and how it can be used for performance tests with less than optimal rater designs are illustrated using a data set from a performance test designed to measure medical students’ patient management skills. The results suggest that group-specific rater statistics can help spot differences in rater performance that might be due to rater bias, identify specific weaknesses and strengths of individual raters, and enhance decisions related to future task development, rater training, and test scoring processes.  相似文献   

19.
《Educational Assessment》2013,18(3):257-272
Concern about the education system has increasingly focused on achievement outcomes and the role of assessment in school performance. Our research with fifth and eighth graders in California explored several issues regarding student performance and rater reliability on hands-on tasks that were administered as part of a field test of a statewide assessment program in science. This research found that raters can produce reliable scores for hands-on tests of science performance. However, the reliability of performance test scores per hour of testing time is quite low relative to multiple-choice tests. Reliability can be improved substantially by adding more tasks (and testing time). Using more than one rater per task produces only a very small improvement in the reliability of a student's total score across tasks. These results were consistent across both grade levels, and they echo the findings of past research.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号