首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 156 毫秒
1.
信度与效度是评估测试质量极其重要的两个标准,要有效度则必须有信度。从2009级五个班中随机抽取一个自然班,对其24名学生2006—2010年TEM4试题的模拟考成绩及2011年TEM4正式考成绩进行统计分析,并对2006-2011年这6年试题信度进行研究,结果表明:1.试题具有较高信度;2.主观题部分尤其是短文听写部分,学生成绩出现了较大差异,难度值差异较大,信度偏低;3.客观题部分,学生成绩保持了较高的稳定性和一致性,信度较高。  相似文献   

2.
信度是指测验结果的一致性程度或者可靠性程度,主要有重测信度、折半信度、复本信度、评分员信度等。计算方法常用的有Spearman-Brown Prophecy计算法、克朗巴赫α系数估算法、Kuder-Richardson20和Kuder-Richardson21计算法等。分析和研究信度计算方法,理解信度含义,正确运用信度概念,对改进语言测试设计、提高语言测试质量十分重要。  相似文献   

3.
信度是指测验结果的一致性程度或者可靠性程度,主要有重测信度、折半信度、复本信度、评分员信度等。计算方法常用的有Spearman Brown Prophecy计算法、克朗巴赫α系数估算法、Kuder Richardson20和Kudcr Richardson21计算法等。分析和研究信度计算方法,理解信度含义,正确运用信度概念,对改进语言测试设计,提高语言测试质量有十分重要的意义。  相似文献   

4.
有教学就要有考核和评估,作为考核和评估的一种手段,科学的、合理的测试会对教学产生正面的反拨效应:反之,则会带来负面的影响。许多研究表明高质量的测试必须满足四个要求:效度、信度、区分度、可行性。效度和信度是测试最重要、最基本的要求。  相似文献   

5.
测验信度大盘点   总被引:1,自引:0,他引:1  
信度是对测量一致性程度的估计。信度分成再测信度、复本信度、同质信度、评分者信度等四种类型。测验的长度与难度以及被试团体的变异性与能力水平是影响信度的主要因素。测量标准误差属另类信度,可用于解释个体分数或解释分数差异。估计速度测验和掌握测验的信度,需使用特殊的方法。  相似文献   

6.
本文从语言测试理论的角度阐述语言测试对外语教学的影响,指出语言测试是衡量外语教与学的有效方法.好的测试题应具有效度、信度、实用性以及对学生的积极引导作用。许多研究表明高质量的测试必须满足五个要求:效度、信度、区分度、实用性和反拨作用。众所周知,有教学就要有考核和评估,作为考核和评估的一种手段,科学的、合理的测试会对教学产生正面的反拔效应:反之.则会带来负面的影响。  相似文献   

7.
刘晓红 《湘南学院学报》2012,33(6):73-76,85
普通话水平测试的科学性和权威性取决于测试的质量。信度是评估语言测试质量的主要指标之一,试卷编制是语言测试信度的基础和保障。通过分析普通话水平测试试卷构成的要求,对各种题型的信度进行了较为深入细致的研究,对试卷编制中存在的问题提出了一些改进意见,旨在促进普通话水平测试工作健康有序开展。  相似文献   

8.
本文在从语言测试理论的角度阐述语言测试对外语教学的影响,指出语言测试是衡量外语教与学的有效方法,好的测试题应具有效度、信度、实用性,以及对学生的积极引导作用.许多研究表明高质量的测试必须满足四个要求:效度、信度、区分度、实用性和反拨作用.众所周知,有教学就要有考核和评估,作为考核和评估的一种手段,科学的、合理的测试会对教学产生正面的反拨效应;反之,则会带来负面的影响.  相似文献   

9.
大学英语四级考试是一项能力测试。在能力测试中,信度和效度是评估大学四级考试的两个主要标准。测试的成功与否,在很大程度上取决于这两个标准的高低,因而应力争提高其测试的信度与效度。分析改革后四级考试听力部分的信度和效度对大学英语听力教学也有一定的指导意义。  相似文献   

10.
信度和效度是评估、检测试卷质量的两个重要指标。本文通过对西华大学大学英语校内测试信度、效度分析,阐述了影响信度、效度的各种因素,最后根据分析数据和实践经验提出了如何提高测试信度和效度的几点建议。  相似文献   

11.
I discuss the contribution by Davenport, Davison, Liou, & Love (2015) in which they relate reliability represented by coefficient α to formal definitions of internal consistency and unidimensionality, both proposed by Cronbach (1951). I argue that coefficient α is a lower bound to reliability and that concepts of internal consistency and unidimensionality, however defined, belong to the realm of validity, viz. the issue of what the test measures. Internal consistency and unidimensionality may play a role in the construction of tests when the theory of the attribute for which the test is constructed implies that the items be internally consistent or unidimensional. I also offer examples of attributes that do not imply internal consistency or unidimensionality, thus limiting these concepts' usefulness in practical applications.  相似文献   

12.
In criterion‐referenced tests (CRTs), the traditional measures of reliability used in norm‐referenced tests (NRTs) have often proved problematic because of NRT assumptions of one underlying ability or competency and of variance in the distribution of scores. CRTs, by contrast, are likely to be created when mastery of the skill or knowledge by all or most all test takers is expected and thus little variation in the scores is expected. A comprehensive CRT often measures a number of discrete tasks that may not represent a single unifying ability or competence. Hence, CRTs theoretically violate the two most essential assumptions of classic NRT re liability theory and they have traditionally required the logistical problems of multiple test administrations to the same test takers to estimate reliability. A review of the literature categorizes approaches to reliability for CRTs into two classes: estimates sensitive to all measures of error and estimates of consistency in test outcome. For single test administration of CRTs Livingston's k2is recommended for estimating all measures of error, Sc is proposed for estimates of consistency in test outcome. Both approaches compared using data from a CRT exam and recommendations for interpretation and use are proposed.  相似文献   

13.
ABSTRACT

Touch screen tablets are being increasingly used in schools for learning and assessment. However, the validity and reliability of assessments delivered via tablets are largely unknown. The present study tested the psychometric properties of a tablet-based app designed to measure early literacy skills. Tablet-based tests were also compared with traditional paper-based tests. Children aged 2–6 years (N?=?99) completed receptive tests delivered via a tablet for letter, word, and numeral skills. The same skills were tested with a traditional paper-based test that used an expressive response format. Children (n?=?35) were post-tested 8 weeks later to examine the stability of test scores over time. The tablet test scores showed high internal consistency (all α’s?>?.94), acceptable test-retest reliability (ICC range?=?.39–.89), and were correlated with child age, family SES, and home literacy teaching to indicate good predictive validity. The agreement between scores for the tablet and traditional tests was high (ICC range?=?.81–.94). The tablet tests provides valid and reliable measures of children’s early literacy skills. The strong psychometric properties and ease of use suggests that tablet-based tests of literacy skills have the potential to improve assessment practices for research purposes and classroom use.  相似文献   

14.
Reliability of a criterion-referenced test is often viewed as the consistency with which individuals who have taken two strictly parallel forms of a test are classified as being masters or nonmasters. However, in practice, it is rarely possible to retest students, especially with equivalent forms. For this reason, methods for making conservative approximations of alternate form (or test-retest “without the effects of testing”) reliability have been developed. Because these methods are computationally tedious and require some psychometric sophistication, they have rarely been used by teachers and school psychologists. This paper (a) describes one method (Subkoviak's) for estimating alternate-form reliability from one administration of a criterion-referenced test and (b) describes a computer program developed by the authors that will handle tests containing hundreds of items for large numbers of examinees and allow any test user to apply the technique described. The program is a superior alternative to other methods of simplifying this estimation procedure that rely upon tables; a user can check classification consistency estimates for several prospective cut scores directly from a data file, without having to make prior calculations.  相似文献   

15.
A common suggestion made in the psychometric literature for fixed‐length classification tests is that one should design tests so that they have maximum information at the cut score. Designing tests in this way is believed to maximize the classification accuracy and consistency of the assessment. This article uses simulated examples to illustrate that one can obtain higher classification accuracy and consistency by designing tests that have maximum test information at locations other than at the cut score. We show that the location where one should maximize the test information is dependent on the length of the test, the mean of the ability distribution in comparison to the cut score, and, to a lesser degree, whether or not one wants to optimize classification accuracy or consistency. Analyses also suggested that the differences in classification performance between designing tests optimally versus maximizing information at the cut score tended to be greatest when tests were short and the mean of ability distribution was further away from the cut score. Larger differences were also found in the simulated examples that used the 3PL model compared to the examples that used the Rasch model.  相似文献   

16.
Book reviews     
Background:?A recent article published in Educational Research on the reliability of results in National Curriculum testing in England (Newton, The reliability of results from national curriculum testing in England, Educational Research 51, no. 2: 181–212, 2009) suggested that: (1) classification accuracy can be calculated from classification consistency; and (2) classification accuracy on a single test administration is higher than classification consistency across two tests.

Purpose:?This article shows that it is not possible to calculate classification accuracy from classification consistency. It then shows that, given reasonable assumptions about the distribution of measurement error, the expected classification accuracy on a single test administration is higher than the expected classification consistency across two tests only in the case of a pass–fail test, but not necessarily for tests that classify test-takers into more than two categories.

Main argument and conclusion:?Classification accuracy is defined in terms of a ‘true score’ specified in a psychometric model. Three things must be known or hypothesised in order to derive a value for classification accuracy: (1) a psychometric model relating observed scores to true scores; (2) the location of the cut-scores on the score scale; and (3) the distribution of true scores in the group of test-takers.  相似文献   

17.
Reliability of Scores From Teacher-Made Tests   总被引:1,自引:0,他引:1  
Reliability is the property of a set of test scores that indicates the amount of measurement error associated with the scores. Teachers need to know about reliability so that they can use test scores to make appropriate decisions about their students. The level of consistency of a set of scores can he estimated by using the methods of internal analysis to compute a reliability coefficient. This coefficient, which can range between 0.0 and +1.0, usually has values around 0.50 for teacher-made tests and around 0.90 for commercially prepared standardized tests. Its magnitude can be affected by such factors as test length, test-item difficulty and discrimination, time limits, and certain characteristics of the group—extent of their testwiseness, level of student motivation, and homogeneity in the ability measured by the test.  相似文献   

18.
In discussion of the properties of criterion-referenced tests, it is often assumed that traditional reliability indices, particularly those based on internal consistency, are not relevant. However, if the measurement errors involved in using an individual's observed score on a criterion-referenced test to estimate his or her universe scores on a domain of items are compared to errors of an a priori procedure that assigns the same universe score (the mean observed test score) to all persons, the test-based procedure is found to improve the accuracy of universe score estimates only if the test reliability is above 0.5. This suggests that criterion-referenced tests with low reliabilities generally will have limited use in estimating universe scores on domains of items.  相似文献   

19.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号