首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
本文通过对环境卫生学期末考试试卷的信度、难度、区分度及学生成绩进行分析,评价试卷质量并分析学生对知识的掌握情况。本试卷客观题信度为0.82,主观题信度为0.68,难度为0.575,区分度为0.375,试卷信度、区分度较好,但多选题、简答题偏难。学生及格率为50.63%,今后应适当降低试题难度。  相似文献   

2.
大型综合性调查经常包含认知测试部分。基于控制调查成本以及降低数据缺失率的要求,大型调查中的认知测试设计一般较为简洁,在这种情况下,其信度和效度可能受到影响。本文运用经典测试理论和项目反应理论两种方式来分析"中国家庭追踪调查"中字词测试的信度和效度。我们还同时比较了3种计分方法,它们分别是原始分计分法、最难题计分法以及基于项目反应理论的计分法。分析结果显示,"中国家庭追踪调查"中字词测试信度较高,其结构效度和效标效度良好。3种计分方法结果的相关度很高,在分析截面数据时没有实质性的差别。  相似文献   

3.
α系数是信度系数下界的估计,因此不能一味要求其值要达0.9以上。本文导出α系数与F-统计量的函数关系,根据这一关系提出确定一份试卷的α值要多大才合适的准则。试卷的α值大小既与题量有关,也与显著性水平即错判概率的大小有关。一、引言信度是衡量一份试卷是否可靠的一种指标。在经典测验理论中,信度的定义只是理论上构想的概念,在实际应用时必须根据一组实得分数采用统计方法加以估计。如何使所佔计的值接近于信度真值,显然是一个很重要的研究方向,文就是  相似文献   

4.
卷面考试是对教学效果进行测评的主要手段,试卷质量如何直接影响评价的信度。在局域网环境中构建试卷自动生成系统,可以充分发挥教育者的集体智慧,加快系统构建速度,保证试卷生成质量。试卷自动生成系统便于在教学评价中实现教考分离,使评价效果更加可信。  相似文献   

5.
本文从高校英语专业英语口试的基本考虑入手,设计了“看图说话”题型的基本操作、评分标准 与计分方法,分析了该题型对一份口试试卷信度和效度影响的问题,从而得出该题型可在英语口试中广泛应 用的结论。  相似文献   

6.
基于数理统计的试卷质量分析   总被引:1,自引:0,他引:1  
试卷质量对反映教师的教学水平.以及学生对知识技能的掌握程度具有重要的意义.选择科学的试卷质量测评方法是有效地分析试卷质量的美键。本文运用数理统计的方法对试卷质量进行分析.通过对原始分数的处理和解释.力求能对试卷的质量作出较科学的比较和评价。其中包括对试题的项目分析.试卷的难度、区分度和总体分析,着重分析了试卷的信度和频数分布。  相似文献   

7.
由教育部考试中心与各分省命题机构命制的高考语文试卷中的阅读部分,大都由一定比例的客观题和主观题组合而成。在对较高层级的能力进行考查时,主观题较之客观题更受命题者的青睐。如2011年全国课标卷中,占代诗歌阅读两道小题全部为主观题,文学类文本阅读和实用类文本阅读都是四道小题中三道为主观题。而在各省独立命制的高考语文试卷中,...  相似文献   

8.
王星 《新高考》2008,(Z2):16-17
主观题是考查学生阅读能力及语言概括能力不可缺少的重要考核方式。近年来,主观题在试卷中所占的比重不断增加,很多学生在这个方面失分较多。分析每年的高考试卷,我们不难发现,很多主观题的失分,并不  相似文献   

9.
遵循经典测量理论中教育测验编制的方法与技术,开发了数学教师核心素养测试题目,具体编制程序为:构建测评框架——明晰测验目的——制定测验蓝本——编制测验试题——修订测试题项——形成测试问卷——完成试卷检验,以此编制的高中数学教师核心素养测试试卷有良好的信度和效度,可以作为我国高中数学教师核心素养的测评工具。  相似文献   

10.
主观题是高考中的基本题型,高考中,不少学生将试卷写得满满的,但是得分却不高,影响了总成绩。怎样答好政治主观题呢?下面我就自己日常教学中的体会与大家交流一下。  相似文献   

11.
Currently there is concern among some educators regarding the reliability of criterion-referenced (CR) measures. In this comment, a recent attempt to develop a theory of reliability for CR measures is examined, and some considerations for determining the reliability of CR measures are discussed. Conventional reliability statistics (e.g., coefficient alpha, standard error of measurement) are found appropriate for CR measures satisfying the assumptions of the measurement model underlying classical test theory. For measures with underlying multidimensional traits, conventional reliability statistics may be used at the homogeneous subscale level. When the confidence interval about a student's “below criterion score” includes the criterion, additional evidence about the student should be obtained. Two-stage sequential testing is suggested as one method for acquiring additional evidence.  相似文献   

12.
In criterion‐referenced tests (CRTs), the traditional measures of reliability used in norm‐referenced tests (NRTs) have often proved problematic because of NRT assumptions of one underlying ability or competency and of variance in the distribution of scores. CRTs, by contrast, are likely to be created when mastery of the skill or knowledge by all or most all test takers is expected and thus little variation in the scores is expected. A comprehensive CRT often measures a number of discrete tasks that may not represent a single unifying ability or competence. Hence, CRTs theoretically violate the two most essential assumptions of classic NRT re liability theory and they have traditionally required the logistical problems of multiple test administrations to the same test takers to estimate reliability. A review of the literature categorizes approaches to reliability for CRTs into two classes: estimates sensitive to all measures of error and estimates of consistency in test outcome. For single test administration of CRTs Livingston's k2is recommended for estimating all measures of error, Sc is proposed for estimates of consistency in test outcome. Both approaches compared using data from a CRT exam and recommendations for interpretation and use are proposed.  相似文献   

13.
Criterion‐related profile analysis (CPA) can be used to assess whether subscores of a test or test battery account for more criterion variance than does a single total score. Application of CPA to subscore evaluation is described, compared to alternative procedures, and illustrated using SAT data. Considerations other than validity and reliability are discussed, including broad societal goals (e.g., affirmative action), fairness, and ties in expected criterion predictions. In simulation data, CPA results were sensitive to subscore correlations, sample size, and the proportion of criterion‐related variance accounted for by the subscores. CPA can be a useful component in a thorough subscore evaluation encompassing subscore reliability, validity, distinctiveness, fairness, and broader societal goals.  相似文献   

14.
The relation between test reliability and statistical power has been a controversial issue, perhaps due in part to a 1975 publication in the Psychological Bulletin by Overall and Woodward, “Unreliability of Difference Scores: A Paradox for the Measurement of Change”, in which they demonstrated that a Student t test based on pretest-posttest differences can attain its greatest power when the difference score reliability is zero. In the present article, the authors attempt to explain this paradox by demonstrating in several ways that power is not a mathematical function of reliability unless either true score variance or error score variance is constant.  相似文献   

15.
The standard error of measurement usefully provides confidence limits for scores in a given test, but is it possible to quantify the reliability of a test with just a single number that allows comparison of tests of different format? Reliability coefficients do not do this, being dependent on the spread of examinee attainment. Better in this regard is a measure produced by dividing the standard error of measurement by the test's ‘reliability length’, the latter defined as the maximum possible score minus the most probable score obtainable by blind guessing alone. This, however, can be unsatisfactory with negative marking (formula scoring), as shown by data on 13 negatively marked true/false tests. In these the examinees displayed considerable misinformation, which correlated negatively with correct knowledge. Negative marking can improve test reliability by penalizing such misinformation as well as by discouraging guessing. Reliability measures can be based on idealized theoretical models instead of on test data. These do not reflect the qualities of the test items, but can be focused on specific test objectives (e.g. in relation to cut‐off scores) and can be expressed as easily communicated statements even before tests are written.  相似文献   

16.
Classification consistency and accuracy are viewed as important indicators for evaluating the reliability and validity of classification results in cognitive diagnostic assessment (CDA). Pattern‐level classification consistency and accuracy indices were introduced by Cui, Gierl, and Chang. However, the indices at the attribute level have not yet been constructed. This study puts forward a simple approach to estimating the indices at both the attribute and the pattern level through one single test administration. Detailed elaboration is made on how the upper and lower bounds for the attribute‐level accuracy can be derived from the variance of error of the attribute mastery probability estimate. In addition, based on Cui's pattern‐level indices, an alternative approach to estimating the attribute‐level indices is also proposed. Comparative analysis of simulation results indicate that the new indices are very desirable for evaluating test‐retest consistency and correct classification rate.  相似文献   

17.
Under the generalizability‐theory (G‐theory) framework, the estimation precision of variance components (VCs) is of significant importance in that they serve as the foundation of estimating reliability. Zhang and Lin advanced the discussion of nonadditivity in data from a theoretical perspective and showed the adverse effects of nonadditivity on the estimation precision of VCs in 2016. Contributing to this line of research, the current article directs the discussion of nonadditivity from a theoretical perspective to a practical application and highlights the importance of detecting nonadditivity in G‐theory applications. To this end, Tukey's test for nonadditivity is the only method to date that is appropriate for the typical single‐facet G‐theory design, in which a single observation is made per element within a facet. The current article evaluates the Type I and Type II error rates of Tukey's test. Results show that Tukey's test is satisfactory in controlling for falsely detecting nonadditivity when the data are actually additive and that it is generally powerful in detecting nonadditivity when it exists. Finally, the article demonstrates an application of Tukey's test in detecting nonadditivity in a judgmental study of educational standards and shows how Tukey's test results can be used to correct imprecision in the estimated VC in the presence of nonadditivity.  相似文献   

18.
We evaluated the statistical power of single-indicator latent growth curve models to detect individual differences in change (variances of latent slopes) as a function of sample size, number of longitudinal measurement occasions, and growth curve reliability. We recommend the 2 degree-of-freedom generalized test assessing loss of fit when both slope-related random effects, the slope variance and intercept-slope covariance, are fixed to 0. Statistical power to detect individual differences in change is low to moderate unless the residual error variance is low, sample size is large, and there are more than four measurement occasions. The generalized test has greater power than a specific test isolating the hypothesis of zero slope variance, except when the true slope variance is close to 0, and has uniformly superior power to a Wald test based on the estimated slope variance.  相似文献   

19.
This study illustrates how generalizability theory can be used to evaluate the dependability of school-level scores in situations where test forms have been matrix sampled within schools, and to estimate the minimum number of forms required to achieve acceptable levels of score reliability. Data from a statewide performance assessment in reading, writing, and language usage were analyzed in a series of generalizability studies using a person: (school x form) design that provided variance component estimates for four sources: school, form, school x form, and person: (school x form). Six separate scores were examined. The results of the generalizability studies were then used in decision studies to determine the impact on score reliability when the number of forms administered within schools was varied. Results from the decision studies indicated that score generalizability could be improved when the number of forms administered within schools was increased from one to three forms, but that gains in generalizability were small when the number of forms was increased beyond three. The implications of these results for planning large-scale performance assessments are discussed.  相似文献   

20.
《教育实用测度》2013,26(4):257-275
Weighting responses to Constructed-Response (CR) items has been proposed as a way to increase the contribution these items make to the test score when there is insufficient testing time to administer additional CR items. The effect of various types of weighting items of an IRT-based mixed-format writing examination was investigated. Constructed-response items were weighted by increasing their representation according to the test blueprint, by increasing their contribution to the test characteristic curve, by summing the ratings of multiple raters, and by applying optimal weights utilized in IRT pattern scoring. Total score and standard errors of the weighted composite forms of CR and Multiple-Choice (MC) items were compared against each other and against a form containing additional rather than weighted items. Weighting resulted in a slight reduction of test reliability but reduced standard error in portions of the ability scale.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号