期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

杜文久《数学教育学报》2008,17(1):67-69

测验分数由于受到评分者和测验难度的影响,因而不能很好地评价学生的学业状况.能力分数仅与被试有关,与测验的难度无关.能力分数的特征有:能力分数是相合估计,能力分数是"不变分数",能力分数呈正态分布.在能力分数的意义下,对于来自不同测验的能力分数可以直接进行对比. 相似文献

2.

再论数学测验中的不变分数——兼答刘耀斌先生

杜文久《数学教育学报》2010,(2)

如何对学生的学业成就进行评价是教育领域面临的一项重要任务,常用的评价方法是用测验分数.然而测验分数严格依赖于测验难度,在不同难度的测验中被试的测验分数也不相同.不变分数是指被试在参加测验前后的某一有限时间段内不变,它是一种新的评分方式.不变分数具有相合性、不变性、正态性等性质. 相似文献

3.

学绩测验分数分布形态及其应用研究

余水《贵阳学院学报(自然科学版)》2016,(1):31-35

从分数分布的偏态系数和峰度系数出发,探讨测验分数分布的类型及其内涵,进而研究分数分布形态的应用现状.研究发现:关于学绩测验分数的分布状态,目前存在对偏态分布概念混淆不清、简单否定偏态分布、盲目遵从正态分布等严重的使用误区.研究结论:盲目遵从分数正态分布可能导致教育的失败.在一般的基于合格测验目的的学绩测验中,应该努力避免出现分数的正偏态分布,理性接受合理的负偏态分布.最后对相关研究的发展趋势进行了简要分析. 相似文献

4.

汉语水平考试(HSK)复本测验稳定程度的历时性研究

柴省三《现代语文》2011,(2)

在大规模、标准化语言测验中,如何确保复本测验分数之间具有等价性和稳定性的问题,是测验分数信度高低的重要标志之一,也是测验结果解释和分数使用效度高低的重要证据之一.本文在对汉语水平考试(HSK)标准化开发程序和等值处理技术进行考察的基础上,重点对国内最近两年HSK考试使用的8份复本测验分数之间的稳定进行了统计分析.结果表明,HSK测验的复本分数之间具有较高横向稳定性,所有测验试卷的分数与标准卷均具有历时一致性. 相似文献

5.

国内外标准参照测验分数体系的比较研究

甘良梅《考试研究》2006,(4)

本文从国内外一些重大的标准参照测验的分数体系出发,探讨其共同点和不同点,为以后的标准参照测验分数体系提供参考。相似文献

6.

影响项目反应理论等值效果的因素探查

丁树良熊建华戴海琦《中国考试》2005,(1):25-26

1前言测验等值是对考核同一心理品质的多个测验形式系统地做出测验分数转换．从而使不同测验形式的测验分数之间具有可比性。由于项目反应理论(IRT)将项目难度与心理特质(能力)定义在同一量表上，故也可以认为IRT中的等值是将考核同一心理品质的多个测验形式系统地做出项目参数转换．从而使不同测验形式中的项目参数之间具有可比性。相似文献

7.

试后试题全公开背景下分数分布的跨年度比较——日本全国性测验与地方性测验的链接

石井秀宗《考试研究》2012,(5):3-11

本研究的目的有三：（1）提出试后试题全公开背景下分数分布的跨年度比较方案,即通过组合日本的全国性测验与地区性测验的设计,应用测验理论中的链接原理提出跨年度比较分数分布的方法;（2）讨论实现该方案的可行性,具体讨论了使用测验数据的可能性、地区性协作的方式以及对于被试群体的要求;（3）进行实际数据的证实,即呈现2006年度与2009年度初中三年级学生国语测验分数的跨年度比较结果,发现无论哪个测验的分数分布都基本上没有变化。相似文献

8.

测验等值

一帆《教育测量与评价(理论版)》2015,(3):54

测量等值是将不同标尺的测验分数转换到同一标尺的测量技术.具体地说,测验等值是将测量同一性质的知识或心理品质的多个测验形式的测验分数转换成相同标尺的分数,进而使得这些不同测验形式的分数之间具有可比性.例如,有A、B、C三种测验,都是测量英语水平的.如果同一个学生在这三种测验上发挥状态相同,A测验得60分,B测验得65分,C测验得55分,说明C测验最难,A测验次之,B测验最容易.这三种测验分数要等值,都可以转换到某一测验的分数系统.若转换到A测验分数系统,那么B测验的65分,C测验的55分,都对应于A测验的60分. 相似文献

9.

测验效度概念的新发展

谢小庆《考试研究》2013,(3):56-64

1985年《教育与心理测验标准》(第5版)出版之前,效度研究的核心概念是"效标(criterion)",效度研究被视为一种用"效标"对测验的效度进行证明(verify)、对测验分数做出有效(valid)解释的过程。1985年以后,效度研究的核心概念是"证据(evidence)",效度研究被视为一种通过积累证据对测验的效度进行支持(support)、对测验分数做出合理(reasonable)解释的过程。关于效度的这种理解,突出体现在1999年出版的《教育与心理测验标准》(第6版)中。美国教育协会和美国国家教育测量学会共同组织编写的《教育测量》在业内被称为"教育测量领域的《圣经》"。2006年《教育测量》(第4版)出版以后,效度研究的核心概念演变为"理由(warrant)",效度研究被视为一种通过构造"理由系统"和"理由网络"对效度进行"论证(argument)"、对测验分数做出可接受的(plausible)解释的过程。本文结合笔者的考试实践,介绍了效度概念的新发展。相似文献

10.

经典测量理论条件下强化子分数的方法

刘育明《教育测量与评价(理论版)》2021,(5):3-10

在一些教育和心理测量中,除了报告给考生一个总分,教育考试机构通常还向考生报告两个或多个子分数,为考生提供掌握不同内容的能力强弱的诊断性信息,便于考生和教师进行有针对性的学习和教学.但是在教育和心理测验中,子测验的项目一般较少,同时各个子测验分数之间、子测验分数与测验总分之间的相关又比较高,因而子测验的信度一般比较低.由此,教育测量学家们提出用同一个测验里的其他子测验的分数或者总分的信息来提高子分数信度的方法,这就是强化子分数.本文介绍Haberman以经典测量理论为基础提出的强化子分数的基本概念,运用模拟数据和R函数介绍强化子分数的方法,包括根据子测验观察分、总分和强化子分数估计子分数的平均误方差减少比、子分数增值,以及强化子分数测量标准误. 相似文献

11.

Discrepancies Between Score Trends from NAEP and State Tests: A Scale-Invariant Perspective 总被引：3，自引：0，他引：3

Andrew D. Ho 《Educational Measurement》2007,26(4):11-20

State test score trends are widely interpreted as indicators of educational improvement. To validate these interpretations, state test score trends are often compared to trends on other tests such as the National Assessment of Educational Progress (NAEP). These comparisons raise serious technical and substantive concerns. Technically, the most commonly used trend statistics—for example, the change in the percent of proficient students—are misleading in the context of cross-test comparisons. Substantively, it may not be reasonable to expect that NAEP and state test score trends should be similar. This paper motivates then applies a "scale-invariant" framework for cross-test trend comparisons to compare "high-stakes" state test score trends from 2003 to 2005 to NAEP trends over the same period. Results show that state trends are significantly more positive than NAEP trends. The paper concludes with cautions against the positioning of trend discrepancies in a framework where only one trend is considered "true." 相似文献

12.

语言能力测试如何适应语言教学方式的发展 总被引：1，自引：0，他引：1

谢小庆《考试研究》2010,(4):29-40

当前,语言教学观念和语言教学方式正在发生着深刻的变化,正在从“知识传授”转向“能力培养”,从“教师主导”转向“学生自主”,从“班级教学”转向“个性化教学”。为了适应语言教学观念的这些变化,应致力于开发新的基于任务的语言测验。为此,需要建立语言测试的评价标准,提供学习者“能做什么”的能力水平描述,并根据标准对测验分数作出解释。同时,还应运用规则空间模型、统一模型、融合模型等测量工具对语言测验进行认知诊断分析,并在此基础上向学习者、教师和家长提供描述性、诊断性的成绩报告。相似文献

13.

Same-Form Retest Effects on Credentialing Examinations

Mark R. Raymond Sandra Neustel Dan Anderson 《Educational Measurement》2009,28(2):19-27

Examinees who take high-stakes assessments are usually given an opportunity to repeat the test if they are unsuccessful on their initial attempt. To prevent examinees from obtaining unfair score increases by memorizing the content of specific test items, testing agencies usually assign a different test form to repeat examinees. The use of multiple forms is expensive and can present psychometric challenges, particularly for low-volume credentialing programs; thus, it is important to determine if unwarranted score gains actually occur. Prior studies provide strong evidence that the same-form advantage is pronounced for aptitude tests. However, the sparse research within the context of achievement and credentialing testing suggests that the same-form advantage is minimal. For the present experiment, 541 examinees who failed a national certification test were randomly assigned to receive either the same test or a different (parallel) test on their second attempt. Although the same-form group had shorter response times on the second administration, score gains for the two groups were indistinguishable. We discuss factors that may limit the generalizability of these findings to other assessment contexts. 相似文献

14.

Validity and participation: implications for school comparison of Australia’s National Assessment Program

Greg Thompson Lenore Adie Val Klenowski 《教育政策杂志》2018,33(6):759-777

The National Assessment Program – Literacy and Numeracy (NAPLAN) in Australia is a series of literacy and numeracy tests that are used for purposes of school comparison. This paper argues that a key question for this use lies in whether or not this is a reasonable, or valid, use of the test data. Using Kane’s argumentative approach to validity, this paper argues that the comparisons of the quality of student achievement made available on the My School Website have low validity due to the lack of regard to rates of participation in schools. In bringing together the literature that addresses the ‘new governance’ of education through testing and an approach to validity that addresses the technical aspects of test score interpretation, with the ethics of how test scores are used and applied, this study identifies validity as an important consideration in comparative analyses of student achievement data. The identification of the need to consider participation in such comparisons through the application of the argumentative approach to validity highlights the contribution of this article not only to the testing field but also to critical policy literature. 相似文献

15.

On the validity of useless tests

Stephen G. Sireci 《Assessment in Education: Principles, Policy & Practice》2016,23(2):226-235

A misconception exists that validity may refer only to the interpretation of test scores and not to the uses of those scores. The development and evolution of validity theory illustrate test score interpretation was a primary focus in the earliest days of modern testing, and that validating interpretations derived from test scores remains essential today. However, test scores are not interpreted and then ignored; rather, their interpretations lead to actions. Thus, a modern definition of validity needs to describe the validation of test score interpretations as a necessary, but insufficient, step en route to validating the uses of test scores for their intended purposes. To ignore test use in defining validity is tantamount to defining validity for ‘useless’ tests. The current definition of validity stipulated in the 2014 version of the Standards for Educational and Psychological Testing properly describes validity in terms of both interpretations and uses, and provides a sufficient starting point for validation. 相似文献

16.

NCME 2007 Presidential Address: The Concordance Table: An Invitation to Misuse Test Scores

Daniel R. Eignor 《Educational Measurement》2008,27(4):30-33

This article discusses a particular type of concordance table and the potential for test score misuse that may result from employing such a table. The concordance that is discussed is typically created between scores on different, nonequatable versions of a test that share the same or close to the same test title. These concordance tables often appear in the context of relating scores on computerized adaptive and paper‐and‐pencil versions of the same test. When such a table is presented in a complete point‐by‐point fashion, relating each reported score on the scale of the new version of the test to a reported score on the scale of the old version of the test, test score users will typically treat the table as if it represented an equating of scores between the two versions, and directly replace scores on the new version of the test by scores on the old version. This clearly represents a misuse of the test scores. Suggestions for avoiding this misuse of test scores from concordance tables are provided. 相似文献

17.

数学测验中主观题的评分问题

杜文久《数学教育学报》2006,15(3):87-88

在各类测验中,不同的评阅者在主观题上常常会评出不同的分数,其结果是增大了测验的误差以及误差的不确定性.为克服这一缺陷,可以采取一种新的评分方法,使不同的评分者在主观题上也能评出相同的分数.新的评分方法的主要步骤可概括为:找出主观题的节点,按照不同的评分步骤可将题目划分成6种不同的评分等级. 相似文献

18.

Evaluating Content Alignment in Computerized Adaptive Testing

下载免费PDF全文

Steven L. Wise G. Gage Kingsbury Norman L. Webb 《Educational Measurement》2015,34(4):41-48

The alignment between a test and the content domain it measures represents key evidence for the validation of test score inferences. Although procedures have been developed for evaluating the content alignment of linear tests, these procedures are not readily applicable to computerized adaptive tests (CATs), which require large item pools and do not use fixed test forms. This article describes the decisions made in the development of CATs that influence and might threaten content alignment. It outlines a process for evaluating alignment that is sensitive to these threats and gives an empirical example of the process. 相似文献

19.

使用复合蜕变关系进行软件测试的实例研究 总被引：1，自引：0，他引：1

董国伟徐宝文陈林聂长海王璐璐《东南大学学报》2008,24(4)

蜕变测试时经常会出现蜕变关系检错能力低下的情况．基于命题逻辑的推理规则,提出了复合蜕变关系的构造方法,该方法对已构造的关系依次进行两两复合最终得到新的蜕变关系．复合蜕变关系可以把原关系的优点综合起来,具有更强的检错能力．此外,由于将蜕变关系复合后关系数量减少,所以当使用它测试程序时,生成测试用例的数量会大幅度降低．通过2个实例对复合蜕变关系的测试性能进行研究,实验结果表明复合关系的性能主要取决于构成它的核心蜕变关系,以及关系复合的顺序．使用复合蜕变关系可以极大地提高测试效率．相似文献