首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This article presents a model of assessment development integrating student characteristics with the conceptualization, design, and implementation of standardized achievement tests. The model extends the assessment triangle proposed by the National Research Council ( Pellegrino, Chudowsky, & Glaser, 2001 ) to consider the needs of students with disabilities and English learners on two dimensions: cognitive interaction and observation interaction. Specific steps in the test development cycle for including students with special needs are proposed following the guidelines provided by Downing (2006) . Because this model of test development considers the range of student needs before test development commences, student characteristics are supported by applying the principles of universal design and appropriately aligning accommodations to address student needs. Specific guidelines for test development are presented.  相似文献   

2.
School self-evaluation (SSE) often makes use of questionnaires in order to sketch a picture of the school. How respondents cognitively process questionnaire items determines the validity of SSE results. Still, one readily assumes that respondents interpret and answer items as intended by the instrument developer (referred to as cognitive validity), but it remains unclear whether they do. This study tested an exemplary SSE instrument by focusing on the extent to which SSE results are cognitively valid, and on the extent to which differences in cognitive validity can be attributed to respondents and/or items. Cognitive interviews with 20 participants made respondents’ answering processes manifest. Results show that, overall, fewer than 50% of respondents’ processes of interpreting and elaborating on items are cognitively valid. Cross-classified multilevel analyses indicate that various hierarchical levels, respondents and items, are significant in explaining differences in cognitive validity, but not for all stages of the answering process.  相似文献   

3.
Test Design with Cognition in Mind   总被引:2,自引:0,他引:2  
One of the primary themes of the National Research Council's 2001 book Knowing What Students Know was the importance of cognition as a component of assessment design and measurement theory ( NRC, 2001 ). One reaction to the book has been an increased use of sophisticated statistical methods to model cognitive information available in test data. However, the application of these cognitive-psychometric methods is fruitless if the tests to which they are applied lack a formal cognitive structure. If assessments are to provide meaningful information about student ability, then cognition must be incorporated into the test development process much earlier than in data analysis. This paper reviews recent advancements in cognitively-based test development and validation, and suggests various ways practitioners can incorporate similar methods into their own work.  相似文献   

4.
测验公平有很多不同的定义,基于效度对其进行界定对于测验开发者而言是最有用的。要开发一个公平的测验,对试题进行公平性审核是不可缺少的重要一环。为使审核过程不那么主观,应遵循一定的审核原则。此外,为更好地解决公平性审核过程中出现的问题,还应建立规范的审核程序。ETS在对测验进行公平性审核的方面积累了丰富经验,其中有不少值得借鉴之处。  相似文献   

5.
在认知诊断模型中进行题目功能差异(DIF)的检测,目的在于保证测验的质量与效果。在以往研究的基础上,本研究重点探索在CDMs框架下,MH、LR、CSIBTEST、WObs、WSw、WXPD 6种DIF检测方法在Q矩阵是否正确设定以及有关DIF影响因素等条件下的表现。结果表明:在Q矩阵正确设定时,WObs、WSw和WXPD统计量表现要好于MH、LR和CSIBTEST方法;在Q矩阵错误设定时,6种方法都会出现Ⅰ类错误率膨胀和统计检验力较低的现象。相对而言,MH、LR和CSIBTEST方法的表现比较稳定,WObs、WSw和WXPD统计量的表现变化较大,WObs、WSw和WXPD统计量的Ⅰ类错误率和统计检验力的结果依然好于MH、LR、CSIBTEST方法。  相似文献   

6.
难度不是试题的固有属性,而是考生因素与试题特征之间互动的结果。很多试题分析者倾向于将试题难度偏高的原因仅仅归结于学生未掌握相关知识或技能,而忽视试题本身的特征。通过分析60道难度在0.6以下的高考英语试题,探究其难度来源。结果显示,除考生因素外,难题或偏难题的难度来源也与命题技术有关,比如答案的唯一性与可接受性、考查内容超纲、考点设置与评分标准欠妥等方面的问题。为此,提出考试机构应提高命题水平,加强试题质量监控,确保大规模考试科学选拔人才。  相似文献   

7.
    
Contamination of responses due to extreme and midpoint response style can confound the interpretation of scores, threatening the validity of inferences made from survey responses. This study incorporated person-level covariates in the multidimensional item response tree model to explain heterogeneity in response style. We include an empirical example and two simulation studies to support the use and interpretation of the model: parameter recovery using Markov chain Monte Carlo (MCMC) estimation and performance of the model under conditions with and without response styles present. Item intercepts mean bias and root mean square error were small at all sample sizes. Item discrimination mean bias and root mean square error were also small but tended to be smaller when covariates were unrelated to, or had a weak relationship with, the latent traits. Item and regression parameters are estimated with sufficient accuracy when sample sizes are greater than approximately 1,000 and MCMC estimation with the Gibbs sampler is used. The empirical example uses the National Longitudinal Study of Adolescent to Adult Health’s sexual knowledge scale. Meaningful predictors associated with high levels of extreme response latent trait included being non-White, being male, and having high levels of parental support and relationships. Meaningful predictors associated with high levels of the midpoint response latent trait included having low levels of parental support and relationships. Item-level covariates indicate the response style pseudo-items were less easy to endorse for self-oriented items, whereas the trait of interest pseudo-items were easier to endorse for self-oriented items.  相似文献   

8.
    
There are numerous statistical procedures for detecting items that function differently across subgroups of examinees that take a test or survey. However, in endeavouring to detect items that may function differentially, selection of the statistical method is only one of many important decisions. In this article, we discuss the important decisions that affect investigations of differential item functioning (DIF) such as choice of method, sample size, effect size criteria, conditioning variable, purification, DIF amplification, DIF cancellation, and research designs for evaluating DIF. Our review highlights the necessity of matching the DIF procedure to the nature of the data analysed, the need to include effect size criteria, the need to consider the direction and balance of items flagged for DIF, and the need to use replication to reduce Type I errors whenever possible. Directions for future research and practice in using DIF to enhance the validity of test scores are provided.  相似文献   

9.
The purpose of this paper is to define and evaluate the categories of cognitive models underlying at least three types of educational tests. We argue that while all educational tests may be based—explicitly or implicitly—on a cognitive model, the categories of cognitive models underlying tests often range in their development and in the psychological evidence gathered to support their value. For researchers and practitioners, awareness of different cognitive models may facilitate the evaluation of educational measures for the purpose of generating diagnostic inferences, especially about examinees' thinking processes, including misconceptions, strengths, and/or abilities. We think a discussion of the types of cognitive models underlying educational measures is useful not only for taxonomic ends, but also for becoming increasingly aware of evidentiary claims in educational assessment and for promoting the explicit identification of cognitive models in test development. We begin our discussion by defining the term cognitive model in educational measurement. Next, we review and evaluate three categories of cognitive models that have been identified for educational testing purposes using examples from the literature. Finally, we highlight the practical implications of blending models for the purpose of improving educational measures .  相似文献   

10.
测验项目编制与等值的一种有效策略——层面理论   总被引:2,自引:0,他引:2  
回转翻译法关注的是“文字等价”,项目反应理论注重“统计指标等价”。层面理论项目等价注重项目的同一测量目标,即等值的项目应该在相同的条件下测试被试相同的反应。层面理论通过映射语句技术清晰地界定项目的测量目标,使得项目等值与项目编制更加科学。通过层面理论编制的项目维度结构更加清楚,测验的结构效度更有保证。将层面理论和心理计量学的其他方法结合起来,可以有效提高测验项目编制与等值的质量。  相似文献   

11.
本研究旨在从一维和多维的角度检测国际教育成效评价协会(IEA)儿童认知发展状况测验中中译英考题的项目功能差异(DIF)。我们分析的数据由871名中国儿童和557名美国儿童的测试数据组成。结果显示,有一半以上的题目存在实质的DIF,意味着这个测验对于中美儿童而言,并没有功能等值。使用者应谨慎使用该跨语言翻译的比较测试结果来比较中美两国考生的认知能力水平。所幸约有半数的DIF题目偏向中国,半数偏向美国,因此利用测验总分所建立的量尺,应该不至于有太大的偏误。此外,题目拟合度统计量并不能足够地检测到存在DIF的题目,还是应该进行特定的DIF分析。我们探讨了三种可能导致DIF的原因,尚需更多学科专业知识和实验来真正解释DIF的形成。  相似文献   

12.
幼儿园教育质量评价标准的编制具有重要的理论与实践意义。中国学前教育研究会组建了一个包含了来自国内6所高校的专家的跨学科研究团队,遵循严谨的测量学研制程序,经过两年扎实的研究,编制了《走向优质——中国幼儿园教育质量评价标准》(简称《优质标准》)。为验证该标准的有效性,课题组采用分层抽样的方法,在位于我国不同区域的5个省区抽取了不同性质与等级的城乡幼儿园共计100所,采用《优质标准》进行质量评价,同时从300个样本班级中随机抽取了1670名儿童(男女各半),对其语言、数学认知、情感社会性发展水平进行测评。基于幼儿园教育质量和儿童发展水平测评数据,课题组对《优质标准》进行了信效度分析。结果显示,《优质标准》总体及其各领域内部一致性均达到较高水平;绝大部分项目具有良好的区分度;《优质标准》具有良好的结构效度,包含两个潜在的质量因子,分别是课程教学与学习环境、管理支持与师资保障;幼儿园教育质量评价结果与儿童发展水平测评结果总体上呈显著相关,不同质量领域与儿童不同发展领域之间的相关性存在一定差异。综合以上证据表明,《优质标准》是适用于我国幼儿园教育情境的有效评估工具,其评价结果是可靠、可信的。  相似文献   

13.
In this paper I describe and illustrate the Roussos-Stout (1996) multidimensionality-based DIF analysis paradigm, with emphasis on its implication for the selection of a matching and studied subtest for DIF analyses. Standard DIF practice encourages an exploratory search for matching subtest items based on purely statistical criteria, such as a failure to display DIF. By contrast, the multidimensional DIF paradigm emphasizes a substantively-informed selection of items for both the matching and studied subtest based on the dimensions suspected of underlying the test data. Using two examples, I demonstrate that these two approaches lead to different interpretations about the occurrence of DIF in a test. It is argued that selecting a matching and studied subtest, as identified using the DIF analysis paradigm, can lead to a more informed understanding of why DIF occurs.  相似文献   

14.
This article examines whether the way that PISA models item outcomes in mathematics affects the validity of its country rankings. As an alternative to PISA methodology a two-parameter model is applied to PISA mathematics item data from Canada and Finland for the year 2012. In the estimation procedure item difficulty and dispersion parameters are allowed to differ across the two countries and samples are restricted to respondents who actually answered items in a mathematics cluster. Different normalizations for identifying the distribution parameters are also considered. The choice of normalization is shown to be crucial in guaranteeing certain invariance properties required by item response models. The ability scores obtained from the methods employed here are significantly higher for Finland, in sharp contrast to PISA results, which gave both countries very similar ranks in mathematics.  相似文献   

15.
目的:分析和探索自我概念清晰量表(SCC)的信度和效度。方法:对367名高中学生进行自我概念清晰量表测查,并间隔一个月进行重测,对获取的数据进行相关分析和因素分析。结果:SCC量表内部一致性系数为0.80;重测信度0.83;与CES-D(流调中心用抑郁量表)、MASC(青少年焦虑症状问卷)、FFI-N(大五人格问卷——神经质分量表)显著负相关;验证性因素分析(CFA)显示此样本两个因子拟合程度较好。结论:自我概念清晰量表是一个比较好的评估工具。  相似文献   

16.
考试题库的制作   总被引:1,自引:0,他引:1  
题库建设是发展大型常设性考试的一项重要工作。本文对如何制作高质量题库作了探讨,包括时命题组织、试题命制、审题、试测分析、入库等各个重要的环节都作了详细的论述,具有较强的操作性。  相似文献   

17.
18.
    
Cloze tests, introduced by W.L.Taylor in 1953, is a quick, economical method of measuring overall language proficiency. Cloze test considered to be an indispensable part of College Entrance Examination. Therefore it is necessary to do some research on how to better design and assess the validity of a cloze test by reader’s readability of texts in high school.  相似文献   

19.
学业成就评价是当前新课程改革研究的热点之一。如何科学地设计和开发试题,对深化新课改、进行基础教育质量监控有着重要意义。PISA是一项权威的国际学生评价项目,具有较高的可比性、可信性和有效性。PISA2006科学评估框架包含情境、知识、态度和能力等相互联系的四个方面,其试题设计和开发技术采用了"双位编码"评分设计,增加了态度评估试题,保证了试题与标准的匹配。  相似文献   

20.
  总被引:2,自引:0,他引:2  
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号