A process for judging the alignment between curriculum standards and assessments developed by the author is presented. This process produces information on the relationship of standards and assessments on four alignment criteria: Categorical Concurrence, Depth of Knowledge Consistency, Range of Knowledge Correspondence, and Balance of Representation. Five issues are identified—but not resolved—that have arisen from conducting alignment studies. All of these issues relate to making a decision about what alignment is good enough. Pragmatic decisions have been made to specify acceptable levels for each of the alignment criteria. The assumptions are described. The issues discussed arise from a change in the underlying assumptions and from considering variations in the purpose for an assessment. The existence of such issues reinforces that alignment judgments have an element of subjectivity.  相似文献   

Webb一致性程序是判断评价与课程标准一致性的重工具,它有4个标准:类别一致性、知识深度一致性、知识范围一致性和知识分布平衡性,分别对应着多级可接受水平。一致性研究的挑战在于判断"什么样的一致性才是最好的",这一判断过程虽然可以基于众多实证研究,但也包含某种程度的主观性。  相似文献   

在教育改革背景下,对考试测评与语言能力标准的对接需求显著增加,这不仅体现在对学生语言能力进步的监测上,也体现在标准化考试的使用过程中。将考试的内容和分数与语言能力标准中的不同等级进行对接,可以增强考试分数在不同教育背景下的可解释性。通过对语言测评与CEFR对接研究的分析,讨论考试测评与语言能力框架对接中的一些重要问题,并对《中国英语能力等级量表》与考试的对接提出建议。  相似文献   

This article examines the role of reviewer agreement in judgments about alignment between tests and standards. We used case data from three state alignment studies to explore how different approaches to incorporating reviewer agreement changes alignment conclusions. The three case studies showed varying degrees of reviewer agreement about correspondences between objectives and test items. Moreover, taking into account reviewer agreement in the analyses sometimes had a marked effect on alignment conclusions. We discuss reasons for differences across case studies and alignment approaches, as well as implications for future alignment efforts.  相似文献   

This article explores the challenge of setting performance standards in a non-Western context. The study is centered on standard-setting practice in the national learning assessments of Trinidad and Tobago. Quantitative and qualitative data from annual evaluations between 2005 and 2009 were compiled, analyzed, and deconstructed. In the mixed methods research design, data were integrated under an evaluation framework for validating performance standards. The quantitative data included panelists’ judgments across standard-setting rounds and methods. The qualitative data included both retrospective comments from open-ended surveys and real-time data from reflective diaries. Findings for procedural and internal validity were mixed, but the evidence for external validity suggested that the final outcomes were reasonable and defensible. Nevertheless, the real-time qualitative data from the reflective diaries highlighted several cognitive challenges experienced by panelists that may have impinged on procedural and internal validity. Additional unique hindrances were lack of resources and wide variation in achievement scores. Ensuring a sustainable system of performance standards requires attention to these deficits.  相似文献   

美国课程标准的开发有一个历史发展和演变的过程。开发过程主要依赖于学科专家的知识和经验以及相关的研究进展和结论,并根据其他专家组、利益群体以及公众的反馈意见不断修改直至最后颁布。其课程标准以学习结果为取向,围绕核心内容领域跨年级螺旋递进编排,详尽规定各年级认知表现期望,并通过评价框架、一致性分析和标准设定等手段确保学业水平考试与课程标准要求的对应性。这对我国相应工作具有重要的借鉴意义。  相似文献   

The success of standards-based education systems depends on 2 elements: strong standards, and assessments that measure what the standards expect. States that have or adopt test-based accountability programs claim that their tests are aligned to their standards. But there has been up to now no independent methodology for checking alignment. This article describes and illustrates such a methodology and reports results on a sample of state tests. In general, although individual items align quite well with some standard, the tests as a whole are not well aligned. With few exceptions, the collections of items that make up the tests that we examined do not do a good job of assessing the full range of standards and objectives that states have laid out for their students. This misalignment can have serious consequences for instruction and for the validity of test results.  相似文献   

This article summarizes the conclusions that the authors have drawn about the measurement quality and potential for linkage with teacher pay of three sets of teacher assessments—those developed or being developed by the Interstate New Teacher Assessment and Support Consortium, the Educational Testing Service, and the National Board for Professional Teaching Standards. To investigate the feasibility of using these assessments as a framework for a knowledge- and skill-based pay system, the Consortium for Policy Research in Education commissioned a set of papers for a conference in September 1997 on the measurement issues involved in assessing teaching practice to standards and linking these assessments to pay for knowledge and skills. The resulting papers, revised and published as articles in this journal, show that this approach is promising but that in some cases additional research on the measurement quality of the assessments is needed.  相似文献   

Similar to educators in mathematics, science, and reading, history educators around the world have mobilized curricular reform movements toward including complex thinking in history education, advancing historical thinking, developing historical consciousness, and teaching competence in historical sense making. These reform movements, including the Common Core Standards, are beginning to include historical thinking. Despite these developments, inclusion of historical thinking in assessments has been slow: The great majority of history assessments, both large-scale and classroom-based, still focus on fragmented pieces of information. In this article, we discuss the challenges in assessment of historical thinking, describe how these issues were dealt with in a 1-hr test of students ability to reason about “enemy aliens” in Canada during World War I, and make recommendations for future assessments.  相似文献   

What problems arise in translating a test to other languages? How can performance be compared for students who take different language versions of a test? What designs can be used for linking studies?  相似文献   

The use of alternative assessments has led many researchers to reexamine traditional views of test qualities, especially validity. Because alternative assessments generally aim at measuring complex constructs and employ rich assessment tasks, it becomes more difficult to demonstrate (a) the validity of the inferences we make and (b) that these inferences extrapolate to target domains beyond the assessment itself. An approach to addressing these issues from the perspective of language testing is described. It is then argued that in both language testing and educational assessment we must consider the roles of both language and content knowledge, and that our approach to the design and development of performance assessments must be both construct-based and task-based.1  相似文献   

Many states are implementing direct writing assessments to assess student achievement. While much literature has investigated minimizing raters' effects on writing scores, little attention has been given to the type of model used to prepare raters to score direct writing assessments. This study reports on an investigation that occurred in a state-mandated writing program when a scoring anomaly became apparent once assessments were put in operation. The study indicates that using a spiral model for training raters and scoring papers results in higher mean ratings than does using a sequential model for training and scoring. Findings suggest that making decisions about cut-scores based on pilot data has important implications for program implementation.  相似文献   

Being proficient in mathematics involves having rich and connected mathematical knowledge, being a strategic and reflective thinker and problem solver, and having productive mathematical beliefs and dispositions. This broad set of mathematics goals is central to the Common Core State Standards for Mathematics.

High-stakes testing often drives instructional practice. In this article, I discuss test specifications and sample assessment items from the two major national testing consortia and the prospects that their assessments will be positive levers for change.

For more than 20 years, the Mathematics Assessment Project has focused on the development of assessments that emphasize productive mathematical practices, most recently creating formative assessment lessons (FALs) designed to help teachers build up student understandings through focusing on student thinking while engaging in rich mathematical tasks. This article describes our recent work.  相似文献   

Since last year's first full run of standard assessment tasks at Key Stage 1 was reviewed in the September 1991 issue of the journal some improvements have been made in the assessment arrangements for pupils with special educational needs but other problems remain. David Bartlett, assessment coordinator, Birmingham Education Department, who wrote last year's review and Nick Peacey, SENJIT coordinator, report the findings from a questionnaire survey and a series of conferences. They make proposals for changes.  相似文献   

作为教育质量评价的重要手段,大规模教育测评中常使用多题本设计。多题本设计通常采用有共同题的不完全矩阵取样设计,共同题又有共同锚和循环锚两种设置方式。共同锚多题本设计需要考虑共同题的比例、内容结构、统计特征、在题本中的放置位置等。循环锚多题本设计即平衡的不完全矩阵设计,往往采用题目组块的方式组合题本,需要考虑题组数量、题组内部结构、题组的排列等。多题本设计的测验数据处理涉及项目反应理论模型下的量尺分数估计、量表化方法、等值技术等。探讨这些问题能为教育测验的设计提供指导和建议。  相似文献   

Setting motor performance standards has long been a process of interest to physical educators. Theoretical advances in the measurement technology appropriate for standard-setting, however, have occurred only in the last decade. The first portion of this paper is devoted to a discussion of issues in setting standards and a brief review of procedures for standard-setting. In the latter section, gender differences in motor performance are examined and the impact of these differences on standard-setting is considered.  相似文献   

For educational technology integration in content disciplines to succeed, teachers and teacher educators need clear standards delineating why, how, where, and how much educational technology they should include in their teaching. This paper examines the visions offered by current science, mathematics, and educational technology standards for educational technology integration in K-12 schools. Since national assessments exert a profound influence on what teachers and students choose to teach and learn, the vision of educational technology use supported by national assessments is also examined. The National Council of Teachers of Mathematics Standards (NCTM, 2000. Principles and Standards for School Mathematics. Retrieved April 6, 2002 from http://standards.nctm.org), the National Science Education Standards (National Research Council (NRC) 1996. National Science Education Standards. Available at http://books.nap.edu/catalog/4962.html), and the National Educational Technology Standards (International Society for Technology in Education (ISTE) 2000. National Educational Technology Standards for Students: Connecting Curriculum and Technology, ISTE, Eugene, Oregon) provide different visions of educational technology use in the classroom. In addition, the current technology use policies for national assessments in science and mathematics, in particular the college admission tests (ACT, SAT I and SAT II subject area tests), Advanced Placement (AP) course assessments, and the Praxis Series assessments indicate that while mathematics assessments often recommend or require the use of educational technology, few science assessments permit the use of educational technology by students. Recommendations are offered for science educators regarding teacher preparation for the technology-rich classrooms of the future.  相似文献   

In large-scale assessments, such as state-wide testing programs, national sample-based assessments, and international comparative studies, there are many steps involved in the measurement and reporting of student achievement. There are always sources of inaccuracies in each of the steps. It is of interest to identify the source and magnitude of the errors in the measurement process that may threaten the validity of the final results. Assessment designers can then improve the assessment quality by focusing on areas that pose the highest threats to the results. This paper discusses the relative magnitudes of three main sources of error with reference to the objectives of assessment programs: measurement error, sampling error, and equating error. A number of examples from large-scale assessments are used to illustrate these errors and their impact on the results. The paper concludes by making a number of recommendations that could lead to an improvement of the accuracies of large-scale assessment results.  相似文献   

The development of alternate assessments for students with disabilities plays a pivotal role in state and national accountability systems. An important assumption in the use of alternate assessments in these accountability systems is that scores are comparable on different test forms across diverse groups of students over time. The use of test equating is a common way that states attempt to establish score comparability on different test forms. However, equating presents many unique, practical, and technical challenges for alternate assessments. This article provides case studies of equating for two alternate assessments in Michigan and an approach to determine whether or not equating would be preferred to not equating on these assessments. This approach is based on examining equated score and performance-level differences and investigating population invariance across subgroups of students with disabilities. Results suggest that using an equating method with these data appeared to have a minimal impact on proficiency classifications. The population invariance assumption was suspect for some subgroups and equating methods with some large potential differences observed.  相似文献   

