首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This article examines three typical approaches to alternate assessment for students with significant cognitive disabilities—portfolios, performance assessments, and rating scales. A detailed analysis of common and unique design features of these approaches is provided, including features of each approach that influence the psychometric quality of their results. Validity imperatives for alternate assessments are reviewed, and approaches for addressing the need for validity evidence are outlined. The article concludes with an examination of three technical challenges—alignment, scores and scoring, and standard setting—common to all alternate assessments. In light of these challenges, existing methods and professional testing standards are endorsed as necessary guidance for understanding and advancing alternate assessment practices.  相似文献   

2.
Multiple scoring is widely used in large-scale assessments. The use of a single response for making multiple inferences as is done in multiple scoring has implications on the validity of these inferences and interpretations based on assessment results. The purpose of this article is to review two types of multiple scoring practices and discuss how multiple scoring affects inferences.  相似文献   

3.
The present article is concerned with how Swedish students in the last year of comprehensive school make sense and use of educational assessments of their school performance. Based on interviews with 28 students in Year 9, “talk about assessments” is analysed using a discourse analytical approach inspired by Michel Foucault’s theories. The study shows that students find it difficult to both understand and make use of the assessments given by their teachers due to the overly complicated language. To receive the grades they need to apply for upper-secondary school, the students use other strategies to stand out and be perceived as “good students,” thus the assessment discourse is effective in the construction of adjustable subjects.  相似文献   

4.
This article presents findings from two projects designed to improve evaluations of technical quality of alternate assessments for students with the most significant cognitive disabilities. We argue that assessment technical documents should allow for the evaluation of the construct validity of the alternate assessments following the traditions of Cronbach (1971) , Messick (1989, 1995) , Linn, Baker, and Dunbar (1991) , and Shepard (1993) . The projects used the work of Knowing What Students Know ( Pellegrino, Chudowsky, & Glaser, 2001 ) to structure and focus the collection and evaluation of assessment information. The heuristic of the assessment triangle ( Pellegrino et al., 2001 ) was particularly useful in emphasizing that the validity evaluation needs to consider the logical connections among the characteristics of the students tested and how they develop domain proficiency (the cognition vertex), the nature of the assessment (the observation vertex), and the ways in which the assessment results are interpreted (the interpretation vertex). This project has shown that in addition to designing more valid assessments, the growing body of knowledge about the psychology of achievement testing can be useful for structuring evaluations of technical quality.  相似文献   

5.
《教育实用测度》2013,26(1):83-102
With increased demands for various types of assessments-from the class- room use of individual student results to international comparisons-has come an expanded desire to use assessments for multiple purposes by linking results from distinct assessments. There is a desire to make comparisons from results on one assessment with those of another (e.g., the results from a state assessment vs. the results on a national or international assessment). The degree to which desired interpretations and inferences are justified, however, depends on the nature of the assessments being compared and the ways in which the linkage occurs. Five different types of linking (equating, calibra- tion, statistical moderation, prediction, and social moderation) are distin- guished. The characteristics of these types of linking, their requirements for the assessments being linked, and the comparative inferences they support are described.  相似文献   

6.
Design and Development of Performance Assessments   总被引:1,自引:0,他引:1  
Achievement can be, and often is, measured by means of observation and professional judgment. This form of measurement is called performance assessment. Developers of large-scale assessments of communication skills often rely on performance assessments in which carefully devised exercises elicit performance that is observed and judged by trained raters. Teachers also rely heavily on day-to-day observation and judgment. Like other tests, quality performance assessment must be carefully planned and developed to conform to specific rules of test design. This module presents and illustrates those rules in the form of a step-by-step strategy for designing such assessments, through the specification of (a) reason(s) for assessment, (b) type of performance to be evaluated, (c) exercises that will elicit performance, and (d) systematic rating procedures. General guidelines are presented for maximizing the reliability, validity, and economy of performance assessments.  相似文献   

7.
To date, assessment validity research on non-native English speaking students in the United States has focused exclusively on those who are presently English language learners (ELLs). However, little, if any, research has been conducted on two other sizable groups of language minority students: (a) bilingual or multilingual students who were already English proficient when they entered the school system (IFEPs), and (b) former English language learners, those students who were once classified as ELLs but are now reclassified as being English proficient (RFEPs). This study investigated the validity of several standards-based assessments in mathematics and science for these two student groups and found a very high degree of score comparability, when compared with native English speakers, for the IFEPs, whereas a moderate to high degree of score comparability was observed for the RFEPs. Thus, test scores for these two groups on the assessments we studied appear to be valid indicators of their content knowledge, to a degree similar to that of native English speakers.  相似文献   

8.
Local assessment systems are being marketed as formative, benchmark, predictive, and a host of other terms. Many so-called formative assessments are not at all similar to the types of assessments and strategies studied by   Black and Wiliam (1998)   but instead are interim assessments. In this article, we clarify the definition and uses of interim assessments and argue that they can be an important piece of a comprehensive assessment system that includes formative, interim, and summative assessments. Interim assessments are given on a larger scale than formative assessments, have less flexibility, and are aggregated to the school or district level to help inform policy. Interim assessments are driven by their purpose, which fall into the categories of instructional, evaluative, or predictive. Our intent is to provide a specific definition for these "interim assessments" and to develop a framework that district and state leaders can use to evaluate these systems for purchase or development. The discussion lays out some concerns with the current state of these assessments as well as hopes for future directions and suggestions for further research.  相似文献   

9.
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both.  相似文献   

10.
The relationships between ratings on the Idaho Alternate Assessment (IAA) for 116 students with significant disabilities and corresponding ratings for the same students on two norm-referenced teacher rating scales were examined to gain evidence about the validity of resulting IAA scores. To contextualize these findings, another group of 54 students who had disabilities, but were not officially eligible for the alternate assessment also was assessed. Evidence to support the validity of the inferences about IAA scores was mixed, yet promising. Specifically, the relationship among the reading, language arts, and mathematics achievement level ratings on the IAA and the concurrent scores on the ACES-Academic Skills scales for the eligible students varied across grade clusters, but in general were moderate. These findings provided evidence that IAA scales measure skills indicative of the state's content standards. This point was further reinforced by moderate to high correlations between the IAA and Idaho State Achievement Test (ISAT) for the not eligible students. Additional evidence concerning the valid use of the IAA was provided by logistic regression results that the scores do an excellent job of differentiating students who were eligible from those not eligible to participate in an alternate assessment. The collective evidence for the validity of the IAA scores suggests it is a promising assessment for NCLB accountability of students with significant disabilities. The methods of establishing this evidence have the potential to advance validation efforts of other states' alternate assessments.  相似文献   

11.
Assessments labeled as formative have been offered as a means to improve student achievement. But labels can be a powerful way to miscommunicate. For an assessment use to be appropriately labeled "formative," both empirical evidence and reasoned arguments must be offered to support the claim that improvements in student achievement can be linked to the use of assessment information. Our goal in this article is to support the construction of such an argument by offering a framework within which to consider evidence-based claims that assessment information can be used to improve student achievement. We describe this framework and then illustrate its use with an example of one-on-one tutoring. Finally, we explore the framework's implications for understanding when the use of assessment information is likely to improve student achievement and for advising test developers on how to develop assessments that are intended to offer information that can be used to improve student achievement.  相似文献   

12.
In low-stakes assessments, some students may not reach the end of the test and leave some items unanswered due to various reasons (e.g., lack of test-taking motivation, poor time management, and test speededness). Not-reached items are often treated as incorrect or not-administered in the scoring process. However, when the proportion of not-reached items is high, these traditional approaches may yield biased scores and thereby threatening the validity of test results. In this study, we propose a polytomous scoring approach for handling not-reached items and compare its performance with those of the traditional scoring approaches. Real data from a low-stakes math assessment administered to second and third graders were used. The assessment consisted of 40 short-answer items focusing on addition and subtraction. The students were instructed to answer as many items as possible within 5 minutes. Using the traditional scoring approaches, students’ responses for not-reached items were treated as either not-administered or incorrect in the scoring process. With the proposed scoring approach, students’ nonmissing responses were scored polytomously based on how accurately and rapidly they responded to the items to reduce the impact of not-reached items on ability estimation. The traditional and polytomous scoring approaches were compared based on several evaluation criteria, such as model fit indices, test information function, and bias. The results indicated that the polytomous scoring approaches outperformed the traditional approaches. The complete case simulation corroborated our empirical findings that the scoring approach in which nonmissing items were scored polytomously and not-reached items were considered not-administered performed the best. Implications of the polytomous scoring approach for low-stakes assessments were discussed.  相似文献   

13.
14.
In this article, performance assessments are cast within a sampling framework. More specifically, a performance assessment is viewed as a sample of student performance drawn from a complex universe defined by a combination of all possible tasks, occasions, raters, and measurement methods. Using generalizability theory, we present evidence bearing on the generalizability and convergent validity of performance assessments sampled from a range of measurement facets and measurement methods. Results at both the individual and school level indicate that task-sampling variability is the major source ofmeasurment error. Large numbers of tasks are needed to get a reliable measure of mathematics and science achievement at the elementary level. With respect to convergent validity, results suggest that methods do not converge. Students' performance scores, then, are dependent on both the task and method sampled.  相似文献   

15.
《教育实用测度》2013,26(4):413-432
With the increasing use of automated scoring systems in high-stakes testing, it has become essential that test developers assess the validity of the inferences based on scores produced by these systems. In this article, we attempt to place the issues associated with computer-automated scoring within the context of current validity theory. Although it is assumed that the criteria appropriate for evaluating the validity of score interpretations are the same for tests using automated scoring procedures as for other assessments, different aspects of the validity argument may require emphasis as a function of the scoring procedure. We begin the article with a taxonomy of automated scoring procedures. The presentation of this taxonomy provides a framework for discussing threats to validity that may take on increased importance for specific approaches to automated scoring. We then present a general discussion of the process by which test-based inferences are validated, followed by a discussion of the special issues that must be considered when scoring is done by computer.  相似文献   

16.
The purpose of this article is to address a major gap in the instructional sensitivity literature on how to develop instructionally sensitive assessments. We propose an approach to developing and evaluating instructionally sensitive assessments in science and test this approach with one elementary life‐science module. The assessment we developed was administered to 125 students in seven classrooms. The development approach considered three dimensions of instructional sensitivity; that is, assessment items should: represent the curriculum content, reflect the quality of instruction, and have formative value for teaching. Focusing solely on the first dimension, representation of the curriculum content, this study was guided by the following research questions: (1) What science module characteristics can be systematically manipulated to develop items that prove to be instructionally sensitive? and (2) Are the instructionally sensitive assessments developed sufficiently valid to make inferences about the impact of instruction on students' performance? In this article, we describe our item development approach and provide empirical evidence to support validity arguments about the developed instructionally sensitive items. Results indicated that: (1) manipulations of the items at different proximities to vary their sensitivity were aligned with the rules for item development and also corresponded with pre‐to‐post gains; and (2) the items developed at different distances from the science module showed a pattern of pre‐to‐post gain consistent with their instructional sensitivity, that is, the closer the items were to the science module, the larger the observed gains and effect sizes. © 2012 Wiley Periodicals, Inc. J Res Sci Teach 49: 691–712, 2012  相似文献   

17.
In this article we examine empirical evidence on the criterion, predictive, transfer, and fairness aspects of validity of a large-scale language arts performance assessment, referred to as the Performance Assignment (PA). We use multilevel models to avoid biased inferences that might result from the naturally nested data. Specifically, we examine the relationships of the assessment with the Stanford Achievement Test, 9th Edition and the California High School Exit Examination. The results indicate that the measures are related, that students demonstrate a degree of transfer, and that the language arts PA is relatively more fair than comparison assessments. The results are robust to various model specifications and demonstrate that benefits do not accrue to all students equally.  相似文献   

18.
A sample of 293 local district assessments used in the Nebraska STARS (School-based Teacher-led Assessment and Reporting System), 147 from 2004 district mathematics assessment portfolios and 146 from 2003 reading assessment portfolios, was scored with a rubric evaluating their quality. Scorers were Nebraska educators with background and training in assessment. Raters reached an agreement criterion during a training session; however, analysis of a set of 30 assessments double-scored during the main scoring session indicated that the math ratings remained reliable during scoring, while the reading ratings did not. Therefore, this article presents results for the 147 mathematics assessments only. The quality of local mathematics assessments used in the Nebraska STARS was good overall. The majority were of high quality on characteristics that go to validity (alignment with standards, clarity to students, appropriateness of content). Professional development for Nebraska teachers is recommended on aspects of assessment related to reliability (sufficiency of information and scoring procedures).  相似文献   

19.
Commentary: Evaluating the Validity of Formative and Interim Assessment   总被引:1,自引:0,他引:1  
In many school districts, the pressure to raise test scores has created overnight celebrity status for formative assessment. Its powers to raise student achievement have been touted, however, without attending to the research on which these claims were based. Sociocultural learning theory provides theoretical grounding for understanding how formative assessment works to increase student learning. The articles in this special issue bring us back to underlying first principles by offering separate validity frameworks for evaluating formative assessment (Nichols, Meyers, & Burling) and newly-invented interim assessments (Perie, Marion, & Gong). The article by Heritage, Kim, Vendlinski, and Herman then offers the most important insight of all; that is, formative assessment is of little use if teachers don't know what to do when students are unable to grasp an important concept. While it is true that validity investigations are needed, I argue that the validity research that will tell us the most—about how formative assessment can be used to improve student learning—must be embedded in rich curriculum and must at the same time attempt to foster instructional practices consistent with learning research.  相似文献   

20.
The Validity of National Curriculum Assessment   总被引:3,自引:1,他引:3  
This paper reviews the validity of National Curriculum assessment in England. It works with the concept of 'consequential validity' (Messick, 1989) which incorporates both conventional 'reliability'issues and the use to which any assessment is put. The review uses the eight stage 'threats to validity'model developed by Crooks, Kane and Cohen (1996). The complexity of National Curriculum assessment makes evaluation difficult. These assessments are used for a variety of purposes so that the 'consequential'aspects are compounded. National Curriculum assessment also involves both Teacher Assessment and tests – each of which has strengths and limitations in relation to validity. The main finding is that the validity of National Curriculum assessment hinges on the balance between Teacher Assessment and testing. Between them they can meet Crooks et al.' s requirements of a valid assessment system. The current emphasis on the use of test results for school accountability and as a measure of national standards has undermined Teacher Assessment to a point at which the validity of the system is in question.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号