首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Multiple scoring is widely used in large-scale assessments. The use of a single response for making multiple inferences as is done in multiple scoring has implications on the validity of these inferences and interpretations based on assessment results. The purpose of this article is to review two types of multiple scoring practices and discuss how multiple scoring affects inferences.  相似文献   

2.
Rater‐mediated assessments require the evaluation of the accuracy and consistency of the inferences made by the raters to ensure the validity of score interpretations and uses. Modeling rater response processes allows for a better understanding of how raters map their representations of the examinee performance to their representation of the scoring criteria. Validity of score meaning is affected by the accuracy of raters' representations of examinee performance and the scoring criteria, and the accuracy of the mapping process. Methodological advances and applications that model rater response processes, rater accuracy, and rater consistency inform the design, scoring, interpretations, and uses of rater‐mediated assessments.  相似文献   

3.
Abstract

Teachers' evaluation of pupil learning should be consistent with identified learning outcomes at the intended level of performance. To the extent that curriculum and assessment are aligned, the validity of inferences about pupil knowledge is strengthened. The purpose of this investigation was to evaluate the assessment practices of preservice teachers who have successfully completed coursework in educational measurement. Three hundred and nine lesson plans from 65 preservice (student) teachers were reviewed. The authors found that, during student teaching, preservice teachers do not follow many of the assessment practices recommended in their coursework. Perhaps implementations of recommended classroom assessment practices seem to depend on more than possessing the requisite knowledge.  相似文献   

4.
Student assessment of teaching in higher education   总被引:1,自引:0,他引:1  
Plans to introduce campus-wide assessments of college or university teaching which are largely dependent on student ratings are seen as a threat to academic freedom in those institutions with little or no experience of this form of evaluation. While regular student evaluations of teaching are very common in North America, their introduction is only now being considered in colleges and universities in a number of other countries. Research on the reliability and validity of student ratings indicate that they are capable of providing valuable information about the quality of teaching. Depending on the survey used, this type of evaluation may be used to provide evidence of teaching ability to staffing committees or to suggest ways of improving teaching. The paper concludes with a set of recommendations for higher education institutions which are considering the regular assessment of all teachers by their students.  相似文献   

5.
This article argues that test takers are as integral to determining validity of test scores as defining target content and conditioning inferences on test use. A principled sustained attention to how students interact with assessment opportunities is essential, as is a principled sustained evaluation of evidence confirming the validity or calling into question the inferences for individual students. Three innovative assessment systems are highlighted to illustrate where and how the developers might handle diverse test taker needs and learning characteristics. ONPAR measures challenging content using multisemiotic methods and novel item types, designing items to handle multiple profiles so they are accessible for most students. Dynamic Learning Maps has built an innovative network of learning maps, and multiple pathways designed to model how diverse students acquire knowledge. To support their assessments, the National Center and State Collaborative has built an exemplary web of educator resources such as content modules and guides in order to support differentiated learning.  相似文献   

6.
The growing importance of genomics and bioinformatics methods and paradigms in biology has been accompanied by an explosion of new curricula and pedagogies. An important question to ask about these educational innovations is whether they are having a meaningful impact on students’ knowledge, attitudes, or skills. Although assessments are necessary tools for answering this question, their outputs are dependent on their quality. Our study 1) reviews the central importance of reliability and construct validity evidence in the development and evaluation of science assessments and 2) examines the extent to which published assessments in genomics and bioinformatics education (GBE) have been developed using such evidence. We identified 95 GBE articles (out of 226) that contained claims of knowledge increases, affective changes, or skill acquisition. We found that 1) the purpose of most of these studies was to assess summative learning gains associated with curricular change at the undergraduate level, and 2) a minority (<10%) of studies provided any reliability or validity evidence, and only one study out of the 95 sampled mentioned both validity and reliability. Our findings raise concerns about the quality of evidence derived from these instruments. We end with recommendations for improving assessment quality in GBE.  相似文献   

7.
The purpose of this article is to address a major gap in the instructional sensitivity literature on how to develop instructionally sensitive assessments. We propose an approach to developing and evaluating instructionally sensitive assessments in science and test this approach with one elementary life‐science module. The assessment we developed was administered to 125 students in seven classrooms. The development approach considered three dimensions of instructional sensitivity; that is, assessment items should: represent the curriculum content, reflect the quality of instruction, and have formative value for teaching. Focusing solely on the first dimension, representation of the curriculum content, this study was guided by the following research questions: (1) What science module characteristics can be systematically manipulated to develop items that prove to be instructionally sensitive? and (2) Are the instructionally sensitive assessments developed sufficiently valid to make inferences about the impact of instruction on students' performance? In this article, we describe our item development approach and provide empirical evidence to support validity arguments about the developed instructionally sensitive items. Results indicated that: (1) manipulations of the items at different proximities to vary their sensitivity were aligned with the rules for item development and also corresponded with pre‐to‐post gains; and (2) the items developed at different distances from the science module showed a pattern of pre‐to‐post gain consistent with their instructional sensitivity, that is, the closer the items were to the science module, the larger the observed gains and effect sizes. © 2012 Wiley Periodicals, Inc. J Res Sci Teach 49: 691–712, 2012  相似文献   

8.
Abstract

Background: International large-scale assessments (ILSAs) are a much-debated phenomenon in education. Increasingly, their outcomes attract considerable media attention and influence educational policies in many jurisdictions worldwide. The relevance, uses and consequences of these assessments are often the focus of research scrutiny. Whilst some argue that the assessment outcomes provide an effective basis for informed policy-making, critics claim that the use of international assessment data can result in a range of unintended consequences, such as the shaping and governing of school systems ‘by numbers’.

Purpose: This article explores and analyses the arguments about the uses and consequences of ILSAs. In particular, the discourse about the assessments’ consequential validity will be discussed and evaluated.

Sources of evidence: Literature relating to the uses and consequences of large-scale assessment was analysed, with a focus on research on the consequential aspects of validity.

Main argument: Much research suggests that ILSAs have unintended consequences that affect and influence educational policy. However, the influences on educational policy are complex and interwoven: for example, it is not clear-cut whether effects such as converging curricular are, necessarily, direct consequences of large-scale assessments. Further, it is suggested that a beneficial consequence of large-scale assessment is the infrastructure they provide for studies in the social sciences, although caution must be applied to causal claims, in particular because of the cross-sectional design of the assessments.

Conclusions: The considerable literature discussing the uses and consequences of large-scale assessments tends to point out potential negative aspects of the studies. However, it is also apparent that large-scale international assessments can be a valuable resource for studying global trends and evolving systems in education. Despite the extensive debates around large-scale assessment outcomes both in the media and in educational policy arenas, empirical educational research all too often appears underused in the discussion.  相似文献   

9.
Assessment Validation in the Context of High-Stakes Assessment   总被引:1,自引:0,他引:1  
Including the perspectives of stakeholder groups (e.g., teachers, parents) can improve the validity of high-stakes assessment interpretations and uses. How stakeholder groups view high-stakes assessments and their uses may differ significantly from state-level policy officials. The views of these stakeholders can contribute to identifying the strengths and weaknesses of the intended assessment interpretations and uses. This article proposes a process approach to validity that addresses assessment validation in the context of high-stakes assessment. The process approach includes a test evaluator or validator who considers the perspectives of five stakeholder groups at four different stages of assessment maturity in relationship to six aspects of construct validity. The tasks of the test evaluator and how stakeholders' views might be incorporated are illustrated at each stage of assessment maturity. How the test evaluator might make judgments about the merit of high-stakes assessment interpretations and uses is discussed.  相似文献   

10.
The Standards for Educational and Psychological Testing identify several strands of validity evidence that may be needed as support for particular interpretations and uses of assessments. Yet assessment validation often does not seem guided by these Standards, with validations lacking a particular strand even when it appears relevant to an assessment. Consequently, the degree to which validity evidence supports the proposed interpretation and use of the assessment may be compromised. Guided by the Standards, this article presents an independent validation of OECD's PISA assessment of mathematical self-efficacy (MSE) as an instructive example of this issue. OECD identifies MSE as one of a number of “factors” explaining student performance in mathematics, thereby serving the “policy orientation” of PISA. However, this independent validation identifies significant shortcomings in the strands of validity evidence available to support this interpretation and use of the assessment. The article therefore demonstrates how the Standards can guide the planning of a validation to ensure it generates the validity evidence relevant to an interpretive argument, particularly for an international large-scale assessment such as PISA. The implication is that assessment validation could yet benefit from the Standards as what Zumbo calls “a global force for testing”.  相似文献   

11.

Previous approaches to teacher testing have been criticized for poorly representing the knowledge base for teaching, for oversimplifying teaching decisions, and for lacking criterion-related validity evidence supporting their use. A new generation of teacher assessments has been developed in the United States through the efforts of the National Board for Professional Teaching Standards and a corollary organization of more than 30 states. These performance-based assessments use videotapes of teachers' practice, examples of lessons and assessments, samples of student work, and analyses of classroom events and outcomes to provide evidence about teaching. Early research on the effects of these assessments suggests that they may be more valid measures of teacher knowledge and skill and that they may help teachers improve their practice. The stimulus to teacher learning appears to occur through task structures that require teachers to learn new content and teaching strategies as part of their demonstration of performance and through the processes of required reflection about the relationships between learning and teaching.  相似文献   

12.
Nebraska districts use different strategies for measuring student performance on the state's content standards. District assessments differ in type and technical quality. Six quality criteria were endorsed by the state. These criteria cover content and curricular validity, fairness, and appropriateness of score interpretations. District assessment portfolios document how well assessments meet these criteria. Districts receive ratings on how well their assessments meet each of the quality criteria and are given a rating from Unacceptable to Exemplary. This article presents these technical quality criteria and explains how they are (a) individually rated and (b) combined for the district's overall quality rating.  相似文献   

13.
《教育实用测度》2013,26(1):83-102
With increased demands for various types of assessments-from the class- room use of individual student results to international comparisons-has come an expanded desire to use assessments for multiple purposes by linking results from distinct assessments. There is a desire to make comparisons from results on one assessment with those of another (e.g., the results from a state assessment vs. the results on a national or international assessment). The degree to which desired interpretations and inferences are justified, however, depends on the nature of the assessments being compared and the ways in which the linkage occurs. Five different types of linking (equating, calibra- tion, statistical moderation, prediction, and social moderation) are distin- guished. The characteristics of these types of linking, their requirements for the assessments being linked, and the comparative inferences they support are described.  相似文献   

14.
15.
Validity is a central principle of assessment relating to the appropriateness of the uses and interpretations of test results. Usually, one of the inferences that we wish to make is that the score reflects the extent of a student’s learning in a given domain. Thus, it is important to establish that the assessment tasks elicit performances that reflect the intended constructs. This research explored the use of three methods for evaluating whether there are threats to validity in relation to the constructs elicited in international A level geography examinations: (a) Rasch analysis; (b) analysis of processes expected and apparent when students answer questions; and (c) qualitative analysis of responses to items identified as potentially problematic. The results provided strong evidence to support validity with regard to the elicitation of constructs although one question part was identified as a threat to validity. Strengths and weaknesses of the methods can be identified.  相似文献   

16.
The use of alternative assessments has led many researchers to reexamine traditional views of test qualities, especially validity. Because alternative assessments generally aim at measuring complex constructs and employ rich assessment tasks, it becomes more difficult to demonstrate (a) the validity of the inferences we make and (b) that these inferences extrapolate to target domains beyond the assessment itself. An approach to addressing these issues from the perspective of language testing is described. It is then argued that in both language testing and educational assessment we must consider the roles of both language and content knowledge, and that our approach to the design and development of performance assessments must be both construct-based and task-based.1  相似文献   

17.
We examine the factor structure of scores from the CLASS‐S protocol obtained from observations of middle school classroom teaching. Factor analysis has been used to support both interpretations of scores from classroom observation protocols, like CLASS‐S, and the theories about teaching that underlie them. However, classroom observations contain multiple sources of error, most predominantly rater errors. We demonstrate that errors in scores made by two raters on the same lesson have a factor structure that is distinct from the factor structure at the teacher level. Consequently, the “standard” approach of analyzing on teacher‐level average dimension scores can yield incorrect inferences about the factor structure at the teacher level and possibly misleading evidence about the validity of scores and theories of teaching. We consider alternative hierarchical estimation approaches designed to prevent the contamination of estimated teacher‐level factors. These alternative approaches find a teacher‐level factor structure for CLASS‐S that consists of strongly correlated support and classroom management factors. Our results have implications for future studies using factor analysis on classroom observation data to develop validity evidence and test theories of teaching and for practitioners who rely on the results of such studies to support their use and interpretation of the classroom observation scores.  相似文献   

18.
《Educational Assessment》2013,18(2):119-129
Although some educators have suggested authentic tests as a solution to the problem of artificially inflated scores from teaching to paper-and-pencil tests, we argue that teaching to the test under high-stakes conditions could be more problematic with the new forms of assessment. The wide range of methods that can potentially be used in authentic assessments introduce a method variance that is not part of the construct to be measured. As a consequence, teaching the specific methods used in the assessment potentially invalidates the uses and interpretations that can be made from the test scores by narrowing the definition of the construct measured.  相似文献   

19.
This article reports on the collaboration of six states to study how simulation‐based science assessments can become transformative components of multi‐level, balanced state science assessment systems. The project studied the psychometric quality, feasibility, and utility of simulation‐based science assessments designed to serve formative purposes during a unit and to provide summative evidence of end‐of‐unit proficiencies. The frameworks of evidence‐centered assessment design and model‐based learning shaped the specifications for the assessments. The simulations provided the three most common forms of accommodations in state testing programs: audio recording of text, screen magnification, and support for extended time. The SimScientists program at WestEd developed simulation‐based, curriculum‐embedded, and unit benchmark assessments for two middle school topics, Ecosystems and Force & Motion. These were field‐tested in three states. Data included student characteristics, responses to the assessments, cognitive labs, classroom observations, and teacher surveys and interviews. UCLA CRESST conducted an evaluation of the implementation. Feasibility and utility were examined in classroom observations, teacher surveys and interviews, and by the six‐state Design Panel. Technical quality data included AAAS reviews of the items' alignment with standards and quality of the science, cognitive labs, and assessment data. Student data were analyzed using multidimensional Item Response Theory (IRT) methods. IRT analyses demonstrated the high psychometric quality (reliability and validity) of the assessments and their discrimination between content knowledge and inquiry practices. Students performed better on the interactive, simulation‐based assessments than on the static, conventional items in the posttest. Importantly, gaps between performance of the general population and English language learners and students with disabilities were considerably smaller on the simulation‐based assessments than on the posttests. The Design Panel participated in development of two models for integrating science simulations into a balanced state science assessment system. © 2012 Wiley Periodicals, Inc. J Res Sci Teach 49: 363–393, 2012  相似文献   

20.
This article presents findings from two projects designed to improve evaluations of technical quality of alternate assessments for students with the most significant cognitive disabilities. We argue that assessment technical documents should allow for the evaluation of the construct validity of the alternate assessments following the traditions of Cronbach (1971) , Messick (1989, 1995) , Linn, Baker, and Dunbar (1991) , and Shepard (1993) . The projects used the work of Knowing What Students Know ( Pellegrino, Chudowsky, & Glaser, 2001 ) to structure and focus the collection and evaluation of assessment information. The heuristic of the assessment triangle ( Pellegrino et al., 2001 ) was particularly useful in emphasizing that the validity evaluation needs to consider the logical connections among the characteristics of the students tested and how they develop domain proficiency (the cognition vertex), the nature of the assessment (the observation vertex), and the ways in which the assessment results are interpreted (the interpretation vertex). This project has shown that in addition to designing more valid assessments, the growing body of knowledge about the psychology of achievement testing can be useful for structuring evaluations of technical quality.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号