首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
When good model-data fit is observed, the Many-Facet Rasch (MFR) model acts as a linking and equating model that can be used to estimate student achievement, item difficulties, and rater severity on the same linear continuum. Given sufficient connectivity among the facets, the MFR model provides estimates of student achievement that are equated to control for differences in rater severity. Although several different linking designs are used in practice to establish connectivity, the implications of design differences have not been fully explored. Research is also limited related to the impact of model-data fit on the quality of MFR model-based adjustments for rater severity. This study explores the effects of linking designs and model-data fit for raters on the interpretation of student achievement estimates within the context of performance assessments in music. Results indicate that performances cannot be effectively adjusted for rater effects when inadequate linking or model-data fit is present.  相似文献   

2.
Rater‐mediated assessments require the evaluation of the accuracy and consistency of the inferences made by the raters to ensure the validity of score interpretations and uses. Modeling rater response processes allows for a better understanding of how raters map their representations of the examinee performance to their representation of the scoring criteria. Validity of score meaning is affected by the accuracy of raters' representations of examinee performance and the scoring criteria, and the accuracy of the mapping process. Methodological advances and applications that model rater response processes, rater accuracy, and rater consistency inform the design, scoring, interpretations, and uses of rater‐mediated assessments.  相似文献   

3.
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both.  相似文献   

4.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

5.
Accurate equating results are essential when comparing examinee scores across exam forms. Previous research indicates that equating results may not be accurate when group differences are large. This study compared the equating results of frequency estimation, chained equipercentile, item response theory (IRT) true‐score, and IRT observed‐score equating methods. Using mixed‐format test data, equating results were evaluated for group differences ranging from 0 to .75 standard deviations. As group differences increased, equating results became increasingly biased and dissimilar across equating methods. Results suggest that the size of group differences, the likelihood that equating assumptions are violated, and the equating error associated with an equating method should be taken into consideration when choosing an equating method.  相似文献   

6.
In this study, we describe a framework for monitoring rater performance over time. We present several statistical indices to identify raters whose standards drift and explain how to use those indices operationally. To illustrate the use of the framework, we analyzed rating data from the 2002 Advanced Placement English Literature and Composition examination, employing a multifaceted Rasch approach to determine whether raters exhibited evidence of two types of differential rater functioning over time (i.e., changes in levels of accuracy or scale category use). Some raters showed statistically significant changes in their levels of accuracy as the scoring progressed, while other raters displayed evidence of differential scale category use over time.  相似文献   

7.
How can the contributions of raters and tasks to error variance be estimated? Which source of error variance is usually greater? Are interrater coefficients adequate estimates of reliability? What other facets contribute to unreliability in performance assessments?  相似文献   

8.
Numerous studies have examined performance assessment data using generaliz-ability theory. Typically, these studies have treated raters as randomly sampled from a population, with each rater judging a given performance on a single occasion. This paper presents two studies that focus on aspects of the rating process that are not explicitly accounted for in this typical design. The first study makes explicit the "committee" facet, acknowledging that raters often work within groups. The second study makes explicit the "rating-occasion" facet by having each rater judge each performance on two separate occasions. The results of the first study highlight the importance of clearly specifying the relevant facets of the universe of interest. Failing to include the committee facet led to an overly optimistic estimate of the precision of the measurement procedure. By contrast, failing to include the rating-occasion facet, in the second study, had minimal impact on the estimated error variance.  相似文献   

9.
In this article, performance assessments are cast within a sampling framework. More specifically, a performance assessment is viewed as a sample of student performance drawn from a complex universe defined by a combination of all possible tasks, occasions, raters, and measurement methods. Using generalizability theory, we present evidence bearing on the generalizability and convergent validity of performance assessments sampled from a range of measurement facets and measurement methods. Results at both the individual and school level indicate that task-sampling variability is the major source ofmeasurment error. Large numbers of tasks are needed to get a reliable measure of mathematics and science achievement at the elementary level. With respect to convergent validity, results suggest that methods do not converge. Students' performance scores, then, are dependent on both the task and method sampled.  相似文献   

10.
This study describes three least squares models to control for rater effects in performance evaluation: ordinary least squares (OLS); weighted least squares (WLS); and ordinary least squares, subsequent to applying a logistic transformation to observed ratings (LOG-OLS). The models were applied to ratings obtained from four administrations of an oral examination required for certification in a medical specialty. For any single administration, there were 40 raters and approximately 115 candidates, and each candidate was rated by four raters. The results indicated that raters exhibited significant amounts of leniency error and that application of the least squares models would change the pass-fail status of approximately 7% to 9% of the candidates. Ratings adjusted by the models demonstrated higher reliability and correlated slightly higher than observed ratings with the scores on a written examination.  相似文献   

11.
The use of alternative assessments has led many researchers to reexamine traditional views of test qualities, especially validity. Because alternative assessments generally aim at measuring complex constructs and employ rich assessment tasks, it becomes more difficult to demonstrate (a) the validity of the inferences we make and (b) that these inferences extrapolate to target domains beyond the assessment itself. An approach to addressing these issues from the perspective of language testing is described. It is then argued that in both language testing and educational assessment we must consider the roles of both language and content knowledge, and that our approach to the design and development of performance assessments must be both construct-based and task-based.1  相似文献   

12.
Why is comparability of forms important for performance assessments? Can traditional methods of form equating be used? What problems are likely to arise in equating? Can standards generalize across forms?  相似文献   

13.
Design and Development of Performance Assessments   总被引:1,自引:0,他引:1  
Achievement can be, and often is, measured by means of observation and professional judgment. This form of measurement is called performance assessment. Developers of large-scale assessments of communication skills often rely on performance assessments in which carefully devised exercises elicit performance that is observed and judged by trained raters. Teachers also rely heavily on day-to-day observation and judgment. Like other tests, quality performance assessment must be carefully planned and developed to conform to specific rules of test design. This module presents and illustrates those rules in the form of a step-by-step strategy for designing such assessments, through the specification of (a) reason(s) for assessment, (b) type of performance to be evaluated, (c) exercises that will elicit performance, and (d) systematic rating procedures. General guidelines are presented for maximizing the reliability, validity, and economy of performance assessments.  相似文献   

14.
This study examined the stability of scores on two types of performance assessments, an observed hands-on investigation and a notebook surrogate. Twenty-nine sixth-grade students in a hands-on inquiry-based science curriculum completed three investigations on two occasions separated by 5 months. Results indicated that: (a) the generalizability across occasions for relative decisions was, on average, moderate for the observed investigations (.52) and the notebooks (.50); (b) the generalizability for absolute decisions was only slightly lower; (c) the major source of measurement error was the person by occasion (residual) interaction; and (d) the procedures students used to carry out the investigations tended to change from one occasion to the other.  相似文献   

15.
Rater‐mediated assessments are a common methodology for measuring persons, investigating rater behavior, and/or defining latent constructs. The purpose of this article is to provide a pedagogical framework for examining rater variability in the context of rater‐mediated assessments using three distinct models. The first model is the observation model, which includes ecological/environmental considerations for the evaluation system. The second model is the measurement model, which includes the transformation of observed, rater response data to linear measures using a measurement model with specific requirements of rater‐invariant measurement in order to examine raters’ construct‐relevant variability stemming from the evaluative system. The third model is the interaction model, which includes an interaction parameter to allow for the investigation into raters’ systematic, construct‐irrelevant variability stemming from the evaluative system. Implications for measurement outcomes and validity are discussed.  相似文献   

16.
As part of Nebraska's assessment and accountability system, districts' local assessment systems are evaluated for their psychometric quality. This article provides an overview of a two-stage evaluation strategy, discusses how it was applied in Nebraska, and presents results from the first three years of the evaluation process. Benefits of the method include an emphasis on formative evaluation and promotion of improved assessment quality at the local level. A limitation of the model is the inability to make refined comparisons of student performance across districts on the assessments. Results from the first three years suggest that greater specificity in the review criteria and additional reviewer calibration activities are needed.  相似文献   

17.
18.
An Approach for Evaluating the Technical Quality of Interim Assessments   总被引:1,自引:1,他引:0  
Increasing numbers of schools and districts have expressed interest in interim assessment systems to prepare for summative assessments and to improve teaching and learning. However, with so many commercial interim assessments available, schools and districts are struggling to determine which interim assessment is most appropriate to their needs. Unfortunately, there is little research-based guidance to help schools and districts to make the right choice about how to spend their money. Because we realize the urgency of developing criteria that can describe or evaluate the quality of interim assessments, this article presents the results of an initial attempt to create an instrument that school and district educators could use to evaluate the quality and usefulness of the interim assessment. The instrument is designed for use by state and district leaders to help them select an appropriate interim assessment system for their needs, but it could also be used by test vendors looking to evaluate and improve their own systems and by researchers engaged in studies of interim assessment use.  相似文献   

19.
In this article, procedures are described for estimating single-administration classification consistency and accuracy indices for complex assessments using item response theory (IRT). This IRT approach was applied to real test data comprising dichotomous and polytomous items. Several different IRT model combinations were considered. Comparisons were also made between the IRT approach and two non-IRT approaches including the Livingston-Lewis and compound multinomial procedures. Results for various IRT model combinations were not substantially different. The estimated classification consistency and accuracy indices for the non-IRT procedures were almost always lower than those for the IRT procedures.  相似文献   

20.
Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how one can compute comparable scores on various versions of these assessments. In this paper, the multigroup multidimensional Rasch model is repurposed for modeling data originating from various versions of a culturally responsive assessment and for analyzing such data to compute comparable scores. Two simulation studies are performed to evaluate the performance of the model for data simulated from hypothetical culturally responsive assessments and to find the conditions under which the computed scores are accurate. Recommendations are made for measurement practitioners interested in culturally responsive assessments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号