首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In an essay rating study multiple ratings may be obtained by having different raters judge essays or by having the same rater(s) repeat the judging of essays. An important question in the analysis of essay ratings is whether multiple ratings, however obtained, may be assumed to represent the same true scores. When different raters judge the same essays only once, it is impossible to answer this question. In this study 16 raters judged 105 essays on two occasions; hence, it was possible to test assumptions about true scores within the framework of linear structural equation models. It emerged that the ratings of a given rater on the two occasions represented the same true scores. However, the ratings of different raters did not represent the same true scores. The estimated intercorrelations of the true scores of different raters ranged from .415 to .910. Parameters of the best fitting model were used to compute coefficients of reliability, validity, and invalidity. The implications of these coefficients are discussed.  相似文献   

2.
This study describes three least squares models to control for rater effects in performance evaluation: ordinary least squares (OLS); weighted least squares (WLS); and ordinary least squares, subsequent to applying a logistic transformation to observed ratings (LOG-OLS). The models were applied to ratings obtained from four administrations of an oral examination required for certification in a medical specialty. For any single administration, there were 40 raters and approximately 115 candidates, and each candidate was rated by four raters. The results indicated that raters exhibited significant amounts of leniency error and that application of the least squares models would change the pass-fail status of approximately 7% to 9% of the candidates. Ratings adjusted by the models demonstrated higher reliability and correlated slightly higher than observed ratings with the scores on a written examination.  相似文献   

3.
The problem of measuring instructional effectiveness was examined, and a rationale was offered for employing “student progress on relevant objectives” for this purpose. To assess such progress, it was suggested that instructor ratings of the importance of objectives be combined with student ratings of progress on these objectives. On the basis of this suggestion, data were collected from 708 undergraduate classes at Kansas State University. An analysis of these data resulted in the following conclusions:
  1. Faculty members appeared to make reliable judgments of the relative importance of these objectives.
  2. Student progress ratings were made with acceptable reliability when there were 20–25 raters. Reliability of the overall progress measure was satisfactory when only 10 raters were used.
  3. Students used some discrimination in rating progress on various objectives, but their ratings were also noticeably subject to the halo effect.
  4. An indirect test of the validity of class progress ratings yielded positive results.
The proposed method of evaluating instruction appears generally feasible and useful. Its application would provide a practical approach to judging teaching success. Such an approach is essential before investigations can be undertaken of how teaching might be improved.  相似文献   

4.
This study of the reliability and validity of scales from the Child's Report of Parental Behavior (CRPBI) presents data on the utility of aggregating the ratings of multiple observers. Subjects were 680 individuals from 170 families. The participants in each family were a college freshman student, the mother, the father, and 1 sibling. The results revealed moderate internal consistency (M = .71) for all rater types on the 18 subscales of the CRPBI, but low interrater agreement (M = .30). The same factor structure was observed across the 4 rater types; however, aggregation within raters across salient scales to form estimated factor scores did not improve rater convergence appreciably (M = .36). Aggregation of factor scores across 2 raters yields much higher convergence (M = .51), and the 4-rater aggregates yielded impressive generalizability coefficients (M = .69). These and other analyses suggested that the responses of each family member contained a small proportion of true variance and a substantial proportion of factor-specific systematic error. The latter can be greatly reduced by aggregating scores across multiple raters.  相似文献   

5.
In this paper, assessments of faculty performance for the determination of salary increases are analyzed to estimate interrater reliability. Using the independent ratings by six elected members of the faculty, correlations between the ratings are calculated and estimates of the reliability of the composite (group) ratings are generated. Average intercorrelations are found to range from 0.603 for teaching, to 0.850 for research. The average intercorrelation for the overall faculty ratings is 0.794. Using these correlations, the reliability of the six-person group (the composite reliability) is estimated to be over 0.900 for each of the three areas and 0.959 for the overall faculty rating. Furthermore, little correlation is found between the ratings of performance levels of individual faculty members in the three areas of research, teaching, and service. The high intercorrelations and, consequently, the high composite reliabilities suggest that a reduction in the number of raters would have relatively small effects on reliability. The findings are discussed in terms of their relationship to issues of validity as well as to other questions of faculty assessment.  相似文献   

6.
本研究以PETS-1级拟聘口试教师为研究对象,对口试教师评分的培训效果进行了研究。采用多面Rasch分析对比口试教师接受培训前后的评分效果。结果发现:培训后,提升了口试教师与专家评分完全一致的比率,评分偏于严格的口试教师在评分标准上做了恰当的调整,所有口试教师评分拟合值都在可接受范围内,总体上,口试教师评分的培训比较有效,培训后提升了评分的准确性。多面Rasch分析有助于发现评分过于宽松、过于严格、评分拟合差的口试教师以及评分异常情况,为开展有针对性地培训提供了可靠的依据。  相似文献   

7.
Despite considerable interest in the topic of instructional quality in research as well as practice, little is known about the quality of its assessment. Using generalizability analysis as well as content analysis, the present study investigates how reliably and validly instructional quality is measured by observer ratings. Twelve trained raters judged 57 videotaped lesson sequences with regard to aspects of domain-independent instructional quality. Additionally, 3 of these sequences were judged by 390 untrained raters (i.e., student teachers and teachers). Depending on scale level and dimension, 16–44% of the variance in ratings could be attributed to instructional quality, whereas rater bias accounted for 12–40% of the variance. Although the trained raters referred more often to aspects considered essential for instructional quality, this was not reflected in the reliability of their ratings. The results indicate that observer ratings should be treated in a more differentiated manner in the future.  相似文献   

8.
Performance assessments are typically scored by having experts rate individual performances. The cost associated with using expert raters may represent a serious limitation in many large-scale testing programs. The use of raters may also introduce an additional source of error into the assessment. These limitations have motivated development of automated scoring systems for performance assessments. Preliminary research has shown these systems to have application across a variety of tasks ranging from simple mathematics to architectural problem solving. This study extends research on automated scoring by comparing alternative automated systems for scoring a computer simulation test of physicians'patient management skills; one system uses regression-derived weights for components of the performance, the other uses complex rules to map performances into score levels. The procedures are evaluated by comparing the resulting scores to expert ratings of the same performances.  相似文献   

9.
Observer ratings are often used to measure instructional quality. They are, however, usually based on observations gathered over short periods of time. Few studies have attempted to determine whether these periods are sufficient to provide reliable measures of instructional quality. Using generalizability theory, this study investigates (a) how three dimensions of instructional quality – classroom management, personal learning support, and cognitive activation of students – vary between the lessons of a specific teacher, and (b) how many lessons per teacher are necessary to establish sufficiently reliable measures of these dimensions. Analyses are based on ratings of five lessons for 38 teachers. Classroom management and personal learning support were stable across lessons, whereas cognitive activation showed high variability. Consequently, one lesson per teacher suffices to measure classroom management and personal learning support, whereas nine lessons would be needed for cognitive activation. The importance of advancing our theoretical understanding of cognitive activation is discussed.  相似文献   

10.
Machine learning has been frequently employed to automatically score constructed response assessments. However, there is a lack of evidence of how this predictive scoring approach might be compromised by construct-irrelevant variance (CIV), which is a threat to test validity. In this study, we evaluated machine scores and human scores with regard to potential CIV. We developed two assessment tasks targeting science teacher pedagogical content knowledge (PCK); each task contains three video-based constructed response questions. 187 in-service science teachers watched the videos with each had a given classroom teaching scenario and then responded to the constructed-response items. Three human experts rated the responses and the human-consent scores were used to develop machine learning algorithms to predict ratings of the responses. Including the machine as another independent rater, along with the three human raters, we employed the many-facet Rasch measurement model to examine CIV due to three sources: variability of scenarios, rater severity, and rater sensitivity of the scenarios. Results indicate that variability of scenarios impacts teachers’ performance, but the impact significantly depends on the construct of interest; for each assessment task, the machine is always the most severe rater, compared to the three human raters. However, the machine is less sensitive than the human raters to the task scenarios. This means the machine scoring is more consistent and stable across scenarios within each of the two tasks.  相似文献   

11.
ABSTRACT

In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay’s true score. An automated essay scoring (AES) system was trained on the datasets and then scored the essays using a cross-validation scheme. By eliminating one, two, or three raters at a time, and by calculating an estimate of the true scores using the remaining raters, an independent criterion against which to judge the validity of the human raters and that of the AES system, as well as the interrater reliability was produced. The results of the study indicated that the automated scores correlate with human scores to the same degree as human raters correlate with each other. However, the findings regarding the validity of the ratings support a claim that the reliability and validity of AES diverge: although the AES scoring is, naturally, more consistent than the human ratings, it is less valid.  相似文献   

12.
OBJECTIVE: There were two aims: First, to determine to what extent four variables, disclosure, doll play, affect and collateral information, affect the decision-making processes of child sexual abuse experts and lay persons when confronted with an abuse allegation, and second, to see how these two groups of raters might differ from one another. METHOD: A randomized block partially confounded factorial design was used. Participants made abuse likelihood and confidence ratings in response to six hypothetical cases of child sexual abuse, four of which had varying combinations of the four types of information and two of which were constant across all raters. Participants also completed attitudes and knowledge questionnaires. RESULTS: Disclosure and collateral information both had large effects on both rater groups. Doll play and affect had little or no effect on the decisions of either group. Experts were slightly more conservative in their judgments over all than students were. Experts also displayed more knowledge of the sexual abuse literature and more child-believing attitudes than their student counterparts. CONCLUSION: Concrete information such as disclosure statements and collateral information affected abuse decisions while inferential data such as doll play and affect did not. The goal of these evaluations may be the clarification of such concrete information and the inferential data may be used only to guide one's inquiry. This conclusion argues against the concern that experts might jump to conclusions of abuse merely based upon suggestive, symbolic material.  相似文献   

13.
This study investigates how art teachers judge the creativity of student art work. Both conceptual and operational definitions given by teachers were studied. Furthermore, cultural exposure as measured by a teacher's exposure to non-Asian cultures, was studied to see how it might influence art teachers' judgements. Two instruments were developed for the purpose of this study. One instrument was a questionnaire designed to collect qualitative data from the respondents. The second instrument was used to measure art teachers' ratings of creativity. The data revealed that the conceptual definitions given by art teachers varied considerably. However, when asked to rate subjectively the creativity of art products, moderate agreement was reached among art teachers. Exposure to non-Asian cultures did not seem to have an effect on an art teachers' operational definition of creativity. Results of the study imply that the usefulness of the term creativity needs to be reviewed in general usage and in documents such as curricula. It would seem that there is not a clear notion among art teachers as to what constitutes creativity and a creative product.  相似文献   

14.
The Third Edition of the ACEI Global Guidelines Assessment (GGA) was evaluated for its effectiveness as an international assessment tool for use by early childhood educators to develop, assess, and improve program quality worldwide. This expanded study was conducted in nine countries [People’s Republic of China (2 sites), Guatemala, India, Italy, Mexico (2 sites), Peru (2 sites), Taiwan, Thailand, United States] to continue the investigation of the psychometric properties of the GGA. A total of 346 programs and 678 early care and education professionals participated in this study. Results primarily confirmed the findings of the previous study, (Hardin et al. in Early Child Educ J 41(2): 91–101, 2013), indicating that the GGA showed strong to moderate internal consistency and interrater reliability for subscale ratings across this larger number of countries and programs. The congruence of item ratings and written evidence to support ratings was acceptable, although some programs had lower participation in providing evidence. To test concurrent validity of ratings, external raters also evaluated a subset of programs (n = 44 from Peru and United States) on both the GGA and the Early Childhood Environment Rating Scale-Revised that showed moderate positive correlations. Patterns of program practices were also identified within and across the participating sites and countries. Results suggest that the GGA can be used as an onsite evaluation method that can help stakeholder participants (teachers and administrators) increase their awareness of program quality standards and serve as an assessment method for their own programs. In particular, the results suggest the GGA is a reliable and useful instrument that can be used effectively by early childhood stakeholders for assessing and improving program quality worldwide (Bergen and Hardin in Child Educ 91(4): 259–264, 2015).  相似文献   

15.
本研究采用混合研究法对CET-4作文评分人如何使用评分标准进行分析。26位CET-4作文评分人对30篇CET-4模拟作文评分,并提供3条按重要性排序的评分理由。研究结果显示:(1)虽然存在严厉度的差异,但是26位评分人之间的一致性比较好,且大部分评分人的自身一致性也较好。(2)部分评分人的评分理由呈现了单一化趋势。(3)评分人所给评分理由的71.91%体现了CET-4作文评分标准所规定的5个文本特征,说明大部分评分人对标准的理解和把握还是比较准确的。  相似文献   

16.
评分标准在写作测试中非常重要,使用不同的评分方法会影响评卷者的评分行为。研究显示,虽然整体法和分析法两种英语写作评分方法都可靠,但是在两种评分中,评卷者的严厉程度以及考生的写作成绩发生很大变化。总体上,整体法评分中,评卷者的严厉程度趋于一致,接近理想值;分析法评分中,考生的写作成绩更高,同时评卷者的严厉程度也存在显著差异。因而,在决定考生前途命运的重大考试中,整体评分法更受推崇。  相似文献   

17.
The Consensual Assessment Technique (CAT), developed by Amabile [Amabile, T.M. (1982). Social psychology of creativity: A consensual assessment technique. Journal of Personality and Social Psychology, 43, 997–1013], is frequently used to evaluate the creativity of productions. Judgments obtained with CAT are usually reliable and valid. However, notable individual differences in judgment exist. This empirical study shows that creativity judgments for advertisements vary, depending on (1) the level of two underlying components of creativity — originality and appropriateness, (2) the creative ability of the judges, i.e. variations in their ability to be original, and finally, (3) instructions or training that they received about the topic of creativity assessment. Effects of advertisements' appropriateness and judges' ability to be original on individual differences in creativity judgments are discussed.  相似文献   

18.
19.
Martin   《Assessing Writing》2009,14(2):88-115
The demand for valid and reliable methods of assessing second and foreign language writing has grown in significance in recent years. One such method is the timed writing test which has a central place in many testing contexts internationally. The reliability of this test method is heavily influenced by the scoring procedures, including the rating scale to be used and the success with which raters can apply the scale. Reliability is crucial because important decisions and inferences about test takers are often made on the basis of test scores. Determining the reliability of the scoring procedure frequently involves examining the consistency with which raters assign scores. This article presents an analysis of the rating of two sets of timed tests written by intermediate level learners of German as a foreign language (n = 47) by two independent raters who used a newly developed detailed scoring rubric containing several categories. The article discusses how the rubric was developed to reflect a particular construct of writing proficiency. Implications for the reliability of the scoring procedure are explored, and considerations for more extensive cross-language research are discussed.  相似文献   

20.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号