首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 250 毫秒

As an alternative to rubric scoring, comparative judgment generates essay scores by aggregating decisions about the relative quality of the essays. Comparative judgment eliminates certain scorer biases and potentially reduces training requirements, thereby allowing a large number of judges, including teachers, to participate in essay evaluation. The purpose of this study was to assess the validity, labor costs, and efficiency of comparative judgments as a potential substitute for rubric scoring. An analysis of two essay prompts revealed that comparative judgment measures were comparable to rubric scores at a level similar to that expected of two professional scorers. The comparative judgment measures correlated slightly higher than rubric scores with a multiple-choice writing test. Score reliability exceeding .80 was achieved with approximately nine judgments per response. The average judgment time was 94 seconds, which compared favorably to 119 seconds per rubric score. Practical challenges to future implementation are discussed.  相似文献   

The present study was conducted to establish a scoring key for the Guilford Zimmerman Temperament Survey appropriate for predicting academic performance. To maximize reliability of criterion data, academic performance was operationally defined as cumulative college grade point average based on a minimum of four semesters’ course work. The scoring key developed was predictive of academic performance (cross validated r = .39, p <.01). The magnitude of the relationship between scores on this key and cumulative grade point average compares favorably with the validities reported for the widely used academic aptitude tests in predicting the same criterion. Lesser relationships were observed between scores on the ten publisher-supplied scales and college grades. Results point to the utility of non-cognitive measures in predicting academic performance, particularly when keys tailored to the specific situation are empirically derived. Suggestions for future research are advanced.  相似文献   


This study investigated the reliability, validity, and utility of the following three measures of letter-formation quality: (a) a holistic rating system, in which examiners rated letters on a five-point Likert-type scale; (h) a holistic rating system with model letters, in which examiners used model letters that exemplified specific criterion scores to rate letters; and (c) a correct/incorrect procedure, in which examiners used transparent overlays and standard verbal criteria to score letters. Intrarater and interrater reliability coefficients revealed that the two holistic scoring procedures were unreliable, whereas scores obtained by examiners who used the correct/incorrect procedure were consistent over time and across examiners. Although all three of the target measures were sensitive to differences between individual letters, only the scores from the two holistic procedures were associated with other indices of handwriting performance. Furthermore, for each of the target measures, variability in scores was, for the most part, not attributable to the level of experience or sex of the respondents. Findings are discussed with respect to criteria for validating an assessment instrument.  相似文献   

This yearlong study was implemented in seventh-grade life science classes with the students' regular teacher serving as teacher/researcher. In the study, a method of scoring concept maps was developed to assess knowledge and comprehension levels of science achievement. By linking scoring of concept maps to instructional objectives, scores were based upon the correctness of propositions. High correlations between the concept map scores and unit multiple choice tests provided strong evidence of the content validity of the map scores. Similarly, correlations between map scores and state criterion-referenced and national norm-referenced standardized tests were indicators of high concurrent validity. The approach to concept map scoring in the study represents a distinct departure from traditional methods that focus on characteristics such as hierarchy and branching. A large body of research has demonstrated the utility of such methods in the assessment of higher-level learning outcomes. The results of the study suggest that a concept map might be used in assessing declarative and procedural knowledge, both of which have a place in the science classroom. One important implication of these results is that science curriculum and its corresponding assessment need not be dichotomized into knowledge/comprehension versus higher-order outcomes. © 1998 John Wiley & Sons, Inc. J Res Sci Teach 35: 1103–1127, 1998.  相似文献   


The authors lamemt the fact that there does not seem to be much agreement as to the proper method of scoring tests The use of the scoring formula is advocated by some and criticized by others. Literature is reviewed showing that the basic assumptions behind the scoring formula (namely that all wrong answers are due to chance guessing) are false. Arguments are presented for and against the continued use of the formula, with the conclusion that its use cannot be justified. A new aspect of this question, that use of the formula may create behavior patterns detrimental to ingenuity and creativity, is also presented.  相似文献   


In each of the 9th (N = 75) and 11th (N = 84) grades and on each subtest of the ITED battery, overachieving and underachieving groups were identified by using the predicted scores plus and minus one SB as the cutting points. When these two groups were compared on tests of creative thinking, the mean scores did not show any consistent trend across the different achievement areas to favor either one of these groups. Because of the low correlations between IQ and creativity (.12 in 9th, "-.01 in 11th) and achievement and creativity (—.02 to .21 in 9th, —.16 to .07 in 11th), results of analyses of covariance, controlling for IQ, did not affect the findings.  相似文献   

Multiple discriminant function analysis (MDFA) was conducted with data from 255 Strange Situations conducted and scored by Ainsworth and her colleagues. Cross-validated discriminant functions and classification weights were obtained, allowing attachment classifications (A, B, C) to be assigned directly from scores on interactive behavior and crying during reunion episodes. In the past, classification agreement within laboratories has often been used as a training criterion. Unfortunately, this does not insure that classification criteria agreed upon within a laboratory are comparable across laboratories, nor does it insure that agreed upon criteria will yield the same classifications that would have been assigned by the researchers who developed the scoring system. The present results enable researchers who have mastered the scoring systems for reunion behavior and crying to obtain attachment classifications directly from scores on these variables. Alternatively, this procedure may be used to guide the training of, and validate classification decisions by, local judges.  相似文献   

A framework for evaluation and use of automated scoring of constructed‐response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high‐stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time.  相似文献   


Both the California Test of Personality (CTP) and the Mooney Problem Check List (MPCL) were administered to 301 undergraduate university students. The 360 correlations obtained between the various scores of the two tests produced 288 which were significant at the .01 level of confidence and 25 which were significant at the .05 level. The relationships were primarily negative, i.e., those who demonstrated a high degree of adjustment (high score on the CTP) checked fewer problems on the MPCL than those who demonstrated poor adjustment (low score on the CTP). On the basis of this study which is in agreement with similar work by an earlier researcher, the present writer asserts that the MPCL may permit an assessment of the person's adjustment status.  相似文献   


The responses of three groups of teachers to a rating task in which they were asked to indicate how many of their pupils were “bright”, “above average”, “below average” and “dull” were compared. In two of the groups, teachers had been provided with test information based on performance on nationally standardized ability and attainment tests. In the third group tests had been administered but no results were provided.

No differences between the groups were found in terms of the extent to which teacher ratings of ability levels correspond with mean ability test scores. In addition, teachers, irrespective of the group to which they belonged, were found to display a tendency to place more pupils in the above average categories than the below average categories. Finally, no support was found for an hypothesis which suggested that test information would differentially affect the ratings of teachers of classes with pupils who were typical and untypical with respect to age.

The fact that the correlation between mean ratings and mean test scores were found to be fairly high in all three groups (they ranged from .51 to .60) suggests a reason for the failure of test information to impact on teachers’ judgements. The degree of agreement between teachers and tests that the correlations reveal means that there is less scope for a convergence of ratings on tests to occur than might otherwise be the case.  相似文献   


This study examines the internal consistency of Novak and Gowin's scoring scheme and its effect on the prediction validity of concept mapping as an alternative science classroom achievement assessment. Data were collected in three typical situations: very limited concept mapping experience with free‐style concept mapping; some concept mapping experience with questions provided; extensive concept mapping experience with a list of concepts provided for. It was found that Novak's scoring scheme was not internally consistent, and therefore there was generally no significant correlation between students’ scores on concept mapping and students’ scores on conventional classroom achievement assessments. The need for a new scoring scheme when concept mapping is used as an alternative science assessment is discussed.  相似文献   

Formula scoring is a procedure designed to reduce multiple-choice test score irregularities due to guessing. Typically, a formula score is obtained by subtracting a proportion of the number of wrong responses from the number correct. Examinees are instructed to omit items when their answers would be sheer guesses among all choices but otherwise to guess when unsure of an answer. Thus, formula scoring is not intended to discourage guessing when an examinee can rule out one or more of the options within a multiple-choice item. Examinees who, contrary to the instructions, do guess blindly among all choices are not penalized by formula scoring on the average; depending on luck, they may obtain better or worse scores than if they had refrained from this guessing. In contrast, examinees with partial information who refrain from answering tend to obtain lower formula scores than if they had guessed among the remaining choices. (Examinees with misinformation may be exceptions.) Formula scoring is viewed as inappropriate for most classroom testing but may be desirable for speeded tests and for difficult tests with low passing scores. Formula scores do not approximate scores from comparable fill-in-the-blank tests, nor can formula scoring preclude unrealistically high scores for examinees who are very lucky.  相似文献   

The percentage of students retaking college admissions tests is rising. Researchers and college admissions offices currently use a variety of methods for summarizing these multiple scores. Testing organizations such as ACT and the College Board, interested in validity evidence like correlations with first‐year grade point average (FYGPA), often use the most recent test score available. In contrast, institutions report using a variety of composite scoring methods for applicants with multiple test records, including averaging and taking the maximum subtest score across test occasions (“superscoring”). We compare four scoring methods on two criteria. First, we compare correlations between scores and FYGPA by scoring method. We find them similar (). Second, we compare the extent to which test scores differentially predict FYGPA by scoring method and number of retakes. We find that retakes account for additional variance beyond standardized achievement and positively predict FYGPA across all scoring methods. Superscoring minimizes this differential prediction—although it may seem that superscoring should inflate scores across retakes, this inflation is “true” in that it accounts for the positive effects of retaking for predicting FYGPA. Future research should identity factors related to retesting and consider how they should be used in college admissions.  相似文献   


The relationship between teacher locus of control (A), teacher behavior (B), student behavior (C), and student achievement (D) was investigated. It was predicted that internal teachers would produce higher achieving students by maintaining a controlled learning environment, thereby engaging students in more appropriate on-task behavior. The first part of the study found modest correlations between I-E scores of 44 fourth grade teachers and student achievement in reading, language, and math. In the second part, the behavior of a subsample of 17 teachers and their students was observed. Although the complete A-B-C-D link was not obtained, several parts of the model did relate significantly.  相似文献   

本研究以PETS-1级拟聘口试教师为研究对象,对口试教师评分的培训效果进行了研究。采用多面Rasch分析对比口试教师接受培训前后的评分效果。结果发现:培训后,提升了口试教师与专家评分完全一致的比率,评分偏于严格的口试教师在评分标准上做了恰当的调整,所有口试教师评分拟合值都在可接受范围内,总体上,口试教师评分的培训比较有效,培训后提升了评分的准确性。多面Rasch分析有助于发现评分过于宽松、过于严格、评分拟合差的口试教师以及评分异常情况,为开展有针对性地培训提供了可靠的依据。  相似文献   


This study investigates the role of automated scoring and feedback in supporting students’ construction of written scientific arguments while learning about factors that affect climate change in the classroom. The automated scoring and feedback technology was integrated into an online module. Students’ written scientific argumentation occurred when they responded to structured argumentation prompts. After submitting the open-ended responses, students received scores generated by a scoring engine and written feedback associated with the scores in real-time. Using the log data that recorded argumentation scores as well as argument submission and revisions activities, we answer three research questions. First, how students behaved after receiving the feedback; second, whether and how students’ revisions improved their argumentation scores; and third, did item difficulties shift with the availability of the automated feedback. Results showed that the majority of students (77%) made revisions after receiving the feedback, and students with higher initial scores were more likely to revise their responses. Students who revised had significantly higher final scores than those who did not, and each revision was associated with an average increase of 0.55 on the final scores. Analysis on item difficulty shifts showed that written scientific argumentation became easier after students used the automated feedback.  相似文献   

By far, the most frequently used method of validating (the interpretation and use of) automated essay scores has been to compare them with scores awarded by human raters. Although this practice is questionable, human-machine agreement is still often regarded as the “gold standard.” Our objective was to refine this model and apply it to data from a major testing program and one system of automated essay scoring. The refinement capitalizes on the fact that essay raters differ in numerous ways (e.g., training and experience), any of which may affect the quality of ratings. We found that automated scores exhibited different correlations with scores awarded by experienced raters (a more compelling criterion) than with those awarded by untrained raters (a less compelling criterion). The results suggest potential for a refined machine-human agreement model that differentiates raters with respect to experience, expertise, and possibly even more salient characteristics.  相似文献   

The parent‐teacher agreement has become an important issue of children's psychological assessment. However, the amount of research available for preschool children is small and mainly based on one index of agreement with samples of modest size/representativeness. This study examined parent‐teacher agreement (correlations) and discrepancies (t tests) on preschoolers' social skills and problem behaviors for the normative Portuguese sample (N = 1,000) of the Preschool and Kindergarten Behavior Scales – 2nd Edition (PKBS‐2). Analyses were replicated according to the child's gender and mothers' educational level. Correlational analyses suggest weak to moderate informant agreement (mean correlation = .32). Parents' and teachers' ratings are significantly different for all PKBS‐2 scores, with parents assigning higher scores both on social skills and problem behaviors. Results highlight the importance of both parents' and teachers' perspectives to achieve a more comprehensive picture of preschoolers' social‐emotional behaviors, and reinforce the evidence of reliability of the PKBS‐2 Portuguese version.  相似文献   


This study tests the hypotheses that (1) grades in high school and college as well as scores on nationally-standardised tests of scholastic aptitude and professional knowledge (National Teacher Examinations, NTE) do not predict rated success in teaching, but that (2) scholastic aptitude and achievement do predict scores on the NTE. In a sample of 280 student teachers, evidence was found to support both of these hypotheses  相似文献   

The purpose of this study was to examine how different scoring procedures affect interpretation of maze curriculum‐based measurements. Fall and spring data were collected from 199 students receiving supplemental reading instruction. Maze probes were scored first by counting all correct maze choices, followed by four scoring variations designed to reduce the effect of random guessing. Pearson's r correlation coefficients were calculated among scoring procedures and between maze scores and a standardized measure of reading. In addition, t tests were conducted to compare fall to spring growth for each scoring procedure. Results indicated that scores derived from the different procedures are highly correlated, demonstrate criterion‐related validity, and show fall‐to‐spring growth. Educators working with struggling readers may use any of the five scoring procedures to obtain technically sound scores.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号