首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This study investigated the relationship between middle school students’ scores for a written assignment (N = 162) and a process that involved students in generating criteria and self‐assessing with a rubric. Gender, time spent writing, grade level, prior rubric use, and previous achievement in English were also examined. The treatment involved using a model essay to scaffold the process of generating a list of criteria for an effective essay, reviewing a written rubric, and using the rubric to self‐assess first drafts. The comparison condition involved generating a list of criteria and reviewing first drafts. Findings include a main effect of treatment, gender, grade level, writing time, and previous achievement on total essay scores, as well as main effects on scores for every criterion on the scoring rubric. The results suggested that reading a model, generating criteria, and using a rubric to self‐assess can help middle school students produce more effective writing.  相似文献   

2.
ABSTRACT

In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay’s true score. An automated essay scoring (AES) system was trained on the datasets and then scored the essays using a cross-validation scheme. By eliminating one, two, or three raters at a time, and by calculating an estimate of the true scores using the remaining raters, an independent criterion against which to judge the validity of the human raters and that of the AES system, as well as the interrater reliability was produced. The results of the study indicated that the automated scores correlate with human scores to the same degree as human raters correlate with each other. However, the findings regarding the validity of the ratings support a claim that the reliability and validity of AES diverge: although the AES scoring is, naturally, more consistent than the human ratings, it is less valid.  相似文献   

3.
《Assessing Writing》2008,13(2):80-92
The scoring of student essays by computer has generated much debate and subsequent research. The majority of the research thus far has focused on validating the automated scoring tools by comparing the electronic scores to human scores of writing or other measures of writing skills, and exploring the predictive validity of the automated scores. However, very little research has investigated possible effects of the essay prompts. This study endeavoured to do so by exploring test scores for three different prompts for the ACCUPLACER® WritePlacer® Plus test which is scored by the IntelliMetric® automated scoring system. The results indicated that there was no significant difference among the prompts overall; among males, between males and females, by native language or in comparison to scores generated by human raters. However, there was a significant difference in mean scores by topic for females.  相似文献   

4.
Abstract

Previous research has shown that children and adolescents can progress in the stages of moral judgment. However, in the case of adults, Kohlberg (1973) suggested there might be crystallization after the age of 25.

The purpose of this study was to establish whether the structure of moral judgment of adults could be systematically encouraged toward change. Thirty‐six adults (three groups) enrolled in an adult sexology course were assessed to determine stage level at the beginning of the course, and post‐tested at the completion of the course. Four dilemmas were used: two for general moral judgment, and two for sexual moral judgment. During the 45‐hour course, subjects were systematically introduced to arguments of a higher stage, and discussions focused on the axiological aspects of the adults’ sexual life.

Results show that there was a significant increase in the scores at the post‐test, both in general and in sexual moral judgments; subjects over 25 also increased their scores, thus indicating that the structure of moral judgment is not crystallized after that age. The existence of a differential between general and sexual moral judgments was also corroborated.

Implications with regard to the use of the ‘+1 stage’ technique for adult education, and more particularly for adult sexual education, are discussed.

  相似文献   

5.
ABSTRACT

Automated essay scoring is a developing technology that can provide efficient scoring of large numbers of written responses. Its use in higher education admissions testing provides an opportunity to collect validity and fairness evidence to support current uses and inform its emergence in other areas such as K–12 large-scale assessment. In this study, human and automated scores on essays written by college students with and without learning disabilities and/or attention deficit hyperactivity disorder were compared, using a nationwide (U.S.) sample of prospective graduate students taking the revised Graduate Record Examination. The findings are that, on average, human raters and the automated scoring engine assigned similar essay scores for all groups, despite average differences among groups with respect to essay length and spelling errors.  相似文献   

6.
Rubric-referenced calibration and the interaction between writing achievement and calibration, a measure of the relationship between one's performance and the accuracy of one's judgments, were investigated. Undergraduate students (N = 596) were assigned to one of three calibration conditions: (a) global, (b) global and general criteria, or (c) global and detailed criteria. Students in all three conditions provided global predictions and postdictions of essay exam scores. Although calibration judgments by condition did not affect calibration accuracy overall, statistically significant main effects were found between calibration accuracy by criteria and prior achievement. High achievers made more-accurate predictions and postdictions by criteria than low achievers. Regardless of achievement level, those students in the detailed rubric condition had higher postdictive accuracy for the organization criteria than did students in the general rubric condition.  相似文献   

7.
A nonexperimental design was used to determine whether the verbal scores of low-income gifted fifth graders (n = 38) differed from those of their higher income peers (n = 83). The Otis–Lennon School Ability Test, Eighth Edition and the Stanford Achievement Test-Tenth Edition were used to collect student data. Results of a MANOVA showed a statistically significant difference between the verbal scores of the two groups, with low-income students scoring significantly lower. A large effect size for the multivariate main effect of income level on verbal intelligence and verbal achievement scores was found (η2 = .19). The existence of verbal–nonverbal score discrepancy in low-income students questions the practice of using only nonverbal or nonverbal parts of an IQ test to identify and place students in gifted programmes. These results also underscore the need to nurture underdeveloped verbal abilities when they occur in low-income students.  相似文献   

8.
Martin   《Assessing Writing》2009,14(2):88-115
The demand for valid and reliable methods of assessing second and foreign language writing has grown in significance in recent years. One such method is the timed writing test which has a central place in many testing contexts internationally. The reliability of this test method is heavily influenced by the scoring procedures, including the rating scale to be used and the success with which raters can apply the scale. Reliability is crucial because important decisions and inferences about test takers are often made on the basis of test scores. Determining the reliability of the scoring procedure frequently involves examining the consistency with which raters assign scores. This article presents an analysis of the rating of two sets of timed tests written by intermediate level learners of German as a foreign language (n = 47) by two independent raters who used a newly developed detailed scoring rubric containing several categories. The article discusses how the rubric was developed to reflect a particular construct of writing proficiency. Implications for the reliability of the scoring procedure are explored, and considerations for more extensive cross-language research are discussed.  相似文献   

9.
Abstract

This article investigates the effect of the raters' perception of a given topic on students' writing scores. Three raters, who were teachers in TEFL with backgrounds in teaching essay and letter writing in English, scored the compositions. The means of the three sets of scores by three raters were compared using the two-way analysis of variance (ANOVA). Concerning the hypothesis of the study that states “raters' perception has no effect on writers' composition scores,” the result of ANOVA was nonsignificant, meaning that there are other factors such as students' attitude and cognitive ability that may affect raters' judgments on student writing performance.  相似文献   

10.
11.
Abstract

We developed and tested a behavioral version of the Defining Issues Test-1 revised (DIT-1r), which is a measure of the development of moral judgment. We conducted a behavioral experiment using the behavioral Defining Issues Test (bDIT) to examine the relationship between participants’ moral developmental status, moral competence, and reaction time when making moral judgments. We found that when the judgments were made based on the preferred moral schema, the reaction time for moral judgments was significantly moderated by the moral developmental status. In addition, as a participant becomes more confident with moral judgment, the participant differentiates the preferred versus other schemas better particularly when the participant’s abilities for moral judgment are more developed.  相似文献   

12.
In this study, we examined the effect of two metacognitive scaffolds on the accuracy of confidence judgments made while diagnosing dermatopathology slides in SlideTutor. Thirty-one (N = 31) first- to fourth-year pathology and dermatology residents were randomly assigned to one of the two scaffolding conditions. The cases used in this study were selected from the domain of nodular and diffuse dermatitides. Both groups worked with a version of SlideTutor that provided immediate feedback on their actions for 2 h before proceeding to solve cases in either the Considering Alternatives or Playback condition. No immediate feedback was provided on actions performed by participants in the scaffolding mode. Measurements included learning gains (pre-test and post-test), as well as metacognitive performance, including Goodman–Kruskal Gamma correlation, bias, and discrimination. Results showed that participants in both conditions improved significantly in terms of their diagnostic scores from pre-test to post-test. More importantly, participants in the Considering Alternatives condition outperformed those in the Playback condition in the accuracy of their confidence judgments and the discrimination of the correctness of their assertions while solving cases. The results suggested that presenting participants with their diagnostic decision paths and highlighting correct and incorrect paths helps them to become more metacognitively accurate in their confidence judgments.  相似文献   

13.
This paper examines the effects of attended and unattended demonstratives on text processing, comprehension, and writing quality in two studies. In the first study, participants (n = 45) read 64 mini-stories in a self-paced reading task and identified the main referent in the clauses. The sentences varied in the type of demonstratives (i.e., this, that, these, and those) contained in the sentences and whether the referent was followed by a demonstrative determiner and noun (i.e., an attended demonstrative) or a demonstrative pronoun (i.e., an unattended demonstrative). In the second study, 173 persuasive essays written by high school students were rated by expert judges on overall writing quality using a standardized rubric. Expert coders manually counted the number and types of demonstratives (attended and unattended demonstratives) in each essay. These counts were used to predict the human scores of essay quality. The findings demonstrate that the use of unattended demonstratives as anaphoric references is disadvantageous to both reading time and referent identification. However, these disadvantages become advantages in terms of essay quality likely because linguistic complexity is a strong indicator of high proficiency writing. From a text processing and comprehension viewpoint, the findings indicate, then, that anaphoric reference is not always beneficial and does not always create a more cohesive text. In contrast, from a writing context, the use of unattended demonstratives leads to a more linguistically complex text, which generally equates to a higher quality text.  相似文献   

14.
Abstract

Students in a college course were given written criteria, divided into teams, and asked to score their own essay examination. Their pooled ratings correlated .922 with the instructor's ratings. Agreement between the ratings of students and instructor was not related to grade point or total test score. However, grade point and test scores were related negatively to the ambiguity of the students’ answers on the examination. The results support the generalization that subjective scoring standards are readily communicable. Theoretical and practical implications are examined.  相似文献   

15.
This study examined the dimensionality of 10 different calibration measures using confirmatory factor analysis (CFA). The 10 measures were representative of five interpretative families of measures used to assess monitoring accuracy based on a 2 (performance) × 2 (monitoring judgment) contingency table. We computed scores for each of the measures using a common data set and compared one-, two-, and five-factor CFA solutions. We predicted that the two-factor solution corresponding to measures of specificity and sensitivity used to assess diagnostic efficiency would provide the best solution. This hypothesis was confirmed, yielding two orthogonal factors that explained close to 100% of sample variance. The remaining eight measures were intercorrelated significantly with the sensitivity and specificity factors, which explained between 91 and 99 percent of variance in each measure. The two-factor solution was consistent with two different explanations, including the possibility that metacognitive monitoring may utilize two different types of processes that rely on separate judgments of correct and incorrect performance, or may be sufficiently complex that a single measurement statistic fails to capture all of the variance in the monitoring process. Our findings indicated that no single measure explains all the variance in monitoring judgments. Implications for future research are discussed.  相似文献   

16.
Abstract

This study investigated the reliability, validity, and utility of the following three measures of letter-formation quality: (a) a holistic rating system, in which examiners rated letters on a five-point Likert-type scale; (h) a holistic rating system with model letters, in which examiners used model letters that exemplified specific criterion scores to rate letters; and (c) a correct/incorrect procedure, in which examiners used transparent overlays and standard verbal criteria to score letters. Intrarater and interrater reliability coefficients revealed that the two holistic scoring procedures were unreliable, whereas scores obtained by examiners who used the correct/incorrect procedure were consistent over time and across examiners. Although all three of the target measures were sensitive to differences between individual letters, only the scores from the two holistic procedures were associated with other indices of handwriting performance. Furthermore, for each of the target measures, variability in scores was, for the most part, not attributable to the level of experience or sex of the respondents. Findings are discussed with respect to criteria for validating an assessment instrument.  相似文献   

17.
The purpose of this study is to explore the reliability of a potentially more practical approach to direct writing assessment in the context of ESL writing. Traditional rubric rating (RR) is a common yet resource-intensive evaluation practice when performed reliably. This study compared the traditional rubric model of ESL writing assessment and many-facet Rasch modeling (MFRM) to comparative judgment (CJ), the new approach, which shows promising results in terms of reliability. We employed two groups of raters—novice and experienced—and used essays that had been previously double-rated, analyzed with MFRM, and selected with fit statistics. We compared the results of the novice and experienced groups against the initial ratings using raw scores, MFRM, and a modern form of CJ—randomly distributed comparative judgment (RDCJ). Results showed that the CJ approach, though not appropriate for all contexts, can be as reliable as RR while showing promise as a more practical approach. Additionally, CJ is easily transferable to novel assessment tasks while still providing context-specific scores. Results from this study will not only inform future studies but can help guide ESL programs in selecting a rating model best suited to their specific needs.  相似文献   

18.
Using an argument‐based approach to validation, this study examines the quality of teacher judgments in the context of a standards‐based classroom assessment of English proficiency. Using Bachman's (2005) assessment use argument (AUA) as a framework for the investigation, this paper first articulates the claims, warrants, rebuttals, and backing needed to justify the link between teachers' scores on the English Language Development (ELD) Classroom Assessment and the interpretations made about students' language ability. Then the paper summarizes the findings of two studies—one quantitative and one qualitative—conducted to gather the necessary backing to support the warrants and, in particular, address the rebuttals about teacher judgments in the argument. The quantitative study examined the assessment in relation to another measure of the same ability—the California English Language Development Test—using confirmatory factor analysis of multitrait‐multimethod data and provided evidence in support of the warrant that states that the ELD Classroom Assessment measures English proficiency as defined by the California ELD Standards. The qualitative study examined the processes teachers engaged in while scoring the classroom assessment using verbal protocol analysis. The findings of this study serve to support the rebuttals in the validity argument that state that there are inconsistencies in teachers' scoring. The paper concludes by providing an explanation for these seemingly contradictory findings using the AUA as a framework and discusses the implications of the findings for the use of standards‐based classroom assessments based on teacher judgments.  相似文献   

19.
The aim of this study was to investigate the influence of perceived student gender on the feedback given to undergraduate student work. Participants (n = 12) were lecturers in higher education and were required to mark two undergraduate student essays. The first student essay that all participants marked was the control essay. Participants were informed that the control essay was written by Samuel Jones (a male student). Participants then marked the target essay. Although participants marked the same essay, half of the participants (n = 6) were informed that the student essay was written by Natasha Brown (a female student), while the remaining participants were informed that it was written by James Smith (a male student). In-text and end-of-text feedback were qualitatively analysed on six dimensions: academic style of writing; criticality; structure, fluency and cohesion; sources used; understanding/knowledge of the subject; and other. Analysis of feedback for both the control and target essay revealed no discernible differences in the number of comments (strengths of the essay, areas for improvement) made and the content and presentation of these comments between the two groups. Pedagogical implications pertaining to the potential impact of anonymous marking on feedback processes are discussed.  相似文献   

20.
ABSTRACT

The aim of this research was to examine the levels of burnout syndrome dimensions in special education teachers and correlations with some socio-demographic characteristics, job characteristics, and levels of assertiveness. The research included 225 special education teachers from Serbia (82% were women, 18% were men, with the average age of 42.51 ± 9.23 years). Research instruments included Maslach Burnout Inventory, Rathus Assertiveness Schedule, and a socio-demographic questionnaire. There were differences in all burnout dimensions in relation to types of students’ special needs. Higher levels of burnout symptoms were observed in teachers working with children with motor skill disorders. The assertiveness scores had a significant negative correlation with emotional exhaustion and depersonalisation, and a positive correlation with a lack of accomplishment. The results obtained may help in the planning of the adequate preventative measures for improving the mental health of those professionals.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号