首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A two-stage process by which a holistic rubric is applied to the assessment of open-ended items, such as writing samples, is defined. The first stage involves scoring a performance by the assignment of an integer rating that is congruent with the proficiency level that is exhibited in the performance. The second stage is the subsequent assignment by the rater of an augmentation that indicates whether or not the writing competency reflected in the paper is a bit higher or lower than the competency level reflected in the benchmark paper for the given proficiency level. If the rater feels that the paper represents benchmark proficiency for the given level, no augmentation is assigned to the rating. The results of this study indicate that the use of rating augmentation can improve the inter-rater reliability of holistic assessments, as indicated by generalizability phi coefficients, correlation coefficients, and percent agreement indices. Implications and suggestions for follow-up research are discussed.  相似文献   

2.
Martin   《Assessing Writing》2009,14(2):88-115
The demand for valid and reliable methods of assessing second and foreign language writing has grown in significance in recent years. One such method is the timed writing test which has a central place in many testing contexts internationally. The reliability of this test method is heavily influenced by the scoring procedures, including the rating scale to be used and the success with which raters can apply the scale. Reliability is crucial because important decisions and inferences about test takers are often made on the basis of test scores. Determining the reliability of the scoring procedure frequently involves examining the consistency with which raters assign scores. This article presents an analysis of the rating of two sets of timed tests written by intermediate level learners of German as a foreign language (n = 47) by two independent raters who used a newly developed detailed scoring rubric containing several categories. The article discusses how the rubric was developed to reflect a particular construct of writing proficiency. Implications for the reliability of the scoring procedure are explored, and considerations for more extensive cross-language research are discussed.  相似文献   

3.
4.
The discriminant and concurrent validity of the Gordon Diagnostic System (GDS) was investigated in 29 youngsters categorized into “normals” or “ADHDs” based on teacher ratings. The results failed to demonstrate the discriminant validity of any GDS score regardless of the behavior rating used. The Vigilance Correct and Vigilance Omission scores were significantly correlated with ADHD Rating Scale scores completed by teachers. The sample size in the study demands cautious interpretation of these results; however, the authors suggest the continued use of multiple behavior ratings by teachers as the “gold standard” for the classification of youngsters with a suspected Attention-deficit Hyperactivity Disorder.  相似文献   

5.
States participating in the Growth Model Pilot Program reference individual student growth against “proficiency” cut scores that conform with the original No Child Left Behind Act (NCLB). Although achievement results from conventional NCLB models are also cut‐score dependent, the functional relationships between cut‐score location and growth results are more complex and are not currently well described. We apply cut‐score scenarios to longitudinal data to demonstrate the dependence of state‐ and school‐level growth results on cut‐score choice. This dependence is examined along three dimensions: 1) rigor, as states set cut scores largely at their discretion, 2) across‐grade articulation, as the rigor of proficiency standards may vary across grades, and 3) the time horizon chosen for growth to proficiency. Results show that the selection of plausible alternative cut scores within a growth model can change the percentage of students “on track to proficiency” by more than 20 percentage points and reverse accountability decisions for more than 40% of schools. We contribute a framework for predicting these dependencies, and we argue that the cut‐score dependence of large‐scale growth statistics must be made transparent, particularly for comparisons of growth results across states.  相似文献   

6.
《Assessing Writing》2008,13(3):201-218
Using generalizability theory, this study examined both the rating variability and reliability of ESL students’ writing in the provincial English examinations in Canada. Three years’ data were used in order to complete the analyses and examine the stability of the results. The major research question that guided this study was: Are there any differences between the rating variability and reliability of the writing scores assigned to ESL students and to Native English (NE) students in the writing components of the provincial examinations across three years? A series of generalizability studies and decision studies was conducted. Results showed that differences in score variation did exist between ESL and NE students when adjudicated scores were used. First, there was a large effect for both language group and person within language-by-task interaction. Second, the unwanted residual variance component was significantly larger for ESL students than for NE students in all three years. Finally, the desired variance associated with the object of measurement was significantly smaller for ESL students than for NE students in one year. Consequently, the observed generalizability coefficient for ESL students was significantly lower than that for NE students in that year. These findings raise a potential question about the fairness of the writing scores assigned to ESL students.  相似文献   

7.
The inquiry sought to develop a semantic differential (SD) for use in assessing attitude and attitude change among secondary school and college students. It included thirty-five bipolar adjectives, each using a 7-point ordinal scale, for student rating purposes. Three concepts were used in the study: teacher, learning, and student. A Likert type scoring was accomplished with part scores for each of the separate concepts, and with the total score being the sum of the three part scores. A comparison between pre- and post-college course attitudes was made involving 237 students, which showed significant change only for the concept “student” (Me as a student). Internal reliability indexes were obtained usingthe Kuder -Richardson(K-R) Formula 20 for part scores ranging from r = .421 to .610; and for the total score ranging from r = .928 to .960. Inter correlations of part scores for pretest ranged from r = .530 to .584; and for posttest r = .620 to .707. There is evidence of greater homogeneity for post-course concepts than for pre-course concepts used in the evaluation, i. e., teacher, learning, and student.  相似文献   

8.
The Koppitz Human Figure Drawing (HFD) Test was examined for use in early identification of academically not-ready kindergarten children. HFD developmental scores of a group of children who later “passed” the Metroplitan Readiness Test (stanine score ≥4), as required for admission to the first-grade classroom, were compared with the HFD scores of a group of children who later “failed” the Metropolitan Readiness Test (stanine score ≤3). Evaluation of the test data showed that the optimum HFD score cut-off point for prediction of nonreadiness was ≤3; 42% of the nonready and 90% of the ready children, as defined by the Metropolitan Readiness Test (MTR) score criterion, were correctly identified. An overall hit rate of 75% was obtained on the experimental population. The work indicates that HFD developmental scores are useful for early identification of the academically not-ready kindergarten child.  相似文献   

9.
The paper reports and discusses a government‐initiated nationwide assessment of writing proficiency among Norwegian compulsory school students. A sample‐study of 7th and 10th grade students are discussed and reported with regard to challenges in measuring writing skills in a valid and reliable manner. For the 7th graders the results showed a greater proportion of narrative texts, and in contrast to more scientific oriented texts, was assessed as “lower than expected”; however, for the 10th graders the tendency was opposite with respect to central linguistic components. Low correlations between the raters were ascertained at both levels, indicating different views among teachers as to what can be expected of students' writing proficiency. The results are discussed in relation to the usefulness of the theoretical model as a basis for assessment of writing proficiency, as well as other obstacles to constructing valid and reliable writing tests.  相似文献   

10.
Analyses of the rating data for papers submitted to the Division of Educational Psychology's program committee for inclusion in the 1974 APA convention are reported. Interjudge reliability was estimated, the rating scale scores were factor analyzed, and the judges’ reasons for non‐acceptance of papers were content analyzed. Results showed that the judges (a) were reliable, (b) tended to rely on an overall rating dimension, and (c) did not accept papers because data or results were lacking, research methods were in error, the topic was inappropriate for the Division of Educational Psychology, or the theory was weak. Avoiding such errors should increase the probability that a paper will be accepted.  相似文献   

11.
Abstract

During the past several decades, there have been numerous published papers in the measurement literature dealing with the psychometric properties of gain scores and related measures of change. Three measures which have received considerable attention are the simple gain score, the residualized difference score, and the base free measure of Tucker, Damarin, and Messick (1966). Most of the arguments appearing in the literature have been mathematical in nature, often depending on fabricated data sets or on the substitution of “reasonable” numbers into well-known measurement equations. However, there has been a sparsity of actual empirical studies designed to investigate the fallibility of any measure of change. The purpose of the present paper is to describe the procedures and results of two studies designed to yield empirical comparisons of the error magnitude in these three measures of change. In both of these studies, residualized scores were found to possess smaller standard errors of measurement than the other two measures of change. An interesting ancillary finding was that the reliability coefficient of the much maligned simple gain score turned out to be 0.96 for one of the studies and 0.82 for the other.  相似文献   

12.
In research and development designed to assess the writing skills of third-year college students, the University of Wisconsin Verbal Assessment Project developed and tested procedures for assessing writing portfolios from students in courses representing each college in the university. Following the work of Britton (1970) and 1:he National Assessment of Education Progress, we defined expository writing as sustained reflection in which the writer focuses and processes information to various degrees. Basing our work on this construct, we assessed writing samples in each portfolio in terms of both degree of reflection and extent of text elaboration. Results of two studies are presented. In Study 1, raters scored each text from a given portfolio before rating texts in the next portfolio. Reliability estimates were low to moderate for both scores. In follow-up Study 2, involving a comparable group of students, several changes were made to improve reliability: (a) Raters scored all texts written in response to a given prompt or assignment within a class before moving to the next set of texts; and (b) each time readers dealt with a new task, they read several examples together, coming to agreement about how various texts were to be rated. Estimates of reliability for both scores were somewhat higher and suggest that the modifications improved reliabilities. Results demonstrate that adequate reliability should be expected if texts are rated by task across portfolios within classes. Based on these findings, we contend that, because writing normally varies by topic, genre, and other variables, writing portfolios are better characterized by scores for each piece than by a single writing-skill score.  相似文献   

13.
This study examined the influence of a professional development program based around commercially available inquiry science curricula on the teaching practices of 27 beginning elementary school teachers and their teacher mentors over a 2 year period. A quantitative rubric used to score inquiry elements and use of data in videotaped lessons indicated that education students assigned to inquiry-based classrooms during their methods course or student teaching year outperformed students without this experience. There was also a significant positive effect of multi-year access to the kit-based program on mentor teaching practice. Recent inclusion of a “writing in science” program in both preservice and inservice training has been used to address the lesson element that received lowest scores—evaluation of data and its use in scientific explanation.  相似文献   

14.
从广州四所高职院校2012年6月份的大学英语四级考试中的作文成绩及对用人单位调查得知,大学生的英语写作能力还比较薄弱,有待于加强培养。从 Krashen 的“输入假说”和Swain等人的“输出假设”理论中得到启示,尝试运用基于“输入与输出平衡”的大学英语写作教学模式来提高大学生的写作能力,实证研究结果表明,这种教学模式能起到良好的教学效果。  相似文献   

15.
Our objective was to investigate the impact of the Science Writing Heuristic (SWH) on undergraduates’ ability to express logical conclusions and include appropriate evidence in formal writing assignments. Students in three laboratory sections were randomly allocated to the SWH treatment (n?=?51 students) with another three sections serving as a control (n?=?47 students). All sections received an identical formal writing assignment to report results of laboratory activities. Four blinded raters used a 6-point rating scheme to evaluate the quality of students’ writing performance. Raters’ independent scoring agreement was evaluated using Cronbach's α. Paper scores were compared using a t-test, then papers were combined into low-scoring (3.5 of 6 points) or high-scoring (>3.5 of 6 points) sets and SWH and control cohorts were compared using Pearson's chi-square test. Papers from the SWH cohort were significantly (P?=?0.02) more likely to receive a high score than those from the control cohort. Overall scores of SWH cohort papers tended to be higher (P?=?0.07) than those from the control cohort. Gains in student conceptual understanding elicited by the SWH approach improved student ability to express logical conclusions about their data and include appropriate evidence to support those conclusions in formal research reports. Extending the writing tasks of the SWH to formal writing assignments can improve the ability of undergraduates to argue effectively for their research findings.  相似文献   

16.
《Assessing Writing》1998,5(1):39-70
The Maryland School Performance Assessment Program (MSPAP) tests include an expressive writing task in which students at grades 3, 5, and 8 can choose to write about any topic they wish in the form of either a story, poem, or play. This test design feature provided the opportunity to investigate what factors contribute to students' choice of genre, how scorers apply a single expressive writing rubric to a range of genres, and whether these genres constitute equivalent tasks for measurement and reporting purposes. Our study, which combined analysis of statewide score data, 300 randomly selected student texts, questionnaires given to teacher-scorers, and interviews with students, argues strongly for the validity of this choice task as a measure of expressive writing and demonstrates that choice of genre both increases writers' engagement and enhances the fairness of the assessment by giving all students the best opportunity to demonstrate proficiency in this learning outcome. By highlighting several features of student texts that complicate scoring, the study also suggests that accuracy and consistency might be improved by
  • 1.1) providing additional sample papers during training,
  • 2.2) attending to scorers' assumptions regarding several key concepts, especially “originality,” and
  • 3.3) adjusting the ways that training for focused holistic scoring generally takes place.
The study concludes that the perceptions of students, scorers, and classroom teachers are critical to the ongoing development of writing assessments that offer students increasing control and choice.  相似文献   

17.
The purpose of this study was to compare the effects of two peer assessment methods on university students' academic writing performance and their satisfaction with peer assessment. This study also examined the validity and reliability of student generated assessment scores. Two hundred and thirty-two predominantly undergraduate students were selected by convenience sampling during the fall semester of 2007. The results indicate that students in the experimental group demonstrated greater improvement in their writing than those in the comparison group, and the findings reveal that students in the experimental group exhibited higher levels of satisfaction with the peer assessment method both in peer assessment structure and peer feedback than those in the comparison group. Additionally, the findings indicate that the validity and reliability of student generated rating scores were extremely high. Using Wiki interactive software and providing an online collaborative learning environment to facilitate peer assessment added value to peer assessment.  相似文献   

18.
作文批改与反馈是英语写作教学的一个重要环节,对提高学生的写作能力有着不可低估的作用,句酷批改网以其优势而深受广大师生的喜爱。句酷批改网的作文评分有很高的信度,但评分显著地高于教师的评分,尚不能反映学生英语作文的真实水平。从效度上看,以句子为单位,在词汇和语法等方面进行详细的评价,但在篇章结构、文体修辞、内容逻辑性和连贯性方面不能给学生充分的反馈。在运用批改网的同时,我们还要同其他的评估方式结合起来。  相似文献   

19.
On repeated occasions, observational learning has proved itself to be an effective instruction method. Experimental studies have shown to be effective for complex tasks such as reading and writing for both teachers and students as models. The problem when interpreting the results of such research is that, in observation tasks, several mental activities play a simultaneous role. In this study we therefore set out to identify the effective elements of observation tasks. We focused on two elements of the observation tasks, both aimed at stimulating monitoring activities: evaluation of the model’s performance and elaboration on this evaluation. We have also distinguished between elaboration on the observed products (the models’ written answers), and elaboration on the observed processes (the models’ verbalisations of their mental activities). The data were subjected to a LISREL analysis. First of all, it was observed that subjects who performed “evaluation” and “productelaboration” better, and “process-elaboration” more often in one lesson, also performed these activities better or more often in the subsequent lesson. Next, we observed an effect of aptitude on the learning activities: pre skill scores influence “evaluation” and “product-elaboration”. The most important finding is that “evaluation” and “product-elaboration” contribute positively to argumentative writing skills. It is discussed that these findings confirm the importance of the monitoring, evaluative and reflective activities when learning complex tasks as writing.  相似文献   

20.
Abstract

In the absence of a randomized control trial, regression discontinuity (RD) designs can produce plausible estimates of the treatment effect on an outcome for individuals near a cutoff score. In the standard RD design, individuals with rating scores higher than some exogenously determined cutoff score are assigned to one treatment condition; those with rating scores below the cutoff score are assigned to an alternate treatment condition. Many education policies, however, assign treatment status on the basis of more than one rating-score dimension. We refer to this class of RD designs as “multiple rating score regression discontinuity” (MRSRD) designs. In this paper, we discuss five different approaches to estimating treatment effects using MRSRD designs (response surface RD; frontier RD; fuzzy frontier RD; distance-based RD; and binding-score RD). We discuss differences among them in terms of their estimands, applications, statistical power, and potential extensions for studying heterogeneity of treatment effects.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号