首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 9 毫秒
1.
How can the contributions of raters and tasks to error variance be estimated? Which source of error variance is usually greater? Are interrater coefficients adequate estimates of reliability? What other facets contribute to unreliability in performance assessments?  相似文献   

2.
"形成性评价"一直是教育者关注的概念。它引导教师可以不断收集对学习产生积极影响的证据,并根据这些证据开展创新性的教学实践。其结果不但在很大程度上改善了教与学的成果,而且使得教师在教学中的角色也有很大转变。本文按照"对话互动"及"学习合作"两个类别对形成性评价在实践中的多种形式进行讨论。在分析形成性评价的基本原则基础上,讨论目前教师在开展形成性评价过程中面临的问题。  相似文献   

3.
Differential item functioning (DIF) analyses are a routine part of the development of large-scale assessments. Less common are studies to understand the potential sources of DIF. The goals of this study were (a) to identify gender DIF in a large-scale science assessment and (b) to look for trends in the DIF and non-DIF items due to content, cognitive demands, item type, item text, and visual-spatial or reference factors. To facilitate the analyses, DIF studies were conducted at 3 grade levels and for 2 randomly equivalent forms of the science assessment at each grade level (administered in different years). The DIF procedure itself was a variant of the "standardization procedure" of Dorans and Kulick (1986) and was applied to very large sets of data (6 sets of data, each involving 60,000 students). It has the advantages of being easy to understand and to explain to practitioners. Several findings emerged from the study that would be useful to pass on to test development committees. For example, when there was DIF in science items, MC items tended to favor male examinees and OR items tended to favor female examinees. Compiling DIF information across multiple grades and years increases the likelihood that important trends in the data will be identified and that item writing practices will be informed by more than anecdotal reports about DIF.  相似文献   

4.
《Educational Assessment》2013,18(4):325-338
Concerns about the effects of multiple-choice measures in traditional testing programs have led many educators and policymakers to suggest the use of alternative assessment methods. Some performance-based assessments require students to work in small collaborative groups as part of the test process. This study uses responses to hands-on science tasks at Grades 5 and 8 to examine whether the score a student earns while working with someone else is a truly independent assessment of that student's ability. We also explore whether working in pairs affects an individual's scores on subsequent tasks and whether these results are consistent across grade levels. Our analyses indicate that at Grades 5 and 8, work done with a partner should not be considered as an independent assessment of each student's ability. Some evidence of carry-over effects from working in pairs was found at each grade.  相似文献   

5.
In 1993, we reported in Journal of Educational Measurement that task-sampling variability was the Achilles' heel of science performance assessment. To reduce measurement error, tasks needed to be stratified before sampling, sampled in large number, or possibly both. However, Cronbach, Linn, Brennan, & Haertel (1997) pointed out that a task-sampling interpretation of a large person x task variance component might be incorrect. Task and occasion sampling are confounded because tasks are typically given on only a single occasion. The person x task source of measurement error is then confounded with the pt x occasion source. If pto variability accounts for a substantial part of the commonly observed pt interaction, stratifying tasks into homogenous subsets—a cost-effective way of addressing task sampling variability—might not increase accuracy. Stratification would not address the pro source of error. Another conclusion reported in JEM was that only direct observation (DO) and notebook (NB) methods of collecting performance assessment data were exchangeable; computer simulation, short-answer, and multiple-choice methods were not. However, if Cronbach et al. were right, our exchangeability conclusion might be incorrect. After re-examining and re-analyzing data, we found support for Conbach et al. We concluded that large task-sampling variability was due to both the person x task interaction and person x task x occasion interaction. Moreover, we found that direct observation, notebook and computer simulation methods were equally exchangeable, but their exchangeability was limited by the volatility of student performances across tasks and occasions.  相似文献   

6.
《Educational Assessment》2013,18(3):257-272
Concern about the education system has increasingly focused on achievement outcomes and the role of assessment in school performance. Our research with fifth and eighth graders in California explored several issues regarding student performance and rater reliability on hands-on tasks that were administered as part of a field test of a statewide assessment program in science. This research found that raters can produce reliable scores for hands-on tests of science performance. However, the reliability of performance test scores per hour of testing time is quite low relative to multiple-choice tests. Reliability can be improved substantially by adding more tasks (and testing time). Using more than one rater per task produces only a very small improvement in the reliability of a student's total score across tasks. These results were consistent across both grade levels, and they echo the findings of past research.  相似文献   

7.
What are practical and logistical constraints in developing science performance assessments (SPAs)? What are key components in a framework for conceptualizing the process? What are the major steps in SPA development?  相似文献   

8.
二十年来,大规模学生评估对教育研究、学校体系和教育政策产生了深远影响."国际中学生评估项目"(PISA)、"国际学生数学与科学能力动态项目"(TIMSS)和"国际学生阅读能力进步研究项目"(PIRLS)使各个国家学生成绩值具有一定可比性,由此,人们能更细致地从学校内部来观察不同国家学校工作的差异度.大规模学生评估还为学校发展、教育领导以及学生成绩的改进提供必要数据.本文以PIRLS为例,旨在从德国视角为中国今后开展大规模国际学生评估提供借鉴.  相似文献   

9.
Many states are implementing direct writing assessments to assess student achievement. While much literature has investigated minimizing raters' effects on writing scores, little attention has been given to the type of model used to prepare raters to score direct writing assessments. This study reports on an investigation that occurred in a state-mandated writing program when a scoring anomaly became apparent once assessments were put in operation. The study indicates that using a spiral model for training raters and scoring papers results in higher mean ratings than does using a sequential model for training and scoring. Findings suggest that making decisions about cut-scores based on pilot data has important implications for program implementation.  相似文献   

10.
Many efforts have been made to determine and explain differential gender performance on large-scale mathematics assessments. A well-agreed-on conclusion is that gender differences are contextualized and vary across math domains. This study investigated the pattern of gender differences by item domain (e.g., Space and Shape, Quantity) and item type (e.g., multiple-choice i iIn this paper, two kinds of multiple-choice items are discussed: traditional multiple-choice items and complex multiple-choice items. A sample complex multiple choice item is shown in Table 6. The terms “multiple-choice” and “traditional multiple-choice” are used interchangeably to refer to the traditional multiple choice items throughout the paper, while the term “complex multiple-choice” is used to refer to the complex multiple-choice items. Raman K. Grover is now an Independent Psychometrician. items, open constructed-response items). The U.S. portion of the Programme for International Student Assessment (PISA) 2000 and 2003 mathematics assessment was analyzed. A multidimensional Rasch model was used to provide student ability estimates for each comparison. Results revealed a slight but consistent male advantage. Students showed the largest gender difference (d = 0.19) in favor of males on complex multiple-choice items, an unconventional item type. Males and females also showed sizable differences on Space and Shape items, a domain well documented for showing robust male superiority. Contrary to many previous findings reporting male superiority on multiple-choice items, no measurable difference has been identified on multiple-choice items for both the PISA 2000 and the 2003 math assessments. Reasons for the differential gender performance across math domains and item types were speculated, and directions of future research were discussed.  相似文献   

11.
12.
This article proposes that sampling design effects have potentially huge unrecognized impacts on the results reported by large-scale district and state assessments in the United States. When design effects are unrecognized and unaccounted for they lead to underestimating the sampling error in item and test statistics. Underestimating the sampling errors, in turn, results in unanticipated instability in the testing program and an increase in Type I errors in significance tests. This is especially true when the standard error of equating is underestimated. The problem is caused by the typical district and state practice of using nonprobability cluster-sampling procedures, such as convenience, purposeful, and quota sampling, then calculating statistics and standard errors as if the samples were simple random samples.  相似文献   

13.
14.
In large-scale assessments, such as state-wide testing programs, national sample-based assessments, and international comparative studies, there are many steps involved in the measurement and reporting of student achievement. There are always sources of inaccuracies in each of the steps. It is of interest to identify the source and magnitude of the errors in the measurement process that may threaten the validity of the final results. Assessment designers can then improve the assessment quality by focusing on areas that pose the highest threats to the results. This paper discusses the relative magnitudes of three main sources of error with reference to the objectives of assessment programs: measurement error, sampling error, and equating error. A number of examples from large-scale assessments are used to illustrate these errors and their impact on the results. The paper concludes by making a number of recommendations that could lead to an improvement of the accuracies of large-scale assessment results.  相似文献   

15.
《教育实用测度》2013,26(2):173-185
More attention is being given to evaluating the quality of school-level assessment scores due to their importance for school-based planning and monitoring effectiveness. In this study, cross-year stability is proposed as an indicator of data quality and the degree of stability that is appropriate for large-scale assessments of student performance is explored. Following a search of Internet sites, Year 1 to Year 2 stability coefficients were calculated for assessment data from 21 states and 2 provinces. The median stability coefficient was .78 in mathematics and reading, but coefficients for writing were generally lower. A stability coefficient of .80 is recommended as the standard for large-scale assessments of student performance. A high degree of cross-year stability makes it easier to detect and attribute changes in school-level scores to school improvement efforts. The link between stability and reliability and several factors that may attenuate stability are discussed.  相似文献   

16.
The use of alternative assessments has led many researchers to reexamine traditional views of test qualities, especially validity. Because alternative assessments generally aim at measuring complex constructs and employ rich assessment tasks, it becomes more difficult to demonstrate (a) the validity of the inferences we make and (b) that these inferences extrapolate to target domains beyond the assessment itself. An approach to addressing these issues from the perspective of language testing is described. It is then argued that in both language testing and educational assessment we must consider the roles of both language and content knowledge, and that our approach to the design and development of performance assessments must be both construct-based and task-based.1  相似文献   

17.
In this study it is investigated to what extent contextualized and non-contextualized mathematics test items have a differential impact on examinee effort. Mixture item response theory (IRT) models are applied to two subsets of items from a national assessment on mathematics in the second grade of the pre-vocational track in secondary education in Flanders. One subset focused on elementary arithmetic and consisted of non-contextualized items. Another subset of contextualized items focused on the application of arithmetic in authentic problem-solving situations. Results indicate that differential performance on the subsets is to a large extent due to test effort. The non-contextualized items appear to be much more susceptible to low examinee effort in low-stakes testing situations. However, subgroups of students can be found with regard to the extent to which they show low effort. One can distinguish a compliant, an underachieving, and a dropout group. Group membership is also linked to relevant background characteristics.  相似文献   

18.
The QUASAR Cognitive Assessment Instrument (QCAI) is designed to measure program outcomes and growth in mathematics. It consists of a relatively large set of open-ended tasks that assess mathematical problem solving, reasoning, and communication at the middle-school grade levels. This study provides some evidence for the generalizability and validity of the assessment. The results from the generalizability studies indicate that the error due to raters is minimal, whereas there is considerable differential student performance across tasks. The dependability of grade level scores for absolute decision making is encouraging; when the number of students is equal to 350, the coefficients are between .80 and .97 depending on the form and grade level. As expected, there tended to be a higher relationship between the QCAI scores and both the problem solving and conceptual subtest scores from a mathematics achievement multiple-choice test than between the QCAI scores and the mathematics computation subtest scores.  相似文献   

19.
In this study we examined variations of the nonequivalent groups equating design for tests containing both multiple-choice (MC) and constructed-response (CR) items to determine which design was most effective in producing equivalent scores across the two tests to be equated. Using data from a large-scale exam, this study investigated the use of anchor CR item rescoring (known as trend scoring) in the context of classical equating methods. Four linking designs were examined: an anchor with only MC items, a mixed-format anchor test containing both MC and CR items; a mixed-format anchor test incorporating common CR item rescoring; and an equivalent groups (EG) design with CR item rescoring, thereby avoiding the need for an anchor test. Designs using either MC items alone or a mixed anchor without CR item rescoring resulted in much larger bias than the other two designs. The EG design with trend scoring resulted in the smallest bias, leading to the smallest root mean squared error value.  相似文献   

20.
The evaluation of developmental interventions has been hampered by a lack of practical, reliable, and objective developmental assessment systems. This article describes the construction of a domain-general computerized developmental assessment system for texts: the Lexical Abstraction Assessment System (LAAS). The LAAS provides assessments of the order of hierarchical complexity of oral and written texts, employing scoring rules developed with predictive discriminant analysis. The LAAS is made possible by a feature of conceptual structure we call hierarchical order of abstraction, which produces systematic quantifiable changes in lexical composition with development. The LAAS produces scores that agree with human ratings of hierarchical complexity more than 80% of the time within one-third of a complexity order across 6 complexity orders (18 levels), spanning the portion of the lifespan from about 4 years of age through adulthood. This corresponds to a Kendall's tau of .93.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号