首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
A framework for evaluation and use of automated scoring of constructed‐response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high‐stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time.  相似文献   

2.
The scoring process is critical in the validation of tests that rely on constructed responses. Documenting that readers carry out the scoring in ways consistent with the construct and measurement goals is an important aspect of score validity. In this article, rater cognition is approached as a source of support for a validity argument for scores based on constructed responses, whether such scores are to be used on their own or as the basis for other scoring processes, for example, automated scoring.  相似文献   

3.
Rater training is an important part of developing and conducting large‐scale constructed‐response assessments. As part of this process, candidate raters have to pass a certification test to confirm that they are able to score consistently and accurately before they begin scoring operationally. Moreover, many assessment programs require raters to pass a calibration test before every scoring shift. To support the high‐stakes decisions made on the basis of rater certification tests, a psychometric approach for their development, analysis, and use is proposed. The circumstances and uses of these tests suggest that they are expected to have relatively low reliability. This expectation is supported by empirical data. Implications for the development and use of these tests to ensure their quality are discussed.  相似文献   

4.
Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items.  相似文献   

5.
Many large‐scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple‐choice and/or constructed responses sections of items to generate multiple scores. In the current article, we propose an extension of the hierarchical rater model (HRM) to be applied with simple structured tests with constructed response items. In addition to modeling the appropriate trait structure, the multidimensional HRM (M‐HRM) presented here also accounts for rater severity bias and rater variability or inconsistency. We introduce the model formulation, test parameter recovery with a focus on latent traits, and compare the M‐HRM to other scoring approaches (unidimensional HRMs and a traditional multidimensional item response theory model) using simulated and empirical data. Results show more precise scores under the M‐HRM, with a major improvement in scores when incorporating rater effects versus ignoring them in the traditional multidimensional item response theory model.  相似文献   

6.
During the last three decades the constructed response format has gradually gained entry in large‐scale assessments of reading‐comprehension. In their 1991 Reading Literacy Study The International Association for the Evaluation of Educational Achievement (IEA) included constructed response items on an exploratory basis. Ten years later, in Progress in International Reading Literacy Study (PIRLS) 2001, the constructed response format is ascribed special significance as bearer of central insights to the definition of reading literacy. This article focuses on the significance of the scoring guides and the relation between these guides on the one hand, and the text and the items on the other hand. A discussion of this relation as it is to be found in PIRLS 2001 is performed, showing both examples of success and more problematic aspects in the operationalisation of the intentions expressed in the theoretical framework for the test. Handling the problem of semantic openness is essential in representing depth of understanding and represents a field of possibilities for further research and development.  相似文献   

7.
In signal detection rater models for constructed response (CR) scoring, it is assumed that raters discriminate equally well between different latent classes defined by the scoring rubric. An extended model that relaxes this assumption is introduced; the model recognizes that a rater may not discriminate equally well between some of the scoring classes. The extension recognizes a different type of rater effect and is shown to offer useful tests and diagnostic plots of the equal discrimination assumption, along with ways to assess rater accuracy and various rater effects. The approach is illustrated with an application to a large‐scale language test.  相似文献   

8.
In response to the demand for sound science assessments, this article presents the development of a latent construct called knowledge integration as an effective measure of science inquiry. Knowledge integration assessments ask students to link, distinguish, evaluate, and organize their ideas about complex scientific topics. The article focuses on assessment topics commonly taught in 6th- through 12th-grade classes. Items from both published standardized tests and previous knowledge integration research were examined in 6 subject-area tests. Results from Rasch partial credit analyses revealed that the tests exhibited satisfactory psychometric properties with respect to internal consistency, item fit, weighted likelihood estimates, discrimination, and differential item functioning. Compared with items coded using dichotomous scoring rubrics, those coded with the knowledge integration rubrics yielded significantly higher discrimination indexes. The knowledge integration assessment tasks, analyzed using knowledge integration scoring rubrics, demonstrate strong promise as effective measures of complex science reasoning in varied science domains.  相似文献   

9.
Drawing from multiple theoretical frameworks representing cognitive and educational psychology, we present a writing task and scoring system for measurement of students’ informative writing. Participants in this study were 72 fifth- and sixth-grade students who wrote compositions describing real-world problems and how mathematics, science, and social studies information could be used to solve those problems. Of the 72 students, 69 were able to craft a cohesive response that not only demonstrated planning in writing structure but also elaboration of relevant knowledge in one or more domains. Many-facet Rasch Modeling (MFRM) techniques were used to examine the reliability and validity of scores for the writing rating scale. Additionally, comparison of fifth- and sixth-grade responses supported the validity of scores, as did the results of a correlational analysis with scores from an overall interest measure. Recommendations for improving writing scoring systems based on the findings of this investigation are provided.  相似文献   

10.
The first generation of computer-based tests depends largely on multiple-choice items and constructed-response questions that can be scored through literal matches with a key. This study evaluated scoring accuracy and item functioning for an open-ended response type where correct answers, posed as mathematical expressions, can take many different surface forms. Items were administered to 1,864 participants in field trials of a new admissions test for quantitatively oriented graduate programs. Results showed automatic scoring to approximate the accuracy of multiple-choice scanning, with all processing errors stemming from examinees improperly entering responses. In addition, the items functioned similarly in difficulty, item-total relations, and male-female performance differences to other response types being considered for the measure.  相似文献   

11.
The engagement of teachers as raters to score constructed response items on assessments of student learning is widely claimed to be a valuable vehicle for professional development. This paper examines the evidence behind those claims from several sources, including research and reports over the past two decades, information from a dozen state educational agencies regarding past and ongoing involvement of teachers in scoring‐related activities as of 2001, and interviews with educators who served a decade or more ago for one state's innovative performance assessment program. That evidence reveals that the impact of scoring experience on teachers is more provisional and nuanced than has been suggested. The author identifies possible issues and implications associated with attempts to distill meaningful skills and knowledge from hand‐scoring training and practice, along with other forms of teacher involvement in assessment development and implementation. The paper concludes with a series of research questions that—based on current and proposed practice for the coming decade—seem to the author to require the most immediate attention.  相似文献   

12.
This study explored the use of machine learning to automatically evaluate the accuracy of students’ written explanations of evolutionary change. Performance of the Summarization Integrated Development Environment (SIDE) program was compared to human expert scoring using a corpus of 2,260 evolutionary explanations written by 565 undergraduate students in response to two different evolution instruments (the EGALT-F and EGALT-P) that contained prompts that differed in various surface features (such as species and traits). We tested human-SIDE scoring correspondence under a series of different training and testing conditions, using Kappa inter-rater agreement values of greater than 0.80 as a performance benchmark. In addition, we examined the effects of response length on scoring success; that is, whether SIDE scoring models functioned with comparable success on short and long responses. We found that SIDE performance was most effective when scoring models were built and tested at the individual item level and that performance degraded when suites of items or entire instruments were used to build and test scoring models. Overall, SIDE was found to be a powerful and cost-effective tool for assessing student knowledge and performance in a complex science domain.  相似文献   

13.
《Educational Assessment》2013,18(2):163-189
This research examined between-group differences in test-related perceptions, engagement, and performance; and within-group predictors of science performance among groups of high school students characterized by different patterns of science motivation. Patterns of motivation were derived from Dweck's (1986) typology and were used to classify students as mastery oriented, ego oriented, helpless, or "unclassified by such a typology." Groups were then compared on their efficacy for performing successfully on science multiple-choice tests, constructed response tests, and performance assessments; their beliefs about the validity of each test format; and their actual performance on multiple-choice and constructed response items. Group differences in gender composition, test perceptions and engagement, and performance were found. Results are discussed in terms of Snow's (1994) theory of aptitude complexes and their relation to individual differences in performance.  相似文献   

14.
It is not always convenient or appropriate to construct tests in which individual items are fungible. There are situations in which small clusters of items (testlets) are the units that are assembled to create a test. Using data from a test of reading comprehension constructed of four passages with several questions following each passage, we show that local independence fails at the level of the individual questions. The questions following each passage, however, constitute a testlet. We discuss the application to testlet scoring of some multiple-category models originally developed for individual items, In the example examined, the concurrent validity of the testlet scoring equaled or exceeded that of individual-item-level scoring  相似文献   

15.
As methods for automated scoring of constructed‐response items become more widely adopted in state assessments, and are used in more consequential operational configurations, it is critical that their susceptibility to gaming behavior be investigated and managed. This article provides a review of research relevant to how construct‐irrelevant response behavior may affect automated constructed‐response scoring, and aims to address a gap in that literature: the need to assess the degree of risk before operational launch. A general framework is proposed for evaluating susceptibility to gaming, and an initial empirical demonstration is presented using the open‐source short‐answer scoring engines from the Automated Student Assessment Prize (ASAP) Challenge.  相似文献   

16.

Too often tests are used with clients for whom the validity of the test has not been established. As a case in point we studied the use of the Human Figure Drawing (HFD) test with children living in Curaçao, a small island in the Caribbean. In this community no time and money are available for developing tests and establishing their validity and norms. We suggest that borrowing such information can be a relatively good, inexpensive alternative, provided that clinicians make the best of choices. This paper formulates three requirements, which should be met by the group of clients a clinician is working with. As an example we explored to what extent the requirements are being satisfied by 96 Curaçaoan Grade 4 school children. With regard to these children we conclude that clinicians using the HFD test can best use US representative frequency tables for scoring.  相似文献   

17.
Interpreting and creating graphs plays a critical role in scientific practice. The K-12 Next Generation Science Standards call for students to use graphs for scientific modeling, reasoning, and communication. To measure progress on this dimension, we need valid and reliable measures of graph understanding in science. In this research, we designed items to measure graph comprehension, critique, and construction and developed scoring rubrics based on the knowledge integration (KI) framework. We administered the items to over 460 middle school students. We found that the items formed a coherent scale and had good reliability using both item response theory and classical test theory. The KI scoring rubric showed that most students had difficulty linking graphs features to science concepts, especially when asked to critique or construct graphs. In addition, students with limited access to computers as well as those who speak a language other than English at home have less integrated understanding than others. These findings point to the need to increase the integration of graphing into science instruction. The results suggest directions for further research leading to comprehensive assessments of graph understanding.  相似文献   

18.
运用激励理论分析政府与公交企业之间的委托代理关系,从激励公交企业提高服务水平的角度出发,构建了信息不对称条件下政府与城市公交企业的委托代理模型,证明了政府合适的补贴机制可以激励城市公交企业优化服务水平,为政府对企业进行补贴激励决策提供了理论依据。  相似文献   

19.
Although test scores from similar tests in multiple choice and constructed response formats are highly correlated, equivalence in rankings may mask differences in substantive strategy use. The author used an experimental design and participant think-alouds to explore cognitive processes in mathematical problem solving among undergraduate examinees (N = 64). The study examined the effect of format on mathematics performance and strategy use for male and female examinees given stem-equivalent items. A statistically significant main effect of format on performance was found, with constructed-response items more difficult. The multiple-choice format was associated with more varied strategies, backward strategies, and guessing. Format was found to moderate the effect of problem conceptualization on performance. Results suggest that while for purposes of ranking students on performance, the multiple-choice format may be adequate, for many contemporary educational purposes that seek to provide nuanced information about student cognition, the constructed response format should be preferred.  相似文献   

20.
In probabilistic test and scoring systems the examinee is required to respond to each of the options of a multiple-choice test with a probability which represents the confidence he has in that option. It seems reasonable to assume that for such tests to yield valid information about the examinees, the knowledge they have should be the primary influence on the probabilities they assign. The purpose of this study was to seek the relationship between the degree to which examinees display certainty in their responses and certain personality variables. Proponents of probabilistic testing would expect such correlations to be low. In this stud), it was found that individuals do respond to multiple-choice questions with a characteristic certainty that cannot be accounted for on the basis of their knowledge. This certainty was related to scores of both the F Scale and the Kogan and Wallach risk taking measure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号