期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Comparison of Three Scoring Methods for Tests With Selected-Response and Constructed-Response Items

《Educational Assessment》2013,18(4):317-340

A number of methods for scoring tests with selected-response (SR) and constructed-response (CR) items are available. The selection of a method depends on the requirements of the program, the particular psychometric model and assumptions employed in the analysis of item and score data, and how scores are to be used. This article compares 3 methods: unweighted raw scores, Item Response Theory pattern scores, and weighted raw scores. Student score data from large-scale end-of-course high school tests in Biology and English were used in the comparisons. In the weighted raw score method evaluated in this study, the CR items were weighted so that SR and CR items contributed the same number of points toward the total score. The scoring methods were compared for the total group and for subgroups of students in terms of the resultant scaled score distributions, standard errors of measurement, and proficiency-level classifications. For most of the student ability distribution, the three scoring methods yielded similar results. Some differences in results are noted. Issues to be considered when selecting a scoring method are discussed. 相似文献

2.

试题命制的理论和技术(一) 总被引：1，自引：0，他引：1

雷新勇周群《考试研究》2008,(1):84-97

大规模教育考试试题命制是以心理学的某些理论假设为基础。与这些理论假设一致的试题定义要求试题应该具备三个要素:测量目标、刺激情境和设问,这三个要素缺失了任何一个,都不能构成完整的试题。根据这些理论假设以及试题定义和要素,本文讨论了命制客观题和主观题的基本要求,客观题包括题干的要求、选项设置的要求以及选项数的问题;主观题包括情境材料的选择、设问、赋分和评分标准制定。相似文献

3.

试题命制的理论和技术(二) 总被引：1，自引：0，他引：1

雷新勇周群《考试研究》2008,(2):90-106

大规模教育考试试题命制以心理学的某些理论假设为基础。与这些理论假设一致的试题定义要求试题应该具备三个要素,即测量目标、刺激情境和设问,这三个要素缺失了任何一个,都不能构成完整的试题。根据这些理论假设以及试题定义和要素,本文讨论了命制客观题和主观题的基本要求,客观题包括题干的要求、选项设置的要求以及选项数的问题;主观题包括情境材料的选择、设问、赋分和评分标准制定。相似文献

4.

The Interaction of Reader and Task Factors in the Assessment of Reading Comprehension

《Journal of Experimental Education》2012,80(4):199-206

This study examined the effects of selected reader and task variables on reading comprehension performance. Fifty deaf students and 61 hearing students of comparable reading skill level were blocked by cognitive style scores into field dependent or field independent groups. Subjects read 12 passages and completed selected-response and constructed-response question tasks under both lookback and no-lookback conditions. The passage questions tapped both text-explicit and text-implicit information. Several reader and task interaction effects were found to be significant, particularly for lookback conditions and constructed-response tasks. Moreover, cognitive style interacted with hearing state on tasks involving lookback options. Implications are drawn for further consideration of differential test administration and training in test-taking skills for certain types of readers. 相似文献

5.

The Effects of Test Length and Sample Size on the Reliability and Equating of Tests Composed of Constructed-Response Items

《教育实用测度》2013,26(1):31-57

Examined in this study were the effects of test length and sample size on the alternate forms reliability and the equating of simulated mathematics tests composed of constructed-response items scaled using the 2-parameter partial credit model. Test length was defined in terms of the number of both items and score points per item. Tests with 2, 4, 8, 12, and 20 items were generated, and these items had 2, 4, and 6 score points. Sample sizes of 200, 500, and 1,000 were considered. Precise item parameter estimates were not found when 200 cases were used to scale the items. To obtain acceptable reliabilities and accurate equated scores, the findings suggested that tests should have at least eight 6-point items or at least 12 items with 4 or more score points per item. 相似文献

6.

英语测试中选择题型对学习者的负面影响及相关对策

黄河《培训与研究》2008,(12):112-114

我国大多英语测试题型结构以客观性试题的设置为主,其中选择—反应测验题目占据了最大比例。然而由于以选择—反应测试题目为主的英语考试对学生学习成就测量的准确性不足,这种英语测试题对学生的学习动机、学习方法、学习效果等有较大负面影响,因此英语测试题目设置应当进行针对性的进行改革。相似文献

7.

Comparisons among Designs for Equating Mixed-Format Tests in Large-Scale Assessments

Sooyeon Kim Michael E. Walker Frederick McHale 《Journal of Educational Measurement》2010,47(1):36-53

In this study we examined variations of the nonequivalent groups equating design for tests containing both multiple-choice (MC) and constructed-response (CR) items to determine which design was most effective in producing equivalent scores across the two tests to be equated. Using data from a large-scale exam, this study investigated the use of anchor CR item rescoring (known as trend scoring) in the context of classical equating methods. Four linking designs were examined: an anchor with only MC items, a mixed-format anchor test containing both MC and CR items; a mixed-format anchor test incorporating common CR item rescoring; and an equivalent groups (EG) design with CR item rescoring, thereby avoiding the need for an anchor test. Designs using either MC items alone or a mixed anchor without CR item rescoring resulted in much larger bias than the other two designs. The EG design with trend scoring resulted in the smallest bias, leading to the smallest root mean squared error value. 相似文献

8.

Scoring Constructed Responses Using Expert Systems

Henry I. Braun Randy Elliot Bennett Douglas Frye Elliot Soloway 《Journal of Educational Measurement》1990,27(2):93-108

The use of constructed-response items in large scale standardized testing has been hampered by the costs and difficulties associated with obtaining reliable scores. The advent of expert systems may signal the eventual removal of this impediment. This study investigated the accuracy with which expert systems could score a new, nonmultiple-choice item type. The item type presents a faulty solution to a computer programming problem and asks the student to correct the solution. This item type was administered to a sample of high school seniors enrolled in an Advanced Placement course in Computer Science who also took the Advanced Placement Computer Science (APCS) examination. Results indicated that the expert systems were able to produce scores for between 82% and 95% of the solutions encountered and to display high agreement with a human reader on the correctness of the solutions. Diagnoses of the specific errors produced by students were less accurate. Correlations with scores on the objective and free-response sections of the APCS examination were moderate. Implications for additional research and for testing practice are offered. 相似文献

9.

Calibration and Scoring of Tests With Multiple-Choice and Constructed-Response Item Types

Kadriye Ercikan Richard D. Sehwarz Marc W. Julian George R. Burket Melba M. Weber Valerie Link 《Journal of Educational Measurement》1998,35(2):137-154

This article discusses and demonstrates combining scores from multiple-choice (MC) and constructed-response (CR) items to create a common scale using item response theory methodology. Two specific issues addressed are (a) whether MC and CR items can be calibrated together and (b) whether simultaneous calibration of the two item types leads to loss of information. Procedures are discussed and empirical results are provided using a set of tests in the areas of reading, language, mathematics, and science in three grades. 相似文献

10.

Cognitive Processing requirements of Constructed Figural Response and Multiple-Choice Items in Architecture Assessment

《Educational Assessment》2013,18(1):83-98

Contrasts between constructed-response items and multiple-choice counterparts have yielded but a few weak generalizations. Such contrasts typically have been based on the statistical properties of groups of items, an approach that masks differences in properties at the item level and may lead to inaccurate conclusions. In this article, we examine item-level differences between a certain type of constructed-response item (called figural response) and comparable multiple-choice items in the domain of architecture. Our data show that in comparing two item formats, item-level differences in difficulty correspond to differences in cognitive processing requirements and that relations between processing requirements and psychometric properties are systematic. These findings illuminate one aspect of construct validity that is frequently neglected in comparing item types, namely the cognitive demand of test items. 相似文献

11.

When Is Reading Also Writing: Sources of Individual Differences on the New Reading Performance Assessments

《Scientific Studies of Reading》2013,17(2):125-151

This research examined component processes that contribute to performance on one of the new, standards-based reading tests that have become a staple in many states. Participants were 60 Grade 4 students randomly sampled from 7 classrooms in a rural school district. The particular test we studied employed a mixture of traditional (multiple-choice) and performance assessment approaches (constructed-response items that required written responses). Our findings indicated that multiple-choice and constructed-response items enlisted different cognitive skills. Writing ability emerged as an important source of individual differences in explaining overall reading ability, but its influence was limited to performance on constructed-response items. After controlling for word identification and listening, writing ability accounted for no variance in multiple-choice reading scores. By contrast, writing ability accounted for unique variance in reading ability, even after controlling for word identification and listening skill, and explained more variance in constructed-response reading scores than did either word identification or listening skill. In addition, performance on the multiple-choice reading measure along with writing ability accounted for nearly all of the reliable variance in performance on the constructed-response reading measure. 相似文献

12.

How invariant and accurate are domain ratings in writing assessment?

Stefanie A. Wind George Engelhard 《Assessing Writing》2013,18(4):278-299

The use of evidence to guide policy and practice in education (Cooper, Levin, & Campbell, 2009) has included an increased emphasis on constructed-response items, such as essays and portfolios. Because assessments that go beyond selected-response items and incorporate constructed-response items are rater-mediated (Engelhard, 2002, Engelhard, 2013), it is necessary to develop evidence-based indices of quality for the rating processes used to evaluate student performances. This study proposes a set of criteria for evaluating the quality of ratings based on the concepts of measurement invariance and accuracy within the context of a large-scale writing assessment. Two measurement models are used to explore indices of quality for raters and ratings: the first model provides evidence for the invariance of ratings, and the second model provides evidence for rater accuracy. Rating quality is examined within four writing domains from an analytic rubric. Further, this study explores the alignment between indices of rating quality based on these invariance and accuracy models within each of the four domains of writing. Major findings suggest that rating quality varies across analytic rubric domains, and that there is some correspondence between indices of rating quality based on the invariance and accuracy models. Implications for research and practice are discussed. 相似文献

13.

Investigating the Effect of Different Selected-response Item Formats for Reading Comprehension

Anthony Becker Tatiana Nekrasova-Beker 《Educational Assessment》2018,23(4):296-317

While previous research has identified numerous factors that contribute to item difficulty, studies involving large-scale reading tests have provided mixed results. This study examined five selected-response item types used to measure reading comprehension in the Pearson Test of English Academic: a) multiple-choice (choose one answer), b) multiple-choice (choose multiple answers), c) re-order paragraphs, d) reading (fill-in-the-blanks), and e) reading and writing (fill-in-the-blanks). Utilizing a multiple regression approach, the criterion measure consisted of item difficulty scores for 172 items. 18 passage, passage-question, and response-format variables served as predictors. Overall, four significant predictors were identified for the entire group (i.e., sentence length, falsifiable distractors, number of correct options, and abstractness of information requested) and five variables were found to be significant for high-performing readers (including the four listed above and passage coherence); only the number of falsifiable distractors was a significant predictor for low-performing readers. Implications for assessing reading comprehension are discussed. 相似文献

14.

Gender DIF in Reading and Mathematics Tests With Mixed Item Formats

Catherine S. Taylor Yoonsun Lee 《教育实用测度》2013,26(3):246-280

This was a study of differential item functioning (DIF) for grades 4, 7, and 10 reading and mathematics items from state criterion-referenced tests. The tests were composed of multiple-choice and constructed-response items. Gender DIF was investigated using POLYSIBTEST and a Rasch procedure. The Rasch procedure flagged more items for DIF than did the simultaneous item bias procedure—particularly multiple-choice items. For both reading and mathematics tests, multiple-choice items generally favored males while constructed-response items generally favored females. Content analyses showed that flagged reading items typically measured text interpretations or implied meanings; males tended to benefit from items that asked them to identify reasonable interpretations and analyses of informational text. Most items that favored females asked students to make their own interpretations and analyses, of both literary and informational text, supported by text-based evidence. Content analysis of mathematics items showed that items favoring males measured geometry, probability, and algebra. Mathematics items favoring females measured statistical interpretations, multistep problem solving, and mathematical reasoning. 相似文献

15.

Formula Scoring of Multiple-Choice Tests (Correction for Guessing)

Robert B. Frary 《Educational Measurement》1988,7(2):33-38

Formula scoring is a procedure designed to reduce multiple-choice test score irregularities due to guessing. Typically, a formula score is obtained by subtracting a proportion of the number of wrong responses from the number correct. Examinees are instructed to omit items when their answers would be sheer guesses among all choices but otherwise to guess when unsure of an answer. Thus, formula scoring is not intended to discourage guessing when an examinee can rule out one or more of the options within a multiple-choice item. Examinees who, contrary to the instructions, do guess blindly among all choices are not penalized by formula scoring on the average; depending on luck, they may obtain better or worse scores than if they had refrained from this guessing. In contrast, examinees with partial information who refrain from answering tend to obtain lower formula scores than if they had guessed among the remaining choices. (Examinees with misinformation may be exceptions.) Formula scoring is viewed as inappropriate for most classroom testing but may be desirable for speeded tests and for difficult tests with low passing scores. Formula scores do not approximate scores from comparable fill-in-the-blank tests, nor can formula scoring preclude unrealistically high scores for examinees who are very lucky. 相似文献

16.

The Psychometric Characteristics of Choice Items

Anne R. Fitzpatrick Wendy M. Yen 《Journal of Educational Measurement》1995,32(3):243-259

This study investigated the psychometric characteristics of constructed-response (CR) items referring to choice and non-choice passages administered to students in Grades 3, 5, and 8. The items were scaled using item response theory (IRT) methodology. The results indicated no consistent differences in the difficulty and discrimination of the items referring to the two types of passages. On the average, students' scale scores on the choice and non-choice passages were comparable. Finally, the choice passages differed in terms of overall popularity and in their attractiveness to different gender and ethnic groups 相似文献

17.

Generalizability of Cognitive Interview-Based Measures Across Cultural Groups

Guillermo Solano-Flores Min Li 《Educational Measurement》2009,28(2):9-18

相似文献

18.

Building a Taxonomy for Constructed-Response Test Items

《Educational Assessment》2013,18(2):133-147

This article presents a beginning effort to build a taxonomy for constructed-response test items. The taxonomy defines the categories for various item formats in three distinct dimensions: (a) type of reasoning competency employed, (b) nature of cognitive continuum employed, and (c) kind of response yielded. Each dimension is described, and the reasons for incorporating it into the taxonomy are explained. A theoretical rationale for the taxonomy is developed, and advantages and shortcomings of its use are noted. 相似文献

19.

Evaluating an Automatically Scorable, Open-Ended Response Type for Measuring Mathematical Reasoning in Computer-Adaptive Tests

Randy Elliot Bennett Manfred Steffen Mark Kevin Singley Mary Morley Daniel Jacquemin 《Journal of Educational Measurement》1997,34(2):162-176

The first generation of computer-based tests depends largely on multiple-choice items and constructed-response questions that can be scored through literal matches with a key. This study evaluated scoring accuracy and item functioning for an open-ended response type where correct answers, posed as mathematical expressions, can take many different surface forms. Items were administered to 1,864 participants in field trials of a new admissions test for quantitatively oriented graduate programs. Results showed automatic scoring to approximate the accuracy of multiple-choice scanning, with all processing errors stemming from examinees improperly entering responses. In addition, the items functioned similarly in difficulty, item-total relations, and male-female performance differences to other response types being considered for the measure. 相似文献

20.

Scaling Performance Assessments: A Comparison of One-Parameter and Two-Parameter Partial Credit Models

Anne R. Fitzpatrick Valerie B. Link Wendy M. Yen George R. Burket Kyoko Ito Robert C. Sykes 《Journal of Educational Measurement》1996,33(3):291-314

In one study, parameters were estimated for constructed-response (CR) items in 8 tests from 4 operational testing programs using the l-parameter and 2- parameter partial credit (IPPC and 2PPC) models. Where multiple-choice (MC) items were present, these models were combined with the 1-parameter and 3-parameter logistic (IPL and 3PL) models, respectively. We found that item fit was better when the 2PPC model was used alone or with the 3PL model. Also, the slopes of the CR and MC items were found to differ substantially. In a second study, item parameter estimates produced using the IPL-IPPC and 3PL-2PPC model combinations were evaluated for fit to simulated data generated using true parameters known to fit one model combination or ttle other. The results suggested that the more flexible 3PL-2PPC model combination would produce better item fit than the IPL-1PPC combination. 相似文献