期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Are Tests Comprising Both Multiple-Choice and Free-Response Items Necessarily Less Unidimensional Than Multiple-Choice Tests?An Analysis of Two Tests

David Thissen Howard Wainer Xiang-Bo Wang 《Journal of Educational Measurement》1994,31(2):113-123

We consider the relationship between the multiple-choice and free-response sections on the Computer Science and Chemistry tests of the College Board's Advanced Placement program. Restricted factor analysis shows that the free-response sections measure the same underlying proficiency as the multiple-choice sections for the most part. However, there is also a significant, if relatively small, amount of local dependence among the free-response items that produces a small degree of multidimensionauty for each test 相似文献

2.

MULTIPLE-CHOICE VERSUS FREE-RESPONSE: A SIMULATION STUDY

ROBERT B. FRARY 《Journal of Educational Measurement》1985,22(1):21-31

Responses to a 40-item test were simulated for 150 examinees under free-response and multiple-choice formats. The simulation was replicated three times for each of 30 variations reflecting format and the extent to which examinees were (a) misinformed, (b) successful in guessing free-response answers, and (c) able to recognize with assurance correct multiple-choice options that they could not produce under free-response testing. Internal consistency reliability (KR20) estimates were consistently higher for the free-response score sets, even when the free-response item difficulty indices were augmented to yield mean scores comparable to those from multiple-choice testing. In addition, all test score sets were correlated with four randomly generated sets of unit-normal measures, whose intercorrelations ranged from moderate to strong. These measures served as criteria because one of them had been used as the basic ability measure in the simulation of the test score sets. Again, the free-response score sets yielded superior results even when tests of equal difficulty were compared. The guessing and recognition factors had little or no effect on reliability estimates or correlations with the criteria. The extent of misinformation affected only multiple-choice score KR20's (more misinformation—higher KR20's). Although free-response tests were found to be generally superior, the extent of their advantage over multiple-choice was judged sufficiently small that other considerations might justifiably dictate format choice. 相似文献

3.

Scoring Constructed Responses Using Expert Systems

Henry I. Braun Randy Elliot Bennett Douglas Frye Elliot Soloway 《Journal of Educational Measurement》1990,27(2):93-108

The use of constructed-response items in large scale standardized testing has been hampered by the costs and difficulties associated with obtaining reliable scores. The advent of expert systems may signal the eventual removal of this impediment. This study investigated the accuracy with which expert systems could score a new, nonmultiple-choice item type. The item type presents a faulty solution to a computer programming problem and asks the student to correct the solution. This item type was administered to a sample of high school seniors enrolled in an Advanced Placement course in Computer Science who also took the Advanced Placement Computer Science (APCS) examination. Results indicated that the expert systems were able to produce scores for between 82% and 95% of the solutions encountered and to display high agreement with a human reader on the correctness of the solutions. Diagnoses of the specific errors produced by students were less accurate. Correlations with scores on the objective and free-response sections of the APCS examination were moderate. Implications for additional research and for testing practice are offered. 相似文献

4.

Format Effects of Empirically Derived Multiple-Choice Versus Free-Response Instruments When Assessing Graphing Abilities

Craig Berg Stacy Boote 《International Journal of Science and Mathematics Education》2017,15(1):19-38

相似文献

5.

Method of Measurement and Gender Differences in Scholastic Achievement 总被引：3，自引：0，他引：3

Niall Bolger Thomas Kellaghan 《Journal of Educational Measurement》1990,27(2):165-174

Gender differences in scholastic achievement as a function of method of measurement were examined by comparing the performance of 15-year-old boys (N = 739) and girls (N = 758) in Irish schools on multiple-choice tests and free-response tests (requiring short written answers) of mathematics, Irish, and English achievement. Males performed significantly better than females on multiple-choice tests compared to their performance on free-response examinations. An expectation that the gender difference would be larger for the languages and smaller for mathematics because of the superior verbal skills attributed to females was not fulfilled. 相似文献

6.

Multiple-Choice Models: The Distractors Are Also Part of the Item

David Thissen Lynne Steinberg Anne R. Fitzpatrick 《Journal of Educational Measurement》1989,26(2):161-176

This paper describes an item response model for multiple-choice items and illustrates its application in item analysis. The model provides parametric and graphical summaries of the performance of each alternative associated with a multiple-choice item; the summaries describe each alternative's relationship to the proficiency being measured. The interpretation of the parameters of the multiple-choice model and the use of the model in item analysis are illustrated using data obtained from a pilot test of mathematics achievement items. The use of such item analysis for the detection of flawed items, for item design and development, and for test construction is discussed. 相似文献

7.

The Contribution of a Response-Production Component to a Free-Response Synonym Task

Rianne Janssen Paul De Boeck 《Journal of Educational Measurement》1996,33(4):417-432

A cognitive approach to the study of format differences is illustrated using synonym tasks. By means of a multiple regression analysis with latent variables, it is shown that both a response-production component and an evaluation component are involved in answering a free-response synonym task. Given the results of Janssen, De Boeck, and Vander Steene (1996), the format differences between the multiple-choice evaluation task and the free-response synonym task can be explained in terms of the kinds of verbal abilities measured. The evaluation task is a pure measure of verbal comprehension, while the free-response synonym task is affected by verbal comprehension and verbal fluency, as well. The design used to study format differences controls both for content effects and for the effects of repeating item stems across formats. 相似文献

8.

Reducing the need for guesswork in multiple-choice tests

Martin Bush 《Assessment & Evaluation in Higher Education》2015,40(2):218-231

The humble multiple-choice test is very widely used within education at all levels, but its susceptibility to guesswork makes it a suboptimal assessment tool. The reliability of a multiple-choice test is partly governed by the number of items it contains; however, longer tests are more time consuming to take, and for some subject areas, it can be very hard to create new test items that are sufficiently distinct from previously used items. A number of more sophisticated multiple-choice test formats have been proposed dating back at least 60?years, many of which offer significantly improved test reliability. This paper offers a new way of comparing these alternative test formats, by modelling each one in terms of the range of possible test taker responses it enables. Looking at the test formats in this way leads to the realisation that the need for guesswork is reduced when test takers are given more freedom to express their beliefs. Indeed, guesswork is eliminated entirely when test takers are able to partially order the answer options within each test item. The paper aims to strengthen the argument for using more sophisticated multiple-choice test formats, especially for high-stakes summative assessment. 相似文献

9.

IRT Approaches to Modeling Scores on Mixed-Format Tests

Won-Chan Lee Stella Y. Kim Jiwon Choi Yujin Kang 《Journal of Educational Measurement》2020,57(2):230-254

This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models. 相似文献

10.

Gender Differences in Large-Scale Math Assessments: PISA Trend 2000 and 2003

Ou Lydia Liu Mark Wilson 《教育实用测度》2013,26(2):164-184

Many efforts have been made to determine and explain differential gender performance on large-scale mathematics assessments. A well-agreed-on conclusion is that gender differences are contextualized and vary across math domains. This study investigated the pattern of gender differences by item domain (e.g., Space and Shape, Quantity) and item type (e.g., multiple-choice ⁱ ⁱIn this paper, two kinds of multiple-choice items are discussed: traditional multiple-choice items and complex multiple-choice items. A sample complex multiple choice item is shown in Table 6. The terms “multiple-choice” and “traditional multiple-choice” are used interchangeably to refer to the traditional multiple choice items throughout the paper, while the term “complex multiple-choice” is used to refer to the complex multiple-choice items. Raman K. Grover is now an Independent Psychometrician. items, open constructed-response items). The U.S. portion of the Programme for International Student Assessment (PISA) 2000 and 2003 mathematics assessment was analyzed. A multidimensional Rasch model was used to provide student ability estimates for each comparison. Results revealed a slight but consistent male advantage. Students showed the largest gender difference (d = 0.19) in favor of males on complex multiple-choice items, an unconventional item type. Males and females also showed sizable differences on Space and Shape items, a domain well documented for showing robust male superiority. Contrary to many previous findings reporting male superiority on multiple-choice items, no measurable difference has been identified on multiple-choice items for both the PISA 2000 and the 2003 math assessments. Reasons for the differential gender performance across math domains and item types were speculated, and directions of future research were discussed. 相似文献

11.

Does the Rasch Model Really Work for Multiple-Choice Items? Take Another Look: A Response to Divgi

Grant I-leaning 《Journal of Educational Measurement》1989,26(1):91-97

Divgi's (1986) study concludes, largely on the basis of a proposed test o f fit that is designed to be more powerful than Wright and Panchapakesan's (I 969) test, that the Rasch model is never appropriate for use with multiple-choice type test items. This paper is an attempt to refute the conclusions of Divgi's study. 相似文献

12.

Validating Measurement of Knowledge Integration in Science Using Multiple-Choice and Explanation Items

Hee-Sun Lee Ou Lydia Liu Marcia C. Linn 《教育实用测度》2013,26(2):115-136

This study explores measurement of a construct called knowledge integration in science using multiple-choice and explanation items. We use construct and instructional validity evidence to examine the role multiple-choice and explanation items plays in measuring students' knowledge integration ability. For construct validity, we analyze item properties such as alignment, discrimination, and target range on the knowledge integration scale using a Rasch Partial Credit Model analysis. For instructional validity, we test the sensitivity of multiple-choice and explanation items to knowledge integration instruction using a cohort comparison design. Results show that (1) one third of correct multiple-choice responses are aligned with higher levels of knowledge integration while three quarters of incorrect multiple-choice responses are aligned with lower levels of knowledge integration, (2) explanation items discriminate between high and low knowledge integration ability students much more effectively than multiple-choice items, (3) explanation items measure a wider range of knowledge integration levels than multiple-choice items, and (4) explanation items are more sensitive to knowledge integration instruction than multiple-choice items. 相似文献

13.

Asymmetry in student achievement on multiple-choice and constructed-response items in reversible mathematics processes

Christopher J. Sangwin Ian Jones 《Educational Studies in Mathematics》2017,94(2):205-222

In this paper we report the results of an experiment designed to test the hypothesis that when faced with a question involving the inverse direction of a reversible mathematical process, students solve a multiple-choice version by verifying the answers presented to them by the direct method, not by undertaking the actual inverse calculation. Participants responded to an online test containing equivalent multiple-choice and constructed-response items in two reversible algebraic techniques: factor/expand and solve/verify. The findings supported this hypothesis: Overall scores were higher in the multiple-choice condition compared to the constructed-response condition, but this advantage was significantly greater for items concerning the inverse direction of reversible processes compared to those involving direct processes. 相似文献

14.

Testing formal Reasoning

《教育实用测度》2013,26(2):189-202

相似文献

15.

A Taxonomy of Multiple-Choice Item-Writing Rules

《教育实用测度》2013,26(1):37-50

A taxonomy of 43 multiple-choice item-writing rules is presented and discussed. The taxonomy derives from an analysis of 46 authoritative textbooks and other sources in the educational measurement literature. The analysis also leads to a "validity by consensus" for each rule. The taxonomy is viewed as a complete and authoritative set of guidelines for writing multiple-choice items. 相似文献

16.

Assessing Partial Knowledge in Vocabulary

Richard M. Smith 《Journal of Educational Measurement》1987,24(3):217-231

This study reports an attempt to assess partial knowledge in vocabulary. Fifty multiple-choice vocabulary items were constructed so that the incorrect choices followed the stages of vocabulary acquisition defined by O'Connor (1940). Ability estimates based on Rasch dichotomous and polychotomous models were compared to determine if there were any gains in validity or reliability as a result of using the polychotomous scoring model rather than the dichotomous scoring model. An attempt was also made to determine the appropriateness of O'Connor's stage theory of vocabulary acquisition for predicting the type of errors that examinees of differing ability would make on the test items. The results indicate that the reliability and concurrent validity of the polychotomous scoring of a subset of items that fit the polychotomous scoring model were significantly higher than those for dichotomous scoring of the same subset of items. The results also indicate moderate support for O'Connor's theory of vocabulary acquisition. 相似文献

17.

An Investigation of Explanation Multiple-Choice Items in Science Assessment

Ou Lydia Liu Hee-Sun Lee Marcia C. Linn 《Educational Assessment》2013,18(3):164-184

Both multiple-choice and constructed-response items have known advantages and disadvantages in measuring scientific inquiry. In this article we explore the function of explanation multiple-choice (EMC) items and examine how EMC items differ from traditional multiple-choice and constructed-response items in measuring scientific reasoning. A group of 794 middle school students was randomly assigned to answer either constructed-response or EMC items following regular multiple-choice items. By applying a Rasch partial-credit analysis, we found that there is a consistent alignment between the EMC and multiple-choice items. Also, the EMC items are easier than the constructed-response items but are harder than most of the multiple-choice items. We discuss the potential value of the EMC items as a learning and diagnostic tool. 相似文献

18.

Development and application of a diagnostic instrument to evaluate grade-11 and -12 students' concepts of covalent bonding and structure following a course of instruction

Raymond F. Peterson David F. Treagust Patrick Garnett 《科学教学研究杂志》1989,26(4):301-314

This article initially outlines a procedure used to develop a written diagnostic instrument to identify grade-11 and -12 students' misconceptions and misunderstandings of the chemistry topic covalent bonding and structure. The content to be taught was carefully defined through a concept map and propositional statements. Following instruction, student understanding of the topic was identified from interviews, student-drawn concept maps, and free-response questions. These data were used to produce 15 two-tier multiple-choice items where the first tier examined content knowledge and the second examined understanding of that knowledge in six conceptual areas, namely, bond polarity, molecular shape, polarity of molecules, lattices, intermolecular forces, and the octet rule. The diagnostic instrument was administered to a total of 243 grade-11 and -12 chemistry students and has a Cronbach alpha reliability of 0.73. Item difficulties ranged from 0.13 to 0.60; discrimination values ranged from 0.32 to 0.65. Each item was analyzed to ascertain student understanding of and identify misconceptions related to the concepts and propositional statements underlying covalent bonding and structure. 相似文献

19.

A SIBTEST Approach to Testing DIF Hypotheses Using Experimentally Designed Test Items

Daniel M. Bolt 《Journal of Educational Measurement》2000,37(4):307-327

This paper considers a modification of the DIF procedure SIBTEST for investigating the causes of differential item functioning (DIF). One way in which factors believed to be responsible for DIF can be investigated is by systematically manipulating them across multiple versions of an item using a randomized DIF study (Schmitt, Holland, & Dorans, 1993). In this paper: it is shown that the additivity of the index used for testing DIF in SIBTEST motivates a new extension of the method for statistically testing the effects of DIF factors. Because an important consideration is whether or not a studied DIF factor is consistent in its effects across items, a methodology for testing item x factor interactions is also presented. Using data from the mathematical sections of the Scholastic Assessment Test (SAT), the effects of two potential DIF factors—item format (multiple-choice versus open-ended) and problem type (abstract versus concrete)—are investigated for gender Results suggest a small but statistically significant and consistent effect of item format (favoring males for multiple-choice items) across items, and a larger but less consistent effect due to problem type. 相似文献

20.

Gender DIF in Reading and Mathematics Tests With Mixed Item Formats

Catherine S. Taylor Yoonsun Lee 《教育实用测度》2013,26(3):246-280

This was a study of differential item functioning (DIF) for grades 4, 7, and 10 reading and mathematics items from state criterion-referenced tests. The tests were composed of multiple-choice and constructed-response items. Gender DIF was investigated using POLYSIBTEST and a Rasch procedure. The Rasch procedure flagged more items for DIF than did the simultaneous item bias procedure—particularly multiple-choice items. For both reading and mathematics tests, multiple-choice items generally favored males while constructed-response items generally favored females. Content analyses showed that flagged reading items typically measured text interpretations or implied meanings; males tended to benefit from items that asked them to identify reasonable interpretations and analyses of informational text. Most items that favored females asked students to make their own interpretations and analyses, of both literary and informational text, supported by text-based evidence. Content analysis of mathematics items showed that items favoring males measured geometry, probability, and algebra. Mathematics items favoring females measured statistical interpretations, multistep problem solving, and mathematical reasoning. 相似文献