期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Affordances of Item Formats and Their Effects on Test‐Taker Cognition under Uncertainty

Jung Aa Moon Madeleine Keehner Irvin R. Katz 《Educational Measurement》2019,38(1):54-62

The current study investigated how item formats and their inherent affordances influence test‐takers’ cognition under uncertainty. Adult participants solved content‐equivalent math items in multiple‐selection multiple‐choice and four alternative grid formats. The results indicated that participants’ affirmative response tendency (i.e., judge the given information as True) was affected by the presence of a grid, type of grid options, and their visual layouts. The item formats further affected the test scores obtained from the alternatives keyed True and the alternatives keyed False, and their psychometric properties. The current results suggest that the affordances rendered by item design can lead to markedly different test‐taker behaviors and can potentially influence test outcomes. They emphasize that a better understanding of the cognitive implications of item formats could potentially facilitate item design decisions for large‐scale educational assessments. 相似文献

2.

Measurement Efficiency of Innovative Item Formats in Computer-Based Testing

Michael G. Jodoin 《Journal of Educational Measurement》2003,40(1):1-15

The psychometric literature provides little empirical evaluation of examinee test data to assess essential psychometric properties of innovative items. In this study, examinee responses to conventional (e.g., multiple choice) and innovative item formats in a computer-based testing program were analyzed for IRT information with the three-parameter and graded response models. The innovative item types considered in this study provided more information across all levels of ability than multiple-choice items. In addition, accurate timing data captured via computer administration were analyzed to consider the relative efficiency of the multiple choice and innovative item types. As with previous research, multiple-choice items provide more information per unit time. Implications for balancing policy, psychometric, and pragmatic factors in selecting item formats are also discussed. 相似文献

3.

Students’ assessment preferences, perceptions of assessment and their relationships to study results

Gerard van de Watering David Gijbels Filip Dochy Janine van der Rijt 《Higher Education》2008,56(6):645-658

The purposes of this study are to gain more insight into students’ actual preferences and perceptions of assessment, into the effects of these on their performances when different assessment formats are used, and into the different cognitive process levels assessed. Data were obtained from two sources. The first was the scores on the assessment of learning outcomes, consisting of open ended and multiple choice questions measuring the students’ abilities to recall information, to understand concepts and principles, and to apply knowledge in new situations. The second was the adapted Assessment Preferences Inventory (API) which measured students’ preferences as a pre-test and perceptions as a post-test. Results show that, when participating in a New Learning Environment (NLE), students prefer traditional written assessment and questions which are as closed as possible, assessing a mix of cognitive processes. Some relationships, but not all the expected ones, were found between students’ preferences and their assessment scores. No relationships were found between students’ perceptions of assessment and their assessment scores. Additionally, only forty percent of the students had perceptions of the levels of the cognitive processes assessed that matched those measured by the assessments. Several explanations are discussed. 相似文献

4.

Controlling Bias in Both Constructed Response and Multiple‐Choice Items When Analyzed With the Dichotomous Rasch Model

下载免费PDF全文

David Andrich Ida Marais 《Journal of Educational Measurement》2018,55(2):281-307

Even though guessing biases difficulty estimates as a function of item difficulty in the dichotomous Rasch model, assessment programs with tests which include multiple‐choice items often construct scales using this model. Research has shown that when all items are multiple‐choice, this bias can largely be eliminated. However, many assessments have a combination of multiple‐choice and constructed response items. Using vertically scaled numeracy assessments from a large‐scale assessment program, this article shows that eliminating the bias on estimates of the multiple‐choice items also impacts on the difficulty estimates of the constructed response items. This implies that the original estimates of the constructed response items were biased by the guessing on the multiple‐choice items. This bias has implications for both defining difficulties in item banks for use in adaptive testing composed of both multiple‐choice and constructed response items, and for the construction of proficiency scales. 相似文献

5.

Relationships between learning patterns and attitudes towards two assessment formats

Menucha Birenbaum Rose A. Feldman 《Educational research; a review for teachers and all concerned with progress in education》2013,55(1):90-98

The study examined the relationships between learning patterns and attitudes towards two assessment formats: open‐ended (OE) and multiple‐choice (MC), among students in higher education. Sixteen Semantic Differential scales measuring emotional reactions, intellectual reactions and appraisal of each assessment format, along with measures of learning processes, academic self‐concept and test anxiety, were administered to 58 students. Results indicated two patterns of relationships between the learning‐related variables and the assessment attitudes: high scores on the self‐concept measure and on the three measures of learning processes were related to positive attitudes towards the OE format but negative ones towards the MC format; low scores on the test anxiety measures were related to positive attitudes towards the OE format. In addition, significant gender differences emerged with respect to the MC format, with males having more favourable attitudes than females. Results were discussed in light of an adaptive assessment approach. 相似文献

6.

The challenges and possibilities of aligning large‐scale testing with mathematical reform: the case of Ontario

Alex Lawson Christine Suurtamm 《Assessment in Education: Principles, Policy & Practice》2006,13(3):305-325

In 1997, the Ontario government, like many other jurisdictions, undertook systemic reform of their elementary school mathematics programme, developing a new mathematics curriculum, report card, and province‐wide assessment. The curricular reform embodied a new vision of mathematics learning and instruction that emphasized instruction using challenging problems, the student construction of multiple solution methods, and mathematical communication and defence of ideas. While the design of the original large‐scale assessment incorporated much of the latest research and theory on effective practices at that time, these traditional item development and scoring practices no longer adequately assess mathematics achievement in reform‐inspired classrooms. The difficulties of marrying traditional assessment practices with a reform‐inspired curriculum could be addressed by creating a construct definition from the recent research findings on students’ mathematical development in reform‐inspired classrooms. The importance, challenges and implications of redefining the construct on the basis of existing research on students’ mathematical development, as well as collapsing the traditional content‐by‐process matrix for item development, are explored. 相似文献

7.

Can We Learn From Student Mistakes in a Formative,Reading Comprehension Assessment?

Bowen Liu Patrick C. Kennedy Ben Seipel Sarah E. Carlson Gina Biancarosa Mark L. Davison 《Journal of Educational Measurement》2019,56(4):815-835

This article describes an ongoing project to develop a formative, inferential reading comprehension assessment of causal story comprehension. It has three features to enhance classroom use: equated scale scores for progress monitoring within and across grades, a scale score to distinguish among low‐scoring students based on patterns of mistakes, and a reading efficiency index. Instead of two response types for each multiple‐choice item, correct and incorrect, each item has three response types: correct and two incorrect response types. Prior results on reliability, convergent and discriminant validity, and predictive utility of mistake subscores are briefly described. The three‐response‐type structure of items required rethinking the item response theory (IRT) modeling. IRT‐modeling results are presented, and implications for formative assessments and instructional use are discussed. 相似文献

8.

Scoring methods for multiple choice assessment in higher education – Is it still a matter of number right scoring or negative marking?

Ellen Lesage Martin Valcke Elien Sabbe 《Studies in Educational Evaluation》2013

In higher education, a multiple choice test is a widely known format for measuring student's knowledge. The debate about the two most commonly used scoring methods for multiple choice assessment – number right scoring (NR) and negative marking (NM) – seems to be a never-ending story. Both NR scoring as NM do not seem to meet the expectations. However, available research hardly offers alternative methods. Clearly, there is a growing need to explore these alternative scoring methods in order to inform and support test designers. This review aims to present an overview of (alternative) scoring methods for multiple choice tests, in which strengths and weaknesses of each method are provided. 相似文献

9.

A Multidimensional Scaling Study of College Students' Perception of Test Item Formats

《教育实用测度》2013,26(2):123-136

College students use information about upcoming tests, including the item formats to be used, to guide their study strategies and allocation of effort, but little is known about how students perceive item formats. In this study, college students rated the dissimilarity of pairs of common item formats (true/false, multiple choice, essay, fill-in-the-blank, matching, short answer, analogy, and arrangement). A multidimensional scaling model with individual differences (INDSCAL) was fit to the data of 11 1 students and suggested that they were using two dimensions to distinguish among these formats. One dimension separated supply from selection items, and the formats' positions on the dimension were related to ratings of difficulty, review time allocated, objectivity, and recognition (as opposed to recall) required. The second dimension ordered item formats from those with few options from which to choose (e.g., true/false) or brief responses (e.g., fill-in-the-blank), to those with many options from which to choose (e.g., matching) or long responses (e.g., essay). These student perceptions are likely to mediate the impact of classroom evaluation on student study strategies and allocation of effort. 相似文献

10.

Accounting for Rater Effects With the Hierarchical Rater Model Framework When Scoring Simple Structured Constructed Response Tests

Ricardo Nieto Jodi M. Casabianca 《Journal of Educational Measurement》2019,56(3):547-581

Many large‐scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple‐choice and/or constructed responses sections of items to generate multiple scores. In the current article, we propose an extension of the hierarchical rater model (HRM) to be applied with simple structured tests with constructed response items. In addition to modeling the appropriate trait structure, the multidimensional HRM (M‐HRM) presented here also accounts for rater severity bias and rater variability or inconsistency. We introduce the model formulation, test parameter recovery with a focus on latent traits, and compare the M‐HRM to other scoring approaches (unidimensional HRMs and a traditional multidimensional item response theory model) using simulated and empirical data. Results show more precise scores under the M‐HRM, with a major improvement in scores when incorporating rater effects versus ignoring them in the traditional multidimensional item response theory model. 相似文献

11.

Extended matching sets questions for online numeracy assessments: a case study

Alan J. Cann 《Assessment & Evaluation in Higher Education》2005,30(6):633-640

Extended matching sets questions (EMSQs) are a form of multiple‐choice question (MCQ) consisting of a stem (the question or scenario) with an extended number of possible answers. Although there is no consensus on their absolute format, for the purpose of this paper a multiple‐choice question with ten or more alternative answers is considered to be an EMSQ. Faced with the limitations imposed by virtual learning environment software, I have conducted a case study into the use of the EMSQ format in online assessment of numerical and statistical ability which shows that properly constructed questions of this type can play a valuable role in assessment of numeracy. The extended format was found to work well on screen and resulted in an increase in both student marks and student satisfaction when compared with other answer input formats. This case study indicates that the EMSQ format has much more widespread applicability for online assessment that its traditional uses. 相似文献

12.

THE EFFECT OF DIFFERENTIAL WEIGHTING OF INDIVIDUAL ITEM RESPONSES ON THE PREDICTIVE VALIDITY AND RELIABILITY OF AN APTITUDE TEST

DARRELL L. SABERS GORDON W. WHITE 《Journal of Educational Measurement》1969,6(2):93-96

An empirical investigation of the effect of choice weight scoring on predictive validity and reliability. Choice weight scoring refers to the procedure whereby different weights may be assigned to all the options of an item. Four groups of subjects were included in the experiment. Weights derived from each group were used to score tests for another group in order to assess the cross-validity of the weighted scoring. In no case did the increments in reliability and validity due to the weighted scoring exceed .03. 相似文献

13.

Investigating Psychometric Isomorphism for Traditional and Performance‐Based Assessment

下载免费PDF全文

Derek M. Fay Roy Levy Vandhana Mehta 《Journal of Educational Measurement》2018,55(1):52-77

相似文献

14.

Gender Differences in Large-Scale Math Assessments: PISA Trend 2000 and 2003

Ou Lydia Liu Mark Wilson 《教育实用测度》2013,26(2):164-184

Many efforts have been made to determine and explain differential gender performance on large-scale mathematics assessments. A well-agreed-on conclusion is that gender differences are contextualized and vary across math domains. This study investigated the pattern of gender differences by item domain (e.g., Space and Shape, Quantity) and item type (e.g., multiple-choice ⁱ ⁱIn this paper, two kinds of multiple-choice items are discussed: traditional multiple-choice items and complex multiple-choice items. A sample complex multiple choice item is shown in Table 6. The terms “multiple-choice” and “traditional multiple-choice” are used interchangeably to refer to the traditional multiple choice items throughout the paper, while the term “complex multiple-choice” is used to refer to the complex multiple-choice items. Raman K. Grover is now an Independent Psychometrician. items, open constructed-response items). The U.S. portion of the Programme for International Student Assessment (PISA) 2000 and 2003 mathematics assessment was analyzed. A multidimensional Rasch model was used to provide student ability estimates for each comparison. Results revealed a slight but consistent male advantage. Students showed the largest gender difference (d = 0.19) in favor of males on complex multiple-choice items, an unconventional item type. Males and females also showed sizable differences on Space and Shape items, a domain well documented for showing robust male superiority. Contrary to many previous findings reporting male superiority on multiple-choice items, no measurable difference has been identified on multiple-choice items for both the PISA 2000 and the 2003 math assessments. Reasons for the differential gender performance across math domains and item types were speculated, and directions of future research were discussed. 相似文献

15.

Costs of carrying students to a criterion level of competency in physics

Ralph Dressel 《科学教学研究杂志》1983,20(3):231-238

Data are presented which display the performance of students in an introductory physics course that was conducted in a competency-based, multiple-opportunity format. These data reveal features of student behavior that are reproducible. The specific costs, measured in instructor's time per student produced are shown to be related to the character of the examination questions and to the comprehensional modes of the students. Specific costs for single-trial and multiple-trial formats are compared and the evidence clearly shows that the multiple-trial opportunity has lower specific costs and a much higher yield of students at the full competency level. 相似文献

16.

Relationships Among Multiple–Choice and Open–Ended Analytical Questions

Brent Bridgeman Donald A. Rock 《Journal of Educational Measurement》1993,30(4):313-329

Exploratory and confirmatory factor analyses were used to explore relationships among existing item types and three new computer–administered item types for the analytical scale of the Graduate Record Examination General Test. One new item type was an open–ended version of the current multiple–choice analytical reasoning item type. The other new item types had no counterparts on the existing test. The computer tests were administered at four sites to a sample of students who had previously taken the GRE General Test. Scores from the regular GRE and the special computer administration were matched for a sample of 349 students. Factor analyses suggested that the new item types with no counterparts in the existing GRE were reliably assessing unique constructs but the open–ended analytical reasoning items were not measuring anything beyond what is measured by the current multiple–choice version of these items. 相似文献

17.

American Students’ Perspectives on Alternative Assessment: do they know it's different?

Joan L. Herman Davina C.D. Klein Sara T. Wakai 《Assessment in Education: Principles, Policy & Practice》1997,4(3):339-352

This study used the 1993 California Learning Assessment System (CLAS) Middle Grades Mathematics Performance Assessment as a platform to examine alternative assessment in actual practice in the United States. Reported here is information gathered using the CLAS regarding student attitudes and approaches toward this new type of assessment. At issue is whether students find alternative assessments to be more motivating and interesting than traditional types of tests, and whether they appreciate the difference between traditional and alternative tasks. Data were collected in 13 schools across the state of California, involving more than 800 students. Instrumentation used in data collection included student surveys as well as in‐depth student retrospective interviews. Findings suggest that students do indeed understand the differences in approaches necessitated by novel, open‐ended versus more familiar multiple‐choice tasks. In addition, student attitudes toward these two types of tasks are discussed in detail. 相似文献

18.

Effects of Assigning Raters to Items

Robert C. Sykes Kyoko Ito Zhen Wang 《Educational Measurement》2008,27(1):47-55

Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items. 相似文献

19.

INCREASING TEST RELIABILITY THROUGH SELF-SCORING PROCEDURES

DAVID ALAN GILMAN PAULA FERRY 《Journal of Educational Measurement》1972,9(3):205-207

Fifty-four graduate students were administered a 66 item four-response multiple choice test on self-scoring tests forms. Each test was scored by the traditional right-wrong method of scoring tests and was also scored by the self-scoring method of counting the number of responses necessary to respond to all items correctly. Results indicate that scoring tests by the self-scoring method can result in a higher split half reliability than tests scored by the traditional right-wrong method. 相似文献

20.

A discussion of some methodological issues in international assessments

《International Journal of Educational Research》1998,29(6):569-577

Three topics addressed in the previous chapters are identified and discussed from a somewhat different perspective from that of the chapter authors. The topics are: the level of scoring in assessment studies, translation of test items, and sampling of curriculum content. Based on the analysis of these topics, five recommendations are offered. International assessments should be scored and reported at a more specific level that is currently the practice. There is a need of sound statistical checks on the quality of item translations. Rather than sampling the curriculum only once or twice, sampling could be in real time and on a permanent basis. Tests could be administered with open books and a well-chosen time limit per item. Finally, schools could be instructed to prepare their students for the assessment. 相似文献