期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Measurement Properties of Two Innovative Item Formats in a Computer-Based Test

Lei Wan George A. Henly 《教育实用测度》2013,26(1):58-78

Many innovative item formats have been proposed over the past decade, but little empirical research has been conducted on their measurement properties. This study examines the reliability, efficiency, and construct validity of two innovative item formats—the figural response (FR) and constructed response (CR) formats used in a K–12 computerized science test. The item response theory (IRT) information function and confirmatory factor analysis (CFA) were employed to address the research questions. It was found that the FR items were similar to the multiple-choice (MC) items in providing information and efficiency, whereas the CR items provided noticeably more information than the MC items but tended to provide less information per minute. The CFA suggested that the innovative formats and the MC format measure similar constructs. Innovations in computerized item formats are reviewed, and the merits as well as challenges of implementing the innovative formats are discussed. 相似文献

2.

Cognitive Processing requirements of Constructed Figural Response and Multiple-Choice Items in Architecture Assessment

《Educational Assessment》2013,18(1):83-98

Contrasts between constructed-response items and multiple-choice counterparts have yielded but a few weak generalizations. Such contrasts typically have been based on the statistical properties of groups of items, an approach that masks differences in properties at the item level and may lead to inaccurate conclusions. In this article, we examine item-level differences between a certain type of constructed-response item (called figural response) and comparable multiple-choice items in the domain of architecture. Our data show that in comparing two item formats, item-level differences in difficulty correspond to differences in cognitive processing requirements and that relations between processing requirements and psychometric properties are systematic. These findings illuminate one aspect of construct validity that is frequently neglected in comparing item types, namely the cognitive demand of test items. 相似文献

3.

Effects of Response Format on Difficulty of SAT-Mathematics Items: It's Not the Strategy

Irvin R. Katz Randy Elliot Bennett Aliza E. Berger 《Journal of Educational Measurement》2000,37(1):39-57

Problem-solving strategy is frequently cited as mediating the effects of response format (multiple-choice, constructed response) on item difficulty, yet there are few direct investigations of examinee solution procedures. Fifty-five high school students solved parallel constructed response and multiple-choice items that differed only in the presence of response options. Student performance was videotaped to assess solution strategies. Strategies were categorized as "traditional"–those associated with constructed response problem solving (e.g., writing and solving algebraic equations)–or "nontraditional"–those associated with multiple-choice problem solving (e.g., estimating a potential solution). Surprisingly, participants sometimes adopted nontraditional strategies to solve constructed response items. Furthermore, differences in difficulty between response formats did not correspond to differences in strategy choice: some items showed a format effect on strategy but no effect on difficulty; other items showed the reverse. We interpret these results in light of the relative comprehension challenges posed by the two groups of items. 相似文献

4.

Affordances of Item Formats and Their Effects on Test‐Taker Cognition under Uncertainty

Jung Aa Moon Madeleine Keehner Irvin R. Katz 《Educational Measurement》2019,38(1):54-62

The current study investigated how item formats and their inherent affordances influence test‐takers’ cognition under uncertainty. Adult participants solved content‐equivalent math items in multiple‐selection multiple‐choice and four alternative grid formats. The results indicated that participants’ affirmative response tendency (i.e., judge the given information as True) was affected by the presence of a grid, type of grid options, and their visual layouts. The item formats further affected the test scores obtained from the alternatives keyed True and the alternatives keyed False, and their psychometric properties. The current results suggest that the affordances rendered by item design can lead to markedly different test‐taker behaviors and can potentially influence test outcomes. They emphasize that a better understanding of the cognitive implications of item formats could potentially facilitate item design decisions for large‐scale educational assessments. 相似文献

5.

AN EXPERIMENTAL COMPARISON OF ITEM SAMPLING AND EXAMINEE SAMPLING FOR ESTIMATING TEST NORMS

THOMAS R. OWENS DANIEL L. STUFFLEBEAM 《Journal of Educational Measurement》1969,6(2):75-83

An empirical comparison of the accuracy of item sampling and examinee sampling in estimating norm statistics. Item samples were composed of 3, 6, or 12 items selected from a total test of 50 multiple-choice vocabulary questions. Overall, the study findings provided empirical evidence that item sampling is approximately as effective as examinee sampling for estimating the population mean and standard deviation. Contradictory trends occurred for lower ability and higher ability student populations in accuracy of estimated means and standard deviations when the number of items administered increased from 3 to 6 to 12. The findings from this study indicate that the variation of sequences of items occurring in item sampling need not have a significant affect on test performance. 相似文献

6.

Gender Differences in Large-Scale Math Assessments: PISA Trend 2000 and 2003

Ou Lydia Liu Mark Wilson 《教育实用测度》2013,26(2):164-184

Many efforts have been made to determine and explain differential gender performance on large-scale mathematics assessments. A well-agreed-on conclusion is that gender differences are contextualized and vary across math domains. This study investigated the pattern of gender differences by item domain (e.g., Space and Shape, Quantity) and item type (e.g., multiple-choice ⁱ ⁱIn this paper, two kinds of multiple-choice items are discussed: traditional multiple-choice items and complex multiple-choice items. A sample complex multiple choice item is shown in Table 6. The terms “multiple-choice” and “traditional multiple-choice” are used interchangeably to refer to the traditional multiple choice items throughout the paper, while the term “complex multiple-choice” is used to refer to the complex multiple-choice items. Raman K. Grover is now an Independent Psychometrician. items, open constructed-response items). The U.S. portion of the Programme for International Student Assessment (PISA) 2000 and 2003 mathematics assessment was analyzed. A multidimensional Rasch model was used to provide student ability estimates for each comparison. Results revealed a slight but consistent male advantage. Students showed the largest gender difference (d = 0.19) in favor of males on complex multiple-choice items, an unconventional item type. Males and females also showed sizable differences on Space and Shape items, a domain well documented for showing robust male superiority. Contrary to many previous findings reporting male superiority on multiple-choice items, no measurable difference has been identified on multiple-choice items for both the PISA 2000 and the 2003 math assessments. Reasons for the differential gender performance across math domains and item types were speculated, and directions of future research were discussed. 相似文献

7.

Item Response Models for Examinee‐Selected Items

Wen‐Chung Wang Kuan‐Yu Jin Xue‐Lan Qiu Lei Wang 《Journal of Educational Measurement》2012,49(4):419-445

In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items. 相似文献

8.

Models of Decisionmaking Processes for Multiple-Choice Test Items: An Analysis of Spatial Ability

Rand R. Wilcox Karen Thompson Wilcox 《Journal of Educational Measurement》1988,25(2):125-136

Latent class models of decisionmaking processes related to multiple-choice test items are extremely important and useful in mental test theory. However, building realistic models or studying the robustness of existing models is very difficult. One problem is that there are a limited number of empirical studies that address this issue. The purpose of this paper is to describe and illustrate how latent class models, in conjunction with the answer-until-correct format, can be used to examine the strategies used by examinees for a specific type of task. In particular, suppose an examinee responds to a multiple-choice test item designed to measure spatial ability, and the examinee gets the item wrong. This paper empirically investigates various latent class models of the strategies that might be used to arrive at an incorrect response. The simplest model is a random guessing model, but the results reported here strongly suggest that this model is unsatisfactory. Models for the second attempt of an item, under an answer-until-correct scoring procedure, are proposed and found to give a good fit to data in most situations. Some results on strategies used to arrive at the first choice are also discussed 相似文献

9.

A Review of Automatically Scorable Constructed-Response Item Types for Large-Scale Assessment

《教育实用测度》2013,26(2):151-169

The use of automated scanning of test sheets, beginning in the 1930s, led to widespread use of the multiple-choice format in standardized testing. New forms of automated scoring now hold out the possibility of making a wide range of constructed-response item formats feasible for use on a large-scale basis. We describe new developments in five domains: mathematical reasoning, algebra problem solving, computer science, architecture, and natural language. For each one, we describe the task as presented to the examinee, the methods used to score the response, and the psychometric properties of the item responses. We then highlight general challenges and issues spanning these technologies. We conclude by offering our views on the ways in which such technologies are likely to shape the future of testing. 相似文献

10.

Reducing the need for guesswork in multiple-choice tests

Martin Bush 《Assessment & Evaluation in Higher Education》2015,40(2):218-231

The humble multiple-choice test is very widely used within education at all levels, but its susceptibility to guesswork makes it a suboptimal assessment tool. The reliability of a multiple-choice test is partly governed by the number of items it contains; however, longer tests are more time consuming to take, and for some subject areas, it can be very hard to create new test items that are sufficiently distinct from previously used items. A number of more sophisticated multiple-choice test formats have been proposed dating back at least 60?years, many of which offer significantly improved test reliability. This paper offers a new way of comparing these alternative test formats, by modelling each one in terms of the range of possible test taker responses it enables. Looking at the test formats in this way leads to the realisation that the need for guesswork is reduced when test takers are given more freedom to express their beliefs. Indeed, guesswork is eliminated entirely when test takers are able to partially order the answer options within each test item. The paper aims to strengthen the argument for using more sophisticated multiple-choice test formats, especially for high-stakes summative assessment. 相似文献

11.

Item Function Characteristics and Dimensionality for Alternative Response Formats in Mathematics

《教育实用测度》2013,26(3):257-275

The purpose of this study was to investigate the technical properties of stem-equivalent mathematics items differing only with respect to response format. Using socio- economic factors to define the strata, a proportional stratified random sample of 1,366 Connecticut sixth-grade students were administered one of three forms. Classical item analysis, dimensionality assessment, item response theory goodness-of-fit, and an item bias analysis were conducted. Analysis of variance and confirmatory factor analysis were used to examine the functioning of the items presented in the three different formats. It was found that, after equating forms, the constructed-response formats were somewhat more difficult than the multiple-choice format. However, there was no significant difference across formats with respect to item discrimination. A differential item functioning (DIF) analysis was conducted using both the Mantel-Haenszel procedure and the comparison of the item characteristic curves. The DIF analysis indicated that the presence of bias was not greatly affected by item format; that is, items biased in one format tended to be biased in a similar manner when presented in a different format, and unbiased items tended to remain so regardless of format. 相似文献

12.

The Scaling of Mixed-Item-Format Tests With the One-Parameter and Two-Parameter Partial Credit Models

Robert C. Sykes Wendy M. Yen 《Journal of Educational Measurement》2000,37(3):221-244

Item response theory scalings were conducted for six tests with mixed item formats. These tests differed in their proportions of constructed response (c.r.) and multiple choice (m.c.) items and in overall difficulty. The scalings included those based on scores for the c.r. items that had maintained the number of levels as the item rubrics, either produced from single ratings or multiple ratings that were averaged and rounded to the nearest integer, as well as scalings for a single form of c.r. items obtained by summing multiple ratings. A one-parameter (IPPC) or two-parameter (2PPC) partial credit model was used for the c.r. items and the one-parameter logistic (IPL) or three-parameter logistic (3PL) model for the m.c. items, ltem fit was substantially worse with the combination IPL/IPPC model than the 3PL/2PPC model due to the former's restrictive assumptions that there would be no guessing on the m.c. items and equal item discrimination across items and item types. The presence of varying item discriminations resulted in the IPL/IPPC model producing estimates of item information that could be spuriously inflated for c.r. items that had three or more score levels. Information for some items with summed ratings were usually overestimated by 300% or more for the IPL/IPPC model. These inflated information values resulted in under-estbnated standard errors of ability estimates. The constraints posed by the restricted model suggests limitations on the testing contexts in which the IPL/IPPC model can be accurately applied. 相似文献

13.

Examinees' Perceptions of Feedback in Applied Performance Testing: The Case of the National Board for Professional Teaching Standards

Hye K. Pae 《Educational Assessment》2013,18(2):97-115

This study investigated the role of item formats in the performance of 206 nonnative speakers of English on expressive skills (i.e., speaking and writing). Test scores were drawn from the field test of the Pearson Test of English Academic for Chinese, French, Hebrew, and Korean native speakers. Four item formats, including multiple-choice questions asking for a single answer (SAMC), multiple-choice questions allowing for multiple answers (MAMC), gap-filling, and summarizing items, were examined in relation to expressive skills. The results showed that, although the four groups showed different score distributions, their first language itself did not account for a significant variance in the expressive skills. The summarizing item format assessing listening skills accounted for the greatest variance in the test takers' expressive skills. The SAMC format explained consistently a smaller variance than that of MAMC in the expressive skills measured. Unlike the findings of previous research, no gender difference was found. 相似文献

14.

Mathematics Strategy Use in Solving Test Items in Varied Formats

Sarah M. Bonner 《Journal of Experimental Education》2013,81(3):409-428

Although test scores from similar tests in multiple choice and constructed response formats are highly correlated, equivalence in rankings may mask differences in substantive strategy use. The author used an experimental design and participant think-alouds to explore cognitive processes in mathematical problem solving among undergraduate examinees (N = 64). The study examined the effect of format on mathematics performance and strategy use for male and female examinees given stem-equivalent items. A statistically significant main effect of format on performance was found, with constructed-response items more difficult. The multiple-choice format was associated with more varied strategies, backward strategies, and guessing. Format was found to moderate the effect of problem conceptualization on performance. Results suggest that while for purposes of ranking students on performance, the multiple-choice format may be adequate, for many contemporary educational purposes that seek to provide nuanced information about student cognition, the constructed response format should be preferred. 相似文献

15.

A Multidimensional Scaling Study of College Students' Perception of Test Item Formats

《教育实用测度》2013,26(2):123-136

College students use information about upcoming tests, including the item formats to be used, to guide their study strategies and allocation of effort, but little is known about how students perceive item formats. In this study, college students rated the dissimilarity of pairs of common item formats (true/false, multiple choice, essay, fill-in-the-blank, matching, short answer, analogy, and arrangement). A multidimensional scaling model with individual differences (INDSCAL) was fit to the data of 11 1 students and suggested that they were using two dimensions to distinguish among these formats. One dimension separated supply from selection items, and the formats' positions on the dimension were related to ratings of difficulty, review time allocated, objectivity, and recognition (as opposed to recall) required. The second dimension ordered item formats from those with few options from which to choose (e.g., true/false) or brief responses (e.g., fill-in-the-blank), to those with many options from which to choose (e.g., matching) or long responses (e.g., essay). These student perceptions are likely to mediate the impact of classroom evaluation on student study strategies and allocation of effort. 相似文献

16.

Construct Equivalence of Multiple-Choice and Constructed-Response Items: A Random Effects Synthesis of Correlations

Michael C. Rodriguez 《Journal of Educational Measurement》2003,40(2):163-184

A thorough search of the literature was conducted to locate empirical studies investigating the trait or construct equivalence of multiple-choice (MC) and constructed-response (CR) items. Of the 67 studies identified, 29 studies included 56 correlations between items in both formats. These 56 correlations were corrected for attenuation and synthesized to establish evidence for a common estimate of correlation (true-score correlations). The 56 disattenuated correlations were highly heterogeneous. A search for moderators to explain this variation uncovered the role of the design characteristics of test items used in the studies. When items are constructed in both formats using the same stem (stem equivalent), the mean correlation between the two formats approaches unity and is significantly higher than when using non-stem-equivalent items (particularly when using essay-type items). Construct equivalence, in part, appears to be a function of the item design method or the item writer's intent. 相似文献

17.

Investigating Psychometric Isomorphism for Traditional and Performance‐Based Assessment

下载免费PDF全文

Derek M. Fay Roy Levy Vandhana Mehta 《Journal of Educational Measurement》2018,55(1):52-77

相似文献

18.

When Can We Improve Subscores by Making Them Shorter?: The Case Against Subscores with Overlapping Items

Richard A. Feinberg Howard Wainer 《Educational Measurement》2014,33(3):47-54

Subscores can be of diagnostic value for tests that cover multiple underlying traits. Some items require knowledge or ability that spans more than a single trait. It is thus natural for such items to be included on more than a single subscore. Subscores only have value if they are reliable enough to justify conclusions drawn from them and if they contain information about the examinee that is distinct from what is in the total test score. In this study we show, for a broad range of conditions of item overlap on subscores, that the value of the subscore is always improved through the removal of such items. 相似文献

19.

An Empirical Examination of the IRT Information of Polytomously Scored Reading Items Under the Generalized Partial Credit Model

John R. Donoghue 《Journal of Educational Measurement》1994,31(4):295-311

Using Muraki's (1992) generalized partial credit IRT model, polytomous items (responses to which can be scored as ordered categories) from the 1991 field test of the NAEP Reading Assessment were calibrated simultaneously with multiple-choice and short open-ended items. Expected information of each type of item was computed. On average, four-category polytomous items yielded 2.1 to 3.1 times as much IRT information as dichotomous items. These results provide limited support for the ad hoc rule of weighting k-category polytomous items the same as k - 1 dichotomous items for computing total scores. Polytomous items provided the most information about examinees of moderately high proficiency; the information function peaked at 1.0 to 1.5, and the population distribution mean was 0. When scored dichotomously, information in polytomous items sharply decreased, but they still provided more expected information than did the other response formats. For reference, a derivation of the information function for the generalized partial credit model is included. 相似文献

20.

The assessment of quantitative problem-solving skills with “none of the above”-items (NOTA items)

Filip Dochy George Moerkerke Erik De Corte Mien Segers 《European Journal of Psychology of Education - EJPE》2001,16(2):163-177

In this contribution we concentrate on the features of a particular item format: items having as the last option “none of the above” (NOTA items). There is considerable dispute on the advisability of the usage of NOTA items in testing. Some authors come to the conclusion that NOTA items should be avoided, some come to neutral conclusions while others argue that NOTA items are optimal test items. In this article, we provide evidence to this discussion by conducting protocol analysis on written statements of examinees while answering NOTA items. In our investigation, a test containing 30 multiple-choice items was administered from 169 university students. The results show that NOTA options appear to be more attractive than options with specified solutions in those cases where a problemsolver fails. Also, a relationship is found between the quality of (incorrect) problemsolving and the choice of NOTA items: the more qualitative the incorrect problemsolving process is, the more likely the student is to choose for NOTA items. Overall, our research supports the statement that ‘the more confidence an examinee has in his worked solution, which is inconsistent with one of the specified solutions, the more eager he seems to choose “none of the above”. 相似文献