期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Stepwise Test Characteristic Curve Method to Detect Item Parameter Drift

Rui Guo Yi Zheng Hua‐Hua Chang 《Journal of Educational Measurement》2015,52(3):280-300

An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items. 相似文献

2.

A Comparison of Multiple-Choice and Constructed Figural Response Items

Michael E. Martinez 《Journal of Educational Measurement》1991,28(2):131-145

In contrast to multiple-choice test questions, figural response items call for constructed responses and rely upon figural material, such as illustrations and graphs, as the response medium. Figural response questions in various science domains were created and administered to a sample of 4th-, 8th-, and 12th-grade students. Item and test statistics from parallel sets of figural response and multiple-choice questions were compared. Figural response items were generally more difficult, especially for questions that were difficult (p < .5) in their constructed-response forms. Figural response questions were also slightly more discriminating and reliable than their multiple-choice counterparts, but they had higher omit rates. This article addresses the relevance of guessing to figural response items and the diagnostic value of the item type. Plans for future research on figural response items are discussed. 相似文献

3.

Does difficulty-based item order matter in multiple-choice exams? (Empirical evidence from university students)

《Studies in Educational Evaluation》2020

This empirical study aimed to investigate the impact of easy first vs. hard first ordering of the same items in a paper and-pencil multiple-choice exam on the performances of low, moderate, and high achiever examinees, as well as on the item statistics. Data were collected from 554 Turkish university students using two test forms, which included the same multiple-choice items ordered reversely, i.e. easy first vs. hard first. Tests included 26 multiple-choice items about the introductory unit of “Measurement and Assessment” course. The results suggested that sequencing the multiple-choice items in either direction from easy to hard or vice versa did not affect the test performances of the examinees no matter whether they are low, moderate or high achiever examinees. Finally, no statistically significant difference was observed between item statistics of both forms, i.e. the difficulty (p), discrimination (d), point biserial (r), and adjusted point biserial (adj. r) coefficients. 相似文献

4.

Understanding Cohesion—Some Practical Teaching Implications

Ilma Louise Hadley 《Literacy》1987,21(2):106-114

An experiment was conducted with 151 primary school children from three year levels, in a suburban primary school, set in a moderately high socio-economic area. The object of the investigation was to test the understanding of twelve anaphoric pronouns, which were embedded in passages of continuous text. The relationship between the perception of the cohesive items, and general reading comprehension was studied, as was the difference between the performance of girls and boys. A further question related to the understanding of anaphoric items set within direct speech. Results showed a significant relationship between the comprehension of the selected anaphoric personal items and ability in reading, as measured by a standardised test. In the early school years, girls were superior to boys in their perception of the items, but no difference was found at the upper level of the school. Children of all levels found the items set within quoted speech more difficult to comprehend than the items in the rest of the text. Some practical teaching strategies are discussed, and attention is drawn to areas where teachers' awareness of cohesion could prove useful. 相似文献

5.

Automated Test‐Form Generation

Wim J. van der Linden Qi Diao 《Journal of Educational Measurement》2011,48(2):206-222

In automated test assembly (ATA), the methodology of mixed‐integer programming is used to select test items from an item bank to meet the specifications for a desired test form and optimize its measurement accuracy. The same methodology can be used to automate the formatting of the set of selected items into the actual test form. Three different cases are discussed: (i) computerized test forms in which the items are presented on a screen one at a time and only their optimal order has to be determined; (ii) paper forms in which the items need to be ordered and paginated and the typical goal is to minimize paper use; and (iii) published test forms with the same requirements but a more sophisticated layout (e.g., double‐column print). For each case, a menu of possible test‐form specifications is identified, and it is shown how they can be modeled as linear constraints using 0–1 decision variables. The methodology is demonstrated using two empirical examples. 相似文献

6.

CONSTRUCTING AN ITEM BANK USING PARTIAL CREDIT SCORING 总被引：1，自引：0，他引：1

GEOFFEREY N. MASTERS 《Journal of Educational Measurement》1984,21(1):19-32

A method for banking test items scored in several ordered response categories is described. Each item is seen as an ordered sequence of steps, and test forms are equated using the estimated difficulties of the steps in their shared items. Procedures for analyzing the internal consistency of individual links and for analyzing the coherence of an entire linking structure are described. The methodology is used to link six forms of a mathematics problem solving test. 相似文献

7.

The Effect of the Most-Attractive-Distractor Location on Multiple-Choice Item Difficulty

Jinnie Shin Okan Bulut Mark J. Gierl 《Journal of Experimental Education》2020,88(4):643-659

Abstract

The arrangement of response options in multiple-choice (MC) items, especially the location of the most attractive distractor, is considered critical in constructing high-quality MC items. In the current study, a sample of 496 undergraduate students taking an educational assessment course was given three test forms consisting of the same items but the positions of the most attractive distractor varied across the forms. Using a multiple-indicators–multiple-causes (MIMIC) approach, the effects of the most attractive distractor's positions on item difficulty were investigated. The results indicated that the relative placement of the most attractive distractor and the distance between the most attractive distractor and the keyed option affected students’ response behaviors. Moreover, low-achieving students were more susceptible to response-position changes than high-achieving students. 相似文献

8.

Evaluating Statistical Targets for Assembling Parallel Mixed‐Format Test Forms

下载免费PDF全文

Dries Debeer Usama S. Ali Peter W. van Rijn 《Journal of Educational Measurement》2017,54(2):218-242

Test assembly is the process of selecting items from an item pool to form one or more new test forms. Often new test forms are constructed to be parallel with an existing (or an ideal) test. Within the context of item response theory, the test information function (TIF) or the test characteristic curve (TCC) are commonly used as statistical targets to obtain this parallelism. In a recent study, Ali and van Rijn proposed combining the TIF and TCC as statistical targets, rather than using only a single statistical target. In this article, we propose two new methods using this combined approach, and compare these methods with single statistical targets for the assembly of mixed‐format tests. In addition, we introduce new criteria to evaluate the parallelism of multiple forms. The results show that single statistical targets can be problematic, while the combined targets perform better, especially in situations with increasing numbers of polytomous items. Implications of using the combined target are discussed. 相似文献

9.

The Issue of Range Restriction in Bookmark Standard Setting

下载免费PDF全文

Adam E. Wyse 《Educational Measurement》2015,34(2):47-54

This article uses data from a large‐scale assessment program to illustrate the potential issue of range restriction with the Bookmark method in the context of trying to set cut scores to closely align with a set of college and career readiness benchmarks. Analyses indicated that range restriction issues existed across different response probability (RP) values and item response theory (IRT) models if one were to apply the Bookmark procedure using intact test forms. Results also suggested that range restriction may still be present if one had access to additional data from an item bank. This demonstration critically highlights challenges that may exist in some practical applications of the Bookmark method due items not being designed to cover the full range of examinee abilities. 相似文献

10.

Multidimensional Linking for Tests with Mixed Item Types

Lihua Yao Keith Boughton 《Journal of Educational Measurement》2009,46(2):177-197

Numerous assessments contain a mixture of multiple choice (MC) and constructed response (CR) item types and many have been found to measure more than one trait. Thus, there is a need for multidimensional dichotomous and polytomous item response theory (IRT) modeling solutions, including multidimensional linking software. For example, multidimensional item response theory (MIRT) may have a promising future in subscale score proficiency estimation, leading toward a more diagnostic orientation, which requires the linking of these subscale scores across different forms and populations. Several multidimensional linking studies can be found in the literature; however, none have used a combination of MC and CR item types. Thus, this research explores multidimensional linking accuracy for tests composed of both MC and CR items using a matching test characteristic/response function approach. The two-dimensional simulation study presented here used real data-derived parameters from a large-scale statewide assessment with two subscale scores for diagnostic profiling purposes, under varying conditions of anchor set lengths (6, 8, 16, 32, 60), across 10 population distributions, with a mixture of simple versus complex structured items, using a sample size of 3,000. It was found that for a well chosen anchor set, the parameters recovered well after equating across all populations, even for anchor sets composed of as few as six items. 相似文献

11.

Effectiveness of Short-Term Group Guidance with a Group of Transfer Students Admitted on Academic Probation

《The Journal of educational research》2012,105(10):463-465

Abstract

To combat problems of cheating arising from testing under crowed classroom conditions, instructors frequently use multiple arrangements of a set of test items. These different arrangements or forms should be nearly equivalent relative to mean total scores. This study reports data from comparisons involving eleven pairs of equivalent tests. There were no significant linear relationships between equivalent test forms on the ordering of item difficulties. Reliabilities differed little within pairs of equivalent tests. Nine of eleven t-tests comparing mean total test scores were insignificant. The bulk of these data supported the assumption that one may construct equivalent power tests by rearranging items, when the ordering of item difficulty is non-systematic on both arrangements. 相似文献

12.

Test Stakes and Item Format Interactions

《教育实用测度》2013,26(1):55-77

The effects of test consequences, response formats (multiple choice or constructed response), gender, and ethnicity were studied for the math and science sections of a high school diploma endorsement test. There was an interaction between response format and test consequences: Under both response formats, students performed better under high stakes (diploma endorsement) than under low stakes (pilot test), but the difference was larger for the constructed response items. Gender and ethnicity did not interact with test stakes; the means of all groups increased when the test had high stakes. Gender interacted with format; boys scored higher than girls on multiple-choice items, girls scored higher than boys on constructed-response items. 相似文献

13.

Assessing the Practical Equivalence of Conversions When Measurement Conditions Change

Jinghua Liu Neil J. Dorans 《Journal of Educational Measurement》2012,49(1):101-115

At times, the same set of test questions is administered under different measurement conditions that might affect the psychometric properties of the test scores enough to warrant different score conversions for the different conditions. We propose a procedure for assessing the practical equivalence of conversions developed for the same set of test questions but administered under different measurement conditions. This procedure assesses whether the use of separate conversions for each condition has a desirable or undesirable effect. We distinguish effects due to differences in difficulty from effects due to rounding conventions. The proposed procedure provides objective empirical information that assists in deciding to report a common conversion for a set of items or a different conversion for the set of items when the set is administered under different measurement conditions. To illustrate the use of the procedure, we consider the case where a scrambled test form is used along with a base test form. If section order effects are detected between the scrambled and base forms, a decision needs to be made whether to report a single common conversion for both forms or to report separate conversions. 相似文献

14.

IRT Estimation of Domain Scores

R. Darrell Bock David Thissen Michele F. Zimowski 《Journal of Educational Measurement》1997,34(3):197-211

In classical test theory, a test is regarded as a sample of items from a domain defined by generating rules or by content, process, and format specifications, l f the items are a random sample of the domain, then the percent-correct score on the test estimates the domain score, that is, the expected percent correct for all items in the domain. When the domain is represented by a large set of calibrated items, as in item banking applications, item response theory (IRT) provides an alternative estimator of the domain score by transformation of the IRT scale score on the test. This estimator has the advantage of not requiring the test items to be a random sample of the domain, and of having a simple standard error. We present here resampling results in real data demonstrating for uni- and multidimensional models that the IRT estimator is also a more accurate predictor of the domain score than is the classical percent-correct score. These results have implications for reporting outcomes of educational qualification testing and assessment. 相似文献

15.

EFFECTS OF INCORPORATING HUMOR IN TEST ITEMS

ROBERT F. MCMORRIS SANDRA L. URBACH MICHAEL C. CONNOR 《Journal of Educational Measurement》1985,22(2):147-155

Two matched forms of a 50 item multiple-choice grammar test were developed. Twenty items designed to be humorous were included in one form. Test forms were randomly assigned to 126 eighth graders who received the test plus alternate forms of a questionnaire. Inclusion of humorous items did not affect grammar scores on matched humorous/nonhumorous items nor on common post-treatment items, nor did inclusion affect results of anxiety measures. Students favored inclusion of humor on tests, judged effects of humor positively, and estimated humorous items to be easier. Humor did not lower performance but was sought by the students. Potential for more valid and humane measurement is discussed. 相似文献

16.

A Comparison of Quantitative Questions in Open-Ended and Multiple-Choice Formats

Brent Bridgeman 《Journal of Educational Measurement》1992,29(3):253-271

Open–ended counterparts to a set of items from the quantitative section of the Graduate Record Examination (GRE–Q) were developed. Examinees responded to these items by gridding a numerical answer on a machine-readable answer sheet or by typing on a computer. The test section with the special answer sheets was administered at the end of a regular GRE administration. Test forms were spiraled so that random groups received either the grid-in questions or the same questions in a multiple-choice format. In a separate data collection effort, 364 paid volunteers who had recently taken the GRE used a computer keyboard to enter answers to the same set of questions. Despite substantial format differences noted for individual items, total scores for the multiple-choice and open-ended tests demonstrated remarkably similar correlational patterns. There were no significant interactions of test format with either gender or ethnicity. 相似文献

17.

Assessment of Differential Item Functioning for Performance Tasks 总被引：1，自引：0，他引：1

Rebecca Zwick John R. Donoghue Angela Grima 《Journal of Educational Measurement》1993,30(3):233-251

相似文献

18.

A comparison of difficulty and discrimination values of selected true-false item types

Douglas Barker Robert L Ebel 《Contemporary educational psychology》1982,7(1):35-40

Thirty-eight undergraduate students were randomly assigned one of two alternate forms of a 144-item true-false midterm examination. Whenever a statement appeared on one form as true and positively stated, it appeared on the alternate form as false and negatively stated. Similarly, a false and positively stated item on one form was true and negatively stated on the other. The subject matter of the two forms was identical and the four kinds of true-false items were equally represented on each form. Difficulty and discrimination indices were computed for each of the four item types. The statistical results showed negatively stated items were more difficult, but no more discriminating, than positively stated items. Also, false items were not statistically more difficult than true items, but were significantly more discriminating. It was concluded that test constructors should include more false items than true items in their instruments and that all items should be stated positively. 相似文献

19.

Validation of Group Domain Score Estimates Using a Test of Domain

Mary Pommerich 《Journal of Educational Measurement》2006,43(2):97-111

Domain scores have been proposed as a user-friendly way of providing instructional feedback about examinees' skills. Domain performance typically cannot be measured directly; instead, scores must be estimated using available information. Simulation studies suggest that IRT-based methods yield accurate group domain score estimates. Because simulations can represent best-case scenarios for methodology, it is important to verify results with a real data application. This study administered a domain of elementary algebra (EA) items created from operational test forms. An IRT-based group-level domain score was estimated from responses to a subset of taken items (comprised of EA items from a single operational form) and compared to the actual observed domain score. Domain item parameters were calibrated both using item responses from the special study and from national operational administrations of the items. The accuracy of the domain score estimates were evaluated within schools and across school sizes for each set of parameters. The IRT-based domain score estimates typically were closer to the actual domain score than observed performance on the EA items from the single form. Previously simulated findings for the IRT-based domain score estimation procedure were supported by the results of the real data application. 相似文献

20.

Item Function Characteristics and Dimensionality for Alternative Response Formats in Mathematics

《教育实用测度》2013,26(3):257-275

The purpose of this study was to investigate the technical properties of stem-equivalent mathematics items differing only with respect to response format. Using socio- economic factors to define the strata, a proportional stratified random sample of 1,366 Connecticut sixth-grade students were administered one of three forms. Classical item analysis, dimensionality assessment, item response theory goodness-of-fit, and an item bias analysis were conducted. Analysis of variance and confirmatory factor analysis were used to examine the functioning of the items presented in the three different formats. It was found that, after equating forms, the constructed-response formats were somewhat more difficult than the multiple-choice format. However, there was no significant difference across formats with respect to item discrimination. A differential item functioning (DIF) analysis was conducted using both the Mantel-Haenszel procedure and the comparison of the item characteristic curves. The DIF analysis indicated that the presence of bias was not greatly affected by item format; that is, items biased in one format tended to be biased in a similar manner when presented in a different format, and unbiased items tended to remain so regardless of format. 相似文献