期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

EFFECTS OF INCORPORATING HUMOR IN TEST ITEMS

ROBERT F. MCMORRIS SANDRA L. URBACH MICHAEL C. CONNOR 《Journal of Educational Measurement》1985,22(2):147-155

Two matched forms of a 50 item multiple-choice grammar test were developed. Twenty items designed to be humorous were included in one form. Test forms were randomly assigned to 126 eighth graders who received the test plus alternate forms of a questionnaire. Inclusion of humorous items did not affect grammar scores on matched humorous/nonhumorous items nor on common post-treatment items, nor did inclusion affect results of anxiety measures. Students favored inclusion of humor on tests, judged effects of humor positively, and estimated humorous items to be easier. Humor did not lower performance but was sought by the students. Potential for more valid and humane measurement is discussed. 相似文献

2.

True-False Testing on Trial: Guilty as Charged or Falsely Accused?

Brabec Jordan Andrew Pan Steven C. Bjork Elizabeth Ligon Bjork Robert A. 《Educational Psychology Review》2021,33(2):667-692

Although widely used, the true-false test is often regarded as a superficial or even harmful test, one that lacks the pedagogical efficacy of more substantive tests (e.g., cued-recall or short-answer tests). Such charges, however, lack conclusive evidence and may, in some cases, be false. Across four experiments, we investigated how true-false testing of studied passages (e.g., on Yellowstone National Park) might enhance—or be optimized to enhance—performance on subsequent cued-recall tests. In Experiments 1–2, relative to control performance that did not benefit from any additional exposure, we found that (a) the evaluation of true statements enhanced the recall of tested (but not related) content and that (b) the evaluation of false statements enhanced the recall of related (but not tested) content, a differential pattern of benefits that did not depend on the syntactic structure of the test items. Moreover, when competitive clauses were embedded within the true-false items of Experiment 3 (e.g., True or false? Castle Geyser (not Steamboat Geyser) is the tallest geyser), we found that the evaluation of both types of statements enhanced the recall of both types of content. Finally, in Experiment 4, these holistic benefits proved robust to a retention interval of 48 h and were comparable with the benefits of a restudy condition in which learners restudied all of the propositions that could have been retrieved in the evaluation of the true-false items. Accordingly, although it was not uncommon for participants to misremember information as a consequence of true-false practice, our findings broadly indicate that, especially when carefully constructed, true-false tests can elicit beneficial, not superficial, processes that belie their poor reputation.

相似文献

3.

MULTIPLE CHOICE VERSUS TRUE-FALSE: A COMPARISON OF RELIABILITIES AND CONCURRENT VALIDITIES

DAVID A. FRISBIE 《Journal of Educational Measurement》1973,10(4):297-304

The purpose of this study was to compare the reliabilities of true-false (TF) and multiple choice (MC) tests and to determine the concurrent validities of both. Two methods, judgmental and discrimination, were devised for objectively converting MC items to TF form. The TF items generated by the two methods from 70-item MC natural science and social studies tests were incorporated in eight final forms which were differentiated by subject matter, conversion method, and item form order. A sample of 1018 nonurban high school students each responded to one of the eight forms. Examinees tried three TF items for every pair of MC items attempted. The TF tests were significantly less reliable than the MC tests but did tend to measure the same thing as the corresponding MC tests. 相似文献

4.

Shifting gears: consequences of including two negatively worded items in the middle of a positively worded questionnaire

Michael J. Roszkowski Margot Soven 《Assessment & Evaluation in Higher Education》2010,35(1):113-130

A questionnaire used in student evaluations of interdisciplinary courses during six semesters contained two Likert items stated in a direct negative mode which were embedded in a questionnaire (14–18 items) in which the remaining items were phrased in a direct positive mode. In the seventh semester and thereafter, the two negative items were restated as direct positive stems. Item‐analysis demonstrated that in the direct negative mode, the two items had low item‐to‐total correlations and that the internal consistency reliability of the sum score could be improved by eliminating the two negatively phrased items. Also, the two negatively worded items defined a separate factor. After they were reworded into a direct positive mode, these two items showed markedly improved item‐to‐total correlations. Moreover, the unique factor disappeared, which suggests that it was a methodological artefact probably attributable to respondent carelessness. Including a few negative items in an otherwise positively stated questionnaire leads to ambiguity of results rather than controlling for response sets. We therefore recommend against the practice. 相似文献

5.

A Comparison of Multiple-Choice and Constructed Figural Response Items

Michael E. Martinez 《Journal of Educational Measurement》1991,28(2):131-145

In contrast to multiple-choice test questions, figural response items call for constructed responses and rely upon figural material, such as illustrations and graphs, as the response medium. Figural response questions in various science domains were created and administered to a sample of 4th-, 8th-, and 12th-grade students. Item and test statistics from parallel sets of figural response and multiple-choice questions were compared. Figural response items were generally more difficult, especially for questions that were difficult (p < .5) in their constructed-response forms. Figural response questions were also slightly more discriminating and reliable than their multiple-choice counterparts, but they had higher omit rates. This article addresses the relevance of guessing to figural response items and the diagnostic value of the item type. Plans for future research on figural response items are discussed. 相似文献

6.

Multiple True-False Items: A Study of Interitem Correlations, Scoring Alternatives, and Reliability Estimation

Mark A. Albanese Darrell L. Sabers 《Journal of Educational Measurement》1988,25(2):111-123

Intercorrelations among multiple true-false items were examined to determine to what extent each true-false option can be treated as independent. Results from 157 health science students and 170 medical students showed that correlations between true-false options associated with the same stem were from 2.6 to 7.0 times larger than those from different stems. This suggests that results from previous research indicating that each true-false option could be treated as an independent item cannot be generalized to other tests and examinee populations without supporting evidence. Four scoring methods were explored which varied chance success levels and scoring for partial knowledge. The results showed that scoring methods incorporating partial knowledge were more reliable and possessed greater concurrent and predictive validity than those minimizing chance success. Methods for computing reliability estimates were compared and suggestions were offered regarding practical use 相似文献

7.

The Generalizability of Ratings of Item Relevance

《教育实用测度》2013,26(4):301-309

The relevance of test content to practice is essential for credentialing examinations and one way to ensure it is to collect ratings of item relevance from job incumbents. This study analyzed ratings of the 132 single-best-answer items and 117 multiple true-false item sets that formed the pretest books in a single administration of a medical certifying examination. Ratings collected from 57 practitioners were high (an average of more than 4 on a 5-point scale) and correlated with item difficulty (r = .31 to .34). The relationship between ratings and item discrimination is less clear (r = -.04 to .31). Application of generalizability theory to the ratings shows that reasonable estimates of item, stem, and total test relevance can be obtained with about 10 raters. 相似文献

8.

Language and Cultural Characteristics That Explain Differential Item Functioning for Hispanic Examinees on the Scholastic Aptitude Test

Alieia P. Sehmitt 《Journal of Educational Measurement》1988,25(1):1-13

The standardization methodology was used to help identify item characteristics that might explain differential item functioning among Hispanics on the Scholastic Aptitude Test. Results indicated that true cognates or words with a common root in English and Spanish and content of special interest for Hispanics seemed to help Hispanics performance. Limited occurrence of false cognates (words that appear to be cognates but have different meanings in both languages) and of homographs (words that are spelled alike but have different meanings in English) restricted their evaluation. Nevertheless, examination of items with false cognates or homographs gave some evidence indicating that their occurrence might make items unexpectedly more difficult for Hispanic examinees 相似文献

9.

A Comparison of Limited-Information and Full-Information Methods in Mplus for Estimating Item Response Theory Parameters for Nonnormal Populations

Christine E. DeMars 《Structural equation modeling》2013,20(4):610-632

In structural equation modeling software, either limited-information (bivariate proportions) or full-information item parameter estimation routines could be used for the 2-parameter item response theory (IRT) model. Limited-information methods assume the continuous variable underlying an item response is normally distributed. For skewed and platykurtic latent variable distributions, 3 methods were compared in Mplus: limited information, full information integrating over a normal distribution, and full information integrating over the known underlying distribution. Interfactor correlation estimates were similar for all 3 estimation methods. For the platykurtic distribution, estimation method made little difference for the item parameter estimates. When the latent variable was negatively skewed, for the most discriminating easy or difficult items, limited-information estimates of both parameters were considerably biased. Full-information estimates obtained by marginalizing over a normal distribution were somewhat biased. Full-information estimates obtained by integrating over the true latent distribution were essentially unbiased. For the a parameters, standard errors were larger for the limited-information estimates when the bias was positive but smaller when the bias was negative. For the d parameters, standard errors were larger for the limited-information estimates of the easiest, most discriminating items. Otherwise, they were generally similar for the limited- and full-information estimates. Sample size did not substantially impact the differences between the estimation methods; limited information did not gain an advantage for smaller samples. 相似文献

10.

Item Selection in Computerized Adaptive Testing: Should More Discriminating Items be Used First?

Kit-Tai Hau Hua-Hua Chang 《Journal of Educational Measurement》2001,38(3):249-266

During computerized adaptive testing (CAT), items are selected continuously according to the test-taker's estimated ability. The traditional method of attaining the highest efficiency in ability estimation is to select items of maximum Fisher information at the currently estimated ability. Test security has become a problem because high-discrimination items are more likely to be selected and become overexposed. So, there seems to be a tradeoff between high efficiency in ability estimations and balanced usage of items. This series of four studies with simulated data addressed the dilemma by focusing on the notion of whether more or less discriminating items should be used first in CAT. The first study demonstrated that the common maximum information method with Sympson and Hetter (1985) control resulted in the use of more discriminating items first. The remaining studies showed that using items in the reverse order (i.e., less discriminating items first), as described in Chang and Ying's (1999) stratified method had potential advantages: (a) a more balanced item usage and (b) a relatively stable resultant item pool structure with easy and inexpensive management. This stratified method may have ability-estimation efficiency better than or close to that of other methods, particularly for operational item pools when retired items cannot be totally replenished with similar highly discriminating items. It is argued that the judicious selection of items, as in the stratified method, is a more active control of item exposure, which can successfully even out the usage of all items. 相似文献

11.

Gender-Based Differential Item Performance in Mathematics Achievement Items

Allen E. Doolittle T. Anne Cleary 《Journal of Educational Measurement》1987,24(2):157-166

A procedure for the detection of differential item performance (DIP) is used to investigate the relationships between characteristics of mathematics achievement items and gender differences in performance. Eight randomly equivalent samples of high school seniors were each given a unique form of the ACT Assessment Mathematics Usage Test (ACTM). Students without requisite mathematics courses were deleted from the samples to reduce the confounding effects of differences in instruction at the high school level. Signed measures of DIP were obtained for each item in the eight ACTM forms. These DIP estimates were then analyzed in a 6 × 8 (item category by form) experimental design. A significant item category effect was found indicating a relationship between item characteristics and gender-based DIP. Predictions, based on previous research about the categories of items that would contribute to gender-based DIP, were supported: Geometry and mathematics reasoning items were relatively more difficult for female examinees and the more algorithmic, computation-oriented items were relatively easier. 相似文献

12.

ESTIMATING THE RELIABILITY OF MULTIPLE TRUE-FALSE TESTS

DAVID A. FRISBIE CYNTHIA A. DRUVA 《Journal of Educational Measurement》1986,23(2):99-105

This study was designed to examine the level of dependence within multiple true-false (MTF) test item clusters by computing sets of item intercorrelations with data from a test composed of both MTF and multiple choice (MC) items. It was posited that internal analysis reliability estimates for MTF tests would be spurious due to elevated MTF within-cluster intercorrelations. Results showed that, on the average, MTF within-cluster dependence was no greater than that found between MTF items from different clusters, between MC items, or between MC and MTF items. But item for item, there was greater dependence between items within the same cluster than between items of different clusters. 相似文献

13.

Item Function Characteristics and Dimensionality for Alternative Response Formats in Mathematics

《教育实用测度》2013,26(3):257-275

The purpose of this study was to investigate the technical properties of stem-equivalent mathematics items differing only with respect to response format. Using socio- economic factors to define the strata, a proportional stratified random sample of 1,366 Connecticut sixth-grade students were administered one of three forms. Classical item analysis, dimensionality assessment, item response theory goodness-of-fit, and an item bias analysis were conducted. Analysis of variance and confirmatory factor analysis were used to examine the functioning of the items presented in the three different formats. It was found that, after equating forms, the constructed-response formats were somewhat more difficult than the multiple-choice format. However, there was no significant difference across formats with respect to item discrimination. A differential item functioning (DIF) analysis was conducted using both the Mantel-Haenszel procedure and the comparison of the item characteristic curves. The DIF analysis indicated that the presence of bias was not greatly affected by item format; that is, items biased in one format tended to be biased in a similar manner when presented in a different format, and unbiased items tended to remain so regardless of format. 相似文献

14.

The Effects of Test Length and Sample Size on the Reliability and Equating of Tests Composed of Constructed-Response Items

《教育实用测度》2013,26(1):31-57

Examined in this study were the effects of test length and sample size on the alternate forms reliability and the equating of simulated mathematics tests composed of constructed-response items scaled using the 2-parameter partial credit model. Test length was defined in terms of the number of both items and score points per item. Tests with 2, 4, 8, 12, and 20 items were generated, and these items had 2, 4, and 6 score points. Sample sizes of 200, 500, and 1,000 were considered. Precise item parameter estimates were not found when 200 cases were used to scale the items. To obtain acceptable reliabilities and accurate equated scores, the findings suggested that tests should have at least eight 6-point items or at least 12 items with 4 or more score points per item. 相似文献

15.

THE IMPACT OF ITEM PHRASING ON THE VALIDITY OF ATTITUDE SCALES FOR ELEMENTARY SCHOOL CHILDREN 总被引：3，自引：0，他引：3

JERI BENSON DENNIS HOCEVAR 《Journal of Educational Measurement》1985,22(3):231-240

The purpose of the study was to examine the effect of item phrasing on the validity of a Likert-type attitude scale. Three content similar scales were composed of 15 items, either all positive, all negative, or a mixture of positive and negative items. Five hundred twenty-two students in grades 4–6 responded to one of the three forms. Results from the all positive and negative forms indicated that item means, variances, and factor structures differed significantly. Inspection of item means suggested that it was difficult for the students to indicate agreement by disagreeing with a negative statement. Analyses of the mixed phrasing form indicated factors based upon item phrasing, not item content. Taken together, the results suggest that the technique of balancing item phrasing when used with elementary students appears to affect adversely the validity of attitude measurement. 相似文献

16.

Item Discrimination: When More Is Worse

Geofferey N. Masters 《Journal of Educational Measurement》1988,25(1):15-29

High item discrimination can be a symptom o f a special kind of measurement disturbance introduced by an item that gives persons o f high ability a special advantage over and above their higher abilities. This type o f disturbance, which can be interpreted as a form o f item "bias," can be encouraged by methods that routinely interpret highly discriminating items as the "best" items on a test and may be compounded by procedures that weight items by their discrimination. The type of measurement disturbance described and illustrated in this paper occurs when an item is sensitive to individual differences on a second, undesired dimension that is positively correlated with the variable intended to be measured. Possible secondary influences o f this type include opportunity to learn, opportunity to answer, and test wiseness 相似文献

17.

The Relationship of Content Characteristics of GRE Analytical Reasoning Items to Their Difficulties and Discriminations

Clark L. Chalifour Donald E. Powers 《Journal of Educational Measurement》1989,26(2):120-132

In actual test development practice, the number o f test items that must be developed and pretested is typically greater, and sometimes much greater, than the number that is eventually judged suitable for use in operational test forms. This has proven to be especially true for one item type–analytical reasoning-that currently forms the bulk of the analytical ability measure of the GRE General Test. This study involved coding the content characteristics of some 1,400 GRE analytical reasoning items. These characteristics were correlated with indices of item difficulty and discrimination. Several item characteristics were predictive of the difficulty of analytical reasoning items. Generally, these same variables also predicted item discrimination, but to a lesser degree. The results suggest several content characteristics that could be considered in extending the current specifications for analytical reasoning items. The use of these item features may also contribute to greater efficiency in developing such items. Finally, the influence of these various characteristics also provides a better understanding of the construct validity of the analytical reasoning item type. 相似文献

18.

Item Difficulty of Four Verbal Item Types and an Index of Differential Item Functioning for Black and White Examinees 总被引：1，自引：0，他引：1

Roy Freedle Irene Kostin 《Journal of Educational Measurement》1990,27(4):329-343

In this study, the authors explored the importance of item difficulty (equated delta) as a predictor of differential item functioning (DIF) of Black versus matched White examinees for four verbal item types (analogies, antonyms, sentence completions, reading comprehension) using 13 GRE-disclosed forms (988 verbal items) and 11 SAT-disclosed forms (935 verbal items). The average correlation across test forms for each item type (and often the correlation for each individual test form as well) revealed a significant relationship between item difficulty and DIF value for both GRE and SAT. The most important finding indicates that for hard items, Black examinees perform differentially better than matched ability White examinees for each of the four item types and for both the GRE and SAT tests! The results further suggest that the amount of verbal context is an important determinant of the magnitude of the relationship between item difficulty and differential performance of Black versus matched White examinees. Several hypotheses accounting for this result were explored. 相似文献

19.

Detecting Calculator Effects on Item Performance

《教育实用测度》2013,26(4):303-320

Calculator effects were examined using methods taken from research on differential item functioning. Use of a calculator was controlled on two experimental forms of a test assembled from operational items used on a standardized university mathematics placement test. Results indicated that calculator effects were not present based on analysis of test scores and in only two of the three subscores composed from homogeneous item types. Analyses of item-level functioning indicated, however, that a number of items, including several not included in the two significant subscore combinations, also contained calculator effects. For those items identified, use of the calculator appeared to have changed the actual objective being tested. The findings were generally consistent with previous research: Items that were easier when a calculator was used required either simple computations or use of a function key on the calculator; items that were more difficult required knowledge of a procedure either with or without additional computation. Analysis at the item level facilitated clearer understanding of the impact of calculator use on measurement of the underlying objective. 相似文献

20.

Effect of Two Selected Item-Writing Practices on Test Difficulty,Discrimination, and Reliability

Cynthia B. Schmeiser Douglas R. Whitney 《Journal of Experimental Education》2013,81(3):30-34

In order to investigate the effect of two item-writing practices on test characteristics, examinations were chosen for study in two undergraduate courses (N = 71 and 210) . About one-fourth of the items on each examination included a practice generally regarded as undesirable in measurement textbooks and alleged to make test items more difficult. Alternate forms which eliminated the undesirable practice were developed and administered at the same time as the original form. Rewriting item stems so that they formed a complete sentence or question resulted in about 6 percent more students answering items correctly. Eliminating unnecessary material in item stems, however, had little effect on difficulty. KR₂₀ values were not appreciably different for the two versions of either test. Neither flaw was found to affect item discrimination indices noticeably. The absence of any substantial practice-by-achievement level interactions suggested little effect of the practices on the validity of the tests. 相似文献