期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The Effect of Item Response Changes on Scores on an Elementary Reading Achievement Test

《The Journal of educational research》2012,105(3):153-156

Abstract

The effect of changing item responses on scores of elementary school children on a standardized achievement test was studied. Previous research, primarily involving non-standardized instruments and adult samples, indicates that changed responses are more likely to be correct than not. Subjects were 165 third grade students using the Metropolitan Reading Tests. Students received no special instructions regarding changing responses. Changes were identified visually and were independently verified. While frequency of response changes was low, such changes generally improved scores. Sex differences in number and success of changes were non-significant. The relationship between frequency of response change and test score was minimal. Responses to difficult items were changed more frequently with less success than changes on easy items. High scorers made more successful changes than did low scorers. Within the limits of the methodology, results clearly indicated that response changes of elementary students on multiple-choice items tend to improve test scores. 相似文献

2.

Investigating the Effect of Item Position in Computer‐Based Tests

Feiming Li Allan Cohen Linjun Shen 《Journal of Educational Measurement》2012,49(4):362-379

Computer‐based tests (CBTs) often use random ordering of items in order to minimize item exposure and reduce the potential for answer copying. Little research has been done, however, to examine item position effects for these tests. In this study, different versions of a Rasch model and different response time models were examined and applied to data from a CBT administration of a medical licensure examination. The models specifically were used to investigate whether item position affected item difficulty and item intensity estimates. Results indicated that the position effect was negligible. 相似文献

3.

Item-Response Changes on Multiple-Choice Tests as a Function of Test Anxiety

Kathy Green 《Journal of Experimental Education》2013,81(4):225-228

Item-response changing as a function of test anxiety was investigated. Seventy graduate students completed the Test Anxiety Scale and 73 multiple-choice items during the quarter. The data supported the hypothesis that high test-anxious students make more item-response changes than low test-anxious students. Results also suggested that both high- and low-anxious students profit to a similar extent proportionally from answer changing. It was further found that more responses were changed on difficult than on easy items for both high- and low-anxious students. Test anxiety is suggested as a factor forming test-taking style. 相似文献

4.

A Paradox in the Study of the Benefits of Test‐Item Review

Wim J. van der Linden Minjeong Jeon Steve Ferrara 《Journal of Educational Measurement》2011,48(4):380-398

According to a popular belief, test takers should trust their initial instinct and retain their initial responses when they have the opportunity to review test items. More than 80 years of empirical research on item review, however, has contradicted this belief and shown minor but consistently positive score gains for test takers who changed answers they found to be incorrect during review. This study reanalyzed the problem of the benefits of answer changes using item response theory modeling of the probability of an answer change as a function of the test taker’s ability level and the properties of items. Our empirical results support the popular belief and reveal substantial losses due to changing initial responses for all ability levels. Both the contradiction of the earlier research and support of the popular belief are explained as a manifestation of Simpson’s paradox in statistics. 相似文献

5.

Score Gains, Attitudes, and Behavior Changes Due to Answer-Changing Instruction

Catherine P. Prinsell Philip H. Ramsey Patricia P. Ramsey 《Journal of Educational Measurement》1994,31(4):327-337

Six undergraduate and three graduate classes were given multiple-choice tests with subsequent evaluation of answer changes. The 300 students were tested twice, once before and once after instruction on answer changing. After each test, students were asked to complete two forms. The forms evaluated attitude toward answer changing, reasons for changing, and confidence in final answers. Students showed a significant increase in favorability toward answer changing after instruction. No significant change was found in number of answers changed. Psychology students were found to change significantly more items than were business students. Mean gain score did not change significantly after instruction. It was concluded that although instruction does lead to a change in attitude in answer changing, the number of changes and overall gain due to answer changing do not change. It was also determined that students continue to make significant gains even when their confidence in the final answer is less than 50 on a 100-point scale. 相似文献

6.

Reading retardation or linguistic deficit? II: test-answering strategies in hearing and hearing-impaired school children

D. J. Wood A. J. Griffiths A. Webster 《Journal of Research in Reading》1981,4(2):148-156

The reading test performances of 60 hearing and 60 hearing-impaired children of similar measured reading ages on the Southgate reading test were analysed. As in an earlier study using the Brimer Wide-span test it was shown that the performances of the two groups were quite different. Deaf children tackled significantly more test items than the hearing and made significantly more errors in achieving similar reading scores. A detailed examination of both correct and incorrect answers showed that the deaf children were not simply providing answers to questions at random. Even where they produced incorrect responses they tended, as a group, to select the same answer. Unlike the hearing group, who did not converge on the same incorrect solution to difficult test items, the deaf were systematic in their choices, indicating that they were using a consistent strategy. A post hoc examination of individual test items indicated that the deaf children were selecting answers on the basis of word associations in each test item. On some items these produced a correct response, on others the same (incorrect) response. The implications of these findings are discussed to argue that reading tests based on hearing norms are of little value in the assessment of reading abilities and reading problems in hearing-impaired children. 相似文献

7.

Consequence of Performance,Test, Motivation,and Mentally Taxing Items

《教育实用测度》2013,26(4):341-351

The relation between characteristics of test takers and characteristics of items was examined in a quasi-experimental study. High-school sophomores and juniors were administered a mathematics exam that was of consequence to the sophomores but not the juniors. The juniors had more mathematics course work as a group but less motivation to perform well. Items were characterized by item difficulty (from p values), the degree to which they were mentally taxing (how much mental effort was necessary to reach a correct answer), and item position (as an index of the level of fatigue of the test taker). A differential item functioning (DIE) analysis was conducted to look at differences between sophomores and juniors on an item-by-item basis. It was found that all three item characteristic measures were related to the DIF index, with the mental taxation measure showing the strongest relation. Results are interpreted in relation to the expectancy value model of motivation as formulated by Pintrich (1988, 1989). 相似文献

8.

Analysis of an instructional unit for level of cognitive demand

Ann C. Howe Beulah P. Durr 《科学教学研究杂志》1982,19(3):217-224

A section of the secondary chemistry curriculum was analyzed to determine the level of cognitive demand of the various aspects of the selected topic. Piagetian levels of thinking of 71 pupils were initially assessed by two group tests, a unit on the mole was taught, and guidelines were used to estimate the level of cognitive operations required by each concept and problem type in the unit. Results of a 23-item test were used to compare the estimated level of cognitive demand of each test item with the Piagetian cognitive level of pupils who were able to answer the item correctly. It was found that pupil cognitive level was positively associated with overall unit test score and with percent success on all test items. Predicted levels of cognitive demand were confirmed for eight items and were within one level for nine additional items. 相似文献

9.

Effectiveness of Short-Term Group Guidance with a Group of Transfer Students Admitted on Academic Probation

《The Journal of educational research》2012,105(10):463-465

Abstract

To combat problems of cheating arising from testing under crowed classroom conditions, instructors frequently use multiple arrangements of a set of test items. These different arrangements or forms should be nearly equivalent relative to mean total scores. This study reports data from comparisons involving eleven pairs of equivalent tests. There were no significant linear relationships between equivalent test forms on the ordering of item difficulties. Reliabilities differed little within pairs of equivalent tests. Nine of eleven t-tests comparing mean total test scores were insignificant. The bulk of these data supported the assumption that one may construct equivalent power tests by rearranging items, when the ordering of item difficulty is non-systematic on both arrangements. 相似文献

10.

Research and Teaching in the Science Department of the University of London Institute of Education

A. D. Turner 《International Journal of Science Education》2013,35(4):457-459

This study was designed (1) to analyse the relationship between the answer profile from multiple‐choice questions on stoichiometric problems and the students’ reasoning patterns and (2) to examine the effect for certain variables on the facility values of test items. The instruments used were mainly paper‐and‐pencil tests. The subjects were 6262 grammar school students from all parts of the Federal Republic of Germany. They were randomly assigned to the test items.

The results indicated that many students arrived at their answers by mixing up amount and reacting mass, or molar mass and reacting mass. It was also found that the variables ‘easy/hard calculations’, and ‘formula given/to be developed’ determined the facility values of test items.

From the results, it was possible to make recommendations to practising teachers as well as to examiners. Knowing students’ ideas, the teacher can think of how to make use of them before entering the classroom. A teaching unit may start off with easy problems leaving the more difficult ones for later. Examiners developing new tests on stoichiometry should consider two essential preconditions for the formulae of chemical compounds, used in the item: formulae of the type AB should be avoided and the molar masses of the elements involved must be clearly different. 相似文献

11.

Item-Level Comparative Analysis of Online and Paper Administrations of the Texas Assessment of Knowledge and Skills

Leslie Keng Katie Larsen McClarty Laurie Laughlin Davis 《教育实用测度》2013,26(3):207-226

This article describes a comparative study conducted at the item level for paper and online administrations of a statewide high stakes assessment. The goal was to identify characteristics of items that may have contributed to mode effects. Item-level analyses compared two modes of the Texas Assessment of Knowledge and Skills (TAKS) for up to four subjects at two grade levels. The analyses included significance tests of p-value differences, DIF, and response distributions for each item. Additional analyses investigated item position effects and objective-level mode differences. No evidence of item position effects emerged, but significant differences were found for several items and objectives in all subjects at grade 8 and in mathematics and English language arts (ELA) at grade 11. Differences generally favored the paper group. ELA items that were longer in passage length and math items that required graphing and geometric manipulations or involved scrolling in the online administration tended to be the items showing mode differences. 相似文献

12.

Using a Computer-Based Error Analysis Approach to Improve Basic Subtraction Skills in the Third Grade

《The Journal of educational research》2012,105(6):363-365

Abstract

With the national move toward competency testing, publishers and educators have become increasingly concerned about test validity, item construction, and item readability. While a major effort is usually made by test developers to control the readability level of the test items, there is currently no validated measure of individual item readability.

It is commonly assumed that oral reading of test items by the teacher would ameliorate the readability problem for poor readers. Over 4,000 fifth-grade students were involved in this study aimed at determining the effect of teacher oral reading of test items to good and poor readers. The findings suggested that having teachers read test items aloud during the administration of standardized examinations yielded, overall, higher scores than having students read the items for themselves. However, this intervention did not benefit poor readers more than good readers. Both of these groups reflected similar gains under the influence of this intervention. 相似文献

13.

The Assessment of Teacher Affective Sensitivity

《The Journal of educational research》2012,105(5):257-263

Abstract

The purpose of this investigation was to develop and validate a simulation device to measure a teacher's ability to identify verbal and nonverbal emotions expressed by students (teacher affective sensitivity). The scale consists of videotaped excerpts of teacher-learner interactions and accompanying multiple-choice instrumentation. Respondents select the answer from each multiple-choice item that they believe most accurately describes the affective state of the pupil viewed on the monitor. Previously produced media focusing on classroom interactions were used to obtain the examples of learner affective expressions. Expert judges constructed two multiple-choice items for each simulation episode. Pilot test administrations allowed for numerous scale revisions. Finally, assessments of scale reliability, and scale construct, predictive, concurrent, and content validity were made. 相似文献

14.

Reasons for Changing Answers: An Evaluation Using Personal Interviews 总被引：1，自引：0，他引：1

Shirley P. Schwarz Robert F. McMorris Lawrence P. DeMers 《Journal of Educational Measurement》1991,28(2):163-171

Researchers investigating answer changing have consistently found the preponderance of changes on objective items to be from wrong to right, but little is understood about the mechanisms involved in this phenomenon. In this study, personal interviews were combined with instruction in answer-changing research to investigate further the processes involved in answer changing. Students changed answers and gained from changing, with those in the upper two thirds of the classes gaining the most. Each test-taking strategy produced a mean gain, but particular strategies were not significantly correlated with percentage of gain or percentage of change. Most students reported changing answers for thoughtful reasons such as rereading, rethinking, or remembering more information; very few changes were due to clerical errors. For each reason, most changes were wrong-to-right. We conclude that reconsideration of test items is probably underestimated in answer-changing studies. The role of memory should be considered in why people change and in how successful they judge their changing to have been. 相似文献

15.

Effect of Two Selected Item-Writing Practices on Test Difficulty,Discrimination, and Reliability

Cynthia B. Schmeiser Douglas R. Whitney 《Journal of Experimental Education》2013,81(3):30-34

In order to investigate the effect of two item-writing practices on test characteristics, examinations were chosen for study in two undergraduate courses (N = 71 and 210) . About one-fourth of the items on each examination included a practice generally regarded as undesirable in measurement textbooks and alleged to make test items more difficult. Alternate forms which eliminated the undesirable practice were developed and administered at the same time as the original form. Rewriting item stems so that they formed a complete sentence or question resulted in about 6 percent more students answering items correctly. Eliminating unnecessary material in item stems, however, had little effect on difficulty. KR₂₀ values were not appreciably different for the two versions of either test. Neither flaw was found to affect item discrimination indices noticeably. The absence of any substantial practice-by-achievement level interactions suggested little effect of the practices on the validity of the tests. 相似文献

16.

Assessment of the quality and generalizability of the revised UCLA loneliness scale in Chinese and Korean community-dwelling elderly populations using item response theory (IRT)-Rasch modeling and hybrid IRT-logistic regression

In H. Park Arif Rachmatullah In-Sook Park 《Educational gerontology》2013,39(10):581-599

ABSTRACT

Objectives: This study aims to test the dimensionality, reliability, and item quality of the revised UCLA loneliness scale as well as to investigate the differential item functioning (DIF) of the three dimensions of the revised UCLA loneliness scale in community-dwelling Chinese and Korean elderly individuals.

Method: Data from 493 elderly individuals (287 Chinese and 206 Korean) were used to examine the revised UCLA loneliness scale. The Research model based on item response theory (IRT) was used to test dimensionality, reliability, and item fit. The hybrid ordinal logistic regression-IRT test was used to evaluate DIF.

Results: Item separation reliability, person reliability, and Cronbach’s alpha met the benchmarks. The quality of the items in the three-dimension model met the benchmark. Eight items were detected as significant DIF items (at α < .01). The loneliness level of Chinese elderly individuals was significantly higher than that of Koreans in Dimensions 1 and 2, while Korean elderly participants showed significantly higher loneliness levels than Chinese participants in Dimension 3. Several collected demographic characteristics and loneliness levels were more highly correlated in Korean elderly individuals than in Chinese elderly individuals.

Conclusion: Analysis using the three dimensions is reasonable for the revised UCLA loneliness scale. Good item quality and the items of this measure suggest that the revised UCLA loneliness can be used to assess the preferred latent traits. Finally, the differences between the levels of loneliness in Chinese and Korean elderly individuals are associated with the factors of loneliness. 相似文献

17.

Item Response Models for Examinee‐Selected Items

Wen‐Chung Wang Kuan‐Yu Jin Xue‐Lan Qiu Lei Wang 《Journal of Educational Measurement》2012,49(4):419-445

In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items. 相似文献

18.

NCME 2008 Presidential Address: The Impact of Anchor Test Configuration on Student Proficiency Rates

Anne R. Fitzpatrick 《Educational Measurement》2008,27(4):34-40

Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results. 相似文献

19.

Reducing the need for guesswork in multiple-choice tests

Martin Bush 《Assessment & Evaluation in Higher Education》2015,40(2):218-231

The humble multiple-choice test is very widely used within education at all levels, but its susceptibility to guesswork makes it a suboptimal assessment tool. The reliability of a multiple-choice test is partly governed by the number of items it contains; however, longer tests are more time consuming to take, and for some subject areas, it can be very hard to create new test items that are sufficiently distinct from previously used items. A number of more sophisticated multiple-choice test formats have been proposed dating back at least 60?years, many of which offer significantly improved test reliability. This paper offers a new way of comparing these alternative test formats, by modelling each one in terms of the range of possible test taker responses it enables. Looking at the test formats in this way leads to the realisation that the need for guesswork is reduced when test takers are given more freedom to express their beliefs. Indeed, guesswork is eliminated entirely when test takers are able to partially order the answer options within each test item. The paper aims to strengthen the argument for using more sophisticated multiple-choice test formats, especially for high-stakes summative assessment. 相似文献

20.

Evaluation of a College Study Habits Course Using Scores on a Q-Sort Test as the Criterion

《The Journal of educational research》2012,105(5):272-274

Abstract

An attempt was made to extend and clarify prior research which had demonstrated consistently that changed answers to objective test items tend to be correct. Results extended the basic effect of profiting from changed answers to Air Force personnel responding to multiple-choice questions regarding technical skills; the profit from changes was very similar to that observed in a university group responding to relatively "academic" items. Secondly, most individuals in both groups profited from changes. Third, individuals with the highest test scores tended to profit more from changes than those with the lowest test scores. Fourth, neither Airman Qualifying Exam scores (for the military personnel) nor Scholastic Aptitude Test scores (for the university students) were related to profit. Finally, a systematic case against the popular belief that one should not change answers on objective tests was made, based on an integration of the research to date. 相似文献