首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Abstract

The effect of changing item responses on scores of elementary school children on a standardized achievement test was studied. Previous research, primarily involving non-standardized instruments and adult samples, indicates that changed responses are more likely to be correct than not. Subjects were 165 third grade students using the Metropolitan Reading Tests. Students received no special instructions regarding changing responses. Changes were identified visually and were independently verified. While frequency of response changes was low, such changes generally improved scores. Sex differences in number and success of changes were non-significant. The relationship between frequency of response change and test score was minimal. Responses to difficult items were changed more frequently with less success than changes on easy items. High scorers made more successful changes than did low scorers. Within the limits of the methodology, results clearly indicated that response changes of elementary students on multiple-choice items tend to improve test scores.  相似文献   

2.
Computer‐based tests (CBTs) often use random ordering of items in order to minimize item exposure and reduce the potential for answer copying. Little research has been done, however, to examine item position effects for these tests. In this study, different versions of a Rasch model and different response time models were examined and applied to data from a CBT administration of a medical licensure examination. The models specifically were used to investigate whether item position affected item difficulty and item intensity estimates. Results indicated that the position effect was negligible.  相似文献   

3.
Item-response changing as a function of test anxiety was investigated. Seventy graduate students completed the Test Anxiety Scale and 73 multiple-choice items during the quarter. The data supported the hypothesis that high test-anxious students make more item-response changes than low test-anxious students. Results also suggested that both high- and low-anxious students profit to a similar extent proportionally from answer changing. It was further found that more responses were changed on difficult than on easy items for both high- and low-anxious students. Test anxiety is suggested as a factor forming test-taking style.  相似文献   

4.
According to a popular belief, test takers should trust their initial instinct and retain their initial responses when they have the opportunity to review test items. More than 80 years of empirical research on item review, however, has contradicted this belief and shown minor but consistently positive score gains for test takers who changed answers they found to be incorrect during review. This study reanalyzed the problem of the benefits of answer changes using item response theory modeling of the probability of an answer change as a function of the test taker’s ability level and the properties of items. Our empirical results support the popular belief and reveal substantial losses due to changing initial responses for all ability levels. Both the contradiction of the earlier research and support of the popular belief are explained as a manifestation of Simpson’s paradox in statistics.  相似文献   

5.
Six undergraduate and three graduate classes were given multiple-choice tests with subsequent evaluation of answer changes. The 300 students were tested twice, once before and once after instruction on answer changing. After each test, students were asked to complete two forms. The forms evaluated attitude toward answer changing, reasons for changing, and confidence in final answers. Students showed a significant increase in favorability toward answer changing after instruction. No significant change was found in number of answers changed. Psychology students were found to change significantly more items than were business students. Mean gain score did not change significantly after instruction. It was concluded that although instruction does lead to a change in attitude in answer changing, the number of changes and overall gain due to answer changing do not change. It was also determined that students continue to make significant gains even when their confidence in the final answer is less than 50 on a 100-point scale.  相似文献   

6.
The reading test performances of 60 hearing and 60 hearing-impaired children of similar measured reading ages on the Southgate reading test were analysed. As in an earlier study using the Brimer Wide-span test it was shown that the performances of the two groups were quite different. Deaf children tackled significantly more test items than the hearing and made significantly more errors in achieving similar reading scores. A detailed examination of both correct and incorrect answers showed that the deaf children were not simply providing answers to questions at random. Even where they produced incorrect responses they tended, as a group, to select the same answer. Unlike the hearing group, who did not converge on the same incorrect solution to difficult test items, the deaf were systematic in their choices, indicating that they were using a consistent strategy. A post hoc examination of individual test items indicated that the deaf children were selecting answers on the basis of word associations in each test item. On some items these produced a correct response, on others the same (incorrect) response. The implications of these findings are discussed to argue that reading tests based on hearing norms are of little value in the assessment of reading abilities and reading problems in hearing-impaired children.  相似文献   

7.
《教育实用测度》2013,26(4):341-351
The relation between characteristics of test takers and characteristics of items was examined in a quasi-experimental study. High-school sophomores and juniors were administered a mathematics exam that was of consequence to the sophomores but not the juniors. The juniors had more mathematics course work as a group but less motivation to perform well. Items were characterized by item difficulty (from p values), the degree to which they were mentally taxing (how much mental effort was necessary to reach a correct answer), and item position (as an index of the level of fatigue of the test taker). A differential item functioning (DIE) analysis was conducted to look at differences between sophomores and juniors on an item-by-item basis. It was found that all three item characteristic measures were related to the DIF index, with the mental taxation measure showing the strongest relation. Results are interpreted in relation to the expectancy value model of motivation as formulated by Pintrich (1988, 1989).  相似文献   

8.
A section of the secondary chemistry curriculum was analyzed to determine the level of cognitive demand of the various aspects of the selected topic. Piagetian levels of thinking of 71 pupils were initially assessed by two group tests, a unit on the mole was taught, and guidelines were used to estimate the level of cognitive operations required by each concept and problem type in the unit. Results of a 23-item test were used to compare the estimated level of cognitive demand of each test item with the Piagetian cognitive level of pupils who were able to answer the item correctly. It was found that pupil cognitive level was positively associated with overall unit test score and with percent success on all test items. Predicted levels of cognitive demand were confirmed for eight items and were within one level for nine additional items.  相似文献   

9.
Abstract

To combat problems of cheating arising from testing under crowed classroom conditions, instructors frequently use multiple arrangements of a set of test items. These different arrangements or forms should be nearly equivalent relative to mean total scores. This study reports data from comparisons involving eleven pairs of equivalent tests. There were no significant linear relationships between equivalent test forms on the ordering of item difficulties. Reliabilities differed little within pairs of equivalent tests. Nine of eleven t-tests comparing mean total test scores were insignificant. The bulk of these data supported the assumption that one may construct equivalent power tests by rearranging items, when the ordering of item difficulty is non-systematic on both arrangements.  相似文献   

10.

This study was designed (1) to analyse the relationship between the answer profile from multiple‐choice questions on stoichiometric problems and the students’ reasoning patterns and (2) to examine the effect for certain variables on the facility values of test items. The instruments used were mainly paper‐and‐pencil tests. The subjects were 6262 grammar school students from all parts of the Federal Republic of Germany. They were randomly assigned to the test items.

The results indicated that many students arrived at their answers by mixing up amount and reacting mass, or molar mass and reacting mass. It was also found that the variables ‘easy/hard calculations’, and ‘formula given/to be developed’ determined the facility values of test items.

From the results, it was possible to make recommendations to practising teachers as well as to examiners. Knowing students’ ideas, the teacher can think of how to make use of them before entering the classroom. A teaching unit may start off with easy problems leaving the more difficult ones for later. Examiners developing new tests on stoichiometry should consider two essential preconditions for the formulae of chemical compounds, used in the item: formulae of the type AB should be avoided and the molar masses of the elements involved must be clearly different.  相似文献   

11.
This article describes a comparative study conducted at the item level for paper and online administrations of a statewide high stakes assessment. The goal was to identify characteristics of items that may have contributed to mode effects. Item-level analyses compared two modes of the Texas Assessment of Knowledge and Skills (TAKS) for up to four subjects at two grade levels. The analyses included significance tests of p-value differences, DIF, and response distributions for each item. Additional analyses investigated item position effects and objective-level mode differences. No evidence of item position effects emerged, but significant differences were found for several items and objectives in all subjects at grade 8 and in mathematics and English language arts (ELA) at grade 11. Differences generally favored the paper group. ELA items that were longer in passage length and math items that required graphing and geometric manipulations or involved scrolling in the online administration tended to be the items showing mode differences.  相似文献   

12.
Abstract

With the national move toward competency testing, publishers and educators have become increasingly concerned about test validity, item construction, and item readability. While a major effort is usually made by test developers to control the readability level of the test items, there is currently no validated measure of individual item readability.

It is commonly assumed that oral reading of test items by the teacher would ameliorate the readability problem for poor readers. Over 4,000 fifth-grade students were involved in this study aimed at determining the effect of teacher oral reading of test items to good and poor readers. The findings suggested that having teachers read test items aloud during the administration of standardized examinations yielded, overall, higher scores than having students read the items for themselves. However, this intervention did not benefit poor readers more than good readers. Both of these groups reflected similar gains under the influence of this intervention.  相似文献   

13.
Abstract

The purpose of this investigation was to develop and validate a simulation device to measure a teacher's ability to identify verbal and nonverbal emotions expressed by students (teacher affective sensitivity). The scale consists of videotaped excerpts of teacher-learner interactions and accompanying multiple-choice instrumentation. Respondents select the answer from each multiple-choice item that they believe most accurately describes the affective state of the pupil viewed on the monitor. Previously produced media focusing on classroom interactions were used to obtain the examples of learner affective expressions. Expert judges constructed two multiple-choice items for each simulation episode. Pilot test administrations allowed for numerous scale revisions. Finally, assessments of scale reliability, and scale construct, predictive, concurrent, and content validity were made.  相似文献   

14.
Reasons for Changing Answers: An Evaluation Using Personal Interviews   总被引:1,自引:0,他引:1  
Researchers investigating answer changing have consistently found the preponderance of changes on objective items to be from wrong to right, but little is understood about the mechanisms involved in this phenomenon. In this study, personal interviews were combined with instruction in answer-changing research to investigate further the processes involved in answer changing. Students changed answers and gained from changing, with those in the upper two thirds of the classes gaining the most. Each test-taking strategy produced a mean gain, but particular strategies were not significantly correlated with percentage of gain or percentage of change. Most students reported changing answers for thoughtful reasons such as rereading, rethinking, or remembering more information; very few changes were due to clerical errors. For each reason, most changes were wrong-to-right. We conclude that reconsideration of test items is probably underestimated in answer-changing studies. The role of memory should be considered in why people change and in how successful they judge their changing to have been.  相似文献   

15.
In order to investigate the effect of two item-writing practices on test characteristics, examinations were chosen for study in two undergraduate courses (N = 71 and 210) . About one-fourth of the items on each examination included a practice generally regarded as undesirable in measurement textbooks and alleged to make test items more difficult. Alternate forms which eliminated the undesirable practice were developed and administered at the same time as the original form. Rewriting item stems so that they formed a complete sentence or question resulted in about 6 percent more students answering items correctly. Eliminating unnecessary material in item stems, however, had little effect on difficulty. KR20 values were not appreciably different for the two versions of either test. Neither flaw was found to affect item discrimination indices noticeably. The absence of any substantial practice-by-achievement level interactions suggested little effect of the practices on the validity of the tests.  相似文献   

16.
ABSTRACT

Objectives: This study aims to test the dimensionality, reliability, and item quality of the revised UCLA loneliness scale as well as to investigate the differential item functioning (DIF) of the three dimensions of the revised UCLA loneliness scale in community-dwelling Chinese and Korean elderly individuals.

Method: Data from 493 elderly individuals (287 Chinese and 206 Korean) were used to examine the revised UCLA loneliness scale. The Research model based on item response theory (IRT) was used to test dimensionality, reliability, and item fit. The hybrid ordinal logistic regression-IRT test was used to evaluate DIF.

Results: Item separation reliability, person reliability, and Cronbach’s alpha met the benchmarks. The quality of the items in the three-dimension model met the benchmark. Eight items were detected as significant DIF items (at α < .01). The loneliness level of Chinese elderly individuals was significantly higher than that of Koreans in Dimensions 1 and 2, while Korean elderly participants showed significantly higher loneliness levels than Chinese participants in Dimension 3. Several collected demographic characteristics and loneliness levels were more highly correlated in Korean elderly individuals than in Chinese elderly individuals.

Conclusion: Analysis using the three dimensions is reasonable for the revised UCLA loneliness scale. Good item quality and the items of this measure suggest that the revised UCLA loneliness can be used to assess the preferred latent traits. Finally, the differences between the levels of loneliness in Chinese and Korean elderly individuals are associated with the factors of loneliness.  相似文献   

17.
In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items.  相似文献   

18.
Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results.  相似文献   

19.
The humble multiple-choice test is very widely used within education at all levels, but its susceptibility to guesswork makes it a suboptimal assessment tool. The reliability of a multiple-choice test is partly governed by the number of items it contains; however, longer tests are more time consuming to take, and for some subject areas, it can be very hard to create new test items that are sufficiently distinct from previously used items. A number of more sophisticated multiple-choice test formats have been proposed dating back at least 60?years, many of which offer significantly improved test reliability. This paper offers a new way of comparing these alternative test formats, by modelling each one in terms of the range of possible test taker responses it enables. Looking at the test formats in this way leads to the realisation that the need for guesswork is reduced when test takers are given more freedom to express their beliefs. Indeed, guesswork is eliminated entirely when test takers are able to partially order the answer options within each test item. The paper aims to strengthen the argument for using more sophisticated multiple-choice test formats, especially for high-stakes summative assessment.  相似文献   

20.
Abstract

An attempt was made to extend and clarify prior research which had demonstrated consistently that changed answers to objective test items tend to be correct. Results extended the basic effect of profiting from changed answers to Air Force personnel responding to multiple-choice questions regarding technical skills; the profit from changes was very similar to that observed in a university group responding to relatively "academic" items. Secondly, most individuals in both groups profited from changes. Third, individuals with the highest test scores tended to profit more from changes than those with the lowest test scores. Fourth, neither Airman Qualifying Exam scores (for the military personnel) nor Scholastic Aptitude Test scores (for the university students) were related to profit. Finally, a systematic case against the popular belief that one should not change answers on objective tests was made, based on an integration of the research to date.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号