首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 396 毫秒
1.
Formula scoring is a procedure designed to reduce multiple-choice test score irregularities due to guessing. Typically, a formula score is obtained by subtracting a proportion of the number of wrong responses from the number correct. Examinees are instructed to omit items when their answers would be sheer guesses among all choices but otherwise to guess when unsure of an answer. Thus, formula scoring is not intended to discourage guessing when an examinee can rule out one or more of the options within a multiple-choice item. Examinees who, contrary to the instructions, do guess blindly among all choices are not penalized by formula scoring on the average; depending on luck, they may obtain better or worse scores than if they had refrained from this guessing. In contrast, examinees with partial information who refrain from answering tend to obtain lower formula scores than if they had guessed among the remaining choices. (Examinees with misinformation may be exceptions.) Formula scoring is viewed as inappropriate for most classroom testing but may be desirable for speeded tests and for difficult tests with low passing scores. Formula scores do not approximate scores from comparable fill-in-the-blank tests, nor can formula scoring preclude unrealistically high scores for examinees who are very lucky.  相似文献   

2.
Modifications of administration and item arrangement of a conventional test can force a match between item difficulty levels and the ability level of the examinee. Although different examinees take different sets of items, the scoring method provides comparable scores for all. Furthermore, the test is self-scoring. These advantages are obtained without some of the usual disadvantages of tailored testing.  相似文献   

3.
According to item response theory (IRT), examinee ability estimation is independent of the particular set of test items administered from a calibrated pool. Although the most popular application of this feature of IRT is computerized adaptive (CA) testing, a recently proposed alternative is self-adapted (SA) testing, in which examinees choose the difficulty level of each of their test items. This study compared examinee performance under SA and CA tests, finding that examinees taking the SA test (a) obtained significantly higher ability scores and (b) reported significantly lower posttest state anxiety. The results of this study suggest that SA testing is a desirable format for computer-based testing.  相似文献   

4.
Responses to a 40-item test were simulated for 150 examinees under free-response and multiple-choice formats. The simulation was replicated three times for each of 30 variations reflecting format and the extent to which examinees were (a) misinformed, (b) successful in guessing free-response answers, and (c) able to recognize with assurance correct multiple-choice options that they could not produce under free-response testing. Internal consistency reliability (KR20) estimates were consistently higher for the free-response score sets, even when the free-response item difficulty indices were augmented to yield mean scores comparable to those from multiple-choice testing. In addition, all test score sets were correlated with four randomly generated sets of unit-normal measures, whose intercorrelations ranged from moderate to strong. These measures served as criteria because one of them had been used as the basic ability measure in the simulation of the test score sets. Again, the free-response score sets yielded superior results even when tests of equal difficulty were compared. The guessing and recognition factors had little or no effect on reliability estimates or correlations with the criteria. The extent of misinformation affected only multiple-choice score KR20's (more misinformation—higher KR20's). Although free-response tests were found to be generally superior, the extent of their advantage over multiple-choice was judged sufficiently small that other considerations might justifiably dictate format choice.  相似文献   

5.
This study examines the claim that attempting, or guessing at, more items yields improved formula scores. Two samples of students who had taken a form of the SA T- Verbal consisting of three parallel half-hour sections, were used to form the following scores on each of the three sections: the number of attempts, a guessing index, the formula score, and (indirectly) an approximation to an ability score. Correlations were obtained separately for the two samples between the attempts, and the guessing index, on one section, the formula score on a second section, and ability as measured by the third section. The partial correlations obtained hovered near zero, suggesting, contrary to conventional opinion, that, on average, attempting more items and guessing are not helpful in yielding higher formula scores, and that, therefore, formula scoring is not generally disadvantageous to the student who is less willing to guess and attempt an item that he or she is not sure of. On closer examination, however, it became clear that the advantages of guessing depend, at least in part, on the ability of the examinee. Although the relationship is generally quite weak, it is apparently the case that more able examinees do tend to profit somewhat from guessing, and would therefore be disadvantaged by their reluctance to guess. On the other hand, less able examinees may lower their scores i f they guess.  相似文献   

6.
For the purpose of obtaining data to use in test development, multiple matrix sampling (MMS) plans were compared to examinee sampling plans. Data were simulated for examinees, sampled from a population with a normal distribution of ability, responding to items selected from an item universe. Three item universes were considered: one that would produce a normal distribution of test scores, one a moderately platykurtic distribution, and one a very platykurtic distribution. When comparing sampling plans, total numbers of observations were held constant. No differences were found among plans in estimating item difficulty. Examinee sampling produced better estimates of item discrimination, test reliability, and test validity. As total number of observations increased, estimates improved considerably, especially for those MMS plans with larger subtest sizes. Larger numbers of observations were needed for tests designed to produce a normal distribution of test scores. With an adequate number of observations, MMS is seen as an alternative to examinee sampling in test development.  相似文献   

7.
We evaluated a computer-delivered response type for measuring quantitative skill. "Generating Examples" (GE) presents under-determined problems that can have many right answers. We administered two GE tests that differed in the manipulation of specific item features hypothesized to affect difficulty. Analyses related to internal consistency reliability, external relations, and features contributing to item difficulty, adverse impact, and examinee perceptions. Results showed that GE scores were reasonably reliable but only moderately related to the GRE quantitative section, suggesting the two tests might be tapping somewhat different skills. Item features that increased difficulty included asking examinees to supply more than one correct answer and to identify whether an item was solvable. Gender differences were similar to those found on the GRE quantitative and analytical test sections. Finally, examinees were divided on whether GE items were a fairer indicator of ability than multiple-choice items, but still overwhelmingly preferred to take the more conventional questions.  相似文献   

8.
《教育实用测度》2013,26(2):163-183
When low-stakes assessments are administered, the degree to which examinees give their best effort is often unclear, complicating the validity and interpretation of the resulting test scores. This study introduces a new method, based on item response time, for measuring examinee test-taking effort on computer-based test items. This measure, termed response time effort (RTE), is based on the hypothesis that when administered an item, unmotivated examinees will answer too quickly (i.e., before they have time to read and fully consider the item). Psychometric characteristics of RTE scores were empirically investigated and supportive evidence for score reliability and validity was found. Potential applications of RTE scores and their implications are discussed.  相似文献   

9.
We describe the development and administration of a recently introduced computer-based test of writing skills. This test asks the examinee to edit a writing passage presented on a computer screen. To do this, the examinee moves a cursor to a suspect section of the passage and chooses from a list of alternative ways o f rewriting that section. Any or all parts o f the passage can be changed, as often as the examinee likes. An able examinee identifies and fixes errors in grammar, organization, and style, whereas a less able examinee may leave errors untouched, replace an error with another error, or even introduce errors where none existed previously. All these response alternatives contrive to present both obvious and subtle scoring difficulties. These difficulties were attacked through the combined use of option weighting and the sequential probability ratio test, the result o f which is to classify examinees into several discrete ability groups. Item calibration was enabled by augmenting sparse pretest samples through data meiosis, in which response vectors were randomly recombined to produce offspring that retained much of the character of their parents. These procedures are described, and operational examples are offered.  相似文献   

10.
Equatings were performed on both simulated and real data sets using the common-examinee design and two abilities for each examinee (i.e., two dimensions). Item and ability parameter estimates were found by using the Multidimensional Item Response Theory Estimation (MIRTE) program. The amount of equating error was evaluated by a comparison of the mean difference and the mean absolute difference between the true scores and ability estimates found on both tests for the common examinees used in the equating. The results indicated that effective equating, as measured by comparability o f true scores, was possible with the techniques used in this study. When the stability o f the ability estimates was examined, unsatisfactory results were found.  相似文献   

11.
This study attempted to pinpoint the causes of differential item difficulty for blind students taking the braille edition of the Scholastic Aptitude Test's Mathematical section (SAT-M). The study method involved reviewing the literature to identify factors that might cause differential item functioning for these examinees, forming item categories based on these factors, identifying categories that functioned differentially, and assessing the functioning o f the items comprising deviant categories to determine if the differential effect was pervasive. Results showed an association between selected item categories and differential functioning, particularly for items that included figures in the stimulus, items for which spatial estimation was helpful in eliminating at least two of the options, and items that presented figures that were small or medium in size. The precise meaning of this association was unclear, however, because some items from the suspected categories functioned normally, factors other than the hypothesized ones might have caused the observed aberrant item behavior, and the differential difficulty might reflect real population differences in relevant content knowledge  相似文献   

12.
The trustworthiness of low-stakes assessment results largely depends on examinee effort, which can be measured by the amount of time examinees devote to items using solution behavior (SB) indices. Because SB indices are calculated for each item, they can be used to understand how examinee motivation changes across items within a test. Latent class analysis (LCA) was used with the SB indices from three low-stakes assessments to explore patterns of solution behavior across items. Across tests, the favored models consisted of two classes, with Class 1 characterized by high and consistent solution behavior (>90% of examinees) and Class 2 by lower and less consistent solution behavior (<10% of examinees). Additional analyses provided supportive validity evidence for the two-class solution with notable differences between classes in self-reported effort, test scores, gender composition, and testing context. Although results were generally similar across the three assessments, striking differences were found in the nature of the solution behavior pattern for Class 2 and the ability of item characteristics to explain the pattern. The variability in the results suggests motivational changes across items may be unique to aspects of the testing situation (e.g., content of the assessment) for less motivated examinees.  相似文献   

13.
Using a computer-based model of an item trace line, a random sampling experiment concerned with comparing item sample estimates to traditional (examinee) sample estimates of the mean and variance of a distribution of test scores was conducted. The results indicated that the optimal method for estimating a test's parameters may depend on several conditions. As expected, item sampling proved superior to traditional sampling in estimating test means under all conditions. However, with certain test lengths, ranges of item difficulty, and discrimination, traditional sampling provided better estimates of test variance than did item sampling.  相似文献   

14.
Recent simulation studies indicate that there are occasions when examinees can use judgments of relative item difficulty to obtain positively biased proficiency estimates on computerized adaptive tests (CATs) that permit item review and answer change. Our purpose in the study reported here was to evaluate examinees' success in using these strategies while taking CATs in a live testing setting. We taught examinees two item difficulty judgment strategies designed to increase proficiency estimates. Examinees who were taught each strategy and examinees who were taught neither strategy were assigned at random to complete vocabulary CATs under conditions in which review was allowed after completing all items and when review was allowed only within successive blocks of items. We found that proficiency estimate changes following review were significantly higher in the regular review conditions than in the strategy conditions. Failure to obtain systematically higher scores in the strategy conditions was due in large part to errors examinees made in judging the relative difficulty of CAT items.  相似文献   

15.
Performance assessments, scenario‐based tasks, and other groups of items carry a risk of violating the local item independence assumption made by unidimensional item response theory (IRT) models. Previous studies have identified negative impacts of ignoring such violations, most notably inflated reliability estimates. Still, the influence of this violation on examinee ability estimates has been comparatively neglected. It is known that such item dependencies cause low‐ability examinees to have their scores overestimated and high‐ability examinees' scores underestimated. However, the impact of these biases on examinee classification decisions has been little examined. In addition, because the influence of these dependencies varies along the underlying ability continuum, whether or not the location of the cut‐point is important in regard to correct classifications remains unanswered. This simulation study demonstrates that the strength of item dependencies and the location of an examination systems’ cut‐points both influence the accuracy (i.e., the sensitivity and specificity) of examinee classifications. Practical implications of these results are discussed in terms of false positive and false negative classifications of test takers.  相似文献   

16.
Effects of Item Wording on Sex Bias   总被引:1,自引:0,他引:1  
This study examined the effects of gender-related item-wording changes on the performance of male and female examinees. Mathematics word problems and English language items were created in neuter, male, and female versions. Items were administered to randomly equivalent samples of about 300 high school juniors and seniors. Loglinear analysis was used to assess the impact of item gender and its interaction with examinee sex on the difficulty and discrimination of each item in each context. No items were found to have sex bias in either context. Mathematics items did not have different difficulty or discrimination in the three gender versions. Neither mathematics nor English items had different discrimination levels in the three gender-related versions. Some English items, however, were found to have different difficulty levels in the three gender-related versions. These difficulty differences were not systematic." none of the three gender versions appeared consistently more or less difficult than the others.  相似文献   

17.
《教育实用测度》2013,26(4):331-345
In order to obtain objective measurement for examinations that are graded by judges, an extension of the Rasch model designed to analyze examinations with more than two facets (items/examinees) is used. This extended Rasch model calibrates the elements of each facet of the examination (i.e., examinee performances, items, and judges) on a common log-linear scale. A network for assigning judges to examinations is used to link all facets. Real examination data from the "clinical assessment" part of a certification examination are used to illustrate the application. A range of item difficulties and judge severities were found. Comparison of examinee raw scores with objective linear measures corrected for variations in judge severity shows that judge severity can have a substantial impact on a raw score. Correcting for judge severity improves the fairness of examinee measures and of the subsequent pass-fail decisions because the uncorrected raw scores favor examinee performances graded by lenient judges.  相似文献   

18.
The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity.  相似文献   

19.
The hypothesis that some students, when tested under formula directions, omit items about which they have useful partial knowledge implies that such directions are not as fair as rights directions, especially to those students who are less inclined to guess. This hypothesis may be called the differential effects hypothesis. An alternative hypothesis states that examinees would perform no better than chance expectation on items that they would omit under formula directions but would answer under rights directions. This may be called the invariance hypothesis. Experimental data on this question were obtained by conducting special test administrations of College Board SAT-verbal and Chemistry tests and by including experimental tests in a Graduate Management Admission Test administration. The data provide a basis for evaluating the two hypotheses and for assessing the effects of directions on the reliability and parallelism of scores for sophisticated examinees taking professionally developed tests. Results support the invariance hypothesis rather than the differential effects hypothesis.  相似文献   

20.
In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号