首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Under an answer-until-correct scoring procedure, many measurement problems can be solved when certain cognitive models of examinee behavior can be assumed (Wilcox, 1983). Point estimates of true score under these models are available, but the problem of obtaining a confidence interval has never been addressed. Two simple methods for obtaining a confidence interval are suggested that give good results when the sample size is reasonably large, say, greater than or equal to 20, and when true score is not too close to zero or one. A third procedure is suggested that can also be used to get slightly better results where again the sample size is assumed to be reasonably large and true score is not too close to zero or one. For small sample sizes or situations where true score is close to zero or one, a fourth procedure is described that always gives conservative results.  相似文献   

2.
Guessing correct answers to test items is a statistical concept that has direct impact when interpreting test scores. Many published tests, however, do not account for guessing. This is an important issue in view of recent federal legislation in the United States and global attention mandating the provision of identification of at-risk children for educational services. Children may score within a normal range by chance alone, resulting in test scores that are not sensitive. The purpose of this paper, therefore, is: (a) to describe one process, random guessing, for estimating a “true blind guessing score” (range of scores) that, if known, would result in missing fewer at-risk children; and (b) to sensitize test administrators to tests that do not address or may have suspicious corrections for guessing answers on tests.  相似文献   

3.
Formula scoring is a procedure designed to reduce multiple-choice test score irregularities due to guessing. Typically, a formula score is obtained by subtracting a proportion of the number of wrong responses from the number correct. Examinees are instructed to omit items when their answers would be sheer guesses among all choices but otherwise to guess when unsure of an answer. Thus, formula scoring is not intended to discourage guessing when an examinee can rule out one or more of the options within a multiple-choice item. Examinees who, contrary to the instructions, do guess blindly among all choices are not penalized by formula scoring on the average; depending on luck, they may obtain better or worse scores than if they had refrained from this guessing. In contrast, examinees with partial information who refrain from answering tend to obtain lower formula scores than if they had guessed among the remaining choices. (Examinees with misinformation may be exceptions.) Formula scoring is viewed as inappropriate for most classroom testing but may be desirable for speeded tests and for difficult tests with low passing scores. Formula scores do not approximate scores from comparable fill-in-the-blank tests, nor can formula scoring preclude unrealistically high scores for examinees who are very lucky.  相似文献   

4.
Latent class models of decisionmaking processes related to multiple-choice test items are extremely important and useful in mental test theory. However, building realistic models or studying the robustness of existing models is very difficult. One problem is that there are a limited number of empirical studies that address this issue. The purpose of this paper is to describe and illustrate how latent class models, in conjunction with the answer-until-correct format, can be used to examine the strategies used by examinees for a specific type of task. In particular, suppose an examinee responds to a multiple-choice test item designed to measure spatial ability, and the examinee gets the item wrong. This paper empirically investigates various latent class models of the strategies that might be used to arrive at an incorrect response. The simplest model is a random guessing model, but the results reported here strongly suggest that this model is unsatisfactory. Models for the second attempt of an item, under an answer-until-correct scoring procedure, are proposed and found to give a good fit to data in most situations. Some results on strategies used to arrive at the first choice are also discussed  相似文献   

5.
The standard error of measurement usefully provides confidence limits for scores in a given test, but is it possible to quantify the reliability of a test with just a single number that allows comparison of tests of different format? Reliability coefficients do not do this, being dependent on the spread of examinee attainment. Better in this regard is a measure produced by dividing the standard error of measurement by the test's ‘reliability length’, the latter defined as the maximum possible score minus the most probable score obtainable by blind guessing alone. This, however, can be unsatisfactory with negative marking (formula scoring), as shown by data on 13 negatively marked true/false tests. In these the examinees displayed considerable misinformation, which correlated negatively with correct knowledge. Negative marking can improve test reliability by penalizing such misinformation as well as by discouraging guessing. Reliability measures can be based on idealized theoretical models instead of on test data. These do not reflect the qualities of the test items, but can be focused on specific test objectives (e.g. in relation to cut‐off scores) and can be expressed as easily communicated statements even before tests are written.  相似文献   

6.
An IRT‐based sequential procedure is developed to monitor items for enhancing test security. The procedure uses a series of statistical hypothesis tests to examine whether the statistical characteristics of each item under inspection have changed significantly during CAT administration. This procedure is compared with a previously developed CTT‐based procedure through simulation studies. The results show that when the total number of examinees is fixed both procedures can control the rate of type I errors at any reasonable significance level by choosing an appropriate cutoff point and meanwhile maintain a low rate of type II errors. Further, the IRT‐based method has a much lower type II error rate or more power than the CTT‐based method when the number of compromised items is small (e.g., 5), which can be achieved if the IRT‐based procedure can be applied in an active mode in the sense that flagged items can be replaced with new items.  相似文献   

7.
Previous research has shown that rapid-guessing behavior can degrade the validity of test scores from low-stakes proficiency tests. This study examined, using hierarchical generalized linear modeling, examinee and item characteristics for predicting rapid-guessing behavior. Several item characteristics were found significant; items with more text or those occurring later in the test were related to increased rapid guessing, while the inclusion of a graphic in a item was related to decreased rapid guessing. The sole significant examinee predictor was SAT total score. Implications of these results for measurement professionals developing low-stakes tests are discussed.  相似文献   

8.
This study examines the claim that attempting, or guessing at, more items yields improved formula scores. Two samples of students who had taken a form of the SA T- Verbal consisting of three parallel half-hour sections, were used to form the following scores on each of the three sections: the number of attempts, a guessing index, the formula score, and (indirectly) an approximation to an ability score. Correlations were obtained separately for the two samples between the attempts, and the guessing index, on one section, the formula score on a second section, and ability as measured by the third section. The partial correlations obtained hovered near zero, suggesting, contrary to conventional opinion, that, on average, attempting more items and guessing are not helpful in yielding higher formula scores, and that, therefore, formula scoring is not generally disadvantageous to the student who is less willing to guess and attempt an item that he or she is not sure of. On closer examination, however, it became clear that the advantages of guessing depend, at least in part, on the ability of the examinee. Although the relationship is generally quite weak, it is apparently the case that more able examinees do tend to profit somewhat from guessing, and would therefore be disadvantaged by their reluctance to guess. On the other hand, less able examinees may lower their scores i f they guess.  相似文献   

9.
10.
Examiners seeking guidance on multiple‐choice and true/false tests are likely to encounter various faulty or questionable ideas. Twelve of these are discussed in detail, having to do mainly with the effects on test reliability of test length, guessing and scoring method (i.e. number‐right scoring or negative marking). Some misunderstandings could be based on evidence from tests that were badly written or administered, while others may have arisen through the misinterpretation of reliability coefficients. The usefulness of item response theory in the analysis of academic test items is briefly dismissed.  相似文献   

11.
The attribute hierarchy method (AHM) is a psychometric procedure for classifying examinees' test item responses into a set of structured attribute patterns associated with different components from a cognitive model of task performance. Results from an AHM analysis yield information on examinees' cognitive strengths and weaknesses. Hence, the AHM can be used for cognitive diagnostic assessment. The purpose of this study is to introduce and evaluate a new concept for assessing attribute reliability using the ratio of true score variance to observed score variance on items that probe specific cognitive attributes. This reliability procedure is evaluated and illustrated using both simulated data and student response data from a sample of algebra items taken from the March 2005 administration of the SAT. The reliability of diagnostic scores and the implications for practice are also discussed.  相似文献   

12.
Speededness refers to the extent to which time limits affect examinees'test performance, and it is often measured by calculating the proportion of examinees who do not reach a certain percentage of test items. However, when tests are number-right scored (i.e., no points are subtracted for incorrect responses), examinees are likely to rapidly guess on items rather than leave them blank. Therefore, this traditional measure of speededness probably underestimates the true amount of speededness on such tests. A more accurate assessment of speededness should also reflect the tendency of examinees to rapidly guess on items as time expires. This rapid-guessing component of speededness can be estimated by modeling response times with a two-state mixture model, as demonstrated with data from a computer- administered reasoning test. Taking into account the combined effect of unreached items and rapid guessing provides a more complete measure of speededness than has previously been available.  相似文献   

13.
ABSTRACT

The identification of rapid guessing is important to promote the validity of achievement test scores, particularly with low-stakes tests. Effective methods for identifying rapid guesses require reliable threshold methods that are also aligned with test taker behavior. Although several common threshold methods are based on rapid guessing response accuracy or visual inspection of response time distributions, this paper describes a new information-based approach to setting thresholds that does not share the limitations of other methods. A pair of information-based methods are introduced, and an empirical comparison study found the new methods to more reliably set thresholds than methods based on response accuracy or visual inspection.  相似文献   

14.
The indices of item difficulty and discrimination, the coefficients of effective length, and the average item information for both single and multiple-answer items using six different scoring formulas were computed and compared. These formulas vary in terms of the assignment of partial credit and the correction for guessing. Results show that items with multiple answers are substantially more discriminating and reliable when partial credit is given. The formulas without correction for guessing seem to perform at least as well as the formulas with correction.  相似文献   

15.
Even though guessing biases difficulty estimates as a function of item difficulty in the dichotomous Rasch model, assessment programs with tests which include multiple‐choice items often construct scales using this model. Research has shown that when all items are multiple‐choice, this bias can largely be eliminated. However, many assessments have a combination of multiple‐choice and constructed response items. Using vertically scaled numeracy assessments from a large‐scale assessment program, this article shows that eliminating the bias on estimates of the multiple‐choice items also impacts on the difficulty estimates of the constructed response items. This implies that the original estimates of the constructed response items were biased by the guessing on the multiple‐choice items. This bias has implications for both defining difficulties in item banks for use in adaptive testing composed of both multiple‐choice and constructed response items, and for the construction of proficiency scales.  相似文献   

16.
非参数项目反应理论模型包括单调均匀性模型和双单调模型。用单调均匀性模型对某英语听力考试结果研究发现,按照顺序选择法,可从16道听力试题中选出11道满足要求的试题,组成单维量表。用考生在这11道试题上的总得分对考生进行排序与按照潜质排序等效。利用双单调模型对11道听力试题组成的单维量表进行试题功能偏差研究发现,有5道试题在女生子群体中的排序与在男生子群体以及整个群体排序不同,显示女生子群体作出正确应答的概率明显高于男生子群体作出正确应答的概率。这种差异至少部分是由两个子群体听力能力上的差异引起的。  相似文献   

17.
18.
This study examined whether practice testing with short-answer (SA) items benefits learning over time compared to practice testing with multiple-choice (MC) items, and rereading the material. More specifically, the aim was to test the hypotheses of retrieval effort and transfer appropriate processing by comparing retention tests with respect to practice testing format. To adequately compare SA and MC items, the MC items were corrected for random guessing. With a within-group design, 54 students (mean age = 16 years) first read a short text, and took four practice tests containing all three formats (SA, MC and statements to read) with feedback provided after each part. The results showed that both MC and SA formats improved short- and long-term memory compared to rereading. More importantly, practice testing with SA items is more beneficial for learning and long-term retention, providing support for retrieval effort hypothesis. Using corrections for guessing and educational implications are discussed.  相似文献   

19.
Abstract

Scoring multipie-choice questions according to the simple scoring systems S1 = R, where R is the number of correct answers, produces an upward bias in scores of poorer students as a result of guessing. The scoring formula conventionally used to adjust for guessing is S2 R-W/(n-1), where W is the number of wrong answers and nis the number of choices per question. However, S2 is based on the unrealistic assumption that on each question the student either knows the correct answer or guesses randomly. On the basis of a more realistic assumption an alternative scoring formula is derived, S4 = [nR + (n-1)Q - Q2/R]/2(n-1), where Q is the number of questions. Compared to S4, the conventional formula (S2) has a downward bias for Q/n < R < Q and the simple formula (S1) has a downward bias for Q/(n-2)<R<Q in addition to its upward bias for R<Q/(n-2).  相似文献   

20.
Item response theory scalings were conducted for six tests with mixed item formats. These tests differed in their proportions of constructed response (c.r.) and multiple choice (m.c.) items and in overall difficulty. The scalings included those based on scores for the c.r. items that had maintained the number of levels as the item rubrics, either produced from single ratings or multiple ratings that were averaged and rounded to the nearest integer, as well as scalings for a single form of c.r. items obtained by summing multiple ratings. A one-parameter (IPPC) or two-parameter (2PPC) partial credit model was used for the c.r. items and the one-parameter logistic (IPL) or three-parameter logistic (3PL) model for the m.c. items, ltem fit was substantially worse with the combination IPL/IPPC model than the 3PL/2PPC model due to the former's restrictive assumptions that there would be no guessing on the m.c. items and equal item discrimination across items and item types. The presence of varying item discriminations resulted in the IPL/IPPC model producing estimates of item information that could be spuriously inflated for c.r. items that had three or more score levels. Information for some items with summed ratings were usually overestimated by 300% or more for the IPL/IPPC model. These inflated information values resulted in under-estbnated standard errors of ability estimates. The constraints posed by the restricted model suggests limitations on the testing contexts in which the IPL/IPPC model can be accurately applied.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号