首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Pass rates are key assessment statistics which are calculated for nearly all high-stakes examinations. In this article, we define the terminal, first attempt, total attempts, and repeat attempts pass rates, and discuss the uses of each statistic. We also explain why in many situations one should expect the terminal pass rate to be the highest, first attempt pass rate to be the second highest, total attempts pass rate to be the third highest, and repeat attempts pass rate to be the lowest when repeat attempts are allowed. Analyses of data from 14 credentialing programs showed that the expected relationship held for 13 out of 14 of the programs. Additional analyses of pass rates for educational programs in radiography in one state showed that the general relationship held at the state level, but only held for 6 out of 34 educational programs. It is suggested that credentialing programs need to clearly state their pass rate definitions and carefully consider how repeat examinees may influence pass rate statistics. It is also suggested that credentialing programs need to think carefully about the meaning and uses of different pass rate statistics when choosing which pass rates to report to stakeholders.  相似文献   

2.
When tests are designed to measure dimensionally complex material, DIF analysis with matching based on the total test score may be inappropriate. Previous research has demonstrated that matching can be improved by using multiple internal or both internal and external measures to more completely account for the latent ability space. The present article extends this line of research by examining the potential to improve matching by conditioning simultaneously on test score and a categorical variable representing the educational background of the examinees. The responses of male and female examinees from a test of medical competence were analyzed using a logistic regression procedure. Results show a substantial reduction in the number of items identified as displaying significant DIF when conditioning is based on total test score and a variable representing educational background as opposed to total test score only.  相似文献   

3.
The development of statistical methods for detecting test collusion is a new research direction in the area of test security. Test collusion may be described as large‐scale sharing of test materials, including answers to test items. Current methods of detecting test collusion are based on statistics also used in answer‐copying detection. Therefore, in computerized adaptive testing (CAT) these methods lose power because the actual test varies across examinees. This article addresses that problem by introducing a new approach that works in two stages: in Stage 1, test centers with an unusual distribution of a person‐fit statistic are identified via Kullback–Leibler divergence; in Stage 2, examinees from identified test centers are analyzed further using the person‐fit statistic, where the critical value is computed without data from the identified test centers. The approach is extremely flexible. One can employ any existing person‐fit statistic. The approach can be applied to all major testing programs: paper‐and‐pencil testing (P&P), computer‐based testing (CBT), multiple‐stage testing (MST), and CAT. Also, the definition of test center is not limited by the geographic location (room, class, college) and can be extended to support various relations between examinees (from the same undergraduate college, from the same test‐prep center, from the same group at a social network). The suggested approach was found to be effective in CAT for detecting groups of examinees with item pre‐knowledge, meaning those with access (possibly unknown to us) to one or more subsets of items prior to the exam.  相似文献   

4.
马洪超 《考试研究》2012,(1):61-66,85
参数估计是项目反应理论应用、发展的前提。本研究针对六种不同的HSK考生样本,分别使用三种软件,采用不同的参数估计方法对考生能力值进行估计,结果表明能力值估计结果与考生潜在能力分布有关系。当潜在能力分布趋向正态分布时,能力值的估计的误差较小。此外,不同软件的参数估计方法的能力值估计结果均有差异。  相似文献   

5.
The purpose of this study was to assess the dimensionality of two forms of a large-scale standardized test separately for 3 ethnic groups of examinees and to investigate whether differences in their latent trait composites have any impact on unidimensional item response theory true-score equating functions. Specifically, separate equating functions for African American and Hispanic examinees were compared to those of a Caucasian group as well as the total test taker population. On both forms, a 2-dimensional model adequately accounted for the item responses of Caucasian and African American examinees, whereas a more complex model was required for the Hispanic subgroup. The differences between equating functions for the 3 ethnic groups and the total test taker population were small and tended to be located at the low end of the score scale.  相似文献   

6.
Many researchers have suggested that the main cause of item bias is the misspecification of the latent ability space, where items that measure multiple abilities are scored as though they are measuring a single ability. If two different groups of examinees have different underlying multidimensional ability distributions and the test items are capable of discriminating among levels of abilities on these multiple dimensions, then any unidimensional scoring scheme has the potential to produce item bias. It is the purpose of this article to provide the testing practitioner with insight about the difference between item bias and item impact and how they relate to item validity. These concepts will be explained from a multidimensional item response theory (MIRT) perspective. Two detection procedures, the Mantel-Haenszel (as modified by Holland and Thayer, 1988) and Shealy and Stout's Simultaneous Item Bias (SIB; 1991) strategies, will be used to illustrate how practitioners can detect item bias.  相似文献   

7.
In competency testing, it is sometimes difficult to properly equate scores of different forms of a test and thereby assure equivalent cutting scores. Under such circumstances, it is possible to set standards separately for each test form and then scale the judgments of the standard setters to achieve equivalent pass/fail decisions. Data from standard setters and examinees for a medical certifying examination were reanalyzed. Cutting score equivalents were derived by applying a linear procedure to the standard-setting results. These were compared against criteria along with the cutting score equivalents derived from typical examination equating procedures. Results indicated that the cutting score equivalents produced by the experts were closer to the criteria than standards derived from examinee performance, especially when the number of examinees used in equating was small. The root mean square error estimate was about 1 item on a 189-item test.  相似文献   

8.
Much recent psychometric literature has focused on cognitive diagnosis models (CDMs), a promising class of instruments used to measure the strengths and weaknesses of examinees. This article introduces a genetic algorithm to perform automated test assembly alongside CDMs. The algorithm is flexible in that it can be applied whether the goal is to minimize the average number of classification errors, minimize the maximum error rate across all attributes being measured, hit a target set of error rates, or optimize any other prescribed objective function. Under multiple simulation conditions, the algorithm compared favorably with a standard method of automated test assembly, successfully finding solutions that were appropriate for each stated goal.  相似文献   

9.
With the proliferation of computers in test delivery today, adaptive testing has become quite popular, especially when examinees must be classified into two categories (pass/fail, master/nonmaster). Several well‐established organisations have provided standards and guidelines for the design and evaluation of educational and psychological testing. The purpose of this paper was not to repeat the guidelines and standards that exist in the literature but to identify and discuss the main evaluation parameters for a computer‐adaptive test (CAT). A number of parameters should be taken into account when evaluating CAT. Key parameters include utility, validity, reliability, satisfaction, usability, reporting, administration, security, and thoseassociated with adaptivity, item pool, and psychometric theory. These parameters are presented and discussed below and form a proposed evaluation model, Evaluation Model of Computer‐Adaptive Testing.  相似文献   

10.
11.
Permitting item review is to the benefit of the examinees who typically increase their test scores with item review. However, testing companies do not prefer item review since it does not follow the logic on which adaptive tests are based, and since it is prone to cheating strategies. Consequently, item review is not permitted in many adaptive tests. This study attempts to provide a solution that would allow examinees to revise their answers, without jeopardizing the quality and efficiency of the test. The purpose of this study is to test the efficiency of a “rearrangement procedure” that rearranges and skips certain items in order to better estimate the examinees' abilities, without allowing them to cheat on the test. This was examined through a simulation study. The results show that the rearrangement procedure is effective in reducing the standard error of the Bayesian ability estimates and in increasing the reliability of the same estimates.  相似文献   

12.
In computerized adaptive testing (CAT), ensuring the security of test items is a crucial practical consideration. A common approach to reducing item theft is to define maximum item exposure rates, i.e., to limit the proportion of examinees to whom a given item can be administered. Numerous methods for controlling exposure rates have been proposed for tests employing the unidimensional 3-PL model. The present article explores the issues associated with controlling exposure rates when a multidimensional item response theory (MIRT) model is utilized and exposure rates must be controlled conditional upon ability. This situation is complicated by the exponentially increasing number of possible ability values in multiple dimensions. The article introduces a new procedure, called the generalized Stocking-Lewis method, that controls the exposure rate for students of comparable ability as well as with respect to the overall population. A realistic simulation set compares the new method with three other approaches: Kullback-Leibler information with no exposure control, Kullback-Leibler information with unconditional Sympson-Hetter exposure control, and random item selection.  相似文献   

13.
A procedure is presented for obtaining maximum likelihood trait estimates from number-correct (NC) scores for the three-parameter logistic model. The procedure produces an NC score to trait estimate conversion table, which can be used when the hand scoring of tests is desired or when item response pattern (IP) scoring is unacceptable for other (e.g., political) reasons. Simulated data are produced for four 20-item and four 40-item tests of varying difficulties. These data indicate that the NC scoring procedure produces trait estimates that are tau-equivalent to the IP trait estimates (i.e., they are expected to have the same mean for all groups of examinees), but the NC trait estimates have higher standard errors of measurement than IP trait estimates. Data for six real achievement tests verify that the NC trait estimates are quite similar to the IP trait estimates but have higher empirical standard errors than IP trait estimates, particularly for low-scoring examinees. Analyses in the estimated true score metric confirm the conclusions made in the trait metric.  相似文献   

14.
Changing a small number of answers to multiple-choice questions reliably improves test-scores, although it remains unclear how examinees select which initial answers to change and whether answer-changing behaviour is susceptible to instruction. We tested the effect of an instructional intervention on the number of changes made by examinees on a mock-exam in a controlled experimental design. We also examined how examinees' confidence with their initial answers, and their judgement of how difficult each exam question was, predicted their answer-changing behaviour. We found that the number of changes made increased, to a small extent, through instruction, without increasing the rate of errors. The likelihood of changing an initial answer decreased with examinees’ feeling of confidence, and increased with their feeling of difficulty. This is consistent with the theory that examinees use metacognitive experiences to select which initial answers to change on exams.  相似文献   

15.
ABSTRACT

We examined change in test-taking effort over the course of a three-hour, five test, low-stakes testing session. Latent growth modeling results indicated that change in test-taking effort was well-represented by a piecewise growth form, wherein effort increased from test 1 to test 4 and then decreased from test 4 to test 5. There was significant variability in effort for each of the five tests, which could be predicted from examinees’ conscientiousness, agreeableness, mastery approach goal orientation, and whether the examinee “skipped” or attended the initial testing session. The degree to which examinees perceived a particular test as important was related to effort for the difficult, cognitive test but not for less difficult, noncognitive tests. There was significant variability in the rates of change in effort, which could be predicted from examinees’ agreeableness. Interestingly, change in test-taking effort was not related to change in perceived test importance. Implications of these results for assessment practice and directions for future research are discussed.  相似文献   

16.
Subscores Based on Classical Test Theory: To Report or Not to Report   总被引:1,自引:0,他引:1  
There is an increasing interest in reporting subscores, both at examinee level and at aggregate levels. However, it is important to ensure reasonable subscore performance in terms of high reliability and validity to minimize incorrect instructional and remediation decisions. This article employs a statistical measure based on classical test theory that is conceptually similar to the test reliability measure and can be used to determine when subscores have any added value over total scores. The usefulness of subscores is examined both at the level of the examinees and at the level of the institutions that the examinees belong to. The suggested approach is applied to two data sets from a basic skills test. The results provide little support in favor of reporting subscores for either examinees or institutions for the tests studied here.  相似文献   

17.
《教育实用测度》2013,26(3):241-261
This simulation study compared two procedures to enable an adaptive test to select items in correspondence with a content blueprint. Trait level estimates obtained from testlet-based and constrained adaptive tests administered to 10,000 simulated examinees under two trait distributions and three item pool sizes were compared to the trait level estimates obtained from traditional adaptive tests in terms of mean absolute error, bias, and information. Results indicate that using constrained adaptive testing requires an increase of 5% to 11% in test length over the traditional adaptive test to reach the same error level and, using testlets requires an increase of 43% to 104% in test length over the traditional adaptive test. Given these results, the use of constrained computerized adaptive testing is recommended for situations in which an adaptive test must adhere to particular content specifications.  相似文献   

18.
The traditional kappa statistic in assessing interrater agreement is not adequate when multiraters and multiattributes are involved. In this article, latent trait models are proposed to assess the multirater multiattribute (MRMA) agreement. Data from the Third International Mathematics and Science Studies (TIMSS) are used to illustrate the application of the latent trait models. Results showed that among four possible latent trait models, the correlated uniqueness model had the best fit to assess the MRMA agreement. Furthermore, in coding a set of different attributes, the coding accuracy within the same rater may differ across attributes. Likewise, when different raters rate the same attribute, the accuracy in rating varies among the raters. Thus, the latent models provide us with a more refined and accurate assessment of interrater agreement. The application of the latent trait models is important in school psychology research and intervention because accurate assessment of children's functioning is fundamental in designing effective intervention strategies. © 2007 Wiley Periodicals, Inc. Psychol Schs 44: 515–525, 2007.  相似文献   

19.
20.
When tests are administered under fixed time constraints, test performances can be affected by speededness. Among other consequences, speededness can result in inaccurate parameter estimates in item response theory (IRT) models, especially for items located near the end of tests (Oshima, 1994). This article presents an IRT strategy for reducing contamination in item difficulty estimates due to speededness. Ordinal constraints are applied to a mixture Rasch model (Rost, 1990) so as to distinguish two latent classes of examinees: (a) a "speeded" class, comprised of examinees that had insufficient time to adequately answer end-of-test items, and (b) a "nonspeeded" class, comprised of examinees that had sufficient time to answer all items. The parameter estimates obtained for end-of-test items in the nonspeeded class are shown to more accurately approximate their difficulties when the items are administered at earlier locations on a different form of the test. A mixture model can also be used to estimate the class memberships of individual examinees. In this way, it can be determined whether membership in the speeded class is associated with other student characteristics. Results are reported for gender and ethnicity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号