首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
Assessment items are commonly field tested prior to operational use to observe statistical item properties such as difficulty. Item parameter estimates from field testing may be used to assign scores via pre-equating or computer adaptive designs. This study examined differences between item difficulty estimates based on field test and operational data and the relationship of such differences to item position changes and student proficiency estimates. Item position effects were observed for 20 assessments, with items in later positions tending to be more difficult. Moreover, field test estimates of item difficulty were biased slightly upward, which may indicate examinee knowledge of which items were being field tested. Nevertheless, errors in field test item difficulty estimates had negligible impacts on student proficiency estimates for most assessments. Caution is still warranted when using field test statistics for scoring, and testing programs should conduct investigations to determine whether the effects on scoring are inconsequential.  相似文献   

In the presence of test speededness, the parameter estimates of item response theory models can be poorly estimated due to conditional dependencies among items, particularly for end‐of‐test items (i.e., speeded items). This article conducted a systematic comparison of five‐item calibration procedures—a two‐parameter logistic (2PL) model, a one‐dimensional mixture model, a two‐step strategy (a combination of the one‐dimensional mixture and the 2PL), a two‐dimensional mixture model, and a hybrid model‐–by examining how sample size, percentage of speeded examinees, percentage of missing responses, and way of scoring missing responses (incorrect vs. omitted) affect the item parameter estimation in speeded tests. For nonspeeded items, all five procedures showed similar results in recovering item parameters. For speeded items, the one‐dimensional mixture model, the two‐step strategy, and the two‐dimensional mixture model provided largely similar results and performed better than the 2PL model and the hybrid model in calibrating slope parameters. However, those three procedures performed similarly to the hybrid model in estimating intercept parameters. As expected, the 2PL model did not appear to be as accurate as the other models in recovering item parameters, especially when there were large numbers of examinees showing speededness and a high percentage of missing responses with incorrect scoring. Real data analysis further described the similarities and differences between the five procedures.  相似文献   

This article presents a method for estimating the accuracy and consistency of classifications based on test scores. The scores can be produced by any scoring method, including a weighted composite. The estimates use data from a single form. The reliability of the score is used to estimate effective test length in terms of discrete items. The true-score distribution is estimated by fitting a 4-parameter beta model. The conditional distribution of scores on an alternate form, given the true score, is estimated from a binomial distribution based on the estimated effective test length. Agreement between classifications on alternate forms is estimated by assuming conditional independence, given the true score. Evaluation of the method showed estimates to be within 1 percentage point of the actual values in most cases. Estimates of decision accuracy and decision consistency statistics were only slightly affected by changes in specified minimum and maximum possible scores.  相似文献   

The analytically derived asymptotic standard errors (SEs) of maximum likelihood (ML) item estimates can be approximated by a mathematical function without examinees' responses to test items, and the empirically determined SEs of marginal maximum likelihood estimation (MMLE)/Bayesian item estimates can be obtained when the same set of items is repeatedly estimated from the simulation (or resampling) test data. The latter method will result in rather stable and accurate SE estimates as the number of replications increases, but requires cumbersome and time-consuming calculations. Instead of using the empirically determined method, the adequacy of using the analytical-based method in predicting the SEs for item parameter estimates was examined by comparing results produced from both approaches. The results indicated that the SEs yielded from both approaches were, in most cases, very similar, especially when they were applied to a generalized partial credit model. This finding encourages test practitioners and researchers to apply the analytically asymptotic SEs of item estimates to the context of item-linking studies, as well as to the method of quantifying the SEs of equating scores for the item response theory (IRT) true-score method. Three-dimensional graphical presentation for the analytical SEs of item estimates as the bivariate function of item difficulty together with item discrimination was also provided for a better understanding of several frequently used IRT models.  相似文献   

Domain scores have been proposed as a user-friendly way of providing instructional feedback about examinees' skills. Domain performance typically cannot be measured directly; instead, scores must be estimated using available information. Simulation studies suggest that IRT-based methods yield accurate group domain score estimates. Because simulations can represent best-case scenarios for methodology, it is important to verify results with a real data application. This study administered a domain of elementary algebra (EA) items created from operational test forms. An IRT-based group-level domain score was estimated from responses to a subset of taken items (comprised of EA items from a single operational form) and compared to the actual observed domain score. Domain item parameters were calibrated both using item responses from the special study and from national operational administrations of the items. The accuracy of the domain score estimates were evaluated within schools and across school sizes for each set of parameters. The IRT-based domain score estimates typically were closer to the actual domain score than observed performance on the EA items from the single form. Previously simulated findings for the IRT-based domain score estimation procedure were supported by the results of the real data application.  相似文献   

Many large-scale educational surveys have moved from linear form design to multistage testing (MST) design. One advantage of MST is that it can provide more accurate latent trait (θ) estimates using fewer items than required by linear tests. However, MST generates incomplete response data by design; hence, questions remain as to how to calibrate items using the incomplete data from MST design. Further complication arises when there are multiple correlated subscales per test, and when items from different subscales need to be calibrated according to their respective score reporting metric. The current calibration-per-subscale method produced biased item parameters, and there is no available method for resolving the challenge. Deriving from the missing data principle, we showed when calibrating all items together the Rubin's ignorability assumption is satisfied such that the traditional single-group calibration is sufficient. When calibrating items per subscale, we proposed a simple modification to the current calibration-per-subscale method that helps reinstate the missing-at-random assumption and therefore corrects for the estimation bias that is otherwise existent. Three mainstream calibration methods are discussed in the context of MST, they are the marginal maximum likelihood estimation, the expectation maximization method, and the fixed parameter calibration. An extensive simulation study is conducted and a real data example from NAEP is analyzed to provide convincing empirical evidence.  相似文献   

The examinee‐selected‐item (ESI) design, in which examinees are required to respond to a fixed number of items in a given set of items (e.g., choose one item to respond from a pair of items), always yields incomplete data (i.e., only the selected items are answered and the others have missing data) that are likely nonignorable. Therefore, using standard item response theory models, which assume ignorable missing data, can yield biased parameter estimates so that examinees taking different sets of items to answer cannot be compared. To solve this fundamental problem, in this study the researchers utilized the specific objectivity of Rasch models by adopting the conditional maximum likelihood estimation (CMLE) and pairwise estimation (PE) methods to analyze ESI data, and conducted a series of simulations to demonstrate the advantages of the CMLE and PE methods over traditional estimation methods in recovering item parameters in ESI data. An empirical data set obtained from an experiment on the ESI design was analyzed to illustrate the implications and applications of the proposed approach to ESI data.  相似文献   

For the purpose of obtaining data to use in test development, multiple matrix sampling (MMS) plans were compared to examinee sampling plans. Data were simulated for examinees, sampled from a population with a normal distribution of ability, responding to items selected from an item universe. Three item universes were considered: one that would produce a normal distribution of test scores, one a moderately platykurtic distribution, and one a very platykurtic distribution. When comparing sampling plans, total numbers of observations were held constant. No differences were found among plans in estimating item difficulty. Examinee sampling produced better estimates of item discrimination, test reliability, and test validity. As total number of observations increased, estimates improved considerably, especially for those MMS plans with larger subtest sizes. Larger numbers of observations were needed for tests designed to produce a normal distribution of test scores. With an adequate number of observations, MMS is seen as an alternative to examinee sampling in test development.  相似文献   

Technical difficulties occasionally lead to missing item scores and hence to incomplete data on computerized tests. It is not straightforward to report scores to the examinees whose data are incomplete due to technical difficulties. Such reporting essentially involves imputation of missing scores. In this paper, a simulation study based on data from three educational tests is used to compare the performances of six approaches for imputation of missing scores. One of the approaches, based on data mining, is the first application of its kind to the problem of imputation of missing data. The approach based on data mining and a multiple imputation approach based on chained equations led to the most accurate imputation of missing scores, and hence to most accurate score reporting. A simple approach based on linear regression performed the next best overall. Several recommendations are made regarding the reporting of scores to examinees with incomplete data.  相似文献   

A simulation study was performed to determine whether a group's average percent correct in a content domain could be accurately estimated for groups taking a single test form and not the entire domain of items. Six Item Response Theory based domain score estimation methods were evaluated, under conditions of few items per content area perform taken, small domains, and small group sizes. The methods used item responses to a single form taken to estimate examinee or group ability; domain scores were then computed using the ability estimates and domain item characteristics. The IRT-based domain score estimates typically showed greater accuracy and greater consistency across forms taken than observed performance on the form taken. For the smallest group size and least number of items taken, the accuracy of most IRT-based estimates was questionable; however, a procedure that operates on an estimated distribution of group ability showed promise under most conditions.  相似文献   

In this study we examined variations of the nonequivalent groups equating design for tests containing both multiple-choice (MC) and constructed-response (CR) items to determine which design was most effective in producing equivalent scores across the two tests to be equated. Using data from a large-scale exam, this study investigated the use of anchor CR item rescoring (known as trend scoring) in the context of classical equating methods. Four linking designs were examined: an anchor with only MC items, a mixed-format anchor test containing both MC and CR items; a mixed-format anchor test incorporating common CR item rescoring; and an equivalent groups (EG) design with CR item rescoring, thereby avoiding the need for an anchor test. Designs using either MC items alone or a mixed anchor without CR item rescoring resulted in much larger bias than the other two designs. The EG design with trend scoring resulted in the smallest bias, leading to the smallest root mean squared error value.  相似文献   

Item nonresponses are prevalent in standardized testing. They happen either when students fail to reach the end of a test due to a time limit or quitting, or when students choose to omit some items strategically. Oftentimes, item nonresponses are nonrandom, and hence, the missing data mechanism needs to be properly modeled. In this paper, we proposed to use an innovative item response time model as a cohesive missing data model to account for the two most common item nonresponses: not-reached items and omitted items. In particular, the new model builds on a behavior process interpretation: a person chooses to skip an item if the required effort exceeds the implicit time the person allocates to the item (Lee & Ying, 2015; Wolf, Smith, & Birnbaum, 1995), whereas a person fails to reach the end of the test due to lack of time. This assumption was verified by analyzing the 2015 PISA computer-based mathematics data. Simulation studies were conducted to further evaluate the performance of the proposed Bayesian estimation algorithm for the new model and to compare the new model with a recently proposed “speed-accuracy + omission” model (Ulitzsch, von Davier, & Pohl, 2019). Results revealed that all model parameters could recover properly, and inadequately accounting for missing data caused biased item and person parameter estimates.  相似文献   

Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test.  相似文献   

The purpose of this study was to compare several methods for determining a passing score on an examination from the individual raters' estimates of minimal pass levels for the items. The methods investigated differ in the weighting that the estimates for each item receive in the aggregation process. An IRT-based simulation method was used to model a variety of error components of minimum pass levels. The results indicate little difference in estimated passing scores across the three methods. Less error was present when the ability level of the minimally competent candidates matched the expected difficulty level of the test. No meaningful improvement in passing score estimation was achieved for a 50-item test as opposed to a 25-item test; however, the RMSE values for estimates with 10 raters were smaller than those for 5 raters. The results suggest that the simplest method for aggregating minimum pass levels across the items in a test–adding them up–is the preferred method.  相似文献   

The purpose of this study was to determine if a linear procedure, typically applied to an entire examination when equating scores and reseating judges' standards, could be used with individual item data gathered through Angoffs standard-setting method (1971). Specifically, experts estimates of borderline group performance on one form of a test were transformed to be on the same scale as experts' estimates of borderline group performance on another form of the test. The transformations were based on examinees' responses to the items and on judges' estimates of borderline group performance. The transformed values were compared to the actual estimates provided by a group of judges. The equated and reseated values were reasonably close to those actually assigned by the experts. Bias in the estimates was also relatively small. In general, the reseating procedure was more accurate than the equating procedure, especially when the examinee sample size for equating was small.  相似文献   

Competence data from low‐stakes educational large‐scale assessment studies allow for evaluating relationships between competencies and other variables. The impact of item‐level nonresponse has not been investigated with regard to statistics that determine the size of these relationships (e.g., correlations, regression coefficients). Classical approaches such as ignoring missing values or treating them as incorrect are currently applied in many large‐scale studies, while recent model‐based approaches that can account for nonignorable nonresponse have been developed. Estimates of item and person parameters have been demonstrated to be biased for classical approaches when missing data are missing not at random (MNAR). In our study, we focus on parameter estimates of the structural model (i.e., the true regression coefficient when regressing competence on an explanatory variable), simulating data according to various missing data mechanisms. We found that model‐based approaches and ignoring missing values performed well in retrieving regression coefficients even when we induced missing data that were MNAR. Treating missing values as incorrect responses can lead to substantial bias. We demonstrate the validity of our approach empirically and discuss the relevance of our results.  相似文献   


Educational stakeholders have long known that students might not be fully engaged when taking an achievement test and that such disengagement could undermine the inferences drawn from observed scores. Thanks to the growing prevalence of computer-based tests and the new forms of metadata they produce, researchers have developed and validated procedures for using item response times to identify responses to items that are likely disengaged. In this study, we examine the impact of two techniques to account for test disengagement—(a) removing unengaged test takers from the sample and (b) adjusting test scores to remove rapidly guessed items—on estimates of school contributions to student growth, achievement gaps, and summer learning loss. Our results indicate that removing disengaged examinees from the sample will likely induce bias in the estimates, although as a whole accounting for disengagement had minimal impact on the metrics we examined. Last, we provide guidance for policy makers and evaluators on how to account for disengagement in their own work and consider the promise and limitations of using achievement test metadata for related purposes.  相似文献   

In test development, item response theory (IRT) is a method to determine the amount of information that each item (i.e., item information function) and combination of items (i.e., test information function) provide in the estimation of an examinee's ability. Studies investigating the effects of item parameter estimation errors over a range of ability have demonstrated an overestimation of information when the most discriminating items are selected (i.e., item selection based on maximum information). In the present study, the authors examined the influence of item parameter estimation errors across 3 item selection methods—maximum no target, maximum target, and theta maximum—using the 2- and 3-parameter logistic IRT models. Tests created with the maximum no target and maximum target item selection procedures consistently overestimated the test information function. Conversely, tests created using the theta maximum item selection procedure yielded more consistent estimates of the test information function and, at times, underestimated the test information function. Implications for test development are discussed.  相似文献   

Administering tests under time constraints may result in poorly estimated item parameters, particularly for items at the end of the test (Douglas, Kim, Habing, & Gao, 1998; Oshima, 1994). Bolt, Cohen, and Wollack (2002) developed an item response theory mixture model to identify a latent group of examinees for whom a test is overly speeded, and found that item parameter estimates for end-of-test items in the nonspeeded group were similar to estimates for those same items when administered earlier in the test. In this study, we used the Bolt et al. (2002) method to study the effect of removing speeded examinees on the stability of a score scale over an II-year period. Results indicated that using only the nonspeeded examinees for equating and estimating item parameters provided a more unidimensional scale, smaller effects of item parameter drift (including fewer drifting items), and less scale drift (i.e., bias) and variability (i.e., root mean squared errors) when compared to the total group of examinees.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号