共查询到20条相似文献,搜索用时 15 毫秒
1.
The examinee‐selected‐item (ESI) design, in which examinees are required to respond to a fixed number of items in a given set of items (e.g., choose one item to respond from a pair of items), always yields incomplete data (i.e., only the selected items are answered and the others have missing data) that are likely nonignorable. Therefore, using standard item response theory models, which assume ignorable missing data, can yield biased parameter estimates so that examinees taking different sets of items to answer cannot be compared. To solve this fundamental problem, in this study the researchers utilized the specific objectivity of Rasch models by adopting the conditional maximum likelihood estimation (CMLE) and pairwise estimation (PE) methods to analyze ESI data, and conducted a series of simulations to demonstrate the advantages of the CMLE and PE methods over traditional estimation methods in recovering item parameters in ESI data. An empirical data set obtained from an experiment on the ESI design was analyzed to illustrate the implications and applications of the proposed approach to ESI data. 相似文献
2.
Wen‐Chung Wang Kuan‐Yu Jin Xue‐Lan Qiu Lei Wang 《Journal of Educational Measurement》2012,49(4):419-445
In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items. 相似文献
3.
R. Philip Chalmers 《Journal of Educational Measurement》2015,52(2):200-222
A mixed‐effects item response theory (IRT) model is presented as a logical extension of the generalized linear mixed‐effects modeling approach to formulating explanatory IRT models. Fixed and random coefficients in the extended model are estimated using a Metropolis‐Hastings Robbins‐Monro (MH‐RM) stochastic imputation algorithm to accommodate for increased dimensionality due to modeling multiple design‐ and trait‐based random effects. As a consequence of using this algorithm, more flexible explanatory IRT models, such as the multidimensional four‐parameter logistic model, are easily organized and efficiently estimated for unidimensional and multidimensional tests. Rasch versions of the linear latent trait and latent regression model, along with their extensions, are presented and discussed, Monte Carlo simulations are conducted to determine the efficiency of parameter recovery of the MH‐RM algorithm, and an empirical example using the extended mixed‐effects IRT model is presented. 相似文献
4.
When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters’ ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between‐subgroup fit approach. We used standardized between‐subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups. 相似文献
5.
Drawing valid inferences from item response theory (IRT) models is contingent upon a good fit of the data to the model. Violations of model‐data fit have numerous consequences, limiting the usefulness and applicability of the model. This instructional module provides an overview of methods used for evaluating the fit of IRT models. Upon completing this module, the reader will have an understanding of traditional and Bayesian approaches for evaluating model‐data fit of IRT models, the relative advantages of each approach, and the software available to implement each method. 相似文献
6.
7.
Christoph Knig Lale Khorramdel Kentaro Yamamoto Andreas Frey 《Educational Measurement》2021,40(1):17-27
Large‐scale assessments such as the Programme for International Student Assessment (PISA) have field trials where new survey features are tested for utility in the main survey. Because of resource constraints, there is a trade‐off between how much of the sample can be used to test new survey features and how much can be used for the initial item response theory (IRT) scaling. Utilizing real assessment data of the PISA 2015 Science assessment, this article demonstrates that using fixed item parameter calibration (FIPC) in the field trial yields stable item parameter estimates in the initial IRT scaling for samples as small as n = 250 per country. Moreover, the results indicate that for the recovery of the county‐specific latent trait distributions, the estimates of the trend items (i.e., the information introduced into the calibration) are crucial. Thus, concerning the country‐level sample size of n = 1,950 currently used in the PISA field trial, FIPC is useful for increasing the number of survey features that can be examined during the field trial without the need to increase the total sample size. This enables international large‐scale assessments such as PISA to keep up with state‐of‐the‐art developments regarding assessment frameworks, psychometric models, and delivery platform capabilities. 相似文献
8.
Wooyeol Lee 《教育实用测度》2017,30(2):129-146
Utilizing a longitudinal item response model, this study investigated the effect of item parameter drift (IPD) on item parameters and person scores via a Monte Carlo study. Item parameter recovery was investigated for various IPD patterns in terms of bias and root mean-square error (RMSE), and percentage of time the 95% confidence interval covered the true parameter. The simulation results suggest that item parameters were not recovered well when IPD was ignored, especially if there was a larger number of IPD conditions. In addition, coverage was not accurate in all IPD conditions when IPD is ignored. Also, the results suggest that the accuracy of person scores (measured by bias) is potentially problematic when the larger number of IPD items is ignored. However, the overall accuracy (measured by RMSE) and coverage were unexpectedly acceptable in the presence of IPD as defined in this study. 相似文献
9.
Divgi (1986) demonstrated that the bias of UCON item parameter estimates is not removed by the factor (n − 1)/n. Andrich (1989) argued in this journal that the demonstration was faulty. In this note a complete proof of Divgfs conclusion is presented. 相似文献
10.
When speaking to infants, mothers often alter their speech compared to how they speak to adults, but findings for fathers are mixed. This study examined interactions (N = 30) between fathers and infants (Mage ± SD = 7.8 ± 4.3 months) in a small‐scale society in Vanuatu and two urban societies in North America. Fundamental frequency (F0) and speech rate were measured in infant‐directed and adult‐directed speech. When speaking to infants, fathers in both groups increased their F0 range, yet only Vanuatu fathers increased their average F0. Conversely, North American fathers slowed down their speech rate to infants, whereas Vanuatu fathers did not. Behavioral traits can vary across distant cultures while still potentially solving similar communicative problems. 相似文献
11.
Youngsuk Suh 《Journal of Educational Measurement》2016,53(4):403-430
This study adapted an effect size measure used for studying differential item functioning (DIF) in unidimensional tests and extended the measure to multidimensional tests. Two effect size measures were considered in a multidimensional item response theory model: signed weighted P‐difference and unsigned weighted P‐difference. The performance of the effect size measures was investigated under various simulation conditions including different sample sizes and DIF magnitudes. As another way of studying DIF, the χ2 difference test was included to compare the result of statistical significance (statistical tests) with that of practical significance (effect size measures). The adequacy of existing effect size criteria used in unidimensional tests was also evaluated. Both effect size measures worked well in estimating true effect sizes, identifying DIF types, and classifying effect size categories. Finally, a real data analysis was conducted to support the simulation results. 相似文献
12.
A Standardized Generalized Dimensionality Discrepancy Measure and a Standardized Model‐Based Covariance for Dimensionality Assessment for Multidimensional Models
下载免费PDF全文

Roy Levy Yuning Xu Nedim Yel Dubravka Svetina 《Journal of Educational Measurement》2015,52(2):144-158
The standardized generalized dimensionality discrepancy measure and the standardized model‐based covariance are introduced as tools to critique dimensionality assumptions in multidimensional item response models. These tools are grounded in a covariance theory perspective and associated connections between dimensionality and local independence. Relative to their precursors, they allow for dimensionality assessment in a more readily interpretable metric of correlations. A simulation study demonstrates the utility of the discrepancy measures’ application at multiple levels of dimensionality analysis, and compares them to factor analytic and item response theoretic approaches. An example illustrates their use in practice. 相似文献
13.
Computer‐based tests (CBTs) often use random ordering of items in order to minimize item exposure and reduce the potential for answer copying. Little research has been done, however, to examine item position effects for these tests. In this study, different versions of a Rasch model and different response time models were examined and applied to data from a CBT administration of a medical licensure examination. The models specifically were used to investigate whether item position affected item difficulty and item intensity estimates. Results indicated that the position effect was negligible. 相似文献
14.
Dual‐Objective Item Selection Criteria in Cognitive Diagnostic Computerized Adaptive Testing
下载免费PDF全文

The development of cognitive diagnostic‐computerized adaptive testing (CD‐CAT) has provided a new perspective for gaining information about examinees' mastery on a set of cognitive attributes. This study proposes a new item selection method within the framework of dual‐objective CD‐CAT that simultaneously addresses examinees' attribute mastery status and overall test performance. The new procedure is based on the Jensen‐Shannon (JS) divergence, a symmetrized version of the Kullback‐Leibler divergence. We show that the JS divergence resolves the noncomparability problem of the dual information index and has close relationships with Shannon entropy, mutual information, and Fisher information. The performance of the JS divergence is evaluated in simulation studies in comparison with the methods available in the literature. Results suggest that the JS divergence achieves parallel or more precise recovery of latent trait variables compared to the existing methods and maintains practical advantages in computation and item pool usage. 相似文献
15.
Indices of item diffculty and item discrimination were analyzed for the items comprising the Wechsler Intelligence Scale for Children - Revised as obtained from a group of 142 subjects with Full Scale IQs below 96. Item validities were estimated by computing the biserial correlation between dichotomized item responses and the total weight score. Kendall's tau was computed for each item. The item difficulties for each subtest except Information and Vocabulary are roughly in the same rank order as those obtained by the stadardization group. Evidence from the study indicates that the increase in the number of items on the WISC-R helped to increase its internal Validity. Analysis of the data ragarding the internal consistency of the test indicates that the majority of the items operate as significant discriminators. Changes in the order of that administration and /or revision of the record form would not seem warranted on the basis of the present study. 相似文献
16.
单调匀质模型是非参数项目反应理论中使用最广泛的模型,它有三个基本假设,适用于小规模测验的分析。本研究使用MHM分析北京语言大学汉语进修学院某次测验,结果表明测验满足弱单维性假设与弱局部独立性假设,67个项目中有9个项目的量表适宜性系数低于0.3,需要修改或删除,删除后测验为中等强度的Mokken量表。另外,有2个项目违反了单调性假设,不符合Mokken量表的要求。 相似文献
17.
Wim J. van der Linden Minjeong Jeon Steve Ferrara 《Journal of Educational Measurement》2011,48(4):380-398
According to a popular belief, test takers should trust their initial instinct and retain their initial responses when they have the opportunity to review test items. More than 80 years of empirical research on item review, however, has contradicted this belief and shown minor but consistently positive score gains for test takers who changed answers they found to be incorrect during review. This study reanalyzed the problem of the benefits of answer changes using item response theory modeling of the probability of an answer change as a function of the test taker’s ability level and the properties of items. Our empirical results support the popular belief and reveal substantial losses due to changing initial responses for all ability levels. Both the contradiction of the earlier research and support of the popular belief are explained as a manifestation of Simpson’s paradox in statistics. 相似文献
18.
This study investigates the effect of multidimensionality on extraction of latent classes in mixture Rasch models. In this study, two‐dimensional data were generated under varying conditions. The two‐dimensional data sets were analyzed with one‐ to five‐class mixture Rasch models. Results of the simulation study indicate the mixture Rasch model tended to extract more latent classes than the number of dimensions simulated, particularly when the multidimensional structure of the data was more complex. In addition, the number of extracted latent classes decreased as the dimensions were more highly correlated regardless of multidimensional structure. An analysis of the empirical multidimensional data also shows that the number of latent classes extracted by the mixture Rasch model is larger than the number of dimensions measured by the test. 相似文献
19.
Robin Tamez 《Performance Improvement》2016,55(6):19-24
The purpose of this article is to highlight theories that support the functions of performance‐based design models and to discuss the implications of integrating divergent models into the system‐oriented human performance technology (HPT) and performance improvement (PI) disciplines. HPT, PI, and instructional systems design (ISD) share a systems framework, along with the influence of common theories such as performance theory, learning theory, adult learning, cognitive psychology, and behavioral psychology (Foshay, Villachica, & Stepich, 2014). This article focuses on the role of theory as a tool in the practitioner's toolbox and as a connection point when working with teams and organizations that have different theoretical orientations. Performance‐based ISD models are discussed, including Robinson and Robinson's (1989) Training for Impact, Brethower and Smalley's (1998) Performance‐Based Instruction, and Bradford and Boler's (2015) Horizon Model. Allen and Sites's (2012) successive approximation model (SAM) retains elements of ADDIE as a process, but the model is iterative rather than systematic in design. 相似文献
20.
Detecting Differential Item Discrimination (DID) and the Consequences of Ignoring DID in Multilevel Item Response Models
下载免费PDF全文

Cross‐level invariance in a multilevel item response model can be investigated by testing whether the within‐level item discriminations are equal to the between‐level item discriminations. Testing the cross‐level invariance assumption is important to understand constructs in multilevel data. However, in most multilevel item response model applications, the cross‐level invariance is assumed without testing of the cross‐level invariance assumption. In this study, the detection methods of differential item discrimination (DID) over levels and the consequences of ignoring DID are illustrated and discussed with the use of multilevel item response models. Simulation results showed that the likelihood ratio test (LRT) performed well in detecting global DID at the test level when some portion of the items exhibited DID. At the item level, the Akaike information criterion (AIC), the sample‐size adjusted Bayesian information criterion (saBIC), LRT, and Wald test showed a satisfactory rejection rate (>.8) when some portion of the items exhibited DID and the items had lower intraclass correlations (or higher DID magnitudes). When DID was ignored, the accuracy of the item discrimination estimates and standard errors was mainly problematic. Implications of the findings and limitations are discussed. 相似文献