本研究是关于项目形式对测量效果的影响研究。研究结果显示,在题干等价的条件下,填空形式的难度一般高于多选一形式;两种形式在区分度上没有显著差异,如果能给出恰当的选择项,多选一形式的区分度可能会高于填空形式。同时,两种项目形式所测量能力的维度差异不大,但对于较低能力层的被试,多选一形式的测量效果相对较好,而对于较高能力层的被试,则填空形式的测量效果比较好。  相似文献   

正提供大量新素材、新信息,设置新情景,考查考生获取和解读地理信息、分析解决问题的地理思维能力,已经成为当今地理能力测试的显著特色。获取和解读信息的过程,是个体的思维能力的凸显过程。包含信息的发现、信息的判断、有效信息的提取等思维过程。因此教师在授课解题时根据地理信息载体的不同,采取不同的方法和手段获取和解读地理信息,不失时机地培养学生获取和解读地理信息的能力,才  相似文献   

正一、选择题的特点、组成选择题是客观性试题,它具有知识覆盖面广、信息量大、评分客观,构思新颖、思路活、解法巧、思维量大等特点;能从多角度、多方位考查分析、综合判断能力。是高考中的固定题型。选择题由题干和题肢组成,题干属于已知提示部分,它规定了选择的内容和要求。题肢是由题干信息作出判断结果结论性内容。现今高考由于注重学生能力培养,题干设置更  相似文献   

项目反应理论研究的新进展   总被引:8,自引:0,他引:8  
近20年以来,以项目反应理论(Item Response Theory.IRT)为代表的考试理论(Testing Theories)的研究取得了长足进展.这种进展表现在3个方面.即出现了多维度项目反应理论(Multi—dimensional IRT)、非参数项目反应理论(Non—parametric IRT)以及认知诊断理论(Cognitively Diagnostic Theory)等。这些新的理论的出现加深了人们对项目反应理论的理解.也会对考试的实践产生了深远的影响。  相似文献   

导言笔者连续撰写了三篇论文探讨测验等值和连接的概念、程序、应用以及存在的问题。第一篇文章(发表在《考试研究》2011年第1期)探讨了效度的核心问题,以及在命制试题和组卷过程中构建等值测试版本的重要意义。同时,介绍了等值和连接的主要概念和基本术语,概述了经典测量理论(CTT)和项目反应理论(IRT)。第二篇文章(发表在《考  相似文献   

文章采用项目反应理论中的两参数正态双卵模型,利用MCMC的方法,给出了Gibbs抽样估计项目参数的Matlab程序,根据该程序对某校本科生的期末成绩数据进行运算得出了项目参数,并加以分析。  相似文献   

经典测量理论与项目反应理论的比较研究   总被引:3,自引:1,他引:3  
文章通过对经典测量理论和项目反应理论的模型及其假设、主要概念和参数、测量水平等方面进行比较,廓清了两种理论的联系和区别,明确了两种理论的优势和不足,从而为研究者根据测验实践的要求和各个理论的适用条件选择恰当的分析框架提供思路。  相似文献   

文章以某师范大学非英语专业二年级4个班205名学生为研究对象,从问题预览方式和篇章类型两个维度,探讨问题预览对英语听力理解的影响。实验结果表明:(1)听力理解分别受问题预览方式和篇章类型的影响,但两个因素不存在显著交互作用;(2)就总体而言,完全预览比题干预览与无预览更显著有利于听力理解,选项预览比无预览更显著有利于听力理解;(3)会话篇章采用不同的预览方式,答题的正确率不存在显著差异,但在讲座篇章中却存在非常显著差异。  相似文献   

测验等值使得不同形式的考试能进行比较,从而保证了测验之间的相对稳定性。基于IRT的分数等值是在估计出参数的基础上进行的参数转换,等值结果的稳定性与考生样本量密不可分。本研究针对汉语水平考试(HSK)阅读分测验,采用真实数据模拟共同组锚测验设计,确定等值的参照标准,考察考生样本量的变化对IRT分数等值稳定性的影响。结果表明,考生样本量为2000左右时各种方案的等值结果均比较稳定。考生样本量进一步增大时,等值误差不降反增。  相似文献   

项目反应理论模型的参数估计一般需要较大样本量,小样本量条件下参数型与非参数型项目反应理论模型的相对优势并无定论。通过计算机模拟数据比较两类模型在小样本量时(n<=200)估计项目特征曲线所产生的偏误及均方根误差。当模拟数据基于3PL模型生成时,参数型与非参数型模型在样本量低于200时估值偏误方面无差别,但前者均方根误差较小。在样本量为200时,两模型估算值类似。当真实数据基于3PL模型且样本量小于200时,参数型Rasch模型比非参数核平滑模型更值得推荐。  相似文献   

Six procedures for combining sets of IRT item parameter estimates obtained from different samples were evaluated using real and simulated response data. In the simulated data analyses, true item and person parameters were used to generate response data for three different-sized samples. Each sample was calibrated separately to obtain three sets of item parameter estimates for each item. The six procedures for combining multiple estimates were each applied, and the results were evaluated by comparing the true and estimated item characteristic curves. For the real data, the two best methods from the simulation data analyses were applied to three different-sized samples and the resulting estimated item characteristic curves were compared to the curves obtained when the three samples were combined and calibrated simultaneously. The results support the use of covariance matrix-weighted averaging and a procedure that involves sample-size-weighted averaging of estimated item characteristic curves at the center of the ability distribution  相似文献   

Various applications of item response theory often require linking to achieve a common scale for item parameter estimates obtained from different groups. This article used a simulation to examine the relative performance of four different item response theory (IRT) linking procedures in a random groups equating design: concurrent calibration with multiple groups, separate calibration with the Stocking-Lord method, separate calibration with the Haebara method, and proficiency transformation. The simulation conditions used in this article included three sampling designs, two levels of sample size, and two levels of the number of items. In general, the separate calibration procedures performed better than the concurrent calibration and proficiency transformation procedures, even though some inconsistent results were observed across different simulation conditions. Some advantages and disadvantages of the linking procedures are discussed.  相似文献   

Four equating methods (3PL true score equating, 3PL observed score equating, beta 4 true score equating, and beta 4 observed score equating) were compared using four equating criteria: first-order equity (FOE), second-order equity (SOE), conditional-mean-squared-error (CMSE) difference, and the equipercentile equating property. True score equating more closely achieved estimated FOE than observed score equating when the true score distribution was estimated using the psychometric model that was used in the equating. Observed score equating more closely achieved estimated SOE, estimated CMSE difference, and the equipercentile equating property than true score equating. Among the four equating methods, 3PL observed score equating most closely achieved estimated SOE and had the smallest estimated CMSE difference, and beta 4 observed score equating was the method that most closely met the equipercentile equating property.  相似文献   

Disengaged item responses pose a threat to the validity of the results provided by large-scale assessments. Several procedures for identifying disengaged responses on the basis of observed response times have been suggested, and item response theory (IRT) models for response engagement have been proposed. We outline that response time-based procedures for classifying response engagement and IRT models for response engagement are based on common ideas, and we propose the distinction between independent and dependent latent class IRT models. In all IRT models considered, response engagement is represented by an item-level latent class variable, but the models assume that response times either reflect or predict engagement. We summarize existing IRT models that belong to each group and extend them to increase their flexibility. Furthermore, we propose a flexible multilevel mixture IRT framework in which all IRT models can be estimated by means of marginal maximum likelihood. The framework is based on the widespread Mplus software, thereby making the procedure accessible to a broad audience. The procedures are illustrated on the basis of publicly available large-scale data. Our results show that the different IRT models for response engagement provided slightly different adjustments of item parameters of individuals’ proficiency estimates relative to a conventional IRT model.  相似文献   

In this study we examined procedures for assessing model-data fit of item response theory (IRT) models for mixed format data. The model fit indices used in this study include PARSCALE's G2 , Orlando and Thissen's SX2 and SG2 , and Stone's χ2* and G2* . To investigate the relative performance of the fit statistics at the item level, we conducted two simulation studies: Type I error and power studies. We evaluated the performance of the item fit indices for various conditions of test length, sample size, and IRT models. Among the competing measures, the summed score-based indices SX2 and SG2 were found to be the sensible and efficient choice for assessing model fit for mixed format data. These indices performed well, particularly with short tests. The pseudo-observed score indices, χ2* and G2* , showed inflated Type I error rates in some simulation conditions. Consistent with the findings of current literature, the PARSCALE's G2 index was rarely useful, although it provided reasonable results for long tests.  相似文献   

In order to equate tests under Item Response Theory (IRT), one must obtain the slope and intercept coefficients of the appropriate linear transformation. This article compares two methods for computing such equating coefficients–Loyd and Hoover (1980) and Stocking and Lord (1983). The former is based upon summary statistics of the test calibrations; the latter is based upon matching test characteristic curves by minimizing a quadratic loss function. Three types of equating situations: horizontal, vertical, and that inherent in IRT parameter recovery studies–were investigated. The results showed that the two computing procedures generally yielded similar equating coefficients in all three situations. In addition, two sets of SAT data were equated via the two procedures, and little difference in the obtained results was observed. Overall, the results suggest that the Loyd and Hoover procedure usually yields acceptable equating coefficients. The Stocking and Lord procedure improves upon the Loyd and Hoover values and appears to be less sensitive to atypical test characteristics. When the user has reason to suspect that the test calibrations may be associated with data sets that are typically troublesome to calibrate, the Stocking and Lord procedure is to be preferred.  相似文献   

In this article, linear item response theory (IRT) observed‐score equating is compared under a generalized kernel equating framework with Levine observed‐score equating for nonequivalent groups with anchor test design. Interestingly, these two equating methods are closely related despite being based on different methodologies. Specifically, when using data from IRT models, linear IRT observed‐score equating is virtually identical to Levine observed‐score equating. This leads to the conclusion that poststratification equating based on true anchor scores can be viewed as the curvilinear Levine observed‐score equating.  相似文献   

The usefulness of item response theory (IRT) models depends, in large part, on the accuracy of item and person parameter estimates. For the standard 3 parameter logistic model, for example, these parameters include the item parameters of difficulty, discrimination, and pseudo-chance, as well as the person ability parameter. Several factors impact traditional marginal maximum likelihood (ML) estimation of IRT model parameters, including sample size, with smaller samples generally being associated with lower parameter estimation accuracy, and inflated standard errors for the estimates. Given this deleterious impact of small samples on IRT model performance, use of these techniques with low-incidence populations, where it might prove to be particularly useful, estimation becomes difficult, especially with more complex models. Recently, a Pairwise estimation method for Rasch model parameters has been suggested for use with missing data, and may also hold promise for parameter estimation with small samples. This simulation study compared item difficulty parameter estimation accuracy of ML with the Pairwise approach to ascertain the benefits of this latter method. The results support the use of the Pairwise method with small samples, particularly for obtaining item location estimates.  相似文献   

The present study evaluated the multiple imputation method, a procedure that is similar to the one suggested by Li and Lissitz (2004), and compared the performance of this method with that of the bootstrap method and the delta method in obtaining the standard errors for the estimates of the parameter scale transformation coefficients in item response theory (IRT) equating in the context of the common‐item nonequivalent groups design. Two different estimation procedures for the variance‐covariance matrix of the IRT item parameter estimates, which were used in both the delta method and the multiple imputation method, were considered: empirical cross‐product (XPD) and supplemented expectation maximization (SEM). The results of the analyses with simulated and real data indicate that the multiple imputation method generally produced very similar results to the bootstrap method and the delta method in most of the conditions. The differences between the estimated standard errors obtained by the methods using the XPD matrices and the SEM matrices were very small when the sample size was reasonably large. When the sample size was small, the methods using the XPD matrices appeared to yield slight upward bias for the standard errors of the IRT parameter scale transformation coefficients.  相似文献   

