首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This research examined the effect of scoring items thought to be multidimensional using a unidimensional model and demonstrated the use of multidimensional item response theory (MIRT) as a diagnostic tool. Using real data from a large-scale mathematics test, previously shown to function differentially in favor of proficient writers, the difference in proficiency classifications was explored when a two-versus one-dimensional confirmatory model was fit. The estimate of ability obtained when using the unidimensional model was considered to represent general mathematical ability. Under the two-dimensional model, one of the two dimensions was also considered to represent general mathematical ability. The second dimension was considered to represent the ability to communicate in mathematics. The resulting pattern of mismatched proficiency classifications suggested that examinees found to have less mathematics communication ability were more likely to be placed in a lower general mathematics proficiency classification under the unidimensional than multidimensional model. Results and implications are discussed.  相似文献   

2.
This inquiry is an investigation of item response theory (IRT) proficiency estimators’ accuracy under multistage testing (MST). We chose a two‐stage MST design that includes four modules (one at Stage 1, three at Stage 2) and three difficulty paths (low, middle, high). We assembled various two‐stage MST panels (i.e., forms) by manipulating two assembly conditions in each module, such as difficulty level and module length. For each panel, we investigated the accuracy of examinees’ proficiency levels derived from seven IRT proficiency estimators. The choice of Bayesian (prior) versus non‐Bayesian (no prior) estimators was of more practical significance than the choice of number‐correct versus item‐pattern scoring estimators. The Bayesian estimators were slightly more efficient than the non‐Bayesian estimators, resulting in smaller overall error. Possible score changes caused by the use of different proficiency estimators would be nonnegligible, particularly for low‐ and high‐performing examinees.  相似文献   

3.
Psychometric properties of item response theory proficiency estimates are considered in this paper. Proficiency estimators based on summed scores and pattern scores include non-Bayes maximum likelihood and test characteristic curve estimators and Bayesian estimators. The psychometric properties investigated include reliability, conditional standard errors of measurement, and score distributions. Four real-data examples include (a) effects of choice of estimator on score distributions and percent proficient, (b) effects of the prior distribution on score distributions and percent proficient, (c) effects of test length on score distributions and percent proficient, and (d) effects of proficiency estimator on growth-related statistics for a vertical scale. The examples illustrate that the choice of estimator influences score distributions and the assignment of examinee to proficiency levels. In particular, for the examples studied, the choice of Bayes versus non-Bayes estimators had a more serious practical effect than the choice of summed versus pattern scoring.  相似文献   

4.
本研究基于IRT理论中最常用的LOGISTIC三种模型来探讨等值的跨样本一致性,研究对象为某一汉语类别的测验,等值方法采用同时校准法。研究结果表明,双参数模型下同时校准法等值跨样本一致性最好,最为稳定。  相似文献   

5.
This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma method). The biases and sample variances for the equating functions based on the characteristic curve methods and concurrent calibrations for adjacent forms are smaller than the biases and variances for the equating functions based on the moment methods. In addition, the IRT true score equating is also compared to the chained equipercentile equating, and we observe that the sample variances for the chained equipercentile equating are much smaller than the variances for the IRT true score equating with an exception at the low scores.  相似文献   

6.
Item parameter drift (IPD) occurs when item parameter values change from their original value over time. IPD may pose a serious threat to the fairness and validity of test score interpretations, especially when the goal of the assessment is to measure growth or improvement. In this study, we examined the effect of multidirectional IPD (i.e., some items become harder while other items become easier) on the linking procedure and rescaled proficiency estimates. The impact of different combinations of linking items with various multidirectional IPD on the test equating procedure was investigated for three scaling methods (mean-mean, mean-sigma, and TCC method) via a series of simulation studies. It was observed that multidirectional IPD had a substantive effect on examinees' scores and achievement level classifications under some of the studied conditions. Choice of linking method had a direct effect on the results, as did the pattern of IPD.  相似文献   

7.
Using several data sets, the authors examine the relative performance of the beta binomial model and two other more general strong true score models in estimating several indexes of classification consistency. It is shown that the beta binomial model can provide inadequate fits to raw score distributions compared to more general models. This lack of fit is reflected in differences in decision consistency indexes computed using the beta binomial model and the other models. It is recommended that the adequacy of a model in fitting the data be assessed before the model is used to estimate decision consistency indexes. When the beta binomial model does not fit the data, the more general models discussed here may provide an adequate fit and, in such cases, would be more appropriate for computing decision consistency indexes.  相似文献   

8.
This article presents a method for estimating the accuracy and consistency of classifications based on test scores. The scores can be produced by any scoring method, including a weighted composite. The estimates use data from a single form. The reliability of the score is used to estimate effective test length in terms of discrete items. The true-score distribution is estimated by fitting a 4-parameter beta model. The conditional distribution of scores on an alternate form, given the true score, is estimated from a binomial distribution based on the estimated effective test length. Agreement between classifications on alternate forms is estimated by assuming conditional independence, given the true score. Evaluation of the method showed estimates to be within 1 percentage point of the actual values in most cases. Estimates of decision accuracy and decision consistency statistics were only slightly affected by changes in specified minimum and maximum possible scores.  相似文献   

9.
传统的项目反应理论模型由于不能很好地处理非连续数据而影响了对具有潜在类别属性的特质进行精确估计。混合项目反应理论不仅能够精确地估计项目参数和能力参数,而且可以实现按照不同类别属性的潜在特质与行为对被试进行自动鉴别。随着研究的发展,混合分部评分模型、混合Logistic模型、多水平项目反应理论模型以及带协变量的混合项目反应理论模型等相继诞生,并在教育测验分析与编制、项目功能差异分析以及其他拓展性实践应用中展现出优良的品质。开发多维混合项目反应理论模型、多维混合认知诊断模型以及混合题组模型等并对其进行本土化研究与应用将是混合项目反应理论的一大研究热点与方向。  相似文献   

10.
Student growth percentiles (SGPs, Betebenner, 2009) are used to locate a student's current score in a conditional distribution based on the student's past scores. Currently, following Betebenner (2009), quantile regression (QR) is most often used operationally to estimate the SGPs. Alternatively, multidimensional item response theory (MIRT) may also be used to estimate SGPs, as proposed by Lockwood and Castellano (2015). A benefit of using MIRT to estimate SGPs is that techniques and methods already developed for MIRT may readily be applied to the specific context of SGP estimation and inference. This research adopts a MIRT framework to explore the reliability of SGPs. More specifically, we propose a straightforward method for estimating SGP reliability. In addition, we use this measure to study how SGP reliability is affected by two key factors: the correlation between prior and current latent achievement scores, and the number of prior years included in the SGP analysis. These issues are primarily explored via simulated data. In addition, the QR and MIRT approaches are compared in an empirical application.  相似文献   

11.
In this study we examined procedures for assessing model-data fit of item response theory (IRT) models for mixed format data. The model fit indices used in this study include PARSCALE's G2 , Orlando and Thissen's SX2 and SG2 , and Stone's χ2* and G2* . To investigate the relative performance of the fit statistics at the item level, we conducted two simulation studies: Type I error and power studies. We evaluated the performance of the item fit indices for various conditions of test length, sample size, and IRT models. Among the competing measures, the summed score-based indices SX2 and SG2 were found to be the sensible and efficient choice for assessing model fit for mixed format data. These indices performed well, particularly with short tests. The pseudo-observed score indices, χ2* and G2* , showed inflated Type I error rates in some simulation conditions. Consistent with the findings of current literature, the PARSCALE's G2 index was rarely useful, although it provided reasonable results for long tests.  相似文献   

12.
相对于选拔性考试,基于标准的学业水平考试更加注重对学生所达到的学习成就的考查,从而体现监控教学质量这一功能,这就要求学业水平考试试题与课程标准有很高的吻合度。从教育统计学基本原理出发,结合对于2015年上海市普通高中学业水平考试物理试卷的统计分析,从定量研究的角度,尝试建立一种更为简洁、合理的描述学业水平考试试题一致性的新模型,并讨论此一致性模型如何发挥作用。  相似文献   

13.
The usefulness of item response theory (IRT) models depends, in large part, on the accuracy of item and person parameter estimates. For the standard 3 parameter logistic model, for example, these parameters include the item parameters of difficulty, discrimination, and pseudo-chance, as well as the person ability parameter. Several factors impact traditional marginal maximum likelihood (ML) estimation of IRT model parameters, including sample size, with smaller samples generally being associated with lower parameter estimation accuracy, and inflated standard errors for the estimates. Given this deleterious impact of small samples on IRT model performance, use of these techniques with low-incidence populations, where it might prove to be particularly useful, estimation becomes difficult, especially with more complex models. Recently, a Pairwise estimation method for Rasch model parameters has been suggested for use with missing data, and may also hold promise for parameter estimation with small samples. This simulation study compared item difficulty parameter estimation accuracy of ML with the Pairwise approach to ascertain the benefits of this latter method. The results support the use of the Pairwise method with small samples, particularly for obtaining item location estimates.  相似文献   

14.
In judgmental standard setting procedures (e.g., the Angoff procedure), expert raters establish minimum pass levels (MPLs) for test items, and these MPLs are then combined to generate a passing score for the test. As suggested by Van der Linden (1982), item response theory (IRT) models may be useful in analyzing the results of judgmental standard setting studies. This paper examines three issues relevant to the use of lRT models in analyzing the results of such studies. First, a statistic for examining the fit of MPLs, based on judges' ratings, to an IRT model is suggested. Second, three methods for setting the passing score on a test based on item MPLs are analyzed; these analyses, based on theoretical models rather than empirical comparisons among the three methods, suggest that the traditional approach (i.e., setting the passing score on the test equal to the sum of the item MPLs) does not provide the best results. Third, a simple procedure, based on generalizability theory, for examining the sources of error in estimates of the passing score is discussed.  相似文献   

15.
Mental models are one way that humans represent knowledge (Markman, 1999). Instructional design (ID) is a conceptual model for developing instruction and typically includes analysis, design, development, implementation, and evaluation (i.e., ADDIE model). ID, however, has been viewed differently by practicing teachers and instructional designers (Kennedy, 1994). In a graduate ID course students constructed their own ID models. This study analyzed student models for (a) what ADDIE components were included (by teacher, nonteacher), and (b) model structural characteristics (by teacher, nonteacher). Participants included 178 students in 12 deliveries of a master's level ID course (115 teachers, 63 nonteachers). Our conceptual ID model is presented, and the ID model task is described. Students most frequently represented design, followed by program evaluation, needs assessment, development, and implementation. In terms of structural characteristics, 76 models were characterized as metaphoric, 61 dynamic, and 35 sequential. Three interrelated conclusions and implications for ID learning are offered. Susan G. Magliaro [sumags@vt.edu] is Director of the School of Education and the Center for Teacher Education at Virginia Tech. Neal Shambaugh is Assistant Professor of Instructional Design & Technology in the College of Human Resources & Education at West Virginia University. An erratum to this article is available at .  相似文献   

16.
Disengaged item responses pose a threat to the validity of the results provided by large-scale assessments. Several procedures for identifying disengaged responses on the basis of observed response times have been suggested, and item response theory (IRT) models for response engagement have been proposed. We outline that response time-based procedures for classifying response engagement and IRT models for response engagement are based on common ideas, and we propose the distinction between independent and dependent latent class IRT models. In all IRT models considered, response engagement is represented by an item-level latent class variable, but the models assume that response times either reflect or predict engagement. We summarize existing IRT models that belong to each group and extend them to increase their flexibility. Furthermore, we propose a flexible multilevel mixture IRT framework in which all IRT models can be estimated by means of marginal maximum likelihood. The framework is based on the widespread Mplus software, thereby making the procedure accessible to a broad audience. The procedures are illustrated on the basis of publicly available large-scale data. Our results show that the different IRT models for response engagement provided slightly different adjustments of item parameters of individuals’ proficiency estimates relative to a conventional IRT model.  相似文献   

17.
This study sets out to assess the extent to which final‐year students following a four year degree in education have a realistic sense of the strengths and weaknesses of an independent research study they are required to undertake. An analysis of the self‐assessment grades and those of their tutors reveals a poor match, whilst the criteria students use to assess their studies emphasise lower order criteria such as style and presentation, and largely ignore higher order criteria concerned with theoretical and conceptual understanding and the quality of discussion. The paper considers why students appear not to have theorised their experiences, and why there is only limited evidence of a critical understanding of academic writing.  相似文献   

18.
项目反应理论模型的参数估计一般需要较大样本量,小样本量条件下参数型与非参数型项目反应理论模型的相对优势并无定论。通过计算机模拟数据比较两类模型在小样本量时(n<=200)估计项目特征曲线所产生的偏误及均方根误差。当模拟数据基于3PL模型生成时,参数型与非参数型模型在样本量低于200时估值偏误方面无差别,但前者均方根误差较小。在样本量为200时,两模型估算值类似。当真实数据基于3PL模型且样本量小于200时,参数型Rasch模型比非参数核平滑模型更值得推荐。  相似文献   

19.
A practical concern for many existing tests is that subscore test lengths are too short to provide reliable and meaningful measurement. A possible method of improving the subscale reliability and validity would be to make use of collateral information provided by items from other subscales of the same test. To this end, the purpose of this article is to compare two different formulations of an alternative Item Response Theory (IRT) model developed to parameterize unidimensional projections of multidimensional test items: Analytical and Empirical formulations. Two real data applications are provided to illustrate how the projection IRT model can be used in practice, as well as to further examine how ability estimates from the projection IRT model compare to external examinee measures. The results suggest that collateral information extracted by a projection IRT model can be used to improve reliability and validity of subscale scores, which in turn can be used to provide diagnostic information about strength and weaknesses of examinees helping stakeholders to link instruction or curriculum to assessment results.  相似文献   

20.
Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号