共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Empirical studies demonstrated Type-I error (TIE) inflation (especially for highly discriminating easy items) of the Mantel-Haenszel chi-square test for differential item functioning (DIF), when data conformed to item response theory (IRT) models more complex than Rasch, and when IRT proficiency distributions differed only in means. However, no published study manipulated proficiency variance ratio (VR). Data were generated with the three-parameter logistic (3PL) IRT model. Proficiency VRs were 1, 2, 3, and 4. The present study suggests inflation may be greater, and may affect all highly discriminating items (low, moderate, and high difficulty), when IRT proficiency distributions of reference and focal groups differ also in variances. Inflation was greatest on the 21-item test (vs. 41) and 2,000 total sample size (vs. 1,000). Previous studies had not systematically examined sample size ratio. Sample size ratio of 1:1 produced greater TIE inflation than 3:1, but primarily for total sample size of 2,000. 相似文献
3.
《教育实用测度》2013,26(1):11-22
Previous research has provided conflicting findings about whether allowing the use of calculators changes the difficulty of mathematics tests or the time needed to complete the tests. Because the interpretation of results from standardized tests via norm tables depends on standardized conditions, the impact of allowing or not allowing examinees to use calculators while taking such tests would need to be specified as part of the standardizing condition. This article examines four item types that may perform differently under different conditions of calculator use. This article also examines the effect of testing under calculator and noncalculator conditions on testing time, reliability, item difficulty, and item discrimination. 相似文献
4.
评分教师的评分效应和评分量表研究是研究主观题评分误差的核心问题。本论文以2006年高考政治(上海卷)第38题(论述题)为例,运用ACER Conquest的Raters Effect模型研究,结果显示该大题基本没有表现出模糊性、趋中性和等级限制等评分误差,评分教师能够比较好地区分考生不同行为特征,除个别评分教师的评分一致性还有待提高外,评分松紧度差异比较显著。为此,作者提出根据松紧度调整考试分数的方法。 相似文献
5.
6.
7.
《Journal of Experimental Education》2012,80(1):9-19
AbstractCovariance structure analysis provides a useful methodology to test hypotheses about competing structural models. The chi-square goodness of fit test is basically an appropriate test for model evaluation. However, methodologists are particularly concerned about the validity of the test to detect misspecified models in small samples. At the same time, there is the concern of rejecting models with reasonably good fit in large samples. The present Monte Carlo study examined the validity of the chi-square test in different instances of misspecification and sample size. The usefulness of the chi-square difference statistic to compare competing structures and improvement in fit is also addressed. 相似文献
8.
Katherine E. Ryan 《Journal of Educational Measurement》1991,28(4):325-337
This study examined the reliability of the Mantel-Haenszel indexes across different samples of test takers as well as across sample sizes and investigated whether these indexes are robust to item context effects. Mathematics data from the Second International Mathematics Study (SIMS; 1985) for U.S. eighth-grade students were analyzed. The results suggest that the MH D-DIF is robust to item context effects. However, larger sample sizes than those used in this investigation (N = 141-167 for the focal group) may be necessary to obtain stable estimates from the Mantel-Haenszel procedure. 相似文献
9.
Brian Clauser Kathleen M. Mazor Ronald K. Hambleton 《Journal of Educational Measurement》1994,31(1):67-78
Previous research examining the effects of reducing the number of score groups used in the matching criterion of the Mantel-Haenszel procedure, when screening for DIF, has produced ambiguous results. The goal of this study was to resolve the ambiguity by examining the problem with a simulated data set. The main results from this study call into question the preliminary recommendations of several other researchers that four or more score groups are sufficient and produce stable results. Although considerable stability and very little Type I error was noted with equal ability distribution comparisons, with unequal ability distributions, the Type I error rate was substantially inflated. These results argue against the appropriateness of implementing the procedure by collapsing score groups. The current data suggest that more than modest reductions in the number of score groups cannot be recommended when the ability distributions of the reference and focal groups differ 相似文献
10.
Multilevel models are an increasingly popular method to analyze data that originate from a clustered or hierarchical structure. To effectively utilize multilevel models, one must have an adequately large number of clusters; otherwise, some model parameters will be estimated with bias. The goals for this paper are to (1) raise awareness of the problems associated with a small number of clusters, (2) review previous studies on multilevel models with a small number of clusters, (3) to provide an illustrative simulation to demonstrate how a simple model becomes adversely affected by small numbers of clusters, (4) to provide researchers with remedies if they encounter clustered data with a small number of clusters, and (5) to outline methodological topics that have yet to be addressed in the literature. 相似文献
11.
《教育心理学家》2013,48(1):99-110
Based on reviews by Glass, Cahen, Smith, and Filby (1982) and the Educational Research Service (1978), Cooper (this issue) concludes that substantial reductions in class size can have important effects on low-achieving students in the early grades. This article critiques these reviews and summarizes the findings of experimental studies that compared the achievement levels of elementary school students in larger classes to classes with no more than 20 students. Even in studies that made such substantial reductions, achievement differences were slight, averaging only 13% of a standard deviation. Not until class size approaches one is there evidence of meaningful effects. Based on this and other evidence, it is suggested that Chapter 1 programs provide one-to-one tutoring in reading rather than providing small-group pullouts or reducing overall class size. 相似文献
12.
团队异质性、规模、阶段与类型对学科团队创新绩效的影响研究 总被引:2,自引:0,他引:2
刘惠琴 《清华大学教育研究》2008,29(4):83-90
本文首先提出团队异质性、规模、阶段与类型对学科团队创新绩效影响的理论假设,而后通过对中国高校86个学科团队660名教师进行问卷调查来验证,以为创新团队建设提供决策支持.研究得出,团队异质性、发展阶段、学科团队类型与团队创新绩效之间存在相关关系;团队规模与团队创新绩效没有显著相关性. 相似文献
13.
Judith A. Spray Terry A. Ackerman Mark D. Reckase James E. Carlson 《Journal of Educational Measurement》1989,26(3):261-271
Studies that have investigated differences in examinee performance on items administered in paper-and-pencil form or on a computer screen have produced equivocal results. Certain item administration procedures were hypothesized to be among the most important variables causing differences in item performance and ultimately in test scores obtained from these different administration media. A study where these item administration procedures were made as identical as possible for each presentation medium is described. In addition, a methodology is presented for studying the difficulty and discrimination of items under each presentation medium as a post hoc procedure. 相似文献
14.
《教育实用测度》2013,26(4):329-349
The logistic regression (LR) procedure for differential item functioning (DIF) detection is a model-based approach designed to identify both uniform and nonuniform DIF. However, this procedure tends to produce inflated Type I errors. This outcome is problematic because it can result in the inefficient use of testing resources, and it may interfere with the study of the underlying causes of DIF. Recently, an effect size measure was developed for the LR DIF procedure and a classification method was proposed. However, the effect size measure and classification method have not been systematically investigated. In this study, we developed a new classification method based on those established for the Simultaneous Item Bias Test. A simulation study also was conducted to determine if the effect size measure affects the Type I error and power rates for the LR DIF procedure across sample sizes, ability distributions, and percentage of DIF items included on a test. The results indicate that the inclusion of the effect size measure can substantially reduce Type I error rates when large sample sizes are used, although there is also a reduction in power. 相似文献
15.
AbstractFactor mixture models are designed for the analysis of multivariate data obtained from a population consisting of distinct latent classes. A common factor model is assumed to hold within each of the latent classes. Factor mixture modeling involves obtaining estimates of the model parameters, and may also be used to assign subjects to their most likely latent class. This simulation study investigates aspects of model performance such as parameter coverage and correct class membership assignment and focuses on covariate effects, model size, and class-specific versus class-invariant parameters. When fitting true models, parameter coverage is good for most parameters even for the smallest class separation investigated in this study (0.5 SD between 2 classes). The same holds for convergence rates. Correct class assignment is unsatisfactory for the small class separation without covariates, but improves dramatically with increasing separation, covariate effects, or both. Model performance is not influenced by the differences in model size investigated here. Class-specific parameters may improve some aspects of model performance but negatively affect other aspects. 相似文献
16.
郑华 《宁夏师范学院学报》2010,31(3):36-39
在分析了李道本、陈少霞提出的基于最小差错概率盲均衡算法的基础上,用牛顿梯度变步长实现了基于最小差错概率新的盲均衡算法,仿真结果表明,与固定步长的盲均衡算法相比,该算法收敛速度快,均方误差小,在均衡技术的应用方面有一定的实用价值. 相似文献
17.
Joe Hauptman 《Teaching Statistics》2004,26(2):46-48
This article describes a simple computer program which graphically demonstrates both Type I and Type II statistical errors. 相似文献
18.
19.
Victor L. Willson 《Journal of Educational Measurement》1989,26(2):103-119
Performance on items in intelligence and achievement tests can be represented in terms of child development and information processes. Research is reviewed on item performance that supports developmental and information processing effects, particularly in children. Some suggestions for item development in intelligence and achievement tests are presented. 相似文献
20.
《教育实用测度》2013,26(1):31-57
Examined in this study were the effects of test length and sample size on the alternate forms reliability and the equating of simulated mathematics tests composed of constructed-response items scaled using the 2-parameter partial credit model. Test length was defined in terms of the number of both items and score points per item. Tests with 2, 4, 8, 12, and 20 items were generated, and these items had 2, 4, and 6 score points. Sample sizes of 200, 500, and 1,000 were considered. Precise item parameter estimates were not found when 200 cases were used to scale the items. To obtain acceptable reliabilities and accurate equated scores, the findings suggested that tests should have at least eight 6-point items or at least 12 items with 4 or more score points per item. 相似文献