首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 51 毫秒
1.
《Educational Assessment》2013,18(2):125-146
States are implementing statewide assessment programs that classify students into proficiency levels that reflect state-defined performance standards. In an effort to provide support for score interpretations, this study examined the consistency of classifications based on competing item response theory (IRT) models for data from a state assessment program. Classification of students into proficiency levels was compared based on a 1-parameter vs. a 3-parameter IRT model. Despite an overall high level of agreement between classifications based on the 2 models, systematic differences were observed. Under the 1-parameter model, proficiency was underestimated for low proficiency classifications but overestimated for upper proficiency classifications. This resulted in higher "Below Basic" and "Advanced" classifications under 1-parameter vs. 3-parameter IRT applications. Implications of these differences are discussed.  相似文献   

2.
A practical concern for many existing tests is that subscore test lengths are too short to provide reliable and meaningful measurement. A possible method of improving the subscale reliability and validity would be to make use of collateral information provided by items from other subscales of the same test. To this end, the purpose of this article is to compare two different formulations of an alternative Item Response Theory (IRT) model developed to parameterize unidimensional projections of multidimensional test items: Analytical and Empirical formulations. Two real data applications are provided to illustrate how the projection IRT model can be used in practice, as well as to further examine how ability estimates from the projection IRT model compare to external examinee measures. The results suggest that collateral information extracted by a projection IRT model can be used to improve reliability and validity of subscale scores, which in turn can be used to provide diagnostic information about strength and weaknesses of examinees helping stakeholders to link instruction or curriculum to assessment results.  相似文献   

3.
This paper demonstrates, both theoretically and empirically, using both simulated and real test data, that sets of items can be selected that meet the unidimensionality assumption of most item response theory models even though they require more than one ability for a correct response. Sets of items that measure the same composite of abilities as defined by multidimensional item response theory are shown to meet the unidimensionality assumption. A method for identifying such item sets is also presented  相似文献   

4.
This study adapted an effect size measure used for studying differential item functioning (DIF) in unidimensional tests and extended the measure to multidimensional tests. Two effect size measures were considered in a multidimensional item response theory model: signed weighted P‐difference and unsigned weighted P‐difference. The performance of the effect size measures was investigated under various simulation conditions including different sample sizes and DIF magnitudes. As another way of studying DIF, the χ2 difference test was included to compare the result of statistical significance (statistical tests) with that of practical significance (effect size measures). The adequacy of existing effect size criteria used in unidimensional tests was also evaluated. Both effect size measures worked well in estimating true effect sizes, identifying DIF types, and classifying effect size categories. Finally, a real data analysis was conducted to support the simulation results.  相似文献   

5.
Unidimensionality and local independence are two common assumptions of item response theory. The former implies that all items measure a common latent trait, while the latter implies that responses are independent, conditional on respondents’ location on the latent trait. Yet, few tests are truly unidimensional. Unmodeled dimensions may result in test items displaying dependencies, which can lead to misestimated parameters and inflated reliability estimates. In this article, we investigate the dimensionality of interim mathematics tests and evaluate the extent to which modeling minor dimensions in the data change model parameter estimates. We found evidence of minor dimensions, but parameter estimates across models were similar. Our results indicate that minor dimensions outside the primary trait have negligible consequences on parameter estimates. This finding was observed despite the ratio of multidimensional to unidimensional items being above previously recommended thresholds.  相似文献   

6.
通过对经典测量理论与项目反应理论在基本假设、测验精度计量、测验的标准误以及测验项目的筛选等四个主要领域的比较,可以发现项目反应理论具有被试能力估计的项目选择独立性、项目难度参数与能力参数的刻度统一性、项目参数估计的样本独立性、估计测量误差的精确性等几个优点;但是在某些模型中存在单维性假设难以满足、测验条件要求严格以及数学模型简约性差等需要解决的问题。  相似文献   

7.
The purpose of this study was to investigate multidimensional DIF with a simple and nonsimple structure in the context of multidimensional Graded Response Model (MGRM). This study examined and compared the performance of the IRT-LR and Wald test using MML-EM and MHRM estimation approaches with different test factors and test structures in simulation studies and applying real data sets. When the test structure included two dimensions, the IRT-LR (MML-EM) generally performed better than the Wald test and provided higher power rates. If the test included three dimensions, the methods provided similar performance in DIF detection. In contrast to these results, when the number of dimensions in the test was four, MML-EM estimation completely lost precision in estimating the nonuniform DIF, even with large sample sizes. The Wald with MHRM estimation approaches outperformed the Wald test (MML-EM) and IRT-LR (MML-EM). The Wald test had higher power rate and acceptable type I error rates for nonuniform DIF with the MHRM estimation approach.The small and/or unbalanced sample sizes, small DIF magnitudes, unequal ability distributions between groups, number of dimensions, estimation methods and test structure were evaluated as important test factors for detecting multidimensional DIF.  相似文献   

8.
Item parameter drift (IPD) occurs when item parameter values change from their original value over time. IPD may pose a serious threat to the fairness and validity of test score interpretations, especially when the goal of the assessment is to measure growth or improvement. In this study, we examined the effect of multidirectional IPD (i.e., some items become harder while other items become easier) on the linking procedure and rescaled proficiency estimates. The impact of different combinations of linking items with various multidirectional IPD on the test equating procedure was investigated for three scaling methods (mean-mean, mean-sigma, and TCC method) via a series of simulation studies. It was observed that multidirectional IPD had a substantive effect on examinees' scores and achievement level classifications under some of the studied conditions. Choice of linking method had a direct effect on the results, as did the pattern of IPD.  相似文献   

9.
The results of an exploratory study into measurement of elementary mathematics ability are presented. The focus is on the abilities involved in solving standard computation problems on the one hand and problems presented in a realistic context on the other. The objectives were to assess to what extent these abilities are shared or distinct, and the extent to which students' language level plays a differential role in these abilities. Data from a sample of over 2,000 students from first, second, and third grade in the Netherlands were analyzed in a multidimensional item response theory (IRT) framework. The latent correlation between the two ability dimensions (computational skills and applied mathematics problem solving) ranged from .81 in grade 1 to .87 in grade 3, indicating that the ability dimensions are highly correlated but still distinct. Moreover, students' language level had differential effects on the two mathematical abilities: Effects were larger on applied problem solving than on computational skills. The implications of these findings for measurement practices in the field of elementary mathematics are discussed.  相似文献   

10.
在语言评估范式正从“对教学的评估”转换成“为教学的评估”的过程中,评估与教学对接的研究就尤为重要.在当今以能力为导向的教学理念下,评估与教学的接口主要是语言能力描述.在分析《欧洲语言共同参考框架》为代表的语言能力描述之后提出,为了使得语言能力描述更具有教学指导意义,语言能力描述应与语言能力诊断相结合,构建一种符合针对性和个体化教学理念的诊断性语言能力描述.在介绍《新加坡小学一年级华文口语能力诊断量表》研发理念的基础上,结合新加坡小学一年级的口语能力诊断与针对性教学实践,介绍了一套针对小学低年级口语能力发展的干预性教学策略,以此例证语言能力描述与华文教学及评估如何对接.  相似文献   

11.
A new procedure for generating instructionally relevant diagnostic feedback is proposed. The approach involves first constructing a strong model of student proficiency and then testing whether individual students' observed item response vectors are consistent with that model. Diagnoses are specified in terms of the combinations of skills needed to score at increasingly higher levels on a test's reported score scale. The approach is applied to the problem of developing diagnostic feedback for the SAT I Verbal Reasoning test. Using a variation of Wright's (1977) person-fit statistic, it is shown that the estimated proficiency mode accounts for 91% of the "explainable" variation in students' observed item response vectors.  相似文献   

12.
Testlet effects can be taken into account by incorporating specific dimensions in addition to the general dimension into the item response theory model. Three such multidimensional models are described: the bi-factor model, the testlet model, and a second-order model. It is shown how the second-order model is formally equivalent to the testlet model. In turn, both models are constrained bi-factor models. Therefore, the efficient full maximum likelihood estimation method that has been established for the bi-factor model can be modified to estimate the parameters of the two other models. An application on a testlet-based international English assessment indicated that the bi-factor model was the preferred model for this particular data set.  相似文献   

13.
Although reliability of subscale scores may be suspect, subscale scores are the most common type of diagnostic information included in student score reports. This research compared methods for augmenting the reliability of subscale scores for an 8th-grade mathematics assessment. Yen's Objective Performance Index, Wainer et al.'s augmented scores, and scores based on multidimensional item response theory (IRT) models were compared and found to improve the precision of the subscale scores. However, the augmented subscale scores were found to be more highly correlated and less variable than unaugmented scores. The meaningfulness of reporting such augmented scores as well as the implications for validity and test development are discussed.  相似文献   

14.
A schoolwide language assists the mapping of subject content, supports teachers to discuss teaching and learning issues and enables a shared understanding of school leadership to emerge, so that student learning is improved. This article presents findings from a mixed methods study investigating Leadership for Learning (LfL) in independent schools in the state of New South Wales, Australia. By being intentional about the language used for learning and leadership, schools are more likely to establish LfL as a community-wide activity that is inclusive, collaborative and distributed. These findings also reinforce the critical role played by school principals in leading learning.  相似文献   

15.
Lord's Wald test for differential item functioning (DIF) has not been studied extensively in the context of the multidimensional item response theory (MIRT) framework. In this article, Lord's Wald test was implemented using two estimation approaches, marginal maximum likelihood estimation and Bayesian Markov chain Monte Carlo estimation, to detect uniform and nonuniform DIF under MIRT models. The Type I error and power rates for Lord's Wald test were investigated under various simulation conditions, including different DIF types and magnitudes, different means and correlations of two ability parameters, and different sample sizes. Furthermore, English usage data were analyzed to illustrate the use of Lord's Wald test with the two estimation approaches.  相似文献   

16.
This research derived information functions and proposed new scalar information indices to examine the quality of multidimensional forced choice (MFC) items based on the RANK model. We also explored how GGUM‐RANK information, latent trait recovery, and reliability varied across three MFC formats: pairs (two response alternatives), triplets (three alternatives), and tetrads (four alternatives). As expected, tetrad and triplet measures provided substantially more information than pairs, and MFC items composed of statements with high discrimination parameters were most informative. The methods and findings of this study will help practitioners to construct better MFC items, make informed projections about reliability with different MFC formats, and facilitate the development of MFC triplet‐ and tetrad‐based computerized adaptive tests.  相似文献   

17.
A contrast is proposed between the two literacy constructs of reading comprehension and locating information in text. A pragmatic approach to identifying important classes of literacy tasks reveals that locating information is prevalent in school and work. This process is a form of strategic reading that differs from reading comprehension or recall of prose by being more goal directed, more selective in the use of text, and less dependent on declarative knowledge. The process of locating text information requires: formulation of a goal, selection of a category of text for inspection, extraction of relevant details, and recycling to obtain solutions. Relationships of locating information in text to models of problem solving, studying, and reading comprehension are discussed and research implications are suggested.  相似文献   

18.
This study examined the extent to which literacy is a unitary construct, the differences between literacy and general language competence, and the relative roles of teachers and students in predicting literacy outcomes. Much of past research failed to make a distinction between variability in outcomes for individual students and variability for outcomes in the classrooms students share (i.e., the classroom level). Utilizing data from 1,342 students in 127 classrooms in Grades 1 to 4 in 17 high-poverty schools, confirmatory factor models were fit with single- and two-factor structures at both student and classroom levels. Results support a unitary literacy factor for reading and spelling, with the role of phonological awareness as an indicator of literacy declining across the grades. Writing was the least related to the literacy factor but the most impacted by teacher effects. Language competence was distinct at the student level but perfectly correlated with literacy at the classroom level. Implications for instruction and assessment of reading comprehension are discussed.  相似文献   

19.
20.
We report a multidimensional test that examines middle grades teachers’ understanding of fraction arithmetic, especially multiplication and division. The test is based on four attributes identified through an analysis of the extensive mathematics education research literature on teachers’ and students’ reasoning in this content area. We administered the test to a national sample of 990 in‐service middle grades teachers and analyzed the item responses using the log‐linear cognitive diagnosis model. We report the diagnostic quality of the test at the item level, mastery classifications for teachers, and attribute relationships. Our results demonstrate that, when a test is grounded in research on cognition and is designed to be multidimensional from the onset, it is possible to use diagnostic classification models to detect distinct patterns of attribute mastery.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号