期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Psychometric Properties of IRT Proficiency Estimates

Michael J. Kolen Ye Tong 《Educational Measurement》2010,29(3):8-14

Psychometric properties of item response theory proficiency estimates are considered in this paper. Proficiency estimators based on summed scores and pattern scores include non-Bayes maximum likelihood and test characteristic curve estimators and Bayesian estimators. The psychometric properties investigated include reliability, conditional standard errors of measurement, and score distributions. Four real-data examples include (a) effects of choice of estimator on score distributions and percent proficient, (b) effects of the prior distribution on score distributions and percent proficient, (c) effects of test length on score distributions and percent proficient, and (d) effects of proficiency estimator on growth-related statistics for a vertical scale. The examples illustrate that the choice of estimator influences score distributions and the assignment of examinee to proficiency levels. In particular, for the examples studied, the choice of Bayes versus non-Bayes estimators had a more serious practical effect than the choice of summed versus pattern scoring. 相似文献

2.

The Benefits of Fixed Item Parameter Calibration for Parameter Accuracy in Small Sample Situations in Large‐Scale Assessments

Christoph Knig Lale Khorramdel Kentaro Yamamoto Andreas Frey 《Educational Measurement》2021,40(1):17-27

Large‐scale assessments such as the Programme for International Student Assessment (PISA) have field trials where new survey features are tested for utility in the main survey. Because of resource constraints, there is a trade‐off between how much of the sample can be used to test new survey features and how much can be used for the initial item response theory (IRT) scaling. Utilizing real assessment data of the PISA 2015 Science assessment, this article demonstrates that using fixed item parameter calibration (FIPC) in the field trial yields stable item parameter estimates in the initial IRT scaling for samples as small as n = 250 per country. Moreover, the results indicate that for the recovery of the county‐specific latent trait distributions, the estimates of the trend items (i.e., the information introduced into the calibration) are crucial. Thus, concerning the country‐level sample size of n = 1,950 currently used in the PISA field trial, FIPC is useful for increasing the number of survey features that can be examined during the field trial without the need to increase the total sample size. This enables international large‐scale assessments such as PISA to keep up with state‐of‐the‐art developments regarding assessment frameworks, psychometric models, and delivery platform capabilities. 相似文献

3.

Raju's Differential Functioning of Items and Tests (DFIT)

T. C. Oshima S. B. Morris 《Educational Measurement》2008,27(3):43-50

Nambury S. Raju (1937–2005) developed two model‐based indices for differential item functioning (DIF) during his prolific career in psychometrics. Both methods, Raju's area measures ( Raju, 1988 ) and Raju's DFIT ( Raju, van der Linden, & Fleer, 1995 ), are based on quantifying the gap between item characteristic functions (ICFs). This approach provides an intuitive and flexible methodology for assessing DIF. The purpose of this tutorial is to explain DFIT and show how this methodology can be utilized in a variety of DIF applications. 相似文献

4.

An NCME Instructional Module on Item‐Fit Statistics for Item Response Theory Models

Allison J. Ames Randall D. Penfield 《Educational Measurement》2015,34(3):39-48

Drawing valid inferences from item response theory (IRT) models is contingent upon a good fit of the data to the model. Violations of model‐data fit have numerous consequences, limiting the usefulness and applicability of the model. This instructional module provides an overview of methods used for evaluating the fit of IRT models. Upon completing this module, the reader will have an understanding of traditional and Bayesian approaches for evaluating model‐data fit of IRT models, the relative advantages of each approach, and the software available to implement each method. 相似文献

5.

CTT与IRT测量原理之比较

沐守宽《上海师范大学学报(哲学社会科学版)》2006,35(4):6-9

通过对经典测量理论与项目反应理论在基本假设、测验精度计量、测验的标准误以及测验项目的筛选等四个主要领域的比较,可以发现项目反应理论具有被试能力估计的项目选择独立性、项目难度参数与能力参数的刻度统一性、项目参数估计的样本独立性、估计测量误差的精确性等几个优点;但是在某些模型中存在单维性假设难以满足、测验条件要求严格以及数学模型简约性差等需要解决的问题。相似文献

6.

Affordances of Item Formats and Their Effects on Test‐Taker Cognition under Uncertainty

Jung Aa Moon Madeleine Keehner Irvin R. Katz 《Educational Measurement》2019,38(1):54-62

The current study investigated how item formats and their inherent affordances influence test‐takers’ cognition under uncertainty. Adult participants solved content‐equivalent math items in multiple‐selection multiple‐choice and four alternative grid formats. The results indicated that participants’ affirmative response tendency (i.e., judge the given information as True) was affected by the presence of a grid, type of grid options, and their visual layouts. The item formats further affected the test scores obtained from the alternatives keyed True and the alternatives keyed False, and their psychometric properties. The current results suggest that the affordances rendered by item design can lead to markedly different test‐taker behaviors and can potentially influence test outcomes. They emphasize that a better understanding of the cognitive implications of item formats could potentially facilitate item design decisions for large‐scale educational assessments. 相似文献

7.

结构方程模型和IRT等级反应模型在人格量表项目筛选中的对比研究

邹丹杰伍霞《内江师范学院学报》2014,(12)

为比较结构方程模型和 IRT等级反应模型在人格量表项目筛选上的作用,以《中国大学生人格量表》的7229个实际测量数据为基础,针对因素二“爽直”分别以Lisrel8.70和Multilog7.03进行结构方程模型和等级反应模型的参数估计与拟合,比较两种方法的项目筛选结果.二者统计结果均认为项目5、6、7、8拟合度不佳,在结构方程模型上表现为因子负荷较低,整体拟合指数不理想;在等级反应模型上表现为区分度参数和位置参数不理想,相关项目的特征曲线和信息曲线形态较差.但结构方程模型倾向于项目6、8更差,而等级反应模型则倾向于项目5、6更差.结构方程模型和 IRT等级反应模型对人格量表项目的统计推断结果从总体上讲是一致的,但在个别项目上略有差异.二者各有优势,可以结合使用. 相似文献

8.

Evaluating the Comparability of Paper‐ and Computer‐Based Science Tests Across Sex and SES Subgroups

Jennifer Randall Stephen Sireci Xueming Li Leah Kaira 《Educational Measurement》2012,31(4):2-12

As access and reliance on technology continue to increase, so does the use of computerized testing for admissions, licensure/certification, and accountability exams. Nonetheless, full computer‐based test (CBT) implementation can be difficult due to limited resources. As a result, some testing programs offer both CBT and paper‐based test (PBT) administration formats. In such situations, evidence that scores obtained from different formats are comparable must be gathered. In this study, we illustrate how contemporary statistical methods can be used to provide evidence regarding the comparability of CBT and PBT scores at the total test score and item levels. Specifically, we looked at the invariance of test structure and item functioning across test administration mode across subgroups of students defined by SES and sex. Multiple replications of both confirmatory factor analysis and Rasch differential item functioning analyses were used to assess invariance at the factorial and item levels. Results revealed a unidimensional construct with moderate statistical support for strong factorial‐level invariance across SES subgroups, and moderate support of invariance across sex. Issues involved in applying these analyses to future evaluations of the comparability of scores from different versions of a test are discussed. 相似文献

9.

An NCME Instructional Module on Booklet Designs in Large-Scale Assessments of Student Achievement: Theory and Practice 总被引：2，自引：0，他引：2

Andreas Frey Johannes Hartig André A. Rupp 《Educational Measurement》2009,28(3):39-53

In most large-scale assessments of student achievement, several broad content domains are tested. Because more items are needed to cover the content domains than can be presented in the limited testing time to each individual student, multiple test forms or booklets are utilized to distribute the items to the students. The construction of an appropriate booklet design is a complex and challenging endeavor that has far-reaching implications for data calibration and score reporting. This module describes the construction of booklet designs as the task of allocating items to booklets under context-specific constraints. Several types of experimental designs are presented that can be used as booklet designs. The theoretical properties and construction principles for each type of design are discussed and illustrated with examples. Finally, the evaluation of booklet designs is described and future directions for researching, teaching, and reporting on booklet designs for large-scale assessments of student achievement are identified. 相似文献

10.

开展学业水平考试命题质量评估的困境与对策——基于初中学业水平考试命题质量评估的分析 总被引：1，自引：0，他引：1

张远增《中国考试》2021,(2)

通过对1999年以来我国中考命题质量评估历程的分析,揭示开展学业水平考试命题质量评估的困境成因在于:缺乏命题质量评估界定,缺乏命题质量标准,缺乏命题质量评估主体标准,评估实践的“只技术”思维,缺乏命题质量评估文化。在此基础上,提出3条解困对策:建立学业水平考试命题容错制度,依法治考,专业化与标准化双管齐下。相似文献