首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Subscores Based on Classical Test Theory: To Report or Not to Report   总被引:1,自引:0,他引:1  
There is an increasing interest in reporting subscores, both at examinee level and at aggregate levels. However, it is important to ensure reasonable subscore performance in terms of high reliability and validity to minimize incorrect instructional and remediation decisions. This article employs a statistical measure based on classical test theory that is conceptually similar to the test reliability measure and can be used to determine when subscores have any added value over total scores. The usefulness of subscores is examined both at the level of the examinees and at the level of the institutions that the examinees belong to. The suggested approach is applied to two data sets from a basic skills test. The results provide little support in favor of reporting subscores for either examinees or institutions for the tests studied here.  相似文献   

2.
A polytomous item is one for which the responses are scored according to three or more categories. Given the increasing use of polytomous items in assessment practices, item response theory (IRT) models specialized for polytomous items are becoming increasingly common. The purpose of this ITEMS module is to provide an accessible overview of polytomous IRT models. The module presents commonly encountered polytomous IRT models, describes their properties, and contrasts their defining principles and assumptions. After completing this module, the reader should have a sound understating of what a polytomous IRT model is, the manner in which the equations of the models are generated from the model's underlying step functions, how widely used polytomous IRT models differ with respect to their definitional properties, and how to interpret the parameters of polytomous IRT models.  相似文献   

3.
Recently, interest in test subscore reporting for diagnosis purposes has been growing rapidly. The two simulation studies here examined factors (sample size, number of subscales, correlation between subscales, and three factors affecting subscore reliability: number of items per subscale, item parameter distribution, and data generating model) that affected the value of reporting subscores within the classical test theory framework. Results showed that a higher proportion of subscores of added value was related to lower correlation between subscales, more items per subscale, no guessing in responses, smaller variability in difficulty parameters, and matched average item difficulty and average examinee ability.  相似文献   

4.
The purpose of this ITEMS module is to provide an introduction to differential item functioning (DIF) analysis using mixture item response models. The mixture item response models for DIF analysis involve comparing item profiles across latent groups, instead of manifest groups. First, an overview of DIF analysis based on latent groups, called latent DIF analysis, is provided and its applications in the literature are surveyed. Then, the methodological issues pertaining to latent DIF analysis are described, including mixture item response models, parameter estimation, and latent DIF detection methods. Finally, recommended steps for latent DIF analysis are illustrated using empirical data.  相似文献   

5.
Drawing valid inferences from item response theory (IRT) models is contingent upon a good fit of the data to the model. Violations of model‐data fit have numerous consequences, limiting the usefulness and applicability of the model. This instructional module provides an overview of methods used for evaluating the fit of IRT models. Upon completing this module, the reader will have an understanding of traditional and Bayesian approaches for evaluating model‐data fit of IRT models, the relative advantages of each approach, and the software available to implement each method.  相似文献   

6.
Mokken scale analysis (MSA) is a probabilistic‐nonparametric approach to item response theory (IRT) that can be used to evaluate fundamental measurement properties with less strict assumptions than parametric IRT models. This instructional module provides an introduction to MSA as a probabilistic‐nonparametric framework in which to explore measurement quality, with an emphasis on its application in the context of educational assessment. The module describes both dichotomous and polytomous formulations of the MSA model. Examples of the application of MSA to educational assessment are provided using data from a multiple‐choice physical science assessment and a rater‐mediated writing assessment.  相似文献   

7.
8.
Brennan ( 2012 ) noted that users of test scores often want (indeed, demand) that subscores be reported, along with total test scores, for diagnostic purposes. Haberman ( 2008 ) suggested a method based on classical test theory (CTT) to determine if subscores have added value over the total score. According to this method, a subscore has added value if the corresponding true subscore is predicted better by the subscore than by the total score. In this note, parallel‐forms scores are considered. It is proved that another way to interpret the method of Haberman is that a subscore has added value if it is in better agreement than the total score with the corresponding subscore on a parallel form. The suggested interpretation promises to make the method of Haberman more accessible because several practitioners find the concept of parallel forms more acceptable or easier to understand than that of a true score. Results are shown for data from two operational tests.  相似文献   

9.
In most large-scale assessments of student achievement, several broad content domains are tested. Because more items are needed to cover the content domains than can be presented in the limited testing time to each individual student, multiple test forms or booklets are utilized to distribute the items to the students. The construction of an appropriate booklet design is a complex and challenging endeavor that has far-reaching implications for data calibration and score reporting. This module describes the construction of booklet designs as the task of allocating items to booklets under context-specific constraints. Several types of experimental designs are presented that can be used as booklet designs. The theoretical properties and construction principles for each type of design are discussed and illustrated with examples. Finally, the evaluation of booklet designs is described and future directions for researching, teaching, and reporting on booklet designs for large-scale assessments of student achievement are identified.  相似文献   

10.
11.
Traditional methods for examining differential item functioning (DIF) in polytomously scored test items yield a single item‐level index of DIF and thus provide no information concerning which score levels are implicated in the DIF effect. To address this limitation of DIF methodology, the framework of differential step functioning (DSF) has recently been proposed, whereby measurement invariance is examined within each step underlying the polytomous response variable. The examination of DSF can provide valuable information concerning the nature of the DIF effect (i.e., is the DIF an item‐level effect or an effect isolated to specific score levels), the location of the DIF effect (i.e., precisely which score levels are manifesting the DIF effect), and the potential causes of a DIF effect (i.e., what properties of the item stem or task are potentially biasing). This article presents a didactic overview of the DSF framework and provides specific guidance and recommendations on how DSF can be used to enhance the examination of DIF in polytomous items. An example with real testing data is presented to illustrate the comprehensive information provided by a DSF analysis.  相似文献   

12.
Multistage tests are those in which sets of items are administered adaptively and are scored as a unit. These tests have all of the advantages of adaptive testing, with more efficient and precise measurement across the proficiency scale as well as time savings, without many of the disadvantages of an item-level adaptive test. As a seemingly balanced compromise between linear paper-and-pencil and item-level adaptive tests, development and use of multistage tests is increasing. This module describes multistage tests, including two-stage and testlet-based tests, and discusses the relative advantages and disadvantages of multistage testing as well as considerations and steps in creating such tests.  相似文献   

13.
This module describes standard setting for achievement measures used i n education, licensure, and certification. On completing the module, readers will be able to: describe what standard setting is, why it is necessary, what some of the purposes of standard setting are, and what professional guidelines apply to the design and conduct of a standard-setting procedure; differentiate among different models of standard setting; calculate a cutting score using various methods; identify appropriate sources of validity evidence and threats to the validity of a standard-setting procedure; and list some elements to be considered when evaluating the success of a standard-setting procedure. A self-test and annotated bibliography are provided at the end of the module. Teaching aids to accompany the module are available through NCME.  相似文献   

14.
计算机信息技术课无纸化考试的研究   总被引:1,自引:0,他引:1  
介绍考试理论从经典测量到项目反应的发展,指出计算机化考试的必然性和优越性。对计算机考试如何在多媒体网络实验室实现,进行了较详细的阐述。  相似文献   

15.
In this digital ITEMS module, Dr. Brian Leventhal and Dr. Allison Ames provide an overview of Monte Carlo simulation studies (MCSS) in item response theory (IRT). MCSS are utilized for a variety of reasons, one of the most compelling being that they can be used when analytic solutions are impractical or nonexistent because they allow researchers to specify and manipulate an array of parameter values and experimental conditions (e.g., sample size, test length, and test characteristics). Dr. Leventhal and Dr. Ames review the conceptual foundation of MCSS in IRT and walk through the processes of simulating total scores as well as item responses using the two-parameter logistic, graded response, and bifactor models. They provide guidance for how to implement MCSS using other item response models and best practices for efficient syntax and executing an MCSS. The digital module contains sample SAS code, diagnostic quiz questions, activities, curated resources, and a glossary.  相似文献   

16.
通过对经典测量理论与项目反应理论在基本假设、测验精度计量、测验的标准误以及测验项目的筛选等四个主要领域的比较,可以发现项目反应理论具有被试能力估计的项目选择独立性、项目难度参数与能力参数的刻度统一性、项目参数估计的样本独立性、估计测量误差的精确性等几个优点;但是在某些模型中存在单维性假设难以满足、测验条件要求严格以及数学模型简约性差等需要解决的问题。  相似文献   

17.
基于项目反应理论的测验编制方法研究   总被引:3,自引:0,他引:3  
本文在简单介绍项目反应理论的基础上,从计量分析的角度,深入探讨了应用项目反应理论编制各种测验的一般步骤;探讨了项目反应理论题库建设方法及基于题库的测验编制方法;探讨了标准参照测验合格分数线的划分方法。  相似文献   

18.
A goal for any linking or equating of two or more tests is that the linking function be invariant to the population used in conducting the linking or equating. Violations of population invariance in linking and equating jeopardize the fairness and validity of test scores, and pose particular problems for test‐based accountability programs that require schools, districts, and states to report annual progress on academic indicators disaggregated by demographic group membership. This instructional module provides a comprehensive overview of population invariance in linking and equating and the relevant methodology developed for evaluating violations of invariance. A numeric example is used to illustrate the comparative properties of available methods, and important considerations for evaluating population invariance in linking and equating are presented.  相似文献   

19.
The purpose of a credentialing examination is to assure the public that individuals who work in an occupation or profession have met certain standards. To be consistent with this purpose, credentialing examinations must be job related, and this requirement is typically met by developing test plans based on an empirical job or practice analysis. The purpose of this module is to describe procedures for developing practice analysis surveys, with emphasis on task inventory questionnaires. Editorial guidelines for writing task statements are presented, followed by a discussion of issues related to the development of scales for rating tasks and job responsibilities. The module also offers guidelines for designing and formatting both mail-out and Internet-based questionnaires. It concludes with a brief overview of the types of data analyses useful for practice analysis questionnaires.  相似文献   

20.
Data mining methods for classification and regression are becoming increasingly popular in various scientific fields. However, these methods have not been explored much in educational measurement. This module first provides a review, which should be accessible to a wide audience in education measurement, of some of these methods. The module then demonstrates using three real‐data examples that these methods may lead to an improvement over traditionally used methods such as linear and logistic regression in educational measurement.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号