首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Recent advances have enabled diagnostic classification models (DCMs) to accommodate longitudinal data. These longitudinal DCMs were developed to study how examinees change, or transition, between different attribute mastery statuses over time. This study examines using longitudinal DCMs as an approach to assessing growth and serves three purposes: (1) to define and evaluate two reliability measures to be used in the application of longitudinal DCMs; (2) through simulation, demonstrate that longitudinal DCM growth estimates have increased reliability compared to longitudinal item response theory models; and (3) through an empirical analysis, illustrate the practical and interpretive benefits of longitudinal DCMs. A discussion describes how longitudinal DCMs can be used as practical and reliable psychometric models when categorical and criterion‐referenced interpretations of growth are desired.  相似文献   

2.
In this ITEMS module, we provide a didactic overview of the specification, estimation, evaluation, and interpretation steps for diagnostic measurement/classification models (DCMs), which are a promising psychometric modeling approach. These models can provide detailed skill‐ or attribute‐specific feedback to respondents along multiple latent dimensions and hold theoretical and practical appeal for a variety of fields. We use a current unified modeling framework—the log‐linear cognitive diagnosis model (LCDM)—as well as a series of quality‐control checklists for data analysts and scientific users to review the foundational concepts, practical steps, and interpretational principles for these models. We demonstrate how the models and checklists can be applied in real‐life data‐analysis contexts. A library of macros and supporting files for Excel, SAS, and Mplus are provided along with video tutorials for key practices.  相似文献   

3.
Diagnostic classification models (aka cognitive or skills diagnosis models) have shown great promise for evaluating mastery on a multidimensional profile of skills as assessed through examinee responses, but continued development and application of these models has been hindered by a lack of readily available software. In this article we demonstrate how diagnostic classification models may be estimated as confirmatory latent class models using Mplus, thus bridging the gap between the technical presentation of these models and their practical use for assessment in research and applied settings. Using a sample English test of three grammatical skills, we describe how diagnostic classification models can be phrased as latent class models within Mplus and how to obtain the syntax and output needed for estimation and interpretation of the model parameters. We also have written a freely available SAS program that can be used to automatically generate the Mplus syntax. We hope this work will ultimately result in greater access to diagnostic classification models throughout the testing community, from researchers to practitioners.  相似文献   

4.
The purpose of this study is to apply the attribute hierarchy method (AHM) to a subset of SAT critical reading items and illustrate how the method can be used to promote cognitive diagnostic inferences. The AHM is a psychometric procedure for classifying examinees’ test item responses into a set of attribute mastery patterns associated with different components from a cognitive model. The study was conducted in two steps. In step 1, three cognitive models were developed by reviewing selected literature in reading comprehension as well as research related to SAT Critical Reading. Then, the cognitive models were validated by having a sample of students think aloud as they solved each item. In step 2, psychometric analyses were conducted on the SAT critical reading cognitive models by evaluating the model‐data fit between the expected and observed response patterns produced from two random samples of 2,000 examinees who wrote the items. The model that provided best data‐model fit was then used to calculate attribute probabilities for 15 examinees to illustrate our diagnostic testing procedure.  相似文献   

5.
Standardizing aspects of assessments has long been recognized as a tactic to help make evaluations of examinees fair. It reduces variation in irrelevant aspects of testing procedures that could advantage some examinees and disadvantage others. However, recent attention to making assessment accessible to a more diverse population of students highlights situations in which making tests identical for all examinees can make a testing procedure less fair: Equivalent surface conditions may not provide equivalent evidence about examinees. Although testing accommodations are by now standard practice in most large-scale testing programmes, for the most part these practices lie outside formal educational measurement theory. This article builds on recent research in universal design for learning (UDL), assessment design, and psychometrics to lay out the rationale for inference that is conditional on matching examinees with principled variations of an assessment so as to reduce construct-irrelevant demands. The present focus is assessment for special populations, but it is argued that the principles apply more broadly.  相似文献   

6.
认知诊断评估中知识状态估计方法简述   总被引:1,自引:1,他引:0  
尽管认知诊断评估的方法或模型众多,但其最终目的都是要报告被试属性掌握的强项与不足,即对被试的知识状态进行估计或分类。本文主要对规则空间模型、属性层级模型、确定性输入噪音与门模型和其他几个模型下知识状态估计方法进行简述,重点介绍各知识状态估计方法的基本思想、优缺点及它们内在的联系与区别。最后,概述影响知识状态估计结果的因素,并提出应进一步研究的问题。  相似文献   

7.
信度和效度是衡量一个测量工具质量的关键指标,教育认知诊断测验中的信度和效度研究近年来受到研究者的关注。诊断测验的信度系数基本上源自基于α系数的属性信度系数、经验属性信度系数、四分相关系数、模拟重测一致性和分类一致性指标;效度系数主要包括模拟判准率、分类准确性和理论构想效度等。教育认知诊断测验的信度和效度研究较新,仍存在着一定的不足且缺乏全面的比较研究,更缺少系统的评价体系。  相似文献   

8.
BP神经网络是目前应用最广泛的人工神经网络模型之一,在分类和识别上表现出良好的特性,因此被研究者用于认知诊断评估以对被试进行诊断分类。通过模拟研究,考查属性个数、属性层级关系、测验长度、题目质量、测试样本量5个因素对BP神经网络在认知诊断中分类准确性的影响。结果表明:1)基于BP神经网络的认知诊断分类准确率不依赖于测试样本量;2)题目质量和测验长度对BP神经网络的诊断准确率有显著的积极影响;3)属性个数对BP神经网络的分类准确率有消极影响;4)题目质量一定程度上会影响BP诊断方法在不同属性层级结构上的分类准确率。  相似文献   

9.
The purpose of this paper is to define and evaluate the categories of cognitive models underlying at least three types of educational tests. We argue that while all educational tests may be based—explicitly or implicitly—on a cognitive model, the categories of cognitive models underlying tests often range in their development and in the psychological evidence gathered to support their value. For researchers and practitioners, awareness of different cognitive models may facilitate the evaluation of educational measures for the purpose of generating diagnostic inferences, especially about examinees' thinking processes, including misconceptions, strengths, and/or abilities. We think a discussion of the types of cognitive models underlying educational measures is useful not only for taxonomic ends, but also for becoming increasingly aware of evidentiary claims in educational assessment and for promoting the explicit identification of cognitive models in test development. We begin our discussion by defining the term cognitive model in educational measurement. Next, we review and evaluate three categories of cognitive models that have been identified for educational testing purposes using examples from the literature. Finally, we highlight the practical implications of "blending" models for the purpose of improving educational measures .  相似文献   

10.
We report a multidimensional test that examines middle grades teachers’ understanding of fraction arithmetic, especially multiplication and division. The test is based on four attributes identified through an analysis of the extensive mathematics education research literature on teachers’ and students’ reasoning in this content area. We administered the test to a national sample of 990 in‐service middle grades teachers and analyzed the item responses using the log‐linear cognitive diagnosis model. We report the diagnostic quality of the test at the item level, mastery classifications for teachers, and attribute relationships. Our results demonstrate that, when a test is grounded in research on cognition and is designed to be multidimensional from the onset, it is possible to use diagnostic classification models to detect distinct patterns of attribute mastery.  相似文献   

11.
Brennan noted that users of test scores often want (indeed, demand) that subscores be reported, along with total test scores, for diagnostic purposes. Haberman suggested a method based on classical test theory (CTT) to determine if subscores have added value over the total score. One way to interpret the method is that a subscore has added value only if it has a better agreement than the total score with the corresponding subscore on a parallel form. The focus of this article is on classification of the examinees into “pass” and “fail” (or master and nonmaster) categories based on subscores. A new CTT‐based method is suggested to assess whether classification based on a subscore is in better agreement, than classification based on the total score, with classification based on the corresponding subscore on a parallel form. The method can be considered as an assessment of the added value of subscores with respect to classification. The suggested method is applied to data from several operational tests. The added value of subscores with respect to classification is found to be very similar, except at extreme cutscores, to their added value from a value‐added analysis of Haberman.  相似文献   

12.
13.
The applications of item response theory (IRT) models assume local item independence and that examinees are independent of each other. When a representative sample for psychometric analysis is selected using a cluster sampling method in a testlet‐based assessment, both local item dependence and local person dependence are likely to be induced. This study proposed a four‐level IRT model to simultaneously account for dual local dependence due to item clustering and person clustering. Model parameter estimation was explored using the Markov Chain Monte Carlo method. Model parameter recovery was evaluated in a simulation study in comparison with three other related models: the Rasch model, the Rasch testlet model, and the three‐level Rasch model for person clustering. In general, the proposed model recovered the item difficulty and person ability parameters with the least total error. The bias in both item and person parameter estimation was not affected but the standard error (SE) was affected. In some simulation conditions, the difference in classification accuracy between models could go up to 11%. The illustration using the real data generally supported model performance observed in the simulation study.  相似文献   

14.
In the presence of test speededness, the parameter estimates of item response theory models can be poorly estimated due to conditional dependencies among items, particularly for end‐of‐test items (i.e., speeded items). This article conducted a systematic comparison of five‐item calibration procedures—a two‐parameter logistic (2PL) model, a one‐dimensional mixture model, a two‐step strategy (a combination of the one‐dimensional mixture and the 2PL), a two‐dimensional mixture model, and a hybrid model‐–by examining how sample size, percentage of speeded examinees, percentage of missing responses, and way of scoring missing responses (incorrect vs. omitted) affect the item parameter estimation in speeded tests. For nonspeeded items, all five procedures showed similar results in recovering item parameters. For speeded items, the one‐dimensional mixture model, the two‐step strategy, and the two‐dimensional mixture model provided largely similar results and performed better than the 2PL model and the hybrid model in calibrating slope parameters. However, those three procedures performed similarly to the hybrid model in estimating intercept parameters. As expected, the 2PL model did not appear to be as accurate as the other models in recovering item parameters, especially when there were large numbers of examinees showing speededness and a high percentage of missing responses with incorrect scoring. Real data analysis further described the similarities and differences between the five procedures.  相似文献   

15.
16.
This study aimed to develop an instrument for assessing kindergarteners’ mathematics problem solving (MPS) by using cognitive diagnostic assessment (CDA). A total of 747 children were recruited to examine the psychometric properties of the cognitive diagnostic test. The results showed that the classification accuracy of 11 cognitive attributes ranged from .68 to .99, with the average being .84. Both the cognitive diagnostic test score and the average mastery probabilities of the 11 cognitive attributes had moderate correlations with the Applied Problem subtest and the Calculation subtest of the Woodcock–Johnson IV Tests of Achievement. Moreover, the correlation between the cognitive diagnostic test and the Applied Problems subtest was higher than that between the cognitive diagnostic test and the Calculation subtest. The results indicated that the formal cognitive diagnostic test was a reliable instrument for assessing kindergarteners’ MPS in the domain of number and operations.  相似文献   

17.
Test-takers' interpretations of validity as related to test constructs and test use have been widely debated in large-scale language assessment. This study contributes further evidence to this debate by examining 59 test-takers' perspectives in writing large-scale English language tests. Participants wrote about their test-taking experiences in 300 to 500 words, focusing on their perceptions of test validity and test use. A standard thematic coding process and logical cross-analysis were used to analyze test-takers' experiences. Codes were deductively generated and related to both experiential (i.e., testing conditions and consequences) and psychometric (i.e., test construction, format, and administration) aspects of testing. These findings offer test-takers' voices on fundamental aspects of language assessment, which bear implications for test developers, test administrators, and test users. The study also demonstrated the need for obtaining additional evidence from test-takers for validating large-scale language tests.  相似文献   

18.
Summary

In this article my purpose has not been to indicate what kinds of things can and can't be assessed appropriately with tests. Rather, I have tried to illuminate how the key ideas of reliability and validity are used by test developers and what this means in practice — not least in terms of the decisions that are made about individual students on the basis of their test results. As I have stressed throughout this article, these limitations are not the fault of test developers. However inconvenient these limitations are for proponents of school testing, they are inherent in the nature of tests of academic achievement, and are as real as rocks. All users of the results of educational tests must understand what a limited technology this is.  相似文献   

19.
In recent years, science education has placed increasing importance on learners' mastery of scientific reasoning. This growing emphasis presents a challenge for both developers and users of assessments. We report on our effort around the conceptualization, development, and testing the validity of an assessment of students' ability to reason around physical dynamic models in Earth Science. Building from the research literature on analogical mapping and informed by the current perspectives on learning progressions, we present a three‐tiered construct describing the increasing sophistication of students' analogical reasoning around the correspondences and non‐correspondences between models and the Earth System: at the level of entities (Level 1), configurations in space or relative motion of entities (Level 2), and the mechanism or cause for observed phenomena (Level 3). Grounded in a construct‐centered design approach, we describe our process for developing assessments in order to examine and validate this construct, including how we selected topics and models, designed items, and developed outcome spaces. We present the specific example of one assessment centered on moon phases, which was administered to 164 8th and 9th grade Earth Science students as a pre/postmeasure. Two hundred ninety‐four responses were analyzed using a Rasch modeling approach. Item difficulties and student proficiency scores were calculated and analyzed regarding their relative performance with respect to the three levels of the construct. The analysis results provided initial evidence in support of the construct as conceived, with students displaying a range of analogical reasoning spanning all three construct levels. It also identified problematic items that merit further examination. Overall, the assessment has provided us the opportunity to better describe and frame the cognitive uses of models by students during learning situations in Earth Science. Implications for instruction and future directions for research in this area are discussed. © 2012 Wiley Periodicals, Inc. J Res Sci Teach 49: 713–743, 2012  相似文献   

20.
In this ITEMS module, we introduce the generalized deterministic inputs, noisy “and” gate (G‐DINA) model, which is a general framework for specifying, estimating, and evaluating a wide variety of cognitive diagnosis models. The module contains a nontechnical introduction to diagnostic measurement, an introductory overview of the G‐DINA model, as well as common special cases, and a review of model‐data fit evaluation practices within this framework. We use the flexible GDINA R package, which is available for free within the R environment and provides a user‐friendly graphical interface in addition to the code‐driven layer. The digital module also contains videos of worked examples, solutions to data activity questions, curated resources, a glossary, and quizzes with diagnostic feedback.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号