期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Computerized adaptive testing in instructional settings 总被引：3，自引：0，他引：3

R. Edwin Welch Theodore W. Frick 《Educational technology research and development : ETR & D》1993,41(3):47-62

Item response theory (IRT) has most often been used in research on computerized adaptive testing (CAT). Depending on the model used, IRT requires between 200 and 1,000 examinees for estimating item parameters. Thus, it is not practical for instructional designers to develop their own CAT based on the IRT model. Frick improved Wald's sequential probability ratio test (SPRT) by combining it with normative expert systems reasoning, referred to as an EXSPRT-based CAT. While previous studies were based on re-enactments from historical test data, the present study is the first to examine how well these adaptive methods function in a real-time testing situation. Results indicate that the EXSPRT-I significantly reduced test lengths and was highly accurate in predicting mastery. EXSPRT is apparently a viable and practical alternative to IRT for assessing mastery of instructional objectives. 相似文献

2.

Multilevel Cognitive Diagnosis Models for Assessing Changes in Latent Attributes

下载免费PDF全文

Hung‐Yu Huang 《Journal of Educational Measurement》2017,54(4):440-480

Cognitive diagnosis models (CDMs) have been developed to evaluate the mastery status of individuals with respect to a set of defined attributes or skills that are measured through testing. When individuals are repeatedly administered a cognitive diagnosis test, a new class of multilevel CDMs is required to assess the changes in their attributes and simultaneously estimate the model parameters from the different measurements. In this study, the most general CDM of the generalized deterministic input, noisy “and” gate (G‐DINA) model was extended to a multilevel higher order CDM by embedding a multilevel structure into higher order latent traits. A series of simulations based on diverse factors was conducted to assess the quality of the parameter estimation. The results demonstrate that the model parameters can be recovered fairly well and attribute mastery can be precisely estimated if the sample size is large and the test is sufficiently long. The range of the location parameters had opposing effects on the recovery of the item and person parameters. Ignoring the multilevel structure in the data by fitting a single‐level G‐DINA model decreased the attribute classification accuracy and the precision of latent trait estimation. The number of measurement occasions had a substantial impact on latent trait estimation. Satisfactory model and person parameter recoveries could be achieved even when assumptions of the measurement invariance of the model parameters over time were violated. A longitudinal basic ability assessment is outlined to demonstrate the application of the new models. 相似文献

3.

Generic questioning strategies for linking teaching and testing

Thomas M. Haladyna 《Educational technology research and development : ETR & D》1991,39(1):73-81

Modern instructional theory and research suggest that the content of instruction should be closely linked with testing. The content of an instructional program should not focus solely on memorization of facts but should also include higher level thinking. Three uses of tests within any instructional program are: (1) practice on objectives, (2) feedback about mastery of those objectives, and (3) summative evaluation. The context-dependent item set is proposed as a useful tool for measuring many higher level objectives. A generic method for developing context-dependent test item sets is proposed, and several examples are provided. The procedure is useful for developing a larger number of test items that can be used for any of the three uses of tests. The procedure also seems to apply to a wide variety of subject matter. 相似文献

4.

APPLICATION OF COMPUTERIZED ADAPTIVE TESTING TO EDUCATIONAL PROBLEMS 总被引：1，自引：0，他引：1

DAVID J. WEISS G. GAGE KINGSBURY 《Journal of Educational Measurement》1984,21(4):361-375

Three applications of computerized adaptive testing (CAT) to help solve problems encountered in educational settings are described and discussed. Each of these applications makes use of item response theory to select test questions from an item pool to estimate a student's achievement level and its precision. These estimates may then be used in conjunction with certain testing strategies to facilitate certain educational decisions. The three applications considered are (a) adaptive mastery testing for determining whether or not a student has mastered a particular content area, (b) adaptive grading for assigning grades to students, and (c) adaptive self-referenced testing for estimating change in a student's achievement level. Differences between currently used classroom procedures and these CAT procedures are discussed. For the adaptive mastery testing procedure, evidence from a series of studies comparing conventional and adaptive testing procedures is presented showing that the adaptive procedure results in more accurate mastery classifications than do conventional mastery tests, while using fewer test questions. 相似文献

5.

Diagnosing Teachers’ Understandings of Rational Numbers: Building a Multidimensional Test Within the Diagnostic Classification Framework

Laine Bradshaw Andrew Izsák Jonathan Templin Erik Jacobson 《Educational Measurement》2014,33(1):2-14

We report a multidimensional test that examines middle grades teachers’ understanding of fraction arithmetic, especially multiplication and division. The test is based on four attributes identified through an analysis of the extensive mathematics education research literature on teachers’ and students’ reasoning in this content area. We administered the test to a national sample of 990 in‐service middle grades teachers and analyzed the item responses using the log‐linear cognitive diagnosis model. We report the diagnostic quality of the test at the item level, mastery classifications for teachers, and attribute relationships. Our results demonstrate that, when a test is grounded in research on cognition and is designed to be multidimensional from the onset, it is possible to use diagnostic classification models to detect distinct patterns of attribute mastery. 相似文献

6.

Written feedback: Response certitude and durability

Raymond W. Kulhavy William A. Stock Thomas E. Hancock Linda K. Swindell Penny L. Hammrich 《Contemporary educational psychology》1990,15(4)

This study tested assumptions of a servocontrol model of test item feedback. High school students responded to multiple-choice items and rated their certainty of correctness in each response. Next, learners either received feedback on the items or responded again to the same test. The same items were tested again after 1 and 8 days, with the order to alternatives randomized for half of the subjects in each feedback group. The results generally supported the control model and suggest that response certitude estimates can be treated as an index of comprehension. 相似文献

7.

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees’ Cognitive Skills in Critical Reading

Changjiang Wang Mark J. Gierl 《Journal of Educational Measurement》2011,48(2):165-187

The purpose of this study is to apply the attribute hierarchy method (AHM) to a subset of SAT critical reading items and illustrate how the method can be used to promote cognitive diagnostic inferences. The AHM is a psychometric procedure for classifying examinees’ test item responses into a set of attribute mastery patterns associated with different components from a cognitive model. The study was conducted in two steps. In step 1, three cognitive models were developed by reviewing selected literature in reading comprehension as well as research related to SAT Critical Reading. Then, the cognitive models were validated by having a sample of students think aloud as they solved each item. In step 2, psychometric analyses were conducted on the SAT critical reading cognitive models by evaluating the model‐data fit between the expected and observed response patterns produced from two random samples of 2,000 examinees who wrote the items. The model that provided best data‐model fit was then used to calculate attribute probabilities for 15 examinees to illustrate our diagnostic testing procedure. 相似文献

8.

Applying an IRT-Based Cognitive Diagnostic Model to Diagnose Students' Knowledge States in Multiplication and Division with Exponents

《教育实用测度》2013,26(4):255-268

The study applied a psychometric model-the rule-space model-to diagnose students' states of knowledge about how the exponents behave in multiplication and division of quantities with exponents. A 38-item test was administered to 431 Grade 10 students. Each item was characterized by a list of task attributes required for answering the item correctly, and each student was classified, based on his or her item-score pattern, into the most likely knowledge, state (i.e., attribute-mastery pattern) corresponding to an ideal item-score pattern. The following outcomes of the rule-space model were presented: (a) the results of the classification of examinees to knowledge states at the group level along with individual examples, (b) the mastery level of the underlying task attributes as evaluated at three different test-score groups, and (c) a tree diagram of the transitional relationships among the knowledge states that can guide the design of effective remediation. Implications for utilizing the feedback, provided by the rule-space model in the context of instruction and assessment are discussed. 相似文献

9.

Dual‐Objective Item Selection Criteria in Cognitive Diagnostic Computerized Adaptive Testing

下载免费PDF全文

Hyeon‐Ah Kang Susu Zhang Hua‐Hua Chang 《Journal of Educational Measurement》2017,54(2):165-183

The development of cognitive diagnostic‐computerized adaptive testing (CD‐CAT) has provided a new perspective for gaining information about examinees' mastery on a set of cognitive attributes. This study proposes a new item selection method within the framework of dual‐objective CD‐CAT that simultaneously addresses examinees' attribute mastery status and overall test performance. The new procedure is based on the Jensen‐Shannon (JS) divergence, a symmetrized version of the Kullback‐Leibler divergence. We show that the JS divergence resolves the noncomparability problem of the dual information index and has close relationships with Shannon entropy, mutual information, and Fisher information. The performance of the JS divergence is evaluated in simulation studies in comparison with the methods available in the literature. Results suggest that the JS divergence achieves parallel or more precise recovery of latent trait variables compared to the existing methods and maintains practical advantages in computation and item pool usage. 相似文献

10.

小学生乘法计算能力的认知诊断及补救教学

王施媛戴海琦《考试研究》2012,(2):59-65

认知诊断通过分析被试的项目作答反应,推断被试的认知属性掌握状态,为学习困难学生设计补救教学提供了非常有价值的信息。本文作者在探讨了小学生多位数乘法计算能力的认知属性、编制了2份相同考核模式的认知诊断测验后,选择江西某小学310名高年级学生为被试,先施测第1份认知诊断测验,采用DINA模型,自编参数估计程序进行诊断,得到了每一个被试的属性掌握模式分类及全体被试在各个属性上的掌握情况。然后设计和实施补救教学,在实施补救教学后再施测第2份认知诊断测验以检验补救效果。研究发现：（1）该小学高年级学生对0XN运算法则、多位数乘以两位数的运算程序、乘法进位认知属性的掌握不太理想,特别是乘法进位。（2）属性掌握模式中属全部掌握模式的被试人数占86．47％,其余被试均分类于存在各种认知不足的掌握模式。（3）比较两份认知诊断测验报告,结果表明在认知诊断指导下的补救教学有针对性,补救后被试正确作答项目增多,属性掌握个数也有所增加,补救效果良好。相似文献

11.

Evaluating the Wald Test for Item‐Level Comparison of Saturated and Reduced Models in Cognitive Diagnosis

Jimmy de la Torre Young‐Sun Lee 《Journal of Educational Measurement》2013,50(4):355-373

This article used the Wald test to evaluate the item‐level fit of a saturated cognitive diagnosis model (CDM) relative to the fits of the reduced models it subsumes. A simulation study was carried out to examine the Type I error and power of the Wald test in the context of the G‐DINA model. Results show that when the sample size is small and a larger number of attributes are required, the Type I error rate of the Wald test for the DINA and DINO models can be higher than the nominal significance levels, while the Type I error rate of the A‐CDM is closer to the nominal significance levels. However, with larger sample sizes, the Type I error rates for the three models are closer to the nominal significance levels. In addition, the Wald test has excellent statistical power to detect when the true underlying model is none of the reduced models examined even for relatively small sample sizes. The performance of the Wald test was also examined with real data. With an increasing number of CDMs from which to choose, this article provides an important contribution toward advancing the use of CDMs in practical educational settings. 相似文献

12.

The Two Most Useful Approaches to Estimating Criterion‐Referenced Test Reliability in a Single Test Administration

William Coscarelli Sharon Shrock 《Performance Improvement Quarterly》2002,15(4):74-85

In criterion‐referenced tests (CRTs), the traditional measures of reliability used in norm‐referenced tests (NRTs) have often proved problematic because of NRT assumptions of one underlying ability or competency and of variance in the distribution of scores. CRTs, by contrast, are likely to be created when mastery of the skill or knowledge by all or most all test takers is expected and thus little variation in the scores is expected. A comprehensive CRT often measures a number of discrete tasks that may not represent a single unifying ability or competence. Hence, CRTs theoretically violate the two most essential assumptions of classic NRT re liability theory and they have traditionally required the logistical problems of multiple test administrations to the same test takers to estimate reliability. A review of the literature categorizes approaches to reliability for CRTs into two classes: estimates sensitive to all measures of error and estimates of consistency in test outcome. For single test administration of CRTs Livingston's k²is recommended for estimating all measures of error, S_c is proposed for estimates of consistency in test outcome. Both approaches compared using data from a CRT exam and recommendations for interpretation and use are proposed. 相似文献

13.

Persistence of Performance‐Approach Individuals in Achievement Situations: An application of the Rasch model

Georgios D. Sideridis 《教育心理学》2007,27(6):753-770

Research in goal theory has often relied on the general linear model, ignoring some central assumptions of the theoretical framework under investigation. The purpose of the two studies reported here was to illustrate how the Rasch model can supplement traditional statistical analyses when evaluating the effects of goal orientations on persistence. In Study 1, 41 adolescents participated in a series of insolvable puzzles. Means analyses failed to reveal differences between mastery‐approach and performance‐approach students. Application of the Rasch model, and in particular the differential item functioning procedure, indicated that mastery students were more severely challenged by the last puzzle, compared to performance approach students, with their probability of persisting being significantly lower. The findings of Study 1 were replicated in Study 2 with a sample of 37 college students who were also given a series of insolvable puzzles. The findings suggest that use of the Rasch model can be particularly fruitful for our understanding of the complex achievement‐goal relationship. 相似文献

14.

Using Efficient Model Based Sum-Scores for Conducting Skills Diagnoses

Robert Henson Jonathan Templin Jeffrey Douglas 《Journal of Educational Measurement》2007,44(4):361-376

Consider test data, a specified set of dichotomous skills measured by the test, and an IRT cognitive diagnosis model (ICDM). Statistical estimation of the data set using the ICDM can provide examinee estimates of mastery for these skills, referred to generally as attributes. With such detailed information about each examinee, future instruction can be tailored specifically for each student, often referred to as formative assessment. However, use of such cognitive diagnosis models to estimate skills in classrooms can require computationally intensive and complicated statistical estimation algorithms, which can diminish the breadth of applications of attribute level diagnosis. We explore the use of sum-scores (each attribute measured by a sum-score) combined with estimated model-based sum-score mastery/nonmastery cutoffs as an easy-to-use and intuitive method to estimate attribute mastery in classrooms and other settings where simple skills diagnostic approaches are desirable. Using a simulation study of skills diagnosis test settings and assuming a test consisting of a model-based calibrated set of items, correct classification rates (CCRs) are compared among four model-based approaches for estimating attribute mastery, namely using full model-based estimation and three different methods of computing sum-scores (simple sum-scores, complex sum-scores, and weighted complex sum-scores) combined with model-based mastery sum-score cutoffs. In summary, the results suggest that model-based sum-scores and mastery cutoffs can be used to estimate examinee attribute mastery with only moderate reductions in CCRs in comparison with the full model-based estimation approach. Certain topics are mentioned that are currently being investigated, especially applications in classroom and textbook settings. 相似文献

15.

Nonparametric Person-Fit Research: Some Theoretical issues an Empirical Example

《教育实用测度》2013,26(1):77-89

In person-fit analysis, it is investigated whether an item score pattern is improbable given the item score patterns of the other persons in the group or given an expected score pattern on the basis of a test model. In this study, several existing group-based statistics are discussed to detect such improbable item score patterns, along with the cut scores that were proposed in the literature to classify an item score pattern as aberrant. By means of a simulation study and an empirical study, the detection rate of these statistics is compared, and the practical use of various cut scores is investigated. It is furthermore demonstrated that person-fit statistics can be used to detect persons with a deficiency of knowledge on an achievement test. 相似文献

16.

Generating Dichotomous Item Scores with the Four-Parameter Beta Compound Binomial Model

Patrick O. Monahan Won-Chan Lee Robert D. Ankenmann 《Journal of Educational Measurement》2007,44(3):211-225

A Monte Carlo simulation technique for generating dichotomous item scores is presented that implements (a) a psychometric model with different explicit assumptions than traditional parametric item response theory (IRT) models, and (b) item characteristic curves without restrictive assumptions concerning mathematical form. The four-parameter beta compound-binomial (4PBCB) strong true score model (with two-term approximation to the compound binomial) is used to estimate and generate the true score distribution. The nonparametric item-true score step functions are estimated by classical item difficulties conditional on proportion-correct total score. The technique performed very well in replicating inter-item correlations, item statistics (point-biserial correlation coefficients and item proportion-correct difficulties), first four moments of total score distribution, and coefficient alpha of three real data sets consisting of educational achievement test scores. The technique replicated real data (including subsamples of differing proficiency) as well as the three-parameter logistic (3PL) IRT model (and much better than the 1PL model) and is therefore a promising alternative simulation technique. This 4PBCB technique may be particularly useful as a more neutral simulation procedure for comparing methods that use different IRT models. 相似文献

17.

DETECTION OF A SKILL DICHOTOMY USING STANDARDIZED ACHIEVEMENT TEST ITEMS

EDWARD HAERTEL 《Journal of Educational Measurement》1984,21(1):59-72

Multiple-choice reading comprehension items from a conventional, norm-referenced reading comprehension test are successfully analyzed using a simple latent class model. A classification rule for assigning respondents to "mastery" or "nonmastery" states is presented which simplifies the scoring procedure of Macready and Dayton (1977). A procedure is also derived for estimating the "true," or "disattenuated," latent cross-classification of masters versus nonmasters for two tests, and illustrated using two sets of items from the same content domain. Results support the use of latent class, state mastery models with more heterogeneous item pools than has been advocated by previous authors. 相似文献

18.

Stratified and Maximum Information Item Selection Procedures in Computer Adaptive Testing

Hui Deng Timothy Ansley Hua-Hua Chang 《Journal of Educational Measurement》2010,47(2):202-226

In this study we evaluated and compared three item selection procedures: the maximum Fisher information procedure (F), the a-stratified multistage computer adaptive testing (CAT) (STR), and a refined stratification procedure that allows more items to be selected from the high a strata and fewer items from the low a strata (USTR), along with completely random item selection (RAN). The comparisons were with respect to error variances, reliability of ability estimates and item usage through CATs simulated under nine test conditions of various practical constraints and item selection space. The results showed that F had an apparent precision advantage over STR and USTR under unconstrained item selection, but with very poor item usage. USTR reduced error variances for STR under various conditions, with small compromises in item usage. Compared to F, USTR enhanced item usage while achieving comparable precision in ability estimates; it achieved a precision level similar to F with improved item usage when items were selected under exposure control and with limited item selection space. The results provide implications for choosing an appropriate item selection procedure in applied settings. 相似文献

19.

A Preliminary Study of Questions Which Adolescents Find Unanswerable

George Lawton 《Journal of Experimental Education》2013,81(2):99-104

In this article, the authors investigated the teacher practices that middle school students attend to when appraising their classroom's mastery goal structure. After students rated each item on the mastery goal structure scale, they wrote what their teacher did or said that led them to make that choice. Students' responses to the open-ended questions were coded thematically. The categories mentioned most often involved the pedagogical and affective nature of teachers' interactions with students. Recognition and evaluation practices and teachers' use of time were also salient to students. There were no differences in the practices that students attended to in classrooms with high, compared with low, mastery goal structure. 相似文献

20.

Evaluating Comparability in Computerized Adaptive Testing: Issues, Criteria and an Example

Tianyou Wang Michael J. Kolen 《Journal of Educational Measurement》2001,38(1):19-49

When a computerized adaptive testing (CAT) version of a test co-exists with its paper-and-pencil (P&P) version, it is important for scores from the CAT version to be comparable to scores from its P&P version. The CAT version may require multiple item pools for test security reasons, and CAT scores based on alternate pools also need to be comparable to each other. In this paper, we review research literature on CAT comparability issues and synthesize issues specific to these two settings. A framework of criteria for evaluating comparability was developed that contains the following three categories of criteria: validity criterion, psychometric property/reliability criterion, and statistical assumption/test administration condition criterion. Methods for evaluating comparability under these criteria as well as various algorithms for improving comparability are described and discussed. Focusing on the psychometric property/reliability criterion, an example using an item pool of ACT Assessment Mathematics items is provided to demonstrate a process for developing comparable CAT versions and for evaluating comparability. This example illustrates how simulations can be used to improve comparability at the early stages of the development of a CAT. The effects of different specifications of practical constraints, such as content balancing and item exposure rate control, and the effects of using alternate item pools are examined. One interesting finding from this study is that a large part of incomparability may be due to the change from number-correct score-based scoring to IRT ability estimation-based scoring. In addition, changes in components of a CAT, such as exposure rate control, content balancing, test length, and item pool size were found to result in different levels of comparability in test scores. 相似文献