期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Differences Between Self-Adapted and Computerized Adaptive Tests: A Meta-Analysis

Angela K. Pitkin Walter P. Vispoel 《Journal of Educational Measurement》2001,38(3):235-247

Self-adapted testing has been described as a variation of computerized adaptive testing that reduces test anxiety and thereby enhances test performance. The purpose of this study was to gain a better understanding of these proposed effects of self-adapted tests (SATs); meta-analysis procedures were used to estimate differences between SATs and computerized adaptive tests (CATs) in proficiency estimates and post-test anxiety levels across studies in which these two types of tests have been compared. After controlling for measurement error, the results showed that SATs yielded proficiency estimates that were 0.12 standard deviation units higher and post-test anxiety levels that were 0.19 standard deviation units lower than those yielded by CATs. We speculate about possible reasons for these differences and discuss advantages and disadvantages of using SATs in operational settings. 相似文献

2.

Psychometric Characteristics of Computer-Adaptive and Self-Adaptive Vocabulary Tests: The Role of Answer Feedback and Test Anxiety

Walter P. Vispoel 《Journal of Educational Measurement》1998,35(2):155-167

This study focused on the effects of administration mode (computer-adaptive test [CAT] versus self-adaptive test [SAT]), item-by-item answer feedback (present versus absent), and test anxiety on results obtained from computerized vocabulary tests. Examinees were assigned at random to four testing conditions (CAT with feedback, CAT without feedback, SAT with feedback, SAT without feedback). Examinees completed the Test Anxiety Inventory (Spielberger, 1980) before taking their assigned computerized tests. Results showed that the CATs were more reliable and took less time to complete than the SATs. Administration time for both the CATs and SATs was shorter when feedback was provided than when it was not, and this difference was most pronounced for examinees at medium to high levels of test anxiety. These results replicate prior findings regarding the precision and administrative efficiency of CATs and SATs but point to new possible benefits of including answer feedback on such tests. 相似文献

3.

Rater Agreement in Test‐to‐Curriculum Alignment Reviews

下载免费PDF全文

A. Traynor H. E. Merzdorf 《Educational Measurement》2018,37(3):55-64

During the development of large‐scale curricular achievement tests, recruited panels of independent subject‐matter experts use systematic judgmental methods—often collectively labeled “alignment” methods—to rate the correspondence between a given test's items and the objective statements in a particular curricular standards document. High disagreement among the expert panelists may indicate problems with training, feedback, or other steps of the alignment procedure. Existing procedural recommendations for alignment reviews have been derived largely from single‐panel research studies; support for their use during operational large‐scale test development may be limited. Synthesizing data from more than 1,000 alignment reviews of state achievement tests, this study identifies features of test–standards alignment review procedures that impact agreement about test item content. The researchers then use their meta‐regression results to propose some practical suggestions for alignment review implementation. 相似文献

4.

Aligning Tests with States' Content Standards: Methods and Issues

Dennison S. Bhola James C. Impara Chad W. Buckendahl 《Educational Measurement》2003,22(3):21-29

相似文献

5.

Effects of Differentially Time-Consuming Tests on Computer-Adaptive Test Scores

Brent Bridgeman Frederick Cline 《Journal of Educational Measurement》2004,41(2):137-148

Time limits on some computer-adaptive tests (CATs) are such that many examinees have difficulty finishing, and some examinees may be administered tests with more time-consuming items than others. Results from over 100,000 examinees suggested that about half of the examinees must guess on the final six questions of the analytical section of the Graduate Record Examination if they were to finish before time expires. At the higher-ability levels, even more guessing was required because the questions administered to higher-ability examinees were typically more time consuming. Because the scoring model is not designed to cope with extended strings of guesses, substantial errors in ability estimates can be introduced when CATs have strict time limits. Furthermore, examinees who are administered tests with a disproportionate number of time-consuming items appear to get lower scores than examinees of comparable ability who are administered tests containing items that can be answered more quickly, though the issue is very complex because of the relationship of time and difficulty, and the multidimensionality of the test. 相似文献

6.

The Relationship Between Item Exposure and Test Overlap in Computerized Adaptive Testing 总被引：1，自引：0，他引：1

Shu-Ying Chen Robert D. Ankenmann Judith A. Spray 《Journal of Educational Measurement》2003,40(2):129-145

The purpose of this article is to present an analytical derivation for the mathematical form of an average between-test overlap index as a function of the item exposure index, for fixed-length computerized adaptive tests (CATs). This algebraic relationship is used to investigate the simultaneous control of item exposure at both the item and test levels. The results indicate that, in fixed-length CATs, control of the average between-test overlap is achieved via the mean and variance of the item exposure rates of the items that constitute the CAT item pool. The mean of the item exposure rates is easily manipulated. Control over the variance of the item exposure rates can be achieved via the maximum item exposure rate (r_max). Therefore, item exposure control methods which implement a specification of r_max (e.g., Sympson & Hetter, 1985) provide the most direct control at both the item and test levels. 相似文献

7.

Refining Methods for Estimating Critical Values for an Alignment Index

《Journal of research on educational effectiveness》2013,6(4):380-395

Abstract

The alignment among standards, assessments, and teachers’ instruction is an essential element of standards-based educational reforms. The Surveys of Enacted Curriculum (SEC) is the only common tool that can be used to measure the alignment among all three of these sources (Martone & Sireci, 2009). Prior SEC alignment work has been limited by not allowing for significance tests. A recent article (Fulmer, 2011) provided a first attempt to address this shortcoming of the SEC, but that work was limited in several ways. We extend Fulmer's simulation approach by accounting for important elements of the SEC procedures, including the proper framework size, number of standards and assessment points, number of raters, rater cell-splitting rates, and rater agreement results. The results indicate that inferences about relative alignment may be heavily influenced by features of the alignment procedures. Thus, our method should be broadly applied to future SEC alignment investigations. 相似文献

8.

Can Examinees Use Judgments of Item Difficulty to Improve Proficiency Estimates on Computerized Adaptive Vocabulary Tests?

Walter P. Vispoel Sara J. Clough Timothy Bleiler Amy B. Hendrickson Damien Ihrig 《Journal of Educational Measurement》2002,39(4):311-330

Recent simulation studies indicate that there are occasions when examinees can use judgments of relative item difficulty to obtain positively biased proficiency estimates on computerized adaptive tests (CATs) that permit item review and answer change. Our purpose in the study reported here was to evaluate examinees' success in using these strategies while taking CATs in a live testing setting. We taught examinees two item difficulty judgment strategies designed to increase proficiency estimates. Examinees who were taught each strategy and examinees who were taught neither strategy were assigned at random to complete vocabulary CATs under conditions in which review was allowed after completing all items and when review was allowed only within successive blocks of items. We found that proficiency estimate changes following review were significantly higher in the regular review conditions than in the strategy conditions. Failure to obtain systematically higher scores in the strategy conditions was due in large part to errors examinees made in judging the relative difficulty of CAT items. 相似文献

9.

A Simulation and Comparison of Flexilevel and Bayesian Computerized Adaptive Testing

R. J. De Ayala Barbara G. Dodd William R. Koch 《Journal of Educational Measurement》1990,27(3):227-239

Computerized adaptive testing (CAT) is a testing procedure that adapts an examination to an examinee's ability by administering only items of appropriate difficulty for the examinee. In this study, the authors compared Lord's flexilevel testing procedure (flexilevel CAT) with an item response theory-based CAT using Bayesian estimation of ability (Bayesian CAT). Three flexilevel CATs, which differed in test length (36, 18, and 11 items), and three Bayesian CATs were simulated; the Bayesian CATs differed from one another in the standard error of estimate (SEE) used for terminating the test (0.25, 0.10, and 0.05). Results showed that the flexilevel 36- and 18-item CATs produced ability estimates that may be considered as accurate as those of the Bayesian CAT with SEE = 0.10 and comparable to the Bayesian CAT with SEE = 0.05. The authors discuss the implications for classroom testing and for item response theory-based CAT. 相似文献

10.

Alignment Between the Science Curriculum and Assessment in Selected NY State Regents Exams

Xiufeng Liu Gavin Fulmer 《Journal of Science Education and Technology》2008,17(4):373-383

This article reports on an analysis of alignment between NY state core curricula and NY Regents tests in physics and chemistry. Both the curriculum and test were represented by a two dimensional table consisting of topics and cognitive demands. The cell values of the table were numbers of major understandings in the curriculum and points of test items in the test. The Porter alignment index was computed for each test. It was found that, overall, there was a high alignment between the NY core curriculum and the NY Regents test, and the alignment remained fairly stable from test to test. However, there were considerable discrepancies in emphases on different cognitive levels and topics between the core curriculum and the test. Issues related to the nature of alignment, and the nature and validity of content standards were raised, and implications for science curriculum and instructions were also discussed. 相似文献

11.

Gauging Item Alignment Through Online Systems While Controlling for Rater Effects

下载免费PDF全文

Daniel Anderson Shawn Irvin Julie Alonzo Gerald A. Tindal 《Educational Measurement》2015,34(1):22-33

The alignment of test items to content standards is critical to the validity of decisions made from standards‐based tests. Generally, alignment is determined based on judgments made by a panel of content experts with either ratings averaged or via a consensus reached through discussion. When the pool of items to be reviewed is large, or the content‐matter experts are broadly distributed geographically, panel methods present significant challenges. This article illustrates the use of an online methodology for gauging item alignment that does not require that raters convene in person, reduces the overall cost of the study, increases time flexibility, and offers an efficient means for reviewing large item banks. Latent trait methods are applied to the data to control for between‐rater severity, evaluate intrarater consistency, and provide item‐level diagnostic statistics. Use of this methodology is illustrated with a large pool (1,345) of interim‐formative mathematics test items. Implications for the field and limitations of this approach are discussed. 相似文献

12.

Assessing the Language Acquisition Progress of Limited English Proficient Students: Problems and New Alternative

《教育实用测度》2013,26(2):85-113

The article discusses the need educators have for measures of linguistic competence for limited-English-proficient (LEP) students. Traditional measurement procedures do not meet these needs because of mismatches between educational experiences and test content, cultural experiences and test con- tent, and linguistic experience and test content. A new type of test -Sentence Verification Technique (SVT) test - that may meet some of the measurement needs of LEP students is described, and the results of a study that examines the reliability and validity of the new tests as measures of the listening and reading comprehension performance in both the native language and English are reported. The results indicate that the tests are reliable and that SVT performance varies as functions of placement in a transitional bilingual education program, teacher judgments of competence, and difficulty of the material. These results are consistent with the interpretation that SVT tests are valid measures of the linguistic competence of LEP students. The article concludes with a discussion of some of the advantages of using SVT tests with LEP populations. 相似文献

13.

A Proposed Framework for Evaluating Alignment Studies

Susan L. Davis‐Becker Chad W. Buckendahl 《Educational Measurement》2013,32(1):23-33

Evaluating the multiple characteristics of alignment has taken a prominent role in educational assessment and accountability systems given its attention in the No Child Left Behind legislation (NCLB). Leading to this rise in popularity, alignment methodologies that examined relationships among curriculum, academic content standards, instruction, and assessments were proposed as strategies to evaluate evidence of the intended uses and interpretations of test scores. In this article, we propose a framework for evaluating alignment studies based on similar concepts that have been recommended for standard setting (Kane). This framework provides guidance to practitioners about how to identify sources of validity evidence for an alignment study and make judgments about the strength of the evidence that may impact the interpretation of the results. 相似文献

14.

An Investigation of the Alignment Method With Polytomous Indicators Under Conditions of Partial Measurement Invariance

Jessica K. Flake D. Betsy McCoach 《Structural equation modeling》2018,25(1):56-70

The alignment method (Asparouhov & Muthén, 2014) is an alternative to multiple-group factor analysis for estimating measurement models and testing for measurement invariance across groups. Simulation studies evaluating the performance of the alignment for estimating measurement models across groups show promising results for continuous indicators. This simulation study builds on previous research by investigating the performance of the alignment method’s measurement models estimates with polytomous indicators under conditions of systematically increasing, partial measurement invariance. We also present an evaluation of the testing procedure, which has not been the focus of previous simulation studies. Results indicate that the alignment adequately recovers parameter estimates under small and moderate amounts of noninvariance, with issues only arising in extreme conditions. In addition, the statistical tests of invariance were fairly conservative, and had less power for items with more extreme skew. We include recommendations for using the alignment method based on these results. 相似文献

15.

A Closer Look at Using Judgments of Item Difficulty to Change Answers on Computerized Adaptive Tests

Walter P. Vispoel Sara J. Clough Timothy Bleiler 《Journal of Educational Measurement》2005,42(4):331-350

Recent studies have shown that restricting review and answer change opportunities on computerized adaptive tests (CATs) to items within successive blocks reduces time spent in review, satisfies most examinees' desires for review, and controls against distortion in proficiency estimates resulting from intentional incorrect answering of items prior to review. However, restricting review opportunities on CATs may not prevent examinees from artificially raising proficiency estimates by using judgments of item difficulty to signal when to change previous answers. We evaluated six strategies for using item difficulty judgments to change answers on CATs and compared the results to those from examinees reviewing and changing answers in the usual manner. The strategy conditions varied in terms of when examinees were prompted to consider changing answers and in the information provided about the consistency of the item selection algorithm. We found that examinees fared best on average when they reviewed and changed answers in the usual manner. The best gaming strategy was one in which the examinees knew something about the consistency of the item selection algorithm and were prompted to change responses only when they were unsure about answer correctness and sure about their item difficulty judgments. However, even this strategy did not produce a mean gain in proficiency estimates. 相似文献

16.

Comparing Methods of Assessing Differential Item Functioning in a Computerized Adaptive Testing Environment

Pui-Wa Lei Shu-Ying Chen Lan Yu 《Journal of Educational Measurement》2006,43(3):245-264

Mantel-Haenszel and SIBTEST, which have known difficulty in detecting non-unidirectional differential item functioning (DIF), have been adapted with some success for computerized adaptive testing (CAT). This study adapts logistic regression (LR) and the item-response-theory-likelihood-ratio test (IRT-LRT), capable of detecting both unidirectional and non-unidirectional DIF, to the CAT environment in which pretest items are assumed to be seeded in CATs but not used for trait estimation. The proposed adaptation methods were evaluated with simulated data under different sample size ratios and impact conditions in terms of Type I error, power, and specificity in identifying the form of DIF. The adapted LR and IRT-LRT procedures are more powerful than the CAT version of SIBTEST for non-unidirectional DIF detection. The good Type I error control provided by IRT-LRT under extremely unequal sample sizes and large impact is encouraging. Implications of these and other findings are discussed. 相似文献

17.

Computer adaptive testing,big data and algorithmic approaches to education 总被引：1，自引：0，他引：1

Greg Thompson 《British Journal of Sociology of Education》2017,38(6):827-840

This article critically considers the promise of computer adaptive testing (CAT) and digital data to provide better and quicker data that will improve the quality, efficiency and effectiveness of schooling. In particular, it uses the case of the Australian NAPLAN test that will become an online, adaptive test from 2016. The article argues that CATs are specific examples of technological ensembles which are producing, and working through, new subjectivities. In particular, CATs leverage opportunities for big data and algorithmic approaches to education that are symptomatic of what Deleuze saw as the shift from disciplinary to control institutions and societies. 相似文献

18.

The Influence of Several Factors on Reliability for Complex Reading Comprehension Tests

Guemin Lee 《Journal of Educational Measurement》2002,39(2):149-164

The purpose of this study was to investigate the effects of items, passages, contents, themes, and types of passages on the reliability and standard errors of measurement for complex reading comprehension tests. Seven different generalizability theory models were used in the analyses. Results indicated that generalizability coefficients estimated using multivariate models incorporating content strata and types of passages were similar in size to reliability estimates based upon a model that did not include these factors. In contrast, incorporating passages and themes within univariate generalizability theory models produced non-negligible differences in the reliability estimates. This suggested that passages and themes be taken into account when evaluating the reliability of test scores for complex reading comprehension tests. 相似文献

19.

Limiting Answer Review and Change on Computerized Adaptive Vocabulary Tests: Psychometric and Attitudinal Results

Walter P. Vispoel Amy B. Hendrickson Timothy Bleiler 《Journal of Educational Measurement》2000,37(1):21-38

Previous simulation studies of computerized adaptive tests (CATs) have revealed that the validity and precision of proficiency estimates can be maintained when review opportunities are limited to items within successive blocks. Our purpose in this study was to evaluate the effectiveness of CATs with such restricted review options in a live testing setting. Vocabulary CATs were compared under four conditions: (a) no item review allowed, (b) review allowed only within successive 5-item blocks, (c) review allowed only within successive lO-item blocks, and (d) review allowed only after answering all 40 items. Results revealed no trust-worthy differences among conditions in vocabulary proficiency estimates, measurement error, or testing time. Within each review condition, ability estimates and number correct scores increased slightly after review, more answers were changed from wrong to right than from right to wrong, most examinees who changed answers improved proficiency estimates by doing so, and nearly all examinees indicated that they had an adequate opportunity to review their previous answers. These results suggest that restricting review opportunities on CATs may provide a viable way to satisfy examinee desires, maintain validity and measurement precision, and keep testing time at acceptable levels. 相似文献

20.

APPLICATION OF COMPUTERIZED ADAPTIVE TESTING TO EDUCATIONAL PROBLEMS 总被引：1，自引：0，他引：1

DAVID J. WEISS G. GAGE KINGSBURY 《Journal of Educational Measurement》1984,21(4):361-375

Three applications of computerized adaptive testing (CAT) to help solve problems encountered in educational settings are described and discussed. Each of these applications makes use of item response theory to select test questions from an item pool to estimate a student's achievement level and its precision. These estimates may then be used in conjunction with certain testing strategies to facilitate certain educational decisions. The three applications considered are (a) adaptive mastery testing for determining whether or not a student has mastered a particular content area, (b) adaptive grading for assigning grades to students, and (c) adaptive self-referenced testing for estimating change in a student's achievement level. Differences between currently used classroom procedures and these CAT procedures are discussed. For the adaptive mastery testing procedure, evidence from a series of studies comparing conventional and adaptive testing procedures is presented showing that the adaptive procedure results in more accurate mastery classifications than do conventional mastery tests, while using fewer test questions. 相似文献