期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Detection of Test Collusion via Kullback–Leibler Divergence

Dmitry I. Belov 《Journal of Educational Measurement》2013,50(2):141-163

The development of statistical methods for detecting test collusion is a new research direction in the area of test security. Test collusion may be described as large‐scale sharing of test materials, including answers to test items. Current methods of detecting test collusion are based on statistics also used in answer‐copying detection. Therefore, in computerized adaptive testing (CAT) these methods lose power because the actual test varies across examinees. This article addresses that problem by introducing a new approach that works in two stages: in Stage 1, test centers with an unusual distribution of a person‐fit statistic are identified via Kullback–Leibler divergence; in Stage 2, examinees from identified test centers are analyzed further using the person‐fit statistic, where the critical value is computed without data from the identified test centers. The approach is extremely flexible. One can employ any existing person‐fit statistic. The approach can be applied to all major testing programs: paper‐and‐pencil testing (P&P), computer‐based testing (CBT), multiple‐stage testing (MST), and CAT. Also, the definition of test center is not limited by the geographic location (room, class, college) and can be extended to support various relations between examinees (from the same undergraduate college, from the same test‐prep center, from the same group at a social network). The suggested approach was found to be effective in CAT for detecting groups of examinees with item pre‐knowledge, meaning those with access (possibly unknown to us) to one or more subsets of items prior to the exam. 相似文献

2.

Detecting Differential Speededness in Multistage Testing

Wim J. van der Linden Krista Breithaupt Siang Chee Chuah Yanwei Zhang 《Journal of Educational Measurement》2007,44(2):117-130

A potential undesirable effect of multistage testing is differential speededness, which happens if some of the test takers run out of time because they receive subtests with items that are more time intensive than others. This article shows how a probabilistic response-time model can be used for estimating differences in time intensities and speed between subtests and test takers and detecting differential speededness. An empirical data set for a multistage test in the computerized CPA Exam was used to demonstrate the procedures. Although the more difficult subtests appeared to have items that were more time intensive than the easier subtests, an analysis of the residual response times did not reveal any significant differential speededness because the time limit appeared to be appropriate. In a separate analysis, within each of the subtests, we found minor but consistent patterns of residual times that are believed to be due to a warm-up effect, that is, use of more time on the initial items than they actually need. 相似文献

3.

The WISC-R and evidence of item bias for native-American Navajos

Shitala P. Mishra 《Psychology in the schools》1982,19(4):458-464

The study investigated cultural bias in the 79 items of the three verbal tests of the Wechsler Intelligence Scale for Children-Revised (WISC-R). The Information, Similarities, and Vocabulary subtests were administered to 40 Anglo and 40 Native- American Navajo subjects matched for grade level. The responses of the two groups of subjects on individual items were analyzed by log-linear technique using the likelihood ratio chi-square statistic. The findings revealed that performance of subjects was homogeneous across groups on most of the items of three verbal subtests of the WISC-R. Only 15 (19%) of the 79 items comprising Information, Similarities, and Vocabulary subtests were found to be biased against the Navajo sample. Five of these items are from the Information, four from the Similarities, and the remaining six items are from the Vocabulary subtest. Implications of these findings for the psychoeducational assessment of minority children were discussed. 相似文献

4.

Limiting Answer Review and Change on Computerized Adaptive Vocabulary Tests: Psychometric and Attitudinal Results

Walter P. Vispoel Amy B. Hendrickson Timothy Bleiler 《Journal of Educational Measurement》2000,37(1):21-38

Previous simulation studies of computerized adaptive tests (CATs) have revealed that the validity and precision of proficiency estimates can be maintained when review opportunities are limited to items within successive blocks. Our purpose in this study was to evaluate the effectiveness of CATs with such restricted review options in a live testing setting. Vocabulary CATs were compared under four conditions: (a) no item review allowed, (b) review allowed only within successive 5-item blocks, (c) review allowed only within successive lO-item blocks, and (d) review allowed only after answering all 40 items. Results revealed no trust-worthy differences among conditions in vocabulary proficiency estimates, measurement error, or testing time. Within each review condition, ability estimates and number correct scores increased slightly after review, more answers were changed from wrong to right than from right to wrong, most examinees who changed answers improved proficiency estimates by doing so, and nearly all examinees indicated that they had an adequate opportunity to review their previous answers. These results suggest that restricting review opportunities on CATs may provide a viable way to satisfy examinee desires, maintain validity and measurement precision, and keep testing time at acceptable levels. 相似文献

5.

DIF Detection Using Multiple‐Group Categorical CFA With Minimum Free Baseline Approach

下载免费PDF全文

Yu‐Wei Chang Wei‐Kang Huang Rung‐Ching Tsai 《Journal of Educational Measurement》2015,52(2):181-199

The aim of this study is to assess the efficiency of using the multiple‐group categorical confirmatory factor analysis (MCCFA) and the robust chi‐square difference test in differential item functioning (DIF) detection for polytomous items under the minimum free baseline strategy. While testing for DIF items, despite the strong assumption that all but the examined item are set to be DIF‐free, MCCFA with such a constrained baseline approach is commonly used in the literature. The present study relaxes this strong assumption and adopts the minimum free baseline approach where, aside from those parameters constrained for identification purpose, parameters of all but the examined item are allowed to differ among groups. Based on the simulation results, the robust chi‐square difference test statistic with the mean and variance adjustment is shown to be efficient in detecting DIF for polytomous items in terms of the empirical power and Type I error rates. To sum up, MCCFA under the minimum free baseline strategy is useful for DIF detection for polytomous items. 相似文献

6.

On the Stability of Students'Rules of Operation for Solving Arithmetic Problems

Kikumi K. Tatsuoka Menucha Birenbaum Jerry Arnold 《Journal of Educational Measurement》1989,26(4):351-361

The purpose o f this study was to examine the consistency with which students applied procedural rules for solving signed-number operations across identical items presented in different orders. A test with 64 open-ended items was administered to 161 eighth graders. The test consisted o f two 32-item subtests containing identical items. The items in each subtest were in random order. Students'responses to each subtest were compared with respect to the identified underlying rules o f operation used to solve each problem set. The results indicated that inconsistent rule application was common among students who had not mastered signed-number arithmetic operations. In contrast, mastery level students, those who use the right rules, show a stable pattern o f rule application in signed-number arithmetic. These results are discussed in light of the hypothesis testing approach to the learning process. 相似文献

7.

Gauging Item Alignment Through Online Systems While Controlling for Rater Effects

下载免费PDF全文

Daniel Anderson Shawn Irvin Julie Alonzo Gerald A. Tindal 《Educational Measurement》2015,34(1):22-33

The alignment of test items to content standards is critical to the validity of decisions made from standards‐based tests. Generally, alignment is determined based on judgments made by a panel of content experts with either ratings averaged or via a consensus reached through discussion. When the pool of items to be reviewed is large, or the content‐matter experts are broadly distributed geographically, panel methods present significant challenges. This article illustrates the use of an online methodology for gauging item alignment that does not require that raters convene in person, reduces the overall cost of the study, increases time flexibility, and offers an efficient means for reviewing large item banks. Latent trait methods are applied to the data to control for between‐rater severity, evaluate intrarater consistency, and provide item‐level diagnostic statistics. Use of this methodology is illustrated with a large pool (1,345) of interim‐formative mathematics test items. Implications for the field and limitations of this approach are discussed. 相似文献

8.

“I never thought of it as freezing”: How students answer questions on large‐scale science tests and what they know about science

Tracy Noble Catherine Suarez Ann Rosebery Mary Catherine O'Connor Beth Warren Josiane Hudicourt‐Barnes 《科学教学研究杂志》2012,49(6):778-803

相似文献

9.

A New Statistic for Detection of Aberrant Answer Changes

下载免费PDF全文

Sandip Sinharay Minh Q. Duong Scott W. Wood 《Journal of Educational Measurement》2017,54(2):200-217

As noted by Fremer and Olson, analysis of answer changes is often used to investigate testing irregularities because the analysis is readily performed and has proven its value in practice. Researchers such as Belov, Sinharay and Johnson, van der Linden and Jeon, van der Linden and Lewis, and Wollack, Cohen, and Eckerly have suggested several statistics for detection of aberrant answer changes. This article suggests a new statistic that is based on the likelihood ratio test. An advantage of the new statistic is that it follows the standard normal distribution under the null hypothesis of no aberrant answer changes. It is demonstrated in a detailed simulation study that the Type I error rate of the new statistic is very close to the nominal level and the power of the new statistic is satisfactory in comparison to those of several existing statistics for detecting aberrant answer changes. The new statistic and several existing statistics were shown to provide useful information for a real data set. Given the increasing interest in analysis of answer changes, the new statistic promises to be useful to measurement practitioners. 相似文献

10.

Instructional Topics in Educational Measurement (ITEMS) Module: Using Automated Processes to Generate Test Items

Mark J. Gierl Hollis Lai 《Educational Measurement》2013,32(3):36-50

Changes to the design and development of our educational assessments are resulting in the unprecedented demand for a large and continuous supply of content‐specific test items. One way to address this growing demand is with automatic item generation (AIG). AIG is the process of using item models to generate test items with the aid of computer technology. The purpose of this module is to describe and illustrate a template‐based method for generating test items. We outline a three‐step approach where test development specialists first create an item model. An item model is like a mould or rendering that highlights the features in an assessment task that must be manipulated to produce new items. Next, the content used for item generation is identified and structured. Finally, features in the item model are systematically manipulated with computer‐based algorithms to generate new items. Using this template‐based approach, hundreds or even thousands of new items can be generated with a single item model. 相似文献

11.

Model‐Free CUSUM Methods for Person Fit

Ronald D. Armstrong Min Shi 《Journal of Educational Measurement》2009,46(4):408-428

This article demonstrates the use of a new class of model‐free cumulative sum (CUSUM) statistics to detect person fit given the responses to a linear test. The fundamental statistic being accumulated is the likelihood ratio of two probabilities. The detection performance of this CUSUM scheme is compared to other model‐free person‐fit statistics found in the literature as well as an adaptation of another CUSUM approach. The study used both simulated responses and real response data from a large‐scale standardized admission test. 相似文献

12.

计算机自适应测试中的连续概率比测试模式

樊军《考试研究》2012,(4):61-67

计算机自适应性测试中的连续概率比例试模式,是一种适用于普通教师利用网络技术在班级教学这样的小规模测试中评估学生语言学习效果的测试模式。其基本原理就是估计被试连续测试时答对与答错的概率,然后与“掌握”和“未掌握”两个相互对立的假设作比较而产生相应的决策。它一方面可以弥补基于IRT测试模式应用范围的不足,_另一方面可以更好地帮助教师完成对于学生语言能力的评估。相似文献

13.

Constructing Rotating Item Pools for Constrained Adaptive Testing

Adelaide Ariel Bernard P. Veldkamp Wim J. van der Linden 《Journal of Educational Measurement》2004,41(4):345-359

Preventing items in adaptive testing from being over- or underexposed is one of the main problems in computerized adaptive testing. Though the problem of overexposed items can be solved using a probabilistic item-exposure control method, such methods are unable to deal with the problem of underexposed items. Using a system of rotating item pools, on the other hand, is a method that potentially solves both problems. In this method, a master pool is divided into (possibly overlapping) smaller item pools, which are required to have similar distributions of content and statistical attributes. These pools are rotated among the testing sites to realize desirable exposure rates for the items. A test assembly model, motivated by Gulliksen's matched random subtests method, was explored to help solve the problem of dividing a master pool into a set of smaller pools. Different methods to solve the model are proposed. An item pool from the Law School Admission Test was used to evaluate the performances of computerized adaptive tests from systems of rotating item pools constructed using these methods. 相似文献

14.

The Relationship Between Item Parameters and Item Fit

Hamzeh Dodeen 《Journal of Educational Measurement》2004,41(3):261-270

The effect of item parameters (discrimination, difficulty, and level of guessing) on the item-fit statistic was investigated using simulated dichotomous data. Nine tests were simulated using 1,000 persons, 50 items, three levels of item discrimination, three levels of item difficulty, and three levels of guessing. The item fit was estimated using two fit statistics: the likelihood ratio statistic (X²_B), and the standardized residuals (SRs). All the item parameters were simulated to be normally distributed. Results showed that the levels of item discrimination and guessing affected the item-fit values. As the level of item discrimination or guessing increased, item-fit values increased and more items misfit the model. The level of item difficulty did not affect the item-fit statistic. 相似文献

15.

Effect of the number of scale points on chi‐square fit indices in confirmatory factor analysis

Samuel B. Green Theresa M. Akey Kandace K. Fleming Scott L. Hershberger Janet G. Marquis 《Structural equation modeling》2013,20(2):108-120

This article investigates the effect of the number of item response categories on chi‐square statistics for confirmatory factor analysis to assess whether a greater number of categories increases the likelihood of identifying spurious factors, as previous research had concluded. Four types of continuous single‐factor data were simulated for a 20‐item test: (a) uniform for all items, (b) symmetric unimodal for all items, (c) negatively skewed for all items, or (d) negatively skewed for 10 items and positively skewed for 10 items. For each of the 4 types of distributions, item responses were divided to yield item scores with 2,4, or 6 categories. The results indicated that the chi‐square statistic for evaluating a single‐factor model was most inflated (suggesting spurious factors) for 2‐category responses and became less inflated as the number of categories increased. However, the Satorra‐Bentler scaled chi‐square tended not to be inflated even for 2‐category responses, except if the continuous item data had both negatively and positively skewed distributions. 相似文献

16.

Relative and Absolute Fit Evaluation in Cognitive Diagnosis Modeling

Jinsong Chen Jimmy de la Torre Zao Zhang 《Journal of Educational Measurement》2013,50(2):123-140

As with any psychometric models, the validity of inferences from cognitive diagnosis models (CDMs) determines the extent to which these models can be useful. For inferences from CDMs to be valid, it is crucial that the fit of the model to the data is ascertained. Based on a simulation study, this study investigated the sensitivity of various fit statistics for absolute or relative fit under different CDM settings. The investigation covered various types of model–data misfit that can occur with the misspecifications of the Q‐matrix, the CDM, or both. Six fit statistics were considered: –2 log likelihood (–2LL), Akaike's information criterion (AIC), Bayesian information criterion (BIC), and residuals based on the proportion correct of individual items (p), the correlations (r), and the log‐odds ratio of item pairs (l). An empirical example involving real data was used to illustrate how the different fit statistics can be employed in conjunction with each other to identify different types of misspecifications. With these statistics and the saturated model serving as the basis, relative and absolute fit evaluation can be integrated to detect misspecification efficiently. 相似文献

17.

AN INVESTIGATION OF AN EXTENSION OF ITEM SAMPLING WHICH YIELDS INDIVIDUAL SCORES1

MARY ANNE BUNDA 《Journal of Educational Measurement》1973,10(2):117-130

The sampling procedures were designed so that the full matrix of item variances and covariances could be estimated. Three subtest sizes were investigated- subtests of size five, nine and sixteen items. In each of these implementations a double cross validation was used yielding two predicted scores for each individual. Discrepancy measures were also computed showing the difference between the observed and the predicted scores. The prediction of individual scores was accomplished within various ranges of error. The correlations between predicted scores and observed scores ranged from the .70′s to the .90′s, depending on the number of predictor variables used. The procedure is applicable in situations in which large numbers of individuals are tested or in situations where multiple measures are taken. 相似文献

18.

Gender Differences in Multiple-Choice Tests: The Role of Differential Guessing Tendencies 总被引：2，自引：0，他引：2

Gershon Ben-Shakhar Yakov Sinai 《Journal of Educational Measurement》1991,28(1):23-35

The present study focused on gender differences in the tendency to omit items and to guess in multiple-choice tests. It was hypothesized that males would show greater guessing tendencies than females and that the use of formula scoring rather than the use of number of correct answers would result in a relative advantage for females. Two samples were examined: ninth graders and applicants to Israeli universities. The teenagers took a battery of five or six aptitude tests used to place them in various high schools, and the adults took a battery of five tests designed to select candidates to the various faculties of the Israeli universities. The results revealed a clear male advantage in most subtests of both batteries. Four measures of item-omission tendencies were computed for each subtest, and a consistent pattern of greater omission rates among females was revealed by all measures in most subtests of the two batteries. This pattern was observed even in the few subtests that did not show male superiority and even when permissive instructions were used. Correcting the raw scores for guessing reduced the male advantage in all cases (and in the few subtests that showed female advantage the difference increased as a result of this correction), but this effect was small. It was concluded that although gender differences in guessing tendencies are robust they account for only a small fraction of the observed gender differences in multiple-choice tests. The results were discussed, focusing on practical implications. 相似文献

19.

Using Response Time to Detect Item Preknowledge in Computer‐Based Licensure Examinations

Hong Qian Dorota Staniewska Mark Reckase Ada Woo 《Educational Measurement》2016,35(1):38-47

This article addresses the issue of how to detect item preknowledge using item response time data in two computer‐based large‐scale licensure examinations. Item preknowledge is indicated by an unexpected short response time and a correct response. Two samples were used for detecting item preknowledge for each examination. The first sample was from the early stage of the operational test and was used for item calibration. The second sample was from the late stage of the operational test, which may feature item preknowledge. The purpose of this research was to explore whether there was evidence of item preknowledge and compromised items in the second sample using the parameters estimated from the first sample. The results showed that for one nonadaptive operational examination, two items (of 111) were potentially exposed, and two candidates (of 1,172) showed some indications of preknowledge on multiple items. For another licensure examination that featured computerized adaptive testing, there was no indication of item preknowledge or compromised items. Implications for detected aberrant examinees and compromised items are discussed in the article. 相似文献

20.

Changing Conceptions in the Relative Sequence of Major Topics in General Science Textbooks (1911-1934)

Maitland P. Simmons 《Journal of Experimental Education》2013,81(4):277-279

The authors examined the relationship between fifth-grade students' verbal ability level and the adaptive nature of the questions that these students asked in attempting to find a correct synonym for vocabulary items. Questions were divided into necessary questions (questions posed after a wrong provisional answer) and unnecessary questions (questions posed after a right provisional answer). Another division of questions was into helpful questions (questions that led to a correction of a wrong provisional answer) and harmful questions (questions that led to a shift from a right to a wrong answer). Also examined were discontinued inquiries (instances in which a student decided to break off an inquiry in favor of inferring the right alternative). The results showed that students with high verbal ability asked more necessary questions and stepped up the number of unnecessary questions for difficult items, signaling that these questions were asked to increase confidence in knowing. Students' verbal ability did not affect the frequency of discontinued inquiries, but a significant effect was found for the helpfulness of these inquiries. A detailed account of the various processes and stages involved in students' questioning is presented. 相似文献