期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

AN EXAMINATION OF THE CONTEXT EFFECT IN ITEM SAMPLING

LEONARD S. FELDT ROBERT A. FORSYTH 《Journal of Educational Measurement》1974,11(2):73-82

Item sampling and/or multiple matrix sampling techniques have been recommended for a variety of purposes. For some of these purposes, it must be assumed that examinee performance on a set of items is unaffected by the conditions under which the items are taken (i.e., no context effect exists). In this paper factors that may lead to a context effect among high school students are discussed. The net effect of such factors on examinee scores for an English test and a mathematics test is investigated empirically. For the English test there was little support for the existence of a context effect, However, a definite context effect was found for the mathematics test. 相似文献

2.

How Well Can We Compare Scores on Test Forms That Are Constructed by Examinees Choice?

Howard Wainer Xiang-Bo Wang David Thissen 《Journal of Educational Measurement》1994,31(3):183-199

When an exam consists, in whole or in part, of constructed-response items, it is a common practice to allow the examinee to choose a subset of the questions to answer. This procedure is usually adopted so that the limited number of items that can be completed in the allotted time does not unfairly affect the examinee. This results in the de facto administration of several different test forms, where the exact structure of any particular form is determined by the examinee. However, when different forms are administered, a canon of good testing practice requires that those forms be equated to adjust for differences in their difficulty. When the items are chosen by the examinee, traditional equating procedures do not strictly apply due to the nonignorable nature of the missing responses. In this article, we examine the comparability of scores on such tests within an IRT framework. We illustrate the approach with data from the College Board's Advanced Placement Test in Chemistry 相似文献

3.

SCRAMBLING CONTENT IN ACHIEVEMENT TESTING: AN APPLICATION OF MULTIPLE MATRIX SAMPLING IN EXPERIMENTAL DESIGN

KEN SIROTNIK ROGER WELLINGTON 《Journal of Educational Measurement》1974,11(3):179-188

This study was designed to research the question of scrambling item content in the construction of achievement tests, so that very general implications could be drawn for both examinee and item populations. To achieve this generality, the methodology of multiple matrix sampling was combined with a simple two group experimental design: a random group of 8th graders responded to mathematics, science, social studies, reading, and language arts achievement items organized in a scrambled (random) test format, while another random group responded to the same items organized in a fixed (segregated by subject matter) test format. The results indicated that scrambling cognitive test items has minimal or no effect on mean examinee test performance or on any of the other parameters included in the analysis. 相似文献

4.

Hierarchical Generalized Linear Models for the Analysis of Judge Ratings

Timothy J. Muckle George Karabatsos 《Journal of Educational Measurement》2009,46(2):198-219

It is known that the Rasch model is a special two-level hierarchical generalized linear model (HGLM). This article demonstrates that the many-faceted Rasch model (MFRM) is also a special case of the two-level HGLM, with a random intercept representing examinee ability on a test, and fixed effects for the test items, judges, and possibly other facets. This perspective suggests useful modeling extensions of the MFRM. For example, in the HGLM framework it is possible to model random effects for items and judges in order to assess their stability across examinees. The MFRM can also be extended so that item difficulty and judge severity are modeled as functions of examinee characteristics (covariates), for the purposes of detecting differential item functioning and differential rater functioning. Practical illustrations of the HGLM are presented through the analysis of simulated and real judge-mediated data sets involving ordinal responses. 相似文献

5.

Models of Decisionmaking Processes for Multiple-Choice Test Items: An Analysis of Spatial Ability

Rand R. Wilcox Karen Thompson Wilcox 《Journal of Educational Measurement》1988,25(2):125-136

Latent class models of decisionmaking processes related to multiple-choice test items are extremely important and useful in mental test theory. However, building realistic models or studying the robustness of existing models is very difficult. One problem is that there are a limited number of empirical studies that address this issue. The purpose of this paper is to describe and illustrate how latent class models, in conjunction with the answer-until-correct format, can be used to examine the strategies used by examinees for a specific type of task. In particular, suppose an examinee responds to a multiple-choice test item designed to measure spatial ability, and the examinee gets the item wrong. This paper empirically investigates various latent class models of the strategies that might be used to arrive at an incorrect response. The simplest model is a random guessing model, but the results reported here strongly suggest that this model is unsatisfactory. Models for the second attempt of an item, under an answer-until-correct scoring procedure, are proposed and found to give a good fit to data in most situations. Some results on strategies used to arrive at the first choice are also discussed 相似文献

6.

Patterns of Solution Behavior across Items in Low-Stakes Assessments

Dena A. Pastor 《Educational Assessment》2019,24(3):189-212

The trustworthiness of low-stakes assessment results largely depends on examinee effort, which can be measured by the amount of time examinees devote to items using solution behavior (SB) indices. Because SB indices are calculated for each item, they can be used to understand how examinee motivation changes across items within a test. Latent class analysis (LCA) was used with the SB indices from three low-stakes assessments to explore patterns of solution behavior across items. Across tests, the favored models consisted of two classes, with Class 1 characterized by high and consistent solution behavior (>90% of examinees) and Class 2 by lower and less consistent solution behavior (<10% of examinees). Additional analyses provided supportive validity evidence for the two-class solution with notable differences between classes in self-reported effort, test scores, gender composition, and testing context. Although results were generally similar across the three assessments, striking differences were found in the nature of the solution behavior pattern for Class 2 and the ability of item characteristics to explain the pattern. The variability in the results suggests motivational changes across items may be unique to aspects of the testing situation (e.g., content of the assessment) for less motivated examinees. 相似文献

7.

The Effect of Inappropriate Omissions on Formula Scores: A Simulation Study

Robert B. Frary 《Journal of Educational Measurement》1989,26(1):41-53

Responses to a 50-item, four-choice test were simulated for 1,000 examinees under conventional formula-scoring instructions. One hundred ninety-two simulation runs reflected variations in the average level o f item difficulty, the extent to which examinees tended to omit inappropriately (when the formulascoring directions recommended guessing), the extent to which they were misinformed (classified correct answers as distractors), the extent to which they guessed contrary to the formula-scoring instructions, the extent to which examinee ability and tendency to omit inappropriately were correlated, the examinee ability level at which misinformation was most prevalent, and the extent to which item difficulty was related to the probability that an examinee would be misinformed. For each examinee, formula scores and expected formula scores were determined allowing and not allowing inappropriate omissions. Under certain conditions, failure to guess as recommended by the formula-scoring instructions produced nontrivial proportions o f examinees with expected score losses o f one or more points. These conditions were a test o f at least moderate difficulty, a low level for the tendency to be misinformed, and at least a moderate level for the tendency to omit inappropriately. 相似文献

8.

Taking the Time to Improve the Validity of Low-Stakes Tests: The Effort-Monitoring CBT

Steven L. Wise Dennison S. Bhola Sheng-Ta Yang 《Educational Measurement》2006,25(2):21-30

The attractiveness of computer-based tests (CBTs) is due largely to their capability to expand the ways we conduct testing. A relatively unexplored application, however, is actively using the computer to reduce construct-irrelevant variance while a test is being administered. This investigation introduces the effort-monitoring CBT, in which the computer monitors examinee effort (based on item response time) in a low-stakes test and displays warning messages to those exhibiting rapid-guessing behavior. The results of an experimental study are presented, which showed that an effort-monitoring CBT increased examinee effort and yielded more valid test scores than a conventional CBT. Thus, unlike previous research that has focused on identifying rapid-guessing behavior after it has occurred, the effort-monitoring CBT proactively attempts to suppress rapid-guessing behavior. This innovative testing procedure extends the capabilities of measurement practitioners to manage the psychometric challenges posed by unmotivated examinees. 相似文献

9.

Gender Differences in Performance on Mathematics Achievement Items

《教育实用测度》2013,26(2):161-177

Gender differences in performance on three types of mathematics test items were investigated using data from students with three different course backgrounds. Eight randomly equivalent samples of high school seniors were each given a unique form of the ACT Assessment Mathematics Usage Test. Only students with three specific profiles of high school mathematics coursework were considered in the analysis. The three background conditions ranged from little mathematics (Algebra I only) to a modest background (two Algebra courses and Geometry) to a full mathematics program including Introductory Calculus. For each background condition, examinee performance was analyzed in a 2 (Gender) x 3 (Item Category) x 8 (Test Form) split-plot factorial design. The results indicated, that, at each of the studied background levels, females performed less well than males on geometry (strategic, geometric) and reasoning (strategic, nongeometric) items. On the other hand, females performed as well as males on algorithmic, operationsoriented items. 相似文献

10.

An Application of Item Response Time: The Effort-Moderated IRT Model

Steven L. Wise Christine E. DeMars 《Journal of Educational Measurement》2006,43(1):19-38

The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity. 相似文献

11.

Response Time Effort: A New Measure of Examinee Motivation in Computer-Based Tests

《教育实用测度》2013,26(2):163-183

When low-stakes assessments are administered, the degree to which examinees give their best effort is often unclear, complicating the validity and interpretation of the resulting test scores. This study introduces a new method, based on item response time, for measuring examinee test-taking effort on computer-based test items. This measure, termed response time effort (RTE), is based on the hypothesis that when administered an item, unmotivated examinees will answer too quickly (i.e., before they have time to read and fully consider the item). Psychometric characteristics of RTE scores were empirically investigated and supportive evidence for score reliability and validity was found. Potential applications of RTE scores and their implications are discussed. 相似文献

12.

An Investigation of Examinee Test-Taking Effort on a Large-Scale Assessment

J. Carl Setzer Steven L. Wise Jill R. van den Heuvel Guangming Ling 《教育实用测度》2013,26(1):34-49

Assessment results collected under low-stakes testing situations are subject to effects of low examinee effort. The use of computer-based testing allows researchers to develop new ways of measuring examinee effort, particularly using response times. At the item level, responses can be classified as exhibiting either rapid-guessing behavior or solution behavior based on the item response time. Most previous research involving the study of response times has been conducted using locally developed instruments. The purpose of the current study was to examine the amount of rapid-guessing behavior within a commercially available, low-stakes instrument. Results indicate that smaller amounts of rapid-guessing behavior exist within the data compared to published results using other instruments. Additionally, rapid-guessing behavior varied by item and was significantly related to item length, item position, and presence of ancillary reading material. The amount of rapid-guessing behavior was consistently very low among various demographic subpopulations. On average, rapid-guessing behavior was observed on only 1% of item responses. Also found was that a small amount of rapid-guessing behavior can impact institutional rankings. 相似文献

13.

Additive and multiplicative effects of working memory and test anxiety on mathematics performance in grade 3 students

Johan Korhonen Mikaela Nyroos Bert Jonsson Hanna Eklöf 《教育心理学》2018,38(5):572-595

The aim of this study was to investigate the interplay between test anxiety and working memory (WM) on mathematics performance in younger children. A sample of 624 grade 3 students completed a test battery consisting of a test anxiety scale, WM tasks and the Swedish national examination in mathematics for grade 3. The main effects of test anxiety and WM, and the two-way interaction between test anxiety and WM on mathematics performance, were modelled with structural equation modelling techniques. Additionally, the effects were also tested separately on tasks with high WM demands (mathematical problem-solving) versus low WM demands (basic arithmetic). As expected, WM positively predicted mathematics performance in all three models (overall mathematics performance, problem-solving tasks, and basic arithmetic). Test anxiety had a negative effect on problem-solving on the whole sample level but concerning basic arithmetic only students with lower WM were affected by the negative effects of test anxiety on performance. Thus, students with low WM are more vulnerable to the negative effects of test anxiety in low WM tasks like basic arithmetic. The results are discussed in relation to the early identification of test anxiety. 相似文献

14.

Application of Think Aloud Protocols for Examining and Confirming Sources of Differential Item Functioning Identified by Expert Reviews

Kadriye Ercikan Rubab Arim Danielle Law Jose Domene France Gagnon Serge Lacroix 《Educational Measurement》2010,29(2):24-35

This paper demonstrates and discusses the use of think aloud protocols (TAPs) as an approach for examining and confirming sources of differential item functioning (DIF). The TAPs are used to investigate to what extent surface characteristics of the items that are identified by expert reviews as sources of DIF are supported by empirical evidence from examinee thinking processes in the English and French versions of a Canadian national assessment. In this research, the TAPs confirmed sources of DIF identified by expert reviews for 10 out of 20 DIF items. The moderate agreement between TAPs and expert reviews indicates that evidence from expert reviews cannot be considered sufficient in deciding whether DIF items are biased and such judgments need to include evidence from examinee thinking processes. 相似文献

15.

Systematic Error Analysis as an Empirical Basis for Analytic Teaching of Young Adults

James A. Dunn 《教育心理学》1994,14(1):85-128

相似文献

16.

The role that mathematics plays in college- and career-readiness: evidence from PISA

Leland S. Cogan Siwen Guo 《课程研究杂志》2019,51(4):530-553

Many studies have found a strong relationship between the mathematics students study in school and their performance on an academic or school mathematics assessment but not on an assessment of mathematics literacy (ML). With many countries, like the USA, placing emphasis on finishing secondary education being mathematically literate and prepared for college or career, this raises the question about the relationship between the mathematics studied in school and any ML students may have. The Programme for International Student Assessment (PISA) ML assessment is embedded in real-world contexts that provide an important window on how ready students are to tackle the situations and problems that await them whether they intend to pursue further education beyond high school or intend to go directly into the labour force. In this paper, we draw upon the PISA 2012 data to investigate the extent to which the cumulative exposure to rigorous mathematics content, such as that embedded in college- and career-ready standards, is associated with ML as assessed in PISA. Results reveal that both exposure to rigorous school mathematics and experiencing the instruction of this mathematics through real-world applications are significantly related to all the real-world contextualized PISA ML scores. 相似文献

17.

Test-Takers' Judgments of Essay Prompts: Perceptions and Performance

《Educational Assessment》2013,18(1):3-22

This study gathered the judgments of Graduate Record Examination test takers-actual and prospective-about a sample of essay prompts being considered for possible use in a graduate admissions writing test. Our thesis was that test-takers' views, which have not been frequently considered in any systematic fashion, may provide valuable information to developers of writing assessments. The specific objective was to determine the kinds of prompts and topics on which examinees feel they can write strong essays, as well as those that they perceive as more difficult. The study identified several features that underlie examinee perceptions of essay prompts. Prominent among these features was the extent to which prompts allow writers to draw on their personal experiences. Some study participants also wrote essays on a small subset of the prompts. With these data, the relation of examinee opinions to performance on the prompts was examined. Though apparent, this relation was less dramatic than writers' strong opinions would suggest. 相似文献

18.

A Simple Model for Diagnostic Testing When There Are Several Types of Misinformation

《Journal of Experimental Education》2012,80(1):57-62

相似文献

19.

AN EXPERIMENTAL COMPARISON OF ITEM SAMPLING AND EXAMINEE SAMPLING FOR ESTIMATING TEST NORMS

THOMAS R. OWENS DANIEL L. STUFFLEBEAM 《Journal of Educational Measurement》1969,6(2):75-83

An empirical comparison of the accuracy of item sampling and examinee sampling in estimating norm statistics. Item samples were composed of 3, 6, or 12 items selected from a total test of 50 multiple-choice vocabulary questions. Overall, the study findings provided empirical evidence that item sampling is approximately as effective as examinee sampling for estimating the population mean and standard deviation. Contradictory trends occurred for lower ability and higher ability student populations in accuracy of estimated means and standard deviations when the number of items administered increased from 3 to 6 to 12. The findings from this study indicate that the variation of sequences of items occurring in item sampling need not have a significant affect on test performance. 相似文献

20.

英国数学英才选拔考试——第六学期数学考试简介

陈昂任子朝《数学教育学报》2012,(4):64-67

第六学期数学纸笔考试是英国的英才选拔性考试科目之一,旨在测量考生在高校学习数学课程取得成功所必需的技能,重点考查学生应用数学知识解决问题的能力.第六学期数学纸笔考试试题注重在测量思维的方式上进行创新,重视基础数学知识和解题能力,其试题设计上具有鲜明的特点. 相似文献