期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Outlier Detection in High-Stakes Certification Testing

Rob R. Meijer 《Journal of Educational Measurement》2002,39(3):219-233

Recent developrnents of person-Jit analysis in computerized adaptive testing (CAT) are discussed. Methods from stutistical process control are presented that have been proposed to classify an item score pattern as fitting or misjitting the underlying item response theory model in CAT. Most person-fit research in CAT is restricted to simulated data. In this study, empirical data from a certification test were used, Alternatives are discussed to generate norms so that bounds can be determined to classify an item score pattern as fitting or misfitting. Using bounds determined from a sample of a high-stakes certification test, the empirical analysis showed that dizerent types of misfit can be distinguished. Further applications using statistical process control methods to detect misfitting item score patterns are discussed. 相似文献

2.

Psychometric Properties of IRT Proficiency Estimates

Michael J. Kolen Ye Tong 《Educational Measurement》2010,29(3):8-14

Psychometric properties of item response theory proficiency estimates are considered in this paper. Proficiency estimators based on summed scores and pattern scores include non-Bayes maximum likelihood and test characteristic curve estimators and Bayesian estimators. The psychometric properties investigated include reliability, conditional standard errors of measurement, and score distributions. Four real-data examples include (a) effects of choice of estimator on score distributions and percent proficient, (b) effects of the prior distribution on score distributions and percent proficient, (c) effects of test length on score distributions and percent proficient, and (d) effects of proficiency estimator on growth-related statistics for a vertical scale. The examples illustrate that the choice of estimator influences score distributions and the assignment of examinee to proficiency levels. In particular, for the examples studied, the choice of Bayes versus non-Bayes estimators had a more serious practical effect than the choice of summed versus pattern scoring. 相似文献

3.

Evaluating the Operational Feasibility of Using Subsets of Items to Recommend Minimal Competency Cut Scores

Priya Kannan Adrienne Sgammato Richard J. Tannenbaum 《教育实用测度》2015,28(4):292-307

Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test. 相似文献

4.

The Extent of Mismeasurement for Aberrant Examinees

Alexandra Petridou Julian Williams 《Educational Assessment》2013,18(1):42-68

The person-fit literature assumes that aberrant response patterns could be a sign of person mismeasurement, but this assumption has rarely, if ever, been empirically investigated before. We explore the validity of test responses and measures of 10-year-old examinees whose response patterns on a commercial standardized paper-and-pencil mathematics test were flagged as aberrant. Validity evidence was collected through postexamination reflective interviews with 31 of the 80 pupils flagged as aberrant and their teachers, and teacher assessment (TA) judgments for the whole examination cohort of 674 examinees. Analysis suggested that interview-adjusted scores were significantly better fitting than expected by chance, but only some adjustments suggest serious mismeasurement. In addition, disagreement between TA and test scores was significantly greater for aberrant examinees, and partially predicted the interview adjustments. We conclude that person misfit statistics when combined with TA might be a useful antidote to mismeasurement, and we discuss the implications for assessment research and practice. 相似文献

5.

Assessing Person-Fit on Measures of Typical Performance

《教育实用测度》2013,26(1):9-26

We present statistical and theoretical issues that arise from assessing person-fit on measures of typical performance. After presenting the status of past and current research issues, we describe three topics of ongoing concern. First, because typical performance measures tend to be short, and because they have low bandwidth, the detection of person-misfit is often attenuated. Second, there is a need for creative methods of identifying the specific sources of response aberrancy, rather than simply identifying person-misfit. Third, the promise of person-fit measures as moderators of trait-criterion relations remains un- demonstrated. We offer commentary or potential resolutions to these three current topics. In terms of future research directions, we outline two lines of advancement that are relevant for both educational and personality psychologists. These are (a) the use of person-fit statistics in the assessment of how item response theory measurement models differ across manifest groups (e.g., ethnicity, gender), and (b) the application of person-fit statistics under "external" item response theory model conditions. We summarize the role these advances could play in helping educational testers go beyond the standard task of identifying "invalid" protocols by discussing how person-fit assessment may contribute to our understanding of individual and group differences in trait structure. 相似文献

6.

Case Diagnostics for Factor Analysis of Ordered Categorical Data With Applications to Person-Fit Measurement

Maxwell Mansolf Steven P. Reise 《Structural equation modeling》2018,25(1):86-100

Mahalanobis distance (M-distance) case diagnostics are a useful tool for assessing response pattern inconsistency in factor analysis; however, the derivations of these statistics assume continuous variables, which limits their utility in ordinal self- or rater-report data. This research generalizes M-distance diagnostics to categorical factor analysis. We prove that the residual-based M-distance d_r is equivalent to the person-fit index lco, which motivates the use of the new categorical M-distance d_r* as a person-fit index. d_r* is compared and contrasted with z_h, a commonly used item response theory person-fit index. A simulation study is used to show that a simple transformation of d_r* satisfies established criteria for person-fit measures. A sample of responses to the Rosenberg Self-Esteem Scale is used to determine parameters for a simulation study, and real data are analyzed to contrast the use of d_r and d_r* as indexes of person-fit in continuous and categorical factor analysis. 相似文献

7.

Performance of Person-Fit Statistics Under Model Misspecification

Seong Eun Hong Scott Monroe Carl F. Falk 《Journal of Educational Measurement》2020,57(3):423-442

In educational and psychological measurement, a person-fit statistic (PFS) is designed to identify aberrant response patterns. For parametric PFSs, valid inference depends on several assumptions, one of which is that the item response theory (IRT) model is correctly specified. Previous studies have used empirical data sets to explore the effects of model misspecification on PFSs. We further this line of research by using a simulation study, which allows us to explore issues that may be of interest to practitioners. Results show that, depending on the generating and analysis item models, Type I error rates at fixed values of the latent variable may be greatly inflated, even when the aggregate rates are relatively accurate. Results also show that misspecification is most likely to affect PFSs for examinees with extreme latent variable scores. Two empirical data analyses are used to illustrate the importance of model specification. 相似文献

8.

A Tree-Based Approach to Proficiency Scaling and Diagnostic Assessment

Kathleen M. Sheehan 《Journal of Educational Measurement》1997,34(4):333-352

A new procedure for generating instructionally relevant diagnostic feedback is proposed. The approach involves first constructing a strong model of student proficiency and then testing whether individual students' observed item response vectors are consistent with that model. Diagnoses are specified in terms of the combinations of skills needed to score at increasingly higher levels on a test's reported score scale. The approach is applied to the problem of developing diagnostic feedback for the SAT I Verbal Reasoning test. Using a variation of Wright's (1977) person-fit statistic, it is shown that the estimated proficiency mode accounts for 91% of the "explainable" variation in students' observed item response vectors. 相似文献

9.

Person‐Fit Statistics for Joint Models for Accuracy and Speed

下载免费PDF全文

Jean‐Paul Fox Sukaesi Marianti 《Journal of Educational Measurement》2017,54(2):243-262

Response accuracy and response time data can be analyzed with a joint model to measure ability and speed of working, while accounting for relationships between item and person characteristics. In this study, person‐fit statistics are proposed for joint models to detect aberrant response accuracy and/or response time patterns. The person‐fit tests take the correlation between ability and speed into account, as well as the correlation between item characteristics. They are posited as Bayesian significance tests, which have the advantage that the extremeness of a test statistic value is quantified by a posterior probability. The person‐fit tests can be computed as by‐products of a Markov chain Monte Carlo algorithm. Simulation studies were conducted in order to evaluate their performance. For all person‐fit tests, the simulation studies showed good detection rates in identifying aberrant patterns. A real data example is given to illustrate the person‐fit statistics for the evaluation of the joint model. 相似文献

10.

Response Time Effort: A New Measure of Examinee Motivation in Computer-Based Tests

《教育实用测度》2013,26(2):163-183

When low-stakes assessments are administered, the degree to which examinees give their best effort is often unclear, complicating the validity and interpretation of the resulting test scores. This study introduces a new method, based on item response time, for measuring examinee test-taking effort on computer-based test items. This measure, termed response time effort (RTE), is based on the hypothesis that when administered an item, unmotivated examinees will answer too quickly (i.e., before they have time to read and fully consider the item). Psychometric characteristics of RTE scores were empirically investigated and supportive evidence for score reliability and validity was found. Potential applications of RTE scores and their implications are discussed. 相似文献

11.

Effects of Curricular Differences on Achievement Test Data at Item and Objective Levels

《教育实用测度》2013,26(1):33-51

The objectives of this study were to examine the impact of different curricula on standardized achievement test scores at item and objective levels and to determine if different curricula generate different patterns of item factor loadings. School buildings from a middle-sized district were rated regarding the degree to which their curricula matched the content of the standardized test, and the actual textbook series used within each building (classroom) was determined. Covariate analyses of objective scores and plots and correlations of item p values indicated very small, nonsignificant differential effects across ratings and textbook series. Factor patterns indicated no curricular effects on large first factors. These findings parallel the results of a previous study conducted at the subtest level. We conclude that educators need not be unduly concerned about the impact of specific and generally small differences in curricular offerings within a district on standardized test scores or inferences to a broad content domain. 相似文献

12.

Modeling Student Test‐Taking Motivation in the Context of an Adaptive Achievement Test

Steven L. Wise G. Gage Kingsbury 《Journal of Educational Measurement》2016,53(1):86-105

This study examined the utility of response time‐based analyses in understanding the behavior of unmotivated test takers. For the data from an adaptive achievement test, patterns of observed rapid‐guessing behavior and item response accuracy were compared to the behavior expected under several types of models that have been proposed to represent unmotivated test taking behavior. Test taker behavior was found to be inconsistent with these models, with the exception of the effort‐moderated model. Effort‐moderated scoring was found to both yield scores that were more accurate than those found under traditional scoring, and exhibit improved person fit statistics. In addition, an effort‐guided adaptive test was proposed and shown by a simulation study to alleviate item difficulty mistargeting caused by unmotivated test taking. 相似文献

13.

Comparing Graphical and Verbal Representations of Measurement Error In Test Score Reports

Rebecca Zwick Diego Zapata-Rivera Mary Hegarty 《Educational Assessment》2013,18(2):116-138

Research has shown that many educators do not understand the terminology or displays used in test score reports and that measurement error is a particularly challenging concept. We investigated graphical and verbal methods of representing measurement error associated with individual student scores. We created four alternative score reports, each constituting an experimental condition, and randomly assigned them to research participants. We then compared comprehension and preferences across the four conditions. In our main study, we collected data from 148 teachers. For comparison, we studied 98 introductory psychology students. Although we did not detect statistically significant differences across conditions, we found that participants who reported greater comfort with statistics tended to have higher comprehension scores and tended to prefer more informative displays that included variable-width confidence bands for scores. Our data also yielded a wealth of information regarding existing misconceptions about measurement error and about score-reporting conventions. 相似文献

14.

Reliability and Information Functions for Percentile Ranks

Kim May W. Alan Nicewander 《Journal of Educational Measurement》1994,31(4):313-325

Reliabilities and information functions for percentile ranks and number-right scores were compared in the context of item response theory. The basic results were: (a) The percentile rank is always less informative and reliable than the number-right score; and (b)for easy or difficult tests composed of highly discriminating items, the percentile rank often yields unacceptably low reliability and information relative to the number-right score. These results suggest that standardized scores that are linear transformations of the number-right score (e.g., z scores) are much more reliable and informative indicators of the relative standing of a test score than are percentile ranks. The findings reported here demonstrate that there exist situations in which the percent of items known by examinees can be accurately estimated, but that the percent of persons falling below a given score cannot. 相似文献

15.

Setting Standards for English Foreign Language Assessment: Methodology,Validation, and a Degree of Arbitrariness

Simon P. Tiffin‐Richards Hans Anand Pant Olaf Köller 《Educational Measurement》2013,32(2):15-25

Cut‐scores were set by expert judges on assessments of reading and listening comprehension of English as a foreign language (EFL), using the bookmark standard‐setting method to differentiate proficiency levels defined by the Common European Framework of Reference (CEFR). Assessments contained stratified item samples drawn from extensive item pools, calibrated using Rasch models on the basis of examinee responses of a German nationwide assessment of secondary school language performance. The results suggest significant effects of item sampling strategies for the bookmark method on cut‐score recommendations, as well as significant cut‐score judgment revision over cut‐score placement rounds. Results are discussed within a framework of establishing validity evidence supporting cut‐score recommendations using the widely employed bookmark method. 相似文献

16.

A Stepwise Test Characteristic Curve Method to Detect Item Parameter Drift

下载免费PDF全文

Rui Guo Yi Zheng Hua‐Hua Chang 《Journal of Educational Measurement》2015,52(3):280-300

An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items. 相似文献

17.

Using Diagnostic Profiles to Describe Borderline Performance in Standard Setting

Gary Skaggs Serge F. Hein Jesse L. M. Wilkins 《Educational Measurement》2020,39(1):45-51

In test-centered standard-setting methods, borderline performance can be represented by many different profiles of strengths and weaknesses. As a result, asking panelists to estimate item or test performance for a hypothetical group study of borderline examinees, or a typical borderline examinee, may be an extremely difficult task and one that can lead to questionable results in setting cut scores. In this study, data collected from a previous standard-setting study are used to deduce panelists’ conceptions of profiles of borderline performance. These profiles are then used to predict cut scores on a test of algebra readiness. The results indicate that these profiles can predict a very wide range of cut scores both within and between panelists. Modifications are proposed to existing training procedures for test-centered methods that can account for the variation in borderline profiles. 相似文献

18.

Generating Dichotomous Item Scores with the Four-Parameter Beta Compound Binomial Model

Patrick O. Monahan Won-Chan Lee Robert D. Ankenmann 《Journal of Educational Measurement》2007,44(3):211-225

A Monte Carlo simulation technique for generating dichotomous item scores is presented that implements (a) a psychometric model with different explicit assumptions than traditional parametric item response theory (IRT) models, and (b) item characteristic curves without restrictive assumptions concerning mathematical form. The four-parameter beta compound-binomial (4PBCB) strong true score model (with two-term approximation to the compound binomial) is used to estimate and generate the true score distribution. The nonparametric item-true score step functions are estimated by classical item difficulties conditional on proportion-correct total score. The technique performed very well in replicating inter-item correlations, item statistics (point-biserial correlation coefficients and item proportion-correct difficulties), first four moments of total score distribution, and coefficient alpha of three real data sets consisting of educational achievement test scores. The technique replicated real data (including subsamples of differing proficiency) as well as the three-parameter logistic (3PL) IRT model (and much better than the 1PL model) and is therefore a promising alternative simulation technique. This 4PBCB technique may be particularly useful as a more neutral simulation procedure for comparing methods that use different IRT models. 相似文献

19.

The Use of a Person-Fit Statistic with One High-Quality Achievement Test

《教育实用测度》2013,26(1):91-109

After analyzing data from the 1990 National Assessment of Educational Progress Trial State Assessment, we question whether person-fit statistics are useful in the analysis and reporting of results from psychometrically strong achievement tests. Using a weighted mean-square person-fit statistic, we examined the distribution of fit across individuals, looked for group and item-type differences, and investigated practical significance. In each analysis, we found that this person-fit statistic did not provide any additional information. 相似文献

20.

A General Approach to Measuring Test-Taking Effort on Computer-Based Tests

Steven L. Wise Lingyun Gao 《教育实用测度》2017,30(4):343-354

There has been an increased interest in the impact of unmotivated test taking on test performance and score validity. This has led to the development of new ways of measuring test-taking effort based on item response time. In particular, Response Time Effort (RTE) has been shown to provide an assessment of effort down to the level of individual item responses. A limitation of RTE, however, is that it is intended for use with selected response items that must be answered before a test taker can move on to the next item. The current study outlines a general process for measuring item-level effort that can be applied to an expanded set of item types and test-taking behaviors (such as omitted or constructed responses). This process, which is illustrated with data from a large-scale assessment program, should improve our ability to detect non-effortful test taking and perform individual score validation. 相似文献