首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Recent developrnents of person-Jit analysis in computerized adaptive testing (CAT) are discussed. Methods from stutistical process control are presented that have been proposed to classify an item score pattern as fitting or misjitting the underlying item response theory model in CAT. Most person-fit research in CAT is restricted to simulated data. In this study, empirical data from a certification test were used, Alternatives are discussed to generate norms so that bounds can be determined to classify an item score pattern as fitting or misfitting. Using bounds determined from a sample of a high-stakes certification test, the empirical analysis showed that dizerent types of misfit can be distinguished. Further applications using statistical process control methods to detect misfitting item score patterns are discussed.  相似文献   

2.
Psychometric properties of item response theory proficiency estimates are considered in this paper. Proficiency estimators based on summed scores and pattern scores include non-Bayes maximum likelihood and test characteristic curve estimators and Bayesian estimators. The psychometric properties investigated include reliability, conditional standard errors of measurement, and score distributions. Four real-data examples include (a) effects of choice of estimator on score distributions and percent proficient, (b) effects of the prior distribution on score distributions and percent proficient, (c) effects of test length on score distributions and percent proficient, and (d) effects of proficiency estimator on growth-related statistics for a vertical scale. The examples illustrate that the choice of estimator influences score distributions and the assignment of examinee to proficiency levels. In particular, for the examples studied, the choice of Bayes versus non-Bayes estimators had a more serious practical effect than the choice of summed versus pattern scoring.  相似文献   

3.
Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test.  相似文献   

4.
The person-fit literature assumes that aberrant response patterns could be a sign of person mismeasurement, but this assumption has rarely, if ever, been empirically investigated before. We explore the validity of test responses and measures of 10-year-old examinees whose response patterns on a commercial standardized paper-and-pencil mathematics test were flagged as aberrant. Validity evidence was collected through postexamination reflective interviews with 31 of the 80 pupils flagged as aberrant and their teachers, and teacher assessment (TA) judgments for the whole examination cohort of 674 examinees. Analysis suggested that interview-adjusted scores were significantly better fitting than expected by chance, but only some adjustments suggest serious mismeasurement. In addition, disagreement between TA and test scores was significantly greater for aberrant examinees, and partially predicted the interview adjustments. We conclude that person misfit statistics when combined with TA might be a useful antidote to mismeasurement, and we discuss the implications for assessment research and practice.  相似文献   

5.
《教育实用测度》2013,26(1):9-26
We present statistical and theoretical issues that arise from assessing person-fit on measures of typical performance. After presenting the status of past and current research issues, we describe three topics of ongoing concern. First, because typical performance measures tend to be short, and because they have low bandwidth, the detection of person-misfit is often attenuated. Second, there is a need for creative methods of identifying the specific sources of response aberrancy, rather than simply identifying person-misfit. Third, the promise of person-fit measures as moderators of trait-criterion relations remains un- demonstrated. We offer commentary or potential resolutions to these three current topics. In terms of future research directions, we outline two lines of advancement that are relevant for both educational and personality psychologists. These are (a) the use of person-fit statistics in the assessment of how item response theory measurement models differ across manifest groups (e.g., ethnicity, gender), and (b) the application of person-fit statistics under "external" item response theory model conditions. We summarize the role these advances could play in helping educational testers go beyond the standard task of identifying "invalid" protocols by discussing how person-fit assessment may contribute to our understanding of individual and group differences in trait structure.  相似文献   

6.
Mahalanobis distance (M-distance) case diagnostics are a useful tool for assessing response pattern inconsistency in factor analysis; however, the derivations of these statistics assume continuous variables, which limits their utility in ordinal self- or rater-report data. This research generalizes M-distance diagnostics to categorical factor analysis. We prove that the residual-based M-distance dr is equivalent to the person-fit index lco, which motivates the use of the new categorical M-distance dr* as a person-fit index. dr* is compared and contrasted with zh, a commonly used item response theory person-fit index. A simulation study is used to show that a simple transformation of dr* satisfies established criteria for person-fit measures. A sample of responses to the Rosenberg Self-Esteem Scale is used to determine parameters for a simulation study, and real data are analyzed to contrast the use of dr and dr* as indexes of person-fit in continuous and categorical factor analysis.  相似文献   

7.
In educational and psychological measurement, a person-fit statistic (PFS) is designed to identify aberrant response patterns. For parametric PFSs, valid inference depends on several assumptions, one of which is that the item response theory (IRT) model is correctly specified. Previous studies have used empirical data sets to explore the effects of model misspecification on PFSs. We further this line of research by using a simulation study, which allows us to explore issues that may be of interest to practitioners. Results show that, depending on the generating and analysis item models, Type I error rates at fixed values of the latent variable may be greatly inflated, even when the aggregate rates are relatively accurate. Results also show that misspecification is most likely to affect PFSs for examinees with extreme latent variable scores. Two empirical data analyses are used to illustrate the importance of model specification.  相似文献   

8.
A new procedure for generating instructionally relevant diagnostic feedback is proposed. The approach involves first constructing a strong model of student proficiency and then testing whether individual students' observed item response vectors are consistent with that model. Diagnoses are specified in terms of the combinations of skills needed to score at increasingly higher levels on a test's reported score scale. The approach is applied to the problem of developing diagnostic feedback for the SAT I Verbal Reasoning test. Using a variation of Wright's (1977) person-fit statistic, it is shown that the estimated proficiency mode accounts for 91% of the "explainable" variation in students' observed item response vectors.  相似文献   

9.
Response accuracy and response time data can be analyzed with a joint model to measure ability and speed of working, while accounting for relationships between item and person characteristics. In this study, person‐fit statistics are proposed for joint models to detect aberrant response accuracy and/or response time patterns. The person‐fit tests take the correlation between ability and speed into account, as well as the correlation between item characteristics. They are posited as Bayesian significance tests, which have the advantage that the extremeness of a test statistic value is quantified by a posterior probability. The person‐fit tests can be computed as by‐products of a Markov chain Monte Carlo algorithm. Simulation studies were conducted in order to evaluate their performance. For all person‐fit tests, the simulation studies showed good detection rates in identifying aberrant patterns. A real data example is given to illustrate the person‐fit statistics for the evaluation of the joint model.  相似文献   

10.
《教育实用测度》2013,26(2):163-183
When low-stakes assessments are administered, the degree to which examinees give their best effort is often unclear, complicating the validity and interpretation of the resulting test scores. This study introduces a new method, based on item response time, for measuring examinee test-taking effort on computer-based test items. This measure, termed response time effort (RTE), is based on the hypothesis that when administered an item, unmotivated examinees will answer too quickly (i.e., before they have time to read and fully consider the item). Psychometric characteristics of RTE scores were empirically investigated and supportive evidence for score reliability and validity was found. Potential applications of RTE scores and their implications are discussed.  相似文献   

11.
《教育实用测度》2013,26(1):33-51
The objectives of this study were to examine the impact of different curricula on standardized achievement test scores at item and objective levels and to determine if different curricula generate different patterns of item factor loadings. School buildings from a middle-sized district were rated regarding the degree to which their curricula matched the content of the standardized test, and the actual textbook series used within each building (classroom) was determined. Covariate analyses of objective scores and plots and correlations of item p values indicated very small, nonsignificant differential effects across ratings and textbook series. Factor patterns indicated no curricular effects on large first factors. These findings parallel the results of a previous study conducted at the subtest level. We conclude that educators need not be unduly concerned about the impact of specific and generally small differences in curricular offerings within a district on standardized test scores or inferences to a broad content domain.  相似文献   

12.
This study examined the utility of response time‐based analyses in understanding the behavior of unmotivated test takers. For the data from an adaptive achievement test, patterns of observed rapid‐guessing behavior and item response accuracy were compared to the behavior expected under several types of models that have been proposed to represent unmotivated test taking behavior. Test taker behavior was found to be inconsistent with these models, with the exception of the effort‐moderated model. Effort‐moderated scoring was found to both yield scores that were more accurate than those found under traditional scoring, and exhibit improved person fit statistics. In addition, an effort‐guided adaptive test was proposed and shown by a simulation study to alleviate item difficulty mistargeting caused by unmotivated test taking.  相似文献   

13.
Research has shown that many educators do not understand the terminology or displays used in test score reports and that measurement error is a particularly challenging concept. We investigated graphical and verbal methods of representing measurement error associated with individual student scores. We created four alternative score reports, each constituting an experimental condition, and randomly assigned them to research participants. We then compared comprehension and preferences across the four conditions. In our main study, we collected data from 148 teachers. For comparison, we studied 98 introductory psychology students. Although we did not detect statistically significant differences across conditions, we found that participants who reported greater comfort with statistics tended to have higher comprehension scores and tended to prefer more informative displays that included variable-width confidence bands for scores. Our data also yielded a wealth of information regarding existing misconceptions about measurement error and about score-reporting conventions.  相似文献   

14.
Reliabilities and information functions for percentile ranks and number-right scores were compared in the context of item response theory. The basic results were: (a) The percentile rank is always less informative and reliable than the number-right score; and (b)for easy or difficult tests composed of highly discriminating items, the percentile rank often yields unacceptably low reliability and information relative to the number-right score. These results suggest that standardized scores that are linear transformations of the number-right score (e.g., z scores) are much more reliable and informative indicators of the relative standing of a test score than are percentile ranks. The findings reported here demonstrate that there exist situations in which the percent of items known by examinees can be accurately estimated, but that the percent of persons falling below a given score cannot.  相似文献   

15.
Cut‐scores were set by expert judges on assessments of reading and listening comprehension of English as a foreign language (EFL), using the bookmark standard‐setting method to differentiate proficiency levels defined by the Common European Framework of Reference (CEFR). Assessments contained stratified item samples drawn from extensive item pools, calibrated using Rasch models on the basis of examinee responses of a German nationwide assessment of secondary school language performance. The results suggest significant effects of item sampling strategies for the bookmark method on cut‐score recommendations, as well as significant cut‐score judgment revision over cut‐score placement rounds. Results are discussed within a framework of establishing validity evidence supporting cut‐score recommendations using the widely employed bookmark method.  相似文献   

16.
An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items.  相似文献   

17.
In test-centered standard-setting methods, borderline performance can be represented by many different profiles of strengths and weaknesses. As a result, asking panelists to estimate item or test performance for a hypothetical group study of borderline examinees, or a typical borderline examinee, may be an extremely difficult task and one that can lead to questionable results in setting cut scores. In this study, data collected from a previous standard-setting study are used to deduce panelists’ conceptions of profiles of borderline performance. These profiles are then used to predict cut scores on a test of algebra readiness. The results indicate that these profiles can predict a very wide range of cut scores both within and between panelists. Modifications are proposed to existing training procedures for test-centered methods that can account for the variation in borderline profiles.  相似文献   

18.
A Monte Carlo simulation technique for generating dichotomous item scores is presented that implements (a) a psychometric model with different explicit assumptions than traditional parametric item response theory (IRT) models, and (b) item characteristic curves without restrictive assumptions concerning mathematical form. The four-parameter beta compound-binomial (4PBCB) strong true score model (with two-term approximation to the compound binomial) is used to estimate and generate the true score distribution. The nonparametric item-true score step functions are estimated by classical item difficulties conditional on proportion-correct total score. The technique performed very well in replicating inter-item correlations, item statistics (point-biserial correlation coefficients and item proportion-correct difficulties), first four moments of total score distribution, and coefficient alpha of three real data sets consisting of educational achievement test scores. The technique replicated real data (including subsamples of differing proficiency) as well as the three-parameter logistic (3PL) IRT model (and much better than the 1PL model) and is therefore a promising alternative simulation technique. This 4PBCB technique may be particularly useful as a more neutral simulation procedure for comparing methods that use different IRT models.  相似文献   

19.
《教育实用测度》2013,26(1):91-109
After analyzing data from the 1990 National Assessment of Educational Progress Trial State Assessment, we question whether person-fit statistics are useful in the analysis and reporting of results from psychometrically strong achievement tests. Using a weighted mean-square person-fit statistic, we examined the distribution of fit across individuals, looked for group and item-type differences, and investigated practical significance. In each analysis, we found that this person-fit statistic did not provide any additional information.  相似文献   

20.
There has been an increased interest in the impact of unmotivated test taking on test performance and score validity. This has led to the development of new ways of measuring test-taking effort based on item response time. In particular, Response Time Effort (RTE) has been shown to provide an assessment of effort down to the level of individual item responses. A limitation of RTE, however, is that it is intended for use with selected response items that must be answered before a test taker can move on to the next item. The current study outlines a general process for measuring item-level effort that can be applied to an expanded set of item types and test-taking behaviors (such as omitted or constructed responses). This process, which is illustrated with data from a large-scale assessment program, should improve our ability to detect non-effortful test taking and perform individual score validation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号