首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Mokken scale analysis (MSA) is a probabilistic‐nonparametric approach to item response theory (IRT) that can be used to evaluate fundamental measurement properties with less strict assumptions than parametric IRT models. This instructional module provides an introduction to MSA as a probabilistic‐nonparametric framework in which to explore measurement quality, with an emphasis on its application in the context of educational assessment. The module describes both dichotomous and polytomous formulations of the MSA model. Examples of the application of MSA to educational assessment are provided using data from a multiple‐choice physical science assessment and a rater‐mediated writing assessment.  相似文献   

2.
Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating rating quality are often based on measurement theories such as invariant measurement. However, these methods are based on assumptions and transformations that may not be appropriate for ordinal ratings. In this study, I show how researchers can use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, to evaluate rating quality within the framework of invariant measurement without the use of potentially inappropriate parametric techniques. I use an illustrative analysis of data from a rater‐mediated writing assessment to demonstrate how one can use numeric and graphical indicators from MSA to gather evidence of validity, reliability, and fairness. The results from the analyses suggest that MSA provides a useful framework within which to evaluate rater‐mediated assessments for evidence of validity, reliability, and fairness that can supplement existing popular methods for evaluating ratings.  相似文献   

3.
Recent developrnents of person-Jit analysis in computerized adaptive testing (CAT) are discussed. Methods from stutistical process control are presented that have been proposed to classify an item score pattern as fitting or misjitting the underlying item response theory model in CAT. Most person-fit research in CAT is restricted to simulated data. In this study, empirical data from a certification test were used, Alternatives are discussed to generate norms so that bounds can be determined to classify an item score pattern as fitting or misfitting. Using bounds determined from a sample of a high-stakes certification test, the empirical analysis showed that dizerent types of misfit can be distinguished. Further applications using statistical process control methods to detect misfitting item score patterns are discussed.  相似文献   

4.
An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item judgments directly reflect empirical item difficulties, the variability in Angoff judgments over items would not add error to the cut score, but to the extent that the mean item judgments do not correspond to the empirical item difficulties, variability in mean judgments over items would add error to the cut score. In this article, we present two generalizability-theory–based analyses of the proportion of the item variance that contributes to error in the cut score. For one approach, variance components are estimated on the probability (or proportion-correct) scale of the Angoff judgments, and for the other, the judgments are transferred to the theta scale of an item response theory model before estimating the variance components. The two analyses yield somewhat different results but both indicate that it is not appropriate to simply ignore the item variance component in estimating the error variance.  相似文献   

5.
This article discusses and demonstrates combining scores from multiple-choice (MC) and constructed-response (CR) items to create a common scale using item response theory methodology. Two specific issues addressed are (a) whether MC and CR items can be calibrated together and (b) whether simultaneous calibration of the two item types leads to loss of information. Procedures are discussed and empirical results are provided using a set of tests in the areas of reading, language, mathematics, and science in three grades.  相似文献   

6.
An approach called generalizability in item response modeling (GIRM) is introduced in this article. The GIRM approach essentially incorporates the sampling model of generalizability theory (GT) into the scaling model of item response theory (IRT) by making distributional assumptions about the relevant measurement facets. By specifying a random effects measurement model, and taking advantage of the flexibility of Markov Chain Monte Carlo (MCMC) estimation methods, it becomes possible to estimate GT variance components simultaneously with traditional IRT parameters. It is shown how GT and IRT can be linked together, in the context of a single-facet measurement design with binary items. Using both simulated and empirical data with the software WinBUGS, the GIRM approach is shown to produce results comparable to those from a standard GT analysis, while also producing results from a random effects IRT model.  相似文献   

7.
采用随机整群抽样抽取505名中小学教师作为被试,其中,男教师189名,女教师271名,年龄均在25至55岁之间。采用教学效能感问卷进行施测,基于项目反应理论,对测试结果进行分析,得出所有项目的区分度、难度和项目信息峰值,参考项目区分度、难度及项目信息函数峰值对教学效能感量表做了修订,再运用结构方程模型、层面理论技术和最小空间分析对修订后的量表进行质量检验,结果表明修订后的量表测量拥有更为清晰的结构效度和更高的信度,测量更为精确。运用SPSS15.0管理数据,运用Hudap6.0和MULTILOG 7.03分析数据,研究得出如下五个结论:1)教学效能感量表为单一维度,可以使用项目反应理论进行分析;2)修订后的量表项目的区分度、难度更为合理;3)修订后的量表的测验信息峰值较原量表稍低;4)修订前后量表对应层面元素之间存在高相关;5)量表的三个方面内容结构得以证实,即学生品德行为教育、课堂组织管理和知识传授。  相似文献   

8.
Most growth models implicitly assume that test scores have been vertically scaled. What may not be widely appreciated are the different choices that must be made when creating a vertical score scale. In this paper empirical patterns of growth in student achievement are compared as a function of different approaches to creating a vertical scale. Longitudinal item‐level data from a standardized reading test are analyzed for two cohorts of students between Grades 3 and 6 and Grades 4 and 7 for the entire state of Colorado from 2003 to 2006. Eight different vertical scales were established on the basis of choices made for three key variables: Item Response Theory modeling approach, linking approach, and ability estimation approach. It is shown that interpretations of empirical growth patterns appear to depend upon the extent to which a vertical scale has been effectively “stretched” or “compressed” by the psychometric decisions made to establish it. While all of the vertical scales considered show patterns of decelerating growth across grade levels, there is little evidence of scale shrinkage.  相似文献   

9.
In educational and psychological measurement, a person-fit statistic (PFS) is designed to identify aberrant response patterns. For parametric PFSs, valid inference depends on several assumptions, one of which is that the item response theory (IRT) model is correctly specified. Previous studies have used empirical data sets to explore the effects of model misspecification on PFSs. We further this line of research by using a simulation study, which allows us to explore issues that may be of interest to practitioners. Results show that, depending on the generating and analysis item models, Type I error rates at fixed values of the latent variable may be greatly inflated, even when the aggregate rates are relatively accurate. Results also show that misspecification is most likely to affect PFSs for examinees with extreme latent variable scores. Two empirical data analyses are used to illustrate the importance of model specification.  相似文献   

10.
Many large‐scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple‐choice and/or constructed responses sections of items to generate multiple scores. In the current article, we propose an extension of the hierarchical rater model (HRM) to be applied with simple structured tests with constructed response items. In addition to modeling the appropriate trait structure, the multidimensional HRM (M‐HRM) presented here also accounts for rater severity bias and rater variability or inconsistency. We introduce the model formulation, test parameter recovery with a focus on latent traits, and compare the M‐HRM to other scoring approaches (unidimensional HRMs and a traditional multidimensional item response theory model) using simulated and empirical data. Results show more precise scores under the M‐HRM, with a major improvement in scores when incorporating rater effects versus ignoring them in the traditional multidimensional item response theory model.  相似文献   

11.
In this study, repeated measures item response data from an English language proficiency (ELP) test administered over 2?years were analyzed by item response theory-based approaches to model change over time. First, change over time was modeled by an item side change approach where change was parameterized at the item level. Second, more importantly, this study introduced and demonstrated an application of an interaction model for change, which overcomes a limitation of the item side change approach by allowing interaction effects between persons and item side change parameters. The results showed that student changes in the ELP data used in this study should be modeled for the three sub-content areas and that modeling interaction for change was necessary.  相似文献   

12.
Generalizability theory and analysis of variance methods are employed, together with the concept of objective time pressure, to estimate response time distributions and the degree of time pressure in timed tests. By estimating response time variance components due to person, item, and their interaction, and fixed effects due to item types and examinee time pressure, one can predict the distribution (mean and variance) of total response time for a population of examinees and a particular time limit. Furthermore, these variance components and fixed effects can be used in a simulation approach to estimate the distributions of time pressure during the test to help test developers evaluate the appropriateness of specific time limits. I present theoretical considerations and empirical results from two tests.  相似文献   

13.
Restricted factor analysis (RFA) can be used to detect item bias (also called differential item functioning). In the RFA method of item bias detection, the common factor model serves as an item response model, but group membership is also included in the model. Two simulation studies are reported, both showing that the RFA method detects bias in 7‐point scale items very well, especially when the sample size is large, the mean trait difference between groups is small, the group sizes are equal, and the amount of bias is large. The first study further shows that the RFA method detects bias in dichotomous items at least as well as an established method based on the one‐parameter logistic item response model. The second study concerns various procedures to evaluate the significance of two‐item bias indices provided by the RFA method. The results indicate that the RFA method performs best when it is used in an iterative procedure.  相似文献   

14.
This article proposes procedures for assessing acquiescence in a balanced set of binary personality items. These procedures are based on the bidimensional item-factor analysis model, which is an alternative parameterization of the bidimensional 2-parameter normal-ogive item response theory model. First the rationale and general approach are described. Then methods for specifying and fitting the model and for estimating the individual factor scores, are discussed, as well as several validity criteria for supporting the suitability of the proposed solution. Finally, we illustrate the procedures using empirical data and discuss their usefulness for applied personality measurement.  相似文献   

15.
Item stem formats can alter the cognitive complexity as well as the type of abilities required for solving mathematics items. Consequently, it is possible that item stem formats can affect the dimensional structure of mathematics assessments. This empirical study investigated the relationship between item stem format and the dimensionality of mathematics assessments. A sample of 671 sixth-grade students was given two forms of a mathematics assessment in which mathematical expression (ME) items and word problems (WP) were used to measure the same content. The effects of mathematical language and reading abilities in responding to ME and WP items were explored using unidimensional and multidimensional item response theory models. The results showed that WP and ME items appear to differ with regard to the underlying abilities required to answer these items. Hence, the multidimensional model fit the response data better than the unidimensional model. For the accurate assessment of mathematics achievement, students’ reading and mathematical language abilities should also be considered when implementing mathematics assessments with ME and WP items.  相似文献   

16.
The purpose of this ITEMS module is to provide an introduction to differential item functioning (DIF) analysis using mixture item response models. The mixture item response models for DIF analysis involve comparing item profiles across latent groups, instead of manifest groups. First, an overview of DIF analysis based on latent groups, called latent DIF analysis, is provided and its applications in the literature are surveyed. Then, the methodological issues pertaining to latent DIF analysis are described, including mixture item response models, parameter estimation, and latent DIF detection methods. Finally, recommended steps for latent DIF analysis are illustrated using empirical data.  相似文献   

17.
A large literature emphasizes the importance of testing for measurement equivalence in scales that may be used as observed variables in structural equation modeling applications. When the same construct is measured across more than one developmental period, as in a longitudinal study, it can be especially critical to establish measurement equivalence, or invariance, across the developmental periods. Similarly, when data from more than one study are combined into a single analysis, it is again important to assess measurement equivalence across the data sources. Yet, how to incorporate nonequivalence when it is discovered is not well described for applied researchers. Here, we present an item response theory approach that can be used to create scale scores from measures while explicitly accounting for nonequivalence. We demonstrate these methods in the context of a latent curve analysis in which data from two separate studies are combined to estimate a single longitudinal model spanning several developmental periods.  相似文献   

18.
We introduce a new comparative response format, suitable for assessing personality and similar constructs. In this “graded-block” format, items measuring different constructs are first organized in blocks of 2 or more; then, pairs are formed from items within blocks. The pairs are presented 1 at a time to enable respondents expressing the extent of preference for 1 item or the other using several graded categories. We model such data using confirmatory factor analysis (CFA) for ordinal outcomes. We derive Fisher information matrices for the graded pairs, and supply R code to enable computation of standard errors of trait scores. An empirical example illustrates the approach in low-stakes personality assessments and shows that similar results are obtained when using graded blocks of size 3 and a standard Likert format. However, graded-block designs might be superior when insufficient differentiation between items is expected (due to acquiescence, halo, or social desirability).  相似文献   

19.
Recent advances have enabled diagnostic classification models (DCMs) to accommodate longitudinal data. These longitudinal DCMs were developed to study how examinees change, or transition, between different attribute mastery statuses over time. This study examines using longitudinal DCMs as an approach to assessing growth and serves three purposes: (1) to define and evaluate two reliability measures to be used in the application of longitudinal DCMs; (2) through simulation, demonstrate that longitudinal DCM growth estimates have increased reliability compared to longitudinal item response theory models; and (3) through an empirical analysis, illustrate the practical and interpretive benefits of longitudinal DCMs. A discussion describes how longitudinal DCMs can be used as practical and reliable psychometric models when categorical and criterion‐referenced interpretations of growth are desired.  相似文献   

20.
We present a logistic function of a monotonic polynomial with a lower asymptote, allowing additional flexibility beyond the three‐parameter logistic model. We develop a maximum marginal likelihood‐based approach to estimate the item parameters. The new item response model is demonstrated on math assessment data from a state, and a computationally efficient strategy for choosing the order of the polynomial is demonstrated. Finally, our approach is tested through simulations and compared to response function estimation using smoothed isotonic regression. Results indicate that our approach can result in small gains in item response function recovery and latent trait estimation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号