首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters’ ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between‐subgroup fit approach. We used standardized between‐subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups.  相似文献   

2.
Karabatsos compared the power of 36 person-fit statistics using receiver operating characteristics curves and found the HT statistic to be the most powerful in identifying aberrant examinees. He found three statistics, C, MCI, and U3, to be the next most powerful. These four statistics, all of which are nonparametric, were found to perform considerably better than each of 25 parametric person-fit statistics. Dimitrov and Smith replicated part of this finding in a similar study. The present article raises some issues with the comparisons performed in Karabatsos and Dimitrov and Smith and points to literature that suggests that the comparisons could have been performed in a more traditional and more fair manner. The present article then replicates the simulations of Karabatsos and demonstrates in several ways that the parametric person-fit statistics lz and ECI4z (that were also considered by Karabatsos) are as powerful as are HT and U3 in identifying aberrant examinees in more traditional and fair comparisons. Two parametric person-fit statistics are shown to lead to similar results as HT and U3 in a real data example.  相似文献   

3.
Individual person fit analyses provide important information regarding the validity of test score inferences for an individual test taker. In this study, we use data from an undergraduate statistics test (N = 1135) to illustrate a two-step method that researchers and practitioners can use to examine individual person fit. First, person fit is examined numerically with several indices based on the Rasch model (i.e., Infit, Outfit, and Between-Subset statistics). Second, person misfit is presented graphically with person response functions, and these person response functions are interpreted using a heuristic. Individual person fit analysis holds promise for improving score interpretation in that it may detect potential threats to validity of score inferences for some test takers. Individual person fit analysis may also highlight particular subsets of items (on which a test taker performs unexpectedly) that can be used to further contextualize her or his test performance.  相似文献   

4.
Response accuracy and response time data can be analyzed with a joint model to measure ability and speed of working, while accounting for relationships between item and person characteristics. In this study, person‐fit statistics are proposed for joint models to detect aberrant response accuracy and/or response time patterns. The person‐fit tests take the correlation between ability and speed into account, as well as the correlation between item characteristics. They are posited as Bayesian significance tests, which have the advantage that the extremeness of a test statistic value is quantified by a posterior probability. The person‐fit tests can be computed as by‐products of a Markov chain Monte Carlo algorithm. Simulation studies were conducted in order to evaluate their performance. For all person‐fit tests, the simulation studies showed good detection rates in identifying aberrant patterns. A real data example is given to illustrate the person‐fit statistics for the evaluation of the joint model.  相似文献   

5.
Using Rasch analysis, the psychometric properties of a newly developed 35‐item parent‐proxy instrument, the Caregiver Assessment of Movement Participation (CAMP), designed to measure movement participation problems in children with Developmental Coordination Disorder, were examined. The CAMP was administered to 465 school children aged 5–10 years. Thirty of the 35 items were retained as they had acceptable infit and outfit statistics. Item separation (7.48) and child separation (3.16) were good; moreover, the CAMP had excellent reliability (Reliability Index for item = 0.98; Person = 0.91). Principal components analysis of item residuals confirmed the unidimensionality of the instrument. Based on category probability statistics, the original five‐point scale was collapsed into a four‐point scale. The item threshold calibration of the CAMP with the Movement Assessment Battery for Children Test was computed. The results indicated that a CAMP total score of 75 is the optimal cut‐off point for identifying children at risk of movement problems.  相似文献   

6.
This article demonstrates the use of a new class of model‐free cumulative sum (CUSUM) statistics to detect person fit given the responses to a linear test. The fundamental statistic being accumulated is the likelihood ratio of two probabilities. The detection performance of this CUSUM scheme is compared to other model‐free person‐fit statistics found in the literature as well as an adaptation of another CUSUM approach. The study used both simulated responses and real response data from a large‐scale standardized admission test.  相似文献   

7.
The landscape of science education is being transformed by the new Framework for Science Education (National Research Council, A framework for K-12 science education: practices, crosscutting concepts, and core ideas. The National Academies Press, Washington, DC, 2012), which emphasizes the centrality of scientific practices—such as explanation, argumentation, and communication—in science teaching, learning, and assessment. A major challenge facing the field of science education is developing assessment tools that are capable of validly and efficiently evaluating these practices. Our study examined the efficacy of a free, open-source machine-learning tool for evaluating the quality of students’ written explanations of the causes of evolutionary change relative to three other approaches: (1) human-scored written explanations, (2) a multiple-choice test, and (3) clinical oral interviews. A large sample of undergraduates (n = 104) exposed to varying amounts of evolution content completed all three assessments: a clinical oral interview, a written open-response assessment, and a multiple-choice test. Rasch analysis was used to compute linear person measures and linear item measures on a single logit scale. We found that the multiple-choice test displayed poor person and item fit (mean square outfit >1.3), while both oral interview measures and computer-generated written response measures exhibited acceptable fit (average mean square outfit for interview: person 0.97, item 0.97; computer: person 1.03, item 1.06). Multiple-choice test measures were more weakly associated with interview measures (r = 0.35) than the computer-scored explanation measures (r = 0.63). Overall, Rasch analysis indicated that computer-scored written explanation measures (1) have the strongest correspondence to oral interview measures; (2) are capable of capturing students’ normative scientific and naive ideas as accurately as human-scored explanations, and (3) more validly detect understanding than the multiple-choice assessment. These findings demonstrate the great potential of machine-learning tools for assessing key scientific practices highlighted in the new Framework for Science Education.  相似文献   

8.
This study investigated whether aberrant response behaviour is a stable characteristic of high school students taking classroom maths tests as has been implied in the literature. For the purposes of the study, two maths tests were administered; the first to 25 classes (635 students) and the second to 18 out of the original 25 classes (445 students). The tests contained multistep mathematical problems with partial credit awarding for partially correct answers, together with some multiple choice items. The Rasch Partial Credit Model was used for the analyses and the infit and outfit mean square statistics with six different cut-off scores were used to identify students with aberrant response behaviour (misfitting students). Six Chi-square tests were then performed, one for each cut-off score, leading to a very clear conclusion: contrary to expectations the same students do not misfit in the two tests administered; aberrance does not seem to be a stable characteristic of students. Explanations for aberrant responses such as carelessness, plodding or guessing need to be reconsidered. They may have validity for particular test situations but this has yet to be demonstrated and thus investigation calls them into question.  相似文献   

9.
10.
The idea that test scores may not be valid representations of what students know, can do, and should learn next is well known. Person fit provides an important aspect of validity evidence. Person fit analyses at the individual student level are not typically conducted and person fit information is not communicated to educational stakeholders. In this study, we focus on a promising method for detecting and conveying person fit for large-scale educational assessments. This method uses multilevel logistic regression (MLR) to model the slopes of the person response functions, a potential source of person misfit for IRT models. We apply the method to a representative sample of students who took the writing section of the SAT (N = 19,341). The findings suggest that the MLR approach is useful for providing supplemental evidence of model–data fit in large-scale educational test settings. MLR can be useful for detecting general misfit at global and individual levels. However, as with other model–data fit indices, the MLR approach is limited in providing information regarding only some types of person misfit.  相似文献   

11.
Due to changes in Dutch mathematics education, teachers are expected to use new teaching methods such as enquiry-based teaching. In this study, we investigate how teachers design, implement and evaluate new methods for statistics teaching for 7th-graders during a professional development trajectory based on peer collaboration. We monitored teachers’ development in a network of four mathematics teachers from the same school. By using a mixed-methods approach in which we combined data from interviews, concept maps and classroom observations, we describe changes in teachers’ practical knowledge. We found how the nature of these changes highly depends on teachers’ personal concerns that emerge during the trajectory. Some teachers considered their concerns as challenges stimulating their learning, while other teachers experienced their concerns as a reason to fall back to previous teaching methods. Based on our results, we give some recommendations for organising teacher networks.  相似文献   

12.
A paucity of research has compared estimation methods within a measurement invariance (MI) framework and determined if research conclusions using normal-theory maximum likelihood (ML) generalizes to the robust ML (MLR) and weighted least squares means and variance adjusted (WLSMV) estimators. Using ordered categorical data, this simulation study aimed to address these queries by investigating 342 conditions. When testing for metric and scalar invariance, Δχ2 results revealed that Type I error rates varied across estimators (ML, MLR, and WLSMV) with symmetric and asymmetric data. The Δχ2 power varied substantially based on the estimator selected, type of noninvariant indicator, number of noninvariant indicators, and sample size. Although some the changes in approximate fit indexes (ΔAFI) are relatively sample size independent, researchers who use the ΔAFI with WLSMV should use caution, as these statistics do not perform well with misspecified models. As a supplemental analysis, our results evaluate and suggest cutoff values based on previous research.  相似文献   

13.
In this study we examined procedures for assessing model-data fit of item response theory (IRT) models for mixed format data. The model fit indices used in this study include PARSCALE's G2 , Orlando and Thissen's SX2 and SG2 , and Stone's χ2* and G2* . To investigate the relative performance of the fit statistics at the item level, we conducted two simulation studies: Type I error and power studies. We evaluated the performance of the item fit indices for various conditions of test length, sample size, and IRT models. Among the competing measures, the summed score-based indices SX2 and SG2 were found to be the sensible and efficient choice for assessing model fit for mixed format data. These indices performed well, particularly with short tests. The pseudo-observed score indices, χ2* and G2* , showed inflated Type I error rates in some simulation conditions. Consistent with the findings of current literature, the PARSCALE's G2 index was rarely useful, although it provided reasonable results for long tests.  相似文献   

14.
Linear factor analysis (FA) models can be reliably tested using test statistics based on residual covariances. We show that the same statistics can be used to reliably test the fit of item response theory (IRT) models for ordinal data (under some conditions). Hence, the fit of an FA model and of an IRT model to the same data set can now be compared. When applied to a binary data set, our experience suggests that IRT and FA models yield similar fits. However, when the data are polytomous ordinal, IRT models yield a better fit because they involve a higher number of parameters. But when fit is assessed using the root mean square error of approximation (RMSEA), similar fits are obtained again. We explain why. These test statistics have little power to distinguish between FA and IRT models; they are unable to detect that linear FA is misspecified when applied to ordinal data generated under an IRT model.  相似文献   

15.
The asymptotically distribution free (ADF) method is often used to estimate parameters or test models without a normal distribution assumption on variables, both in covariance structure analysis and in correlation structure analysis. However, little has been done to study the differences in behaviors of the ADF method in covariance versus correlation structure analysis. The behaviors of 3 test statistics frequently used to evaluate structural equation models with nonnormally distributed variables, χ2 test TAGLS and its small-sample variants TYB and TF(AGLS) were compared. Results showed that the ADF method in correlation structure analysis with test statistic TAGLS performs much better at small sample sizes than the corresponding test for covariance structures. In contrast, test statistics TYB and TF(AGLS) under the same conditions generally perform better with covariance structures than with correlation structures. It is proposed that excessively large and variable condition numbers of weight matrices are a cause of poor behavior of ADF test statistics in small samples, and results showed that these condition numbers are systematically increased with substantial increase in variance as sample size decreases. Implications for research and practice are discussed.  相似文献   

16.
The assumption of conditional independence between the responses and the response times (RTs) for a given person is common in RT modeling. However, when the speed of a test taker is not constant, this assumption will be violated. In this article we propose a conditional joint model for item responses and RTs, which incorporates a covariance structure to explain the local dependency between speed and accuracy. To obtain information about the population of test takers, the new model was embedded in the hierarchical framework proposed by van der Linden ( 2007 ). A fully Bayesian approach using a straightforward Markov chain Monte Carlo (MCMC) sampler was developed to estimate all parameters in the model. The deviance information criterion (DIC) and the Bayes factor (BF) were employed to compare the goodness of fit between the models with two different parameter structures. The Bayesian residual analysis method was also employed to evaluate the fit of the RT model. Based on the simulations, we conclude that (1) the new model noticeably improves the parameter recovery for both the item parameters and the examinees’ latent traits when the assumptions of conditional independence between the item responses and the RTs are relaxed and (2) the proposed MCMC sampler adequately estimates the model parameters. The applicability of our approach is illustrated with an empirical example, and the model fit indices indicated a preference for the new model.  相似文献   

17.
本研究旨在从一维和多维的角度检测国际教育成效评价协会(IEA)儿童认知发展状况测验中中译英考题的项目功能差异(DIF)。我们分析的数据由871名中国儿童和557名美国儿童的测试数据组成。结果显示,有一半以上的题目存在实质的DIF,意味着这个测验对于中美儿童而言,并没有功能等值。使用者应谨慎使用该跨语言翻译的比较测试结果来比较中美两国考生的认知能力水平。所幸约有半数的DIF题目偏向中国,半数偏向美国,因此利用测验总分所建立的量尺,应该不至于有太大的偏误。此外,题目拟合度统计量并不能足够地检测到存在DIF的题目,还是应该进行特定的DIF分析。我们探讨了三种可能导致DIF的原因,尚需更多学科专业知识和实验来真正解释DIF的形成。  相似文献   

18.
In this article, we discuss the benefits of Bayesian statistics and how to utilize them in studies of moral education. To demonstrate concrete examples of the applications of Bayesian statistics to studies of moral education, we reanalyzed two data sets previously collected: one small data set collected from a moral educational intervention experiment, and one big data set from a large-scale Defining Issues Test-2 survey (DIT). The results suggest that Bayesian analysis of data sets collected from moral educational studies can provide additional useful statistical information, particularly that associated with the strength of evidence supporting alternative hypotheses, which has not been provided by the classical frequentist approach focusing on P-values. Finally, we introduce several practical guidelines pertaining to how to utilize Bayesian statistics, including the utilization of newly developed free statistical software, Jeffrey’s Amazing Statistics Program (JASP), and thresholding based on Bayes Factors (BF), to scholars in the field of moral education.  相似文献   

19.
The size of a model has been shown to critically affect the goodness of approximation of the model fit statistic T to the asymptotic chi-square distribution in finite samples. It is not clear, however, whether this “model size effect” is a function of the number of manifest variables, the number of free parameters, or both. It is demonstrated by means of 2 Monte Carlo computer simulation studies that neither the number of free parameters to be estimated nor the model degrees of freedom systematically affect the T statistic when the number of manifest variables is held constant. Increasing the number of manifest variables, however, is associated with a severe bias. These results imply that model fit drastically depends on the size of the covariance matrix and that future studies involving goodness-of-fit statistics should always consider the number of manifest variables, but can safely neglect the influence of particular model specifications.  相似文献   

20.
The goal of this study was to investigate the usefulness of person‐fit analysis in validating student score inferences in a cognitive diagnostic assessment. In this study, a two‐stage procedure was used to evaluate person fit for a diagnostic test in the domain of statistical hypothesis testing. In the first stage, the person‐fit statistic, the hierarchy consistency index (HCI; Cui, 2007 ; Cui & Leighton, 2009 ), was used to identify the misfitting student item‐score vectors. In the second stage, students’ verbal reports were collected to provide additional information about students’ response processes so as to reveal the actual causes of misfits. This two‐stage procedure helped to identify the misfits of item‐score vectors to the cognitive model used in the design and analysis of the diagnostic test, and to discover the reasons of misfits so that students’ problem‐solving strategies were better understood and their performances were interpreted in a more meaningful way.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号