首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The goal of this study was to investigate the usefulness of person‐fit analysis in validating student score inferences in a cognitive diagnostic assessment. In this study, a two‐stage procedure was used to evaluate person fit for a diagnostic test in the domain of statistical hypothesis testing. In the first stage, the person‐fit statistic, the hierarchy consistency index (HCI; Cui, 2007 ; Cui & Leighton, 2009 ), was used to identify the misfitting student item‐score vectors. In the second stage, students’ verbal reports were collected to provide additional information about students’ response processes so as to reveal the actual causes of misfits. This two‐stage procedure helped to identify the misfits of item‐score vectors to the cognitive model used in the design and analysis of the diagnostic test, and to discover the reasons of misfits so that students’ problem‐solving strategies were better understood and their performances were interpreted in a more meaningful way.  相似文献   

2.
As item response theory has been more widely applied, investigating the fit of a parametric model becomes an important part of the measurement process. There is a lack of promising solutions to the detection of model misfit in IRT. Douglas and Cohen introduced a general nonparametric approach, RISE (Root Integrated Squared Error), for detecting model misfit. The purposes of this study were to extend the use of RISE to more general and comprehensive applications by manipulating a variety of factors (e.g., test length, sample size, IRT models, ability distribution). The results from the simulation study demonstrated that RISE outperformed G2 and S‐X2 in that it controlled Type I error rates and provided adequate power under the studied conditions. In the empirical study, RISE detected reasonable numbers of misfitting items compared to G2 and S‐X2, and RISE gave a much clearer picture of the location and magnitude of misfit for each misfitting item. In addition, there was no practical consequence to classification before and after replacement of misfitting items detected by three fit statistics.  相似文献   

3.
Recent developrnents of person-Jit analysis in computerized adaptive testing (CAT) are discussed. Methods from stutistical process control are presented that have been proposed to classify an item score pattern as fitting or misjitting the underlying item response theory model in CAT. Most person-fit research in CAT is restricted to simulated data. In this study, empirical data from a certification test were used, Alternatives are discussed to generate norms so that bounds can be determined to classify an item score pattern as fitting or misfitting. Using bounds determined from a sample of a high-stakes certification test, the empirical analysis showed that dizerent types of misfit can be distinguished. Further applications using statistical process control methods to detect misfitting item score patterns are discussed.  相似文献   

4.
The purpose of this study is to apply the attribute hierarchy method (AHM) to a subset of SAT critical reading items and illustrate how the method can be used to promote cognitive diagnostic inferences. The AHM is a psychometric procedure for classifying examinees’ test item responses into a set of attribute mastery patterns associated with different components from a cognitive model. The study was conducted in two steps. In step 1, three cognitive models were developed by reviewing selected literature in reading comprehension as well as research related to SAT Critical Reading. Then, the cognitive models were validated by having a sample of students think aloud as they solved each item. In step 2, psychometric analyses were conducted on the SAT critical reading cognitive models by evaluating the model‐data fit between the expected and observed response patterns produced from two random samples of 2,000 examinees who wrote the items. The model that provided best data‐model fit was then used to calculate attribute probabilities for 15 examinees to illustrate our diagnostic testing procedure.  相似文献   

5.
This study investigated whether aberrant response behaviour is a stable characteristic of high school students taking classroom maths tests as has been implied in the literature. For the purposes of the study, two maths tests were administered; the first to 25 classes (635 students) and the second to 18 out of the original 25 classes (445 students). The tests contained multistep mathematical problems with partial credit awarding for partially correct answers, together with some multiple choice items. The Rasch Partial Credit Model was used for the analyses and the infit and outfit mean square statistics with six different cut-off scores were used to identify students with aberrant response behaviour (misfitting students). Six Chi-square tests were then performed, one for each cut-off score, leading to a very clear conclusion: contrary to expectations the same students do not misfit in the two tests administered; aberrance does not seem to be a stable characteristic of students. Explanations for aberrant responses such as carelessness, plodding or guessing need to be reconsidered. They may have validity for particular test situations but this has yet to be demonstrated and thus investigation calls them into question.  相似文献   

6.
Research has suggested that inappropriate or misfitting response patterns may have detrimental effects on the quality and validity of measurement. It has been suggested that factors like language and ethnic background are related to the generation of misfitting response patterns, but the empirical research on this is rather poor. This research analyzes data from three testing cycles of the National Curriculum tests in mathematics in England using the Rasch model. It was found that pupils having English as an additional language and pupils belonging to ethnic minorities are significantly more likely to generate aberrant response patterns. However, within the groups of pupils belonging to ethnic minorities, those who speak English as an additional language are not significantly more likely to generate misfitting response patterns. This may indicate that the ethnic background effect is more significant than the effect of the first language spoken. The results suggest that pupils having English as an additional language and pupils belonging to ethnic minorities are mismeasured significantly more than the remainder of pupils by taking the mathematics National Curriculum tests. More research is needed to generalize the results to other subjects and contexts.  相似文献   

7.
Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items.  相似文献   

8.
In the presence of test speededness, the parameter estimates of item response theory models can be poorly estimated due to conditional dependencies among items, particularly for end‐of‐test items (i.e., speeded items). This article conducted a systematic comparison of five‐item calibration procedures—a two‐parameter logistic (2PL) model, a one‐dimensional mixture model, a two‐step strategy (a combination of the one‐dimensional mixture and the 2PL), a two‐dimensional mixture model, and a hybrid model‐–by examining how sample size, percentage of speeded examinees, percentage of missing responses, and way of scoring missing responses (incorrect vs. omitted) affect the item parameter estimation in speeded tests. For nonspeeded items, all five procedures showed similar results in recovering item parameters. For speeded items, the one‐dimensional mixture model, the two‐step strategy, and the two‐dimensional mixture model provided largely similar results and performed better than the 2PL model and the hybrid model in calibrating slope parameters. However, those three procedures performed similarly to the hybrid model in estimating intercept parameters. As expected, the 2PL model did not appear to be as accurate as the other models in recovering item parameters, especially when there were large numbers of examinees showing speededness and a high percentage of missing responses with incorrect scoring. Real data analysis further described the similarities and differences between the five procedures.  相似文献   

9.
Examinees' thinking processes have become an increasingly important concern in testing. The responses processes aspect is a major component of validity, and contemporary tests increasingly involve specifications about the cognitive complexity of examinees' response processes. Yet, empirical research findings on examinees' cognitive processes are not often available either to provide evidence for validity or to guide the design or selection of items. In this article, studies and developments from the author's research program are presented to illustrate how empirical studies on examinees' thinking processes can impact item and test design.  相似文献   

10.
When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters’ ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between‐subgroup fit approach. We used standardized between‐subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups.  相似文献   

11.
Traditional item analyses such as classical test theory (CTT) use exam-taker responses to assessment items to approximate their difficulty and discrimination. The increased adoption by educational institutions of electronic assessment platforms (EAPs) provides new avenues for assessment analytics by capturing detailed logs of an exam-taker's journey through their exam. This paper explores how logs created by EAPs can be employed alongside exam-taker responses and CTT to gain deeper insights into exam items. In particular, we propose an approach for deriving features from exam logs for approximating item difficulty and discrimination based on exam-taker behaviour during an exam. Items for which difficulty and discrimination differ significantly between CTT analysis and our approach are flagged through outlier detection for independent academic review. We demonstrate our approach by analysing de-identified exam logs and responses to assessment items of 463 medical students enrolled in a first-year biomedical sciences course. The analysis shows that the number of times an exam-taker visits an item before selecting a final response is a strong indicator of an item's difficulty and discrimination. Scrutiny by the course instructor of the seven items identified as outliers suggests our log-based analysis can provide insights beyond what is captured by traditional item analyses.

Practitioner notes

What is already known about this topic
  • Traditional item analysis is based on exam-taker responses to the items using mathematical and statistical models from classical test theory (CTT). The difficulty and discrimination indices thus calculated can be used to determine the effectiveness of each item and consequently the reliability of the entire exam.
What this paper adds
  • Data extracted from exam logs can be used to identify exam-taker behaviours which complement classical test theory in approximating the difficulty and discrimination of an item and identifying items that may require instructor review.
Implications for practice and/or policy
  • Identifying the behaviours of successful exam-takers may allow us to develop effective exam-taking strategies and personal recommendations for students.
  • Analysing exam logs may also provide an additional tool for identifying struggling students and items in need of revision.
  相似文献   

12.
The nature of anatomy education has changed substantially in recent decades, though the traditional multiple‐choice written examination remains the cornerstone of assessing students' knowledge. This study sought to measure the quality of a clinical anatomy multiple‐choice final examination using item response theory (IRT) models. One hundred seventy‐six students took a multiple‐choice clinical anatomy examination. One‐ and two‐parameter IRT models (difficulty and discrimination parameters) were used to assess item quality. The two‐parameter IRT model demonstrated a wide range in item difficulty, with a median of ?1.0 and range from ?2.0 to 0.0 (25th to 75th percentile). Similar results were seen for discrimination (median 0.6; range 0.4–0.8). The test information curve achieved maximum discrimination for an ability level one standard deviation below the average. There were 15 items with standardized loading less than 0.3, which was due to several factors: two items had two correct responses, one was not well constructed, two were too easy, and the others revealed a lack of detailed knowledge by students. The test used in this study was more effective in discriminating students of lower ability than those of higher ability. Overall, the quality of the examination in clinical anatomy was confirmed by the IRT models. Anat Sci Educ 3:17–24, 2010. © 2009 American Association of Anatomists.  相似文献   

13.
The presence of nuisance dimensionality is a potential threat to the accuracy of results for tests calibrated using a measurement model such as a factor analytic model or an item response theory model. This article describes a mixture group bifactor model to account for the nuisance dimensionality due to a testlet structure as well as the dimensionality due to differences in patterns of responses. The model can be used for testing whether or not an item functions differently across latent groups in addition to investigating the differential effect of local dependency among items within a testlet. An example is presented comparing test speededness results from a conventional factor mixture model, which ignores the testlet structure, with results from the mixture group bifactor model. Results suggested the 2 models treated the data somewhat differently. Analysis of the item response patterns indicated that the 2-class mixture bifactor model tended to categorize omissions as indicating speededness. With the mixture group bifactor model, more local dependency was present in the speeded than in the nonspeeded class. Evidence from a simulation study indicated the Bayesian estimation method used in this study for the mixture group bifactor model can successfully recover generated model parameters for 1- to 3-group models for tests containing testlets.  相似文献   

14.
This paper presents the item and test information functions of the Rank two-parameter logistic models (Rank-2PLM) for items with two (pair) and three (triplet) statements in forced-choice questionnaires. The Rank-2PLM model for pairs is the MUPP-2PLM (Multi-Unidimensional Pairwise Preference) and, for triplets, is the Triplet-2PLM. Fisher's information and directional information are described, and the test information for Maximum Likelihood (ML), Maximum A Posterior (MAP), and Expected A Posterior (EAP) trait score estimates is distinguished. Expected item/test information indexes at various levels are proposed and plotted to provide diagnostic information on items and tests. The expected test information indexes for EAP scores may be difficult to compute due to a typical test's vast number of item response patterns. The relationships of item/test information with discrimination parameters of statements, standard error, and reliability estimates of trait score estimates are discussed and demonstrated using real data. Practical suggestions for checking the various expected item/test information indexes and plots are provided.  相似文献   

15.
A cognitive item response theory model called the attribute hierarchy method (AHM) is introduced and illustrated. This method represents a variation of Tatsuoka's rule-space approach. The AHM is designed explicitly to link cognitive theory and psychometric practice to facilitate the development and analyses of educational and psychological tests. The following are described: cognitive properties of the AHM; psychometric properties of the AHM, as well as a demonstration of how the AHM differs from Tatsuoka's rule-space approach; and application of the AHM to the domain of syllogistic reasoning to illustrate how this approach can be used to evaluate the cognitive competencies required in a higher-level thinking task. Future directions for research are also outlined.  相似文献   

16.
We report some findings of the Longitudinal Proof Project, which investigated patterns in high-attaining students' mathematical reasoning in algebra and in geometry and development in their reasoning, by analyses of students' responses to three annual proof tests. The paper focuses on students' responses to one non-standard geometry item. It reports how the distribution of responses to this item changed over time with some moderate progress that suggests a cognitive shift from perceptual to geometrical reasoning. However, we also note that many students made little or no progress and some regressed. Extracts from student interviews indicate that the source of this variation from the overall trend stems from the shift over the three years of the study to a more formal approach in the school geometry curriculum for high-attaining students, and the effects of this shift on what students interpreted as the didactical demands of the item.  相似文献   

17.
The sample invariance of item discrimination statistics is evaluated in this case study using real data. The hypothesized superiority of the item response model (IRM) is tested against structural equation modeling (SEM) for responses to the Center for Epidemiologic Studies-Depression (CES-D) scale. Responses from 10 random samples of 500 people were drawn from a base sample of 6,621 participants across gender, age, and different health groups. Hierarchical tests of multiple-group structural equation models indicated statistically significant differences exist in item regressions across contrast groups. Although the IRM item discrimination estimates were most stable in all conditions of this case study, additional research on the precision of individual scores and possible item bias is required to support the validity of either model for scoring the CES-D. The SEM approach to examining between-group differences holds promise for any field where heterogeneous populations are assessed and important consequences arise from score interpretations.  相似文献   

18.
《教育实用测度》2013,26(1):47-64
Optimal appropriateness measurement statistically provides the most powerful methods for identifying individuals who are mismeasured by a standardized psychological test or scale. These methods use a likelihood ratio test to compare the hypothesis of normal responding versus the alternative hypothesis that an individual's responses are aberrant in some specified way. According to the Neyman-Pearson Lemma, no other statistic computed from an individual's item responses can achieve a higher rate of detection of the hypothesized measure- ment anomaly at the same false positive rate. Use of optimal methods requires a psychometric model for normal responding, which can be readily obtained from the item response theory literature, and a model for aberrant responding. In this article, several concerns about measurement anomalies are described and transformed into quantitative models. We then show how to compute the likeli- hood of a response pattern u* for each of the aberrance models.  相似文献   

19.
The attribute hierarchy method (AHM) is a psychometric procedure for classifying examinees' test item responses into a set of structured attribute patterns associated with different components from a cognitive model of task performance. Results from an AHM analysis yield information on examinees' cognitive strengths and weaknesses. Hence, the AHM can be used for cognitive diagnostic assessment. The purpose of this study is to introduce and evaluate a new concept for assessing attribute reliability using the ratio of true score variance to observed score variance on items that probe specific cognitive attributes. This reliability procedure is evaluated and illustrated using both simulated data and student response data from a sample of algebra items taken from the March 2005 administration of the SAT. The reliability of diagnostic scores and the implications for practice are also discussed.  相似文献   

20.
In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号