The early detection of item drift is an important issue for frequently administered testing programs because items are reused over time. Unfortunately, operational data tend to be very sparse and do not lend themselves to frequent monitoring analyses, particularly for on‐demand testing. Building on existing residual analyses, the authors propose an item index that requires only moderate‐to‐small sample sizes to form data for time‐series analysis. Asymptotic results are presented to facilitate statistical significance tests. The authors show that the proposed index combined with time‐series techniques may be useful in detecting and predicting item drift. Most important, this index is related to a well‐known differential item functioning analysis so that a meaningful effect size can be proposed for item drift detection.  相似文献   

There is a large body of research on the effectiveness of rater training methods in the industrial and organizational psychology literature. Less has been reported in the measurement literature on large‐scale writing assessments. This study compared the effectiveness of two widely used rater training methods—self‐paced and collaborative frame‐of‐reference training—in the context of a large‐scale writing assessment. Sixty‐six raters were randomly assigned to the training methods. After training, all raters scored the same 50 representative essays prescored by a group of expert raters. A series of generalized linear mixed models were then fitted to the rating data. Results suggested that the self‐paced method was equivalent in effectiveness to the more time‐intensive and expensive collaborative method. Implications for large‐scale writing assessments and suggestions for further research are discussed.  相似文献   

This study examined rater effects on essay scoring in an operational monitoring system from England's 2008 national curriculum English writing test for 14‐year‐olds. We fitted two multilevel models and analyzed: (1) drift in rater severity effects over time; (2) rater central tendency effects; and (3) differences in rater severity and central tendency effects by raters’ previous rating experience. We found no significant evidence of rater drift and, while raters with less experience appeared more severe than raters with more experience, this result also was not significant. However, we did find that there was a central tendency to raters’ scoring. We also found that rater severity was significantly unstable over time. We discuss the theoretical and practical questions that our findings raise.  相似文献   

The purpose of this study was to investigate the stability of rater severity over an extended rating period. Multifaceted Rasch analysis was applied to ratings of 16 raters on writing performances of 8, 285 elementary school students. Each performance was rated by two trained raters over a period of seven rating days. Performances rated on the first day were re-rated at the end of the rating period. Statistically significant differences between raters were found within each day and in all days combined. Daily estimates of the relative severity of individual raters were found to differ significantly from single, on-average estimates for the whole rating period. For 10 raters, severity estimates on the last day were significantly different from estimates on the first day. These fndings cast doubt on the practice of using a single calibration of rater severity as the basis for adjustment of person measures.  相似文献   

This study pioneers a Rasch scoring approach and compares it to a conventional summative approach for measuring longitudinal gains in student learning. In this methodological note, our proposed methodology is demonstrated using an example of rating scales in a student survey as part of a higher education outcome assessment. Such assessments have become increasingly important worldwide for purposes of institutional accreditation and accountability to stakeholders. Data were collected from a longitudinal study by tracking self-reported learning outcomes of individual students in the same cohort who completed the student learning experience questionnaire (SLEQ) in their first and final years. Rasch model was employed for item calibration and latent trait estimation, together with a scaling procedure of concurrent calibration incorporating a randomly equivalent group design and a single group design to measure the gains in self-reported learning outcomes as yielded by repeated measures. The extent to which Rasch scoring compared to the conventional summative scoring method in its sensitivity to change was quantified by a statistical index namely relative performance (RP). Findings indicated greater ability to capture learning outcomes gains from Rasch scoring over the conventional summative scoring method, with RP values ranging from 3 to 17% in the cognitive, social, and value domains of the SLEQ. The Rasch scoring approach and the scaling procedure presented in the study can be readily generalised to studies using rating scales to measure change in student learning in the higher education context. The methodological innovations and contributions of this study are discussed.  相似文献   

The present article reports results of a real‐world effectiveness trial conducted in Denmark with six thousand four hundred eighty‐three 3‐ to 6‐year‐olds designed to improve children's language and preliteracy skills. Children in 144 child cares were assigned to a control condition or one of three planned variations of a 20‐week storybook‐based intervention: a base intervention and two enhanced versions featuring extended professional development for educators or a home‐based program for parents. Pre‐ to posttest comparisons revealed a significant impact of all three interventions for preliteracy skills (= .21–.27) but not language skills (d = .04–.16), with little differentiation among the three variations. Fidelity, indexed by number of lessons delivered, was a significant predictor of most outcomes. Implications for real‐world research and practice are considered.  相似文献   

Since the advent of computers, scientists who study how people learn have been utilizing technology to uncover the cognitive and neural mechanisms of learning. Recent technological advances have allowed learning scientists to move their research out of the lab and into the wild, to investigate how students learn in real‐world environments. However, the move from the lab to the classroom involves a significant shift in strategy, requiring consideration of factors varying from the design of mobile (vs. lab‐based) technology to the recruitment of participants, as well as the contextual variables to account for in the less‐controlled environment of schools. Here I discuss the learnings our group has gleaned from a research program involving over a thousand elementary and middle school students in a longitudinal, multi‐year design that involves technologies for assessment and improving learning in schools.  相似文献   

This study investigates the effect of several design and administration choices on item exposure and person/item parameter recovery under a multistage test (MST) design. In a simulation study, we examine whether number‐correct (NC) or item response theory (IRT) methods are differentially effective at routing students to the correct next stage(s) and whether routing choices (optimal versus suboptimal routing) have an impact on achievement precision. Additionally, we examine the impact of testlet length on both person and item recovery. Overall, our results suggest that no single approach works best across the studied conditions. With respect to the mean person parameter recovery, IRT scoring (via either Fisher information or preliminary EAP estimates) outperformed classical NC methods, although differences in bias and root mean squared error were generally small. Item exposure rates were found to be more evenly distributed when suboptimal routing methods were used, and item recovery (both difficulty and discrimination) was most precisely observed for items with moderate difficulties. Based on the results of the simulation study, we draw conclusions and discuss implications for practice in the context of international large‐scale assessments that recently introduced adaptive assessment in the form of MST. Future research directions are also discussed.  相似文献   

Calibration and equating is the quintessential necessity for most large‐scale educational assessments. However, there are instances when no consideration is given to the equating process in terms of context and substantive realization, and the methods used in its execution. In the view of the authors, equating is not merely an exhibit of the statistical methodology, but it is also a reflection of the thought process undertaken in its execution. For example, there is hardly any discussion in literature of the ideological differences in the selection of an equating method. Furthermore, there is little evidence of modeling cohort growth through an identification and use of construct‐relevant linking items’ drift, using the common item nonequivalent group equating design. In this article, the authors philosophically justify the use of Huynh's statistical method for the identification of construct‐relevant outliers in the linking pool. The article also dispels the perception of scale instability associated with the inclusion of construct‐relevant outliers in the linking item pool and concludes that an appreciation of the rationale used in the selection of the equating method, together with the use of linking items in modeling cohort growth, can be beneficial to the practitioners.  相似文献   

This paper attempts to take seriously the claim that we can look for causes in order to understand the reality we live (in), and focuses therefore primarily on ‘the natural world’. It will be argued that even if we were to fully endorse the programme of looking for antecedents, a dominant driver for many educational researchers, this would still not solve the problems they commonly set out to address. It will illustrate the problem of contextualisation in using an example of educational research that uses the methodology of the randomised field trial. In these kind of studies the paradigm of causality and its experimental laboratory approach is modified to incorporate the exigencies of real life situations. The claim that these studies too do not put one in a position to derive straightforward conclusions for policy makers or more generally for educational practitioners will be substantiated. Finally, some concluding remarks will be offered that indicate what may be expected from large‐scale population studies and what their epistemological basis is.  相似文献   

Trend estimation in international comparative large‐scale assessments relies on measurement invariance between countries. However, cross‐national differential item functioning (DIF) has been repeatedly documented. We ran a simulation study using national item parameters, which required trends to be computed separately for each country, to compare trend estimation performances to two linking methods employing international item parameters across several conditions. The trend estimates based on the national item parameters were more accurate than the trend estimates based on the international item parameters when cross‐national DIF was present. Moreover, the use of fixed common item parameter calibrations led to biased trend estimates. The detection and elimination of DIF can reduce this bias but is also likely to increase the total error.  相似文献   

An effect size of about .70 (or .40–.70) is often claimed for the efficacy of formative assessment, but is not supported by the existing research base. More than 300 studies that appeared to address the efficacy of formative assessment in grades K‐12 were reviewed. Many of the studies had severely flawed research designs yielding uninterpretable results. Only 13 of the studies provided sufficient information to calculate relevant effect sizes. A total of 42 independent effect sizes were available. The median observed effect size was .25. Using a random effects model, a weighted mean effect size of .20 was calculated. Moderator analyses suggested that formative assessment might be more effective in English language arts (ELA) than in mathematics or science, with estimated effect sizes of .32, .17, and .09, respectively. Two types of implementation of formative assessment, one based on professional development and the other on the use of computer‐based formative systems, appeared to be more effective than other approaches, yielding mean effect size of .30 and .28, respectively. Given the wide use and potential efficacy of good formative assessment practices, the paucity of the current research base is problematic. A call for more high‐quality studies is issued.  相似文献   

This study is the first to employ panel data to examine well‐being outcomes—self‐rated health, happiness, life satisfaction, and school enjoyment—of children in transnational families in an African context. It analyzes data collected in 2013, 2014, and 2015 from secondary schoolchildren and youth (ages 12–21) in Ghana (= 741). Results indicate that children with fathers, mothers, or both parents away and those cared for by a parent, a family, or a nonfamily member are equally or more likely to have higher levels of well‐being as children in nonmigrant families. Yet, there are certain risk factors—being a female, living in a family affected by divorce or by a change in caregiver while parents migrate—that may decrease child well‐being.  相似文献   

Large‐scale assessments such as the Programme for International Student Assessment (PISA) have field trials where new survey features are tested for utility in the main survey. Because of resource constraints, there is a trade‐off between how much of the sample can be used to test new survey features and how much can be used for the initial item response theory (IRT) scaling. Utilizing real assessment data of the PISA 2015 Science assessment, this article demonstrates that using fixed item parameter calibration (FIPC) in the field trial yields stable item parameter estimates in the initial IRT scaling for samples as small as n = 250 per country. Moreover, the results indicate that for the recovery of the county‐specific latent trait distributions, the estimates of the trend items (i.e., the information introduced into the calibration) are crucial. Thus, concerning the country‐level sample size of n = 1,950 currently used in the PISA field trial, FIPC is useful for increasing the number of survey features that can be examined during the field trial without the need to increase the total sample size. This enables international large‐scale assessments such as PISA to keep up with state‐of‐the‐art developments regarding assessment frameworks, psychometric models, and delivery platform capabilities.  相似文献   

Competence data from low‐stakes educational large‐scale assessment studies allow for evaluating relationships between competencies and other variables. The impact of item‐level nonresponse has not been investigated with regard to statistics that determine the size of these relationships (e.g., correlations, regression coefficients). Classical approaches such as ignoring missing values or treating them as incorrect are currently applied in many large‐scale studies, while recent model‐based approaches that can account for nonignorable nonresponse have been developed. Estimates of item and person parameters have been demonstrated to be biased for classical approaches when missing data are missing not at random (MNAR). In our study, we focus on parameter estimates of the structural model (i.e., the true regression coefficient when regressing competence on an explanatory variable), simulating data according to various missing data mechanisms. We found that model‐based approaches and ignoring missing values performed well in retrieving regression coefficients even when we induced missing data that were MNAR. Treating missing values as incorrect responses can lead to substantial bias. We demonstrate the validity of our approach empirically and discuss the relevance of our results.  相似文献   

Extensive research has examined the validity and fairness of standardized tests in academic admissions. However, due to their underrepresentation in higher education, American Indians have gained much less attention in this research. In the present study, we examined for American Indian students (1) group differences on SAT scores, (2) the predictive and incremental validity of SAT over high school grades, (3) the effect of socioeconomic status on SAT validity, (4) differential prediction in the use of SAT scores, and (5) potential omitted variables that could explain differential prediction for American Indian students. Results provided evidence of predictive and incremental validity of SAT scores, and the validity of SAT scores was largely independent of socioeconomic status. Overprediction was found when using SAT scores to predict college performance and it was reduced when including high school grades as an additional predictor. This study provides substantial evidence of the validity and fairness of SAT scores for American Indians.  相似文献   

2 of the major questions prompted by recent research on inhibition are: (1) Should inhibition be considered a trait dimension, or do those who manifest extreme inhibition constitute a discrete personality type? (2) Are there sex differences in stability of inhibition? We addressed these questions using mothers' ratings over 16 years and psychologists' ratings over 6 years of a Swedish longitudinal sample. From the mean of mothers' 18- and 24-month ratings and the mean of psychologists' 18- and 24-month ratings, we predicted later ratings through 16 years. We performed these analyses for children constituting the extreme 10%–15% from each end of the distribution and then for children not rated as extreme. Ratings were more stable for children in the extreme groups than for those in the nonextreme groups through 6 years; however, only for the inhibited girls did early inhibition predict inhibition into adolescence. We conclude that culturally shared notions of gender-appropriate behavior influence the stability of inhibition.  相似文献   

This is among the first longitudinal studies to report student attitudes across 4 yr of a university program. We found that the attitudes of students in biology become significantly more expert-like from the first year to the fourth year of the program, that is, there was a significant positive shift in students’ overall percent favorable scores from 64.5 to 72%, as opposed to the expert response, which averaged 90%. There was a significant positive shift for the real world connection category (78–85%), the enjoyment (personal interest) category (74–82%), and the conceptual connections/memorization category (66–74%). Moreover, there was a significant correlation between students’ overall percent favorable scores and performance (cumulative grade point average) at the end, but not at the beginning, of the fourth year, with high-performing students having significantly more expert-like attitudes than low-performing students. The correlation between percent favorable score and performance was the strongest for the problem solving: synthesis and application category, in which the highest-performing students finished their fourth year with 90% favorable compared with 35% favorable for the lowest-performing students. A comparison of these results with previously reported results and their implications for teaching are discussed.  相似文献   

