首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 10 毫秒
1.
Technical difficulties occasionally lead to missing item scores and hence to incomplete data on computerized tests. It is not straightforward to report scores to the examinees whose data are incomplete due to technical difficulties. Such reporting essentially involves imputation of missing scores. In this paper, a simulation study based on data from three educational tests is used to compare the performances of six approaches for imputation of missing scores. One of the approaches, based on data mining, is the first application of its kind to the problem of imputation of missing data. The approach based on data mining and a multiple imputation approach based on chained equations led to the most accurate imputation of missing scores, and hence to most accurate score reporting. A simple approach based on linear regression performed the next best overall. Several recommendations are made regarding the reporting of scores to examinees with incomplete data.  相似文献   

2.
The early detection of item drift is an important issue for frequently administered testing programs because items are reused over time. Unfortunately, operational data tend to be very sparse and do not lend themselves to frequent monitoring analyses, particularly for on‐demand testing. Building on existing residual analyses, the authors propose an item index that requires only moderate‐to‐small sample sizes to form data for time‐series analysis. Asymptotic results are presented to facilitate statistical significance tests. The authors show that the proposed index combined with time‐series techniques may be useful in detecting and predicting item drift. Most important, this index is related to a well‐known differential item functioning analysis so that a meaningful effect size can be proposed for item drift detection.  相似文献   

3.
Large‐scale assessments such as the Programme for International Student Assessment (PISA) have field trials where new survey features are tested for utility in the main survey. Because of resource constraints, there is a trade‐off between how much of the sample can be used to test new survey features and how much can be used for the initial item response theory (IRT) scaling. Utilizing real assessment data of the PISA 2015 Science assessment, this article demonstrates that using fixed item parameter calibration (FIPC) in the field trial yields stable item parameter estimates in the initial IRT scaling for samples as small as n = 250 per country. Moreover, the results indicate that for the recovery of the county‐specific latent trait distributions, the estimates of the trend items (i.e., the information introduced into the calibration) are crucial. Thus, concerning the country‐level sample size of n = 1,950 currently used in the PISA field trial, FIPC is useful for increasing the number of survey features that can be examined during the field trial without the need to increase the total sample size. This enables international large‐scale assessments such as PISA to keep up with state‐of‐the‐art developments regarding assessment frameworks, psychometric models, and delivery platform capabilities.  相似文献   

4.
Calibration and equating is the quintessential necessity for most large‐scale educational assessments. However, there are instances when no consideration is given to the equating process in terms of context and substantive realization, and the methods used in its execution. In the view of the authors, equating is not merely an exhibit of the statistical methodology, but it is also a reflection of the thought process undertaken in its execution. For example, there is hardly any discussion in literature of the ideological differences in the selection of an equating method. Furthermore, there is little evidence of modeling cohort growth through an identification and use of construct‐relevant linking items’ drift, using the common item nonequivalent group equating design. In this article, the authors philosophically justify the use of Huynh's statistical method for the identification of construct‐relevant outliers in the linking pool. The article also dispels the perception of scale instability associated with the inclusion of construct‐relevant outliers in the linking item pool and concludes that an appreciation of the rationale used in the selection of the equating method, together with the use of linking items in modeling cohort growth, can be beneficial to the practitioners.  相似文献   

5.
This study investigates the effect of several design and administration choices on item exposure and person/item parameter recovery under a multistage test (MST) design. In a simulation study, we examine whether number‐correct (NC) or item response theory (IRT) methods are differentially effective at routing students to the correct next stage(s) and whether routing choices (optimal versus suboptimal routing) have an impact on achievement precision. Additionally, we examine the impact of testlet length on both person and item recovery. Overall, our results suggest that no single approach works best across the studied conditions. With respect to the mean person parameter recovery, IRT scoring (via either Fisher information or preliminary EAP estimates) outperformed classical NC methods, although differences in bias and root mean squared error were generally small. Item exposure rates were found to be more evenly distributed when suboptimal routing methods were used, and item recovery (both difficulty and discrimination) was most precisely observed for items with moderate difficulties. Based on the results of the simulation study, we draw conclusions and discuss implications for practice in the context of international large‐scale assessments that recently introduced adaptive assessment in the form of MST. Future research directions are also discussed.  相似文献   

6.
Trend estimation in international comparative large‐scale assessments relies on measurement invariance between countries. However, cross‐national differential item functioning (DIF) has been repeatedly documented. We ran a simulation study using national item parameters, which required trends to be computed separately for each country, to compare trend estimation performances to two linking methods employing international item parameters across several conditions. The trend estimates based on the national item parameters were more accurate than the trend estimates based on the international item parameters when cross‐national DIF was present. Moreover, the use of fixed common item parameter calibrations led to biased trend estimates. The detection and elimination of DIF can reduce this bias but is also likely to increase the total error.  相似文献   

7.
The purpose of this study is to investigate the effects of missing data techniques in longitudinal studies under diverse conditions. A Monte Carlo simulation examined the performance of 3 missing data methods in latent growth modeling: listwise deletion (LD), maximum likelihood estimation using the expectation and maximization algorithm with a nonnormality correction (robust ML), and the pairwise asymptotically distribution-free method (pairwise ADF). The effects of 3 independent variables (sample size, missing data mechanism, and distribution shape) were investigated on convergence rate, parameter and standard error estimation, and model fit. The results favored robust ML over LD and pairwise ADF in almost all respects. The exceptions included convergence rates under the most severe nonnormality in the missing not at random (MNAR) condition and recovery of standard error estimates across sample sizes. The results also indicate that nonnormality, small sample size, MNAR, and multicollinearity might adversely affect convergence rate and the validity of statistical inferences concerning parameter estimates and model fit statistics.  相似文献   

8.
Item stem formats can alter the cognitive complexity as well as the type of abilities required for solving mathematics items. Consequently, it is possible that item stem formats can affect the dimensional structure of mathematics assessments. This empirical study investigated the relationship between item stem format and the dimensionality of mathematics assessments. A sample of 671 sixth-grade students was given two forms of a mathematics assessment in which mathematical expression (ME) items and word problems (WP) were used to measure the same content. The effects of mathematical language and reading abilities in responding to ME and WP items were explored using unidimensional and multidimensional item response theory models. The results showed that WP and ME items appear to differ with regard to the underlying abilities required to answer these items. Hence, the multidimensional model fit the response data better than the unidimensional model. For the accurate assessment of mathematics achievement, students’ reading and mathematical language abilities should also be considered when implementing mathematics assessments with ME and WP items.  相似文献   

9.
Although population modeling methods are well established, a paucity of literature appears to exist regarding the effect of missing background data on subpopulation achievement estimates. Using simulated data that follows typical large‐scale assessment designs with known parameters and a number of missing conditions, this paper examines the extent to which missing background data impacts subpopulation achievement estimates. In particular, the paper compares achievement estimates under a model with fully observed background data to achievement estimates for a variety of missing background data conditions. The findings suggest that sub‐population differences are preserved under all analyzed conditions while point estimates for subpopulation achievement values are influenced by missing at random conditions. Implications for cross‐population comparisons are discussed.  相似文献   

10.
11.
The present article reports results of a real‐world effectiveness trial conducted in Denmark with six thousand four hundred eighty‐three 3‐ to 6‐year‐olds designed to improve children's language and preliteracy skills. Children in 144 child cares were assigned to a control condition or one of three planned variations of a 20‐week storybook‐based intervention: a base intervention and two enhanced versions featuring extended professional development for educators or a home‐based program for parents. Pre‐ to posttest comparisons revealed a significant impact of all three interventions for preliteracy skills (= .21–.27) but not language skills (d = .04–.16), with little differentiation among the three variations. Fidelity, indexed by number of lessons delivered, was a significant predictor of most outcomes. Implications for real‐world research and practice are considered.  相似文献   

12.
There is a large body of research on the effectiveness of rater training methods in the industrial and organizational psychology literature. Less has been reported in the measurement literature on large‐scale writing assessments. This study compared the effectiveness of two widely used rater training methods—self‐paced and collaborative frame‐of‐reference training—in the context of a large‐scale writing assessment. Sixty‐six raters were randomly assigned to the training methods. After training, all raters scored the same 50 representative essays prescored by a group of expert raters. A series of generalized linear mixed models were then fitted to the rating data. Results suggested that the self‐paced method was equivalent in effectiveness to the more time‐intensive and expensive collaborative method. Implications for large‐scale writing assessments and suggestions for further research are discussed.  相似文献   

13.
Although much attention has been given to rater effects in rater‐mediated assessment contexts, little research has examined the overall stability of leniency and severity effects over time. This study examined longitudinal scoring data collected during three consecutive administrations of a large‐scale, multi‐state summative assessment program. Multilevel models were used to assess the overall extent of rater leniency/severity during scoring and examine the extent to which leniency/severity effects were stable across the three administrations. Model results were then applied to scaled scores to estimate the impact of the stability of leniency/severity effects on students’ scores. Results showed relative scoring stability across administrations in mathematics. In English language arts, short constructed response items showed evidence of slightly increasing severity across administrations, while essays showed mixed results: evidence of both slightly increasing severity and moderately increasing leniency over time, depending on trait. However, when model results were applied to scaled scores, results revealed rater effects had minimal impact on students’ scores.  相似文献   

14.
15.
Performance assessments, scenario‐based tasks, and other groups of items carry a risk of violating the local item independence assumption made by unidimensional item response theory (IRT) models. Previous studies have identified negative impacts of ignoring such violations, most notably inflated reliability estimates. Still, the influence of this violation on examinee ability estimates has been comparatively neglected. It is known that such item dependencies cause low‐ability examinees to have their scores overestimated and high‐ability examinees' scores underestimated. However, the impact of these biases on examinee classification decisions has been little examined. In addition, because the influence of these dependencies varies along the underlying ability continuum, whether or not the location of the cut‐point is important in regard to correct classifications remains unanswered. This simulation study demonstrates that the strength of item dependencies and the location of an examination systems’ cut‐points both influence the accuracy (i.e., the sensitivity and specificity) of examinee classifications. Practical implications of these results are discussed in terms of false positive and false negative classifications of test takers.  相似文献   

16.
Eighty-one adult participants varying in reading ability completed two choice reaction time (RT) tasks (one auditory and one visual) in conjunction with measures of phonological awareness, general cognitive ability, and word recognition ability. Replicating previous work, a significant correlation between RT and reading ability was obtained. However, several different methods of examining overlapping variance (hierarchical regression, path analysis, commonality analysis) indicated that the zero-order correlation between RT and word recognition ability was largely due to variance shared with phonological awareness and general cognitive ability. RT explained little variance in reading ability after phonological sensitivity had been partialed out and almost no unique variance after phonological sensitivity and general cognitive ability had been partialed out. In addition, the overlap in the variance of RT and phonological processing was almost entirely due to variance shared with intelligence.  相似文献   

17.
Implications of the multiple‐use of accountability assessments for the process of validation are examined. Multiple‐use refers to the simultaneous use of results from a single administration of an assessment for its intended use and for one or more additional uses. A theoretical discussion of the issues for validation which emerge from multiple‐use is provided focusing on the increased stakes that result from multiple‐use and the need to consider the interactions that may take place between multiple‐uses. To further explore this practice, an empirical study of the multiple‐use of the Education Quality and Accountability Office Grade 9 Assessment of Mathematics, a mandatory assessment administered in Ontario, Canada, is presented. Drawing on data gathered in an in‐depth case study, practices associated with two of the multiple‐uses of this assessment are considered and evidence of ways these two uses interact is presented. Given these interactions, the limitations of an argument‐based approach to validation for this instance of multiple‐use are demonstrated. Some ways that the process of validation might better address the practice of multiple‐use are suggested and areas for further investigation of this frequently occurring practice are discussed.  相似文献   

18.
项目反应理论(Item Response Theory,IRT)是在克服经典测验理论局限性的基础上发展起来的,在单维性、局部独立性和单调性的前提假设下,更具有优越性。它以潜在特质理论为基础,采用项目特征曲线假设对其进行建模,产生了三个最基本的模型:正态肩型曲线模型、拉希模型和逻辑斯蒂模型。指出了IRT在实际应用中取得实质性进展的两个方面:计算机自适应测验和认知诊断。  相似文献   

19.
Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating rating quality are often based on measurement theories such as invariant measurement. However, these methods are based on assumptions and transformations that may not be appropriate for ordinal ratings. In this study, I show how researchers can use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, to evaluate rating quality within the framework of invariant measurement without the use of potentially inappropriate parametric techniques. I use an illustrative analysis of data from a rater‐mediated writing assessment to demonstrate how one can use numeric and graphical indicators from MSA to gather evidence of validity, reliability, and fairness. The results from the analyses suggest that MSA provides a useful framework within which to evaluate rater‐mediated assessments for evidence of validity, reliability, and fairness that can supplement existing popular methods for evaluating ratings.  相似文献   

20.
数据处理是对机械零件进行计算机辅助设计的重要工作。提出了在Visual Basic程序设计中处理表格中数据的三种方法,即在程序内部建立参数表、在程序外部建立数据文件、在窗体中完成数据  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号