首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This study deals with the statistical properties of a randomization test applied to an ABAB design in cases where the desirable random assignment of the points of change in phase is not possible. To obtain information about each possible data division, the authors carried out a conditional Monte Carlo simulation with 100,000 samples for each systematically chosen triplet. The authors studied robustness and power under several experimental conditions—different autocorrelation levels and different effect sizes as well as different phase lengths determined by the points of change. Type I error rates were distorted by the presence of autocorrelation for the majority of data divisions. The authors obtained satisfactory Type II error rates only for large treatment effects. The relation between the lengths of the four phases appeared to be an important factor for the robustness and power of the randomization test.  相似文献   

2.
The standard error of measurement (SEM) is the standard deviation of errors of measurement that are associated with test scores from a particular group of examinees. When used to calculate confidence bands around obtained test scores, it can be helpful in expressing the unreliability of individual test scores in an understandable way. Score bands can also be used to interpret intraindividual and interindividual score differences. Interpreters should be wary of over-interpretation when using approximations for correctly calculated score bands. It is recommended that SEMs at various score levels be used in calculating score bands rather than a single SEM value.  相似文献   

3.
A potential concern for individuals interested in using item response theory (IRT) with achievement test data is that such tests have been specifically designed to measure content areas related to course curriculum and students taking the tests at different points in their coursework may not constitute samples from the same population. In this study, data were obtained from three administrations of two forms of a Biology achievement test. Data from the newer of the two forms were collected at a spring administration, made up of high school sophomores just completing the Biology course, and at a fall administration, made up mostly of seniors who completed their instruction in the course from 6–18 months prior to the test administration. Data from the older form, already on scale, were collected at only a fall administration, where the sample was comparable to the newer form fall sample. IRT and conventional item difficulty parameter estimates for the common items across the two forms were compared for each of the two form/sample combinations. In addition, conventional and IRT score equatings were performed between the new and old forms for each o f the form sample combinations. Widely disparate results were obtained between the equatings based on the two form/sample combinations. Conclusions are drawn about the use o f both classical test theory and IRT in situations such as that studied, and implications o f the results for achievement test validity are also discussed  相似文献   

4.
Test reliability is a concept central to classical test theory and it is commonly stated as a requirement that a test attain a certain level of reliability before it be considered of sufficient quality for practical use. This article discusses the role of reliability in item response theory, and in particular the role of reliability in contexts where matrix sampling designs are used and concern is with the estimation of population parameters rather than the measurement of individuals. The concept of a measurement design effect is introduced. This concept parallels the concept of sampling design effects, in that it describes the impact of measurement error at the individual level (described through a reliability index) on the accuracy with which population parameters are estimated.  相似文献   

5.
Equatings were performed on both simulated and real data sets using the common-examinee design and two abilities for each examinee (i.e., two dimensions). Item and ability parameter estimates were found by using the Multidimensional Item Response Theory Estimation (MIRTE) program. The amount of equating error was evaluated by a comparison of the mean difference and the mean absolute difference between the true scores and ability estimates found on both tests for the common examinees used in the equating. The results indicated that effective equating, as measured by comparability o f true scores, was possible with the techniques used in this study. When the stability o f the ability estimates was examined, unsatisfactory results were found.  相似文献   

6.
Confidence intervals often are recommended as a means of communicating the extent to which individual test scores may be influenced by measurement error. However, test manuals and assessment texts vary widely in their recommendations about how confidence intervals should be constructed, and several contain misinterpretations of classical test theory. The most widely used procedure for constructing confidence intervals misrepresents the likely distribution of true scores, and confidence intervals constructed with it will be inaccurate, especially when extreme scores are involved. The various procedures for constructing confidence intervals that have been suggested in measurement texts are examined in relation to their approximation to the most accurate procedure that uses the estimated true score as the center of the confidence interval and the standard error of estimate to determine the width. In addition, the problems of applying these procedures to norm-referenced scores are discussed—an issue that has been largely ignored in the assessment literature and that leads to further misinterpretations of confidence intervals.  相似文献   

7.
Self-adapted testing has been described as a variation of computerized adaptive testing that reduces test anxiety and thereby enhances test performance. The purpose of this study was to gain a better understanding of these proposed effects of self-adapted tests (SATs); meta-analysis procedures were used to estimate differences between SATs and computerized adaptive tests (CATs) in proficiency estimates and post-test anxiety levels across studies in which these two types of tests have been compared. After controlling for measurement error, the results showed that SATs yielded proficiency estimates that were 0.12 standard deviation units higher and post-test anxiety levels that were 0.19 standard deviation units lower than those yielded by CATs. We speculate about possible reasons for these differences and discuss advantages and disadvantages of using SATs in operational settings.  相似文献   

8.
Although extensive research exists on the use of curriculum‐based measures for progress monitoring, little is known about using computer adaptive tests (CATs) for progress‐monitoring purposes. The purpose of this study was to evaluate the impact of the frequency of data collection on individual and group growth estimates using a CAT. Data were available for 278 fourth‐ and fifth‐grade students. Growth estimates were obtained when five, three, and two data collections were available across 18 weeks. Data were analyzed by grade to evaluate any observed differences in growth. Further, root mean square error values were obtained to evaluate differences in individual student growth estimates across data collection schedules. Group‐level estimates of growth did not differ across data collection schedules; however, growth estimates for individual students varied across the different schedules of data collection. Implications for using CATs to monitor student progress at the individual or group level are discussed.  相似文献   

9.
Research on cognitive load theory (CLT) has not yet provided facet-specific measures of cognitive load. The lack of valid methods to measure intrinsic, extraneous and germane cognitive load makes it difficult to empirically test theoretical explanations of effects caused by manipulations of instructional designs. This situation also imposes challenges to testing CLT as a theory. This paper critically reflects the conceptualisation of CLT's core concept and the implications for its operationalisation. In order to address some of the challenges we propose a complexity framework that allows the derivation of a priori estimates of mental load that go beyond CLT's notion of element interactivity. In a study we test hypotheses with regard to effects of the variation of sources for intrinsic cognitive load (increase of complexity within tasks) and the variation of sources for extraneous cognitive load (reduction of extraneous cognitive load between tasks) in three ability groups. Complexity-based estimates prove superior to element interactivity-based estimates of mental load in the prediction of performance outcomes. Results also indicate that individual differences in information-processing capacity determine to what extent complexity is reflected as cognitive load. In this respect the proposed framework extends the focus of CLT beyond the discussion of the role of prior knowledge and acquired levels of expertise.  相似文献   

10.
Abstract

The outcomes of educational assessments undoubtedly have real implications for students, teachers, schools and education in the widest sense. Assessment results are, for example, used to award qualifications that determine future educational or vocational pathways of students. The results obtained by students in assessments are also used to gauge individual teacher quality, to hold schools to account for the standards achieved by their students, and to compare international education systems. Given the current high-stakes nature of educational assessment, it is imperative that the measurement practices involved have stable philosophical foundations. However, this article casts doubt on the theoretical underpinnings of contemporary educational measurement models. Aspects of Wittgenstein’s later philosophy and Bohr’s philosophy of quantum theory are used to argue that a quantum theoretical rather than a Newtonian model is appropriate for educational measurement, and the associated implications for the concept of validity are elucidated. Whilst it is acknowledged that the transition to a quantum theoretical framework would not lead to the demise of educational assessment, it is argued that, where practical, current high-stakes assessments should be reformed to become as ‘low-stakes’ as possible. This article also undermines some of the pro high-stakes testing rhetoric that has a tendency to afflict education.  相似文献   

11.
An improved method is derived for estimating conditional measurement error variances, that is, error variances specific to individual examinees or specific to each point on the raw score scale of the test. The method involves partitioning the test into short parallel parts, computing for each examinee the unbiased estimate of the variance of part-test scores, and multiplying this variance by a constant dictated by classical test theory. Empirical data are used to corroborate the principal theoretical deductions.  相似文献   

12.
The purpose of this study is to develop and evaluate unidimensional models that can handle semiordered data within scale items (i.e., items with multiple ordered response categories, and one additional nominal response category). We apply the models to scale data with not applicable (NA) responses to compare the model performance to conditions in which NA responses are treated as missing and ignored. We also conduct a small simulation study based on the operational study to evaluate the parameter recovery of the models under the operational conditions. Findings indicate that the proposed models show promise for (a) reducing standard errors of trait estimates for persons who select NA responses, (b) reducing nonresponse bias in trait estimates for persons who select NA responses, and (c) providing substantive information to practitioners about the nature of the relationship between NA selection and the trait of measurement.  相似文献   

13.
It is well known that measurement error in observable variables induces bias in estimates in standard regression analysis and that structural equation models are a typical solution to this problem. Often, multiple indicator equations are subsumed as part of the structural equation model, allowing for consistent estimation of the relevant regression parameters. In many instances, however, embedding the measurement model into structural equation models is not possible because the model would not be identified. To correct for measurement error one has no other recourse than to provide the exact values of the variances of the measurement error terms of the model, although in practice such variances cannot be ascertained exactly, but only estimated from an independent study. The usual approach so far has been to treat the estimated values of error variances as if they were known exact population values in the subsequent structural equation modeling (SEM) analysis. In this article we show that fixing measurement error variance estimates as if they were true values can make the reported standard errors of the structural parameters of the model smaller than they should be. Inferences about the parameters of interest will be incorrect if the estimated nature of the variances is not taken into account. For general SEM, we derive an explicit expression that provides the terms to be added to the standard errors provided by the standard SEM software that treats the estimated variances as exact population values. Interestingly, we find there is a differential impact of the corrections to be added to the standard errors depending on which parameter of the model is estimated. The theoretical results are illustrated with simulations and also with empirical data on a typical SEM model.  相似文献   

14.
We evaluated the statistical power of single-indicator latent growth curve models to detect individual differences in change (variances of latent slopes) as a function of sample size, number of longitudinal measurement occasions, and growth curve reliability. We recommend the 2 degree-of-freedom generalized test assessing loss of fit when both slope-related random effects, the slope variance and intercept-slope covariance, are fixed to 0. Statistical power to detect individual differences in change is low to moderate unless the residual error variance is low, sample size is large, and there are more than four measurement occasions. The generalized test has greater power than a specific test isolating the hypothesis of zero slope variance, except when the true slope variance is close to 0, and has uniformly superior power to a Wald test based on the estimated slope variance.  相似文献   

15.
This Monte Carlo study investigated the impacts of measurement noninvariance across groups on major parameter estimates in latent growth modeling when researchers test group differences in initial status and latent growth. The average initial status and latent growth and the group effects on initial status and latent growth were investigated in terms of Type I error and bias. The location and magnitude of noninvariance across groups was related to the location and magnitude of bias and Type I error in the parameter estimates. That is, noninvariance in factor loadings and intercepts was associated with the Type I error inflation and bias in the parameter estimates of the slope factor (or latent growth) and the intercept factor (or initial status), respectively. As noninvariance became large, the degree of Type I error and bias also increased. On the other hand, a correctly specified second-order latent growth model yielded unbiased parameter estimates and correct statistical inferences. Other findings and implications on future studies were discussed.  相似文献   

16.
Student–teacher interactions are dynamic relationships that change and evolve over the course of a school year. Measuring classroom quality through observations that focus on these interactions presents challenges when observations are conducted throughout the school year. Variability in observed scores could reflect true changes in the quality of student–teacher interaction or simply reflect measurement error. Classroom observation protocols should be designed to minimize measurement error while allowing measureable changes in the construct of interest. Treating occasions as fixed multivariate outcomes allows true changes to be separated from random measurement error. These outcomes may also be summarized through trend score composites to reflect different types of growth over the school year. We demonstrate the use of multivariate generalizability theory to estimate reliability for trend score composites, and we compare the results to traditional methods of analysis. Reliability estimates computed for average, linear, quadratic, and cubic trend scores from 118 classrooms participating in the MyTeachingPartner study indicate that universe scores account for between 57% and 88% of observed score variance.  相似文献   

17.
High item discrimination can be a symptom o f a special kind of measurement disturbance introduced by an item that gives persons o f high ability a special advantage over and above their higher abilities. This type o f disturbance, which can be interpreted as a form o f item "bias," can be encouraged by methods that routinely interpret highly discriminating items as the "best" items on a test and may be compounded by procedures that weight items by their discrimination. The type of measurement disturbance described and illustrated in this paper occurs when an item is sensitive to individual differences on a second, undesired dimension that is positively correlated with the variable intended to be measured. Possible secondary influences o f this type include opportunity to learn, opportunity to answer, and test wiseness  相似文献   

18.
The primary purpose of this study was to estimate the amount of variability in the proportions of students in a school district, scoring within each of three achievement levels that could be attributed to factors other than random sampling error. The approach taken is based on a general conceptual framework that collectively incorporates five sources of variability: instructional intervention, random sampling error, measurement error, equating error, and systematic error. Statewide school-level assessment data for reading and mathematics in grades four and eight from four consecutive years were used to examine annual grade-group change. The intent was to assess the impact of random sampling error in grade-group change estimates when either single-year proportions or 2-year average proportions are used to report school improvement with achievement levels. Observed variability in change was compared with theoretically-derived estimates of change due to random sampling error to determine the relative influence of sampling error and the aggregate of the other four sources of variability. Results indicate that the error variance of estimates of change at the school level is large enough to interfere with interpretations of annual change estimates. Recommendations are offered for establishing annual improvement goals and for reporting results with achievement levels-all in the context of adequate yearly progress (AYP)-while taking error estimates into account.  相似文献   

19.
We use quantile treatment effects estimation to examine the consequences of the random-assignment New York City School Choice Scholarship Program across the distribution of student achievement. Our analyses suggest that the program had negligible and statistically insignificant effects across the skill distribution. In addition to contributing to the literature on school choice, the article illustrates several ways in which distributional effects estimation can enrich educational research: First, we demonstrate that moving beyond a focus on mean effects estimation makes it possible to generate and test new hypotheses about the heterogeneity of educational treatment effects that speak to the justification for many interventions. Second, we demonstrate that distributional effects can uncover issues even with well-studied data sets by forcing analysts to view their data in new ways. Finally, such estimates highlight where in the overall national achievement distribution test scores of children exposed to particular interventions lie; this is important for exploring the external validity of the intervention's effects.  相似文献   

20.
In classical test theory, a test is regarded as a sample of items from a domain defined by generating rules or by content, process, and format specifications, l f the items are a random sample of the domain, then the percent-correct score on the test estimates the domain score, that is, the expected percent correct for all items in the domain. When the domain is represented by a large set of calibrated items, as in item banking applications, item response theory (IRT) provides an alternative estimator of the domain score by transformation of the IRT scale score on the test. This estimator has the advantage of not requiring the test items to be a random sample of the domain, and of having a simple standard error. We present here resampling results in real data demonstrating for uni- and multidimensional models that the IRT estimator is also a more accurate predictor of the domain score than is the classical percent-correct score. These results have implications for reporting outcomes of educational qualification testing and assessment.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号