首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
New confidence intervals for the Wechsler Intelligence Scale for Children-Revised (WISC-R) are provided to improve accuracy over existing tables (which center the confidence interval on the child's actual quotient and construct intervals upon the standard error of measurement). These new tables follow Schulte and Borich (1988) in centering confidence intervals on the estimated true IQ and constructing intervals with either the standard error of estimate or the standard error of prediction. The tables are more conservative and reflect more accurate statistical formulas for the construction of WISC-R confidence intervals for WISC-R quotients.  相似文献   

2.
Reporting confidence intervals with test scores helps test users make important decisions about examinees by providing information about the precision of test scores. Although a variety of estimation procedures based on the binomial error model are available for computing intervals for test scores, these procedures assume that items are randomly drawn from a undifferentiated universe of items, and therefore might not be suitable for tests developed according to a table of specifications. To address this issue, four interval estimation procedures that use category subscores for the computation of confidence intervals are presented in this article. All four estimation procedures assume that subscores instead of test scores follow a binomial distribution (i.e., compound binomial error model). The relative performance of the four compound binomial–based interval estimation procedures is compared to each other and to the better known normal approximation and Wilson score procedures based on the binomial error model.  相似文献   

3.
The present article provides a primer on (a) effect sizes, (b) confidence intervals, and (c) confidence intervals for effect sizes. Additionally, various admonitions for reformed statistical practice are presented. For example, a very important implication of the realization that there are dozens of effect size statistics is that authors must explicitly tell readers what effect sizes they are reporting. With respect to confidence intervals, when interpreting a 95% interval, we should never say that we are 95% confident that our interval captures the estimated population parameter. It is explained that effect sizes should be reported even for statistically nonsignificant effects. And, most importantly of all, it is emphasized that effect sizes should not be interpreted using Cohen's benchmarks. Instead, we ought to interpret our effects in direct and explicit comparison against the effects in the related prior literature. © 2007 Wiley Periodicals, Inc. Psychol Schs 44: 423–432, 2007.  相似文献   

4.
Misconceptions about science are often not corrected during study when they are held with high confidence. However, when corrective feedback co-activates a misconception together with the correct conception, this feedback may surprise the learner and draw attention, especially when the misconceptions are held with high confidence. Therefore, high-confidence misconceptions might be more likely to be corrected than low-confidence misconceptions. The present study investigates whether this hypercorrection effect occurs when students read science texts. Effects of two text formats were compared: Standard texts that presented factual information, and refutation texts that explicitly addressed misconceptions and refuted them before presenting factual information. Eighth grade adolescents (N = 114) took a pre-reading test that included 16 common misconceptions about science concepts, rated their confidence in correctness of their response to the pre-reading questions, read 16 texts about the science concepts, and finally took a post-test which included both true/false and open-ended test questions. Analyses of post-test responses show that reading refutation texts causes hypercorrection: Learners more often corrected high-confidence misconceptions after reading refutation texts than after reading standard texts, whereas low-confidence misconceptions did not benefit from reading refutation texts. These outcomes suggest that people are more surprised when they find out a confidently held misconception is incorrect, which may encourage them to pay more attention to the feedback and the refutation. Moreover, correction of high-confidence misconceptions was more apparent on the true/false test responses than on the open-ended test, suggesting that additional interventions may be needed to improve learners' accommodation of the correct information.  相似文献   

5.
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.  相似文献   

6.
Many students find understanding confidence intervals difficult, especially because of the amalgamation of concepts such as confidence levels, standard error, point estimates and sample sizes. An R Shiny application was created to assist the learning process of confidence intervals using graphics and data from the US National Basketball Association.  相似文献   

7.
With a focus on performance assessments, this paper describes procedures for calculating conditional standard error of measurement (CSEM) and reliability of scale scores and classification consistency of performance levels. Scale scores that are transformations of total raw scores are the focus of these procedures, although other types of raw scores are considered as well. Polytomous IRT models provide the psychometric foundation for the procedures that are described. The procedures are applied using test data from ACT's Work Keys Writing Assessment to demonstrate their usefulness. Two polytomous IRT models were compared, as were two different procedures for calculating scores. One simulation study was done using one of the models to evaluate the accuracy of the proposed procedures. The results suggest that the procedures provide quite stable estimates and have the potential to be useful in a variety of performance assessment situations.  相似文献   

8.
Histograms are widely used and appear easy to understand. Research nevertheless indicates that students, teachers and researchers often misinterpret these graphical representations. Hence, the research question addressed in this paper is: What are the conceptual difficulties that become manifest in the common misinterpretations people have when constructing or interpreting histograms? To identify these conceptual difficulties, we conducted a narrative systematic literature review and identified 86 publications reporting or containing misinterpretations. The misinterpretations were clustered and—through abduction—connected to difficulties with statistical concepts. The analysis revealed that most of these conceptual difficulties relate to two big ideas in statistics: data (e.g., number of variables and measurement level) and distribution (shape, centre and variability or spread). These big ideas are depicted differently in histograms compared to, for example, case-value plots. Our overview can help teachers and researchers to address common misinterpretations more generally instead of remediating them each individually.  相似文献   

9.
The standard error of measurement usefully provides confidence limits for scores in a given test, but is it possible to quantify the reliability of a test with just a single number that allows comparison of tests of different format? Reliability coefficients do not do this, being dependent on the spread of examinee attainment. Better in this regard is a measure produced by dividing the standard error of measurement by the test's ‘reliability length’, the latter defined as the maximum possible score minus the most probable score obtainable by blind guessing alone. This, however, can be unsatisfactory with negative marking (formula scoring), as shown by data on 13 negatively marked true/false tests. In these the examinees displayed considerable misinformation, which correlated negatively with correct knowledge. Negative marking can improve test reliability by penalizing such misinformation as well as by discouraging guessing. Reliability measures can be based on idealized theoretical models instead of on test data. These do not reflect the qualities of the test items, but can be focused on specific test objectives (e.g. in relation to cut‐off scores) and can be expressed as easily communicated statements even before tests are written.  相似文献   

10.
Introductory statistics texts give extensive coverage to two‐sided inferences in hypothesis testing, interval estimation, and one‐sided hypothesis tests. Very few discuss the possibility of one‐sided interval estimation at all. Even fewer do so in any detail. Two of the business statistics texts we reviewed mentioned the possibility of dividing the risk of a type I error unequally between the tails for a two‐sided confidence interval. None of the textbooks that were reviewed even considered the possibility of unequal tails for two‐sided hypothesis tests. In this paper, we propose that statistics courses and texts should cover both one‐sided tests and confidence intervals. Furthermore, we propose that coverage, at least in two semesters and advanced courses, should also be given to unequal division of the nominal risk of a type I error for both tests and confidence intervals. Examples are provided for both situations.  相似文献   

11.
Two methods of constructing equal-interval scales for educational achievement are discussed: Thurstone's absolute scaling method and Item Response Theory (IRT). Alternative criteria for choosing a scale are contrasted. It is argued that clearer criteria are needed for judging the appropriateness and usefulness of alternative scaling procedures, and more information is needed about the qualities of the different scales that are available. In answer to this second need, some examples are presented of how IRT can be used to examine the properties of scales: It is demonstrated that for observed score scales in common use (i.e., any scores that are influenced by measurement error), (a) systematic errors can be introduced when comparing growth at selected percentiles, and (b) normalizing observed scores will not necessarily produce a scale that is linearly related to an underlying normally distributed true trait.  相似文献   

12.
Educational measurement specialists in undertaking test equating in applied settings have been plagued by the absence of a logically or mathematically compelling rationale for their test equating efforts. Classical test theory and other test theories based on the assumption of identically distributed true scores are tautological in terms of test equating. The present study examined (by means of a Monte Carlo procedure) the effects of four parameters on the accuracy of test equating under a relaxed definition of test form equivalence. The four parameters studied were sample size, test form length, test form reliability, and the correlation between the true scores of the test forms to be equated. Significant interactions involving sample size and the other parameters indicated that smaller samples of observations yielded disproportionately larger errors in test equating for fixed values of the test form parameters. In terms of main effects, sample size emerged as most important in controlling equating error. Taken together, the results suggest that when test equating is carried out on larger samples of observations, errors of equating will tend to be relatively small even though the test forms are not strictly parallel. For arbitrarily small samples, however, errors of equating will tend to be larger regardless of how equivalent the test forms are.  相似文献   

13.
14.
The standard error of measurement (SEM) is the standard deviation of errors of measurement that are associated with test scores from a particular group of examinees. When used to calculate confidence bands around obtained test scores, it can be helpful in expressing the unreliability of individual test scores in an understandable way. Score bands can also be used to interpret intraindividual and interindividual score differences. Interpreters should be wary of over-interpretation when using approximations for correctly calculated score bands. It is recommended that SEMs at various score levels be used in calculating score bands rather than a single SEM value.  相似文献   

15.
《教育实用测度》2013,26(4):361-367
The sampling theory for coefficient alpha is well developed and readily accessible in the measurement literature. The theory for the intraclass reliability coefficient, a Spearman-Brown extrapolation of alpha to a single measurement on each examinee, is less widely recognized and less easily cited. This article presents techniques for constructing confidence intervals and testing hypotheses for the intraclass coefficient.  相似文献   

16.
The stability of standard score and probability method sociometric group assignments was examined over a 2-year period with an initial group of 334 preadolescents. The popular, neglected, and controversial sociometric groups evidenced low stability of group membership over intervals of approximately 1, 6, 12, 18, and 24 months; the rejected group evidenced slightly higher short-term stability. These findings of limited stability were attributed to measurement error and to the failure of both classification systems to identify groups with homogeneous social reputation profiles. Social role scores contributed to the prediction of stable group membership in the rejected and controversial classification, although these scores added little to the prediction of stable popular and neglected group membership. Stability over short intervals could be used to enhance the prediction of stability over longer periods; however, this procedure resulted in the classification of numerous false positives and false negatives. The instability of sociometric group assignments completed with the standard score and probability methods indicates that researchers should be cautious about the use of classifications based on only one data collection and that the selection of children for both clinical intervention and further nomothetic research may require alternative assessment procedures.  相似文献   

17.
Previous methods for estimating the conditional standard error of measurement (CSEM) at specific score or ability levels are critically discussed, and a brief summary of prior empirical results is given. A new method is developed that avoids theoretical problems inherent in some prior methods, is easy to implement, and estimates not only a quantity analogous to the CSEM at each score but also the conditional standard error of prediction (CSEP) at each score and the conditional true score standard deviation (CTSSD) at each score, The new method differs from previous methods in that previous methods have concentrated on attempting to estimate error variance conditional on a fixed value of true score, whereas the new method considers the variance of observed scores conditional on a fixed value of an observed parallel measurement and decomposes these conditional observed score variances into true and error parts. The new method and several older methods are applied to a variety of tests, and representative results are graphically displayed. The CSEM-Iike estimates produced by the new method are called conditional standard error of measurement in prediction (CSEMP) estimates and are similar to those produced by older methods, but the CSEP estimates produced by the new method offer an alternative interpretation of the accuracy of a test at different scores. Finally, evidence is presented that shows that previous methods can produce dissimilar results and that the shape of the score distribution may influence the way in which the CSEM varies across the score scale.  相似文献   

18.
This paper describes four procedures previously developed for estimating conditional standard errors of measurement for scale scores: the IRT procedure (Kolen, Zeng, & Hanson. 1996), the binomial procedure (Brennan & Lee, 1999), the compound binomial procedure (Brennan & Lee, 1999), and the Feldt-Qualls procedure (1998). These four procedures are based on different underlying assumptions. The IRT procedure is based on the unidimensional IRT model assumptions. The binomial and compound binomial procedures employ, as the distribution of errors, the binomial model and compound binomial model, respectively. By contrast, the Feldt-Qualls procedure does not depend on a particular psychometric model, and it simply translates any estimated conditional raw-score SEM to a conditional scale-score SEM. These procedures are compared in a simulation study, which involves two-dimensional data sets. The presence of two category dimensions reflects a violation of the IRT unidimensionality assumption. The relative accuracy of these procedures for estimating conditional scale-score standard errors of measurement is evaluated under various circumstances. The effects of three different types of transformations of raw scores are investigated including developmental standard scores, grade equivalents, and percentile ranks. All the procedures discussed appear viable. A general recommendation is made that test users select a procedure based on various factors such as the type of scale score of concern, characteristics of the test, assumptions involved in the estimation procedure, and feasibility and practicability of the estimation procedure.  相似文献   

19.
The usability of interactive whiteboards vs. computers was evaluated on three dimensions (visibility, legibility and comprehension) in the secondary school pupils. The visibility assessment consisted in detecting a visual stimulus varying in luminance using a staircase procedure, legibility was assessed with a target-search task, and we administered narrative and explanatory texts with or without illustrations to evaluate comprehension. The results of the visibility test showed that pupils found the light signal easier to detect on the IWB. For the legibility test, we observed differences in error rates and discriminability according to medium, font size and congruence between target and the distractor letters. Performances in the comprehension test were similar for both explanatory and narrative texts. Moreover, the presence of illustrations does not improve comprehension. These results could be related to the hierarchical structure of the texts, which facilitate comprehension.  相似文献   

20.
Student–teacher interactions are dynamic relationships that change and evolve over the course of a school year. Measuring classroom quality through observations that focus on these interactions presents challenges when observations are conducted throughout the school year. Variability in observed scores could reflect true changes in the quality of student–teacher interaction or simply reflect measurement error. Classroom observation protocols should be designed to minimize measurement error while allowing measureable changes in the construct of interest. Treating occasions as fixed multivariate outcomes allows true changes to be separated from random measurement error. These outcomes may also be summarized through trend score composites to reflect different types of growth over the school year. We demonstrate the use of multivariate generalizability theory to estimate reliability for trend score composites, and we compare the results to traditional methods of analysis. Reliability estimates computed for average, linear, quadratic, and cubic trend scores from 118 classrooms participating in the MyTeachingPartner study indicate that universe scores account for between 57% and 88% of observed score variance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号