首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Formulas for the standard error of a parallel-test correlation and for the Kuder-Richardson formula 20 reliability estimate are provided. Given equal values of the two reliabilities in the population, the standard error of the Kuder-Richardson formula 20 is shown to be somewhat smaller than the standard error of a parallel-test correlation for reliability values, sample sizes, and test lengths that are usually encountered in practice.  相似文献   

2.
The standard error of measurement usefully provides confidence limits for scores in a given test, but is it possible to quantify the reliability of a test with just a single number that allows comparison of tests of different format? Reliability coefficients do not do this, being dependent on the spread of examinee attainment. Better in this regard is a measure produced by dividing the standard error of measurement by the test's ‘reliability length’, the latter defined as the maximum possible score minus the most probable score obtainable by blind guessing alone. This, however, can be unsatisfactory with negative marking (formula scoring), as shown by data on 13 negatively marked true/false tests. In these the examinees displayed considerable misinformation, which correlated negatively with correct knowledge. Negative marking can improve test reliability by penalizing such misinformation as well as by discouraging guessing. Reliability measures can be based on idealized theoretical models instead of on test data. These do not reflect the qualities of the test items, but can be focused on specific test objectives (e.g. in relation to cut‐off scores) and can be expressed as easily communicated statements even before tests are written.  相似文献   

3.
Although it has been known for over a half-century that the standard error of measurement is in many respects superior to the reliability coefficient for purposes of evaluating the fallibility of a psychological test, current textbooks and journal literature in tests and measurements still devote far more attention to test reliability than to the standard error. The present paper provides a list of ten salient features of the standard error, contrasting it to the reliability coefficient, and concludes that the standard error of measurement should be regarded as a primary characteristic of a mental test.  相似文献   

4.
We contend that generalizability (G) theory allows the design of psychometric approaches to testing English-language learners (ELLs) that are consistent with current thinking in linguistics. We used G theory to estimate the amount of measurement error due to code (language or dialect). Fourth- and fifth-grade ELLs, native speakers of Haitian-Creole from two speech communities, were given the same set of mathematics items in the standard English and standard Haitian-Creole dialects (Sample 1) or in the standard and local dialects of Haitian-Creole (Samples 2 and 3). The largest measurement error observed was produced by the interaction of student, item, and code. Our results indicate that the reliability and dependability of ELL achievement measures is affected by two facts that operate in combination: Each test item poses a unique set of linguistic challenges and each student has a unique set of linguistic strengths and weaknesses. This sensitivity to language appears to take place at the level of dialect. Also, students from different speech communities within the same broad linguistic group may differ considerably in the number of items needed to obtain dependable measures of their academic achievement. Whether students are tested in English or in their first language, dialect variation needs to be considered if language as a source of measurement error is to be effectively addressed.  相似文献   

5.
From concepts which refer only to observed scores and which allow the parameters of score distribution over repeated measurements on a given person to differ from person to person, necessary and sufficient conditions under which coefficient alpha equals test reliability are derived. The result clarifies the relation this quantity to the Kuder-Richardson formula 20, to the KR 21, to the Spearman-Brown formula, and to Lord item sampling model.  相似文献   

6.
Currently there is concern among some educators regarding the reliability of criterion-referenced (CR) measures. In this comment, a recent attempt to develop a theory of reliability for CR measures is examined, and some considerations for determining the reliability of CR measures are discussed. Conventional reliability statistics (e.g., coefficient alpha, standard error of measurement) are found appropriate for CR measures satisfying the assumptions of the measurement model underlying classical test theory. For measures with underlying multidimensional traits, conventional reliability statistics may be used at the homogeneous subscale level. When the confidence interval about a student's “below criterion score” includes the criterion, additional evidence about the student should be obtained. Two-stage sequential testing is suggested as one method for acquiring additional evidence.  相似文献   

7.
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.  相似文献   

8.
水准仪是通过其望远镜的视准轴和水准管的水准轴的相互平行,来获得水平视线。若二者在竖直面上的投影不平行,两投影线的交角称为“i”角误差。对水准测量而言,如果“i”角误差为零,表明水准管的水准轴水平后,望远镜的视准轴也是水平的,从而可提供水平视线,满足水准测量基本原理的要求。可见在水准测量中“i”角误差对观测结果影响很大,必须在测量工作前快速、准确地检验出“i”角误差大小,并加以校正。本文中水准仪“i”角误差的快速检验方法是在常规的检验方法的基础上,通过在距水准尺特定距离上的不同的地方安置两次水准仪,分别测出同两点间的两次高差。根据所设定的特定距离简化计算“i”角误差的公式,无须结合计算器等计算工具,可在现场直接口算出该水准仪的“i”角误差的大小及距水准仪最远水准尺上的正确读数。从而比常规检验方法更快速、方便地实现对水准仪“i”角误差的检验和校正。  相似文献   

9.
With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number‐correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real‐number item scores. The generalized algorithm is distinct from the Lord‐Wingersky algorithm in that it explicitly incorporates the task of figuring out all possible unique real‐number test scores in each recursion. Some applications of the generalized recursive algorithm, such as IRT test score reliability estimation and IRT proficiency estimation based on summed test scores, are illustrated with a short test by varying scoring schemes for its items.  相似文献   

10.
What is the extent of error likely with each of several approximations for the standard deviation, internal consistency reliability, and the standard error of measurement? To help answer this question, approximations were compared with exact statistics obtained on 85 different classroom tests constructed and administered by professors in a variety of fields; means and standard deviations of the resulting differences supported the use of approximations in practical situations. Results of this analysis (1) suggest a greater number of alternative formulas that might be employed, and (2) provide additional information concerning the accuracy of approximations with non-normal distributions.  相似文献   

11.
An IRT method for estimating conditional standard errors of measurement of scale scores is presented, where scale scores are nonlinear transformations of number-correct scores. The standard errors account for measurement error that is introduced due to rounding scale scores to integers. Procedures for estimating the average conditional standard error of measurement for scale scores and reliability of scale scores are also described. An illustration of the use of the methodology is presented, and the results from the IRT method are compared to the results from a previously developed method that is based on strong true-score theory.  相似文献   

12.
Test reliability is a concept central to classical test theory and it is commonly stated as a requirement that a test attain a certain level of reliability before it be considered of sufficient quality for practical use. This article discusses the role of reliability in item response theory, and in particular the role of reliability in contexts where matrix sampling designs are used and concern is with the estimation of population parameters rather than the measurement of individuals. The concept of a measurement design effect is introduced. This concept parallels the concept of sampling design effects, in that it describes the impact of measurement error at the individual level (described through a reliability index) on the accuracy with which population parameters are estimated.  相似文献   

13.
An alternative interpretation of Livingston's reliability coefficient is based on the notion of the relation of the size of the reliability coefficient to the range of talent. It is shown that the (generally) larger Livingston coefficient does not imply a smaller standard error of measurement and consequently does not imply a more dependable determination of whether or not a true score falls below (or exceeds) a given criterion value.  相似文献   

14.
It is well known that measurement error in observable variables induces bias in estimates in standard regression analysis and that structural equation models are a typical solution to this problem. Often, multiple indicator equations are subsumed as part of the structural equation model, allowing for consistent estimation of the relevant regression parameters. In many instances, however, embedding the measurement model into structural equation models is not possible because the model would not be identified. To correct for measurement error one has no other recourse than to provide the exact values of the variances of the measurement error terms of the model, although in practice such variances cannot be ascertained exactly, but only estimated from an independent study. The usual approach so far has been to treat the estimated values of error variances as if they were known exact population values in the subsequent structural equation modeling (SEM) analysis. In this article we show that fixing measurement error variance estimates as if they were true values can make the reported standard errors of the structural parameters of the model smaller than they should be. Inferences about the parameters of interest will be incorrect if the estimated nature of the variances is not taken into account. For general SEM, we derive an explicit expression that provides the terms to be added to the standard errors provided by the standard SEM software that treats the estimated variances as exact population values. Interestingly, we find there is a differential impact of the corrections to be added to the standard errors depending on which parameter of the model is estimated. The theoretical results are illustrated with simulations and also with empirical data on a typical SEM model.  相似文献   

15.
Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information.  相似文献   

16.
This paper presents the results of a simulation study to compare the performance of the Mann-Whitney U test, Student?s t test, and the alternate (separate variance) t test for two mutually independent random samples from normal distributions, with both one-tailed and two-tailed alternatives. The estimated probability of a Type I error was controlled (in the sense of being reasonably close to the attainable level) by all three tests when the variances were equal, regardless of the sample sizes. However, it was controlled only by the alternate t test for unequal variances with unequal sample sizes. With equal sample sizes, the probability was controlled by all three tests regardless of the variances. When it was controlled, we also compared the power of these tests and found very little difference. This means that very little power will be lost if the Mann-Whitney U test is used instead of tests that require the assumption of normal distributions.  相似文献   

17.
老子的道和爱因斯坦的上帝分别是他们各自关于世界本质和规律的认识、理解和体验的产物。爱因斯坦的上帝概念相当于老子哲学中的“天之道”概念。老子的道和爱因斯坦的上帝不仅是人们在科学文化研究活动中的思想基础、科学直觉和灵感的源泉,而且是人们从事科学文化活动的世界观、方法论、审美观和价值评价标准。  相似文献   

18.
A reliability coefficient for criterion-referenced tests is developed from the assumptions of classical test theory. This coefficient is based on deviations of scores from the criterion score, rather than from the mean. The coefficient is shown to have several of the important properties of the conventional normreferenced reliability coefficient, including its interpretation as a ratio of variances and as a correlation between parallel forms, its relationship to test length, its estimation from a single form of a test, and its use in correcting for attenuation due to measurement error. Norm-referenced measurement is considered as a special case of criterion-referenced measurement.  相似文献   

19.
In discussion of the properties of criterion-referenced tests, it is often assumed that traditional reliability indices, particularly those based on internal consistency, are not relevant. However, if the measurement errors involved in using an individual's observed score on a criterion-referenced test to estimate his or her universe scores on a domain of items are compared to errors of an a priori procedure that assigns the same universe score (the mean observed test score) to all persons, the test-based procedure is found to improve the accuracy of universe score estimates only if the test reliability is above 0.5. This suggests that criterion-referenced tests with low reliabilities generally will have limited use in estimating universe scores on domains of items.  相似文献   

20.
The standard error of measurement (SEM) is the standard deviation of errors of measurement that are associated with test scores from a particular group of examinees. When used to calculate confidence bands around obtained test scores, it can be helpful in expressing the unreliability of individual test scores in an understandable way. Score bands can also be used to interpret intraindividual and interindividual score differences. Interpreters should be wary of over-interpretation when using approximations for correctly calculated score bands. It is recommended that SEMs at various score levels be used in calculating score bands rather than a single SEM value.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号