首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, assessments of faculty performance for the determination of salary increases are analyzed to estimate interrater reliability. Using the independent ratings by six elected members of the faculty, correlations between the ratings are calculated and estimates of the reliability of the composite (group) ratings are generated. Average intercorrelations are found to range from 0.603 for teaching, to 0.850 for research. The average intercorrelation for the overall faculty ratings is 0.794. Using these correlations, the reliability of the six-person group (the composite reliability) is estimated to be over 0.900 for each of the three areas and 0.959 for the overall faculty rating. Furthermore, little correlation is found between the ratings of performance levels of individual faculty members in the three areas of research, teaching, and service. The high intercorrelations and, consequently, the high composite reliabilities suggest that a reduction in the number of raters would have relatively small effects on reliability. The findings are discussed in terms of their relationship to issues of validity as well as to other questions of faculty assessment.  相似文献   

2.
Common factor scores were compared to unfactored data-level variables as predictors in terms of the correlation of a criterion with the predicted value in multiple regression equations applied to replication (cross-validation) samples. Data were generated by computer to provide populations with three different degrees of common variance inherent in their predictor variable intercorrelation matrices. Two replication populations differing from the original by specified amounts in their intercorrelation matrices were created for each common variance level. Results indicated that shrinkage was less for factor scores than for data-level variables for all combinations of common variance and difference of replication population. Moreover, the actual correlation describing accuracy of prediction was higher for factor scores than for data-level variables at the extreme conditions of common variance and difference of replication population.  相似文献   

3.
This report is a review of reliability data on the PPVT obtained from 32 research studies published between 1965 and 1974. Much of the research was done on Head Start children. Overall, the median of reliability coefficients reported here (0.72) has remained remarkably close to the original median of 0.77 found in standardizing the test. Unexpectedly, elapsed time between test and retest had only a slight effect on the reliability coefficients. However, as expected, the greater range in ages and ability levels of subjects, the higher were the reliabilities. For average children in the elementary grades, and for retarded people of all ages, PPVT scores remained relatively stable over time and there was close equivalence between alternate forms. Scores were least stable for preschool children, especially from minority groups. Black preschool girls were more variable in their performance on the PPVT than boys, and preschool girls generally were more responsive than boys to play periods conducted before testing was begun. A number of variables associated with examiners and setting affected the scores on the test. As expected, raw scores tended to yield slightly higher reliabilities than MA and considerably higher reliabilities than IQ scores.  相似文献   

4.
Many prominent intelligence tests (e.g., Wechsler Intelligence Scale for Children, Fifth Edition [WISC-V] and Reynolds Intellectual Abilities Scale, Second Edition [RIAS-2]) offer methods for computing subtest- and composite-level difference scores. This study uses data provided in the technical manual of the WISC-V and RIAS-2 to calculate reliability coefficients for difference scores. Subtest-level difference score reliabilities range from 0.59 to 0.99 for the RIAS-2 and from 0.53 to 0.87 for the WISC-V. Composite-level difference score reliabilities generally range from 0.23 to 0.95 for the RIAS-2 and from 0.36 to 0.87 for the WISC-V. Emphasis is placed on comparisons recommended by test publishers and a discussion of minimum requirements for interpretation of differences scores is provided.  相似文献   

5.
Several parts of the STEP Writing Test, Level 1, were administered to 14 different groups of from 19 to 52 high school students. In the testing situations, scores were computed using the following scoring functions: (a) probability assigned to the correct answer, (b) the logarithmic function, (c) the spherical function, (d) the Euclidean function, and (e) inferred choice. Reliabilities of the scores obtained by means of each scoring function were computed. Comparisons between the reliabilities showed that the simplest and most intuitive function, the probability assigned to the correct answer, produced the highest reliability in comparison with any of the other functions. The data suggest that in the absence of information about the scoring system, subjects assign their confidence in multiple-choice responses on the basis of the intuitively simplest payoff model, and that reliability decreases as scoring functions generate item scores which are progressively discrepant from scores generated by the simplest model.  相似文献   

6.
A common practice in the field of learning disabilities is analysis of ability-achievement discrepancies. The reliability of discrepancy scores is an important statistic in such decision making. In this study, selected ability and achievement devices were administered to a sample of low achievers (N = 99), and the reliability of various difference scores was analyzed. In all cases, the reliabilities of difference scores were moderately high. Reliabilities of differences for devices normed on the same population and differences for devices normed on different populations were comparable. These results are discussed in light of current psychometric practices.  相似文献   

7.
Muller, Calhoun, and Orling (1972) conclude that test reliability is dependent on the type of answer document used by elementary pupils. The present study was designed in part to assess the differential effect of two pupil response procedures (answering directly in the test booklet versus on a separate answer folder) on Metropolitan Achievement Tests scores of grades 3 and 4 pupils.
Over 4000 pupils from nine school systems took the Metropolitan , half responding in their booklets and half using answer folders. The two groups were matched by grade in general scholastic aptitude.
Although the separate answer folder group received lower scores than did the group responding in the test booklets, the score reliabilities did not differ significantly for any test. Additionally, these reliabilities did not differ significantly from comparable Metropolitan normative reliabilities. For survey achievement tests such as Metropolitan , test reliability would not appear to depend on pupil response mode.  相似文献   

8.
One of the most widely used methods for equating multiple parallel forms of a test is to incorporate a common set of anchor items in all its operational forms. Under appropriate assumptions it is possible to derive a linear equation for converting raw scores from one operational form to the others. The present note points out that the single most important determinant of the efficiency of the equating process is the magnitude of the correlation between the anchor test and the unique components of each form. It is suggested to use some monotonic function of this correlation as a measure of the equating efficiency, and a simple model relating the relative length of the anchor test and the test reliability to this measure of efficiency is presented.  相似文献   

9.
A number of mental-test theorists have called attention to the fact that increasing test reliability beyond an optimal point can actually lead to a decrement in the validity of that test with respect to a criterion. This non-monotonic relation between reliability and validity has been referred to by Loevinger as the “attentuation paradox,” because Spearman’s correction for attenuation leads one to expect that increasing reliability will always increase validity. In this paper a mathematical link between test reliability and test validity is derived which takes into account the correlation between error scores on a test and error scores on a criterion measure the test is designed to predict. It is proved that when the correlation between these two sets of error scores is positive, the non-monotonic relation between test reliability and test validity which has been viewed as a paradox occurs universally.  相似文献   

10.
Projecting the changes in the reliability of a difference score (d =× - Y ) as a consequence of changes in the reliabilities of X and Y does not represent a straightforward application of the Spearman-Brown formula. Formulas are developed for estimating the changes in the reliability of X-Y under two possible assumptions: (a ) × and Y have equal variances both before and after their reliabilities are altered, and (b ) × and Y have unequal variances before and after × and Y are modified. The second of these situations, which includes the first as a special case, is probably the more common .  相似文献   

11.
Formulas for the standard error of a parallel-test correlation and for the Kuder-Richardson formula 20 reliability estimate are provided. Given equal values of the two reliabilities in the population, the standard error of the Kuder-Richardson formula 20 is shown to be somewhat smaller than the standard error of a parallel-test correlation for reliability values, sample sizes, and test lengths that are usually encountered in practice.  相似文献   

12.
In this study, we focused on increasing the reliability of ability-achievement difference scores using the Kaufman Assessment Battery for Children (KABC) as an example. Ability-achievement difference scores are often used as indicators of learning disabilities, but when they are derived from traditional equally weighted ability and achievement scores, they have suboptimal psychometric properties because of the high correlations between the scores. As an alternative to equally weighted difference scores, we examined an orthogonal reliable component analysis, (RCA) solution and an oblique principal component analysis (PCA) solution for the standardization sample of the KABC (among 5- to 12-year-olds). The components were easily identifiable as the simultaneous processing, sequential processing, and achievement constructs assessed by the KABC. As judged via the score intercorrelations, all three types of scores had adequate convergent validity, while the orthogonal RCA scores had superior discriminant validity, followed by the oblique PCA scores. Differences between the orthogonal RCA scores were more reliable than differences between the oblique PCA scores, which were in turn more reliable than differences between the traditional equally weighted scores. The increased reliability with which the KABC differences are assessed with the orthogonal RCA method has important practical implications, including narrower confidence intervals around difference scores used in individual administrations of the KABC.  相似文献   

13.
In discussion of the properties of criterion-referenced tests, it is often assumed that traditional reliability indices, particularly those based on internal consistency, are not relevant. However, if the measurement errors involved in using an individual's observed score on a criterion-referenced test to estimate his or her universe scores on a domain of items are compared to errors of an a priori procedure that assigns the same universe score (the mean observed test score) to all persons, the test-based procedure is found to improve the accuracy of universe score estimates only if the test reliability is above 0.5. This suggests that criterion-referenced tests with low reliabilities generally will have limited use in estimating universe scores on domains of items.  相似文献   

14.
《教育实用测度》2013,26(3):249-253
A test segment that lacks content validity with respect to a criterion may be deleted for that reason. At issue is the effect on reliability and validity as measured by the coefficients arising from classical test theory. Assuming that the predictor test has some reasonable degree of internal consistency, deleting a segment of meaningful size is certain to reduce reliability. However, Feldt (1997) showed that a concomitant rise in the validity coefficient may occur under certain limited conditions. The present research further characterizes the circumstances under which validity changes may occur as a result of deletion of a predictor test segment. Specifically, for a positive outcome, one seeks a relatively large correlation between the scores from the deleted segment and the remaining items coupled with a relatively low correlation between scores from the deleted segment and the criterion.  相似文献   

15.
This study was an investigation of the relation between the reliability of difference scores, considered as a parameter characterizing a population of examinees, and the reliability estimates obtained from random samples from the population. The parameters in familiar equations for the reliability of difference scores were redefined in such a way that determinants of reliability in both populations and samples become more transparent. Computer simulation was used to find sample values and to plot frequency distributions of various correlations and variance ratios relevant to the reliability of differences. The shape of frequency distributions resulting from the simulations and the means and standard deviations of these distributions reveal the extent to which reliability estimates based on sample data can be expected to meaningfully represent population reliability.  相似文献   

16.
This study addressed the issue of specificity in reading disability by comparing two approaches to defining and selecting children with reading disabilities. One approach defined reading disability according to cutoff scores representing appropriate levels of intelligence and reading deficiency, whereas the other approach adjusted these scores for their intercorrelation through regression procedures. Results revealed clear differences in which children were identified as reading disabled according to the two definitions. However, differences in neuropsychological performance between children whose reading scores were discrepant or not discrepant with IQ were small and nonspecific for both definitions. The results of this study show that children identified as reading disabled vary according to the definition employed; at this point, there is little evidence suggesting any specificity of reading disability according to definition.  相似文献   

17.
The matched pair technique for writing and scoring true-false items was designed to compensate for the acquiescence response set of primary grade children. The claim that this technique increases reliability to an appreciable extent over traditional true-false scoring was investigated by comparing alpha internal consistency coefficients computed for the matched pair true-false, traditional true-false, and three other scoring schemes. Both the total sample coefficients and individual classroom coefficients were computed from the standardization sample of a primary grade economics achievement test (Primary Test of Economic Understanding). Classroom reliability coefficients computed from the matched pair scores were found to be higher than those from scores computed by the other methods. Total sample coefficients obtained from four of the five methods were nearly equal. Evidence of the effects of each scoring technique on concurrent validity is also presented. Contrary to expectations, the correlations of traditional and matched pair scores with Iowa Test of Basic Skills (ITBS) subtests (when adjusted for differing reliabilities) were approximately equal.  相似文献   

18.
This article evaluates a procedure-based scoring system for a performance assessment (an observed paper towels investigation) and a notebook surrogate completed by fifth-grade students varying in hands-on science experience. Results suggested interrater reliability of scores for observed performance and notebooks was adequate (>.80) with the reliability of the former higher. In contrast, interrater agreement on procedures was higher for observed hands-on performance (.92) than for notebooks (.66). Moreover, for the notebooks, the reliability of scores and agreement on procedures varied by student experience, but this was not so for observed performance. Both the observed-performance and notebook measures correlated less with traditional ability than did a multiple-choice science achievement test. The correlation between the two performance assessments and the multiple-choice test was only moderate (mean = .46), suggesting that different aspects of science achievement have been measured. Finally, the correlation between the observed-performance scores and the notebook scores was .83, suggesting that notebooks may provide a reasonable, albeit less reliable, surrogate for the observed hands-on performance of students.  相似文献   

19.
This study examined the reliability and validity of scores on a fluency‐based measure of reading comprehension. The Dynamic Indicators of Basic Early Literacy Skills (DIBELS; 6th ed. revised) Retell Fluency (RTF), Oral Reading Fluency (DORF), and Woodcock Johnson III NU Tests of Achievement (WJ‐III NU ACH) Reading Comprehension measures were administered to fourth‐grade students. Results indicated a large difference between real time and recorded retell fluency scores for each passage. In addition, students' retell fluency scores had a low correlation with their reading comprehension scores. In light of these findings, practitioners may want to exercise caution in using fluency‐based story‐retell scores as a measure of reading comprehension. © 2011 Wiley Periodicals, Inc.  相似文献   

20.
The purpose of this study was to determine in what way Guttman weighting affected the internal consistency and intercorrelation of the suhtests of the Scholastic Aptitude Test. The tests were first scored with Guttman weights and then with conventional correction-for-guessing weights. The internal consistency of the tests increased markedly when Guttman weights were used. The correlation of the two verbal subtests increased somewhat when Guttman weights were used, but the correlation of the two mathematics subtests as well as the intercorrelation of all verbal and mathematics subtests decreased. Differences in the factor structure of the Guttman- and conventionally-weighted subtests were used to explain the result.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号