首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Abstract

This study investigated the reliability, validity, and utility of the following three measures of letter-formation quality: (a) a holistic rating system, in which examiners rated letters on a five-point Likert-type scale; (h) a holistic rating system with model letters, in which examiners used model letters that exemplified specific criterion scores to rate letters; and (c) a correct/incorrect procedure, in which examiners used transparent overlays and standard verbal criteria to score letters. Intrarater and interrater reliability coefficients revealed that the two holistic scoring procedures were unreliable, whereas scores obtained by examiners who used the correct/incorrect procedure were consistent over time and across examiners. Although all three of the target measures were sensitive to differences between individual letters, only the scores from the two holistic procedures were associated with other indices of handwriting performance. Furthermore, for each of the target measures, variability in scores was, for the most part, not attributable to the level of experience or sex of the respondents. Findings are discussed with respect to criteria for validating an assessment instrument.  相似文献   

2.
Value-added scores from tests of college learning indicate how score gains compare to those expected from students of similar entering academic ability. Unfortunately, the choice of value-added model can impact results, and this makes it difficult to determine which results to trust. The research presented here demonstrates how value-added models can be compared on three criteria: reliability, year-to-year consistency and information about score precision. To illustrate, the original Collegiate Learning Assessment value-added model is compared to a new model that employs hierarchical linear modelling. Results indicate that scores produced by the two models are similar, but the new model produces scores that are more reliable and more consistent across years. Furthermore, the new approach provides school-specific indicators of value-added score precision. Although the reliability of value-added scores is sufficient to inform discussions about improving general education programmes, reliability is currently inadequate for making dependable, high-stakes comparisons between postsecondary institutions.  相似文献   

3.
It is widely recognized that the reliability of a difference score depends on the reliabilities of the constituent scores and their intercorrelation. Authors often use a well-known identity to express the reliability of a difference as a function of the reliabilities of the components, assuming that the intercorrelation remains constant. This approach is misleading, because the familiar formula is a composite function in which the correlation between components is a function of reliability. An alternative formula, containing the correlation between true scores instead of the correlation between observed scores, provides more useful information and yields values that are not quite as anomalous as the ones usually obtained  相似文献   

4.
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.  相似文献   

5.
《Assessing Writing》2008,13(3):201-218
Using generalizability theory, this study examined both the rating variability and reliability of ESL students’ writing in the provincial English examinations in Canada. Three years’ data were used in order to complete the analyses and examine the stability of the results. The major research question that guided this study was: Are there any differences between the rating variability and reliability of the writing scores assigned to ESL students and to Native English (NE) students in the writing components of the provincial examinations across three years? A series of generalizability studies and decision studies was conducted. Results showed that differences in score variation did exist between ESL and NE students when adjudicated scores were used. First, there was a large effect for both language group and person within language-by-task interaction. Second, the unwanted residual variance component was significantly larger for ESL students than for NE students in all three years. Finally, the desired variance associated with the object of measurement was significantly smaller for ESL students than for NE students in one year. Consequently, the observed generalizability coefficient for ESL students was significantly lower than that for NE students in that year. These findings raise a potential question about the fairness of the writing scores assigned to ESL students.  相似文献   

6.
Scale scores for educational tests can be made more interpretable by incorporating score precision information at the time the score scale is established. Methods for incorporating this information are examined that are applicable to testing situations with number-correct scoring. Both linear and nonlinear methods are described. These methods can be used to construct score scales that discourage the overinterpretation of small differences in scores. The application of the nonlinear methods also results in scale scores that have nearly equal error variability along the score scale and that possess the property that adding a specified number of points to and subtracting the same number of points from any examinee's scale score produces an approximate two-sided confidence interval with a specified coverage. These nonlinear methods use an arcsine transformation to stabilize measurement error variance for transformed scores. The methods are compared through the use of illustrative examples. The effect of rounding on measurement error variability is also considered and illustrated using stanines  相似文献   

7.
Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information.  相似文献   

8.
In many of the methods currently proposed for standard setting, all experts are asked to judge all items, and the standard is taken as the mean of their judgments. When resources are limited, gathering the judgments of all experts in a single group can become impractical. Multiple matrix sampling (MMS) provides an alternative. This paper applies MMS to a variation on Angoff's method (1971) of standard setting. A pool of 36 experts and 190 items were divided randomly into 5 groups, and estimates of borderline examinee performance were acquired. Results indicated some variability in the cutting scores produced by the individual groups, but the variance components were reasonably well estimated. The standard error of the cutting score was very small, and the width of the 90% confidence interval around it was only 1.3 items. The reliability of the final cutting score was.98  相似文献   

9.
New measures of college selectivity   总被引:1,自引:1,他引:1  
Institutional averages of entering freshman scores on the Scholastic Aptitude Test (SAT) and the American College Test (ACT) were combined and edited to produce a single institutional measure of selectivity for 2,601 institutions. Older scores were adjusted to reflect decreasing performance over time, and ACT scores were converted to SAT equivalents, resulting in a final measure that reflects 1973 performance levels and is expressed as an SAT Verbal plus Mathematical score (range 400–1,600). Actual scores were available for 1,803 schools; the remaining schools with missing values were given an imputed score based upon means from similar institutions among the 1,803. Correlations between scores from different years and between the final measure and 19 institutional attributes indicated substantial reliability and validity for the selectivity measure.  相似文献   

10.
In this study, we focused on increasing the reliability of ability-achievement difference scores using the Kaufman Assessment Battery for Children (KABC) as an example. Ability-achievement difference scores are often used as indicators of learning disabilities, but when they are derived from traditional equally weighted ability and achievement scores, they have suboptimal psychometric properties because of the high correlations between the scores. As an alternative to equally weighted difference scores, we examined an orthogonal reliable component analysis, (RCA) solution and an oblique principal component analysis (PCA) solution for the standardization sample of the KABC (among 5- to 12-year-olds). The components were easily identifiable as the simultaneous processing, sequential processing, and achievement constructs assessed by the KABC. As judged via the score intercorrelations, all three types of scores had adequate convergent validity, while the orthogonal RCA scores had superior discriminant validity, followed by the oblique PCA scores. Differences between the orthogonal RCA scores were more reliable than differences between the oblique PCA scores, which were in turn more reliable than differences between the traditional equally weighted scores. The increased reliability with which the KABC differences are assessed with the orthogonal RCA method has important practical implications, including narrower confidence intervals around difference scores used in individual administrations of the KABC.  相似文献   

11.
Two conventional scores and a weighted score on a group test of general intelligence were compared for reliability and predictive validity. One conventional score consisted of the number of correct answers an examinee gave in responding to 69 multiple-choice questions; the other was the formula score obtained by subtracting from the number of correct answers a fraction of the number of wrong answers. A weighted score was obtained by assigning weights to all the response alternatives of all the questions and adding the weights associated with the responses, both correct and incorrect, made by the examinee. The weights were derived from degree-of-correctness judgments of the set of response alternatives to each question. Reliability was estimated using a split-half procedure; predictive validity was estimated from the correlation between test scores and mean school achievement. Both conventional scores were found to be significantly less reliable but significantly more valid than the weighted scores. (The formula scores were neither significantly less reliable nor significantly more valid than number-correct scores.)  相似文献   

12.
《教育实用测度》2013,26(3):221-240
The scores on 2 distinct tests (e.g., essay and objective) are often combined to create a composite score, which is used to make decisions. The validity of the observed composite can sometimes be evaluated relative to an external criterion. However, in cases where no criterion is available, the observed composite has generally been evaluated in terms of its reliability. The analyses in this article are based on a simple, content-based model for the validity of the observed composite as an estimate of a target composite, based on a priori weights for the 2 tests. The results suggest that giving extra weight to the more reliable of the 2 observed scores tends to improve the reliability of the composite, and up to a point tends to improve its validity. Giving too much weight to the more reliable score can decrease the validity of the observed composite as a measure of the target composite.  相似文献   

13.
Although reliability of subscale scores may be suspect, subscale scores are the most common type of diagnostic information included in student score reports. This research compared methods for augmenting the reliability of subscale scores for an 8th-grade mathematics assessment. Yen's Objective Performance Index, Wainer et al.'s augmented scores, and scores based on multidimensional item response theory (IRT) models were compared and found to improve the precision of the subscale scores. However, the augmented subscale scores were found to be more highly correlated and less variable than unaugmented scores. The meaningfulness of reporting such augmented scores as well as the implications for validity and test development are discussed.  相似文献   

14.
通过对某校本科护理专业预防医学课程考查的试卷分析,为改革教学评价方法,提高教学质量提供依据。运用Spss17.0软件包对试卷的难度、信度、效度和成绩进行统计分析。学生考查成绩近似正态分布,平均分(71.7±7.6),信度0.626,效度0.478,难度0.718。该次考试可信度较好,试卷总体难度适中,成绩分布合理,较好地反映了学生的真实水平。  相似文献   

15.
Reliabilities and information functions for percentile ranks and number-right scores were compared in the context of item response theory. The basic results were: (a) The percentile rank is always less informative and reliable than the number-right score; and (b)for easy or difficult tests composed of highly discriminating items, the percentile rank often yields unacceptably low reliability and information relative to the number-right score. These results suggest that standardized scores that are linear transformations of the number-right score (e.g., z scores) are much more reliable and informative indicators of the relative standing of a test score than are percentile ranks. The findings reported here demonstrate that there exist situations in which the percent of items known by examinees can be accurately estimated, but that the percent of persons falling below a given score cannot.  相似文献   

16.
This article presents a method for estimating the accuracy and consistency of classifications based on test scores. The scores can be produced by any scoring method, including a weighted composite. The estimates use data from a single form. The reliability of the score is used to estimate effective test length in terms of discrete items. The true-score distribution is estimated by fitting a 4-parameter beta model. The conditional distribution of scores on an alternate form, given the true score, is estimated from a binomial distribution based on the estimated effective test length. Agreement between classifications on alternate forms is estimated by assuming conditional independence, given the true score. Evaluation of the method showed estimates to be within 1 percentage point of the actual values in most cases. Estimates of decision accuracy and decision consistency statistics were only slightly affected by changes in specified minimum and maximum possible scores.  相似文献   

17.
A reliability coefficient for criterion-referenced tests is developed from the assumptions of classical test theory. This coefficient is based on deviations of scores from the criterion score, rather than from the mean. The coefficient is shown to have several of the important properties of the conventional normreferenced reliability coefficient, including its interpretation as a ratio of variances and as a correlation between parallel forms, its relationship to test length, its estimation from a single form of a test, and its use in correcting for attenuation due to measurement error. Norm-referenced measurement is considered as a special case of criterion-referenced measurement.  相似文献   

18.
The standard error of measurement usefully provides confidence limits for scores in a given test, but is it possible to quantify the reliability of a test with just a single number that allows comparison of tests of different format? Reliability coefficients do not do this, being dependent on the spread of examinee attainment. Better in this regard is a measure produced by dividing the standard error of measurement by the test's ‘reliability length’, the latter defined as the maximum possible score minus the most probable score obtainable by blind guessing alone. This, however, can be unsatisfactory with negative marking (formula scoring), as shown by data on 13 negatively marked true/false tests. In these the examinees displayed considerable misinformation, which correlated negatively with correct knowledge. Negative marking can improve test reliability by penalizing such misinformation as well as by discouraging guessing. Reliability measures can be based on idealized theoretical models instead of on test data. These do not reflect the qualities of the test items, but can be focused on specific test objectives (e.g. in relation to cut‐off scores) and can be expressed as easily communicated statements even before tests are written.  相似文献   

19.
In discussion of the properties of criterion-referenced tests, it is often assumed that traditional reliability indices, particularly those based on internal consistency, are not relevant. However, if the measurement errors involved in using an individual's observed score on a criterion-referenced test to estimate his or her universe scores on a domain of items are compared to errors of an a priori procedure that assigns the same universe score (the mean observed test score) to all persons, the test-based procedure is found to improve the accuracy of universe score estimates only if the test reliability is above 0.5. This suggests that criterion-referenced tests with low reliabilities generally will have limited use in estimating universe scores on domains of items.  相似文献   

20.
The psychometric characteristics and practicality of concept mapping as a technique for classroom assessment were evaluated. Subjects received 90 min of training in concept mapping techniques and were given a list of terms and asked to produce a concept map. The list of terms was from a course in which they were enrolled. The maps were scored by pairs of graduate students, each pair using one of six different scoring methods. The score reliability of the six scoring methods ranged from r = .23 to r = .76. The highest score reliability was found for the method based on the evaluation of separate propositions represented. Correlations of map scores with a measure of the concept maps' similarity to a master map provided evidence supporting the validity of five of the six scoring methods. The times required to provide training in concept mapping, produce concepts, and score concept maps were compatible with the adoption of concept mapping as classroom assessment technique. © 1999 John Wiley & Sons, Inc. J Res Sci Teach 36: 475–492, 1999  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号