首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
《Educational Assessment》2013,18(2):143-164
Large-scale use of performance assessments for individual-level, high-stakes purposes is not widespread, in part because the scores from such assessments are seldom reliable enough for that use. Perhaps the most pervasive cause of this poor reliability is the error variance associated with a Person ×Task interaction. This problem seems quite similar to problems of transfer of learning. The literature on these 2 issues are synthesized to offer an explanation of Person ×Task variability. A research agenda is then proposed based on this literature.  相似文献   

2.
Although it has been known for over a half-century that the standard error of measurement is in many respects superior to the reliability coefficient for purposes of evaluating the fallibility of a psychological test, current textbooks and journal literature in tests and measurements still devote far more attention to test reliability than to the standard error. The present paper provides a list of ten salient features of the standard error, contrasting it to the reliability coefficient, and concludes that the standard error of measurement should be regarded as a primary characteristic of a mental test.  相似文献   

3.
There has been a growing consensus among the educational measurement experts and psychometricians that test taker characteristics may unduly affect the performance on tests. This may lead to construct-irrelevant variance in the scores and thus render the test biased. Hence, it is incumbent on test developers and users alike to provide evidence that their tests are free of such bias. The present study exploited generalizability theory to examine the presence of gender differential performance on a high-stakes language proficiency test, the University of Tehran English Proficiency Test. An analysis of the performance of 2,343 examinees who had taken the test in 2009 indicated that the relative contributions of different facets to score variance were almost uniform across the gender groups. Further, there is no significant interaction between items and persons, indicating that the relative standings of the persons were uniform across all items. The lambda reliability coefficients were also uniformly high. All in all, the study provides evidence that the test is free of gender bias and enjoys a high level of dependability.  相似文献   

4.
How can the contributions of raters and tasks to error variance be estimated? Which source of error variance is usually greater? Are interrater coefficients adequate estimates of reliability? What other facets contribute to unreliability in performance assessments?  相似文献   

5.
The relation between test reliability and statistical power has been a controversial issue, perhaps due in part to a 1975 publication in the Psychological Bulletin by Overall and Woodward, “Unreliability of Difference Scores: A Paradox for the Measurement of Change”, in which they demonstrated that a Student t test based on pretest-posttest differences can attain its greatest power when the difference score reliability is zero. In the present article, the authors attempt to explain this paradox by demonstrating in several ways that power is not a mathematical function of reliability unless either true score variance or error score variance is constant.  相似文献   

6.
In criterion‐referenced tests (CRTs), the traditional measures of reliability used in norm‐referenced tests (NRTs) have often proved problematic because of NRT assumptions of one underlying ability or competency and of variance in the distribution of scores. CRTs, by contrast, are likely to be created when mastery of the skill or knowledge by all or most all test takers is expected and thus little variation in the scores is expected. A comprehensive CRT often measures a number of discrete tasks that may not represent a single unifying ability or competence. Hence, CRTs theoretically violate the two most essential assumptions of classic NRT re liability theory and they have traditionally required the logistical problems of multiple test administrations to the same test takers to estimate reliability. A review of the literature categorizes approaches to reliability for CRTs into two classes: estimates sensitive to all measures of error and estimates of consistency in test outcome. For single test administration of CRTs Livingston's k2is recommended for estimating all measures of error, Sc is proposed for estimates of consistency in test outcome. Both approaches compared using data from a CRT exam and recommendations for interpretation and use are proposed.  相似文献   

7.
We evaluated the statistical power of single-indicator latent growth curve models to detect individual differences in change (variances of latent slopes) as a function of sample size, number of longitudinal measurement occasions, and growth curve reliability. We recommend the 2 degree-of-freedom generalized test assessing loss of fit when both slope-related random effects, the slope variance and intercept-slope covariance, are fixed to 0. Statistical power to detect individual differences in change is low to moderate unless the residual error variance is low, sample size is large, and there are more than four measurement occasions. The generalized test has greater power than a specific test isolating the hypothesis of zero slope variance, except when the true slope variance is close to 0, and has uniformly superior power to a Wald test based on the estimated slope variance.  相似文献   

8.
The study reported in this paper was designed to examine three reliability characteristics of the Harris revision of the Goodenough ‘Draw‐a‐man’ Test when used with five‐year‐old school entrants. The test was individually administered to each of 90 children on two occasions, with an average time‐separation of two weeks. Three persons undertook the administration and scoring of the drawings, and the investigation examined the reliability coefficients associated with i) temporal stability (same tester); ii) temporal stability (different testers); iii) marker error. The results indicate that when experienced testers are used, the reliability of the ‘Draw‐a‐man’ scale is of the same magnitude as that reported in previous studies involving older children as subjects. It is also suggested that with school entrants, the influence of different trained testers on the final rank order of scores is probably quite small. The present study shows too, that with the drawings of five‐year‐old children there is less likelihood of the scorer developing a consistent subjective marking standard than is the case with the drawings of older children. Scoring errors tended to be random rather than systematic due probably to the relatively greater number of occasions when uncertainty exists over the interpretation or naming of basic features of immature drawings. It is suggested that the test is more useful for the comparison of groups rather than individual school entrants.  相似文献   

9.
Although federal regulations require testing students with severe cognitive disabilities, there is little guidance regarding how technical quality should be established. It is known that challenges exist with documentation of the reliability of scores for alternate assessments. Typical measures of reliability do little in modeling multiple sources of error, which are characteristic of alternate assessments. Instead, Generalizability theory (G-theory) allows researchers to identify sources of error and analyze the relative contribution of each source. This study demonstrates an application of G-theory to examine reliability for an alternate assessment. A G-study with the facets rater type, assessment attempts, and tasks was examined to determine the relative contribution of each to observed score variance. Results were used to determine the reliability of scores. The assessment design was modified to examine how changes might impact reliability. As a final step, designs that were deemed satisfactory were evaluated regarding the feasibility of adapting them into a statewide standardized assessment and accountability program.  相似文献   

10.
Despite their widespread use in identifying and evaluating programs for gifted and talented students, the Torrance Tests of Creative Thinking were standardized on samples that excluded gifted children. The interrater reliability of measures like the TTCT has been questioned repeatedly, yet studies with average students have demonstrated high interrater reliability. This study compares the interrater reliability of the TTCT for groups of gifted and nongifted elementary-school-aged students. Results indicated most interrater reliability coefficients exceeding .90 for both gifted and nongifted groups. However, multivariate analysis of variance indicated significant mean differences across the three self-trained raters for both gifted and nongifted groups. Consequently, use of a single scorer to evaluate TTCT protocols is recommended, especially where specific cutoff scores are used to select students.  相似文献   

11.
Mean or median student growth percentiles (MGPs) are a popular measure of educator performance, but they lack rigorous evaluation. This study investigates the error in MGP due to test score measurement error (ME). Using analytic derivations, we find that errors in the commonly used MGP are correlated with average prior latent achievement: Teachers with low prior achieving students have MGPs that underestimate true teacher performance and vice versa for teachers with high achieving students. We evaluate alternative MGP estimators, showing that aggregates of SGP that correct for ME only contain errors independent of prior achievement. The alternatives are thus more fair because they are not biased by prior mean achievement and have smaller overall variance and larger marginal reliability than the Standard MGP approach. In addition, the mean estimators always outperform their median counterparts.  相似文献   

12.
Instruction cannot be really personalised, as long as assessment remains norm‐referenced. Whereas psychometrics aims at differentiating the performances of individuals at a given moment, edumetrics aims at differentiating stages of learning for a given individual. The structure of the two projects is the same and generalisability theory offers symmetrical formulae for estimating the reliability of each of these measurement designs. An example is presented in this paper which shows that satisfactory reliability can be obtained in an edumetric situation, where the between‐pupils variance is completely ignored. Even though the absolute error variance is the same in both cases, the relative error variances and hence the standard errors of measurement are different. As the true score variances are also different, the edumetric properties of a test should be considered alongside its psychometric ones. Certification of progress by the teacher, supporting a portfolio of achievement, could even have a summative, as well as a formative, function.  相似文献   

13.
There is considerable confusion in the media and the public about healthy behaviors in contrast to “antiaging” behaviors designed to make one look “younger.” As an aid in clarifying the differences between these two types of behaviors, we have developed a questionnaire called the Health Behavior Inventory (HBI). We also wanted to estimate differences in the frequency of different types of behaviors, differences between older and younger respondents, and differences between men and women.

The HBI consists of 10 Healthy Behaviors and 10 Age Denial behaviors. We tested the HBI in a mailed survey to 250 older persons. The questionnaire was returned by 222 persons. We used Cronbach's alpha to test for consistency reliability of the two scales. Healthy Behaviors were reported much more often than Age Denials, but 1/3 reported 1 or more Age Denials. Younger persons and women tended to report substantially more Age Denials than men. The scales appear to have good face validity, and the Age Denials scale has fairly consistent internal reliability. The HBI appears to be a useful tool for research on the frequency and distribution of Healthy Behaviors and Age Denials among different groups of older persons. It may also be used in educational settings and the media for raising awareness of the need for more Healthy Behaviors and for reducing useless and dangerous types of Age Denials.  相似文献   

14.
In many of the methods currently proposed for standard setting, all experts are asked to judge all items, and the standard is taken as the mean of their judgments. When resources are limited, gathering the judgments of all experts in a single group can become impractical. Multiple matrix sampling (MMS) provides an alternative. This paper applies MMS to a variation on Angoff's method (1971) of standard setting. A pool of 36 experts and 190 items were divided randomly into 5 groups, and estimates of borderline examinee performance were acquired. Results indicated some variability in the cutting scores produced by the individual groups, but the variance components were reasonably well estimated. The standard error of the cutting score was very small, and the width of the 90% confidence interval around it was only 1.3 items. The reliability of the final cutting score was.98  相似文献   

15.
Formulas for the standard error of a parallel-test correlation and for the Kuder-Richardson formula 20 reliability estimate are provided. Given equal values of the two reliabilities in the population, the standard error of the Kuder-Richardson formula 20 is shown to be somewhat smaller than the standard error of a parallel-test correlation for reliability values, sample sizes, and test lengths that are usually encountered in practice.  相似文献   

16.
选取3位有经验的评估专家,对河北省8个重点学科的自评量表进行评价,使用概化理论中的混合设计模型,对该评价结果所反映的量表结构及方差误差进行分析。结果表明,不同的评估专家对学科能力的评价并没有造成很大的系统误差,而学科、指标、评估专家的交互作用,以及指标和评估专家的交互作用方差很大,说明评估体系的二级指标设置尚存在较大缺陷。  相似文献   

17.
心理健康知识的传播是当前大学生心理健康教育的主要形式,但目前还缺少对其的量化研究,本研究即意在初步编制大学生心理健康知识问卷。首先广泛搜集文献,形成自编大学生心理健康知识预测问卷,并对800名大学生施测,进而对自编问卷做项目分析和信、效度考查。结果显示,共15个项目构成正式问卷,探索性因素分析提取4个因素,可解释总变异的47.652%;问卷的 Cronbachα系数为0.68,重测信度为0.73;验证性因素分析表明四因素模型拟合良好。研究表明自编大学生心理健康知识问卷具有较好的信效度,可用于进一步的实证研究。  相似文献   

18.
This article concerns the simultaneous assessment of DIF for a collection of test items. Rather than an average or sum in which positive and negative DIF may cancel, we propose an index that measures the variance of DIF on a test as an indicator of the degree to which different items show DIF in different directions. It is computed from standard Mantel-Haenszel statistics (the logodds ratio and its variance error) and may be conceptually classified as a variance component or variance effect size. Evaluated by simulation under three item response models (IPL, 2PL, and 3PL), the index is shown to be an accurate estimate of the DTF generating parameter in the case of the 1PL and 2PL models with groups of equal ability. For groups of unequal ability, the index is accurate under the I PL but not the 2PL condition; however, a weighted version of the index provides improved estimates. For the 3PL condition, the DTF generating parameter is underestimated. This latter result is due in part to a mismatch in the scales of the log-odds ratio and IRT difficulty.  相似文献   

19.
A study was conducted to determine if analysis of variance techniques are appropriate when the dependent variable has a dichotomous (zero-one) distribution. Several 1-, 2-, and 3-way analysis of variance configurations were investigated with regard to both the size of the Type I error and the Power. The findings show the analysis of variance to be an appropriate statistical technique for analyzing dichotomous data in fixed effects models where cell frequencies are equal under the following conditions: (a) the proportion of responses in the smaller response category is equal to or greater than .2 and there are at least 20 degrees of freedom for error, or (b) the proportion of responses in the smaller response category is less than .2 and there are at least 40 degrees of freedom for error.  相似文献   

20.
We examined how raters and tasks influence measurement error in writing evaluation and how many raters and tasks are needed to reach a desirable level of .90 and .80 reliabilities for children in Grades 3 and 4. A total of 211 children (102 boys) were administered three tasks in narrative and expository genres, respectively, and their written compositions were evaluated in widely used evaluation methods for developing writers: holistic scoring, productivity, and curriculum-based writing scores. Results showed that 54 and 52% of variance in narrative and expository compositions were attributable to true individual differences in writing. Students’ scores varied largely by tasks (30.44 and 28.61% of variance), but not by raters. To reach the reliability of .90, multiple tasks and raters were needed, and for the reliability of .80, a single rater and multiple tasks were needed. These findings offer important implications about reliably evaluating children’s writing skills, given that writing is typically evaluated by a single task and a single rater in classrooms and even in some state accountability systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号