首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper examines whether educational production in secondary school involves joint production among teachers across subjects. In doing so, it also provides insights into the reliability of value-added modeling. Teacher value-added to reading test scores is estimated for four different teacher types: English, math, science and social-studies. The initial results indicate that reading output is jointly produced by math and English teachers. However, while falsification tests confirm the English-teacher effects, they cast some doubt about whether the math-teacher effects are free from sorting bias. The results offer a mixed review of the value-added methodology, suggesting that it can be useful but should be implemented cautiously.  相似文献   

2.
A method of determining the reliability coefficient of a test from a formulation which does not employ the concepts of true score and error score, together with assumptions about the process which generates variability in scores, is described. Several well-known reliability formulas as well as some new results are derived from models which hypothesize different sources of variability in scores.  相似文献   

3.
The purpose of this study is to better understand how math teachers’ effectiveness as measured by value-added scores and student satisfaction with teaching is influenced by school’s working conditions. The data for the study were derived from 2009 to 2010 Teacher Working Condition Survey and Student Perception Survey in Measures of Effective Teaching Project. Using the structural equation modeling and other related methods, several models of teacher effectiveness were estimated. The findings indicate that among the examined working condition factors, support for instruction and for student conduct management have significant effects on teachers’ value-added scores in mathematics. Moreover, the student satisfaction with teaching seems to have a mediating effect on value-added scores. The findings of the study significantly contribute to a better understanding of the effects of working environment on math teachers’ effectiveness and how improvement in working conditions can enhance math teachers’ performance.  相似文献   

4.
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models.  相似文献   

5.
Although reliability of subscale scores may be suspect, subscale scores are the most common type of diagnostic information included in student score reports. This research compared methods for augmenting the reliability of subscale scores for an 8th-grade mathematics assessment. Yen's Objective Performance Index, Wainer et al.'s augmented scores, and scores based on multidimensional item response theory (IRT) models were compared and found to improve the precision of the subscale scores. However, the augmented subscale scores were found to be more highly correlated and less variable than unaugmented scores. The meaningfulness of reporting such augmented scores as well as the implications for validity and test development are discussed.  相似文献   

6.
Outcome-oriented evaluation of school effectiveness is often based on student test scores in certain critical examinations. This study provides another method of evaluation—value-added—which is based on student achievement progress. This paper introduces the method of estimating the value-added score of schools in multi-level models. Based on longitudinal student achievement data, two measures of school effectiveness in one local education authority in China are compared. It is found that the between-school difference in both test-score and value-added is large comparable with that of Western countries. The results of the two measures of school effectiveness are highly different. The value-added measures lack consistency across different subject areas within schools while the test score measures are highly correlated between subjects. Teachers show their preference for value-added measures over test-score measures of education quality. It is suggested that value-added measures of school effectiveness should be used as a complement to rather than a substitute for test-score measures. The shortcomings of value-added approach are also discussed.  相似文献   

7.
In the UK, USA and elsewhere, school accountability systems increasingly compare schools using value-added measures of school performance derived from pupil scores in high-stakes standardised tests. Rather than naïvely comparing school average scores, which largely reflect school intake differences in prior attainment, these measures attempt to compare the average progress or improvement pupils make during a year or phase of schooling. Schools, however, also differ in terms of their pupil demographic and socioeconomic characteristics and these factors also predict why some schools subsequently score higher than others. Many therefore argue that value-added measures unadjusted for pupil background are biased in favour of schools with more ‘educationally advantaged’ intakes. But others worry that adjusting for pupil background entrenches socioeconomic inequities and excuses low-performing schools. In this article we explore these theoretical arguments and their practical importance in the context of the ‘Progress 8’ secondary school accountability system in England, which has chosen to ignore pupil background. We reveal how the reported low or high performance of many schools changes dramatically once adjustments are made for pupil background, and these changes also affect the reported differential performances of regions and of different school types. We conclude that accountability systems which choose to ignore pupil background are likely to reward and punish the wrong schools and this will likely have detrimental effects on pupil learning. These findings, especially when coupled with more general concerns surrounding high-stakes testing and school value-added models, raise serious doubts about their use in school accountability systems.  相似文献   

8.
A broad literature seeks to assess the importance of schools, proxies for school quality, and family background on children's achievement growth using the education production function. Using rich data from the Philippines, we introduce and estimate a model that imposes little structure on the relationship between intake achievement and follow-up achievement and evaluate school performance based on this estimated relationship. Our methods nest typical value-added specifications that use test score gains as the outcome variable and models assuming linearity in the relationship between intake and follow-up scores. We find evidence against the use of value-added models for our data and show that such models give very different assessments of school performance in the Philippines. Using a variety of tests we find that schools matter in the production of student achievement, though variation in performance across schools only explains about 4.4–5.3% of the total (conditional) variation in follow-up achievement. Schools providing basic facilities—in particular schools providing electricity—are found to perform much better in the production of achievement growth.  相似文献   

9.
This study utilizes an argument-based approach to validation to examine the implications of reliability in order to further differentiate the concepts of score and decision consistency. In a methodological example, the framework of generalizability theory was used to estimate appropriate indices of score consistency and evaluations of the likelihood of decision errors based on the design of a performance assessment and its intended use. The study illustrates how generalizability theory can be applied to address various claims about consistency when decisions are based on two or more cut scores, and results underscore the importance of considering score and decision consistency separately.  相似文献   

10.
High-stakes standardized student assessments are increasingly used in value-added evaluation models to connect teacher performance to P–12 student learning. These assessments are also being used to evaluate teacher preparation programs, despite validity and reliability threats. A more rational model linking student performance to candidates who actually teach these students is presented. Preliminary findings with three candidate cohorts indicate that the majority of their students met learning objectives and showed substantial pre-to-post learning gains.  相似文献   

11.
Value-added models and growth-based accountability aim to evaluate school??s performance based on student growth in learning. The current focus is on linking the results from value-added models to the ones from growth-based accountability systems including Adequate Yearly Progress decisions mandated by No Child Left Behind. We present a new statistical approach that extends the current value-added modeling possibilities and focuses on using latent longitudinal growth curves to estimate the probabilities of students reaching proficiency. The aim is to utilize time-series measures of student achievement scores to estimate latent growth curves and use them as predictors of a dichotomous outcome, such as proficiency or passing a high-stakes exam, within a single multilevel longitudinal model. We illustrated this method through analyzing a three-year data set of longitudinal achievement scores and California High School Exit Exam scores from a large urban school district. This latent variable growth logistic model is useful for (1) early identification of students at risk of failing or of those who are most in need; (2) a validation or/and adequacy of student growth over years with relation to distal outcome criteria; (3) evaluation of a longitudinal intervention study.  相似文献   

12.
教学质量增值评价能够改善我国传统的学校评估方法,提高学校教学质量评价的公正性。本文搜集整理了某地区学生中、高考总成绩,在对数据进行标准化处理的基础上,通过分析数据特征,确定了一元线性回归、反比例回归、幂回归、指数回归和二次回归等五个比较适合的回归增值评价模型。设置"双差法"标准,对模型的优劣进行比较,得出一类学生和二类学生对应的最优模型为二次回归模型,三类学生对应的最优模型为指数回归模型,并计算了该地区160所学校的教学质量增值排名。最后,给出增值评价模型选择的相关建议。  相似文献   

13.
Using longitudinal data from a cohort of middle school students from a large school district, we estimate separate "value-added" teacher effects for two subscales of a mathematics assessment under a variety of statistical models varying in form and degree of control for student background characteristics. We find that the variation in estimated effects resulting from the different mathematics achievement measures is large relative to variation resulting from choices about model specification, and that the variation within teachers across achievement measures is larger than the variation across teachers. These results suggest that conclusions about individual teachers' performance based on value-added models can be sensitive to the ways in which student achievement is measured.  相似文献   

14.
Scale scores for educational tests can be made more interpretable by incorporating score precision information at the time the score scale is established. Methods for incorporating this information are examined that are applicable to testing situations with number-correct scoring. Both linear and nonlinear methods are described. These methods can be used to construct score scales that discourage the overinterpretation of small differences in scores. The application of the nonlinear methods also results in scale scores that have nearly equal error variability along the score scale and that possess the property that adding a specified number of points to and subtracting the same number of points from any examinee's scale score produces an approximate two-sided confidence interval with a specified coverage. These nonlinear methods use an arcsine transformation to stabilize measurement error variance for transformed scores. The methods are compared through the use of illustrative examples. The effect of rounding on measurement error variability is also considered and illustrated using stanines  相似文献   

15.
The purpose of this study was to investigate the methods of estimating the reliability of school-level scores using generalizability theory and multilevel models. Two approaches, ‘student within schools’ and ‘students within schools and subject areas,’ were conceptualized and implemented in this study. Four methods resulting from the combination of these two approaches with generalizability theory and multilevel models were compared for both balanced and unbalanced data. The generalizability theory and multilevel models for the ‘students within schools’ approach produced the same variance components and reliability estimates for the balanced data, while failing to do so for the unbalanced data. The different results from the two models can be explained by the fact that they administer different procedures in estimating the variance components used, in turn, to estimate reliability. Among the estimation methods investigated in this study, the generalizability theory model with the ‘students nested within schools crossed with subject areas’ design produced the lowest reliability estimates. Fully nested designs such as (students:schools) or (subject areas:students:schools) would not have any significant impact on reliability estimates of school-level scores. Both methods provide very similar reliability estimates of school-level scores.  相似文献   

16.
Contemporary educational accountability systems, including state‐level systems prescribed under No Child Left Behind as well as those envisioned under the “Race to the Top” comprehensive assessment competition, rely on school‐level summaries of student test scores. The precision of these score summaries is almost always evaluated using models that ignore the classroom‐level clustering of students within schools. This paper reports balanced and unbalanced generalizability analyses investigating the consequences of ignoring variation at the level of classrooms within schools when analyzing the reliability of such school‐level accountability measures. Results show that the reliability of school means cannot be determined accurately when classroom‐level effects are ignored. Failure to take between‐classroom variance into account biases generalizability (G) coefficient estimates downward and standard errors (SEs) upward if classroom‐level effects are regarded as fixed, and biases G‐coefficient estimates upward and SEs downward if they are regarded as random. These biases become more severe as the difference between the school‐level intraclass correlation (ICC) and the class‐level ICC increases. School‐accountability systems should be designed so that classroom (or teacher) level variation can be taken into consideration when quantifying the precision of school rankings, and statistical models for school mean score reliability should incorporate this information.  相似文献   

17.
Test administrators are appropriately concerned about the potential for time constraints to impact the validity of score interpretations; psychometric efforts to evaluate the impact of speededness date back more than half a century. The widespread move to computerized test delivery has led to the development of new approaches to evaluating how examinees use testing time and to new metrics designed to provide evidence about the extent to which time limits impact performance. Much of the existing research is based on these types of observational metrics; relatively few studies use randomized experiments to evaluate the impact time limits on scores. Of those studies that do report on randomized experiments, none directly compare the experimental results to evidence from observational metrics to evaluate the extent to which these metrics are able to sensitively identify conditions in which time constraints actually impact scores. The present study provides such evidence based on data from a medical licensing examination. The results indicate that these observational metrics are useful but provide an imprecise evaluation of the impact of time constraints on test performance.  相似文献   

18.
Teaching quality often is assumed to be a personal and stable characteristic of teachers. Whether this is true has scarcely been investigated empirically. In this study the extent to which value-added scores of teachers teaching German and English as a foreign language (EFL) to the same class remain consistent across subjects was investigated. Then, the consistency of two teaching quality dimensions—classroom management and motivational support—across subjects was explored. A sample consisting of 25 classes with 548 students to whom German and EFL were taught by the same teacher was analyzed using multivariate multilevel models and generalizability theory. The results showed that the value-added scores were highly correlated across subjects. While there was hardly any subject-dependent variance in classroom management, there was substantial subject-dependent variance in motivational support. The results indicate that it is important to conduct further studies on the situational and contextual factors that might influence teaching quality to gain a more comprehensive picture regarding the consistency of teaching quality across various conditions.  相似文献   

19.
Book reviews     
Background:?A recent article published in Educational Research on the reliability of results in National Curriculum testing in England (Newton, The reliability of results from national curriculum testing in England, Educational Research 51, no. 2: 181–212, 2009) suggested that: (1) classification accuracy can be calculated from classification consistency; and (2) classification accuracy on a single test administration is higher than classification consistency across two tests.

Purpose:?This article shows that it is not possible to calculate classification accuracy from classification consistency. It then shows that, given reasonable assumptions about the distribution of measurement error, the expected classification accuracy on a single test administration is higher than the expected classification consistency across two tests only in the case of a pass–fail test, but not necessarily for tests that classify test-takers into more than two categories.

Main argument and conclusion:?Classification accuracy is defined in terms of a ‘true score’ specified in a psychometric model. Three things must be known or hypothesised in order to derive a value for classification accuracy: (1) a psychometric model relating observed scores to true scores; (2) the location of the cut-scores on the score scale; and (3) the distribution of true scores in the group of test-takers.  相似文献   

20.
《Educational Assessment》2013,18(4):255-258
Editor's Introduction. Reliability Versus Accuracy: A Critical Distinction Test reliability coefficients traditionally have been used to judge the quality of measurement. And, reliability coefficients of .90 have often been considered adequate to assure the quality for standardized testing and large-scale assessment programs. However, a test reliability of .90 (or above) does not ensure that individual test scores, such as national percentile ranks, are accurate. Consider, for example, a mathematics test with a reliability of .90 and imagine a student taking that test whose true score is at the 50th percentile; that is, we know that the student's actual capability is at that level. The probability is less than one third (.309) that when the student takes the test, he or she will obtain a score within 5 percentile points of his or her true score, the 50th percentile (Rogosa 1999a, 1999b). The following informal example attempts to explain why high test reliability does not indicate good accuracy for an individual score, without the encumbrances of percentile rank scoring, complex measurement models, and other technical detail. Dedicated to Al Bundy-A man who cares as much about good measurement as he does about his own children.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号