期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

What Do School‐Level Scores From Large‐Scale Assessments Really Measure?

Fiore Sicoly 《Educational Measurement》2002,21(4):17-26

Although assessments of mathematics, reading, and writing are assumed to measure distinct academic skills, this may be difficult owing to the pervasive influence of general ability on performance. Factor analyses of school-level data from 14 large-scale assessment programs revealed that 80% of the variance in mathematics, reading, and writing scores was due to a common, underlying factor. Multiple regression analyses confirmed that scores contribute little information that is unique to a particular subject (6% or less). Although different assessments may create the illusion of providing unique information, they may be tapping into generic cognitive abilities that cut across content areas. These results raise suspicions about the value and validity of interpretations based on school-level subject area scores. 相似文献

2.

Using State School Accountability Data to Evaluate Federal Programs: A Long Uphill Road

《Peabody Journal of Education》2013,88(4):122-145

Evaluations of federal programs designed to improve student achievement generally depend on data gathered by the states for school accountability purposes, rather than data specifically designed for program evaluation. In addition, these data are available at the school level but not at the student level. This article first discusses issues related to the quality of school-level data collected as part of state accountability systems, including the reliability and validity of school-level test scores as a measure of the value added by schools to student learning. It then outlines various ways in which school-level data can be usefully analyzed and illustrates the challenges inherent in doing so, including the challenges of aggregating data across states to find an overall program effect. The final section discusses the implications of the arguments presented here for measuring changes in school performance and linking these effects to a specific program. Ultimately, our ability to measure changes in outcomes and link them back to the intervention depends on three factors: (a) identifying a set of activities attributable to the program, (b) measuring the quality of implementation of these activities, and (c) obtaining a valid and reliable measure of the desired outcome. The article makes it clear that none of these is easy to come by. 相似文献

3.

Examining the Dual Purpose Use of Student Learning Objectives for Classroom Assessment and Teacher Evaluation

Derek C. Briggs Rajendra Chattergoon Amy Burkhardt 《Journal of Educational Measurement》2019,56(4):686-714

The process of setting and evaluating student learning objectives (SLOs) has become increasingly popular as an example where classroom assessment is intended to fulfill the dual purpose use of informing instruction and holding teachers accountable. A concern is that the high‐stakes purpose may lead to distortions in the inferences about students and teachers that SLOs can support. This concern is explored in the present study by contrasting student SLO scores in a large urban school district to performance on a common objective external criterion. This external criterion is used to evaluate the extent to which student growth scores appear to be inflated. Using 2 years of data, growth comparisons are also made at the teacher level for teachers who submit SLOs and have students that take the state‐administered large‐scale assessment. Although they do show similar relationships with demographic covariates and have the same degree of stability across years, the two different measures of growth are weakly correlated. 相似文献

4.

Are school-SES effects statistical artefacts? Evidence from longitudinal population data

Gary N. Marks 《牛津教育评论》2013,39(1):122-144

Schools’ socioeconomic status (SES) has been claimed as an important influence on student performance and there are calls for a policy response. However, there is an extensive literature which for various reasons casts doubt on the veracity of school-SES effects. This paper investigates school-SES effects with population data from a longitudinal cohort of school students which includes achievement measures in Years 3, 5 and 7. Estimates for school-SES are unstable under differing model and measurement specifications. School-SES effects are trivial controlling for student- and school-level prior ability. Inconsistent with theoretical explanations, school-SES effects were stronger with weaker SES measures. Furthermore, school-SES effects differ somewhat by achievement domain. Also contrary to expectations, there were school-SES effects on Year 7 achievement in secondary school for the primary schools students attended in Year 5. In each of five domains of achievement, fixed effect models show a small negative effect for school-SES and a small positive effect for school-level prior ability. The large school-SES effects prominent in some research and policy literatures are statistical artefacts. 相似文献

5.

On the generalizability of school-level performance assessment scores

《International Journal of Educational Research》1994,21(3):267-278

This study illustrates how generalizability theory can be used to evaluate the dependability of school-level scores in situations where test forms have been matrix sampled within schools, and to estimate the minimum number of forms required to achieve acceptable levels of score reliability. Data from a statewide performance assessment in reading, writing, and language usage were analyzed in a series of generalizability studies using a person: (school x form) design that provided variance component estimates for four sources: school, form, school x form, and person: (school x form). Six separate scores were examined. The results of the generalizability studies were then used in decision studies to determine the impact on score reliability when the number of forms administered within schools was varied. Results from the decision studies indicated that score generalizability could be improved when the number of forms administered within schools was increased from one to three forms, but that gains in generalizability were small when the number of forms was increased beyond three. The implications of these results for planning large-scale performance assessments are discussed. 相似文献

6.

Test anxiety and GCSE performance: the effect of gender and socio‐economic background

David William Putwain 《Educational Psychology in Practice》2008,24(4):319-334

Despite a well established body of international literature describing the effect of test anxiety on student performance in a range of assessments, there has been little work conducted on samples of students from the UK. The purpose of this exploratory study is two‐fold. First, to establish the relationship between test anxiety and assessment performance in a group of students in their final year of compulsory secondary schooling, in the politicised educational context of the UK. Second, to establish if this relationship is moderated by gender and socio‐economic background. Data were gathered on trait test anxiety, GSCE examination performance in Mathematics, English Language and Science, gender and socio‐economic background from 557 mixed ability Year 11 students drawn from three secondary schools in the UK. A hierarchical regression analysis was used to establish the moderating influence of gender and socio‐economic background. Results suggest a small, but significant inverse relationship between test anxiety scores and mean examination performance and that the cognitive component of test anxiety accounts for 7% of variance in examination performance. A differential test anxiety–assessment performance relationship was reported for socio‐economic background but not gender. Although the data reported for the test anxiety–assessment performance relationship are similar to those reported in numerous other studies, it is hypothesised that contextualised features associated with secondary education in the UK, particularly efforts to raise attainment, may have influenced these results. 相似文献

7.

Why Students Answer TIMSS Science Test Items the Way They Do

Harlow Ann Jones Alister 《Research in Science Education》2004,34(2):221-238

The purpose of this study was to explore how Year 8 students answered Third International Mathematics and Science Study (TIMSS) questions and whether the test questions represented the scientific understanding of these students. One hundred and seventy-seven students were tested using written test questions taken from the science test used in the Third International Mathematics and Science Study. The degree to which a sample of 38 children represented their understanding of the topics in a written test compared to the level of understanding that could be elicited by an interview is presented in this paper. In exploring student responses in the interview situation this study hoped to gain some insight into the science knowledge that students held and whether or not the test items had been able to elicit this knowledge successfully. We question the usefulness and quality of data from large-scale summative assessments on their own to represent student scientific understanding and conclude that large scale written test items, such as TIMSS, on their own are not a valid way of exploring students' understanding of scientific concepts. Considerable caution is therefore needed in exploiting the outcomes of international achievement testing when considering educational policy changes or using TIMSS data on their own to represent student understanding. 相似文献

8.

Relationships among Singaporean secondary teachers’ conceptions of assessment and school and policy contextual factors

Gavin W. Fulmer Kelvin H. K. Tan Iris C. H. Lee 《Assessment in Education: Principles, Policy & Practice》2019,26(2):166-183

This study examines teachers’ conceptions of assessment and related contextual factors at the classroom, school and national levels. A representative survey of Singaporean secondary school teachers resulted in a final sample consisting of 229 teachers from 9 secondary schools. Findings on that, teachers endorse views of assessment for school accountability, student accountability and student improvement, but little endorsement of assessment as irrelevance. Teachers report feeling capable and qualified to use assessments, but concerned about how much they are trusted as assessors at school and national levels. Follow-up latent class analysis identified groups of teachers based on their responses to the irrelevance of assessment; teachers who found assessment irrelevant were present across all schools and subjects, but showed lower sense of preparation for assessment, school-level support and importance of academic success in society. 相似文献

9.

The Generalizability of Motivation Filtering in Improving Test Score Validity

《Educational Assessment》2013,18(1):65-83

Accountability for educational quality is a priority at all levels of education. Low-stakes testing is one way to measure the quality of education that students receive and make inferences about what students know and can do. Aggregate test scores from low-stakes testing programs are suspect, however, to the degree that these scores are influenced by low test-taker effort. This study examined the generalizability of a recently developed technique called motivation filtering, whereby scores for students of low motivation are systemically filtered from test data to determine aggregate test scores that more accurately reflect student performance and that can be used for reporting purposes. Across assessment tests in five different content areas, motivation filtering was found to consistently increase mean test performance and convergent validity. 相似文献

10.

Thinking beyond the score: Multidimensional analysis of student performance to inform the next generation of science assessments

Lourdes Cardozo-Gaibisso Seohyun Kim Cory Buxton Allan Cohen 《科学教学研究杂志》2020,57(6):856-878

Conventional assessment analysis of student results, referred to as rubric-based assessments (RBA), has emphasized numeric scores as the primary way of communicating information to teachers about their students’ learning. In this light, rethinking and reflecting on not only how scores are generated but also what analyses are done with them to inform classroom practices is of utmost importance. Informed by Systemic Functional Linguistics and Latent Dirichlet Allocation analyses, this study utilizes an innovative bilingual (Spanish–English) constructed response assessment of science and language practices for middle and high school students to perform a multilayered analysis of student responses. We explore multiple ways of looking at students’ performance through their written assessments and discuss features of student responses that are made visible through these analyses. Findings from this study suggest that science educators would benefit from a multidimensional model which deploys complementary ways in which we can interpret student performance. This understanding leads us to think that researchers and developers in the field of assessment need to promote approaches that analyze student science performance as a multilayered phenomenon. 相似文献

11.

Using growth models to monitor school performance: comparing the effect of the metric and the assessment

Pete Goldschmidt Kilchan Choi Felipe Martinez John Novak 《School Effectiveness & School Improvement》2013,24(3):337-357

This paper investigates whether inferences about school performance based on longitudinal models are consistent when different assessments and metrics are used as the basis for analysis. Using norm-referenced (NRT) and standards-based (SBT) assessment results from panel data of a large heterogeneous school district, we examine inferences based on vertically equated scale scores, normal curve equivalents (NCEs), and nonvertically equated scale scores. The results indicate that the effect of the metric depends upon the evaluation objective. NCEs significantly underestimate absolute individual growth, but NCEs and scale scores yield highly correlated (r >.90) school-level results based on mean initial status and growth estimates. SBT and NRT results are highly correlated for status but only moderately correlated for growth. We also find that as few as 30 students per school provide consistent results and that mobility tends to affect inferences based on status but not growth – irrespective of the assessment or metric used. 相似文献

12.

Content and alignment of state writing standards and assessments as predictors of student writing achievement: an analysis of 2007 National Assessment of Educational Progress data

Gary A. Troia Natalie G. Olinghouse Mingcai Zhang Joshua Wilson Kelly A. Stewart Ya Mo Lisa Hawkins 《Reading and writing》2018,31(4):835-864

We examined the degree to which content of states’ writing standards and assessments (using measures of content range, frequency, balance, and cognitive complexity) and their alignment were related to student writing achievement on the 2007 National Assessment of Educational Progress (NAEP), while controlling for student, school, and state characteristics. We found student demographic characteristics had the largest effect on between-state differences in writing performance, followed by state policy-related variables, then state and school covariates. States with writing tests that exhibited greater alignment with the NAEP writing assessment demonstrated significantly higher writing scores. We discuss plausible implications of these findings. 相似文献

13.

The Predictive Validity of Interim Assessment Scores Based on the Full-Information Bifactor Model for the Prediction of End-of-Grade Test Performance

Jason C. Immekus Ben Atitya 《Educational Assessment》2016,21(3):176-195

Interim tests are a central component of district-wide assessment systems, yet their technical quality to guide decisions (e.g., instructional) has been repeatedly questioned. In response, the study purpose was to investigate the validity of a series of English Language Arts (ELA) interim assessments in terms of dimensionality and prediction of summative test performance, based on Grade 6 student data (N = 4,651) from a larger, urban district. Factor analytic results supported modeling the interim test data in terms of a bifactor model (Gibbons & Hedeker, 1992), with items reporting moderate to high relationships to the primary dimension (i.e., ELA) and varying estimates on the secondary domains. Hierarchical multiple linear regression results indicated that primary ELA scores were the strongest predictors of summative test performance, with subscale scores not improving predictive accuracy. Findings address issues pertaining to investigating the technical quality of test data widely used in district-wide assessment systems. 相似文献

14.

Measurement,Sampling, and Equating Errors in Large-Scale Assessments

Margaret Wu 《Educational Measurement》2010,29(4):15-27

In large-scale assessments, such as state-wide testing programs, national sample-based assessments, and international comparative studies, there are many steps involved in the measurement and reporting of student achievement. There are always sources of inaccuracies in each of the steps. It is of interest to identify the source and magnitude of the errors in the measurement process that may threaten the validity of the final results. Assessment designers can then improve the assessment quality by focusing on areas that pose the highest threats to the results. This paper discusses the relative magnitudes of three main sources of error with reference to the objectives of assessment programs: measurement error, sampling error, and equating error. A number of examples from large-scale assessments are used to illustrate these errors and their impact on the results. The paper concludes by making a number of recommendations that could lead to an improvement of the accuracies of large-scale assessment results. 相似文献

15.

Assessment in an era of accessibility: Evaluating rules for scripting audio representation of test items

Christopher Johnstone Jennifer Higgins Gaye Fedorchak 《British journal of educational technology : journal of the Council for Educational Technology》2019,50(2):806-818

Standardized, large-scale assessment of educational outcomes has become a global phenomenon over the past three decades (Smith, 2016 ). A key challenge facing assessment designers is that standard formats may be inaccessible or may create barriers to student performance. Schwanke, Smith, and Edyburn's ( 2001 ) A3 model describes how advocates have reacted to structural barriers by providing accommodations and, ultimately, accessibility. This paper synthesizes and evaluates three studies that attempted to improve accessibility in assessments for students who struggle with print reading through audio presentation of assessment items. Cross-study implications for policy and practice are considered. 相似文献

16.

Large‐scale Portfolio Assessments in the US: evidence pertaining to the quality of measurement

Daniel Koretz 《Assessment in Education: Principles, Policy & Practice》1998,5(3):309-334

Portfolio assessment, that is, the evaluation of performance by means of a cumulative collection of student work, has figured prominently in recent US debate about education reform. Proponents hope not only to broaden measurement of performance, but also to use portfolio assessment to encourage improved instruction. Although portfolio assessment has sparked considerable attention and enthusiasm, it has been incorporated into only a few of the nearly ubiquitous large‐scale external assessment programmes in the US. This paper evaluates the quality of the performance data produced by several large‐scale portfolio efforts. Evaluations of reliability, which have focused primarily on the consistency of scoring, have yielded highly variable results. While high levels of consistency have been reached in some cases, scoring has been quite inconsistent in others, to the point of severely limiting the utility of scores.

Information about other aspects of validity is more limited and generally discouraging. For example, scores from portfolio assessments often do not show anticipated relationships with other achievement data, and teachers report practices in the implementation of portfolio assessment that are appropriate for instructional purposes but threaten the validity of inferences from portfolio scores. While other studies show positive effects of portfolio programmes (see Stecher, this issue), these findings suggest that portfolio assessment at its current state of development is problematic for many of the uses to which large‐scale external assessments are now put in the US. 相似文献

17.

Proximal Versus Distal Validity Coefficients for Teacher Observational Instruments

Robert J. Marzano 《The Teacher Educator》2014,49(2):89-96

This study examined the use of measures of student learning computed using end-of-year assessments (distal measures) versus measures of student learning associated with a single lesson (proximal measures) as criterion scores for the validity of observations of teachers' pedagogical skills. The validity coefficients computed using distal measures were significantly lower than the validity coefficient computed using proximal measures. Assumptions underlying the current emphasis on distal measures were challenged. Possible ways to generate more proximal measures were explored. 相似文献

18.

Alignment and Implications for Test Takers

Catherine J. Welch Stephen B. Dunbar 《Educational Measurement》2020,39(2):8-17

The use of assessment results to inform school accountability relies on the assumption that the test design appropriately represents the content and cognitive emphasis reflected in the state's standards. Since the passage of the Every Student Succeeds Act and the certification of accountability assessments through federal peer review practices, the content validity arguments supporting accountability have relied almost exclusively on the alignment of statewide assessments to state standards. It is assumed that if alignment does not hold, the scores will not provide valid inferences regarding the degree to which test takers have performed. Although alignment results are commonly used as evidence of test appropriateness, Polikoff (this issue) would argue that given the importance of alignment in policy decisions, research related to alignment is surprisingly limited. Few studies have addressed the adequacy of alignment methodologies and results as support for the inferences to be made (i.e., proficient on state standards). This paper uses an example of test taker performance (and common performance indicators) to investigate to what extent the degree of alignment impacts inferences made about performance (i.e., classification into performance levels, estimates of student ability, and student rank order). 相似文献

19.

The overall effects of end-of-course assessment on student performance: A comparison between multiple choice testing, peer assessment, case-based assessment and portfolio assessment

Katrien Struyven Filip Dochy Steven Janssens Wouter Schelfhou Sarah Gielen 《Studies in Educational Evaluation》2006,32(3):202

This study investigates the effect of method of assessment on student performance. Five research conditions go together with one of four assessment modes, namely: portfolio, case-based, peer assessment, and multiple choice evaluation. Data collection is done by means of a pre-test/ post-test-design with the help of two standardised tests (N=816). Results show that assessment method does make a difference: assessments do not produce overall effects on student performance. Moreover, student-activating instruction efforts do not automatically result in more extensive learning gains. Finally, test results show, when compared to other assessments, a statistically significant positive effect of the multiple choice test on students' test scores. However, students' preparation level and the closed book format of the tests might serve explanatory purposes. 相似文献

20.

Achievement Measures of School Effectiveness: Comparison of Model Stability Across Years

《教育实用测度》2013,26(4):353-365

The purpose of this study was to determine the feasibility of combining different test types (criterion-referenced and norm-referenced) in a composite school achievement score to be used in a model for school effectiveness classification. The cross-year stability and within-model consistency of the composite was compared to models using subcomposite, overall scores for both the criterion- referenced and norm-referenced tests, subject-area scores (across grades), grade-level scores, and component scores for each grade. Stability of the different models across 2 years was determined by using the agreement ratio, kappa coefficient, and correlation of residuals (N = 361). The same statistical procedures were used to compute consistency across subsamples (N = 264). Results indicated that transforming and combining student-level scores of different test types, grade levels, and subject areas allows for a broader basis for judging schools and provides a school effectiveness model that is both consistent across subsamples and stable across years. 相似文献