期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Actual Interpretations and Use of Scores as Aspects of Validity

Timothy M. O'Leary John A. C. Hattie Patrick Griffin 《Educational Measurement》2017,36(2):16-23

Validity is the most fundamental consideration in test development. Understandably, much time, effort, and money is spent in its pursuit. Central to the modern conception of validity are the interpretations made, and uses planned, on the basis of test scores. There is, unfortunately, however, evidence that test users have difficulty understanding scores as intended. That is, although the proposed interpretations and use of test scores might be theoretically valid they might never come to be because the meaning of the message is lost in translation. This necessitates pause. It is almost absurd to think that the intended interpretations and uses of test scores might fail because there is a lack of alignment with the actual interpretations made and uses enacted by the audience. Despite this, there has only recently been contributions to the literature regarding the interpretability of score reports, the mechanisms by which scores are communicated to their audience, and their relevance to validity. These contributions have focused upon linking, through evidence, the intended interpretation and use with the actual interpretations being made and actions being planned by score users. This article reviews the current conception of validity, validation, and validity evidence with the goal of positioning the emerging notion of validity of usage within the current paradigm. 相似文献

2.

A General Framework for the Validation of Embedded Formative Assessment

Dorien Hopster‐den Otter Saskia Wools Theo J. H. M. Eggen Bernard P. Veldkamp 《Journal of Educational Measurement》2019,56(4):715-732

In educational practice, test results are used for several purposes. However, validity research is especially focused on the validity of summative assessment. This article aimed to provide a general framework for validating formative assessment. The authors applied the argument‐based approach to validation to the context of formative assessment. This resulted in a proposed interpretation and use argument consisting of a score interpretation and a score use. The former involves inferences linking specific task performance to an interpretation of a student's general performance. The latter involves inferences regarding decisions about actions and educational consequences. The validity argument should focus on critical claims regarding score interpretation and score use, since both are critical to the effectiveness of formative assessment. The proposed framework is illustrated by an operational example including a presentation of evidence that can be collected on the basis of the framework. 相似文献

3.

Using person response functions to investigate areas of person misfit related to item characteristics

A. Adrienne Walker Jeremy Kyle Jennings George Engelhard Jr. 《Educational Assessment》2018,23(1):47-68

Individual person fit analyses provide important information regarding the validity of test score inferences for an individual test taker. In this study, we use data from an undergraduate statistics test (N = 1135) to illustrate a two-step method that researchers and practitioners can use to examine individual person fit. First, person fit is examined numerically with several indices based on the Rasch model (i.e., Infit, Outfit, and Between-Subset statistics). Second, person misfit is presented graphically with person response functions, and these person response functions are interpreted using a heuristic. Individual person fit analysis holds promise for improving score interpretation in that it may detect potential threats to validity of score inferences for some test takers. Individual person fit analysis may also highlight particular subsets of items (on which a test taker performs unexpectedly) that can be used to further contextualize her or his test performance. 相似文献

4.

Explicating validity

Michael T. Kane 《Assessment in Education: Principles, Policy & Practice》2016,23(2):198-211

How we choose to use a term depends on what we want to do with it. If validity is to be used to support a score interpretation, validation would require an analysis of the plausibility of that interpretation. If validity is to be used to support score uses, validation would require an analysis of the appropriateness of the proposed uses, and therefore, would require an analysis of the consequences of the uses. In each case, the evidence need for validation would depend on the specific claims being made. 相似文献

5.

Evaluating Content-Related Validity Evidence Using a Text-Based Machine Learning Procedure

Daniel Anderson Brock Rowley Sondra Stegenga P. Shawn Irvin Joshua M. Rosenberg 《Educational Measurement》2020,39(4):53-64

Validity evidence based on test content is critical to meaningful interpretation of test scores. Within high-stakes testing and accountability frameworks, content-related validity evidence is typically gathered via alignment studies, with panels of experts providing qualitative judgments on the degree to which test items align with the representative content standards. Various summary statistics are then calculated (e.g., categorical concurrence, balance of representation) to aid in decision-making. In this paper, we propose an alternative approach for gathering content-related validity evidence that capitalizes on the overlap in vocabulary used in test items and the corresponding content standards, which we define as textual congruence. We use a text-based, machine learning model, specifically topic modeling, to identify clusters of related content within the standards. This model then serves as the basis from which items are evaluated. We illustrate our method by building a model from the Next Generation Science Standards, with textual congruence evaluated against items within the Oregon statewide alternate assessment. We discuss the utility of this approach as a source of triangulating and diagnostic information and show how visualizations can be used to evaluate the overall coverage of the content standards across the test items. 相似文献

6.

A Framework for Conceptualizing and Evaluating the Validity of Instructionally Relevant Assessments

James W. Pellegrino Louis V. DiBello Susan R. Goldman 《教育心理学家》2016,51(1):59-81

Assessments that function close to classroom teaching and learning can play a powerful role in fostering academic achievement. Unfortunately, however, relatively little attention has been given to discussion of the design and validation of such assessments. The present article presents a framework for conceptualizing and organizing the multiple components of validity applicable to assessments intended for use in the classroom to support ongoing processes of teaching and learning. The conceptual framework builds on existing validity concepts and focuses attention on three components: cognitive validity, instructional validity, and inferential validity. The goal in presenting the framework is to clarify the concept of validity, including key components of the interpretive argument, while considering the types and forms of evidence needed to construct a validity argument for classroom assessments. The framework's utility is illustrated by presenting an application to the analysis of the validity of assessments embedded within an elementary mathematics curriculum. 相似文献

7.

An Experimental Study of the Internal Consistency of Judgments Made in Bookmark Standard Setting

下载免费PDF全文

Brian E. Clauser Peter Baldwin Melissa J. Margolis Janet Mee Marcia Winward 《Journal of Educational Measurement》2017,54(4):481-497

Validating performance standards is challenging and complex. Because of the difficulties associated with collecting evidence related to external criteria, validity arguments rely heavily on evidence related to internal criteria—especially evidence that expert judgments are internally consistent. Given its importance, it is somewhat surprising that evidence of this kind has rarely been published in the context of the widely used bookmark standard‐setting procedure. In this article we examined the effect of ordered item booklet difficulty on content experts’ bookmark judgments. If panelists make internally consistent judgments, their resultant cut scores should be unaffected by the difficulty of their respective booklets. This internal consistency was not observed: the results suggest that substantial systematic differences in the resultant cut scores can arise when the difficulty of the ordered item booklets varies. These findings raise questions about the ability of content experts to make the judgments required by the bookmark procedure. 相似文献

8.

A Proposed Framework for Evaluating Alignment Studies

Susan L. Davis‐Becker Chad W. Buckendahl 《Educational Measurement》2013,32(1):23-33

Evaluating the multiple characteristics of alignment has taken a prominent role in educational assessment and accountability systems given its attention in the No Child Left Behind legislation (NCLB). Leading to this rise in popularity, alignment methodologies that examined relationships among curriculum, academic content standards, instruction, and assessments were proposed as strategies to evaluate evidence of the intended uses and interpretations of test scores. In this article, we propose a framework for evaluating alignment studies based on similar concepts that have been recommended for standard setting (Kane). This framework provides guidance to practitioners about how to identify sources of validity evidence for an alignment study and make judgments about the strength of the evidence that may impact the interpretation of the results. 相似文献

9.

Development of the Physical Educators’ Judgments about Inclusion Instrument for Japanese Physical Education Majors and an Analysis of their Judgments

Samuel R. Hodge Takahiro Sato Takahito Mukoyama Francis M. Kozub 《International Journal of Disability, Development & Education》2013,60(4):332-346

The purpose of this study was to assess the validity and reliability of the Physical Educators’ Judgments about Inclusion (PEJI) survey for analysing the judgments of Japanese (361 male, 170 female) physical education teacher education majors. A secondary purpose was to examine group differences in judgments as a function of gender and past experiences. Data were collected and psychometrical properties of the PEJI were assessed using a maximum-likelihood extraction method. Confirmatory factor analysis resulted in salient loadings for all items on three hypothesised dimensions, resulting in 47% explained variance for measuring judgments about: Inclusion versus Exclusion, Acceptance of Students with Disabilities, and Perceived Training Needs. Supported by the three interpretable factors that emerged, construct validity evidence is presented. Although most of the physical education teacher education majors sampled had yet to teach students with disabilities, they had formed preliminary judgments about doing so. 相似文献

10.

On the validity of useless tests

Stephen G. Sireci 《Assessment in Education: Principles, Policy & Practice》2016,23(2):226-235

A misconception exists that validity may refer only to the interpretation of test scores and not to the uses of those scores. The development and evolution of validity theory illustrate test score interpretation was a primary focus in the earliest days of modern testing, and that validating interpretations derived from test scores remains essential today. However, test scores are not interpreted and then ignored; rather, their interpretations lead to actions. Thus, a modern definition of validity needs to describe the validation of test score interpretations as a necessary, but insufficient, step en route to validating the uses of test scores for their intended purposes. To ignore test use in defining validity is tantamount to defining validity for ‘useless’ tests. The current definition of validity stipulated in the 2014 version of the Standards for Educational and Psychological Testing properly describes validity in terms of both interpretations and uses, and provides a sufficient starting point for validation. 相似文献

11.

测验开发中的公平性审核探微——以美国ETS为例

陈吉《现代教育论丛》2011,(3)

测验公平有很多不同的定义,基于效度对其进行界定对于测验开发者而言是最有用的。要开发一个公平的测验,对试题进行公平性审核是不可缺少的重要一环。为使审核过程不那么主观,应遵循一定的审核原则。此外,为更好地解决公平性审核过程中出现的问题,还应建立规范的审核程序。ETS在对测验进行公平性审核的方面积累了丰富经验,其中有不少值得借鉴之处。相似文献

12.

国家职业汉语能力测试的效度分析 总被引：1，自引：1，他引：1

XIE Xiaoqing REN Jie 《中国考试》2008,(9)

国家职业汉语能力测试(ZHC)由中国劳动和社会保障部职业技能鉴定中心(OSTA)组织国内语言学、语言教学、心理学和教育测量学等方面的专家开发研制。ZHC是测查应试者在职业活动中的汉语能力的国家级职业核心能力测试。效度研究是一个关于考试有效性资料的积累过程,是通过积累证据对考试提供支持的过程,我们需要从多种角度对考试的有效性进行检验,积累资料。在本项研究中,从ZHC成绩与学历的相关、试卷的内部结构分析(不同意型间相关和因素分析)、DIF分析(包含关于性别和文理科专业的分析)、名牌大学与普通大学在校生成绩的比较、用户调查等几个方面对ZHC进行了效度分析。测验分数与学历的相关分析、试卷结构分析、DIF分析、组间比较和用户调查的结果都对ZHC的效度提供了支持,显示ZHC具有较好的效度,可以比较真实准确地反映出受测者的实际语言能力和逻辑思维能力。相似文献

13.

Teaching for the Test: Validity, Fairness, and Moral Action 总被引：1，自引：0，他引：1

Linda Crocker 《Educational Measurement》2003,22(3):5-11

In response to heightened levels of assessment activity at the K-12 level to meet requirements of the No Child Left Behind Act of 2001, measurement professionals are called to focus greater attention on four fundamental areas of measurement research and practice: (a) improving the research infrastructure for validation methods involving judgments of test content; (b) expanding the psychometric definition of fairness in achievement testing; (c) developing guidelines for validation studies of test use consequences; and (d) preparing teachers for new roles in instruction and assessment practice. Illustrative strategies for accomplishing these goals are outlined. 相似文献

14.

Validating test score meaning and defending test score use: different aims,different methods

Gregory J. Cizek 《Assessment in Education: Principles, Policy & Practice》2016,23(2):212-225

Advances in validity theory and alacrity in validation practice have suffered because the term validity has been used to refer to two incompatible concerns: (1) the degree of support for specified interpretations of test scores (i.e. intended score meaning) and (2) the degree of support for specified applications (i.e. intended test uses). This article provides a brief summary of current validity theory, explication of a critical flaw in the current conceptualisation of validity, and a framework that both accommodates and differentiates validation of test score inferences and justification of test use. 相似文献

15.

Building and Supporting a Validity Argument for a Standards‐Based Classroom Assessment of English Proficiency Based on Teacher Judgments

Lorena Llosa 《Educational Measurement》2008,27(3):32-42

Using an argument‐based approach to validation, this study examines the quality of teacher judgments in the context of a standards‐based classroom assessment of English proficiency. Using Bachman's (2005) assessment use argument (AUA) as a framework for the investigation, this paper first articulates the claims, warrants, rebuttals, and backing needed to justify the link between teachers' scores on the English Language Development (ELD) Classroom Assessment and the interpretations made about students' language ability. Then the paper summarizes the findings of two studies—one quantitative and one qualitative—conducted to gather the necessary backing to support the warrants and, in particular, address the rebuttals about teacher judgments in the argument. The quantitative study examined the assessment in relation to another measure of the same ability—the California English Language Development Test—using confirmatory factor analysis of multitrait‐multimethod data and provided evidence in support of the warrant that states that the ELD Classroom Assessment measures English proficiency as defined by the California ELD Standards. The qualitative study examined the processes teachers engaged in while scoring the classroom assessment using verbal protocol analysis. The findings of this study serve to support the rebuttals in the validity argument that state that there are inconsistencies in teachers' scoring. The paper concludes by providing an explanation for these seemingly contradictory findings using the AUA as a framework and discusses the implications of the findings for the use of standards‐based classroom assessments based on teacher judgments. 相似文献

16.

Voices From Test-Takers: Further Evidence for Language Assessment Validation and Use

Liying Cheng Christopher DeLuca 《Educational Assessment》2013,18(2):104-122

Test-takers' interpretations of validity as related to test constructs and test use have been widely debated in large-scale language assessment. This study contributes further evidence to this debate by examining 59 test-takers' perspectives in writing large-scale English language tests. Participants wrote about their test-taking experiences in 300 to 500 words, focusing on their perceptions of test validity and test use. A standard thematic coding process and logical cross-analysis were used to analyze test-takers' experiences. Codes were deductively generated and related to both experiential (i.e., testing conditions and consequences) and psychometric (i.e., test construction, format, and administration) aspects of testing. These findings offer test-takers' voices on fundamental aspects of language assessment, which bear implications for test developers, test administrators, and test users. The study also demonstrated the need for obtaining additional evidence from test-takers for validating large-scale language tests. 相似文献

17.

Predicting student achievement for low stakes tests with effort and task value 总被引：1，自引：0，他引：1

James S. Cole David A. Bergin Tiffany A. Whittaker 《Contemporary educational psychology》2008,33(4):609-624

We investigated motivation for taking low stakes tests. Based on expectancy-value theory, we expected that the effect of student perceptions of three task values (interest, usefulness, and importance) on low stakes test performance would be mediated by the student’s reported effort. We hypothesized that all three task value components would play a significant role in predicting test-taking effort, and that effort would significantly predict test performance. Participants were 1005 undergraduate students enrolled at four midsize public universities. After students took all four subtests of CBASE, a standardized general education exam, they immediately filled out a motivation survey. Path analyses showed that the task value variables usefulness and importance significantly predicted test-taking effort and performance for all four tests. These results provide evidence that students who report trying hard on low stakes tests score higher than those who do not. The results indicate that if students do not perceive importance or usefulness of an exam, their effort suffers and so does their test score. While the data are correlational, they suggest that it might be useful for test administrators and school staff to communicate to students the importance and usefulness of the test that they are being asked to complete. 相似文献

18.

Themes and Variations in Validity Theory

Pamela A. Moss 《Educational Measurement》1995,14(2):5-13

Should the Standards reflect the perspective that construct validity is central to all validation efforts? Is the construct-/content-/criterion-related categorization of validity evidence now obsolete? Should the definition of validity include consideration of the consequences of test use? 相似文献

19.

DIFFERENTIAL WEIGHTING BY JUDGED DEGREE1OF CORRECTNESS

DURGADAS PATNAIK ROSS E. TRAUB 《Journal of Educational Measurement》1973,10(4):281-286

Two conventional scores and a weighted score on a group test of general intelligence were compared for reliability and predictive validity. One conventional score consisted of the number of correct answers an examinee gave in responding to 69 multiple-choice questions; the other was the formula score obtained by subtracting from the number of correct answers a fraction of the number of wrong answers. A weighted score was obtained by assigning weights to all the response alternatives of all the questions and adding the weights associated with the responses, both correct and incorrect, made by the examinee. The weights were derived from degree-of-correctness judgments of the set of response alternatives to each question. Reliability was estimated using a split-half procedure; predictive validity was estimated from the correlation between test scores and mean school achievement. Both conventional scores were found to be significantly less reliable but significantly more valid than the weighted scores. (The formula scores were neither significantly less reliable nor significantly more valid than number-correct scores.) 相似文献

20.

An Examination of English Teachers' Opinions About the Ontario Grade 9 Reading and Writing Test 总被引：1，自引：1，他引：1

Tony C.M. Lam Catherine Bordignon 《Interchange》2001,32(2):131-145

Interest in the use of large-scale achievement testing for accountability purposes and to drive instructional reform has been increasing in Canada. In the 1995 publications in Interchange, several researchers debated the merits and demerits of standardized achievement testing, including among the latter a tendency to reduce the curriculum and overemphasize routine learning (i.e., "teaching to the test"). Almost no studies have found empirical evidence for such testing's purported benefits. We set out to investigate these issues in Ontario: We present findings from a mail survey designed to find out, from Grade 9 and Grade 10 English teachers in Ontario, their perception of the quality of the Ontario Grade 9 literacy testing program and the effects it has had on the teaching and learning processes. Based on the responses of 107 teachers, our results paint a negative picture of teachers' opinions of the Grade 9 test in terms of its quality and its impact on teaching and learning. Three years after the Grade 9 test was first introduced, Grade 9 and 10 English teachers are still not convinced of its value. Our findings (and those from two other similar surveys) appear to suggest, at least based on teachers' self-reporting, that the purposes of the test — improving the quality of education and learning — as envisioned by the Ontario Ministry of Education and Training have not been met. These findings support those of other assessment impact studies in Canada, namely British Columbia and Alberta, regarding the adverse consequences of large-scale standardized testing (either multiple-choice test or performance-based assessment), and the lack of evidence for its purported positive educational influences. We recommend future research to investigate further the validity and the educational impact of the provincial tests and the reasons responsible for the observed impact or lack of it, and to determine resources, such as teacher training and materials, that are necessary to supplement the provincial testing program's effort to improve teaching and learning. 相似文献