期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Reliable and Valid Procedures to Create an Authentic Listening Test in EFL Context

吴婷《海外英语》2012,(22):103-105

Listening testing is a universal social activity,especially for school life as well as an indispensable part to language assessment.How test takers perform during the tests may affect their entry to many significant roles both in society and schools.This paper is an attempt to explore how to design a reliable and valid listening test for particular purposes in EFL context. 相似文献

2.

Valid and Reliable Science Content Assessments for Science Teachers

Thomas R. Tretter Sherri L. Brown William S. Bush Jon C. Saderholm Vicki-Lynn Holmes 《Journal of Science Teacher Education》2013,24(2):269-295

Science teachers’ content knowledge is an important influence on student learning, highlighting an ongoing need for programs, and assessments of those programs, designed to support teacher learning of science. Valid and reliable assessments of teacher science knowledge are needed for direct measurement of this crucial variable. This paper describes multiple sources of validity and reliability (Cronbach’s alpha greater than 0.8) evidence for physical, life, and earth/space science assessments—part of the Diagnostic Teacher Assessments of Mathematics and Science (DTAMS) project. Validity was strengthened by systematic synthesis of relevant documents, extensive use of external reviewers, and field tests with 900 teachers during assessment development process. Subsequent results from 4,400 teachers, analyzed with Rasch IRT modeling techniques, offer construct and concurrent validity evidence. 相似文献

3.

Developing a Reliable and Valid Assessment Tool for Online Classes

下载免费PDF全文

Sahar Bahmani 《Assessment Update》2018,30(2):4-14

相似文献

4.

Reporting Valid and Reliable Overall Scores and Domain Scores

Lihua Yao 《Journal of Educational Measurement》2010,47(3):339-360

In educational assessment, overall scores obtained by simply averaging a number of domain scores are sometimes reported. However, simply averaging the domain scores ignores the fact that different domains have different score points, that scores from those domains are related, and that at different score points the relationship between overall score and domain score may be different. To report reliable and valid overall scores and domain scores, I investigated the performance of four methods using both real and simulation data: (a) the unidimensional IRT model; (b) the higher-order IRT model, which simultaneously estimates the overall ability and domain abilities; (c) the multidimensional IRT (MIRT) model, which estimates domain abilities and uses the maximum information method to obtain the overall ability; and (d) the bifactor general model. My findings suggest that the MIRT model not only provides reliable domain scores, but also produces reliable overall scores. The overall score from the MIRT maximum information method has the smallest standard error of measurement. In addition, unlike the other models, there is no linear relationship assumed between overall score and domain scores. Recommendations for sizes of correlations between domains and the number of items needed for reporting purposes are provided. 相似文献

5.

Reasoning About Evidence in Portfolios: Cognitive Foundations for Valid and Reliable Assessment

《Educational Assessment》2013,18(1):5-40

相似文献

6.

Equating Subscores under the Nonequivalent Anchor Test (NEAT) Design

Gautam Puhan Longjuan Liang 《Educational Measurement》2011,30(1):23-35

The study examined two approaches for equating subscores. They are (1) equating subscores using internal common items as the anchor to conduct the equating, and (2) equating subscores using equated and scaled total scores as the anchor to conduct the equating. Since equated total scores are comparable across the new and old forms, they can be used as an anchor to equate the subscores. Both chained linear and chained equipercentile methods were used. Data from two tests were used to conduct the study and results showed that when more internal common items were available (i.e., 10–12 items), then using common items to equate the subscores is preferable. However, when the number of common items is very small (i.e., five to six items), then using total scaled scores to equate the subscores is preferable. For both tests, not equating (i.e., using raw subscores) is not reasonable as it resulted in a considerable amount of bias. 相似文献

7.

Exploration of Factors Affecting the Added Value of Test Subscores

Xiaolin Wang Dubravka Svetina Shenghai Dai 《Journal of Experimental Education》2019,87(2):179-192

Recently, interest in test subscore reporting for diagnosis purposes has been growing rapidly. The two simulation studies here examined factors (sample size, number of subscales, correlation between subscales, and three factors affecting subscore reliability: number of items per subscale, item parameter distribution, and data generating model) that affected the value of reporting subscores within the classical test theory framework. Results showed that a higher proportion of subscores of added value was related to lower correlation between subscales, more items per subscale, no guessing in responses, smaller variability in difficulty parameters, and matched average item difficulty and average examinee ability. 相似文献

8.

Guidelines for Interpreting and Reporting Subscores

Richard A. Feinberg Daniel P. Jurich 《Educational Measurement》2017,36(1):5-13

Recent research has proposed a criterion to evaluate the reportability of subscores. This criterion is a value‐added ratio (VAR), where values greater than 1 suggest that the true subscore is better approximated by the observed subscore than by the total score. This research extends the existing literature by quantifying statistical significance and effect size for using VAR to provide practical guidelines for subscore interpretation and reporting. Findings indicate that subscores with VAR ≥ 1.1 are a minimum requirement for a meaningful contribution to a user's score interpretation; subscores with .9 < VAR < 1.1 are redundant with the total score and subscores with VAR ≤ .9 would be misleading to report. Additionally, we discuss what to do when subscores do not add value, yet must be reported, as well as when VAR ≥ 1.1 may be undesirable. 相似文献

9.

A Method of Providing a More Valid Distribution of School Marks

R. W. Edmiston 《Journal of Experimental Education》2013,81(3):194-197

The cognitive thought processes involved in students’ answers to different kinds of teachers’ questions were investigated using data obtained from a previous study. The dimensions examined were (a) the degree of correspondence between the cognitive level of teachers’ questions and the cognitive level of students’ answers, and (b) the relation of that correspondence to the type of cognitive coding system used, grade level, and clarity of the questions and answers. It was found that the chances are about even that there will be a correspondence between the cognitive level of the question asked and the cognitive level of the response that was elicited. The coding system used, grade level of the students, and clarity of the questions each moderated this effect. 相似文献

10.

Subscores Based on Classical Test Theory: To Report or Not to Report 总被引：1，自引：0，他引：1

Sandip Sinharay Shelby Haberman Gautam Puhan 《Educational Measurement》2007,26(4):21-28

There is an increasing interest in reporting subscores, both at examinee level and at aggregate levels. However, it is important to ensure reasonable subscore performance in terms of high reliability and validity to minimize incorrect instructional and remediation decisions. This article employs a statistical measure based on classical test theory that is conceptually similar to the test reliability measure and can be used to determine when subscores have any added value over total scores. The usefulness of subscores is examined both at the level of the examinees and at the level of the institutions that the examinees belong to. The suggested approach is applied to two data sets from a basic skills test. The results provide little support in favor of reporting subscores for either examinees or institutions for the tests studied here. 相似文献

11.

Alignment and Implications for Test Takers

Catherine J. Welch Stephen B. Dunbar 《Educational Measurement》2020,39(2):8-17

The use of assessment results to inform school accountability relies on the assumption that the test design appropriately represents the content and cognitive emphasis reflected in the state's standards. Since the passage of the Every Student Succeeds Act and the certification of accountability assessments through federal peer review practices, the content validity arguments supporting accountability have relied almost exclusively on the alignment of statewide assessments to state standards. It is assumed that if alignment does not hold, the scores will not provide valid inferences regarding the degree to which test takers have performed. Although alignment results are commonly used as evidence of test appropriateness, Polikoff (this issue) would argue that given the importance of alignment in policy decisions, research related to alignment is surprisingly limited. Few studies have addressed the adequacy of alignment methodologies and results as support for the inferences to be made (i.e., proficient on state standards). This paper uses an example of test taker performance (and common performance indicators) to investigate to what extent the degree of alignment impacts inferences made about performance (i.e., classification into performance levels, estimates of student ability, and student rank order). 相似文献

12.

优质教师培养与选聘策略①--国际经验及启示

李莎程晋宽《比较教育研究》2015,(4):90-95

优质教师的培养与选聘是实现优质教育的一项重要任务。联合国教科文组织在2013~2014年发布的报告《教与学：为所有人实现优质教育》（Teaching and Learning：Achieving Quality for All）,将教师的选聘与培养作为实现优质教育的一项重要策略。优质教师的培养与选聘策略主要有：选拔最优秀的人才进入教育领域;提高教师教育质量,培养全纳教师;将优质师资分配到最急需的地区;为所有教师提供合理的薪资待遇。相似文献

13.

Multiple Objective Test Assembly Problems

Bernard P. Veldkamp 《Journal of Educational Measurement》1999,36(3):253-266

Mathematical programming techniques for optimal test assembly are discussed. Most methods optimize a single objective: for instance, the amount of information in a test, subject to a number of constraints. However, some test assembly problems have multiple objectives. A recent example in the literature is the problem of assembling test that measure multiple traits, where the amount of information in the test about each different trait has to be maximized. The present paper proposes methods appropriate for solving multiple objective test assembly problems. An overview of multiple objective optimization methods is given. The impact of the method on the optimality of the solution is shown and the appropriateness of the methods is discussed. The methods are illustrated using an empirical example of a test assembly problem for a two-dimensional mathematics item pool. 相似文献

14.

Simultaneous Assembly of Multiple Test Forms

Wim J. van der Linden Jos J. Adema 《Journal of Educational Measurement》1998,35(3):185-198

An algorithm for the assembly of multiple test forms is proposed in which the multiple-form problem is reduced to a series of computationally less intensive two-form problems. At each step, one form is assembled to its true specifications; the other form is a dummy assembled only to maintain a balance between the quality of the current form and the remaining forms. It is shown how the method can be implemented using the technique of O-1 linear programming. Two empirical examples using a former item pool from the LSAT are given—one in which a set of parallel forms is assembled and another in which the targets for the information functions of the forms are shifted systematically. 相似文献

15.

PISA2009上海测试的考务实施及其启示 总被引：1，自引：0，他引：1

周云《上海教育科研》2010,(5)

从测试的实施角度看,不同的测试可以有不同的实施规定和办法,但是这些规定和办法的目的都是为了使学生在相等的条件和情境下参加考试或测试,从而减少测量误差,保持测量结果的客观、公正和准确.从这一角度来看,HSA带给我们的不仅是国际化的测试理念,在测试具体实施层面也为国内的教育质量监测提供了很多值得借鉴的经验. 相似文献

16.

Implications of the Golden Rule Settlement for Test Construction

Robert L. Linn Fritz Drasgow 《Educational Measurement》1987,6(2):13-17

The authors present the results of an application of Golden Rule procedures to items of the Scholastic Aptitude Test. Using item response theory, their analyses indicate that the Golden Rule procedures are ineffective in detecting biased items and may undermine the reliability and validity of tests. 相似文献

17.

Test Standards—Some Implications for the Measurement Curriculum

David A. Frisbie Stephen J. Friedman 《Educational Measurement》1987,6(3):17-23

This paper demonstrates how an analysis of the Test Standards can be used to define the body of knowledge needed by teachers for the effective use of tests in classroom instruction. This knowledge domain is essential for describing a sound measurement curriculum for preservice teachers and for outlining specifications for teacher certification testing-. Procedures are described for identifying standards relevant to teachers' classroom role functions and for describing the behavior inherent in those standards. 相似文献

18.

美国托福ITP考试特点及其对我国英语测评的启示

《考试研究》2019,(6)

介绍美国托福ITP考试。首先阐述考试的主要特点,然后从现代语言测试学视角,分析其考试体系的优势和局限性。基于此,讨论托福ITP考试对我国英语测评的启示,旨在为我国英语考试的设计、施测及分数报告提供参考依据。相似文献

19.

大学英语四级机考及对英语教学和学习的启示

周园园《宁波广播电视大学学报》2011,9(2):85-87

大学英语四级机考将传统的英语纸质考试转变为以计算机为基础的语言测试,听力比重大大增加,这种考试方式和形式的变化是大学英语教学模式改革的重大契机。本文通过介绍机考的题型和特点,提出教师应充分利用计算机网络、多媒体及其他教学资源,采取灵活多样的教学模式和方法,培养和提高学生的听力和口语能力,通过加强学生学习英语的兴趣和自信来提高学生自主学习的能力,并指出加强网络教学和学习资源平台建设的重要性。相似文献

20.

Automated Test Assembly for Cognitive Diagnosis Models Using a Genetic Algorithm

Matthew Finkelman Wonsuk Kim Louis A. Roussos 《Journal of Educational Measurement》2009,46(3):273-292

Much recent psychometric literature has focused on cognitive diagnosis models (CDMs), a promising class of instruments used to measure the strengths and weaknesses of examinees. This article introduces a genetic algorithm to perform automated test assembly alongside CDMs. The algorithm is flexible in that it can be applied whether the goal is to minimize the average number of classification errors, minimize the maximum error rate across all attributes being measured, hit a target set of error rates, or optimize any other prescribed objective function. Under multiple simulation conditions, the algorithm compared favorably with a standard method of automated test assembly, successfully finding solutions that were appropriate for each stated goal. 相似文献