期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

How Can Released State Test Items Support Interim Assessment Purposes in an Educational Crisis?

Emma M. Klugman Andrew D. Ho 《Educational Measurement》2020,39(3):65-69

State testing programs regularly release previously administered test items to the public. We provide an open-source recipe for state, district, and school assessment coordinators to combine these items flexibly to produce scores linked to established state score scales. These would enable estimation of student score distributions and achievement levels. We discuss how educators can use resulting scores to estimate achievement distributions at the classroom and school level. We emphasize that any use of such tests should be tertiary, with no stakes for students, educators, and schools, particularly in the context of a crisis like the COVID-19 pandemic. These tests and their results should also be lower in priority than assessments of physical, mental, and social–emotional health, and lower in priority than classroom and district assessments that may already be in place. We encourage state testing programs to release all the ingredients for this recipe to support low-stakes, aggregate-level assessments. This is particularly urgent during a crisis where scores may be declining and gaps increasing at unknown rates. 相似文献

2.

Test Development with Performance Standards and Achievement Growth in Mind

Steve Ferrara Dubravka Svetina Sylvia Skucha Anne H. Davidson 《Educational Measurement》2011,30(4):3-15

相似文献

3.

An NCME Instructional Module on Booklet Designs in Large-Scale Assessments of Student Achievement: Theory and Practice 总被引：2，自引：0，他引：2

Andreas Frey Johannes Hartig André A. Rupp 《Educational Measurement》2009,28(3):39-53

In most large-scale assessments of student achievement, several broad content domains are tested. Because more items are needed to cover the content domains than can be presented in the limited testing time to each individual student, multiple test forms or booklets are utilized to distribute the items to the students. The construction of an appropriate booklet design is a complex and challenging endeavor that has far-reaching implications for data calibration and score reporting. This module describes the construction of booklet designs as the task of allocating items to booklets under context-specific constraints. Several types of experimental designs are presented that can be used as booklet designs. The theoretical properties and construction principles for each type of design are discussed and illustrated with examples. Finally, the evaluation of booklet designs is described and future directions for researching, teaching, and reporting on booklet designs for large-scale assessments of student achievement are identified. 相似文献

4.

贵州省体育专业高考术科考试调查研究

王有智张东秀李康林单春华《黔南民族师范学院学报》2013,(6):97-101

采用专家访谈、文献资料、数理统计等研究方法,对贵州省体育专业高考术科考试进行了研究.研究表明：贵州省体育专业高考术科考试固定四项身体素质测试不能全面检测考生情况.建议：身体素质测试增加灵敏性素质,由4类4小项增为5类5小项,增加专项运动技能考试,身体素质和专项技能比例应为75∶25,测试总分宜统一定为100分,体育考试和文化课考试成绩都达到分数线的考生,按体育专业成绩由高到低录取. 相似文献

5.

A MODEL FOR INCREASING THE MEANING OF STANDARDIZED TEST SCORES1

RICHARD C. COX BARBARA G. STERRETT 《Journal of Educational Measurement》1970,7(4):227-228

相似文献

6.

Effort Analysis: Individual Score Validation of Achievement Test Data

Steven L. Wise 《教育实用测度》2015,28(3):237-252

Whenever the purpose of measurement is to inform an inference about a student’s achievement level, it is important that we be able to trust that the student’s test score accurately reflects what that student knows and can do. Such trust requires the assumption that a student’s test event is not unduly influenced by construct-irrelevant factors that could distort his score. This article examines one such factor—test-taking motivation—that tends to induce a person-specific, systematic negative bias on test scores. Because current measurement models underlying achievement testing assume students respond effortfully to test items, it is important to identify test scores that have been materially distorted by non-effortful test taking. A method for conducting effort-related individual score validation is presented, and it is recommended that measurement professionals have a responsibility to identify invalid scores to individuals who make inferences about student achievement on the basis of those scores. 相似文献

7.

语言测试中的项目分析方法

张红《民族教育研究》2002,13(4):38-42

地每一次测试的试题进行项目分析的目的，是为了确定试题的科学性，把那些符合测试规则、能够体现测试功能的科学性的试题保存在试题库中。项目分析的方法不仅适合于英语测试，还可以扩大到任何一种选项式的测试。语言测试在教学研究中有着很重要的作用。如何使它更为科学、更为精确，并对学生起到积极的引导作用是一项艰巨且有意义的研究工作。相似文献

8.

Embedded Standard Setting: Aligning Standard-Setting Methodology with Contemporary Assessment Design Principles

Daniel Lewis Robert Cook 《Educational Measurement》2020,39(1):8-21

相似文献

9.

A probe into EFL learners’ emotioncy as a source of test bias: Insights from differential item functioning analysis

《Studies in Educational Evaluation》2019

The development of unbiased tests is crucial in the arena of language testing in order to ensure validity. To date, studies of bias in language testing have mainly focused on factors such as gender, native language, or academic background, inter alia. However, bias may also result from psychological factors. Therefore, the present study investigates the role of English as a Foreign Language (EFL) test takers’ emotioncy, defined as the emotions evoked by senses that one holds for an entity, in their test performance. Specifically, this study aimed to examine emotioncy for the form as well as the meaning of 20 words to find out whether it can lead to differential functioning of the items on a vocabulary test. To this end, two emotioncy scales and a vocabulary test were designed. Then, based on the data collected from 235EFL students, the participants were bisected into the Low-Group and the High-Group, once based on their emotioncy scores for each word form and then based on their emotioncy scores for each word meaning. Subsequently, Rasch model-based Differential Item Functioning (DIF) analysis was performed across the two groups. The results showed that the vocabulary test items functioned differentially across the two groups in both form and meaning classifications, favoring the High-Group. Therefore, the study provides evidence for emotioncy as a psychological source of test bias and discusses implications for language testing stakeholders. 相似文献

10.

高中学业水平考试:功能、命题与成绩使用 总被引：1，自引：0，他引：1

朱宇《考试研究》2008,(2):107-115

高中学业水平考试是新课程改革背景下出现的新型教育考试。它考查中学生的知识掌握水平、学科素养、学习能力,应该成为高校招生的重要评价指标之一,并能为社会用人单位提供有用的信息。高中学业水平考试的命题组卷,应关注的首要指标是试题有无较全面地覆盖课程标准要求的内容,以及试题的难度范围是否足够广泛。作为高考招生的指标之一,高中学业水平考试成绩合格或达到一定等级应该成为一个关键的录用前提条件。至于考生取得哪个等级的学业水平考试成绩可以被录用,则最终取决于招生院校和专业的要求。相似文献

11.

国际教育成效评价协会儿童认知发展状况测验项目功能差异分析 总被引：3，自引：0，他引：3

王蕾黄晓婷《考试研究》2006,(4)

本研究旨在从一维和多维的角度检测国际教育成效评价协会(IEA)儿童认知发展状况测验中中译英考题的项目功能差异(DIF)。我们分析的数据由871名中国儿童和557名美国儿童的测试数据组成。结果显示,有一半以上的题目存在实质的DIF,意味着这个测验对于中美儿童而言,并没有功能等值。使用者应谨慎使用该跨语言翻译的比较测试结果来比较中美两国考生的认知能力水平。所幸约有半数的DIF题目偏向中国,半数偏向美国,因此利用测验总分所建立的量尺,应该不至于有太大的偏误。此外,题目拟合度统计量并不能足够地检测到存在DIF的题目,还是应该进行特定的DIF分析。我们探讨了三种可能导致DIF的原因,尚需更多学科专业知识和实验来真正解释DIF的形成。相似文献

12.

主观题与客观题辨议

GUAN Dandan 《中国考试》2008,(7)

主客观题实际上是一个连续体,"主观题客观化"和"客观题主观化"在这个连续体上向对方无限趋近,"客观题主观化"在教育考试中有借鉴意义。文章以我国高考和研究生入学考试的试卷为例,探讨了主观题与客观题比例设置问题。主观题与客观题的有机结合反映了各国考试理念的融合。题型的设计不仅与考查目标有关,还与学科特点有关,并随着认识的深入而发展。相似文献

13.

Identifying Country-Specific Cultures of Physics Education: A differential item functioning approach

Vanes Mesic 《International Journal of Science Education》2013,35(16):2483-2500

In international large-scale assessments of educational outcomes, student achievement is often represented by unidimensional constructs. This approach allows for drawing general conclusions about country rankings with respect to the given achievement measure, but it typically does not provide specific diagnostic information which is necessary for systematic comparisons and improvements of educational systems. Useful information could be obtained by exploring the differences in national profiles of student achievement between low-achieving and high-achieving countries. In this study, we aimed to identify the relative weaknesses and strengths of eighth graders’ physics achievement in Bosnia and Herzegovina in comparison to the achievement of their peers from Slovenia. For this purpose, we ran a secondary analysis of Trends in International Mathematics and Science Study (TIMSS) 2007 data. The student sample consisted of 4,220 students from Bosnia and Herzegovina and 4,043 students from Slovenia. After analysing the cognitive demands of TIMSS 2007 physics items, the correspondent differential item functioning (DIF)/differential group functioning contrasts were estimated. Approximately 40% of items exhibited large DIF contrasts, indicating significant differences between cultures of physics education in Bosnia and Herzegovina and Slovenia. The relative strength of students from Bosnia and Herzegovina showed to be mainly associated with the topic area ‘Electricity and magnetism’. Classes of items which required the knowledge of experimental method, counterintuitive thinking, proportional reasoning and/or the use of complex knowledge structures proved to be differentially easier for students from Slovenia. In the light of the presented results, the common practice of ranking countries with respect to universally established cognitive categories seems to be potentially misleading. 相似文献

14.

International assessment: A Rasch model and teachers' evaluation of TIMSS science achievement items

Shawn M. Glynn 《科学教学研究杂志》2012,49(10):1321-1344

The Trends in International Mathematics and Science Study (TIMSS) is a comparative assessment of the achievement of students in many countries. In the present study, a rigorous independent evaluation was conducted of a representative sample of TIMSS science test items because item quality influences the validity of the scores used to inform educational policy in those countries. The items had been administered internationally to 16,009 students in their eighth year of formal schooling. The evaluation had three components. First, the Rasch model, which emphasizes high quality items, was used to evaluate the items psychometrically. Second, readability and vocabulary analyses were used to evaluate the wording of the items to ensure they were comprehensible to the students. And third, item development guidelines were used by a focus group of science teachers to evaluate the items in light of the TIMSS assessment framework, which specified the format, content, and cognitive domains of the items. The evaluation components indicated that the majority of the items were of high quality, thereby contributing to the validity of TIMSS scores. These items had good psychometric characteristics, readability, vocabulary, and compliance with the assessment framework. Overall, the items tended to be difficult: constructed response items assessing reasoning or application were the most difficult, and multiple choice items assessing knowledge or application were less difficult. The teachers revised some of the sampled items to improve their clarity of content, conciseness of wording, and fit with format specifications. For TIMSS, the findings imply that some of the non‐sampled items may need revision, too. For researchers and teachers, the findings imply that the TIMSS science items and the Rasch model are valuable resources for assessing the achievement of students. © 2012 Wiley Periodicals, Inc. J Res Sci Teach 49: 1321–1344, 2012 相似文献

15.

Investigating the relationship between instructional practices and science achievement in an inquiry-based learning environment

Dean Cairns 《International Journal of Science Education》2013,35(15):2113-2135

ABSTRACT

This study investigates the discrete effects of inquiry-based instructional practices that described the PISA 2015 construct ‘inquiry-based instruction’ and how each practice, and the frequency of each practice, is related to science achievement across 69 countries. The data for this study were drawn from the PISA 2015 database and analysed using hierarchical linear modelling (HLM). HLMs were estimated to test the contribution of each item to students’ science achievement scores. Some inquiry practices demonstrated a significant, linear, positive relationship to science achievement (particularly items involving contextualising science learning). Two of the negatively associated items (explaining their ideas and doing experiments) were found to have a curvilinear relationship to science achievement. All nine items were dummy coded by the reported frequency of use and an optimum frequency was determined using the categorical model and by calculating the inflection point of the curvilinear associations in the previous model e.g. students that carry out experiments in the lab in some lessons have higher achievement scores than students who perform experiments in all lessons. These findings, accompanied by detailed analyses of the items and their relationships to science outcomes, give stakeholders clear guidance regarding the effective use of inquiry-based approaches in the classroom. 相似文献

16.

Instructional Topics in Educational Measurement (ITEMS) Module: Using Automated Processes to Generate Test Items

Mark J. Gierl Hollis Lai 《Educational Measurement》2013,32(3):36-50

Changes to the design and development of our educational assessments are resulting in the unprecedented demand for a large and continuous supply of content‐specific test items. One way to address this growing demand is with automatic item generation (AIG). AIG is the process of using item models to generate test items with the aid of computer technology. The purpose of this module is to describe and illustrate a template‐based method for generating test items. We outline a three‐step approach where test development specialists first create an item model. An item model is like a mould or rendering that highlights the features in an assessment task that must be manipulated to produce new items. Next, the content used for item generation is identified and structured. Finally, features in the item model are systematically manipulated with computer‐based algorithms to generate new items. Using this template‐based approach, hundreds or even thousands of new items can be generated with a single item model. 相似文献

17.

Effects of repeated testing on short- and long-term memory performance across different test formats

Tova Stenlund Anna Sundström Bert Jonsson 《教育心理学》2016,36(10):1710-1727

This study examined whether practice testing with short-answer (SA) items benefits learning over time compared to practice testing with multiple-choice (MC) items, and rereading the material. More specifically, the aim was to test the hypotheses of retrieval effort and transfer appropriate processing by comparing retention tests with respect to practice testing format. To adequately compare SA and MC items, the MC items were corrected for random guessing. With a within-group design, 54 students (mean age = 16 years) first read a short text, and took four practice tests containing all three formats (SA, MC and statements to read) with feedback provided after each part. The results showed that both MC and SA formats improved short- and long-term memory compared to rereading. More importantly, practice testing with SA items is more beneficial for learning and long-term retention, providing support for retrieval effort hypothesis. Using corrections for guessing and educational implications are discussed. 相似文献

18.

Is the Item-Position Effect in Achievement Measures Induced by Increasing Item Difficulty?

Florian Zeller Siegbert Reiß Karl Schweizer 《Structural equation modeling》2017,24(5):745-754

This work examines the hypothesis that the arrangement of items according to increasing difficulty is the real source of what is considered the item-position effect. A confusion of the 2 effects is possible because in achievement measures the items are arranged according to their difficulty. Two item subsets of Raven’s Advanced Progressive Matrices (APM), one following the original item order, and the other one including randomly ordered items, were applied to a sample of 266 students. Confirmatory factor analysis models including representations of both the item-position effect and a possible effect due to increasing item difficulty were compared. The results provided evidence for both effects. Furthermore, they indicated a substantial relation between the item-position effects of the 2 APM subsets, whereas no relation was found for item difficulty. This indicates that the item-position effect stands on its own and is not due to increasing item difficulty. 相似文献

19.

The interpretation of norm-based scores from individualized testing using the Iowa tests of basic skills

Barbara S. Plake 《Psychology in the schools》1979,16(1):8-13

School teachers and administrators are often faced with the dilemma of deciding what level of an achievement test to assign a child whose developmental rate is atypical of his peers. It is not attractive to mismatch the developmental and achievement test level; however, alternative procedures often call for extra testing time. The ITBS has an out-of-level option which allows for a developmental/achievement test level match that does not require additional testing. Since the procedures used by ITBS to assign grade equivalent scores do not take grade level into account, questions have been raised about the interpretation of grade equivalent scores achieved from out-of-level testing. This research addresses the question of the comparability of equal scores on the same test from children in different grades. The results indicate that the scores are comparable and support the assignment of ITBS levels that match the child's developmental level. 相似文献

20.

Decisions that make a difference in detecting differential item functioning

Stephen G. Sireci Joseph A. Rios 《Educational Research and Evaluation》2013,19(2-3):170-187

There are numerous statistical procedures for detecting items that function differently across subgroups of examinees that take a test or survey. However, in endeavouring to detect items that may function differentially, selection of the statistical method is only one of many important decisions. In this article, we discuss the important decisions that affect investigations of differential item functioning (DIF) such as choice of method, sample size, effect size criteria, conditioning variable, purification, DIF amplification, DIF cancellation, and research designs for evaluating DIF. Our review highlights the necessity of matching the DIF procedure to the nature of the data analysed, the need to include effect size criteria, the need to consider the direction and balance of items flagged for DIF, and the need to use replication to reduce Type I errors whenever possible. Directions for future research and practice in using DIF to enhance the validity of test scores are provided. 相似文献