期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The Bookmark Standard-Setting Method: A Literature Review 总被引：1，自引：0，他引：1

Ana Karantonis Stephen G. Sireci 《Educational Measurement》2006,25(1):4-12

The Bookmark method for setting standards on educational tests is currently one of the most popular standard-setting methods. However, research to support the method is scarce. In this report, we review the published and unpublished literature on this method as well as some seminal work in the area of evaluating standard-setting studies. Our review highlights both strengths and limitations of the method. Strengths include its wide acceptance and panelist confidence in the method. Limitations include a potential bias to produce lower-than-intended standards and problems in selecting the most appropriate response probability value for ordering the items presented to panelists. It is clear that more research on this method is needed to support its wide use. Several areas for future research to better understand the validity of the Bookmark method for setting standards on educational tests are presented. 相似文献

2.

A Critical Look into the Beuk Standard-Setting Method

Adam E. Wyse 《Educational Measurement》2020,39(1):52-60

One commonly used compromise standard-setting method is the Beuk (1984) method. A key assumption of the Beuk method is that the emphasis given to the pass rate and the percent correct ratings should be proportional to the extent that the panelists agree on their ratings. However, whether the slope of Beuk line reflects the emphasis that panelists believe should be assigned to the pass rate and the percentage correct ratings has not be fully tested. In this article, I evaluate this critical assumption of the Beuk method by asking panelists to assign importance weights to their percentage correct and pass rate judgments. I show that in several cases that the emphasis suggested by the Beuk slope is noticeably different from what one would expect and is inconsistent with importance weight ratings. I also suggest two ways that the importance weights can be used to calculate alternate cut scores, and I show that one of the ways of calculating cut scores using the importance weights leads to larger potential differences in cut score estimates. I suggest that practitioners should consider collecting importance weights when the Beuk method is used for determining cut scores. 相似文献

3.

Reliability and Validity of Bookmark‐Based Methods for Standard Setting: Comparisons to Angoff‐Based Methods in the National Assessment of Educational Progress

Christina Hamme Peterson E. Matthew Schulz George Engelhard Jr. 《Educational Measurement》2011,30(2):3-14

Historically, Angoff‐based methods were used to establish cut scores on the National Assessment of Educational Progress (NAEP). In 2005, the National Assessment Governing Board oversaw multiple studies aimed at evaluating the reliability and validity of Bookmark‐based methods via a comparison to Angoff‐based methods. As the Board considered adoption of Bookmark‐based methods, it considered several criteria, including reliability of the cut scores, validity of the cut scores as evidenced by comparability of results to those from Angoff, and procedural validity as evidenced by panelist understanding of the method tasks and instructions and confidence in the results. As a result of their review, a Bookmark‐based method was adopted for NAEP, and has been used since that time. This article goes beyond the Governing Board's initial evaluations to conduct a systematic review of 27 studies in NAEP research conducted over 15 years. This research is used to evaluate Bookmark‐based methods on key criteria originally considered by the Governing Board. Findings suggest that Bookmark‐based methods have comparable reliability, resulting cut scores, and panelist evaluations to Angoff. Given that Bookmark‐based methods are shorter in duration and less costly, Bookmark‐based methods may be preferable to Angoff for NAEP standard setting. 相似文献

4.

Setting Performance Standards: Contemporary Methods

Gregory J. Cizek Michael B. Bunch Heather Koons 《Educational Measurement》2004,23(4):31-31

This module describes some common standard-setting procedures used to derive performance levels for achievement tests in education, licensure, and certification. Upon completing the module, readers will be able to: describe what standard setting is; understand why standard setting is necessary; recognize some of the purposes of standard setting; calculate cut scores using various methods; and identify elements to be considered when evaluating standard-setting procedures. A self-test and annotated bibliography are provided at the end of the module. Teaching aids to accompany the module are available through NCME. 相似文献

5.

Adopting Cut Scores: Post-Standard-Setting Panel Considerations for Decision Makers

Kurt F. Geisinger Carina M. McCormick 《Educational Measurement》2010,29(1):38-44

Standard-setting studies utilizing procedures such as the Bookmark or Angoff methods are just one component of the complete standard-setting process. Decision makers ultimately must determine what they believe to be the most appropriate standard or cut score to use, employing the input of the standard-setting panelists as one piece of information among multiple sources. However, guidance for weighing the various components is limited. The current article describes considerations about data that are used to make standard-setting decisions, as previously outlined by Geisinger (1991) . The ten points provided by Geisinger have been expanded as they relate to shifts in educational policy and practice in educational measurement. They have been amended with six new components as well. The new considerations addressed are smoothing across grades, raising standards in progression (over grades or over time), opportunity to learn or instructional validity, input from other groups, equating or linking to previous standards, and organizational vision and goals . 相似文献

6.

An Investigation of Undefined Cut Scores With the Hofstee Standard‐Setting Method

Adam E. Wyse Ben Babcock 《Educational Measurement》2017,36(4):28-34

This article provides an overview of the Hofstee standard‐setting method and illustrates several situations where the Hofstee method will produce undefined cut scores. The situations where the cut scores will be undefined involve cases where the line segment derived from the Hofstee ratings does not intersect the score distribution curve based on actual exam performance data. Data from 15 standard settings performed by a credentialing organization are used to investigate how common undefined cut scores are with the Hofstee method and to compare cut scores derived from the Hofstee method with those from the Beuk method. Results suggest that when Hofstee cut scores exist that the Hofstee and Beuk methods often yield fairly similar results. However, it is shown that undefined Hofstee cut scores did occur in a few situations. When Hofstee cut scores are undefined, it is suggested that one extend the Hofstee line segment so that it intersects the score distribution curve to estimate cut scores. Analyses show that extending the line segment to estimate cut scores often yields similar results to the Beuk method. The article concludes with a discussion of what these results may imply for people who want to employ the Hofstee method. 相似文献

7.

Five Methods for Estimating Angoff Cut Scores with IRT

Adam E. Wyse 《Educational Measurement》2017,36(4):16-27

This article illustrates five different methods for estimating Angoff cut scores using item response theory (IRT) models. These include maximum likelihood (ML), expected a priori (EAP), modal a priori (MAP), and weighted maximum likelihood (WML) estimators, as well as the most commonly used approach based on translating ratings through the test characteristic curve (i.e., the IRT true‐score (TS) estimator). The five methods are compared using a simulation study and a real data example. Results indicated that the application of different methods can sometimes lead to different estimated cut scores, and that there can be some key differences in impact data when using the IRT TS estimator compared to other methods. It is suggested that one should carefully think about their choice of methods to estimate ability and cut scores because different methods have distinct features and properties. An important consideration in the application of Bayesian methods relates to the choice of the prior and the potential bias that priors may introduce into estimates. 相似文献

8.

评定学生成绩等级的方法研究

宋庆龙《唐山师范学院学报》2007,29(2):61-63

给出了用正态分布划分学生学习成绩等级的可行方法,并运用x2检验的方法检验了该方法的科学、合理、有效性。相似文献

9.

It's Not Just Angoff: Misperceptions of Hard and Easy Items in Bookmark-Type Ratings

Adam E. Wyse Ben Babcock 《Educational Measurement》2020,39(1):22-29

A common belief is that the Bookmark method is a cognitively simpler standard-setting method than the modified Angoff method. However, a limited amount of research has investigated panelist's ability to perform well the Bookmark method, and whether some of the challenges panelists face with the Angoff method may also be present in the Bookmark method. This article presents results from three experiments where panelists were asked to give Bookmark-type ratings to separate items into groups based on item difficulty data. Results of the experiments showed, consistent with results often observed with the Angoff method, that panelists typically and paradoxically perceived hard items to be too easy and easy items to be too hard. These perceptions were reflected in panelists often placing their Bookmarks too early for hard items and often placing their Bookmarks too late for easy items. The article concludes with a discussion of what these results imply for educators and policymakers using the Bookmark standard-setting method. 相似文献

10.

The Issue of Range Restriction in Bookmark Standard Setting

Adam E. Wyse 《Educational Measurement》2015,34(2):47-54

This article uses data from a large‐scale assessment program to illustrate the potential issue of range restriction with the Bookmark method in the context of trying to set cut scores to closely align with a set of college and career readiness benchmarks. Analyses indicated that range restriction issues existed across different response probability (RP) values and item response theory (IRT) models if one were to apply the Bookmark procedure using intact test forms. Results also suggested that range restriction may still be present if one had access to additional data from an item bank. This demonstration critically highlights challenges that may exist in some practical applications of the Bookmark method due items not being designed to cover the full range of examinee abilities. 相似文献

11.

对HSK部分等级的验证性研究 总被引：1，自引：0，他引：1

武晓宇徐静《考试研究》2005,(3)

中国汉语水平考试(HSK)的作用之一是界定留学生进入中国大学入系学习时所应具备的汉语能力。根据有关规定,HSK三级和六级分别是进入中国大学理工西医类和文史中医类入系学习的最低标准。本文采用安哥夫、边缘组及对照组三种方法对此标准进行了验证性研究。相似文献

12.

The Efficacy of Collaborative Learning Groups in an Undergraduate Statistics Course

《College Teaching》2013,61(2):244-248

Abstract. Assessment of the efficacy of collaborative learning group techniques is frequently subjectively based and often relies on casual comments from students or faculty. Despite this shortcoming, instructors searching for new and effective ways of teaching quantitative courses continue to experiment with collaborative pedagogy. This study examined the relationship between student performance on collaborative learning group assignments and students' examination scores in statistics. The results both challenge and support the efficacy of collaborative learning groups and suggest that faculty modify such techniques when evidence of student achievement cannot be empirically linked to the collaborative experience. 相似文献

13.

The Dependence of Growth‐Model Results on Proficiency Cut Scores

Andrew D. Ho Daniel M. Lewis Jason L. MacGregor Farris 《Educational Measurement》2009,28(4):15-26

States participating in the Growth Model Pilot Program reference individual student growth against “proficiency” cut scores that conform with the original No Child Left Behind Act (NCLB). Although achievement results from conventional NCLB models are also cut‐score dependent, the functional relationships between cut‐score location and growth results are more complex and are not currently well described. We apply cut‐score scenarios to longitudinal data to demonstrate the dependence of state‐ and school‐level growth results on cut‐score choice. This dependence is examined along three dimensions: 1) rigor, as states set cut scores largely at their discretion, 2) across‐grade articulation, as the rigor of proficiency standards may vary across grades, and 3) the time horizon chosen for growth to proficiency. Results show that the selection of plausible alternative cut scores within a growth model can change the percentage of students “on track to proficiency” by more than 20 percentage points and reverse accountability decisions for more than 40% of schools. We contribute a framework for predicting these dependencies, and we argue that the cut‐score dependence of large‐scale growth statistics must be made transparent, particularly for comparisons of growth results across states. 相似文献

14.

Setting Standards for English Foreign Language Assessment: Methodology,Validation, and a Degree of Arbitrariness

Simon P. Tiffin‐Richards Hans Anand Pant Olaf Köller 《Educational Measurement》2013,32(2):15-25

Cut‐scores were set by expert judges on assessments of reading and listening comprehension of English as a foreign language (EFL), using the bookmark standard‐setting method to differentiate proficiency levels defined by the Common European Framework of Reference (CEFR). Assessments contained stratified item samples drawn from extensive item pools, calibrated using Rasch models on the basis of examinee responses of a German nationwide assessment of secondary school language performance. The results suggest significant effects of item sampling strategies for the bookmark method on cut‐score recommendations, as well as significant cut‐score judgment revision over cut‐score placement rounds. Results are discussed within a framework of establishing validity evidence supporting cut‐score recommendations using the widely employed bookmark method. 相似文献

15.

“一年多考”机制下稳定英语高考难度的三种方法

杨志明《教育测量与评价(理论版)》2019,(3):15-18

英语高考试行"一年多考"是一项了不起的进步,但多次考试之间的难度波动往往会给直接使用原始分数做招生决定带来极大的麻烦。本文探讨了稳定测验难度的三种方法:国际考试行业的标准做法、借用标准设定思想的专家评定方法,以及反向使用效度证据的小规模代表性样本试测方法。期待这些方法可以给考试一线工作者提供更多的选择。相似文献

16.

多层次培训突出心理健康教育的实效性

宋丽斌《牡丹江教育学院学报》2005,(6):68-69

提升中小学心理健康教育的科学性、实效性,必须适应学生心理需要,结合教育现状、心理课题研究内容和教师现有心理健康教育水平等方面的实际,采用不同的方法,选用不同的内容,有目的、有计划、有组织、有系统地开展多层次培训,从而收到良好的教育效果. 相似文献

17.

The Impact of Process Instructions on Judges’ Use of Examinee Performance Data in Angoff Standard Setting Exercises

Janet Mee Brian E. Clauser Melissa J. Margolis 《Educational Measurement》2013,32(3):27-35

Despite being widely used and frequently studied, the Angoff standard setting procedure has received little attention with respect to an integral part of the process: how judges incorporate examinee performance data in the decision‐making process. Without performance data, subject matter experts have considerable difficulty accurately making the required judgments. Providing data introduces the very real possibility that judges will turn their content‐based judgments into norm‐referenced judgments. This article reports on three Angoff standard setting panels for which some items were randomly assigned to have incorrect performance data. Judges were informed that some of the items were accompanied by inaccurate data, but were not told which items they were. The purpose of the manipulation was to assess the extent to which changing the instructions given to the judges would impact the extent to which they relied on the performance data. The modified instructions resulted in the judges making less use of the performance data than judges participating in recent parallel studies. The relative extent of the change judges made did not appear to be substantially influenced by the accuracy of the data. 相似文献

18.

手机短信语言规范的模因论阐释

陈春雷《阜阳师范学院学报(社会科学版)》2011,(6):42-44

手机短信语言中存在诸多不规范的语用现象,这与手机短信语言模因变异密切相关。规范手机短信语言,首先,应处理好语言模因变异与语言规范的关系;第二,干扰不规范模因变异的传播,阻止不规范语用的凸显;第三,用规范的语言模因引导手机短信语言的使用。相似文献

19.

谈《会计电算化》的分层分组教学

任春英《新乡教育学院学报》2005,18(4):87-88

通过《会计电算化》教学实例论述了实施分层分组教学的必要性及优越性:有助于培养学生自学能力,提高专业课学习兴趣。相似文献

20.

Methods of Interpreting Results in the Cleveland Arithmetic Tests

《The Journal of educational research》2012,105(4):280-292

ABSTRACT

The authors explored how gender and socioeconomic status (SES) predicted physics achievement as mediated by metacognition and physics self-efficacy. Data were collected from 338 high school students. The model designed for exploring how gender and SES-related differences in physics achievement were explained through metacognition and physics self-efficacy was tested. The result showed that metacognition and physics self-efficacy could explain gender- and SES-related differences in physics achievement. In addition, it was observed that physics self-efficacy mediated the relation of metacognition to physics achievement whereas metacognition did not. This finding means that metacognition contributed to physics achievement through physics self-efficacy. 相似文献