期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Exploring the Influence of Judge Proficiency on Standard‐Setting Judgments

Michael R. Peabody Stefanie A. Wind 《Journal of Educational Measurement》2019,56(1):101-120

Setting performance standards is a judgmental process involving human opinions and values as well as technical and empirical considerations. Although all cut score decisions are by nature somewhat arbitrary, they should not be capricious. Judges selected for standard‐setting panels should have the proper qualifications to make the judgments asked of them; however, even qualified judges vary in expertise and in some cases, such as highly specialized areas or when members of the public are involved, it may be difficult to ensure that each member of a standard‐setting panel has the requisite expertise to make qualified judgments. Given the subjective nature of these types of judgments, and that a large part of the validity argument for an exam lies in the robustness of its passing standard, an examination of the influence of judge proficiency on the judgments is warranted. This study explores the use of the many‐facet Rasch model as a method for adjusting modified Angoff standard‐setting ratings based on judges’ proficiency levels. The results suggest differences in the severity and quality of standard‐setting judgments across levels of judge proficiency, such that judges who answered easy items incorrectly tended to perceive them as easier, but those who answered correctly tended to provide ratings within normal stochastic limits. 相似文献

2.

基于CTT、GT、IRT的评分者信度研究——以某届奥运会女子跳水决赛为例

钟晓玲康春花陈婧《考试研究》2013,(5):41-52

本文以某届国际奥林匹克运动会女子跳水决赛为例,综合应用CTT、GT和IRT三大测量理论进行评分者信度分析,从不同角度揭示评分者之间和评分者内部的差异情况。结果表明:CTT的评分者信度分别为0.981和078;GT的概化系数和可靠性指数分别为0.8279和0.8271,比赛中所采用的7名评委分别对选手在5轮上的跳水表现进行评定的决策是比较适宜的决策;在IRT中,相对而言,评委5在7名评委中最为严厉,评委2最为宽松,但评委之间在宽严程度上的差异不显著,评委1和评委4在自身一致性上存在问题,不同评委在评定不同选手、不同难度系数动作和不同轮数上存在偏差,但未达到显著性水平。基于本文的分析,可以了解三种评分者信度分析方法的特点及各自优势,为评分者培训和提高评分信度提供有用信息。相似文献

3.

熵值距离的离群点检测及其在学生评教中的应用

刘祥新《湖北第二师范学院学报》2012,(2):84-86

离群数据检测是找出与正常数据不一致的数据。学生评教中由于某种原因,会出现一些评教噪声数据。针对学生评教中噪声数据的特征,提出了一个基于熵值距离的离群点检测算法,该算法通过比较每个数据点所对应的熵值和整个数据集的熵值,来判断数据点的离群程度。仿真结果表明该算法对学生评教中出现的噪声数据具有较好的过滤效果。相似文献

4.

Scoring a Performance-Based Assessment by Modeling the Judgments of Experts

Brian E. Clauser Raja G. Subhiyah Ronald J. Nungester Douglas R. Ripkey Stephen G. Clyman Danette McKinley 《Journal of Educational Measurement》1995,32(4):397-415

Performance assessments typically require expert judges to individually rate each performance. This results in a limitation in the use of such assessments because the rating process may be extremely time consuming. This article describes a scoring algorithm that is based on expert judgments but requires the rating of only a sample of performances. A regression-based policy capturing procedure was implemented to model the judgment policies of experts. The data set was a seven-case performance assessment of physician patient management skills. The assessment used a computer-based simulation of the patient care environment. The results showed a substantial improvement in correspondence between scores produced using the algorithm and actual ratings, when compared to raw scores. Scores based on the algorithm were also shown to be superior to raw scores and equal to expert ratings for making pass/fail decisions which agreed with those made by an independent committee of experts 相似文献

5.

Differential Use of Item Information by Judges Using Angoff and Nedeisky Procedures

Robert L. Smith Jeffrey K. Smith 《Journal of Educational Measurement》1988,25(4):259-274

Competency examinations in a variety of domains require setting a minimum standard of performance. This study examines the issue of whether judges using the two most popular methods for setting cut scores (Angoff and Nedelsky methods) use different sources of information when making their judgments. Thirty-one judges were assigned randomly to the two methods to set cut scores for a high school graduation test in reading comprehension. These ratings were then related to characteristics of the items as well as to empirically obtained p values. Results indicate that judges using the Angoff method use a wider variety of information and yield estimates closer to the actual p values. The characteristics of items used in the study were effective predictors of judges' ratings, but were far less effective in predicting p values 相似文献

6.

Effect of sex,class standing,and academic performance on perceived importance of teacher behaviors

Hubert S. Feild William H. Holley Achilles A. Armenakis 《Research in higher education》1976,5(3):215-222

An implicit assumption made in most teaching evaluation instruments is that teaching behaviors are equally important to students. Using specific teacher behaviors which have appeared in a number of teaching assessment devices, the importance of teacher behaviors questionnaire (ITB) was constructed to measure students' perceptions of the importance of selected teacher behaviors. Data collected from 105 college students were utilized in the present study for the following purposes: (a) to determine if there are differences among students' ratings of teaching behaviors in terms of importance and (b) to determine if the ratings of importance given to selected teacher behaviors vary according to students' sex, class standing, or academic performance. Results of the study indicated that there were significant differences in perceived importance of selected teacher behaviors. Furthermore, it was found that ratings of some of these behaviors tended to vary across sex groups. Implications for utilizing importance information as a weighting component in teaching evaluations are discussed. 相似文献

7.

Content Specificity of Expert Judgments in a Standard-Setting Study

Barbara S. Plake James C. Impara Maria T. Potenza 《Journal of Educational Measurement》1994,31(4):339-347

This study investigated the comparability of Angoff-based item ratings on a general education test battery made by judges from within-content specialties and across content domains. Judges were from English, mathematics, science, and social studies specialties in teacher education programs in a midwestem state. Cutscores established from the judges'ratings of out-of-content items differed little from the cutscores set using the ratings made by the content specialists. Further, out-of-content ratings by judges were not more influenced by performance data than were the ratings provided by judges rating items within their content specialty. The degree to -which these results generalize to other content specialties needs to be investigated. 相似文献

8.

Are health professional competency assessments transferable across cultures? A preliminary validity study

Diana Ho Sue McAllister 《Assessment & Evaluation in Higher Education》2018,43(7):1069-1083

This study investigated if professional competency assessments are transferable across cultures using COMPASS®: Competency assessment in speech pathology, a tool developed and validated in Australia. Students in Hong Kong were assessed by clinical educators using COMPASS® and the usual clinical evaluation forms. Analyses compared Hong Kong data with the original Australian field trial data. Rasch analysis was used to evaluate how well the ratings and score generated represented students’ development of competency. Hong Kong clinical educators’ ratings represented the same seven distinct categories of judgement as Australian clinical educators. The order of item difficulty was very similar for the two samples. However, Hong Kong clinical educators were not rating students in a pattern that reflected increasing competency with experience and very few year 4 students were rated at entry level. It is concluded that an assessment tool validated and developed in one culture may well support valid judgements and yield measures that can be used to judge student competency in another culture. Further evaluation is required to investigate the differences in the judgement of student progress in another culture and strengthen the validity of using its measures to judge students' competency performance. 相似文献

9.

Measuring the Impact of Judge Severity on Examination Scores

《教育实用测度》2013,26(4):331-345

In order to obtain objective measurement for examinations that are graded by judges, an extension of the Rasch model designed to analyze examinations with more than two facets (items/examinees) is used. This extended Rasch model calibrates the elements of each facet of the examination (i.e., examinee performances, items, and judges) on a common log-linear scale. A network for assigning judges to examinations is used to link all facets. Real examination data from the "clinical assessment" part of a certification examination are used to illustrate the application. A range of item difficulties and judge severities were found. Comparison of examinee raw scores with objective linear measures corrected for variations in judge severity shows that judge severity can have a substantial impact on a raw score. Correcting for judge severity improves the fairness of examinee measures and of the subsequent pass-fail decisions because the uncorrected raw scores favor examinee performances graded by lenient judges. 相似文献

10.

大学英语形成性评估方法的有效性实证研究

殷小娟陈小凤《洛阳师范学院学报》2013,(12):96-99

终结性评估和形成性评估结合是我国大学英语课程普遍使用的评估方法,而形成性评估的结果通常以平时成绩的形式体现。通过实证研究的数据结果表明平时成绩和期末笔试成绩之间呈正相关关系,说明实验中大学英语的形成性评估方法有效而且科学。同时,数据结果也表明不同教师的形成性评估方法存在较大个体差异,而且同一评估人在不同时间对同样的评估对象做出的评估也是动态发展的。相似文献

11.

Exploring the Influence of Range Restrictions on Connectivity in Sparse Assessment Networks: An Illustration and Exploration Within the Context of Classroom Observations

下载免费PDF全文

Stefanie A. Wind Eli Jones 《Journal of Educational Measurement》2018,55(2):217-242

Range restrictions, or raters’ tendency to limit their ratings to a subset of available rating scale categories, are well documented in large‐scale teacher evaluation systems based on principal observations. When these restrictions occur, the ratings observed during operational teacher evaluations are limited to a subset of the available categories. However, range restrictions are less common within teacher performances that are used to establish links (anchor ratings) in otherwise disconnected assessment systems. As a result, principals’ category use may be different between anchor ratings and operational ratings. The purpose of this study is to explore the consequences of discrepancies in rating scale category use across operational and anchor ratings within the context of teacher evaluation systems based on principal observations. First, we used real data to illustrate the presence of range restriction in operational ratings, and the effect of this restriction on connectivity. Then, we used simulated data to explore these effects using experimental manipulation. Results suggested that discrepancies in range restriction between anchor and operational ratings do not systematically impact the precision of teacher, principal, and teaching practice estimates. We discuss the implications of these results in terms of research and practice for teacher evaluation systems. 相似文献

12.

The Impact of Student Perceptions and Characteristics on Teaching Evaluations: A case study in finance education 总被引：2，自引：2，他引：2

Andrew C. Worthington 《Assessment & Evaluation in Higher Education》2002,27(1):49-64

This study uses an ordered probit model to examine the impact of student characteristics and perceptions of the teaching evaluation process on student ratings. The results indicate that expected grade, ethnic background, gender and age are a significant influence on student ratings. A primary student-based influence on teaching evaluation performance would appear to be the perceived potential outcome of the evaluation in terms of tenure, promotion and salary decisions, and improvements in teaching and staff allocation. The impact of student perceptions and characteristics is also found to vary across the various dimensions of teaching performance with the potential bias being highest for evaluation questions relating to overall performance, and lowest for questions relating to formative assessment and deep learning outcomes. 相似文献

13.

Understanding research strategies to improve ERA performance in Australian universities: circumventing secrecy to achieve success

Carmel M. Diezmann 《Journal of Higher Education Policy & Management》2018,40(2):154-174

ABSTRACT

Many Australian universities have prioritised improving discipline performance on the national research assessment – Excellence for Research in Australia. However, a culture of secrecy pervades Excellence in Research for Australia (ERA). There are no specified criteria for the assignment of ratings on a 5-point scale ranging from ‘well above world standard’ (5) to ‘well below world standard’ (1). No rationale is provided to institutions for their discipline ratings and university staff on the ERA panels sign confidentiality agreements. However, what is available to universities are the research strategies that each university documents to improve its ERA performance in its Mission-based Compact, a government funding agreement. Thus, the purpose of this paper is to investigate the similarities and differences in the research strategies that universities with different performance profiles employ. Following an analysis of the strategies, substantial commonality was identified in strategy use. However, what was different was how universities employed these strategies and the associated contexts. 相似文献

14.

Exploration of the DSM-5’s Autism Spectrum Disorder Severity Level Specifier and Prediction of Autism Severity

Kimberly Ellison Jon Gore Dustin Wygant 《Exceptionality》2013,21(4):289-298

ABSTRACT

With the publication of DSM-5, clinical assessment of Autism Spectrum Disorder (ASD) has begun to follow a new dimensional framework which includes new severity specifiers. Little research has explored these severity ratings in comparison to other previously established severity indicators (e.g. ADOS-2 calibrated severity score). The current study compared parent and teacher ratings using the BASC-2 for 43 children and adolescents diagnosed with DSM-5 ASD, to contribute novel information to the BASC literature and to explore the new DSM-5 severity ratings. Linear regression analyses were conducted to determine the extent to which the DSM-5 Social Communication and Restrictive/Repetitive Behavior Severity Ratings (clinician-rated) predicted autism severity based on ADOS-2 calibrated severity scores. Furthermore, linear regressions were conducted to explore whether teacher ratings on the BASC-2 enhance parent ratings. Implications of these preliminary results for the assessment of children and adolescents with ASD are suggested. 相似文献

15.

大学英语写作评分方法对评分者严厉程度的影响——整体评分法和分析评分法的对比分析 总被引：1，自引：0，他引：1

贺满足《湖南第一师范学报》2006,6(4):59-61,66

评分标准在写作测试中非常重要,使用不同的评分方法会影响评卷者的评分行为。研究显示,虽然整体法和分析法两种英语写作评分方法都可靠,但是在两种评分中,评卷者的严厉程度以及考生的写作成绩发生很大变化。总体上,整体法评分中,评卷者的严厉程度趋于一致,接近理想值;分析法评分中,考生的写作成绩更高,同时评卷者的严厉程度也存在显著差异。因而,在决定考生前途命运的重大考试中,整体评分法更受推崇。相似文献

16.

Influence of Playback Techniques on Counselor Preformance

Martin J. Markey Ronald H. Fredrickson Richard W. Johnson Mary Alice Julius 《Counselor Education & Supervision》1970,9(3):178-182

Study was made of training impact of different electronic playback techniques on ratings of student counselor performance. Thirty-two upperclass university females were randomly divided into four playback treatment groups: (a) audio-video, (b) audio, (c) video, and (d) no playback received. Four underclass university females served as trained clients. All student counselors interviewed two different trained clients in two 20-minute sessions separated by a playback treatment period. All sessions were recorded by television cameras. Trained judges rated the second interview using the Counselor Evaluation Inventory, Nonverbal Behavior Scale, and Audio-Visual Counseling Scale. Two-way analysis of variance was used to compare scores on criterion instruments. Results indicated no judged differences among the playback treatment groups, nor could discriminant rankings be made among the various playback methods. Several explanations are discussed as to limited influence of playback media on early interview performance. 相似文献

17.

An Integrative Approach to Portfolio Evaluation for Teacher Licensure 总被引：3，自引：0，他引：3

Pamela A. Moss Aaron M. Schutz Kathleen M. Collins 《Journal of Personnel Evaluation in Education》1998,12(2):139-161

The purpose of our overall research agenda is to develop and evaluate a methodology for the assessment of teachers in which experienced teachers, serving as judges, engage in dialogue to integrate multiple sources of evidence about a candidate to reach a sound conclusion. The project that provides the venue for this research agenda is the Interstate New Teacher Assessment and Support Consortium (INTASC), which is developing a portfolio assessment system to assist participating states in making a decision about teacher licensure. To develop the theoretical foundation necessary to support and evaluate such dialogic and integrative assessment practices, we turn, in part, to the tradition of philosophical hermeneutics, as a complement to psychometrics. In this article, we characterize and assess the processes in which judges, trying out an integrative approach to portfolio evaluation for the first time, engage as they collaboratively construct and document their conclusions, and we locate this work in the larger research agenda. The premise of this project, which is being carefully evaluated in the course of inquiry, is that these integrative practices cannot only lead to an epistemologically sound evaluation of teaching but also promote an ongoing professional dialogue of critical reflection on teaching practice. 相似文献

18.

Leadership Perception

Thomas P. Bradley Jeff M. Allen Scott Hamilton Scott K. Filgo 《Performance Improvement Quarterly》2006,19(1):7-23

Multirater feedback, often called 360‐degree feedback, is a popular development and assessment tool, especially for organizational leaders. Raters from different organizational levels, including subordinates, boss, peers, and self, rate the leader's performance. However, there seldom is strong agreement across rater groups. This study used the data from a commercially available 360‐degree leader development feedback instrument and a second‐order confirmatory factor analysis model to try to explain the differences in ratings between the groups. Rather than an explanation of the differences, what was found were two second‐order factors that may be the underlying elements that all raters consider when observing leader performance. 相似文献

19.

Latent trait modelling of rater accuracy in formative peer assessment of English-Chinese consecutive interpreting

Chao Han 《Assessment & Evaluation in Higher Education》2018,43(6):979-994

Despite the increasing popularity of peer assessment in tertiary-level interpreter education, very little research has been conducted to examine the quality of peer ratings on language interpretation. While previous research on the quality of peer ratings, particularly rating accuracy, mainly relies on correlation and analysis of variance, latent trait modelling emerges as a useful approach to investigate rating accuracy in rater-mediated performance assessment. The present study demonstrates the use of multifaceted Rasch partial credit modelling to explore the accuracy of peer ratings on English-Chinese consecutive interpretation. The analysis shows that there was a relatively wide spread of rater accuracy estimates and that statistically significant differences were found between peer raters regarding rating accuracy. Additionally, it was easier for peer raters to assess some students accurately than others, to peer-assess target language quality accurately than the other rating domains, and to provide accurate ratings to English-to-Chinese interpretation than the other direction. Through these findings, latent trait modelling demonstrates its capability to produce individual-level indices, measure rater accuracy directly, and accommodate sparse data rating designs. It is therefore hoped that substantive inquiries into peer assessment of language interpretation could utilise latent trait modelling to move this line of research forward. 相似文献

20.

DETERMINING THE QUALITY OF COMPETENCE ASSESSMENT PROGRAMS: A SELF-EVALUATION PROCEDURE

Liesbeth K.J. Frans J. Paul A. Cees P.M. 《Studies in Educational Evaluation》2007,33(3-4):258-281

As assessment methods are changing, the way to determine their quality needs to be changed accordingly. This article argues for the use Competence Assessment Programs (CAPs), combinations of traditional tests and new assessment methods which involve both formative and summative assessments. To assist schools in evaluating their CAPs, a self-evaluation procedure was developed, based on 12 quality criteria for CAPs developed in earlier studies. A self-evaluation was chosen as it is increasingly used as an alternative to external evaluation. The CAP self-evaluation is carried out by a group of functionaries from the same school and comprises individual self-evaluations and a group interview. The CAP is rated on the 12 quality criteria and a piece of evidence is asked for to support these ratings. In this study, three functionaries from eight schools (N = 24) evaluated their CAP using the self-evaluation procedure. Results show that the group interview was very important as different perspectives on the CAP are assembled here into an overall picture of the CAP's quality. Schools seem to use mainly personal experiences to support their ratings and need to be supported in the process of carrying out a self-evaluation. 相似文献