期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A New Way of Scoring Moral Judgement Interviews

《Journal of moral education》2012,41(2):118-130

Abstract

Two studies of the categorization of justifications for the morality of the actions of others are reported. Justifications were categorized using a scoring scheme not previously reported. Results showed that a reasonable degree of inter‐rater reliability could be achieved and that developmental trends detected were robust with respect both to interviewers and interview content, although interview content had an expected and comprehensible effect on the frequency of items within content categories. Results were interpreted within the context of a model of the development of moral reasoning that emphasizes the influence of the social focus of the interviewee and the process by which individuation occurs towards either a secular or a religious view of morality. The notion that a more differentiated individuation may also occur within each of these categories was explored, as were, shifts from paternalism to autonomous decision‐making in thinking about some areas of social life. 相似文献

2.

The Triple Jump Exercise in Inquiry‐based Learning: a case study showing directions for further research

Grahame Feletti Greg Ryan 《Assessment & Evaluation in Higher Education》1994,19(3):225-234

The Triple Jump is a versatile but under‐studied instrument, used both for developing and assessing problem‐based learning (PBL). This paper evaluates its assessment of inquiry‐based learning (IBL) in a graduate course, along with a Group Assessment Task. Students' performance on the Triple Jump was not related to satisfaction with their small‐group discussion prior to completing a self‐directed learning task. Analysis of the self‐directed learning task in terms of academic or pragmatic focus showed consistent differences between two markers — suggesting the need for more research into inter‐rater reliability and other characteristics of the Triple Jump exercise. Some simple strategies are recommended to make this instrument cost‐effective for assessing large classes. 相似文献

3.

Nonparametric Evidence of Validity,Reliability, and Fairness for Rater‐Mediated Assessments: An Illustration Using Mokken Scale Analysis

Stefanie A. Wind 《Journal of Educational Measurement》2019,56(3):478-504

Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating rating quality are often based on measurement theories such as invariant measurement. However, these methods are based on assumptions and transformations that may not be appropriate for ordinal ratings. In this study, I show how researchers can use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, to evaluate rating quality within the framework of invariant measurement without the use of potentially inappropriate parametric techniques. I use an illustrative analysis of data from a rater‐mediated writing assessment to demonstrate how one can use numeric and graphical indicators from MSA to gather evidence of validity, reliability, and fairness. The results from the analyses suggest that MSA provides a useful framework within which to evaluate rater‐mediated assessments for evidence of validity, reliability, and fairness that can supplement existing popular methods for evaluating ratings. 相似文献

4.

Assessing student exposure to and use of computer technologies through an examination of course syllabi

Michael B. Madson Timothy P. Melchert Joan L. Whipp 《Assessment & Evaluation in Higher Education》2004,29(5):549-561

A syllabus analysis instrument was developed to assist program evaluators, administrators and faculty in the identification of skills that students use as they complete their college coursework. While this instrument can be tailored for use with a variety of learning domains, we used it to assess students' use of and exposure to computer technology skills. The reliability and validity of the instrument was examined through an analysis of 88 syllabi from courses within the teacher education program and the core curriculum at a private Midwest US university. Results indicate that the instrument has good inter‐rater reliability and ratings by and interviews with faculty and students provide evidence of construct validity. The use and limitations of the instrument in educational program evaluation are discussed. 相似文献

5.

Adjustment scales for preschool intervention: Extending validity and relevance across multiple perspectives

Rebecca Bulotsky‐Shearer John Fantuzzo 《Psychology in the schools》2004,41(7):725-736

A series of studies extended psychometric research on the Adjustment Scales for Preschool Intervention (ASPI). The ASPI is a multidimensional measure of preschool emotional and behavioral adjustment for use within formal early childhood educational programs. These studies used a multiple method, multisource approach to provide additional evidence for the reliability and validity of the ASPI. Findings documented inter‐rater reliability of the ASPI across key informants within early childhood educational programs—teachers and teacher assistants. Findings supported concurrent validity of the ASPI with direct observations of preschool classroom adjustment problems and the developmentally salient constructs of temperament and emotion regulation. Implications for policy, practice, and future research are discussed. © 2004 Wiley Periodicals, Inc. Psychol Schs 41: 725–736, 2004. 相似文献

6.

(Mis)appropriations of criteria and standards‐referenced assessment in a performance‐based subject

Peter J. Hay Doune Macdonald 《Assessment in Education: Principles, Policy & Practice》2008,15(2):153-168

This paper draws on semi‐structured interview data and participant observations of senior secondary Physical Education (PE) teachers and students at two school sites across 20 weeks of the school year. The data indicated that the teachers in this study made progressive judgements about students’ level of achievement across each unit of work without explicit or overt reference to the criteria and standards represented in the schools’ work programmes and in the Senior PE syllabus. The teachers’ justification for such an approach was that the criteria and standards had become for them sufficiently ‘internalised’. Determining students’ levels of achievement was for the teachers somewhat ‘intuitive’, being reliant on their memory of students’ performances, and influenced by the construct‐irrelevant affective characteristics of the students. It is argued in this paper that such construct‐irrelevance compromised the construct validity and possible inter‐rater reliability of the decisions made and advantaged some students and marginalised others on the basis of characteristics that were not specifically related to the learning expected from following the syllabus. The potential inequities of such an approach are discussed and suggestions are made for the consolidation of the validity and reliability of teachers’ judgements. 相似文献

7.

Quantitative analysis of the rubric as an assessment tool: an empirical study of student peer‐group rating

John Hafner Patti Hafner 《International Journal of Science Education》2013,35(12):1509-1528

Although the rubric has emerged as one of the most popular assessment tools in progressive educational programs, there is an unfortunate dearth of information in the literature quantifying the actual effectiveness of the rubric as an assessment tool in the hands of the students. This study focuses on the validity and reliability of the rubric as an assessment tool for student peer‐group evaluation in an effort to further explore the use and effectiveness of the rubric. A total of 1577 peer‐group ratings using a rubric for an oral presentation was used in this 3‐year study involving 107 college biology students. A quantitative analysis of the rubric used in this study shows that it is used consistently by both students and the instructor across the study years. Moreover, the rubric appears to be ‘gender neutral’ and the students' academic strength has no significant bearing on the way that they employ the rubric. A significant, one‐to‐one relationship (slope = 1.0) between the instructor's assessment and the students' rating is seen across all years using the rubric. A generalizability study yields estimates of inter‐rater reliability of moderate values across all years and allows for the estimation of variance components. Taken together, these data indicate that the general form and evaluative criteria of the rubric are clear and that the rubric is a useful assessment tool for peer‐group (and self‐) assessment by students. To our knowledge, these data provide the first statistical documentation of the validity and reliability of the rubric for student peer‐group assessment. 相似文献

8.

Bones,boys, bombs and booze: an exploratory study of the reliability of marking dissertations across disciplines

Josette Bettany‐Saltikov Stephanie Kilinc Karen Stow 《Assessment & Evaluation in Higher Education》2009,34(6):621-639

The primary aim of this study was to evaluate the reliability of the University’s Masters’ level (M‐level) generic assessment criteria when used by lecturers from different disciplines. A further aim was to evaluate if subject‐specific knowledge was essential to marking these dissertations. Four senior lecturers from diverse disciplines participated in this study. The University of Teesside’s generic M‐level assessment criteria were used and formatted into a grid. The assessment criteria related to the learning outcomes, the depth of understanding, the complexity of analysis and synthesis and the structure and academic presentation of the work. As well as a quantitative mark, a qualitative statement for the reason behind the judgement was required. Each lecturer provided a dissertation that had previously been marked. All participants then marked each of the four projects using the M‐level grid and comments sheet. The study found very good inter‐rater reliability. For any one project, the variation in marks from the original mark was no more than 6% on average. This study also found that subject‐specific knowledge was not essential to marking when using generic assessment criteria in terms of the reliability of marks. The authors acknowledge the exploratory nature of these results and hope other lecturers will join in the exploration to test the robustness of generic assessment criteria across disciplines. 相似文献

9.

Predicting Operational Rater‐Type Classifications Using Rasch Measurement Theory and Random Forests: A Music Performance Assessment Perspective

Brian C. Wesolowski 《Journal of Educational Measurement》2019,56(3):610-625

The purpose of this study was to build a Random Forest supervised machine learning model in order to predict musical rater‐type classifications based upon a Rasch analysis of raters’ differential severity/leniency related to item use. Raw scores (N = 1,704) from 142 raters across nine high school solo and ensemble festivals (grades 9–12) were collected using a 29‐item Likert‐type rating scale embedded within five domains (tone/intonation, n = 6; balance, n = 5; interpretation, n = 6; rhythm, n = 6; and technical accuracy, n = 6). Data were analyzed using a Many Facets Rasch Partial Credit Model. An a priori k‐means cluster analysis of 29 differential rater functioning indices produced a discrete feature vector that classified raters into one of three distinct rater‐types: (a) syntactical rater‐type, (b) expressive rater‐type, or (c) mental representation rater‐type. Results of the initial Random Forest model resulted in an out‐of‐bag error rate of 5.05%, indicating that approximately 95% of the raters were correctly classified. After tuning a set of three hyperparameters (n_tree, m_try, and node size), the optimized model demonstrated an improved out‐of‐bag error rate of 2.02%. Implications for improvements in assessment, research, and rater training in the field of music education are discussed. 相似文献

10.

The Construction and Evaluation of Simulations to Assess Professional Interviewing Skills

Gertrude N. Smit Henk T. van der Molen 《Assessment in Education: Principles, Policy & Practice》1997,4(3):353-364

Courses preparing students for interviews commonly held in organisations often form part of the curriculum of senior secondary, higher, university, and in‐service education. In these courses, students are prepared for their future work practice. Assessment of student performance after attending such a course requires a different assessment method from the traditional written examination. In this article we describe the construction and evaluation of simulations. The results of an investigation into their quality show that they are reliable in terms of measures of internal consistency and inter‐rater reliability. However, it turned out that a student's score is highly dependent on the content of the interview. We found support for the simulations’ construct and content validity. Although the simulation is not an efficient instrument, its benefits are high: students are stimulated to do their best in practising for the interviews, and weaknesses in students’ performances will be detected so that remedial teaching can be offered. 相似文献

11.

Conceptualizing Rater Judgments and Rating Processes for Rater‐Mediated Assessments

Jue Wang George Engelhard 《Journal of Educational Measurement》2019,56(3):582-609

Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested. 相似文献

12.

A Hierarchical Rater Model for Constructed Responses,with a Signal Detection Rater Model

Lawrence T. DeCarlo YoungKoung Kim Matthew S. Johnson 《Journal of Educational Measurement》2011,48(3):333-356

The hierarchical rater model (HRM) re‐cognizes the hierarchical structure of data that arises when raters score constructed response items. In this approach, raters’ scores are not viewed as being direct indicators of examinee proficiency but rather as indicators of essay quality; the (latent categorical) quality of an examinee's essay in turn serves as an indicator of the examinee's proficiency, thus yielding a hierarchical structure. Here it is shown that a latent class model motivated by signal detection theory (SDT) is a natural candidate for the first level of the HRM, the rater model. The latent class SDT model provides measures of rater precision and various rater effects, above and beyond simply severity or leniency. The HRM‐SDT model is applied to data from a large‐scale assessment and is shown to provide a useful summary of various aspects of the raters’ performance. 相似文献

13.

Exploring the Impact of Rater Effects on Person Fit in Rater-Mediated Assessments

Stefanie A. Wind 《Educational Measurement》2020,39(4):76-94

Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both. 相似文献

14.

Examining Differential Rater Functioning Using a Between‐Subgroup Outfit Approach

Stefanie A. Wind Stefanie S. Sebok‐Syer 《Journal of Educational Measurement》2019,56(2):217-250

When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters’ ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between‐subgroup fit approach. We used standardized between‐subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups. 相似文献

15.

The Quality of Content Analyses of State Student Achievement Tests and Content Standards

Andrew C. Porter Morgan S. Polikoff Tim Zeidner John Smithson 《Educational Measurement》2008,27(4):2-14

This article examines the reliability of content analyses of state student achievement tests and state content standards. We use data from two states in three grades in mathematics and English language arts and reading to explore differences by state, content area, grade level, and document type. Using a generalizability framework, we find that reliabilities for four coders are generally greater than .80. For the two problematic reliabilities, they are partly explained by an odd rater out. We conclude that the content analysis procedures, when used with at least five raters, provide reliable information to researchers, policymakers, and practitioners about the content of assessments and standards. 相似文献

16.

THE RELIABILITY AND USER‐FEASIBILITY OF MATERIALS AND PROCEDURES FOR MONITORING THE IMPLEMENTATION INTEGRITY OF A READING INTERVENTION

John C. Begeny Julia E. Easton James J. Upright Kali R. Tunstall Cassia A. Ehrenbock 《Psychology in the schools》2014,51(5):517-533

Within the realm of school‐based interventions, implementation integrity is important for practical, legal, and ethical purposes. Unfortunately, evidence suggests that proper monitoring of implementation integrity is often absent from both research and practice. School psychology practitioners and researchers have reported that a major barrier to monitoring integrity is a lack of procedural guidance, and currently there is little research that has examined the psychometric reliability of monitoring procedures and materials. Therefore, the purpose of this two‐part study was to examine (a) the extent to which relatively novice educators could self‐learn and successfully use an implementation integrity monitoring system designed to evaluate a structured reading intervention program, and (b) the inter‐observer reliability of two individuals using the system to evaluate the same interventionist. Overall findings suggested that it is feasible for most individuals to learn the implementation integrity monitoring system (and associated materials) and the system can be used reliably across multiple observers. Implications of these findings are discussed, including how the procedures and materials might be adapted for other intervention programs to assist researchers and practitioners with monitoring implementation integrity. 相似文献

17.

A Comparison of Two Response Scale Formats Used in Teaching Evaluation Questionnaires

Malcolm G. Eley Erica J. Stecher 《Assessment & Evaluation in Higher Education》1997,22(1):65-79

Three studies compared the common Likert agree/disagree question form to a behavioural observation form in which students report recalled frequencies of described teaching or learning events. The agree/disagree form seemed to prompt global, impressionistic approaches to responding, while the behavioural observation form seemed to prompt more objective approaches. Between‐student response consistency was greater for the behavioural observation form than for the agree/disagree form. Across separate samples of teaching, mean overall ratings derived from behavioural observation form questionnaires spread more broadly than did those from agree/ disagree forms. Across separate elements within an individual's teaching, the ratings from the behavioural observation form spread more than those from the agree/disagree form. The conclusions drawn were that using behavioural observation form questions rather than agree/disagree questions in teaching evaluation questionnaires can yield measurable improvements in inter‐rater reliability and in the capability to distinguish amongst levels of teaching quality. 相似文献

18.

The Student Actions Coding Sheet (SACS): An instrument for illuminating the shifts toward student‐centered science classrooms

Ibrahim Erdogan Nor Hashidah Abd‐Hamid 《International Journal of Science Education》2013,35(10):1313-1336

This study describes the development of an instrument to investigate the extent to which student‐centered actions are occurring in science classrooms. The instrument was developed through the following five stages: (1) student action identification, (2) use of both national and international content experts to establish content validity, (3) refinement of the item pool based on reviewer comments, (4) pilot testing of the instrument, and (5) statistical reliability and item analysis leading to additional refinement and finalization of the instrument. In the field test, the instrument consisted of 26 items separated into four categories originally derived from student‐centered instruction literature and used by the authors to sort student actions in previous research. The SACS was administered across 22 Grade 6–8 classrooms by 22 groups of observers, with a total of 67 SACS ratings completed. The finalized instrument was found to be internally consistent, with acceptable estimates from inter‐rater intraclass correlation reliability coefficients at the p < 0.01 level. After the final stage of development, the SACS instrument consisted of 24 items separated into three categories, which aligned with the factor analysis clustering of the items. Additionally, concurrent validity of the SACS was established with the Reformed Teaching Observation Protocol. Based on the analyses completed, the SACS appears to be a useful instrument for inclusion in comprehensive assessment packages for illuminating the extent to which student‐centered actions are occurring in science classrooms. 相似文献

19.

Pedagogical Considerations for Examining Rater Variability in Rater‐Mediated Assessments: A Three‐Model Framework

Brian C. Wesolowski Stefanie A. Wind 《Journal of Educational Measurement》2019,56(3):521-546

Rater‐mediated assessments are a common methodology for measuring persons, investigating rater behavior, and/or defining latent constructs. The purpose of this article is to provide a pedagogical framework for examining rater variability in the context of rater‐mediated assessments using three distinct models. The first model is the observation model, which includes ecological/environmental considerations for the evaluation system. The second model is the measurement model, which includes the transformation of observed, rater response data to linear measures using a measurement model with specific requirements of rater‐invariant measurement in order to examine raters’ construct‐relevant variability stemming from the evaluative system. The third model is the interaction model, which includes an interaction parameter to allow for the investigation into raters’ systematic, construct‐irrelevant variability stemming from the evaluative system. Implications for measurement outcomes and validity are discussed. 相似文献

20.

A Call to Action: Gain a Seat at Decision Tables

《Performance Improvement》2017,56(1):21-27

No matter how much HPT practitioners improve human performance, we are not granted enough invitations to serve as tactical and strategic decision makers. Many “earning” decision makers such as financial experts regard HPT professionals as no more than operational‐level learning professionals. Therefore, HPT practitioners cannot rely on current decision makers to initiate a call for increased respect for the HPT industry. However, there is a contingent of HPT practitioners known as business‐entity performance technologists (BEP Techs) who serve as HPT activists by obtaining influential roles in decision processes. Their goals are to influence decision makers, become decision makers, support decision makers, and challenge decision makers from other industries to respect HPT practitioners as more than just learning professionals. To do this, they focus more on the improvement of business‐entity performance than on human performance. They seek to obtain respect for the HPT industry as both learning and earning professionals. 相似文献