首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Competency examinations in a variety of domains require setting a minimum standard of performance. This study examines the issue of whether judges using the two most popular methods for setting cut scores (Angoff and Nedelsky methods) use different sources of information when making their judgments. Thirty-one judges were assigned randomly to the two methods to set cut scores for a high school graduation test in reading comprehension. These ratings were then related to characteristics of the items as well as to empirically obtained p values. Results indicate that judges using the Angoff method use a wider variety of information and yield estimates closer to the actual p values. The characteristics of items used in the study were effective predictors of judges' ratings, but were far less effective in predicting p values  相似文献   

Despite being widely used and frequently studied, the Angoff standard setting procedure has received little attention with respect to an integral part of the process: how judges incorporate examinee performance data in the decision‐making process. Without performance data, subject matter experts have considerable difficulty accurately making the required judgments. Providing data introduces the very real possibility that judges will turn their content‐based judgments into norm‐referenced judgments. This article reports on three Angoff standard setting panels for which some items were randomly assigned to have incorrect performance data. Judges were informed that some of the items were accompanied by inaccurate data, but were not told which items they were. The purpose of the manipulation was to assess the extent to which changing the instructions given to the judges would impact the extent to which they relied on the performance data. The modified instructions resulted in the judges making less use of the performance data than judges participating in recent parallel studies. The relative extent of the change judges made did not appear to be substantially influenced by the accuracy of the data.  相似文献   

College teachers' self-ratings were investigated in this study by comparing them to ratings given by students. The sample consisted of 343 teaching faculty from five colleges; these teachers, as well as the students in one of their classes, responded to a 21-item instructional report questionnaire. Teacher self-ratings had only a modest relationship with the ratings given by students (a median correlation of .21 for the items). In addition to the general lack of agreement between self and student evaluations, there was also a tendency for teachers as a group to give themselves better ratings than their students did.
Discrepancies between individual teacher ratings and ratings given by the class were further analyzed for: (a) sex of the teacher (no difference found); (b) number of years of teaching experience (no difference); and (c) subject area of the course (differences noted for natural science courses vs. those in education and applied areas).  相似文献   

This study examined the effect of matching learners' cognitive styles with science learning activities on science knowledge and attitudes. Fifty-six elementary education majors who were identified as Sensing Feeling types on the Myers-Briggs Type Indicator participated in this study. The Sensing Feeling type is predominant among elementary school educators. The subjects participated in either nine science activities matched to the learning preferences of Sensing Feelers or nine science activities mismatched to their learning preferences. These mismatched activities were geared toward the learning preferences of Intuitive Thinkers, the dominant type among scientists. Results revealed no significant differences between matched and mismatched groups in knowledge of the material presented or overall attitude toward science and toward science teaching. Comparisons made subsequent to the hypothesized analyses did suggest that cognitive style may affect reactions to certain specific learning activities. The immediate reactions of forty non-Sensing Feeling types who also experienced the treatments were compared to those of the 56 Sensing Feeling subjects. Certain activities which were rated by judges prior to the onset of treatment as being particularly well-matched to the Sensing Feeling style did receive significantly more favorable ratings by the Sensing Feeling subjects than by other types. Conversely, the Sensing Feelers gave significantly lower ratings than other types to certain activities which, according to independent judges, were strongly mismatched to the Sensing Feeling style.  相似文献   

Numerous writers have suggested that the discrimination index may be helpful in identifying faulty test items. The purpose of this study was to investigate systematically the validity of the index for this purpose. To attain this objective, two forms of an arithmetic-reasoning test were written. In each form, the items were designed to vary in quality with respect to nine item-writing principles, and on the basis of the responses of 364 examinees, a discrimination index was computed for each item. Next, the items were rated independently for quality by three judges who used a check list of the nine item-writing principles. The average of their ratings for each item was used as the criterion for determining the validity of the indices. The results indicate that the discrimination index is a moderately valid measure of item quality. The implications of this finding are discussed.  相似文献   

Forty eight sets of recommendations identified by their authors as practical heuristics for the evaluation and revision of instructional materials were reviewed and consolidated. These guidelines were extracted from professional journals, book chapters and independent publications. Initially, all items were compiled, and then sorted into three categories according to their specific focus: content (subject matter), design, and presentation. In a second sort, identical items were eliminated and semantically equivalent items were grouped together. Three independent judges performed the same tasks for reliability. The outcome is a comprehensive list of 67 items, representing all of the reviewed guidelines. This instrument could be a successful aid in identifying deficiencies of instructional materials in the areas of content, design and presentation.  相似文献   

This study investigated classroom practices of 38 teachers enrolled in university masters' degree programs in educational technology and in other areas of education. The classroom practices related to five key concepts associated with educational technology: (a) learner-centered instruction, (b) instructional design, (c) media and technology, (d) assessment, and (e) instructional alignment. Teachers rated their frequency of use of desirable practices in these five areas on a 30-item Likert type survey. In addition, one class of students per teacher rated its own teacher's frequency of use of the practices on 20 items parallel to items on the teacher survey. The mean overall rating across all teachers for the classroom practice items was very close to Often, or 4.0, on the 5-point scale. There were few reported differences between the teachers enrolled in educational technology programs and those enrolled in other education programs. Student ratings indicated less frequent teacher use of the desirable practices on 16 of the 20 common items, with significantly lower student ratings on 8 of these items. However, there was strong teacher-student agreement on several other comparisons.The study reported in this article was conducted as a doctoral dissertation at Arizona State University.  相似文献   

The purpose of the present study was to extend past work with the Angoff method for setting standards by examining judgments at the judge level rather than the panel level. The focus was on investigating the relationship between observed Angoff standard setting judgments and empirical conditional probabilities. This relationship has been used as a measure of internal consistency by previous researchers. Results indicated that judges varied in the degree to which they were able to produce internally consistent ratings; some judges produced ratings that were highly correlated with empirical conditional probabilities and other judges’ ratings had essentially no correlation with the conditional probabilities. The results also showed that weighting procedures applied to individual judgments both increased panel-level internal consistency and produced convergence across panels.  相似文献   

Setting performance standards is a judgmental process involving human opinions and values as well as technical and empirical considerations. Although all cut score decisions are by nature somewhat arbitrary, they should not be capricious. Judges selected for standard‐setting panels should have the proper qualifications to make the judgments asked of them; however, even qualified judges vary in expertise and in some cases, such as highly specialized areas or when members of the public are involved, it may be difficult to ensure that each member of a standard‐setting panel has the requisite expertise to make qualified judgments. Given the subjective nature of these types of judgments, and that a large part of the validity argument for an exam lies in the robustness of its passing standard, an examination of the influence of judge proficiency on the judgments is warranted. This study explores the use of the many‐facet Rasch model as a method for adjusting modified Angoff standard‐setting ratings based on judges’ proficiency levels. The results suggest differences in the severity and quality of standard‐setting judgments across levels of judge proficiency, such that judges who answered easy items incorrectly tended to perceive them as easier, but those who answered correctly tended to provide ratings within normal stochastic limits.  相似文献   

Minimum standards were established for the National Teacher Examinations (NTE) area examinations in mathematics and in elementary education by independent panels of teacher educators who had been instructed in the use of either the Angoff, Nedelsky, or Jaeger procedures. Of these three procedures, only the Jaeger method requires that normative data be provided to the judges when evaluating the items. However, it was of interest to study the effect such information would have upon the standards obtained using the other two methods. Therefore, the design incorporated three sequential review sessions with the level of normative information different for each. A three-factor ANOVA revealed significant main effects for methods and sessions but not for subject area. None of the interactions was significant. The anticipated failure rates, the psychometric characteristics of the ratings, and other factors suggest that the Angoff procedure, as modified during the second session of this study, yields the most defensible standards for the NTE area examinations.  相似文献   

The alignment of test items to content standards is critical to the validity of decisions made from standards‐based tests. Generally, alignment is determined based on judgments made by a panel of content experts with either ratings averaged or via a consensus reached through discussion. When the pool of items to be reviewed is large, or the content‐matter experts are broadly distributed geographically, panel methods present significant challenges. This article illustrates the use of an online methodology for gauging item alignment that does not require that raters convene in person, reduces the overall cost of the study, increases time flexibility, and offers an efficient means for reviewing large item banks. Latent trait methods are applied to the data to control for between‐rater severity, evaluate intrarater consistency, and provide item‐level diagnostic statistics. Use of this methodology is illustrated with a large pool (1,345) of interim‐formative mathematics test items. Implications for the field and limitations of this approach are discussed.  相似文献   

Extensive research has been done on student ratings of instruction on closed-ended questionnaires, but little research has examined students’ written responses to open-ended questions. This study investigated the written comments of students in 198 classes, focusing on their frequency, content, direction, and consistency with quantitative ratings on closed-ended items. Results indicated that about 45% of the students wrote comments. Comments were more often positive than negative and tended to be general rather than specific. Written comments addressed dimensions similar to those identified in the closed-ended items, but they also related to unique aspects of the courses as well.  相似文献   

This study investigated the usefulness of the many‐facet Rasch model (MFRM) in evaluating the quality of performance related to PowerPoint presentations in higher education. The Rasch Model utilizes item response theory stating that the probability of a correct response to a test item/task depends largely on a single parameter, the ability of the person. MFRM extends this one‐parameter model to other facets of task difficulty, for example, rater severity, rating scale format, task difficulty levels. This paper specifically investigated presentation ability in terms of items/task difficulty and rater severity/leniency. First‐year science education students prepared and used the PowerPoint presentation software program during the autumn semester of the 2005–2006 school year in the ‘Introduction to the Teaching Profession’ course. The students were divided into six sub‐groups and each sub‐group was given an instructional topic, based on the content and objectives of the course, to prepare a PowerPoint presentation. Seven judges, including the course instructor, evaluated each group’s PowerPoint presentation performance using ‘A+ PowerPoint Rubric’. The results of this study show that the MFRM technique is a powerful tool for handling polytomous data in performance and peer assessment in higher education.  相似文献   

Cut scores, estimated using the Angoff procedure, are routinely used to make high-stakes classification decisions based on examinee scores. Precision is necessary in estimation of cut scores because of the importance of these decisions. Although much has been written about how these procedures should be implemented, there is relatively little literature providing empirical support for specific approaches to providing training and feedback to standard-setting judges. This article presents a multivariate generalizability analysis designed to examine the impact of training and feedback on various sources of error in estimation of cut scores for a standard-setting procedure in which multiple independent groups completed the judgments. The results indicate that after training, there was little improvement in the ability of judges to rank order items by difficulty but there was a substantial improvement in inter-judge consistency in centering ratings. The results also show a substantial group effect. Consistent with this result, the direction of change for the estimated cut score was shown to be group dependent.  相似文献   

One of the major problems in assessment and evaluation is that different people rate the same performance with varying degrees of severity. Individual raters vary the severity of their ratings in a manner dependent upon a wide array of factors. Most efforts intended to secure reliable and valid ratings across judges assume that the goal is to obtain identical ratings from different judges for the same performance. In contrast to these approaches, probabilistic conjoint measurement facilitates observation and calibration of differences in judge severity, making it possible to account for these differences in the interpretation of the assigned ratings. This chapter addresses issues in the application of Facets analyses to writing assessment, aesthetic judgment, and the evaluation of public speaking ability.  相似文献   


The purpose of this investigation was to develop and validate a simulation device to measure a teacher's ability to identify verbal and nonverbal emotions expressed by students (teacher affective sensitivity). The scale consists of videotaped excerpts of teacher-learner interactions and accompanying multiple-choice instrumentation. Respondents select the answer from each multiple-choice item that they believe most accurately describes the affective state of the pupil viewed on the monitor. Previously produced media focusing on classroom interactions were used to obtain the examples of learner affective expressions. Expert judges constructed two multiple-choice items for each simulation episode. Pilot test administrations allowed for numerous scale revisions. Finally, assessments of scale reliability, and scale construct, predictive, concurrent, and content validity were made.  相似文献   

Central to the standards-based assessment validation process is an examination of the alignment between state standards and test items. Several alignment analysis systems have emerged recently, but most rely on either traditional rating or matching techniques. Little, if any, analyses have been reported on the degree of consistency between the two methods and on the item and objective characteristics that influence judges' decisions. We randomly assigned judges to either rate item-objective links or match items to objectives while reviewing the 2004 Arizona high school mathematics standards and assessment. Across items we found moderate convergence between methods, and we detected apparent reasons for divergently scored items. We also found that judges relied on item and objective content and intellectual skill features to render decisions. Based on our evidence, we contend that a thorough alignment analysis would involve judges using both rating and matching, while focusing on both content and intellectual skill. The findings have important implications for states when examining the alignment between their standards and assessments.  相似文献   

In order to obtain objective measurement for examinations that are graded by judges, an extension of the Rasch model designed to analyze examinations with more than two facets (items/examinees) is used. This extended Rasch model calibrates the elements of each facet of the examination (i.e., examinee performances, items, and judges) on a common log-linear scale. A network for assigning judges to examinations is used to link all facets. Real examination data from the "clinical assessment" part of a certification examination are used to illustrate the application. A range of item difficulties and judge severities were found. Comparison of examinee raw scores with objective linear measures corrected for variations in judge severity shows that judge severity can have a substantial impact on a raw score. Correcting for judge severity improves the fairness of examinee measures and of the subsequent pass-fail decisions because the uncorrected raw scores favor examinee performances graded by lenient judges.  相似文献   

This study evaluated the use of Interpersonal Process Recall (IPR) with videotape (as opposed to more traditional methods) in improving the effectiveness of practicum students. The practicum students were randomly assigned to one of three treatment groups: (a) a video-IPR, (b) an audio-IPR, and (c) supervision using an audiotape of a regular counseling session. Three judges were asked to rate videotapes of 54 practicum students conducting their final counseling session with a coached client. The judges rated two time-samplings of the final session by means of a scale consisting of three parts: (a) 33 behavioral and feeling items, (b) a single global evaluation representing the normal curve with the baseline divided into eight equal segments, and (c) a request for the judges to write any adjectives or phases that they felt described the practicum student. The results were not as convincing as had been anticipated. This article discusses possible reasons that the results were not convincing and implications the results have for future research within this area of counselor education.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号