首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 143 毫秒
1.
《教育实用测度》2013,26(1):67-78
Sixty-one judges provided recommendations on minimal standards for the Essay portion of the National Teacher Examinations Communication Skills Test, which is used to screen applicants to North Carolina teacher education programs. The standard-setting procedures were described; provision of student performance information to judges and group discussion significantly increased the average recommended standards. Initial differences between the average recommended standards when two sets of essays were used by judges were diminished by those treatments. The recommendations of public school judges were significantly more variable than were those of college and university judges following discussion.  相似文献   

2.
Minimum standards were established for the National Teacher Examinations (NTE) area examinations in mathematics and in elementary education by independent panels of teacher educators who had been instructed in the use of either the Angoff, Nedelsky, or Jaeger procedures. Of these three procedures, only the Jaeger method requires that normative data be provided to the judges when evaluating the items. However, it was of interest to study the effect such information would have upon the standards obtained using the other two methods. Therefore, the design incorporated three sequential review sessions with the level of normative information different for each. A three-factor ANOVA revealed significant main effects for methods and sessions but not for subject area. None of the interactions was significant. The anticipated failure rates, the psychometric characteristics of the ratings, and other factors suggest that the Angoff procedure, as modified during the second session of this study, yields the most defensible standards for the NTE area examinations.  相似文献   

3.
This paper examined the diagnostic utility of subtest variability, as represented by the number of subtests that deviate from examinees' mean IQ scores, for identifying students with a learning disability (LD). Participants consisted of the 2,200 students in the WISC‐III normative sample and 684 students (Mdngrade = 5; Mage = 10.8) identified as LD. The number of subtests deviating from examinees' Verbal, Performance, and Full Scale IQ by ±3 points for normative and exceptional samples were contrasted via Receiver Operating Curve (ROC) analyses. Results indicated that LD students did not differ from normative sample children at levels above chance. It was concluded that deviation of individual subtest scores from mean IQ scores has no diagnostic utility for hypothesizing about students with learning disabilities. © 2000 John Wiley & Sons, Inc.  相似文献   

4.
A potential undesirable effect of multistage testing is differential speededness, which happens if some of the test takers run out of time because they receive subtests with items that are more time intensive than others. This article shows how a probabilistic response-time model can be used for estimating differences in time intensities and speed between subtests and test takers and detecting differential speededness. An empirical data set for a multistage test in the computerized CPA Exam was used to demonstrate the procedures. Although the more difficult subtests appeared to have items that were more time intensive than the easier subtests, an analysis of the residual response times did not reveal any significant differential speededness because the time limit appeared to be appropriate. In a separate analysis, within each of the subtests, we found minor but consistent patterns of residual times that are believed to be due to a warm-up effect, that is, use of more time on the initial items than they actually need.  相似文献   

5.
The purpose of the present study was to extend past work with the Angoff method for setting standards by examining judgments at the judge level rather than the panel level. The focus was on investigating the relationship between observed Angoff standard setting judgments and empirical conditional probabilities. This relationship has been used as a measure of internal consistency by previous researchers. Results indicated that judges varied in the degree to which they were able to produce internally consistent ratings; some judges produced ratings that were highly correlated with empirical conditional probabilities and other judges’ ratings had essentially no correlation with the conditional probabilities. The results also showed that weighting procedures applied to individual judgments both increased panel-level internal consistency and produced convergence across panels.  相似文献   

6.
An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item judgments directly reflect empirical item difficulties, the variability in Angoff judgments over items would not add error to the cut score, but to the extent that the mean item judgments do not correspond to the empirical item difficulties, variability in mean judgments over items would add error to the cut score. In this article, we present two generalizability-theory–based analyses of the proportion of the item variance that contributes to error in the cut score. For one approach, variance components are estimated on the probability (or proportion-correct) scale of the Angoff judgments, and for the other, the judgments are transferred to the theta scale of an item response theory model before estimating the variance components. The two analyses yield somewhat different results but both indicate that it is not appropriate to simply ignore the item variance component in estimating the error variance.  相似文献   

7.
Assessment practitioners are often encouraged to adopt an “intelligent” approach to the interpretation of intelligence tests. A fundamental assumption of the “intelligent testing” philosophy is that psychometric test information (e.g., subtest g loadings) should be considered during the interpretive process. The relevant psychometric information is provided in the form of sample-based estimates. Unfortunately, the accuracy of these estimates, and the subsequent qualitative classification of intelligence subtests (e.g., good, fair, poor), are influenced to an unknown degree by sampling error. The current study demonstrated how data smoothing procedures, procedures commonly used in the development of continuous test norms, can be used to provide better estimates of the reliability, uniqueness, and general factor characteristics for the WISC-III subtests. © 1997 John Wiley & Sons, Inc.  相似文献   

8.
Selected parameters for a negatively skewed and a normally distributed normative distribution were estimated in a post mortem item-examinee sampling investigation. Manipulated systematically were number of subtests, number of items per subtest, and number of examinees responding to each sub-test. Each item-examinee sampling procedure was replicated five times. Defining one observation as the score received by one examinee on one item, the results of this investigation support the conclusion that, in estimating parameters by item-examinee sampling, the variable of importance is not the item-examinee sampling procedure but is instead the number of observations obtained by that procedure. Degree of skewness in the normative distribution and failure to distribute all items among subtests were found to be relatively unimportant variables.  相似文献   

9.
The Northern Ireland Curriculum, like the English National Curriculum, records pupil achievement on a 10‐level scale. The level to which a pupil is ‘assigned’ at the end of a Key Stage is based upon two sources of assessment information: classroom‐based measures provided by the teacher and summative information from Common Assessment Instruments (CAIs), which are pen‐and‐paper tests taken at the end of the Key Stage. CAIs play a central role in confirming the accuracy with which teachers judge the level at which a pupil is working. While the teacher might judge a pupil to have mastered level 7 in Algebra, for example, based upon observation in class, test data and homeworks, the CAI will only confirm this level if the pupil scores above the level 7 cutscore on the CAI. If this cutscore does not accord with a reliable measure of what constitutes level 7 performance in Algebra in the classroom, there is likely to be misclassification of pupils with attendant difficulties for the efficient planning of teaching and learning. Misclassifications can be minimised when examiners and teachers interpret level 7 achievement in Algebra similarly. The Angoff standard‐setting procedure was used to establish level 5 cutscores in the Number and Handling Data tests of the mathematics CAI so that comparisons might be made between the published level 5 cutscores and those which result from a judgemental standard‐setting procedure. The 21 teachers involved in the procedure were offered the opportunity to recommend a level 5 ‘standard’ using the Angoff methodology, and to review their recommendations in the light of test data from the February 1993 CAI administration. A further opportunity was offered following a discussion during which individual teachers articulated their reasons for the standards they recommended. The results confirm that the reliability of recommended standards increases both as a consequence of receiving normative data and of discussion. All statistical measures reported in this article indicate that the procedure could command the confidence of examiners, teachers and the public. While the recommended cutscore for Number is in close accord with that published by the examiners, the extent of the mismatch in the Handling Data test is such as might give rise to some misclassification of pupils. It is important to stress that this mismatch had no real consequences since 1993 was a pilot year and no test outcomes were reported. The article concludes with an outline of the contribution which the Angoff methodology can make to the resolution of some of the difficulties faced by English national assessment, as identified in Sir Ron Dealing's interim report “The National Curriculum and its Assessment”.  相似文献   

10.
Several studies concerning scoring difficulties on the Wechsler intelligence scales were reviewed. Since the scoring of responses on the comprehension, similarities and vocabulary subtests of the Wechsler scales demands judgements by the examiner, the possibility of poor interscorer reliability is increased. In fact, research on the scoring of ambiguous responses on these subtests has demonstrated a high percentage of disagreement among scorers. More thorough scoring standards and revision of test items which lend themselves to ambiguous replies are needed.  相似文献   

11.
The credibility of standard‐setting cut scores depends in part on two sources of consistency evidence: intrajudge and interjudge consistency. Although intrajudge consistency feedback has often been provided to Angoff judges in practice, more evidence is needed to determine whether it achieves its intended effect. In this randomized experiment with 36 judges, non‐numeric item‐level intrajudge consistency feedback was provided to treatment‐group judges after the first and second rounds of Angoff ratings. Compared to the judges in the control condition, those receiving the feedback significantly improved their intrajudge consistency, with the effect being stronger after the first round than after the second round. To examine whether this feedback has deleterious effects on between‐judge consistency, I also examined interjudge consistency at the cut score level and the item level using generalizability theory. The results showed that without the feedback, cut score variability worsened; with the feedback, idiosyncratic item‐level variability improved. These results suggest that non‐numeric intrajudge consistency feedback achieves its intended effect and potentially improves interjudge consistency. The findings contribute to standard‐setting feedback research and provide empirical evidence for practitioners planning Angoff procedures.  相似文献   

12.
Changes in the full scale reliability of the WISC-R were computed at three age levels when each subtest was omitted by itself. Reliability was then determined when combinations of the two to five subtests which independently lowered the full scale reliability the most were omitted. The same procedure was followed with those subtests which independently had the smallest effect in lowering full scale reliability. The deletion of any one subtest had a negligible effect on reliability. Only when the combination of the five subtests having the greatest independent effect on full scale reliability was omitted did the reliability drop below.90. Cautions were rioted concerning the exclusion of sub-tests even when reliability remains acceptably high.  相似文献   

13.
Abstract

This study investigated the reliability, validity, and utility of the following three measures of letter-formation quality: (a) a holistic rating system, in which examiners rated letters on a five-point Likert-type scale; (h) a holistic rating system with model letters, in which examiners used model letters that exemplified specific criterion scores to rate letters; and (c) a correct/incorrect procedure, in which examiners used transparent overlays and standard verbal criteria to score letters. Intrarater and interrater reliability coefficients revealed that the two holistic scoring procedures were unreliable, whereas scores obtained by examiners who used the correct/incorrect procedure were consistent over time and across examiners. Although all three of the target measures were sensitive to differences between individual letters, only the scores from the two holistic procedures were associated with other indices of handwriting performance. Furthermore, for each of the target measures, variability in scores was, for the most part, not attributable to the level of experience or sex of the respondents. Findings are discussed with respect to criteria for validating an assessment instrument.  相似文献   

14.
This content analysis examined how the authors of 114 peer-reviewed journal articles explained their empirical approaches to visual rhetoric scholarship. The authors content analysis sought to answer the question: how do scholars engage with the material dimensions of visual culture, specifically in terms of artifact selection and reporting data collection procedures? The answers to this question, the authors argue, are needed urgently as visual rhetoric research continues to expand because inconsistent reporting will hinder replicability and the reader’s access to the author’s argument. The authors use the findings of their content analysis to surface the implicit norms of empirical visual rhetoric research and to develop recommendations for reporting visual data collection procedures.  相似文献   

15.
Ten judges scored items from the Comprehension, Similarities, and Vocabulary subtests of the WISC-R. Five were inexperienced undergraduates and five were experienced PhDs. Overall, there were no appreciable differences in the percentages of agreement between the two groups.  相似文献   

16.
A national (USA) student‐led, case‐based CLinician/Administrator Relationship Improvement OrganizatioN (CLARION) competition focuses students in medical and related healthcare programs on the provision of healthcare that is safe, timely, equitable, patient‐centred, effective and efficient. Students work in four‐person, inter‐professional teams to research and analyse a designated case. They then present their findings and recommendations to a panel of independent judges. Students, with support from their faculty advisors, approach the case as they see fit. Following initial participation in this CLARION competition, an inter‐professional team of students from two universities and their advisory faculty developed a two‐semester, pre‐competition course as a model to facilitate transformation in healthcare education. The course is theoretical, empirical and practical. It has multiple levels of learning and is designed to mentor students, develop faculty, measure learning outcomes and stimulate administrators in higher education to think creatively about curriculum development across disciplines. This integrated and inter‐professional approach is pivotal in healthcare education to ensure students learn safe and evidence‐based clinical practice that meets the highest standards for quality care.  相似文献   

17.
Track recommendations provided to students in the final grade of primary education lead the allocation to specific school tracks in secondary education in the Netherlands. Where the results of a standardised test indicate that students are able to go to a higher track level, primary schools are required to reconsider and potentially adjust the track recommendation to a higher level. The current research aimed to (1) investigate trends in the level of track recommendations, double track recommendations and reconsiderations over the years 2014–2015 to 2018–2019, (2) explore the variation in (trends of) track recommendations between Dutch primary schools and their school boards, and (3) assess the association between track recommendations and the school level variables degree of urbanisation and type of primary education. We used multilevel growth curve modelling for continuous and count data based on publicly available school-level population data regarding track recommendations and school leavers tests from 2014–2015 to 2018–2019. The number of double track recommendations has increased over the cohorts, with a slightly decreasing gap between schools in rural and urban areas. The number of reconsiderations first decreased and then increased. The differences in reconsiderations between rural and urban areas are increasing over time. An initial trend towards higher average recommendations stabilising in the later cohorts appeared with no clear pattern for degree of urbanisation. The current study adds to the existing knowledge by assessing longitudinal trends instead of cross-sectional analyses and including multiple stakeholders and factors simultaneously.  相似文献   

18.
Cut scores, estimated using the Angoff procedure, are routinely used to make high-stakes classification decisions based on examinee scores. Precision is necessary in estimation of cut scores because of the importance of these decisions. Although much has been written about how these procedures should be implemented, there is relatively little literature providing empirical support for specific approaches to providing training and feedback to standard-setting judges. This article presents a multivariate generalizability analysis designed to examine the impact of training and feedback on various sources of error in estimation of cut scores for a standard-setting procedure in which multiple independent groups completed the judgments. The results indicate that after training, there was little improvement in the ability of judges to rank order items by difficulty but there was a substantial improvement in inter-judge consistency in centering ratings. The results also show a substantial group effect. Consistent with this result, the direction of change for the estimated cut score was shown to be group dependent.  相似文献   

19.
Confidence intervals often are recommended as a means of communicating the extent to which individual test scores may be influenced by measurement error. However, test manuals and assessment texts vary widely in their recommendations about how confidence intervals should be constructed, and several contain misinterpretations of classical test theory. The most widely used procedure for constructing confidence intervals misrepresents the likely distribution of true scores, and confidence intervals constructed with it will be inaccurate, especially when extreme scores are involved. The various procedures for constructing confidence intervals that have been suggested in measurement texts are examined in relation to their approximation to the most accurate procedure that uses the estimated true score as the center of the confidence interval and the standard error of estimate to determine the width. In addition, the problems of applying these procedures to norm-referenced scores are discussed—an issue that has been largely ignored in the assessment literature and that leads to further misinterpretations of confidence intervals.  相似文献   

20.
This article discusses the development, field test, and uses of the counseling-orientation scale (COS), a scale for assessing relative preferences for seven major counseling orientations. The procedures used to develop and validate the COS are presented. The COS field test and technical information such as reliability and normative data are then described. Finally, the field test results and uses of the COS are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号