首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Performance assessments typically require expert judges to individually rate each performance. This results in a limitation in the use of such assessments because the rating process may be extremely time consuming. This article describes a scoring algorithm that is based on expert judgments but requires the rating of only a sample of performances. A regression-based policy capturing procedure was implemented to model the judgment policies of experts. The data set was a seven-case performance assessment of physician patient management skills. The assessment used a computer-based simulation of the patient care environment. The results showed a substantial improvement in correspondence between scores produced using the algorithm and actual ratings, when compared to raw scores. Scores based on the algorithm were also shown to be superior to raw scores and equal to expert ratings for making pass/fail decisions which agreed with those made by an independent committee of experts  相似文献   

2.
When performance assessments are delivered and scored by computer, the costs of scoring may be substantially lower than those of scoring the same assessment based on expert review of the individual performances. Computerized scoring algorithms also ensure that the scoring rules are implemented precisely and uniformly. Such computerized algorithms represent an effort to encode the scoring policies of experts. This raises the question, would a different group of experts have produced a meaningfully different algorithm? The research reported in this paper uses generalizability theory to assess the impact of using independent, randomly equivalent groups of experts to develop the scoring algorithms for a set of computer‐simulation tasks designed to measure physicians’ patient management skills. The results suggest that the impact of this “expert group” effect may be significant but that it can be controlled with appropriate test development strategies. The appendix presents multivariate generalizability analysis to examine the stability of the assessed proficiency across scores representing the scoring policies of different groups of experts.  相似文献   

3.
Performance assessments are typically scored by having experts rate individual performances. The cost associated with using expert raters may represent a serious limitation in many large-scale testing programs. The use of raters may also introduce an additional source of error into the assessment. These limitations have motivated development of automated scoring systems for performance assessments. Preliminary research has shown these systems to have application across a variety of tasks ranging from simple mathematics to architectural problem solving. This study extends research on automated scoring by comparing alternative automated systems for scoring a computer simulation test of physicians'patient management skills; one system uses regression-derived weights for components of the performance, the other uses complex rules to map performances into score levels. The procedures are evaluated by comparing the resulting scores to expert ratings of the same performances.  相似文献   

4.
Formula scoring is a procedure designed to reduce multiple-choice test score irregularities due to guessing. Typically, a formula score is obtained by subtracting a proportion of the number of wrong responses from the number correct. Examinees are instructed to omit items when their answers would be sheer guesses among all choices but otherwise to guess when unsure of an answer. Thus, formula scoring is not intended to discourage guessing when an examinee can rule out one or more of the options within a multiple-choice item. Examinees who, contrary to the instructions, do guess blindly among all choices are not penalized by formula scoring on the average; depending on luck, they may obtain better or worse scores than if they had refrained from this guessing. In contrast, examinees with partial information who refrain from answering tend to obtain lower formula scores than if they had guessed among the remaining choices. (Examinees with misinformation may be exceptions.) Formula scoring is viewed as inappropriate for most classroom testing but may be desirable for speeded tests and for difficult tests with low passing scores. Formula scores do not approximate scores from comparable fill-in-the-blank tests, nor can formula scoring preclude unrealistically high scores for examinees who are very lucky.  相似文献   

5.
The use of constructed-response items in large scale standardized testing has been hampered by the costs and difficulties associated with obtaining reliable scores. The advent of expert systems may signal the eventual removal of this impediment. This study investigated the accuracy with which expert systems could score a new, nonmultiple-choice item type. The item type presents a faulty solution to a computer programming problem and asks the student to correct the solution. This item type was administered to a sample of high school seniors enrolled in an Advanced Placement course in Computer Science who also took the Advanced Placement Computer Science (APCS) examination. Results indicated that the expert systems were able to produce scores for between 82% and 95% of the solutions encountered and to display high agreement with a human reader on the correctness of the solutions. Diagnoses of the specific errors produced by students were less accurate. Correlations with scores on the objective and free-response sections of the APCS examination were moderate. Implications for additional research and for testing practice are offered.  相似文献   

6.
The Classroom Assessment Scoring System (CLASS; Pianta et al., 2008) is a popular measure of teacher–child interactions. Despite its prominence, CLASS scores have fairly weak relations with various child outcomes (e.g., Zaslow et al., 2010). One potential reason for these findings could be systematic differences in observer severity. As such, the purpose of this study was to explore the scope and impact of rater effects on CLASS scores with a sample of 77 teachers who were rated by 13 observers. Results indicated significant rater effects across all three CLASS domains. Adjusting for these effects, however, did not improve relations between CLASS scores and child outcomes. Implications for the CLASS and related assessments are discussed.  相似文献   

7.
'Mental models' used by automated scoring for the simulation divisions of the computerized Architect Registration Examination are contrasted with those used by experienced human graders. Candidate solutions (N = 3613) received both automated and human holistic scores. Quantitative analyses suggest high correspondence between automated and human scores; thereby suggesting similar mental models are implemented. Solutions with discrepancies between automated and human scores were selected for qualitative analysis. The human graders were reconvened to review the human scores and to investigate the source of score discrepancies in light of rationales provided by the automated scoring process. After review, slightly more than half of the score discrepancies were reduced or eliminated. Six sources of discrepancy between original human scores and automated scores were identified: subjective criteria; objective criteria; tolerances/ weighting; details; examinee task interpretation; and unjustified. The tendency of the human graders to be compelled by automated score rationales varied by the nature of original score discrepancy. We determine that, while the automated scores are based on a mental model consistent with that of expert graders, there remain some important differences, both intentional and incidental, which distinguish between human and automated scoring. We conclude that automated scoring has the potential to enhance the validity evidence of scores in addition to improving efficiency.  相似文献   

8.
9.
Scientific argumentation is one of the core practices for teachers to implement in science classrooms. We developed a computer-based formative assessment to support students’ construction and revision of scientific arguments. The assessment is built upon automated scoring of students’ arguments and provides feedback to students and teachers. Preliminary validity evidence was collected in this study to support the use of automated scoring in this formative assessment. The results showed satisfactory psychometric properties related to this formative assessment. The automated scores showed satisfactory agreement with human scores, but small discrepancies still existed. Automated scores and feedback encouraged students to revise their answers. Students’ scientific argumentation skills improved during the revision process. These findings provided preliminary evident to support the use of automated scoring in the formative assessment to diagnose and enhance students’ argumentation skills in the context of climate change in secondary school science classrooms.  相似文献   

10.
ABSTRACT

As an alternative to rubric scoring, comparative judgment generates essay scores by aggregating decisions about the relative quality of the essays. Comparative judgment eliminates certain scorer biases and potentially reduces training requirements, thereby allowing a large number of judges, including teachers, to participate in essay evaluation. The purpose of this study was to assess the validity, labor costs, and efficiency of comparative judgments as a potential substitute for rubric scoring. An analysis of two essay prompts revealed that comparative judgment measures were comparable to rubric scores at a level similar to that expected of two professional scorers. The comparative judgment measures correlated slightly higher than rubric scores with a multiple-choice writing test. Score reliability exceeding .80 was achieved with approximately nine judgments per response. The average judgment time was 94 seconds, which compared favorably to 119 seconds per rubric score. Practical challenges to future implementation are discussed.  相似文献   

11.
A practical concern for many existing tests is that subscore test lengths are too short to provide reliable and meaningful measurement. A possible method of improving the subscale reliability and validity would be to make use of collateral information provided by items from other subscales of the same test. To this end, the purpose of this article is to compare two different formulations of an alternative Item Response Theory (IRT) model developed to parameterize unidimensional projections of multidimensional test items: Analytical and Empirical formulations. Two real data applications are provided to illustrate how the projection IRT model can be used in practice, as well as to further examine how ability estimates from the projection IRT model compare to external examinee measures. The results suggest that collateral information extracted by a projection IRT model can be used to improve reliability and validity of subscale scores, which in turn can be used to provide diagnostic information about strength and weaknesses of examinees helping stakeholders to link instruction or curriculum to assessment results.  相似文献   

12.
The attribute hierarchy method (AHM) is a psychometric procedure for classifying examinees' test item responses into a set of structured attribute patterns associated with different components from a cognitive model of task performance. Results from an AHM analysis yield information on examinees' cognitive strengths and weaknesses. Hence, the AHM can be used for cognitive diagnostic assessment. The purpose of this study is to introduce and evaluate a new concept for assessing attribute reliability using the ratio of true score variance to observed score variance on items that probe specific cognitive attributes. This reliability procedure is evaluated and illustrated using both simulated data and student response data from a sample of algebra items taken from the March 2005 administration of the SAT. The reliability of diagnostic scores and the implications for practice are also discussed.  相似文献   

13.
本文对基于局域网评分中间结果进行研究,发现阈值高低对一评、二评评分结果统计差异大小有影响,一般阈值越小,一评、二评评分结果无统计差异的越多。但阈值高低不是决定评分一致性的最重要因素,关键在于一评、二评差值的分布。阈值设置高,可能一评、二评结果也会无统计差异;阈值设置低,一评、二评结果也会出现显著差异。在考试分数“分分计较”的情况下,阈值设置应该是1分。在阈值规定的范围内,如果成对样本t检验结果无显著差异,并不意味着评分一致性一定好。如果成对样本t检验结果有显著差异,评分一致性也未必差。成对样本t检验并不是评价评分一致性的有效、可靠的方法。需要采用其他评价评分一致性的方法。  相似文献   

14.
This article evaluates a procedure-based scoring system for a performance assessment (an observed paper towels investigation) and a notebook surrogate completed by fifth-grade students varying in hands-on science experience. Results suggested interrater reliability of scores for observed performance and notebooks was adequate (>.80) with the reliability of the former higher. In contrast, interrater agreement on procedures was higher for observed hands-on performance (.92) than for notebooks (.66). Moreover, for the notebooks, the reliability of scores and agreement on procedures varied by student experience, but this was not so for observed performance. Both the observed-performance and notebook measures correlated less with traditional ability than did a multiple-choice science achievement test. The correlation between the two performance assessments and the multiple-choice test was only moderate (mean = .46), suggesting that different aspects of science achievement have been measured. Finally, the correlation between the observed-performance scores and the notebook scores was .83, suggesting that notebooks may provide a reasonable, albeit less reliable, surrogate for the observed hands-on performance of students.  相似文献   

15.
《Educational Assessment》2013,18(4):317-340
A number of methods for scoring tests with selected-response (SR) and constructed-response (CR) items are available. The selection of a method depends on the requirements of the program, the particular psychometric model and assumptions employed in the analysis of item and score data, and how scores are to be used. This article compares 3 methods: unweighted raw scores, Item Response Theory pattern scores, and weighted raw scores. Student score data from large-scale end-of-course high school tests in Biology and English were used in the comparisons. In the weighted raw score method evaluated in this study, the CR items were weighted so that SR and CR items contributed the same number of points toward the total score. The scoring methods were compared for the total group and for subgroups of students in terms of the resultant scaled score distributions, standard errors of measurement, and proficiency-level classifications. For most of the student ability distribution, the three scoring methods yielded similar results. Some differences in results are noted. Issues to be considered when selecting a scoring method are discussed.  相似文献   

16.
In this digital ITEMS module, Dr. Sue Lottridge, Amy Burkhardt, and Dr. Michelle Boyer provide an overview of automated scoring. Automated scoring is the use of computer algorithms to score unconstrained open-ended test items by mimicking human scoring. The use of automated scoring is increasing in educational assessment programs because it allows scores to be returned faster at lower cost. In the module, they discuss automated scoring from a number of perspectives. First, they discuss benefits and weaknesses of automated scoring, and what psychometricians should know about automated scoring. Next, they describe the overall process of automated scoring, moving from data collection to engine training to operational scoring. Then, they describe how automated scoring systems work, including the basic functions around score prediction as well as other flagging methods. Finally, they conclude with a discussion of the specific validity demands around automated scoring and how they align with the larger validity demands around test scores. Two data activities are provided. The first is an interactive activity that allows the user to train and evaluate a simple automated scoring engine. The second is a worked example that examines the impact of rater error on test scores. The digital module contains a link to an interactive web application as well as its R-Shiny code, diagnostic quiz questions, activities, curated resources, and a glossary.  相似文献   

17.
为解决传统协同过滤推荐算法数据稀疏、可扩展性差等问题,采用改进预测评分矩阵的协同过滤算法。首先使用基于线性回归分析的加权Slope One算法,在传统Slope One算法中加入可信度,提高共同评分基数;然后采用网上标准数据集movielens作为测试数据,结合协同过滤算法进行top-N推荐。实验结果表明,使用改进预测评分矩阵的协同过滤算法的MEA较小,在近邻数大于25时达到0.74,表明该算法改善了传统协同过滤算法数据稀疏、扩展性差问题,降低了推荐误差,提高了推荐系统准确度。  相似文献   

18.
《教育实用测度》2013,26(4):413-432
With the increasing use of automated scoring systems in high-stakes testing, it has become essential that test developers assess the validity of the inferences based on scores produced by these systems. In this article, we attempt to place the issues associated with computer-automated scoring within the context of current validity theory. Although it is assumed that the criteria appropriate for evaluating the validity of score interpretations are the same for tests using automated scoring procedures as for other assessments, different aspects of the validity argument may require emphasis as a function of the scoring procedure. We begin the article with a taxonomy of automated scoring procedures. The presentation of this taxonomy provides a framework for discussing threats to validity that may take on increased importance for specific approaches to automated scoring. We then present a general discussion of the process by which test-based inferences are validated, followed by a discussion of the special issues that must be considered when scoring is done by computer.  相似文献   

19.
智力可以分为学业智力和实践智力,其中,学业智力与学业问题解决有关,主要是通过非社会认知操作表现出来;实践智力则与日常性或职业性问题解决相连,更多地通过社会认知操作表现出来。智力评估的指标可以分为两个方面:认知操作测试总分和认知方式。认知操作测试总分是智力的最大化评估指标,是个体以不同认知操作对不同认知材料进行信息加工时的绝对水平。认知方式是智力的偏向性评估指标,是个体以不同认知操作对不同认知材料进行信息加工时的相对优势。  相似文献   

20.
针对通信信号的非线性时域滤波问题,研究了量子随机滤波器的原理和性能.将神经网络与非线性Schroedinger方程相结合,把方程的解作为信号时变的概率密度函数,进而实现滤波功能.研究发现,通过调整势场权系数的取值,可使滤波器具有明显不同的性能.根据此性质,构造了一种新的滤波算法,该算法可使滤波器在信号波形估计的非线性失真程度与它的抗噪能力之间进行折衷,这将大大推广量子随机滤波器的应用,例如,用于通信信号处理.仿真结果表明了量子随机滤波器的优越性能.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号