首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
科学论证能力是国际教育评价项目的重要测评内容。随着人工智能在自然语言处理等领域取得突破性进展,国外已实现科学论证的计算机自动评分,并尝试结合实时反馈系统将科学论证自动评分用于改进课堂教学。本研究聚焦国际上关于科学论证自动评分的最新进展,以较为成熟的文字类科学论证作品的自动测评为研究对象,借助典型个案分析自动评分中的科学论证框架、自动评分的实现路径及保障测评工具精确性的方法。研究发现:1)自动测评一般使用图尔敏论证模型作为测评框架,并采用主观题与客观题相结合的试题结构;2)科学论证分析框架具有操作简便、通用性强、应用广泛等特点;3)基于机器学习的自动评分工具c-rater-ML具有人力资源耗费相对较小、评分精确性相对较高的特点;4)一般采用二次权重卡帕和皮尔逊相关性检验自动评分的精确性。尽管自动评分取得了令人满意的进展,未来仍有较大的改进空间。  相似文献   

2.
In educational practice, test results are used for several purposes. However, validity research is especially focused on the validity of summative assessment. This article aimed to provide a general framework for validating formative assessment. The authors applied the argument‐based approach to validation to the context of formative assessment. This resulted in a proposed interpretation and use argument consisting of a score interpretation and a score use. The former involves inferences linking specific task performance to an interpretation of a student's general performance. The latter involves inferences regarding decisions about actions and educational consequences. The validity argument should focus on critical claims regarding score interpretation and score use, since both are critical to the effectiveness of formative assessment. The proposed framework is illustrated by an operational example including a presentation of evidence that can be collected on the basis of the framework.  相似文献   

3.
Journal of Science Education and Technology - We present lessons learned from an ongoing attempt to conceptualize, develop, and refine a way for teachers to gather formative assessment evidence...  相似文献   

4.
基于大学生全面发展的形成性评价探究   总被引:1,自引:0,他引:1  
学生评价是教育评价最基本的一个领域,也是教育工作者最关注的一个方面。在我国高等教育快速迈入大众化的过程中,了解现代高校学生评价的特点与趋势并借鉴先进的改革经验,在我国开展行之有效的形成性评价理论研究与实践活动,以期为促进大学生的全面发展,高校评价体系的不断完善提供一些借鉴。  相似文献   

5.
Research Findings: This study examined the validity and reliability of the Classroom Assessment Scoring System (CLASS; R. C. Pianta, K. M. La Paro, & B. K. Hamre, 2008) in Finnish kindergartens. A pair of trained observers used the CLASS to observe 49 kindergarten teachers (47 female, 2 male) on two different days. Questionnaires measuring teachers' efficacy beliefs, exhaustion at work, and classroom interactional style (i.e., affection, behavioral control, and psychological control) were completed by the teachers. Confirmatory factor analysis indicated that when the item measuring Negative Climate was excluded, the 3-factor solution assuming three positively correlated latent factors (i.e., Emotional Support, Classroom Organization, and Instructional Support) described classroom quality well. The CLASS also showed high item and scale reliabilities. Evidence for concurrent validity was indicated by the positive association between observed classroom emotional support and teacher-rated affection and self-efficacy. Teacher-rated affection was also associated with observed classroom organization. Practice or Policy: The findings provide support for the CLASS as a valid and reliable measure of classroom quality in kindergartens and in cultural contexts outside the United States.  相似文献   

6.
The potential of computer-based assessments for capturing complex learning outcomes has been discussed; however, relatively little is understood about how to leverage such potential for summative and accountability purposes. The aim of this study is to develop and validate a multimedia-based assessment of scientific inquiry abilities (MASIA) to cover a more comprehensive construct of inquiry abilities and target secondary school students in different grades while this potential is leveraged. We implemented five steps derived from the construct modeling approach to design MASIA. During the implementation, multiple sources of evidence were collected in the steps of pilot testing and Rasch modeling to support the validity of MASIA. Particularly, through the participation of 1,066 8th and 11th graders, MASIA showed satisfactory psychometric properties to discriminate students with different levels of inquiry abilities in 101 items in 29 tasks when Rasch models were applied. Additionally, the Wright map indicated that MASIA offered accurate information about students’ inquiry abilities because of the comparability of the distributions of student abilities and item difficulties. The analysis results also suggested that MASIA offered precise measures of inquiry abilities when the components (questioning, experimenting, analyzing, and explaining) were regarded as a coherent construct. Finally, the increased mean difficulty thresholds of item responses along with three performance levels across all sub-abilities supported the alignment between our scoring rubrics and our inquiry framework. Together with other sources of validity in the pilot testing, the results offered evidence to support the validity of MASIA.  相似文献   

7.

Argumentation has been emphasized in recent US science education reform efforts (NGSS Lead States 2013; NRC 2012), and while existing studies have investigated approaches to introducing and supporting argumentation (e.g., McNeill and Krajcik in Journal of Research in Science Teaching, 45(1), 53–78, 2008; Kang et al. in Science Education, 98(4), 674–704, 2014), few studies have investigated how game-based approaches may be used to introduce argumentation to students. In this paper, we report findings from a design-based study of a teacher’s use of a computer game intended to introduce the claim, evidence, reasoning (CER) framework (McNeill and Krajcik 2012) for scientific argumentation. We studied the implementation of the game over two iterations of development in a high school biology teacher’s classes. The results of this study include aspects of enactment of the activities and student argument scores. We found the teacher used the game in aspects of explicit instruction of argumentation during both iterations, although the ways in which the game was used differed. Also, students’ scores in the second iteration were significantly higher than the first iteration. These findings support the notion that students can learn argumentation through a game, especially when used in conjunction with explicit instruction and support in student materials. These findings also highlight the importance of analyzing classroom implementation in studies of game-based learning.

  相似文献   

8.
This study investigated the impact of anonymizing text on predicted scores made by two kinds of automated scoring engines: one that incorporates elements of natural language processing (NLP) and one that does not. Eight data sets (N = 22,029) were used to form both training and test sets in which the scoring engines had access to both text and human rater scores for training, but only the text for the test set. Machine ratings were applied under three conditions: (a) both the training and test were conducted with the original data, (b) the training was modeled on the anonymized data, but the predictions were made on the original data, and (c) both the training and test were conducted on the anonymized text. The first condition served as the baseline for subsequent comparisons on the mean, standard deviation, and quadratic weighted kappa. With one exception, results on scoring scales in the range of 1–6 were not significantly different. The results on scales that were much wider did show significant differences. The conclusion was that anonymizing text for operational use may have a differential impact on machine score predictions for both NLP and non‐NLP applications.  相似文献   

9.
《教育实用测度》2013,26(4):345-358
Performance assessments typically are scored by having experts rate individual performances. In contexts such as medical licensure, where the examinee population is large and the pool of expert raters is limited practically, this approach may be unworkable. This article describes an automated scoring algorithm for a computer simulation-based examination of physicians' patient-management skills. The algorithm is based on the policy used by clinicians in rating case performances. The results show that scores produced using this algorithm are highly correlated to actual clinician ratings. These scores also are shown to be effective in discriminating between case performance judged to be passing or failing by an independent group of clinicians.  相似文献   

10.
This study explored the use of machine learning to automatically evaluate the accuracy of students’ written explanations of evolutionary change. Performance of the Summarization Integrated Development Environment (SIDE) program was compared to human expert scoring using a corpus of 2,260 evolutionary explanations written by 565 undergraduate students in response to two different evolution instruments (the EGALT-F and EGALT-P) that contained prompts that differed in various surface features (such as species and traits). We tested human-SIDE scoring correspondence under a series of different training and testing conditions, using Kappa inter-rater agreement values of greater than 0.80 as a performance benchmark. In addition, we examined the effects of response length on scoring success; that is, whether SIDE scoring models functioned with comparable success on short and long responses. We found that SIDE performance was most effective when scoring models were built and tested at the individual item level and that performance degraded when suites of items or entire instruments were used to build and test scoring models. Overall, SIDE was found to be a powerful and cost-effective tool for assessing student knowledge and performance in a complex science domain.  相似文献   

11.
Formative Assessment: Assessment Is for Self-regulated Learning   总被引:1,自引:0,他引:1  
The article draws from 199 sources on assessment, learning, and motivation to present a detailed decomposition of the values, theories, and goals of formative assessment. This article will discuss the extent to which formative feedback actualizes and reinforces self-regulated learning (SRL) strategies among students. Theoreticians agree that SRL is predictive of improved academic outcomes and motivation because students acquire the adaptive and autonomous learning characteristics required for an enhanced engagement with the learning process and subsequent successful performance. The theory of formative assessment is found to be a unifying theory of instruction, which guides practice and improves the learning process by developing SRL strategies among learners. In a postmodern era characterized by rapid technical and scientific advance and obsolescence, there is a growing emphasis on the acquisition of learning strategies which people may rely on across the entire span of their life. Research consistently finds that the self-regulation of cognitive and affective states supports the drive for lifelong learning by: enhancing the motivational disposition to learn, enriching reasoning, refining meta-cognitive skills, and improving performance outcomes. The specific purposes of the article are to provide practitioners, administrators and policy-makers with: (a) an account of the very extensive conceptual territory that is the ‘theory of formative assessment’ and (b) how the goals of formative feedback operate to reveal recondite learning processes, thereby reinforcing SRL strategies which support learning, improve outcomes and actualize the drive for lifelong learning.  相似文献   

12.
课程考核体系是评价学生的知识、能力、行为和态度的重要手段。在明确设计思路的基础上,探讨机电一体化专业课程形成性考核体系的具体内容和形式以及课程形成性考核体系的构建,为机电一体化专业课程形成性考核的实施奠定了基础。  相似文献   

13.
A framework for evaluation and use of automated scoring of constructed‐response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high‐stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time.  相似文献   

14.
形成性考核的策略研究   总被引:6,自引:0,他引:6  
本文分析了形成性考核现状和在远程开放教育中的建设性功能,对形成性考核的策略行动进行了初步探讨.  相似文献   

15.
Automated computerized scoring systems (ACSSs) are being increasingly used to analyze text in many educational settings. Nevertheless, the impact of misspelled words (MSW) on scoring accuracy remains to be investigated in many domains, particularly jargon-rich disciplines such as the life sciences. Empirical studies confirm that MSW are a pervasive feature of human-generated text and that despite improvements, spell-check and auto-replace programs continue to be characterized by significant errors. Our study explored four research questions relating to MSW and text-based computer assessments: (1) Do English language learners (ELLs) produce equivalent magnitudes and types of spelling errors as non-ELLs? (2) To what degree do MSW impact concept-specific computer scoring rules? (3) What impact do MSW have on computer scoring accuracy? and (4) Are MSW more likely to impact false-positive or false-negative feedback to students? We found that although ELLs produced twice as many MSW as non-ELLs, MSW were relatively uncommon in our corpora. The MSW in the corpora were found to be important features of the computer scoring models. Although MSW did not significantly or meaningfully impact computer scoring efficacy across nine different computer scoring models, MSW had a greater impact on the scoring algorithms for naïve ideas than key concepts. Linguistic and concept redundancy in student responses explains the weak connection between MSW and scoring accuracy. Lastly, we found that MSW tend to have a greater impact on false-positive feedback. We discuss the implications of these findings for the development of next-generation science assessments.  相似文献   

16.
输出驱动-输入促成假设理论指导下的《大学英语》过程性评价方式是根据语言类课程的特点,在课堂内外语言输入的基础上,要求学习者根据情境,完成针对性的语言输出任务项目,对其语言产出的结果进行即时评价,评价结果纳入课程综合成绩.通过输入-输出-评价这一教学流程,促使学生主动学习,提高学习效率,最大限度发挥课堂的有效性,实现大学英语这门课程的学以致用,更好地适应社会发展对人才培养的要求.  相似文献   

17.
<宏微观经济学>课程形成性考核改革,以网络化形成性考核为平台,突出教师导学、小组学习、网上讨论的互动,在有效监控学习过程、提高学生的知识应用水平和分析问题能力等方面均取得了明显成效.网络化形成性考核,对教学理念、考核设计的科学性与合理性以及教师的导修水平、学习者的学习方式都提出了新要求.  相似文献   

18.
形成性考核是远程开放教育教学过程与教学质量检查、监控的一项重要环节。本文应用行动研究新兴的研究范式,以《公司财务》和《宏微观经济学》的教学实践作为研究起点,把形成性考核的内涵作为研究对象,以准确评价形成性考核成绩作为研究目的,通过“研究--实践--反思”构建了形成性考核模式;并应用“柯尔莫哥洛夫--斯米尔诺夫”双样本检验,对形成性考核成绩和期末考试成绩经验分布函数,进行非参数假设检验,说明形成性考核模式能准确评价一个班级学员的能力和学力水平。  相似文献   

19.
几个英语作文自动评分系统的原理与评述   总被引:8,自引:0,他引:8  
本文介绍目前美国在大规模考试和英语教学中最为流行的几个作文自动评分系统的基本原理并对这些系统进行简单的评述。所涉及的系统包括Project Essay Grader(PEG),Intelligent Essay Assessor (IEA),E-rater和Criterion,IntelliMetric和MY Access!,Bayesian Essay Test Scoring System(BETSY)。  相似文献   

20.
'Mental models' used by automated scoring for the simulation divisions of the computerized Architect Registration Examination are contrasted with those used by experienced human graders. Candidate solutions (N = 3613) received both automated and human holistic scores. Quantitative analyses suggest high correspondence between automated and human scores; thereby suggesting similar mental models are implemented. Solutions with discrepancies between automated and human scores were selected for qualitative analysis. The human graders were reconvened to review the human scores and to investigate the source of score discrepancies in light of rationales provided by the automated scoring process. After review, slightly more than half of the score discrepancies were reduced or eliminated. Six sources of discrepancy between original human scores and automated scores were identified: subjective criteria; objective criteria; tolerances/ weighting; details; examinee task interpretation; and unjustified. The tendency of the human graders to be compelled by automated score rationales varied by the nature of original score discrepancy. We determine that, while the automated scores are based on a mental model consistent with that of expert graders, there remain some important differences, both intentional and incidental, which distinguish between human and automated scoring. We conclude that automated scoring has the potential to enhance the validity evidence of scores in addition to improving efficiency.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号