首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
This article discusses a particular type of concordance table and the potential for test score misuse that may result from employing such a table. The concordance that is discussed is typically created between scores on different, nonequatable versions of a test that share the same or close to the same test title. These concordance tables often appear in the context of relating scores on computerized adaptive and paper‐and‐pencil versions of the same test. When such a table is presented in a complete point‐by‐point fashion, relating each reported score on the scale of the new version of the test to a reported score on the scale of the old version of the test, test score users will typically treat the table as if it represented an equating of scores between the two versions, and directly replace scores on the new version of the test by scores on the old version. This clearly represents a misuse of the test scores. Suggestions for avoiding this misuse of test scores from concordance tables are provided.  相似文献   

Educational tests are standardized so that all examinees are tested on the same material, under the same testing conditions, and with the same scoring protocols. This uniformity is designed to provide a level “playing field” for all examinees so that the test is “the same” for everyone. Thus, standardization is designed to promote fairness in testing. In practice, the material tested, the conditions under which a test is administered, and the scoring processes, are often too rigid to provide the intended level playing field. For example, standardized testing conditions may interact with personal characteristics of examinees that affect test performance, but are not construct-relevant. Thus, more flexibility in standardization is needed to account for the diversity of experiences, talents, and handicaps of the incredibly heterogeneous populations of examinees we currently assess. Traditional standardization procedures grew out of experimental psychology and psychophysics laboratories where keeping all conditions constant was crucial. Today, accounting for and measuring what is not constant across examinees is crucial to valid construct interpretations. To meet this need I introduce the concept of understandardization, which refers to ensuring sufficient flexibility in standardized testing conditions to yield the most accurate measurement of proficiency for each examinee.  相似文献   

Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results.  相似文献   

The purpose of this paper is to define and evaluate the categories of cognitive models underlying at least three types of educational tests. We argue that while all educational tests may be based—explicitly or implicitly—on a cognitive model, the categories of cognitive models underlying tests often range in their development and in the psychological evidence gathered to support their value. For researchers and practitioners, awareness of different cognitive models may facilitate the evaluation of educational measures for the purpose of generating diagnostic inferences, especially about examinees' thinking processes, including misconceptions, strengths, and/or abilities. We think a discussion of the types of cognitive models underlying educational measures is useful not only for taxonomic ends, but also for becoming increasingly aware of evidentiary claims in educational assessment and for promoting the explicit identification of cognitive models in test development. We begin our discussion by defining the term cognitive model in educational measurement. Next, we review and evaluate three categories of cognitive models that have been identified for educational testing purposes using examples from the literature. Finally, we highlight the practical implications of "blending" models for the purpose of improving educational measures .  相似文献   

This study analyzed questionnaire and interview data on teachers' practices and perceptions with respect to test preparation. Questionnaire respondents were asked to rate the ethicality of various test-preparation practices and indicate the extent to which they utilize these practices in their instruction. On the basis of questionnaire results, interviews were conducted with a smaller sample of teachers to determine their views on the appropriateness of particular test-preparation practices, and to determine the factors affecting teacher perceptions about a given activity. Contrary to previous empirical work, questionnaire results indicated that neither use of a given practice nor teacher perceptions of the ethicality of the practice vary across levels of student achievement. On the other hand, consistent with previous empirical work, both use and perceptions varied across grade-level configuration. Estimates of the prevalence of particular teacher practices and perceptions were obtained and compared with those from the literature. In addition, dimensions of teacher reasoning were explored, indicating that when considering the appropriateness of a given practice, teachers consider the following factors: score meaning, learning, the potential for raising student scores, professional ethics, equity, and external perceptions.  相似文献   

This article reveals perspectives based on experiences from twentieth-century Danish educational history by outlining contemporary, test-based accountability regime characteristics and their implications for education policy. The article introduces one such characteristic, followed by an empirical analysis of the origins and impacts of test-based accountability measures applying both top-down and bottom-up perspectives.

These historical perspectives offer the opportunity to gain a fuller understanding of this contemporary accountability concept and its potential, appeal and implications for continued use in contemporary educational settings. Accountability measures and practices serve as a way to govern schools; by analysing the history of accountability as the concept has been practised in the education sphere, the article will discuss both pros and cons of such a methodology, particularly as it relates to contemporary education governance.  相似文献   

Test developers and psychometricians have historically examined measurement bias and differential item functioning (DIF) across a single categorical variable (e.g., gender), independently of other variables (e.g., race, age, etc.). This is problematic when more complex forms of measurement bias may adversely affect test responses and, ultimately, bias test scores. Complex forms of measurement bias include conditional effects, interactions, and mediation of background information on test responses. I propose a multidimensional, person-specific perspective of measurement bias to explain how complex sources of bias can manifest in the assessment of human knowledge, skills, and abilities. I also describe a data-driven approach for identifying key sources of bias among many possibilities—namely, a machine learning method commonly known as regularization.  相似文献   

The dominant narrative for assessment design seems to reflect a strong, albeit largely implicit undercurrent of purpose purism, which idealizes the principle that assessment design should be driven by a single assessment purpose. With a particular focus on achievement assessments, the present article questions the tenability of purpose purism, explaining how critical decisions—concerning whether to assess, how to specify an assessment construct, and many other design characteristics—require the coordination of multiple perspectives on assessment purposes. It argues the case for purpose pluralism—which idealizes the principle that assessment design should be driven by a multiplicity of assessment purposes simultaneously—not as an occasional, unavoidable concession, but as an organizing principle. The point of explicitly distinguishing between perspectives is to help assessment designers to establish a full complement of design requirements, representing a full range of stakeholder voices; as well as to manage more effectively the trade‐offs and compromises that inevitably arise.  相似文献   

The use of alternative assessments has led many researchers to reexamine traditional views of test qualities, especially validity. Because alternative assessments generally aim at measuring complex constructs and employ rich assessment tasks, it becomes more difficult to demonstrate (a) the validity of the inferences we make and (b) that these inferences extrapolate to target domains beyond the assessment itself. An approach to addressing these issues from the perspective of language testing is described. It is then argued that in both language testing and educational assessment we must consider the roles of both language and content knowledge, and that our approach to the design and development of performance assessments must be both construct-based and task-based.1  相似文献   

本文以普通话水平等级测试实践为事实依据,结合国家的相关法律、法规和政策,从新世纪国家语言文字发展战略的高度,对新时期普通话工作的新特征进行了有益的探讨,那就是:科学化和标准化、有序性和有效性.  相似文献   

Changes to the design and development of our educational assessments are resulting in the unprecedented demand for a large and continuous supply of content‐specific test items. One way to address this growing demand is with automatic item generation (AIG). AIG is the process of using item models to generate test items with the aid of computer technology. The purpose of this module is to describe and illustrate a template‐based method for generating test items. We outline a three‐step approach where test development specialists first create an item model. An item model is like a mould or rendering that highlights the features in an assessment task that must be manipulated to produce new items. Next, the content used for item generation is identified and structured. Finally, features in the item model are systematically manipulated with computer‐based algorithms to generate new items. Using this template‐based approach, hundreds or even thousands of new items can be generated with a single item model.  相似文献   

从对事实与价值的区分出发,霍金森认为,教育管理是一种关涉到价值冲突及其协调的"实践哲学".价值又可以区分为三个层级和四种不同的类型,不同类型以及同一层级的价值总是处于相互冲突之中.教育领导者的根本任务,就在于依托"道德原式",秉持一定之则,协调并整合组织中的价值冲突,在为组织建构特定"生活方式"的同时,实现对组织的道德领导.  相似文献   

教育测量专业人才培养是深化新时代教育评价改革的关键元素之一。通过总结5位美国教育测量专业知名教授就教育测量专业人才培养展开的讨论,对我国教育测量专业人才培养的建议是:1)高校间联合培养教育测量专业人才;2)注重基础知识传授与实践能力训练;3)发挥教育测量专业组织与专业期刊的作用;4)教育考试机构应加强与教育测量专业组织的沟通。  相似文献   

外国教育史研究与外国教育史课程的发展,关系密切。外国教育史课程的开设推动了外国教育史研究的产生,其研究和引导学生研究外国教育发展规律与认识、规划与预测现代教育发展、诊断和指导教育教学改革、探索建设教育体系方面职能的拓展,促进了外国教育史研究的发展。外国教育史研究的发展为外国教育史课程功能的拓展与实现提供了可能与支持。目前,陷入"被弱化"危局的外国教育史学科,需要以现实化为基础,以中国化为核心,以科学化为目标,加强科研队伍与制度、平台、组织建设,整合学科间研究力量与成果,推动外国教育史学科走向新的阶段。  相似文献   

完形填空试题由于在命题、实施、评卷、结果分析等方面具有客观、便利等优点,因而被广泛应用于外语教学和测试中。但是目前充斥市场的绝大多数完形填空试题效度不高,主要原因就是试题的考点层次不高,效度偏低。根据李筱菊提出的完形填空考点层次理论设计一道完形填空试题,并选择某高校的学生进行试测,重点分析了答题正确率和失分原因,从实证的角度得出通过提高考点层次来提升完形填空试题考点效度的方法。应着重培养学生在高层次考点上的能力,从而提高英语学习者的综合英语水平。  相似文献   


The outcomes of educational assessments undoubtedly have real implications for students, teachers, schools and education in the widest sense. Assessment results are, for example, used to award qualifications that determine future educational or vocational pathways of students. The results obtained by students in assessments are also used to gauge individual teacher quality, to hold schools to account for the standards achieved by their students, and to compare international education systems. Given the current high-stakes nature of educational assessment, it is imperative that the measurement practices involved have stable philosophical foundations. However, this article casts doubt on the theoretical underpinnings of contemporary educational measurement models. Aspects of Wittgenstein’s later philosophy and Bohr’s philosophy of quantum theory are used to argue that a quantum theoretical rather than a Newtonian model is appropriate for educational measurement, and the associated implications for the concept of validity are elucidated. Whilst it is acknowledged that the transition to a quantum theoretical framework would not lead to the demise of educational assessment, it is argued that, where practical, current high-stakes assessments should be reformed to become as ‘low-stakes’ as possible. This article also undermines some of the pro high-stakes testing rhetoric that has a tendency to afflict education.  相似文献   

Changes in assessment policy have increased standardized testing at provincial, national, and international levels, introduced testing at more grade levels, increased the reporting of test results, and attached more significance to those results. Advocates claim that testing will result in greater accountability in education. The research demonstrates that standardized testing has a negative impact on students, perpetuating and intensifying educational inequity through test bias and the misuse of test scores. Test results are increasingly being used to analyse policy, program, school, and teacher success, and they are being inappropriately used as "educational gatekeepers" to make important decisions about students, teachers, schools, and school systems. This paper focuses on how standardized testing is becoming the mechanism that facilitates many questionable education practices that contribute to educational inequity.  相似文献   

Neil Dorans has made a career of advocating for the examinee. He continues to do so in his NCME career award address, providing a thought‐provoking commentary on some current trends in educational measurement that could potentially affect the integrity of test scores. Concerns expressed in the address call attention to a conundrum that faces today's measurement practitioners, namely, that technology‐driven assessment, while very appealing, is prone to less controlled conditions of measurement. The commentary given here focuses on the message and implications of Neil Dorans's career award address. It discusses some specific points of note, elaborates on the conundrum, gives a view of the future, and makes a call for a dialogue among test developers and measurement practitioners on how to compensate for the loss in controlled conditions of measurement associated with the use of technology.  相似文献   

There has long been a concern about the lack of representation of ethnic minorities in the field of educational measurement. As previous research has shown that graduate programs primarily rely on their websites for recruiting efforts, the objective of this study was to conduct a content analysis of all U.S. educational measurement program websites to evaluate the availability of college choice information found to be useful for underrepresented ethnic minority applicants. In terms of program climate, results revealed that less than 10% of programs directly encouraged ethnic minorities to apply or included an antidiscrimination statement with regard to application review on their websites. Moreover, only a few program websites indicated the availability of flexible programming—previously found to be important for underrepresented ethnic minority students—such as part‐time options (16%), evening courses (10%), and online course/program availability (8%). Recommendations for how measurement programs can improve their websites to include desirable college choice information for underrepresented ethnic minority applicants are discussed.  相似文献   

Extensive research has examined the validity and fairness of standardized tests in academic admissions. However, due to their underrepresentation in higher education, American Indians have gained much less attention in this research. In the present study, we examined for American Indian students (1) group differences on SAT scores, (2) the predictive and incremental validity of SAT over high school grades, (3) the effect of socioeconomic status on SAT validity, (4) differential prediction in the use of SAT scores, and (5) potential omitted variables that could explain differential prediction for American Indian students. Results provided evidence of predictive and incremental validity of SAT scores, and the validity of SAT scores was largely independent of socioeconomic status. Overprediction was found when using SAT scores to predict college performance and it was reduced when including high school grades as an additional predictor. This study provides substantial evidence of the validity and fairness of SAT scores for American Indians.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号