首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
用概化理论分析高校教师教学水平评估   总被引:4,自引:1,他引:3  
用现代测量理论--概化理论对高校教师教学水平进行评估,提出改进性建议.使用自编的教师教学水平评估问卷,要求543名学生对大学外语部16名教师进行评价,对收集的数据作嵌套设计的多元概化分析.评估的可靠性较高,但某些指标可靠性不高;评估问卷原定指标权重不是最佳权重,通过改变权重可以提高评估的可靠性.  相似文献   

2.
Currently there is concern among some educators regarding the reliability of criterion-referenced (CR) measures. In this comment, a recent attempt to develop a theory of reliability for CR measures is examined, and some considerations for determining the reliability of CR measures are discussed. Conventional reliability statistics (e.g., coefficient alpha, standard error of measurement) are found appropriate for CR measures satisfying the assumptions of the measurement model underlying classical test theory. For measures with underlying multidimensional traits, conventional reliability statistics may be used at the homogeneous subscale level. When the confidence interval about a student's “below criterion score” includes the criterion, additional evidence about the student should be obtained. Two-stage sequential testing is suggested as one method for acquiring additional evidence.  相似文献   

3.
《教育实用测度》2013,26(4):361-367
The sampling theory for coefficient alpha is well developed and readily accessible in the measurement literature. The theory for the intraclass reliability coefficient, a Spearman-Brown extrapolation of alpha to a single measurement on each examinee, is less widely recognized and less easily cited. This article presents techniques for constructing confidence intervals and testing hypotheses for the intraclass coefficient.  相似文献   

4.
This study develops a theoretical model for the costs of an exam as a function of its duration. Two kind of costs are distinguished: (1) the costs of measurement errors and (2) the costs of the measurement. Both costs are expressed in time of the student. Based on a classical test theory model, enriched with assumptions on the context, the costs of the exam can be expressed as a function of various parameters, including the duration of the exam. It is shown that these costs can be minimized in time. Applied in a real example with reliability .80, the outcome is that the optimal exam time would be much shorter and would have reliability .675. The consequences of the model are investigated and discussed. One of the consequences is that optimal exam duration depends on the study load of the course, all other things being equal. It is argued that it is worthwhile to investigate empirically how much time students spend on preparing for resits. Six variants of the model are distinguished, which differ in their weights of the errors and in the way grades affect how much time students study for the resit.  相似文献   

5.
本文介绍可靠性的相关概念,结合应力一强度干涉理论,将安全系数与可靠性进行分析,举例说明可靠性在机械设计中的应用。给出了机械零件可靠度设计计算与可靠度校验计算的方法。  相似文献   

6.
《教育实用测度》2013,26(4):323-342
This study provides empirical evidence about the sampling variability and generalizability (reliability) of a statewide science performance assessment. Results at both individual and school levels indicate that task-sampling variability was the major source of measurement error in the performance assessment; rater-sampling variability was negligible. Adding more tasks improves the generalizability of the measurement. For the school-level assessment, the variation of performance among students within a school was larger than the variation among schools. Increasing the number of students taking a test within a school thus increases the generalizability of the assessment. Finally, the allocation of students in a matrix-sampling design is compared to a studentscrossed-with-tasks design. The former would require fewer tasks per student than the latter to build a generalizable measure of school performance.  相似文献   

7.
以概化理论和项目反应理论为代表的现代测验理论是在克服经典测验理论缺陷的基础上产生的。概化理论是在经典测验理论的基础上,引入实验设计和方差分析技术,对测评情境中的各类误差进行分解和控制的一种现代测量理论,其发展主要经历了一元概化理论和多元概化理论两个阶段。目前,其应用主要集中在评价、考试和评定量表编制三个领域。项目反应理论是在克服经典测验理论题目参数等指标的变异性基础上发展起来的一种现代测验理论,其发展经历了早期理论探索、理论初步形成和理论逐渐完善三个阶段。它主要用于处理分数等值和测验项目参数、测验和项目的质量的分析,剥离测验情境中评委特征对测验结果的影响,以及测查项目功能差异、编制适应性测验等。  相似文献   

8.
We contend that generalizability (G) theory allows the design of psychometric approaches to testing English-language learners (ELLs) that are consistent with current thinking in linguistics. We used G theory to estimate the amount of measurement error due to code (language or dialect). Fourth- and fifth-grade ELLs, native speakers of Haitian-Creole from two speech communities, were given the same set of mathematics items in the standard English and standard Haitian-Creole dialects (Sample 1) or in the standard and local dialects of Haitian-Creole (Samples 2 and 3). The largest measurement error observed was produced by the interaction of student, item, and code. Our results indicate that the reliability and dependability of ELL achievement measures is affected by two facts that operate in combination: Each test item poses a unique set of linguistic challenges and each student has a unique set of linguistic strengths and weaknesses. This sensitivity to language appears to take place at the level of dialect. Also, students from different speech communities within the same broad linguistic group may differ considerably in the number of items needed to obtain dependable measures of their academic achievement. Whether students are tested in English or in their first language, dialect variation needs to be considered if language as a source of measurement error is to be effectively addressed.  相似文献   

9.
The concept of energy is one key component of science education curricula worldwide. While it is still being taught in many science classrooms from a mainly conceptual knowledge perspective, the need to frame the concept of energy as a socioscientific issue and implement it in the context of citizenship education and education for sustainable development, is getting more and more explicit. As we will be faced with limited fossil fuels and the consequences of global climate change in the future, students have to be supported in becoming literate citizens who are able to reach informed energy-related decisions. In this article, we focus on students’ reasoning and decision-making processes about socioscientific energy-related issues. In more detail, we developed a paper-and-pencil measurement instrument to assess secondary school students’ competencies in this domain. The functioning of the measurement instrument was analysed with a sample of 850 students from grades 6, 8, 10 and 12 using item response theory. Findings show that the measurement instrument functions in terms of reliability and validity. Concerning student ability, elaborate reasoning and decision-making was characterised by the use of trade-offs and the ability to weigh arguments and to reflect on the structure of reasoning and decision-making processes. The developed measurement instrument provides a complement for existing test instruments on conceptual knowledge about the concept of energy. It aims to contribute to a change in teaching about energy, especially in physics education in the sense of education for sustainable development.  相似文献   

10.
An approach called generalizability in item response modeling (GIRM) is introduced in this article. The GIRM approach essentially incorporates the sampling model of generalizability theory (GT) into the scaling model of item response theory (IRT) by making distributional assumptions about the relevant measurement facets. By specifying a random effects measurement model, and taking advantage of the flexibility of Markov Chain Monte Carlo (MCMC) estimation methods, it becomes possible to estimate GT variance components simultaneously with traditional IRT parameters. It is shown how GT and IRT can be linked together, in the context of a single-facet measurement design with binary items. Using both simulated and empirical data with the software WinBUGS, the GIRM approach is shown to produce results comparable to those from a standard GT analysis, while also producing results from a random effects IRT model.  相似文献   

11.
There has been a growing consensus among the educational measurement experts and psychometricians that test taker characteristics may unduly affect the performance on tests. This may lead to construct-irrelevant variance in the scores and thus render the test biased. Hence, it is incumbent on test developers and users alike to provide evidence that their tests are free of such bias. The present study exploited generalizability theory to examine the presence of gender differential performance on a high-stakes language proficiency test, the University of Tehran English Proficiency Test. An analysis of the performance of 2,343 examinees who had taken the test in 2009 indicated that the relative contributions of different facets to score variance were almost uniform across the gender groups. Further, there is no significant interaction between items and persons, indicating that the relative standings of the persons were uniform across all items. The lambda reliability coefficients were also uniformly high. All in all, the study provides evidence that the test is free of gender bias and enjoys a high level of dependability.  相似文献   

12.
The top‐down approach to designing a multistage test is relatively understudied in the literature and underused in research and practice. This study introduced a route‐based top‐down design approach that directly sets design parameters at the test level and utilizes the advanced automated test assembly algorithm seeking global optimality. The design process in this approach consists of five sub‐processes: (1) route mapping, (2) setting objectives, (3) setting constraints, (4) routing error control, and (5) test assembly. Results from a simulation study confirmed that the assembly, measurement and routing results of the top‐down design eclipsed those of the bottom‐up design. Additionally, the top‐down design approach provided unique insights into design decisions that could be used to refine the test. Regardless of these advantages, it is recommended applying both top‐down and bottom‐up approaches in a complementary manner in practice.  相似文献   

13.
测验长度(test length)是影响语言测试信度和效度的重要因素之一。本文借助概化理论(Generalizability Theory,GT)的固定侧面s×(i:p)嵌套设计和边际效用递减法则(the Law of Diminishing Marginal Utility),对中国汉语水平考试(HSK[中级])的测验长度进行了实证研究。研究结果显示:由130题构成的HSK[中级]测验具有相当高的测验信度,概化系数(Eρ2)可达0.8890,即使将测验的题目数量减少至120题或110题,测验的概化系数仍可以达到0.8856和0.8816(分别降低了0.38%和0.83%),这种测验长度的缩减不仅明显地降低了研发成本,而且提高了测试效率,完全能够满足标准化考试在误差控制方面的较高要求,并确保测验结果和分数解释具有较高的信度和效度。  相似文献   

14.
为克服经典测量理论存在的测量依赖性和样本依赖性,本研究将Rasch模型应用于小学六年级学生科学素养评测的质量分析中,从整体质量检验、单维性检验、怀特图、单题质量分析、气泡图等方面介绍了Rasch模型在质量分析中的应用。同时指出该评测设计的题目信效度高、区分度合理,绝大多数题目达到了测量预期。Rasch模型在评测设计中的应用,为评测设计提供了一定的测量质量数据的参考。  相似文献   

15.
为编制一个可用于高职院校的教师课堂教学质量学生评价测量工具,根据高职院校教师课堂教学质量学生评价存在的问题,在《大学教师教学效果评价问卷(学生用)》基础上创新性地设计了高职院校教师课堂教学质量学生评价问卷(简称VSEEQ),开发了符合教育测量学标准的、现代教学与学习理论支持的VSEEQ评价问卷,施测并搜集了信效度资料。结果表明,VSEEQ评价问卷具有合理的维度结构,较好的内部一致性信度、重测信度、内容效度和结构效度。  相似文献   

16.
Reconsidering Reliability in Classroom Assessment and Grading   总被引:1,自引:0,他引:1  
It is argued that classroom assessment evolves from a different set of issues and demands from more traditional measurement concerns and that approaches to reliability developed from traditional concerns are not appropriate for most classroom settings. The assessment and grading issues for high school instruction are examined from the perspective of reliability. An alternative conceptualization of reliability, sufficiency of information, is proposed and explored. This conceptualization is based on the argument that at a rudimentary level, reliability theory is based on the notion of having enough information to make decisions or draw inferences.  相似文献   

17.
A reliability coefficient for criterion-referenced tests is developed from the assumptions of classical test theory. This coefficient is based on deviations of scores from the criterion score, rather than from the mean. The coefficient is shown to have several of the important properties of the conventional normreferenced reliability coefficient, including its interpretation as a ratio of variances and as a correlation between parallel forms, its relationship to test length, its estimation from a single form of a test, and its use in correcting for attenuation due to measurement error. Norm-referenced measurement is considered as a special case of criterion-referenced measurement.  相似文献   

18.
项目反应理论下的测验信度能够评价潜在特质估计的可靠性与稳定性,由于具有宏观性的特点,项目反应理论信度的作用并不能被测验信息函数所取代,是IRT测验的一个重要指标。本文参考国内外文献,首先介绍国内外学者关于IRT信度作用的观点,并介绍和评价了多种IRT信度估计方法,然后简要介绍IRT信度的影响因素,最后展望了IRT信度领域后续研究尚可着力之处。  相似文献   

19.
Subscores Based on Classical Test Theory: To Report or Not to Report   总被引:1,自引:0,他引:1  
There is an increasing interest in reporting subscores, both at examinee level and at aggregate levels. However, it is important to ensure reasonable subscore performance in terms of high reliability and validity to minimize incorrect instructional and remediation decisions. This article employs a statistical measure based on classical test theory that is conceptually similar to the test reliability measure and can be used to determine when subscores have any added value over total scores. The usefulness of subscores is examined both at the level of the examinees and at the level of the institutions that the examinees belong to. The suggested approach is applied to two data sets from a basic skills test. The results provide little support in favor of reporting subscores for either examinees or institutions for the tests studied here.  相似文献   

20.
《Assessing Writing》2004,9(3):190-207
Specialists in the field of large-scale, high-stakes writing assessment have, over the last forty years alternately discussed the issue of maximizing either reliability or validity in test design. Factors complicating the debate–such as Messick's (1989) expanded definition of validity, and the ethical implications of testing–are explored. An inverse relationship between the loss of reliability and the loss of validity of a test is proffered. The term, Quality, in reference to writing assessment is defined and introduced. Construct complexity is hypothesized as a factor that influences validity, reliability, and quality. It is suggested that the either/or debate concerning emphasis over reliability or validity in test design be put aside in favor of a discussion on how to maximize the quality of an assessment. Insofar as this goal can be achieved, it is necessary in the design of the test to minimize and balance the loss of both validity and reliability. The discussion draws on literature from within the field of writing assessment and from works in the fields of mathematics and information theory.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号