首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Educational tests used for accountability purposes must represent the content domains they purport to measure. When such tests are used to monitor progress over time, the consistency of the test content across years is important for ensuring that observed changes in test scores are due to student achievement rather than to changes in what the test is measuring. In this study, expert science teachers evaluated the content and cognitive characteristics of the items from 2 consecutive annual administrations of a 10th-grade science assessment. The results indicated the content area representation was fairly consistent across years and the proportion of items measuring the different cognitive skill areas was also consistent. However, the experts identified important cognitive distinctions among the test items that were not captured in the test specifications. The implications of this research for the design of science assessments and for appraising the content validity of state-mandated assessments are discussed.  相似文献   

2.
The current study aims to evaluate the performance of three non-IRT procedures (i.e., normal approximation, Livingston-Lewis, and compound multinomial) for estimating classification indices when the observed score distribution shows atypical patterns: (a) bimodality, (b) structural (i.e., systematic) bumpiness, or (c) structural zeros (i.e., no frequencies). Under a bimodal distribution, the normal approximation procedure produced substantially large bias. For a distribution with structural bumpiness, the compound multinomial procedure tended to introduce larger bias. Under a distribution with structural zeroes, the relative performance of selected estimation procedures depended on cut score location and the sample-size conditions. In general, the differences in estimation errors among the three procedures were not substantially large.  相似文献   

3.
This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma method). The biases and sample variances for the equating functions based on the characteristic curve methods and concurrent calibrations for adjacent forms are smaller than the biases and variances for the equating functions based on the moment methods. In addition, the IRT true score equating is also compared to the chained equipercentile equating, and we observe that the sample variances for the chained equipercentile equating are much smaller than the variances for the IRT true score equating with an exception at the low scores.  相似文献   

4.
由多位评委评分的教育评价活动中,评分的等级次序的一致性影响评价的可信性。运用肯德尔和谐系数可以检验评分的一致性程度,以判断评价数据的可信性和评价活动的有效性.  相似文献   

5.
The central role of the propensity score analysis (PSA) in observational studies is for causal inference; as such, PSA is often used for making causal claims in research articles. However, there are still some issues for researchers to consider when making claims of causality using PSA results. This summary first briefly reviews PSA, followed by discussions of its effectiveness and limitations. Finally, a guideline of how to address these concerns is also provided for researchers to make appropriate causal claims using PSA results in their research articles.  相似文献   

6.
中国经济领域的行政许可制度改革与社会主义市场经济体制的建立和参与WTO体制下的竞争密切相关。WTO规则要求中国的行政许可必需符合市场经济的要求和国际公平竞争规则,并为行政许可留下空间。而将行政许可由计划经济的行政手段转化为市场经济中的国家调控手段、回应WTO规则,中国经济法承担调整、转化和规范许可行为的任务。构建新型的经济行政许可制度,必须充分重视WTO规则和中国经济法的特定作用。由此可见,WTO规则与中国经济法在功能、具体内容和法律效力等方面均存在契合性。  相似文献   

7.
Rater‐mediated assessments require the evaluation of the accuracy and consistency of the inferences made by the raters to ensure the validity of score interpretations and uses. Modeling rater response processes allows for a better understanding of how raters map their representations of the examinee performance to their representation of the scoring criteria. Validity of score meaning is affected by the accuracy of raters' representations of examinee performance and the scoring criteria, and the accuracy of the mapping process. Methodological advances and applications that model rater response processes, rater accuracy, and rater consistency inform the design, scoring, interpretations, and uses of rater‐mediated assessments.  相似文献   

8.
Domain scores have been proposed as a user-friendly way of providing instructional feedback about examinees' skills. Domain performance typically cannot be measured directly; instead, scores must be estimated using available information. Simulation studies suggest that IRT-based methods yield accurate group domain score estimates. Because simulations can represent best-case scenarios for methodology, it is important to verify results with a real data application. This study administered a domain of elementary algebra (EA) items created from operational test forms. An IRT-based group-level domain score was estimated from responses to a subset of taken items (comprised of EA items from a single operational form) and compared to the actual observed domain score. Domain item parameters were calibrated both using item responses from the special study and from national operational administrations of the items. The accuracy of the domain score estimates were evaluated within schools and across school sizes for each set of parameters. The IRT-based domain score estimates typically were closer to the actual domain score than observed performance on the EA items from the single form. Previously simulated findings for the IRT-based domain score estimation procedure were supported by the results of the real data application.  相似文献   

9.
This study investigated the factorial invariance of scores from a 7th-grade state reading assessment across general education students and selected groups of students with disabilities. Confirmatory factor analysis was used to assess the fit of a 2-factor model to each of the 4 groups. In addition to overall fit of this model, 5 levels of constraint, including equal factor loadings, intercepts, error variances, factor variances, and factor covariances, were investigated. Invariance across the factor loadings and intercepts was supported across the groups of students with disabilities and general education students. Invariance for these groups was not supported for the error variances. For the students with mental retardation, the lack of fit of the 2-factor model and the observed score results suggested a mismatch between the difficulty level of this test and the ability level of these students. Although the results generally supported the score comparability of the reading assessment across these groups, further research is needed into the nature of the larger error variances for the student with disabilities groups and into accommodations and modifications for the students with mental retardation.  相似文献   

10.
DETECT, the acronym for Dimensionality Evaluation To Enumerate Contributing Traits, is an innovative and relatively new nonparametric dimensionality assessment procedure used to identify mutually exclusive, dimensionally homogeneous clusters of items using a genetic algorithm ( Zhang & Stout, 1999 ). Because the clusters of items are mutually exclusive, this procedure is most useful when the data display approximate simple structure. In many testing situations, however, data display a complex multidimensional structure. The purpose of the current study was to evaluate DETECT item classification accuracy and consistency when the data display different degrees of complex structure using both simulated and real data. Three variables were manipulated in the simulation study: The percentage of items displaying complex structure (10%, 30%, and 50%), the correlation between dimensions (.00, .30, .60, .75, and .90), and the sample size (500, 1,000, and 1,500). The results from the simulation study reveal that DETECT can accurately and consistently cluster items according to their true underlying dimension when as many as 30% of the items display complex structure, if the correlation between dimensions is less than or equal to .75 and the sample size is at least 1,000 examinees. If 50% of the items display complex structure, then the correlation between dimensions should be less than or equal to .60 and the sample size be, at least, 1,000 examinees. When the correlation between dimensions is .90, DETECT does not work well with any complex dimensional structure or sample size. Implications for practice and directions for future research are discussed.  相似文献   

11.
评价考试质量的新指标:决策一致性和决策准确性   总被引:2,自引:0,他引:2  
在把考生分成几个分数等级的标准参照性考试中,除传统的信度系数外,决策一致性和决策准确性是两个重要的考试质量评价指标。本文介绍决策一致性和决策准确性的定义、研究发展、和几种常用的基于经典测量理论和项目反应理论的估计方法包括Subkoviak方法、Huynh方法、Livingston-Lewis方法和Rudner方法。  相似文献   

12.
The Angoff method requires experts to view every item on the test and make a probability judgment. This can be time consuming when there are large numbers of items on the test. In this study, a G-theory framework was used to determine if a subset of items can be used to make generalizable cut-score recommendations. Angoff ratings (i.e., probability judgments) from previously conducted standard setting studies were used first in a re-sampling study, followed by D-studies. For the re-sampling study, proportionally stratified subsets of items were extracted under various sampling and test-length conditions. The mean cut score, variance components, expected standard error (SE) around the mean cut score, and root-mean-squared deviation (RMSD) across 1,000 replications were estimated at each study condition. The SE and the RMSD decreased as the number of items increased, but this reduction tapered off after approximately 45 items. Subsequently, D-studies were performed on the same datasets. The expected SE was computed at various test lengths. Results from both studies are consistent with previous research indicating that between 40–50 items are sufficient to make generalizable cut score recommendations.  相似文献   

13.
在Rasch测量模型框架下,根据Rudner分类方法,对第三批高考综合改革8省份使用的5等级赋分方式进行等级分数分类一致性和准确性研究。研究侧重试题信息函数对分数等级分类一致性和准确性的影响,结果表明:构成测验的试题类型、试题数量、提供的信息量大小、信息函数分布等是影响分类一致性和准确性的重要因素,其中,试题信息函数分布和试题类型对分类一致性和准确性的影响更为显著。  相似文献   

14.
研究了只有部分属性权重信息,决策者对方案有偏好,决策信息均以Vague数给出的多属性决策问题.在改进Vague集记分函数基础上,建立了一个线性规划模型,用以确定属性权重,进而给出了主客观Vague分值偏差最小化的多属性决策方法.最后通过实例对该方法的详细过程和有效性进行了说明.  相似文献   

15.
16.
介绍工程索赔的处理原则和计算方法,详细说明针对不同原因引起的索赔计算期的确定,提出申请索赔时要注意的问题.  相似文献   

17.
在论战中形成的论辩艺术——谈孟子辩术的形成及其特点   总被引:3,自引:0,他引:3  
本文从二方面对孟子的论辩艺术进行论述。一方面 ,从孟子当时所处的社会变革的大环境 ,分析了孟子论辩艺术的具体形成 ,说明了孟子“予岂好辩哉 ,予不得已也”的客观现实 ;另一方面 ,从《孟子》七篇中采撷若干史料分析孟子论辩方法的具体运用 ,说明孟子论辩艺术的风格和特点。  相似文献   

18.
19.
英语专业四级考试听写成绩与总成绩相关研究   总被引:8,自引:0,他引:8  
严明贵 《台州学院学报》2005,27(2):58-60,80
此项研究以台州学院450名英语专业学生的四级考试成绩为研究对象,运用定量研究,经积差法计算和SPSS10统计软件统计分析,调查了英语专业四级考试短文听写成绩与总成绩的关系.结果表明,短文听写成绩与总成绩成一定的正相关.由此提出,有必要加强英语专业学生短文听写的训练,提高学生的综合英语水平,从而提高英语专业四级考试的通过率.  相似文献   

20.
工程合同管理是项目管理的重要构成部分,索赔作为合同履行管理的内容,事关当事人的利益和权益,是一项极其重要、复杂、系统的综合工作。本文就工程索赔产生的原因、存在的问题和对策展开探讨,说明索赔管理的重要性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号