《Educational Assessment》2013,18(4):357-375
A test designed with built-in modifications and covering the same grade-level mathematics content provided more precise measurement of mathematics achievement for lower performing students with disabilities. Fourth-grade students with disabilities took a test based on modified state curricular standards for their mandated statewide mathematics assessment. To link the modified test with the general test, a block of items was administered to students with and without disabilities who took the general mathematics assessment. Item difficulty and student mathematics ability parameters were estimated using item response theory (IRT) methodology. Results support the conclusion that a modified test, based on the same curricular objectives but providing a more targeted measurement of expected outcomes for lower achieving students, could be developed for this special population.  相似文献   

This study examined rater effects on essay scoring in an operational monitoring system from England's 2008 national curriculum English writing test for 14‐year‐olds. We fitted two multilevel models and analyzed: (1) drift in rater severity effects over time; (2) rater central tendency effects; and (3) differences in rater severity and central tendency effects by raters’ previous rating experience. We found no significant evidence of rater drift and, while raters with less experience appeared more severe than raters with more experience, this result also was not significant. However, we did find that there was a central tendency to raters’ scoring. We also found that rater severity was significantly unstable over time. We discuss the theoretical and practical questions that our findings raise.  相似文献   

Numerous studies have examined performance assessment data using generaliz-ability theory. Typically, these studies have treated raters as randomly sampled from a population, with each rater judging a given performance on a single occasion. This paper presents two studies that focus on aspects of the rating process that are not explicitly accounted for in this typical design. The first study makes explicit the "committee" facet, acknowledging that raters often work within groups. The second study makes explicit the "rating-occasion" facet by having each rater judge each performance on two separate occasions. The results of the first study highlight the importance of clearly specifying the relevant facets of the universe of interest. Failing to include the committee facet led to an overly optimistic estimate of the precision of the measurement procedure. By contrast, failing to include the rating-occasion facet, in the second study, had minimal impact on the estimated error variance.  相似文献   

In the United Kingdom, the majority of national assessments involve human raters. The processes by which raters determine the scores to award are central to the assessment process and affect the extent to which valid inferences can be made from assessment outcomes. Thus, understanding rater cognition has become a growing area of research in the United Kingdom. This study investigated rater cognition in the context of the assessment of school‐based project work for high‐stakes purposes. Thirteen teachers across three subjects were asked to “think aloud” whilst scoring example projects. Teachers also completed an internal standardization exercise. Nine professional raters across the same three subjects standardized a set of project scores whilst thinking aloud. The behaviors and features attended to were coded. The data provided insights into aspects of rater cognition such as reading strategies, emotional and social influences, evaluations of features of student work (which aligned with scoring criteria), and how overall judgments are reached. The findings can be related to existing theories of judgment. Based on the evidence collected, the cognition of teacher raters did not appear to be substantially different from that of professional raters.  相似文献   

评分人培训是保证做事测试分数信、效度的重要方法,一直是国际语言测试界关注的重点。本文首先从理论框架、培训方法和培训效果等方面对评分人培训研究的现状进行了回顾,然后指出了当前研究中的两个问题:培训过程及内容不清楚,培训产生作用的机制不明确。最后,文章就下一步的研究进行了展望,希望能引起我国语言测试工作者对评分人培训的重视。  相似文献   

The Classroom Assessment Scoring System (CLASS; Pianta et al., 2008) is a popular measure of teacher–child interactions. Despite its prominence, CLASS scores have fairly weak relations with various child outcomes (e.g., Zaslow et al., 2010). One potential reason for these findings could be systematic differences in observer severity. As such, the purpose of this study was to explore the scope and impact of rater effects on CLASS scores with a sample of 77 teachers who were rated by 13 observers. Results indicated significant rater effects across all three CLASS domains. Adjusting for these effects, however, did not improve relations between CLASS scores and child outcomes. Implications for the CLASS and related assessments are discussed.  相似文献   

This study describes three least squares models to control for rater effects in performance evaluation: ordinary least squares (OLS); weighted least squares (WLS); and ordinary least squares, subsequent to applying a logistic transformation to observed ratings (LOG-OLS). The models were applied to ratings obtained from four administrations of an oral examination required for certification in a medical specialty. For any single administration, there were 40 raters and approximately 115 candidates, and each candidate was rated by four raters. The results indicated that raters exhibited significant amounts of leniency error and that application of the least squares models would change the pass-fail status of approximately 7% to 9% of the candidates. Ratings adjusted by the models demonstrated higher reliability and correlated slightly higher than observed ratings with the scores on a written examination.  相似文献   

姚霞 《考试研究》2013,(2):53-63
本文在分析PISA、TIMSS和NEAP三项国际学生科学素养测评现状的基础上,提出对我国科学素养测评的启示:1.明确测评目标和测试框架,在深入研究课程教材的基础上设计恰当的试题;2.根据测评目标,采用科学的测评方法和工具;3.多维度收集学生学业成绩信息,进行合理解释。  相似文献   

Many researchers assessing the efficacy of educational programs face challenges due to issues with non-randomization and the likelihood of dependence between nested subjects. The purpose of the study was to demonstrate a rigorous research methodology using a hierarchical propensity score matching method that can be utilized in contexts where randomization is not feasible and dependence between subjects is a concern. Although propensity score matching is not new in helping to create quasi-experimental models, many studies limit propensity score matching to student-level variables. To address this limitation in educational research, this study extends propensity score matching to the next level so that hierarchical modeling techniques can be used to help minimize error due to the likelihood of dependence between nested students. A large-scale educational program that targets first-semester freshmen was used to illustrate the utility and value of the methodology. This type of program is typical in higher education where student self-selection creates difficulty in assessing its true effects on student achievement; however, by using a rigorous methodology, administrators can have higher confidence when making programmatic and budgetary decisions.  相似文献   

Research Findings: Forty-five child caregivers and 120 parents participated in this study to examine perceptions of childcare programs in Jordan. The researchers developed a questionnaire that consisted of 6 dimensions: health, education, parent–caregiver relationship, facilities, building/landscape, and playground. Moreover, interviews with 10 child caregivers and 20 parents were conducted. Results indicated that child caregivers expressed moderate satisfaction with the programs. In contrast, parents expressed lower satisfaction with the childcare programs. The results also revealed that caregivers and parents perceived the playground area as effective but found health and the parent–caregiver relationship ineffective. Practice or Policy: This study highlights the need to supervise childcare programs effectively and the importance of fostering a strong partnership between child caregivers and parents.  相似文献   

This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used  相似文献   

从上世纪五六十年代开始,由于一些大型国际教育测评项目的推动,矩阵取样技术因其较好地解决了广泛的测试内容和有限的测试时间之间的矛盾,而在大尺度教育测评中逐渐得到普遍运用。它通过将测验题目的随机平行等份分配给随机选取的学生来估计测验总分,是用来估计矩阵参数的一般统计方法。在实际测评的过程中,与传统经典测验用同一张试卷测验所有学生的做法不同,矩阵取样通过限制每个学生所接受的测验题目数量来减少必须的测验时间,但同时仍然在学生之间保持了对测试内容的广泛覆盖范围。从基本类型上看,它可分为完全矩阵取样和不完全矩阵取样两大类,两者都定位于对群体水平的测查,但后者通过"公用题目"的设计帮助解决个体间结果比较的问题。采用恰当的矩阵取样技术,在对广泛的测评内容进行梳理和结构化整理的基础上,可以在不增加测验管理成本的前提下,实现对群体水平的准确、全面考察,这对我国教育质量监测工作的开展具有重大的方法学意义。  相似文献   

More attention is being given to evaluating the quality of school-level assessment scores due to their importance for school-based planning and monitoring effectiveness. In this study, cross-year stability is proposed as an indicator of data quality and the degree of stability that is appropriate for large-scale assessments of student performance is explored. Following a search of Internet sites, Year 1 to Year 2 stability coefficients were calculated for assessment data from 21 states and 2 provinces. The median stability coefficient was .78 in mathematics and reading, but coefficients for writing were generally lower. A stability coefficient of .80 is recommended as the standard for large-scale assessments of student performance. A high degree of cross-year stability makes it easier to detect and attribute changes in school-level scores to school improvement efforts. The link between stability and reliability and several factors that may attenuate stability are discussed.  相似文献   

实施有效的元评估是应答社会问责和改进评估实践的国际经验及现实选择。作者基于元评估模型对第四轮学科评估进行优劣分析,发现其服务多元利益相关者的目标适切,指标与学科质量相关性增强,评估方案可行性强,成本效益高。不足之处在于缺乏元评估环节和评估结果使用说明。基于此,为第四轮学科评估元评估提出建议:①教育部学位与研究生教育发展中心自主组织终结性元评估;②参照元评估策略模型实施元评估;③提供元评估报告说明评估有效性;④多元主体参与实现评估改进功能。  相似文献   

Based on a completed study of alumni of a master's degree teacher education program at a large northeastern university, this article demonstrates how alumni research can be designed to focus assessment on student outcomes and be responsive to program goals, policy concerns of administrators, instructional values of the faculty, and standards of professional practice. The article presents a conceptual framework, a research design plan, identification of relevant issues, appropriate analytical techniques, and selected findings with substantial relevance to other professional degree programs. Results confirm the importance of satisfaction with courses, perception of professional growth, and level of intellectual challenge on graduates' overall evaluation of the program. The methodological approaches and substantive issues raised in this study potentially enhance researchers' ability to design future assessment studies that will impact the policy development and program planning of other professional degree programs.  相似文献   

Although much attention has been given to rater effects in rater‐mediated assessment contexts, little research has examined the overall stability of leniency and severity effects over time. This study examined longitudinal scoring data collected during three consecutive administrations of a large‐scale, multi‐state summative assessment program. Multilevel models were used to assess the overall extent of rater leniency/severity during scoring and examine the extent to which leniency/severity effects were stable across the three administrations. Model results were then applied to scaled scores to estimate the impact of the stability of leniency/severity effects on students’ scores. Results showed relative scoring stability across administrations in mathematics. In English language arts, short constructed response items showed evidence of slightly increasing severity across administrations, while essays showed mixed results: evidence of both slightly increasing severity and moderately increasing leniency over time, depending on trait. However, when model results were applied to scaled scores, results revealed rater effects had minimal impact on students’ scores.  相似文献   

Evaluating Classroom Assessment Training in Teacher Education Programs   总被引:2,自引:0,他引:2  
What should training for teachers look like? Do our assessment practices align well with what we would like our students to do? How can you self-assess assessment training at your institution?  相似文献   

State programs of performance funding for public colleges and universities are both popular and volatile. A previous article identified some characteristics of stable programs by comparing the survey responses of state and campus leaders from Tennessee and Missouri about their mature programs with those from four states that later dropped performance funding. This article uses those characteristics to assess the stability of the continuing programs in Florida, Ohio, and South Carolina. This article compares the survey responses of state and campus leaders from each of these three states about their programs with those from Missouri and Tennessee. The findings suggest trouble for these three programs, for they share few of the characteristics common to the stable programs in Missouri and Tennessee.  相似文献   

通过对比分析10位教师评分员和10位非教师评分员,对30位考生的口语故事复述进行评分,利用t-检验和FACETS分析发现:在任务简单的评分工作中,非教师评分员和教师评分员一样可信、有效。  相似文献   

