首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The alignment of test items to content standards is critical to the validity of decisions made from standards‐based tests. Generally, alignment is determined based on judgments made by a panel of content experts with either ratings averaged or via a consensus reached through discussion. When the pool of items to be reviewed is large, or the content‐matter experts are broadly distributed geographically, panel methods present significant challenges. This article illustrates the use of an online methodology for gauging item alignment that does not require that raters convene in person, reduces the overall cost of the study, increases time flexibility, and offers an efficient means for reviewing large item banks. Latent trait methods are applied to the data to control for between‐rater severity, evaluate intrarater consistency, and provide item‐level diagnostic statistics. Use of this methodology is illustrated with a large pool (1,345) of interim‐formative mathematics test items. Implications for the field and limitations of this approach are discussed.  相似文献   

2.
《Educational Assessment》2013,18(4):333-356
Alignment has taken on increased importance given the current high-stakes nature of assessment. To make well-informed decisions about student learning on the basis of test results, assessment items need to be well aligned with standards. Project 2061 of the American Association for the Advancement of Science (AAAS) has developed a procedure for analyzing the content and quality of assessment items. The authors of this study used this alignment procedure to closely examine 2 mathematics assessment items. Student work on these 2 items was analyzed to determine whether the conclusions reached through the use of the alignment procedure could be validated. It was found that the Project 2061 alignment procedure was effective in providing a tool for in-depth analysis of the mathematical content of the item and a set of standards and in identifying 1 particular content standard that was most closely aligned with the standard. Through analyzing student work samples and student interviews, it was also found that students' thinking may not correspond to the standard identified as best aligned with the learning goals of the item. This finding highlights the potential usefulness of analyzing student work to clarify any additional deficiencies of an assessment item not revealed by an alignment procedure.  相似文献   

3.
This article describes an alignment study conducted to evaluate the alignment between Indiana's Kindergarten content standards and items on the Indiana Standards Tool for Alternate Reporting. Alignment is the extent to which standards and assessments are in agreement, working together to guide educators' efforts to support children's learning and development. The alignment process in this study represented a modification of Webb's nationally recognized method of alignment analysis to early childhood assessments and standards. The alignment panel (N = 13) in this study consisted of early childhood educators and educational leaders from all geographic regions of the state. Panel members were asked to rate the depth of knowledge (DOK) stage of each objective in Kindergarten standards; rate the DOK stage for each item on the ISTAR rating scale; and identify the one or two objectives from the standards to which each ISTAR item corresponded. Analysis of the panel's responses suggested the ISTAR inconsistently conformed to Webb's DOK consistency and ROK correspondence criteria for alignment. A promising finding was the strong alignment of the ISTAR Level F1 and F2 scales to the Kindergarten standards. This result provided evidence of the developmental continuum of skills and knowledge that are assessed by the ISTAR items .  相似文献   

4.
5.
《教育实用测度》2013,26(4):331-345
In order to obtain objective measurement for examinations that are graded by judges, an extension of the Rasch model designed to analyze examinations with more than two facets (items/examinees) is used. This extended Rasch model calibrates the elements of each facet of the examination (i.e., examinee performances, items, and judges) on a common log-linear scale. A network for assigning judges to examinations is used to link all facets. Real examination data from the "clinical assessment" part of a certification examination are used to illustrate the application. A range of item difficulties and judge severities were found. Comparison of examinee raw scores with objective linear measures corrected for variations in judge severity shows that judge severity can have a substantial impact on a raw score. Correcting for judge severity improves the fairness of examinee measures and of the subsequent pass-fail decisions because the uncorrected raw scores favor examinee performances graded by lenient judges.  相似文献   

6.
During the development of large‐scale curricular achievement tests, recruited panels of independent subject‐matter experts use systematic judgmental methods—often collectively labeled “alignment” methods—to rate the correspondence between a given test's items and the objective statements in a particular curricular standards document. High disagreement among the expert panelists may indicate problems with training, feedback, or other steps of the alignment procedure. Existing procedural recommendations for alignment reviews have been derived largely from single‐panel research studies; support for their use during operational large‐scale test development may be limited. Synthesizing data from more than 1,000 alignment reviews of state achievement tests, this study identifies features of test–standards alignment review procedures that impact agreement about test item content. The researchers then use their meta‐regression results to propose some practical suggestions for alignment review implementation.  相似文献   

7.
This study investigated the comparability of Angoff-based item ratings on a general education test battery made by judges from within-content specialties and across content domains. Judges were from English, mathematics, science, and social studies specialties in teacher education programs in a midwestem state. Cutscores established from the judges'ratings of out-of-content items differed little from the cutscores set using the ratings made by the content specialists. Further, out-of-content ratings by judges were not more influenced by performance data than were the ratings provided by judges rating items within their content specialty. The degree to -which these results generalize to other content specialties needs to be investigated.  相似文献   

8.
An important part of test development is ensuring alignment between test forms and content standards. One common way of measuring alignment is the Webb (1997, 2007) alignment procedure. This article investigates (a) how well item writers understand components of the definition of Depth of Knowledge (DOK) from the Webb alignment procedure and (b) how consistent their DOK ratings are with ratings provided by other committees of educators across grade levels, content areas, and alternate assessment levels in a Midwestern state alternate assessment system. Results indicate that many item writers understand key features of DOK. However, some item writers struggled to articulate what DOK means and had some misconceptions. Additional analyses suggested some lack of consistency between the item writer DOK ratings and the committee DOK ratings. Some notable differences were found across alternate assessment levels and content areas. Implications for future item writing training and alignment studies are provided.  相似文献   

9.
The use of content validity as the primary assurance of the measurement accuracy for science assessment examinations is questioned. An alternative accuracy measure, item validity, is proposed. Item validity is based on research using qualitative comparisons between (a) student answers to objective items on the examination, (b) clinical interviews with examinees designed to ascertain their knowledge and understanding of the objective examination items, and (c) student answers to essay examination items prepared as an equivalent to the objective examination items. Calculations of item validity are used to show that selected objective items from the science assessment examination overestimated the actual student understanding of science content. Overestimation occurs when a student correctly answers an examination item, but for a reason other than that needed for an understanding of the content in question. There was little evidence that students incorrectly answered the items studied for the wrong reason, resulting in underestimation of the students' knowledge. The equivalent essay items were found to limit the amount of mismeasurement of the students' knowledge. Specific examples are cited and general suggestions are made on how to improve the measurement accuracy of objective examinations.  相似文献   

10.
Variation in test performance among examinees from different regions or national jurisdictions is often partially attributed to differences in the degree of content correspondence between local school or training program curricula, and the test of interest. This posited relationship between test-curriculum correspondence, or “alignment,” and test performance is usually inferred from highly distal evidence, rather than directly examined. Utilizing mathematics standards content analysis data and achievement test item data from ten U.S. states, we examine the relationship between topic-specific alignment and test item performance. When a particular item’s content type is emphasized by the standards, we find evidence of a positive relationship between the alignment measure and proportion-correct test item difficulty, although this effect is not consistent across samples. Implications of the results for curricular achievement test development and score interpretation are discussed.  相似文献   

11.
12.
周群 《考试研究》2012,(6):3-19
学业水平考试中学科能力内涵及其结构居于核心地位,也是考试有效性或效度的核心内容。本文以教育考试设计的理论框架模型为指导,提出了建立高中学业水平学生模型与课程内容标准的关系,指出应当依据课程内容标准确立学业水平考试学科能力内涵及其结构,提出了确定学业水平考试学科能力内涵及其结构的程序和方法,并以上海市普通高中思想政治学科学业水平考试为例,呈现了该学科考试的学科能力内涵及其结构。同时,命题、组卷和评分标准的制定是学业水平考试的中心环节,根据教育考试设计理论框架模型中的任务模型及评分模型,命题阶段必须保证命制的每一道试题要考查一定的学科知识和认知技能;试题考查的认知技能应该与考试大纲或课程内容标准相应知识内容的认知要求保持基本一致;每道试题的评分标准的评分准则应该与试题考查的认知技能的认知要素保持一致,且评分的等级应该能够真实地反映考生在相应认知技能上的实质差异。高质量的试卷还要求其考查的认知结构和知识内容结构与课程内容标准的认知结构和知识内容结构保持基本一致;整卷试题考查的认知能力与课程内容标准相应知识内容认知要求的一致性也应达到一定的指标。在我国当前的考试实践下,知识内容结构与课程内容标准的知识内容结构保持一致存在一定的难度。目前可行的操作方法是试卷考查的知识内容领域(或一级主题)及其权重必须与课程内容标准保持基本一致;每个内容领域(或一级主题)内尽可能覆盖比较多的知识内容主题(或二级知识内容主题);每个内容领域(或一级主题)内以考查主干知识内容为主,主干知识内容权重应明显高于次要知识内容的权重。  相似文献   

13.
The term measurement disturbance has been used to describe systematic conditions that affect a measurement process, resulting in a compromised interpretation of person or item estimates. Measurement disturbances have been discussed in relation to systematic response patterns associated with items and persons, such as start‐up, plodding, boredom, or fatigue. An understanding of the different types of measurement disturbances can lead to a more complete understanding of persons or items in terms of the construct being measured. Although measurement disturbances have been explored in several contexts, they have not been explicitly considered in the context of performance assessments. The purpose of this study is to illustrate the use of graphical methods to explore measurement disturbances related to raters within the context of a writing assessment. Graphical displays that illustrate the alignment between expected and empirical rater response functions are considered as they relate to indicators of rating quality based on the Rasch model. Results suggest that graphical displays can be used to identify measurement disturbances for raters related to specific ranges of student achievement that suggest potential rater bias. Further, results highlight the added diagnostic value of graphical displays for detecting measurement disturbances that are not captured using Rasch model–data fit statistics.  相似文献   

14.
It has long been argued that U.S. states’ differential performance on nationwide assessments may reflect differences in students’ opportunity to learn the tested content that is primarily due to variation in curricular content standards, rather than in instructional quality or educational investment. To quantify the effect of differences in states’ intended curricular goals on test item performance in the mid-to-late 2000s, we use fractional logit regression of state-specific mathematics item difficulty values on a measure of content emphasis in state elementary school mathematics curricular standards documents. Finding weak but positive associations between content emphasis in state standards and proportion-correct item difficulty, we conclude that variations in states’ intended curriculum content, alone, appear to have had limited influence on cross-state mathematics test item performance during the time frame examined. Implications for cross-state assessment are discussed.  相似文献   

15.
Evidence to support the credibility of standard setting procedures is a critical part of the validity argument for decisions made based on tests that are used for classification. One area in which there has been limited empirical study is the impact of standard setting judge selection on the resulting cut score. One important issue related to judge selection is whether the extent of judges’ content knowledge impacts their perceptions of the probability that a minimally proficient examinee will answer the item correctly. The present article reports on two studies conducted in the context of Angoff‐style standard setting for medical licensing examinations. In the first study, content experts answered and subsequently provided Angoff judgments for a set of test items. After accounting for perceived item difficulty and judge stringency, answering the item correctly accounted for a significant (and potentially important) impact on expert judgment. The second study examined whether providing the correct answer to the judges would result in a similar effect to that associated with knowing the correct answer. The results suggested that providing the correct answer did not impact judgments. These results have important implications for the validity of standard setting outcomes in general and on judge recruitment specifically.  相似文献   

16.
Validity evidence based on test content is critical to meaningful interpretation of test scores. Within high-stakes testing and accountability frameworks, content-related validity evidence is typically gathered via alignment studies, with panels of experts providing qualitative judgments on the degree to which test items align with the representative content standards. Various summary statistics are then calculated (e.g., categorical concurrence, balance of representation) to aid in decision-making. In this paper, we propose an alternative approach for gathering content-related validity evidence that capitalizes on the overlap in vocabulary used in test items and the corresponding content standards, which we define as textual congruence. We use a text-based, machine learning model, specifically topic modeling, to identify clusters of related content within the standards. This model then serves as the basis from which items are evaluated. We illustrate our method by building a model from the Next Generation Science Standards, with textual congruence evaluated against items within the Oregon statewide alternate assessment. We discuss the utility of this approach as a source of triangulating and diagnostic information and show how visualizations can be used to evaluate the overall coverage of the content standards across the test items.  相似文献   

17.
This study investigated the effect of complex structure on dimensionality assessment in compensatory multidimensional item response models using DETECT- and NOHARM-based methods. The performance was evaluated via the accuracy of identifying the correct number of dimensions and the ability to accurately recover item groupings using a simple matching similarity (SM) coefficient. The DETECT-based methods yielded higher proportion correct than the NOHARM-based methods in two- and three-dimensional conditions, especially when correlations were ≤.60, data exhibited ≤30% complexity, and sample size was 1,000. As the complexity increased and the sample size decreased, the performance of the methods typically diminished. The NOHARM-based methods were either equally successful or better in recovering item groupings than the DETECT-based methods and were mostly affected by complexity levels. The DETECT-based methods were affected largely by the test length, such that with the increase of the number of items, SM coefficients would decrease substantially.  相似文献   

18.
This study investigated using latent class analysis to set performance standards for assessments comprised of multiple-choice and performance assessment items. Employing this procedure, it is possible to use a sample of student responses to accomplish four goals: (a) determine how well a specified latent structure fits student performance data; (b) determine which latent structure best represents the relationships in the data; (c) obtain estimates of item parameters for each latent class; and (d) identify to which class within that latent structure each response pattern most likely belongs. Comparisons with the Angoff and profile rating methods revealed that the approaches agreed with each other quite well, indicating that both empirical and test-based judgmental approaches may be used for setting performance standards for student achievement.  相似文献   

19.
Michael Scriven has suggested that student rating forms, for the purpose of evaluating college teaching, be designed for multiple audiences (instructor, administrator, student), and with a single global item for summative functions (determination of merit, retention, or promotion). This study reviewed approaches to rating form construction, e.g., factor analytic strategies of Marsh, and recommended the multiple audience design of Scriven. An empirical test of the representativeness of the single global item was reported from an analysis of 1,378 forms collected in a university department of education. The global item correlated most satisfactorily with other items, a computed total of items, items that represented underlying factors, and various triplets of items selected to represent all possible combinations of items. It was concluded that a multiple audience rating form showed distinct advantages in design and that the single global item most fairly and highly represented the overall teaching performance, as judged by students, for decisions about retention, promotion, and merit made by administrators.  相似文献   

20.
新课程高考地理考试标准及试卷结构技术指标构建的依据是《普通高中课程方案(实验)》和《普通高中地理课程标准(实验)》。地理科试卷结构技术指标包括试卷的结构模式、内容要素、目标要素、题型要素、难度要素、分数要素、时限要素、长度要素和等值要素,它是新课程高考地理命题、审题评估监控的标准,是实现试卷及试题质量控制的依据,可以为高考地理试卷及试题质量评价提供系统的可供操作的标准体系和方法手段。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号