首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The purpose of this article is to address a major gap in the instructional sensitivity literature on how to develop instructionally sensitive assessments. We propose an approach to developing and evaluating instructionally sensitive assessments in science and test this approach with one elementary life‐science module. The assessment we developed was administered to 125 students in seven classrooms. The development approach considered three dimensions of instructional sensitivity; that is, assessment items should: represent the curriculum content, reflect the quality of instruction, and have formative value for teaching. Focusing solely on the first dimension, representation of the curriculum content, this study was guided by the following research questions: (1) What science module characteristics can be systematically manipulated to develop items that prove to be instructionally sensitive? and (2) Are the instructionally sensitive assessments developed sufficiently valid to make inferences about the impact of instruction on students' performance? In this article, we describe our item development approach and provide empirical evidence to support validity arguments about the developed instructionally sensitive items. Results indicated that: (1) manipulations of the items at different proximities to vary their sensitivity were aligned with the rules for item development and also corresponded with pre‐to‐post gains; and (2) the items developed at different distances from the science module showed a pattern of pre‐to‐post gain consistent with their instructional sensitivity, that is, the closer the items were to the science module, the larger the observed gains and effect sizes. © 2012 Wiley Periodicals, Inc. J Res Sci Teach 49: 691–712, 2012  相似文献   

2.
Although there is a common understanding of instructional sensitivity, it lacks a common operationalization. Various approaches have been proposed, some focusing on item responses, others on test scores. As approaches often do not produce consistent results, previous research has created the impression that approaches to instructional sensitivity are noticeably fragmented. To counter this impression, we present an item response theory–based framework that can help us to understand similarities and differences between existing approaches. Using empirical data for illustration, this article identifies three perspectives on instructional sensitivity: One perspective views instructional sensitivity as the capacity to detect differences in students' stages of learning across points of time. A second perspective treats instructional sensitivity as the capacity to detect differences between groups that have received different instruction. For a third perspective, the previous two are combined to consider differences between both time points and groups. We discuss linking sensitivity indices to measures of instruction.  相似文献   

3.
Validation of assessments intended to improve instruction and learning should include evidence of instructional sensitivity. This study investigated the instructional sensitivity of a standards-based ninth-grade performance assessment that required students to write an essay about conflict in a literary work. Before administering the assessment, teachers of 886 ninth-grade students were randomly assigned to one of three instructional groups: literary analysis, organization of writing, and teacher-selected instruction. Despite the short duration of instruction (8 class periods), results support the instructional sensitivity of the assessment in two ways: Instruction on literary analysis significantly improved students' ability to analyze and describe conflicts in literature, and instruction on the organization of writing led to significantly higher scores on measures of coherence and organization.  相似文献   

4.
Views on testing—its purpose and uses and how its data are analyzed—are related to one's perspective on test takers. Test takers can be viewed as learners, examinees, or contestants. I briefly discuss the perspective of test takers as learners. I maintain that much of psychometrics views test takers as examinees. I discuss test takers as a contestant in some detail. Test takers who are contestants in high‐stakes settings want reliable outcomes obtained via acceptable scoring of tests administered under clear rules. In addition, it is essential to empirically verify interpretations attached to scores. At the very least, item and test scores should exhibit certain invariance properties. I note that the “do no harm” dictum borrowed from the field of medicine is particularly relevant to the perspective of test takers as contestants.  相似文献   

5.
In recent years, students’ test scores have been used to evaluate teachers’ performance. The assumption underlying this practice is that students’ test performance reflects teachers’ instruction. However, this assumption is generally not empirically tested. In this study, we examine the effect of teachers’ instruction on test performance at the item level using a hierarchical differential item functioning approach. The items are from the U.S. TIMSS 2011 4th-grade math test. Specifically, we tested whether students who had received instruction on a given item performed significantly better on that item compared with students who had not received such instruction when their overall math ability was controlled for, whether with or without controlling for student-level and class-level covariates. This study provides preliminary findings regarding why some items show instructional sensitivity and sheds light on how to develop instructionally sensitive items. Implications and directions for further research are also discussed.  相似文献   

6.
This article reports on analyses of the instructional practices of six middle- and high-school science teachers in the United States who participated in a research-practice partnership that aims to support reform science education goals at scale. All six teachers were well qualified, experienced, and locally successful—respected by students, parents, colleagues, and administrators—but they differed in their success in supporting students' three-dimensional learning. Our goal is to understand how the teachers' instructional practices contributed to their similarities in achieving local success and to differences in enabling students' learning, and to consider the implications of these findings for research-practice partnerships. Data sources included classroom videos supplemented by interviews with teachers and focus students and examples of student work. We also compared students' learning gains by teacher using pre–post assessments that elicited three-dimensional performances. Analyses of classroom videos showed how all six teachers achieved local success—they led effectively managed classrooms, covered the curriculum by teaching almost all unit activities, and assessed students' work in fair and efficient ways. There were important differences, however, in how teachers engaged students in science practices. Teachers in classrooms where students achieved lower learning gains followed a pattern of practice we describe as activity-based teaching, in which students completed investigations and hands-on activities with few opportunities for sensemaking discussions or three-dimensional science performances. Teachers whose students achieved higher learning gains combined the social stability characteristic of local classroom success with more demanding instructional practices associated with scientific sensemaking and cognitive apprenticeship. We conclude with a discussion of implications for research-practice partnerships, highlighting how partnerships need to support all teachers in achieving both local and standards-based success.  相似文献   

7.
Students’ performance in assessments is commonly attributed to more or less effective teaching. This implies that students’ responses are significantly affected by instruction. However, the assumption that outcome measures indeed are instructionally sensitive is scarcely investigated empirically. In the present study, we propose a longitudinal multilevel‐differential item functioning (DIF) model to combine two existing yet independent approaches to evaluate items’ instructional sensitivity. The model permits for a more informative judgment of instructional sensitivity, allowing the distinction of global and differential sensitivity. Exemplarily, the model is applied to two empirical data sets, with classical indices (Pretest–Posttest Difference Index and posttest multilevel‐DIF) computed for comparison. Results suggest that the approach works well in the application to empirical data, and may provide important information to test developers.  相似文献   

8.
This paper illustrates that the psychometric properties of scores and scales that are used with mixed‐format educational tests can impact the use and interpretation of the scores that are reported to examinees. Psychometric properties that include reliability and conditional standard errors of measurement are considered in this paper. The focus is on mixed‐format tests in situations for which raw scores are integer‐weighted sums of item scores. Four associated real‐data examples include (a) effects of weights associated with each item type on reliability, (b) comparison of psychometric properties of different scale scores, (c) evaluation of the equity property of equating, and (d) comparison of the use of unidimensional and multidimensional procedures for evaluating psychometric properties. Throughout the paper, and especially in the conclusion section, the examples are related to issues associated with test interpretation and test use.  相似文献   

9.
Over the past few decades, those who take tests in the United States have exhibited increasing diversity with respect to native language. Standard psychometric procedures for ensuring item and test fairness that have existed for some time were developed when test‐taking groups were predominantly native English speakers. A better understanding of the potential influence that insufficient language proficiency may have on the efficacy of these procedures is needed. This paper represents a first step in arriving at this better understanding. We begin by addressing some of the issues that arise in a context in which assessments in a language such as English are taken increasingly by groups that may not possess the language proficiency needed to take the test. For illustrative purposes, we use the first‐language status of a test taker as a surrogate for language proficiency and describe an approach to examining how the results of fairness procedures are affected by inclusion or exclusion of those who report that English is not their first language in the fairness analyses. Furthermore, we explore the sensitivity of the results of these procedures, differential item functioning (DIF) and score equating, to potential shifts in population composition. We employ data from a large‐volume testing program for this illustrative purpose. The equating results were not affected by either inclusion or exclusion of such test takers in the analysis sample, or by shifts in population composition. The effect on DIF results, however, varied across focal groups.  相似文献   

10.
Explicit instructions to “be creative” often are used to estimate the role of task-perception in divergent thinking test performance; however, previous research on this topic has employed only nongifted individuals. The present investigation compared gifted (n = 97), talented (n = 53), and nongifted (n = 90) intermediate school children in terms of divergent thinking fluency, flexibility, and originality scores elicted by standard and explicit instructions. Results indicated that the scores of all groups were significantly different in the two instructional conditions. More importantly, there was a significant interaction between this instructional effect and children's level of ability. The explicit instructions enhanced the originality scores of the talented and nongifted children more than those of the gifted children; and the same instructions inhibited the fluency and flexibility scores of the gifted children more than those of the talented and nongifted children. These results have important implications for testing creativity and for our understanding of giftedness.  相似文献   

11.
This study examined the use of self-report to measure training transfer by comparing three training transfer assessment methods. Applying a framework provided by instructional alignment theory, the study tested the hypothesis that if training transfer is measured by assessment methods that vary in their degree of alignment with the post-training learning assessment, the training transfer scores would be higher as the degree of alignment increased. Instructional alignment is the extent to which stimulus conditions match across instructional components: intended outcomes, instructional processes, and assessment. Three training transfer assessments were administered to 40 telecommunications technicians approximately 60 days after they completed a training course. The moan transfer score for the job performance assessment with high alignment was significantly higher than the two self-report assessments of moderate and low alignment with effect size differences of 0.96 (p < .01) and 0.87 (p < .01), respectively. The mean scores for the two self-report assessments did not significantly differ. This study has implications for the extensive use of self-report to assess training transfer in both research and training evaluation programs.  相似文献   

12.
Assessments of student learning outcomes (SLO) have been widely used in higher education for accreditation, accountability, and strategic planning purposes. Although important to institutions, the assessment results typically bear no consequence for individual students. It is important to clarify the relationship between motivation and test performance and identify practical strategies to boost students' motivation in test taking. This study designed an experiment to examine the effectiveness of a motivational instruction. The instruction increased examinees' self-reported test-taking motivation by .89 standard deviations (SDs) and test scores by .63 SDs. Students receiving the instruction spent an average of 14 more seconds on an item than students in the control group. Score difference between experimental and control groups narrowed to .23 SDs after unmotivated students identified by low response time were removed from the analyses. The findings provide important implications for higher education institutions which administer SLO assessments in a low-stakes setting.  相似文献   

13.
ABSTRACT

Research Findings: Using data from a short-term longitudinal study of 343 third-, fourth-, and fifth-grade students, we investigated visual-motor integration (VMI) skills as a predictor of direct assessments of executive functions (EFs) and academic achievement. This is the first study to investigate relations among these three constructs in late elementary school. VMI predicted change in EFs from fall to spring. EFs and VMI were independently associated with math and English/language arts standardized test scores. When controlling for earlier achievement test scores, EFs—but not VMI—remained a significant predictor of later academic achievement. Results indicate that VMI may help support the continued development of EFs in late elementary school, but EFs appear to be comparatively more important as a direct predictor of continued academic development during this age period. Practice or Policy: VMI is a complex ability that combines fine motor coordination (an aspect of school readiness) and visual-spatial reasoning skills. VMI has been identified as an influential predictor of early academic development, but it has been neglected in middle childhood studies. Our results suggest that VMI remains important through the end of elementary school for the continued development of children’s EFs and therefore merits more attention from researchers and educators.  相似文献   

14.
15.
When a computerized adaptive testing (CAT) version of a test co-exists with its paper-and-pencil (P&P) version, it is important for scores from the CAT version to be comparable to scores from its P&P version. The CAT version may require multiple item pools for test security reasons, and CAT scores based on alternate pools also need to be comparable to each other. In this paper, we review research literature on CAT comparability issues and synthesize issues specific to these two settings. A framework of criteria for evaluating comparability was developed that contains the following three categories of criteria: validity criterion, psychometric property/reliability criterion, and statistical assumption/test administration condition criterion. Methods for evaluating comparability under these criteria as well as various algorithms for improving comparability are described and discussed. Focusing on the psychometric property/reliability criterion, an example using an item pool of ACT Assessment Mathematics items is provided to demonstrate a process for developing comparable CAT versions and for evaluating comparability. This example illustrates how simulations can be used to improve comparability at the early stages of the development of a CAT. The effects of different specifications of practical constraints, such as content balancing and item exposure rate control, and the effects of using alternate item pools are examined. One interesting finding from this study is that a large part of incomparability may be due to the change from number-correct score-based scoring to IRT ability estimation-based scoring. In addition, changes in components of a CAT, such as exposure rate control, content balancing, test length, and item pool size were found to result in different levels of comparability in test scores.  相似文献   

16.
Research Findings: The objective of this study was to understand how two dimensions of parent–child book-reading quality—instructional and emotional—interact and relate to learning in a sample of low-income infants and toddlers. Participants included 81 parents and their children from Early Head Start programs in the rural Midwest. Correlation and multiple regression analyses were used to test the hypothesis that parental book-reading qualities interact and relate to children's concurrent cognitive and language scores. Exploratory analyses examined if patterns of relationships varied for families who had different home languages (i.e., English, Spanish). Results included that book-reading qualities and home language interacted to predict child scores. Practice or Policy: Findings suggest a need to further explore potentially complex patterns of relationships among parental book-reading behaviors and child learning for diverse families. Understanding these patterns could inform the development of culturally-sensitive intervention approaches designed to support high-quality shared book reading.  相似文献   

17.
Building achievement tests which are sensitive to the instructional effects of school programs concerns both practitioners and researchers in education. To produce such tests, empirical procedures to guide item selection are needed. In this paper, an operational framework and a set of empirical procedures for this task are presented. Within this framework, item sensitivity is linked to instructional implementation. A simple components of variance model has been used to provide actual estimates of instructional sensitivity. These procedures are illustrated using data from a comparative study of alternative item formats for a criterion-referenced test. Even when items were closely matched to instructional content specifications, important differences in instructional sensitivity emerged. These differences were found between the same items presented in different formats as well as between different items presented within the same format. Implications of these results for developing criterion-referenced achievement tests are discussed.  相似文献   

18.
Abstract

The accuracy of achievement test score inferences largely depends on the sensitivity of scores to instruction focused on tested objectives. Sensitivity requirements are particularly challenging for standards-based assessments because a variety of plausible instructional differences across classrooms must be detected. For this study, we developed a new method for capturing the alignment between how teachers bring standards to life in their classrooms and how the standards are defined on a test. Teachers were asked to report the degree to which they emphasized the state's academic standards, and to describe how they taught certain objectives from the standards. Two curriculum experts judged the alignment between how teachers brought the objectives to life in their classrooms and how the objectives were operationalized on the state test. Emphasis alone did not account for achievement differences among classrooms. The best predictors of classroom achievement were the match between how the standards were taught and tested, and the interaction between emphasis and match, indicating that test scores were sensitive to instruction of the standards, but in a narrow sense.  相似文献   

19.
ABSTRACT

Drawing inferences about the extent to which student performance reflects instructional opportunities relies on the premise that the measure of student performance is reflective of instructional opportunities. An instructional sensitivity framework suggests that some assessments are more sensitive to detecting differences in instructional opportunities compared to other assessments. This study applies an instructional sensitivity framework to compare student performance on two different mathematics achievement measures across five states and three grade levels. Results suggest a range of variation in student performance among teachers on the same mathematics achievement measure, variation between the two different mathematics achievement measures, and variation between grade levels within the same state. Findings highlight initial considerations for educators interested in selecting and evaluating measures of student performance that are reflective of instructional opportunities.  相似文献   

20.
Science education needs valid, authentic, and efficient assessments. Many typical science assessments primarily measure recall of isolated information. This paper reports on the validation of assessments that measure knowledge integration ability among middle school and high school students. The assessments were administered to 18,729 students in five states. Rasch analyses of the assessments demonstrated satisfactory item fit, item difficulty, test reliability, and person reliability. The study showed that, when appropriately designed, knowledge integration assessments can be balanced between validity and reliability, authenticity and generalizability, and instructional sensitivity and technical quality. Results also showed that, when paired with multiple‐choice items and scored with an effective scoring rubric, constructed‐response items can achieve high reliabilities. Analyses showed that English language learner status and computer use significantly impacted students' science knowledge integration abilities. Students who took the assessment online, which matched the format of content delivery, performed significantly better than students who took the paper‐and‐pencil version. Implications and future directions of research are noted, including refining curriculum materials to meet the needs of diverse students and expanding the range of topics measured by knowledge integration assessments. © 2011 Wiley Periodicals, Inc. J Res Sci Teach 48: 1079–1107, 2011  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号