首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Examinees who take high-stakes assessments are usually given an opportunity to repeat the test if they are unsuccessful on their initial attempt. To prevent examinees from obtaining unfair score increases by memorizing the content of specific test items, testing agencies usually assign a different test form to repeat examinees. The use of multiple forms is expensive and can present psychometric challenges, particularly for low-volume credentialing programs; thus, it is important to determine if unwarranted score gains actually occur. Prior studies provide strong evidence that the same-form advantage is pronounced for aptitude tests. However, the sparse research within the context of achievement and credentialing testing suggests that the same-form advantage is minimal. For the present experiment, 541 examinees who failed a national certification test were randomly assigned to receive either the same test or a different (parallel) test on their second attempt. Although the same-form group had shorter response times on the second administration, score gains for the two groups were indistinguishable. We discuss factors that may limit the generalizability of these findings to other assessment contexts.  相似文献   

2.
3.
Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. The goals of the study were (a) to review methods for detecting local item dependence (LID), (b) to discuss the use of testlets to account for LID in context-dependent item sets, (c) to apply LID detection methods and testlet-based item calibrations to data from a large-scale, high-stakes admissions test, and (d) to evaluate the results with respect to test score reliability and examinee proficiency estimation. Item dependencies were found in the test and these were due to test speededness or context dependence (related to passage structure). Also, the results highlight that steps taken to correct for the presence of LID and obtain less biased reliability estimates may impact on the estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding how to calibrate context-dependent item sets using item response theory.  相似文献   

4.
Using Multiple Measures to Address Perverse Incentives and Score Inflation   总被引:1,自引:0,他引:1  
The principle that important decisions should not be based on a single measure is axiomatic, if widelu ignored in practice. The traditional rationale is the risk of incorrect decisions from incomplete and error-prone data. The current high-stakes uses of test scores increase the need for multiple measures for two distinct reasons: the risk of score inflation and the potential for perverse incentives for educators and students. Addressing these two issues may require focusing accountability on measures of schooling as well as a much wider range of measures of student outcomes. The difficulties of pursuing this approach are described, and some possible directions for research and development are noted.  相似文献   

5.
高风险考试试题保密性和心理测量学研究的基本矛盾   总被引:1,自引:0,他引:1  
这篇论文以中国高考为例,分析了考试的高风险性对试题心理测量学分析可行性的影响。高风险考试一方面要求高度的试题保密性,另一方面又有对潜在应试者进行实验测试而达到试题质量最优化的要求,而这两种需要之间却存在着一些基本矛盾。本文介绍了一些质化的研究方法;笔者建议在高风险考试的开发阶段主要运用这些质化方法。  相似文献   

6.
In the midst of the debate surrounding the question of whether Australia’s National Assessment Program: Literacy and Numeracy (NAPLAN) test is high-stakes, it is evident that children’s own accounts of their experiences remain sparse. This paper describes the findings of a case study which documented the experiences of 105 children across two Catholic primary schools in Queensland serving different socio-economic status (SES) communities. Analysis of the data revealed that these teachers and principals did not experience NAPLAN as high-stakes. However, the data suggested that the children experienced the tests within a confusing context of contradictions and dissonances emanating from multiple sources; receiving little, if any, clear and consistent information regarding the purpose and significance of NAPLAN. While the children’s responses were varied, many reported NAPLAN as a negative experience, with some constructing the test as high-stakes. These constructions ranged from personal judgement or sense of letting their families down, to failure, and less commonly, grade retention and school exclusion. Some Year 3 children had also constructed good results as vital to future prosperity. These constructions bring into question the assumption that because NAPLAN is designed to be a low-stakes test, that children will necessarily experience it in this way.  相似文献   

7.
《教育实用测度》2013,26(4):411-418
Seven conclusions for professionals who administer state assessment programs are drawn from the GI Forum v. Texas Education Agency ruling: (a) the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (1999) standards are appropriate to use; (b) items showing different p values for subgroups may be used if they are selected as adequate for sound educational reasons; (c) a cut score setting process should be educationally justified; (d) a high-stakes testing program can appropriately address unfair access to education; (e) multiple opportunities to pass satisfies the standard that a single test score should not be the sole basis for a high-stakes decision; (f) a conjunctive decision-making model can appropriately motivate both students and schools; and (g) an 80% pass rate criterion applied to eventual, as opposed to initial, success rates for subgroups is a reasonable threshold for adverse impact. Caution is recommended because circumstances in other states may not parallel those in Texas in important ways.  相似文献   

8.
《牛津教育评论》2013,39(3-4):407-419

This paper discusses significant examples of the impact of research studies on practice and policy in educational assessment. The discussion focuses on selected issues in three main areas, concerned respectively with formative assessment, with high-stakes summative tests, and with assessment in large-scale surveys. It is concluded that the effects of research have been uneven and weak, in part because of lack of co-ordination of studies and of any one strong institutional centre. However, researchers are also faced with the difficult task of changing the understanding of assessment issues, both amongst the general public and amongst policy makers.  相似文献   

9.
ABSTRACT

Tests convey messages about what to teach and how to assess. Both of these dimensions may either broaden or become more uniform and narrow as a consequence of high-stakes testing. This study aimed to investigate how Swedish science teachers were influenced by national, high-stakes testing in science, specifically focusing on instances where teachers’ pedagogical practices were broadened and/or narrowed. The research design is qualitative thematic analysis of focus group data, from group discussions with Swedish science teachers. The total sample consists of six teachers, who participated in 12 focus group discussion during three consecutive years. Findings suggest that the national tests influence teachers' pedagogical practice by being used as a substitute for the national curriculum. Since the teachers do not want their students to fail the tests, they implement new content that is introduced by the tests and thereby broaden their existing practice. However, when this new content is not seen as a legitimate part of teachers' established teaching traditions, the interpretation and implementation of this content may replicate the operationalisations made by the test developers, even though these operationalisations are restricted by demands for standardisation and reliable scoring. Consequently, the tests simultaneously broaden and narrow teachers’ pedagogical practices.  相似文献   

10.
In discussion of the properties of criterion-referenced tests, it is often assumed that traditional reliability indices, particularly those based on internal consistency, are not relevant. However, if the measurement errors involved in using an individual's observed score on a criterion-referenced test to estimate his or her universe scores on a domain of items are compared to errors of an a priori procedure that assigns the same universe score (the mean observed test score) to all persons, the test-based procedure is found to improve the accuracy of universe score estimates only if the test reliability is above 0.5. This suggests that criterion-referenced tests with low reliabilities generally will have limited use in estimating universe scores on domains of items.  相似文献   

11.
Students may not fully demonstrate their knowledge and skills on accountability tests if there are no stakes attached to individual performance. In that case, assessment results may not accurately reflect student achievement, so the validity of score interpretations and uses suffers. For this study, matched samples of students taking state accountability tests under low-stakes and high-stakes conditions were used to estimate the effect of stakes on test performance and subsequent pass rates. Across five assessments, expected performance was greater under high-stakes conditions, with effect sizes ranging from 0.41 to 0.50 standard deviations and with students of lower ability tending to be slightly more affected by stakes. Depending on where cut scores were set, pass rates differed by up to 30% when comparing the low- and high-stakes conditions.  相似文献   

12.
The purpose of this study was to examine the factor structure of the Family Child Care Environment Rating Scale—Revised (FCCERS-R) in high-stakes contexts. The results of an exploratory factor analysis revealed three dimensions of quality on the FCCERS-R: (1) Activities/Materials, (2) Language/Interaction, and (3) Organization. This study also explored whether abridged versions of the FCCERS-R could serve as a proxy for the full instrument. In addition to subsets of FCCERS-R items created from the factor structure, purposively and randomly chosen item subsets were created. The purposively chosen subsets included 6-, 9-, and 12-item scales comprised of the items with the highest factor loading across the three factors, whereas the randomly chosen subsets consisted of 12 items. Results of a discriminant analysis showed that the factor subsets were poorer proxies for the total FCCERS-R score than were the other subsets, which demonstrated comparable internal consistencies and discriminant power as the full FCCERS-R when classifying homes into general quality categories. Implications for adopting shorter versions of the FCCERS-R are discussed.  相似文献   

13.
Testing has become an attractive option for policymakers both because it has the potential to affect the behavior of educators in the educational system and because it is often viewed by the public as a way to guarantee a basic level of quality education. Whatever the reasons, formal testing tied to grade promotion and graduation continues to spread throughout the United States. A qualitative narrative is provided regarding the importance of the preparation of three minority students, Snuffi, Jasmine, and Wanda, for their high-stakes mathematics test. Afterward a critique of high-stakes tests, particularly for minority students as a requirement for high school graduation in public school districts in this country is also provided. Implications are discussed for further research in mathematics education.  相似文献   

14.
State test score trends are widely interpreted as indicators of educational improvement. To validate these interpretations, state test score trends are often compared to trends on other tests such as the National Assessment of Educational Progress (NAEP). These comparisons raise serious technical and substantive concerns. Technically, the most commonly used trend statistics—for example, the change in the percent of proficient students—are misleading in the context of cross-test comparisons. Substantively, it may not be reasonable to expect that NAEP and state test score trends should be similar. This paper motivates then applies a "scale-invariant" framework for cross-test trend comparisons to compare "high-stakes" state test score trends from 2003 to 2005 to NAEP trends over the same period. Results show that state trends are significantly more positive than NAEP trends. The paper concludes with cautions against the positioning of trend discrepancies in a framework where only one trend is considered "true."  相似文献   

15.
Increasingly, tests are being translated and adapted into different languages. Differential item functioning (DIF) analyses are often used to identify non-equivalent items across language groups. However, few studies have focused on understanding why some translated items produce DIF. The purpose of the current study is to identify sources of differential item and bundle functioning on translated achievement tests using substantive and statistical analyses. A substantive analysis of existing DIF items was conducted by an 11-member committee of testing specialists. In their review, four sources of translation DIF were identified. Two certified translators used these four sources to categorize a new set of DIF items from Grade 6 and 9 Mathematics and Social Studies Achievement Tests. Each item was associated with a specific source of translation DIF and each item was anticipated to favor a specific group of examinees. Then, a statistical analysis was conducted on the items in each category using SIBTEST. The translators sorted the mathematics DIF items into three sources, and they correctly predicted the group that would be favored for seven of the eight items or bundles of items across two grade levels. The translators sorted the social studies DIF items into four sources, and they correctly predicted the group that would be favored for eight of the 13 items or bundles of items across two grade levels. The majority of items in mathematics and social studies were associated with differences in the words, expressions, or sentence structure of items that are not inherent to the language and/or culture. By combining substantive and statistical DIF analyses, researchers can study the sources of DIF and create a body of confirmed DIF hypotheses that may be used to develop guidelines and test construction principles for reducing DIF on translated tests.  相似文献   

16.
It is standard practice to arrange items in objective tests in order of increasing difficulty, on the assumption that such an arrangement increases student motivation and produces more reliable tests. The validity of this assumption was investigated in the context of a multiplechoice chemistry test. Fifty items were arranged in three sequences of difficulty: random (R), easy-to-hard (E-H) and hard-to-easy (H-E). The mean test score was significantly higher for the test sequenced E-H than for the test sequenced H-E. Item difficulty index was raised by placement of the easier items toward the beginning of the test and lowered by placement of these items toward the end of the test. Test reliability was largely independent of item sequence.  相似文献   

17.
Deaf students consistently score lower on standardized measures of reading comprehension than their hearing peers. Most of the studies that have been conducted to explain this phenomenon have focused on variables within the reader, and important differences have been found between deaf and hearing readers. More recently, in the face of increasingly high-stakes consequences, researchers are looking "outside" the reader, at the tests themselves, to determine whether there are fairness issues for special populations, such as deaf students. The study reported here, the first of its kind with deaf students, examines the North Carolina (NC) reading comprehension test. The study employs the same method used originally by NC to determine its appropriateness of the test for the general population of NC students. The experts in this article, like those in the original construction of the NC test, are familiar with the content of the reading curriculum in NC; however, the raters in this article bring a special perspective related to teaching and testing reading of students who are deaf. Findings from this study raise questions about the appropriateness of the NC reading test for deaf students. Implications for future research and instructional practice are discussed.  相似文献   

18.
The Angoff method requires experts to view every item on the test and make a probability judgment. This can be time consuming when there are large numbers of items on the test. In this study, a G-theory framework was used to determine if a subset of items can be used to make generalizable cut-score recommendations. Angoff ratings (i.e., probability judgments) from previously conducted standard setting studies were used first in a re-sampling study, followed by D-studies. For the re-sampling study, proportionally stratified subsets of items were extracted under various sampling and test-length conditions. The mean cut score, variance components, expected standard error (SE) around the mean cut score, and root-mean-squared deviation (RMSD) across 1,000 replications were estimated at each study condition. The SE and the RMSD decreased as the number of items increased, but this reduction tapered off after approximately 45 items. Subsequently, D-studies were performed on the same datasets. The expected SE was computed at various test lengths. Results from both studies are consistent with previous research indicating that between 40–50 items are sufficient to make generalizable cut score recommendations.  相似文献   

19.
Recent developrnents of person-Jit analysis in computerized adaptive testing (CAT) are discussed. Methods from stutistical process control are presented that have been proposed to classify an item score pattern as fitting or misjitting the underlying item response theory model in CAT. Most person-fit research in CAT is restricted to simulated data. In this study, empirical data from a certification test were used, Alternatives are discussed to generate norms so that bounds can be determined to classify an item score pattern as fitting or misfitting. Using bounds determined from a sample of a high-stakes certification test, the empirical analysis showed that dizerent types of misfit can be distinguished. Further applications using statistical process control methods to detect misfitting item score patterns are discussed.  相似文献   

20.
This article will have as its focus a problem described by Anderson, namely, the quality of the construction and reporting of research that contains “homemade” achievement tests. The current status of homemade achievement tests was examined in this study. Research reports in two science education journals were analyzed, using Anderson's eight categories of information that a high quality research report should include. The journals examined were Journal of Research in Science Teaching and Science Education from January 1975 to January 1980. The findings indicate that reliability estimates and the procedures for selecting test items were mentioned more frequently in the science education journals than in Anderson's study. However, there was little or no improvement in describing the relation of test items to instruction. These findings should be of interest both to researchers who are utilizing “homemade” achievement tests in their studies and to journal panels who are reviewing studies for publication.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号