首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 22 毫秒
1.
Accountability mandates often prompt assessment of student learning gains (e.g., value-added estimates) via achievement tests. The validity of these estimates have been questioned when performance on tests is low stakes for students. To assess the effects of motivation on value-added estimates, we assigned students to one of three test consequence conditions: (a) an aggregate of test scores is used solely for institutional effectiveness purposes, (b) personal test score is reported to the student, or (c) personal test score is reported to faculty. Value-added estimates, operationalized as change in performance between two testing occasions for the same individuals where educational programming was experienced between testing occasions, were examined across conditions, in addition to the effects of test-taking motivation. Test consequences did not impact value-added estimates. Change in test-taking motivation, however, had a substantial effect on value-added estimates. In short, value-added estimates were attenuated due to decreased motivation from pretest to posttest.  相似文献   

2.
Although norm-referenced educational tests are used widely to identify children with learning disabilities, there have been few studies examining how performance on these tests relates to actual classroom functioning. This study examines the criterion-related validity of a battery of six educational diagnostic and achievement tests for actual reading level of a group of 155 learning disabled students enrolled in grades 4 to 7 and a subgroup of 46 learning disabled students enrolled in grades 5 and 6. Stepwise multiple regression analysis revealed that six subtests from the battery accounted for over 70% of the variance in reading for the larger group. When type of reading curriculum used was controlled, five subtests accounted for over 76% of the criterion. The theoretical and applied implications of these results for diagnostic assessment are discussed.  相似文献   

3.
4.
The Angoff (1971) standard setting method requires expert panelists to (a) conceptualize candidates who possess the qualifications of interest (e.g., the minimally qualified) and (b) estimate actual item performance for these candidates. Past and current research (Bejar, 1983; Shepard, 1994) suggests that estimating item performance is difficult for panelists. If panelists cannot perform this task, the validity of the standard based on these estimates is in question. This study tested the ability of 26 classroom teachers to estimate item performance for two groups of their students on a locally developed district-wide science test. Teachers were more accurate in estimating the performance of the total group than of the "borderline group," but in neither case was their accuracy level high. Implications of this finding for the validity of item performance estimates by panelists using the Angoff standard setting method are discussed.  相似文献   

5.
When we administer educational achievement tests, we want to be confident that the resulting scores validly indicate what the test takers know and can do. However, if the test is perceived as low stakes by the test taker, disengaged test taking sometimes occurs, which poses a serious threat to score validity. When computer-based tests are used, disengagement can be detected through occurrences of rapid-guessing behavior. This empirical study investigated the impact of a new effort monitoring feature that can detect rapid guessing, as it occurs, and notify proctors that a test taker has become disengaged. The results showed that, after a proctor notification was triggered, test-taking engagement tended to increase, test performance improved, and test scores exhibited higher convergent validation evidence. The findings of this study provide validation evidence that this innovative testing feature can decrease disengaged test taking.  相似文献   

6.
《教育实用测度》2013,26(2):85-113
The article discusses the need educators have for measures of linguistic competence for limited-English-proficient (LEP) students. Traditional measurement procedures do not meet these needs because of mismatches between educational experiences and test content, cultural experiences and test con- tent, and linguistic experience and test content. A new type of test -Sentence Verification Technique (SVT) test - that may meet some of the measurement needs of LEP students is described, and the results of a study that examines the reliability and validity of the new tests as measures of the listening and reading comprehension performance in both the native language and English are reported. The results indicate that the tests are reliable and that SVT performance varies as functions of placement in a transitional bilingual education program, teacher judgments of competence, and difficulty of the material. These results are consistent with the interpretation that SVT tests are valid measures of the linguistic competence of LEP students. The article concludes with a discussion of some of the advantages of using SVT tests with LEP populations.  相似文献   

7.
Accountability for educational quality is a priority at all levels of education. Low-stakes testing is one way to measure the quality of education that students receive and make inferences about what students know and can do. Aggregate test scores from low-stakes testing programs are suspect, however, to the degree that these scores are influenced by low test-taker effort. This study examined the generalizability of a recently developed technique called motivation filtering, whereby scores for students of low motivation are systemically filtered from test data to determine aggregate test scores that more accurately reflect student performance and that can be used for reporting purposes. Across assessment tests in five different content areas, motivation filtering was found to consistently increase mean test performance and convergent validity.  相似文献   

8.
There has been a steady interest in investigating the validity of language tests in the last decades. Despite numerous studies on construct validity in language testing, there are not many studies examining the construct validity of a reading test. This paper reports on a study that explored the construct validity of the English reading test in the Nepalese school leaving examination. Eight students were asked to take the test and think-aloud, followed by retrospective interviews. Additionally, seven experts were asked to make judgments regarding the skills tested by the test. The findings provide grounded insights into students’ response behaviors prompted by the reading tasks, and indicate some threats to the construct validity of the test. Additionally, the study reports a low level of agreement among the experts, and a big gap between the skills used by the students and the skills that the experts thought were being examined by the test.  相似文献   

9.
Indirect tests of writing competency are often used at the college level for a variety of educational, programmatic, and research purposes. Although such tests may have been validated on hearing populations, it cannot be assumed that they validly assess the writing competency of deaf and hard-of-hearing students. This study used a direct criterion measure of writing competency to determine the criterion validity of two indirect measures of writing competency. Results suggest that the validity of indirect writing tests for deaf and hard-of-hearing baccalaureate-level students is weak. We recommend that direct writing tests be used with this population to ensure fair and accurage assessment of writing competency.  相似文献   

10.
There are sound educational and examining reasons for the use of coursework assessment and practical assessment of student work by teachers in schools for purposes of reporting examination grades. Coursework and practical work test a range of different curriculum goals to final papers and increase the validity and reliability of the result. However, the use of coursework and practical work in tests and examinations has been a matter of constant political as well as educational debate in England over the last 30 years. The paper reviews these debates and developments and argues that as accountability pressures increase, the evidence base for published results is becoming narrower and less valid as the system moves back to wholly end-of-course testing.  相似文献   

11.
本研究应用项目反应理论,从被试的阅读能力值和题目的难度值这两个方面,分析阅读理解测试中多项选择题命题者对考试效度的影响。实验设计中,将两组被试同时施测于一项“阅读水平测试”,根据测试结果估计出的两组被试能力值之间无显著性差异。再次将这两组被试分别施测于两位不同命题者所命制的题目,尽管这些题目均产生于相同的阅读材料,且题目的难度值之间并没有显著性差异,被试的表现却显著不同。Rasch模型认为,被试表现由被试能力和试题难度共同决定。因此,可以推测,这是由于不同命题者所命制的题目影响了被试的表现,并进而影响了使用多项选择题进行阅读理解测试的效度。  相似文献   

12.
The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity.  相似文献   

13.
《教育实用测度》2013,26(1):25-51
In this study, we compared the efficiency, reliability, validity, and motivational benefits of computerized-adaptive and self-adapted music-listening tests (referred to hereafter as CAT and SAT, respectively). Junior high school general music students completed a tonal memory CAT, a tonal memory SAT, standardized music aptitude and achievement tests; and questionnaires assessing test anxiety, demographics, and attitudes about the CAT and SAT. Standardized music test scores and music course grades served as criterion measures in the concurrent validity analysis. Results showed that the SAT elicited more favorable attitudes from examinees and yielded ability estimates that were higher and less correlated with test anxiety than did the CAT. The CAT, however, required fewer items and less administration time to match the reliability and concurrent validity of the SAT and yielded higher levels of reliability and concurrent validity than the SAT when test length was held constant. These results reaffirm important tradeoffs between the two administration procedures observed in prior studies of vocabulary and algebra skills, with the SAT providing greater potential motivational benefits and the CAT providing greater efficiency. Implications and questions for future research are discussed.  相似文献   

14.
Parental anxiety in children’s education is closely related to children’s developmental and educational outcomes. The current study reported the development and validation of a self-report instrument to evaluate the Sources of Parental Anxiety in Children’s Education (SPACEs). Qualitative analyses suggested that the construct of parental anxiety in children’s education was multidimensional, representing learning performance anxiety, educational environment anxiety, educational input anxiety, and educational outcome anxiety as four primary sources. The results from exploratory and confirmatory factor analyses supported this four-factor structure comprising 17 items to capture this multidimensional construct. The scale also demonstrated adequate internal consistencies, convergent validity, discriminant validity, criterion-related validity, and test-retest reliability. A series of multi-group tests across age, locality, and children’s grades provided evidence of measurement invariance. Overall, the SPACE scale appear to be a reliable and valid tool to measure educational anxiety in parents in the Chinese context.  相似文献   

15.
Item positions in educational assessments are often randomized across students to prevent cheating. However, if altering item positions results in any significant impact on students’ performance, it may threaten the validity of test scores. Two widely used approaches for detecting position effects – logistic regression and hierarchical generalized linear modelling – are often inconvenient for researchers and practitioners due to some technical and practical limitations. Therefore, this study introduced a structural equation modeling (SEM) approach for examining item and testlet position effects. The SEM approach was demonstrated using data from a computer-based alternate assessment designed for students with cognitive disabilities from three grade bands (3–5, 6–8, and high school). Item and testlet position effects were investigated in the field-test (FT) items that were received by each student at different positions. Results indicated that the difficulty of some FT items in grade bands 3–5 and 6–8 differed depending on the positions of the items on the test. Also, the overall difficulty of the field-test task in grade bands 6–8 increased as students responded to the field-test task in later positions. The SEM approach provides a flexible method for examining different types of position effects.  相似文献   

16.
In this study, we examined the reliability and validity of curriculum‐based measures (CBM) in reading for indexing the performance of secondary‐school students. Participants were 236 eighth‐grade students (134 females and 102 males) in the classrooms of 17 English teachers. Students completed 1‐, 2‐, and 3‐minute reading aloud and 2‐, 3‐, and 4‐minute maze selection tasks. The relation between performance on the CBMs and the state reading test were examined. Results revealed that both reading aloud and maze selection were reliable and valid predictors of performance on the state standards tests, with validity coefficients above .70. An exploratory follow‐up study was conducted in which the growth curves produced by the reading‐aloud and maze‐selection measures were compared for a subset of 31 students from the original study. For these 31 students, maze selection reflected change over time whereas reading aloud did not. This pattern of results was found for both lower‐ and higher‐performing students. Results suggest that it is important to consider both performance and progress when examining the technical adequacy of CBMs. Implications for the use of measures with secondary‐level students for progress monitoring are discussed.  相似文献   

17.
Test-takers' interpretations of validity as related to test constructs and test use have been widely debated in large-scale language assessment. This study contributes further evidence to this debate by examining 59 test-takers' perspectives in writing large-scale English language tests. Participants wrote about their test-taking experiences in 300 to 500 words, focusing on their perceptions of test validity and test use. A standard thematic coding process and logical cross-analysis were used to analyze test-takers' experiences. Codes were deductively generated and related to both experiential (i.e., testing conditions and consequences) and psychometric (i.e., test construction, format, and administration) aspects of testing. These findings offer test-takers' voices on fundamental aspects of language assessment, which bear implications for test developers, test administrators, and test users. The study also demonstrated the need for obtaining additional evidence from test-takers for validating large-scale language tests.  相似文献   

18.
This paper serves as an illustration of the usefulness of structurally incomplete designs as an approach to reduce the length of educational questionnaires. In structurally incomplete test designs, respondents only fill out a subset of the total item set, while all items are still provided to the whole sample. The scores on the unadministered items are subsequently dealt with by using methods for the estimation of missing data. Two structurally incomplete test designs — one recording two thirds, and the other recording a half of the potentially complete data — were applied to the complete item scores on 8 educational psychology scales. The incomplete item scores were estimated with missing data method Data Augmentation. Complete and estimated test data were compared at the estimates of total scores, reliability, and predictive validity of an external criterion. The reconstructed data yielded estimates that were very close to the values in the complete data. As expected the statistical uncertainty was higher in the design that recorded fewer item scores. It was concluded that the procedure of applying incomplete test designs and subsequently dealing with the missing values is very fruitful for reducing questionnaire length.  相似文献   

19.
大学英语口语考试具备了较好的效度、信度及积极的反拔作用的支持,而且也得到了广泛的认可。尽管大学英语口语考试有其弊端,即在考试大纲所要求的小组讨论任务中考生互动程度与考生真实表现之间的效度差异。然而,口语考试还是展现了许多积极的方面。作为全国唯一一项大规模标准化英语口语能力考试,大学英语口语考试对于教育政策、教学实践、教育系统以及个人等都产生了一定的影响。  相似文献   

20.
Grades and Test Scores: Accounting for Observed Differences   总被引:1,自引:0,他引:1  
Why do grades and test scores often differ? A framework of possible differences is proposed in this article. An approximation of the framework was tested with data on 8,454 high school seniors from the National Education Longitudinal Study. Individual and group differences in grade versus test performance were substantially reduced by focusing the two measures on similar academic subjects, correcting for grading variations and unreliability, and adding teacher ratings and other information about students. Concurrent prediction of high school average was thus increased from 0.62 to 0.90; differential prediction in eight subgroups was reduced to 0.02 letter‐grades. Grading variation was a major source of discrepancy between grades and test scores. Other major sources were teacher ratings and Scholastic Engagement, a promising organizing principle for understanding student achievement. Engagement was defined by three types of observable behavior: employing school skills, demonstrating initiative, and avoiding competing activities. While groups varied in average achievement, group performance was generally similar on grades and tests. Major factors in achievement were similarly constituted and similarly related from group to group. Differences between grades and tests give these measures complementary strengths in high‐stakes assessment. If artifactual differences between the two measures are not corrected, common statistical estimates of validity and fairness are unduly conservative.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号