首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Validity is a central principle of assessment relating to the appropriateness of the uses and interpretations of test results. Usually, one of the inferences that we wish to make is that the score reflects the extent of a student’s learning in a given domain. Thus, it is important to establish that the assessment tasks elicit performances that reflect the intended constructs. This research explored the use of three methods for evaluating whether there are threats to validity in relation to the constructs elicited in international A level geography examinations: (a) Rasch analysis; (b) analysis of processes expected and apparent when students answer questions; and (c) qualitative analysis of responses to items identified as potentially problematic. The results provided strong evidence to support validity with regard to the elicitation of constructs although one question part was identified as a threat to validity. Strengths and weaknesses of the methods can be identified.  相似文献   

2.
Cognitive interviewing (CI) provides a method of systematically collecting validity evidence of response processes for questionnaire items. CI involves a range of techniques for prompting individuals to verbalise their responses to items. One such technique is concurrent verbalisation, as developed in Think Aloud Protocol (TAP). This article investigates the value of the technique for validating questionnaire items administered to young people in international surveys. To date, the literature on TAP has focused on allaying concerns about reactivity – whether response processes are affected by thinking aloud. This article investigates another concern, namely the completeness of concurrent verbalisations – the extent to which respondents verbalise their response processes. An independent, exploratory validation of the PISA assessment of student self-efficacy in mathematics by a small international team of researchers using CI with concurrent verbalisation in four education systems (England, Estonia, Hong Kong, and the Netherlands) provided the basis for this investigation. The researchers found that students generally thought aloud in response to each of the items, thereby providing validity evidence of responses processes varying within and between the education systems, but that practical steps could be taken to increase the completeness of concurrent verbalisations in future validations.  相似文献   

3.
4.
An assumption that is fundamental to the scoring of student-constructed responses (e.g., essays) is the ability of raters to focus on the response characteristics of interest rather than on other features. A common example, and the focus of this study, is the ability of raters to score a response based on the content achievement it demonstrates independent of the quality with which it is expressed. Previously scored responses from a large-scale assessment in which trained scorers rated exclusively constructed-response formats were altered to enhance or degrade the quality of the writing, and scores that resulted from the altered responses were compared with the original scores. Statistically significant differences in favor of the better-writing condition were found in all six content areas. However, the effect sizes were very small in mathematics, reading, science, and social studies items. They were relatively large for items in writing and language usage (mechanics). It was concluded from the last two content areas that the manipulation was successful and from the first four that trained scorers are reasonably well able to differentiate writing quality from other achievement constructs in rating student responses.  相似文献   

5.
《教育实用测度》2013,26(3):185-207
With increasing interest in educational accountability, test results are now expected to meet a diverse set of informational needs. But a norm-referenced test (NRT) cannot be expected to meet the simultaneous demands for both norm-referenced and curriculum-specific information. One possible solution, which is the focus of this article, is to customize the NRT. Customized tests may appear in any form. They may (a) add a few curriculum-specific items to the end of the NRT, (b) substitute locally constructed items for a few NRT items, (c) substitute a curriculum-specific test (CST) for the NRT, or (d) use equating methods to obtain predicted NRT scores from the CST scores. In this article, we describe the four main approaches to customized testing, address the validity of the uses and interpretations of customized test scores obtained from the four main approaches, and offer recommendations regarding the use of customized tests and the need for further research. Results indicate that customized testing can yield both valid normative and curriculum- specific information, when special conditions exist. But, there are also many threats to the validity of normative interpretations. Cautious application of customized testing is needed in order to avoid misleading inferences about student achievement.  相似文献   

6.
7.
This study explores measurement of a construct called knowledge integration in science using multiple-choice and explanation items. We use construct and instructional validity evidence to examine the role multiple-choice and explanation items plays in measuring students' knowledge integration ability. For construct validity, we analyze item properties such as alignment, discrimination, and target range on the knowledge integration scale using a Rasch Partial Credit Model analysis. For instructional validity, we test the sensitivity of multiple-choice and explanation items to knowledge integration instruction using a cohort comparison design. Results show that (1) one third of correct multiple-choice responses are aligned with higher levels of knowledge integration while three quarters of incorrect multiple-choice responses are aligned with lower levels of knowledge integration, (2) explanation items discriminate between high and low knowledge integration ability students much more effectively than multiple-choice items, (3) explanation items measure a wider range of knowledge integration levels than multiple-choice items, and (4) explanation items are more sensitive to knowledge integration instruction than multiple-choice items.  相似文献   

8.
The study investigated the predictive nature of test anxiety on achievement in the presence of perceived general academic self-concept, study habits, parental involvement in children's learning and socio-economic status. From a population of 2482 Grade 6 students from seven government primary schools of a sub-city in Addis Ababa, 497 participants were randomly selected, namely 248 boys and 249 girls. The mean age of the participants was 12.98 years. An adapted version of Sarason's Test Anxiety Scale (28 items), plus the General Academic Self-Concept Scale (18 items), and Parental Involvement (10 items), Study Habits (10 items) and Socio-Economic Status (10 items) scales developed by the authors were the instruments of the study. The findings of the study indicated: (a) test anxiety correlated with achievement with a weak correlation of ?0.186; and (b) perceived general academic self-concept and study habits were positively and significantly related to achievement. Stepwise multiple regression on achievement resulted in the selection of general academic self-concept, study habits and parental involvement as significant contributors to achievement in that order. Test anxiety was found to be a non-predictor of achievement in the presence of other variables.  相似文献   

9.
Indicators of social background belong to the standard set of instruments for empirical research in education. Construction of valid ranking scales and category systems for social background depends on a differentiated surveying of the occupation and vocational activity of parents. Coding such details using standard procedures is a complex process. This contribution investigates the intercoder-reliability of occupational codes according to ISCO-88 and the indicators for socio-economic status (ISEI) based on these codes using a random sample of 300 graduates surveyed on the occupation of their father and mother. To this aim, we compared a double-coding by professional coders and a double-coding by the research team. The results show a match of around 50 percent between the two coding groups. The validity of the index of socio-economic status based on the data was, however, very good. The correlation between ISEI-values based on the coding from the different coders was very high. The predication of family background did not vary between the coding of the different coding groups.  相似文献   

10.
In large-scale assessments, such as state-wide testing programs, national sample-based assessments, and international comparative studies, there are many steps involved in the measurement and reporting of student achievement. There are always sources of inaccuracies in each of the steps. It is of interest to identify the source and magnitude of the errors in the measurement process that may threaten the validity of the final results. Assessment designers can then improve the assessment quality by focusing on areas that pose the highest threats to the results. This paper discusses the relative magnitudes of three main sources of error with reference to the objectives of assessment programs: measurement error, sampling error, and equating error. A number of examples from large-scale assessments are used to illustrate these errors and their impact on the results. The paper concludes by making a number of recommendations that could lead to an improvement of the accuracies of large-scale assessment results.  相似文献   

11.
Cognitive interviews were employed to systematically examine the cognitive validity of self-report survey items extensively used to assess classroom mastery goal structure. In a sample of elementary and middle school students, items were identified that functioned according to their intended meaning and those eliciting less accurate interpretations as conceptually defined by mastery goal structure cognitive validity criteria. Evidence suggested that items framed to focus on students’ teachers (i.e., teacher goals) were more cognitively valid than were items that focused students on their classroom context. Items with abstract terms yielded less accurate interpretations. We discuss implications of determining the cognitive validity of scales used to assess achievement goal structure and related self-report instruments.  相似文献   

12.
Abstract

Educational stakeholders have long known that students might not be fully engaged when taking an achievement test and that such disengagement could undermine the inferences drawn from observed scores. Thanks to the growing prevalence of computer-based tests and the new forms of metadata they produce, researchers have developed and validated procedures for using item response times to identify responses to items that are likely disengaged. In this study, we examine the impact of two techniques to account for test disengagement—(a) removing unengaged test takers from the sample and (b) adjusting test scores to remove rapidly guessed items—on estimates of school contributions to student growth, achievement gaps, and summer learning loss. Our results indicate that removing disengaged examinees from the sample will likely induce bias in the estimates, although as a whole accounting for disengagement had minimal impact on the metrics we examined. Last, we provide guidance for policy makers and evaluators on how to account for disengagement in their own work and consider the promise and limitations of using achievement test metadata for related purposes.  相似文献   

13.
Predicting item difficulty is highly important in education for both teachers and item writers. Despite identifying a large number of explanatory variables, predicting item difficulty remains a challenge in educational assessment with empirical attempts rarely exceeding 25% of variance explained.

This paper analyses 216 science items of key stage 2 tests which are national sampling assessments administered to 11 year olds in England. Potential predictors (topic, subtopic, concept, question type, nature of stimulus, depth of knowledge and linguistic variables) were considered in the analysis. Coding frameworks employed in similar studies were adapted and employed by two coders to independently rate items. Linguistic demands were gauged using a computational linguistic facility. The stepwise regression models predicted 23% of the variance with extended constructed questions and photos being the main predictors of item difficulty.

While a substantial part of unexplained variance could be attributed to the unpredictable interaction of variables, we argue that progress in this area requires improvement in the theories and the methods employed. Future research needs to be centred on improving coding frameworks as well as developing systematic training protocols for coders. These technical advances would pave the way to improved task design and reduced development costs of assessments.  相似文献   


14.
In low-stakes assessments, some students may not reach the end of the test and leave some items unanswered due to various reasons (e.g., lack of test-taking motivation, poor time management, and test speededness). Not-reached items are often treated as incorrect or not-administered in the scoring process. However, when the proportion of not-reached items is high, these traditional approaches may yield biased scores and thereby threatening the validity of test results. In this study, we propose a polytomous scoring approach for handling not-reached items and compare its performance with those of the traditional scoring approaches. Real data from a low-stakes math assessment administered to second and third graders were used. The assessment consisted of 40 short-answer items focusing on addition and subtraction. The students were instructed to answer as many items as possible within 5 minutes. Using the traditional scoring approaches, students’ responses for not-reached items were treated as either not-administered or incorrect in the scoring process. With the proposed scoring approach, students’ nonmissing responses were scored polytomously based on how accurately and rapidly they responded to the items to reduce the impact of not-reached items on ability estimation. The traditional and polytomous scoring approaches were compared based on several evaluation criteria, such as model fit indices, test information function, and bias. The results indicated that the polytomous scoring approaches outperformed the traditional approaches. The complete case simulation corroborated our empirical findings that the scoring approach in which nonmissing items were scored polytomously and not-reached items were considered not-administered performed the best. Implications of the polytomous scoring approach for low-stakes assessments were discussed.  相似文献   

15.
16.
Previous research has shown that people often engage in cultural-deficit thinking when reasoning about racial/ethnic achievement gaps. However, it is unclear whether culture-blaming explanations are best thought of as group-level internal attributions, expressions of prejudice against a lower status group, or self-serving bias. In this study, White and Latino participants (= 328) responded to items that were written to express either a White-critical or pro–Asian American perspective on the White–Asian American achievement gap. Results indicated that Latinos were more likely to engage in pro-Asian than White-critical culture blaming, whereas expressions of culture blaming did not vary across frames among White participants.  相似文献   

17.
18.
Examinees' thinking processes have become an increasingly important concern in testing. The responses processes aspect is a major component of validity, and contemporary tests increasingly involve specifications about the cognitive complexity of examinees' response processes. Yet, empirical research findings on examinees' cognitive processes are not often available either to provide evidence for validity or to guide the design or selection of items. In this article, studies and developments from the author's research program are presented to illustrate how empirical studies on examinees' thinking processes can impact item and test design.  相似文献   

19.
20.
This paper intends to induce a set of properties that unify and distinguish compliancegaining strategies and to determine whether coders can reliably classify messages on the basis of the proposed properties. The first goal was accomplished by deriving codified strategies from open‐ended responses of subjects to persuasive situations. Properties that reflected differences in the strategies were induced. The second goal had three coders content‐analyze the original responses in terms of the derived properties. Measures of unitizing and coder reliability and content validity were assessed. In addition, information concerning representational validity was presented. The approach taken in this paper provides us with an assessment of the state of affairs found in a compliance‐gaining strategy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号