首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Controlled assessment (CA) was introduced as a valid and reliable replacement for coursework in GCSE English and English Literature assessments in 2009. I argue that CA lacks clear definition, typically mimics externally-assessed public examinations and, when interrogated through the Crooks eight-link chain model, is undermined by several threats to validity, reliability and fairness. This is evidenced by the professional experiences of CA stakeholders consulted by Ipsos MORI in their 2011 ‘Evaluation of the Introduction of Controlled Assessment’; the theoretical threats to validity that may arise during the administration, scoring, aggregation, generalisation, extrapolation, evaluation, decision and impact stages of CA events; and problems of perception concerning CA in English and English Literature that derive from competing purposes. I conclude that CA has not yet proved itself a valid, reliable and apposite replacement for coursework, and that further refinement is necessary if CA is to fulfil this purpose.  相似文献   

2.
High-stakes standardized student assessments are increasingly used in value-added evaluation models to connect teacher performance to P–12 student learning. These assessments are also being used to evaluate teacher preparation programs, despite validity and reliability threats. A more rational model linking student performance to candidates who actually teach these students is presented. Preliminary findings with three candidate cohorts indicate that the majority of their students met learning objectives and showed substantial pre-to-post learning gains.  相似文献   

3.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

4.
Rater‐mediated assessments are a common methodology for measuring persons, investigating rater behavior, and/or defining latent constructs. The purpose of this article is to provide a pedagogical framework for examining rater variability in the context of rater‐mediated assessments using three distinct models. The first model is the observation model, which includes ecological/environmental considerations for the evaluation system. The second model is the measurement model, which includes the transformation of observed, rater response data to linear measures using a measurement model with specific requirements of rater‐invariant measurement in order to examine raters’ construct‐relevant variability stemming from the evaluative system. The third model is the interaction model, which includes an interaction parameter to allow for the investigation into raters’ systematic, construct‐irrelevant variability stemming from the evaluative system. Implications for measurement outcomes and validity are discussed.  相似文献   

5.
Assessment of Prior Learning (APL) refers to a process where adults’ prior learning, formal as well as informal, is assessed and acknowledged. In the first section of this paper, APL and current conceptions of validity in assessments and its evaluation are presented. It is argued that participants in the assessment are an important source of information for the validation of the assessment. In the following section participants’ experiences from a particular APL scheme are evaluated using a questionnaire developed for that purpose. The questionnaire provides data on individuals’ perceptions of the procedure and result of the APL scheme. The results are described, analysed and discussed from a validity perspective. Conclusions drawn from the results are that possible threats to validity can exist in the administration of APL procedures, as well as in consequences of APL.  相似文献   

6.
Rater‐mediated assessments require the evaluation of the accuracy and consistency of the inferences made by the raters to ensure the validity of score interpretations and uses. Modeling rater response processes allows for a better understanding of how raters map their representations of the examinee performance to their representation of the scoring criteria. Validity of score meaning is affected by the accuracy of raters' representations of examinee performance and the scoring criteria, and the accuracy of the mapping process. Methodological advances and applications that model rater response processes, rater accuracy, and rater consistency inform the design, scoring, interpretations, and uses of rater‐mediated assessments.  相似文献   

7.
Recent developments in British higher education have included taking a close look at work‐based learning, in particular its assessment (and its integration within academic programmes of study). However, two questions which are still continuously being asked are (a) to what extent are assessments of work‐based learning valid and reliable, and (b) can they count towards the award of university degrees and diplomas? These questions are becoming increasingly important as there seems to be a growing trend for students to assess their own learning at the workplace (through reflection and analysis and the use of diaries and self‐development journals). This article addresses the above issues by drawing on classical test theory (for an understanding of the fundamentals of validity and reliability) and by examining how the different notions of validity and realiability may be applied in the context of assessments (and self‐assessments) in the workplace. The article concludes that, under certain stated conditions, it is indeed possible to determine whether assessments (and self‐assessments) of work‐based learning are valid, reliable — and comparable.  相似文献   

8.
This research contributes to the methodologies in HPT program evaluation and measurement that are fairly lacking to date. First, a theoretical foundation for a control group is established based on a brief review of control group applications in various fields. Then, four types of control groups applicable to HPT program evaluation and measurement are defined and classified, and threats to internal and external validity in control group applications are explored. Lastly, four evaluation and measurement scenarios are presented for an E‐learning program to demonstrate the applicability of the control group methods for HPT program evaluation and ROI measurement.  相似文献   

9.
Single‐case research designs are often applied within school psychology. This article provides a critical review of the scientific merit of both concurrent and nonconcurrent multiple baseline (MB) designs, relative to their capacity to assess threats of internal validity and establish experimental control. Distinctions are established between AB replications and nonconcurrent multiple baseline designed studies using the initial conception proposed by P.J. Watson and E.A. Workman (1981). Despite some previously pessimistic evaluations of nonconcurrent multiple baseline designs, the findings of this review suggest that various threats of internal validity can be assessed and ruled out using either concurrent or nonconcurrent MB designs. It seems that nonconcurrent designs can be used to assess the intervening effects of history, but might be more prone to threats of mortality. These and other threats of internal validity are reviewed and recommendations are provided. © 2007 Wiley Periodicals, Inc. Psychol Schs 44: 451–459, 2007.  相似文献   

10.
In large-scale assessments, such as state-wide testing programs, national sample-based assessments, and international comparative studies, there are many steps involved in the measurement and reporting of student achievement. There are always sources of inaccuracies in each of the steps. It is of interest to identify the source and magnitude of the errors in the measurement process that may threaten the validity of the final results. Assessment designers can then improve the assessment quality by focusing on areas that pose the highest threats to the results. This paper discusses the relative magnitudes of three main sources of error with reference to the objectives of assessment programs: measurement error, sampling error, and equating error. A number of examples from large-scale assessments are used to illustrate these errors and their impact on the results. The paper concludes by making a number of recommendations that could lead to an improvement of the accuracies of large-scale assessment results.  相似文献   

11.
《教育实用测度》2013,26(4):413-432
With the increasing use of automated scoring systems in high-stakes testing, it has become essential that test developers assess the validity of the inferences based on scores produced by these systems. In this article, we attempt to place the issues associated with computer-automated scoring within the context of current validity theory. Although it is assumed that the criteria appropriate for evaluating the validity of score interpretations are the same for tests using automated scoring procedures as for other assessments, different aspects of the validity argument may require emphasis as a function of the scoring procedure. We begin the article with a taxonomy of automated scoring procedures. The presentation of this taxonomy provides a framework for discussing threats to validity that may take on increased importance for specific approaches to automated scoring. We then present a general discussion of the process by which test-based inferences are validated, followed by a discussion of the special issues that must be considered when scoring is done by computer.  相似文献   

12.
A conceptual model of quality of life and associated instrumentation for collecting data from persons with developmental disabilities are presented. The conceptual model assumes that the components of quality of life for persons with developmental disabilities are the same as for all persons. Additionally, in recognition of the complexity and importance of quality of life assessments, a multi‐method, multi‐source approach was developed. Results from a preliminary study provide evidence for the reliability and validity of the instrumentation associated with the model. The meaning of these preliminary results are examined and the issues raised by such assessments are discussed.  相似文献   

13.
States use standards‐based English language proficiency (ELP) assessments to inform relatively high‐stakes decisions for English learner (EL) students. Results from these assessments are one of the primary criteria used to determine EL students’ level of ELP and readiness for reclassification. The results are also used to evaluate the effectiveness of and funding allocation to district or school programs that serve EL students. In an effort to provide empirical validity evidence for such important uses of ELP assessments, this study focused on examining the constructs of ELP assessments as a fundamental validity issue. Particularly, the study examined the types of language proficiency measured in three sample states’ ELP assessments and the relationship between each type of language proficiency and content assessment performance. The results revealed notable variation in the presence of academic and social language in the three ELP assessments. A series of hierarchical linear modeling (HLM) analyses also revealed varied relationships among social language proficiency, academic language proficiency, and content assessment performance. The findings highlight the importance of examining the constructs of ELP assessments for making appropriate interpretations and decisions based on the assessment scores for EL students. Implications for policy and practice are discussed.  相似文献   

14.
Peña ED 《Child development》2007,78(4):1255-1264
In cross-cultural child development research there is often a need to translate instruments and instructions to languages other than English. Typically, the translation process focuses on ensuring linguistic equivalence. However, establishment of linguistic equivalence through translation techniques is often not sufficient to guard against validity threats. In addition to linguistic equivalence, functional equivalence, cultural equivalence, and metric equivalence are factors that need to be considered when research methods are translated to other languages. This article first examines cross-cultural threats to validity in research. Next, each of the preceding factors is illustrated with examples from the literature. Finally, suggestions for incorporating each factor into research studies of child development are given.  相似文献   

15.
Assessment of performance in practical science and pupil attributes   总被引:2,自引:0,他引:2  
Performance assessment in the UK science General Certificate of Secondary Education (GCSE) currently relies on pupil reports of their investigations. These are widely criticized. Written tests of procedural understanding could be used as an alternative, but what exactly do they measure? This paper describes small‐scale research in which there was an analysis of assessments of pupils' GCSE scores of substantive ideas, their coursework performance assessment and a novel written evidence test. Results from these different assessments were compared with each other and with baseline data on CAT scores and pupils' attributes. Significant predictors of performance on each of these assessments were determined. The data reported shows that a choice could be made between practical coursework that links to ‘behaviour’ and written evidence tests which link, albeit less strongly, with ‘quickness’. There would be differential effects on pupils.  相似文献   

16.
Constituting a metacognitive strategy, system competence or systems thinking can only assume its assigned key function as a basic concept for the school subject of geography in Germany after a theoretical and empirical foundation has been established. A measurement instrument is required which is suitable both for supporting students and for the evaluation of methodical‐didactic measures. Such a tool is theoretically anchored in an empirically validated geography‐didactic and cognition‐psychological competence model, providing a differentiated representation of both the internal structure of a competency and the proficiency levels. The starting point of this foundation was the development of a normative‐theoretically derived model of geographic system competence. Its empirical validation was performed in different phases aimed at operationalising the competence model by means of test problems. In order to analyse the factor structure of the theoretical model, various item response models were estimated. The item levels of difficulty expected in the competence model were related to the empirical levels of difficulty and predicted by means of ordinary least squares regression to verify the model for proficiency levels. The two‐dimensional competence model – with the two dimensions ‘system organisation and behaviour’ and ‘system‐adequate intention to act’ – exhibits a better fit in reference to the model fit criteria than the one‐dimensional and three‐dimensional models. The correlations between the expected and empirical item difficulties are positive. Items that should be more difficult according to the competence model are actually shown to be more difficult. These findings suggest the reliability and validity of this new measurement instrument for diagnosing and promoting geographical system competence. It has to be implemented in practice as the next step.  相似文献   

17.
Multiple threats to validity and reliability exist when value-added models (VAMs) rely wholly on standardized assessments to measure the relationship between teachers and their K–12 students’ learning gains. Research on a curriculum-based VAM, built on evidence-based practices, continues to establish an explicit link between teacher candidates’ instruction and their K–12 students’ learning gains. Statistical tests of association were used to analyze measures of student learning and university supervisors’ ratings during classroom observations with a department instrument, the Narrative Observation Scale. Results from a sample of 23 teacher candidates revealed that (a) two measures of student learning were related and attributed to candidates’ instruction, and (b) 67.6% of the variance in the percentage of K–12 students meeting their specific learning objectives was accounted for by the teacher candidates’ mastery of specific classroom management behaviors. Limitations and directions for future research are discussed regarding continued efforts to refine a rational, curriculum-based VAM.  相似文献   

18.
Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating rating quality are often based on measurement theories such as invariant measurement. However, these methods are based on assumptions and transformations that may not be appropriate for ordinal ratings. In this study, I show how researchers can use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, to evaluate rating quality within the framework of invariant measurement without the use of potentially inappropriate parametric techniques. I use an illustrative analysis of data from a rater‐mediated writing assessment to demonstrate how one can use numeric and graphical indicators from MSA to gather evidence of validity, reliability, and fairness. The results from the analyses suggest that MSA provides a useful framework within which to evaluate rater‐mediated assessments for evidence of validity, reliability, and fairness that can supplement existing popular methods for evaluating ratings.  相似文献   

19.
Self‐report inventories are commonly administered to measure social‐emotional learning competencies related to college and career readiness. Inattentive responding can negatively impact the validity of interpreting individual results and the accuracy of construct validity evidence. This study applied nine methods of detecting insufficient effort responding (IER) to a social‐emotional learning assessment. Individual methods identified between 0.9% and 20.3% of respondents as potentially exhibiting IER. Removing flagged respondents from the data resulted in negligible or small improvements in criterion‐related validity, coefficient alpha, concurrent validity, and confirmatory factor analysis model‐data fit. Implications for future validity studies and the operational use of IER detection for social–emotional learning assessments are discussed.  相似文献   

20.
ABSTRACT

Building on the papers in this special issue, this article uses modern conceptions of validity theory to provide a framework for considering the evaluation of teaching quality. The 3 facets of teaching quality focused on are domain conceptualization, evidence and inferences, and their evaluation. Domain definitions vary in their specificity with tradeoffs in their range of applicability and specificity of inference. Evidence collection can range from highly standardized assessments to observations that must attend to evidence from a myriad of classroom interactions. For all assessments, however, even the most standardized, different interpretations of assessment tasks can threaten the validity of score interpretations. The papers consider a range of processes that are designed to generate, support, and interrogate the validity of inferences based on assessment scores. A fundamental question underlying this type of measurement is whether differences in the quality of teaching that students experience can be causally attributed to the teacher.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号