期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Evaluation of Procedure-Based Scoring for Hands-On Science Assessment

Gail P. Baxter Richard J. Shavelson Susan R. Goldman Jerry Pine 《Journal of Educational Measurement》1992,29(1):1-17

This article evaluates a procedure-based scoring system for a performance assessment (an observed paper towels investigation) and a notebook surrogate completed by fifth-grade students varying in hands-on science experience. Results suggested interrater reliability of scores for observed performance and notebooks was adequate (>.80) with the reliability of the former higher. In contrast, interrater agreement on procedures was higher for observed hands-on performance (.92) than for notebooks (.66). Moreover, for the notebooks, the reliability of scores and agreement on procedures varied by student experience, but this was not so for observed performance. Both the observed-performance and notebook measures correlated less with traditional ability than did a multiple-choice science achievement test. The correlation between the two performance assessments and the multiple-choice test was only moderate (mean = .46), suggesting that different aspects of science achievement have been measured. Finally, the correlation between the observed-performance scores and the notebook scores was .83, suggesting that notebooks may provide a reasonable, albeit less reliable, surrogate for the observed hands-on performance of students. 相似文献

2.

Evaluation of construct-irrelevant variance yielded by machine and human scoring of a science teacher PCK constructed response assessment

《Studies in Educational Evaluation》2020

Machine learning has been frequently employed to automatically score constructed response assessments. However, there is a lack of evidence of how this predictive scoring approach might be compromised by construct-irrelevant variance (CIV), which is a threat to test validity. In this study, we evaluated machine scores and human scores with regard to potential CIV. We developed two assessment tasks targeting science teacher pedagogical content knowledge (PCK); each task contains three video-based constructed response questions. 187 in-service science teachers watched the videos with each had a given classroom teaching scenario and then responded to the constructed-response items. Three human experts rated the responses and the human-consent scores were used to develop machine learning algorithms to predict ratings of the responses. Including the machine as another independent rater, along with the three human raters, we employed the many-facet Rasch measurement model to examine CIV due to three sources: variability of scenarios, rater severity, and rater sensitivity of the scenarios. Results indicate that variability of scenarios impacts teachers’ performance, but the impact significantly depends on the construct of interest; for each assessment task, the machine is always the most severe rater, compared to the three human raters. However, the machine is less sensitive than the human raters to the task scenarios. This means the machine scoring is more consistent and stable across scenarios within each of the two tasks. 相似文献

3.

The Effects of Working in Pairs in Science Performance Assessments

《Educational Assessment》2013,18(4):325-338

Concerns about the effects of multiple-choice measures in traditional testing programs have led many educators and policymakers to suggest the use of alternative assessment methods. Some performance-based assessments require students to work in small collaborative groups as part of the test process. This study uses responses to hands-on science tasks at Grades 5 and 8 to examine whether the score a student earns while working with someone else is a truly independent assessment of that student's ability. We also explore whether working in pairs affects an individual's scores on subsequent tasks and whether these results are consistent across grade levels. Our analyses indicate that at Grades 5 and 8, work done with a partner should not be considered as an independent assessment of each student's ability. Some evidence of carry-over effects from working in pairs was found at each grade. 相似文献

4.

Using Multigroup Confirmatory Factor Analysis to Test Measurement Invariance in Raters: A Clinical Skills Examination Application

Nilufer Kahraman Crystal B. Brown 《教育实用测度》2015,28(4):350-366

Psychometric models based on structural equation modeling framework are commonly used in many multiple-choice test settings to assess measurement invariance of test items across examinee subpopulations. The premise of the current article is that they may also be useful in the context of performance assessment tests to test measurement invariance of raters. The modeling approach and how it can be used for performance tests with less than optimal rater designs are illustrated using a data set from a performance test designed to measure medical students’ patient management skills. The results suggest that group-specific rater statistics can help spot differences in rater performance that might be due to rater bias, identify specific weaknesses and strengths of individual raters, and enhance decisions related to future task development, rater training, and test scoring processes. 相似文献

5.

Reliability and validity of a mathematics performance assessment

《International Journal of Educational Research》1994,21(3):247-266

This study evaluated the reliability and validity of a performance assessment designed to measure students' thinking and reasoning skills in mathematics. The QUASAR Cognitive Assessment Instrument (QCA1) was administered to over 1.700 sixth and seventh grade students of various ethnic backgrounds in six schools that are participating in the QUASAR project. The consistency of students' responses across tasks and the validity for inferences drawn from the scores on the assessment to the more broadly-defined construct domain were examined. The intertask consistency and the dimensionality of the assessment was assessed through the use of polychoric correlations and confirmatory factor analysis, and the generalizability of the derived scores was examined through the use of generalizability theory. The results from the confirmatory factor analysis indicate that a one-factor model fits the data for each of the four QCAI forms. The major findings from the generalizability studies (person x task and person x rater x task) indicate that, for each of the four forms, the person x task variance component accounts for the largest percentage of the total variability and the percentage of variance accounted for by the variance components that include the rater effect is negligible. The variance components that-include the rater effect were negligible. The generalizability and dependability coefficients for the person x task decision studies (n_t, = 9) range from .71-.84. These results indicate that the use of nine tasks may not be adequate for generalizing to the larger domain of mathematics for individual student level scores. The QUASAR project, however, is interested in assessing mathematics achievement at the program level not the student level; therefore, these coefficients are not alarmingly low. 相似文献

6.

Device Comparability of Tablets and Computers for Assessment Purposes

Laurie Laughlin Davis Xiaojing Kong Yuanyuan McBride Kristin M. Morrison 《教育实用测度》2017,30(1):16-26

The definition of what it means to take a test online continues to evolve with the inclusion of a broader range of item types and a wide array of devices used by students to access test content. To assure the validity and reliability of test scores for all students, device comparability research should be conducted to evaluate the impact of testing device on student test performance. The current study looked at the comparability of test scores across tablets and computers for high school students in three commonly assessed content areas and for a variety of different item types. Results indicate no statistically significant differences across device type for any content area or item type. Student survey results suggest that students may have a preference for taking tests on devices with which they have more experience, but that even limited exposure to tablets in this study increased positive responses for testing on tablets. 相似文献

7.

An Application of Generalizability Theory to Evaluate the Technical Quality of an Alternate Assessment

Melinda Ann Taylor Dena A. Pastor 《教育实用测度》2013,26(4):279-297

Although federal regulations require testing students with severe cognitive disabilities, there is little guidance regarding how technical quality should be established. It is known that challenges exist with documentation of the reliability of scores for alternate assessments. Typical measures of reliability do little in modeling multiple sources of error, which are characteristic of alternate assessments. Instead, Generalizability theory (G-theory) allows researchers to identify sources of error and analyze the relative contribution of each source. This study demonstrates an application of G-theory to examine reliability for an alternate assessment. A G-study with the facets rater type, assessment attempts, and tasks was examined to determine the relative contribution of each to observed score variance. Results were used to determine the reliability of scores. The assessment design was modified to examine how changes might impact reliability. As a final step, designs that were deemed satisfactory were evaluated regarding the feasibility of adapting them into a statewide standardized assessment and accountability program. 相似文献

8.

Scoring Stability in a Large‐Scale Assessment Program: A Longitudinal Analysis of Leniency/Severity Effects

Corey Palermo Michael B. Bunch Kirk Ridge 《Journal of Educational Measurement》2019,56(3):626-652

Although much attention has been given to rater effects in rater‐mediated assessment contexts, little research has examined the overall stability of leniency and severity effects over time. This study examined longitudinal scoring data collected during three consecutive administrations of a large‐scale, multi‐state summative assessment program. Multilevel models were used to assess the overall extent of rater leniency/severity during scoring and examine the extent to which leniency/severity effects were stable across the three administrations. Model results were then applied to scaled scores to estimate the impact of the stability of leniency/severity effects on students’ scores. Results showed relative scoring stability across administrations in mathematics. In English language arts, short constructed response items showed evidence of slightly increasing severity across administrations, while essays showed mixed results: evidence of both slightly increasing severity and moderately increasing leniency over time, depending on trait. However, when model results were applied to scaled scores, results revealed rater effects had minimal impact on students’ scores. 相似文献

9.

A study of the effect of HyperCard and pen-paper performance assessment methods on expert-novice chemistry problem solving

Dumar David D. White Arthur L. Helgeson Stanley L. 《Journal of Science Education and Technology》1994,3(3):187-200

This study investigated HyperCard as a tool for assessment in science education and determined whether or not a HyperCard assessment instrument could differentiate between expert and novice student performance (balancing stoichiometric equations) in science education. Five chemical equations were presented by traditional pen-paper and by a HyperCard (Hyperequation) program. Thirty honors (expert) and 30 regular (novice) chemistry students were randomly divided into HyperCard and traditional pen-paper groups of 15 students each. Scoring was based on five dependent variables: performance scores, number of attempts, rate of attempts, time on task, and correctness. Correlation results indicated that students with high performance scores correctly balanced more equations, required fewer attempts to balance equations, and required less time per attempt than did students with low performance scores. MANOVA results showed that performance scores and correctness scores for both experts and novice were significantly higher on HyperCard compared to pen-paper assessment; the novice scores on HyperCard nearly equaled the expert pen-paper assessment scores. Significant interactions were found for time on task and for correctness. The results suggest that HyperCard can be a suitable tool for assessment in science education and that such an instrument can differentiate between expert and novice student performance. 相似文献

10.

Generalizability of Large-Scale Performance Assessments in Science: Promises and Problems

《教育实用测度》2013,26(4):323-342

This study provides empirical evidence about the sampling variability and generalizability (reliability) of a statewide science performance assessment. Results at both individual and school levels indicate that task-sampling variability was the major source of measurement error in the performance assessment; rater-sampling variability was negligible. Adding more tasks improves the generalizability of the measurement. For the school-level assessment, the variation of performance among students within a school was larger than the variation among schools. Increasing the number of students taking a test within a school thus increases the generalizability of the assessment. Finally, the allocation of students in a matrix-sampling design is compared to a studentscrossed-with-tasks design. The former would require fewer tasks per student than the latter to build a generalizable measure of school performance. 相似文献

11.

Effects of multimedia on psychometric characteristics of cognitive tests: A comparison between technology-based and paper-based modalities

《Studies in Educational Evaluation》2023

The study aims to investigate the effects of delivery modalities on psychometric characteristics and student performance on cognitive tests. A first study assessed the inductive reasoning ability of 715 students under the supervision of teachers. A second study examined 731 students’ performance on the application of the control-of-variables strategy in basic physics but without teacher supervision due to the COVID-19 pandemic. Rasch measurement showed that the online format fitted to the data better in the unidimensional model across two conditions. Under teacher supervision, paper-based testing was better than online testing in terms of reliability and total scores, but contradictory findings were found in turn without teacher supervision. Although measurement invariance was confirmed between two versions at item level, the differential bundle functioning analysis supported the online groups on the item bundles constructed of figure-related materials. Response time was also discussed as an advantage of technology-based assessment for test development. 相似文献

12.

Evaluating the Consistency of Test Content Across Two Successive Administrations of a State-Mandated Science Assessment

Timothy O'Neil Stephen G. Sireci Kristen L. Huff 《Educational Assessment》2013,18(3-4):129-151

Educational tests used for accountability purposes must represent the content domains they purport to measure. When such tests are used to monitor progress over time, the consistency of the test content across years is important for ensuring that observed changes in test scores are due to student achievement rather than to changes in what the test is measuring. In this study, expert science teachers evaluated the content and cognitive characteristics of the items from 2 consecutive annual administrations of a 10th-grade science assessment. The results indicated the content area representation was fairly consistent across years and the proportion of items measuring the different cognitive skill areas was also consistent. However, the experts identified important cognitive distinctions among the test items that were not captured in the test specifications. The implications of this research for the design of science assessments and for appraising the content validity of state-mandated assessments are discussed. 相似文献

13.

Effect of context and gender on application of Science Investigation Skills

Dr Mark W. Hackling Professor Patrick J. Garnett 《Research in Science Education》1993,23(1):104-109

Two parallel versions of a Test of Science Investigation Skills were developed to assess students' application of science investigation skills in biology and physics contexts. Repeated pilot testing and critical appraisal were used to ensure the validity of the tests and their equivalence. Both versions of the test were administered to 112 Year 10 science students. The results indicated a satisfactory level of test reliability, the test set in a physics context proved to be significantly more difficult than the test set in a biology context, and mean scores for male and female students were not significantly different. Specializations: science teacher education, development of problem-solving expertise, concept development and conceptual change, assessment of laboratory work. Specializations: Chemistry education, concept development and conceptual change, effective laboratory teaching. 相似文献

14.

口试评分规范化与信度研究 总被引：2，自引：0，他引：2

郭茜邢如沈明波《清华大学教育研究》2003,(Z1)

口语考试的效度较高,信度却比较低。但没有信度,效度也不可能真正得到保证。因此,如何提高口试的信度,是很多测试研究者普遍关注的问题。本文通过描述清华大学英语水平考试中口试部分的评分规范化与评分员培训,对如何规范评分以提高口试信度这一问题进行讨论。相似文献

15.

An analysis of frequency of hands-on experience and science achievement

Patricia M. Stohr-Hunt 《科学教学研究杂志》1996,33(1):101-109

A variance analysis of the relation between the amount of time students spent experiencing hands-on science and science achievement was performed. Data collected by the National Education Longitudinal Study of 1988 on a nationally representative sample of eighth-grade students were analyzed. Student achievement in science was measured by a cognitive test battery developed by the Educational Testing Service. Information regarding the frequency of hands-on experience was collected through a self-administered teacher questionnaire, which included a series of questions specific to the science curriculum. From the analysis it was concluded that significant differences existed across the hands-on frequency variable with respect to science achievement. Specifically, students who engaged in hands-on activities every day or once a week scored significantly higher on a standardized test of science achievement than students who engaged in hands-on activities once a month, less than once a month, or never. © 1996 John Wiley & Sons, Inc. 相似文献

16.

Editorial

Ciaran Sugrue 《Irish Educational Studies》2013,32(1):3-7

This article describes a four-year project undertaken to develop a set of performance tasks that could be used for assessing hands-on science in Irish primary schools. It begins by considering some of the literature on performance assessment and concludes with a discussion on the potential of the tasks to support teaching and learning in science. The main body of the article is structured to reflect the five phases of the research project itself. In phase one, science assessments used in a variety of educational systems in Australia, Canada, New Zealand, the United Kingdom and the United States were located and catalogued. In phase two, approximately 170 performance tasks were selected and adapted by the authors to suit the requirements of the Irish primary science curriculum. In phase three, a purposive convenience sample of teachers evaluated the extent to which the tasks (a subset of 67) were suitable for use at different grade levels. The teachers’ feedback was used to amend tasks. In phase four, the researchers observed 11 different tasks being implemented in classrooms. The eleven teachers involved were interviewed about their experiences immediately afterwards. Again, based on the outcomes of this study, changes were made to the tasks. The fifth phase of the project, due to be completed in 2006, will involve the dissemination of 124 of the tasks to teachers via a booklet and a CD-ROM. Future prospects relating to other elements of the project such as Web-based resources, professional development courses and exemplars of performance are also discussed. 相似文献

17.

The Generalizability of Motivation Filtering in Improving Test Score Validity

《Educational Assessment》2013,18(1):65-83

Accountability for educational quality is a priority at all levels of education. Low-stakes testing is one way to measure the quality of education that students receive and make inferences about what students know and can do. Aggregate test scores from low-stakes testing programs are suspect, however, to the degree that these scores are influenced by low test-taker effort. This study examined the generalizability of a recently developed technique called motivation filtering, whereby scores for students of low motivation are systemically filtered from test data to determine aggregate test scores that more accurately reflect student performance and that can be used for reporting purposes. Across assessment tests in five different content areas, motivation filtering was found to consistently increase mean test performance and convergent validity. 相似文献

18.

Reliability of the peabody picture vocabulary test: A review of 32 selected research studies published between 1965 and 1974

Sandra Bochner 《Psychology in the schools》1978,15(3):320-327

This report is a review of reliability data on the PPVT obtained from 32 research studies published between 1965 and 1974. Much of the research was done on Head Start children. Overall, the median of reliability coefficients reported here (0.72) has remained remarkably close to the original median of 0.77 found in standardizing the test. Unexpectedly, elapsed time between test and retest had only a slight effect on the reliability coefficients. However, as expected, the greater range in ages and ability levels of subjects, the higher were the reliabilities. For average children in the elementary grades, and for retarded people of all ages, PPVT scores remained relatively stable over time and there was close equivalence between alternate forms. Scores were least stable for preschool children, especially from minority groups. Black preschool girls were more variable in their performance on the PPVT than boys, and preschool girls generally were more responsive than boys to play periods conducted before testing was begun. A number of variables associated with examiners and setting affected the scores on the test. As expected, raw scores tended to yield slightly higher reliabilities than MA and considerably higher reliabilities than IQ scores. 相似文献

19.

Rating scale impact on EFL essay marking: A mixed-method study

《Assessing Writing》2007,12(2):86-107

相似文献

20.

Low-stakes performance testing in Germany by the VERA assessment: analysis of the mode effects between computer-based testing and paper-pencil testing

Wagner Inga Loesche Philipp Bißantz Steven 《European Journal of Psychology of Education - EJPE》2022,37(2):531-549

The German school system employs centrally organized performance assessments (some of which are called “VERA”) as a way of promoting lesson development. In recent years, several German federal states introduced a computer-based performance testing system which will replace the paper-pencil testing system in the future. Scores from computer-based testing are required to be equivalent to paper-pencil testing scores so that the new testing medium does not lead to disadvantages for students. Therefore, the current study aimed at investigating the size of the mode effect and the moderating impact of students’ gender, academic achievement and mainly spoken language in everyday life. In addition, the variance of the mode effect across tasks was investigated. The study was conducted in four German federal states in 2019 using a field experimental design. The test scores of 5140 eighth-graders from 165 schools in the subject German were analysed. The results of multi-level modelling revealed that students’ test scores in the computerized version of the VERA test were significantly lower than in the paper-pencil version. Students with a lower academic achievement were more disadvantaged by the VERA computerized test. The results were inconsistent regarding the interactions between testing mode and students’ gender and mainly spoken language in everyday life. The variance of the mode effect across tasks was high. Research into different subjects and in other federal states and countries under different testing conditions might yield further evidence about the generalizability of these results.

相似文献