首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
In criterion‐referenced tests (CRTs), the traditional measures of reliability used in norm‐referenced tests (NRTs) have often proved problematic because of NRT assumptions of one underlying ability or competency and of variance in the distribution of scores. CRTs, by contrast, are likely to be created when mastery of the skill or knowledge by all or most all test takers is expected and thus little variation in the scores is expected. A comprehensive CRT often measures a number of discrete tasks that may not represent a single unifying ability or competence. Hence, CRTs theoretically violate the two most essential assumptions of classic NRT re liability theory and they have traditionally required the logistical problems of multiple test administrations to the same test takers to estimate reliability. A review of the literature categorizes approaches to reliability for CRTs into two classes: estimates sensitive to all measures of error and estimates of consistency in test outcome. For single test administration of CRTs Livingston's k2is recommended for estimating all measures of error, Sc is proposed for estimates of consistency in test outcome. Both approaches compared using data from a CRT exam and recommendations for interpretation and use are proposed.  相似文献   

States participating in the Growth Model Pilot Program reference individual student growth against “proficiency” cut scores that conform with the original No Child Left Behind Act (NCLB). Although achievement results from conventional NCLB models are also cut‐score dependent, the functional relationships between cut‐score location and growth results are more complex and are not currently well described. We apply cut‐score scenarios to longitudinal data to demonstrate the dependence of state‐ and school‐level growth results on cut‐score choice. This dependence is examined along three dimensions: 1) rigor, as states set cut scores largely at their discretion, 2) across‐grade articulation, as the rigor of proficiency standards may vary across grades, and 3) the time horizon chosen for growth to proficiency. Results show that the selection of plausible alternative cut scores within a growth model can change the percentage of students “on track to proficiency” by more than 20 percentage points and reverse accountability decisions for more than 40% of schools. We contribute a framework for predicting these dependencies, and we argue that the cut‐score dependence of large‐scale growth statistics must be made transparent, particularly for comparisons of growth results across states.  相似文献   

The purpose of this study was to determine the feasibility of combining different test types (criterion-referenced and norm-referenced) in a composite school achievement score to be used in a model for school effectiveness classification. The cross-year stability and within-model consistency of the composite was compared to models using subcomposite, overall scores for both the criterion- referenced and norm-referenced tests, subject-area scores (across grades), grade-level scores, and component scores for each grade. Stability of the different models across 2 years was determined by using the agreement ratio, kappa coefficient, and correlation of residuals (N = 361). The same statistical procedures were used to compute consistency across subsamples (N = 264). Results indicated that transforming and combining student-level scores of different test types, grade levels, and subject areas allows for a broader basis for judging schools and provides a school effectiveness model that is both consistent across subsamples and stable across years.  相似文献   

With increasing interest in educational accountability, test results are now expected to meet a diverse set of informational needs. But a norm-referenced test (NRT) cannot be expected to meet the simultaneous demands for both norm-referenced and curriculum-specific information. One possible solution, which is the focus of this article, is to customize the NRT. Customized tests may appear in any form. They may (a) add a few curriculum-specific items to the end of the NRT, (b) substitute locally constructed items for a few NRT items, (c) substitute a curriculum-specific test (CST) for the NRT, or (d) use equating methods to obtain predicted NRT scores from the CST scores. In this article, we describe the four main approaches to customized testing, address the validity of the uses and interpretations of customized test scores obtained from the four main approaches, and offer recommendations regarding the use of customized tests and the need for further research. Results indicate that customized testing can yield both valid normative and curriculum- specific information, when special conditions exist. But, there are also many threats to the validity of normative interpretations. Cautious application of customized testing is needed in order to avoid misleading inferences about student achievement.  相似文献   

Book reviews     
Background:?A recent article published in Educational Research on the reliability of results in National Curriculum testing in England (Newton, The reliability of results from national curriculum testing in England, Educational Research 51, no. 2: 181–212, 2009) suggested that: (1) classification accuracy can be calculated from classification consistency; and (2) classification accuracy on a single test administration is higher than classification consistency across two tests.

Purpose:?This article shows that it is not possible to calculate classification accuracy from classification consistency. It then shows that, given reasonable assumptions about the distribution of measurement error, the expected classification accuracy on a single test administration is higher than the expected classification consistency across two tests only in the case of a pass–fail test, but not necessarily for tests that classify test-takers into more than two categories.

Main argument and conclusion:?Classification accuracy is defined in terms of a ‘true score’ specified in a psychometric model. Three things must be known or hypothesised in order to derive a value for classification accuracy: (1) a psychometric model relating observed scores to true scores; (2) the location of the cut-scores on the score scale; and (3) the distribution of true scores in the group of test-takers.  相似文献   

The purpose of the study is to investigate the predictive validity of criterion- and norm-referenced grades and the Swedish Scholastic Aptitude Test (SweSAT) and, in particular, possible differences in the prediction of achievement in higher education across academic programs. The analyses were based on credit points obtained by 164,106 Swedish students during the years 1993 to 2001. Two-level modeling with randomly varying slopes with academic program as cluster variable was used. The results provide means and variances of the slopes across the different programs. Variability in the slopes because of program subject area was also investigated. The results indicate that the validity of grades, irrespective of grading system, is stronger in comparison with SweSAT scores. The results also indicate considerable differences in predictive power across programs for the SweSAT, whereas there are much smaller differences for norm-referenced grades and relatively modest differences for criterion-referenced grades. The impact of program subject area on the variability of prediction was substantial for SweSAT scores.  相似文献   

This study investigated how performance on reading curriculum‐based measurement (R‐CBM) in Spanish is related to performance on R‐CBM in English. Parallel process growth models and quantile regression analyses were used to examine the relations between initial benchmark scores and growth and the consistency of the relations across student reading skill levels. Initial benchmark scores and growth were strongly related across languages in most grades, and initial scores were less strongly related for students with low and high reading achievement, as measured by curriculum‐based measurement in most grades. Rates of growth were evenly related across performance in fourth and fifth grades, but less strongly related for high‐achieving students in second and third grades. Practical implications and future directions are discussed.  相似文献   

Reliability of a criterion-referenced test is often viewed as the consistency with which individuals who have taken two strictly parallel forms of a test are classified as being masters or nonmasters. However, in practice, it is rarely possible to retest students, especially with equivalent forms. For this reason, methods for making conservative approximations of alternate form (or test-retest “without the effects of testing”) reliability have been developed. Because these methods are computationally tedious and require some psychometric sophistication, they have rarely been used by teachers and school psychologists. This paper (a) describes one method (Subkoviak's) for estimating alternate-form reliability from one administration of a criterion-referenced test and (b) describes a computer program developed by the authors that will handle tests containing hundreds of items for large numbers of examinees and allow any test user to apply the technique described. The program is a superior alternative to other methods of simplifying this estimation procedure that rely upon tables; a user can check classification consistency estimates for several prospective cut scores directly from a data file, without having to make prior calculations.  相似文献   

Standards‐based progress reports (SBPRs) require teachers to grade students using the performance levels reported by state tests and are an increasingly popular report card format. They may help to increase teacher familiarity with state standards, encourage teachers to exclude nonacademic factors from grades, and/or improve communication with parents. The current study examines the SBPR grade–state test score correspondence observed across 2 years in 125 third and fifth grade classrooms located in one school district to examine the degree of consistency between grades and state test results. It also examines the grading practices of a subset of 37 teachers to determine whether there is an association between teacher appraisal style and convergence rates. A moderate degree of grade–test score convergence was observed using three agreement estimates (coefficient kappa, tau‐b correlations, and classroom‐level mean differences between grades and test scores). In addition, only small amounts of grade–test score convergence were observed between teachers; a much greater proportion of variance lay within classrooms and subjects. Appraisal style correlated weakly with convergence rates, but was most strongly related to assigning students to the same performance level as the test. Therefore using recommended grading practices may improve the quality of SBPR grades to some extent.  相似文献   

The purposes of this study were to (a) examine the developmental patterns in pseudoword reading and oral reading fluency in Spanish and English for Spanish-speaking English learners (ELs) in grades 1?C3, and (b) investigate whether initial status and growth rates in reading fluency in Spanish and English, significantly predicted reading comprehension within languages and across languages. Participants were 173 Spanish-speaking ELs in first grade, 156 ELs in second grade, and 142 ELs in third grade across four schools providing a paired bilingual reading program. Results of hierarchical linear modeling indicated different patterns of reading growth in Spanish and English across measures and across grades. ELs at the beginning of first grade had higher scores on pseudoword reading in Spanish than in English and had a higher rate of growth on Spanish pseudoword reading. In second and third grades, initial scores on oral reading fluency were comparable in both languages, but oral reading fluency growth rates were higher in English than in Spanish. Results from regression and path analysis indicated that student initial scores and growth on reading fluency were strong and direct predictors of their reading comprehension within the same language, but not across different languages.  相似文献   

This application study investigates whether the multiple‐choice to composite linking functions that determine Advanced Placement Program exam grades remain invariant over subgroups defined by region. Three years of test data from an AP exam are used to study invariance across regions. The study focuses on two questions: (a) How invariant are grade thresholds across regions? and (b) Do the small sample sizes for some regional groups present particular problems for assessing thresholds invariance? The equatability index proposed by Dorans and Holland (2000) is employed to evaluate the invariance of the linking functions, and cross‐classification is used to evaluate the invariance of the composite cut scores. Overall, the linkings across regions seem to hold up reasonably well. Nevertheless, more exams need to be examined.  相似文献   

Many U.S. students must pass a standards-based exit exam to earn a high school diploma. The degree to which exit exams and state standards properly signal to students their preparedness for postsecondary schooling has been questioned. The alignment of test scores with college grades for students at the University of Arizona (n = 2,667) who took the Arizona high school exams was ascertained in this study. The pass/fail signal accuracy of test scores varied depending on subject: The writing cut score was well aligned with collegiate performance, the reading cut score was below expectations, and the mathematics cut score was set quite rigorously. High school content and performance standards might not be as diluted as prior research has suggested.  相似文献   

The purpose of the study was to use multivariate multilevel techniques to investigate whether it was possible to separate different dimensions in grades that relate to subject-matter achievement and to other factors. Data were derived from The Gothenburg Educational Longitudinal Database (GOLD), and the subjects were 99,070 ninth-grade students born in 1987. The analyses were based on subject grades and scores on national tests in Swedish, English, and mathematics. The results showed that, at both individual and school levels, the greatest part of the variance in grades was due to achievement in the different subject areas. At both levels, it was possible to identify a dimension that cut across the grades in all 3 subjects, which suggests that grading is influenced by factors other than achievement. One of the most interesting results concerns the relation between parental education and the common grade dimension at the school level.  相似文献   


We investigated how and when French children in Grades 1–5 acquire orthographic representations for silent letters and double consonants. Linear mixed-effects modeling analyses on the spelling accuracy scores obtained for 2,519 French words were used to test our predictions. As predicted, the presence of a silent letter or double consonant had a unique detrimental effect on spelling accuracy that was not captured by the inconsistency and complexity generated by these letters, and this effect tended to decrease across grades. Important to note, exposure to more frequent silent-letter endings or double consonants had a facilitative effect over and above consistency that did not seem to change across grades. These findings suggest that children implicitly acquire representations for letters with no phonological value. The results obtained for other predictors also suggest a shift from a lower level, phoneme-based processing to a higher level processing at the word and rime levels as children acquire more reading experience.  相似文献   

In this article, I present the results of an analysis of the relationship between teacher evaluation scores and student achievement on district and state tests in reading, mathematics, and science in a large Midwestern U.S. school district. Within a value-added framework, I correlated the difference between predicted and actual student achievement in science, mathematics, and reading for students in Grades 3 through 8 with teacher evaluation ratings. Small to moderate positive correlationships were found for most grades in each subject tested. When these correlationships were combined across grades within subjects, the average correlationships were. 27 for science,. 32 for reading, and. 43 for mathematics. These results show that scores from a rigorous teacher evaluation system can be substantially related to student achievement and provide criterion-related validity evidence for the use of the performance evaluation scores as the basis for a performance-based pay system or other decisions with consequences for teachers.  相似文献   

Growth in the use of testing to determine student eligibility for community college courses has prompted debate and litigation regarding over the equity, access, and legal implications of these practices. In California, this has resulted in state regulations requiring that community colleges provide predictive validity evidence of test-score?based inferences and course prerequisites. In addition, companion measures that supplement placement test scores must be used for placement purposes. However, for both theoretical and technical reasons the predictive validity coefficients between placement test scores and final grades or retention in a course generally demonstrate a weak relationship. The study discussed in this article examined the predictive validity of placement test scores with course grade and retention in English and mathematics courses. The investigation produced a model to explain variance in course outcomes using test scores, student background data, and instructor differences in grading practices. The model produced suggests that student dispositional characteristics explain the high proportion of variance in the dependent variables. Including instructor grading practices in the model adds significantly to the explanatory power and suggests that grading variations make accurate placement more problematic. This investigation underscores the importance of academic standards as something imposed on students by an institution and not something determined by the entering abilities of students.  相似文献   

Evidence of the internal consistency of standard-setting judgments is a critical part of the validity argument for tests used to make classification decisions. The bookmark standard-setting procedure is a popular approach to establishing performance standards, but there is relatively little research that reflects on the internal consistency of the resulting judgments. This article presents the results of an experiment in which content experts were randomly assigned to one of two response probability conditions: .67 and .80. If the standard-setting judgments collected with the bookmark procedure are internally consistent, both conditions should produce highly similar cut scores. The results showed substantially different cut scores for the two conditions; this calls into question whether content experts can produce the type of internally consistent judgments that are required using the bookmark procedure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号