首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 281 毫秒
1.
Value-added scores from tests of college learning indicate how score gains compare to those expected from students of similar entering academic ability. Unfortunately, the choice of value-added model can impact results, and this makes it difficult to determine which results to trust. The research presented here demonstrates how value-added models can be compared on three criteria: reliability, year-to-year consistency and information about score precision. To illustrate, the original Collegiate Learning Assessment value-added model is compared to a new model that employs hierarchical linear modelling. Results indicate that scores produced by the two models are similar, but the new model produces scores that are more reliable and more consistent across years. Furthermore, the new approach provides school-specific indicators of value-added score precision. Although the reliability of value-added scores is sufficient to inform discussions about improving general education programmes, reliability is currently inadequate for making dependable, high-stakes comparisons between postsecondary institutions.  相似文献   

2.
A common practice in the field of learning disabilities is analysis of ability-achievement discrepancies. The reliability of discrepancy scores is an important statistic in such decision making. In this study, selected ability and achievement devices were administered to a sample of low achievers (N = 99), and the reliability of various difference scores was analyzed. In all cases, the reliabilities of difference scores were moderately high. Reliabilities of differences for devices normed on the same population and differences for devices normed on different populations were comparable. These results are discussed in light of current psychometric practices.  相似文献   

3.
ABSTRACT

The authors address the reliability of scores obtained on the summative performance assessments during the pilot year of our research. Contrary to classical test theory, we discussed the advantages of using generalizability theory for estimating reliability of scores for summative performance assessments. Generalizability theory was used as the framework because of the flexibility this approach provides for examining sources of inconsistency within a complex assessment. Two major sources of inconsistency on scores considered in this study were raters and agencies (teachers' rating vs. researchers' rating). Overall, results showed that the inconsistency in scores attributable to raters and agencies was relatively small. Suggestions regarding improvement of consistency in the subsequent years of our research were provided.  相似文献   

4.
A structural equation modeling based method is outlined that accomplishes interval estimation of individual optimal scores resulting from multiple-component measuring instruments evaluating single underlying latent dimensions. The procedure capitalizes on the linear combination of a prespecified set of measures that is associated with maximal reliability and validity. The approach is useful when one is interested in evaluating plausible ranges for subject scores on the composite exhibiting highest measurement consistency and strongest linear relation with a given criterion. The method is illustrated with a numerical example.  相似文献   

5.
Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information.  相似文献   

6.
It is widely recognized that the reliability of a difference score depends on the reliabilities of the constituent scores and their intercorrelation. Authors often use a well-known identity to express the reliability of a difference as a function of the reliabilities of the components, assuming that the intercorrelation remains constant. This approach is misleading, because the familiar formula is a composite function in which the correlation between components is a function of reliability. An alternative formula, containing the correlation between true scores instead of the correlation between observed scores, provides more useful information and yields values that are not quite as anomalous as the ones usually obtained  相似文献   

7.
In the IEA PIRLS International Report, a number of indices of reading‐related constructs were introduced for the purpose of explaining the variation in reading achievement. These indices, however, raise issues of reliability and validity, due to the way in which they are derived and due to the quality of the data. Targeting these issues, the current study investigates the measurement properties of some of these reading‐related factors by a multivariate latent variable modeling approach. The data is drawn from the PIRLS questionnaires of the fourth graders in six countries. On the basis of a series of confirmatory factor analytic models individual factor scores of these constructs are estimated. The results indicate that the most significant advantage of the suggested approach is that it makes efficient use of the available data to estimate factor scores with higher reliability and validity than the observed index scores.  相似文献   

8.
Reliability has a long history as one of the key psychometric properties of a test. However, a given test might not measure people equally reliably. Test scores from some individuals might have considerably greater error than others. This study proposed two approaches using intraindividual variation to estimate test reliability for each person. A simulation study suggested that the parallel tests approach and the structural equation modeling approach recovered the simulated reliability coefficients. Then in an empirical study, where 45 females were measured daily on the Positive and Negative Affect Schedule (PANAS) for 45 consecutive days, separate estimates of reliability were generated for each person. Results showed that reliability estimates of the PANAS varied substantially from person to person. The methods provided in this article apply to tests measuring changeable attributes and require repeated measures across time on each individual. This article also provides a set of parallel forms of PANAS.  相似文献   

9.
Although reliability of subscale scores may be suspect, subscale scores are the most common type of diagnostic information included in student score reports. This research compared methods for augmenting the reliability of subscale scores for an 8th-grade mathematics assessment. Yen's Objective Performance Index, Wainer et al.'s augmented scores, and scores based on multidimensional item response theory (IRT) models were compared and found to improve the precision of the subscale scores. However, the augmented subscale scores were found to be more highly correlated and less variable than unaugmented scores. The meaningfulness of reporting such augmented scores as well as the implications for validity and test development are discussed.  相似文献   

10.
A latent variable modeling approach to evaluate scale reliability under realistic conditions in empirical behavioral and social research is discussed. The method provides point and interval estimation of reliability of multicomponent measuring instruments when several assumptions are violated. These assumptions include missing data, correlated errors, nonnormality, lack of unidimensionality, and data not missing at random. The procedure can be readily used to aid scale construction and development efforts in applied settings, and is illustrated using data from an educational study.  相似文献   

11.
A method of determining the reliability coefficient of a test from a formulation which does not employ the concepts of true score and error score, together with assumptions about the process which generates variability in scores, is described. Several well-known reliability formulas as well as some new results are derived from models which hypothesize different sources of variability in scores.  相似文献   

12.
13.
《教育实用测度》2013,26(2):173-185
More attention is being given to evaluating the quality of school-level assessment scores due to their importance for school-based planning and monitoring effectiveness. In this study, cross-year stability is proposed as an indicator of data quality and the degree of stability that is appropriate for large-scale assessments of student performance is explored. Following a search of Internet sites, Year 1 to Year 2 stability coefficients were calculated for assessment data from 21 states and 2 provinces. The median stability coefficient was .78 in mathematics and reading, but coefficients for writing were generally lower. A stability coefficient of .80 is recommended as the standard for large-scale assessments of student performance. A high degree of cross-year stability makes it easier to detect and attribute changes in school-level scores to school improvement efforts. The link between stability and reliability and several factors that may attenuate stability are discussed.  相似文献   

14.
The Social Skills Rating System (SSRS; F.M. Gresham & S.N. Elliott, 1990) is a norm‐referenced measure of students' social and problem behaviors. Since its release, much of the published reliability and validity evidence for the SSRS has focused primarily on the Teacher Report Form. The purpose of this study was to explore reliability and validity evidence of scores on the SSRS‐Student Elementary Form (SSRS‐SEF) for children in Grades 3 to 5. Findings provided support for the use of Total scale as a measure of student social behavior for initial screening purposes; however, evidence for the subscales was not as strong as predicted. Directions for future research regarding reliability and validity of scores from the SSRS‐SEF are discussed. © 2005 Wiley Periodicals, Inc. Psychol Schs 42: 345–354, 2005.  相似文献   

15.
The topic of test reliability is about the relative consistency of test scores and other educational and psychological measurements. In this module, the idea of consistency is illustrated with reference to two sets of test scores. A mathematical model is developed to explain both relative consistency and relative inconsistency of measurements. A means of indexing reliability is derived using the model. Practical methods of estimating reliability indices are considered, together with factors that influence the reliability index of a set of measurements and the interpretation that can be made of that index.  相似文献   

16.
The NTID Writing Test was developed to assess the writing ability of postsecondary deaf students entering the National Technical Institute for the Deaf and to determine their appropriate placement into developmental writing courses. While previous research (Albertini et al., 1986; Albertini et al., 1996; Bochner, Albertini, Samar, & Metz, 1992) has shown the test to be reliable between multiple test raters and as a valid measure of writing ability for placement into these courses, changes in curriculum and the rater pool necessitated a new look at interrater reliability and concurrent validity. We evaluated the rating scores for 236 samples from students who entered the college during the fall 2001. Using a multiprong approach, we confirmed the interrater reliability and the validity of this direct measure of assessment. The implications of continued use of this and similar tests in light of definitions of validity, local control, and the nature of writing are discussed.  相似文献   

17.
When a computerized adaptive testing (CAT) version of a test co-exists with its paper-and-pencil (P&P) version, it is important for scores from the CAT version to be comparable to scores from its P&P version. The CAT version may require multiple item pools for test security reasons, and CAT scores based on alternate pools also need to be comparable to each other. In this paper, we review research literature on CAT comparability issues and synthesize issues specific to these two settings. A framework of criteria for evaluating comparability was developed that contains the following three categories of criteria: validity criterion, psychometric property/reliability criterion, and statistical assumption/test administration condition criterion. Methods for evaluating comparability under these criteria as well as various algorithms for improving comparability are described and discussed. Focusing on the psychometric property/reliability criterion, an example using an item pool of ACT Assessment Mathematics items is provided to demonstrate a process for developing comparable CAT versions and for evaluating comparability. This example illustrates how simulations can be used to improve comparability at the early stages of the development of a CAT. The effects of different specifications of practical constraints, such as content balancing and item exposure rate control, and the effects of using alternate item pools are examined. One interesting finding from this study is that a large part of incomparability may be due to the change from number-correct score-based scoring to IRT ability estimation-based scoring. In addition, changes in components of a CAT, such as exposure rate control, content balancing, test length, and item pool size were found to result in different levels of comparability in test scores.  相似文献   

18.
This article treats various procedures for examining the reliability of group mean difference scores, with particular emphasis on procedures from univariate and multivariate generalizability theory. Attention is given to both traditional norm-referenced perspectives on reliability as well as criterion-referenced perspectives that focus on error-tolerance ratios and functions of them. The procedures discussed are illustrated using three cohorts of data for third- and fourth-grade students in Iowa who took the Iowa Tests of Basic Skills in recent years. For these data, estimates of reliability for norm-referenced decisions tend to be relatively low. By contrast, for criterion-referenced decisions, estimates of reliability-like coefficients based on error-tolerance ratios tend to be noticeably larger.  相似文献   

19.
The attribute hierarchy method (AHM) is a psychometric procedure for classifying examinees' test item responses into a set of structured attribute patterns associated with different components from a cognitive model of task performance. Results from an AHM analysis yield information on examinees' cognitive strengths and weaknesses. Hence, the AHM can be used for cognitive diagnostic assessment. The purpose of this study is to introduce and evaluate a new concept for assessing attribute reliability using the ratio of true score variance to observed score variance on items that probe specific cognitive attributes. This reliability procedure is evaluated and illustrated using both simulated data and student response data from a sample of algebra items taken from the March 2005 administration of the SAT. The reliability of diagnostic scores and the implications for practice are also discussed.  相似文献   

20.
Emphasis on improving higher level biology education continues. A new two-step approach to the experimental phases within an outreach gene technology lab, derived from cognitive load theory, is presented. We compared our approach using a quasi-experimental design with the conventional one-step mode. The difference consisted of additional focused discussions combined with students writing down their ideas (step one) prior to starting any experimental procedure (step two). We monitored students’ activities during the experimental phases by continuously videotaping 20 work groups within each approach (N = 131). Subsequent classification of students’ activities yielded 10 categories (with well-fitting intra- and inter-observer scores with respect to reliability). Based on the students’ individual time budgets, we evaluated students’ roles during experimentation from their prevalent activities (by independently using two cluster analysis methods). Independently of the approach, two common clusters emerged, which we labeled as ‘all-rounders’ and as ‘passive students’, and two clusters specific to each approach: ‘observers’ as well as ‘high-experimenters’ were identified only within the one-step approach whereas under the two-step conditions ‘managers’ and ‘scribes’ were identified. Potential changes in group-leadership style during experimentation are discussed, and conclusions for optimizing science teaching are drawn.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号