首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 41 毫秒
1.
Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items.  相似文献   

2.
Most currently accepted approaches for identifying differentially functioning test items compare performance across groups after first matching examinees on the ability of interest. The typical basis for this matching is the total test score. Previous research indicates that when the test is not approximately unidimensional, matching using the total test score may result in an inflated Type I error rate. This study compares the results of differential item functioning (DIF) analysis with matching based on the total test score, matching based on subtest scores, or multivariate matching using multiple subtest scores. Analysis of both actual and simulated data indicate that for the dimensionally complex test examined in this study, using the total test score as the matching criterion is inappropriate. The results suggest that matching on multiple subtest scores simultaneously may be superior to using either the total test score or individual relevant subtest scores.  相似文献   

3.
This study examined the stability of scores on two types of performance assessments, an observed hands-on investigation and a notebook surrogate. Twenty-nine sixth-grade students in a hands-on inquiry-based science curriculum completed three investigations on two occasions separated by 5 months. Results indicated that: (a) the generalizability across occasions for relative decisions was, on average, moderate for the observed investigations (.52) and the notebooks (.50); (b) the generalizability for absolute decisions was only slightly lower; (c) the major source of measurement error was the person by occasion (residual) interaction; and (d) the procedures students used to carry out the investigations tended to change from one occasion to the other.  相似文献   

4.
《Educational Assessment》2013,18(4):317-340
A number of methods for scoring tests with selected-response (SR) and constructed-response (CR) items are available. The selection of a method depends on the requirements of the program, the particular psychometric model and assumptions employed in the analysis of item and score data, and how scores are to be used. This article compares 3 methods: unweighted raw scores, Item Response Theory pattern scores, and weighted raw scores. Student score data from large-scale end-of-course high school tests in Biology and English were used in the comparisons. In the weighted raw score method evaluated in this study, the CR items were weighted so that SR and CR items contributed the same number of points toward the total score. The scoring methods were compared for the total group and for subgroups of students in terms of the resultant scaled score distributions, standard errors of measurement, and proficiency-level classifications. For most of the student ability distribution, the three scoring methods yielded similar results. Some differences in results are noted. Issues to be considered when selecting a scoring method are discussed.  相似文献   

5.
Domain scores have been proposed as a user-friendly way of providing instructional feedback about examinees' skills. Domain performance typically cannot be measured directly; instead, scores must be estimated using available information. Simulation studies suggest that IRT-based methods yield accurate group domain score estimates. Because simulations can represent best-case scenarios for methodology, it is important to verify results with a real data application. This study administered a domain of elementary algebra (EA) items created from operational test forms. An IRT-based group-level domain score was estimated from responses to a subset of taken items (comprised of EA items from a single operational form) and compared to the actual observed domain score. Domain item parameters were calibrated both using item responses from the special study and from national operational administrations of the items. The accuracy of the domain score estimates were evaluated within schools and across school sizes for each set of parameters. The IRT-based domain score estimates typically were closer to the actual domain score than observed performance on the EA items from the single form. Previously simulated findings for the IRT-based domain score estimation procedure were supported by the results of the real data application.  相似文献   

6.
Evidence of stable standard setting results over panels or occasions is an important part of the validity argument for an established cut score. Unfortunately, due to the high cost of convening multiple panels of content experts, standards often are based on the recommendation from a single panel of judges. This approach implicitly assumes that the variability across panels will be modest, but little evidence is available to support this assertion. This article examines the stability of Angoff standard setting results across panels. Data were collected for six independent standard setting exercises, with three panels participating in each exercise. The results show that although in some cases the panel effect is negligible, for four of the six data sets the panel facet represented a large portion of the overall error variance. Ignoring the often hidden panel/occasion facet can result in artificially optimistic estimates of the cut score stability. Results based on a single panel should not be viewed as a reasonable estimate of the results that would be found over multiple panels. Instead, the variability seen in a single panel can best be viewed as a lower bound of the expected variability when the exercise is replicated.  相似文献   

7.
All Year 2 children in six randomly selected primary schools within one Local Education Authority (LEA) comprised the sample to which the Lawseq self‐esteem questionnaire was administered. Four years later, when they were Year 6, they completed the Lawseq again. A two‐way analysis of variance with Sex and Occasions was carried out on the 12 individual items of the instrument and the total. There were no significant differences between occasions or sexes on the overall score, but there were significant differences between occasions on seven of the 12 items and between sexes on two items. On only one item was there a significant interaction between sexes and occasions. The mean for the total fell over the 4 years. The means for both occasions were considerably below the mean of 19.00 obtained when Lawrence standardised the test in 1981. Discussion centred on possible reasons for this, such as appropriacy of the instrument for the age‐groups under study, stability of administration and changes within society and school.  相似文献   

8.
The Angoff method requires experts to view every item on the test and make a probability judgment. This can be time consuming when there are large numbers of items on the test. In this study, a G-theory framework was used to determine if a subset of items can be used to make generalizable cut-score recommendations. Angoff ratings (i.e., probability judgments) from previously conducted standard setting studies were used first in a re-sampling study, followed by D-studies. For the re-sampling study, proportionally stratified subsets of items were extracted under various sampling and test-length conditions. The mean cut score, variance components, expected standard error (SE) around the mean cut score, and root-mean-squared deviation (RMSD) across 1,000 replications were estimated at each study condition. The SE and the RMSD decreased as the number of items increased, but this reduction tapered off after approximately 45 items. Subsequently, D-studies were performed on the same datasets. The expected SE was computed at various test lengths. Results from both studies are consistent with previous research indicating that between 40–50 items are sufficient to make generalizable cut score recommendations.  相似文献   

9.
In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long‐standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation.  相似文献   

10.
Accountability mandates often prompt assessment of student learning gains (e.g., value-added estimates) via achievement tests. The validity of these estimates have been questioned when performance on tests is low stakes for students. To assess the effects of motivation on value-added estimates, we assigned students to one of three test consequence conditions: (a) an aggregate of test scores is used solely for institutional effectiveness purposes, (b) personal test score is reported to the student, or (c) personal test score is reported to faculty. Value-added estimates, operationalized as change in performance between two testing occasions for the same individuals where educational programming was experienced between testing occasions, were examined across conditions, in addition to the effects of test-taking motivation. Test consequences did not impact value-added estimates. Change in test-taking motivation, however, had a substantial effect on value-added estimates. In short, value-added estimates were attenuated due to decreased motivation from pretest to posttest.  相似文献   

11.
The study examined two approaches for equating subscores. They are (1) equating subscores using internal common items as the anchor to conduct the equating, and (2) equating subscores using equated and scaled total scores as the anchor to conduct the equating. Since equated total scores are comparable across the new and old forms, they can be used as an anchor to equate the subscores. Both chained linear and chained equipercentile methods were used. Data from two tests were used to conduct the study and results showed that when more internal common items were available (i.e., 10–12 items), then using common items to equate the subscores is preferable. However, when the number of common items is very small (i.e., five to six items), then using total scaled scores to equate the subscores is preferable. For both tests, not equating (i.e., using raw subscores) is not reasonable as it resulted in a considerable amount of bias.  相似文献   

12.
Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test.  相似文献   

13.
This work examines the hypothesis that the arrangement of items according to increasing difficulty is the real source of what is considered the item-position effect. A confusion of the 2 effects is possible because in achievement measures the items are arranged according to their difficulty. Two item subsets of Raven’s Advanced Progressive Matrices (APM), one following the original item order, and the other one including randomly ordered items, were applied to a sample of 266 students. Confirmatory factor analysis models including representations of both the item-position effect and a possible effect due to increasing item difficulty were compared. The results provided evidence for both effects. Furthermore, they indicated a substantial relation between the item-position effects of the 2 APM subsets, whereas no relation was found for item difficulty. This indicates that the item-position effect stands on its own and is not due to increasing item difficulty.  相似文献   

14.
Numerous assessments contain a mixture of multiple choice (MC) and constructed response (CR) item types and many have been found to measure more than one trait. Thus, there is a need for multidimensional dichotomous and polytomous item response theory (IRT) modeling solutions, including multidimensional linking software. For example, multidimensional item response theory (MIRT) may have a promising future in subscale score proficiency estimation, leading toward a more diagnostic orientation, which requires the linking of these subscale scores across different forms and populations. Several multidimensional linking studies can be found in the literature; however, none have used a combination of MC and CR item types. Thus, this research explores multidimensional linking accuracy for tests composed of both MC and CR items using a matching test characteristic/response function approach. The two-dimensional simulation study presented here used real data-derived parameters from a large-scale statewide assessment with two subscale scores for diagnostic profiling purposes, under varying conditions of anchor set lengths (6, 8, 16, 32, 60), across 10 population distributions, with a mixture of simple versus complex structured items, using a sample size of 3,000. It was found that for a well chosen anchor set, the parameters recovered well after equating across all populations, even for anchor sets composed of as few as six items.  相似文献   

15.
The study of change is based on the idea that the score or index at each measurement occasion has the same meaning and metric across time. In tests or scales with multiple items, such as those common in the social sciences, there are multiple ways to create such scores. Some options include using raw or sum scores (i.e., sum of item responses or linear transformation thereof), using Rasch-scaled scores provided by the test developers, fitting item response models to the observed item responses and estimating ability or aptitude, and jointly estimating the item response and growth models. We illustrate that this choice can have an impact on the substantive conclusions drawn from the change analysis using longitudinal data from the Applied Problems subtest of the Woodcock–Johnson Psycho-Educational Battery–Revised collected as part of the National Institute of Child Health and Human Development's Study of Early Child Care. Assumptions of the different measurement models, their benefits and limitations, and recommendations are discussed.  相似文献   

16.
The matched pair technique for writing and scoring true-false items was designed to compensate for the acquiescence response set of primary grade children. The claim that this technique increases reliability to an appreciable extent over traditional true-false scoring was investigated by comparing alpha internal consistency coefficients computed for the matched pair true-false, traditional true-false, and three other scoring schemes. Both the total sample coefficients and individual classroom coefficients were computed from the standardization sample of a primary grade economics achievement test (Primary Test of Economic Understanding). Classroom reliability coefficients computed from the matched pair scores were found to be higher than those from scores computed by the other methods. Total sample coefficients obtained from four of the five methods were nearly equal. Evidence of the effects of each scoring technique on concurrent validity is also presented. Contrary to expectations, the correlations of traditional and matched pair scores with Iowa Test of Basic Skills (ITBS) subtests (when adjusted for differing reliabilities) were approximately equal.  相似文献   

17.
Noting the wide differences in verbal abilities of middle and lower class children, the investigators proposed that two groups of children, one from the lower class, one from the middle class, who achieve comparable total scores on a group intelligence test, would get their scores by successfully completing different sets of items. In the first study children were placed in social classes based on their fathers' occupations, following guidelines from the Warner scale. Middle class children were matched with lower class children on total Otis scores. No item-social class interaction was found. The study was repeated using the occupational categories of the Dictionary of Occupational Titles as a guide to social class standing. Again no item-social class interaction appeared. If two social class groups are equated on total intelligence scores, one social class sample appears to succeed on essentially the same test items as does the other social class sample. A given score on an intelligence test appears to represent the same skills for one social class as it does for another social class.  相似文献   

18.
In 1993, we reported in Journal of Educational Measurement that task-sampling variability was the Achilles' heel of science performance assessment. To reduce measurement error, tasks needed to be stratified before sampling, sampled in large number, or possibly both. However, Cronbach, Linn, Brennan, & Haertel (1997) pointed out that a task-sampling interpretation of a large person x task variance component might be incorrect. Task and occasion sampling are confounded because tasks are typically given on only a single occasion. The person x task source of measurement error is then confounded with the pt x occasion source. If pto variability accounts for a substantial part of the commonly observed pt interaction, stratifying tasks into homogenous subsets—a cost-effective way of addressing task sampling variability—might not increase accuracy. Stratification would not address the pro source of error. Another conclusion reported in JEM was that only direct observation (DO) and notebook (NB) methods of collecting performance assessment data were exchangeable; computer simulation, short-answer, and multiple-choice methods were not. However, if Cronbach et al. were right, our exchangeability conclusion might be incorrect. After re-examining and re-analyzing data, we found support for Conbach et al. We concluded that large task-sampling variability was due to both the person x task interaction and person x task x occasion interaction. Moreover, we found that direct observation, notebook and computer simulation methods were equally exchangeable, but their exchangeability was limited by the volatility of student performances across tasks and occasions.  相似文献   

19.
The purpose of this study was to compare several methods for determining a passing score on an examination from the individual raters' estimates of minimal pass levels for the items. The methods investigated differ in the weighting that the estimates for each item receive in the aggregation process. An IRT-based simulation method was used to model a variety of error components of minimum pass levels. The results indicate little difference in estimated passing scores across the three methods. Less error was present when the ability level of the minimally competent candidates matched the expected difficulty level of the test. No meaningful improvement in passing score estimation was achieved for a 50-item test as opposed to a 25-item test; however, the RMSE values for estimates with 10 raters were smaller than those for 5 raters. The results suggest that the simplest method for aggregating minimum pass levels across the items in a test–adding them up–is the preferred method.  相似文献   

20.
This study compares the equal percentile (EP) and partial credit (PC) equatings for raw scores derived from performance-based assessments composed of free-response (open-ended) items clustered around long reading selections or multistep mathematics problems. Data are from the Maryland School Performance Assessment Program. The results suggest that Masters (1982; Wright & Masters, 1982) partial credit model may be useful for equating examinations composed of moderately easy (or not too difficult)items sharing a first principal component with at least 25% of the total variance. This conclusion appears to hold even in the presence of some level of response dependency for the items within each cluster. Although visible discrepancies were found between PC and EP equated scores in the skewed tail of the score distributions, the direction of these discrepancies is unpredictable. Therefore, it cannot be concluded from the study that the two methods give equivalent results when the distributions are markedly skewed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号