首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 562 毫秒
1.
Reliabilities and information functions for percentile ranks and number-right scores were compared in the context of item response theory. The basic results were: (a) The percentile rank is always less informative and reliable than the number-right score; and (b)for easy or difficult tests composed of highly discriminating items, the percentile rank often yields unacceptably low reliability and information relative to the number-right score. These results suggest that standardized scores that are linear transformations of the number-right score (e.g., z scores) are much more reliable and informative indicators of the relative standing of a test score than are percentile ranks. The findings reported here demonstrate that there exist situations in which the percent of items known by examinees can be accurately estimated, but that the percent of persons falling below a given score cannot.  相似文献   

2.
This paper investigates whether inferences about school performance based on longitudinal models are consistent when different assessments and metrics are used as the basis for analysis. Using norm-referenced (NRT) and standards-based (SBT) assessment results from panel data of a large heterogeneous school district, we examine inferences based on vertically equated scale scores, normal curve equivalents (NCEs), and nonvertically equated scale scores. The results indicate that the effect of the metric depends upon the evaluation objective. NCEs significantly underestimate absolute individual growth, but NCEs and scale scores yield highly correlated (r >.90) school-level results based on mean initial status and growth estimates. SBT and NRT results are highly correlated for status but only moderately correlated for growth. We also find that as few as 30 students per school provide consistent results and that mobility tends to affect inferences based on status but not growth – irrespective of the assessment or metric used.  相似文献   

3.
This paper examines the limitations of standard scores of achievement tests commonly used in diagnosing learning disabilities. The consideration of these limitations is an important factor in attempting to decide whether a marked discrepancy exists between ability and achievemen, a requirement for the diagnosis of learning disabilities under Public Law 94–142. The phrase “standard score scale” is ambiguous because it can refer to both status score scales and developmental score scales. Unfortunately, many school psychologists seem unaware of the distinction between these two types of standard scores and the ramifications of this distinction. Many standardized achievement tests commonly used in the assessment of learning disabilities use status standard scores despite their severe limitations (noncomparability across grade levels and subjects, and failure to reflect changes in variability across grade levels). While developmental standard scores are to be preferred over status standard scores in diagnosing learning disabled children, their value is significantly lowered because they require greater growth for below-average students than for average or above-average students. Moreover, developmental scores are nonequal interval and they assume that subject matter is normally distributed within age or grade groups. Although we recommend the use of developmental standard scores over status standard scores, we urge that they be interpreted cautiously.  相似文献   

4.
The percentage of students retaking college admissions tests is rising. Researchers and college admissions offices currently use a variety of methods for summarizing these multiple scores. Testing organizations such as ACT and the College Board, interested in validity evidence like correlations with first‐year grade point average (FYGPA), often use the most recent test score available. In contrast, institutions report using a variety of composite scoring methods for applicants with multiple test records, including averaging and taking the maximum subtest score across test occasions (“superscoring”). We compare four scoring methods on two criteria. First, we compare correlations between scores and FYGPA by scoring method. We find them similar (). Second, we compare the extent to which test scores differentially predict FYGPA by scoring method and number of retakes. We find that retakes account for additional variance beyond standardized achievement and positively predict FYGPA across all scoring methods. Superscoring minimizes this differential prediction—although it may seem that superscoring should inflate scores across retakes, this inflation is “true” in that it accounts for the positive effects of retaking for predicting FYGPA. Future research should identity factors related to retesting and consider how they should be used in college admissions.  相似文献   

5.
Performance on a standardized reading comprehension test reflects the number of correct answers readers select from a list of alternate choices, but fails to provide information about how readers cope with the various cognitive demands of the task. The aim of this study was to determine whether three groups of readers: normally achieving (NA), poor comprehenders (CD), with no decoding disability, and reading disabled (RD), poor comprehenders with poor decoding skills, differed in their ability to cope with reading comprehension task demands. Three task variables reflected in the question-answer relations that appear on standardized reading comprehension tests were identified.Passage Independent (PI) question can be answered with reasonable accuracy based on the reader's prior knowledge of the passage content.Inference (INFER) questions required the reader to generate an inference at the local or global test level.Locating (LOCAT) questions require the reader to match the correct answer choice to a detail explicitly stated in the text either verbatim or in paraphrase form. The relations among reader characteristics, cognitive task factors and reading comprehension test scores were analyzed using a structural relations equation with LISREL. It was found that the three reading groups differed with respect to the underlying relationship between their performance on specific question-answer types and their standardized reading comprehension score. For the NA group, a high score on PI was likely to be accompanied by a low score on INFER, whereas in the CD and RD groups, PI and INFER are positively related. The finding of a negative relationship between background knowledge and inference task factors for normally achieving readers suggests that even normal readers may have comprehension difficulties that go undetected on the basis of a standardized scores. This study indicates that current comprehension assessments may not be adequate for assessing specific reading difficulties and that more precise diagnostic tools are needed.  相似文献   

6.
Norm-referenced standardized achievement tests are designed, and commonly used, for obtaining group scores. Various methods are used to calculate and express group scores in terms of common derived scores, such as percent ile ranks. Publishers' scaled scores are ordinarily used in these procedures, with the result that the group scores can possess anomalous characteristics. The group scores can vary widely, depending on not only the measure of central tendency but also the type of derived score employed. A reason for this situation is hypothesized to be the use of inappropriate statistical procedures to develop publishers' scaled scores. Practitioners need to be aware of this problem and to document their procedures when calculating and reporting group scores. Test publishers are urged to avoid the use of scaling procedures that are seen as responsible for this problem.  相似文献   

7.
目前我国大学的英语考试大都采用了多项选择题,且大至全国性的统考,小至学校一门课程的期末考试,多项选择题在试题中所占的比重可以高达85%。这种形式的考试还被冠以“标准化考试”、“是客观题”,并认为具有“阅卷省事”等特点,导致人们对此类试题产生不客观、不全面的认识。实践证明,此类试题的缺点和负面作用是客观存在的,在某种意义上还十分严重。因此,恰如其分地评价多项选择题,这对大学英语考试的正确导向、对英语人才的培养意义重大。  相似文献   

8.
It has been seen that children's scores on reading achievement tests vary not only with knowledge of content, but also with the differing formats of test items. Teachers working with learning disabled children or children with attention problems may wish to choose standardized tests with fewer, rather than more, format changes. The present study evaluated the number of format and direction changes across tests and grade levels of the major elementary standardized reading achievement tests. The number of format changes varies from one change every 1.2 minutes on the Metropolitan Achievement Test Level E1 to one change every 21.3 minutes on the P1 level of the Stanford Achievement Test. Teachers may wish to take this evaluation into account when considering use of standardized reading achievement tests for their students.  相似文献   

9.
Previous methods for estimating the conditional standard error of measurement (CSEM) at specific score or ability levels are critically discussed, and a brief summary of prior empirical results is given. A new method is developed that avoids theoretical problems inherent in some prior methods, is easy to implement, and estimates not only a quantity analogous to the CSEM at each score but also the conditional standard error of prediction (CSEP) at each score and the conditional true score standard deviation (CTSSD) at each score, The new method differs from previous methods in that previous methods have concentrated on attempting to estimate error variance conditional on a fixed value of true score, whereas the new method considers the variance of observed scores conditional on a fixed value of an observed parallel measurement and decomposes these conditional observed score variances into true and error parts. The new method and several older methods are applied to a variety of tests, and representative results are graphically displayed. The CSEM-Iike estimates produced by the new method are called conditional standard error of measurement in prediction (CSEMP) estimates and are similar to those produced by older methods, but the CSEP estimates produced by the new method offer an alternative interpretation of the accuracy of a test at different scores. Finally, evidence is presented that shows that previous methods can produce dissimilar results and that the shape of the score distribution may influence the way in which the CSEM varies across the score scale.  相似文献   

10.
新高考方案包含了选考科目,而在各选考科目中,考生能力高低不同、试题难易不同,因此原始分不具有可比性。如果使用标准分或者比例等级计分,那么它在摆平各科难度的同时,也会将各选考科目考生的整体能力差异抹平,因而带来新的问题。本文在讨论了几个典型的新高考方案中选考科目成绩不校准带来的问题后,认为解决问题的一个可行方法是使用必考科目成绩去校准选考科目成绩,同时还澄清和回应了几个有关校准选考科目成绩方面的问题和疑虑。本文认为,用必考科目成绩去校准选考科目成绩的方案可以设计得很简单,包括在比例等级计分基础上只加不减。  相似文献   

11.
Formula scoring is a procedure designed to reduce multiple-choice test score irregularities due to guessing. Typically, a formula score is obtained by subtracting a proportion of the number of wrong responses from the number correct. Examinees are instructed to omit items when their answers would be sheer guesses among all choices but otherwise to guess when unsure of an answer. Thus, formula scoring is not intended to discourage guessing when an examinee can rule out one or more of the options within a multiple-choice item. Examinees who, contrary to the instructions, do guess blindly among all choices are not penalized by formula scoring on the average; depending on luck, they may obtain better or worse scores than if they had refrained from this guessing. In contrast, examinees with partial information who refrain from answering tend to obtain lower formula scores than if they had guessed among the remaining choices. (Examinees with misinformation may be exceptions.) Formula scoring is viewed as inappropriate for most classroom testing but may be desirable for speeded tests and for difficult tests with low passing scores. Formula scores do not approximate scores from comparable fill-in-the-blank tests, nor can formula scoring preclude unrealistically high scores for examinees who are very lucky.  相似文献   

12.
Value-added scores from tests of college learning indicate how score gains compare to those expected from students of similar entering academic ability. Unfortunately, the choice of value-added model can impact results, and this makes it difficult to determine which results to trust. The research presented here demonstrates how value-added models can be compared on three criteria: reliability, year-to-year consistency and information about score precision. To illustrate, the original Collegiate Learning Assessment value-added model is compared to a new model that employs hierarchical linear modelling. Results indicate that scores produced by the two models are similar, but the new model produces scores that are more reliable and more consistent across years. Furthermore, the new approach provides school-specific indicators of value-added score precision. Although the reliability of value-added scores is sufficient to inform discussions about improving general education programmes, reliability is currently inadequate for making dependable, high-stakes comparisons between postsecondary institutions.  相似文献   

13.
This study examined differences between students who qualified for talent search testing via scores on standardized tests and via parent nomination in their performances on the SAT or ACT and some demographic characteristics. Overall, the standardized testing group earned higher scores on the off‐level tests than the parent nominated group. Asian students used parent nomination more than standardized tests for talent search testing, and Hispanic/Latino students in the parent nominated group but not in the standardized testing group were among the top performers on the off‐level tests. Parent nomination as a feasible alternative to standardized achievement tests is suggested for talented students who are not native English speakers or would not be identified as gifted using traditional qualification methods.  相似文献   

14.
Guessing correct answers to test items is a statistical concept that has direct impact when interpreting test scores. Many published tests, however, do not account for guessing. This is an important issue in view of recent federal legislation in the United States and global attention mandating the provision of identification of at-risk children for educational services. Children may score within a normal range by chance alone, resulting in test scores that are not sensitive. The purpose of this paper, therefore, is: (a) to describe one process, random guessing, for estimating a “true blind guessing score” (range of scores) that, if known, would result in missing fewer at-risk children; and (b) to sensitize test administrators to tests that do not address or may have suspicious corrections for guessing answers on tests.  相似文献   

15.
This study was conducted to determine whether Spanish‐enhanced administration of a standardized math assessment would result in improved scores for English Learners who used Spanish as a heritage language. Twenty‐one typically developing second‐graders (English Learners) were administered the traditional KeyMath‐3. If the child made an error on an item, a Spanish version of the item was presented. Difference scores were calculated to determine whether the Spanish‐enhanced version resulted in improved scores. Data were analyzed using paired t‐tests and simple regression. The data results showed that all children significantly benefited from the Spanish‐enhanced administration of items answered incorrectly in English. The amount of benefit was predicted by a child's degree of Spanish dominance. It was concluded that standardized math tests that do not accommodate second‐language learners may be inadvertently testing language skills in addition to math skills. Implications for assessment and interpretations of assessments are discussed.  相似文献   

16.
The use of constructed-response items in large scale standardized testing has been hampered by the costs and difficulties associated with obtaining reliable scores. The advent of expert systems may signal the eventual removal of this impediment. This study investigated the accuracy with which expert systems could score a new, nonmultiple-choice item type. The item type presents a faulty solution to a computer programming problem and asks the student to correct the solution. This item type was administered to a sample of high school seniors enrolled in an Advanced Placement course in Computer Science who also took the Advanced Placement Computer Science (APCS) examination. Results indicated that the expert systems were able to produce scores for between 82% and 95% of the solutions encountered and to display high agreement with a human reader on the correctness of the solutions. Diagnoses of the specific errors produced by students were less accurate. Correlations with scores on the objective and free-response sections of the APCS examination were moderate. Implications for additional research and for testing practice are offered.  相似文献   

17.
This article develops a conceptual framework that addresses score comparability. The intent of the framework is to help identify and organize threats to comparability in a particular assessment situation. Aspects of the testing situations that might threaten score comparability are delineated, procedures for evaluating the degree of score comparability are described, and suggestions are made about how to minimize the effects of potential threats. The situations considered are restricted to those in which test developers intend to (a) be able to use scores on 2 or more tests interchangeably, (b) collect data that allow for the conversion of scores on each of the tests to a common scale, and (c) use the scores to make decisions about individuals. Comparability of scores on alternate forms of performance assessments, adaptive and paper-and-pencil tests, and alternate pools used for computerized adaptive tests are considered within the framework. Aspects of these testing situations that might threaten score comparability and procedures for evaluating the degree of score comparability are described. Suggestions are made about how to minimize the effects of potential threats to comparability.  相似文献   

18.
《教育实用测度》2013,26(1):75-94
In this study we examined whether the measures used in the admission of students to universities in Israel are gender biased. The criterion used to measure bias was performance in the first year of university study; the predictors consisted of an admission score, a high school matriculation score, and a standardized test score as well as its component subtest scores. Statistically, bias was defined according to the boundary conditions given in Linn (1984). No gender bias was detected when using the admission score (which is used for selection) as a predictor of first-year performance in the university. Bias in favor of women was found predominantly using school grades as predictor whereas bias against women was found predominantly in using the standardized test scores. It was concluded that the admission score is a valid and unbiased predictor of first-year university performance for the two genders.  相似文献   

19.
The purpose of this study was to examine the validity evidence of first-grade spelling scores from a standardized test of nonsense word spellings and their potential value within universal literacy screening. Spelling scores from the Test of Phonological Awareness: Second Edition PLUS for 47 first-grade children were scored using a standardized procedure and an alternative invented spelling procedure. Correlations were examined among spelling and diagnostic word reading and decoding scores, along with scores from the Dynamic Indicators of Basic Early Literacy Skills (DIBELS). Spelling scores were significantly correlated with word reading and decoding scores, as well as DIBELS scores, except Phoneme Segmentation Fluency. Hierarchical multiple regression analyses revealed that spelling scores reliably accounted for significant variance in decoding but not word reading scores, beyond DIBELS scores. Implications are discussed related to the potential value of including early spelling scores within universal literacy screening.  相似文献   

20.
Two new methods have been proposed to determine unexpected sum scores on sub-tests (testlets) both for paper-and-pencil tests and computer adaptive tests. A method based on a conservative bound using the hypergeometric distribution, denoted p, was compared with a method where the probability for each score combination was calculated using a highest density region (HDR). Furthermore, these methods were compared with the standardized log-likelihood statistic with and without a correction for the estimated latent trait value (denoted as l*z and lz, respectively). Data were simulated on the basis of the one-parameter logistic model, and both parametric and non-parametric logistic regression was used to obtain estimates of the latent trait. Results showed that it is important to take the trait level into account when comparing subtest scores. In a nonparametric item response theory (IRT) context, on adapted version of the HDR method was a powerful alterative to p. In a parametric IRT context, results showed that l*z had the highest power when the data were simulated conditionally on the estimated latent trait level.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号