首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《教育实用测度》2013,26(3):185-207
With increasing interest in educational accountability, test results are now expected to meet a diverse set of informational needs. But a norm-referenced test (NRT) cannot be expected to meet the simultaneous demands for both norm-referenced and curriculum-specific information. One possible solution, which is the focus of this article, is to customize the NRT. Customized tests may appear in any form. They may (a) add a few curriculum-specific items to the end of the NRT, (b) substitute locally constructed items for a few NRT items, (c) substitute a curriculum-specific test (CST) for the NRT, or (d) use equating methods to obtain predicted NRT scores from the CST scores. In this article, we describe the four main approaches to customized testing, address the validity of the uses and interpretations of customized test scores obtained from the four main approaches, and offer recommendations regarding the use of customized tests and the need for further research. Results indicate that customized testing can yield both valid normative and curriculum- specific information, when special conditions exist. But, there are also many threats to the validity of normative interpretations. Cautious application of customized testing is needed in order to avoid misleading inferences about student achievement.  相似文献   

2.
In order to investigate the effect of two item-writing practices on test characteristics, examinations were chosen for study in two undergraduate courses (N = 71 and 210) . About one-fourth of the items on each examination included a practice generally regarded as undesirable in measurement textbooks and alleged to make test items more difficult. Alternate forms which eliminated the undesirable practice were developed and administered at the same time as the original form. Rewriting item stems so that they formed a complete sentence or question resulted in about 6 percent more students answering items correctly. Eliminating unnecessary material in item stems, however, had little effect on difficulty. KR20 values were not appreciably different for the two versions of either test. Neither flaw was found to affect item discrimination indices noticeably. The absence of any substantial practice-by-achievement level interactions suggested little effect of the practices on the validity of the tests.  相似文献   

3.
《教育实用测度》2013,26(1):15-35
This study examines the effects of using item response theory (IRT) ability estimates based on customized tests that were formed by selecting specific content areas from a nationally standardized achievement test. Subsets of items were selected from four different subtests of the Iowa Tests of Basic Skills (Hieronymus, Hoover, & Lindquist, 1985) on the basis of (a) selected content areas (content-customized tests) and (b) a representative sampling of content areas (representative-customized tests). For three of the four tests examined, ability estimates and estimated national percentile ranks based on the content-customized tests in school samples tended to be systematically higher than those based on the full tests. The results of the study suggested that for certain populations, IRT ability estimates and corresponding normative scores on content-customized versions of standardized achievement tests cannot be expected to be equivalent to scores based on the full-length tests.  相似文献   

4.
Evaluation is an inherent part of education for an increasingly diverse student population. Confidence in one’s test‐taking skills, and the associated testing environment, needs to be examined from a perspective that combines the concept of Bandurian self‐efficacy with the concept of stereotype threat reactions in a diverse student sample. Factors underlying testing reactions and performance on a cognitive ability test in four different testing conditions (high or low stereotype threat and high or low test face validity) were examined in this exploratory study. The stereotype threat manipulation seemed to lower African‐American and Hispanic participants’ test scores. However, the hypothesis that there would be an interaction with face validity was only partially supported. Participants’ highest scores resulted from low stereotype threat and high face validity, as predicted. However, the lowest scores were not in the high stereotype threat/ low face validity condition as expected. Instead, most groups tended to score lower when the test was perceived to be more face valid. Stereotype threat manipulation affected Whites as well as non‐Whites, although differently. Specifically, high stereotype threat increased Whites’ cognitive ability test scores in the low face validity condition, but decreased them in the high face validity condition. Implications for testing and classroom environment design are discussed.  相似文献   

5.
The use of content validity as the primary assurance of the measurement accuracy for science assessment examinations is questioned. An alternative accuracy measure, item validity, is proposed. Item validity is based on research using qualitative comparisons between (a) student answers to objective items on the examination, (b) clinical interviews with examinees designed to ascertain their knowledge and understanding of the objective examination items, and (c) student answers to essay examination items prepared as an equivalent to the objective examination items. Calculations of item validity are used to show that selected objective items from the science assessment examination overestimated the actual student understanding of science content. Overestimation occurs when a student correctly answers an examination item, but for a reason other than that needed for an understanding of the content in question. There was little evidence that students incorrectly answered the items studied for the wrong reason, resulting in underestimation of the students' knowledge. The equivalent essay items were found to limit the amount of mismeasurement of the students' knowledge. Specific examples are cited and general suggestions are made on how to improve the measurement accuracy of objective examinations.  相似文献   

6.
Yujing Ni 《教育心理学》2000,20(2):139-152
This study investigated validity of scores derived from the measurement procedure involving number lines by assessing its unique contributions to performance differences in criterion measures of rational number knowledge and skills, including fraction computation, application and explanation. A total of 413 5th and 6th graders participated in the study. Children's part-whole knowledge was measured with items of regional area representation, measurement knowledge with items involving number lines as well as those of fraction-size comparisons. It was found that when part-whole knowledge was accounted for, measurement knowledge assessed with number line items had no or negligible association with all of the three criterion measures. In contrast, measurement knowledge assessed with fraction-size comparisons demonstrated excellent incremental predictability. The results indicate that scores derived from the number line test items are poor estimators of children's understanding of the measurement aspect of rational number.  相似文献   

7.
Abstract

With the national move toward competency testing, publishers and educators have become increasingly concerned about test validity, item construction, and item readability. While a major effort is usually made by test developers to control the readability level of the test items, there is currently no validated measure of individual item readability.

It is commonly assumed that oral reading of test items by the teacher would ameliorate the readability problem for poor readers. Over 4,000 fifth-grade students were involved in this study aimed at determining the effect of teacher oral reading of test items to good and poor readers. The findings suggested that having teachers read test items aloud during the administration of standardized examinations yielded, overall, higher scores than having students read the items for themselves. However, this intervention did not benefit poor readers more than good readers. Both of these groups reflected similar gains under the influence of this intervention.  相似文献   

8.
Answer Changing on Multiple-Choice Test Items Among Eighth-Grade Readers   总被引:1,自引:1,他引:0  
This study was done to examine the effect of answer changing on multiple-choice test performance among good and poor readers in the eighth grade. Although the gains of poor readers were higher than those of good readers, all subjects profited significantly from changing their answers on items. For all subjects, when a single response was changed, there was a two-to-one chance that the new response would raise rather than lower the final score. Gains from answer changing on test items were slightly higher for poor readers as a group than were those for good readers. However, the result was determined not to be significant. More important, this hypothesis is strengthened by the fact that all subjects profited from answer changing. Therefore, the results were interpreted as lending support to the notion that answer-changing response among young examinees should be encouraged if there is a reasonable doubt about their “first impression.”  相似文献   

9.
Using data from a large-scale exam, in this study we compared various designs for equating constructed-response (CR) tests to determine which design was most effective in producing equivalent scores across the two tests to be equated. In the context of classical equating methods, four linking designs were examined: (a) an anchor set containing common CR items, (b) an anchor set incorporating common CR items rescored, (c) an external multiple-choice (MC) anchor test, and (d) an equivalent groups design incorporating rescored CR items (no anchor test). The use of CR items without rescoring resulted in much larger bias than the other designs. The use of an external MC anchor resulted in the next largest bias. The use of a rescored CR anchor and the equivalent groups design led to similar levels of equating error.  相似文献   

10.
The paper provides (1) a teacher-administered rating instrument for inattention without confounding the rating with hyperactivity and conduct disorder, and (2) evidence that the ratings correlate with the scores obtained from cognitive tests of attention. In Study I, the first objective was to investigate the construct validity and the inter-rater reliability of the Attention Checklist (ACL) by factor analysing the teacher ratings of 110 Grade 4 children, obtained by using the ACL. The second objective was to investigate the predictive validity of the ACL by examining the relationship between the scores obtained for the participants from teachers' ratings using the ACL and the scores obtained by participants in the lab-type attention tests. The results of factor analysis showed that a single factor labelled ‘inattention’ underlies the 12 items in the ACL. Examining the differences in performance on attention tests, the ‘low attention’ children as rated by the teachers on the ACL scored lower than the ‘high attention’ children on the objective tests of attention. These findings were replicated in Study II, which was conducted to test further the construct validity and predictive validity of the ACL. This time, only those two tests (Auditory Attention and Visual Attention) that had shown relatively poor discrimination between the high and low attention groups in Study I were, again, administered to another cohort of 97 Grade 4 children, as it was our intention to further challenge the reliability of the ACL. Overall, the results of both studies suggest that comprehensive assessment of attention skills should include both ACL and objective measures of selective attention.  相似文献   

11.
《教育实用测度》2013,26(3):249-253
A test segment that lacks content validity with respect to a criterion may be deleted for that reason. At issue is the effect on reliability and validity as measured by the coefficients arising from classical test theory. Assuming that the predictor test has some reasonable degree of internal consistency, deleting a segment of meaningful size is certain to reduce reliability. However, Feldt (1997) showed that a concomitant rise in the validity coefficient may occur under certain limited conditions. The present research further characterizes the circumstances under which validity changes may occur as a result of deletion of a predictor test segment. Specifically, for a positive outcome, one seeks a relatively large correlation between the scores from the deleted segment and the remaining items coupled with a relatively low correlation between scores from the deleted segment and the criterion.  相似文献   

12.
In low-stakes assessments, some students may not reach the end of the test and leave some items unanswered due to various reasons (e.g., lack of test-taking motivation, poor time management, and test speededness). Not-reached items are often treated as incorrect or not-administered in the scoring process. However, when the proportion of not-reached items is high, these traditional approaches may yield biased scores and thereby threatening the validity of test results. In this study, we propose a polytomous scoring approach for handling not-reached items and compare its performance with those of the traditional scoring approaches. Real data from a low-stakes math assessment administered to second and third graders were used. The assessment consisted of 40 short-answer items focusing on addition and subtraction. The students were instructed to answer as many items as possible within 5 minutes. Using the traditional scoring approaches, students’ responses for not-reached items were treated as either not-administered or incorrect in the scoring process. With the proposed scoring approach, students’ nonmissing responses were scored polytomously based on how accurately and rapidly they responded to the items to reduce the impact of not-reached items on ability estimation. The traditional and polytomous scoring approaches were compared based on several evaluation criteria, such as model fit indices, test information function, and bias. The results indicated that the polytomous scoring approaches outperformed the traditional approaches. The complete case simulation corroborated our empirical findings that the scoring approach in which nonmissing items were scored polytomously and not-reached items were considered not-administered performed the best. Implications of the polytomous scoring approach for low-stakes assessments were discussed.  相似文献   

13.
采用专家访谈、文献资料、数理统计等研究方法,对贵州省体育专业高考术科考试进行了研究.研究表明:贵州省体育专业高考术科考试固定四项身体素质测试不能全面检测考生情况.建议:身体素质测试增加灵敏性素质,由4类4小项增为5类5小项,增加专项运动技能考试,身体素质和专项技能比例应为75∶25,测试总分宜统一定为100分,体育考试和文化课考试成绩都达到分数线的考生,按体育专业成绩由高到低录取.  相似文献   

14.
The purpose of this study was to compare several methods for determining a passing score on an examination from the individual raters' estimates of minimal pass levels for the items. The methods investigated differ in the weighting that the estimates for each item receive in the aggregation process. An IRT-based simulation method was used to model a variety of error components of minimum pass levels. The results indicate little difference in estimated passing scores across the three methods. Less error was present when the ability level of the minimally competent candidates matched the expected difficulty level of the test. No meaningful improvement in passing score estimation was achieved for a 50-item test as opposed to a 25-item test; however, the RMSE values for estimates with 10 raters were smaller than those for 5 raters. The results suggest that the simplest method for aggregating minimum pass levels across the items in a test–adding them up–is the preferred method.  相似文献   

15.
Abstract

To combat problems of cheating arising from testing under crowed classroom conditions, instructors frequently use multiple arrangements of a set of test items. These different arrangements or forms should be nearly equivalent relative to mean total scores. This study reports data from comparisons involving eleven pairs of equivalent tests. There were no significant linear relationships between equivalent test forms on the ordering of item difficulties. Reliabilities differed little within pairs of equivalent tests. Nine of eleven t-tests comparing mean total test scores were insignificant. The bulk of these data supported the assumption that one may construct equivalent power tests by rearranging items, when the ordering of item difficulty is non-systematic on both arrangements.  相似文献   

16.
This paper serves as an illustration of the usefulness of structurally incomplete designs as an approach to reduce the length of educational questionnaires. In structurally incomplete test designs, respondents only fill out a subset of the total item set, while all items are still provided to the whole sample. The scores on the unadministered items are subsequently dealt with by using methods for the estimation of missing data. Two structurally incomplete test designs — one recording two thirds, and the other recording a half of the potentially complete data — were applied to the complete item scores on 8 educational psychology scales. The incomplete item scores were estimated with missing data method Data Augmentation. Complete and estimated test data were compared at the estimates of total scores, reliability, and predictive validity of an external criterion. The reconstructed data yielded estimates that were very close to the values in the complete data. As expected the statistical uncertainty was higher in the design that recorded fewer item scores. It was concluded that the procedure of applying incomplete test designs and subsequently dealing with the missing values is very fruitful for reducing questionnaire length.  相似文献   

17.
An attitudinal scale on population problems is constructed. Although the determination of attitudes of Americans toward population problems is meaningful in itself, an additional effort is made to demonstrate the empirical validity of acknowledged variables. Data were collected in the Tulsa Oklahoma Standard Metropolitan Statistical Area using a stratified proportionate sample. The 372 respondents representing a 1% sample do not differ significantly from the population of the total Standard Metropolitan Statistical Area. The interview schedule consisted of items designed to elicit standard socioeconomic information on the respondents along with their attitudes toward population problems. Using the Guttman technique of scalogram analysis, a population problems scale containing 6 items was developed. After validation of the original set of attitudinal items by factor analysis, the scale scores of the respondents were compared with selected socioeconomic variables in an attempt to empirically validate the scale. Using the Student's "t" associated with the Apearman rank correlation coefficient value and the Kruskal-Wallis 1-way analysis of variance, only the variables of education, number of children, and occupation proved to be associated with the population problems scale scores. It was learned that these variables were significant in other studies and do help to establish the empirical validity of the scale. The lack of association of variables of marital status, income, religion, race, and age suggest that the empirical validity of such relationships requires additional examination.  相似文献   

18.
Differential weighting of response alternatives and confidence testing have been proposed as ways to assess partial knowledge on multiple-choice tests. 211 students in an educational measurement course took their midterm examination under one of three procedures. Results from those students administered the test under conventional directions provided a baseline for comparing, in terms of reliability and validity, the results from students who took the test under the differential weighting of response alternatives or the confidence testing instructions. Reliability was estimated by the split-half technique. Validity was estimated by correlating midterm test scores with scores on a final examination. This investigation provides some support for the contention that validity can be improved using more sophisticated testing techniques. Suggestions for the conduct of more definitive studies were offered.  相似文献   

19.
The matched pair technique for writing and scoring true-false items was designed to compensate for the acquiescence response set of primary grade children. The claim that this technique increases reliability to an appreciable extent over traditional true-false scoring was investigated by comparing alpha internal consistency coefficients computed for the matched pair true-false, traditional true-false, and three other scoring schemes. Both the total sample coefficients and individual classroom coefficients were computed from the standardization sample of a primary grade economics achievement test (Primary Test of Economic Understanding). Classroom reliability coefficients computed from the matched pair scores were found to be higher than those from scores computed by the other methods. Total sample coefficients obtained from four of the five methods were nearly equal. Evidence of the effects of each scoring technique on concurrent validity is also presented. Contrary to expectations, the correlations of traditional and matched pair scores with Iowa Test of Basic Skills (ITBS) subtests (when adjusted for differing reliabilities) were approximately equal.  相似文献   

20.
在陕西教育学院实施了《学生体质健康标准》测试,对学生的身体形态、机能和身体素质三方面的测试得分及成绩分布结果进行了分析。陕西教育学院学生的体质健康状况总体水平低于全国水平,身体素质中反映下肢力量及爆发力的项目成绩几乎接近“不合格”状况,说明与学生身体素质较差和缺乏体育锻炼关系密切。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号