首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The effects of violating four item construction principles were examined to assess the validity of the principles and the importance of students’ test wiseness. While flawed items were significantly less difficult than sound items, differences in item discrimination, test reliability, and concurrent validity were not observed. These results are interpreted to suggest that variance attributable to test wiseness may be relatively small, and item-writing guidelines may be robust to violation.  相似文献   

2.
In order to investigate the effect of two item-writing practices on test characteristics, examinations were chosen for study in two undergraduate courses (N = 71 and 210) . About one-fourth of the items on each examination included a practice generally regarded as undesirable in measurement textbooks and alleged to make test items more difficult. Alternate forms which eliminated the undesirable practice were developed and administered at the same time as the original form. Rewriting item stems so that they formed a complete sentence or question resulted in about 6 percent more students answering items correctly. Eliminating unnecessary material in item stems, however, had little effect on difficulty. KR20 values were not appreciably different for the two versions of either test. Neither flaw was found to affect item discrimination indices noticeably. The absence of any substantial practice-by-achievement level interactions suggested little effect of the practices on the validity of the tests.  相似文献   

3.
Abstract

In an attempt to identify some of the causes of answer changing behavior, the effects of four tests and item specific variables were evaluated. Three samples of New Zealand school children of different ages were administered tests of study skills. The number of answer changes per item was compared with the position of each item in a group of items, the position of each item in the test, the discrimination index and the difficulty index of each item. It is shown that answer changes were more likely to be made on items occurring early in a group of items and toward the end of a test. There was also a tendency for difficult items and items with poor discriminations to be changed more frequently. Some implications of answer changing in the design of tests are discussed.  相似文献   

4.
For the purpose of obtaining data to use in test development, multiple matrix sampling (MMS) plans were compared to examinee sampling plans. Data were simulated for examinees, sampled from a population with a normal distribution of ability, responding to items selected from an item universe. Three item universes were considered: one that would produce a normal distribution of test scores, one a moderately platykurtic distribution, and one a very platykurtic distribution. When comparing sampling plans, total numbers of observations were held constant. No differences were found among plans in estimating item difficulty. Examinee sampling produced better estimates of item discrimination, test reliability, and test validity. As total number of observations increased, estimates improved considerably, especially for those MMS plans with larger subtest sizes. Larger numbers of observations were needed for tests designed to produce a normal distribution of test scores. With an adequate number of observations, MMS is seen as an alternative to examinee sampling in test development.  相似文献   

5.
高中学业水平考试研究(二):考试质量评价   总被引:1,自引:0,他引:1  
周群 《考试研究》2012,(6):20-28
学业水平考试试题和试卷的质量直接影响学业评价和诊断结果的有效性和可靠性。本文以上海市高中思想政治学科学业水平考试为例,从试题和题组功能偏差、试题得分与总分的相关系数、识别指数分析及分类一致性和准确性四个方面对考试的质量进行了定量评价,以介绍学业水平考试质量评价的方法。  相似文献   

6.
为保证语言测试题目的质量和加强题库建设,本文基于经典测试理论,使用Gitest Ⅲ对一份高考试卷(阅读部分)题目进行项目分析,结果显示:该阅读题目的难度、区分度较理想,但难度分布并不理想。建议在使用题库中的组合试卷前先进行试测,以改进试题的难度分布以及部分题目选项的质量,从而提高试题的信度和效度。  相似文献   

7.
The purpose of this research was to recommend an item bias procedure when the number of minority examinees is too small to use preferred three-parameter IRT methods. The chi-square, Angoff delta-plot, andpseudo-IRT indices were compared with both real and simulated data. For the real test data a criterion of known bias had been established by cross-validated IRT-3 results. The findings from the Math Test and the simulated test were consistent. The pseudo-IRT approach was best (measured by both correlations and percent agreement) in delecting criterion bias. The chi-square was close in accuracy to the pseudo-IRT index. The Angoff delta-plot method was found to be inadequate on both heuristic and empirical grounds. In extreme cases it even identified items as biased against whites that were simulated to be biased against blacks. However, a modified Angoff index, where p-value differences were regressed on item point biserials (and the residualized values used as the index), was nearly as good as the chi-square in identifying known bias. A final caution was offered regarding the use of item bias techniques. The statistical flags should never be used mechanically to discard items; rather they should be used to inspect items for possible differences in meaning.  相似文献   

8.
Item-pool management requires a balancing act between the input of new items into the pool and the output of tests assembled from it. A strategy for optimizing item-pool management is presented that is based on the idea of a periodic update of an optimal blueprint for the item pool to tune item production to test assembly. A simulation study with scenarios involving different levels of quality of the initial item pool, item writing, and management for a previous item pool from the Law School Admission Test (LSAT) showed that good item-pool management had about the same main effects on the item-writing costs and the number of feasible tests as good item writing, but the two factors showed strong interaction effects.  相似文献   

9.
A subset of the items of both forms of the Peabody Picture Vocabulary Test (PPVT) was administered to a sample of 452 fourth-, fifth- and sixth-grade students. This sample of students was randomly divided into two equal subgroups. Item difficulty indices were calculated for each of the two subsamples for each of the two forms of the test. Data obtained from the first subsample were used to evaluate the published ordering of items of Forms A and B of the PPVT and to reorder the items according to the empirically derived item difficulties. The second subsample was used as a cross-validation sample to evaluate the empirically derived reordering of items. The results of the cross-validation of the reordering indicate a substantial and significant increase in the validity of the item orderings for this subset of items on both forms of the PPVT. Therefore, this new ordering may yield a more accurate estimate of the intelligence of average and above students in the fourth-, fifth-, and sixth-grades than the present, published ordering of items.  相似文献   

10.
A recent review of research revealed that much of the advice given for writing multiple-choice test items is based on experience and wisdom rather than empirical research. The present study involved the testing of two common item-writing rules: (1) the phrasing of the stem in the form of a question versus a partial sentence and (2) the use of the inclusive “none of the above” option instead of a specific content option. Limited empirical research suggests that using the partial sentence format and the inclusive “none of these” option may lead to undesirable item and test characteristics, whereas textbook authors essentially are divided on the opinions about the validity of each rule. Results of this experimental study offer no evidence to support the use of either type of stem and limited evidence to caution against the option “none of the above.”  相似文献   

11.
In actual test development practice, the number o f test items that must be developed and pretested is typically greater, and sometimes much greater, than the number that is eventually judged suitable for use in operational test forms. This has proven to be especially true for one item type–analytical reasoning-that currently forms the bulk of the analytical ability measure of the GRE General Test. This study involved coding the content characteristics of some 1,400 GRE analytical reasoning items. These characteristics were correlated with indices of item difficulty and discrimination. Several item characteristics were predictive of the difficulty of analytical reasoning items. Generally, these same variables also predicted item discrimination, but to a lesser degree. The results suggest several content characteristics that could be considered in extending the current specifications for analytical reasoning items. The use of these item features may also contribute to greater efficiency in developing such items. Finally, the influence of these various characteristics also provides a better understanding of the construct validity of the analytical reasoning item type.  相似文献   

12.
The van Hiele theory and van Hiele Geometry Test have been extensively used in mathematics assessments across countries. The purpose of this study is to use classical test theory (CTT) and cognitive diagnostic modeling (CDM) frameworks to examine psychometric properties of the van Hiele Geometry Test and to compare how various classification criteria assign van Hiele levels to students. The findings support the hierarchical property of the van Hiele theory and levels. Using conventional and combined criteria to determine mastery of a level, the percentages of students classified into an overall level were relatively high. Although some items had aberrant difficulties and low item discrimination, varied selection of the criteria across levels improved item discrimination power, especially for those items with low item discrimination index (IDI) estimates. Based on the findings, we identify items on the van Hiele Geometry Test that might be revised and we suggest changes to classification criteria to increase the number of students who can be assigned an overall level of geometry thinking according to the theory. As a result, practitioners and researchers may be better positioned to use the van Hiele Geometry Test for classroom assessment.  相似文献   

13.
The purpose of the study was to examine the effect of item phrasing on the validity of a Likert-type attitude scale. Three content similar scales were composed of 15 items, either all positive, all negative, or a mixture of positive and negative items. Five hundred twenty-two students in grades 4–6 responded to one of the three forms. Results from the all positive and negative forms indicated that item means, variances, and factor structures differed significantly. Inspection of item means suggested that it was difficult for the students to indicate agreement by disagreeing with a negative statement. Analyses of the mixed phrasing form indicated factors based upon item phrasing, not item content. Taken together, the results suggest that the technique of balancing item phrasing when used with elementary students appears to affect adversely the validity of attitude measurement.  相似文献   

14.
Testing organization needs large numbers of high‐quality items due to the proliferation of alternative test administration methods and modern test designs. But the current demand for items far exceeds the supply. Test items, as they are currently written, evoke a process that is both time‐consuming and expensive because each item is written, edited, and reviewed by a subject‐matter expert. One promising approach that may address this challenge is with automatic item generation. Automatic item generation combines cognitive and psychometric modeling practices to guide the production of items that are generated with the aid of computer technology. The purpose of this study is to describe and illustrate a process that can be used to review and evaluate the quality of the generated item by focusing on the content and logic specified within the item generation procedure. We illustrate our process using an item development example from mathematics drawn from the Common Core State Standards and from surgical education drawn from the health sciences domain.  相似文献   

15.
Anatomists often use images in assessments and examinations. This study aims to investigate the influence of different types of images on item difficulty and item discrimination in written assessments. A total of 210 of 460 students volunteered for an extra assessment in a gross anatomy course. This assessment contained 39 test items grouped in seven themes. The answer format alternated per theme and was either a labeled image or an answer list, resulting in two versions containing both images and answer lists. Subjects were randomly assigned to one version. Answer formats were compared through item scores. Both examinations had similar overall difficulty and reliability. Two cross‐sectional images resulted in greater item difficulty and item discrimination, compared to an answer list. A schematic image of fetal circulation led to decreased item difficulty and item discrimination. Three images showed variable effects. These results show that effects on assessment scores are dependent on the type of image used. Results from the two cross‐sectional images suggest an extra ability is being tested. Data from a scheme of fetal circulation suggest a cueing effect. Variable effects from other images indicate that a context‐dependent interaction takes place with the content of questions. The conclusion is that item difficulty and item discrimination can be affected when images are used instead of answer lists; thus, the use of images as a response format has potential implications for the validity of test items. Anat Sci Educ © 2012 American Association of Anatomists.  相似文献   

16.
The purpose of this study was to identify broad classes of items that behave differentially for handicapped examinees taking special, extended-time administrations of the Scholastic Aptitude Test (SA T). To identify these item classes, the performance of nine handicapped groups and one nonhandicapped group on each of two forms of the SAT was investigated through a two-stage procedure. The first stage centered on the performance of item clusters. Individual items composing clusters showing questionable performance were then examined. This two-stage procedure revealed little indication of differentially functioning item classes. However, some notable instances of differential performance at the item level were detected, the most serious of which affected visually impaired students taking the braille edition of the test.  相似文献   

17.
The purpose of this study was to develop a questionnaire that could measure preservice mathematics teachers' mathematics educational values. Development and validation of the questionnaire involved a sequential inquiry in which design principles were established from the existing literature and a pool of items was constructed then submitted to experts for consideration of the construct validity. Alterations to the items based on their suggestions were made to produce a trial version of the questionnaire. A pilot study involving preservice mathematics teachers explored the validity and usefulness of the questionnaire. The pilot results were used to revise the questionnaire that was administered to a sample of preservice mathematics teachers attending Cumhuriyet University, Sivas, Turkey. Further explorations of the construct and structural validity, item contributions, and reliability were achieved by using a factor analysis and two different item analysis methods. Results revealed that the questionnaire included four factors, satisfactory item contributions, and acceptable internal consistency. One result obtained in this study suggested that some mathematics education values based on Western culture (e.g., accessibility–special) have not been accepted by Turkish preservice mathematics teachers.  相似文献   

18.
PISA测验着眼于学生的终生发展,其测验编制思想给各国教育评价带来了深刻的变革。本研究在PISA阅读测验理论与框架基础上,编制了PISA式汉语阅读测验。该测验包含三篇阅读材料,共18个测验项目。通过对测验难度、区分度、信度、效度的检测,并使用全息Bifactor模型进行维度评价。结果表明,编制的PISA式汉语阅读测验难度适中,具有较好区分度,信效度基本合格。同时,基本达到PISA对阅读测验能力结构的要求,较好地考查了学生的一般阅读理解能力,以及信息提取、文本解释、反思和评价等三个子维度的能力。  相似文献   

19.
3 multiple-choice tests were developed from judgmental, frequency, and discrimination procedures of selecting item distractors. Scores on each of these tests were correlated with scores on a completion test of parallel numeric and algebraic content. Matched triads, with 558 students in each group, were used. No significant differences in validity were found among the tests.  相似文献   

20.
本研究依据有关词汇测试的设计原理及模式,编制了词汇量测试卷,先后进行了两轮试测,运用SPSS18.0 ,对试卷项目进行筛选及修订,最终形成含104个题目的词汇量测试卷。信度、效度检验结果显示,试卷内在一致性信度Cronbach系数 ( 0 . 918) 、 重测信度( 0 . 644 ,p = 0 . 000) 以及效标区分法效度( t = 6. 358 ,p= 0 . 000) 、结构效度各level得分之间及总得分之间的相关性系数分别在 ( 0 .068 ~ 0 .496和 0 .294 ~ 0 .812)均达到测试学要求,本测试卷可作为新课改下非英语专业大学生的词汇量测评的有效工具。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号