首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Anatomists often use images in assessments and examinations. This study aims to investigate the influence of different types of images on item difficulty and item discrimination in written assessments. A total of 210 of 460 students volunteered for an extra assessment in a gross anatomy course. This assessment contained 39 test items grouped in seven themes. The answer format alternated per theme and was either a labeled image or an answer list, resulting in two versions containing both images and answer lists. Subjects were randomly assigned to one version. Answer formats were compared through item scores. Both examinations had similar overall difficulty and reliability. Two cross‐sectional images resulted in greater item difficulty and item discrimination, compared to an answer list. A schematic image of fetal circulation led to decreased item difficulty and item discrimination. Three images showed variable effects. These results show that effects on assessment scores are dependent on the type of image used. Results from the two cross‐sectional images suggest an extra ability is being tested. Data from a scheme of fetal circulation suggest a cueing effect. Variable effects from other images indicate that a context‐dependent interaction takes place with the content of questions. The conclusion is that item difficulty and item discrimination can be affected when images are used instead of answer lists; thus, the use of images as a response format has potential implications for the validity of test items. Anat Sci Educ © 2012 American Association of Anatomists.  相似文献   

2.
In common-item equating the anchor block is generally built to represent a miniature form of the total test in terms of content and statistical specifications. The statistical properties frequently reflect equal mean and spread of item difficulty. Sinharay and Holland (2007) suggested that the requirement for equal spread of difficulty may be too restrictive. They suggested that an anchor test with representative content coverage and equal mean item difficulty but a smaller spread of item difficulty (miditest) may provide the same or better results for equating while decreasing the pressure to find very hard and very easy items to include in the anchor. Analyses to date have concentrated on the results of equating the scores from one form to another with findings that are supportive of the Sinharay and Holland concept (Sinharay &; Holland, 2006a, 2006b, 2007; Liu, Sinharay, Holland, Feigenbaum, &; Curley, 2009). These studies do not address longer chains of equating. It is important to monitor the possibility of scale drift over forms. The current research begins to address this issue.  相似文献   

3.
Studies that have investigated differences in examinee performance on items administered in paper-and-pencil form or on a computer screen have produced equivocal results. Certain item administration procedures were hypothesized to be among the most important variables causing differences in item performance and ultimately in test scores obtained from these different administration media. A study where these item administration procedures were made as identical as possible for each presentation medium is described. In addition, a methodology is presented for studying the difficulty and discrimination of items under each presentation medium as a post hoc procedure.  相似文献   

4.
Assessment items are commonly field tested prior to operational use to observe statistical item properties such as difficulty. Item parameter estimates from field testing may be used to assign scores via pre-equating or computer adaptive designs. This study examined differences between item difficulty estimates based on field test and operational data and the relationship of such differences to item position changes and student proficiency estimates. Item position effects were observed for 20 assessments, with items in later positions tending to be more difficult. Moreover, field test estimates of item difficulty were biased slightly upward, which may indicate examinee knowledge of which items were being field tested. Nevertheless, errors in field test item difficulty estimates had negligible impacts on student proficiency estimates for most assessments. Caution is still warranted when using field test statistics for scoring, and testing programs should conduct investigations to determine whether the effects on scoring are inconsequential.  相似文献   

5.
Biased test items were intentionally imbedded within a set of test items, and the resulting instrument was administered to large samples of blacks and whites. Three popular item bias detection procedures were then applied to the data: (1) the three-parameter item characteristic curve procedure, (2) the chi-square method, and (3) the transformed item difficulty approach. The three-parameter item characteristic curve procedure proved most effective at detecting the intentionally biased test items; and the chi-square method was viewed as the best alternative. The transformed item difficulty approach has certain limitations yet represents a practical alternative if sample size, lack of computer facilities, or the like preclude the use of the other two procedures.  相似文献   

6.
The effects of training tests on subsequent achievement were studied using 2-test item characteristics: item difficulty and item complexity. Ninety Ss were randomly assigned to treatment conditions having easy or difficult items and calling for rote or complex skills. Each S was administered two training tests during the quarter containing only items defined by his treatment condition. The dependent measure was a sixty item final examination with fifteen items reflecting each of the four treatment condition item types. The results showed greater achievement for those trained with difficult items and with rote items. In addition, two interaction of treatment conditions with type of test items were found. The results are discussed as supporting a hierarchical model rather than a “similarity” transfer model of learning.  相似文献   

7.
Two experiments were conducted to determine if a relationship exists between test item arrangements and student performance on power tests. The primary hypotheses were: item arrangements based upon item difficulty, similarity of content, or order of class presentation do not influence test score or required testing time. In the first experiment 122 subjects were randomly assigned to three item difficulty arrangements of 139 test items with a 0–100% difficulty range, and in the second experiment 156 subjects were randomly assigned to three item content arrangements of 103 items. Results of analyses of variance with test anxiety used as a classification factor supported the hypotheses.  相似文献   

8.
In this study, the authors explored the importance of item difficulty (equated delta) as a predictor of differential item functioning (DIF) of Black versus matched White examinees for four verbal item types (analogies, antonyms, sentence completions, reading comprehension) using 13 GRE-disclosed forms (988 verbal items) and 11 SAT-disclosed forms (935 verbal items). The average correlation across test forms for each item type (and often the correlation for each individual test form as well) revealed a significant relationship between item difficulty and DIF value for both GRE and SAT. The most important finding indicates that for hard items, Black examinees perform differentially better than matched ability White examinees for each of the four item types and for both the GRE and SAT tests! The results further suggest that the amount of verbal context is an important determinant of the magnitude of the relationship between item difficulty and differential performance of Black versus matched White examinees. Several hypotheses accounting for this result were explored.  相似文献   

9.
难度不是试题的固有属性,而是考生因素与试题特征之间互动的结果。很多试题分析者倾向于将试题难度偏高的原因仅仅归结于学生未掌握相关知识或技能,而忽视试题本身的特征。通过分析60道难度在0.6以下的高考英语试题,探究其难度来源。结果显示,除考生因素外,难题或偏难题的难度来源也与命题技术有关,比如答案的唯一性与可接受性、考查内容超纲、考点设置与评分标准欠妥等方面的问题。为此,提出考试机构应提高命题水平,加强试题质量监控,确保大规模考试科学选拔人才。  相似文献   

10.
Some cognitive characteristics of graph comprehension items were studied, and a model comprised of several variables was developed. 132 graph items of the Psychometric Entrance Test were included in the study. By analyzing the actual difficulty of the items, an evaluation of the impact of the cognitive variables on item difficulties could be made. Results indicate that successful prediction of item difficulty can be calculated on the basis of a wide range of item characteristics and task demands. This suggests that items can be screened for processing difficulty prior to being administered to examinees. However, the results also have implications for test validity in that the various processing variables identified involve distinct ability dimensions.  相似文献   

11.
This article used the multidimensional random coefficients multinomial logit model to examine the construct validity and detect the substantial differential item functioning (DIF) of the Chinese version of motivated strategies for learning questionnaire (MSLQ-CV). A total of 1,354 Hong Kong junior high school students were administered the MSLQ-CV. Partial credit model was suggested to have a better goodness of fit than that of the rating scale model. Five items with substantial gender or grade DIF were removed from the questionnaire, and the correlations between the subscales indicated that factors of cognitive strategy use and self-regulation had a very high correlation which resulted in a possible combination of the two factors. The test reliability analysis showed that the subscale of test anxiety had a lower reliability compared with the other factors. Finally, the item difficulty and step parameters for the modified 39-item questionnaire were displayed. The order of the step difficulty estimates for some items implied that some grouping of categories might be required in the case of overlapping. Based on these findings, the directions for future research were discussed.  相似文献   

12.
It is standard practice to arrange items in objective tests in order of increasing difficulty, on the assumption that such an arrangement increases student motivation and produces more reliable tests. The validity of this assumption was investigated in the context of a multiplechoice chemistry test. Fifty items were arranged in three sequences of difficulty: random (R), easy-to-hard (E-H) and hard-to-easy (H-E). The mean test score was significantly higher for the test sequenced E-H than for the test sequenced H-E. Item difficulty index was raised by placement of the easier items toward the beginning of the test and lowered by placement of these items toward the end of the test. Test reliability was largely independent of item sequence.  相似文献   

13.
This work examines the hypothesis that the arrangement of items according to increasing difficulty is the real source of what is considered the item-position effect. A confusion of the 2 effects is possible because in achievement measures the items are arranged according to their difficulty. Two item subsets of Raven’s Advanced Progressive Matrices (APM), one following the original item order, and the other one including randomly ordered items, were applied to a sample of 266 students. Confirmatory factor analysis models including representations of both the item-position effect and a possible effect due to increasing item difficulty were compared. The results provided evidence for both effects. Furthermore, they indicated a substantial relation between the item-position effects of the 2 APM subsets, whereas no relation was found for item difficulty. This indicates that the item-position effect stands on its own and is not due to increasing item difficulty.  相似文献   

14.
Using a technique that controlled exposure of items, the investigator examined the effect on mean test score, item difficulty index, and reliability and validity coefficients of the reordering of items within a power test containing ten letter-series-completion items. The results suggest that effects on test statistics from item rearrangement are, generally, minimal. The implication of these findings for test designs involving an item sampling procedure is that performance on an item is minimally influenced by the context in which it occurs.  相似文献   

15.
The analytically derived asymptotic standard errors (SEs) of maximum likelihood (ML) item estimates can be approximated by a mathematical function without examinees' responses to test items, and the empirically determined SEs of marginal maximum likelihood estimation (MMLE)/Bayesian item estimates can be obtained when the same set of items is repeatedly estimated from the simulation (or resampling) test data. The latter method will result in rather stable and accurate SE estimates as the number of replications increases, but requires cumbersome and time-consuming calculations. Instead of using the empirically determined method, the adequacy of using the analytical-based method in predicting the SEs for item parameter estimates was examined by comparing results produced from both approaches. The results indicated that the SEs yielded from both approaches were, in most cases, very similar, especially when they were applied to a generalized partial credit model. This finding encourages test practitioners and researchers to apply the analytically asymptotic SEs of item estimates to the context of item-linking studies, as well as to the method of quantifying the SEs of equating scores for the item response theory (IRT) true-score method. Three-dimensional graphical presentation for the analytical SEs of item estimates as the bivariate function of item difficulty together with item discrimination was also provided for a better understanding of several frequently used IRT models.  相似文献   

16.
《教育实用测度》2013,26(1):35-48
This study investigated several current coaching practices used in training test-wiseness for analogy items in standardized test batteries. A three-group design was used which included a general test-taking, "encouragement" condition in addition to a no-training control group condition. The specific techniques used in training are described. Scholastic Aptitude Test (SAT) scores were obtained from university admission files to verify that no overall aptitude differences existed in the three conditions. Differences were observed for the coached group relative to the two control groups in terms of overall number of correct responses for the coached item types (analogies). No differences were found for the non-coached item types. Item difficulties for the three groups are also reported which show that several items were indeed made easier for individuals in the coached group. A qualitative analysis of the items made easier by coaching in terms of the training techniques used is given along with an analysis of the items that did not respond to coaching. Finally, a discussion of potentially flawed item types and item characteristics and suggestions for dealing with such flaws are given.  相似文献   

17.
In actual test development practice, the number o f test items that must be developed and pretested is typically greater, and sometimes much greater, than the number that is eventually judged suitable for use in operational test forms. This has proven to be especially true for one item type–analytical reasoning-that currently forms the bulk of the analytical ability measure of the GRE General Test. This study involved coding the content characteristics of some 1,400 GRE analytical reasoning items. These characteristics were correlated with indices of item difficulty and discrimination. Several item characteristics were predictive of the difficulty of analytical reasoning items. Generally, these same variables also predicted item discrimination, but to a lesser degree. The results suggest several content characteristics that could be considered in extending the current specifications for analytical reasoning items. The use of these item features may also contribute to greater efficiency in developing such items. Finally, the influence of these various characteristics also provides a better understanding of the construct validity of the analytical reasoning item type.  相似文献   

18.
We investigated students' metacognitive experiences with regard to feelings of difficulty (FD), feelings of satisfaction (FS), and estimate of effort (EE), employing either computerized adaptive testing (CAT) or computerized fixed item testing (FIT). In an experimental approach, 174 students in grades 10 to 13 were tested either with a CAT or a FIT version of a matrices test. Data revealed that metacognitive experiences were not related to the resulting test scores for CAT: test takers who took the matrices test in an adaptive mode were paradoxically more satisfied with their performance the worse they had performed in terms of the resulting ability parameter. They also rated the test as easier the lower they had performed, but their estimates of effort were higher the better they had performed. For test takers who took the FIT version, completely different results were revealed. In line with previous results, test takers were supposed to base these experiences on the subjectively estimated percentage of items solved. This moderated mediation hypothesis was in parts confirmed, as the relation between the percentage of items solved and FD, FS, and EE was revealed to be mediated by the estimated percentage of items solved. Results are discussed with reference to feedback acceptance, errant self-estimations, and test fairness with regard to a possible false regulation of effort in lower ability groups when using CAT.  相似文献   

19.
本文研究的是不同的测试方法-单项选择和信息转移-是否会在阅读理解考试中产生测试方法效应的问题.除对学生的考试成绩(分数)进行分析外,本研究还进一步对试题的难度值进行了分析,而本研究中试题难度是通过项目反应理论(Item Response Theory)计算得到的.结果显示不同测试方法的确会影响题目难度及考生的考试表现,就试题难度而言信息转移比单项选择更难.  相似文献   

20.
Based on a previously validated cognitive processing model of reading comprehension, this study experimentally examines potential generative components of text-based multiple-choice reading comprehension test questions. Previous research ( Embretson & Wetzel, 1987 ; Gorin & Embretson, 2005 ; Sheehan & Ginther, 2001 ) shows text encoding and decision processes account for significant proportions of variance in item difficulties. In the current study, Linear Logistic Latent Trait Model (LLTM; Fischer, 1973 ) parameter estimates of experimentally manipulated items are examined to further verify the impact of encoding and decision processes on item difficulty. Results show that manipulation of some passage features, such as increased use of negative wording, significantly increases item difficulty in some cases, whereas others, such as altering the order of information presentation in a passage, did not significantly affect item difficulty, but did affect reaction time. These results suggest that reliable changes in difficulty and response time through algorithmic manipulation of certain task features is feasible. However, non-significant results for several manipulations highlight potential challenges to item generation in establishing direct links between theoretically relevant item features and individual item processing. Further examination of these relationships will be informative to item writers as well as test developers interested in the feasibility of item generation as an assessment tool.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号