首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
ABSTRACT

The understanding of what makes a question difficult is a crucial concern in assessment. To study the difficulty of test questions, we focus on the case of PISA, which assesses to what degree 15-year-old students have acquired knowledge and skills essential for full participation in society. Our research question is to identify PISA science item characteristics that could influence the item’s proficiency level. It is based on an a-priori item analysis and a statistical analysis. Results show that only the cognitive complexity and the format out of the different characteristics of PISA science items determined in our a-priori analysis have an explanatory power on an item’s proficiency levels. The proficiency level cannot be explained by the dependence/independence of the information provided in the unit and/or item introduction and the competence. We conclude that in PISA, it appears possible to anticipate a high proficiency level, that is, students’ low scores for items displaying a high cognitive complexity. In the case of a middle or low cognitive complexity level item, the cognitive complexity level is not sufficient to predict item difficulty. Other characteristics play a crucial role in item difficulty. We discuss anticipating the difficulties in assessment in a broader perspective.  相似文献   

2.
Item positions in educational assessments are often randomized across students to prevent cheating. However, if altering item positions results in any significant impact on students’ performance, it may threaten the validity of test scores. Two widely used approaches for detecting position effects – logistic regression and hierarchical generalized linear modelling – are often inconvenient for researchers and practitioners due to some technical and practical limitations. Therefore, this study introduced a structural equation modeling (SEM) approach for examining item and testlet position effects. The SEM approach was demonstrated using data from a computer-based alternate assessment designed for students with cognitive disabilities from three grade bands (3–5, 6–8, and high school). Item and testlet position effects were investigated in the field-test (FT) items that were received by each student at different positions. Results indicated that the difficulty of some FT items in grade bands 3–5 and 6–8 differed depending on the positions of the items on the test. Also, the overall difficulty of the field-test task in grade bands 6–8 increased as students responded to the field-test task in later positions. The SEM approach provides a flexible method for examining different types of position effects.  相似文献   

3.
The Latin Square Task (LST) was developed by Birney, Halford, and Andrews [Birney, D. P., Halford, G. S., & Andrews, G. (2006). Measuring the influence of cognitive complexity on relational reasoning: The development of the Latin Square Task. Educational and Psychological Measurement, 66, 146–171.] and represents a non-domain specific, language-free operationalization of Relational Complexity (RC-)Theory. The current study investigates the basic cognitive parameters and structure of LST as defined by RC-Theory, using IRT-based linear logistic test models (LLTM). 850 German school students completed 26 systematically designed LST items. Results support the notion of Rasch-scalability. LLTM analyses reveal that both operation complexity and number of operations affect item difficulty. It is shown how LLTM and its variants can provide substantial insights into cognitive solution processes and composition of item difficulty in relational reasoning in order to make item construction more efficient.  相似文献   

4.
Science education needs valid, authentic, and efficient assessments. Many typical science assessments primarily measure recall of isolated information. This paper reports on the validation of assessments that measure knowledge integration ability among middle school and high school students. The assessments were administered to 18,729 students in five states. Rasch analyses of the assessments demonstrated satisfactory item fit, item difficulty, test reliability, and person reliability. The study showed that, when appropriately designed, knowledge integration assessments can be balanced between validity and reliability, authenticity and generalizability, and instructional sensitivity and technical quality. Results also showed that, when paired with multiple‐choice items and scored with an effective scoring rubric, constructed‐response items can achieve high reliabilities. Analyses showed that English language learner status and computer use significantly impacted students' science knowledge integration abilities. Students who took the assessment online, which matched the format of content delivery, performed significantly better than students who took the paper‐and‐pencil version. Implications and future directions of research are noted, including refining curriculum materials to meet the needs of diverse students and expanding the range of topics measured by knowledge integration assessments. © 2011 Wiley Periodicals, Inc. J Res Sci Teach 48: 1079–1107, 2011  相似文献   

5.
The present study investigates the degree to which item "bias" techniques can lead to interpretable results when groups are defined in terms of specified differences in the cognitive processes involved in students' problem-solving strategies. A large group of junior high school students who took a test on subtraction of fractions was divided into two subgroups judged by the rule-space model to be using different problem-solving strategies. It was confirmed by use of Mantel-Haenszel (MH) statistics that these subgroups showed different performances on items with different underlying cognitive tasks. We note that, in our case, we are far from faulting items that show differential item functioning (D1F) between two groups defined in terms of different solution strategies. Indeed, they are "desirable" items, as explained in the discussion section  相似文献   

6.
ABSTRACT

Construct-irrelevant cognitive complexity of some items in the statewide grade-level assessments may impose performance barriers for students with disabilities who are ineligible for alternate assessments based on alternate achievement standards. This has spurred research into whether items can be modified to reduce complexity without affecting item construct. This study uses a generalized linear mixed modeling analysis to investigate the effects of item modifications on improving test accessibility by reducing construct-irrelevant cognitive barriers for persistently low-performing fifth-grade students with cognitive disabilities. The results showed item scaffolding was an effective modification for both mathematics and reading. Other modifications, such as bolding/underlining of key words, hindered test performance for low-performing students. We discuss the findings’ potential impact on test development with universal design.  相似文献   

7.
Some cognitive characteristics of graph comprehension items were studied, and a model comprised of several variables was developed. 132 graph items of the Psychometric Entrance Test were included in the study. By analyzing the actual difficulty of the items, an evaluation of the impact of the cognitive variables on item difficulties could be made. Results indicate that successful prediction of item difficulty can be calculated on the basis of a wide range of item characteristics and task demands. This suggests that items can be screened for processing difficulty prior to being administered to examinees. However, the results also have implications for test validity in that the various processing variables identified involve distinct ability dimensions.  相似文献   

8.
In recent years, students’ test scores have been used to evaluate teachers’ performance. The assumption underlying this practice is that students’ test performance reflects teachers’ instruction. However, this assumption is generally not empirically tested. In this study, we examine the effect of teachers’ instruction on test performance at the item level using a hierarchical differential item functioning approach. The items are from the U.S. TIMSS 2011 4th-grade math test. Specifically, we tested whether students who had received instruction on a given item performed significantly better on that item compared with students who had not received such instruction when their overall math ability was controlled for, whether with or without controlling for student-level and class-level covariates. This study provides preliminary findings regarding why some items show instructional sensitivity and sheds light on how to develop instructionally sensitive items. Implications and directions for further research are also discussed.  相似文献   

9.
Contrasts between constructed-response items and multiple-choice counterparts have yielded but a few weak generalizations. Such contrasts typically have been based on the statistical properties of groups of items, an approach that masks differences in properties at the item level and may lead to inaccurate conclusions. In this article, we examine item-level differences between a certain type of constructed-response item (called figural response) and comparable multiple-choice items in the domain of architecture. Our data show that in comparing two item formats, item-level differences in difficulty correspond to differences in cognitive processing requirements and that relations between processing requirements and psychometric properties are systematic. These findings illuminate one aspect of construct validity that is frequently neglected in comparing item types, namely the cognitive demand of test items.  相似文献   

10.
In today's higher education, high quality assessments play an important role. Little is known, however, about the degree to which assessments are correctly aimed at the students’ levels of competence in relation to the defined learning goals. This article reviews previous research into teachers’ and students’ perceptions of item difficulty. It focuses on the item difficulty of assessments and students’ and teachers’ abilities to estimate item difficulty correctly. The review indicates that teachers tend to overestimate the difficulty of easy items and underestimate the difficulty of difficult items. Students seem to be better estimators of item difficulty. The accuracy of the estimates can be improved by: the information the estimators or teachers have about the target group and their earlier assessment results; defining the target group before the estimation process; the possibility of having discussions about the defined target group of students and their corresponding standards during the estimation process; and by the amount of training in item construction and estimating. In the subsequent study, the ability and accuracy of teachers and students to estimate the difficulty levels of assessment items was examined. In higher education, results show that teachers are able to estimate the difficulty levels correctly for only a small proportion of the assessment items. They overestimate the difficulty level of most of the assessment items. Students, on the other hand, underestimate their own performances. In addition, the relationships between the students’ perceptions of the difficulty levels of the assessment items and their performances on the assessments were investigated. Results provide evidence that the students who performed best on the assessments underestimated their performances the most. Several explanations are discussed and suggestions for additional research are offered.  相似文献   

11.
This article used the multidimensional random coefficients multinomial logit model to examine the construct validity and detect the substantial differential item functioning (DIF) of the Chinese version of motivated strategies for learning questionnaire (MSLQ-CV). A total of 1,354 Hong Kong junior high school students were administered the MSLQ-CV. Partial credit model was suggested to have a better goodness of fit than that of the rating scale model. Five items with substantial gender or grade DIF were removed from the questionnaire, and the correlations between the subscales indicated that factors of cognitive strategy use and self-regulation had a very high correlation which resulted in a possible combination of the two factors. The test reliability analysis showed that the subscale of test anxiety had a lower reliability compared with the other factors. Finally, the item difficulty and step parameters for the modified 39-item questionnaire were displayed. The order of the step difficulty estimates for some items implied that some grouping of categories might be required in the case of overlapping. Based on these findings, the directions for future research were discussed.  相似文献   

12.
Test items undergo multiple iterations of review before states and vendors deem them acceptable to be placed in a live statewide assessment. This article reviews three approaches that can add validity evidence to states' item review processes. The first process is a structured sensitivity review process that focuses on universal design considerations for items. The second method is a series of statistical analyses intended to increase the limited amount of information that can be derived from analyses on low-incidence populations (such as students who are blind, deaf, or have cognitive disabilities). Finally, think aloud methods are described as a method for understanding why particular items might be problematic for students .  相似文献   

13.
Based on a previously validated cognitive processing model of reading comprehension, this study experimentally examines potential generative components of text-based multiple-choice reading comprehension test questions. Previous research ( Embretson & Wetzel, 1987 ; Gorin & Embretson, 2005 ; Sheehan & Ginther, 2001 ) shows text encoding and decision processes account for significant proportions of variance in item difficulties. In the current study, Linear Logistic Latent Trait Model (LLTM; Fischer, 1973 ) parameter estimates of experimentally manipulated items are examined to further verify the impact of encoding and decision processes on item difficulty. Results show that manipulation of some passage features, such as increased use of negative wording, significantly increases item difficulty in some cases, whereas others, such as altering the order of information presentation in a passage, did not significantly affect item difficulty, but did affect reaction time. These results suggest that reliable changes in difficulty and response time through algorithmic manipulation of certain task features is feasible. However, non-significant results for several manipulations highlight potential challenges to item generation in establishing direct links between theoretically relevant item features and individual item processing. Further examination of these relationships will be informative to item writers as well as test developers interested in the feasibility of item generation as an assessment tool.  相似文献   

14.
Although multiple choice examinations are often used to test anatomical knowledge, these often forgo the use of images in favor of text‐based questions and answers. Because anatomy is reliant on visual resources, examinations using images should be used when appropriate. This study was a retrospective analysis of examination items that were text based compared to the same questions when a reference image was included with the question stem. Item difficulty and discrimination were analyzed for 15 multiple choice items given across two different examinations in two sections of an undergraduate anatomy course. Results showed that there were some differences item difficulty but these were not consistent to either text items or items with reference images. Differences in difficulty were mainly attributable to one group of students performing better overall on the examinations. There were no significant differences for item discrimination for any of the analyzed items. This implies that reference images do not significantly alter the item statistics, however this does not indicate if these images were helpful to the students when answering the questions. Care should be taken by question writers to analyze item statistics when making changes to multiple choice questions, including ones that are included for the perceived benefit of the students. Anat Sci Educ 10: 68–78. © 2016 American Association of Anatomists.  相似文献   

15.
Assessment items are commonly field tested prior to operational use to observe statistical item properties such as difficulty. Item parameter estimates from field testing may be used to assign scores via pre-equating or computer adaptive designs. This study examined differences between item difficulty estimates based on field test and operational data and the relationship of such differences to item position changes and student proficiency estimates. Item position effects were observed for 20 assessments, with items in later positions tending to be more difficult. Moreover, field test estimates of item difficulty were biased slightly upward, which may indicate examinee knowledge of which items were being field tested. Nevertheless, errors in field test item difficulty estimates had negligible impacts on student proficiency estimates for most assessments. Caution is still warranted when using field test statistics for scoring, and testing programs should conduct investigations to determine whether the effects on scoring are inconsequential.  相似文献   

16.
The use of accommodations has been widely proposed as a means of including English language learners (ELLs) or limited English proficient (LEP) students in state and districtwide assessments. However, very little experimental research has been done on specific accommodations to determine whether these pose a threat to score comparability. This study examined the effects of linguistic simplification of 4th- and 6th-grade science test items on a state assessment. At each grade level, 4 experimental 10-item testlets were included on operational forms of a statewide science assessment. Two testlets contained regular field-test items, but in a linguistically simplified condition. The testlets were randomly assigned to LEP and non-LEP students through the spiraling of test booklets. For non-LEP students, in 4 t-test analyses of the differences in means for each corresponding testlet, 3 of the mean score comparisons were not significantly different, and the 4th showed the regular version to be slightly easier than the simplified version. Analysis of variance (ANOVA), followed by pairwise comparisons of the testlets, showed no significant differences in the scores of non-LEP students across the 2 item types. Among the 40 items administered in both regular and simplified format, item difficulty did not vary consistently in favor of either format. Qualitative analyses of items that displayed significant differences in p values were not informative, because the differences were typically very small. For LEP students, there was 1 significant difference in student means, and it favored the regular version. However, because the study was conducted in a state with a small number of LEP students, the analyses of LEP student responses lacked statistical power. The results of this study show that linguistic simplification is not helpful to monolingual English-speaking students who receive the accommodation. Therefore, the results provide evidence that linguistic simplification is not a threat to the comparability of scores of LEP and monolingual English-speaking students when offered as an accommodation to LEP students. The study findings may also have implications for the use of linguistic simplification accommodations in science assessments in other states and in content areas other than science.  相似文献   

17.
Traditional item analyses such as classical test theory (CTT) use exam-taker responses to assessment items to approximate their difficulty and discrimination. The increased adoption by educational institutions of electronic assessment platforms (EAPs) provides new avenues for assessment analytics by capturing detailed logs of an exam-taker's journey through their exam. This paper explores how logs created by EAPs can be employed alongside exam-taker responses and CTT to gain deeper insights into exam items. In particular, we propose an approach for deriving features from exam logs for approximating item difficulty and discrimination based on exam-taker behaviour during an exam. Items for which difficulty and discrimination differ significantly between CTT analysis and our approach are flagged through outlier detection for independent academic review. We demonstrate our approach by analysing de-identified exam logs and responses to assessment items of 463 medical students enrolled in a first-year biomedical sciences course. The analysis shows that the number of times an exam-taker visits an item before selecting a final response is a strong indicator of an item's difficulty and discrimination. Scrutiny by the course instructor of the seven items identified as outliers suggests our log-based analysis can provide insights beyond what is captured by traditional item analyses.

Practitioner notes

What is already known about this topic
  • Traditional item analysis is based on exam-taker responses to the items using mathematical and statistical models from classical test theory (CTT). The difficulty and discrimination indices thus calculated can be used to determine the effectiveness of each item and consequently the reliability of the entire exam.
What this paper adds
  • Data extracted from exam logs can be used to identify exam-taker behaviours which complement classical test theory in approximating the difficulty and discrimination of an item and identifying items that may require instructor review.
Implications for practice and/or policy
  • Identifying the behaviours of successful exam-takers may allow us to develop effective exam-taking strategies and personal recommendations for students.
  • Analysing exam logs may also provide an additional tool for identifying struggling students and items in need of revision.
  相似文献   

18.
This work examines the hypothesis that the arrangement of items according to increasing difficulty is the real source of what is considered the item-position effect. A confusion of the 2 effects is possible because in achievement measures the items are arranged according to their difficulty. Two item subsets of Raven’s Advanced Progressive Matrices (APM), one following the original item order, and the other one including randomly ordered items, were applied to a sample of 266 students. Confirmatory factor analysis models including representations of both the item-position effect and a possible effect due to increasing item difficulty were compared. The results provided evidence for both effects. Furthermore, they indicated a substantial relation between the item-position effects of the 2 APM subsets, whereas no relation was found for item difficulty. This indicates that the item-position effect stands on its own and is not due to increasing item difficulty.  相似文献   

19.
《教育实用测度》2013,26(2):175-199
This study used three different differential item functioning (DIF) detection proce- dures to examine the extent to which items in a mathematics performance assessment functioned differently for matched gender groups. In addition to examining the appropriateness of individual items in terms of DIF with respect to gender, an attempt was made to identify factors (e.g., content, cognitive processes, differences in ability distributions, etc.) that may be related to DIF. The QUASAR (Quantitative Under- standing: Amplifying Student Achievement and Reasoning) Cognitive Assessment Instrument (QCAI) is designed to measure students' mathematical thinking and reasoning skills and consists of open-ended items that require students to show their solution processes and provide explanations for their answers. In this study, 33 polytomously scored items, which were distributed within four test forms, were evaluated with respect to gender-related DIF. The data source was sixth- and seventh- grade student responses to each of the four test forms administrated in the spring of 1992 at all six school sites participatingin the QUASARproject. The sample consisted of 1,782 students with approximately equal numbers of female and male students. The results indicated that DIF may not be serious for 3 1 of the 33 items (94%) in the QCAI. For the two items that were detected as functioning differently for male and female students, several plausible factors for DIF were discussed. The results from the secondary analyses, which removed the mutual influence of the two items, indicated that DIF in one item, PPPl, which favored female students rather than their matched male students, was of particular concern. These secondary analyses suggest that the detection of DIF in the other item in the original analysis may have been due to the influence of Item PPPl because they were both in the same test form.  相似文献   

20.
The van Hiele theory and van Hiele Geometry Test have been extensively used in mathematics assessments across countries. The purpose of this study is to use classical test theory (CTT) and cognitive diagnostic modeling (CDM) frameworks to examine psychometric properties of the van Hiele Geometry Test and to compare how various classification criteria assign van Hiele levels to students. The findings support the hierarchical property of the van Hiele theory and levels. Using conventional and combined criteria to determine mastery of a level, the percentages of students classified into an overall level were relatively high. Although some items had aberrant difficulties and low item discrimination, varied selection of the criteria across levels improved item discrimination power, especially for those items with low item discrimination index (IDI) estimates. Based on the findings, we identify items on the van Hiele Geometry Test that might be revised and we suggest changes to classification criteria to increase the number of students who can be assigned an overall level of geometry thinking according to the theory. As a result, practitioners and researchers may be better positioned to use the van Hiele Geometry Test for classroom assessment.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号