Performance assessments, scenario‐based tasks, and other groups of items carry a risk of violating the local item independence assumption made by unidimensional item response theory (IRT) models. Previous studies have identified negative impacts of ignoring such violations, most notably inflated reliability estimates. Still, the influence of this violation on examinee ability estimates has been comparatively neglected. It is known that such item dependencies cause low‐ability examinees to have their scores overestimated and high‐ability examinees' scores underestimated. However, the impact of these biases on examinee classification decisions has been little examined. In addition, because the influence of these dependencies varies along the underlying ability continuum, whether or not the location of the cut‐point is important in regard to correct classifications remains unanswered. This simulation study demonstrates that the strength of item dependencies and the location of an examination systems’ cut‐points both influence the accuracy (i.e., the sensitivity and specificity) of examinee classifications. Practical implications of these results are discussed in terms of false positive and false negative classifications of test takers.  相似文献   

The Standards for Educational and Psychological Testing indicate that multiple sources of validity evidence should be used to support the interpretation of test scores. In the past decade, examinee response processes, as a source of validity evidence, have received increased attention. However, there have been relatively few methodological studies of the accuracy and consistency of examinee response processes as measured by verbal reports in the context of educational measurement. The objective of the current study was to investigate the accuracy and consistency of examinee response processes—as measured by verbal reports—as a function of varying interviewer and item variables in a think aloud interview within an educational measurement context. Results indicate that the accuracy of responses may be undermined when students perceive the interviewer to be an expert in the domain. Further, the consistency of response processes may be undermined when items that are too easy or difficult are used to elicit reports. The implications of these results for conducting think-aloud studies are explored.  相似文献   

Statistics used to detect differential item functioning can also reflect differential strengths and weaknesses in the performance characteristics of population subgroups. In turn, item features associated with the differential performance patterns are likely to reflect some facet of the item task and hence its difficulty, that might previously have been overlooked. In this study, several item features were identified and coded for a large number of reading comprehension items from the two admissions testing programs. Item features included subject matter content, various properties of item structure, cognitive demand indicators, and semantic content (propositional analysis). Differential item functioning was evaluated for males and females and for White and Black examinees. Results showed a number of significant relationships between item features and indicators of differential item functioning—many of which were consistent across testing programs. Implications of the results for related areas of research are discussed.  相似文献   

Abstract The investigation sets out to determine whether in the construction of normreferenced tests the effects of sampling errors in the pre‐testing procedure outweigh the theoretical advantages to be gained in selecting items with the highest discrimination indices. The procedure for pre‐testing and selecting the items is simulated by sampling artificial matrices each representing the scores of a population of examinees on a domain of items. The results indicate that the sampling errors do not have a significant deleterious effect if the samples in the pre‐testing procedure contain more than 50 items and 25 examinees. Moreover, there may be very little to be gained by using larger samples.  相似文献   

舞蹈与体育表演之间存在诸多共通性。在社会发展过程中,舞蹈对体育表演的影响也是多维度的。研究发现,舞蹈对体育表演的影响媒介主要是体育表演项目,如体育舞蹈、艺术体操、花样游泳、花样滑冰等,而且这种影响也呈现出立体性、多周期性和艺术性等特征。  相似文献   

Computer-based educational assessments often include items that involve drag-and-drop responses. There are different ways that drag-and-drop items can be laid out and different choices that test developers can make when designing these items. Currently, these decisions are based on experts’ professional judgments and design constraints, rather than empirical research, which might threaten the validity of interpretations of test outcomes. To this end, we investigated the effect of drag-and-drop item features on test-taker performance and response strategies with a cognition-centered approach. Four hundred and seventy-six adult participants solved content-equivalent drag-and-drop mathematics items under five design variants. Results showed that: (a) test takers’ performance and response strategies were affected by the experimental manipulations, and (b) test takers mostly used cognitively efficient response strategies regardless of the manipulated item features. Implications of the findings are provided to support test developers’ design decisions.  相似文献   

Loglinear latent class models are used to detect differential item functioning (DIF). These models are formulated in such a manner that the attribute to be assessed may be continuous, as in a Rasch model, or categorical, as in Latent Class Mastery models. Further, an item may exhibit DIF with respect to a manifest grouping variable, a latent grouping variable, or both. Likelihood-ratio tests for assessing the presence of various types of DIF are described, and these methods are illustrated through the analysis of a "real world" data set.  相似文献   

Calculator effects were examined using methods taken from research on differential item functioning. Use of a calculator was controlled on two experimental forms of a test assembled from operational items used on a standardized university mathematics placement test. Results indicated that calculator effects were not present based on analysis of test scores and in only two of the three subscores composed from homogeneous item types. Analyses of item-level functioning indicated, however, that a number of items, including several not included in the two significant subscore combinations, also contained calculator effects. For those items identified, use of the calculator appeared to have changed the actual objective being tested. The findings were generally consistent with previous research: Items that were easier when a calculator was used required either simple computations or use of a function key on the calculator; items that were more difficult required knowledge of a procedure either with or without additional computation. Analysis at the item level facilitated clearer understanding of the impact of calculator use on measurement of the underlying objective.  相似文献   

High stakes testing, a phenomena born out of intense accountability across the United States, produces instructional settings that marginalize both curriculum and instruction. Teachers and other school personnel have minimized instruction to drill and practice in an effort to raise standardized and criterion referenced test scores. This study presents an alternative to current practice that engages students in learning and increases their awareness of the internal aspects of standardized tests. The Test Item Construction Model (TICM) guides students through the process of studying test item stems and subsequently creating items using a 12 week process of incrementing from understanding to creating test items. Students grew in their understanding of the test item stems and the generation of these. An ANOVA did not yield significant differences between random groups of trained and untrained test writers. However, students in the experimental group demonstrated gains in understanding of test items.  相似文献   

为系统地研究边壁流效应与实验介质特征之间的关系,从与边壁流效应有关系的实验介质出发,选用粗、中、细3种性质的砂粒。基于单组分均介质实验结果分析,设计不同粒径介质的空间组合结构类型,通过物理实验,结合地下水数值模拟技术和数学分析方法,得到渗流实验中不同介质对边壁流效应影响的程度为细砂粗砂中砂,明确了边壁流效应对地下水渗流和溶质运移的影响。通过实验,为如何预防或减小孔隙介质水文地质试验中的边壁流效应提供参考依据。  相似文献   

随着大众传媒的不断发展,反映特定区域历史文脉和社会人文精神的地域文化纪录片受到了越来越多的关注,如何让特定地区的文化记录片呈现其独特光彩,成为这类纪录片创作者们面临的严峻挑战。一部优秀的地域文化纪录片必然是对地域文化特质的高度提炼和个性化展示,这需要文化符号的视听表达、解说词的内涵解读和画面的诗意渲染共同完成。  相似文献   

中国音乐精粹是她具有独特的个性--神韵.神韵使一切事物赋有神采,演奏艺术亦是如此.在演奏中要表现出中国音乐的神韵,不仅需要精湛的技艺、广博的知识,更需要有个性的、有修养的审美观使其达到美的境界.  相似文献   

Cut scores, estimated using the Angoff procedure, are routinely used to make high-stakes classification decisions based on examinee scores. Precision is necessary in estimation of cut scores because of the importance of these decisions. Although much has been written about how these procedures should be implemented, there is relatively little literature providing empirical support for specific approaches to providing training and feedback to standard-setting judges. This article presents a multivariate generalizability analysis designed to examine the impact of training and feedback on various sources of error in estimation of cut scores for a standard-setting procedure in which multiple independent groups completed the judgments. The results indicate that after training, there was little improvement in the ability of judges to rank order items by difficulty but there was a substantial improvement in inter-judge consistency in centering ratings. The results also show a substantial group effect. Consistent with this result, the direction of change for the estimated cut score was shown to be group dependent.  相似文献   


The arrangement of response options in multiple-choice (MC) items, especially the location of the most attractive distractor, is considered critical in constructing high-quality MC items. In the current study, a sample of 496 undergraduate students taking an educational assessment course was given three test forms consisting of the same items but the positions of the most attractive distractor varied across the forms. Using a multiple-indicators–multiple-causes (MIMIC) approach, the effects of the most attractive distractor's positions on item difficulty were investigated. The results indicated that the relative placement of the most attractive distractor and the distance between the most attractive distractor and the keyed option affected students’ response behaviors. Moreover, low-achieving students were more susceptible to response-position changes than high-achieving students.  相似文献   

The effect of three item arrangements on state test anxiety was studied using an actual classroom examination administered under power conditions. Examinations were distributed randomly to 128 graduate students in two courses. Separate one-way analyses of variance performed for each course revealed significant effects for item arrangement on anxiety. In one course, anxiety was higher for the hard-to-easy arrangement; in the other course, anxiety was higher for the random arrangement. That highest anxiety levels were associated with different arrangements in the two courses was explained in terms of homogeneity of content and perceived item difficulty. Results suggest that different item arrangements may elicit different levels of anxiety and that item arrangement may introduce a source of variance unrelated to content, thereby reducing the validity of achievement tests.  相似文献   

This study’s general research question was: Given male and female students in an introductory educational psychology course who vary in cognitive entry characteristics and test anxiety, how do three item arrangements (easy to difficult, difficult to easy, and random) located within a 50-item multiple-choice achievement examination influence students’ total test performance? Two hierarchical multiple regression analyses were used to analyze the data. The four predictor variables and their interactions were tested for the amount of variation that they explained in the dependent variable. The main finding within the context of this study is that item arrangements based on item difficulties do not influence achievement examination performance.  相似文献   

A regression analysis was carried out to assess the contributions of passage and no-passage factors to item variance on the Scholastic Aptitude Test reading comprehension task. Unlike earlier regression studies of multiple-choice reading tasks, no-passage factors were experimentally isolated from passage factors, and passage factors from the multiple-choice context. Results showed that no-passage factors play a larger role than do passage factors, accounting for as much as three fourths of systematic variance in item difficulty and more than half of total variance. The task, therefore, appears largely to reflect the systematic influence of factors having nothing to do with the comprehension of reading passages.  相似文献   

The purpose of test directions is to familiarize examinees with a test so that they respond to items in the manner intended. However, changes in educational measurement as well as the U.S. student population present new challenges to test directions and increase the impact that differential familiarity could have on the validity of test score interpretations. This article reviews the literature on best practices for the development of test directions as well as documenting differences in test familiarity for culturally and linguistically diverse students that could be addressed with test directions and practice. The literature indicates that choice of practice items and feedback are critical in the design of test directions and that more extensive practice opportunities may be required to reduce group differences in test familiarity. As increasingly complex and rich item formats are introduced in next-generation assessments, test directions become a critical part of test design and validity.  相似文献   

Educational program assessment studies often use data from low-stakes tests to provide evidence of program quality. The validity of scores from such tests, however, is potentially threatened by examinee noneffort. This study investigated the extent to which one type of noneffort—rapid-guessing behavior—distorted the results from three types of commonly used program assessment designs. It was found that, for each design, a modest amount of rapid guessing had a pronounced effect on the results. In addition, motivation filtering was found to be successful in mitigating the effects caused by rapid guessing. It is suggested that measurement practitioners routinely apply motivation filtering whenever the data from low-stakes tests are used to support program decisions.  相似文献   

