期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Gender Differences in Constructed Response Reading Items

《教育实用测度》2013,26(1):95-109

To evaluate the effects of calculator use on performance on the SAT I: Reasoning Test in Mathematics, questions about use of the calculator on the test were inserted into the answer sheets for the November 1996 and November 1997 administrations of the examination. Overall, nearly all of examinees indicated that they brought a calculator to the test and about two thirds reported using them on one third or more of the math items. Some group differences in the use of calculators were observed with girls using them more frequently than boys and Whites and Asian Americans using them more often than other racial or ethnic groups. Use of calculators was associated with higher test performance, but the more able students were more likely to have calculators and used them more often. The results were analyzed further using multiple regression and differential item functioning procedures. The degree of speededness on different degrees of calculator use was also examined. Overall, the effects of calculator use were found to be small, but detectable. 相似文献

2.

Effects of Calculator Use on Scores on a Test of Mathematical Reasoning

Brent Bridgeman Anne Harvey James Braswell 《Journal of Educational Measurement》1995,32(4):323-340

A sample of college-bound juniors from 275 high schools took a test consisting of 70 math questions from the SAT. A random half of the sample was allowed to use calculators on the test. Both genders and three ethnic groups (White, African American, and Asian American) benefitted about equally from being allowed to use calculators; Latinos benefitted slightly more than the other groups. Students who routinely used calculators on classroom mathematics tests were relatively advantaged on the calculator test. Test speededness was about the same whether or not students used calculators. Calculator effects on individual items ranged from positive through neutral to negative and could either increase or decrease the validity of an item as a measure of mathematical reasoning skills. Calculator effects could be either present or absent in both difficult and easy items 相似文献

3.

Using multidimensional Rasch analysis to validate the Chinese version of the Motivated Strategies for Learning Questionnaire (MSLQ-CV) 总被引：1，自引：0，他引：1

John Chi-Kin Lee Zhonghua Zhang Hongbiao Yin 《European Journal of Psychology of Education - EJPE》2010,25(1):141-155

This article used the multidimensional random coefficients multinomial logit model to examine the construct validity and detect the substantial differential item functioning (DIF) of the Chinese version of motivated strategies for learning questionnaire (MSLQ-CV). A total of 1,354 Hong Kong junior high school students were administered the MSLQ-CV. Partial credit model was suggested to have a better goodness of fit than that of the rating scale model. Five items with substantial gender or grade DIF were removed from the questionnaire, and the correlations between the subscales indicated that factors of cognitive strategy use and self-regulation had a very high correlation which resulted in a possible combination of the two factors. The test reliability analysis showed that the subscale of test anxiety had a lower reliability compared with the other factors. Finally, the item difficulty and step parameters for the modified 39-item questionnaire were displayed. The order of the step difficulty estimates for some items implied that some grouping of categories might be required in the case of overlapping. Based on these findings, the directions for future research were discussed. 相似文献

4.

Effects of Linking Methods on Detection of DIF

Seock-Ho Kim Allan S. Cohen 《Journal of Educational Measurement》1992,29(1):51-66

Studies of differential item functioning under item response theory require that item parameter estimates be placed on the same metric before comparisons can be made. The present study compared the effects of three methods for linking metrics: a weighted mean and sigma method (WMS); the test characteristic curve method (TCC); and the minimum chi-square method (MCS), on detection of differential item functioning. Both iterative and noniterative linking procedures were compared for each method. Results indicated that detection of differentially functioning items following linking via the test characteristic curve method gave the most accurate results when the sample size was small. When the sample size was large, results for the three linking methods were essentially the same. Iterative linking provided an improvement in detection of differentially functioning items over noniterative linking particularly with the .05 alpha level. The weighted mean and sigma method showed greater improvement with iterative linking than either the test characteristic curve or minimum chi-square method. 相似文献

5.

Influence of Prior Distributions on Detection of DIF

Allan S. Cohen Seock-Ho Kim Michael J. Subkoviak 《Journal of Educational Measurement》1991,28(1):49-59

Detection of differential item functioning (DIF) on items intentionally constructed to favor one group over another was investigated on item parameter estimates obtained from two item response theory-based computer programs, LOGIST and BILOG. Signed- and unsigned-area measures based on joint maximum likelihood estimation, marginal maximum likelihood estimation, and two marginal maximum a posteriori estimation procedures were compared with each other to determine whether detection of DIF could be improved using prior distributions. Results indicated that item parameter estimates obtained using either prior condition were less deviant than when priors were not used. Differences in detection of DIF appeared to be related to item parameter estimation condition and to some extent to sample size. 相似文献

6.

Using Response Time to Detect Item Preknowledge in Computer‐Based Licensure Examinations

Hong Qian Dorota Staniewska Mark Reckase Ada Woo 《Educational Measurement》2016,35(1):38-47

This article addresses the issue of how to detect item preknowledge using item response time data in two computer‐based large‐scale licensure examinations. Item preknowledge is indicated by an unexpected short response time and a correct response. Two samples were used for detecting item preknowledge for each examination. The first sample was from the early stage of the operational test and was used for item calibration. The second sample was from the late stage of the operational test, which may feature item preknowledge. The purpose of this research was to explore whether there was evidence of item preknowledge and compromised items in the second sample using the parameters estimated from the first sample. The results showed that for one nonadaptive operational examination, two items (of 111) were potentially exposed, and two candidates (of 1,172) showed some indications of preknowledge on multiple items. For another licensure examination that featured computerized adaptive testing, there was no indication of item preknowledge or compromised items. Implications for detected aberrant examinees and compromised items are discussed in the article. 相似文献

7.

An Examination of the Instructional Sensitivity of the TIMSS Math Items: A Hierarchical Differential Item Functioning Approach

Hongli Li Qi Qin Pui-Wa Lei 《Educational Assessment》2017,22(1):1-17

In recent years, students’ test scores have been used to evaluate teachers’ performance. The assumption underlying this practice is that students’ test performance reflects teachers’ instruction. However, this assumption is generally not empirically tested. In this study, we examine the effect of teachers’ instruction on test performance at the item level using a hierarchical differential item functioning approach. The items are from the U.S. TIMSS 2011 4th-grade math test. Specifically, we tested whether students who had received instruction on a given item performed significantly better on that item compared with students who had not received such instruction when their overall math ability was controlled for, whether with or without controlling for student-level and class-level covariates. This study provides preliminary findings regarding why some items show instructional sensitivity and sheds light on how to develop instructionally sensitive items. Implications and directions for further research are also discussed. 相似文献

8.

Identifying the Causes of DIF in Translated Verbal Items 总被引：1，自引：0，他引：1

Avi Allalouf Ronald K. Hambleton Stephen G. Sireci 《Journal of Educational Measurement》1999,36(3):185-198

Translated tests are being used increasingly for assessing the knowledge and skills of individuals who speak different languages. There is little research exploring why translated items sometimes function differently across languages. If the sources of differential item functioning (DIF) across languages could be predicted, it could have important implications on test development, scoring and equating. This study focuses on two questions: “Is DIF related to item type?”, “What are the causes of DIF?” The data were taken from the Israeli Psychometric Entrance Test in Hebrew (source) and Russian (translated). The results indicated that 34% of the items functioned differentially across languages. The analogy items were the most problematic with 65% showing DIF, mostly in favor of the Russian-speaking examinees. The sentence completion items were also a problem (45% D1F). The main reasons for DIF were changes in word difficulty, changes in item format, differences in cultural relevance, and changes in content. 相似文献

9.

Do images influence assessment in anatomy? Exploring the effect of images on item difficulty and item discrimination

Marc A. T. M. Vorstenbosch Tim P. F. M. Klaassen Jan G. M. Kooloos Sanneke M. Bolhuis Roland F. J. M. Laan 《Anatomical sciences education》2013,6(1):29-41

Anatomists often use images in assessments and examinations. This study aims to investigate the influence of different types of images on item difficulty and item discrimination in written assessments. A total of 210 of 460 students volunteered for an extra assessment in a gross anatomy course. This assessment contained 39 test items grouped in seven themes. The answer format alternated per theme and was either a labeled image or an answer list, resulting in two versions containing both images and answer lists. Subjects were randomly assigned to one version. Answer formats were compared through item scores. Both examinations had similar overall difficulty and reliability. Two cross‐sectional images resulted in greater item difficulty and item discrimination, compared to an answer list. A schematic image of fetal circulation led to decreased item difficulty and item discrimination. Three images showed variable effects. These results show that effects on assessment scores are dependent on the type of image used. Results from the two cross‐sectional images suggest an extra ability is being tested. Data from a scheme of fetal circulation suggest a cueing effect. Variable effects from other images indicate that a context‐dependent interaction takes place with the content of questions. The conclusion is that item difficulty and item discrimination can be affected when images are used instead of answer lists; thus, the use of images as a response format has potential implications for the validity of test items. Anat Sci Educ © 2012 American Association of Anatomists. 相似文献

10.

Measuring thinking with standardized science tests

Carol Faltin Morgenstern John W. Renner 《科学教学研究杂志》1984,21(6):639-648

According to the Educational Policies Commission, the central purpose of education in this country is to lead students to develop the ability to think. No standard way exists to measure whether or not the schools are achieving that purpose. The EPC identified 10 rational powers as constituting the essence of the ability to think. The research reported here was done to ascertain which rational powers are measured by commercially-available, standardized tests in science. A universe of standardized tests was defined and 12 specific tests were randomly selected for analysis. All instruments were validated by a panel of experts, as was a training program for the four teacher-evaluators who applied previously-evaluated criteria to each test item to determine which rational powers had to be used in responding to the item. Seven of the 12 standardized tests analyzed in the research required that students use only the rational power of recall in responding. In fact, approximately 90% of the items analyzed from all tests required only recall. Students were required to use other rational powers only rarely when responding to a test item and the use of the rational powers of comparing, imagining, and analyzing was not necessary on any of the test items examined. The conclusion was drawn that the producers of standardized tests are not concerned with measuring student achievement of the rational powers. The purpose which runs throught and strengthens all other educational purposes—; the common thread of education—; is the development of the ability to think. 相似文献

11.

Identifying Sources of Differential Item and Bundle Functioning on Translated Achievement Tests: A Confirmatory Analysis

Mark J. Gierl Shameem Nyla Khaliq 《Journal of Educational Measurement》2001,38(2):164-187

Increasingly, tests are being translated and adapted into different languages. Differential item functioning (DIF) analyses are often used to identify non-equivalent items across language groups. However, few studies have focused on understanding why some translated items produce DIF. The purpose of the current study is to identify sources of differential item and bundle functioning on translated achievement tests using substantive and statistical analyses. A substantive analysis of existing DIF items was conducted by an 11-member committee of testing specialists. In their review, four sources of translation DIF were identified. Two certified translators used these four sources to categorize a new set of DIF items from Grade 6 and 9 Mathematics and Social Studies Achievement Tests. Each item was associated with a specific source of translation DIF and each item was anticipated to favor a specific group of examinees. Then, a statistical analysis was conducted on the items in each category using SIBTEST. The translators sorted the mathematics DIF items into three sources, and they correctly predicted the group that would be favored for seven of the eight items or bundles of items across two grade levels. The translators sorted the social studies DIF items into four sources, and they correctly predicted the group that would be favored for eight of the 13 items or bundles of items across two grade levels. The majority of items in mathematics and social studies were associated with differences in the words, expressions, or sentence structure of items that are not inherent to the language and/or culture. By combining substantive and statistical DIF analyses, researchers can study the sources of DIF and create a body of confirmed DIF hypotheses that may be used to develop guidelines and test construction principles for reducing DIF on translated tests. 相似文献

12.

Detection of Gender-Related Differential Item Functioning in a Mathematics Performance Assessment

《教育实用测度》2013,26(2):175-199

This study used three different differential item functioning (DIF) detection proce- dures to examine the extent to which items in a mathematics performance assessment functioned differently for matched gender groups. In addition to examining the appropriateness of individual items in terms of DIF with respect to gender, an attempt was made to identify factors (e.g., content, cognitive processes, differences in ability distributions, etc.) that may be related to DIF. The QUASAR (Quantitative Under- standing: Amplifying Student Achievement and Reasoning) Cognitive Assessment Instrument (QCAI) is designed to measure students' mathematical thinking and reasoning skills and consists of open-ended items that require students to show their solution processes and provide explanations for their answers. In this study, 33 polytomously scored items, which were distributed within four test forms, were evaluated with respect to gender-related DIF. The data source was sixth- and seventh- grade student responses to each of the four test forms administrated in the spring of 1992 at all six school sites participatingin the QUASARproject. The sample consisted of 1,782 students with approximately equal numbers of female and male students. The results indicated that DIF may not be serious for 3 1 of the 33 items (94%) in the QCAI. For the two items that were detected as functioning differently for male and female students, several plausible factors for DIF were discussed. The results from the secondary analyses, which removed the mutual influence of the two items, indicated that DIF in one item, PPPl, which favored female students rather than their matched male students, was of particular concern. These secondary analyses suggest that the detection of DIF in the other item in the original analysis may have been due to the influence of Item PPPl because they were both in the same test form. 相似文献

13.

Validity of the Simultaneous Approach to the Development of Equivalent Achievement Tests in English and French

W. Todd Rogers Jie Lin Christia M. Rinaldi 《教育实用测度》2013,26(1):39-70

The evidence gathered in the present study supports the use of the simultaneous development of test items for different languages. The simultaneous approach used in the present study involved writing an item in one language (e.g., French) and, before moving to the development of a second item, translating the item into the second language (e.g., English) and checking to see that both language versions of the item mean the same. The evidence collected through the item development stage suggested that the simultaneous test development method allowed the influence and integration of information from item writers representing different language and cultural groups to affect test development directly. Certified English/French translators and interpreters and the French Immersion students confirmed that the test items in French and English had comparable meanings. The pairs of test forms had equal standard errors of measurement. The source of differential item functioning was not attributable to the adaptation process used to produce the two language forms, but to the lack of French language proficiency as well as other unknown sources. Lastly, the simultaneous approach used in the present study was somewhat more efficient than the forward translation procedure currently in use. 相似文献

14.

Test fairness: Examining differential functioning of the reading comprehension section of the GSEEE in China

《Studies in Educational Evaluation》2020

This study investigated differential item functioning (DIF), differential bundle functioning (DBF), and differential test functioning (DTF) across gender of the reading comprehension section of the Graduate School Entrance English Exam in China. The datasets included 10,000 test-takers’ item-level responses to 6 five-item testlets. Both DIF and DBF were examined by using poly-simultaneous item bias test and item-response-theory-likelihood-ratio test, and DTF was investigated with multi-group confirmatory factor analyses (MG-CFA). The results indicated that although none of the 30 items exhibited statistically and practically significant DIF across gender at the item level, 2 testlets were consistently identified as having significant DBF at the testlet level by the two procedures. Nonetheless, DBF does not manifest itself at the overall test score level to produce DTF based on MG-CFA. This suggests that the relationship between item-level DIF and test-level DTF is a complicated issue with the mediating effect of testlets in testlet-based language assessment. 相似文献

15.

Exploring features that affect the difficulty and functioning of science exam questions for those with reading difficulties

Victoria Crisp 《Irish Educational Studies》2013,32(3):323-343

This research explored the measurement characteristics of two science examinations and the potential to use access arrangements data to investigate how students requiring reading support are affected by features of exam questions. For two science examinations, traditional and Rasch analyses provided estimates of difficulty and information on item functioning. For one examination, the performance of students eligible for support from a reader in exams was compared to a ‘norm’ group. For selected items a sample of student responses were analysed. A number of factors potentially making questions easier, more difficult or potentially contributing to problems with item functioning were identified. A number of features that may particularly influence those requiring reading support were also identified. 相似文献

16.

Effect of Varying Item Order on Multiple-Choice Test Scores: Importance of Statistical and Cognitive Difficulty

《教育实用测度》2013,26(1):89-97

Research on the use of multiple-choice tests has presented conflicting evidence about the use of statistical item difficulty as a means of ordering items. An alternate method advocated by many texts is the use of cognitive difficulty. This study examined the effect of using both statistical and cognitive item difficulty in determining item order. Results indicated that those students who received items in an increasing cognitive order, no matter what the order of statistical difficulty, scored higher on hard items. Those students who received the forms with opposing cognitive and statistical difficulty orders scored the highest on medium-level items. The study concludes with a call for more research on the effects of cognitive difficulty and suggests that future studies examine subscores as well as total test results. 相似文献

17.

The Role of Extended Time and Item Content on a High‐Stakes Mathematics Test

Allan S. Cohen Noel Gregg Meng Deng 《Learning disabilities research & practice》2005,20(4):225-233

The premise of a great deal of current research guiding policy development has been that accommodations are the catalyst for student performance differences. Rather than accepting this premise, two studies were conducted to investigate the influence of extended time and content knowledge on the performance of ninth‐grade students who took a statewide mathematics test with and without accommodations. Each study involved 1,250 accommodated students (extended time only) with learning disabilities and 1,250 nonaccommodated students demonstrating no disabilities. In Study One, a standard differential item functioning (DIF) analysis illustrated that the usual approach to studying the effects of accommodations contributes little to our understanding of the reason for performance differences across students. Next, a mixture item response theory DIF model was used to explore the most likely cause(s) for performance differences across the population. The results from both studies suggest that students for whom items were functioning differently were not accurately characterized by their accommodation status but rather by their content knowledge. That is, knowing students' accommodation status (i.e., accommodated or nonaccommodated) contributed little to understanding why accommodated and nonaccommodated students differed in their test performance. Rather, the data would suggest that a more likely explanation is that mathematics competency differentiated the groups of student learners regardless of their accommodation and/or reading levels. 相似文献

18.

The effect of images on item statistics in multiple choice anatomy examinations

下载免费PDF全文

Andrew J. Notebaert 《Anatomical sciences education》2017,10(1):68-78

Although multiple choice examinations are often used to test anatomical knowledge, these often forgo the use of images in favor of text‐based questions and answers. Because anatomy is reliant on visual resources, examinations using images should be used when appropriate. This study was a retrospective analysis of examination items that were text based compared to the same questions when a reference image was included with the question stem. Item difficulty and discrimination were analyzed for 15 multiple choice items given across two different examinations in two sections of an undergraduate anatomy course. Results showed that there were some differences item difficulty but these were not consistent to either text items or items with reference images. Differences in difficulty were mainly attributable to one group of students performing better overall on the examinations. There were no significant differences for item discrimination for any of the analyzed items. This implies that reference images do not significantly alter the item statistics, however this does not indicate if these images were helpful to the students when answering the questions. Care should be taken by question writers to analyze item statistics when making changes to multiple choice questions, including ones that are included for the perceived benefit of the students. Anat Sci Educ 10: 68–78. © 2016 American Association of Anatomists. 相似文献

19.

Mathematics Test Performance: The Effects of Item Type and Calculator Use

《教育实用测度》2013,26(1):11-22

Previous research has provided conflicting findings about whether allowing the use of calculators changes the difficulty of mathematics tests or the time needed to complete the tests. Because the interpretation of results from standardized tests via norm tables depends on standardized conditions, the impact of allowing or not allowing examinees to use calculators while taking such tests would need to be specified as part of the standardizing condition. This article examines four item types that may perform differently under different conditions of calculator use. This article also examines the effect of testing under calculator and noncalculator conditions on testing time, reliability, item difficulty, and item discrimination. 相似文献

20.

A Nested Logit Approach for Investigating Distractors as Causes of Differential Item Functioning

Youngsuk Suh Daniel M. Bolt 《Journal of Educational Measurement》2011,48(2):188-205

In multiple‐choice items, differential item functioning (DIF) in the correct response may or may not be caused by differentially functioning distractors. Identifying distractors as causes of DIF can provide valuable information for potential item revision or the design of new test items. In this paper, we examine a two‐step approach based on application of a nested logit model for this purpose. The approach separates testing of differential distractor functioning (DDF) from DIF, thus allowing for clearer evaluations of where distractors may be responsible for DIF. The approach is contrasted against competing methods and evaluated in simulation and real data analyses. 相似文献