首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 359 毫秒
1.
Two problems in test development relate to the use of illustrations: (1) Do illustrated items perform better than written items, and (2) Does item performance vary as a function of the type and size of the illustration? A sample of 63 tests was drawn from all the Air Force Specialty Knowledge Tests containing illustrations. These 63 tests had been administered to approximately 28,261 airmen under operational conditions. Item statistics between illustrated and written items drawn from the same content areas were compared using F ratios. The results indicated: (1) That illustrated items in general performed slightly better than matched written items; (2) That the best-performing category of illustrated items was tables.  相似文献   

2.
In this study, the authors explored the importance of item difficulty (equated delta) as a predictor of differential item functioning (DIF) of Black versus matched White examinees for four verbal item types (analogies, antonyms, sentence completions, reading comprehension) using 13 GRE-disclosed forms (988 verbal items) and 11 SAT-disclosed forms (935 verbal items). The average correlation across test forms for each item type (and often the correlation for each individual test form as well) revealed a significant relationship between item difficulty and DIF value for both GRE and SAT. The most important finding indicates that for hard items, Black examinees perform differentially better than matched ability White examinees for each of the four item types and for both the GRE and SAT tests! The results further suggest that the amount of verbal context is an important determinant of the magnitude of the relationship between item difficulty and differential performance of Black versus matched White examinees. Several hypotheses accounting for this result were explored.  相似文献   

3.
This study investigated the role of item formats in the performance of 206 nonnative speakers of English on expressive skills (i.e., speaking and writing). Test scores were drawn from the field test of the Pearson Test of English Academic for Chinese, French, Hebrew, and Korean native speakers. Four item formats, including multiple-choice questions asking for a single answer (SAMC), multiple-choice questions allowing for multiple answers (MAMC), gap-filling, and summarizing items, were examined in relation to expressive skills. The results showed that, although the four groups showed different score distributions, their first language itself did not account for a significant variance in the expressive skills. The summarizing item format assessing listening skills accounted for the greatest variance in the test takers' expressive skills. The SAMC format explained consistently a smaller variance than that of MAMC in the expressive skills measured. Unlike the findings of previous research, no gender difference was found.  相似文献   

4.
We analyzed a pool of items from an admissions test for differential item functioning (DIF) for groups based on age, socioeconomic status, citizenship, or English language status using Mantel-Haenszel and item response theory. DIF items were systematically examined to identify its possible sources by item type, content, and wording. DIF was primarily found in the citizenship group. As suggested by expert reviewers, possible sources of DIF in the direction of U.S. citizens was often in Quantitative Reasoning in items containing figures, charts, tables depicting real-world (as opposed to abstract) contexts. DIF items in the direction of non-U.S. citizens included “mathematical” items containing few words. DIF for the Verbal Reasoning items included geocultural references and proper names that may be differentially familiar for non-U.S. citizens. This study is responsive to foundational changes in the fairness section of the Standards for Educational and Psychological Testing, which now consider additional groups in sensitivity analyses, given the increasing demographic diversity in test-taker populations.  相似文献   

5.
We studied the conceptions of students coming from secondary education and university regarding the number line as a representation of real numbers. In the context of a wider questionnaire, 307 students were presented with a task consisting of two verbal items and one graphic item related to the number line. The students were all at different levels in their study of mathematics (in the third, fourth or fifth year of secondary education, or at the beginning or advanced stage of a university degree in mathematics, biology or physical education). A gradient in the depth of the students’ conceptions, associated with the level of their studies in mathematics, was found. This gradient extends from the estrangement of facing the problem or a conception of a drawn or physical matter line, which was associated with students with a lower level of mathematical studies, passing through a vision centred around potential numeric density or a line containing points (discrete), up to an instrumental conception of the line as supportive of magnitudes in advanced students of biology and focusing on continuity sustained by advanced students of mathematics.  相似文献   

6.
Abstract

In an attempt to identify some of the causes of answer changing behavior, the effects of four tests and item specific variables were evaluated. Three samples of New Zealand school children of different ages were administered tests of study skills. The number of answer changes per item was compared with the position of each item in a group of items, the position of each item in the test, the discrimination index and the difficulty index of each item. It is shown that answer changes were more likely to be made on items occurring early in a group of items and toward the end of a test. There was also a tendency for difficult items and items with poor discriminations to be changed more frequently. Some implications of answer changing in the design of tests are discussed.  相似文献   

7.
The aim of this mixed methods study was to investigate the difficulties prospective elementary mathematics teachers have in solving the Programme for International Student Assessment (PISA) 2012 released items. A mathematics test consisting of 26 PISA items was administered, followed by interviews. Multiple data were utilized to provide rich insights into the types of mathematical knowledge that a particular item requires and prospective teachers’ difficulties in using these knowledge types. A sample of 52 prospective teachers worked the mathematics test, and 12 of them were interviewed afterwards. The data-sets were complementary: the quantitative data showed that PISA items could be categorized under contextual, conceptual, and procedural knowledge and indicated the most frequent difficulties in the combined contextual, conceptual, and procedural knowledge items. The qualitative data revealed that few prospective teachers could give mathematical explanations for conceptual knowledge items, and that their contextual knowledge was fragmented. Educational implications were discussed.  相似文献   

8.
This paper describes the development and validation of an item bank designed for students to assess their own achievements across an undergraduate-degree programme in seven generic competences (i.e., problem-solving skills, critical-thinking skills, creative-thinking skills, ethical decision-making skills, effective communication skills, social interaction skills and global perspective). The Rasch modelling approach was adopted for instrument development and validation. A total of 425 items were developed. The content validity of these items was examined via six focus group interviews with target students, and the construct validity was verified against data collected from a large student sample (N?=?1151). A matrix design was adopted to assemble the items in 26 test forms, which were distributed at random in each administration session. The results demonstrated that the item bank had high reliability and good construct validity. Cross-sectional comparisons of Years 1–4 students revealed patterns of changes over the years. Correlation analyses shed light on the relationships between the constructs. Implications are drawn to inform future efforts to develop the instrument, and suggestions are made regarding ways to use the instrument to enhance the teaching and learning of generic skills.  相似文献   

9.
This paper presents findings from research exploring gender by item difficulty interaction on mathematics test scores in Cyprus. Data steamed from 2 longitudinal studies with 4 different age groups of primary school students. The hypothesis that boys tended to outperform girls on the hardest items and girls tended to outperform boys on the easiest items was generally supported for each year group. The effect of social class was also examined. For each social class, there was a correlation between the item difficulty differences estimated on girls and boys separately and the difficulty of the item estimated on the whole sample. It is claimed that in understanding gender differences in mathematics, item difficulty should be treated as an independent variable. Suggestions for further studies are provided, and implications for the development of assessment policy in mathematics are drawn.  相似文献   

10.
11.
In this study, a multiple-choice test entitled the Science Process Assessment was developed to measure the science process skills of students in grade four. Based on the Recommended Science Competency Continuum for Grades K to 6 for Pennsylvania Schools, this instrument measured the skills of (1) observing, (2) classifying, (3) inferring, (4) predicting, (5) measuring, (6) communicating, (7) using space/time relations, (8) defining operationally, (9) formulating hypotheses, (10) experimenting, (11) recognizing variables, (12) interpreting data, and (13) formulating models. To prepare the instrument, classroom teachers and science educators were invited to participate in two science education workshops designed to develop an item bank of test questions applicable to measuring process skill learning. Participants formed “writing teams” and generated 65 test items representing the 13 process skills. After a comprehensive group critique of each item, 61 items were identified for inclusion into the Science Process Assessment item bank. To establish content validity, the item bank was submitted to a select panel of science educators for the purpose of judging item acceptability. This analysis yielded 55 acceptable test items and produced the Science Process Assessment, Pilot 1. Pilot 1 was administered to 184 fourth-grade students. Students were given a copy of the test booklet; teachers read each test aloud to the students. Upon completion of this first administration, data from the item analysis yielded a reliability coefficient of 0.73. Subsequently, 40 test items were identified for the Science Process Assessment, Pilot 2. Using the test-retest method, the Science Process Assessment, Pilot 2 (Test 1 and Test 2) was administered to 113 fourth-grade students. Reliability coefficients of 0.80 and 0.82, respectively, were ascertained. The correlation between Test 1 and Test 2 was 0.77. The results of this study indicate that (1) the Science Process Assessment, Pilot 2, is a valid and reliable instrument applicable to measuring the science process skills of students in grade four, (2) using educational workshops as a means of developing item banks of test questions is viable and productive in the test development process, and (3) involving classroom teachers and science educators in the test development process is educationally efficient and effective.  相似文献   

12.
Using a complex simulation study we investigated parameter recovery, classification accuracy, and performance of two item‐fit statistics for correct and misspecified diagnostic classification models within a log‐linear modeling framework. The basic manipulated test design factors included the number of respondents (1,000 vs. 10,000), attributes (3 vs. 5), and items (25 vs. 50) as well as different attribute correlations (.50 vs. .80) and marginal attribute difficulties (equal vs. different). We investigated misspecifications of interaction effect parameters under correct Q‐matrix specification and two types of Q‐matrix misspecification. While the misspecification of interaction effects had little impact on classification accuracy, invalid Q‐matrix specifications led to notably decreased classification accuracy. Two proposed item‐fit indexes were more strongly sensitive to overspecification of Q‐matrix entries for items than to underspecification. Information‐based fit indexes AIC and BIC were sensitive to both over‐ and underspecification.  相似文献   

13.
Recent findings suggest that people with dyslexia experience difficulties with the learning of serial order information during the transition from short- to long-term memory (Szmalec et al. Journal of Experimental Psychology: Learning, Memory, & Cognition 37(5): 1270-1279, 2011). At the same time, models of short-term memory increasingly incorporate a distinction of order and item processing (Majerus et al. Cognition 107: 395-419, 2008). The current study is aimed to investigate whether serial order processing deficiencies in dyslexia can be traced back to a selective impairment of short-term memory for serial order and whether this impairment also affects processing beyond the verbal domain. A sample of 26 adults with dyslexia and a group of age and IQ-matched controls participated in a 2?×?2?×?2 experiment in which we assessed short-term recognition performance for order and item information, using both verbal and nonverbal material. Our findings indicate that, irrespective of the type of material, participants with dyslexia recalled the individual items with the same accuracy as the matched control group, whereas the ability to recognize the serial order in which those items were presented appeared to be affected in the dyslexia group. We conclude that dyslexia is characterized by a selective impairment of short-term memory for serial order, but not for item information, and discuss the integration of these findings into current theoretical views on dyslexia and its associated dysfunctions.  相似文献   

14.
Gender fairness in testing can be impeded by the presence of differential item functioning (DIF), which potentially causes test bias. In this study, the presence and causes of gender-related DIF were investigated with real data from 800 items answered by 250,000 test takers. DIF was examined using the Mantel–Haenszel and logistic regression procedures. Little DIF was found in the quantitative items and a moderate amount was found in the verbal items. Vocabulary items favored women if sampled from traditionally female domains but generally not vice versa if sampled from male domains. The sentence completion item format in the English reading comprehension subtest favored men regardless of content. The findings, if supported in a cross-validation study, can potentially lead to changes in how vocabulary items are sampled and in the use of the sentence completion format in English reading comprehension, thereby increasing gender fairness in the examined test.  相似文献   

15.
Raw scores on the 16 K-ABC subtests and the total raw scores on the sequential and simultaneous processing scales and the achievement scale were correlated with age in months for two separate samples, each subdivided by race and sex: the K-ABC standardization sample (N = 2000) and an additional group of blacks and whites tested during the development of the K-ABC sociocultural norms (N = 615). Within each sample, the highest and lowest correlations from all race/sex groups were contrasted across all K-ABC subtests and scales. All correlations between age and raw scores were statistically significant (p ⩽ .05). No significant differences occurred in the magnitude of these relationships as a function of race/sex grouping, supporting the construct validity of the K-ABC as a developmental measure of children's aptitude and achievement for blacks, whites, Hispanics, males, and females.  相似文献   

16.
Noting the wide differences in verbal abilities of middle and lower class children, the investigators proposed that two groups of children, one from the lower class, one from the middle class, who achieve comparable total scores on a group intelligence test, would get their scores by successfully completing different sets of items. In the first study children were placed in social classes based on their fathers' occupations, following guidelines from the Warner scale. Middle class children were matched with lower class children on total Otis scores. No item-social class interaction was found. The study was repeated using the occupational categories of the Dictionary of Occupational Titles as a guide to social class standing. Again no item-social class interaction appeared. If two social class groups are equated on total intelligence scores, one social class sample appears to succeed on essentially the same test items as does the other social class sample. A given score on an intelligence test appears to represent the same skills for one social class as it does for another social class.  相似文献   

17.
Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results.  相似文献   

18.
This article describes a series of studies performed with the National Teacher Examinations which were designed to study the relationship between the cultural content of special sets of general culture test items and the performance of blacks and whites on these experimental items. Significant differences between the performance of blacks and whites were found in terms of black, modern, and traditional test items. A replication of the study with the same test items, and also with a different group of test items, is also described.  相似文献   

19.
Differential Item Functioning (DIF) is traditionally used to identify different item performance patterns between intact groups, most commonly involving race or sex comparisons. This study advocates expanding the utility of DIF as a step in construct validation. Rather than grouping examinees based on cultural differences, the reference and focal groups are chosen from two extremes along a distinct cognitive dimension that is hypothesized to supplement the dominant latent trait being measured. Specifically, this study investigates DIF between proficient and non-proficient fourth- and seventh-grade writers on open-ended mathematics test items that require students to communicate about mathematics. It is suggested that the occurrence of DIF in this situation actually enhances, rather than detracts from, the construct validity of the test because, according to the National Council of Teachers of Mathematics (NCTM), mathematical communication is an important component of mathematical ability, the dominant construct being assessed. However, the presence of DIF influences the validity of inferences that can be made from test scores and suggests that two scores should be reported, one for general mathematical ability and one for mathematical communication. The fact that currently only one test score is reported, a simple composite of scores on multiple-choice and open-ended items, may lead to incorrect decisions being made about examinees.  相似文献   

20.
《教育实用测度》2013,26(2):175-199
This study used three different differential item functioning (DIF) detection proce- dures to examine the extent to which items in a mathematics performance assessment functioned differently for matched gender groups. In addition to examining the appropriateness of individual items in terms of DIF with respect to gender, an attempt was made to identify factors (e.g., content, cognitive processes, differences in ability distributions, etc.) that may be related to DIF. The QUASAR (Quantitative Under- standing: Amplifying Student Achievement and Reasoning) Cognitive Assessment Instrument (QCAI) is designed to measure students' mathematical thinking and reasoning skills and consists of open-ended items that require students to show their solution processes and provide explanations for their answers. In this study, 33 polytomously scored items, which were distributed within four test forms, were evaluated with respect to gender-related DIF. The data source was sixth- and seventh- grade student responses to each of the four test forms administrated in the spring of 1992 at all six school sites participatingin the QUASARproject. The sample consisted of 1,782 students with approximately equal numbers of female and male students. The results indicated that DIF may not be serious for 3 1 of the 33 items (94%) in the QCAI. For the two items that were detected as functioning differently for male and female students, several plausible factors for DIF were discussed. The results from the secondary analyses, which removed the mutual influence of the two items, indicated that DIF in one item, PPPl, which favored female students rather than their matched male students, was of particular concern. These secondary analyses suggest that the detection of DIF in the other item in the original analysis may have been due to the influence of Item PPPl because they were both in the same test form.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号