首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this study, the authors explored the importance of item difficulty (equated delta) as a predictor of differential item functioning (DIF) of Black versus matched White examinees for four verbal item types (analogies, antonyms, sentence completions, reading comprehension) using 13 GRE-disclosed forms (988 verbal items) and 11 SAT-disclosed forms (935 verbal items). The average correlation across test forms for each item type (and often the correlation for each individual test form as well) revealed a significant relationship between item difficulty and DIF value for both GRE and SAT. The most important finding indicates that for hard items, Black examinees perform differentially better than matched ability White examinees for each of the four item types and for both the GRE and SAT tests! The results further suggest that the amount of verbal context is an important determinant of the magnitude of the relationship between item difficulty and differential performance of Black versus matched White examinees. Several hypotheses accounting for this result were explored.  相似文献   

2.
Test preparation activities were determined for a large representative sample of Graduate Record Examination (GRE) Aptitude Test takers. About 3% of these examinees had attended formal coaching programs for one or more sections of the test.
After adjusting for differences in the background characteristics of coached and uncoached students, effects on test scores were related to the length and the type of programs offered. The effects on GRE verbal ability scores were not significantly related to the amount of coaching examinees received, and quantitative coaching effects increased slightly but not significantly with additional coaching. Effects on analytical ability scores, on the other hand, were related significantly to the length of coaching programs, through improved performance on two analytical item types, which have since been deleted from the test.
Overall, the data suggest that, when compared with the two highly susceptible item types that have been removed from the GRE Aptitude Test, the test item types in the current version of the test (now called the GRE General Test) appear to show relatively little susceptibility to formal coaching experiences of the kinds considered here.  相似文献   

3.
We evaluated a computer-delivered response type for measuring quantitative skill. "Generating Examples" (GE) presents under-determined problems that can have many right answers. We administered two GE tests that differed in the manipulation of specific item features hypothesized to affect difficulty. Analyses related to internal consistency reliability, external relations, and features contributing to item difficulty, adverse impact, and examinee perceptions. Results showed that GE scores were reasonably reliable but only moderately related to the GRE quantitative section, suggesting the two tests might be tapping somewhat different skills. Item features that increased difficulty included asking examinees to supply more than one correct answer and to identify whether an item was solvable. Gender differences were similar to those found on the GRE quantitative and analytical test sections. Finally, examinees were divided on whether GE items were a fairer indicator of ability than multiple-choice items, but still overwhelmingly preferred to take the more conventional questions.  相似文献   

4.
A previous study of the initial, preoperational version of the Graduate Record Examinations (GRE) analytical ability measure (Powers & Swinton, 1984) revealed practically and statistically significant effects of test familiarization on analytical test scores. (Two susceptible item types were subsequently removed from the test.) Data from this study were reanalyzed for evidence of differential effects for subgroups of examinees classified by age, ethnicity, degree aspiration, English language dominance, and performance on other sections of the GRE General Test. The results suggested little, if any, difference among subgroups of examinees with respect to their response to the particular kind of test preparation considered in the study. Within the limits of the data, no particular subgroup appeared to benefit significantly more or significantly less than any other subgroup.  相似文献   

5.
《Educational Assessment》2013,18(4):295-308
Performance on the reading comprehension (RC) tasks of the Scholastic Assessment Test-I (SAT-I or the "new" SAT), the Enhanced American College Testing Assessment (ACT), and the Graduate Record Examination (GRE) when passages were missing was examined. For the SAT-I and ACT, scores were well above chance and correlated substantially with verbal score on the earlier version of the SAT, indicating that examinees perform similarly with or without passages. Comparable but weaker results were found for the GRE. The findings raise doubts about the construct validity of the RC task. We argue that performance is influenced by the plausibility of item choices with or without the passages and that this, in turn, is the result of the construction of test items with little knowledge of the underlying reading process.  相似文献   

6.
The Formulating-Hypotheses (F-H) item presents a situation and asks examinees to generate as many explanations for it as possible. This study examined the generalizability, validity, and examinee perceptions of a computer-delivered version of the task. Eight F-H questions were administered to 192 graduate students. Half of the items restricted examinees to 7 words per explanation, and half allowed up to 15 words. Generalizability results showed high interrater agreement, with tests of between 2 and 4 items scored by one judge achieving coefficients in the .80s. Construct validity analyses found that F-H was only marginally related to the GRE General Test, and more strongly related than the General Test to a measure of ideational fluency. Different response limits tapped somewhat different abilities, with the 15-word constraint appearing more useful for graduate assessment. These items added significantly to conventional measures in explaining school performance and creative expression.  相似文献   

7.
One of the major assumptions of item response theory (IRT)models is that performance on a set of items is unidimensional, that is, the probability of successful performance by examinees on a set of items can be modeled by a mathematical model that has only one ability parameter. In practice, this strong assumption is likely to be violated. An important pragmatic question to consider is: What are the consequences of these violations? In this research, evidence is provided of violations of unidimensionality on the verbal scale of the GRE Aptitude Test, and the impact of these violations on IRT equating is examined. Previous factor analytic research on the GRE Aptitude Test suggested that two verbal dimensions, discrete verbal (analogies, antonyms, and sentence completions)and reading comprehension, existed. Consequently, the present research involved two separate calibrations (homogeneous) of discrete verbal items and reading comprehension items as well as a single calibration (heterogeneous) of all verbal item types. Thus, each verbal item was calibrated twice and each examinee obtained three ability estimates: reading comprehension, discrete verbal, and all verbal. The comparability of ability estimates based on homogeneous calibrations (reading comprehension or discrete verbal) to each other and to the all-verbal ability estimates was examined. The effects of homogeneity of item calibration pool on estimates of item discrimination were also examined. Then the comparability of IRT equatings based on homogeneous and heterogeneous calibrations was assessed. The effects of calibration homogeneity on ability parameter estimates and discrimination parameter estimates are consistent with the existence of two highly correlated verbal dimensions. IRT equating results indicate that although violations of unidimensionality may have an impact on equating, the effect may not be substantial.  相似文献   

8.
This study tested the hypotheses that the nonverbal behavior of teachers is affected by the race and performance of their students. Fifty-six white college-age subjects, acting as teachers, were led to praise successful or unsuccessful students. The students were either white or black. Stimulus teachers' nonverbal behavior was recorded, and silent samples of their behavior were shown to naive judges who rated how pleased they appeared to be with their student. Results showed that stimulus teachers were more pleased with successful than unsuccessful students, and more pleased with white than black students.  相似文献   

9.
An assumption of item response theory is that a person's score is a function of the item response parameters and the person's ability. In this paper, the effect of variations in instructional coverage on item characteristic functions is examined. Using data from the Second International Mathematics Study (1985), curriculum clusters were formed based on teachers' ratings of their students' opportunities to learn the items on a test. After forming curriculum clusters, item response curves were compared using signed and unsigned sum of squared differences. Some of the differences in the item response curves between curriculum clusters were found to be large, but better performance was not necessarily related to greater opportunity to learn. The item response curve differences were much larger than differences reported in prior studies based on comparisons of black and white students. Implications of the findings for applications of item response theory to educational achievement test data are discussed  相似文献   

10.
This study investigates the comparability of two item response theory based equating methods: true score equating (TSE), and estimated true equating (ETE). Additionally, six scaling methods were implemented within each equating method: mean-sigma, mean-mean, two versions of fixed common item parameter, Stocking and Lord, and Haebara. Empirical test data were examined to investigate the consistency of scores resulting from the two equating methods, as well as the consistency of the scaling methods both within equating methods and across equating methods. Results indicate that although the degree of correlation among the equated scores was quite high, regardless of equating method/scaling method combination, non-trivial differences in equated scores existed in several cases. These differences would likely accumulate across examinees making group-level differences greater. Systematic differences in the classification of examinees into performance categories were observed across the various conditions: ETE tended to place lower ability examinees into higher performance categories than TSE, while the opposite was observed for high ability examinees. Because the study was based on one set of operational data, the generalizability of the findings is limited and further study is warranted.  相似文献   

11.
Exploratory and confirmatory factor analyses were used to explore relationships among existing item types and three new computer–administered item types for the analytical scale of the Graduate Record Examination General Test. One new item type was an open–ended version of the current multiple–choice analytical reasoning item type. The other new item types had no counterparts on the existing test. The computer tests were administered at four sites to a sample of students who had previously taken the GRE General Test. Scores from the regular GRE and the special computer administration were matched for a sample of 349 students. Factor analyses suggested that the new item types with no counterparts in the existing GRE were reliably assessing unique constructs but the open–ended analytical reasoning items were not measuring anything beyond what is measured by the current multiple–choice version of these items.  相似文献   

12.
Recent studies have shown that restricting review and answer change opportunities on computerized adaptive tests (CATs) to items within successive blocks reduces time spent in review, satisfies most examinees' desires for review, and controls against distortion in proficiency estimates resulting from intentional incorrect answering of items prior to review. However, restricting review opportunities on CATs may not prevent examinees from artificially raising proficiency estimates by using judgments of item difficulty to signal when to change previous answers. We evaluated six strategies for using item difficulty judgments to change answers on CATs and compared the results to those from examinees reviewing and changing answers in the usual manner. The strategy conditions varied in terms of when examinees were prompted to consider changing answers and in the information provided about the consistency of the item selection algorithm. We found that examinees fared best on average when they reviewed and changed answers in the usual manner. The best gaming strategy was one in which the examinees knew something about the consistency of the item selection algorithm and were prompted to change responses only when they were unsure about answer correctness and sure about their item difficulty judgments. However, even this strategy did not produce a mean gain in proficiency estimates.  相似文献   

13.
With the advent of modern computer technology,there have been growing efforts in recent years to computerize standardized tests,including the popular Graduate Record Examination(GRE),the Graduate Management Admission Test(GMAT) and the Test of English as a Foreign Language(TOEFL).Many of such computer-based tests are known as the computerized adaptive tests,whose major feature is that,depending on their performance in the course of testing,different examinees may be given with different sets of items(questions).In doing so,items can be efficiently utilized to realize maximum accuracy for estimation of examinee’s ability. In this short paper we will introduce briefly the computer-adaptive test(CAT).The application of CAT to the assessment of reading comprehension in a second language will also be illustrated in this paper.The advantages and disadvantages will be analyzed,based on which some recommendations will be given for future study.  相似文献   

14.
Administering tests under time constraints may result in poorly estimated item parameters, particularly for items at the end of the test (Douglas, Kim, Habing, & Gao, 1998; Oshima, 1994). Bolt, Cohen, and Wollack (2002) developed an item response theory mixture model to identify a latent group of examinees for whom a test is overly speeded, and found that item parameter estimates for end-of-test items in the nonspeeded group were similar to estimates for those same items when administered earlier in the test. In this study, we used the Bolt et al. (2002) method to study the effect of removing speeded examinees on the stability of a score scale over an II-year period. Results indicated that using only the nonspeeded examinees for equating and estimating item parameters provided a more unidimensional scale, smaller effects of item parameter drift (including fewer drifting items), and less scale drift (i.e., bias) and variability (i.e., root mean squared errors) when compared to the total group of examinees.  相似文献   

15.
This study was concerned with academic performance in black and white children and the interactions of race with other variables on school achievement. Subjects were 334 blacks and 637 whites in grades three to six. Data consisted of general background information and grade equivalent scores on the California Achievement Tests. They were analyzed using multiple re- gression analysis of variance. Results indicated that blacks scored lower than whites and fell farther behind as they progressed from grade to grade. Significant interactions were revealed for sex, social class, family structure, and teachers. Means for black children were generally less variable than for white children.  相似文献   

16.
The purpose of this study was to identify broad classes of items that behave differentially for handicapped examinees taking special, extended-time administrations of the Scholastic Aptitude Test (SA T). To identify these item classes, the performance of nine handicapped groups and one nonhandicapped group on each of two forms of the SAT was investigated through a two-stage procedure. The first stage centered on the performance of item clusters. Individual items composing clusters showing questionable performance were then examined. This two-stage procedure revealed little indication of differentially functioning item classes. However, some notable instances of differential performance at the item level were detected, the most serious of which affected visually impaired students taking the braille edition of the test.  相似文献   

17.
Several samples of black and white students were drawn from the 1970 PSAT administration in Georgia and studied for item x race interaction on both the verbal and mathematical sections of the test. When subsamples of candidates were drawn from their respective racial groups, matched on mathematical for the study of verbal items and matched on verbal for the study of mathematical items, there was an observable decrease in the size of the item x race interaction, suggesting that one factor contributing to that interaction was simply the difference in performance levels on the test shown by the two races. Further analyses demonstrated a moderate item x group interaction for blacks native to different cities and a moderate item x group interaction for blacks native to areas of different population density.  相似文献   

18.
Recent simulation studies indicate that there are occasions when examinees can use judgments of relative item difficulty to obtain positively biased proficiency estimates on computerized adaptive tests (CATs) that permit item review and answer change. Our purpose in the study reported here was to evaluate examinees' success in using these strategies while taking CATs in a live testing setting. We taught examinees two item difficulty judgment strategies designed to increase proficiency estimates. Examinees who were taught each strategy and examinees who were taught neither strategy were assigned at random to complete vocabulary CATs under conditions in which review was allowed after completing all items and when review was allowed only within successive blocks of items. We found that proficiency estimate changes following review were significantly higher in the regular review conditions than in the strategy conditions. Failure to obtain systematically higher scores in the strategy conditions was due in large part to errors examinees made in judging the relative difficulty of CAT items.  相似文献   

19.
Statistics used to detect differential item functioning can also reflect differential strengths and weaknesses in the performance characteristics of population subgroups. In turn, item features associated with the differential performance patterns are likely to reflect some facet of the item task and hence its difficulty, that might previously have been overlooked. In this study, several item features were identified and coded for a large number of reading comprehension items from the two admissions testing programs. Item features included subject matter content, various properties of item structure, cognitive demand indicators, and semantic content (propositional analysis). Differential item functioning was evaluated for males and females and for White and Black examinees. Results showed a number of significant relationships between item features and indicators of differential item functioning—many of which were consistent across testing programs. Implications of the results for related areas of research are discussed.  相似文献   

20.
Once a differential item functioning (DIF) item has been identified, little is known about the examinees for whom the item functions differentially. This is because DIF focuses on manifest group characteristics that are associated with it, but do not explain why examinees respond differentially to items. We first analyze item response patterns for gender DIF and then illustrate, through the use of a mixture item response theory (IRT) model, how the manifest characteristic associated with DIF often has a very weak relationship with the latent groups actually being advantaged or disadvantaged by the item(s). Next, we propose an alternative approach to DIF assessment that first uses an exploratory mixture model analysis to define the primary dimension(s) that contribute to DIF, and secondly studies examinee characteristics associated with those dimensions in order to understand the cause(s) of DIF. Comparison of academic characteristics of these examinees across classes reveals some clear differences in manifest characteristics between groups.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号