首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒

In an attempt to identify some of the causes of answer changing behavior, the effects of four tests and item specific variables were evaluated. Three samples of New Zealand school children of different ages were administered tests of study skills. The number of answer changes per item was compared with the position of each item in a group of items, the position of each item in the test, the discrimination index and the difficulty index of each item. It is shown that answer changes were more likely to be made on items occurring early in a group of items and toward the end of a test. There was also a tendency for difficult items and items with poor discriminations to be changed more frequently. Some implications of answer changing in the design of tests are discussed.  相似文献   

Open–ended counterparts to a set of items from the quantitative section of the Graduate Record Examination (GRE–Q) were developed. Examinees responded to these items by gridding a numerical answer on a machine-readable answer sheet or by typing on a computer. The test section with the special answer sheets was administered at the end of a regular GRE administration. Test forms were spiraled so that random groups received either the grid-in questions or the same questions in a multiple-choice format. In a separate data collection effort, 364 paid volunteers who had recently taken the GRE used a computer keyboard to enter answers to the same set of questions. Despite substantial format differences noted for individual items, total scores for the multiple-choice and open-ended tests demonstrated remarkably similar correlational patterns. There were no significant interactions of test format with either gender or ethnicity.  相似文献   


The effect of changing item responses on scores of elementary school children on a standardized achievement test was studied. Previous research, primarily involving non-standardized instruments and adult samples, indicates that changed responses are more likely to be correct than not. Subjects were 165 third grade students using the Metropolitan Reading Tests. Students received no special instructions regarding changing responses. Changes were identified visually and were independently verified. While frequency of response changes was low, such changes generally improved scores. Sex differences in number and success of changes were non-significant. The relationship between frequency of response change and test score was minimal. Responses to difficult items were changed more frequently with less success than changes on easy items. High scorers made more successful changes than did low scorers. Within the limits of the methodology, results clearly indicated that response changes of elementary students on multiple-choice items tend to improve test scores.  相似文献   

This study established a Chinese scale for measuring high school students’ ocean literacy. This included testing its reliability, validity, and differential item functioning (DIF) with the aim of compensating for the lack of DIF tests focusing on current scales. The construct validity and reliability were verified and tested by analyzing the established scale’s items using the Rasch model, and a gender DIF test was conducted to ensure the test results’ fairness when distinct groups were compared simultaneously. The results indicated that the scale established in this study is unidimensional and possesses favorable internal consistency and construct validity. The gender DIF test results indicated that several items were difficult for either female or male students to correctly answer; however, the experts and scholars discussed these items individually and suggested retaining them. The final Chinese version of the ocean literacy scale developed here comprises 48 items that can reflect high school students’ understanding of ocean literacy—which helps students understand the topics of marine science encountered in real life.  相似文献   

The hypothesis that some students, when tested under formula directions, omit items about which they have useful partial knowledge implies that such directions are not as fair as rights directions, especially to those students who are less inclined to guess. This hypothesis may be called the differential effects hypothesis. An alternative hypothesis states that examinees would perform no better than chance expectation on items that they would omit under formula directions but would answer under rights directions. This may be called the invariance hypothesis. Experimental data on this question were obtained by conducting special test administrations of College Board SAT-verbal and Chemistry tests and by including experimental tests in a Graduate Management Admission Test administration. The data provide a basis for evaluating the two hypotheses and for assessing the effects of directions on the reliability and parallelism of scores for sophisticated examinees taking professionally developed tests. Results support the invariance hypothesis rather than the differential effects hypothesis.  相似文献   

When tests are administered under fixed time constraints, test performances can be affected by speededness. Among other consequences, speededness can result in inaccurate parameter estimates in item response theory (IRT) models, especially for items located near the end of tests (Oshima, 1994). This article presents an IRT strategy for reducing contamination in item difficulty estimates due to speededness. Ordinal constraints are applied to a mixture Rasch model (Rost, 1990) so as to distinguish two latent classes of examinees: (a) a "speeded" class, comprised of examinees that had insufficient time to adequately answer end-of-test items, and (b) a "nonspeeded" class, comprised of examinees that had sufficient time to answer all items. The parameter estimates obtained for end-of-test items in the nonspeeded class are shown to more accurately approximate their difficulties when the items are administered at earlier locations on a different form of the test. A mixture model can also be used to estimate the class memberships of individual examinees. In this way, it can be determined whether membership in the speeded class is associated with other student characteristics. Results are reported for gender and ethnicity.  相似文献   

Identifying students’ conceptual scientific understanding is difficult if the appropriate tools are not available for educators. Concept inventories have become a popular tool to assess student understanding; however, traditionally, they are multiple choice tests. International science education standard documents advocate that assessments should be reform based, contain diverse question types, and should align with instructional approaches. To date, no instrument of this type targeting student conceptions in biotechnology has been developed. We report here the development, testing, and validation of a 35-item Biotechnology Instrument for Knowledge Elicitation (BIKE) that includes a mix of question types. The BIKE was designed to elicit student thinking and a variety of conceptual understandings, as opposed to testing closed-ended responses. The design phase contained nine steps including a literature search for content, student interviews, a pilot test, as well as expert review. Data from 175 students over two semesters, including 16 student interviews and six expert reviewers (professors from six different institutions), were used to validate the instrument. Cronbach’s alpha on the pre/posttest was 0.664 and 0.668, respectively, indicating the BIKE has internal consistency. Cohen’s kappa for inter-rater reliability among the 6,525 total items was 0.684 indicating substantial agreement among scorers. Item analysis demonstrated that the items were challenging, there was discrimination among the individual items, and there was alignment with research-based design principles for construct validity. This study provides a reliable and valid conceptual understanding instrument in the understudied area of biotechnology.  相似文献   

In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items.  相似文献   

Two matched forms of a 50 item multiple-choice grammar test were developed. Twenty items designed to be humorous were included in one form. Test forms were randomly assigned to 126 eighth graders who received the test plus alternate forms of a questionnaire. Inclusion of humorous items did not affect grammar scores on matched humorous/nonhumorous items nor on common post-treatment items, nor did inclusion affect results of anxiety measures. Students favored inclusion of humor on tests, judged effects of humor positively, and estimated humorous items to be easier. Humor did not lower performance but was sought by the students. Potential for more valid and humane measurement is discussed.  相似文献   

Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results.  相似文献   

An assumption that is fundamental to the scoring of student-constructed responses (e.g., essays) is the ability of raters to focus on the response characteristics of interest rather than on other features. A common example, and the focus of this study, is the ability of raters to score a response based on the content achievement it demonstrates independent of the quality with which it is expressed. Previously scored responses from a large-scale assessment in which trained scorers rated exclusively constructed-response formats were altered to enhance or degrade the quality of the writing, and scores that resulted from the altered responses were compared with the original scores. Statistically significant differences in favor of the better-writing condition were found in all six content areas. However, the effect sizes were very small in mathematics, reading, science, and social studies items. They were relatively large for items in writing and language usage (mechanics). It was concluded from the last two content areas that the manipulation was successful and from the first four that trained scorers are reasonably well able to differentiate writing quality from other achievement constructs in rating student responses.  相似文献   

Recent studies have shown that restricting review and answer change opportunities on computerized adaptive tests (CATs) to items within successive blocks reduces time spent in review, satisfies most examinees' desires for review, and controls against distortion in proficiency estimates resulting from intentional incorrect answering of items prior to review. However, restricting review opportunities on CATs may not prevent examinees from artificially raising proficiency estimates by using judgments of item difficulty to signal when to change previous answers. We evaluated six strategies for using item difficulty judgments to change answers on CATs and compared the results to those from examinees reviewing and changing answers in the usual manner. The strategy conditions varied in terms of when examinees were prompted to consider changing answers and in the information provided about the consistency of the item selection algorithm. We found that examinees fared best on average when they reviewed and changed answers in the usual manner. The best gaming strategy was one in which the examinees knew something about the consistency of the item selection algorithm and were prompted to change responses only when they were unsure about answer correctness and sure about their item difficulty judgments. However, even this strategy did not produce a mean gain in proficiency estimates.  相似文献   

Six undergraduate and three graduate classes were given multiple-choice tests with subsequent evaluation of answer changes. The 300 students were tested twice, once before and once after instruction on answer changing. After each test, students were asked to complete two forms. The forms evaluated attitude toward answer changing, reasons for changing, and confidence in final answers. Students showed a significant increase in favorability toward answer changing after instruction. No significant change was found in number of answers changed. Psychology students were found to change significantly more items than were business students. Mean gain score did not change significantly after instruction. It was concluded that although instruction does lead to a change in attitude in answer changing, the number of changes and overall gain due to answer changing do not change. It was also determined that students continue to make significant gains even when their confidence in the final answer is less than 50 on a 100-point scale.  相似文献   

High item discrimination can be a symptom o f a special kind of measurement disturbance introduced by an item that gives persons o f high ability a special advantage over and above their higher abilities. This type o f disturbance, which can be interpreted as a form o f item "bias," can be encouraged by methods that routinely interpret highly discriminating items as the "best" items on a test and may be compounded by procedures that weight items by their discrimination. The type of measurement disturbance described and illustrated in this paper occurs when an item is sensitive to individual differences on a second, undesired dimension that is positively correlated with the variable intended to be measured. Possible secondary influences o f this type include opportunity to learn, opportunity to answer, and test wiseness  相似文献   

In educational measurement, the construction of parallel test forms is often a combinatorial optimization problem that involves the time-consuming selection of items to construct tests having approximately the same test information functions (TIFs) and constraints. This article proposes a novel method, genetic algorithm (GA), to construct parallel test forms effectively. The sum of squared errors of the generated TIFs produced by GA were compared with those of the Swanson and Stocking method, and the Wang and Ackerman method. Experimental results show that tests constructed using GA yielded lower error, and an average improvement ratio above 90%.  相似文献   

The alignment method (Asparouhov & Muthén, 2014) is an alternative to multiple-group factor analysis for estimating measurement models and testing for measurement invariance across groups. Simulation studies evaluating the performance of the alignment for estimating measurement models across groups show promising results for continuous indicators. This simulation study builds on previous research by investigating the performance of the alignment method’s measurement models estimates with polytomous indicators under conditions of systematically increasing, partial measurement invariance. We also present an evaluation of the testing procedure, which has not been the focus of previous simulation studies. Results indicate that the alignment adequately recovers parameter estimates under small and moderate amounts of noninvariance, with issues only arising in extreme conditions. In addition, the statistical tests of invariance were fairly conservative, and had less power for items with more extreme skew. We include recommendations for using the alignment method based on these results.  相似文献   

Part of the controversy about allowing examinees to review and change answers to previous items on computerized adaptive tests (CATs) centers on a strategy for obtaining positively biased ability estimates attributed to Wainer (1993) in which examinees intentionally answer items incorrectly before review and to the best of their abilities upon review. Our results, based on both simulated and live testing data, showed that there were instances in which the Wainer strategy yielded inflated ability estimates as well as instances in which it yielded deflated ability estimates. The success of the strategy in inflating ability estimates depended on the ability estimation method used (maximum likelihood versus Bayesian), the examinee's true ability level, the standard error of the ability estimate, the examinee's ability to implement the strategy, and the type of decision made from the ability estimate. We discuss approaches to dealing with the Wainer strategy in operational CAT settings.  相似文献   

Wilcox (16) proposed a latent structure model for answer-until-correct tests that can solve various measurement problems including correcting for guessing without assuming guessing is at random. This paper proposes a closed sequential procedure for estimating true score that can be used in conjunction with an answer-until-correct test. For criterion-referenced tests where the goal is to determine whether an examinee’s true score is above or below a known constant, the accuracy of the new procedure is exactly the same as a more conventional sequential solution. The advantage of the new procedure is that it eliminates the possibility of using an inordinately large number of items when in fact a large number of items is not needed; typical sequential procedures always allow this possibility. In addition, the new procedure appears to compare favorably to traditional tests where the number of items to be administered is fixed in advance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号