首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Time limits on some computer-adaptive tests (CATs) are such that many examinees have difficulty finishing, and some examinees may be administered tests with more time-consuming items than others. Results from over 100,000 examinees suggested that about half of the examinees must guess on the final six questions of the analytical section of the Graduate Record Examination if they were to finish before time expires. At the higher-ability levels, even more guessing was required because the questions administered to higher-ability examinees were typically more time consuming. Because the scoring model is not designed to cope with extended strings of guesses, substantial errors in ability estimates can be introduced when CATs have strict time limits. Furthermore, examinees who are administered tests with a disproportionate number of time-consuming items appear to get lower scores than examinees of comparable ability who are administered tests containing items that can be answered more quickly, though the issue is very complex because of the relationship of time and difficulty, and the multidimensionality of the test.  相似文献   

2.
3.
With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number‐correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real‐number item scores. The generalized algorithm is distinct from the Lord‐Wingersky algorithm in that it explicitly incorporates the task of figuring out all possible unique real‐number test scores in each recursion. Some applications of the generalized recursive algorithm, such as IRT test score reliability estimation and IRT proficiency estimation based on summed test scores, are illustrated with a short test by varying scoring schemes for its items.  相似文献   

4.
This study explored the use of kernel equating for integrating and extending two procedures proposed for assessing item order effects in test forms that have been administered to randomly equivalent groups. When these procedures are used together, they can provide complementary information about the extent to which item order effects impact test scores, in overall score distributions and also at specific test scores. In addition to detecting item order effects, the integrated procedures also suggest the equating function that most adequately adjusts the scores to mitigate the effects. To demonstrate, the statistical equivalences of alternate versions of two large-volume advanced placement exams were assessed.  相似文献   

5.
Many educational and psychological tests are inherently multidimensional, meaning these tests measure two or more dimensions or constructs. The purpose of this module is to illustrate how test practitioners and researchers can apply multidimensional item response theory (MIRT) to understand better what their tests are measuring, how accurately the different composites of ability are being assessed, and how this information can be cycled back into the test development process. Procedures for conducting MIRT analyses–from obtaining evidence that the test is multidimensional, to modeling the test as multidimensional, to illustrating the properties of multidimensional items graphically-are described from both a theoretical and a substantive basis. This module also illustrates these procedures using data from a ninth-grade mathematics achievement test. It concludes with a discussion of future directions in MIRT research.  相似文献   

6.
Applications of traditional unidimensional item response theory models to passage-based reading comprehension assessment data have been criticized based on potential violations of local independence. However, simple rules for determining dependency, such as including all items associated with a particular passage, may overestimate the dependency that actually exists among the items. The current study proposed a more refined method based on cognitive principles and substantive theories to determine those items that pose a threat. Specifically, the use of common necessary information from text was examined as a contributor of local dependence. Cognitively similar item pairs, those with connected necessary information, had higher local dependence values than item pairs with no connected necessary information. Results suggest that focusing on necessary information may be useful to some extent for understanding and managing item dependence for passage-based reading comprehension tests.  相似文献   

7.
This article addresses the issue of how to detect item preknowledge using item response time data in two computer‐based large‐scale licensure examinations. Item preknowledge is indicated by an unexpected short response time and a correct response. Two samples were used for detecting item preknowledge for each examination. The first sample was from the early stage of the operational test and was used for item calibration. The second sample was from the late stage of the operational test, which may feature item preknowledge. The purpose of this research was to explore whether there was evidence of item preknowledge and compromised items in the second sample using the parameters estimated from the first sample. The results showed that for one nonadaptive operational examination, two items (of 111) were potentially exposed, and two candidates (of 1,172) showed some indications of preknowledge on multiple items. For another licensure examination that featured computerized adaptive testing, there was no indication of item preknowledge or compromised items. Implications for detected aberrant examinees and compromised items are discussed in the article.  相似文献   

8.
Logistic regression has recently been advanced as a viable procedure for detecting differential item functioning (DIF). One of the advantages of this procedure is the considerable flexibility it offers in the specification of the regression equation. This article describes incorporating two ability estimates into a single regression analysis, with the result that substantially fewer items exhibit DIF. A comparable analysis is conducted using the Mantel-Haenszel with similar results. It is argued that by simultaneously conditioning on two relevant ability estimates, more accurate matching of examinees in the reference and focal groups is obtained, and thus multidimensional item impact is not mistakenly identified as DIF.  相似文献   

9.
《Educational Assessment》2013,18(2):127-143
This study investigated factors related to score differences on computerized and paper-and-pencil versions of a series of primary K–3 reading tests. Factors studied included item and student characteristics. The results suggest that the score differences were more related to student than item characteristics. These student characteristics include response style variables, especially omitting, and socioeconomic status as measured by free lunch eligibility. In addition, response style and socioeconomic status appear to be relatively independent factors in the score differences. Variables studied but not found to be related to the format score differences included association of items with a reading passage, item difficulty, and teacher versus computer administration of items. However, because this study is the 1st to study the factors behind these score differences below Grade 3, and because a number of states are increasing computer testing at the primary grades, additional studies are needed to verify the importance of these 2 factors.  相似文献   

10.
In a previous simulation study of methods for assessing differential item functioning (DIF) in computer-adaptive tests (Zwick, Thayer, & Wingersky, 1993, 1994), modified versions of the Mantel-Haenszel and standardization methods were found to perform well. In that study, data were generated using the 3-parameter logistic (3PL) model and this same model was assumed in obtaining item parameter estimates. In the current study, the 3PL data were used but the Rasch model was assumed in obtaining the item parameter estimates, which determined the information table used for item selection. Although the obtained DIF statistics were highly correlated with the generating DIF values, they tended to be smaller in magnitude than in the 3PL analysis, resulting in a lower probability of DIF detection. This reduced sensitivity appeared to be related to a degradation in the accuracy of matching. Expected true scores from the Rasch-based computer-adaptive test tended to be biased downward, particularly for lower-ability examinees  相似文献   

11.
12.
A comparison of animism in college males and females was made. The test instrument was the Crowell-Dole Information Scale, a self-report questionnaire of common objects. A total of 59. 8 percent of all Ss indicated animistic tendencies. Chi-square analysis of the raw data indicated no significant difference in incidents of animism for males and females. No significant difference was found between those students having one or more college biology courses and those with no formal training in biology.  相似文献   

13.
14.
Monte Carlo simulations with 20,000 replications are reported to estimate the probability of rejecting the null hypothesis regarding DIF using SIBTEST when there is DIF present and/or when impact is present due to differences on the primary dimension to be measured. Sample sizes are varied from 250 to 2000 and test lengths from 10 to 40 items. Results generally support previous findings for Type I error rates and power. Impact is inversely related to test length. The combination of DIF and impact, with the focal group having lower ability on both the primary and secondary dimensions, results in impact partially masking DIF so that items biased toward the reference group are less likely to be detected.  相似文献   

15.
New high-frequency, automated data collection and analysis algorithms could offer new insights into complex learning processes, especially for tasks in which students have opportunities to generate unique open-ended artifacts such as computer programs. These approaches should be particularly useful because the need for scalable project-based and student-centered learning is growing considerably. In this article, we present studies focused on how students learn computer programming, based on data drawn from 154,000 code snapshots of computer programs under development by approximately 370 students enrolled in an introductory undergraduate programming course. We use methods from machine learning to discover patterns in the data and try to predict final exam grades. We begin with a set of exploratory experiments that use fully automated techniques to investigate how much students change their programming behavior throughout all assignments in the course. The results show that students’ change in programming patterns is only weakly predictive of course performance. We subsequently hone in on 1 single assignment, trying to map students’ learning process and trajectories and automatically identify productive and unproductive (sink) states within these trajectories. Results show that our process-based metric has better predictive power for final exams than the midterm grades. We conclude with recommendations about the use of such methods for assessment, real-time feedback, and course improvement.  相似文献   

16.
应用主成分评价方法对大学生综合测评成绩进行统计分析,以某学院综合测评办法得出数据为样本,检测大学生综合测评中智育浮动分对测评结果的影响度,发现综合测评办法中智育浮动分对结果影响显著,为修订、完善该大学生综合测评办法提供了依据和参考。  相似文献   

17.
Recent studies have shown that restricting review and answer change opportunities on computerized adaptive tests (CATs) to items within successive blocks reduces time spent in review, satisfies most examinees' desires for review, and controls against distortion in proficiency estimates resulting from intentional incorrect answering of items prior to review. However, restricting review opportunities on CATs may not prevent examinees from artificially raising proficiency estimates by using judgments of item difficulty to signal when to change previous answers. We evaluated six strategies for using item difficulty judgments to change answers on CATs and compared the results to those from examinees reviewing and changing answers in the usual manner. The strategy conditions varied in terms of when examinees were prompted to consider changing answers and in the information provided about the consistency of the item selection algorithm. We found that examinees fared best on average when they reviewed and changed answers in the usual manner. The best gaming strategy was one in which the examinees knew something about the consistency of the item selection algorithm and were prompted to change responses only when they were unsure about answer correctness and sure about their item difficulty judgments. However, even this strategy did not produce a mean gain in proficiency estimates.  相似文献   

18.
Even if national and international assessments are designed to be comparable, subsequent psychometric analyses often reveal differential item functioning (DIF). Central to achieving comparability is to examine the presence of DIF, and if DIF is found, to investigate its sources to ensure differentially functioning items that do not lead to bias. In this study, sources of DIF were examined using think-aloud protocols. The think-aloud protocols of expert reviewers were conducted for comparing the English and French versions of 40 items previously identified as DIF (N?=?20) and non-DIF (N?=?20). Three highly trained and experienced experts in verifying and accepting/rejecting multi-lingual versions of curriculum and testing materials for government purposes participated in this study. Although there is a considerable amount of agreement in the identification of differentially functioning items, experts do not consistently identify and distinguish DIF and non-DIF items. Our analyses of the think-aloud protocols identified particular linguistic, general pedagogical, content-related, and cognitive factors related to sources of DIF. Implications are provided for the process of arriving at the identification of DIF, prior to the actual administration of tests at national and international levels.  相似文献   

19.
This study focused on the effects of administration mode (computer-adaptive test [CAT] versus self-adaptive test [SAT]), item-by-item answer feedback (present versus absent), and test anxiety on results obtained from computerized vocabulary tests. Examinees were assigned at random to four testing conditions (CAT with feedback, CAT without feedback, SAT with feedback, SAT without feedback). Examinees completed the Test Anxiety Inventory (Spielberger, 1980) before taking their assigned computerized tests. Results showed that the CATs were more reliable and took less time to complete than the SATs. Administration time for both the CATs and SATs was shorter when feedback was provided than when it was not, and this difference was most pronounced for examinees at medium to high levels of test anxiety. These results replicate prior findings regarding the precision and administrative efficiency of CATs and SATs but point to new possible benefits of including answer feedback on such tests.  相似文献   

20.
This article develops a conceptual framework that addresses score comparability. The intent of the framework is to help identify and organize threats to comparability in a particular assessment situation. Aspects of the testing situations that might threaten score comparability are delineated, procedures for evaluating the degree of score comparability are described, and suggestions are made about how to minimize the effects of potential threats. The situations considered are restricted to those in which test developers intend to (a) be able to use scores on 2 or more tests interchangeably, (b) collect data that allow for the conversion of scores on each of the tests to a common scale, and (c) use the scores to make decisions about individuals. Comparability of scores on alternate forms of performance assessments, adaptive and paper-and-pencil tests, and alternate pools used for computerized adaptive tests are considered within the framework. Aspects of these testing situations that might threaten score comparability and procedures for evaluating the degree of score comparability are described. Suggestions are made about how to minimize the effects of potential threats to comparability.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号