首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Two new methods have been proposed to determine unexpected sum scores on sub-tests (testlets) both for paper-and-pencil tests and computer adaptive tests. A method based on a conservative bound using the hypergeometric distribution, denoted p, was compared with a method where the probability for each score combination was calculated using a highest density region (HDR). Furthermore, these methods were compared with the standardized log-likelihood statistic with and without a correction for the estimated latent trait value (denoted as l*z and lz, respectively). Data were simulated on the basis of the one-parameter logistic model, and both parametric and non-parametric logistic regression was used to obtain estimates of the latent trait. Results showed that it is important to take the trait level into account when comparing subtest scores. In a nonparametric item response theory (IRT) context, on adapted version of the HDR method was a powerful alterative to p. In a parametric IRT context, results showed that l*z had the highest power when the data were simulated conditionally on the estimated latent trait level.  相似文献   

2.
In the service of educational accountability, student achievement tests are being used to measure constructs quite unlike those envisioned by test developers. Scores are compared to cut points to create classifications like “proficient”; scores are combined over time to measure growth; student scores are aggregated to measure the effectiveness of teachers, schools, and school districts; indices are created to measure college and career readiness. These and other new uses rely on derived scores created to measure new constructs. The field of educational and psychological measurement has largely ignored these significant, consequential measurement applications. The conceptual frameworks and analytical tools of educational and psychological measurement should be used to study such derived scores and the validity of their uses and interpretations.  相似文献   

3.
This study focused on the effects of administration mode (computer-adaptive test [CAT] versus self-adaptive test [SAT]), item-by-item answer feedback (present versus absent), and test anxiety on results obtained from computerized vocabulary tests. Examinees were assigned at random to four testing conditions (CAT with feedback, CAT without feedback, SAT with feedback, SAT without feedback). Examinees completed the Test Anxiety Inventory (Spielberger, 1980) before taking their assigned computerized tests. Results showed that the CATs were more reliable and took less time to complete than the SATs. Administration time for both the CATs and SATs was shorter when feedback was provided than when it was not, and this difference was most pronounced for examinees at medium to high levels of test anxiety. These results replicate prior findings regarding the precision and administrative efficiency of CATs and SATs but point to new possible benefits of including answer feedback on such tests.  相似文献   

4.
In a previous simulation study of methods for assessing differential item functioning (DIF) in computer-adaptive tests (Zwick, Thayer, & Wingersky, 1993, 1994), modified versions of the Mantel-Haenszel and standardization methods were found to perform well. In that study, data were generated using the 3-parameter logistic (3PL) model and this same model was assumed in obtaining item parameter estimates. In the current study, the 3PL data were used but the Rasch model was assumed in obtaining the item parameter estimates, which determined the information table used for item selection. Although the obtained DIF statistics were highly correlated with the generating DIF values, they tended to be smaller in magnitude than in the 3PL analysis, resulting in a lower probability of DIF detection. This reduced sensitivity appeared to be related to a degradation in the accuracy of matching. Expected true scores from the Rasch-based computer-adaptive test tended to be biased downward, particularly for lower-ability examinees  相似文献   

5.
Berglund, G. W. (1970). The Effect of four Sets of Test Instructions on Scores in Mental Ability Tests. Scand. J. Educ. Res. 14, 31‐38. Four hundred and eighteen Swedish children (11‐year‐olds) were divided randomly into four experimental groups. Three mental ability tests of the factor type were administered to the groups by means of four different sets of instructions. In the first group the tests were presented as intelligence tests and in the second group as achievement tests. The third group received the original instructions of the tests and the fourth group received routine instructions. It is concluded (a) that the four instructions do not differentiate the groups in power tests, and (b) that the routine instruction does not affect the subjects’ working speed to the same degree as the other instructions.  相似文献   

6.
A College Board-sponsored survey of a nationally representative sample of 1995–96 SAT takers yielded a data base for more than 4, 000 examinees, about 500 of whom had attended formal coaching programs outside their schools. Several alternative analytical methods were used to estimate the effects of coaching on SAT I: Reasoning Test scores. The various analyses produced slightly different estimates. All of the estimates, however, suggested that the effects of coaching are far less than is claimed by major commercial test preparation companies. The revised SAT does not appear to be any more coachable than its predecessor.  相似文献   

7.
This essay seeks to establish a metaphor of the professional practice of teaching to the attributes and training of an offensive lineman in the game of American football. Effective classroom instruction does not rely exclusively on a rare set of talents but rather rests on the commitment to the work of teaching. Like the position of offensive lineman, the profession of teaching is one of service. And more, it is one in which the person's performance can blossom through intense determination. An invitation is offered to serve as an effective teacher.  相似文献   

8.
9.
A sample of college-bound juniors from 275 high schools took a test consisting of 70 math questions from the SAT. A random half of the sample was allowed to use calculators on the test. Both genders and three ethnic groups (White, African American, and Asian American) benefitted about equally from being allowed to use calculators; Latinos benefitted slightly more than the other groups. Students who routinely used calculators on classroom mathematics tests were relatively advantaged on the calculator test. Test speededness was about the same whether or not students used calculators. Calculator effects on individual items ranged from positive through neutral to negative and could either increase or decrease the validity of an item as a measure of mathematical reasoning skills. Calculator effects could be either present or absent in both difficult and easy items  相似文献   

10.
Reliability of Scores From Teacher-Made Tests   总被引:1,自引:0,他引:1  
Reliability is the property of a set of test scores that indicates the amount of measurement error associated with the scores. Teachers need to know about reliability so that they can use test scores to make appropriate decisions about their students. The level of consistency of a set of scores can he estimated by using the methods of internal analysis to compute a reliability coefficient. This coefficient, which can range between 0.0 and +1.0, usually has values around 0.50 for teacher-made tests and around 0.90 for commercially prepared standardized tests. Its magnitude can be affected by such factors as test length, test-item difficulty and discrimination, time limits, and certain characteristics of the group—extent of their testwiseness, level of student motivation, and homogeneity in the ability measured by the test.  相似文献   

11.
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models.  相似文献   

12.
13.
This study explored the use of kernel equating for integrating and extending two procedures proposed for assessing item order effects in test forms that have been administered to randomly equivalent groups. When these procedures are used together, they can provide complementary information about the extent to which item order effects impact test scores, in overall score distributions and also at specific test scores. In addition to detecting item order effects, the integrated procedures also suggest the equating function that most adequately adjusts the scores to mitigate the effects. To demonstrate, the statistical equivalences of alternate versions of two large-volume advanced placement exams were assessed.  相似文献   

14.
A comparison of animism in college males and females was made. The test instrument was the Crowell-Dole Information Scale, a self-report questionnaire of common objects. A total of 59. 8 percent of all Ss indicated animistic tendencies. Chi-square analysis of the raw data indicated no significant difference in incidents of animism for males and females. No significant difference was found between those students having one or more college biology courses and those with no formal training in biology.  相似文献   

15.
ABSTRACT

Previous studies have shown that several key variables influence student achievement in geometry, but no research has been conducted to determine how these variables interact. A model of achievement in geometry was tested on a sample of 102 high school students. Structural equation modeling was used to test hypothesized relationships among variables linked to successful problem solving in geometry. These variables, including motivation, achievement emotions, pictorial representation, and categorization skills, were examined for their influence on geometry achievement. Results indicated that the model fit well. Achievement emotions, specifically boredom and enjoyment, had a significant influence on student motivation. Student motivation influenced students’ use of pictorial representations and achievement. Pictorial representation also directly influenced achievement. Categorization skills had a significant influence on pictorial representations and student achievement. The implications of these findings for geometry instruction and for future research are discussed.  相似文献   

16.
17.
18.
通过对函数S-粗集和动态规划算法的研究,提出了相似度和可信度概念,给出了非标准化试题实现评分的方案和步骤,其中关键步骤是迁移处理和计算最长公共子序列长度。主要阐述了基于函数S-粗集的迁移处理,并分析了计算最长公共子序列长度解的结构和计算方法,最后分别给出了迁移函数和计算最长公共子序列长度函数的源程序。  相似文献   

19.
Randomly selected fifth, seventh, ninth, and eleventh graders (sixty from each grade) were givenanability test. The score and the time taken were used to test the hypotheses of no negative linear relationship and no curvilinear relationship between test score and test time. Although no significant linear relationships were found, significant curvilinear regressions of time on score were found in grades seven and nine. The strength of these significant relationships were relatively low in both grades.  相似文献   

20.
The first generation of computer-based tests depends largely on multiple-choice items and constructed-response questions that can be scored through literal matches with a key. This study evaluated scoring accuracy and item functioning for an open-ended response type where correct answers, posed as mathematical expressions, can take many different surface forms. Items were administered to 1,864 participants in field trials of a new admissions test for quantitatively oriented graduate programs. Results showed automatic scoring to approximate the accuracy of multiple-choice scanning, with all processing errors stemming from examinees improperly entering responses. In addition, the items functioned similarly in difficulty, item-total relations, and male-female performance differences to other response types being considered for the measure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号