首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The trustworthiness of low-stakes assessment results largely depends on examinee effort, which can be measured by the amount of time examinees devote to items using solution behavior (SB) indices. Because SB indices are calculated for each item, they can be used to understand how examinee motivation changes across items within a test. Latent class analysis (LCA) was used with the SB indices from three low-stakes assessments to explore patterns of solution behavior across items. Across tests, the favored models consisted of two classes, with Class 1 characterized by high and consistent solution behavior (>90% of examinees) and Class 2 by lower and less consistent solution behavior (<10% of examinees). Additional analyses provided supportive validity evidence for the two-class solution with notable differences between classes in self-reported effort, test scores, gender composition, and testing context. Although results were generally similar across the three assessments, striking differences were found in the nature of the solution behavior pattern for Class 2 and the ability of item characteristics to explain the pattern. The variability in the results suggests motivational changes across items may be unique to aspects of the testing situation (e.g., content of the assessment) for less motivated examinees.  相似文献   

2.
Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results.  相似文献   

3.
The objective of this project was to develop a multiple choice test of graphing skills appropriate for science students from grades seven through twelve. Skills associated with the construction and interpretation of line graphs were delineated, and nine objectives encompassing these skills were developed. Twenty-six items were then constructed to measure these objectives. To establish content validity, items and objectives were submitted to a panel of reviewers. The experts agreed over 94% of the time on assignment of items to objectives and 98% on the scoring of items. TOGS was first administered to 119 7th, 9th, and 11th graders. The reliability (KR-20) was 0.81. Poorly functioning items were rewritten based on the item difficulty and discrimination data. The revised version of the test was given to 377 7th through 12th grade students. Total scores ranged from 2 to 26 correct (X = 13.3, S.D. = 5.3). The reliability (KR-20) was 0.83 for all subjects and ranged from 0.71 for eighth graders to 0.88 for ninth graders. Point biserial correlations showed 24 of the 26 items above 0.30 with an average value of 0.43. It was concluded from this and other data that TOGS was a valid and reliable instrument for measuring graphing abilities.  相似文献   

4.
This paper demonstrates and discusses the use of think aloud protocols (TAPs) as an approach for examining and confirming sources of differential item functioning (DIF). The TAPs are used to investigate to what extent surface characteristics of the items that are identified by expert reviews as sources of DIF are supported by empirical evidence from examinee thinking processes in the English and French versions of a Canadian national assessment. In this research, the TAPs confirmed sources of DIF identified by expert reviews for 10 out of 20 DIF items. The moderate agreement between TAPs and expert reviews indicates that evidence from expert reviews cannot be considered sufficient in deciding whether DIF items are biased and such judgments need to include evidence from examinee thinking processes.  相似文献   

5.
Two hundred twenty-six experts on teacher stress and burnout were surveyed to determine the relevance of 49 teacher stress items to their overall concepts of teacher stress. Relevance ratings for all items fell within the relevant-to-quite-relevant range, and all items were retained for inclusion in the Teacher Stress Inventory. The most relevant items dealt with feeling unable to cope and experiencing physical exhaustion; the least were related to student motivation problems. Overall, the experts were congruent to a significant degree in the way they rated the items. When the group was compared in terms of background variables, the younger respondents, those who present stress management workshops, and those who conduct qualitative or quantitative stress research perceived some of the stress factors as being more relevant to teacher stress than did the older respondents, those not publishing, and those conducting combinations of qualitative and quantitative stress research.  相似文献   

6.
The purpose of this article is to address a major gap in the instructional sensitivity literature on how to develop instructionally sensitive assessments. We propose an approach to developing and evaluating instructionally sensitive assessments in science and test this approach with one elementary life‐science module. The assessment we developed was administered to 125 students in seven classrooms. The development approach considered three dimensions of instructional sensitivity; that is, assessment items should: represent the curriculum content, reflect the quality of instruction, and have formative value for teaching. Focusing solely on the first dimension, representation of the curriculum content, this study was guided by the following research questions: (1) What science module characteristics can be systematically manipulated to develop items that prove to be instructionally sensitive? and (2) Are the instructionally sensitive assessments developed sufficiently valid to make inferences about the impact of instruction on students' performance? In this article, we describe our item development approach and provide empirical evidence to support validity arguments about the developed instructionally sensitive items. Results indicated that: (1) manipulations of the items at different proximities to vary their sensitivity were aligned with the rules for item development and also corresponded with pre‐to‐post gains; and (2) the items developed at different distances from the science module showed a pattern of pre‐to‐post gain consistent with their instructional sensitivity, that is, the closer the items were to the science module, the larger the observed gains and effect sizes. © 2012 Wiley Periodicals, Inc. J Res Sci Teach 49: 691–712, 2012  相似文献   

7.
A hierarchy of analytic concepts and processes was derived from the related substantive and syntactical structures of geographic method and submitted to study as a test of a central premise underlying the curriculum reform movement in social studies. Performance data were obtained from 84 instructed second graders on test items criterion referenced to specific capabilities of the hierarchy. Application of Murray’s (5) misclassification model yielded the following results: (a) Values of correct classification probabilities for a modified five-level hierarchy were high, indicating over 90% of the Ss were correctly classified on all levels of the modified scale; and (b) Overall test of fit was well within nonsignificant boundaries, indicating the data could be assumed to fit the hypothesized latent pure scale model. Implications for curriculum were derived.  相似文献   

8.
Drawing upon research which conceptualizes anger as a multidimensional construct including three associated components—anger experience (affective anger), hostility (anger cognitions), and anger expression (aggression, assertion, and withdrawal)—the preliminary development of a Multidimensional School Anger Inventory (MSAI) for adolescents is described. This scale is a modification and extension of the School Anger Inventory and was developed to assess the affective, cognitive, and expressive aspects of anger using items having school-relevant content. Data were collected through personal interviews of 202 males from three different schools: School 1 included general education students in a parochial school in grades 6 through 12; School 2 included students attending general education or mainstreamed special education classes at a public intermediate school; and School 3 included students participating in a public day treatment program for youths with serious emotional disturbance. Scale development is discussed focusing on item development and scale refinement through item and factor analyses. Four factors were identified that accounted for 43.3% of the common variance. Anger Experience, Cynical Attitudes, and Anger Expression were identified as major clusters with the anger expression items bifurcating into Destructive Expression and Positive Coping components. The resulting 31-item scale has strong psychometric qualities and appears to have promise for use in research, treatment planning, and outcome evaluations. © 1998 John Wiley & Sons, Inc.  相似文献   

9.
This article discusses the development and validation of a measure of adolescent students' perceived belonging or psychological membership in the school environment. An initial set of items was administered to early adolescent students in one suburban middle school (N = 454) and two multi-ethnic urban junior high schools (N = 301). Items with low variability and items detracting from scale reliability were dropped, resulting in a final 18-item Psychological Sense of School Membership (PSSM) scale, which had good internal consistency reliability with both urban and suburban students and in both English and Spanish versions. Significant findings of several hypothesized subgroup differences in psychological school membership supported scale construct validity. The quality of psychological membership in school was found to be substantially correlated with self-reported school motivation, and to a lesser degree with grades and with teacher-rated effort in the cross-sectional scale development studies and in a subsequent longitudinal project. Implications for research and for educational practice, especially with at-risk students, are discussed.  相似文献   

10.
Everyone experiences some anxiety while taking an examination. High-test-anxious (HTA) and low-test-anxious (LTA) students are described by two characteristic differences: frequency and intensity of anxious responses and attentional direction to testing cues. The purposes of this study were threefold: (1) to report “potent” testing cues (i.e., 90% response agreement for both intensity and frequency) that were identified by HTA and LTA students; (2) to report differences between HTA and LTA students for frequencies and intensities of responses to testing cues; and (3) to report differences between HTA and LTA students of attentional direction to testing cues. A pool of 396 males and females who were enrolled in physical geology completed the State-Trait Anxiety Inventory. A random sample consisting of 93 HTA and 40 LTA subjects completed the Test Cues Identification Questionnaire (TCIQ). The TCIQ consists of 28 disruptive items and 27 helpful items. Subjects responded with both frequency and intensity ratings for all of the 55 items in the TCIQ. Results revealed that 22 items were viewed by subjects as “potent” testing cues. Empirical evidence obtained did not support previous theoretical reports of differences between HTA and LTA students for either frequency and intensity of anxious responses or attentional direction to the set of disruptive and helpful testing cues. Although test anxiousness did not appear to be associated with those two characteristics differences, a discriminant analysis revealed 24 items in the TCIQ which significantly, χ2 (24) = 47.59, p < 0.004, separated HTA and LTA subjects responses. Apparently, HTA and LTA students differ in their responses to specific disruptive and helpful cues but not in their responses to the set of testing cues as was previously postulated.  相似文献   

11.
The use of content validity as the primary assurance of the measurement accuracy for science assessment examinations is questioned. An alternative accuracy measure, item validity, is proposed. Item validity is based on research using qualitative comparisons between (a) student answers to objective items on the examination, (b) clinical interviews with examinees designed to ascertain their knowledge and understanding of the objective examination items, and (c) student answers to essay examination items prepared as an equivalent to the objective examination items. Calculations of item validity are used to show that selected objective items from the science assessment examination overestimated the actual student understanding of science content. Overestimation occurs when a student correctly answers an examination item, but for a reason other than that needed for an understanding of the content in question. There was little evidence that students incorrectly answered the items studied for the wrong reason, resulting in underestimation of the students' knowledge. The equivalent essay items were found to limit the amount of mismeasurement of the students' knowledge. Specific examples are cited and general suggestions are made on how to improve the measurement accuracy of objective examinations.  相似文献   

12.
Two problems in test development relate to the use of illustrations: (1) Do illustrated items perform better than written items, and (2) Does item performance vary as a function of the type and size of the illustration? A sample of 63 tests was drawn from all the Air Force Specialty Knowledge Tests containing illustrations. These 63 tests had been administered to approximately 28,261 airmen under operational conditions. Item statistics between illustrated and written items drawn from the same content areas were compared using F ratios. The results indicated: (1) That illustrated items in general performed slightly better than matched written items; (2) That the best-performing category of illustrated items was tables.  相似文献   

13.
As part of the validation research process for a new self-report social-emotional test for children, internalizing social-emotional symptoms (e.g., depression, anxiety, social withdrawal, somatic complaints, positive and negative affectivity) of a group of elementary-age gifted students (n = 65) were contrasted with those of a carefully matched (by gender and age) comparison group of non-gifted students (n = 65). Subjects completed the Internalizing Symptoms Scale for Children (ISSC) (Merrell & Walters, 1996), a self-report measure of internalizing symptoms, affect, and cognition. The gifted students reported significantly fewer internalizing symptoms than did the comparison group. An analysis of critical items separating the two groups indicated that the gifted students differed most substantially from their non-gifted peers on ISSC items that relate to self-efficacy and perceived self-importance. Although these types of self-perceptions are considered to be a peripheral rather than a central component of specific internalizing disorders, it is hypothesized that their positive presence in children may act as a “buffering” factor, possibly insulating children from insults to their social-emotional functioning that may lead to the development of internalizing forms of psychopathology. The results of this investigation are discussed in terms of their relationship to conflicting previous research in this area, to future research needs in the study of social-emotional symptoms and development of gifted children, and in terms of the construct validity evidence for the ISSC. © 1996 John Wiley & Sons, Inc.  相似文献   

14.
A rapidly expanding arena for item response theory (IRT) is in attitudinal and health‐outcomes survey applications, often with polytomous items. In particular, there is interest in computer adaptive testing (CAT). Meeting model assumptions is necessary to realize the benefits of IRT in this setting, however. Although initial investigations of local item dependence have been studied both for polytomous items in fixed‐form settings and for dichotomous items in CAT settings, there have been no publications applying local item dependence detection methodology to polytomous items in CAT despite its central importance to these applications. The current research uses a simulation study to investigate the extension of widely used pairwise statistics, Yen's Q3 Statistic and Pearson's Statistic X2, in this context. The simulation design and results are contextualized throughout with a real item bank of this type from the Patient‐Reported Outcomes Measurement Information System (PROMIS).  相似文献   

15.
Improving professional attitudes and behaviors requires critical self reflection. Research on reflection is necessary to understand professionalism among medical students. The aims of this prospective validation study at the Mayo Medical School and Cleveland Clinic Lerner College of Medicine were: (1) to develop and validate a new instrument for measuring reflection on professionalism, and (2) determine whether learner variables are associated with reflection on the gross anatomy experience. An instrument for assessing reflections on gross anatomy, which was comprised of 12 items structured on five‐point scales, was developed. Factor analysis revealed a three‐dimensional model including low reflection (four items), moderate reflection (five items), and high reflection (three items). Item mean scores ranged from 3.05 to 4.50. The overall mean for all 12 items was 3.91 (SD = 0.52). Internal consistency reliability (Cronbach's α) was satisfactory for individual factors and overall (Factor 1 α = 0.78; Factor 2 α = 0.69; Factor 3 α = 0.70; Overall α = 0.75). Simple linear regression analysis indicated that reflection scores were negatively associated with teamwork peer scores (P = 0.018). The authors report the first validated measurement of medical student reflection on professionalism in gross anatomy. Critical reflection is a recognized component of professionalism and may be important for behavior change. This instrument may be used in future research on professionalism among medical students. Anat Sci Educ 6: 232–238. © 2012 American Association of Anatomists.  相似文献   

16.
17.
Federal policy on alternate assessment based on modified academic achievement standards (AA-MAS) inspired this research. Specifically, an experimental study was conducted to determine whether tests composed of modified items would have the same level of reliability as tests composed of original items, and whether these modified items helped reduce the performance gap between AA-MAS eligible and ineligible students. Three groups of eighth-grade students (N?=?755) defined by eligibility and disability status took original and modified versions of reading and mathematics tests. In a third condition, the students were provided limited reading support along with the modified items. Changes in reliability across groups and conditions for both the reading and mathematics tests were determined to be minimal. Mean item difficulties within the Rasch model were shown to decrease more for students who would be eligible for the AA-MAS than for non-eligible groups, revealing evidence of differential boost. Exploratory analyses indicated that shortening the question stem may be a highly effective modification, and that adding graphics to reading items may be a poor modification.  相似文献   

18.
According to the Educational Policies Commission, the central purpose of education in this country is to lead students to develop the ability to think. No standard way exists to measure whether or not the schools are achieving that purpose. The EPC identified 10 rational powers as constituting the essence of the ability to think. The research reported here was done to ascertain which rational powers are measured by commercially-available, standardized tests in science. A universe of standardized tests was defined and 12 specific tests were randomly selected for analysis. All instruments were validated by a panel of experts, as was a training program for the four teacher-evaluators who applied previously-evaluated criteria to each test item to determine which rational powers had to be used in responding to the item. Seven of the 12 standardized tests analyzed in the research required that students use only the rational power of recall in responding. In fact, approximately 90% of the items analyzed from all tests required only recall. Students were required to use other rational powers only rarely when responding to a test item and the use of the rational powers of comparing, imagining, and analyzing was not necessary on any of the test items examined. The conclusion was drawn that the producers of standardized tests are not concerned with measuring student achievement of the rational powers. The purpose which runs throught and strengthens all other educational purposes—; the common thread of education—; is the development of the ability to think.  相似文献   

19.
This research investigated the nature of the method effect associated with positively worded items of the Life Orientation Test–Revised. In a first cross-sectional study (N?=?11,028) the best fitting model posits 2 factors, representing general optimism and a specific factor associated with positive items. In a second longitudinal study (N?=?203), a unified latent curve latent state–trait model was used to assess the developmental trajectory of the specific factor from 16 to 20 years. This factor contains a prevalence of trait versus state variance, and presents a relationship pattern with external criteria (e.g., depression) that differ from the one involving general optimism.  相似文献   

20.
The primary objective of this research was to compare various groups of Greek university students for their level of knowledge of Evolution by means of Natural Selection (ENS). For the purpose of the study, we used a well known questionnaire the Conceptual Inventory of Natural Selection (CINS) and 352 biology majors and non-majors students from the University of Athens took part in it. A principal components analysis revealed problems with the items designed to assess the concepts of population stability, differential survival and variation inheritable, therefore these items need to be reconsidered. Nonetheless, the results of the CINS for each Greek sub-group showed that the higher the involvement in evolution education, the higher the students' performances on the CINS test. This linear correlation, together with other evidence, supports the CINS authors' claims about the usefulness of the CINS as an assessment of instruction. Unfortunately, Greek university students gave many teleological and proximate answers to many of the CINS items. Comparisons between least and most evolutionary educated university students revealed that the latter gave more evolutionary answers. Oddly, advanced biology majors students did not show an improvement in all the 20 items of the CINS (only in 14 out of the 20 items) compared to novice biology students. They even gave more teleological answers to the concept natural resources are limited than novice biology majors students. Finally, Greek university students' level of knowledge of ENS seems to be closer to Canadian than US students'.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号