首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
States are increasingly requiring that public school teachers pass one or more tests as a condition for permanent employment. As a result of a recent federal court decision, these tests must now satisfy the same legal standards as other employment tests. Moreover, some of the measures used to assess teacher competence no longer rely on multiple-choice items. They now utilize various types of open-ended performance assessments. This article discusses how these developments may affect the adverse impact, reliability, validity, and pass-fail standards of teacher certification tests. The article concludes by recommending that such tests combine multiple-choice questions with open-end tasks that focus on the common or critical situations that are likely to arise across the full range of practice setting for which the teacher is being certified or licensed.  相似文献   

2.
States are increasingly requiring that public school teachers pass one or more tests as a condition for permanent employment. As a result of a recent federal court decision, these tests must now satisfy the same legal standards as other employment tests. Moreover, some of the measures used to assess teacher competence no longer rely on multiple-choice items. They now utilize various types of open-ended performance assessments. This article discusses how these developments may affect the adverse impact, reliability, validity, and pass-fail standards of teacher certification tests. The article concludes by recommending that such tests combine multiple-choice questions with open-end tasks that focus on the common or critical situations that are likely to arise across the full range of practice setting for which the teacher is being certified or licensed.  相似文献   

3.
Using data from a large-scale exam, in this study we compared various designs for equating constructed-response (CR) tests to determine which design was most effective in producing equivalent scores across the two tests to be equated. In the context of classical equating methods, four linking designs were examined: (a) an anchor set containing common CR items, (b) an anchor set incorporating common CR items rescored, (c) an external multiple-choice (MC) anchor test, and (d) an equivalent groups design incorporating rescored CR items (no anchor test). The use of CR items without rescoring resulted in much larger bias than the other designs. The use of an external MC anchor resulted in the next largest bias. The use of a rescored CR anchor and the equivalent groups design led to similar levels of equating error.  相似文献   

4.
Performance assessments appear on a priori grounds to be likely to produce far more local item dependence (LID) than that produced in the use of traditional multiple-choice tests. This article (a) defines local item independence, (b) presents a compendium of causes of LID, (c) discusses some of LID's practical measurement implications, (d) details some empirical results for both performance assessments and multiple-choice tests, and (e) suggests some strategies for managing LID in order to avoid negative measurement consequences.  相似文献   

5.
Although test scores from similar tests in multiple choice and constructed response formats are highly correlated, equivalence in rankings may mask differences in substantive strategy use. The author used an experimental design and participant think-alouds to explore cognitive processes in mathematical problem solving among undergraduate examinees (N = 64). The study examined the effect of format on mathematics performance and strategy use for male and female examinees given stem-equivalent items. A statistically significant main effect of format on performance was found, with constructed-response items more difficult. The multiple-choice format was associated with more varied strategies, backward strategies, and guessing. Format was found to moderate the effect of problem conceptualization on performance. Results suggest that while for purposes of ranking students on performance, the multiple-choice format may be adequate, for many contemporary educational purposes that seek to provide nuanced information about student cognition, the constructed response format should be preferred.  相似文献   

6.
In this study we examined variations of the nonequivalent groups equating design for tests containing both multiple-choice (MC) and constructed-response (CR) items to determine which design was most effective in producing equivalent scores across the two tests to be equated. Using data from a large-scale exam, this study investigated the use of anchor CR item rescoring (known as trend scoring) in the context of classical equating methods. Four linking designs were examined: an anchor with only MC items, a mixed-format anchor test containing both MC and CR items; a mixed-format anchor test incorporating common CR item rescoring; and an equivalent groups (EG) design with CR item rescoring, thereby avoiding the need for an anchor test. Designs using either MC items alone or a mixed anchor without CR item rescoring resulted in much larger bias than the other two designs. The EG design with trend scoring resulted in the smallest bias, leading to the smallest root mean squared error value.  相似文献   

7.
One of the most widely used methods for equating multiple parallel forms of a test is to incorporate a common set of anchor items in all its operational forms. Under appropriate assumptions it is possible to derive a linear equation for converting raw scores from one operational form to the others. The present note points out that the single most important determinant of the efficiency of the equating process is the magnitude of the correlation between the anchor test and the unique components of each form. It is suggested to use some monotonic function of this correlation as a measure of the equating efficiency, and a simple model relating the relative length of the anchor test and the test reliability to this measure of efficiency is presented.  相似文献   

8.
Two matched forms of a 50 item multiple-choice grammar test were developed. Twenty items designed to be humorous were included in one form. Test forms were randomly assigned to 126 eighth graders who received the test plus alternate forms of a questionnaire. Inclusion of humorous items did not affect grammar scores on matched humorous/nonhumorous items nor on common post-treatment items, nor did inclusion affect results of anxiety measures. Students favored inclusion of humor on tests, judged effects of humor positively, and estimated humorous items to be easier. Humor did not lower performance but was sought by the students. Potential for more valid and humane measurement is discussed.  相似文献   

9.
Problem-solving strategy is frequently cited as mediating the effects of response format (multiple-choice, constructed response) on item difficulty, yet there are few direct investigations of examinee solution procedures. Fifty-five high school students solved parallel constructed response and multiple-choice items that differed only in the presence of response options. Student performance was videotaped to assess solution strategies. Strategies were categorized as "traditional"–those associated with constructed response problem solving (e.g., writing and solving algebraic equations)–or "nontraditional"–those associated with multiple-choice problem solving (e.g., estimating a potential solution). Surprisingly, participants sometimes adopted nontraditional strategies to solve constructed response items. Furthermore, differences in difficulty between response formats did not correspond to differences in strategy choice: some items showed a format effect on strategy but no effect on difficulty; other items showed the reverse. We interpret these results in light of the relative comprehension challenges posed by the two groups of items.  相似文献   

10.

Instruments designed to measure teachers’ knowledge for teaching mathematics have been widely used to evaluate the impact of professional development and to investigate the role of teachers’ knowledge in teaching and student learning. These instruments assess a mixture of content knowledge and pedagogical content knowledge. However, little attention has been given to the content alignment between such instruments and curricular standards, particularly in regard to how content knowledge and pedagogical content knowledge items are distributed across mathematical topics. This article provides content maps for two widely used teacher assessment instruments in the USA relative to the widely adopted Common Core State Standards. This common reference enables comparisons of content alignment both between the instruments and between parallel forms within each instrument. The findings indicate that only a small number of items on both instruments are designed to capture teachers’ pedagogical content knowledge and that the majority of these items are focused on curricular topics in the later grades rather than in the early grades. Furthermore, some forms designed for use as pre- and post-assessment of professional development or teacher education are not parallel in terms of curricular topics, so estimates of teachers’ knowledge growth based on these forms may not mean what users assume. The implications of these findings for teacher educators and researchers who use teacher knowledge instruments are discussed.

  相似文献   

11.
In the class session following feedback regarding their scores on multiple-choice exams, undergraduate students in a large human development course rated the strength of possible contributors to their exam performance. Students rated items related to their personal effort in preparing for the exam (identified as student effort in the paper), their ability to perform well on the exams (identified as student ability), and teacher input that might have affected their exam performance. Students rated most student effort items higher than teacher input and student ability items. Notwithstanding, across all exams, ratings of student ability and teacher input correlated more strongly with exam performance than did student effort ratings. High and low performers on the exams differed significantly on ratings of student ability and teacher input, but were more similar on ratings of student effort.  相似文献   

12.
Psychometric models based on structural equation modeling framework are commonly used in many multiple-choice test settings to assess measurement invariance of test items across examinee subpopulations. The premise of the current article is that they may also be useful in the context of performance assessment tests to test measurement invariance of raters. The modeling approach and how it can be used for performance tests with less than optimal rater designs are illustrated using a data set from a performance test designed to measure medical students’ patient management skills. The results suggest that group-specific rater statistics can help spot differences in rater performance that might be due to rater bias, identify specific weaknesses and strengths of individual raters, and enhance decisions related to future task development, rater training, and test scoring processes.  相似文献   

13.
This paper describes an item response model for multiple-choice items and illustrates its application in item analysis. The model provides parametric and graphical summaries of the performance of each alternative associated with a multiple-choice item; the summaries describe each alternative's relationship to the proficiency being measured. The interpretation of the parameters of the multiple-choice model and the use of the model in item analysis are illustrated using data obtained from a pilot test of mathematics achievement items. The use of such item analysis for the detection of flawed items, for item design and development, and for test construction is discussed.  相似文献   

14.
Standard procedures for equating tests, including those based on item response theory (IRT), require item responses from large numbers of examinees. Such data may not be forthcoming for reasons theoretical, political, or practical. Information about items' operating characteristics may be available from other sources, however, such as content and format specifications, expert opinion, or psychological theories about the skills and strategies required to solve them. This article shows how, in the IRT framework, collateral information about items can be exploited to augment or even replace examinee responses when linking or equating new tests to established scales. The procedures are illustrated with data from the Pre-Professional Skills Test (PPST).  相似文献   

15.
Linguistic complexity of test items is one test format element that has been studied in the context of struggling readers and their participation in paper-and-pencil tests. The present article presents findings from an exploratory study on the potential relationship between linguistic complexity and test performance for deaf readers. A total of 64 students completed 52 multiple-choice items, 32 in mathematics and 20 in reading. These items were coded for linguistic complexity components of vocabulary, syntax, and discourse. Mathematics items had higher linguistic complexity ratings than reading items, but there were no significant relationships between item linguistic complexity scores and student performance on the test items. The discussion addresses issues related to the subject area, student proficiency levels in the test content, factors to look for in determining a "linguistic complexity effect," and areas for further research in test item development and deaf students.  相似文献   

16.
The development of alternate assessments for students with disabilities plays a pivotal role in state and national accountability systems. An important assumption in the use of alternate assessments in these accountability systems is that scores are comparable on different test forms across diverse groups of students over time. The use of test equating is a common way that states attempt to establish score comparability on different test forms. However, equating presents many unique, practical, and technical challenges for alternate assessments. This article provides case studies of equating for two alternate assessments in Michigan and an approach to determine whether or not equating would be preferred to not equating on these assessments. This approach is based on examining equated score and performance-level differences and investigating population invariance across subgroups of students with disabilities. Results suggest that using an equating method with these data appeared to have a minimal impact on proficiency classifications. The population invariance assumption was suspect for some subgroups and equating methods with some large potential differences observed.  相似文献   

17.
This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common‐item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common‐item equating methodology to standard setting ratings to account for systematic differences between standard setting panels) has received almost no attention in the literature. Identity equating was also examined to provide context. Data from a standard setting form of a large national certification test (N examinees = 4,397; N panelists = 13) were split into content‐equivalent subforms with common items, and resampling methodology was used to investigate the error introduced by each approach. Common‐item equating (circle‐arc and nominal weights mean) was evaluated at samples of size 10, 25, 50, and 100. The standard setting approaches (resetting and rescaling the standard) were evaluated by resampling (N = 8) and by simulating panelists (N = 8, 13, and 20). Results were inconclusive regarding the relative effectiveness of resetting and rescaling the standard. Small‐sample equating, however, consistently produced new form cut scores that were less biased and less prone to random error than new form cut scores based on resetting or rescaling the standard.  相似文献   

18.
In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long‐standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation.  相似文献   

19.
Both multiple-choice and constructed-response items have known advantages and disadvantages in measuring scientific inquiry. In this article we explore the function of explanation multiple-choice (EMC) items and examine how EMC items differ from traditional multiple-choice and constructed-response items in measuring scientific reasoning. A group of 794 middle school students was randomly assigned to answer either constructed-response or EMC items following regular multiple-choice items. By applying a Rasch partial-credit analysis, we found that there is a consistent alignment between the EMC and multiple-choice items. Also, the EMC items are easier than the constructed-response items but are harder than most of the multiple-choice items. We discuss the potential value of the EMC items as a learning and diagnostic tool.  相似文献   

20.
This study explores measurement of a construct called knowledge integration in science using multiple-choice and explanation items. We use construct and instructional validity evidence to examine the role multiple-choice and explanation items plays in measuring students' knowledge integration ability. For construct validity, we analyze item properties such as alignment, discrimination, and target range on the knowledge integration scale using a Rasch Partial Credit Model analysis. For instructional validity, we test the sensitivity of multiple-choice and explanation items to knowledge integration instruction using a cohort comparison design. Results show that (1) one third of correct multiple-choice responses are aligned with higher levels of knowledge integration while three quarters of incorrect multiple-choice responses are aligned with lower levels of knowledge integration, (2) explanation items discriminate between high and low knowledge integration ability students much more effectively than multiple-choice items, (3) explanation items measure a wider range of knowledge integration levels than multiple-choice items, and (4) explanation items are more sensitive to knowledge integration instruction than multiple-choice items.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号