首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Although test scores from similar tests in multiple choice and constructed response formats are highly correlated, equivalence in rankings may mask differences in substantive strategy use. The author used an experimental design and participant think-alouds to explore cognitive processes in mathematical problem solving among undergraduate examinees (N = 64). The study examined the effect of format on mathematics performance and strategy use for male and female examinees given stem-equivalent items. A statistically significant main effect of format on performance was found, with constructed-response items more difficult. The multiple-choice format was associated with more varied strategies, backward strategies, and guessing. Format was found to moderate the effect of problem conceptualization on performance. Results suggest that while for purposes of ranking students on performance, the multiple-choice format may be adequate, for many contemporary educational purposes that seek to provide nuanced information about student cognition, the constructed response format should be preferred.  相似文献   

2.
Equivalent forms of a ten-item completion test were constructed. The same test items then were rewritten in matching format and in multiple-choice format, resulting in two forms (A and B) of each of three types of test. All tests were administered to 73 examinees, and parallel-forms reliability coefficients (correlation between scores on A and B) were calculated. These empirically obtained values were compared to the values of the reliability coefficient predicted from theoretically derived equations which indicate the influence of chance success due to guessing on test reliability. In accordance with theory it was found that the completion test was more reliable than the matching test and that the matching test was more reliable than the multiple-choice test. The empirically obtained reliability coefficients were very close to those predicted from the mathematically derived formulas.  相似文献   

3.
A simple modification to the method of answering and scoring multiple choice tests allows students to indicate their estimates of the probability of the correctness of the multiple choice options for each question, without affecting the validity of the assessment. A study was conducted using a test that investigated common misconceptions in mechanics. The study showed that for assessment purposes this method gives results that are very similar to results obtained by students who answer in the traditional manner. Year 12 Physics students (N=85) were randomly allocated to two treatment groups: one received a standard format multiple choice test, the other a test format allowing students to select more than one response in a multiple choice test, and to distribute their marks among their chosen optionsl An analysis of the students' uncertainties is used to argue that not only can students appeal to different conceptions in different contexts, but that they can also hold conflicting conceptions with respect to a single context.  相似文献   

4.
Scores were obtained from 198 ninth grade students on achievement motivation, test anxiety, testwiseness, and risktaking. Tests in mathematics and vocabulary were constructed in free response and multiple choice form, and administered to the subjects in that order, with an interval of 5 weeks between administrations. Partial correlations were computed between scores on the multiple choice tests and achievement motivation, test anxiety, testwiseness, and risktaking, with free response scores partialled out. The partial correlations were corrected for the unreliability in the free response scores, and tested for significance. All partials involving achievement motivation and test anxiety were nonsignificant, as were all partials based on mathematics scores. The partial correlations of vocabulary scores with testwiseness and risktaking were significant without exception. It was concluded that the use of multiple choice tests can favour certain examinees those who are highly testwise and willing to take risks in the test situation. It was noted that the extent to which these examinees were favoured was dependent on the nature of the test, and that a verbal test seemed more susceptible than a numerical test.  相似文献   

5.
A common item format frequently encountered on survey questionnaires is the one which asks respondents to check all those categories which are personally applicable. Thus, if there are r categories a subject is free to check none, one, or up to r categories. If the researcher wants to compare c independent groups on their responses to such an item, the usual chi-square test of homogeneity of distributions is inappropriate since subjects can appear in more than one category of the out-come measure. This paper develops and illustrates a new statistic which can compare the response patterns to the item across groups. Post hoc procedures to be used in conjunction with the statistical test are also developed.  相似文献   

6.
基于计算机的测验已逐渐普及,但不同的计算机测验形式在测量相同任务时可能会产生测验结果的偏差,从而导致教育测量与评价结果的不公平性。文章基于项目反应理论,探讨了计算机化线性测验与计算机自适应测验在测验效率、测验结果的统计学特征及其对考生个体心理特质的影响是否等效等问题,并以师范生"现代教育技术"课程为例开展了实证研究,结果显示:两种测验中考生的分数具有可比性,计算机自适应测验具有更高的测验效率与测验信度,但有无即时反馈对考生测验焦虑的影响较大;而计算机化线性测验具有更合理的内容效度,有无即时反馈对考生测验焦虑的影响较小。文章的研究不仅对教学评价中测验形式的选择是否公平合理进行了科学分析,而且为施测者根据测验场景有针对性地选择测验形式提供了理论参考。  相似文献   

7.
The development of statistical methods for detecting test collusion is a new research direction in the area of test security. Test collusion may be described as large‐scale sharing of test materials, including answers to test items. Current methods of detecting test collusion are based on statistics also used in answer‐copying detection. Therefore, in computerized adaptive testing (CAT) these methods lose power because the actual test varies across examinees. This article addresses that problem by introducing a new approach that works in two stages: in Stage 1, test centers with an unusual distribution of a person‐fit statistic are identified via Kullback–Leibler divergence; in Stage 2, examinees from identified test centers are analyzed further using the person‐fit statistic, where the critical value is computed without data from the identified test centers. The approach is extremely flexible. One can employ any existing person‐fit statistic. The approach can be applied to all major testing programs: paper‐and‐pencil testing (P&P), computer‐based testing (CBT), multiple‐stage testing (MST), and CAT. Also, the definition of test center is not limited by the geographic location (room, class, college) and can be extended to support various relations between examinees (from the same undergraduate college, from the same test‐prep center, from the same group at a social network). The suggested approach was found to be effective in CAT for detecting groups of examinees with item pre‐knowledge, meaning those with access (possibly unknown to us) to one or more subsets of items prior to the exam.  相似文献   

8.
Admission decisions frequently rely on multiple assessments. As a consequence, it is important to explore rational approaches to combine the information from different educational tests. For example, U.S. graduate schools usually receive both TOEFL iBT® scores and GRE® General scores of foreign applicants for admission; however, little guidance has been given to combine information from these two assessments, even though the relationships between such sections as GRE Verbal and TOEFL iBT Reading are obvious. In this study, principles are provided to explore the extent to which different assessments complement one another and are distinguishable. Augmentation approaches developed for individual tests are applied to provide an accurate evaluation of combined assessments. Because augmentation methods require estimates of measurement error and internal reliability data are unavailable, required estimates of measurement error are obtained from repeaters, examinees who took the same test more than once. Because repeaters are not representative of all examinees in typical assessments, minimum discriminant information adjustment techniques are applied to the available sample of repeaters to treat the effect of selection bias. To illustrate methodology, combining information from TOEFL iBT scores and GRE General scores is examined. Analysis suggests that information from the GRE General and TOEFL iBT assessments is complementary but not redundant, indicating that the two tests measure related but somewhat different constructs. The proposed methodology can be readily applied to other situations where multiple assessments are needed.  相似文献   

9.
This study investigates the effect of method of assessment on student performance. Five research conditions go together with one of four assessment modes, namely: portfolio, case-based, peer assessment, and multiple choice evaluation. Data collection is done by means of a pre-test/ post-test-design with the help of two standardised tests (N=816). Results show that assessment method does make a difference: assessments do not produce overall effects on student performance. Moreover, student-activating instruction efforts do not automatically result in more extensive learning gains. Finally, test results show, when compared to other assessments, a statistically significant positive effect of the multiple choice test on students' test scores. However, students' preparation level and the closed book format of the tests might serve explanatory purposes.  相似文献   

10.
This paper illustrates that the psychometric properties of scores and scales that are used with mixed‐format educational tests can impact the use and interpretation of the scores that are reported to examinees. Psychometric properties that include reliability and conditional standard errors of measurement are considered in this paper. The focus is on mixed‐format tests in situations for which raw scores are integer‐weighted sums of item scores. Four associated real‐data examples include (a) effects of weights associated with each item type on reliability, (b) comparison of psychometric properties of different scale scores, (c) evaluation of the equity property of equating, and (d) comparison of the use of unidimensional and multidimensional procedures for evaluating psychometric properties. Throughout the paper, and especially in the conclusion section, the examples are related to issues associated with test interpretation and test use.  相似文献   

11.
This research examines the effect of two testing strategies on academic achievement and summative evaluations in an introductory statistics course. In 2001, 63 students underwent an hourly midterm format; and in 2002, 68 students underwent a bi-weekly exam format. Other than the exam format, the class lectures and labs were identical in terms of content, structure, pace, and the cumulative final exam. Findings from the regression analyses show that students in the bi-weekly format performed better than the students in the hourly midterm format. On average, students who took the bi-weekly exams performed about 10 percentage points higher (one letter grade) on the exams during the semester and about 15 percentage points higher on the cumulative final exam compared to their peers who took hourly midterms. The benefits of the bi-weekly format were significantly greater among female students than male students. Finally, students in the bi-weekly format were less likely to drop the class and evaluated the class far more favorably. Carrie B. Myers is an Assistant Professor of Adult and Higher Education at Montana State University. She received her Ph.D. in Higher Education Administration from Washington State University. Her research focuses on student and faculty development and assessment and evaluation. Scott M. Myers is an Associate Professor of Sociology at Montana State University. His areas of research are family demography and education. He received a Ph.D. in Sociology and a Ph.D. in Demography from the Pennsylvania State University.  相似文献   

12.
In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items.  相似文献   

13.
In this contribution we concentrate on the features of a particular item format: items having as the last option “none of the above” (NOTA items). There is considerable dispute on the advisability of the usage of NOTA items in testing. Some authors come to the conclusion that NOTA items should be avoided, some come to neutral conclusions while others argue that NOTA items are optimal test items. In this article, we provide evidence to this discussion by conducting protocol analysis on written statements of examinees while answering NOTA items. In our investigation, a test containing 30 multiple-choice items was administered from 169 university students. The results show that NOTA options appear to be more attractive than options with specified solutions in those cases where a problemsolver fails. Also, a relationship is found between the quality of (incorrect) problemsolving and the choice of NOTA items: the more qualitative the incorrect problemsolving process is, the more likely the student is to choose for NOTA items. Overall, our research supports the statement that ‘the more confidence an examinee has in his worked solution, which is inconsistent with one of the specified solutions, the more eager he seems to choose “none of the above”.  相似文献   

14.
Responses to a 40-item test were simulated for 150 examinees under free-response and multiple-choice formats. The simulation was replicated three times for each of 30 variations reflecting format and the extent to which examinees were (a) misinformed, (b) successful in guessing free-response answers, and (c) able to recognize with assurance correct multiple-choice options that they could not produce under free-response testing. Internal consistency reliability (KR20) estimates were consistently higher for the free-response score sets, even when the free-response item difficulty indices were augmented to yield mean scores comparable to those from multiple-choice testing. In addition, all test score sets were correlated with four randomly generated sets of unit-normal measures, whose intercorrelations ranged from moderate to strong. These measures served as criteria because one of them had been used as the basic ability measure in the simulation of the test score sets. Again, the free-response score sets yielded superior results even when tests of equal difficulty were compared. The guessing and recognition factors had little or no effect on reliability estimates or correlations with the criteria. The extent of misinformation affected only multiple-choice score KR20's (more misinformation—higher KR20's). Although free-response tests were found to be generally superior, the extent of their advantage over multiple-choice was judged sufficiently small that other considerations might justifiably dictate format choice.  相似文献   

15.
ObjectiveThe present study extends field research on interviews with young children suspected of having been abused by examining multiple assessment interviews designed to be inquisitory and exploratory, rather than formal evidential or forensic interviews.MethodsSixty-six interviews with 24 children between the ages of 3 and 6 years who were undergoing an assessment for suspected child abuse were examined. Each child was interviewed 2, 3, or 4 times. The interviewer's questions were categorized in terms of openness (open, closed or choice), in terms of the degree of interviewer input (free recall, direct, leading, suggestive), and for topic (whether the question was abuse-specific or nonabuse-related). Children's on-task responses were coded for amount of information (number of clauses) reported in relation to each question type and topic, and off-task responses were categorized as either ignoring the question or a diverted response.ResultsChildren provided a response to most questions, independent of question type or topic and typically responded with one or two simple clauses. Some children disclosed abuse in response to open-ended questions; generally, however, failure to respond to a question was more likely for abuse-specific than for nonabuse-related questions.ConclusionThe findings are discussed in terms of the growing literature on interviewing children about suspected abuse, particularly in interviews conducted over multiple sessions.Practice implicationsAssessment of suspected child abuse may involve more than a single investigative interview. Research examining children's responses to questioning over multiple interviews (or single interviews conducted over multiple sessions) is necessary for the development of best practise guidelines for the assessment of abuse.  相似文献   

16.
ABSTRACT

Drawing upon Aihwa Ong’s concept of ‘neoliberalism as exception’, this paper explores how the education authority in Shanghai capitalises on neoliberal knowledge, techniques and logics to address local challenges. Through the creation of ‘new high-quality schools’ that is accompanied by a new assessment system, the authority hopes to persuade parents to choose non-elite schools instead of prestigious schools that excel in academic performance. The neoliberal strategy of school choice is supported by the policy of school autonomy for educators to go beyond test scores to promote holistic development in students. The paper underlines the indigenisation of neoliberalism through policy dynamics where multiple educational stakeholders interact with and mutually influence one another. By highlighting ‘neoliberalism with Chinese characteristics’ in Shanghai, this study demonstrates how neoliberalism coexists with state forms, cultural norms and social practices in a particular locality.  相似文献   

17.

The debate over who in the family makes the selection of a preferred new school is an important one for many reasons. This paper presents some of the positions in that debate and attempts to resolve some of the apparent contradictions and anomalies in previous findings by using a new three step model of choice. This model clearly suggests that the reported role of both parents and children are susceptible to variations over time during the process of choice and that some of the differences discernible in previous studies may be due to this. In addition the model predicts that a simple division of families into 'alert' and 'inert' or 'disconnected' and 'privileged' or parent-centred and child-centred will not work in making sense of the complex micropolitics of choice in most families.  相似文献   

18.
Although a few studies report sizable score gains for examinees who repeat performance‐based assessments, research has not yet addressed the reliability and validity of inferences based on ratings of repeat examinees on such tests. This study analyzed scores for 8,457 single‐take examinees and 4,030 repeat examinees who completed a 6‐hour clinical skills assessment required for physician licensure. Each examinee was rated in four skill domains: data gathering, communication‐interpersonal skills, spoken English proficiency, and documentation proficiency. Conditional standard errors of measurement computed for single‐take and multiple‐take examinees indicated that ratings were of comparable precision for the two groups within each of the four skill domains; however, conditional errors were larger for low‐scoring examinees regardless of retest status. In addition, on their first attempt multiple‐take examinees exhibited less score consistency across the skill domains but on their second attempt their scores became more consistent. Further, the median correlation between scores on the four clinical skill domains and three external measures was .15 for multiple‐take examinees on their first attempt but increased to .27 for their second attempt, a value, which was comparable to the median correlation of .26 for single‐take examinees. The findings support the validity of inferences based on scores from the second attempt.  相似文献   

19.
ABSTRACT

We examined change in test-taking effort over the course of a three-hour, five test, low-stakes testing session. Latent growth modeling results indicated that change in test-taking effort was well-represented by a piecewise growth form, wherein effort increased from test 1 to test 4 and then decreased from test 4 to test 5. There was significant variability in effort for each of the five tests, which could be predicted from examinees’ conscientiousness, agreeableness, mastery approach goal orientation, and whether the examinee “skipped” or attended the initial testing session. The degree to which examinees perceived a particular test as important was related to effort for the difficult, cognitive test but not for less difficult, noncognitive tests. There was significant variability in the rates of change in effort, which could be predicted from examinees’ agreeableness. Interestingly, change in test-taking effort was not related to change in perceived test importance. Implications of these results for assessment practice and directions for future research are discussed.  相似文献   

20.
Time limits on some computer-adaptive tests (CATs) are such that many examinees have difficulty finishing, and some examinees may be administered tests with more time-consuming items than others. Results from over 100,000 examinees suggested that about half of the examinees must guess on the final six questions of the analytical section of the Graduate Record Examination if they were to finish before time expires. At the higher-ability levels, even more guessing was required because the questions administered to higher-ability examinees were typically more time consuming. Because the scoring model is not designed to cope with extended strings of guesses, substantial errors in ability estimates can be introduced when CATs have strict time limits. Furthermore, examinees who are administered tests with a disproportionate number of time-consuming items appear to get lower scores than examinees of comparable ability who are administered tests containing items that can be answered more quickly, though the issue is very complex because of the relationship of time and difficulty, and the multidimensionality of the test.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号