首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Educational Assessment》2013,18(2):125-146
States are implementing statewide assessment programs that classify students into proficiency levels that reflect state-defined performance standards. In an effort to provide support for score interpretations, this study examined the consistency of classifications based on competing item response theory (IRT) models for data from a state assessment program. Classification of students into proficiency levels was compared based on a 1-parameter vs. a 3-parameter IRT model. Despite an overall high level of agreement between classifications based on the 2 models, systematic differences were observed. Under the 1-parameter model, proficiency was underestimated for low proficiency classifications but overestimated for upper proficiency classifications. This resulted in higher "Below Basic" and "Advanced" classifications under 1-parameter vs. 3-parameter IRT applications. Implications of these differences are discussed.  相似文献   

2.
ABSTRACT

The understanding of what makes a question difficult is a crucial concern in assessment. To study the difficulty of test questions, we focus on the case of PISA, which assesses to what degree 15-year-old students have acquired knowledge and skills essential for full participation in society. Our research question is to identify PISA science item characteristics that could influence the item’s proficiency level. It is based on an a-priori item analysis and a statistical analysis. Results show that only the cognitive complexity and the format out of the different characteristics of PISA science items determined in our a-priori analysis have an explanatory power on an item’s proficiency levels. The proficiency level cannot be explained by the dependence/independence of the information provided in the unit and/or item introduction and the competence. We conclude that in PISA, it appears possible to anticipate a high proficiency level, that is, students’ low scores for items displaying a high cognitive complexity. In the case of a middle or low cognitive complexity level item, the cognitive complexity level is not sufficient to predict item difficulty. Other characteristics play a crucial role in item difficulty. We discuss anticipating the difficulties in assessment in a broader perspective.  相似文献   

3.
本研究以普通话水平测试各项内容的难易程度为调研目标,一方面通过问卷调查的方式,调查了全国十所师范院校共2084名在校学生,就难易度的认知分布情况,在不同等级、不同方言区、已测与未测学生间展开比较;另一方面对640名学生普通话水平测试样卷,用百分比和重复测量方差分析法进行分析,以测试项失分指数为参照,考察难易认知情况与实际测试情况的相关程度。问卷调查结果表明,师范生普遍认为最难的测试项游移于"命题说话"和"读单音节字词"之间,最易项均选择"朗读多音节";而测试结果表明,不同等级、不同方言母语学生在"命题说话"项失分最多,其余依次为"读单音节字词"项、"读多音节词语"项,"朗读短文"项失分最少。  相似文献   

4.
Methods are presented for comparing grades obtained in a situation where students can choose between different subjects. It must be expected that the comparison between the grades is complicated by the interaction between the students' pattern and level of proficiency on one hand, and the choice of the subjects on the other hand. Three methods based on item response theory (IRT) for the estimation of proficiency measures that are comparable over students and subjects are discussed: a method based on a model with a unidimensional representation of proficiency, a method based on a model with a multidimensional representation of proficiency, and a method based on a multidimensional representation of proficiency where the stochastic nature of the choice of examination subjects is explicitly modeled. The methods are compared using the data from the Central Examinations in Secondary Education in the Netherlands. The results show that the unidimensional IRT model produces unrealistic results, which do not appear when using the two multidimensional IRT models. Further, it is shown that both the multidimensional models produce acceptable model fit. However, the model that explicitly takes the choice process into account produces the best model fit.  相似文献   

5.
为了检测词汇自主学习策略培训对词汇水平的影响,以及词汇自主学习策略培训对于不同语言水平的学生是否有不同的效果,试验以158名大一新生为研究对象.试验组接受了为期14周的自主学习策略培训,而对照组不接受培训.所有参加试验的学生都参加了两次词汇水平测试,分别作为前测试和后测试.测试的结果用独立样本T检验和多因素方差检测来分析.分析结果表明自主学习策略的培训对于提高不同语言水平学生的词汇水平同样有效.  相似文献   

6.
Assessment items are commonly field tested prior to operational use to observe statistical item properties such as difficulty. Item parameter estimates from field testing may be used to assign scores via pre-equating or computer adaptive designs. This study examined differences between item difficulty estimates based on field test and operational data and the relationship of such differences to item position changes and student proficiency estimates. Item position effects were observed for 20 assessments, with items in later positions tending to be more difficult. Moreover, field test estimates of item difficulty were biased slightly upward, which may indicate examinee knowledge of which items were being field tested. Nevertheless, errors in field test item difficulty estimates had negligible impacts on student proficiency estimates for most assessments. Caution is still warranted when using field test statistics for scoring, and testing programs should conduct investigations to determine whether the effects on scoring are inconsequential.  相似文献   

7.
This study measured the effects of an online supplementary mathematics curriculum designed for middle school English language learners who speak Spanish as a first language. A randomized experiment measured the achievement differences between middle school English language learners who used the Web-based HELP Math (Help with English Language Proficiency) curriculum and students who used other technology-based programs. Three hundred and ninety-six students participated. Both groups made statistically significant gains from pretest to posttest within their respective curricula, but no main effect was found between the two groups. Post hoc analyses revealed that students with higher levels of English proficiency, who participated in the comparison condition, performed significantly better than students in the HELP Math condition, while students with lower levels of English proficiency performed better in the HELP Math program (although these differences were not statistically significant). Findings are interpreted with caution due to the truncated length of the intervention.  相似文献   

8.
The purpose of the present study is to examine the language characteristics of a few states' large-scale assessments of mathematics and science and investigate whether the language demands of the items are associated with the degree of differential item functioning (DIF) for English language learner (ELL) students. A total of 542 items from 11 assessments at Grades 4, 5, 7, and 8 from three states were rated for the linguistic complexity based on a developed linguistic coding scheme. The linguistic ratings were compared to each item's DIF statistics. The results yielded a stronger association between the linguistic rating and DIF statistics for ELL students in the “relatively easy” items than in the “not easy” items. Particularly, general academic vocabulary and the amount of language in an item were found to have the strongest association with the degrees of DIF, particularly for ELL students with low English language proficiency. Furthermore, the items were grouped into four bundles to closely look at the relationship between the varying degrees of language demands and ELL students' performance. Differential bundling functioning (DBF) results indicated that the exhibited DBF was more substantial as the language demands increased. By disentangling linguistic difficulty from content difficulty, the results of the study provide strong evidence of the impact of linguistic complexity on ELL students' performance on tests. The study discusses the implications for the validation of the tests and instructions for ELL students.  相似文献   

9.
The paper reports and discusses a government‐initiated nationwide assessment of writing proficiency among Norwegian compulsory school students. A sample‐study of 7th and 10th grade students are discussed and reported with regard to challenges in measuring writing skills in a valid and reliable manner. For the 7th graders the results showed a greater proportion of narrative texts, and in contrast to more scientific oriented texts, was assessed as “lower than expected”; however, for the 10th graders the tendency was opposite with respect to central linguistic components. Low correlations between the raters were ascertained at both levels, indicating different views among teachers as to what can be expected of students' writing proficiency. The results are discussed in relation to the usefulness of the theoretical model as a basis for assessment of writing proficiency, as well as other obstacles to constructing valid and reliable writing tests.  相似文献   

10.
This study examined (1) differences in background, integrative/instrumental motivation, learning approach, leaning strategy and proficiency in second language (L2) and (2) the determinants of learning outcomes between Hong Kong and Mainland (Chinese) students. To achieve this, a questionnaire survey was distributed to 773 s language learners across four universities in Hong Kong and Mainland China to students in Bachelor of Education (English Language) programmes. The results showed that L2 proficiency was the strongest predictor of learning outcomes for Hong Kong and Mainland students, while integrative motivation was also a significant predictor of learning outcomes in both sample groups. In addition, instrumental motivation, deep approaches, and learning strategies were found to be significant predictors of learning outcomes for Mainland students. Mainland students demonstrated lower levels of motivation, learning approaches, learning strategies, L2 proficiency, as well as learning outcomes relative to Hong Kong students. Implications for curriculum design, classroom teaching and assessment, and future research are discussed in the paper.  相似文献   

11.
Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. The goals of the study were (a) to review methods for detecting local item dependence (LID), (b) to discuss the use of testlets to account for LID in context-dependent item sets, (c) to apply LID detection methods and testlet-based item calibrations to data from a large-scale, high-stakes admissions test, and (d) to evaluate the results with respect to test score reliability and examinee proficiency estimation. Item dependencies were found in the test and these were due to test speededness or context dependence (related to passage structure). Also, the results highlight that steps taken to correct for the presence of LID and obtain less biased reliability estimates may impact on the estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding how to calibrate context-dependent item sets using item response theory.  相似文献   

12.
针对丽水学院入学新生英语基础普遍较差、大学英语教师普遍超负荷工作这一实际状况,从管理层面就丽水学院大学英语教学管理的指导思想作出思考:丽水学院只有遵循“育人为主,终生教育”的教育理念,尊重教师在教育、教学活动中的“主导地位”,坚持以“激励为主”的指导思想,树立以“效益为主”的教学管理思想,才能积极推进大学英语教学改革,确保学生的英语水平。  相似文献   

13.
An assumption of item response theory is that a person's score is a function of the item response parameters and the person's ability. In this paper, the effect of variations in instructional coverage on item characteristic functions is examined. Using data from the Second International Mathematics Study (1985), curriculum clusters were formed based on teachers' ratings of their students' opportunities to learn the items on a test. After forming curriculum clusters, item response curves were compared using signed and unsigned sum of squared differences. Some of the differences in the item response curves between curriculum clusters were found to be large, but better performance was not necessarily related to greater opportunity to learn. The item response curve differences were much larger than differences reported in prior studies based on comparisons of black and white students. Implications of the findings for applications of item response theory to educational achievement test data are discussed  相似文献   

14.

This study investigated the learning styles of adult English as a second language (ESL) students in Northwest Arkansas. Learning style differences by age, gender, and country of origin were explored. A total of 69 northwest Arkansas adult ESL students attending 7 adult-education centers were administered the VARK Learning Styles Questionnaire. Most participants came from Mexico and El Salvador, their ages ranged from 23 to 45, and females were an average of 10 years older than males. Note taking was chosen by 1/3 of participants as their favorite learning style, 20% favored aural modes, 15% favored kinesthetic, 4% favored visual, and 15% chose combinations of learning styles. Females chose auditory and multimodal learning styles, while males favored note taking. Students differed by level of English proficiency, beginning-intermediate favoring aural learning styles more than advanced students. ANOVA results indicated that participants were significantly less visual and more read-write than either aural or kinesthetic, but males and females differed significantly in their choice of aural learning. Hispanic males chose note taking and kinesthetic learning styles significantly more than visual or auditory modes of learning. Hispanic females chose note taking, aural, and kinesthetic learning styles significantly more than visual. Asian males favored note taking and aural learning. Correlation was found between age and learning styles with subgroups exhibiting a negative correlation between age and kinesthetic learning, with Mexican males and females exhibiting the strongest negative correlation. Males showed a low positive correlation between age and note taking.  相似文献   

15.
《Educational Assessment》2013,18(4):357-375
A test designed with built-in modifications and covering the same grade-level mathematics content provided more precise measurement of mathematics achievement for lower performing students with disabilities. Fourth-grade students with disabilities took a test based on modified state curricular standards for their mandated statewide mathematics assessment. To link the modified test with the general test, a block of items was administered to students with and without disabilities who took the general mathematics assessment. Item difficulty and student mathematics ability parameters were estimated using item response theory (IRT) methodology. Results support the conclusion that a modified test, based on the same curricular objectives but providing a more targeted measurement of expected outcomes for lower achieving students, could be developed for this special population.  相似文献   

16.
本研究应用项目反应理论,从被试的阅读能力值和题目的难度值这两个方面,分析阅读理解测试中多项选择题命题者对考试效度的影响。实验设计中,将两组被试同时施测于一项“阅读水平测试”,根据测试结果估计出的两组被试能力值之间无显著性差异。再次将这两组被试分别施测于两位不同命题者所命制的题目,尽管这些题目均产生于相同的阅读材料,且题目的难度值之间并没有显著性差异,被试的表现却显著不同。Rasch模型认为,被试表现由被试能力和试题难度共同决定。因此,可以推测,这是由于不同命题者所命制的题目影响了被试的表现,并进而影响了使用多项选择题进行阅读理解测试的效度。  相似文献   

17.
The purpose of this study was to compare the opinions of students, teachers, and administrators relative to student evaluation of instruction in selected community colleges. While important educational decisions in community colleges are made on the basis of students’ evaluations (as in retention, promotion, tenure, and pay), little has been accomplished in testing the assumptions behind student evaluation of instruction. The student evaluation process assumes that students are honest, serious, and evaluate instruction, not some incidental activity.

A 25‐item Student Evaluation Process Scale was completed by 607 students, 130 faculty, and 45 administrators in five Illinois community colleges. Findings revealed little significant differences in the opinions of students regarding evaluation of instruction based on variables of sex, age, school location, student type (transfer or occupational), and class standing. There were little significant differences in faculty opinion and within the administrative groups based on selected variables. There were significant differences when the opinions of students, faculty, and administrators were compared. Students and faculty tended to agree with those items that questioned the objectivity of student evaluation of instruction. Administrators and students tended to agree with items reflecting the seriousness with which students evaluate instruction. Faculty and administrators indicated that student evaluation of instruction impacted faculty members’ instructional performances. Neither students, faculty, nor administrators supported the concept of merit pay tied to student evaluation of instruction.

The role of student evaluation of instruction in a faculty evaluation system must be investigated. A variety of groups should participate in this investigation.  相似文献   

18.
This study examined the effects of a learning game, [The Math App] on the mathematics proficiency of middle school students. For the study, researchers recruited 306 students, Grades 6–8, from two schools in rural southwest Virginia. Over a nine-week period, [The Math App] was deployed as an intervention for investigation. Students were assigned to game intervention treatment, and paper-and-pencil control conditions. For the game intervention condition, students learned fractions concepts by playing [The Math App]. In the analysis, students’ mathematical proficiency levels prior to the intervention were taken into account. Results indicate that students in the game intervention group showed higher mathematics proficiency than those in the paper-and-pencil group. Particularly, the significantly higher performances of intervention groups were noted among 7th graders and inclusion groups. The empirically derived results of the reported study could contribute to the field of educational video game research, which has not reached a consensus on the effects of games on students’ mathematics performance in classroom settings.  相似文献   

19.
The central idea of differential item functioning (DIF) is to examine differences between two groups at the item level while controlling for overall proficiency. This approach is useful for examining hypotheses at a finer-grain level than are permitted by a total test score. The methodology proposed in this paper is also aimed at estimating differences at the item rather than the overall score level, yet with the innovation where item-level differences for many groups simultaneously are the focus. This is a straightforward generalization of DIF as variance rather than one or several group differences; conceptually, this can be referred to as item difficulty variation (IDV). When instruction is of interest, and "groups" is a unit at which instruction is determined or delivered, then IDV signals value-added effects that can be influenced by either demographic or instructional variables.  相似文献   

20.
Over the past few decades, those who take tests in the United States have exhibited increasing diversity with respect to native language. Standard psychometric procedures for ensuring item and test fairness that have existed for some time were developed when test‐taking groups were predominantly native English speakers. A better understanding of the potential influence that insufficient language proficiency may have on the efficacy of these procedures is needed. This paper represents a first step in arriving at this better understanding. We begin by addressing some of the issues that arise in a context in which assessments in a language such as English are taken increasingly by groups that may not possess the language proficiency needed to take the test. For illustrative purposes, we use the first‐language status of a test taker as a surrogate for language proficiency and describe an approach to examining how the results of fairness procedures are affected by inclusion or exclusion of those who report that English is not their first language in the fairness analyses. Furthermore, we explore the sensitivity of the results of these procedures, differential item functioning (DIF) and score equating, to potential shifts in population composition. We employ data from a large‐volume testing program for this illustrative purpose. The equating results were not affected by either inclusion or exclusion of such test takers in the analysis sample, or by shifts in population composition. The effect on DIF results, however, varied across focal groups.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号