PISA:数学素养的界定与测评   总被引:8,自引:0,他引:8  
数学素养是国际学生评价项目(The Programme International Student Assessment,以下简称PISA)2003年的重点测评项目。PISA认为数学素养这一领域不仅要求学生掌握适应未来社会挑战的数学技能,而且要求学生学会分析、推理,在各种情境、领域中,通过提出问题,分析问题和解决问题来有效地传递观点。因此,PISA对数学素养的测评重点与以前的其它测评不同,如第三次国际数学和科学教育研究(IEA/TIMSS),他们的评价结构是基于成员国的国家课程的共同要素而进行,但PISA评估的重点不是  相似文献   

数学学习的性别差异一直是人们关心的问题。PISA 2012测评结果显示,虽然天津男生与女生数学学习成绩不存在差异,但男女生数学学习的驱动力、动机和自我信念还是有明显差异。与男生相比,女生学习的坚持性、问题解决的开发性、对自身解决数学问题能力的自信更差,而数学焦虑更强,更倾向于将数学学习失败的责任归咎于自身以外的因素。  相似文献   

This study provides empirical evidence about the sampling variability and generalizability (reliability) of a statewide science performance assessment. Results at both individual and school levels indicate that task-sampling variability was the major source of measurement error in the performance assessment; rater-sampling variability was negligible. Adding more tasks improves the generalizability of the measurement. For the school-level assessment, the variation of performance among students within a school was larger than the variation among schools. Increasing the number of students taking a test within a school thus increases the generalizability of the assessment. Finally, the allocation of students in a matrix-sampling design is compared to a studentscrossed-with-tasks design. The former would require fewer tasks per student than the latter to build a generalizable measure of school performance.  相似文献   

二十年来,大规模学生评估对教育研究、学校体系和教育政策产生了深远影响."国际中学生评估项目"(PISA)、"国际学生数学与科学能力动态项目"(TIMSS)和"国际学生阅读能力进步研究项目"(PIRLS)使各个国家学生成绩值具有一定可比性,由此,人们能更细致地从学校内部来观察不同国家学校工作的差异度.大规模学生评估还为学校发展、教育领导以及学生成绩的改进提供必要数据.本文以PIRLS为例,旨在从德国视角为中国今后开展大规模国际学生评估提供借鉴.  相似文献   

Many states are implementing direct writing assessments to assess student achievement. While much literature has investigated minimizing raters' effects on writing scores, little attention has been given to the type of model used to prepare raters to score direct writing assessments. This study reports on an investigation that occurred in a state-mandated writing program when a scoring anomaly became apparent once assessments were put in operation. The study indicates that using a spiral model for training raters and scoring papers results in higher mean ratings than does using a sequential model for training and scoring. Findings suggest that making decisions about cut-scores based on pilot data has important implications for program implementation.  相似文献   

No abstract available for this article.  相似文献   

Cross-cultural studies can shed new light on theories of gender differences in cognition. In the present study, Chinese students were given items from the math subtest of the Scholastic Aptitude Test (SAT) that have been found to produce the largest gender differences in American students. The authors describe how four different explanations of gender differences make different predictions regarding the possible size of the gender difference in Chinese students. Consistent with the Differential Coursework view but contrary to the predictions of several other views, the results revealed no difference in performance on the SAT items between Chinese males and females.  相似文献   

In large-scale assessments, such as state-wide testing programs, national sample-based assessments, and international comparative studies, there are many steps involved in the measurement and reporting of student achievement. There are always sources of inaccuracies in each of the steps. It is of interest to identify the source and magnitude of the errors in the measurement process that may threaten the validity of the final results. Assessment designers can then improve the assessment quality by focusing on areas that pose the highest threats to the results. This paper discusses the relative magnitudes of three main sources of error with reference to the objectives of assessment programs: measurement error, sampling error, and equating error. A number of examples from large-scale assessments are used to illustrate these errors and their impact on the results. The paper concludes by making a number of recommendations that could lead to an improvement of the accuracies of large-scale assessment results.  相似文献   

This article proposes that sampling design effects have potentially huge unrecognized impacts on the results reported by large-scale district and state assessments in the United States. When design effects are unrecognized and unaccounted for they lead to underestimating the sampling error in item and test statistics. Underestimating the sampling errors, in turn, results in unanticipated instability in the testing program and an increase in Type I errors in significance tests. This is especially true when the standard error of equating is underestimated. The problem is caused by the typical district and state practice of using nonprobability cluster-sampling procedures, such as convenience, purposeful, and quota sampling, then calculating statistics and standard errors as if the samples were simple random samples.  相似文献   

Differential item functioning (DIF) analyses are a routine part of the development of large-scale assessments. Less common are studies to understand the potential sources of DIF. The goals of this study were (a) to identify gender DIF in a large-scale science assessment and (b) to look for trends in the DIF and non-DIF items due to content, cognitive demands, item type, item text, and visual-spatial or reference factors. To facilitate the analyses, DIF studies were conducted at 3 grade levels and for 2 randomly equivalent forms of the science assessment at each grade level (administered in different years). The DIF procedure itself was a variant of the "standardization procedure" of Dorans and Kulick (1986) and was applied to very large sets of data (6 sets of data, each involving 60,000 students). It has the advantages of being easy to understand and to explain to practitioners. Several findings emerged from the study that would be useful to pass on to test development committees. For example, when there was DIF in science items, MC items tended to favor male examinees and OR items tended to favor female examinees. Compiling DIF information across multiple grades and years increases the likelihood that important trends in the data will be identified and that item writing practices will be informed by more than anecdotal reports about DIF.  相似文献   

In the past decade, extensive research on gender and learning styles has produced a multitude of findings: gender differences in learning styles are small on average, but across studies quite different results are observed. In the present study, this heterogeneity is the central focus of our attention. Two possible interpretations concerning the educational context and the concept of gender identity are investigated: the teacher and the subject he or she teaches. Besides the variable gender as a dichotomous variable, the variable gender identity is included to reflect the theoretical standpoint of the social construction of gender differences. Using multivariate techniques on a data set of 432 adult secondary students, the observed relations between gender, gender identity and learning styles are described. Gender identity turns out to explain more variance in the use of learning styles compared to gender. Furthermore, it is shown that gender (identity) differences in learning styles do not vary across teachers and, with one exception, they do not vary across subjects.  相似文献   

A comparison of PISA and TIMSS 2003 achievement results in mathematics   总被引:1,自引:0,他引:1  
Margaret Wu 《Prospects》2009,39(1):33-46
This study compares the Programme for International Student Assessment (PISA) 2003 Mathematics results with the Trends in International Mathematics and Science Study (TIMSS) 2003 Grade 8 mathematics results, using country mean scores for 22 participants of both studies. It is found that Western countries generally performed better in PISA than in TIMSS, and Eastern European and Asian countries generally performed better in TIMSS than in PISA. Furthermore, two factors, content balance and years of schooling, can account for 93% of the variation between the differential performance of countries in PISA and TIMSS. Consequently, the rankings of countries in the two studies can be reconciled to a reasonable degree of accuracy.  相似文献   

In this study it is investigated to what extent contextualized and non-contextualized mathematics test items have a differential impact on examinee effort. Mixture item response theory (IRT) models are applied to two subsets of items from a national assessment on mathematics in the second grade of the pre-vocational track in secondary education in Flanders. One subset focused on elementary arithmetic and consisted of non-contextualized items. Another subset of contextualized items focused on the application of arithmetic in authentic problem-solving situations. Results indicate that differential performance on the subsets is to a large extent due to test effort. The non-contextualized items appear to be much more susceptible to low examinee effort in low-stakes testing situations. However, subgroups of students can be found with regard to the extent to which they show low effort. One can distinguish a compliant, an underachieving, and a dropout group. Group membership is also linked to relevant background characteristics.  相似文献   

In this study we examined variations of the nonequivalent groups equating design for tests containing both multiple-choice (MC) and constructed-response (CR) items to determine which design was most effective in producing equivalent scores across the two tests to be equated. Using data from a large-scale exam, this study investigated the use of anchor CR item rescoring (known as trend scoring) in the context of classical equating methods. Four linking designs were examined: an anchor with only MC items, a mixed-format anchor test containing both MC and CR items; a mixed-format anchor test incorporating common CR item rescoring; and an equivalent groups (EG) design with CR item rescoring, thereby avoiding the need for an anchor test. Designs using either MC items alone or a mixed anchor without CR item rescoring resulted in much larger bias than the other two designs. The EG design with trend scoring resulted in the smallest bias, leading to the smallest root mean squared error value.  相似文献   

The evaluation of developmental interventions has been hampered by a lack of practical, reliable, and objective developmental assessment systems. This article describes the construction of a domain-general computerized developmental assessment system for texts: the Lexical Abstraction Assessment System (LAAS). The LAAS provides assessments of the order of hierarchical complexity of oral and written texts, employing scoring rules developed with predictive discriminant analysis. The LAAS is made possible by a feature of conceptual structure we call hierarchical order of abstraction, which produces systematic quantifiable changes in lexical composition with development. The LAAS produces scores that agree with human ratings of hierarchical complexity more than 80% of the time within one-third of a complexity order across 6 complexity orders (18 levels), spanning the portion of the lifespan from about 4 years of age through adulthood. This corresponds to a Kendall's tau of .93.  相似文献   

This article presents the pseudo-equivalent group approach and discusses how it can enhance the quality of linking in the presence of nonequivalent groups. The pseudo-equivalent group approach allows to achieve pseudo-equivalence using propensity score reweighting techniques. We use it to perform linking to establish scale concordance between two assessments. The article presents Monte-Carlo simulations and a real data application based on data from the Survey of Adult Skills (PIAAC) and the Programme for International Student Assessment (PISA). Monte-Carlo simulations suggest that the pseudo-equivalent group design is particularly useful whenever there is a large overlap across the two groups with respect to balancing variables and when the correlation between such variables and ability is medium or high. The example based on PISA and PIAAC data indicates that the approach can provide reasonable accurate linking that can be used for group-level comparisons.  相似文献   

语言中的性别差异是客观存在的社会现象。作为整个文化系统的一个子系统,语言符号系统以四种指表模式--标识、评价、规定和构成反映了两性之间的差异及地位。本文从这四个方面列举了英语中的性别差异现象,分析了导致这一现象的原因,并指出性别语言之间的差异正在逐步缩小,英语中的性别歧视现象将由引得以消除。  相似文献   

陈媛媛 《海外英语》2014,(17):237+240
Men and women applied language distinct from each other in many ways. The thesis gives an illustration of gender differences in conversation and different interpretive frames within which the discourse between men and women take place. More profoundly,it tries to explain them from perspective of socialization.  相似文献   

加拿大在2000-2009年的PISA测评中取得了骄人的成绩,分析得知这与它有重视教育的优良传统、统一的省级课程、高质量的教师以及成功的移民教育等背景有关。加拿大在最近的教育改革中重视政府领导作用、建立合理的问责制度、注重教育公平和均衡发展、加强教师队伍建设、吸引优秀人才长期从教、终身从教等方面的经验对我国目前的教育改革具有参照启示。  相似文献   

