首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
作为一种典型的增长模型,纵向量表化(Vertical Scaling,也称垂直等值、垂直标定等)方法常用于评估被试的学业或能力发展状况。本研究以新疆少数民族四至六年级学生在2011年至2013年三次学业水平质量监测汉语考试中的答题数据为样本,采取共同题设计进行数据收集,并运用Thurstone方法和IRT同时标定的方法进行量表分数构建,最终完成了三个年级间的分数连接,实现了对新疆双语班四至六年级学生汉语学业水平增长的测量,为学业水平监测工作提供了可参考的量化指标。  相似文献   

2.
<正>教学内容:等值分数。课前思考:“等值分数”是分数认识的重要内容,是学生后续研究“分数的基本性质”的重要知识基础。传统的教材编排体系中,“等值分数”并没有单独予以编排。然而,学生在三年级初步认识分数时,常常表现出对“等值分数”的初步感悟。为此,本课例尝试以“等值分数”作为教学内容,展现三年级学生究竟是如何感受、认识和理解“等值分数”概念的,  相似文献   

3.
考试分数可比性关乎考试公平,是检验一门考试质量的重要维度,尤其对于存在多个平行试卷的大规模英语测试更是如此.考试分数可比性研究中最关键的一步是考试分数等值.本文借助项目反应理论,展示了如何按照IRT的等值程序对高考英语两份平行试卷分数进行等值.通过建立的对应等值分数,结果呈现:(1)这两份试卷难度不同,使用实际考试分数时必须参考等值结果进行相应处理;(2)使用项目反应理论对高考英语考试分数进行等值时必须进行严格的模式筛选过程选择适合数据的参数模型.  相似文献   

4.
教学内容:人教版小学数学5年级下册"分数的基本性质"。教学目标:1.学生能理解和掌握分数的基本性质,知道分数的基本性质与整数除法中商不变的性质之间的联系。2.学生能运用分数的基本性质把一个分数化成分母不同而大小相等的分数。  相似文献   

5.
高考出现选考科目以后,高考总分合成愈发受到关注。由于未能解决不同选考科目分数的不可比性问题,不仅导致选考不同科目的考生受到不公平对待,而且对高中学生选科产生明显的负面影响,这也是一些高考方案实施遭遇阻力的根本原因之一。虽然高考各科目是测量不同构念的考试,但实证数据表明语数外三门总分与选考科目分数之间具有较强的相关性。本研究以语数外考试为锚测验,采用频数估计等百分位法,实现选考科目在合成总体上的量表化,使得分数具有可比性。研究表明,选考科目分数量表化,并以量表分数代替原始分数计入总分是必要的,也是可行的。  相似文献   

6.
1前言 测验等值是对考核同一心理品质的多个测验形式系统地做出测验分数转换.从而使不同测验形式的测验分数之间具有可比性。由于项目反应理论(IRT)将项目难度与心理特质(能力)定义在同一量表上,故也可以认为IRT中的等值是将考核同一心理品质的多个测验形式系统地做出项目参数转换.从而使不同测验形式中的项目参数之间具有可比性。  相似文献   

7.
新课程高考地理考试标准及试卷结构技术指标构建的依据是《普通高中课程方案(实验)》和《普通高中地理课程标准(实验)》。地理科试卷结构技术指标包括试卷的结构模式、内容要素、目标要素、题型要素、难度要素、分数要素、时限要素、长度要素和等值要素,它是新课程高考地理命题、审题评估监控的标准,是实现试卷及试题质量控制的依据,可以为高考地理试卷及试题质量评价提供系统的可供操作的标准体系和方法手段。  相似文献   

8.
由于测验安全性、试卷组卷不当等问题,有些测验的题本相互之间不能或者没有设置锚题。对作答不同题本的被试进行分数比较时,需要用到测验等值技术。不同于有锚题测验能通过题本之间的锚题进行等值,无锚题情境下的测验需要借助于一些特殊方法进行等值。目前,对无锚题测验进行等值主要有三种方式,一种是通过测验中具体的题目,也就是构建相同的"锚题"来进行等值,如构造随机等组测验法和利用题目先验信息进行等值的方法;一种是通过构建相同被试组来进行等值,即构造随机等组样本法;还有一种是借助于测验题目所考查的认知属性来进行等值,一般是基于一种认知诊断模型——规则空间模型来进行操作。  相似文献   

9.
测验等值     
测量等值是将不同标尺的测验分数转换到同一标尺的测量技术.具体地说,测验等值是将测量同一性质的知识或心理品质的多个测验形式的测验分数转换成相同标尺的分数,进而使得这些不同测验形式的分数之间具有可比性.例如,有A、B、C三种测验,都是测量英语水平的.如果同一个学生在这三种测验上发挥状态相同,A测验得60分,B测验得65分,C测验得55分,说明C测验最难,A测验次之,B测验最容易.这三种测验分数要等值,都可以转换到某一测验的分数系统.若转换到A测验分数系统,那么B测验的65分,C测验的55分,都对应于A测验的60分.  相似文献   

10.
新课程高考文科综合考试标准及试卷结构技术指标构建的依据是《普通高中课程方案(实验)》和普通高中文科各科课程标准(实验)。新课程高考文科综合科试卷结构建模的技术指标包括试卷的结构模式、内容要素、目标要素、题型要素、难度要素、分数要素、长度要素、时限要素和等值要素,它是新课程高考文科综合命题、审题评估监控的标准,是实现试卷及试题质量控制的依据,可以为高考文科综合试卷及试题质量评价提供系统的可供操作的标准体系和方法手段。  相似文献   

11.
Scaling is the process of constructing a score scale that associates numbers or other ordered indicators with the performance of examinees. Scaling typically is conducted to aid users in interpreting test results. This module describes different types of raw scores and scale scores, illustrates how to incorporate various sources of information into a score scale, and introduces vertical scaling and its related designs and methodologies as a special type of scaling. After completion of this module, the reader should be able to understand the relationship between various types of raw scores, understand the relationship between raw scores and scale scores, construct a scale with desired properties, evaluate an existing score scale, understand how content and standards information are built into a scale, and understand how vertical scales are developed and used in practice.  相似文献   

12.
Most growth models implicitly assume that test scores have been vertically scaled. What may not be widely appreciated are the different choices that must be made when creating a vertical score scale. In this paper empirical patterns of growth in student achievement are compared as a function of different approaches to creating a vertical scale. Longitudinal item‐level data from a standardized reading test are analyzed for two cohorts of students between Grades 3 and 6 and Grades 4 and 7 for the entire state of Colorado from 2003 to 2006. Eight different vertical scales were established on the basis of choices made for three key variables: Item Response Theory modeling approach, linking approach, and ability estimation approach. It is shown that interpretations of empirical growth patterns appear to depend upon the extent to which a vertical scale has been effectively “stretched” or “compressed” by the psychometric decisions made to establish it. While all of the vertical scales considered show patterns of decelerating growth across grade levels, there is little evidence of scale shrinkage.  相似文献   

13.
14.
Functional form misfit is frequently a concern in item response theory (IRT), although the practical implications of misfit are often difficult to evaluate. In this article, we illustrate how seemingly negligible amounts of functional form misfit, when systematic, can be associated with significant distortions of the score metric in vertical scaling contexts. Our analysis uses two‐ and three‐parameter versions of Samejima's logistic positive exponent model (LPE) as a data generating model. Consistent with prior work, we find LPEs generally provide a better comparative fit to real item response data than traditional IRT models (2PL, 3PL). Further, our simulation results illustrate how 2PL‐ or 3PL‐based vertical scaling in the presence of LPE‐induced misspecification leads to an artificial growth deceleration across grades, consistent with that commonly seen in vertical scaling studies. The results raise further concerns about the use of standard IRT models in measuring growth, even apart from the frequently cited concerns of construct shift/multidimensionality across grades.  相似文献   

15.
Vertical achievement scales, which range from the lower elementary grades to high school, are used pervasively in educational assessment. Using simulated data modeled after real tests, the present article examines two procedures available for vertical scaling: a Thurstone method and three-parameter item response theory. Neither procedure produced artifactual scale shrinkage; both procedures produced modest scale expansion for one simulated condition.  相似文献   

16.
Many psychological constructs show heterotypic continuity—their behavioral manifestations change with development but their meaning remains the same. However, research has paid little attention to how to account for heterotypic continuity. A promising approach to account for heterotypic continuity is creating a developmental scale using vertical scaling. A simulation was conducted to compare creating a developmental scale using vertical scaling to traditional approaches of longitudinal assessment. Traditional approaches that failed to account for heterotypic continuity resulted in less accurate growth estimates, at the person- and group level. Findings suggest that ignoring heterotypic continuity may result in faulty developmental inferences. Creating a developmental scale with vertical scaling is recommended to link different measures across time and account for heterotypic continuity.  相似文献   

17.
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.  相似文献   

18.
Reading and Mathematics tests of multiple-choice items for grades Kindergarten through 9 were vertically scaled using the three-parameter logistic model and two different scaling procedures: concurrent and separate by grade groups. Item parameters were estimated using Markov chain Monte Carlo methodology while fixing the grade 4 population abilities to have a standard normal distribution. For the separate grade-groups scaling, grade groupings were linked using the Stocking and Lord test characteristic curve procedure. Abilities were estimated using the maximum-likelihood method. In either content area, scatterplots of item difficulty, discrimination, and ability estimates from the two methods showed consistently strong linear relationships. However, as grade deviated from the base grade of four, the best-fit linear line through the pairs of item discriminations started to rotate away from the identity line. This indicated the discrimination estimates from the separate grade-groups procedure for extreme grades to be, on average, higher than those from the concurrent analysis. The study also observed some systematic change in score variability across grades. In general, the two vertical scaling approaches yielded similar results at more grades in Reading than in Mathematics.  相似文献   

19.
Scale scores for educational tests can be made more interpretable by incorporating score precision information at the time the score scale is established. Methods for incorporating this information are examined that are applicable to testing situations with number-correct scoring. Both linear and nonlinear methods are described. These methods can be used to construct score scales that discourage the overinterpretation of small differences in scores. The application of the nonlinear methods also results in scale scores that have nearly equal error variability along the score scale and that possess the property that adding a specified number of points to and subtracting the same number of points from any examinee's scale score produces an approximate two-sided confidence interval with a specified coverage. These nonlinear methods use an arcsine transformation to stabilize measurement error variance for transformed scores. The methods are compared through the use of illustrative examples. The effect of rounding on measurement error variability is also considered and illustrated using stanines  相似文献   

20.
Two methods of constructing equal-interval scales for educational achievement are discussed: Thurstone's absolute scaling method and Item Response Theory (IRT). Alternative criteria for choosing a scale are contrasted. It is argued that clearer criteria are needed for judging the appropriateness and usefulness of alternative scaling procedures, and more information is needed about the qualities of the different scales that are available. In answer to this second need, some examples are presented of how IRT can be used to examine the properties of scales: It is demonstrated that for observed score scales in common use (i.e., any scores that are influenced by measurement error), (a) systematic errors can be introduced when comparing growth at selected percentiles, and (b) normalizing observed scores will not necessarily produce a scale that is linearly related to an underlying normally distributed true trait.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号