首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
培养学生的科学思维是科学教育的重要目标,运用科学思维的方式思考和解决问题也是对具有科学素养的人的基本要求。在理科的纸笔评价中,可以利用科技论文、科学史和故事线索等策略编制题目,考查学生的科学思维。  相似文献   

Research on the use of multiple-choice tests has presented conflicting evidence about the use of statistical item difficulty as a means of ordering items. An alternate method advocated by many texts is the use of cognitive difficulty. This study examined the effect of using both statistical and cognitive item difficulty in determining item order. Results indicated that those students who received items in an increasing cognitive order, no matter what the order of statistical difficulty, scored higher on hard items. Those students who received the forms with opposing cognitive and statistical difficulty orders scored the highest on medium-level items. The study concludes with a call for more research on the effects of cognitive difficulty and suggests that future studies examine subscores as well as total test results.  相似文献   

This study applied kernel equating (KE) in two scenarios: equating to a very similar population and equating to a very different population, referred to as a distant population, using SAT® data. The KE results were compared to the results obtained from analogous traditional equating methods in both scenarios. The results indicate that KE results are comparable to the results of other methods. Further, the results show that when the two populations taking the two tests are similar on the anchor score distributions, different equating methods yield the same or very similar results, even though they have different assumptions.  相似文献   

曹文娟  白俊梅 《考试研究》2013,(3):79-85,33
本文使用R-2.15.2软件模拟研究锚测验难度参数方差特征对测验等值误差的影响,采用三种等值方法(链百分位等值法、Levine等值法和Tucker等值法)对锚测验不同类型的难度方差进行比较研究。结果显示,当锚测验难度方差小于全测验难度方差时,其等值的随机误差和系统误差与锚测验难度方差和全测验难度方差一致时(即锚测验为全测验的平行缩减版minitest时)的表现基本相同。因此,对锚测验而言,要求其与全测验具有相同的统计规格可能过于严格。  相似文献   

Score equity assessment (SEA) is introduced, and placed within a fair assessment context that includes differential prediction or fair selection and differential item functioning. The notion of subpopulation invariance of linking functions is central to the assessment of score equity, just as it has been for differential item functioning and differential prediction. Advanced Placement (AP) data are used for illustrative purposes. The use of multiple-choice and constructed response items in AP provides an opportunity to observe a case where subpopulation invariance of linking functions does not hold (U.S. History), and a case in which it does hold (Calculus AB). The lack of invariance for U.S. History might be attributed to several sources. The role of SEA in assessing the fairness of test assembly processes is discussed.  相似文献   

从评分、等值到成绩报告的过程中,各环节相互依赖和影响,其评价结果极易出现错误。为了监控这一评价过程并尽可能减少犯错数量,需要制定一套质量监控程序。所谓质量监控即指用来确保评分、等值和分数报告过程中达到预期质量标准的一个正规的系统化过程。评分-等值-分数报告过程可分为11个环节,在很多情况下,质量检查都可以在最终产品上进行。  相似文献   

There is significant potential for error in long production processes that consist of sequential stages, each of which is heavily dependent on the previous stage, such as the SER (Scoring, Equating, and Reporting) process. Quality control procedures are required in order to monitor this process and to reduce the number of mistakes to a minimum. In the context of this module, quality control is a formal systematic process designed to ensure that expected quality standards are achieved during scoring, equating, and reporting of test scores. The module divides the SER process into 11 steps. For each step, possible mistakes that might occur are listed, followed by examples and quality control procedures for avoiding, detecting, or dealing with these mistakes. Most of the listed quality control procedures are also relevant for Internet-delivered and scored testing. Lessons from other industries are also discussed. The motto of this module is: There is a reason for every mistake. If you can identify the mistake, you can identify the reason it happened and prevent it from recurring.  相似文献   

应用主成分评价方法对大学生综合测评成绩进行统计分析,以某学院综合测评办法得出数据为样本,检测大学生综合测评中智育浮动分对测评结果的影响度,发现综合测评办法中智育浮动分对结果影响显著,为修订、完善该大学生综合测评办法提供了依据和参考。  相似文献   

测验等值研究综述   总被引:1,自引:0,他引:1  
本研究从研究历史、概念界定、数据收集设计、等值模型和等值方法、等值误差及不同等值方法的评价标准等五个方面对测验等值研究进行了文献综述,以期为今后等值研究的进一步开展提供理论基础。  相似文献   

An item-preequating design and a random groups design were used to equate forms of the American College Testing (ACT) Assessment Mathematics Test. Equipercentile and 3-parameter logistic model item-response theory (IRT) procedures were used for both designs. Both pretest methods produced inadequate equating results, and the IRT item preequating method resulted in more equating error than had no equating been conducted. Although neither of the item preequating methods performed well, the results from the equipercentile preequating method were more consistent with those from the random groups method than were the results from the IRT item pretest method. Item context and position effects were likely responsible, at least in part, for the inadequate results for item preequating. Such effects need to be either controlled or modeled, and the design further researched before the item preequating design can be recommended for operational use.  相似文献   


The effect of changing item responses on scores of elementary school children on a standardized achievement test was studied. Previous research, primarily involving non-standardized instruments and adult samples, indicates that changed responses are more likely to be correct than not. Subjects were 165 third grade students using the Metropolitan Reading Tests. Students received no special instructions regarding changing responses. Changes were identified visually and were independently verified. While frequency of response changes was low, such changes generally improved scores. Sex differences in number and success of changes were non-significant. The relationship between frequency of response change and test score was minimal. Responses to difficult items were changed more frequently with less success than changes on easy items. High scorers made more successful changes than did low scorers. Within the limits of the methodology, results clearly indicated that response changes of elementary students on multiple-choice items tend to improve test scores.  相似文献   

Using factor analysis, we conducted an assessment of multidimensionality for 6 forms of the Law School Admission Test (LSAT) and found 2 subgroups of items or factors for each of the 6 forms. The main conclusion of the factor analysis component of this study was that the LSAT appears to measure 2 different reasoning abilities: inductive and deductive. The technique of N. J. Dorans & N. M. Kingston (1985) was used to examine the effect of dimensionality on equating. We began by calibrating (with item response theory [IRT] methods) all items on a form to obtain Set I of estimated IRT item parameters. Next, the test was divided into 2 homogeneous subgroups of items, each having been determined to represent a different ability (i.e., inductive or deductive reasoning). The items within these subgroups were then recalibrated separately to obtain item parameter estimates, and then combined into Set II. The estimated item parameters and true-score equating tables for Sets I and II corresponded closely.  相似文献   

With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number‐correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real‐number item scores. The generalized algorithm is distinct from the Lord‐Wingersky algorithm in that it explicitly incorporates the task of figuring out all possible unique real‐number test scores in each recursion. Some applications of the generalized recursive algorithm, such as IRT test score reliability estimation and IRT proficiency estimation based on summed test scores, are illustrated with a short test by varying scoring schemes for its items.  相似文献   

应用项目反应理论等值含有多种题型考试的一个实例   总被引:2,自引:2,他引:2  
本文以美国一个州的高中统考为例介绍应用项目反应理论来对含有多种题型的考试进行等值处理的具体做法,同时也对考试的其他技术环节进行了一些探讨。  相似文献   


Using the Hartshorne and May circles test, cheating behavior was detected among 152 undergraduates under two situations: high threat, high supervision (HTHS) and low threat, low supervision (LTLS). Rest’s Defining Test was used to assess level of moral development. Among all subjects and across both situations it was found that subjects high in moral development cheated less than other subjects (p >.01). However, it was also found that in the LTLS situation, subjects high in moral development were just as likely to cheat as subjects low in moral development. The implications of the findings for moral education and the cognitive developmental theory of moral development are discussed.  相似文献   

Time limits on some computer-adaptive tests (CATs) are such that many examinees have difficulty finishing, and some examinees may be administered tests with more time-consuming items than others. Results from over 100,000 examinees suggested that about half of the examinees must guess on the final six questions of the analytical section of the Graduate Record Examination if they were to finish before time expires. At the higher-ability levels, even more guessing was required because the questions administered to higher-ability examinees were typically more time consuming. Because the scoring model is not designed to cope with extended strings of guesses, substantial errors in ability estimates can be introduced when CATs have strict time limits. Furthermore, examinees who are administered tests with a disproportionate number of time-consuming items appear to get lower scores than examinees of comparable ability who are administered tests containing items that can be answered more quickly, though the issue is very complex because of the relationship of time and difficulty, and the multidimensionality of the test.  相似文献   

通过Ping命令中的不同参数,对运行网络连通性进行测试和诊断,判定网络运行情况,找出病因,使网络稳定运行.  相似文献   

A single-group (SG) equating with nearly equivalent test forms (SiGNET) design was developed by Grant to equate small-volume tests. Under this design, the scored items for the operational form are divided into testlets or mini tests. An additional testlet is created but not scored for the first form. If the scored testlets are testlets 1–6 and the unscored testlet is testlet 7, then the first form is composed of testlets 1–6 and the second form is composed of testlets 2–7. The seven testlets are administered as a single administered form, and when a sufficient number of examinees have taken the administered form, the second form (testlets 2–7) is equated to the first form (testlets 1–6) using an SG equating design. As evident, this design facilitates the use of an SG equating and allows for the accumulation of data, both of which may reduce equating error. This study compared equatings under the SiGNET and common-item equating designs and found lower equating error for the SiGNET design in very small sample size conditions (e.g., N = 10).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号