首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 0 毫秒
Using data from a large-scale exam, in this study we compared various designs for equating constructed-response (CR) tests to determine which design was most effective in producing equivalent scores across the two tests to be equated. In the context of classical equating methods, four linking designs were examined: (a) an anchor set containing common CR items, (b) an anchor set incorporating common CR items rescored, (c) an external multiple-choice (MC) anchor test, and (d) an equivalent groups design incorporating rescored CR items (no anchor test). The use of CR items without rescoring resulted in much larger bias than the other designs. The use of an external MC anchor resulted in the next largest bias. The use of a rescored CR anchor and the equivalent groups design led to similar levels of equating error.  相似文献   


The main purposes of this study were to develop bi-factor multidimensional item response theory (BF-MIRT) observed-score equating procedures for mixed-format tests and to investigate relative appropriateness of the proposed procedures. Using data from a large-scale testing program, three types of pseudo data sets were formulated: matched samples, pseudo forms, and simulated data sets. Very minor within-format residual dependence in mixed-format tests was found after controlling for the influence of the primary general factor. The unidimensional IRT and BF-MIRT equating methods produced similar equating results for the data used in this study. When a BF-MIRT model is implemented, we recommend the use of observed-score equating instead of true-score equating because the latter requires an arbitrary approximation or reduction process to relate true scores on test forms.  相似文献   

The 1986 scores from Florida's Statewide Student Assessment Test, Part II (SSAT-II), a minimum-competency test required for high school graduation in Florida, were placed on the scale of the 1984 scores from that test using five different equating procedures. For the highest scoring 84 % of the students, four of the five methods yielded results within 1.5 raw-score points of each other. They would be essentially equally satisfactory in this situation, in which the tests were made parallel item by item in difficulty and content and the groups of examinees were population cohorts separated by only 2 years. Also, the results from six different lengths of anchor items were compared. Anchors of 25, 20, 15, or 10 randomly selected items provided equatings as effective as 30 items using the concurrent IRT equating method, but an anchor of 5 randomly selected items did not  相似文献   

In large-scale assessments, such as state-wide testing programs, national sample-based assessments, and international comparative studies, there are many steps involved in the measurement and reporting of student achievement. There are always sources of inaccuracies in each of the steps. It is of interest to identify the source and magnitude of the errors in the measurement process that may threaten the validity of the final results. Assessment designers can then improve the assessment quality by focusing on areas that pose the highest threats to the results. This paper discusses the relative magnitudes of three main sources of error with reference to the objectives of assessment programs: measurement error, sampling error, and equating error. A number of examples from large-scale assessments are used to illustrate these errors and their impact on the results. The paper concludes by making a number of recommendations that could lead to an improvement of the accuracies of large-scale assessment results.  相似文献   

This study explores classification consistency and accuracy for mixed-format tests using real and simulated data. In particular, the current study compares six methods of estimating classification consistency and accuracy for seven mixed-format tests. The relative performance of the estimation methods is evaluated using simulated data. Study results from real data analysis showed that the procedures exhibited similar patterns across various exams, but some tended to produce lower estimates of classification consistency and accuracy than others. As data became more multidimensional, unidimensional and multidimensional item response theory (IRT) methods tended to produce different results, with the unidimensional approach yielding lower estimates than the multidimensional approach. Results from simulated data analysis demonstrated smaller estimation error for the multidimensional IRT methods than for the unidimensional IRT method. The unidimensional approach yielded larger error as tests became more multidimensional, whereas a reverse relationship was observed for the multidimensional IRT approach. Among the non-IRT approaches, the normal approximation and Livingston-Lewis methods performed well, whereas the compound multinomial method tended to produce relatively larger error.  相似文献   

Score equating based on small samples of examinees is often inaccurate for the examinee populations. We conducted a series of resampling studies to investigate the accuracy of five methods of equating in a common-item design. The methods were chained equipercentile equating of smoothed distributions, chained linear equating, chained mean equating, the symmetric circle-arc method, and the simplified circle-arc method. Four operational test forms, each containing at least 110 items, were used for the equating, with new-form samples of 100, 50, 25, and 10 examinees and reference-form samples three times as large. Accuracy was described in terms of the root-mean-squared difference (over 1,000 replications) of the sample equatings from the criterion equating. Overall, chained mean equating produced the most accurate results for low scores, but the two circle-arc methods produced the most accurate results, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible.  相似文献   

The goal of this study was the development of a procedure to predict the equating error associated with the long-term equating method of Tate (2003) for mixed-format tests. An expression for the determination of the error of an equating based on multiple links using the error for the component links was derived and illustrated with simulated data. Expressions relating the equating error for single equating links to relevant factors like the equating design and the history of the examinee population ability distribution were determined based on computer simulation. Use of the resulting procedure for the selection of a long-term equating design was illustrated.  相似文献   

An Extension of Four IRT Linking Methods for Mixed-Format Tests   总被引:1,自引:0,他引:1  
Under item response theory (IRT), linking proficiency scales from separate calibrations of multiple forms of a test to achieve a common scale is required in many applications. Four IRT linking methods including the mean/mean, mean/sigma, Haebara, and Stocking-Lord methods have been presented for use with single-format tests. This study extends the four linking methods to a mixture of unidimensional IRT models for mixed-format tests. Each linking method extended is intended to handle mixed-format tests using any mixture of the following five IRT models: the three-parameter logistic, graded response, generalized partial credit, nominal response (NR), and multiple-choice (MC) models. A simulation study is conducted to investigate the performance of the four linking methods extended to mixed-format tests. Overall, the Haebara and Stocking-Lord methods yield more accurate linking results than the mean/mean and mean/sigma methods. When the NR model or the MC model is used to analyze data from mixed-format tests, limitations of the mean/mean, mean/sigma, and Stocking-Lord methods are described.  相似文献   

In most large-scale assessments of student achievement, several broad content domains are tested. Because more items are needed to cover the content domains than can be presented in the limited testing time to each individual student, multiple test forms or booklets are utilized to distribute the items to the students. The construction of an appropriate booklet design is a complex and challenging endeavor that has far-reaching implications for data calibration and score reporting. This module describes the construction of booklet designs as the task of allocating items to booklets under context-specific constraints. Several types of experimental designs are presented that can be used as booklet designs. The theoretical properties and construction principles for each type of design are discussed and illustrated with examples. Finally, the evaluation of booklet designs is described and future directions for researching, teaching, and reporting on booklet designs for large-scale assessments of student achievement are identified.  相似文献   

The development of alternate assessments for students with disabilities plays a pivotal role in state and national accountability systems. An important assumption in the use of alternate assessments in these accountability systems is that scores are comparable on different test forms across diverse groups of students over time. The use of test equating is a common way that states attempt to establish score comparability on different test forms. However, equating presents many unique, practical, and technical challenges for alternate assessments. This article provides case studies of equating for two alternate assessments in Michigan and an approach to determine whether or not equating would be preferred to not equating on these assessments. This approach is based on examining equated score and performance-level differences and investigating population invariance across subgroups of students with disabilities. Results suggest that using an equating method with these data appeared to have a minimal impact on proficiency classifications. The population invariance assumption was suspect for some subgroups and equating methods with some large potential differences observed.  相似文献   

The choice of anchor tests is crucial in applications of the nonequivalent groups with anchor test design of equating. Sinharay and Holland (2006, 2007) suggested “miditests,” which are anchor tests that are content‐representative and have the same mean item difficulty as the total test but have a smaller spread of item difficulties. Sinharay and Holland (2006, 2007), Cho, Wall, Lee, and Harris (2010), Fitzpatrick and Skorupski (2016), Liu, Sinharay, Holland, Curley, and Feigenbaum (2011a), Liu, Sinharay, Holland, Feigenbaum, and Curley (2011b), and Yi (2009) found the miditests to lead to better equating than minitests, which are representative of the total test with respect to content and difficulty. However, these findings recently came into question as Trierweiler, Lewis, and Smith (2016) concluded, based on a comparison of correlation coefficients of miditests and minitests with the total test, that making an anchor test a miditest does not generally increase the anchor to total score correlation and recommended the continuation of the practice of using minitests over miditests. Their recommendation raises the question, “Should miditests continue to be considered in practice?” This note defends the miditests by citing literature that favors miditests and then by showing that miditests perform as well as the minitests in most realistic situations considered in Trierweiler et al. (2016), which implies that miditests should continue to be seriously considered by equating practitioners.  相似文献   

国际大型测评项目在年度内题本等值时,主要采用同时估计的方法实现对题目参数的估计,并使用似真值实现对学生个体能力的报告,各个测评项目之间的等值设计与处理相对统一与一致。在年度间等值时,不同的项目,依据各自的设计特点,采用锚题或锚人的方法,使用同时估计,并通过线性转换将学生能力分数置于同一量尺上,实现年度间分数的比较。依据我国国情,建议采用锚题与锚人相结合的等值设计方式实现年度间测评结果的链接。  相似文献   

In this study, we describe what factors influence the observed score correlation between an (external) anchor test and a total test. We show that the anchor to full‐test observed score correlation is based on two components: the true score correlation between the anchor and total test, and the reliability of the anchor test. Findings using an analytical approach suggest that making an anchor test a miditest does not generally maximize the anchor to total test correlation. Results are discussed in the context of what conditions maximize the correlations between the anchor and total test.  相似文献   


Previous researchers having established the equivalence of a group administered version of the PPVT with the standard procedure of individual administration and the reliability between alternate forms of the PPVT, an attempt was made to establish the concurrent validity of a group administered version of the PPVT in terms of two criterion variables. An r of .62 was obtained between the Otis, a group test of intelligence, and the PPVT. An r of .55 was found between the PPVT and the Stanford Achievement Test. Both r’s were significant beyond the .01 level. The concurrent validity of the PPVT was established and suggestions for additional research were made.  相似文献   

This study used innovative assessment practices to obtain and document broad learning outcomes for a 15-hour game-based curriculum in Quest Atlantis, a multi-user virtual environment that supports school-based participation in socio scientific inquiry in ecological sciences. Design-based methods were used to refine and align the enactment of virtual narrative and scientific investigations to a challenging problem solving assessment and indirectly to achievement test items that were independent of the curriculum. In study one, one-sixth grade teacher used the curriculum in two of his classes and obtained larger gains in understanding and achievement than his two other classes, which used an expository text to learn the same concepts and skills. Further treatment refinements were carried out, and two forms of virtual formative feedback were introduced. In study two, the same teacher used the curriculum in all four of his classes; the revised curriculum resulted in even larger gains in understanding and achievement. Gains averaged 1.1 SD and 0.4 SD, respectively, with greater gains shown for students who engaged more with formative feedback. Principles for assessing designs and designing assessments in virtual environments are presented.
Daniel T. HickeyEmail:

通过模拟和实证研究探讨样本量、题本量以及锚题题型对大尺度测评中项目参数等值精度的影响,模拟研究和实证研究的结果均表明:(1)0/1计分项目参数的等值精度在大多数条件下均好于多级计分项目,相对而言,实证研究的差异不如模拟研究明显;(2)相对而言,样本容量的增加对于提高项目参数等值精度有着重要的作用,而增加题本数量的作用甚微;(3)无论是区分度参数还是难度参数,均表现为3个题本和2 000人的搭配已经可以达到较好的等值精度,如果进一步提高等值精度,只需将每一题本的样本容量增加到3 000人即可;在多级计分时,当选用5个题本时,每一个题本2 000人是最适宜的组合。  相似文献   

Many states are implementing direct writing assessments to assess student achievement. While much literature has investigated minimizing raters' effects on writing scores, little attention has been given to the type of model used to prepare raters to score direct writing assessments. This study reports on an investigation that occurred in a state-mandated writing program when a scoring anomaly became apparent once assessments were put in operation. The study indicates that using a spiral model for training raters and scoring papers results in higher mean ratings than does using a sequential model for training and scoring. Findings suggest that making decisions about cut-scores based on pilot data has important implications for program implementation.  相似文献   

Researchers in education are often interested in determining whether independent groups are equivalent on a specific outcome. Equivalence tests for 2 independent populations have been widely discussed, whereas testing for equivalence with more than 2 independent groups has received little attention. The authors discuss alternatives for testing the equivalence of more than 2 independent populations, and they use a Monte Carlo study to demonstrate and compare the performance of these alternatives under several conditions. The results indicate that a 1-way test (e.g., Wellek's F test) is recommended for assessing the equivalence of more than 2 independent groups because approaches based on conducting pairwise tests of equivalence are overly conservative.  相似文献   

《Educational Assessment》2013,18(1):99-110
The purpose of this article is to describe some of the measurement issues encountered in the equating of performance assessments designed for use in making teacher certification decisions. As some teacher certification programs move from sole reliance on multiple-choice items to inclusion of complex performance tasks, difficult measurement issues related to equating may arise. A variety of analytic and judgmental strategies are described in this article that may provide solutions for addressing these equating issues. Analytic strategies are based on examinee data and involve the modification of existing equating procedures, such as linear and equipercentile methods, that have been used successfully in the past with test forms composed of multiple-choice items. Judgmental strategies for equating involve the use of expert judgments to determine the equivalence of scores obtained from alternate forms of an assessment instrument.  相似文献   

作为教育质量评价的重要手段,大规模教育测评中常使用多题本设计。多题本设计通常采用有共同题的不完全矩阵取样设计,共同题又有共同锚和循环锚两种设置方式。共同锚多题本设计需要考虑共同题的比例、内容结构、统计特征、在题本中的放置位置等。循环锚多题本设计即平衡的不完全矩阵设计,往往采用题目组块的方式组合题本,需要考虑题组数量、题组内部结构、题组的排列等。多题本设计的测验数据处理涉及项目反应理论模型下的量尺分数估计、量表化方法、等值技术等。探讨这些问题能为教育测验的设计提供指导和建议。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号