期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Assessing Early Language Development in Children with Vision Disability and Motor Disability

Stephen Hennessey 《International Journal of Disability, Development & Education》2011,58(2):169-187

This article describes a method for identifying test items as disability neutral for children with vision and motor disabilities. Graduate students rated 130 items of the Preschool Language Scale and obtained inter‐rater correlation coefficients of 0.58 for ratings of items as disability neutral for children with vision disability, and 0.77 for ratings of items as disability neutral for children with motor disability. These ratings were used to create three item sets considered disability neutral for children with vision disability, motor disability, or both disabilities. Two methods for scoring the item sets were identified: scoring each set as a partially administered developmental test, or computing standard scores based upon pro‐rated raw score totals. The pro‐rated raw score method generated standard scores that were significantly inflated and therefore less useful for the assessment purposes than the ratio quotient method. This research provides a test accommodation technique for assessing children with multiple disabilities. 相似文献

2.

Comparisons among Designs for Equating Mixed-Format Tests in Large-Scale Assessments

Sooyeon Kim Michael E. Walker Frederick McHale 《Journal of Educational Measurement》2010,47(1):36-53

In this study we examined variations of the nonequivalent groups equating design for tests containing both multiple-choice (MC) and constructed-response (CR) items to determine which design was most effective in producing equivalent scores across the two tests to be equated. Using data from a large-scale exam, this study investigated the use of anchor CR item rescoring (known as trend scoring) in the context of classical equating methods. Four linking designs were examined: an anchor with only MC items, a mixed-format anchor test containing both MC and CR items; a mixed-format anchor test incorporating common CR item rescoring; and an equivalent groups (EG) design with CR item rescoring, thereby avoiding the need for an anchor test. Designs using either MC items alone or a mixed anchor without CR item rescoring resulted in much larger bias than the other two designs. The EG design with trend scoring resulted in the smallest bias, leading to the smallest root mean squared error value. 相似文献

3.

Five Methods to Score the Teacher Observation of Classroom Adaptation Checklist and to Examine Group Differences

Ze Wang David Rohrer Chi-ching Chuang Mayo Fujiki Keith Herman Wendy Reinke 《Journal of Experimental Education》2015,83(1):24-50

This study compared 5 scoring methods in terms of their statistical assumptions. They were then used to score the Teacher Observation of Classroom Adaptation Checklist, a measure consisting of 3 subscales and 21 Likert-type items. The 5 methods used were (a) sum/average scores of items, (b) latent factor scores with continuous indicators, (c) latent factor scores with ordered categorical indicators using the mean- and variance-adjusted weighted least squares estimation method, (d) latent factor scores with ordered categorical indicators using the full information maximum likelihood estimation method, and (e) multidimensional graded response model using the Bock-Aitkin expectation-maximization estimation procedure. Measurement invariance between gender groups and between free/reduced-price lunch status groups was evaluated with the second, third, fourth, and fifth methods. Group mean differences based on the 5 methods were calculated and compared. 相似文献

4.

Weighting Constructed-Response Items in IRT-Based Exams

《教育实用测度》2013,26(4):257-275

Weighting responses to Constructed-Response (CR) items has been proposed as a way to increase the contribution these items make to the test score when there is insufficient testing time to administer additional CR items. The effect of various types of weighting items of an IRT-based mixed-format writing examination was investigated. Constructed-response items were weighted by increasing their representation according to the test blueprint, by increasing their contribution to the test characteristic curve, by summing the ratings of multiple raters, and by applying optimal weights utilized in IRT pattern scoring. Total score and standard errors of the weighted composite forms of CR and Multiple-Choice (MC) items were compared against each other and against a form containing additional rather than weighted items. Weighting resulted in a slight reduction of test reliability but reduced standard error in portions of the ability scale. 相似文献

5.

Use of Adjustment by Minimum Discriminant Information in Linking Constructed‐Response Test Scores in the Absence of Common Items

Yi‐Hsuan Lee Shelby J. Haberman Neil J. Dorans 《Journal of Educational Measurement》2019,56(2):452-472

In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long‐standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation. 相似文献

6.

Equating Subscores under the Nonequivalent Anchor Test (NEAT) Design

Gautam Puhan Longjuan Liang 《Educational Measurement》2011,30(1):23-35

The study examined two approaches for equating subscores. They are (1) equating subscores using internal common items as the anchor to conduct the equating, and (2) equating subscores using equated and scaled total scores as the anchor to conduct the equating. Since equated total scores are comparable across the new and old forms, they can be used as an anchor to equate the subscores. Both chained linear and chained equipercentile methods were used. Data from two tests were used to conduct the study and results showed that when more internal common items were available (i.e., 10–12 items), then using common items to equate the subscores is preferable. However, when the number of common items is very small (i.e., five to six items), then using total scaled scores to equate the subscores is preferable. For both tests, not equating (i.e., using raw subscores) is not reasonable as it resulted in a considerable amount of bias. 相似文献

7.

Estimating the Consistency and Accuracy of Classifications Based on Test Scores 总被引：1，自引：0，他引：1

Samuel A. Livingston Charles Lewis 《Journal of Educational Measurement》1995,32(2):179-197

This article presents a method for estimating the accuracy and consistency of classifications based on test scores. The scores can be produced by any scoring method, including a weighted composite. The estimates use data from a single form. The reliability of the score is used to estimate effective test length in terms of discrete items. The true-score distribution is estimated by fitting a 4-parameter beta model. The conditional distribution of scores on an alternate form, given the true score, is estimated from a binomial distribution based on the estimated effective test length. Agreement between classifications on alternate forms is estimated by assuming conditional independence, given the true score. Evaluation of the method showed estimates to be within 1 percentage point of the actual values in most cases. Estimates of decision accuracy and decision consistency statistics were only slightly affected by changes in specified minimum and maximum possible scores. 相似文献

8.

Effects of Assigning Raters to Items

Robert C. Sykes Kyoko Ito Zhen Wang 《Educational Measurement》2008,27(1):47-55

Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items. 相似文献

9.

高考语文阅读主观题评分方法对题目参数分析的影响

温红博李峰《考试研究》2020,(1):65-73

针对目前高考语文阅读主观题评分方法的局限,提出基于SOLO理论的分类评价法和基于阅读认知过程的建构整合模型(CI)评分法。选择1019名学生高考语文阅读三道主观题的真实作答,采用三种评分法评分,采用项目反应理论对三道主观题进行测量学分析,结果表明:相对于原始评分法,SOLO评分法和CI评分法题目之间具有更高的相关,测验模型拟合更佳,题目区分度较高,题目得分的难度阈限和步长更合理,题目的信息量更大,而CI评分法又明显优于SOLO评分法。研究支持了将CI评方法作为高考语文阅读主观题评分方法的潜在优势。相似文献

10.

Formula Scoring of Multiple-Choice Tests (Correction for Guessing)

Robert B. Frary 《Educational Measurement》1988,7(2):33-38

Formula scoring is a procedure designed to reduce multiple-choice test score irregularities due to guessing. Typically, a formula score is obtained by subtracting a proportion of the number of wrong responses from the number correct. Examinees are instructed to omit items when their answers would be sheer guesses among all choices but otherwise to guess when unsure of an answer. Thus, formula scoring is not intended to discourage guessing when an examinee can rule out one or more of the options within a multiple-choice item. Examinees who, contrary to the instructions, do guess blindly among all choices are not penalized by formula scoring on the average; depending on luck, they may obtain better or worse scores than if they had refrained from this guessing. In contrast, examinees with partial information who refrain from answering tend to obtain lower formula scores than if they had guessed among the remaining choices. (Examinees with misinformation may be exceptions.) Formula scoring is viewed as inappropriate for most classroom testing but may be desirable for speeded tests and for difficult tests with low passing scores. Formula scores do not approximate scores from comparable fill-in-the-blank tests, nor can formula scoring preclude unrealistically high scores for examinees who are very lucky. 相似文献

11.

Defining Score Scales in Relation to Measurement Error

Michael J. Kolen 《Journal of Educational Measurement》1988,25(2):97-110

Scale scores for educational tests can be made more interpretable by incorporating score precision information at the time the score scale is established. Methods for incorporating this information are examined that are applicable to testing situations with number-correct scoring. Both linear and nonlinear methods are described. These methods can be used to construct score scales that discourage the overinterpretation of small differences in scores. The application of the nonlinear methods also results in scale scores that have nearly equal error variability along the score scale and that possess the property that adding a specified number of points to and subtracting the same number of points from any examinee's scale score produces an approximate two-sided confidence interval with a specified coverage. These nonlinear methods use an arcsine transformation to stabilize measurement error variance for transformed scores. The methods are compared through the use of illustrative examples. The effect of rounding on measurement error variability is also considered and illustrated using stanines 相似文献

12.

试卷中含有单个高计分主观题时的信度估计方法

杨志明丁港王雯《教育测量与评价(理论版)》2021,(1):44-48

测评信度是衡量考试质量的核心指标之一,但常规的信度估计方法在估计含有单个高计分主观题试卷的信度时并不恰当,因为这种高计分主观题对测验总分方差的影响太大。解决这种问题的一个做法是:在估计出单个高计分主观题信度的基础上,进一步运用分层α系数公式估计整个试卷的测评信度。单个高计分主观题信度的估计方法有两种,即使用重测信度的估计方法,或者使用根据两个随机变量的相关系数会因随机误差的存在而衰减的特点所提出的估计方法。相似文献

13.

Investigating the Effectiveness of Equating Designs for Constructed-Response Tests in Large-Scale Assessments

Sooyeon Kim Michael E. Walker Frederick McHale 《Journal of Educational Measurement》2010,47(2):186-201

Using data from a large-scale exam, in this study we compared various designs for equating constructed-response (CR) tests to determine which design was most effective in producing equivalent scores across the two tests to be equated. In the context of classical equating methods, four linking designs were examined: (a) an anchor set containing common CR items, (b) an anchor set incorporating common CR items rescored, (c) an external multiple-choice (MC) anchor test, and (d) an equivalent groups design incorporating rescored CR items (no anchor test). The use of CR items without rescoring resulted in much larger bias than the other designs. The use of an external MC anchor resulted in the next largest bias. The use of a rescored CR anchor and the equivalent groups design led to similar levels of equating error. 相似文献

14.

Differences between multiple choice items and constructed response items in the IEA timss surveys

《Studies in Educational Evaluation》2005,31(2-3):145-161

In international large-scale surveys, constructed response (CR) items are increasingly being used and multiple-choice (MC) items are being used less frequently. In this article the two item types will be compared in terms of any differences they have on national mean scores. TIMSS 1995 and TIMSS 1999 data have been used. Are there different effects of the question types for mathematics and science? Does the introduction of open-ended items into the math and science tests affect the math and science achievement results? 相似文献

15.

国际语言测试的计分趋向考察

詹先君洪民《教育与考试》2011,(1):79-84

通过对几种知名的国际语言测试的计分方法进行考察,发现语言测试的计分方法表现了动态计分、转换分数、设置不计分项、倒扣分的趋势;引起这种趋向的动因在于追求计分的精准性、方便考试用户、促使语言测试可持续性发展。国内的大部分语言测试还只是采取单一化的原始计分方法,不利于语言测试的质量提高和发展,不妨以国际语言测试的计分趋向为参照,对计分方法尝试进行改革,以使语言测试更加科学。相似文献

16.

Digital Module 18: Automated Scoring https://ncme.elevate.commpartners.com

Sue Lottridge Amy Burkhardt Michelle Boyer 《Educational Measurement》2020,39(3):141-142

In this digital ITEMS module, Dr. Sue Lottridge, Amy Burkhardt, and Dr. Michelle Boyer provide an overview of automated scoring. Automated scoring is the use of computer algorithms to score unconstrained open-ended test items by mimicking human scoring. The use of automated scoring is increasing in educational assessment programs because it allows scores to be returned faster at lower cost. In the module, they discuss automated scoring from a number of perspectives. First, they discuss benefits and weaknesses of automated scoring, and what psychometricians should know about automated scoring. Next, they describe the overall process of automated scoring, moving from data collection to engine training to operational scoring. Then, they describe how automated scoring systems work, including the basic functions around score prediction as well as other flagging methods. Finally, they conclude with a discussion of the specific validity demands around automated scoring and how they align with the larger validity demands around test scores. Two data activities are provided. The first is an interactive activity that allows the user to train and evaluate a simple automated scoring engine. The second is a worked example that examines the impact of rater error on test scores. The digital module contains a link to an interactive web application as well as its R-Shiny code, diagnostic quiz questions, activities, curated resources, and a glossary. 相似文献

17.

How Should Colleges Treat Multiple Admissions Test Scores?

下载免费PDF全文

Krista Mattern Justine Radunzel Maria Bertling Andrew D. Ho 《Educational Measurement》2018,37(3):11-23

The percentage of students retaking college admissions tests is rising. Researchers and college admissions offices currently use a variety of methods for summarizing these multiple scores. Testing organizations such as ACT and the College Board, interested in validity evidence like correlations with first‐year grade point average (FYGPA), often use the most recent test score available. In contrast, institutions report using a variety of composite scoring methods for applicants with multiple test records, including averaging and taking the maximum subtest score across test occasions (“superscoring”). We compare four scoring methods on two criteria. First, we compare correlations between scores and FYGPA by scoring method. We find them similar (). Second, we compare the extent to which test scores differentially predict FYGPA by scoring method and number of retakes. We find that retakes account for additional variance beyond standardized achievement and positively predict FYGPA across all scoring methods. Superscoring minimizes this differential prediction—although it may seem that superscoring should inflate scores across retakes, this inflation is “true” in that it accounts for the positive effects of retaking for predicting FYGPA. Future research should identity factors related to retesting and consider how they should be used in college admissions. 相似文献

18.

Equating Minimum-Competency Tests: Comparisons of Methods

John R. Hills Raja G. Subhiyah Thomas M. Hirsch 《Journal of Educational Measurement》1988,25(3):221-231

The 1986 scores from Florida's Statewide Student Assessment Test, Part II (SSAT-II), a minimum-competency test required for high school graduation in Florida, were placed on the scale of the 1984 scores from that test using five different equating procedures. For the highest scoring 84 % of the students, four of the five methods yielded results within 1.5 raw-score points of each other. They would be essentially equally satisfactory in this situation, in which the tests were made parallel item by item in difficulty and content and the groups of examinees were population cohorts separated by only 2 years. Also, the results from six different lengths of anchor items were compared. Anchors of 25, 20, 15, or 10 randomly selected items provided equatings as effective as 30 items using the concurrent IRT equating method, but an anchor of 5 randomly selected items did not 相似文献

19.

A Comparison of Equal Percentile and Partial Credit Equatings for Performance-Based Assessments Composed of Free-Response Items

Huynh Huynh Steven Ferrara 《Journal of Educational Measurement》1994,31(2):125-141

This study compares the equal percentile (EP) and partial credit (PC) equatings for raw scores derived from performance-based assessments composed of free-response (open-ended) items clustered around long reading selections or multistep mathematics problems. Data are from the Maryland School Performance Assessment Program. The results suggest that Masters (1982; Wright & Masters, 1982) partial credit model may be useful for equating examinations composed of moderately easy (or not too difficult)items sharing a first principal component with at least 25% of the total variance. This conclusion appears to hold even in the presence of some level of response dependency for the items within each cluster. Although visible discrepancies were found between PC and EP equated scores in the skewed tail of the score distributions, the direction of these discrepancies is unpredictable. Therefore, it cannot be concluded from the study that the two methods give equivalent results when the distributions are markedly skewed. 相似文献

20.

IRT Approaches to Modeling Scores on Mixed-Format Tests

Won-Chan Lee Stella Y. Kim Jiwon Choi Yujin Kang 《Journal of Educational Measurement》2020,57(2):230-254

This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models. 相似文献