首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Frequency distributions of test scores may appear irregular and, as estimates of a population distribution, contain a substantial amount of sampling error. Techniques for smoothing score distributions are available that have the capacity to improve estimation. In this article, estimation/smoothing methods that are flexible enough to fit a wide variety of test score distributions are reviewed. The methods are a kernel method, a strong true–score model–based method, and a method that uses polynomial log–linear models. The use of these methods is then reviewed, and applications of the methods are presented that include describing and comparing test score distributions, estimating norms, and estimating equipercentile equivalents in test score equating. Suggestions for further research are also provided.  相似文献   

The purpose of this study was to compare the IRT-based area method and the Mantel-Haenszel method for investigating differential item functioning (DIF), to determine the degree of agreement between the methods in identifying potentially biased items, and, when the two methods led to different results, to identify possible reasons for the discrepancies. Data for the study were the item responses of Anglo American and Native American students who took the 1982 New Mexico High School Proficiency Exam. Two samples of 1,000 students from each group were studied. The major findings were that (a) the consistency of classifications of items into "biased" and "not-biased" categories across replications was 75% to 80% for both methods and (b) when the unreliability of the statistics was taken into account, the two methods led to very similar results. Discrepancies between methods were due to the presence of nonuniform DIF (the Mantel-Haenszel method could not identify these items) and the choice of interval over which DIF was assessed (the IRT method results depended on the choice of interval). The implications for practitioners seem clear: The Mantel-Haenszel method in general provides an acceptable approximation to the IRT-based methods.  相似文献   

The use of accommodations has been widely proposed as a means of including English language learners (ELLs) or limited English proficient (LEP) students in state and districtwide assessments. However, very little experimental research has been done on specific accommodations to determine whether these pose a threat to score comparability. This study examined the effects of linguistic simplification of 4th- and 6th-grade science test items on a state assessment. At each grade level, 4 experimental 10-item testlets were included on operational forms of a statewide science assessment. Two testlets contained regular field-test items, but in a linguistically simplified condition. The testlets were randomly assigned to LEP and non-LEP students through the spiraling of test booklets. For non-LEP students, in 4 t-test analyses of the differences in means for each corresponding testlet, 3 of the mean score comparisons were not significantly different, and the 4th showed the regular version to be slightly easier than the simplified version. Analysis of variance (ANOVA), followed by pairwise comparisons of the testlets, showed no significant differences in the scores of non-LEP students across the 2 item types. Among the 40 items administered in both regular and simplified format, item difficulty did not vary consistently in favor of either format. Qualitative analyses of items that displayed significant differences in p values were not informative, because the differences were typically very small. For LEP students, there was 1 significant difference in student means, and it favored the regular version. However, because the study was conducted in a state with a small number of LEP students, the analyses of LEP student responses lacked statistical power. The results of this study show that linguistic simplification is not helpful to monolingual English-speaking students who receive the accommodation. Therefore, the results provide evidence that linguistic simplification is not a threat to the comparability of scores of LEP and monolingual English-speaking students when offered as an accommodation to LEP students. The study findings may also have implications for the use of linguistic simplification accommodations in science assessments in other states and in content areas other than science.  相似文献   

本研究的目的有三:(1)提出试后试题全公开背景下分数分布的跨年度比较方案,即通过组合日本的全国性测验与地区性测验的设计,应用测验理论中的链接原理提出跨年度比较分数分布的方法;(2)讨论实现该方案的可行性,具体讨论了使用测验数据的可能性、地区性协作的方式以及对于被试群体的要求;(3)进行实际数据的证实,即呈现2006年度与2009年度初中三年级学生国语测验分数的跨年度比较结果,发现无论哪个测验的分数分布都基本上没有变化。  相似文献   

Whenever the purpose of measurement is to inform an inference about a student’s achievement level, it is important that we be able to trust that the student’s test score accurately reflects what that student knows and can do. Such trust requires the assumption that a student’s test event is not unduly influenced by construct-irrelevant factors that could distort his score. This article examines one such factor—test-taking motivation—that tends to induce a person-specific, systematic negative bias on test scores. Because current measurement models underlying achievement testing assume students respond effortfully to test items, it is important to identify test scores that have been materially distorted by non-effortful test taking. A method for conducting effort-related individual score validation is presented, and it is recommended that measurement professionals have a responsibility to identify invalid scores to individuals who make inferences about student achievement on the basis of those scores.  相似文献   

In this study, the relationship between student affective performance and classroom physical environment, social climate, and management style were investigated in a sample of classes in Hong Kong primary schools. The results of Pearson and canonical correlation analyses indicated that among the measures of classroom environment, perceived quality of physical environment and class master's expert power, personal power, and coercive power were the strongest predictors of affective performance. This finding supports the importance of class master's management style in the classroom environment. Students' attitudes toward school and teachers appeared to be most sensitive to variation in the classroom environment, and self-concept was the least sensitive among the seven student affective measures. Students' self-efficacy of learning and intention to drop out were moderately sensitive to classroom environment. Profiles of effective and ineffective classroom environments were also mapped. In effective classrooms, class masters care for students, pay attention to teaching, do not use force or punishment but do create a good classroom climate with their professional knowledge, personal morality, and personality. Physical environment and psychological environment are both important; a good classroom environment is highly correlated with student affective performance.  相似文献   

《Educational Assessment》2013,18(2):133-147
This article presents a beginning effort to build a taxonomy for constructed-response test items. The taxonomy defines the categories for various item formats in three distinct dimensions: (a) type of reasoning competency employed, (b) nature of cognitive continuum employed, and (c) kind of response yielded. Each dimension is described, and the reasons for incorporating it into the taxonomy are explained. A theoretical rationale for the taxonomy is developed, and advantages and shortcomings of its use are noted.  相似文献   

《Educational Assessment》2013,18(4):317-340
A number of methods for scoring tests with selected-response (SR) and constructed-response (CR) items are available. The selection of a method depends on the requirements of the program, the particular psychometric model and assumptions employed in the analysis of item and score data, and how scores are to be used. This article compares 3 methods: unweighted raw scores, Item Response Theory pattern scores, and weighted raw scores. Student score data from large-scale end-of-course high school tests in Biology and English were used in the comparisons. In the weighted raw score method evaluated in this study, the CR items were weighted so that SR and CR items contributed the same number of points toward the total score. The scoring methods were compared for the total group and for subgroups of students in terms of the resultant scaled score distributions, standard errors of measurement, and proficiency-level classifications. For most of the student ability distribution, the three scoring methods yielded similar results. Some differences in results are noted. Issues to be considered when selecting a scoring method are discussed.  相似文献   

由于测验安全性、试卷组卷不当等问题,有些测验的题本相互之间不能或者没有设置锚题。对作答不同题本的被试进行分数比较时,需要用到测验等值技术。不同于有锚题测验能通过题本之间的锚题进行等值,无锚题情境下的测验需要借助于一些特殊方法进行等值。目前,对无锚题测验进行等值主要有三种方式,一种是通过测验中具体的题目,也就是构建相同的"锚题"来进行等值,如构造随机等组测验法和利用题目先验信息进行等值的方法;一种是通过构建相同被试组来进行等值,即构造随机等组样本法;还有一种是借助于测验题目所考查的认知属性来进行等值,一般是基于一种认知诊断模型——规则空间模型来进行操作。  相似文献   

This article explores the amount of equating error at a passing score when equating scores from exams with small samples sizes. This article focuses on equating using classical test theory methods of Tucker linear, Levine linear, frequency estimation, and chained equipercentile equating. Both simulation and real data studies were used in the investigation. The results of the study supported past findings that as the sample sizes increase, the amount of bias in the equating at the passing score decreases. The research also highlights the importance for practitioners to understand the data, to have an informed expectation of the results, and to have a documented rationale for an acceptable amount of equating error.  相似文献   

This study compared and illustrated four differential distractor functioning (DDF) detection methods for analyzing multiple-choice items. The log-linear approach, two item response theory-model-based approaches with likelihood ratio tests, and the odds ratio approach were compared to examine the congruence among the four DDF detection methods. Data from a college-level mathematics placement test were analyzed to understand the causes of differential functioning. Results indicated some agreement among the four detection methods. To facilitate practical interpretation of the DDF results, several possible effect size measures were also obtained and compared.  相似文献   

Six procedures for combining sets of IRT item parameter estimates obtained from different samples were evaluated using real and simulated response data. In the simulated data analyses, true item and person parameters were used to generate response data for three different-sized samples. Each sample was calibrated separately to obtain three sets of item parameter estimates for each item. The six procedures for combining multiple estimates were each applied, and the results were evaluated by comparing the true and estimated item characteristic curves. For the real data, the two best methods from the simulation data analyses were applied to three different-sized samples and the resulting estimated item characteristic curves were compared to the curves obtained when the three samples were combined and calibrated simultaneously. The results support the use of covariance matrix-weighted averaging and a procedure that involves sample-size-weighted averaging of estimated item characteristic curves at the center of the ability distribution  相似文献   

This Monte Carlo simulation study compares methods to estimate the effects of programs with multiple versions when assignment of individuals to program version is not random. These methods use generalized propensity scores, which are predicted probabilities of receiving a particular level of the treatment conditional on covariates, to remove selection bias. The results indicate that inverse probability of treatment weighting (IPTW) removes the most bias, followed by optimal full matching (OFM), and marginal mean weighting through stratification (MMWTS). The study also compared standard error estimation with Taylor series linearization, bootstrapping and the jackknife across propensity score methods. With IPTW, these standard error estimation methods performed adequately, but standard errors estimates were biased in most conditions with OFM and MMWTS.  相似文献   

Building on previous works by Lord and Ogasawara for dichotomous items, this article proposes an approach to derive the asymptotic standard errors of item response theory true score equating involving polytomous items, for equivalent and nonequivalent groups of examinees. This analytical approach could be used in place of empirical methods like the bootstrap method, to obtain standard errors of equated scores. Formulas are introduced to obtain the derivatives for computing the asymptotic standard errors. The approach was validated using mean‐mean, mean‐sigma, random‐groups, or concurrent calibration equating of simulated samples, for tests modeled using the generalized partial credit model or the graded response model.  相似文献   

温升指标是变压器一项非常重要的性能参数,实际工作中经常发现同一台变压器不同厂家检测结果存在明显差异。从试验方法角度介绍了3个典型厂家配电变压器温升指标测试方法,重点对变压器绕组断电时刻电阻值和油平均温升计算方法进行了较为详细的分析和比较,最后对变压器温升试验方法做了总结。  相似文献   

In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long‐standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation.  相似文献   

Factor score regression has recently received growing interest as an alternative for structural equation modeling. However, many applications are left without guidance because of the focus on normally distributed outcomes in the literature. We perform a simulation study to examine how a selection of factor scoring methods compare when estimating regression coefficients in generalized linear factor score regression. The current study evaluates the regression method and the correlation-preserving method as well as two sum score methods in ordinary, logistic, and Poisson factor score regression. Our results show that scoring method performance can differ notably across the considered regression models. In addition, the results indicate that the choice of scoring method can substantially influence research conclusions. The regression method generally performs the best in terms of coefficient and standard error bias, accuracy, and empirical Type I error rates. Moreover, the regression method and the correlation-preserving method mostly outperform the sum score methods.  相似文献   

对植物生理学实验课的考核内容、方法、评分以及教学评估进行了探讨。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号