首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In this study, we compared 12 statistical strategies proposed for selecting loglinear models for smoothing univariate test score distributions and for enhancing the stability of equipercentile equating functions. The major focus was on evaluating the effects of the selection strategies on equating function accuracy. Selection strategies' influence on the estimation of cumulative test score distributions was also assessed. The results of this simulation study differentiate the selection strategies and define the situations where their use has the most important implications for equating function accuracy. The recommended strategy for estimating test score distributions and for equating is AIC minimization.  相似文献   

2.
The selection of bandwidth in kernel equating is important because it has a direct impact on the equated test scores. The aim of this article is to examine the use of double smoothing when selecting bandwidths in kernel equating and to compare double smoothing with the commonly used penalty method. This comparison was made using both an equivalent groups design and a nonequivalent group with anchor test design. The performance of the methods was evaluated through simulation studies using both symmetric and skewed score distributions. In addition, the bandwidth selection methods were applied to real data from a college admissions test. The results show that the traditional penalty method works well although double smoothing is a viable alternative because it performs reasonably well compared to the traditional method.  相似文献   

3.
Numerous methods have been proposed and investigated for estimating · the standard error of measurement (SEM) at specific score levels. Consensus on the preferred method has not been obtained, in part because there is no standard criterion. The criterion procedure in previous investigations has been a single test occasion procedure. This study compares six estimation techniques. Two criteria were calculated by using test results obtained from a test-retest or parallel forms design. The relationship between estimated score level standard errors and the score scale was similar for the six procedures. These relationships were also congruent to findings from previous investigations. Similarity between estimates and criteria varied over methods and criteria. For test-retest conditions, the estimation techniques are interchangeable. The user's selection could be based on personal preference. However, for parallel forms conditions, the procedures resulted in estimates that were meaningfully different. The preferred estimation technique would be Feldt's method (cited in Gupta, 1965; Feldt, 1984).  相似文献   

4.
In operational equating situations, frequency estimation equipercentile equating is considered only when the old and new groups have similar abilities. The frequency estimation assumptions are investigated in this study under various situations from both the levels of theoretical interest and practical use. It shows that frequency estimation equating can be used under circumstances when it is not normally used. To link theoretical results with practice, statistical methods are proposed for checking frequency estimation assumptions based on available data: observed‐score distributions and item difficulty distributions of the forms. In addition to the conventional use of frequency estimation equating when the group abilities are similar, three situations are identified when the group abilities are dissimilar: (a) when the two forms and the observed conditional score distributions are similar the two forms and the observed conditional score distributions are similar (in this situation, the frequency estimation equating assumptions are likely to hold, and frequency estimation equating is appropriate); (b) when forms are similar but the observed conditional score distributions are not (in this situation, frequency estimation equating is not appropriate); and (c) when forms are not similar but the observed conditional score distributions are (frequency estimation equating is not appropriate). Statistical analysis procedures for comparing distributions are provided. Data from a large‐scale test are used to illustrate the use of frequency estimation equating when the group difference in ability is large.  相似文献   

5.
When cut scores for classifications occur on the total score scale, popular methods for estimating classification accuracy (CA) and classification consistency (CC) require assumptions about a parametric form of the test scores or about a parametric response model, such as item response theory (IRT). This article develops an approach to estimate CA and CC nonparametrically by replacing the role of the parametric IRT model in Lee's classification indices with a modified version of Ramsay's kernel‐smoothed item response functions. The performance of the nonparametric CA and CC indices are tested in simulation studies in various conditions with different generating IRT models, test lengths, and ability distributions. The nonparametric approach to CA often outperforms Lee's method and Livingston and Lewis's method, showing robustness to nonnormality in the simulated ability. The nonparametric CC index performs similarly to Lee's method and outperforms Livingston and Lewis's method when the ability distributions are nonnormal.  相似文献   

6.
This article examines whether Bayesian estimation with minimally informed prior distributions can alleviate the estimation problems often encountered with fitting the true score multitrait–multimethod structural equation model with split-ballot data. In particular, the true score multitrait–multimethod structural equation model encounters an empirical underidentification when (a) latent variable correlations are homogenous, and (b) fitted to data from a 2-group split-ballot design; an understudied case of empirical underidentification due to a planned missingness (i.e., split-ballot) design. A Monte Carlo simulation and 3 empirical examples showed that Bayesian estimation performs better than maximum likelihood (ML) estimation. Therefore, we suggest using Bayesian estimation with minimally informative prior distributions when estimating the true score multitrait–multimethod structural equation model with split-ballot data. Furthermore, given the increase in planned missingness designs in psychological research, we also suggest using Bayesian estimation as a potential alternative to ML estimation for analyses using data from planned missingness designs.  相似文献   

7.
试卷质量和考试成绩评判指标的分析研究   总被引:3,自引:0,他引:3  
本文对试题警告系数、考试成绩的信度、考试成绩的效度及学生警告系数的计算式做了修正 ,使之能适合计算机处理。提出试卷平均得分、试卷区分度、试卷警告系数和学生试卷得分度评判指标 ,使试卷质量 ,教学质量和学生学习水平的评判量化 ,可用于教学评估  相似文献   

8.
广义Pareto分布(Generalized Pareto Distribution,简称GPD)是统计推断中重要的一个分布,其目前在诸多领域得到广泛的应用.GPD的参数估计方法有多种,但各种方法及估计效果一般都受到形状参数k的限制,总结几种常用的参数估计方法,如:矩估计(the method of moments,简记MOM)、最小二乘估计(the least squares estimation,简记LSE)、基于分位数估计(the elemental percentile method,简记EPM)、近似广义最小二乘估计(AGLSE)等,通过模拟研究,得出不存在一致最优的参数估计方法.而在k较大时,LSE在GPD参数估计中模拟效果较为理想,特别当k1/2时,AGLSE对k的估计精度较高.  相似文献   

9.
Four methods for estimating a dynamic factor model, the direct autoregressive factor score (DAFS) model, are evaluated and compared. The first method estimates the DAFS model using a Kalman filter algorithm based on its state space model representation. The second one employs the maximum likelihood estimation method based on the construction of a block-Toeplitz covariance matrix in the structural equation modeling framework. The third method is built in the Bayesian framework and implemented using Gibbs sampling. The fourth is the least squares method, which also employs the block-Toeplitz matrix. All 4 methods are implemented in currently available software. The simulation study shows that all 4 methods reach appropriate parameter estimates with comparable precision. Differences among the 4 estimation methods and related software are discussed.  相似文献   

10.
This study examined the extent to which log-linear smoothing could improve the accuracy of differential item functioning (DIF) estimates in small samples of examinees. Examinee responses from a certification test were analyzed using White examinees in the reference group and African American examinees in the focal group. Using a simulation approach, separate DIF estimates for seven small-sample-size conditions were obtained using unsmoothed (U) and smoothed (S) score distributions. These small sample U and S DIF estimates were compared to a criterion (i.e., DIF estimates obtained using the unsmoothed total data) to assess their degree of variability (random error) and accuracy (bias). Results indicate that for most studied items smoothing the raw score distributions reduced random error and bias of the DIF estimates, especially in the small-sample-size conditions. Implications of these results for operational testing programs are discussed.  相似文献   

11.
This article considers two new smoothing methods in equipercentile equating , the cubic B-spline presmoothing method and the direct presmoothing method. Using a simulation study , these two methods are compared with established methods , the beta-4 method , the polynomial loglinear method , and the cubic spline postsmoothing method , under three sample sizes (300 , 1,000 , and 3,000) and for three test content areas (ITBS Maps and Diagrams , ITBS Reference and Materials , and ITBS Capitalization). Ten thousand random samples were simulated from population distributions , and the standard error , bias , and RMSE statistics were calculated. The cubic B-spline presmoothing method performed well in reducing total error of equating , whereas the direct presmoothing method appeared to need some modification for it to be as accurate as other smoothing methods.  相似文献   

12.
The Non-Equivalent-groups Anchor Test (NEAT) design has been in wide use since at least the early 1940s. It involves two populations of test takers, P and Q, and makes use of an anchor test to link them. Two linking methods used for NEAT designs are those (a) based on chain equating and (b) that use the anchor test to post-stratify the distributions of the two operational test scores to a common population (i.e., Tucker equating and frequency estimation). We show that, under different sets of assumptions, both methods are observed score equating methods and we give conditions under which the methods give identical results. In addition, we develop analogues of the Dorans and Holland (2000) RMSD measures of population invariance of equating methods for the NEAT design for both chain and post-stratification equating methods.  相似文献   

13.
The purpose of this study was to investigate multidimensional DIF with a simple and nonsimple structure in the context of multidimensional Graded Response Model (MGRM). This study examined and compared the performance of the IRT-LR and Wald test using MML-EM and MHRM estimation approaches with different test factors and test structures in simulation studies and applying real data sets. When the test structure included two dimensions, the IRT-LR (MML-EM) generally performed better than the Wald test and provided higher power rates. If the test included three dimensions, the methods provided similar performance in DIF detection. In contrast to these results, when the number of dimensions in the test was four, MML-EM estimation completely lost precision in estimating the nonuniform DIF, even with large sample sizes. The Wald with MHRM estimation approaches outperformed the Wald test (MML-EM) and IRT-LR (MML-EM). The Wald test had higher power rate and acceptable type I error rates for nonuniform DIF with the MHRM estimation approach.The small and/or unbalanced sample sizes, small DIF magnitudes, unequal ability distributions between groups, number of dimensions, estimation methods and test structure were evaluated as important test factors for detecting multidimensional DIF.  相似文献   

14.
Defining one observation as the score received by one examinee on one item, the results of this investigation suggest that, for a given test length, item-examinee sampling procedures having the same number of observation have, for all practical purposes, the same standard error in estimating μ but different standard errors in estimating σ. Additionally, the variance of the item difficulty indices (proportion answering the item correctly) was found to be a significant factor in accounting for differences in standard errors of estimating μ between normative distributions differing primarily in degree of skewness.  相似文献   

15.
Previous methods for estimating the conditional standard error of measurement (CSEM) at specific score or ability levels are critically discussed, and a brief summary of prior empirical results is given. A new method is developed that avoids theoretical problems inherent in some prior methods, is easy to implement, and estimates not only a quantity analogous to the CSEM at each score but also the conditional standard error of prediction (CSEP) at each score and the conditional true score standard deviation (CTSSD) at each score, The new method differs from previous methods in that previous methods have concentrated on attempting to estimate error variance conditional on a fixed value of true score, whereas the new method considers the variance of observed scores conditional on a fixed value of an observed parallel measurement and decomposes these conditional observed score variances into true and error parts. The new method and several older methods are applied to a variety of tests, and representative results are graphically displayed. The CSEM-Iike estimates produced by the new method are called conditional standard error of measurement in prediction (CSEMP) estimates and are similar to those produced by older methods, but the CSEP estimates produced by the new method offer an alternative interpretation of the accuracy of a test at different scores. Finally, evidence is presented that shows that previous methods can produce dissimilar results and that the shape of the score distribution may influence the way in which the CSEM varies across the score scale.  相似文献   

16.
《教育实用测度》2013,26(3):261-275
Establishing operational cutoff scores has traditionally been performed in two phases: (a) obtaining estimated cutoff scores based on expert judgments and (b) establishing the operational cutoff. The estimation phase involves selecting a method, collecting the data, and analyzing the results. In the second phase, the estimated cutoff score may be accepted or it may be adjusted after considering other pertinent information. This article provides an introductory review of several features of selected methods available at each of the two phases. Features of selected methods for estimating cutoff scores and methods for adjusting the estimated cutoff scores are discussed, and tentative recommendations for method selection at each phase are provided.  相似文献   

17.
The purpose of this study was to compare several methods for determining a passing score on an examination from the individual raters' estimates of minimal pass levels for the items. The methods investigated differ in the weighting that the estimates for each item receive in the aggregation process. An IRT-based simulation method was used to model a variety of error components of minimum pass levels. The results indicate little difference in estimated passing scores across the three methods. Less error was present when the ability level of the minimally competent candidates matched the expected difficulty level of the test. No meaningful improvement in passing score estimation was achieved for a 50-item test as opposed to a 25-item test; however, the RMSE values for estimates with 10 raters were smaller than those for 5 raters. The results suggest that the simplest method for aggregating minimum pass levels across the items in a test–adding them up–is the preferred method.  相似文献   

18.
In psychological research, available data are often insufficient to estimate item factor analysis (IFA) models using traditional estimation methods, such as maximum likelihood (ML) or limited information estimators. Bayesian estimation with common-sense, moderately informative priors can greatly improve efficiency of parameter estimates and stabilize estimation. There are a variety of methods available to evaluate model fit in a Bayesian framework; however, past work investigating Bayesian model fit assessment for IFA models has assumed flat priors, which have no advantage over ML in limited data settings. In this paper, we evaluated the impact of moderately informative priors on ability to detect model misfit for several candidate indices: posterior predictive checks based on the observed score distribution, leave-one-out cross-validation, and widely available information criterion (WAIC). We found that although Bayesian estimation with moderately informative priors is an excellent aid for estimating challenging IFA models, methods for testing model fit in these circumstances are inadequate.  相似文献   

19.
Four methods are outlined for estimating or approximating from a single test administration the standard error of measurement of number-right test score at specified ability levels or cutting scores. The methods are illustrated and compared on one set of real test data.  相似文献   

20.
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号