首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
Confidence intervals often are recommended as a means of communicating the extent to which individual test scores may be influenced by measurement error. However, test manuals and assessment texts vary widely in their recommendations about how confidence intervals should be constructed, and several contain misinterpretations of classical test theory. The most widely used procedure for constructing confidence intervals misrepresents the likely distribution of true scores, and confidence intervals constructed with it will be inaccurate, especially when extreme scores are involved. The various procedures for constructing confidence intervals that have been suggested in measurement texts are examined in relation to their approximation to the most accurate procedure that uses the estimated true score as the center of the confidence interval and the standard error of estimate to determine the width. In addition, the problems of applying these procedures to norm-referenced scores are discussed—an issue that has been largely ignored in the assessment literature and that leads to further misinterpretations of confidence intervals.  相似文献   

Are boys discriminated in Swedish high schools?   总被引:1,自引:0,他引:1  
Girls typically have higher grades than boys in school and recent research suggests that part of this gender difference may be due to discrimination of boys in grading. We rigorously test this in a field experiment where a random sample of the same tests in the Swedish language is subject to blind and non-blind grading. The non-blind test score is on average 15% lower for boys than for girls. Blind grading lowers the average grades with 13%, indicating that personal ties and/or grade inflation are important in non-blind grading. But we find no evidence of discrimination against boys in grading. The point estimate of the discrimination effect is close to zero with a 95% confidence interval of ±4.5% of the average non-blind grade.  相似文献   

A method for obtaining an approximate confidence interval for the difference in root mean square error of approximation-a widely used goodness-of-fit measure-of 2 structural equation models is discussed, which is based on an application of the bootstrap methodology. The confidence interval represents a useful tool when studying plausibility of parameter restrictions in nested structural equation models and can be used for examining the difference in fit, accounting for complexity, for any 2 models-whether nested or nonnested-fitted to the same data set. The method is illustrated on a numerical example.  相似文献   

Mediation models are commonly used to identify the mechanisms through which one variable influences another. Among longitudinal mediation methods, latent difference score mediation stands out due to its unique ability to capture nonlinear change over time. However, there is limited information regarding sample size demands to achieve adequate power with this method, resulting in few applications of latent difference score mediation. To address this limitation, the current study presents empirically supported sample size guidelines for 10 common latent difference score mediation structural models and 9 unique population models. The results of this study offer researchers with a set of representative sample estimates that may be used when designing studies or seeking funding.  相似文献   

We evaluated the statistical power of single-indicator latent growth curve models to detect individual differences in change (variances of latent slopes) as a function of sample size, number of longitudinal measurement occasions, and growth curve reliability. We recommend the 2 degree-of-freedom generalized test assessing loss of fit when both slope-related random effects, the slope variance and intercept-slope covariance, are fixed to 0. Statistical power to detect individual differences in change is low to moderate unless the residual error variance is low, sample size is large, and there are more than four measurement occasions. The generalized test has greater power than a specific test isolating the hypothesis of zero slope variance, except when the true slope variance is close to 0, and has uniformly superior power to a Wald test based on the estimated slope variance.  相似文献   

The present study evaluated the multiple imputation method, a procedure that is similar to the one suggested by Li and Lissitz (2004), and compared the performance of this method with that of the bootstrap method and the delta method in obtaining the standard errors for the estimates of the parameter scale transformation coefficients in item response theory (IRT) equating in the context of the common‐item nonequivalent groups design. Two different estimation procedures for the variance‐covariance matrix of the IRT item parameter estimates, which were used in both the delta method and the multiple imputation method, were considered: empirical cross‐product (XPD) and supplemented expectation maximization (SEM). The results of the analyses with simulated and real data indicate that the multiple imputation method generally produced very similar results to the bootstrap method and the delta method in most of the conditions. The differences between the estimated standard errors obtained by the methods using the XPD matrices and the SEM matrices were very small when the sample size was reasonably large. When the sample size was small, the methods using the XPD matrices appeared to yield slight upward bias for the standard errors of the IRT parameter scale transformation coefficients.  相似文献   

Wilcox (16) proposed a latent structure model for answer-until-correct tests that can solve various measurement problems including correcting for guessing without assuming guessing is at random. This paper proposes a closed sequential procedure for estimating true score that can be used in conjunction with an answer-until-correct test. For criterion-referenced tests where the goal is to determine whether an examinee’s true score is above or below a known constant, the accuracy of the new procedure is exactly the same as a more conventional sequential solution. The advantage of the new procedure is that it eliminates the possibility of using an inordinately large number of items when in fact a large number of items is not needed; typical sequential procedures always allow this possibility. In addition, the new procedure appears to compare favorably to traditional tests where the number of items to be administered is fixed in advance.  相似文献   

Many researchers assessing the efficacy of educational programs face challenges due to issues with non-randomization and the likelihood of dependence between nested subjects. The purpose of the study was to demonstrate a rigorous research methodology using a hierarchical propensity score matching method that can be utilized in contexts where randomization is not feasible and dependence between subjects is a concern. Although propensity score matching is not new in helping to create quasi-experimental models, many studies limit propensity score matching to student-level variables. To address this limitation in educational research, this study extends propensity score matching to the next level so that hierarchical modeling techniques can be used to help minimize error due to the likelihood of dependence between nested students. A large-scale educational program that targets first-semester freshmen was used to illustrate the utility and value of the methodology. This type of program is typical in higher education where student self-selection creates difficulty in assessing its true effects on student achievement; however, by using a rigorous methodology, administrators can have higher confidence when making programmatic and budgetary decisions.  相似文献   

The purpose of this study was to determine if a linear procedure, typically applied to an entire examination when equating scores and reseating judges' standards, could be used with individual item data gathered through Angoffs standard-setting method (1971). Specifically, experts estimates of borderline group performance on one form of a test were transformed to be on the same scale as experts' estimates of borderline group performance on another form of the test. The transformations were based on examinees' responses to the items and on judges' estimates of borderline group performance. The transformed values were compared to the actual estimates provided by a group of judges. The equated and reseated values were reasonably close to those actually assigned by the experts. Bias in the estimates was also relatively small. In general, the reseating procedure was more accurate than the equating procedure, especially when the examinee sample size for equating was small.  相似文献   

An interval estimation procedure for proportion of explained observed variance in latent curve analysis is discussed, which can be used as an aid in the process of choosing between linear and nonlinear models. The method allows obtaining confidence intervals for the R 2 indexes associated with repeatedly followed measures in longitudinal studies. In addition to facilitating evaluation of local model fit, the approach is helpful for purposes of differentiating between plausible models stipulating different patterns of change over time, and in particular in empirical situations characterized by large samples and high statistical power. The procedure is also applicable in cross-sectional studies, as well as with general structural equation models. The method is illustrated using data from a nationally representative study of older adults.  相似文献   

Kelley and Lai (2011) recently proposed the use of accuracy in parameter estimation (AIPE) for sample size planning in structural equation modeling. The sample size that reaches the desired width for the confidence interval of root mean square error of approximation (RMSEA) is suggested. This study proposes a graphical extension with the AIPE approach, abbreviated as GAIPE, on RMSEA to facilitate sample size planning in structural equation modeling. GAIPE simultaneously displays the expected width of a confidence interval of RMSEA, the necessary sample size to reach the desired width, and the RMSEA values covered in the confidence interval. Power analysis for hypothesis tests related to RMSEA can also be integrated into the GAIPE framework to allow for a concurrent consideration of accuracy in estimation and statistical power to plan sample sizes. A package written in R has been developed to implement GAIPE. Examples and instructions for using the GAIPE package are presented to help readers make use of this flexible framework. With the capacity of incorporating information on accuracy in RMSEA estimation, values of RMSEA, and power for hypothesis testing on RMSEA in a single graphical representation, the GAIPE extension offers an informative and practical approach for sample size planning in structural equation modeling.  相似文献   

A procedure is presented for obtaining maximum likelihood trait estimates from number-correct (NC) scores for the three-parameter logistic model. The procedure produces an NC score to trait estimate conversion table, which can be used when the hand scoring of tests is desired or when item response pattern (IP) scoring is unacceptable for other (e.g., political) reasons. Simulated data are produced for four 20-item and four 40-item tests of varying difficulties. These data indicate that the NC scoring procedure produces trait estimates that are tau-equivalent to the IP trait estimates (i.e., they are expected to have the same mean for all groups of examinees), but the NC trait estimates have higher standard errors of measurement than IP trait estimates. Data for six real achievement tests verify that the NC trait estimates are quite similar to the IP trait estimates but have higher empirical standard errors than IP trait estimates, particularly for low-scoring examinees. Analyses in the estimated true score metric confirm the conclusions made in the trait metric.  相似文献   

The purpose of this investigation is to evaluate structural equation models (SEMs) for measures of the same construct collected on multiple occasions (one-variable, multiwave panel studies). Simplex models hypothesize that a measure at any one wave is substantially influenced by the measure at the 0immediately preceding wave; correlations between the same construct measured on different occasions are predicted to decline systematically as the number of intervening occasions increases. Alternatively, a one-factor model posits that a person's score at any one time is a function of some underlying "true" score and a random disturbance that is idiosyncratic to the time; no temporal ordering of correlations is assumed. Both the simplex and one-factor models can befit when there is only a single indicator of each construct at each wave (e.g., scale scores), but there are serious limitations to such models. Stronger models are possible when the same set of multiple indicators (e.g., the items that make up the scale) is measured at each wave. In Study 1, based on students' evaluations of teaching effectiveness collected over an 8-year period, one-factor models fit the data well, whereas simplex models did not. In Study 2, based on personality variables collected over a 4-year period during adolescence, one-factor models again provided an excellent fit to the data, whereas the simplex model did marginally poorer. The results challenge an overreliance on simplex models and demonstrate that a one-factor model is a potentially useful alternative that should be considered in multiwave studies.  相似文献   

DETECT, the acronym for Dimensionality Evaluation To Enumerate Contributing Traits, is an innovative and relatively new nonparametric dimensionality assessment procedure used to identify mutually exclusive, dimensionally homogeneous clusters of items using a genetic algorithm ( Zhang & Stout, 1999 ). Because the clusters of items are mutually exclusive, this procedure is most useful when the data display approximate simple structure. In many testing situations, however, data display a complex multidimensional structure. The purpose of the current study was to evaluate DETECT item classification accuracy and consistency when the data display different degrees of complex structure using both simulated and real data. Three variables were manipulated in the simulation study: The percentage of items displaying complex structure (10%, 30%, and 50%), the correlation between dimensions (.00, .30, .60, .75, and .90), and the sample size (500, 1,000, and 1,500). The results from the simulation study reveal that DETECT can accurately and consistently cluster items according to their true underlying dimension when as many as 30% of the items display complex structure, if the correlation between dimensions is less than or equal to .75 and the sample size is at least 1,000 examinees. If 50% of the items display complex structure, then the correlation between dimensions should be less than or equal to .60 and the sample size be, at least, 1,000 examinees. When the correlation between dimensions is .90, DETECT does not work well with any complex dimensional structure or sample size. Implications for practice and directions for future research are discussed.  相似文献   

Pigeons were trained using a symbolic delayed matching-to-sample procedure involving bright versus dim houselight samples. We hypothesized that when sample stimuli differ in salience, increasing the size of the retention interval will affect performance on trials initiated by the more salient sample only. In agreement with this prediction, accuracy following the dim sample did not decline as the retention interval increased, whereas accuracy following the bright sample declined to well below 50% correct. In a second experiment, the less salient (dim) sample from Experiment 1 was arranged as the more salient sample in a sample/no-sample procedure. Accuracy on dim sample trials then declined to well below 50% correct as the retention interval increased, whereas accuracy on no-sample trials remained constant. The results suggest that when sample stimuli differ in salience, pigeons may transform the nominal discrimination task into a detection task in which they respond on the basis of the presence or absence of the more salient sample.  相似文献   

This study evaluates the reliability of profile analysis for assessing differential abilities on the McCarthy Scales for Children's Abilities (MSCA). Subjects were enrolled in private schools and ranged in age from 5–5 to 6–5. The test-retest interval ranged from 3 to 6 weeks, with a mean interval of 24 days. Results indicated 70.9% of the sample showed profile variability not reasonably accounted for as real fluctuations in measurable abilities. General application of the null hypothesis procedure for calculating statistical significance of scaled score differences as a basis for interpretive judgments is discussed.  相似文献   

Assessing the correspondence between model predictions and observed data is a recommended procedure for justifying the application of an IRT model. However, with shorter tests, current goodness-of-fit procedures that assume precise point estimates of ability, are inappropriate. The present paper describes a goodness-of-fit statistic that considers the imprecision with which ability is estimated and involves constructing item fit tables based on each examinee's posterior distribution of ability, given the likelihood of their response pattern and an assumed marginal ability distribution. However, the posterior expectations that are computed are dependent and the distribution of the goodness-of-fit statistic is unknown. The present paper also describes a Monte Carlo resampling procedure that can be used to assess the significance of the fit statistic and compares this method with a previously used method. The results indicate that the method described herein is an effective and reasonably simple procedure for assessing the validity of applying IRT models when ability estimates are imprecise.  相似文献   


Factor mixture models are designed for the analysis of multivariate data obtained from a population consisting of distinct latent classes. A common factor model is assumed to hold within each of the latent classes. Factor mixture modeling involves obtaining estimates of the model parameters, and may also be used to assign subjects to their most likely latent class. This simulation study investigates aspects of model performance such as parameter coverage and correct class membership assignment and focuses on covariate effects, model size, and class-specific versus class-invariant parameters. When fitting true models, parameter coverage is good for most parameters even for the smallest class separation investigated in this study (0.5 SD between 2 classes). The same holds for convergence rates. Correct class assignment is unsatisfactory for the small class separation without covariates, but improves dramatically with increasing separation, covariate effects, or both. Model performance is not influenced by the differences in model size investigated here. Class-specific parameters may improve some aspects of model performance but negatively affect other aspects.  相似文献   

The design of research studies utilizing binary multilevel models must necessarily incorporate knowledge of multiple factors, including estimation method, variance component size, or number of predictors, in addition to sample sizes. This Monte Carlo study examined the performance of random effect binary outcome multilevel models under varying methods of estimation, level-1 and level-2 sample size, outcome prevalence, variance component sizes, and number of predictors using SAS software. Mean estimates of statistical power were influenced primarily by sample sizes at both levels. In addition, confidence interval coverage and width and the likelihood of nonpositive definite random effect covariance matrices were impacted by variance component size and estimation method. The interactions of these and other factors with various model performance outcomes are explored.  相似文献   

In the delayed matching of key location procedure, pigeons must remember the location of the sample key in order to choose correctly between two comparison keys. The deleterious effect of short intertrial intervals on key location matching found in previous studies suggested that pigeons’ short-term spatial memory is affected by proactive interference. However, because a reward expectancy mechanism may account for the intertriai interval effect, additional research aimed at demonstrating proactive interference was warranted. In Experiment 1, matching accuracy did not decline from early to late trials within a session, a finding inconsistent with a proactive interference effect. In Experiment 2, evidence suggestive of proactive interference was found: Matching was more accurate when the locations that served as distractors and as samples were chosen from different sets. However, this effect could have been due to differences in task difficulty, and the results of the two subsequent experiments provided no evidence of proactive interference. In Experiment 3, the distractor on Trialn was either the location that had served as the sample on Trialn ? 1 or one that had been a sample on earlier trials. Matching accuracy was not inferior on the former type of trial. In Experiment 4, the stimuli that served as samples and distractors were taken from sets containing 2, 3, 5, or 9 locations. Matching accuracy was no worse, actually slightly better, with smaller memory set sizes. Overall, these findings suggested that pigeons’ memory for spatial location may be immune to proactive interference. However, when, in Experiment 5, an intratrial manipulation was used, clear evidence of proactive interference was found: Matching accuracy was considerably lower when the sample was preceded by the distractor for that trial than when it was preceded by the sample or by nothing. Possible reasons why interference was produced by intratrial but not intertrial manipulations are discussed, as are implications of these data for models of pigeons’ short-term spatial memory.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号