首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The authors examined the robustness of multilevel linear growth curve modeling to misspecification of an autoregressive moving average process. As previous research has shown (J. Ferron, R. Dailey, &; Q. Yi, 2002; O. Kwok, S. G. West, &; S. B. Green, 2007; S. Sivo, X. Fan, &; L. Witta, 2005), estimates of the fixed effects were unbiased, and Type I error rates for the tests of the fixed effects were generally accurate when the present authors correctly specified or underspecified the model. However, random effects were poorly estimated under many conditions, even under correct model specification. Further, fit criteria performed inconsistently and were especially inaccurate when small sample sizes and short series lengths were combined. With the exception of elevated Type I error rates that occurred under some conditions, the best performance was obtained by use of an unstructured covariance matrix at the first level of the growth curve model.  相似文献   

2.
The asymptotically distribution-free (ADF) test statistic depends on very mild distributional assumptions and is theoretically superior to many other so-called robust tests available in structural equation modeling. The ADF test, however, often leads to model overrejection even at modest sample sizes. To overcome its poor small-sample performance, a family of robust test statistics obtained by modifying the ADF statistics was recently proposed. This study investigates by simulation the performance of the new modified test statistics. The results revealed that although a few of the test statistics adequately controlled Type I error rates in each of the examined conditions, most performed quite poorly. This result underscores the importance of choosing a modified test statistic that performs well for specific examined conditions. A parametric bootstrap method is proposed for identifying such a best-performing modified test statistic. Through further simulation it is shown that the proposed bootstrap approach performs well.  相似文献   

3.
Cluster sampling results in response variable variation both among respondents (i.e., within-cluster or Level 1) and among clusters (i.e., between-cluster or Level 2). Properly modeling within- and between-cluster variation could be of substantive interest in numerous settings, but applied researchers typically test only within-cluster (i.e., individual difference) theories. Specifying a between-cluster model in the absence of theory requires a specification search in multilevel structural equation modeling. This study examined a variety of within-cluster and between-cluster sample sizes, intraclass correlation coefficients, start models, parameter addition and deletion methods, and Type I error control techniques to identify which combination of start model, parameter addition or deletion method, and Type I error control technique best recovered the population of the between-cluster model. Results indicated that a “saturated” start model, univariate parameter deletion technique, and no Type I error control performed best, but recovered the population between-cluster model in less than 1 in 5 attempts at the largest sample sizes. The accuracy of specification search methods, suggestions for applied researchers, and future research directions are discussed.  相似文献   

4.
Formulas for the standard error of a parallel-test correlation and for the Kuder-Richardson formula 20 reliability estimate are provided. Given equal values of the two reliabilities in the population, the standard error of the Kuder-Richardson formula 20 is shown to be somewhat smaller than the standard error of a parallel-test correlation for reliability values, sample sizes, and test lengths that are usually encountered in practice.  相似文献   

5.
《教育实用测度》2013,26(3):241-261
This simulation study compared two procedures to enable an adaptive test to select items in correspondence with a content blueprint. Trait level estimates obtained from testlet-based and constrained adaptive tests administered to 10,000 simulated examinees under two trait distributions and three item pool sizes were compared to the trait level estimates obtained from traditional adaptive tests in terms of mean absolute error, bias, and information. Results indicate that using constrained adaptive testing requires an increase of 5% to 11% in test length over the traditional adaptive test to reach the same error level and, using testlets requires an increase of 43% to 104% in test length over the traditional adaptive test. Given these results, the use of constrained computerized adaptive testing is recommended for situations in which an adaptive test must adhere to particular content specifications.  相似文献   

6.
ADHD is one of the most common referrals to school psychologists and child mental health providers. Although a best practice assessment of ADHD requires more than the use of rating scales, rating scales are one of the primary components in the assessment of ADHD. Therefore, the goal of this paper is to provide the reader with a critical and comparative evaluation of the five most commonly used, narrow‐band, published rating scales for the assessment of ADHD. Reviews were conducted in four main areas: content and use, standardization sample and norms, scores and interpretation, and psychometric properties. It was concluded the rating scales with the strongest standardization samples and evidence for reliability and validity are the ADDES, the ADHD‐IV, and the CRS‐R. In determining which of these to use, the prospective users may want to reflect on their goals for the assessment. The ACTeRS and the ADHDT are not recommended for use because they are lacking crucial information in their manuals and have less well‐documented evidence of reliability and validity. Conclusions and recommendations for scale usage are discussed. © 2003 Wiley Periodicals, Inc. Psychol Schs 40: 341–361, 2003.  相似文献   

7.
Many academic tests (e.g. short‐answer and multiple‐choice) sample required knowledge with questions scoring 0 or 1 (dichotomous scoring). Few textbooks give useful guidance on the length of test needed to do this reliably. Posey's binomial error model of 1932 provides the best starting point, but allows neither for heterogeneity of question difficulty and discriminatory power nor for students' uneven spread of knowledge. Even with these taken into account, it appears that tests of 30–60 items, as commonly used, must generally be far from adequate. No exact test length can be specified as ‘just sufficient’, but the tests of 300 items that some students take are not extravagantly long. The effects on reliability of some particular test forms and practices are discussed.  相似文献   

8.
In criterion‐referenced tests (CRTs), the traditional measures of reliability used in norm‐referenced tests (NRTs) have often proved problematic because of NRT assumptions of one underlying ability or competency and of variance in the distribution of scores. CRTs, by contrast, are likely to be created when mastery of the skill or knowledge by all or most all test takers is expected and thus little variation in the scores is expected. A comprehensive CRT often measures a number of discrete tasks that may not represent a single unifying ability or competence. Hence, CRTs theoretically violate the two most essential assumptions of classic NRT re liability theory and they have traditionally required the logistical problems of multiple test administrations to the same test takers to estimate reliability. A review of the literature categorizes approaches to reliability for CRTs into two classes: estimates sensitive to all measures of error and estimates of consistency in test outcome. For single test administration of CRTs Livingston's k2is recommended for estimating all measures of error, Sc is proposed for estimates of consistency in test outcome. Both approaches compared using data from a CRT exam and recommendations for interpretation and use are proposed.  相似文献   

9.
This paper presents the results of a simulation study to compare the performance of the Mann-Whitney U test, Student?s t test, and the alternate (separate variance) t test for two mutually independent random samples from normal distributions, with both one-tailed and two-tailed alternatives. The estimated probability of a Type I error was controlled (in the sense of being reasonably close to the attainable level) by all three tests when the variances were equal, regardless of the sample sizes. However, it was controlled only by the alternate t test for unequal variances with unequal sample sizes. With equal sample sizes, the probability was controlled by all three tests regardless of the variances. When it was controlled, we also compared the power of these tests and found very little difference. This means that very little power will be lost if the Mann-Whitney U test is used instead of tests that require the assumption of normal distributions.  相似文献   

10.
Empirical researchers maximize their contribution to theory development when they compare alternative theory‐inspired models under the same conditions. Yet model comparison tools in structural equation modeling—χ2 difference tests, information criterion measures, and screening heuristics—have significant limitations. This article explores the use of the Friedman method of ranks as an inferential procedure for evaluating competing models. This approach has attractive properties, including limited reliance on sample size, limited distributional assumptions, an explicit multiple comparison procedure, and applicability to the comparison of nonnested models. However, this use of the Friedman method raises important issues regarding the lack of independence of observations and the power of the test.  相似文献   

11.
It has been argued that item variance and test variance are not necessary characteristics for criterion-referenced tests, although they are necessary for normreferenced tests. This position is in error because it considers sample statistics as the criteria for evaluating items and tests. Within a particular sample, an item or test may have no variance, but in the population of observations for which the test was designed, calibrated, and evaluated, both items and tests must have variance.  相似文献   

12.
Recent advances in testing mediation have found that certain resampling methods and tests based on the mathematical distribution of 2 normal random variables substantially outperform the traditional z test. However, these studies have primarily focused only on models with a single mediator and 2 component paths. To address this limitation, a simulation was conducted to evaluate these alternative methods in a more complex path model with multiple mediators and indirect paths with 2 and 3 paths. Methods for testing contrasts of 2 effects were evaluated also. The simulation included 1 exogenous independent variable, 3 mediators and 2 outcomes and varied sample size, number of paths in the mediated effects, test used to evaluate effects, effect sizes for each path, and the value of the contrast. Confidence intervals were used to evaluate the power and Type I error rate of each method, and were examined for coverage and bias. The bias-corrected bootstrap had the least biased confidence intervals, greatest power to detect nonzero effects and contrasts, and the most accurate overall Type I error. All tests had less power to detect 3-path effects and more inaccurate Type I error compared to 2-path effects. Confidence intervals were biased for mediated effects, as found in previous studies. Results for contrasts did not vary greatly by test, although resampling approaches had somewhat greater power and might be preferable because of ease of use and flexibility.  相似文献   

13.
Investigation of peer effects on achievement with sample survey data on schools may mean that only a random sample of the population of peers is observed for each individual. This generates measurement error in peer variables similar in form to the textbook case of errors-in-variables, resulting in the estimated peer group effects in an OLS regression model being biased towards zero. We investigate the problem using survey data for England from the Programme for International Student Assessment (PISA) linked to administrative microdata recording information for each PISA sample member's entire year cohort. We calculate a peer group measure based on these complete data and compare its use with a variable based on peers in just the PISA sample. We also use a Monte Carlo experiment to show how the extent of the attenuation bias rises as peer sample size falls. On average, the estimated peer effect is biased downwards by about one third when drawing a sample of peers of the size implied by the PISA survey design.  相似文献   

14.
Statistical theories of goodness-of-fit tests in structural equation modeling are based on asymptotic distributions of test statistics. When the model includes a large number of variables or the population is not from a multivariate normal distribution, the asymptotic distributions do not approximate the distribution of the test statistics very well at small sample sizes. A variety of methods have been developed to improve the accuracy of hypothesis testing at small sample sizes. However, all these methods have their limitations, specially for nonnormal distributed data. We propose a Monte Carlo test that is able to control Type I error with more accuracy compared to existing approaches in both normal and nonnormally distributed data at small sample sizes. Extensive simulation studies show that the suggested Monte Carlo test has a more accurate observed significance level as compared to other tests with a reasonable power to reject misspecified models.  相似文献   

15.
概化理论作为新一代测量理论逐渐应用于大规模考试领域。文章运用多元概化理论对自学考试课程《英语水平考试(一)笔试》试卷的测量信度、试卷总分合成、及格线决策信度、试卷结构优化等问题进行探讨。研究发现:本次考试的测量信度较高;各分测验对全域总分的方差贡献比例与试卷赋分意图基本一致;该试卷以60分作为及格线具有较高的决策信度;将各分测验题量同时增至15题或单独将词汇分测验题量增至20题,可有效提高测量信度。  相似文献   

16.
Structural equation modeling (SEM) techniques provide us with excellent tools for conducting preliminary evaluation of differential validity and reliability of measurement instruments among a comprehensive selection of population groups. This article demonstrates empirically an SEM technique for group comparison of reliability and validity. Data are from a study of 495 mothers' attitudes toward pregnancy. Proportions of African American and White, married and unmarried, and Medicaid and non-Medicaid mothers provided sample sizes large enough for group comparisons. Four hypotheses are tested: that factor structures are invariant between subgroups, that factor loadings are invariant between subgroups, that measurement error is invariant between subgroups, and that means of the latent variable are invariant between subgroups. Discussion of item distributions, sample size issues, and appropriate estimation techniques is included.  相似文献   

17.
The population discrepancy between unstandardized and standardized reliability of homogeneous multicomponent measuring instruments is examined. Within a latent variable modeling framework, it is shown that the standardized reliability coefficient for unidimensional scales can be markedly higher than the corresponding unstandardized reliability coefficient, or alternatively substantially lower than the latter. Based on these findings, it is recommended that scholars avoid estimating, reporting, interpreting, or using standardized scale reliability coefficients in empirical research, unless they have strong reasons to consider standardizing the original components of utilized scales.  相似文献   

18.
Tablets can be used to facilitate systematic testing of academic skills. Yet, when using validated paper tests on tablet, comparability between the mediums must be established. Comparability between a tablet and a paper version of a basic math skills test (HRT: Heidelberger Rechen Test 1–4) was investigated. Five samples with second and third grade students participated. The associations between the tablet and paper version of HRT showed that these modes of administration were comparable for three arithmetic scales, but unacceptable for a pictorial counting scale. Scores were lower on tablet. Test-retest reliability for arithmetic scales on tablet was satisfactory, but was inferior for a low-performing sample. The overall convergent validity was satisfactory. No effect of test administrator was found. Arithmetic scales can potentially be transferred to tablet with good comparability and maintained test-retest reliability. Precautions are necessary when transferring pictorial scales into tablet. Separate norms for tablet are needed when interpreting scores.  相似文献   

19.
Recently a new mean scaled and skewness adjusted test statistic was developed for evaluating structural equation models in small samples and with potentially nonnormal data, but this statistic has received only limited evaluation. The performance of this statistic is compared to normal theory maximum likelihood and 2 well-known robust test statistics. A modification to the Satorra–Bentler scaled statistic is developed for the condition that sample size is smaller than degrees of freedom. The behavior of the 4 test statistics is evaluated with a Monte Carlo confirmatory factor analysis study that varies 7 sample sizes and 3 distributional conditions obtained using Headrick's fifth-order transformation to nonnormality. The new statistic performs badly in most conditions except under the normal distribution. The goodness-of-fit χ2 test based on maximum-likelihood estimation performed well under normal distributions as well as under a condition of asymptotic robustness. The Satorra–Bentler scaled test statistic performed best overall, whereas the mean scaled and variance adjusted test statistic outperformed the others at small and moderate sample sizes under certain distributional conditions.  相似文献   

20.
In this study we describe an analytic method for aiding in the generation of subscales that characterize the deep structure of tests. In addition we also derive a procedure for estimating scores for these scales that are much more statistically stable than subscores computed solely from the items that are contained on that scale. These scores achieve their stability through augmentation with information from other related information on the test. These methods were used to complement each other on a data set obtained from a Praxis administration. We found that the deep structure of the test yielded ten subscales and that, because the test was essentially unidimensional, ten subscores could be computed, all with very high reliability. This result was contrasted with the calculation of six traditional subscales based on surface features of the items. These subscales also yielded augmented subscores of high reliability.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号