首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
The present study evaluated the multiple imputation method, a procedure that is similar to the one suggested by Li and Lissitz (2004), and compared the performance of this method with that of the bootstrap method and the delta method in obtaining the standard errors for the estimates of the parameter scale transformation coefficients in item response theory (IRT) equating in the context of the common‐item nonequivalent groups design. Two different estimation procedures for the variance‐covariance matrix of the IRT item parameter estimates, which were used in both the delta method and the multiple imputation method, were considered: empirical cross‐product (XPD) and supplemented expectation maximization (SEM). The results of the analyses with simulated and real data indicate that the multiple imputation method generally produced very similar results to the bootstrap method and the delta method in most of the conditions. The differences between the estimated standard errors obtained by the methods using the XPD matrices and the SEM matrices were very small when the sample size was reasonably large. When the sample size was small, the methods using the XPD matrices appeared to yield slight upward bias for the standard errors of the IRT parameter scale transformation coefficients.  相似文献   

2.
Measurement bias can be detected using structural equation modeling (SEM), by testing measurement invariance with multigroup factor analysis (Jöreskog, 1971;Meredith, 1993;Sörbom, 1974) MIMIC modeling (Muthén, 1989) or restricted factor analysis (Oort, 1992,1998). In educational research, data often have a nested, multilevel structure, for example when data are collected from children in classrooms. Multilevel structures might complicate measurement bias research. In 2-level data, the potentially “biasing trait” or “violator” can be a Level 1 variable (e.g., pupil sex), or a Level 2 variable (e.g., teacher sex). One can also test measurement invariance with respect to the clustering variable (e.g., classroom). This article provides a stepwise approach for the detection of measurement bias with respect to these 3 types of violators. This approach works from Level 1 upward, so the final model accounts for all bias and substantive findings at both levels. The 5 proposed steps are illustrated with data of teacher–child relationships.  相似文献   

3.
Wording effect refers to the systematic method variance caused by positive and negative item wordings on a self-report measure. This Monte Carlo simulation study investigated the impact of ignoring wording effect on the reliability and validity estimates of a self-report measure. Four factors were considered in the simulation design: (a) the number of positively and negatively worded items, (b) the loadings on the trait and the wording effect factors, (c) sample size, and (d) the magnitude of population validity coefficient. The findings suggest that the unidimensional model that ignores the negative wording effect would underestimate the composite reliability and criterion-related validity, but overestimate the homogeneity coefficient. The magnitude of relative bias of the composite reliability was generally small and acceptable, whereas the relative bias for the homogeneity coefficient and criterion-related validity coefficient was negatively correlated with the strength of the general trait factor.  相似文献   

4.
Standard errors of measurement of scale scores by score level (conditional standard errors of measurement) can be valuable to users of test results. In addition, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985) recommends that conditional standard errors be reported by test developers. Although a variety of procedures are available for estimating conditional standard errors of measurement for raw scores, few procedures exist for estimating conditional standard errors of measurement for scale scores from a single test administration. In this article, a procedure is described for estimating the reliability and conditional standard errors of measurement of scale scores. This method is illustrated using a strong true score model. Practical applications of this methodology are given. These applications include a procedure for constructing score scales that equalize standard errors of measurement along the score scale. Also included are examples of the effects of various nonlinear raw-to-scale score transformations on scale score reliability and conditional standard errors of measurement. These illustrations examine the effects on scale score reliability and conditional standard errors of measurement of (a) the different types of raw-to-scale score transformations (e.g., normalizing scores), (b) the number of scale score points used, and (c) the transformation used to equate alternate forms of a test. All the illustrations use data from the ACT Assessment testing program.  相似文献   

5.
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models.  相似文献   

6.
Abstract

A dramatic shift in research priorities has recently produced a large number of ambitious randomized trials in K-12 education. In most cases, the aim is to improve student academic learning by improving classroom instruction. Embedded in these studies are theories about how the quality of classroom must improve if these interventions are to succeed. The problem of measuring classroom quality then emerges as a major concern. This article first considers how errors of measurement reduce statistical power in studies of the impact of interventions classroom quality. We show how to use information about reliability to compute power and plan new research. At the same time, errors of measurement introduce bias into estimates of the association between classroom quality and student outcomes. We show how to use knowledge about the magnitude of measurement error to eliminate or reduce this bias. We also briefly review research on the design of studies of the reliability of classroom measures. Such studies are essential to evaluate promising new classroom interventions.  相似文献   

7.
It is well known that coefficient alpha is an estimate of reliability if its underlying assumptions are met and that it is a lower-bound estimate if the assumption of essential tau equivalency is violated. Very little literature addresses the assumption of uncorrelated errors among items and the effect of violating this assumption on alpha. True score models are proposed that can account for correlated errors. These models allow random measurement errors on earlier items to affect directly or indirectly scores on later items. Coefficient alpha may yield spuriously high estimates of reliability if these true score models reflect item responding. In practice, it is important to differentiate these models from models in which the errors are correlated because 1 or more factors have been left unspecified. If the latter model is an accurate representation of item responding, the assumption of essential tau equivalency is violated and alpha is a lower-bound estimate of reliability.  相似文献   

8.
It is well known that measurement error in observable variables induces bias in estimates in standard regression analysis and that structural equation models are a typical solution to this problem. Often, multiple indicator equations are subsumed as part of the structural equation model, allowing for consistent estimation of the relevant regression parameters. In many instances, however, embedding the measurement model into structural equation models is not possible because the model would not be identified. To correct for measurement error one has no other recourse than to provide the exact values of the variances of the measurement error terms of the model, although in practice such variances cannot be ascertained exactly, but only estimated from an independent study. The usual approach so far has been to treat the estimated values of error variances as if they were known exact population values in the subsequent structural equation modeling (SEM) analysis. In this article we show that fixing measurement error variance estimates as if they were true values can make the reported standard errors of the structural parameters of the model smaller than they should be. Inferences about the parameters of interest will be incorrect if the estimated nature of the variances is not taken into account. For general SEM, we derive an explicit expression that provides the terms to be added to the standard errors provided by the standard SEM software that treats the estimated variances as exact population values. Interestingly, we find there is a differential impact of the corrections to be added to the standard errors depending on which parameter of the model is estimated. The theoretical results are illustrated with simulations and also with empirical data on a typical SEM model.  相似文献   

9.
OBJECTIVE: The aim was to construct and test the reliability (utility, internal consistency, interrater agreement) and the validity (internal validity, concurrent validity) of a scale for home visiting social nurses to identify risks of physical abuse and neglect in mothers with a newborn child. METHOD: A 71-item scale was constructed based on a literature review and focus group sessions with social nurses and paraprofessionals who had experience with underprivileged families. This scale was applied in a random sample of 40 home visiting social nurses, who collected data in a sample of 373 nonabusive and 18 abusive/neglectful mothers with a newborn child. RESULTS: Items with prevalence rates below 5% and items making no significant difference between maltreating and non-maltreating mothers were omitted. The final version contained 20 items. This scale showed high internal consistency (alpha = .92) and high interrater reliability (r = .97). Exploratory factor analysis yielded a three-factor solution: Isolation (8 items, explaining 62.17% of the common variance), Psychological complexity (6 items, 18.86%), and Communication problems (6 items, 8.41%). Scores on Communication problems and Isolation significantly predicted scores on a social deprivation scale, which significantly distinguished maltreating from non-maltreating mothers. Mothers scoring high on Communication problems or Isolation obtained higher scores for social deprivation than low-scoring mothers. CONCLUSIONS: Home visiting nurses can identify risks for physical abuse and neglect among mothers with a newborn infant by focusing on signs of social isolation, distorted communication and psychological problems.  相似文献   

10.
The purpose of this study is to develop and evaluate unidimensional models that can handle semiordered data within scale items (i.e., items with multiple ordered response categories, and one additional nominal response category). We apply the models to scale data with not applicable (NA) responses to compare the model performance to conditions in which NA responses are treated as missing and ignored. We also conduct a small simulation study based on the operational study to evaluate the parameter recovery of the models under the operational conditions. Findings indicate that the proposed models show promise for (a) reducing standard errors of trait estimates for persons who select NA responses, (b) reducing nonresponse bias in trait estimates for persons who select NA responses, and (c) providing substantive information to practitioners about the nature of the relationship between NA selection and the trait of measurement.  相似文献   

11.
Longitudinal studies offer unique opportunities to identify the specificity variance in the components of a psychometric scale that is administered repeatedly. This article discusses a procedure for evaluation of the relationship between true scale scores and criterion variables uncorrelated with measurement errors in longitudinally presented measures comprising unidimensional multicomponent instruments. The approach provides point and interval estimates of the true scale criterion validity with respect to a criterion that is assessed once or repeatedly, as well as a means for testing temporal stability in this validity. The outlined method is based on an application of the latent variable modeling methodology, is readily applicable with popular software, and is illustrated using empirical data.  相似文献   

12.
通过文献查阅、问卷调查和个案访谈编制我国大学生心理社会能力初测量表。探索性因素分析(n=691)确定该量表22道题项,抽取情绪管理、自我认知、社会适应和人际交往四个因子。验证性因素分析(n=500)及信效度检验结果表明该量表结构合理,拟合度良好,具有较好的结构效度及内部一致性信度,可以作为我国大学生心理社会能力的测量工具。  相似文献   

13.
An IRT method for estimating conditional standard errors of measurement of scale scores is presented, where scale scores are nonlinear transformations of number-correct scores. The standard errors account for measurement error that is introduced due to rounding scale scores to integers. Procedures for estimating the average conditional standard error of measurement for scale scores and reliability of scale scores are also described. An illustration of the use of the methodology is presented, and the results from the IRT method are compared to the results from a previously developed method that is based on strong true-score theory.  相似文献   

14.
Multiple indicators of HIV risk behaviors are not yet well developed in the field of drug abuse and AIDS prevention, with most research relying on single‐item measures. Weak or inconclusive statistical analyses often result from measurement errors. This study illustrates the problems of biased statistical estimates when single‐item measures with measurement errors are used. Using HIV risk behaviors among injection drug users as an example, this study shows the impact of measurement error on statistical results in path analysis. The results suggest that more attention is needed to address the issue of measurement reliability in survey data. Measurement error should be taken into account in analyzing HIV risk behaviors, and appropriate multiple indicators for a full range of HIV risk behaviors should be developed to deal successfully with the “errors‐in‐variables” problem.  相似文献   

15.
Abstract

One major aim of international large-scale assessments (ILSAs) is to monitor changes in student performance over time. To accomplish this task, a set of common items is repeatedly administered in each assessment and linking methods are used to align the results from the different assessments on a common scale. The present article introduces a framework for discussing linking errors in ILSAs, in which different components of linking errors are distinguished (country-by-item interaction, assessment-by-item interaction and country-by-assessment-by-item interaction). Furthermore, the different components of linking errors are used to analytically derive standard errors for national trend estimates. In a simulation study, the proposed standard error formula outperforms the method that is used in PISA. In addition, the PISA 2006 and 2009 reading data are used to illustrate how the interpretation of national trend estimates can change when different procedures are applied to calculate standard errors.  相似文献   

16.
There has been a growing consensus among the educational measurement experts and psychometricians that test taker characteristics may unduly affect the performance on tests. This may lead to construct-irrelevant variance in the scores and thus render the test biased. Hence, it is incumbent on test developers and users alike to provide evidence that their tests are free of such bias. The present study exploited generalizability theory to examine the presence of gender differential performance on a high-stakes language proficiency test, the University of Tehran English Proficiency Test. An analysis of the performance of 2,343 examinees who had taken the test in 2009 indicated that the relative contributions of different facets to score variance were almost uniform across the gender groups. Further, there is no significant interaction between items and persons, indicating that the relative standings of the persons were uniform across all items. The lambda reliability coefficients were also uniformly high. All in all, the study provides evidence that the test is free of gender bias and enjoys a high level of dependability.  相似文献   

17.
In structural equation modeling (SEM), researchers need to evaluate whether item response data, which are often multidimensional, can be modeled with a unidimensional measurement model without seriously biasing the parameter estimates. This issue is commonly addressed through testing the fit of a unidimensional model specification, a strategy previously determined to be problematic. As an alternative to the use of fit indexes, we considered the utility of a statistical tool that was expressly designed to assess the degree of departure from unidimensionality in a data set. Specifically, we evaluated the ability of the DETECT “essential unidimensionality” index to predict the bias in parameter estimates that results from misspecifying a unidimensional model when the data are multidimensional. We generated multidimensional data from bifactor structures that varied in general factor strength, number of group factors, and items per group factor; a unidimensional measurement model was then fit and parameter bias recorded. Although DETECT index values were generally predictive of parameter bias, in many cases, the degree of bias was small even though DETECT indicated significant multidimensionality. Thus we do not recommend the stand-alone use of DETECT benchmark values to either accept or reject a unidimensional measurement model. However, when DETECT was used in combination with additional indexes of general factor strength and group factor structure, parameter bias was highly predictable. Recommendations for judging the severity of potential model misspecifications in practice are provided.  相似文献   

18.
This article studies the difference between the criterion validity coefficient of the widely used overall scale score for a unidimensional multicomponent measuring instrument and the maximal criterion validity coefficient that is achievable with a linear combination of its components. A necessary and sufficient condition of their identity is presented in the case of measurement errors being uncorrelated among themselves and with a used criterion. An upper bound of the difference in these validity coefficients is provided, indicating that it cannot exceed the discrepancy between the maximal reliability and composite reliability indexes. A readily applicable latent variable modeling procedure is discussed that can be used for point and interval estimation of the difference between the maximal and scale criterion validity coefficients. The outlined method is illustrated with a numerical example.  相似文献   

19.
This Monte Carlo simulation study investigated different strategies for forming product indicators for the unconstrained approach in analyzing latent interaction models when the exogenous factors are measured by unequal numbers of indicators under both normal and nonnormal conditions. Product indicators were created by (a) multiplying parcels of the larger scale by items of the smaller scale, and (b) matching items according to reliability to create several product indicators, ignoring those items with lower reliability. Two scaling approaches were compared where parceling was not involved: (a) fixing the factor variances, and (b) fixing 1 loading to 1 for each factor. The unconstrained approach was compared with the latent moderated structural equations (LMS) approach. Results showed that under normal conditions, the LMS approach was preferred because the biases of its interaction estimates and associated standard errors were generally smaller, and its power was higher than that of the unconstrained approach. Under nonnormal conditions, however, the unconstrained approach was generally more robust than the LMS approach. It is recommended to form product indicators by using items with higher reliability (rather than parceling) in the matching and then to specify the model by fixing 1 loading of each factor to unity when adopting the unconstrained approach.  相似文献   

20.
Forty science students received training for 12 weeks on delivering effective presentations and using a tertiary-level English oral presentation scale comprising three subscales (Verbal Communication, Nonverbal Communication, and Content and Organization) measured by 18 items. For their final project, each student was given 10 to 12 min to present on 1 of the 5 compulsory science books for the module and was rated by the tutor, peers, and himself/herself. Many-facet Rasch measurement, correlation, and analysis of variance were performed to mine the data. The results show that the student raters, tutor, items, and rating scales achieved high psychometric quality, though a small number of assessments exhibited bias. Although all of the biased self-assessments were underestimations of presentation skills, the peer and tutor assessment bias had a mixed pattern. In addition, self-, peer, and tutor assessments had low to medium correlations on the subscales, and a significant difference was found between the assessments. Implications are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号