首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The power of the chi-square test statistic used in structural equation modeling decreases as the absolute value of excess kurtosis of the observed data increases. Excess kurtosis is more likely the smaller the number of item response categories. As a result, fit is likely to improve as the number of item response categories decreases, regardless of the true underlying factor structure or χ2-based fit index used to examine model fit. Equivalently, given a target value of approximate fit (e.g., root mean square error of approximation ≤ .05) a model with more factors is needed to reach it as the number of categories increases. This is true regardless of whether the data are treated as continuous (common factor analysis) or as discrete (ordinal factor analysis). We recommend using a large number of response alternatives (≥ 5) to increase the power to detect incorrect substantive models.  相似文献   

2.
This study examined the efficacy of 4 different parceling methods for modeling categorical data with 2, 3, and 4 categories and with normal, moderately nonnormal, and severely nonnormal distributions. The parceling methods investigated were isolated parceling in which items were parceled with other items sharing the same source of variance, and distributed parceling in which items were parceled with items influenced by different factors. These parceling strategies were crossed with strategies in which items were either parceled with similarly distributed or differently distributed items, to create 4 different parceling methods. Overall, parceling together items influenced by different factors and with different distributions resulted in better model fit, but high levels of parameter estimate bias. Across all parceling methods, parameter estimate bias ranged from 20% to over 130%. Parceling strategies were contrasted with use of the WLSMV estimator for categorical, unparceled data. Results based on this estimator are encouraging, although some bias was found when high levels of nonnormality were present. Values of the chi-square and root mean squared error of approximation based on WLSMV also resulted in Type II error rates for misspecified models when data were severely nonnormally distributed.  相似文献   

3.
This article investigates the effect of the number of item response categories on chi‐square statistics for confirmatory factor analysis to assess whether a greater number of categories increases the likelihood of identifying spurious factors, as previous research had concluded. Four types of continuous single‐factor data were simulated for a 20‐item test: (a) uniform for all items, (b) symmetric unimodal for all items, (c) negatively skewed for all items, or (d) negatively skewed for 10 items and positively skewed for 10 items. For each of the 4 types of distributions, item responses were divided to yield item scores with 2,4, or 6 categories. The results indicated that the chi‐square statistic for evaluating a single‐factor model was most inflated (suggesting spurious factors) for 2‐category responses and became less inflated as the number of categories increased. However, the Satorra‐Bentler scaled chi‐square tended not to be inflated even for 2‐category responses, except if the continuous item data had both negatively and positively skewed distributions.  相似文献   

4.
This study compared diagonal weighted least squares robust estimation techniques available in 2 popular statistical programs: diagonal weighted least squares (DWLS; LISREL version 8.80) and weighted least squares–mean (WLSM) and weighted least squares—mean and variance adjusted (WLSMV; Mplus version 6.11). A 20-item confirmatory factor analysis was estimated using item-level ordered categorical data. Three different nonnormality conditions were applied to 2- to 7-category data with sample sizes of 200, 400, and 800. Convergence problems were seen with nonnormal data when DWLS was used with few categories. Both DWLS and WLSMV produced accurate parameter estimates; however, bias in standard errors of parameter estimates was extreme for select conditions when nonnormal data were present. The robust estimators generally reported acceptable model–data fit, unless few categories were used with nonnormal data at smaller sample sizes; WLSMV yielded better fit than WLSM for most indices.  相似文献   

5.
To ensure the statistical result validity, model-data fit must be evaluated for each item. In practice, certain actions or treatments are needed for misfit items. If all misfit items are treated, much item information would be lost during calibration. On the other hand, if only severely misfit items are treated, the inclusion of misfit items may invalidate the statistical inferences based on the estimated item response models. Hence, given response data, one has to find a balance between treating too few and too many misfit items. In this article, misfit items are classified into three categories based on the extent of misfit. Accordingly, three different item treatment strategies are proposed in determining which categories of misfit items should be treated. The impact of using different strategies is investigated. The results show that the test information functions obtained under different strategies can be substantially different in some ability ranges.  相似文献   

6.
This study investigated differential item functioning (DIF), differential bundle functioning (DBF), and differential test functioning (DTF) across gender of the reading comprehension section of the Graduate School Entrance English Exam in China. The datasets included 10,000 test-takers’ item-level responses to 6 five-item testlets. Both DIF and DBF were examined by using poly-simultaneous item bias test and item-response-theory-likelihood-ratio test, and DTF was investigated with multi-group confirmatory factor analyses (MG-CFA). The results indicated that although none of the 30 items exhibited statistically and practically significant DIF across gender at the item level, 2 testlets were consistently identified as having significant DBF at the testlet level by the two procedures. Nonetheless, DBF does not manifest itself at the overall test score level to produce DTF based on MG-CFA. This suggests that the relationship between item-level DIF and test-level DTF is a complicated issue with the mediating effect of testlets in testlet-based language assessment.  相似文献   

7.
In structural equation modeling (SEM), researchers need to evaluate whether item response data, which are often multidimensional, can be modeled with a unidimensional measurement model without seriously biasing the parameter estimates. This issue is commonly addressed through testing the fit of a unidimensional model specification, a strategy previously determined to be problematic. As an alternative to the use of fit indexes, we considered the utility of a statistical tool that was expressly designed to assess the degree of departure from unidimensionality in a data set. Specifically, we evaluated the ability of the DETECT “essential unidimensionality” index to predict the bias in parameter estimates that results from misspecifying a unidimensional model when the data are multidimensional. We generated multidimensional data from bifactor structures that varied in general factor strength, number of group factors, and items per group factor; a unidimensional measurement model was then fit and parameter bias recorded. Although DETECT index values were generally predictive of parameter bias, in many cases, the degree of bias was small even though DETECT indicated significant multidimensionality. Thus we do not recommend the stand-alone use of DETECT benchmark values to either accept or reject a unidimensional measurement model. However, when DETECT was used in combination with additional indexes of general factor strength and group factor structure, parameter bias was highly predictable. Recommendations for judging the severity of potential model misspecifications in practice are provided.  相似文献   

8.
In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items.  相似文献   

9.
The purpose of the study was to investigate the relationship between number of response categories employed and internal-consistency reliability of Likert-type questionnaires. Two questionnaires, each composed of items with high loadings on one factor, were scaled with 2, 3, 4, 5, 6, and 7 categories. The questionnaires were administered to graduate students in education and coefficient alpha reliabilities were computed both for random samples of items and for each total questionnaire. The results indicated that in situations where low total score variability is achieved with a small number of categories, reliability can be increased through increasing the number of categories employed, in situations where pinion is widely divided toward the content being measured, reliability appeared to be independent of the number of response categories.  相似文献   

10.
《Educational Assessment》2013,18(3):201-224
This article discusses an approach to analyzing performance assessments that identifies potential reasons for misfitting items and uses this information to improve on items and rubrics for these assessments. Specifically, the approach involves identifying psychometric features and qualitative features of items and rubrics that may possibly influence misfit; examining relations between these features and the fit statistic; conducting an analysis of student responses to a sample of misfitting items; and finally, based on the results of the previous analyses, modifying characteristics of the items or rubrics and reexamining fit. A mathematics performance assessment containing 53 constructed-response items scored on a holistic scale from 0 to 4 is used to illustrate the approach. The 2-parameter graded response model (Samejima, 1969) is used to calibrate the data. Implications of this method of data analysis for improving performance assessment items and rubrics are discussed as well as issues and limitations related to the use of the approach.  相似文献   

11.
This study tested assumptions of a servocontrol model of test item feedback. High school students responded to multiple-choice items and rated their certainty of correctness in each response. Next, learners either received feedback on the items or responded again to the same test. The same items were tested again after 1 and 8 days, with the order to alternatives randomized for half of the subjects in each feedback group. The results generally supported the control model and suggest that response certitude estimates can be treated as an index of comprehension.  相似文献   

12.
This study focused on how the design of a national student survey instrument was informed and improved through the combined use of student focus groups, cognitive interviews, and expert survey design advice. We were specifically interested in determining (a) how students interpret the items and response options, (b) the frequency of behaviors or activities associated with the response options, (c) if the items are clearly worded and specific enough to produce reliable and valid results, and (d) if the items and response categories accurately represent students' behaviors and perceptions. We collected focus group data from 8 colleges and universities as part of a nationally funded research project on student engagement. The findings provide additional insight into the importance of using focus groups and cognitive interviews to learn how students interpret various items and what different responses really mean.  相似文献   

13.
The 2010 Western Cape graduate destination survey utilised a sequential mixed-mode design in which an initial web survey was augmented with an equivalent telephonic survey. This article examines mode effect in the Western Cape survey in terms of overall effect size and the bearing it had on the main outcome of the study. Standardised residuals and Cramér’s V are used to determine mode effect across two scenarios, a full sample vs. a subsample, and using two categorical questions with different numbers of response categories. Overall effect size appears to be small in the first question, but increases noticeably together with non-responses in the second question that has many more response categories. Web responses to alumni or graduate destination surveys can perhaps be augmented with telephonic responses if necessary, provided response categories are kept to a minimum, and interviewers are trained properly and monitored for possible interviewer misbehaviour. The benefit of obtaining larger samples should then also outweigh the benefit of using web surveys alone.  相似文献   

14.
Model fit indices are being increasingly recommended and used to select the number of factors in an exploratory factor analysis. Growing evidence suggests that the recommended cutoff values for common model fit indices are not appropriate for use in an exploratory factor analysis context. A particularly prominent problem in scale evaluation is the ubiquity of correlated residuals and imperfect model specification. Our research focuses on a scale evaluation context and the performance of four standard model fit indices: root mean square error of approximate (RMSEA), standardized root mean square residual (SRMR), comparative fit index (CFI), and Tucker–Lewis index (TLI), and two equivalence test-based model fit indices: RMSEAt and CFIt. We use Monte Carlo simulation to generate and analyze data based on a substantive example using the positive and negative affective schedule (N = 1,000). We systematically vary the number and magnitude of correlated residuals as well as nonspecific misspecification, to evaluate the impact on model fit indices in fitting a two-factor exploratory factor analysis. Our results show that all fit indices, except SRMR, are overly sensitive to correlated residuals and nonspecific error, resulting in solutions that are overfactored. SRMR performed well, consistently selecting the correct number of factors; however, previous research suggests it does not perform well with categorical data. In general, we do not recommend using model fit indices to select number of factors in a scale evaluation framework.  相似文献   

15.
This study examined the performance of the weighted root mean square residual (WRMR) through a simulation study using confirmatory factor analysis with ordinal data. Values and cut scores for the WRMR were examined, along with a comparison of its performance relative to commonly cited fit indexes. The findings showed that WRMR illustrated worse fit when sample size increased or model misspecification increased. Lower (i.e., better) values of WRMR were observed when nonnormal data were present, there were lower loadings, and when few categories were analyzed. WRMR generally illustrated expected patterns of relations to other well-known fit indexes. In general, a cutoff value of 1.0 appeared to work adequately under the tested conditions and the WRMR values of “good fit” were generally in agreement with other indexes. Users are cautioned that when the fitted model is misspeficifed, the index might provide misleading results under situations where extremely large sample sizes are used.  相似文献   

16.
17.
The reading data from the 1983–84 National Assessment of Educational Progress survey were scaled using a unidimensional item response theory model. To determine whether the responses to the reading items were consistent with unidimensionality, the full-information factor analysis method developed by Bock and associates (1985) and Rosenbaum's (1984) test of unidimensionality, conditional (local) independence, and monotonicity were applied. Full-information factor analysis involves the assumption of a particular item response function; the number of latent variables required to obtain a reasonable fit to the data is then determined. The Rosenbaum method provides a test of the more general hypothesis that the data can be represented by a model characterized by unidimensionality, conditional independence, and monotonicity. Results of both methods indicated that the reading items could be regarded as measures of a single dimension. Simulation studies were conducted to investigate the impact of balanced incomplete block (BIB) spiraling, used in NAEP to assign items to students, on methods of dimensionality assessment. In general, conclusions about dimensionality were the same for BIB-spiraled data as for complete data.  相似文献   

18.
Item response theory scalings were conducted for six tests with mixed item formats. These tests differed in their proportions of constructed response (c.r.) and multiple choice (m.c.) items and in overall difficulty. The scalings included those based on scores for the c.r. items that had maintained the number of levels as the item rubrics, either produced from single ratings or multiple ratings that were averaged and rounded to the nearest integer, as well as scalings for a single form of c.r. items obtained by summing multiple ratings. A one-parameter (IPPC) or two-parameter (2PPC) partial credit model was used for the c.r. items and the one-parameter logistic (IPL) or three-parameter logistic (3PL) model for the m.c. items, ltem fit was substantially worse with the combination IPL/IPPC model than the 3PL/2PPC model due to the former's restrictive assumptions that there would be no guessing on the m.c. items and equal item discrimination across items and item types. The presence of varying item discriminations resulted in the IPL/IPPC model producing estimates of item information that could be spuriously inflated for c.r. items that had three or more score levels. Information for some items with summed ratings were usually overestimated by 300% or more for the IPL/IPPC model. These inflated information values resulted in under-estbnated standard errors of ability estimates. The constraints posed by the restricted model suggests limitations on the testing contexts in which the IPL/IPPC model can be accurately applied.  相似文献   

19.
This study sought a scientific way to examine whether item response curves are influenced systematically by the cognitive processes underlying solution of the items in a procedural domain (addition of fractions). Starting from an expert teacher's logical task analysis and prediction of various erroneous rules and sources of misconceptions, an error diagnostic program was developed. This program was used to carry out an error analysis of test performance by three samples of students. After the cognitive structure of the subtasks was validated by a majority of the students, the items were characterized by their underlying subtask patterns. It was found that item response curves for items in the same categories were significantly more homogeneous than those in different categories. In other words, underlying cognitive subtasks appeared to systematically influence the slopes and difficulties of item response curves.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号