首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Two new indices to detect answer copying on a multiple-choice test—S1 and S2—were proposed. The S1 index is similar to the K index (Holland, 1996) and the K2 index (Sotaridona & Meijer, 2002) but the distribution of the number of matching incorrect answers of the source and the copier is modeled by the Poisson distribution instead of the binomial distribution to improve the detection rate of K and K2. The S2 index was proposed to overcome a limitation of the K and K2 index, namely, their insensitiveness to correct answers copying. The S2 index incorporates the matching correct answers in addition to the matching incorrect answers. A simulation study was conducted to investigate the usefulness of S1 and S2 for 40- and 80-item tests, 100 and 500 sample sizes, and 10%, 20%, 30%, and 40% answer copying. The Type I errors and detection rates of S1 and S2 were compared with those of the K2 and the ω copying index (Wollack, 1997). Results showed that all four indices were able to maintain their Type I errors, with S1 and K2 being slightly conservative compared to S2 and ω. Furthermore, S1 had higher detection rates than K2. The S2 index showed a significant improvement in detection rate compared to K and K2.  相似文献   

2.
We investigated the statistical properties of the K-index (Holland, 1996) that can be used to detect copying behavior on a test. A simulation study was conducted to investigate the applicability of the K-index for small, medium, and large datasets. Furthermore, the Type I error rate and the detection rate of this index were compared with the copying index, ω (Wollack, 1997). Several approximations were used to calculate the K-index. Results showed that all approximations were able to hold the Type I error rates below the nominal level. Results further showed that using ω resulted in higher detection rates than the K-indices for small and medium sample sizes (100 and 500 simulees).  相似文献   

3.
The standardized log-likelihood of a response vector (lz) is a popular IRT-based person-fit test statistic for identifying model-misfitting response patterns. Traditional use of lz is overly conservative in detecting aberrance due to its incorrect assumption regarding its theoretical null distribution. This study proposes a method for improving the accuracy of person-fit analysis using lz which takes into account test unreliability when estimating the ability and constructs the distribution for each lz through resampling methods. The Type I error and power (or detection rate) of the proposed method were examined at different test lengths, ability levels, and nominal α levels along with other methods, and power to detect three types of aberrance—cheating, lack of motivation, and speeding—was considered. Results indicate that the proposed method is a viable and promising approach. It has Type I error rates close to the nominal value for most ability levels and reasonably good power.  相似文献   

4.
《教育实用测度》2013,26(4):265-288
Many of the currently available statistical indexes to detect answer copying lack sufficient power at small α levels or when the amount of copying is relatively small. Furthermore, there is no one index that is uniformly best. Depending on the type or amount of copying, certain indexes are better than others. The purpose of this article was to explore the utility of simultaneously using multiple copying indexes to detect different types and amounts of answer copying. This study compared eight copying indexes: S1 and S2 (Sotaridona & Meijer, 2003), 2 (Sotaridona & Meijer, 2002), ω (Wollack, 1997),B and H (Angoff, 1974), and new indexes Runs and MaxStrings, plus all possible pairs and triplets of the 8 indexes using multiple comparison procedures (Dunn, 1961) to adjust the critical α level for each index in a pair or triplet. Empirical Type-I error rates and power of all indexes, pairs, and triplets were examined in a real data simulation (i.e., where actual examinee responses to items [rather than generated item response vectors] were changed to match the actual responses for randomly selected source examinees) for 2 test lengths, 9 sample sizes, 3 types of copying, 4 α levels, and 4 percentages of items copied. This study found that using both ω and H* (i.e., H with empirically derived critical values) can help improve power in the most realistic types of copying situations (strings and mixed copying). The ω-H* paired index improved power most particularly for small percentages of items copied and small amounts of copying, two conditions for which copying indexes tend to be underpowered.  相似文献   

5.
To assess item dimensionality, the following two approaches are described and compared: hierarchical generalized linear model (HGLM) and multidimensional item response theory (MIRT) model. Two generating models are used to simulate dichotomous responses to a 17-item test: the unidimensional and compensatory two-dimensional (C2D) models. For C2D data, seven items are modeled to load on the first and second factors, θ1 and θ2, with the remaining 10 items modeled unidimensionally emulating a mathematics test with seven items requiring an additional reading ability dimension. For both types of generated data, the multidimensionality of item responses is investigated using HGLM and MIRT. Comparison of HGLM and MIRT's results are possible through a transformation of items' difficulty estimates into probabilities of a correct response for a hypothetical examinee at the mean on θ and θ2. HGLM and MIRT performed similarly. The benefits of HGLM for item dimensionality analyses are discussed.  相似文献   

6.
Using a bidimensional two-parameter logistic model, the authors generated data for two groups on a 40-item test. The item parameters were the same for the two groups, but the correlation between the two traits varied between groups. The difference in the trait correlation was directly related to the number of items judged not to be invariant using traditional unidimensional IRT-based unsigned item invariance indexes; the higher trait correlation leads to higher discrimination parameter estimates when a unidimensional IRT model is fit to the multidimensional data. In the most extreme case, when rθ1 θ2= Ofor one group and r θ1 θ2= 1.0 for the other group, 33 out of 40 items were identified as not invariant. When using signed indexes, the effect was much smaller. The authors, therefore, suggest a cautious use of IRT-based item invariance indexes when data are potentially multidimensional and groups may vary in the strength of the correlations among traits.  相似文献   

7.
Monte Carlo simulations with 20,000 replications are reported to estimate the probability of rejecting the null hypothesis regarding DIF using SIBTEST when there is DIF present and/or when impact is present due to differences on the primary dimension to be measured. Sample sizes are varied from 250 to 2000 and test lengths from 10 to 40 items. Results generally support previous findings for Type I error rates and power. Impact is inversely related to test length. The combination of DIF and impact, with the focal group having lower ability on both the primary and secondary dimensions, results in impact partially masking DIF so that items biased toward the reference group are less likely to be detected.  相似文献   

8.
The effect of item parameters (discrimination, difficulty, and level of guessing) on the item-fit statistic was investigated using simulated dichotomous data. Nine tests were simulated using 1,000 persons, 50 items, three levels of item discrimination, three levels of item difficulty, and three levels of guessing. The item fit was estimated using two fit statistics: the likelihood ratio statistic (X2B), and the standardized residuals (SRs). All the item parameters were simulated to be normally distributed. Results showed that the levels of item discrimination and guessing affected the item-fit values. As the level of item discrimination or guessing increased, item-fit values increased and more items misfit the model. The level of item difficulty did not affect the item-fit statistic.  相似文献   

9.
Data from a large-scale performance assessment ( N = 105,731) were analyzed with five differential item functioning (DIF) detection methods for polytomous items to examine the congruence among the DIF detection methods. Two different versions of the item response theory (IRT) model-based likelihood ratio test, the logistic regression likelihood ratio test, the Mantel test, and the generalized Mantel–Haenszel test were compared. Results indicated some agreement among the five DIF detection methods. Because statistical power is a function of the sample size, the DIF detection results from extremely large data sets are not practically useful. As alternatives to the DIF detection methods, four IRT model-based indices of standardized impact and four observed-score indices of standardized impact for polytomous items were obtained and compared with the R 2 measures of logistic regression.  相似文献   

10.
This study examined the effect of sample size ratio and model misfit on the Type I error rates and power of the Difficulty Parameter Differences procedure using Winsteps. A unidimensional 30-item test with responses from 130,000 examinees was simulated and four independent variables were manipulated: sample size ratio (20/100/250/500/1000); model fit/misfit (1 PL and 3PLc =. 15 models); impact (no difference/mean differences/variance differences/mean and variance differences); and percentage of items with uniform and nonuniform DIF (0%/10%/20%). In general, the results indicate the importance of ensuring model fit to achieve greater control of Type I error and adequate statistical power. The manipulated variables produced inflated Type I error rates, which were well controlled when a measure of DIF magnitude was applied. Sample size ratio also had an effect on the power of the procedure. The paper discusses the practical implications of these results.  相似文献   

11.
To assess the relative contribution of dynamic and summary features of vocal fundamental frequency (f0) to the statistical discrimination of pragmatic categories in infant-directed speech, 49 mothers were instructed to use their voice to get their 4-month-old baby's attention, show approval, and provide comfort. Vocal f0 from 621 tokens was extracted using a Computerized Speech Laboratory and custom software. Dynamic features were measured with convergent methods (visual judgment and quantitative modeling of f0 contour shape). Summary features were f0 mean, standard deviation, and duration. Dynamic and summary features both individually and in combination statistically discriminated between each of the pragmatic categories. Classification rates were 69% and 62% in initial and cross-validation DFAs, respectively.  相似文献   

12.
An approximate χ2 statistic based on McDonald's (1967) nonlinear factor analytic representation of item response theory was proposed and investigated with simulated data. The results were compared with Stout's T statistic (Nandakumar & Stout, 1993; Stout, 1987). Unidimensional and two-dimensional item response data were simulated under varying levels of sample size, test length, test reliability, and dimension dominance. The approximate χ2 statistic had good control over Type I errors when unidimensional data were generated and displayed very good power in identifying the two-dimensional data. The performance of the approximate χ2 was at least as good as Stout's T statistic in all conditions and was better than Stout's T statistic with smaller sample sizes and shorter tests. Further implications regarding the potential use of nonlinear factor analysis and the approximate χ2 in addressing current measurement issues are discussed.  相似文献   

13.
Testing the goodness of fit of item response theory (IRT) models is relevant to validating IRT models, and new procedures have been proposed. These alternatives compare observed and expected response frequencies conditional on observed total scores, and use posterior probabilities for responses across θ levels rather than cross-classifying examinees using point estimates of θ and score responses. This research compared these alternatives with regard to their methods, properties (Type 1 error rates and empirical power), available research, and practical issues (computational demands, treatment of missing data, effects of sample size and sparse data, and available computer programs). Different advantages and disadvantages related to these characteristics are discussed. A simulation study provided additional information about empirical power and Type 1 error rates.  相似文献   

14.
Latent means methods such as multiple-indicator multiple-cause (MIMIC) and structured means modeling (SMM) allow researchers to determine whether or not a significant difference exists between groups' factor means. Strong invariance is typically recommended when interpreting latent mean differences. The extent of the impact of noninvariant intercepts on conclusions made when implementing both MIMIC and SMM methods was the main purpose of this study. The impact of intercept noninvariance on Type I error rates, power, and two model fit indices when using MIMIC and SMM approaches under various conditions were examined. Type I error and power were adversely affected by intercept noninvariance. Although the fit indices did not detect small misspecifications in the form of noninvariant intercepts, one did perform more optimally.  相似文献   

15.
The purpose of this article is to present an analytical derivation for the mathematical form of an average between-test overlap index as a function of the item exposure index, for fixed-length computerized adaptive tests (CATs). This algebraic relationship is used to investigate the simultaneous control of item exposure at both the item and test levels. The results indicate that, in fixed-length CATs, control of the average between-test overlap is achieved via the mean and variance of the item exposure rates of the items that constitute the CAT item pool. The mean of the item exposure rates is easily manipulated. Control over the variance of the item exposure rates can be achieved via the maximum item exposure rate (rmax). Therefore, item exposure control methods which implement a specification of rmax (e.g., Sympson & Hetter, 1985) provide the most direct control at both the item and test levels.  相似文献   

16.
A Monte Carlo simulation study was conducted to evaluate the sensitivities of the likelihood ratio test and five commonly used delta goodness-of-fit (ΔGOF) indices (i.e., ΔGamma, ΔMcDonald’s, ΔCFI, ΔRMSEA, and ΔSRMR) to detect a lack of metric invariance in a bifactor model. Experimental conditions included factor loading differences, location and number of noninvariant items, and sample size. The results indicated all ΔGOF indices held Type I error to a minimum and overall had adequate power for the study. For detecting the violation of metric invariance, only ΔGamma and ΔCFI, in addition to Δχ2, are recommended to use in the bifactor model with values of ?.016 to ?.023 and ?.003 to ?.004, respectively. Moreover, in the variance component analysis, the magnitude of the factor loading differences contributed the most variation to all ΔGOF indices, whereas sample size affected Δχ2 the most.  相似文献   

17.
Two qualitatively different information-processing algorithms for solution of Raven's Progressive Matrices items have been identified. Whereas the Gestalt algorithm involves spatial operations upon the test stimuli, the Analytic algorithm employs logical operations upon features abstracted from the displays. In this study, training groups were established varying both in the Strength (Weak or Strong) and Type (Gestalt or Analytic) of training at three grade levels. Two sets of post-test measures were given. Ambiguous items were constructed such that more than one correct answer was possible, some being the result of the Gestalt algorithm and others of the Analytic algorithm. Subjects' performances on the Ambiguous items indicated that strong Analytic training had been particularly effective and was specific to Analytic answer options. The second post-test measure was Set I of the Advanced Progressive Matrices. Performance on these Test items indicated that the effects of strategy training had been maintained, and were due to the facilitation of Analytic item performance by Analytic training. The effects of Strength and Type of training were consistent across Grades. These results support Hunt's analysis of Raven's Progressive Matrices items, and demonstrate that strategy training based upon a precise information processing task analysis can be effective in improving Progressive Matrices performance. The implications of these results for intellectual assessment are discussed.  相似文献   

18.
Two simulation studies investigated Type I error performance of two statistical procedures for detecting differential item functioning (DIF): SIBTEST and Mantel-Haenszel (MH). Because MH and SIBTEST are based on asymptotic distributions requiring "large" numbers of examinees, the first study examined Type 1 error for small sample sizes. No significant Type I error inflation occurred for either procedure. Because MH has the potential for Type I error inflation for non-Rasch models, the second study used a markedly non-Rasch test and systematically varied the shape and location of the studied item. When differences in distribution across examinee group of the measured ability were present, both procedures displayed inflated Type 1 error for certain items; MH displayed the greater inflation. Also, both procedures displayed statistically biased estimation of the zero DIF for certain items, though SIBTEST displayed much less than MH. When no latent distributional differences were present, both procedures performed satisfactorily under all conditions.  相似文献   

19.
This study examined the efficacy of 4 different parceling methods for modeling categorical data with 2, 3, and 4 categories and with normal, moderately nonnormal, and severely nonnormal distributions. The parceling methods investigated were isolated parceling in which items were parceled with other items sharing the same source of variance, and distributed parceling in which items were parceled with items influenced by different factors. These parceling strategies were crossed with strategies in which items were either parceled with similarly distributed or differently distributed items, to create 4 different parceling methods. Overall, parceling together items influenced by different factors and with different distributions resulted in better model fit, but high levels of parameter estimate bias. Across all parceling methods, parameter estimate bias ranged from 20% to over 130%. Parceling strategies were contrasted with use of the WLSMV estimator for categorical, unparceled data. Results based on this estimator are encouraging, although some bias was found when high levels of nonnormality were present. Values of the chi-square and root mean squared error of approximation based on WLSMV also resulted in Type II error rates for misspecified models when data were severely nonnormally distributed.  相似文献   

20.
The authors assessed the effects of using “none of the above” as an option in a 40-item, general-knowledge multiple-choice test administered to undergraduate students. Examinees who selected “none of the above” were given an incentive to write the correct answer to the question posed. Using “none of the above” as the keyed option made items much more difficult (d = ?1.11). Furthermore, 45% of the time that examinees correctly selected “none of the above,” they wrote either a wrong answer (19%) or no answer (26%), and rescoring items to deny credit in these cases caused item discrimination to fall (d = ?0.35). Thus, when “none of the above” is the keyed option, credit earned by examinees with knowledge deficiencies can make items appear to have more discriminatory power than is actually the case. The authors recommend that “none of the above” should not be used as an option in multiple-choice items.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号