首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items.  相似文献   

2.
Differential linear drift of item location parameters over a 10 -year period is demonstrated in data from the College Board Physics Achievement Test. The relative direction of drift is associated with the content of the items and reflects changing emphasis in the physics curricula of American secondary schools. No evidence of drift of discriminating power parameters was found. Statistical procedures for detecting, estimating, and accounting for item parameter drift in item pools for long-term testing programs are proposed  相似文献   

3.
In this article we present a general approach not relying on item response theory models (non‐IRT) to detect differential item functioning (DIF) in dichotomous items with presence of guessing. The proposed nonlinear regression (NLR) procedure for DIF detection is an extension of method based on logistic regression. As a non‐IRT approach, NLR can be seen as a proxy of detection based on the three‐parameter IRT model which is a standard tool in the study field. Hence, NLR fills a logical gap in DIF detection methodology and as such is important for educational purposes. Moreover, the advantages of the NLR procedure as well as comparison to other commonly used methods are demonstrated in a simulation study. A real data analysis is offered to demonstrate practical use of the method.  相似文献   

4.
Most of the existing classification accuracy indices of attribute patterns lose effectiveness when the response data is absent in diagnostic testing. To handle this issue, this article proposes new indices to predict the correct classification rate of a diagnostic test before administering the test under the deterministic noise input “and” gate (DINA) model. The new indices include an item‐level expected classification accuracy (ECA) for attributes and a test‐level ECA for attributes and attribute patterns, and both of them are calculated based solely on the known item parameters and Q ‐matrix. Theoretical analysis showed that the item‐level ECA could be regarded as a measure of correct classification rates of attributes contributed by an item. This article also illustrates how to apply the item‐level ECA for attributes to estimate the correct classification rate of attributes patterns at the test level. Simulation results showed that two test‐level ECA indices, ECA_I_W (an index based on the independence assumption and the weighted sum of the item‐level ECAs) and ECA_C_M (an index based on Gaussian Copula function that incorporates the dependence structure of the events of attribute classification and the simple average of the item‐level ECAs), could make an accurate prediction for correct classification rates of attribute patterns.  相似文献   

5.
Traditional methods for examining differential item functioning (DIF) in polytomously scored test items yield a single item‐level index of DIF and thus provide no information concerning which score levels are implicated in the DIF effect. To address this limitation of DIF methodology, the framework of differential step functioning (DSF) has recently been proposed, whereby measurement invariance is examined within each step underlying the polytomous response variable. The examination of DSF can provide valuable information concerning the nature of the DIF effect (i.e., is the DIF an item‐level effect or an effect isolated to specific score levels), the location of the DIF effect (i.e., precisely which score levels are manifesting the DIF effect), and the potential causes of a DIF effect (i.e., what properties of the item stem or task are potentially biasing). This article presents a didactic overview of the DSF framework and provides specific guidance and recommendations on how DSF can be used to enhance the examination of DIF in polytomous items. An example with real testing data is presented to illustrate the comprehensive information provided by a DSF analysis.  相似文献   

6.
Cross‐level invariance in a multilevel item response model can be investigated by testing whether the within‐level item discriminations are equal to the between‐level item discriminations. Testing the cross‐level invariance assumption is important to understand constructs in multilevel data. However, in most multilevel item response model applications, the cross‐level invariance is assumed without testing of the cross‐level invariance assumption. In this study, the detection methods of differential item discrimination (DID) over levels and the consequences of ignoring DID are illustrated and discussed with the use of multilevel item response models. Simulation results showed that the likelihood ratio test (LRT) performed well in detecting global DID at the test level when some portion of the items exhibited DID. At the item level, the Akaike information criterion (AIC), the sample‐size adjusted Bayesian information criterion (saBIC), LRT, and Wald test showed a satisfactory rejection rate (>.8) when some portion of the items exhibited DID and the items had lower intraclass correlations (or higher DID magnitudes). When DID was ignored, the accuracy of the item discrimination estimates and standard errors was mainly problematic. Implications of the findings and limitations are discussed.  相似文献   

7.
Response accuracy and response time data can be analyzed with a joint model to measure ability and speed of working, while accounting for relationships between item and person characteristics. In this study, person‐fit statistics are proposed for joint models to detect aberrant response accuracy and/or response time patterns. The person‐fit tests take the correlation between ability and speed into account, as well as the correlation between item characteristics. They are posited as Bayesian significance tests, which have the advantage that the extremeness of a test statistic value is quantified by a posterior probability. The person‐fit tests can be computed as by‐products of a Markov chain Monte Carlo algorithm. Simulation studies were conducted in order to evaluate their performance. For all person‐fit tests, the simulation studies showed good detection rates in identifying aberrant patterns. A real data example is given to illustrate the person‐fit statistics for the evaluation of the joint model.  相似文献   

8.
The intent of this research was to find an item selection procedure in the multidimensional computer adaptive testing (CAT) framework that yielded higher precision for both the domain and composite abilities, had a higher usage of the item pool, and controlled the exposure rate. Five multidimensional CAT item selection procedures (minimum angle; volume; minimum error variance of the linear combination; minimum error variance of the composite score with optimized weight; and Kullback‐Leibler information) were studied and compared with two methods for item exposure control (the Sympson‐Hetter procedure and the fixed‐rate procedure, the latter simply refers to putting a limit on the item exposure rate) using simulated data. The maximum priority index method was used for the content constraints. Results showed that the Sympson‐Hetter procedure yielded better precision than the fixed‐rate procedure but had much lower item pool usage and took more time. The five item selection procedures performed similarly under Sympson‐Hetter. For the fixed‐rate procedure, there was a trade‐off between the precision of the ability estimates and the item pool usage: the five procedures had different patterns. It was found that (1) Kullback‐Leibler had better precision but lower item pool usage; (2) minimum angle and volume had balanced precision and item pool usage; and (3) the two methods minimizing the error variance had the best item pool usage and comparable overall score recovery but less precision for certain domains. The priority index for content constraints and item exposure was implemented successfully.  相似文献   

9.
This study investigated possible explanations for an observed change in Rasch item parameters (b values) obtained from consecutive administrations of a professional licensure examination. Considered in this investigation were variables related to item position, item type, item content, and elapsed time between administrations of the item. An analysis of covariance methodology was used to assess the relations between these variables and change in item b values, with the elapsed time index serving to control for differences that could be attributed to average or pool changes in b values over time. A series of analysis of covariance models were fitted to the data in an attempt to identify item characteristics that were significantly related to the change in b values after the time elapsed between item administrations had been controlled. The findings indicated that the change in item b values was not related either to item position or to item type. A small, positive relationship between this change and elapsed time indicated that the pool b values were increasing over time. A test of simple effects suggested the presence of greater change for one of the content categories analyzed. These findings are interpreted, and suggestions for future research are provided.  相似文献   

10.
The human nose is a very sensitive detector and is able to detect potent aroma compounds down to low ng/L levels. These levels are often below detection limits of analytical instrumentation. The following laboratory exercise is designed to compare instrumental and human methods for the detection of volatile odor active compounds. Reference standards of 3‐mercapto‐1‐hexanol (3MH), a secondary thiol that is important to food quality, are analyzed by gas chromatography with flame ionization detection (GC‐FID), and these raw data are provided to students. Students also perform a series of 3‐alternative forced choice (3‐AFC) sensory tests to determine the human detection limits in a series of samples. For both data sets, 2 methods of data analysis (standard deviation of the response and the slope and signal‐to‐noise ratio for GC‐FID data; forced‐choice ascending concentration series method of limits and linear regression for 3‐AFC data) will be used to estimate instrumental detection limits and human thresholds. GC‐FID and 3‐AFC results are then compared by the students to demonstrate the importance of instrumental and human methods for food analysis, and to provide an experiential learning opportunity to critically think through multiple methods of analysis and compare the outcomes of those methods. In completing the laboratory exercise and discussion questions, students will gain an understanding of the advantages and disadvantages of human and instrumental measurements in food analysis, and compare the outcome of common data analysis methods for instrumental and sensory data.  相似文献   

11.
The aim of this study is to assess the efficiency of using the multiple‐group categorical confirmatory factor analysis (MCCFA) and the robust chi‐square difference test in differential item functioning (DIF) detection for polytomous items under the minimum free baseline strategy. While testing for DIF items, despite the strong assumption that all but the examined item are set to be DIF‐free, MCCFA with such a constrained baseline approach is commonly used in the literature. The present study relaxes this strong assumption and adopts the minimum free baseline approach where, aside from those parameters constrained for identification purpose, parameters of all but the examined item are allowed to differ among groups. Based on the simulation results, the robust chi‐square difference test statistic with the mean and variance adjustment is shown to be efficient in detecting DIF for polytomous items in terms of the empirical power and Type I error rates. To sum up, MCCFA under the minimum free baseline strategy is useful for DIF detection for polytomous items.  相似文献   

12.
As access and reliance on technology continue to increase, so does the use of computerized testing for admissions, licensure/certification, and accountability exams. Nonetheless, full computer‐based test (CBT) implementation can be difficult due to limited resources. As a result, some testing programs offer both CBT and paper‐based test (PBT) administration formats. In such situations, evidence that scores obtained from different formats are comparable must be gathered. In this study, we illustrate how contemporary statistical methods can be used to provide evidence regarding the comparability of CBT and PBT scores at the total test score and item levels. Specifically, we looked at the invariance of test structure and item functioning across test administration mode across subgroups of students defined by SES and sex. Multiple replications of both confirmatory factor analysis and Rasch differential item functioning analyses were used to assess invariance at the factorial and item levels. Results revealed a unidimensional construct with moderate statistical support for strong factorial‐level invariance across SES subgroups, and moderate support of invariance across sex. Issues involved in applying these analyses to future evaluations of the comparability of scores from different versions of a test are discussed.  相似文献   

13.
This study examined the utility of response time‐based analyses in understanding the behavior of unmotivated test takers. For the data from an adaptive achievement test, patterns of observed rapid‐guessing behavior and item response accuracy were compared to the behavior expected under several types of models that have been proposed to represent unmotivated test taking behavior. Test taker behavior was found to be inconsistent with these models, with the exception of the effort‐moderated model. Effort‐moderated scoring was found to both yield scores that were more accurate than those found under traditional scoring, and exhibit improved person fit statistics. In addition, an effort‐guided adaptive test was proposed and shown by a simulation study to alleviate item difficulty mistargeting caused by unmotivated test taking.  相似文献   

14.
In many testing programs it is assumed that the context or position in which an item is administered does not have a differential effect on examinee responses to the item. Violations of this assumption may bias item response theory estimates of item and person parameters. This study examines the potentially biasing effects of item position. A hierarchical generalized linear model is formulated for estimating item‐position effects. The model is demonstrated using data from a pilot administration of the GRE wherein the same items appeared in different positions across the test form. Methods for detecting and assessing position effects are discussed, as are applications of the model in the contexts of test development and item analysis.  相似文献   

15.
Many standardized tests are now administered via computer rather than paper‐and‐pencil format. The computer‐based delivery mode brings with it certain advantages. One advantage is the ability to adapt the difficulty level of the test to the ability level of the test taker in what has been termed computerized adaptive testing (CAT). A second advantage is the ability to record not only the test taker's response to each item (i.e., question), but also the amount of time the test taker spends considering and answering each item. Combining these two advantages, various methods were explored for utilizing response time data in selecting appropriate items for an individual test taker. Four strategies for incorporating response time data were evaluated, and the precision of the final test‐taker score was assessed by comparing it to a benchmark value that did not take response time information into account. While differences in measurement precision and testing times were expected, results showed that the strategies did not differ much with respect to measurement precision but that there were differences with regard to the total testing time.  相似文献   

16.
17.
This paper considers a modification of the DIF procedure SIBTEST for investigating the causes of differential item functioning (DIF). One way in which factors believed to be responsible for DIF can be investigated is by systematically manipulating them across multiple versions of an item using a randomized DIF study (Schmitt, Holland, & Dorans, 1993). In this paper: it is shown that the additivity of the index used for testing DIF in SIBTEST motivates a new extension of the method for statistically testing the effects of DIF factors. Because an important consideration is whether or not a studied DIF factor is consistent in its effects across items, a methodology for testing item x factor interactions is also presented. Using data from the mathematical sections of the Scholastic Assessment Test (SAT), the effects of two potential DIF factors—item format (multiple-choice versus open-ended) and problem type (abstract versus concrete)—are investigated for gender Results suggest a small but statistically significant and consistent effect of item format (favoring males for multiple-choice items) across items, and a larger but less consistent effect due to problem type.  相似文献   

18.
Calibration and equating is the quintessential necessity for most large‐scale educational assessments. However, there are instances when no consideration is given to the equating process in terms of context and substantive realization, and the methods used in its execution. In the view of the authors, equating is not merely an exhibit of the statistical methodology, but it is also a reflection of the thought process undertaken in its execution. For example, there is hardly any discussion in literature of the ideological differences in the selection of an equating method. Furthermore, there is little evidence of modeling cohort growth through an identification and use of construct‐relevant linking items’ drift, using the common item nonequivalent group equating design. In this article, the authors philosophically justify the use of Huynh's statistical method for the identification of construct‐relevant outliers in the linking pool. The article also dispels the perception of scale instability associated with the inclusion of construct‐relevant outliers in the linking item pool and concludes that an appreciation of the rationale used in the selection of the equating method, together with the use of linking items in modeling cohort growth, can be beneficial to the practitioners.  相似文献   

19.
Literature relating to the well‐being of older adults was reviewed to identify indicators relevant to the construct of self‐responsibility for wellness. The wellness model proposed by Travis (1981) has produced a variety of concepts which can be useful in improving the quality of life for older adults. The purpose of this study was to develop an instrument which would assess an individual's self‐responsibility for wellness. A 47‐item instrument developed for this purpose was evaluated by experts in gerontology and psychology. After revision and reevaluation it was field‐tested on a sample of 180 older adults (60 years of age and over). In order to take preliminary steps in establishing the validity and reliability of this instrument, the data were evaluated and an item analysis conducted to identify poor items. Cronbach's coefficient alpha was also computed (α = .90). A test‐retest correlation coefficient was computed, and an analysis of variance was performed to test for the relationship between self‐responsibility for wellness and demographic variables obtained during the field test.

The field testing of the instrument served as an educational needs assessment study. Evidence has been provided that there is a significant need for education programs which can provide training in the wellness skills as assessed by the instrument.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号