首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
《Educational Assessment》2013,18(2):203-206
This rejoinder responds to the major statements and claims made in Clemans (this issue). The arbitrary and unrealistic assumptions made by the Thurstone procedure are described. We point out the logical inconsistency of Clemans's claim that the relationship between raw scores, and abilities holds when transforming abilities into raw scores but not when transforming raw scores into abilities. Two effects that Clemans claims are caused by item response theory (IRT) scaling are examined, and we demonstrate that they occur more often with Thurstone scaling than with IRT scaling. We reiterate our belief in the superiority of IRT scaling over Thurstone scaling.  相似文献   

3.
4.
Two methods of constructing equal-interval scales for educational achievement are discussed: Thurstone's absolute scaling method and Item Response Theory (IRT). Alternative criteria for choosing a scale are contrasted. It is argued that clearer criteria are needed for judging the appropriateness and usefulness of alternative scaling procedures, and more information is needed about the qualities of the different scales that are available. In answer to this second need, some examples are presented of how IRT can be used to examine the properties of scales: It is demonstrated that for observed score scales in common use (i.e., any scores that are influenced by measurement error), (a) systematic errors can be introduced when comparing growth at selected percentiles, and (b) normalizing observed scores will not necessarily produce a scale that is linearly related to an underlying normally distributed true trait.  相似文献   

5.
A developmental scale for the North Carolina End-of-Grade Mathematics Tests was created using a subset of identical test forms administered to adjacent grade levels. Thurstone scaling and item response theory (IRT) techniques were employed to analyze the changes in grade distributions across these linked forms.Three variations of Thurstone scaling were examined, one based on Thurstone's 1925 procedure and two based on Thurstone's 1938 procedure. The IRT scaling was implemented using both B i M ain and M ultilog .All methods indicated that average mathematics performance improved from Grade 3 to Grade 8, with similar results for the two IRT analyses and one version of Thurstone's 1938 method.The standard deviations of the IRT scales did not show a consistent pattern across grades, whereas those produced by Thurstone's 1925 procedure generally decreased; one version of the 1938 method exhibited slightly increasing variation with increasing grade level, while the other version displayed inconsistent trends.  相似文献   

6.
Homing pigeons were reinforced for emitting a perching response according to differential-reinforcement-of-low-rate (DRL) schedules. The spacing requirement between successive perchings was progressively increased by 1-sec steps up to 70 sec and then abruptly decreased to 60, 40, and 20 sec. IRT/OP (interresponse time/opportunity) functions were maximal near the time of reinforcement. The coefficients of variation of the IRT distributions (ratio between the interquartile range and median IRT) fluctuated around .32, testifying for equivalent levels of adjustment throughout the critical IRT range. The ratio between reinforced and total IRTs ranged between .90 and .20. These data contrast with the performance of another group of pigeons reinforced for a treadle-pressing response according to DRL schedules (flatter IRT/OP functions, high coefficients of variation, and low efficiencies). Despite these differences in temporal regulation between perching and treadle-pressing DRL, response rates and reinforcement rates followed the same trend in both cases: they decreased as schedule value increased. The DRL perching results are similar to previous results obtained in the same species when perching duration was reinforced.  相似文献   

7.
When cut scores for classifications occur on the total score scale, popular methods for estimating classification accuracy (CA) and classification consistency (CC) require assumptions about a parametric form of the test scores or about a parametric response model, such as item response theory (IRT). This article develops an approach to estimate CA and CC nonparametrically by replacing the role of the parametric IRT model in Lee's classification indices with a modified version of Ramsay's kernel‐smoothed item response functions. The performance of the nonparametric CA and CC indices are tested in simulation studies in various conditions with different generating IRT models, test lengths, and ability distributions. The nonparametric approach to CA often outperforms Lee's method and Livingston and Lewis's method, showing robustness to nonnormality in the simulated ability. The nonparametric CC index performs similarly to Lee's method and outperforms Livingston and Lewis's method when the ability distributions are nonnormal.  相似文献   

8.
Functional form misfit is frequently a concern in item response theory (IRT), although the practical implications of misfit are often difficult to evaluate. In this article, we illustrate how seemingly negligible amounts of functional form misfit, when systematic, can be associated with significant distortions of the score metric in vertical scaling contexts. Our analysis uses two‐ and three‐parameter versions of Samejima's logistic positive exponent model (LPE) as a data generating model. Consistent with prior work, we find LPEs generally provide a better comparative fit to real item response data than traditional IRT models (2PL, 3PL). Further, our simulation results illustrate how 2PL‐ or 3PL‐based vertical scaling in the presence of LPE‐induced misspecification leads to an artificial growth deceleration across grades, consistent with that commonly seen in vertical scaling studies. The results raise further concerns about the use of standard IRT models in measuring growth, even apart from the frequently cited concerns of construct shift/multidimensionality across grades.  相似文献   

9.
As a global measure of precision, item response theory (IRT) estimated reliability is derived for four coefficients (Cronbach's α, Feldt‐Raju, stratified α, and marginal reliability). Models with different underlying assumptions concerning test‐part similarity are discussed. A detailed computational example is presented for the targeted coefficients. A comparison of the IRT model‐derived coefficients is made and the impact of varying ability distributions is evaluated. The advantages of IRT‐derived reliability coefficients for problems such as automated test form assembly and vertical scaling are discussed.  相似文献   

10.
Empirical studies demonstrated Type-I error (TIE) inflation (especially for highly discriminating easy items) of the Mantel-Haenszel chi-square test for differential item functioning (DIF), when data conformed to item response theory (IRT) models more complex than Rasch, and when IRT proficiency distributions differed only in means. However, no published study manipulated proficiency variance ratio (VR). Data were generated with the three-parameter logistic (3PL) IRT model. Proficiency VRs were 1, 2, 3, and 4. The present study suggests inflation may be greater, and may affect all highly discriminating items (low, moderate, and high difficulty), when IRT proficiency distributions of reference and focal groups differ also in variances. Inflation was greatest on the 21-item test (vs. 41) and 2,000 total sample size (vs. 1,000). Previous studies had not systematically examined sample size ratio. Sample size ratio of 1:1 produced greater TIE inflation than 3:1, but primarily for total sample size of 2,000.  相似文献   

11.
The purpose of this study was to examine slopes from curriculum-based measures of writing (CBM-W) as indicators of growth in writing. Responses to story prompts administered for 5 min to 89 students in Grades 2–5 were collected across 12 weeks and scored for correct word sequences (CWS) and correct minus incorrect sequences (CIWS). Linear mixed modeling revealed that, for students in Grades 2–3, a linear model with random effects on both intercept and slope fit the data best. For students in Grades 4–5, growth trends varied depending on number of weeks and scoring procedure used. The time point at which slopes were significantly different from zero varied by scoring procedure and grade. Gender was related to intercept and slope for CWS and CIWS in Grades 2–3 and to intercept and linear slope for CWS and CIWS in Grades 4–5. Findings suggest that CBM-W may be appropriate for monitoring student progress, and that gender should be considered in data-based decision making.  相似文献   

12.
Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information.  相似文献   

13.
This article illustrates five different methods for estimating Angoff cut scores using item response theory (IRT) models. These include maximum likelihood (ML), expected a priori (EAP), modal a priori (MAP), and weighted maximum likelihood (WML) estimators, as well as the most commonly used approach based on translating ratings through the test characteristic curve (i.e., the IRT true‐score (TS) estimator). The five methods are compared using a simulation study and a real data example. Results indicated that the application of different methods can sometimes lead to different estimated cut scores, and that there can be some key differences in impact data when using the IRT TS estimator compared to other methods. It is suggested that one should carefully think about their choice of methods to estimate ability and cut scores because different methods have distinct features and properties. An important consideration in the application of Bayesian methods relates to the choice of the prior and the potential bias that priors may introduce into estimates.  相似文献   

14.
Item response theory (IRT) models can be subsumed under the larger class of statistical models with latent variables. IRT models are increasingly used for the scaling of the responses derived from standardized assessments of competencies. The paper summarizes the strengths of IRT in contrast to more traditional techniques as well as in contrast to alternative models with latent variables (e. g. structural equation modeling). Subsequently, specific limitations of IRT and cases where other methods might be preferable are lined out.  相似文献   

15.
With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number‐correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real‐number item scores. The generalized algorithm is distinct from the Lord‐Wingersky algorithm in that it explicitly incorporates the task of figuring out all possible unique real‐number test scores in each recursion. Some applications of the generalized recursive algorithm, such as IRT test score reliability estimation and IRT proficiency estimation based on summed test scores, are illustrated with a short test by varying scoring schemes for its items.  相似文献   

16.
Bock, Muraki, and Pfeiffenberger (1988) proposed a dichotomous item response theory (IRT) model for the detection of differential item functioning (DIF), and they estimated the IRT parameters and the means and standard deviations of the multiple latent trait distributions. This IRT DIF detection method is extended to the partial credit model (Masters, 1982; Muraki, 1993) and presented as one of the multiple-group IRT models. Uniform and non-uniform DIF items and heterogeneous latent trait distributions were used to generate polytomous responses of multiple groups. The DIF method was applied to this simulated data using a stepwise procedure. The standardized DIF measures for slope and item location parameters successfully detected the non-uniform and uniform DIF items as well as recovered the means and standard deviations of the latent trait distributions.This stepwise DIF analysis based on the multiple-group partial credit model was then applied to the National Assessment of Educational Progress (NAEP) writing trend data.  相似文献   

17.
This study investigated the psychometric characteristics of constructed-response (CR) items referring to choice and non-choice passages administered to students in Grades 3, 5, and 8. The items were scaled using item response theory (IRT) methodology. The results indicated no consistent differences in the difficulty and discrimination of the items referring to the two types of passages. On the average, students' scale scores on the choice and non-choice passages were comparable. Finally, the choice passages differed in terms of overall popularity and in their attractiveness to different gender and ethnic groups  相似文献   

18.
Large‐scale assessments such as the Programme for International Student Assessment (PISA) have field trials where new survey features are tested for utility in the main survey. Because of resource constraints, there is a trade‐off between how much of the sample can be used to test new survey features and how much can be used for the initial item response theory (IRT) scaling. Utilizing real assessment data of the PISA 2015 Science assessment, this article demonstrates that using fixed item parameter calibration (FIPC) in the field trial yields stable item parameter estimates in the initial IRT scaling for samples as small as n = 250 per country. Moreover, the results indicate that for the recovery of the county‐specific latent trait distributions, the estimates of the trend items (i.e., the information introduced into the calibration) are crucial. Thus, concerning the country‐level sample size of n = 1,950 currently used in the PISA field trial, FIPC is useful for increasing the number of survey features that can be examined during the field trial without the need to increase the total sample size. This enables international large‐scale assessments such as PISA to keep up with state‐of‐the‐art developments regarding assessment frameworks, psychometric models, and delivery platform capabilities.  相似文献   

19.
等值(equating)和纵向量表化(vertical scaling)的功用是建立来自不同考试的分数之间的关系。等值是施用于相同年级,相同性质的试卷,而纵向量表化则用于不同年级而性质相似的试卷。纵向量表化是将不同年级的成绩放置于统一的成长分数量表之中。纵向量表(vertical scale)是一种延伸的分数,其度量跨越和串连不同年级之间,用以评估学生连继性的成就成长(Nitko,2004)。在教学中,学生的进度可以利用纵向量表来监察和评估。而在教育研究上,纵向量表可成为长期跟踪调查(longitudinal study)之有力工具。本文讨论纵向量表化的方法论,包括成长定义(definition of growth),数据收集(data collection)方法,试卷设计和使用项目反应理论(Item Response Theory)的方法以及对制作纵向量表提供一些实际的建议。  相似文献   

20.
The self-images of 49 adolescents with learning disabilities (Grades 9 through 12, mean age = 15.9) and 49 normally achieving peers (Grades 9 through 12, mean age = 16.0) were compared using the Offer Self-Image Questionnaire (OSIQ). The group with LD scored significantly lower than the comparison group on 4 of the 10 OSIQ scales. Later-diagnosed adolescents with LD scored significantly higher than early-diagnosed adolescents on two of the scales. Severity of the learning disability was not found to be related to self-image scores. In a second study, parental perceptions of the self-images of 28 of the 49 students with LD were studied by administering the Offer Parent-Adolescent Questionnaire (OPAQ) and an informal questionnaire to their parents. On 6 of the 10 OPAQ scales, parents perceived their children as having a lower self-image than the adolescents themselves reported. Significant but moderate relationships were found between parents' perceptions and adolescent self-image scores. Results of the two studies are interpreted in terms of a multidimensional conception of self-image that considers factors inherent to the individual as well as interpersonal and institutional factors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号