期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Two IRT Fixed Parameter Calibration Methods for the Bifactor Model

Kyung Yong Kim 《Journal of Educational Measurement》2020,57(1):29-50

相似文献

2.

DIF Detection and Effect Size Measures for Polytomously Scored Items

Seock-Ho Kim Allan S. Cohen Cigdem Alagoz Sukwoo Kim 《Journal of Educational Measurement》2007,44(2):93-116

Data from a large-scale performance assessment ( N = 105,731) were analyzed with five differential item functioning (DIF) detection methods for polytomous items to examine the congruence among the DIF detection methods. Two different versions of the item response theory (IRT) model-based likelihood ratio test, the logistic regression likelihood ratio test, the Mantel test, and the generalized Mantel–Haenszel test were compared. Results indicated some agreement among the five DIF detection methods. Because statistical power is a function of the sample size, the DIF detection results from extremely large data sets are not practically useful. As alternatives to the DIF detection methods, four IRT model-based indices of standardized impact and four observed-score indices of standardized impact for polytomous items were obtained and compared with the R ² measures of logistic regression. 相似文献

3.

Using an Approximate Chi-Square Statistic to Test the Number of Dimensions Underlying the Responses to a Set of Items

Marc E. Gessaroli ré F. De Champlain 《Journal of Educational Measurement》1996,33(2):157-179

An approximate χ² statistic based on McDonald's (1967) nonlinear factor analytic representation of item response theory was proposed and investigated with simulated data. The results were compared with Stout's T statistic (Nandakumar & Stout, 1993; Stout, 1987). Unidimensional and two-dimensional item response data were simulated under varying levels of sample size, test length, test reliability, and dimension dominance. The approximate χ² statistic had good control over Type I errors when unidimensional data were generated and displayed very good power in identifying the two-dimensional data. The performance of the approximate χ² was at least as good as Stout's T statistic in all conditions and was better than Stout's T statistic with smaller sample sizes and shorter tests. Further implications regarding the potential use of nonlinear factor analysis and the approximate χ² in addressing current measurement issues are discussed. 相似文献

4.

Associations among Attachment Classifications of Mothers, Fathers, and Their Infants 总被引：2，自引：0，他引：2

Howard Steele Miriam Steele Peter Fonagy 《Child development》1996,67(2):541-555

Associations are reported among classifications of Adult Attachment Interviews (AAIs) obtained from expectant parents and subsequent classifications of their infants in the Strange Situation Procedure (SSP). Mothers' AAIs predicted infant-mother SSPs ( X ²= 41.87, N = 96, df = 9, ρ≤ .0001), and fathers' AAIs predicted infant-father SSPs ( X ²= 18.94, N = 90, df = 6, ρ≤ .005). Associations between parents' AAIs and infant-parent SSPs were lessened by the failure to predict the insecure-resistant pattern with mother and the absence of this pattern with father. Counter to expectation, infant-father SSPs were associated with infant-mother SSPs ( X ²= 3.78, N = 90, df = 1, ρ≤ .05), which could not be accounted for in terms of an overlap between parental AAIs. A secondary analysis of the data suggested that this dependency effect of SSPs may be explained by the influence of maternal AAIs upon child-father SSPs. Results are discussed in terms of intergenerational and relationship-specific influences upon attachment during infancy, the possible influence of infant temperament, and the relative influence of mother and father upon the child's evolving representations of attachments within the family. 相似文献

5.

Application of IRT Fixed Parameter Calibration to Multiple-Group Test Data

Seonghoon Kim Michael J. Kolen 《教育实用测度》2013,26(4):310-324

ABSTRACT

In applications of item response theory (IRT), fixed parameter calibration (FPC) has been used to estimate the item parameters of a new test form on the existing ability scale of an item pool. The present paper presents an application of FPC to multiple examinee groups test data that are linked to the item pool via anchor items, and investigates the performance of FPC relative to an alternative approach, namely independent 0–1 calibration and scale linking. Two designs for linking to the pool are proposed that involve multiple groups and test forms, for which multiple-group FPC can be effectively used. A real-data study shows that the multiple-group FPC method performs similarly to the alternative method in estimating ability distributions and new item parameters on the scale of the item pool. In addition, a simulation study shows that the multiple-group FPC method performs nearly equally to or better than the alternative method in recovering the underlying ability distributions and the new item parameters. 相似文献

6.

Empathy, Emotional Expressiveness, and Prosocial Behavior 总被引：6，自引：1，他引：6

William Roberts Janet Strayer 《Child development》1996,67(2):449-470

Relations between emotional expressiveness, empathy, and prosocial behaviors are important for theoretical and practical reasons. In this study, all 3 areas were assessed across methods and sources. Emotional expressiveness and empathy were evaluated in 73 children in 3 age groups (5-, 9-, and 13-year-olds) by measuring facial and verbal responses to emotionally evocative videotapes and by ratings from best friends, parents, and teachers. Measures of emotional insight and role taking were also obtained. Prosocial behaviors were assessed by 3 laboratory tasks and by ratings from best friends, parents, and teachers. Confirming expectations, latent variable path analyses (Lohmöller, 1984) indicated that emotional expressiveness, emotional insight, and role taking were strong predictors of latent empathy (multiple R ²= .60). Boys' empathy, in turn, was a strong predictor of prosocial behavior, R ²= .55. In contrast, girls' empathy was related to prosocial behaviors with friends, R ²= .13, but not to cooperation with peers. Thus present findings provide important support and clarification for certain theoretical expectations, and also raise issues that need clarification. 相似文献

7.

Linking item parameters to a base scale 总被引：1，自引：0，他引：1

Taehoon Kang Nancy S. Petersen 《Asia Pacific Education Review》2012,13(2):311-321

This paper compares three methods of item calibration??concurrent calibration, separate calibration with linking, and fixed item parameter calibration??that are frequently used for linking item parameters to a base scale. Concurrent and separate calibrations were implemented using BILOG-MG. The Stocking and Lord in Appl Psychol Measure 7:201?C210, (1983) characteristic curve method of parameter linking was used in conjunction with separate calibration. The fixed item parameter calibration (FIPC) method was implemented using both BILOG-MG and PARSCALE because the method is carried out differently by the two programs. Both programs use multiple EM cycles, but BILOG-MG does not update the prior ability distribution during FIPC calibration, whereas PARSCALE updates the prior ability distribution multiple times. The methods were compared using simulations based on actual testing program data, and results were evaluated in terms of recovery of the underlying ability distributions, the item characteristic curves, and the test characteristic curves. Factors manipulated in the simulations were sample size, ability distributions, and numbers of common (or fixed) items. The results for concurrent calibration and separate calibration with linking were comparable, and both methods showed good recovery results for all conditions. Between the two fixed item parameter calibration procedures, only the appropriate use of PARSCALE consistently provided item parameter linking results similar to those of the other two methods. 相似文献

8.

Female-Teacher Gender and Sexuality in Twentieth-Century Ontario, Canada 总被引：2，自引：0，他引：2

Sheila L. Cavanagb 《History of education quarterly》2005,45(2):247-273

[The Romans] created the cult of the Vestal Virgins, high-minded priestesses of the goddess Vesta, Guardian Angel of Mankind and Keeper of the Hearth. These priestesses were educated in special normal training schools, were forbidden to many, were subjected to drastic moral codes, and were accorded social position of preeminence.¹
Spinster teachers were hired so frequently in the late nineteenth and early twentieth centuries that they eventually became an important part of the cultural landscape.²
Single women seem forever to unnerve, anger and unwittingly scare large swaths of the population, both female and male.³ 相似文献

9.

Restricting a Familiar Name in Response to Learning a New One: Evidence for the Mutual Exclusivity Bias in Young Two-Year-Olds

William E. Merriman Colleen M. Stevenson 《Child development》1997,68(2):211-228

Children under 2¹/₂ years old tend to interpret novel words in accordance with the Mutual Exclusivity Principle, but tend not to reinterpret familiar words this way. Because alternative principle have been proposed that only predict the novel word effects, and because tests of the familiar word effects may have been flawed, a new test was administered. In Experiment 1 ( N = 32), 24- to 25-month-olds heard stories in which a novel noun was used for an atypical exemplar of a familiar noun. When asked to select exemplars of the familiar noun, they showed a small but reliable tendency to avoid the object from the story. In Experiment 2 ( N = 16), the novel nouns in the stories were replaced by pronouns and proper names, and the children did not avoid the story object in the test of the familiar noun. Thus, the aversion to this object that was observed in Experiment I was not due to its greater exposure or its being referenced immediately before testing, but to toddlers' Mutual Exclusivity bias. Their bias is hypothesized to be a form of implicit probabilistic knowledge that derives from the competitive nature of category retrieval. 相似文献

10.

Testing Features of Graphical DIF: Application of a Regression Correction to Three Nonparametric Statistical Tests

Daniel M. Bolt Mark J. Gierl 《Journal of Educational Measurement》2006,43(4):313-333

Inspection of differential item functioning (DIF) in translated test items can be informed by graphical comparisons of item response functions (IRFs) across translated forms. Due to the many forms of DIF that can emerge in such analyses, it is important to develop statistical tests that can confirm various characteristics of DIF when present. Traditional nonparametric tests of DIF (Mantel-Haenszel, SIBTEST) are not designed to test for the presence of nonuniform or local DIF, while common probability difference (P-DIF) tests (e.g., SIBTEST) do not optimize power in testing for uniform DIF, and thus may be less useful in the context of graphical DIF analyses. In this article, modifications of three alternative nonparametric statistical tests for DIF, Fisher's χ ² test, Cochran's Z test, and Goodman's U test ( Marascuilo & Slaughter, 1981 ), are investigated for these purposes. A simulation study demonstrates the effectiveness of a regression correction procedure in improving the statistical performance of the tests when using an internal test score as the matching criterion. Simulation power and real data analyses demonstrate the unique information provided by these alternative methods compared to SIBTEST and Mantel-Haenszel in confirming various forms of DIF in translated tests. 相似文献

11.

Data Sparseness and On-Line Pretest Item Calibration-Scaling Methods in CAT

Jae-Chun Ban Bradley A. Hanson Qing Yi Deborah J. Harris 《Journal of Educational Measurement》2002,39(3):207-218

The purpose of this study was to compare and evaluate three on-line pretest item calibration-scaling methods (the marginal maximum likelihood estimate with one expectation maximization [EM] cycle [OEM] method, the marginal maximum likelihood estimate with multiple EM cycles [MEM] method, and Stocking's Method B) in terms of itern parameter recovery when the item responses to the pretest items in the pool are sparse. Simulations of computerized adaptive tests were used to evaluate the results yielded by the three methods. The MEM method produced the smallest average total error in parameter estimation, and the OEM method yielded the largest total error. 相似文献

12.

A Comparative Study of On-line Pretest Item—Calibration/Scaling Methods in Computerized Adaptive Testing

Jae-Chun Ban Bradley A. Hanson Tianyou Wang Qing Yi Deborah J. Harris 《Journal of Educational Measurement》2001,38(3):191-212

The purpose of this study was to compare and evaluate five on-line pretest item-calibration/scaling methods in computerized adaptive testing (CAT): marginal maximum likelihood estimate with one EM cycle (OEM), marginal maximum likelihood estimate with multiple EM cycles (MEM), Stocking's Method A, Stocking's Method B, and BILOG/Prior. The five methods were evaluated in terms of item-parameter recovery, using three different sample sizes (300, 1000 and 3000). The MEM method appeared to be the best choice among these, because it produced the smallest parameter-estimation errors for all sample size conditions. MEM and OEM are mathematically similar, although the OEM method produced larger errors. MEM also was preferable to OEM, unless the amount of time involved in iterative computation is a concern. Stocking's Method B also worked very well, but it required anchor items that either would increase test lengths or require larger sample sizes depending on test administration design. Until more appropriate ways of handling sparse data are devised, the BILOG/Prior method may not be a reasonable choice for small sample sizes. Stocking's Method A had the largest weighted total error, as well as a theoretical weakness (i.e., treating estimated ability as true ability); thus, there appeared to be little reason to use it. 相似文献

13.

The Use of Hierarchical Generalized Linear Model for Item Dimensionality Assessment

S. Natasha Beretvas Natasha J. Williams 《Journal of Educational Measurement》2004,41(4):379-395

To assess item dimensionality, the following two approaches are described and compared: hierarchical generalized linear model (HGLM) and multidimensional item response theory (MIRT) model. Two generating models are used to simulate dichotomous responses to a 17-item test: the unidimensional and compensatory two-dimensional (C2D) models. For C2D data, seven items are modeled to load on the first and second factors, θ₁ and θ₂, with the remaining 10 items modeled unidimensionally emulating a mathematics test with seven items requiring an additional reading ability dimension. For both types of generated data, the multidimensionality of item responses is investigated using HGLM and MIRT. Comparison of HGLM and MIRT's results are possible through a transformation of items' difficulty estimates into probabilities of a correct response for a hypothetical examinee at the mean on θ and θ₂. HGLM and MIRT performed similarly. The benefits of HGLM for item dimensionality analyses are discussed. 相似文献

14.

"He lives as a Master": Seventeenth-Century Masculinity, Gendered Teaching, and Careers of New England Schoolmasters

Jo Anne Preston 《History of education quarterly》2003,43(3):350-371

You that are men and thoughts of manhood know, Be Just now to the Man who made you so. Martyr'd by Scholars the stabbed Cassian dies, And falls to cursed Lads a Sacrafice. Not so my Cheever , Not by Scholars slain, But Praised and Lov'd, and wished to Life again. Cotton Mather, 1708¹ 相似文献

15.

Joseph Kinmont Hart and Vanderbilt University: Academic Freedom and the Rise and Fall of a Department of Education, 1930–1934

Deron R. Boyles 《History of education quarterly》2003,43(4):571-609

No one can follow the history of academic freedom… without wondering at the fact that any society, interested in the immediate goals of solidarity and self-preservation, should possess the vision to subsidize free criticism and inquiry, and without feeling that the academic freedom we still possess is one of the remarkable achievements of man. At the same time…one cannot but be disheartened by the cowardice and self-deception that frail men use who want to be both safe and free.¹ 相似文献

16.

No End to Equality

RICHARD NORMAN 《Journal of Philosophy of Education》1995,29(3):421-431

John White argues that 'egalitarianism, in education as elsewhere, is a will-o'-the-wisp'.¹ He claims that recent defences of egalitarianism, among which he kindly includes my own along with those of Thomas Nagel and Kai Nielsen, have failed to answer the basic question of why a more equal society should be regarded as valuable. I shall try to show that the positive philosophical commitments contained in his argument may point the way to an answer. 相似文献

17.

Item Selection and Ability Estimation Procedures for a Mixed-Format Adaptive Test

Tsung-Han Ho Barbara G. Dodd 《教育实用测度》2013,26(4):305-326

In this study we compared five item selection procedures using three ability estimation methods in the context of a mixed-format adaptive test based on the generalized partial credit model. The item selection procedures used were maximum posterior weighted information, maximum expected information, maximum posterior weighted Kullback-Leibler information, and maximum expected posterior weighted Kullback-Leibler information procedures. The ability estimation methods investigated were maximum likelihood estimation (MLE), weighted likelihood estimation (WLE), and expected a posteriori (EAP). Results suggested that all item selection procedures, regardless of the information functions on which they were based, performed equally well across ability estimation methods. The principal conclusions drawn about the ability estimation methods are that MLE is a practical choice and WLE should be considered when there is a mismatch between pool information and the population ability distribution. EAP can serve as a viable alternative when an appropriate prior ability distribution is specified. Several implications of the findings for applied measurement are discussed. 相似文献

18.

Alternative Interpretations of Alternative Assessments: Some Validity Issues in Educational Performance Assessments

Lyle F. Bachman 《Educational Measurement》2002,21(3):5-18

The use of alternative assessments has led many researchers to reexamine traditional views of test qualities, especially validity. Because alternative assessments generally aim at measuring complex constructs and employ rich assessment tasks, it becomes more difficult to demonstrate (a) the validity of the inferences we make and (b) that these inferences extrapolate to target domains beyond the assessment itself. An approach to addressing these issues from the perspective of language testing is described. It is then argued that in both language testing and educational assessment we must consider the roles of both language and content knowledge, and that our approach to the design and development of performance assessments must be both construct-based and task-based.¹ 相似文献

19.

Comparison of NOHARM and DETECT in Item Cluster Recovery: Counting Dimensions and Allocating Items

Holmes Finch Brian Habing 《Journal of Educational Measurement》2005,42(2):149-169

This study examines the performance of a new method for assessing and characterizing dimensionality in test data using the NOHARM model, and comparing it with DETECT. Dimensionality assessment is carried out using two goodness-of-fit statistics that are compared to reference χ² distributions. A Monte Carlo study is used with item parameters based on a statewide basic skills assessment and the SAT. Other factors that are varied include the correlation among the latent traits, the number of items, the number of subjects, skewness of the latent traits, and the presence or absence of guessing. The performance of the two procedures is judged by the accuracy in determining the number of underlying dimensions, and the degree to which items are correctly clustered together. Results indicate that the new, NOHARM-based method appears to perform comparably to DETECT in terms of simultaneously finding the correct number of dimensions and clustering items correctly. NOHARM is generally better able to determine the number of underlying dimensions, but less able to group items together, than DETECT. When errors in item cluster assignment are made, DETECT is more likely to incorrectly separate items while NOHARM more often incorrectly groups them together. 相似文献

20.

The Relationship Between Item Parameters and Item Fit

Hamzeh Dodeen 《Journal of Educational Measurement》2004,41(3):261-270

The effect of item parameters (discrimination, difficulty, and level of guessing) on the item-fit statistic was investigated using simulated dichotomous data. Nine tests were simulated using 1,000 persons, 50 items, three levels of item discrimination, three levels of item difficulty, and three levels of guessing. The item fit was estimated using two fit statistics: the likelihood ratio statistic (X²_B), and the standardized residuals (SRs). All the item parameters were simulated to be normally distributed. Results showed that the levels of item discrimination and guessing affected the item-fit values. As the level of item discrimination or guessing increased, item-fit values increased and more items misfit the model. The level of item difficulty did not affect the item-fit statistic. 相似文献