The sampling procedures were designed so that the full matrix of item variances and covariances could be estimated. Three subtest sizes were investigated- subtests of size five, nine and sixteen items. In each of these implementations a double cross validation was used yielding two predicted scores for each individual. Discrepancy measures were also computed showing the difference between the observed and the predicted scores. The prediction of individual scores was accomplished within various ranges of error. The correlations between predicted scores and observed scores ranged from the .70′s to the .90′s, depending on the number of predictor variables used. The procedure is applicable in situations in which large numbers of individuals are tested or in situations where multiple measures are taken.  相似文献   

This study describes the development of two instruments to investigate the extent to which students are engaged in scientific inquiry. As a result of the instrument development process employed, each finalized instrument consisted of 20-items separated into five categories. Both instruments were found to be internally consistent, with high reliability estimates. Factor analysis showed two factors for each instrument that, while not clustering the items into the five categories, did show item clustering that is consistent with research literature about students’ engagement in inquiry experiences. Based on the analyses completed, the instruments appear to be useful instruments for use in comprehensive assessment packages for assessing the extent to which students are experiencing inquiry in science classrooms.  相似文献   

This article investigates the effect of the number of item response categories on chi‐square statistics for confirmatory factor analysis to assess whether a greater number of categories increases the likelihood of identifying spurious factors, as previous research had concluded. Four types of continuous single‐factor data were simulated for a 20‐item test: (a) uniform for all items, (b) symmetric unimodal for all items, (c) negatively skewed for all items, or (d) negatively skewed for 10 items and positively skewed for 10 items. For each of the 4 types of distributions, item responses were divided to yield item scores with 2,4, or 6 categories. The results indicated that the chi‐square statistic for evaluating a single‐factor model was most inflated (suggesting spurious factors) for 2‐category responses and became less inflated as the number of categories increased. However, the Satorra‐Bentler scaled chi‐square tended not to be inflated even for 2‐category responses, except if the continuous item data had both negatively and positively skewed distributions.  相似文献   

What is the extent of error likely with each of several approximations for the standard deviation, internal consistency reliability, and the standard error of measurement? To help answer this question, approximations were compared with exact statistics obtained on 85 different classroom tests constructed and administered by professors in a variety of fields; means and standard deviations of the resulting differences supported the use of approximations in practical situations. Results of this analysis (1) suggest a greater number of alternative formulas that might be employed, and (2) provide additional information concerning the accuracy of approximations with non-normal distributions.  相似文献   

Cross-cultural comparisons of latent variable means demands equivalent loadings and intercepts or thresholds. Although equivalence generally emphasizes items as originally designed, researchers sometimes modify response options in categorical items. For example, substantive research interests drive decisions to reduce the number of item categories. Further, categorical multiple-group confirmatory factor analysis (MG-CFA) methods generally require that the number of indicator categories is equal across groups; however, categories with few observations in at least one group can cause challenges. In the current paper, we examine the impact of collapsing ordinal response categories in MG-CFA. An empirical analysis and a complementary simulation study suggested meaningful impacts on model fit due to collapsing categories. We also found reduced scale reliability, measured as a function of Fisher’s information. Our findings further illustrated artifactual fit improvement, pointing to the possibility of data dredging for improved model-data consistency in challenging invariance contexts with large numbers of groups.  相似文献   

A brief behavior rating scale consisting of 28 items divided into 7 categories was developed for use in a school setting. Reliability coefficients for each of the categories ranged from.79 to.91; total reliability was.92. Test validity was based upon the successful discrimination between neurologically impaired, socially maladjusted, emotionally handicapped, and normal children.  相似文献   

This study compares the Rasch item fit approach for detecting multidimensionality in response data with principal component analysis without rotation using simulated data. The data in this study were simulated to represent varying degrees of multidimensionality and varying proportions of items representing each dimension. Because the requirement of unidimensionality is necessary to preserve the desirable measurement properties of Rasch models, useful ways of testing this requirement must be developed. The results of the analyses indicate that both the principal component approach and the Rasch item fit approach work in a variety of multidimensional data structures. However, each technique is unable to detect multidimensionality in certain combinations of the level of correlation between the two variables and the proportion of items loading on the two factors. In cases where the intention is to create a unidimensional structure, one would expect few items to load on the second factor and the correlation between the factors to be high. The Rasch item fit approach detects dimensionality more accurately in these situations.  相似文献   

To ensure the statistical result validity, model-data fit must be evaluated for each item. In practice, certain actions or treatments are needed for misfit items. If all misfit items are treated, much item information would be lost during calibration. On the other hand, if only severely misfit items are treated, the inclusion of misfit items may invalidate the statistical inferences based on the estimated item response models. Hence, given response data, one has to find a balance between treating too few and too many misfit items. In this article, misfit items are classified into three categories based on the extent of misfit. Accordingly, three different item treatment strategies are proposed in determining which categories of misfit items should be treated. The impact of using different strategies is investigated. The results show that the test information functions obtained under different strategies can be substantially different in some ability ranges.  相似文献   

Using Muraki's (1992) generalized partial credit IRT model, polytomous items (responses to which can be scored as ordered categories) from the 1991 field test of the NAEP Reading Assessment were calibrated simultaneously with multiple-choice and short open-ended items. Expected information of each type of item was computed. On average, four-category polytomous items yielded 2.1 to 3.1 times as much IRT information as dichotomous items. These results provide limited support for the ad hoc rule of weighting k-category polytomous items the same as k - 1 dichotomous items for computing total scores. Polytomous items provided the most information about examinees of moderately high proficiency; the information function peaked at 1.0 to 1.5, and the population distribution mean was 0. When scored dichotomously, information in polytomous items sharply decreased, but they still provided more expected information than did the other response formats. For reference, a derivation of the information function for the generalized partial credit model is included.  相似文献   

Four item response theory (IRT) models were compared using data from tests where multiple items were grouped into testlets focused on a common stimulus. In the bi-factor model each item was treated as a function of a primary trait plus a nuisance trait due to the testlet; in the testlet-effects model the slopes in the direction of the testlet traits were constrained within each testlet to be proportional to the slope in the direction of the primary trait; in the polytomous model the item scores were summed into a single score for each testlet; and in the independent-items model the testlet structure was ignored. Using the simulated data, reliability was overestimated somewhat by the independent-items model when the items were not independent within testlets. Under these nonindependent conditions, the independent-items model also yielded greater root mean square error (RMSE) for item difficulty and underestimated the item slopes. When the items within testlets were instead generated to be independent, the bi-factor model yielded somewhat higher RMSE in difficulty and slope. Similar differences between the models were illustrated with real data.  相似文献   

Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items.  相似文献   

In this research, a survey method utilising questionnaires and focus group interviews was employed to determine correlations between students’ learning styles and each of the presences in the CoI Framework across disciplines as well as students’ blended learning experience. To this end, the linear regression model was the statistical approach used to explore the correlation between each of the presences and the learning styles after controlling for the disciplines. Consequently, a three–way cross-tab using Chi-square statistics was the statistical method used to discover the variations of the students’ experience of blended learning under different disciplines and learning styles. A total of 12 lecturers and 377 students from three private institutions were involved in this study. The results show that among the four discipline categories, only the soft-applied has a significant effect on the linear regression model. In this particular discipline, the Kinesthetic variable alone has a significant effect on all the three presences in the CoI Framework. The R-squared values are rather small. Further investigation should be directed towards an inclusion of a larger number of postgraduate participants, more courses in the soft-pure and hard-pure categories, and the learning styles of lecturers.  相似文献   

Two personality scales, each consisting of twenty-two items with Likert type response options, were administered to sixty-one graduate students. The items were keyed by three methods: a.) Likert weighting, b.) unit weighting, dichotomizing between the agree and disagree options, and c.) unit weighting, dichotomizing as close as possible to p = .50. Internal consistency reliability coefficients were computed for both scales, keyed by the three methods. For both scales, keying method c.) resulted in reliability coefficients as high as those obtained with method a.) . Method b.) however, effected a drastic drop in reliability in the scale with many highly skewed item response distributions. It was concluded that multiple-response personality items with highly skewed response distributions can be useful in a unit weighted keying system if they are dichotomized close to p = .50 rather than between the agree and disagree responses.  相似文献   

This study describes the development of an instrument to investigate the extent to which technology is integrated in science instruction in ways aligned to science reform outlined in standards documents. The instrument was developed by: (a) creating items consistent with the five dimensions identified in science education literature, (b) establishing content validity with both national and international content experts, (c) refining the item pool based on content expert feedback, (d) piloting testing of the instrument, (e) checking statistical reliability and item analysis, and (f) subsequently refining and finalization of the instrument. The TUSI was administered in a field test across eleven classrooms by three observers, with a total of 33 TUSI ratings completed. The finalized instrument was found to have acceptable inter-rater intraclass correlation reliability estimates. After the final stage of development, the TUSI instrument consisted of 26-items separated into the original five categories, which aligned with the exploratory factor analysis clustering of the items. Additionally, concurrent validity of the TUSI was established with the Reformed Teaching Observation Protocol. Finally, a subsequent set of 17 different classrooms were observed during the spring of 2011, and for the 9 classrooms where technology integration was observed, an overall Cronbach alpha reliability coefficient of 0.913 was found. Based on the analyses completed, the TUSI appears to be a useful instrument for measuring how technology is integrated into science classrooms and is seen as one mechanism for measuring the intersection of technological, pedagogical, and content knowledge in science classrooms.  相似文献   

The purpose of this study was to describe the development and validation of an instrument on Student Perceptions of Teachers' Knowledge (SPOTK) in relation to their pedagogy. Features of teachers' knowledge from the research literature related to instruction, representation, subject matter knowledge, and knowledge of how to assess students' understanding were used to generate categories in the SPOTK. The result of a pilot study with 634 Taiwanese junior high school students showed high reliability of the scales, a good factor structure, and provided suggestions to delete weak items. In the main study, for which nine to ten items under each category were generated making a total of 37 items in the SPOTK, the instrument was administered to 1879 Taiwanese and 1081 Australian junior high school students varying in grades, sex and ability levels. Reliability and validity measures of the instrument were established based on Cronbach alpha and factor analysis. After the validating process, 28 items remained in the final instrument and reliabilities of the scales ranged from 0.97 to 0.82. Comment is made about the differences between Australian and Taiwanese students' responses and suggestions for using the instrument in future research.  相似文献   

This paper presents the item and test information functions of the Rank two-parameter logistic models (Rank-2PLM) for items with two (pair) and three (triplet) statements in forced-choice questionnaires. The Rank-2PLM model for pairs is the MUPP-2PLM (Multi-Unidimensional Pairwise Preference) and, for triplets, is the Triplet-2PLM. Fisher's information and directional information are described, and the test information for Maximum Likelihood (ML), Maximum A Posterior (MAP), and Expected A Posterior (EAP) trait score estimates is distinguished. Expected item/test information indexes at various levels are proposed and plotted to provide diagnostic information on items and tests. The expected test information indexes for EAP scores may be difficult to compute due to a typical test's vast number of item response patterns. The relationships of item/test information with discrimination parameters of statements, standard error, and reliability estimates of trait score estimates are discussed and demonstrated using real data. Practical suggestions for checking the various expected item/test information indexes and plots are provided.  相似文献   

Autobiographical narratives (N = 97) of guilt and shame experiences were analysed to determine how the nature of emotion and context relate to ways of coping in such situations. The coding categories were created by content analysis, and the connections between categories were analysed with optimal scaling and log‐linear analysis. Two theoretical perspectives were tested: the view that shame generally is a more maladaptive emotion than guilt, and the view that in situations where responsibility is ambiguous, both guilt and shame feelings are likely to be maladaptive. In line with the latter, chronic rumination was more likely to occur in situations where responsibility was ambiguous compared to situations where the respondent's responsibility was clear, regardless of emotion. In addition, reparative behaviour was less frequently reported in shame situations than in situations where the respondent felt guilty or both guilty and ashamed. The findings supported the view that the nature of emotional reaction and the nature of the situation both affect the ways of coping.  相似文献   

This Monte Carlo simulation study investigated different strategies for forming product indicators for the unconstrained approach in analyzing latent interaction models when the exogenous factors are measured by unequal numbers of indicators under both normal and nonnormal conditions. Product indicators were created by (a) multiplying parcels of the larger scale by items of the smaller scale, and (b) matching items according to reliability to create several product indicators, ignoring those items with lower reliability. Two scaling approaches were compared where parceling was not involved: (a) fixing the factor variances, and (b) fixing 1 loading to 1 for each factor. The unconstrained approach was compared with the latent moderated structural equations (LMS) approach. Results showed that under normal conditions, the LMS approach was preferred because the biases of its interaction estimates and associated standard errors were generally smaller, and its power was higher than that of the unconstrained approach. Under nonnormal conditions, however, the unconstrained approach was generally more robust than the LMS approach. It is recommended to form product indicators by using items with higher reliability (rather than parceling) in the matching and then to specify the model by fixing 1 loading of each factor to unity when adopting the unconstrained approach.  相似文献   

This article used the multidimensional random coefficients multinomial logit model to examine the construct validity and detect the substantial differential item functioning (DIF) of the Chinese version of motivated strategies for learning questionnaire (MSLQ-CV). A total of 1,354 Hong Kong junior high school students were administered the MSLQ-CV. Partial credit model was suggested to have a better goodness of fit than that of the rating scale model. Five items with substantial gender or grade DIF were removed from the questionnaire, and the correlations between the subscales indicated that factors of cognitive strategy use and self-regulation had a very high correlation which resulted in a possible combination of the two factors. The test reliability analysis showed that the subscale of test anxiety had a lower reliability compared with the other factors. Finally, the item difficulty and step parameters for the modified 39-item questionnaire were displayed. The order of the step difficulty estimates for some items implied that some grouping of categories might be required in the case of overlapping. Based on these findings, the directions for future research were discussed.  相似文献   

