首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models.  相似文献   

2.
Psychometric properties of item response theory proficiency estimates are considered in this paper. Proficiency estimators based on summed scores and pattern scores include non-Bayes maximum likelihood and test characteristic curve estimators and Bayesian estimators. The psychometric properties investigated include reliability, conditional standard errors of measurement, and score distributions. Four real-data examples include (a) effects of choice of estimator on score distributions and percent proficient, (b) effects of the prior distribution on score distributions and percent proficient, (c) effects of test length on score distributions and percent proficient, and (d) effects of proficiency estimator on growth-related statistics for a vertical scale. The examples illustrate that the choice of estimator influences score distributions and the assignment of examinee to proficiency levels. In particular, for the examples studied, the choice of Bayes versus non-Bayes estimators had a more serious practical effect than the choice of summed versus pattern scoring.  相似文献   

3.
Angus Duff 《教育心理学》2004,24(5):699-709
Given the psychometric limitations of existing measures of Kolb's experiential learning model (ELM), two new scales of learning styles have been developed. The validity of these scales has been supported in samples of undergraduate and MBA students in the USA. This paper provides evidence of some psychometric properties of scores yielded by these scales using samples of undergraduate students in the UK. Only limited support is found for the internal consistency reliability and construct validity of scores produced by the scales. However, an item attrition exercise identifies a two‐factor solution providing an acceptable fit to the data. The scales are reported as being positively correlated with academic performance and prior academic achievement. Despite the mixed evidence, we suggest further development of the scales is warranted to create a psychometrically sound measure of the ELM.  相似文献   

4.
ABSTRACT

Touch screen tablets are being increasingly used in schools for learning and assessment. However, the validity and reliability of assessments delivered via tablets are largely unknown. The present study tested the psychometric properties of a tablet-based app designed to measure early literacy skills. Tablet-based tests were also compared with traditional paper-based tests. Children aged 2–6 years (N?=?99) completed receptive tests delivered via a tablet for letter, word, and numeral skills. The same skills were tested with a traditional paper-based test that used an expressive response format. Children (n?=?35) were post-tested 8 weeks later to examine the stability of test scores over time. The tablet test scores showed high internal consistency (all α’s?>?.94), acceptable test-retest reliability (ICC range?=?.39–.89), and were correlated with child age, family SES, and home literacy teaching to indicate good predictive validity. The agreement between scores for the tablet and traditional tests was high (ICC range?=?.81–.94). The tablet tests provides valid and reliable measures of children’s early literacy skills. The strong psychometric properties and ease of use suggests that tablet-based tests of literacy skills have the potential to improve assessment practices for research purposes and classroom use.  相似文献   

5.
The answer-until-correct (AUC) method of multiple-choice (MC) testing involves test respondents making selections until the keyed answer is identified. Despite attendant benefits that include improved learning, broad student adoption, and facile administration of partial credit, the use of AUC methods for classroom testing has been extremely limited. This study presents scoring properties and item analysis for 26 AUC university course examinations, administered using a commercial scratch-card response system. Here, we show that beyond the traditional pedagogical advantages of AUC, the availability of partial credit adds psychometric advantages by boosting both the mean item discrimination and overall test-score reliability, when compared to tests scored dichotomously upon initial response. Furthermore we also find a strong correlation between students’ initial-response successes and the likelihood that they would obtain partial credit when they make incorrect initial responses. Thus, partial credit is being granted based on partial knowledge that remains latent in traditional MC tests. The fact that these advantages are realized in real-life classroom tests may motivate further expansion of the use of AUC MC tests in higher education.  相似文献   

6.
ANGUS DUFF 《教育心理学》2003,23(2):123-139
This investigation: first, examines some psychometric properties of the scores obtained on a 30 item short form of the Revised Approaches to Studying Inventory (RASI) using samples of postgraduate management (MBA) students (n=75); second, examines the relationship between scores on the three dimensions of the RASI and background variables of age, gender and prior educational experience; third, tests for any relationship between the background variables and academic performance as measured by four distinct types of assessment; and fourth, examines the relationship between scores on the three dimensions of the RASI and academic performance. No previous published work has examined the approaches to learning of postgraduate business students. Key findings include: the instrument has satisfactory psychometric properties; and scores obtained on the RASI using samples of MBA students are good predictors of academic performance in continuous assessment tasks but poor predictors of performance in examinations and oral presentations.  相似文献   

7.
Anatomists often use images in assessments and examinations. This study aims to investigate the influence of different types of images on item difficulty and item discrimination in written assessments. A total of 210 of 460 students volunteered for an extra assessment in a gross anatomy course. This assessment contained 39 test items grouped in seven themes. The answer format alternated per theme and was either a labeled image or an answer list, resulting in two versions containing both images and answer lists. Subjects were randomly assigned to one version. Answer formats were compared through item scores. Both examinations had similar overall difficulty and reliability. Two cross‐sectional images resulted in greater item difficulty and item discrimination, compared to an answer list. A schematic image of fetal circulation led to decreased item difficulty and item discrimination. Three images showed variable effects. These results show that effects on assessment scores are dependent on the type of image used. Results from the two cross‐sectional images suggest an extra ability is being tested. Data from a scheme of fetal circulation suggest a cueing effect. Variable effects from other images indicate that a context‐dependent interaction takes place with the content of questions. The conclusion is that item difficulty and item discrimination can be affected when images are used instead of answer lists; thus, the use of images as a response format has potential implications for the validity of test items. Anat Sci Educ © 2012 American Association of Anatomists.  相似文献   

8.
The study aims to investigate the effects of delivery modalities on psychometric characteristics and student performance on cognitive tests. A first study assessed the inductive reasoning ability of 715 students under the supervision of teachers. A second study examined 731 students’ performance on the application of the control-of-variables strategy in basic physics but without teacher supervision due to the COVID-19 pandemic. Rasch measurement showed that the online format fitted to the data better in the unidimensional model across two conditions. Under teacher supervision, paper-based testing was better than online testing in terms of reliability and total scores, but contradictory findings were found in turn without teacher supervision. Although measurement invariance was confirmed between two versions at item level, the differential bundle functioning analysis supported the online groups on the item bundles constructed of figure-related materials. Response time was also discussed as an advantage of technology-based assessment for test development.  相似文献   

9.
ADHD is one of the most common referrals to school psychologists and child mental health providers. Although a best practice assessment of ADHD requires more than the use of rating scales, rating scales are one of the primary components in the assessment of ADHD. Therefore, the goal of this paper is to provide the reader with a critical and comparative evaluation of the five most commonly used, narrow‐band, published rating scales for the assessment of ADHD. Reviews were conducted in four main areas: content and use, standardization sample and norms, scores and interpretation, and psychometric properties. It was concluded the rating scales with the strongest standardization samples and evidence for reliability and validity are the ADDES, the ADHD‐IV, and the CRS‐R. In determining which of these to use, the prospective users may want to reflect on their goals for the assessment. The ACTeRS and the ADHDT are not recommended for use because they are lacking crucial information in their manuals and have less well‐documented evidence of reliability and validity. Conclusions and recommendations for scale usage are discussed. © 2003 Wiley Periodicals, Inc. Psychol Schs 40: 341–361, 2003.  相似文献   

10.
Using a sample of 908 eleventh grade science stream male and female students from similar socioeconomic area schools, variance based psychometric properties of three paper-and-pencil tests of logical thinking (Longeot test, Lawson's test TOFR, and Tobin and Capie's test TOLT) are investigated. A sub-sample of 212 students took the three tests in randomly allocated different sequential orders of presentation, while 696 students took only two tests. Alfa coefficients for each test separately and for the three tests combined together, concurrent validity coefficients, measures of item difficulty, item discrimination, item-criterion correlation, and 30-day stability coefficients are calculated. Considering the relative homogeneity of the sample, the reliability coefficients of the tests are judged satisfactory, but concurrent validity coefficients are quite low which implies incongruency in decisions made on the basis of the three tests. Need for estimating various psychometric parameters of alternative tests of logical thinking over different grade populations is emphasized.  相似文献   

11.
Accurate equating results are essential when comparing examinee scores across exam forms. Previous research indicates that equating results may not be accurate when group differences are large. This study compared the equating results of frequency estimation, chained equipercentile, item response theory (IRT) true‐score, and IRT observed‐score equating methods. Using mixed‐format test data, equating results were evaluated for group differences ranging from 0 to .75 standard deviations. As group differences increased, equating results became increasingly biased and dissimilar across equating methods. Results suggest that the size of group differences, the likelihood that equating assumptions are violated, and the equating error associated with an equating method should be taken into consideration when choosing an equating method.  相似文献   

12.
When a computerized adaptive testing (CAT) version of a test co-exists with its paper-and-pencil (P&P) version, it is important for scores from the CAT version to be comparable to scores from its P&P version. The CAT version may require multiple item pools for test security reasons, and CAT scores based on alternate pools also need to be comparable to each other. In this paper, we review research literature on CAT comparability issues and synthesize issues specific to these two settings. A framework of criteria for evaluating comparability was developed that contains the following three categories of criteria: validity criterion, psychometric property/reliability criterion, and statistical assumption/test administration condition criterion. Methods for evaluating comparability under these criteria as well as various algorithms for improving comparability are described and discussed. Focusing on the psychometric property/reliability criterion, an example using an item pool of ACT Assessment Mathematics items is provided to demonstrate a process for developing comparable CAT versions and for evaluating comparability. This example illustrates how simulations can be used to improve comparability at the early stages of the development of a CAT. The effects of different specifications of practical constraints, such as content balancing and item exposure rate control, and the effects of using alternate item pools are examined. One interesting finding from this study is that a large part of incomparability may be due to the change from number-correct score-based scoring to IRT ability estimation-based scoring. In addition, changes in components of a CAT, such as exposure rate control, content balancing, test length, and item pool size were found to result in different levels of comparability in test scores.  相似文献   

13.
OBJECTIVE: The goal was to develop a retrospective inventory of parental threatening behavior to facilitate a better understanding of such behavior's role in the etiology of psychological distress. METHOD: Inventory items were developed based on theory and 135 students' responses to a question eliciting examples of threatening parental behavior. Following item development, two additional student samples (n = 200 and n = 603) completed batteries of self-report measures. Responses were used to eliminate unstable or redundant items from the inventory and to examine the inventory's psychometric properties. RESULTS: Factor analysis of the inventory revealed three factors, accounting for 66.2% of variance; this factor structure is compatible with theory, and consistent across maternal behavior scores, paternal behavior scores, and combined maternal and paternal scores. Cronbach's coefficient alphas indicated acceptable internal consistency; Pearson correlation coefficients indicated acceptable 4-week test-retest reliability. Moderate intercorrelations with two retrospective measures of childhood experiences suggested construct validity. Regression analyses demonstrated the ability of the inventory to predict both anxious and depressive symptomatology and lifetime symptoms of anxiety and depressive disorder. Normative data on combined parent scores, maternal scores, and paternal scores are also presented. CONCLUSIONS: Initial psychometric testing of the Parent Threat Inventory (PTI) suggests it is a reliable and valid tool for investigating the developmental antecedents of adult psychological distress. Further research should focus on addressing two limitations: (1) lack of normative and psychometric data on men and women suffering from clinical disorders, and (2) lack of validation by parental reporting.  相似文献   

14.
Instructional sensitivity is the psychometric capacity of tests or single items of capturing effects of classroom instruction. Yet, current item sensitivity measures’ relationship to (a) actual instruction and (b) overall test sensitivity is rather unclear. The present study aims at closing these gaps by investigating test and item sensitivity to teaching quality, reanalyzing data from a quasi-experimental intervention study in primary school science education (1026 students, 53 classes, Mage = 8.79 years, SDage = 0.49, 50% female). We examine (a) the correlation of item sensitivity measures and the potential for cognitive activation in class and (b) consequences for test score interpretation when assembling tests from items varying in their degree of sensitivity to cognitive activation. Our study (a) provides validity evidence that item sensitivity measures may be related to actual classroom instruction and (b) points out that inferences on teaching drawn from test scores may vary due to test composition.  相似文献   

15.
The current study investigated how item formats and their inherent affordances influence test‐takers’ cognition under uncertainty. Adult participants solved content‐equivalent math items in multiple‐selection multiple‐choice and four alternative grid formats. The results indicated that participants’ affirmative response tendency (i.e., judge the given information as True) was affected by the presence of a grid, type of grid options, and their visual layouts. The item formats further affected the test scores obtained from the alternatives keyed True and the alternatives keyed False, and their psychometric properties. The current results suggest that the affordances rendered by item design can lead to markedly different test‐taker behaviors and can potentially influence test outcomes. They emphasize that a better understanding of the cognitive implications of item formats could potentially facilitate item design decisions for large‐scale educational assessments.  相似文献   

16.
This study investigates the relationships among factor correlations, inter-item correlations, and the reliability estimates of subscores, providing a guideline with respect to psychometric properties of useful subscores. In addition, it compares subscore estimation methods with respect to reliability and distinctness. The subscore estimation methods explored in the current study include augmentation based on classical test theory and multidimensional item response theory (MIRT). The study shows that there is no estimation method that is optimal according to both criteria. Augmented subscores show the most improvement in reliability compared to observed subscores but are the least distinct.  相似文献   

17.
The comparison of scores from linguistically different tests is a twofold matter: the adaptation of tests and the comparison of scores. These 2 aspects of measurement invariance intersect at the need to guarantee the psychometric equivalence between the original and adapted versions. In this study, the authors examined comparability in 2 stages. First, they conducted a thorough study of progressive factorial variance through which they defined an anchor test. Second, they defined an observed score-equated function to establish equivalences between the original test and the adapted test; they used a design of common item nonequivalent groups for this purpose.  相似文献   

18.
The use of surveys, questionnaires, and rating scales to measure important outcomes in higher education is pervasive, but reliability and validity information is often based on problematic Classical Test Theory approaches. Rasch Analysis, based on Item Response Theory, provides a better alternative for examining the psychometric quality of rating scales and informing scale improvements. This paper outlines a six-step process for using Rasch Analysis to review the psychometric properties of a rating scale. The Partial Credit Model and Andrich Rating Scale Model will be described in terms of the pyschometric information (i.e., reliability, validity, and item difficulty) and diagnostic indices generated. Further, this approach will be illustrated through the example of authentic data from a university-wide student evaluation of teaching.  相似文献   

19.
The TEMAS (acronym for Tell‐Me‐a‐Story)—an objectively scored, projective thematic personality instrument for children and adolescents—is analyzed, reviewed, and critiqued with regard to theoretical underpinnings and rationale for development, administration, scoring, psychometric properties, and research to date. The TEMAS appears to be an improvement over existing projective personality measures used by school psychologists. Although it requires more training than other projective techniques, competency in administration, scoring, and interpretation can be achieved within a one semester course in personality assessment. The test has evidence of reliability and validity, and it is a multicultural alternative to the TAT and other thematic apperception instruments. The use of the TEMAS by psychologists may achieve more accurate assessment of Black and Hispanic children. Limitations include geographically limited standardization samples and little research conducted by individuals other than the authors. © 1999 John Wiley & Sons, Inc.  相似文献   

20.
Using Rasch analysis, the psychometric properties of a newly developed 35‐item parent‐proxy instrument, the Caregiver Assessment of Movement Participation (CAMP), designed to measure movement participation problems in children with Developmental Coordination Disorder, were examined. The CAMP was administered to 465 school children aged 5–10 years. Thirty of the 35 items were retained as they had acceptable infit and outfit statistics. Item separation (7.48) and child separation (3.16) were good; moreover, the CAMP had excellent reliability (Reliability Index for item = 0.98; Person = 0.91). Principal components analysis of item residuals confirmed the unidimensionality of the instrument. Based on category probability statistics, the original five‐point scale was collapsed into a four‐point scale. The item threshold calibration of the CAMP with the Movement Assessment Battery for Children Test was computed. The results indicated that a CAMP total score of 75 is the optimal cut‐off point for identifying children at risk of movement problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号