首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 328 毫秒
1.
The computerization of reading assessments has presented a set of new challenges to test designers. From the vantage point of measurement invariance, test designers must investigate whether the traditionally recognized causes for violating invariance are still a concern in computer-mediated assessments. In addition, it is necessary to understand the technology-related causes of measurement invariance among test-taking populations. In this study, we used the available data (n = 800) from the previous administrations of the Pearson Test of English Academic (PTE Academic) reading, an international test of English comprising 10 test items, to investigate measurement invariance across gender and the Information and Communication Technology Development index (IDI). We conducted a multi-group confirmatory factor analysis (CFA) to assess invariance at four levels: configural, metric, scalar, and structural. Overall, we were able to confirm structural invariance for the PTE Academic, which is a necessary condition for conducting fair assessments. Implications for computer-based education and the assessment of reading are discussed.  相似文献   

2.
Understanding how situational features of assessment tasks impact reasoning is important for many educational pursuits, notably the selection of curricular examples to illustrate phenomena, the design of formative and summative assessment items, and determination of whether instruction has fostered the development of abstract schemas divorced from particular instances. The goal of our study was to employ an experimental research design to quantify the degree to which situational features impact inferences about participants’ understanding of Mendelian genetics. Two participant samples from different educational levels and cultural backgrounds (high school, n = 480; university, n = 444; Germany and USA) were used to test for context effects. A multi-matrix test design was employed, and item packets differing in situational features (e.g., plant, animal, human, fictitious) were randomly distributed to participants in the two samples. Rasch analyses of participant scores from both samples produced good item fit, person reliability, and item reliability and indicated that the university sample displayed stronger performance on the items compared to the high school sample. We found, surprisingly, that in both samples, no significant differences in performance occurred among the animal, plant, and human item contexts, or between the fictitious and “real” item contexts. In the university sample, we were also able to test for differences in performance between genders, among ethnic groups, and by prior biology coursework. None of these factors had a meaningful impact upon performance or context effects. Thus some, but not all, types of genetics problem solving or item formats are impacted by situational features.  相似文献   

3.
本研究利用建构图设计一套含有六大部分的30道试题。题型包括拼写题、选择题和简答题。共有175名6到14岁儿童参加了此项考试。Rasch分析结果发现题组内局部题目依赖并不严重。信度为0.85。考题的难度和考生能力的配合度相当良好。我们根据建构图来编写考题,因此有一定程度的内容效度。但有9道题的难度稍微与原先预期略有出入。有5道题不大吻合Rasch模式的预期,没有发现在性别上有明显的项目功能差异。考生能力与学习英语的时间有正相关。最后探讨了基于信息通讯技术的远程计算机自适应测验的技术问题。  相似文献   

4.
Sound assessment tools are needed to evaluate effects of mathematics interventions that familiarize children with early mathematics concepts before they enter the formal school system. We developed a short version of an existing early mathematics tool based on analyses of data collected in a nationally representative Danish sample. Research findings: The Danish adaptation and development process of the Tools for Early Assessment in Math (TEAM) for children aged 3?6 years was carried out in four steps: (a) choosing and translating relevant items, (b) conducting a pilot study, (c) testing items in a representative sample of Danish children aged 3?6 years (n = 5,621), and (d) analyses based on Rasch models. The process resulted in a final 19-item version—the DK-TEAM (final)—that has no differential item functioning relative to age and gender and is sensitive to the full range of abilities. The great majority of the children viewed the test as enjoyable. Practice or Policy: The DK-TEAM (final) appears to be broadly applicable for young Danish children, though the modest reliability at 3 years (which may be remediable by adding easy items) should be kept in mind.  相似文献   

5.
The performance of English language learners (ELLs) has been a concern given the rapidly changing demographics in US K-12 education. This study aimed to examine whether students' English language status has an impact on their inquiry science performance. Differential item functioning (DIF) analysis was conducted with regard to ELL status on an inquiry-based science assessment, using a multifaceted Rasch DIF model. A total of 1,396 seventh- and eighth-grade students took the science test, including 313 ELL students. The results showed that, overall, non-ELLs significantly outperformed ELLs. Of the four items that showed DIF, three favored non-ELLs while one favored ELLs. The item that favored ELLs provided a graphic representation of a science concept within a family context. There is some evidence that constructed-response items may help ELLs articulate scientific reasoning using their own words. Assessment developers and teachers should pay attention to the possible interaction between linguistic challenges and science content when designing assessment for and providing instruction to ELLs.  相似文献   

6.
The purpose of this study was to examine the validity and reliability of Curriculum-Based Measures in writing for English learners. Participants were 36 high school English learners with moderate to high levels of English language proficiency. Predictor variables were type of writing prompt (picture, narrative, and expository), time (3, 5, and 7 min), and scoring procedure (words written, words spelled correctly, correct word sequences, correct minus incorrect word sequences). Criterion variables were teacher ratings of writing performance and student performance on the Test of Written Language-III, the writing subtest of the Test of Emerging Academic English, and the Minnesota state writing test. Results supported the validity and reliability of a 5 to 7-min writing sample written in response to a narrative or picture prompt and scored for percent of correct word sequences, correct minus incorrect word sequences, or words written plus correct minus incorrect word sequences.  相似文献   

7.
English language programmes provide established pathways for international students seeking university admission in countries such as Australia and the United Kingdom. In order to refer international applicants to appropriate levels and durations of English language support prior to matriculation into their main course of study, pathway providers need effective and efficient language assessment tools. This report evaluates the effectiveness of an online vocabulary knowledge test as an index of English proficiency for university English pathway programme applicants (N = 177). The Timed Yes/No (TYN) test measures vocabulary recognition size and speed in a time- and resource-effective format. Test results were correlated with performance on a comprehensive placement test consisting of speaking, writing, reading and listening components. The predictive validity of word recognition accuracy (a proxy for size) and response time (a measure of efficiency) for placement test outcomes were examined independently and in combination. The TYN test scores’ sensitivity at predicting comprehensive placement test scores were assessed using a cut-score analysis resulting in an identification accuracy rate ranging from 76 to 86% for five critical band scores. The potential use of the online vocabulary-screening test for measuring international students’ English language proficiency is discussed in terms of reliability, validity, speed, usability and cost-effectiveness in onsite and offshore testing conditions.  相似文献   

8.
ABSTRACT

Objectives: This study aims to test the dimensionality, reliability, and item quality of the revised UCLA loneliness scale as well as to investigate the differential item functioning (DIF) of the three dimensions of the revised UCLA loneliness scale in community-dwelling Chinese and Korean elderly individuals.

Method: Data from 493 elderly individuals (287 Chinese and 206 Korean) were used to examine the revised UCLA loneliness scale. The Research model based on item response theory (IRT) was used to test dimensionality, reliability, and item fit. The hybrid ordinal logistic regression-IRT test was used to evaluate DIF.

Results: Item separation reliability, person reliability, and Cronbach’s alpha met the benchmarks. The quality of the items in the three-dimension model met the benchmark. Eight items were detected as significant DIF items (at α < .01). The loneliness level of Chinese elderly individuals was significantly higher than that of Koreans in Dimensions 1 and 2, while Korean elderly participants showed significantly higher loneliness levels than Chinese participants in Dimension 3. Several collected demographic characteristics and loneliness levels were more highly correlated in Korean elderly individuals than in Chinese elderly individuals.

Conclusion: Analysis using the three dimensions is reasonable for the revised UCLA loneliness scale. Good item quality and the items of this measure suggest that the revised UCLA loneliness can be used to assess the preferred latent traits. Finally, the differences between the levels of loneliness in Chinese and Korean elderly individuals are associated with the factors of loneliness.  相似文献   

9.
This study describes the implementation of the Assessment of Learner-Centered Practices (ALCP) surveys in 4 English schools, 3 primary schools and 1 secondary school during the academic year 2002 – 2003. The ALCP teacher and student surveys for grades kindergarten through 12 were developed and validated with over 25,000 students and their teachers in the United States. The theoretical basis for the ALCP surveys is the American Psychological Association's Learner-Centered Psychological Principles. This paper firstly describes the knowledge base underpinning the ALCP surveys, then describes their implementation in the UK. Although the ALCP surveys have been extensively validated in the US, this study is the first attempt to trial them in the UK as a teacher development tool. Given the cultural similarities between the US and UK, as well as the presumed generalizability of the Learner-Centered Psychological Principles, establishing the psychometric qualities of the ALCP surveys with English teachers extends the cross-cultural usefulness of these surveys. The study found that the ALCP surveys demonstrated comparable reliability and validity as U.S. data and their usefulness in practice were confirmed via teacher evaluations.  相似文献   

10.
This study presents evidence regarding the construct validity and internal consistency of the IFSP Rating Scale (McWilliam & Jung, 2001), which was designed to rate individualized family service plans (IFSPs) on 12 indicators of family centered practice. Here, the Rasch measurement model is employed to investigate the scale's functioning and fit for both person and item diagnostics of 120 IFSPs that were previously analyzed with a classical test theory approach. Analyses demonstrated scores on the IFSP Rating Scale fit the model well, though additional items could improve the scale's reliability. Implications for applying the Rasch model to improve special education research and practice are discussed.  相似文献   

11.
The landscape of science education is being transformed by the new Framework for Science Education (National Research Council, A framework for K-12 science education: practices, crosscutting concepts, and core ideas. The National Academies Press, Washington, DC, 2012), which emphasizes the centrality of scientific practices—such as explanation, argumentation, and communication—in science teaching, learning, and assessment. A major challenge facing the field of science education is developing assessment tools that are capable of validly and efficiently evaluating these practices. Our study examined the efficacy of a free, open-source machine-learning tool for evaluating the quality of students’ written explanations of the causes of evolutionary change relative to three other approaches: (1) human-scored written explanations, (2) a multiple-choice test, and (3) clinical oral interviews. A large sample of undergraduates (n = 104) exposed to varying amounts of evolution content completed all three assessments: a clinical oral interview, a written open-response assessment, and a multiple-choice test. Rasch analysis was used to compute linear person measures and linear item measures on a single logit scale. We found that the multiple-choice test displayed poor person and item fit (mean square outfit >1.3), while both oral interview measures and computer-generated written response measures exhibited acceptable fit (average mean square outfit for interview: person 0.97, item 0.97; computer: person 1.03, item 1.06). Multiple-choice test measures were more weakly associated with interview measures (r = 0.35) than the computer-scored explanation measures (r = 0.63). Overall, Rasch analysis indicated that computer-scored written explanation measures (1) have the strongest correspondence to oral interview measures; (2) are capable of capturing students’ normative scientific and naive ideas as accurately as human-scored explanations, and (3) more validly detect understanding than the multiple-choice assessment. These findings demonstrate the great potential of machine-learning tools for assessing key scientific practices highlighted in the new Framework for Science Education.  相似文献   

12.
Individual person fit analyses provide important information regarding the validity of test score inferences for an individual test taker. In this study, we use data from an undergraduate statistics test (N = 1135) to illustrate a two-step method that researchers and practitioners can use to examine individual person fit. First, person fit is examined numerically with several indices based on the Rasch model (i.e., Infit, Outfit, and Between-Subset statistics). Second, person misfit is presented graphically with person response functions, and these person response functions are interpreted using a heuristic. Individual person fit analysis holds promise for improving score interpretation in that it may detect potential threats to validity of score inferences for some test takers. Individual person fit analysis may also highlight particular subsets of items (on which a test taker performs unexpectedly) that can be used to further contextualize her or his test performance.  相似文献   

13.
The current scales for self-blame are not suitable for school bullying scenarios and most lack validity. This study used a self-developed scale to measure bullied victims’ tendency to self-blame and further examined whether victims and bully/victims exhibited different tendencies toward self-blame under both bullied and generalized scenarios. The study consisted of 1,320 student participants from grades five to nine. The research instrument was a self-constructed bullied-victim self-blame scale (BSS), and the results were analyzed using the Rasch rating scale model. The Rasch results showed strong evidence of BSS reliability and validity. The results indicated that participants’ self-blaming tendency scores were positively correlated with depression (= .31). In addition, participants’ self-blaming scores in relational bullying were higher than those in verbal and physical bullying. The self-blaming tendency of bully/victims under bullied scenarios was higher than that of victims, but no difference was found between bully/victims and victims for generalised scenarios. The participants’ tendency to self-blame under generalised scenarios was significantly higher than under bullied scenarios. The tendencies of various roles to self-blame under different scenarios and the self-blaming counselling strategies for victims are discussed at the end of this study.  相似文献   

14.
Computer‐based tests (CBTs) often use random ordering of items in order to minimize item exposure and reduce the potential for answer copying. Little research has been done, however, to examine item position effects for these tests. In this study, different versions of a Rasch model and different response time models were examined and applied to data from a CBT administration of a medical licensure examination. The models specifically were used to investigate whether item position affected item difficulty and item intensity estimates. Results indicated that the position effect was negligible.  相似文献   

15.
The applications of item response theory (IRT) models assume local item independence and that examinees are independent of each other. When a representative sample for psychometric analysis is selected using a cluster sampling method in a testlet‐based assessment, both local item dependence and local person dependence are likely to be induced. This study proposed a four‐level IRT model to simultaneously account for dual local dependence due to item clustering and person clustering. Model parameter estimation was explored using the Markov Chain Monte Carlo method. Model parameter recovery was evaluated in a simulation study in comparison with three other related models: the Rasch model, the Rasch testlet model, and the three‐level Rasch model for person clustering. In general, the proposed model recovered the item difficulty and person ability parameters with the least total error. The bias in both item and person parameter estimation was not affected but the standard error (SE) was affected. In some simulation conditions, the difference in classification accuracy between models could go up to 11%. The illustration using the real data generally supported model performance observed in the simulation study.  相似文献   

16.
ABSTRACT

Students’ attitude towards science (SAS) is often a subject of investigation in science education research. Survey of rating scale is commonly used in the study of SAS. The present study illustrates how Rasch analysis can be used to provide psychometric information of SAS rating scales. The analyses were conducted on a 20-item SAS scale used in an existing dataset of The Trends in International Mathematics and Science Study (TIMSS) (2011). Data of all the eight-grade participants from Hong Kong and Singapore (N?=?9942) were retrieved for analyses. Additional insights from Rasch analysis that are not commonly available from conventional test and item analyses were discussed, such as invariance measurement of SAS, unidimensionality of SAS construct, optimum utilization of SAS rating categories, and item difficulty hierarchy in the SAS scale. Recommendations on how TIMSS items on the measurement of SAS can be better designed were discussed. The study also highlights the importance of using Rasch estimates for statistical parametric tests (e.g. ANOVA, t-test) that are common in science education research for group comparisons.  相似文献   

17.
In operational testing programs using item response theory (IRT), item parameter invariance is threatened when an item appears in a different location on the live test than it did when it was field tested. This study utilizes data from a large state's assessments to model change in Rasch item difficulty (RID) as a function of item position change, test level, test content, and item format. As a follow-up to the real data analysis, a simulation study was performed to assess the effect of item position change on equating. Results from this study indicate that item position change significantly affects change in RID. In addition, although the test construction procedures used in the investigated state seem to somewhat mitigate the impact of item position change, equating results might be impacted in testing programs where other test construction practices or equating methods are utilized.  相似文献   

18.
In this paper we present a new methodology for detecting differential item functioning (DIF). We introduce a DIF model, called the random item mixture (RIM), that is based on a Rasch model with random item difficulties (besides the common random person abilities). In addition, a mixture model is assumed for the item difficulties such that the items may belong to one of two classes: a DIF or a non-DIF class. The crucial difference between the DIF class and the non-DIF class is that the item difficulties in the DIF class may differ according to the observed person groups while they are equal across the person groups for the items from the non-DIF class. Statistical inference for the RIM is carried out in a Bayesian framework. The performance of the RIM is evaluated using a simulation study in which it is compared with traditional procedures, like the likelihood ratio test, the Mantel-Haenszel procedure and the standardized p -DIF procedure. In this comparison, the RIM performs better than the other methods. Finally, the usefulness of the model is also demonstrated on a real life data set.  相似文献   

19.
This study aimed to develop a new scale to examine primary and secondary school students’ perceptions of the severity of cyberbullying behaviours, and to explore further whether differences exist in the means of gender, grade and participant role. A total of 707 primary and secondary school students (M = 14.7) in Taiwan participated in this study. Two Olweus-like global items were used to identify students’ participant roles. A self-reported cyberbullying severity scale (CSS) was developed and validated by Rasch measurement. Results of this study supported the reliability and validity of the 16-item CSS. Impersonation was rated as the most serious type of cyberbullying. Cyberbullying behaviours that occurred in private were rated as less severe than were those that occurred in public. A Rasch latent regression analysis revealed that some gender and involvement effects were found, but no statistically significant difference was found among means of four participant roles. The behavioural hierarchy of cyberbullying severity, mean differences among personal attributions and cyberbullying intervention are discussed at the end of the article.  相似文献   

20.
This study examined the underlying structure of the Depression scale of the revised Minnesota Multiphasic Personality Inventory using dichotomous Rasch model and factor analysis. Rasch methodology was used to identify and restructure the Depression scale, and factor analysis was used to confirm the structure established by the Rasch model. The item calibration and factor analysis were carried out on the full sample of 2,600 normative subjects. The results revealed that the Depression scale did not consist of one homogeneous set of items, even though the scale was developed to measure one dimension of depression. Rasch analysis, as well as factor analysis, recognized two distinct content‐homogeneous subscales, here labeled mental depression and physical depression. The Rasch methodology provided a basis for a better understanding of the underlying structure and furnished a useful solution to the scale refinement.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号