首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
As computer‐based tests become more common, there is a growing wealth of metadata related to examinees’ response processes, which include solution strategies, concentration, and operating speed. One common type of metadata is item response time. While response times have been used extensively to improve estimates of achievement, little work considers whether these metadata may provide useful information on social–emotional constructs. This study uses an analytic example to explore whether metadata might help illuminate such constructs. Specifically, analyses examine whether the amount of time students spend on test items (after accounting for item difficulty and estimates of true achievement), and difficult items in particular, tell us anything about the student's academic motivation and self‐efficacy. While results do not indicate a strong relationship between mean item durations and these constructs in general, the amount of time students spend on very difficult items is highly correlated with motivation and self‐efficacy. The implications of these findings for using response process metadata to gain information on social–emotional constructs are discussed.  相似文献   

2.
In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items.  相似文献   

3.
In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long‐standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation.  相似文献   

4.
This study proposes a structured constructs model (SCM) to examine measurement in the context of a multidimensional learning progression (LP). The LP is assumed to have features that go beyond a typical multidimentional IRT model, in that there are hypothesized to be certain cross‐dimensional linkages that correspond to requirements between the levels of the different dimensions. The new model builds on multidimensional item response theory models and change‐point analysis to add cut‐score and discontinuity parameters that embody these substantive requirements. This modeling strategy allows us to place the examinees in the appropriate LP level and simultaneously to model the hypothesized requirement relations. Results from a simulation study indicate that the proposed change‐point SCM recovers the generating parameters well. When the hypothesized requirement relations are ignored, the model fit tends to become worse, and the model parameters appear to be more biased. Moreover, the proposed model can be used to find validity evidence to support or disprove initial theoretical hypothesized links in the LP through empirical data. We illustrate the technique with data from an assessment system designed to measure student progress in a middle‐school statistics and modeling curriculum.  相似文献   

5.
This research derived information functions and proposed new scalar information indices to examine the quality of multidimensional forced choice (MFC) items based on the RANK model. We also explored how GGUM‐RANK information, latent trait recovery, and reliability varied across three MFC formats: pairs (two response alternatives), triplets (three alternatives), and tetrads (four alternatives). As expected, tetrad and triplet measures provided substantially more information than pairs, and MFC items composed of statements with high discrimination parameters were most informative. The methods and findings of this study will help practitioners to construct better MFC items, make informed projections about reliability with different MFC formats, and facilitate the development of MFC triplet‐ and tetrad‐based computerized adaptive tests.  相似文献   

6.
Even though guessing biases difficulty estimates as a function of item difficulty in the dichotomous Rasch model, assessment programs with tests which include multiple‐choice items often construct scales using this model. Research has shown that when all items are multiple‐choice, this bias can largely be eliminated. However, many assessments have a combination of multiple‐choice and constructed response items. Using vertically scaled numeracy assessments from a large‐scale assessment program, this article shows that eliminating the bias on estimates of the multiple‐choice items also impacts on the difficulty estimates of the constructed response items. This implies that the original estimates of the constructed response items were biased by the guessing on the multiple‐choice items. This bias has implications for both defining difficulties in item banks for use in adaptive testing composed of both multiple‐choice and constructed response items, and for the construction of proficiency scales.  相似文献   

7.
The purposes of this study were to (a) test the hypothesized factor structure of the Student-Teacher Relationship Scale (STRS; Pianta, 2001) for 308 African American (AA) and European American (EA) children using confirmatory factor analysis (CFA) and (b) examine the measurement invariance of the factor structure across AA and EA children. CFA of the hypothesized three-factor model with correlated latent factors did not yield an optimal model fit. Parameter estimates obtained from CFA identified items with low factor loadings and R2 values, suggesting that content revision is required for those items on the STRS. Deletion of two items from the scale yielded a good model fit, suggesting that the remaining 26 items reliably and validly measure the constructs for the whole sample. Tests for configural invariance, however, revealed that the underlying constructs may differ for AA and EA groups. Subsequent exploratory factor analyses (EFAs) for AA and EA children were carried out to investigate the comparability of the measurement model of the STRS across the groups. The results of EFAs provided evidence suggesting differential factor models of the STRS across AA and EA groups. This study provides implications for construct validity research and substantive research using the STRS given that the STRS is extensively used in intervention and research in early childhood education.  相似文献   

8.
Single‐best answers to multiple‐choice items are commonly dichotomized into correct and incorrect responses, and modeled using either a dichotomous item response theory (IRT) model or a polytomous one if differences among all response options are to be retained. The current study presents an alternative IRT‐based modeling approach to multiple‐choice items administered with the procedure of elimination testing, which asks test‐takers to eliminate all the response options they consider to be incorrect. The partial credit model is derived for the obtained responses. By extracting more information pertaining to test‐takers’ partial knowledge on the items, the proposed approach has the advantage of providing more accurate estimation of the latent ability. In addition, it may shed some light on the possible answering processes of test‐takers on the items. As an illustration, the proposed approach is applied to a classroom examination of an undergraduate course in engineering science.  相似文献   

9.
This article addresses the issue of how to detect item preknowledge using item response time data in two computer‐based large‐scale licensure examinations. Item preknowledge is indicated by an unexpected short response time and a correct response. Two samples were used for detecting item preknowledge for each examination. The first sample was from the early stage of the operational test and was used for item calibration. The second sample was from the late stage of the operational test, which may feature item preknowledge. The purpose of this research was to explore whether there was evidence of item preknowledge and compromised items in the second sample using the parameters estimated from the first sample. The results showed that for one nonadaptive operational examination, two items (of 111) were potentially exposed, and two candidates (of 1,172) showed some indications of preknowledge on multiple items. For another licensure examination that featured computerized adaptive testing, there was no indication of item preknowledge or compromised items. Implications for detected aberrant examinees and compromised items are discussed in the article.  相似文献   

10.
Many educational and psychological tests are inherently multidimensional, meaning these tests measure two or more dimensions or constructs. The purpose of this module is to illustrate how test practitioners and researchers can apply multidimensional item response theory (MIRT) to understand better what their tests are measuring, how accurately the different composites of ability are being assessed, and how this information can be cycled back into the test development process. Procedures for conducting MIRT analyses–from obtaining evidence that the test is multidimensional, to modeling the test as multidimensional, to illustrating the properties of multidimensional items graphically-are described from both a theoretical and a substantive basis. This module also illustrates these procedures using data from a ninth-grade mathematics achievement test. It concludes with a discussion of future directions in MIRT research.  相似文献   

11.
Many large‐scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple‐choice and/or constructed responses sections of items to generate multiple scores. In the current article, we propose an extension of the hierarchical rater model (HRM) to be applied with simple structured tests with constructed response items. In addition to modeling the appropriate trait structure, the multidimensional HRM (M‐HRM) presented here also accounts for rater severity bias and rater variability or inconsistency. We introduce the model formulation, test parameter recovery with a focus on latent traits, and compare the M‐HRM to other scoring approaches (unidimensional HRMs and a traditional multidimensional item response theory model) using simulated and empirical data. Results show more precise scores under the M‐HRM, with a major improvement in scores when incorporating rater effects versus ignoring them in the traditional multidimensional item response theory model.  相似文献   

12.
Cross‐level invariance in a multilevel item response model can be investigated by testing whether the within‐level item discriminations are equal to the between‐level item discriminations. Testing the cross‐level invariance assumption is important to understand constructs in multilevel data. However, in most multilevel item response model applications, the cross‐level invariance is assumed without testing of the cross‐level invariance assumption. In this study, the detection methods of differential item discrimination (DID) over levels and the consequences of ignoring DID are illustrated and discussed with the use of multilevel item response models. Simulation results showed that the likelihood ratio test (LRT) performed well in detecting global DID at the test level when some portion of the items exhibited DID. At the item level, the Akaike information criterion (AIC), the sample‐size adjusted Bayesian information criterion (saBIC), LRT, and Wald test showed a satisfactory rejection rate (>.8) when some portion of the items exhibited DID and the items had lower intraclass correlations (or higher DID magnitudes). When DID was ignored, the accuracy of the item discrimination estimates and standard errors was mainly problematic. Implications of the findings and limitations are discussed.  相似文献   

13.
A mixed‐effects item response theory (IRT) model is presented as a logical extension of the generalized linear mixed‐effects modeling approach to formulating explanatory IRT models. Fixed and random coefficients in the extended model are estimated using a Metropolis‐Hastings Robbins‐Monro (MH‐RM) stochastic imputation algorithm to accommodate for increased dimensionality due to modeling multiple design‐ and trait‐based random effects. As a consequence of using this algorithm, more flexible explanatory IRT models, such as the multidimensional four‐parameter logistic model, are easily organized and efficiently estimated for unidimensional and multidimensional tests. Rasch versions of the linear latent trait and latent regression model, along with their extensions, are presented and discussed, Monte Carlo simulations are conducted to determine the efficiency of parameter recovery of the MH‐RM algorithm, and an empirical example using the extended mixed‐effects IRT model is presented.  相似文献   

14.
In this article, it is shown how item text can be represented by (a) 113 features quantifying the text's linguistic characteristics, (b) 16 measures of the extent to which an information‐retrieval‐based automatic question‐answering system finds an item challenging, and (c) through dense word representations (word embeddings). Using a random forests algorithm, these data then are used to train a prediction model for item response times and predicted response times then are used to assemble test forms. Using empirical data from the United States Medical Licensing Examination, we show that timing demands are more consistent across these specially assembled forms than across forms comprising randomly‐selected items. Because an exam's timing conditions affect examinee performance, this result has implications for exam fairness whenever examinees are compared with each other or against a common standard.  相似文献   

15.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

16.
The current work examines children's sensitivity to rime unit spelling–sound correspondences within the context of early word reading as a way of assessing word‐specific influences on early word‐reading strategies. Sixty 6–7‐year‐olds participated in an experimental reading task that comprised word items that shared either frequent or infrequent rime unit correspondences. Retrospective self‐reports were taken as measures of strategy choice. The results showed that the children were more accurate in identifying word items that shared a common rime unit (consistent items) when compared with those containing infrequent rime units (unique and exception items). Moreover, while nonlexical (phonological) attempts were most frequently applied across all word types, these resulted in lower levels of accuracy, especially for the exception word items. The current data support the argument that children are increasingly sensitive to rime unit sound–spelling correspondences during the early stages of their word reading and the nature of these word‐specific orthographic representations shape their reliance on using particular lexical or non‐lexical‐based word‐reading strategies.  相似文献   

17.
Content‐based automated scoring has been applied in a variety of science domains. However, many prior applications involved simplified scoring rubrics without considering rubrics representing multiple levels of understanding. This study tested a concept‐based scoring tool for content‐based scoring, c‐rater?, for four science items with rubrics aiming to differentiate among multiple levels of understanding. The items showed moderate to good agreement with human scores. The findings suggest that automated scoring has the potential to score constructed‐response items with complex scoring rubrics, but in its current design cannot replace human raters. This article discusses sources of disagreement and factors that could potentially improve the accuracy of concept‐based automated scoring.  相似文献   

18.
When developing self-report instruments, researchers often have included both positively and negatively worded items to negate the possibility of response bias. Unfortunately, this strategy may interfere with examinations of the latent structure of self-report instruments by introducing method effects, particularly among negatively worded items. The substantive nature of the method effects remains unclear and requires examination. Building on recommendations from previous researchers (Tomás& Oliver, 1999), this study examined the longitudinal invariance of method effects associated with negatively worded items using a self-report measure of global self-esteem. Data were obtained from the National Educational Longitudinal Study (NELS; Ingels et al., 1992) across 3 waves, each separated by 2 years, and the longitudinal invariance of the method effects was tested using LISREL 8.20 with weighted least squares estimation on polychoric correlations and an asymptotic variance/covariance matrix. Our results indicated that method effects associated with negatively worded items exhibited longitudinal invariance of the factor structure, factor loadings, item uniquenesses, factor variances, and factor covariances. Therefore, method effects associated with negatively worded items demonstrated invariance across time, similar to measures of personality traits, and should be considered of potential substantive importance. One possible substantive interpretation is a response style.  相似文献   

19.
This 4‐year longitudinal research was designed to study special education determinations of students who participated in Tier 2 intervention in a Response to Intervention (RtI) model focused on reading across Grades 1–4. We compared identification rates for learning disabilities (LD) and student characteristics of 381 students the year prior to implementation with 377 students in the RtI environment. Across schools, 38–60 percent of students were English language learners (ELL). Key outcomes by Grade 4 for students with LD who had participated in a model of RtI were relatively greater reading impairment with effect sizes ranging from 0.64 to 0.82, and more equitable representation across ELL and native English speakers than in the cohort prior to RtI implementation. Notably, one‐third of the students identified for special services as LD in these schools were not identified until 4th grade.  相似文献   

20.
The premise of a great deal of current research guiding policy development has been that accommodations are the catalyst for student performance differences. Rather than accepting this premise, two studies were conducted to investigate the influence of extended time and content knowledge on the performance of ninth‐grade students who took a statewide mathematics test with and without accommodations. Each study involved 1,250 accommodated students (extended time only) with learning disabilities and 1,250 nonaccommodated students demonstrating no disabilities. In Study One, a standard differential item functioning (DIF) analysis illustrated that the usual approach to studying the effects of accommodations contributes little to our understanding of the reason for performance differences across students. Next, a mixture item response theory DIF model was used to explore the most likely cause(s) for performance differences across the population. The results from both studies suggest that students for whom items were functioning differently were not accurately characterized by their accommodation status but rather by their content knowledge. That is, knowing students' accommodation status (i.e., accommodated or nonaccommodated) contributed little to understanding why accommodated and nonaccommodated students differed in their test performance. Rather, the data would suggest that a more likely explanation is that mathematics competency differentiated the groups of student learners regardless of their accommodation and/or reading levels.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号