首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Few adequately normed drawing tests are available for current practice. Two subtests of the McCarthy Scales, Draw-A-Design and Draw-A-Child, are the best normed of all drawing tests for children aged 2½ to 8½ years: however, no age-corrected deviation scaled scores are available for interpretaion, only raw scores and age equivalents. This paper presents scaled scores for use in interpretation of these two drawing tests.  相似文献   

2.
Self-adapted testing has been described as a variation of computerized adaptive testing that reduces test anxiety and thereby enhances test performance. The purpose of this study was to gain a better understanding of these proposed effects of self-adapted tests (SATs); meta-analysis procedures were used to estimate differences between SATs and computerized adaptive tests (CATs) in proficiency estimates and post-test anxiety levels across studies in which these two types of tests have been compared. After controlling for measurement error, the results showed that SATs yielded proficiency estimates that were 0.12 standard deviation units higher and post-test anxiety levels that were 0.19 standard deviation units lower than those yielded by CATs. We speculate about possible reasons for these differences and discuss advantages and disadvantages of using SATs in operational settings.  相似文献   

3.
This study examined and compared various statistical methods for detecting individual differences in change. Considering 3 issues including test forms (specific vs. generalized), estimation procedures (constrained vs. unconstrained), and nonnormality, we evaluated 4 variance tests including the specific Wald variance test, the generalized Wald variance test, the specific likelihood ratio (LR) variance test, and the generalized LR variance test under both constrained and unconstrained estimation for both normal and nonnormal data. For the constrained estimation procedure, both the mixture distribution approach and the alpha correction approach were evaluated for their performance in dealing with the boundary problem. To deal with the nonnormality issue, we used the sandwich standard error (SE) estimator for the Wald tests and the Satorra–Bentler scaling correction for the LR tests. Simulation results revealed that testing a variance parameter and the associated covariances (generalized) had higher power than testing the variance solely (specific), unless the true covariances were zero. In addition, the variance tests under constrained estimation outperformed those under unconstrained estimation in terms of higher empirical power and better control of Type I error rates. Among all the studied tests, for both normal and nonnormal data, the robust generalized LR and Wald variance tests with the constrained estimation procedure were generally more powerful and had better Type I error rates for testing variance components than the other tests. Results from the comparisons between specific and generalized variance tests and between constrained and unconstrained estimation were discussed.  相似文献   

4.
Numerous methods have been proposed and investigated for estimating · the standard error of measurement (SEM) at specific score levels. Consensus on the preferred method has not been obtained, in part because there is no standard criterion. The criterion procedure in previous investigations has been a single test occasion procedure. This study compares six estimation techniques. Two criteria were calculated by using test results obtained from a test-retest or parallel forms design. The relationship between estimated score level standard errors and the score scale was similar for the six procedures. These relationships were also congruent to findings from previous investigations. Similarity between estimates and criteria varied over methods and criteria. For test-retest conditions, the estimation techniques are interchangeable. The user's selection could be based on personal preference. However, for parallel forms conditions, the procedures resulted in estimates that were meaningfully different. The preferred estimation technique would be Feldt's method (cited in Gupta, 1965; Feldt, 1984).  相似文献   

5.
This paper discusses how to maintain the integrity of national normative information for achievement tests when the test that is administered has been customized to satisfy local needs and is not a test that has been nationally normed. Using an Item Response Theory perspective, alternative procedures for item selection and calibration are examined with respect to their effect on the accuracy of normative information. It is emphasized that it is important to match the content of the customized test with that of the normed test if accurate normative data are desired.  相似文献   

6.
The aim of our study was to determine DEM test performance norms for school-aged children in Latvia, assess how DEM test results correlate with children’s reading rates, compare test performance between strong and weak readers. A modified DEM test and a newly developed reading test were administered to 1487 children during a screening survey. Our study provides norms for adjusted DEM scores for children from 7 to 18 years of age. A high correlation exists between a child’s reading rate and her DEM speed scores for both parts of the test. Weak readers performed significantly more slowly on the DEM test than strong readers. Overall, 6 % of the subject population scored 1 standard deviation below the mean value on both the DEM and reading tests. We conclude that these individuals may be at a higher risk for developing reading impairments.  相似文献   

7.
This paper describes four procedures previously developed for estimating conditional standard errors of measurement for scale scores: the IRT procedure (Kolen, Zeng, & Hanson. 1996), the binomial procedure (Brennan & Lee, 1999), the compound binomial procedure (Brennan & Lee, 1999), and the Feldt-Qualls procedure (1998). These four procedures are based on different underlying assumptions. The IRT procedure is based on the unidimensional IRT model assumptions. The binomial and compound binomial procedures employ, as the distribution of errors, the binomial model and compound binomial model, respectively. By contrast, the Feldt-Qualls procedure does not depend on a particular psychometric model, and it simply translates any estimated conditional raw-score SEM to a conditional scale-score SEM. These procedures are compared in a simulation study, which involves two-dimensional data sets. The presence of two category dimensions reflects a violation of the IRT unidimensionality assumption. The relative accuracy of these procedures for estimating conditional scale-score standard errors of measurement is evaluated under various circumstances. The effects of three different types of transformations of raw scores are investigated including developmental standard scores, grade equivalents, and percentile ranks. All the procedures discussed appear viable. A general recommendation is made that test users select a procedure based on various factors such as the type of scale score of concern, characteristics of the test, assumptions involved in the estimation procedure, and feasibility and practicability of the estimation procedure.  相似文献   

8.
Two simulation studies investigated Type I error performance of two statistical procedures for detecting differential item functioning (DIF): SIBTEST and Mantel-Haenszel (MH). Because MH and SIBTEST are based on asymptotic distributions requiring "large" numbers of examinees, the first study examined Type 1 error for small sample sizes. No significant Type I error inflation occurred for either procedure. Because MH has the potential for Type I error inflation for non-Rasch models, the second study used a markedly non-Rasch test and systematically varied the shape and location of the studied item. When differences in distribution across examinee group of the measured ability were present, both procedures displayed inflated Type 1 error for certain items; MH displayed the greater inflation. Also, both procedures displayed statistically biased estimation of the zero DIF for certain items, though SIBTEST displayed much less than MH. When no latent distributional differences were present, both procedures performed satisfactorily under all conditions.  相似文献   

9.
Reading and Mathematics tests of multiple-choice items for grades Kindergarten through 9 were vertically scaled using the three-parameter logistic model and two different scaling procedures: concurrent and separate by grade groups. Item parameters were estimated using Markov chain Monte Carlo methodology while fixing the grade 4 population abilities to have a standard normal distribution. For the separate grade-groups scaling, grade groupings were linked using the Stocking and Lord test characteristic curve procedure. Abilities were estimated using the maximum-likelihood method. In either content area, scatterplots of item difficulty, discrimination, and ability estimates from the two methods showed consistently strong linear relationships. However, as grade deviated from the base grade of four, the best-fit linear line through the pairs of item discriminations started to rotate away from the identity line. This indicated the discrimination estimates from the separate grade-groups procedure for extreme grades to be, on average, higher than those from the concurrent analysis. The study also observed some systematic change in score variability across grades. In general, the two vertical scaling approaches yielded similar results at more grades in Reading than in Mathematics.  相似文献   

10.
正态总体方差和标准差的无偏估计   总被引:1,自引:0,他引:1  
在正态总体分布下,给出了方差及标准差的矩估计量和极大似然估计量,讨论了两者之间的关系,得出两类估计量相同,并进一步给出无偏估计量。  相似文献   

11.
This study investigates the development of an adaptive strategy for the estimation of numerosity from the theoretical perspective of “strategic change” (Lemaire & Siegler, 1995; Siegler & Shipley, 1995). A simple estimation task was used in which participants of three different age groups (20 university students, 20 sixth-graders and 10 second-graders) had to estimate 100 numerosities of (colored) blocks presented in a 10x10 rectangular grid. Generally speaking, this task allows for two distinct estimation procedures: either repeatedly adding estimations of groups of blocks (=addition procedure) or subtracting the estimated number of empty squares from the (estimated) total number of squares in the grid (=subtraction procedure). A rational task analysis indicates that the most efficient overall estimation strategy consists of the adaptive use of both procedures, depending on the ratio of the blocks to the empty squares. The first hypothesis was that there will be a developmental difference in the adaptive use of the two procedures, and according to the second hypothesis this adaptive use will result in better estimation accuracy. Converging evidence from different kinds of data (i.e., response times, error rates, and retrospective reports) supported both hypotheses. From a methodological point of view, the study shows the potential of Beem’s (1995a, 1995b) “segmentation analysis” for unravelling subjects’ adaptive choices between different procedures in cognitive tasks, and for examining the relationship between these adaptive choices and performance.  相似文献   

12.
Item response theory (IRT) procedures have been used extensively to study normal latent trait distributions and have been shown to perform well; however, less is known concerning the performance of IRT with non-normal latent trait distributions. This study investigated the degree of latent trait estimation error under normal and non-normal conditions using four latent trait estimation procedures and also evaluated whether the test composition, in terms of item difficulty level, reduces estimation error. Most importantly, both true and estimated item parameters were examined to disentangle the effects of latent trait estimation error from item parameter estimation error. Results revealed that non-normal latent trait distributions produced a considerably larger degree of latent trait estimation error than normal data. Estimated item parameters tended to have comparable precision to true item parameters, thus suggesting that increased latent trait estimation error results from latent trait estimation rather than item parameter estimation.  相似文献   

13.
Achievement and cognitive tests are used extensively in the diagnosis and educational placement of children with reading disabilities (RD). Moreover, research on scholastic interventions often requires repeat testing and information on practice effects. Little is known, however, about the test-retest and other psychometric properties of many commonly used measures within the beginning reader population, nor are these nationally normed or experimental measures comparatively evaluated. This study examined the test-retest reliability, practice effects, and relations among a number of nationally normed measures of word identification and spelling and experimental measures of achievement and reading-related cognitive processing tests in young children with significant RD. Reliability was adequate for most tests, although lower than might be ideal on a few measures when there was a lengthy test-retest interval or with the reduced behavioral variability that can be seen in groups of beginning readers. Practice effects were minimal. There were strong relations between nationally normed measures of decoding and spelling and their experimental counterparts and with most measures of reading-related cognitive processes. The implications for the use of such tests in treatment studies that focus on beginning readers are discussed.  相似文献   

14.
Simulations of computerized adaptive tests (CATs) were used to evaluate results yielded by four commonly used ability estimation methods: maximum likelihood estimation (MLE) and three Bayesian approaches—Owen's method, expected a posteriori (EAP), and maximum a posteriori. In line with the theoretical nature of the ability estimates and previous empirical research, the results showed clear distinctions between MLE and the Bayesian methods, with MLE yielding lower bias, higher standard errors, higher root mean square errors, lower fidelity, and lower administrative efficiency. Standard errors for MLE based on test information underestimated actual standard errors, whereas standard errors for the Bayesian methods based on posterior distribution standard deviations accurately estimated actual standard errors. Among the Bayesian methods, Owen's provided the worst overall results, and EAP provided the best. Using a variable starting rule in which examinees were initially classified into three broad/ability groups greatly reduced the bias for the Bayesian methods, but had little effect on the results for MLE. On the basis of these results, guidelines are offered for selecting appropriate CAT ability estimation methods in different decision contexts.  相似文献   

15.
In the presence of test speededness, the parameter estimates of item response theory models can be poorly estimated due to conditional dependencies among items, particularly for end‐of‐test items (i.e., speeded items). This article conducted a systematic comparison of five‐item calibration procedures—a two‐parameter logistic (2PL) model, a one‐dimensional mixture model, a two‐step strategy (a combination of the one‐dimensional mixture and the 2PL), a two‐dimensional mixture model, and a hybrid model‐–by examining how sample size, percentage of speeded examinees, percentage of missing responses, and way of scoring missing responses (incorrect vs. omitted) affect the item parameter estimation in speeded tests. For nonspeeded items, all five procedures showed similar results in recovering item parameters. For speeded items, the one‐dimensional mixture model, the two‐step strategy, and the two‐dimensional mixture model provided largely similar results and performed better than the 2PL model and the hybrid model in calibrating slope parameters. However, those three procedures performed similarly to the hybrid model in estimating intercept parameters. As expected, the 2PL model did not appear to be as accurate as the other models in recovering item parameters, especially when there were large numbers of examinees showing speededness and a high percentage of missing responses with incorrect scoring. Real data analysis further described the similarities and differences between the five procedures.  相似文献   

16.
In this article, we describe two United Kingdom (UK) screening tests for dyslexia: the Dyslexia Early Screening Test (DEST) and the Cognitive Profiling System (CoPS 1), both normed and designed to be administered by teachers to children four years and older. We first outline the political context in the UK, which for the first time, makes the use of such tests viable. We then outline the research programs behind and the components of each test; reliability and validity are also discussed. Information is presented on the tests in use. We conclude that tests such as these have the potential to identify children as at risk before they fail, halting the cycle of emotional and motivational problems traditionally associated with dyslexia. Both tests are appropriate for use in the United States, and initial reactions from the education sector have been favorable.  相似文献   

17.
Reporting confidence intervals with test scores helps test users make important decisions about examinees by providing information about the precision of test scores. Although a variety of estimation procedures based on the binomial error model are available for computing intervals for test scores, these procedures assume that items are randomly drawn from a undifferentiated universe of items, and therefore might not be suitable for tests developed according to a table of specifications. To address this issue, four interval estimation procedures that use category subscores for the computation of confidence intervals are presented in this article. All four estimation procedures assume that subscores instead of test scores follow a binomial distribution (i.e., compound binomial error model). The relative performance of the four compound binomial–based interval estimation procedures is compared to each other and to the better known normal approximation and Wilson score procedures based on the binomial error model.  相似文献   

18.
The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity.  相似文献   

19.
An empirical comparison of the accuracy of item sampling and examinee sampling in estimating norm statistics. Item samples were composed of 3, 6, or 12 items selected from a total test of 50 multiple-choice vocabulary questions. Overall, the study findings provided empirical evidence that item sampling is approximately as effective as examinee sampling for estimating the population mean and standard deviation. Contradictory trends occurred for lower ability and higher ability student populations in accuracy of estimated means and standard deviations when the number of items administered increased from 3 to 6 to 12. The findings from this study indicate that the variation of sequences of items occurring in item sampling need not have a significant affect on test performance.  相似文献   

20.
As part of developing a comprehensive strategy for structural equation model building and assessment, a large‐scale Monte Carlo study of 7,200 covariance matrices sampled from 36 population models was conducted. This study compared maximum likelihood with the much simpler centroid method for the confirmatory factor analysis of multiple‐indicator measurement models. Surprisingly, the contribution of maximum likelihood to model analysis is limited to formal evaluation of the model. No statistically discernible differences were obtained for the bias, standard errors, or mean squared error (MSE) of the estimated factor correlations, and empirically obtained maximum likelihood standard errors for the pattern coefficients were only slightly smaller than their centroid counterparts. Further supporting the recommendations of Anderson and Gerbing (1982), the considerably faster centroid method may have a useful role in the analysis of these models, particularly for the analysis of large models with 50 or more input variables. These results encourage the further development of a comprehensive research paradigm that exploits the relative strengths of both centroid and maximum likelihood as complementary estimation procedures along an integrated exploratory‐confirmatory continuum of model specification, revision, and formal evaluation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号