期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Post Hoc Analysis of Teacher-Made Tests: The Goodness-of-Fit Between Prescription and Practice

Arlen R. Gullickson Mary C. Ellwein 《Educational Measurement》1985,4(1):15-18

相似文献

2.

Interpreting Scores from Standardized Tests

Rose A. Howard 《Clearing house (Menasha, Wis.)》2013,86(4):155-157

This essay seeks to establish a metaphor of the professional practice of teaching to the attributes and training of an offensive lineman in the game of American football. Effective classroom instruction does not rely exclusively on a rare set of talents but rather rests on the commitment to the work of teaching. Like the position of offensive lineman, the profession of teaching is one of service. And more, it is one in which the person's performance can blossom through intense determination. An invitation is offered to serve as an effective teacher. 相似文献

3.

Transmutation of Scores Between Binet Tests and Group Tests

《The Journal of educational research》2012,105(4):338-343

相似文献

4.

The Reliability of Test Scores

《The Journal of educational research》2012,105(5):370-379

相似文献

5.

The Reliability of Ratings Versus the Reliability of Scores

Mark D. Reckase 《Educational Measurement》1995,14(4):31-31

相似文献

6.

Independence and Non-Independence of True Scores and Error Scores in Mental Tests

Donald W. Zimmerman Richard H. Williams 《Journal of Experimental Education》2013,81(3):59-64

A comparison of animism in college males and females was made. The test instrument was the Crowell-Dole Information Scale, a self-report questionnaire of common objects. A total of 59. 8 percent of all Ss indicated animistic tendencies. Chi-square analysis of the raw data indicated no significant difference in incidents of animism for males and females. No significant difference was found between those students having one or more college biology courses and those with no formal training in biology. 相似文献

7.

Tests,Test Scores,and Constructs

Edward H. Haertel 《教育心理学家》2018,53(3):203-216

In the service of educational accountability, student achievement tests are being used to measure constructs quite unlike those envisioned by test developers. Scores are compared to cut points to create classifications like “proficient”; scores are combined over time to measure growth; student scores are aggregated to measure the effectiveness of teachers, schools, and school districts; indices are created to measure college and career readiness. These and other new uses rely on derived scores created to measure new constructs. The field of educational and psychological measurement has largely ignored these significant, consequential measurement applications. The conceptual frameworks and analytical tools of educational and psychological measurement should be used to study such derived scores and the validity of their uses and interpretations. 相似文献

8.

Comparability of Scores From Performance Assessments

Bert F. Green 《Educational Measurement》1995,14(4):13-15

Why is comparability of forms important for performance assessments? Can traditional methods of form equating be used? What problems are likely to arise in equating? Can standards generalize across forms? 相似文献

9.

IRT Approaches to Modeling Scores on Mixed-Format Tests

Won-Chan Lee Stella Y. Kim Jiwon Choi Yujin Kang 《Journal of Educational Measurement》2020,57(2):230-254

This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models. 相似文献

10.

Effects of Differentially Time-Consuming Tests on Computer-Adaptive Test Scores

Brent Bridgeman Frederick Cline 《Journal of Educational Measurement》2004,41(2):137-148

Time limits on some computer-adaptive tests (CATs) are such that many examinees have difficulty finishing, and some examinees may be administered tests with more time-consuming items than others. Results from over 100,000 examinees suggested that about half of the examinees must guess on the final six questions of the analytical section of the Graduate Record Examination if they were to finish before time expires. At the higher-ability levels, even more guessing was required because the questions administered to higher-ability examinees were typically more time consuming. Because the scoring model is not designed to cope with extended strings of guesses, substantial errors in ability estimates can be introduced when CATs have strict time limits. Furthermore, examinees who are administered tests with a disproportionate number of time-consuming items appear to get lower scores than examinees of comparable ability who are administered tests containing items that can be answered more quickly, though the issue is very complex because of the relationship of time and difficulty, and the multidimensionality of the test. 相似文献

11.

The Reliability and Validity of Weighted Composite Scores

《教育实用测度》2013,26(3):221-240

The scores on 2 distinct tests (e.g., essay and objective) are often combined to create a composite score, which is used to make decisions. The validity of the observed composite can sometimes be evaluated relative to an external criterion. However, in cases where no criterion is available, the observed composite has generally been evaluated in terms of its reliability. The analyses in this article are based on a simple, content-based model for the validity of the observed composite as an estimate of a target composite, based on a priori weights for the 2 tests. The results suggest that giving extra weight to the more reliable of the 2 observed scores tends to improve the reliability of the composite, and up to a point tends to improve its validity. Giving too much weight to the more reliable score can decrease the validity of the observed composite as a measure of the target composite. 相似文献

12.

A Comparison of Scores on two College Freshman Intelligence Tests

《The Journal of educational research》2012,105(9):666-667

相似文献

13.

On the Reliability of Testlet-Based Tests

Stephen G. Sireci David Thissen Howard Wainer 《Journal of Educational Measurement》1991,28(3):237-247

If a test is constructed of testlets, one must take into account the within-testlet structure in the calculation of test statistics. Failing to do so may yield serious biases in the estimation of such statistics as reliability. We demonstrate how to calculate the reliability of a testlet-based test. We show that traditional reliabilities calculated on two reading comprehension tests constructed of four testlets are substantial overestimates. 相似文献

14.

The Reliability of Difference Scores in Populations and Samples

Donald W. Zimmerman 《Journal of Educational Measurement》2009,46(1):19-42

This study was an investigation of the relation between the reliability of difference scores, considered as a parameter characterizing a population of examinees, and the reliability estimates obtained from random samples from the population. The parameters in familiar equations for the reliability of difference scores were redefined in such a way that determinants of reliability in both populations and samples become more transparent. Computer simulation was used to find sample values and to plot frequency distributions of various correlations and variance ratios relevant to the reliability of differences. The shape of frequency distributions resulting from the simulations and the means and standard deviations of these distributions reveal the extent to which reliability estimates based on sample data can be expected to meaningfully represent population reliability. 相似文献

15.

Stability of School-Level Scores From Large-Scale Student Assessments

《教育实用测度》2013,26(2):173-185

More attention is being given to evaluating the quality of school-level assessment scores due to their importance for school-based planning and monitoring effectiveness. In this study, cross-year stability is proposed as an indicator of data quality and the degree of stability that is appropriate for large-scale assessments of student performance is explored. Following a search of Internet sites, Year 1 to Year 2 stability coefficients were calculated for assessment data from 21 states and 2 provinces. The median stability coefficient was .78 in mathematics and reading, but coefficients for writing were generally lower. A stability coefficient of .80 is recommended as the standard for large-scale assessments of student performance. A high degree of cross-year stability makes it easier to detect and attribute changes in school-level scores to school improvement efforts. The link between stability and reliability and several factors that may attenuate stability are discussed. 相似文献

16.

Methodology for Examining the Reliability of Group Mean Difference Scores

Robert L. Brennan Ping Yin Michael T. Kane 《Journal of Educational Measurement》2003,40(3):207-230

This article treats various procedures for examining the reliability of group mean difference scores, with particular emphasis on procedures from univariate and multivariate generalizability theory. Attention is given to both traditional norm-referenced perspectives on reliability as well as criterion-referenced perspectives that focus on error-tolerance ratios and functions of them. The procedures discussed are illustrated using three cohorts of data for third- and fourth-grade students in Iowa who took the Iowa Tests of Basic Skills in recent years. For these data, estimates of reliability for norm-referenced decisions tend to be relatively low. By contrast, for criterion-referenced decisions, estimates of reliability-like coefficients based on error-tolerance ratios tend to be noticeably larger. 相似文献

17.

Empirical Estimates of the Comparative Reliability of Matching Tests and Multiple-Choice Tests

《Journal of Experimental Education》2012,80(3):179-182

Equivalent forms of a ten-item completion test were constructed. The same test items then were rewritten in matching format and in multiple-choice format, resulting in two forms (A and B) of each of three types of test. All tests were administered to 73 examinees, and parallel-forms reliability coefficients (correlation between scores on A and B) were calculated. These empirically obtained values were compared to the values of the reliability coefficient predicted from theoretically derived equations which indicate the influence of chance success due to guessing on test reliability. In accordance with theory it was found that the completion test was more reliable than the matching test and that the matching test was more reliable than the multiple-choice test. The empirically obtained reliability coefficients were very close to those predicted from the mathematically derived formulas. 相似文献

18.

Using Patterns of Summed Scores in Paper-and-Pencil Tests and Computer-Adaptive Tests to Detect Misfitting Item Score Patterns

Rob R. Meijer 《Journal of Educational Measurement》2004,41(2):119-136

Two new methods have been proposed to determine unexpected sum scores on sub-tests (testlets) both for paper-and-pencil tests and computer adaptive tests. A method based on a conservative bound using the hypergeometric distribution, denoted p, was compared with a method where the probability for each score combination was calculated using a highest density region (HDR). Furthermore, these methods were compared with the standardized log-likelihood statistic with and without a correction for the estimated latent trait value (denoted as l*_z and l_z, respectively). Data were simulated on the basis of the one-parameter logistic model, and both parametric and non-parametric logistic regression was used to obtain estimates of the latent trait. Results showed that it is important to take the trait level into account when comparing subtest scores. In a nonparametric item response theory (IRT) context, on adapted version of the HDR method was a powerful alterative to p. In a parametric IRT context, results showed that l*_z had the highest power when the data were simulated conditionally on the estimated latent trait level. 相似文献

19.

The Reliability of Tests Requiring Alternative Responses

《The Journal of educational research》2012,105(3):234-240

相似文献

20.

MMPI在中国应用的信度概化研究

焦璨张洁婷吴利张敏强《华南师范大学学报(社会科学版)》2010,(4)

对1989-2008年国内发表的有关明尼苏达多相人格测验(MMPI)的文章进行信度概化研究.对MMPI的10个临床量表和3个效度量表信度系数的报告情况、信度水平和变异性进行描述性分析;以样本类型、样本量等作为预测变量,探讨影响MMPI量表信度水平的因素.在此基础上,与国外关于MMPI的信度概化研究结果进行比较,结果表明二者在信度水平、信度系数的变异性及其预测源方面都存在异同. 相似文献