期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Comparative Power of Student T Test and Mann-Whitney U Test for Unequal Sample Sizes and Variances

Donald W. Zimmerman 《Journal of Experimental Education》2013,81(3):171-174

A computer program generated power functions of the Student t test and Mann-Whitney U test under violation of the parametric assumption of homogeneity of variance for equal and unequal sample sizes. In addition to depression and elevation of nominal significance levels of the t test observed by Hsu and by Scheffé, the entire power functions of both the t test and the U test were depressed or elevated. When the smaller sample was associated with a smaller variance, the U test was more powerful in detecting differences over the entire range of possible differences between population means. When sample sizes were equal, or when the smaller sample had the larger variance, the t test was more powerful over this entire range. These results show that replacement of the t test by a nonparametric alternative under violation of homogeneity of variance does not necessarily maximize correct decisions. 相似文献

2.

How reliable are students’ evaluations of teaching quality? A variance components approach

Daniela Feistauer Tobias Richter 《Assessment & Evaluation in Higher Education》2017,42(8):1263-1279

The inter-rater reliability of university students’ evaluations of teaching quality was examined with cross-classified multilevel models. Students (N = 480) evaluated lectures and seminars over three years with a standardised evaluation questionnaire, yielding 4224 data points. The total variance of these student evaluations was separated into the variance components of courses, teachers, students and the student/teacher interaction. The substantial variance components of teachers and courses suggest reliability. However, a similar proportion of variance was due to students, and the interaction of students and teachers was the strongest source of variance. Students’ individual perceptions of teaching and the fit of these perceptions with the particular teacher greatly influence their evaluations. This casts some doubt on the validity of student evaluations as indicators of teaching quality and suggests that aggregated evaluation scores should be used with caution. 相似文献

3.

Further Study of the Choice of Anchor Tests in Equating

Tammy J. Trierweiler Charles Lewis Robert L. Smith 《Journal of Educational Measurement》2016,53(4):498-518

In this study, we describe what factors influence the observed score correlation between an (external) anchor test and a total test. We show that the anchor to full‐test observed score correlation is based on two components: the true score correlation between the anchor and total test, and the reliability of the anchor test. Findings using an analytical approach suggest that making an anchor test a miditest does not generally maximize the anchor to total test correlation. Results are discussed in the context of what conditions maximize the correlations between the anchor and total test. 相似文献

4.

Reliability and validity of the Math Essential Skill Screener—Elementary Version (MESS-E)

Bradley T. Erford Donna L. Bagley James A. Hopper Ramona M. Lee Kathleen A. Panagopulos Denise B. Preller 《Psychology in the schools》1998,35(2):127-135

The Math Essential Skill Screener–Elementary Version (MESS-E) is a screener devised to identify primary grade students at risk for math difficulties. Item analysis, interitem consistency, test–retest reliability, decision efficiency, and construct validity of the MESS-E were studied using four independent samples of boys and girls grades 1–3 (aged 6–8). Item analysis revealed median item difficulty of .64 and median item discrimination of .75. Interitem consistency was .92 (n = 171) and .94 (n = 711), while 30-day test–retest reliability was .86 (n = 125). Exploratory factor analysis indicated a one-factor solution accounting for 37% of observed variance. LISREL 7 confirmatory factor analysis procedures determined that the one-factor model fit the standardization sample data poorly (goodness-of-fit index = .729, χ² to df ratio = 9.91). The MESS-E yielded concurrent validity coefficients (n = 171) of .74 with the Woodcock–Johnson: Tests of Achievement–Revised (WJ-R) Math Cluster, .80 with the Wide-Range Achievement Test–Revised (WRAT-R) Arithmetic subtest and .73 with the KeyMath-R Operations Area standard scores. A diagnostic efficiency study yielded a total predictive value (TPV) of .93, sensitivity = .98, specificity = .88, positive predictive power (PPP) = .89, negative predictive power (NPP) = .98, and incremental validity = 39%. The MESS-E displayed a slight tendency to overidentify children potentially at risk for math difficulties. © 1998 John Wiley & Sons, Inc. 相似文献

5.

Book reviews

Tom Bramley 《Educational research; a review for teachers and all concerned with progress in education》2013,55(3):325-330

Background:?A recent article published in Educational Research on the reliability of results in National Curriculum testing in England (Newton, The reliability of results from national curriculum testing in England, Educational Research 51, no. 2: 181–212, 2009) suggested that: (1) classification accuracy can be calculated from classification consistency; and (2) classification accuracy on a single test administration is higher than classification consistency across two tests.

Purpose:?This article shows that it is not possible to calculate classification accuracy from classification consistency. It then shows that, given reasonable assumptions about the distribution of measurement error, the expected classification accuracy on a single test administration is higher than the expected classification consistency across two tests only in the case of a pass–fail test, but not necessarily for tests that classify test-takers into more than two categories.

Main argument and conclusion:?Classification accuracy is defined in terms of a ‘true score’ specified in a psychometric model. Three things must be known or hypothesised in order to derive a value for classification accuracy: (1) a psychometric model relating observed scores to true scores; (2) the location of the cut-scores on the score scale; and (3) the distribution of true scores in the group of test-takers. 相似文献

6.

Validating Student Score Inferences With Person‐Fit Statistic and Verbal Reports: A Person‐Fit Study for Cognitive Diagnostic Assessment

Ying Cui Mary Roduta Roberts 《Educational Measurement》2013,32(1):34-42

The goal of this study was to investigate the usefulness of person‐fit analysis in validating student score inferences in a cognitive diagnostic assessment. In this study, a two‐stage procedure was used to evaluate person fit for a diagnostic test in the domain of statistical hypothesis testing. In the first stage, the person‐fit statistic, the hierarchy consistency index (HCI; Cui, 2007 ; Cui & Leighton, 2009 ), was used to identify the misfitting student item‐score vectors. In the second stage, students’ verbal reports were collected to provide additional information about students’ response processes so as to reveal the actual causes of misfits. This two‐stage procedure helped to identify the misfits of item‐score vectors to the cognitive model used in the design and analysis of the diagnostic test, and to discover the reasons of misfits so that students’ problem‐solving strategies were better understood and their performances were interpreted in a more meaningful way. 相似文献

7.

Validation against a Fallible Criterion

Edward E. Cureton 《Journal of Experimental Education》2013,81(3):258-263

This paper presents the results of a simulation study to compare the performance of the Mann-Whitney U test, Student?s t test, and the alternate (separate variance) t test for two mutually independent random samples from normal distributions, with both one-tailed and two-tailed alternatives. The estimated probability of a Type I error was controlled (in the sense of being reasonably close to the attainable level) by all three tests when the variances were equal, regardless of the sample sizes. However, it was controlled only by the alternate t test for unequal variances with unequal sample sizes. With equal sample sizes, the probability was controlled by all three tests regardless of the variances. When it was controlled, we also compared the power of these tests and found very little difference. This means that very little power will be lost if the Mann-Whitney U test is used instead of tests that require the assumption of normal distributions. 相似文献

8.

Two readiness measures as predictors of first and third-grade reading achievement

Mildred A. Randel Maurine A. Fry Elizabeth M. Ralls 《Psychology in the schools》1977,14(1):37-40

Multiple-regression procedures were used to assess the effectiveness of the ABC Inventory and the Metropolitan Readiness Test (MRT) in predicting first and third-grade reading achievement. Sex and chronological age were included in the first-grade analysis (N = 62) and the first-grade PMA intelligence test score was added to the equation in predicting third-grade reading (N = 65). MRT performance accounted for 11% of the variance in first-grade SRA reading scores (R = .34). In predicting third-grade reading, the MRT accounted for 26% of the variance and the PMA IQ scores accounted for an additional 6% (final multiple R = .57). No other predictor made a significant contribution to explaining variance in first or third-grade reading achievement. 相似文献

9.

Stability and correlates of student evaluations of teaching at a Chinese university

Guo‐Hai Chen David Watkins 《Assessment & Evaluation in Higher Education》2010,35(6):675-685

This paper examines the stability and validity of a student evaluations of teaching (SET) instrument used by the administration at a university in the PR China. The SET scores for two semesters of courses taught by 435 teachers were collected. Total 388 teachers (170 males and 218 females) were also invited to fill out the 60‐item NEO Five‐Factor Inventory together with a demographic information questionnaire. The SET responses were found to have very high internal consistency and confirmatory factor analysis supported a one‐factor solution. The SET re‐test correlations were .62 for both the teachers who taught the same course (n = 234) and those who taught a different course in the second semester (n = 201). Linguistics teachers received higher SET scores than either social science or humanities or science and technology teachers. Student ratings were significantly related to Neuroticism and Extraversion. Regression results showed that the Big‐Five personality traits as a group explained only 2.6% of the total variance of student ratings and academic discipline explained 12.7% of the total variance of student ratings. Overall the stability and validity of SET was supported and future uses of SET scores in the PR China are discussed. 相似文献

10.

Measuring service quality in higher education: three instruments compared

Firdaus Abdullah 《International Journal of Research & Method in Education》2013,36(1):71-89

Measuring the quality of service in higher education is increasingly important, particularly as fees introduce a more consumerist ethic amongst students. This paper aims to test and compare the relative efficacy of three measuring instruments of service quality (namely HEdPERF, SERVPERF and the moderating scale of HEdPERF‐SERVPERF) within a higher education setting. The objective was to determine which instrument had the superior measuring capability in terms of unidimensionality, reliability, validity and explained variance. Tests were conducted utilizing a sample of higher education students, and the findings indicated that HEdPERF scale resulted in more reliable estimations, greater criterion and construct validity, greater explained variance, and consequently was a better fit than the other two instruments. Consequently, a modified five‐factor structure of HEdPERF is put forward as the more superior scale for the higher education sector. 相似文献

11.

The Effect of Ignoring Classroom‐Level Variance in Estimating the Generalizability of School Mean Scores

Xin Wei Edward Haertel 《Educational Measurement》2011,30(1):13-22

Contemporary educational accountability systems, including state‐level systems prescribed under No Child Left Behind as well as those envisioned under the “Race to the Top” comprehensive assessment competition, rely on school‐level summaries of student test scores. The precision of these score summaries is almost always evaluated using models that ignore the classroom‐level clustering of students within schools. This paper reports balanced and unbalanced generalizability analyses investigating the consequences of ignoring variation at the level of classrooms within schools when analyzing the reliability of such school‐level accountability measures. Results show that the reliability of school means cannot be determined accurately when classroom‐level effects are ignored. Failure to take between‐classroom variance into account biases generalizability (G) coefficient estimates downward and standard errors (SEs) upward if classroom‐level effects are regarded as fixed, and biases G‐coefficient estimates upward and SEs downward if they are regarded as random. These biases become more severe as the difference between the school‐level intraclass correlation (ICC) and the class‐level ICC increases. School‐accountability systems should be designed so that classroom (or teacher) level variation can be taken into consideration when quantifying the precision of school rankings, and statistical models for school mean score reliability should incorporate this information. 相似文献

12.

A study of the power associated with testing factor mean differences under violations of factorial invariance

David Kaplan Rani George 《Structural equation modeling》2013,20(2):101-118

We examine the power associated with the test of factor mean differences when the assumption of factorial invariance is violated. Utilizing the Wald test for obtaining power, issues of model size, sample size, and total versus partial noninvariance are considered along with variation of actual factor mean differences. Results of a population study show that power is profoundly affected by true factor mean differences but is relatively unaffected by the degree of factor loading noninvariance. Inequality of sample size has a profound effect on power probabilities with power decreasing as sample sizes become increasingly disparate. Sample size variations operate such that power is uniformly lower when the group with the smaller generalized variance is associated with the smaller sample size. An increase in the number of variables yields uniformly larger power probabilities. No substantial differences are found between total and partial noninvariance. Results are related to work in the area of robustness of Hotelling's T ² statistic and discussed in terms of asymptotic covariability of factor means and factor loadings. Implications for practice are considered. 相似文献

13.

多元概化理论在高等教育自学考试命题质量控制中的应用——以北京市《英语水平考试（一）笔试》为例

田霖王桥影赵晓茫《考试研究》2012,(3):57-64

概化理论作为新一代测量理论逐渐应用于大规模考试领域。文章运用多元概化理论对自学考试课程《英语水平考试（一）笔试》试卷的测量信度、试卷总分合成、及格线决策信度、试卷结构优化等问题进行探讨。研究发现：本次考试的测量信度较高;各分测验对全域总分的方差贡献比例与试卷赋分意图基本一致;该试卷以60分作为及格线具有较高的决策信度;将各分测验题量同时增至15题或单独将词汇分测验题量增至20题,可有效提高测量信度。相似文献

14.

Using Excel for White's Test—An Important Technique for Evaluating the Equality of Variance Assumption and Model Specification in a Regression Analysis

Mark L. Berenson 《Decision Sciences Journal of Innovative Education》2013,11(3):243-262

There is consensus in the statistical literature that severe departures from its assumptions invalidate the use of regression modeling for purposes of inference. The assumptions of regression modeling are usually evaluated subjectively through visual, graphic displays in a residual analysis but such an approach, taken alone, may be insufficient for assessing the appropriateness of the fitted model. Here, an easy‐to‐use test of the assumption of equal variance (i.e., homoscedasticity) as well as model specification is provided. Given the importance of the equal‐variance assumption (i.e., if uncorrected, severe violations preclude the use of statistical inference and moderate violations result in a loss of statistical power) and given the fact that, if uncorrected, a misspecified or underspecified model could invalidate an entire study, the test developed by Halbert White in 1980 is recommended for supplementing a graphic residual analysis when teaching regression modeling to business students at both the undergraduate and graduate levels. Using this confirmatory approach to supplement a traditional residual analysis has value because students often find that graphic displays are too subjective for determining what constitutes severe from moderate departures from the equal variance assumption or for assessing patterns in plots that might indicate model misspecification or underspecification. 相似文献

15.

Use of the Rasch measurement model to explore the relationship between content knowledge and topic-specific pedagogical content knowledge for organic chemistry

Bette Davidowitz Marietjie Potgieter 《International Journal of Science Education》2016,38(9):1483-1503

Research has shown that a high level of content knowledge (CK) is necessary but not sufficient to develop the special knowledge base of expert teachers known as pedagogical content knowledge (PCK). This study contributes towards research to quantify the relationship between CK and PCK in science. In order to determine the proportion of the variance in PCK accounted for by the variance in CK, instruments are required which are valid and reliable as well as being unidimensional to measure person abilities for CK and PCK. An instrument consisting of two paper-and-pencil tests was designed to assess Grade 12 teachers CK and PCK in organic chemistry. We used the Rasch measurement model to convert raw score data into interval measures and to provide empirical evidence for the validity, reliability and unidimensionality of the tests. The correlation between CK and PCK was estimated as r = .66 (p < .001). We found evidence to suggest that while topic-specific PCK (TSPCK) develops with increasing teaching experience, high levels of CK can be acquired with limited teaching experience. These findings support the hypothesis that CK is a requirement for the development of TSPCK; proficiency in CK is, however, not necessarily associated with high levels of TSPCK. 相似文献

16.

Long-term stability of students' evaluations: A note on Feldman's “consistency and variability among college students in rating their teachers and courses”

Herbert W. Marsh Dr. J. U. Overall 《Research in higher education》1979,10(2):139-147

Feldman (1977), reviewing research about the reliability of student evaluations, reported that while class average responses were quite reliable (.80s and .90s), single rater reliabilities were typically low (.20s). However, studies he reviewed determined single rater reliability with internal consistency measures which assumed that differences among students in the same class (within-class variance) were completely random—an assumption which Feldman seriously questioned. In the present study, this assumption was tested by collecting evaluations from the same students at the end of each class and again one year after graduation. Single rater reliability based upon an internal consistency approach (agreement among different students in the same class) was similar to that reported by Feldman. However, single rater reliability based upon a stability approach (agreement between end-of-term and follow-up ratings by the same student) was much higher (medianr=.59). These results indicate that individual student evaluations were remarkably stable over time and more reliable than previously assumed. Most important, there was systematic information in individual student ratings—beyond that implied by the class average response—that internal consistency approaches have ignored or assumed to be nonexistent. 相似文献

17.

A Formula for Predicting the Comprehension Level of Material to be Presented Orally

《The Journal of educational research》2012,105(4):218-220

Abstract

A total of 456 teachers, secondary students, and elementary students from nine public schools in a southeastern state took part in the study. The authors produced an operational definition for environmental robustness or the relative dramatic content of school structures. It was composed of 10 semantic differential pairs able to discriminate dramatic content and representative of a single factor measuring about two-thirds of the test variance for the concept "dramatic." The 10 pairs demonstrated a degree of test-retest reliability both as individual items and combined as a total instrument score using a new sample of 84 secondary school students from the same area. As predicted and with this same sample of 84 students, the mean environmental robustness score for students holding a positive evaluation of their school was significantly higher than the mean robustness score for students holding a neutral or negative evaluation of their school. 相似文献

18.

The Construction and Interpretation of Differential Ability Patterns

David Segel 《Journal of Experimental Education》2013,81(3):283-287

Increasing the correlation between the independent variable and the mediator (a coefficient) increases the effect size (ab) for mediation analysis; however, increasing a by definition increases collinearity in mediation models. As a result, the standard error of product tests increase. The variance inflation caused by increases in a at some point outweighs the increase of the effect size (ab) and results in a loss of statistical power. This phenomenon also occurs with nonparametric bootstrapping approaches because the variance of the bootstrap distribution of ab approximates the variance expected from normal theory. Both variances increase dramatically when a exceeds the b coefficient, thus explaining the power decline with increases in a. Implications for statistical analysis and applied researchers are discussed. 相似文献

19.

试卷中含有单个高计分主观题时的信度估计方法

杨志明丁港王雯《教育测量与评价(理论版)》2021,(1):44-48

测评信度是衡量考试质量的核心指标之一,但常规的信度估计方法在估计含有单个高计分主观题试卷的信度时并不恰当,因为这种高计分主观题对测验总分方差的影响太大。解决这种问题的一个做法是:在估计出单个高计分主观题信度的基础上,进一步运用分层α系数公式估计整个试卷的测评信度。单个高计分主观题信度的估计方法有两种,即使用重测信度的估计方法,或者使用根据两个随机变量的相关系数会因随机误差的存在而衰减的特点所提出的估计方法。相似文献

20.

Psychometric Properties of IRT Proficiency Estimates

Michael J. Kolen Ye Tong 《Educational Measurement》2010,29(3):8-14

Psychometric properties of item response theory proficiency estimates are considered in this paper. Proficiency estimators based on summed scores and pattern scores include non-Bayes maximum likelihood and test characteristic curve estimators and Bayesian estimators. The psychometric properties investigated include reliability, conditional standard errors of measurement, and score distributions. Four real-data examples include (a) effects of choice of estimator on score distributions and percent proficient, (b) effects of the prior distribution on score distributions and percent proficient, (c) effects of test length on score distributions and percent proficient, and (d) effects of proficiency estimator on growth-related statistics for a vertical scale. The examples illustrate that the choice of estimator influences score distributions and the assignment of examinee to proficiency levels. In particular, for the examples studied, the choice of Bayes versus non-Bayes estimators had a more serious practical effect than the choice of summed versus pattern scoring. 相似文献