首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 515 毫秒
1.
绩效考核是人力资源管理的一项核心职能。目前企业绩效考核存在考核指标选择主观、考核方法选用随意等误区。绩效考核指标设计要进行信度和效度分析,具有可行性。绩效考核方法根据被考核主体人数的多寡分为个体考核法和多人考核法两类,由不同主体从多维度实施,要依据企业生命周期、规模与考核成本选择使用。  相似文献   

2.
Any examination that involves moderate to high stakes implications for examinees should be psychometrically sound and legally defensible. Currently, there are two broad and competing families of test theories that are used to score examination data. The majority of instructors outside the high‐stakes testing arena rely on classical test theory (CTT) methods. However, advances in item response theory software have made the application of these techniques much more accessible to classroom instructors. The purpose of this research is to analyze a common medical school anatomy examination using both the traditional CTT scoring method and a Rasch measurement scoring method to determine which technique provides more robust findings, and which set of psychometric indicators will be more meaningful and useful for anatomists looking to improve the psychometric quality and functioning of their examinations. Results produced by the more robust and meaningful methodology will undergo a rigorous psychometric validation process to evaluate construct validity. Implications of these techniques and additional possibilities for advanced applications are also discussed. Anat Sci Educ 7: 450–460. © 2014 American Association of Anatomists.  相似文献   

3.
Examinees who take high-stakes assessments are usually given an opportunity to repeat the test if they are unsuccessful on their initial attempt. To prevent examinees from obtaining unfair score increases by memorizing the content of specific test items, testing agencies usually assign a different test form to repeat examinees. The use of multiple forms is expensive and can present psychometric challenges, particularly for low-volume credentialing programs; thus, it is important to determine if unwarranted score gains actually occur. Prior studies provide strong evidence that the same-form advantage is pronounced for aptitude tests. However, the sparse research within the context of achievement and credentialing testing suggests that the same-form advantage is minimal. For the present experiment, 541 examinees who failed a national certification test were randomly assigned to receive either the same test or a different (parallel) test on their second attempt. Although the same-form group had shorter response times on the second administration, score gains for the two groups were indistinguishable. We discuss factors that may limit the generalizability of these findings to other assessment contexts.  相似文献   

4.
Does reviewing previous answers during multiple-choice exams help examinees increase their final score? This article formalizes the question using a rigorous causal framework, the potential outcomes framework. Viewing examinees’ reviewing status as a treatment and their final score as an outcome, the article first explains the challenges of identifying the causal effect of answer reviewing in regular exam-taking settings. In addition to the incapability of randomizing the treatment selection (reviewing status) and the lack of other information to make this selection process ignorable, the treatment variable itself is not fully known to researchers. Looking at examinees’ answer sheet data, it is unclear whether an examinee who did not change his or her answer on a specific item reviewed it but retained the initial answer (treatment condition) or chose not to review it (control condition). Despite such challenges, however, the article develops partial identification strategies and shows that the sign of the answer reviewing effect can be reasonably inferred. By analyzing a statewide math assessment data set, the article finds that reviewing initial answers is generally beneficial for examinees.  相似文献   

5.
Standardizing aspects of assessments has long been recognized as a tactic to help make evaluations of examinees fair. It reduces variation in irrelevant aspects of testing procedures that could advantage some examinees and disadvantage others. However, recent attention to making assessment accessible to a more diverse population of students highlights situations in which making tests identical for all examinees can make a testing procedure less fair: Equivalent surface conditions may not provide equivalent evidence about examinees. Although testing accommodations are by now standard practice in most large-scale testing programmes, for the most part these practices lie outside formal educational measurement theory. This article builds on recent research in universal design for learning (UDL), assessment design, and psychometrics to lay out the rationale for inference that is conditional on matching examinees with principled variations of an assessment so as to reduce construct-irrelevant demands. The present focus is assessment for special populations, but it is argued that the principles apply more broadly.  相似文献   

6.
Educational tests are standardized so that all examinees are tested on the same material, under the same testing conditions, and with the same scoring protocols. This uniformity is designed to provide a level “playing field” for all examinees so that the test is “the same” for everyone. Thus, standardization is designed to promote fairness in testing. In practice, the material tested, the conditions under which a test is administered, and the scoring processes, are often too rigid to provide the intended level playing field. For example, standardized testing conditions may interact with personal characteristics of examinees that affect test performance, but are not construct-relevant. Thus, more flexibility in standardization is needed to account for the diversity of experiences, talents, and handicaps of the incredibly heterogeneous populations of examinees we currently assess. Traditional standardization procedures grew out of experimental psychology and psychophysics laboratories where keeping all conditions constant was crucial. Today, accounting for and measuring what is not constant across examinees is crucial to valid construct interpretations. To meet this need I introduce the concept of understandardization, which refers to ensuring sufficient flexibility in standardized testing conditions to yield the most accurate measurement of proficiency for each examinee.  相似文献   

7.
Many states are implementing response‐to‐intervention (RTI)–based assessment as the sole means of identifying students with specific learning disabilities (SLDs). Although RTI is often hailed as an improved model of identification, concern for the possibility of this model elevating false positives has been examined. The risk of RTI producing a second form of diagnostic error, however, has received relatively little attention, namely, false negatives. The widespread implementation of RTI necessitates an analysis of its ability to identify students who are most vulnerable to be inaccurately judged as responsive to instruction, namely, students with coexisting intellectual talent and SLDs.  相似文献   

8.
Substantial growth in the numbers of English language learners (ELLs) in the United States and Canada in recent years has significantly affected the educational systems of both countries. This article focuses on critical issues and concerns related to the assessment of ELLs in U.S. and Canadian schools and emphasizes assessment approaches for test developers and decision makers that will facilitate increased equity, meaningfulness, and accuracy in assessment and accountability efforts. It begins by examining the crucial issue of defining ELLs as a group. Next, it examines the impact of testing originating from the No Child Left Behind Act of 2001 (NCLB) in the U.S. and government‐mandated standards‐driven testing in Canada by briefly describing each country's respective legislated testing requirements and outlining their consequences at several levels. Finally, the authors identify key points that test developers and decision makers in both contexts should consider in testing this ever‐increasing group of students.  相似文献   

9.
The development of statistical methods for detecting test collusion is a new research direction in the area of test security. Test collusion may be described as large‐scale sharing of test materials, including answers to test items. Current methods of detecting test collusion are based on statistics also used in answer‐copying detection. Therefore, in computerized adaptive testing (CAT) these methods lose power because the actual test varies across examinees. This article addresses that problem by introducing a new approach that works in two stages: in Stage 1, test centers with an unusual distribution of a person‐fit statistic are identified via Kullback–Leibler divergence; in Stage 2, examinees from identified test centers are analyzed further using the person‐fit statistic, where the critical value is computed without data from the identified test centers. The approach is extremely flexible. One can employ any existing person‐fit statistic. The approach can be applied to all major testing programs: paper‐and‐pencil testing (P&P), computer‐based testing (CBT), multiple‐stage testing (MST), and CAT. Also, the definition of test center is not limited by the geographic location (room, class, college) and can be extended to support various relations between examinees (from the same undergraduate college, from the same test‐prep center, from the same group at a social network). The suggested approach was found to be effective in CAT for detecting groups of examinees with item pre‐knowledge, meaning those with access (possibly unknown to us) to one or more subsets of items prior to the exam.  相似文献   

10.
The assessment of differential item functioning (DIF) is routinely conducted to ensure test fairness and validity. Although many DIF assessment methods have been developed in the context of classical test theory and item response theory, they are not applicable for cognitive diagnosis models (CDMs), as the underlying latent attributes of CDMs are multidimensional and binary. This study proposes a very general DIF assessment method in the CDM framework which is applicable for various CDMs, more than two groups of examinees, and multiple grouping variables that are categorical, continuous, observed, or latent. The parameters can be estimated with Markov chain Monte Carlo algorithms implemented in the freeware WinBUGS. Simulation results demonstrated a good parameter recovery and advantages in DIF assessment for the new method over the Wald method.  相似文献   

11.
Computerized adaptive testing (CAT) and multistage testing (MST) have become two of the most popular modes in large‐scale computer‐based sequential testing.  Though most designs of CAT and MST exhibit strength and weakness in recent large‐scale implementations, there is no simple answer to the question of which design is better because different modes may fit different practical situations. This article proposes a hybrid adaptive framework to combine both CAT and MST, inspired by an analysis of the history of CAT and MST. The proposed procedure is a design which transitions from a group sequential design to a fully sequential design. This allows for the robustness of MST in early stages, but also shares the advantages of CAT in later stages with fine tuning of the ability estimator once its neighborhood has been identified. Simulation results showed that hybrid designs following our proposed principles provided comparable or even better estimation accuracy and efficiency than standard CAT and MST designs, especially for examinees at the two ends of the ability range.  相似文献   

12.
Student examinees are key stakeholders in large-scale, high-stakes, public examination systems. How they perceive the purpose, comprehend the technical characteristics of testing and how they interpret scores influence their response to the system demands and their preparation for the examinations; this information relates to intended and unintended consequences of testing and is a component of an expanded notion of test validity. The research reported in this paper investigates examinees’ perceptions about the secondary school graduation and university-entrance national exams in Cyprus. Interviews with recent examinees reveal the versatility and complexity of their perceptions about the fairness and appropriateness of the system, which are influenced by design features of the exams and by the local context. There are important, mostly unintended, consequences on their in- and out-of-school experience, on school curricula and on instructional practices. Empirical evidence about consequential aspects of examinations contributes to the validity argument needed to support such programmes.  相似文献   

13.
Although much research has been conducted on the psychometric properties of cognitive diagnostic models, they are only recently being used in operational settings to provide results to examinees and other stakeholders. Using this newer class of models in practice comes with a fresh challenge for diagnostic assessment developers: effectively reporting results and supporting end users to accurately interpret results. Achieving the goal of communicating results in a way that leads users of the assessment to make accurate interpretations requires a prerequisite step that cannot be taken for granted. The assessment developers must first accurately interpret results from a psychometric, or measurement, standpoint. Through this article, we seek to begin a discussion about reasonable interpretations of the results that classification‐based models provide about examinees. Interpretations from published research and ongoing practice show different—and sometimes conflicting—ways to interpret these results. This article seeks to formalize a comparison, critique, and discussion among the interpretations. Before beginning this discussion, we first present background on the results provided by classification‐based models regarding the examinees. We then structure our discussion around key questions an assessment development team needs to answer themselves prior to constructing reports and interpretative guides for end users of the assessment.  相似文献   

14.
To mitigate security concerns and unfair score gains, credentialing programs routinely administer new test material to examinees retesting after an initial failing attempt. Counterintuitively, a small but growing body of recent research suggests that repeating the identical form does not create an unfair advantage. This study builds upon and extends this research by investigating changes in responses to specific items encountered on both the first and repeat attempts. Results indicate that scores gains for repeat examinees who were assigned an identical form were not different from repeat examinees who received a different, but parallel, form. Analyses of responses to individual items answered incorrectly on the initial attempt found that examinees 68% of the time selected the same incorrect option on their second attempt, suggesting repeaters are misinformed rather than uninformed. Implications for feedback, remediation, and retesting policies are discussed.  相似文献   

15.
The attractiveness of computer-based tests (CBTs) is due largely to their capability to expand the ways we conduct testing. A relatively unexplored application, however, is actively using the computer to reduce construct-irrelevant variance while a test is being administered. This investigation introduces the effort-monitoring CBT, in which the computer monitors examinee effort (based on item response time) in a low-stakes test and displays warning messages to those exhibiting rapid-guessing behavior. The results of an experimental study are presented, which showed that an effort-monitoring CBT increased examinee effort and yielded more valid test scores than a conventional CBT. Thus, unlike previous research that has focused on identifying rapid-guessing behavior after it has occurred, the effort-monitoring CBT proactively attempts to suppress rapid-guessing behavior. This innovative testing procedure extends the capabilities of measurement practitioners to manage the psychometric challenges posed by unmotivated examinees.  相似文献   

16.
分数不确切代表被试的真实语言能力的问题是语言测量学界一个最本质、最棘手的问题——效度问题。以往我们采取的一些诸如增加评分员数量、重评等办法虽然在一定程度上改善了效度,但是却都无法从真正意义上得到一个与真分数尽可能近似的客观的分数。Longford针对主观评分中的信度问题提出了四种分数调整模型来解决这一问题。本文运用严厉度调整模型对HSK高等作文评分中的异常评分者所评的分数进行了调整,调整后分数得到很大改善。因此在以后的考试当中基本上可以用这种数学的调整方法代替以往组织评分员重评的方法。  相似文献   

17.
Quality control (QC) in testing is paramount. QC procedures for tests can be divided into two types. The first type, one that has been well researched, is QC for tests administered to large population groups on few administration dates using a small set of test forms (e.g., large‐scale assessment). The second type is QC for tests, usually computerized, that are administered to small population groups on many administration dates using a wide array of test forms (CMT—continuous mode tests). Since the world of testing is headed in this direction, developing QC for CMT is crucial. In the current ITEMS module we discuss errors that might occur at the different stages of the CMT process, as well as the recommended QC procedure to reduce the incidence of each error. Illustration from a recent study is provided, and a computerized system that applies these procedures is presented. Instructions on how to develop one's own QC procedure are also included.  相似文献   

18.
美国普瑞细斯系列测试是一项针对中小学教师的测试,其目的是为了给各级教育当局认证教师资格、颁发教师资格证书提供帮助。它分为教师职前技能测试、学科知识评估考试和课堂教学评价三个项目,具有强大的理论支持、良好的操作性能和有效的测试结果三个特点。我国教师资格考试可以从考试内容的全面性、考试对象的广泛性、过程性评价贯穿于考试之中和评价的连续性几方面加以借鉴。  相似文献   

19.
Although a few studies report sizable score gains for examinees who repeat performance‐based assessments, research has not yet addressed the reliability and validity of inferences based on ratings of repeat examinees on such tests. This study analyzed scores for 8,457 single‐take examinees and 4,030 repeat examinees who completed a 6‐hour clinical skills assessment required for physician licensure. Each examinee was rated in four skill domains: data gathering, communication‐interpersonal skills, spoken English proficiency, and documentation proficiency. Conditional standard errors of measurement computed for single‐take and multiple‐take examinees indicated that ratings were of comparable precision for the two groups within each of the four skill domains; however, conditional errors were larger for low‐scoring examinees regardless of retest status. In addition, on their first attempt multiple‐take examinees exhibited less score consistency across the skill domains but on their second attempt their scores became more consistent. Further, the median correlation between scores on the four clinical skill domains and three external measures was .15 for multiple‐take examinees on their first attempt but increased to .27 for their second attempt, a value, which was comparable to the median correlation of .26 for single‐take examinees. The findings support the validity of inferences based on scores from the second attempt.  相似文献   

20.
完形填空是一种综合语言运用能力测试方式,一般教师应具备鉴别题目质量的能力,做完形填空时考生则应掌握必要的解题技巧。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号