首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This study investigated the Type I error rate and power of four copying indices, K-index (Holland, 1996), Scrutiny! (Assessment Systems Corporation, 1993), g2 (Frary, Tideman, & Watts, 1977), and ω (Wollack, 1997) using real test data from 20,000 examinees over a 2-year period. The data were divided into three different test lengths (20, 40, and 80 items) and nine different sample sizes (ranging from 50 to 20,000). Four different amounts of answer copying were simulated (10%, 20%, 30%, and 40% of the items) within each condition. The ω index demonstrated the best Type I error control and power in all conditions and at all α levels. Scrutiny! and the K-index were uniformly conservative, and both had poor power to detect true copiers at the small α levels typically used in answer copying detection, whereas g2 was generally too liberal, particularly at small α levels. Some comments on the proper uses of copying indices are provided.  相似文献   

2.
A statistical test for the detection of answer copying on multiple-choice tests is presented. The test is based on the idea that the answers of examinees to test items may be the result of three possible processes: (1) knowing, (2) guessing, and (3) copying, but that examinees who do not have access to the answers of other examinees can arrive at their answers only through the first two processes. This assumption leads to a distribution for the number of matched incorrect alternatives between the examinee suspected of copying and the examinee believed to be the source that belongs to a family of "shifted binomials." Power functions for the tests for several sets of parameter values are analyzed. An extension of the test to include matched numbers of correct alternatives would lead to improper statistical hypotheses.  相似文献   

3.
We investigated the statistical properties of the K-index (Holland, 1996) that can be used to detect copying behavior on a test. A simulation study was conducted to investigate the applicability of the K-index for small, medium, and large datasets. Furthermore, the Type I error rate and the detection rate of this index were compared with the copying index, ω (Wollack, 1997). Several approximations were used to calculate the K-index. Results showed that all approximations were able to hold the Type I error rates below the nominal level. Results further showed that using ω resulted in higher detection rates than the K-indices for small and medium sample sizes (100 and 500 simulees).  相似文献   

4.
The standardized log-likelihood of a response vector (lz) is a popular IRT-based person-fit test statistic for identifying model-misfitting response patterns. Traditional use of lz is overly conservative in detecting aberrance due to its incorrect assumption regarding its theoretical null distribution. This study proposes a method for improving the accuracy of person-fit analysis using lz which takes into account test unreliability when estimating the ability and constructs the distribution for each lz through resampling methods. The Type I error and power (or detection rate) of the proposed method were examined at different test lengths, ability levels, and nominal α levels along with other methods, and power to detect three types of aberrance—cheating, lack of motivation, and speeding—was considered. Results indicate that the proposed method is a viable and promising approach. It has Type I error rates close to the nominal value for most ability levels and reasonably good power.  相似文献   

5.
Why do students give incorrect answers in PISA? What are the reasons for giving incorrect answers? Do all incorrect answers reflect only the lack of competence or might even a competent child make a mistake? The aim of this article is to contribute to a better understanding of these issues. In the current investigation, we selected six students who responded incorrectly to one PISA question in mathematics or science when they solved it individually. Then, we analyzed their understanding of the PISA task and their reasoning about it through a dialogical problem solving in triads to identify why they made an incorrect answer. Moreover, we tried to determine how the shared peer interaction might change the understanding and reasoning of the child and enable her/him to solve the task. The results of this study illustrate the differences between incorrect answers reflecting lack of competence and those incorrect answers, which appear for some other reasons. Based on the dialogical problem solving approach, we analyzed these two types of incorrect answers and the reasoning trajectories behind them.  相似文献   

6.
《教育实用测度》2013,26(4):265-288
Many of the currently available statistical indexes to detect answer copying lack sufficient power at small α levels or when the amount of copying is relatively small. Furthermore, there is no one index that is uniformly best. Depending on the type or amount of copying, certain indexes are better than others. The purpose of this article was to explore the utility of simultaneously using multiple copying indexes to detect different types and amounts of answer copying. This study compared eight copying indexes: S1 and S2 (Sotaridona & Meijer, 2003), 2 (Sotaridona & Meijer, 2002), ω (Wollack, 1997),B and H (Angoff, 1974), and new indexes Runs and MaxStrings, plus all possible pairs and triplets of the 8 indexes using multiple comparison procedures (Dunn, 1961) to adjust the critical α level for each index in a pair or triplet. Empirical Type-I error rates and power of all indexes, pairs, and triplets were examined in a real data simulation (i.e., where actual examinee responses to items [rather than generated item response vectors] were changed to match the actual responses for randomly selected source examinees) for 2 test lengths, 9 sample sizes, 3 types of copying, 4 α levels, and 4 percentages of items copied. This study found that using both ω and H* (i.e., H with empirically derived critical values) can help improve power in the most realistic types of copying situations (strings and mixed copying). The ω-H* paired index improved power most particularly for small percentages of items copied and small amounts of copying, two conditions for which copying indexes tend to be underpowered.  相似文献   

7.
To assess item dimensionality, the following two approaches are described and compared: hierarchical generalized linear model (HGLM) and multidimensional item response theory (MIRT) model. Two generating models are used to simulate dichotomous responses to a 17-item test: the unidimensional and compensatory two-dimensional (C2D) models. For C2D data, seven items are modeled to load on the first and second factors, θ1 and θ2, with the remaining 10 items modeled unidimensionally emulating a mathematics test with seven items requiring an additional reading ability dimension. For both types of generated data, the multidimensionality of item responses is investigated using HGLM and MIRT. Comparison of HGLM and MIRT's results are possible through a transformation of items' difficulty estimates into probabilities of a correct response for a hypothetical examinee at the mean on θ and θ2. HGLM and MIRT performed similarly. The benefits of HGLM for item dimensionality analyses are discussed.  相似文献   

8.
To assess the relative contribution of dynamic and summary features of vocal fundamental frequency (f0) to the statistical discrimination of pragmatic categories in infant-directed speech, 49 mothers were instructed to use their voice to get their 4-month-old baby's attention, show approval, and provide comfort. Vocal f0 from 621 tokens was extracted using a Computerized Speech Laboratory and custom software. Dynamic features were measured with convergent methods (visual judgment and quantitative modeling of f0 contour shape). Summary features were f0 mean, standard deviation, and duration. Dynamic and summary features both individually and in combination statistically discriminated between each of the pragmatic categories. Classification rates were 69% and 62% in initial and cross-validation DFAs, respectively.  相似文献   

9.
An approximate χ2 statistic based on McDonald's (1967) nonlinear factor analytic representation of item response theory was proposed and investigated with simulated data. The results were compared with Stout's T statistic (Nandakumar & Stout, 1993; Stout, 1987). Unidimensional and two-dimensional item response data were simulated under varying levels of sample size, test length, test reliability, and dimension dominance. The approximate χ2 statistic had good control over Type I errors when unidimensional data were generated and displayed very good power in identifying the two-dimensional data. The performance of the approximate χ2 was at least as good as Stout's T statistic in all conditions and was better than Stout's T statistic with smaller sample sizes and shorter tests. Further implications regarding the potential use of nonlinear factor analysis and the approximate χ2 in addressing current measurement issues are discussed.  相似文献   

10.
Using a bidimensional two-parameter logistic model, the authors generated data for two groups on a 40-item test. The item parameters were the same for the two groups, but the correlation between the two traits varied between groups. The difference in the trait correlation was directly related to the number of items judged not to be invariant using traditional unidimensional IRT-based unsigned item invariance indexes; the higher trait correlation leads to higher discrimination parameter estimates when a unidimensional IRT model is fit to the multidimensional data. In the most extreme case, when rθ1 θ2= Ofor one group and r θ1 θ2= 1.0 for the other group, 33 out of 40 items were identified as not invariant. When using signed indexes, the effect was much smaller. The authors, therefore, suggest a cautious use of IRT-based item invariance indexes when data are potentially multidimensional and groups may vary in the strength of the correlations among traits.  相似文献   

11.
12.
The reading test performances of 60 hearing and 60 hearing-impaired children of similar measured reading ages on the Southgate reading test were analysed. As in an earlier study using the Brimer Wide-span test it was shown that the performances of the two groups were quite different. Deaf children tackled significantly more test items than the hearing and made significantly more errors in achieving similar reading scores. A detailed examination of both correct and incorrect answers showed that the deaf children were not simply providing answers to questions at random. Even where they produced incorrect responses they tended, as a group, to select the same answer. Unlike the hearing group, who did not converge on the same incorrect solution to difficult test items, the deaf were systematic in their choices, indicating that they were using a consistent strategy. A post hoc examination of individual test items indicated that the deaf children were selecting answers on the basis of word associations in each test item. On some items these produced a correct response, on others the same (incorrect) response. The implications of these findings are discussed to argue that reading tests based on hearing norms are of little value in the assessment of reading abilities and reading problems in hearing-impaired children.  相似文献   

13.
The current study investigated kindergarteners and second graders’ ability to monitor and evaluate their own and a virtual peer’s performance in a paired-associate learning task. Participants provided confidence judgments (CJs) for their own responses and performance-based judgments (judgments provided after receiving feedback on their performance) for both their own and a virtual peer’s responses. For the performance-based judgments, children were confronted with their own or the peer’s answer as well as the correct answer. Additionally, participants were asked to credit their own and the peer’s correct and incorrect answers while facing feedback. Results indicate an age-related progression in metacognitive monitoring skills, with second graders differentiating more strongly in their confidence judgments between correct and incorrect responses compared to kindergarteners. Regarding performance-based judgments, children of both age groups provided higher judgments for correctly compared to incorrectly recognized items as well as for their own responses in comparison to the responses of the unknown child. Similarly, when crediting, participants of both age groups gave more credits for correct recognition than for incorrect recognition and for their own responses than for the peer’s responses. The significant interaction between age group and recognition accuracy for the crediting shows that second graders gave more credits for correctly recognized items while kindergarteners gave more credits for incorrect answers than the older children – primarily for their own incorrect answers. In conclusion, the study provides new insights into 6- and 8-year-olds’ evaluations of their own and an unknown child’s performance in a paired-associate learning task by showing that children of both age groups generally judged and credited responses in their own favor. These results add to our understanding of biases in children’s performance evaluations, including metacognitive judgments and judgments provided after receiving feedback.  相似文献   

14.
John White argues that 'egalitarianism, in education as elsewhere, is a will-o'-the-wisp'.1 He claims that recent defences of egalitarianism, among which he kindly includes my own along with those of Thomas Nagel and Kai Nielsen, have failed to answer the basic question of why a more equal society should be regarded as valuable. I shall try to show that the positive philosophical commitments contained in his argument may point the way to an answer.  相似文献   

15.
Three field studies tested the hypothesis that anticipating a graded test as opposed to a pass-fail test enhances metacognitive monitoring. Participants were teacher candidates who completed a mid-term and a final test in psychology courses. Each participant chose whether the result of the final test should be evaluated with one of five grades or with a pass-fail decision. In both tests, participants answered true–false inference items about the contents of the course and indicated their confidence in the correctness of each answer. When a graded test was expected, confidence and the absolute accuracy of the confidence judgments increased and bias decreased to a greater extent than when a pass-fail decision was expected. However, expecting a grade increased participants’ confidence not only in correct answers but also in incorrect answers (Study 1). Feedback and instructions emphasizing the importance of accurate discrimination between correct and incorrect answers did not weaken this effect (Study 2). The generalizability of the findings was investigated by reanalyzing the test results of participants in eight other psychology courses (Study 3). The results are discussed in terms of the motivational consequences of grading.  相似文献   

16.
针对目前食品安全问答系统准确率不高且无法满足智能化问答要求等问题,基于词向量相似度设计食品安全问答系统。采用深度学习方法构建食品安全领域知识库及词向量模型,结合近义词库提出问句相似度计算方法,将问句与知识库内所有问句进行匹配,返回相似度最高问句对应的答案。实验结果表明,该系统问答准确率达到80%,能满足食品行业用户的日常问答需求。  相似文献   

17.
The development of statistical methods for detecting test collusion is a new research direction in the area of test security. Test collusion may be described as large‐scale sharing of test materials, including answers to test items. Current methods of detecting test collusion are based on statistics also used in answer‐copying detection. Therefore, in computerized adaptive testing (CAT) these methods lose power because the actual test varies across examinees. This article addresses that problem by introducing a new approach that works in two stages: in Stage 1, test centers with an unusual distribution of a person‐fit statistic are identified via Kullback–Leibler divergence; in Stage 2, examinees from identified test centers are analyzed further using the person‐fit statistic, where the critical value is computed without data from the identified test centers. The approach is extremely flexible. One can employ any existing person‐fit statistic. The approach can be applied to all major testing programs: paper‐and‐pencil testing (P&P), computer‐based testing (CBT), multiple‐stage testing (MST), and CAT. Also, the definition of test center is not limited by the geographic location (room, class, college) and can be extended to support various relations between examinees (from the same undergraduate college, from the same test‐prep center, from the same group at a social network). The suggested approach was found to be effective in CAT for detecting groups of examinees with item pre‐knowledge, meaning those with access (possibly unknown to us) to one or more subsets of items prior to the exam.  相似文献   

18.
Croatian 1st‐year and 3rd‐year high‐school students (N = 170) completed a conceptual physics test. Students were evaluated with regard to two physics topics: Newtonian dynamics and simple DC circuits. Students answered test items and also indicated their confidence in each answer. Rasch analysis facilitated the calculation of three linear measures: (a) an item‐difficulty measure based upon all responses, (b) an item‐confidence measure based upon correct student answers, and (c) an item‐confidence measure based upon incorrect student answers. Comparisons were made with regard to item difficulty and item confidence. The results suggest that Newtonian dynamics is a topic with stronger students' alternative conceptions than the topic of DC circuits, which is characterized by much lower students' confidence on both correct and incorrect answers. A systematic and significant difference between mean student confidence on Newtonian dynamics and DC circuits items was found in both student groups. Findings suggest some steps for physics instruction in Croatia as well as areas of further research for those in science education interested in additional techniques of exploring alternative conceptions. © 2005 Wiley Periodicals, Inc. J Res Sci Teach 43: 150–171, 2006  相似文献   

19.
以箭图Q1,Q2,Q3为例,构造有限偏序k范畴Γ1,Γ2,考虑Γ1,Γ2及函子范畴Γ1Γ2诱导的incidence代数的两个例子。  相似文献   

20.
Recent studies have shown that restricting review and answer change opportunities on computerized adaptive tests (CATs) to items within successive blocks reduces time spent in review, satisfies most examinees' desires for review, and controls against distortion in proficiency estimates resulting from intentional incorrect answering of items prior to review. However, restricting review opportunities on CATs may not prevent examinees from artificially raising proficiency estimates by using judgments of item difficulty to signal when to change previous answers. We evaluated six strategies for using item difficulty judgments to change answers on CATs and compared the results to those from examinees reviewing and changing answers in the usual manner. The strategy conditions varied in terms of when examinees were prompted to consider changing answers and in the information provided about the consistency of the item selection algorithm. We found that examinees fared best on average when they reviewed and changed answers in the usual manner. The best gaming strategy was one in which the examinees knew something about the consistency of the item selection algorithm and were prompted to change responses only when they were unsure about answer correctness and sure about their item difficulty judgments. However, even this strategy did not produce a mean gain in proficiency estimates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号