期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

How accurate are gender detection tools in predicting the gender for Chinese names? A study with 20,000 given names in Pinyin format

Paul Sebo 《Journal of the Medical Library Association》2022,110(2):205

Objective:We recently showed that the gender detection tools NamSor, Gender API, and Wiki-Gendersort accurately predicted the gender of individuals with Western given names. Here, we aimed to evaluate the performance of these tools with Chinese given names in Pinyin format.Methods:We constructed two datasets for the purpose of the study. File #1 was created by randomly drawing 20,000 names from a gender-labeled database of 52,414 Chinese given names in Pinyin format. File #2, which contained 9,077 names, was created by removing from File #1 all unisex names that we were able to identify (i.e., those that were listed in the database as both male and female names). We recorded for both files the number of correct classifications (correct gender assigned to a name), misclassifications (wrong gender assigned to a name), and nonclassifications (no gender assigned). We then calculated the proportion of misclassifications and nonclassifications (errorCoded).Results:For File #1, errorCoded was 53% for NamSor, 65% for Gender API, and 90% for Wiki-Gendersort. For File #2, errorCoded was 43% for NamSor, 66% for Gender API, and 94% for Wiki-Gendersort.Conclusion:We found that all three gender detection tools inaccurately predicted the gender of individuals with Chinese given names in Pinyin format and therefore should not be used in this population. 相似文献

2.

Separating the Odds: Thresholds for Entropy in Logistic Regression

Brandi A. Weiss William Dardick 《Journal of Experimental Education》2020,88(4):676-697

Abstract

Researchers are often reluctant to rely on classification rates because a model with favorable classification rates but poor separation may not replicate well. In comparison, entropy captures information about borderline cases unlikely to generalize to the population. In logistic regression, the correctness of predicted group membership is known, however, this information has not yet been utilized in entropy calculations. The purpose of this study was to, 1) introduce three new variants of entropy as approximate-model-fit measures, 2) establish rule-of-thumb thresholds to determine whether a theoretical model fits the data, and 3) investigate empirical Type I error and statistical power associated with those thresholds. Results are presented from two Monte Carlo simulations. Simulation results indicated that EFR-rescaled was the most representative of overall model effect size, whereas EFR provided the most intuitive interpretation for all group size ratios. Empirically-derived thresholds are provided. 相似文献

3.

Updating Latent Class Imputations with External Auxiliary Variables

Laura Boeschoten Daniel L. Oberski Ton De Waal Jeroen K. Vermunt 《Structural equation modeling》2018,25(5):750-761

Latent class models are often used to assign values to categorical variables that cannot be measured directly. This “imputed” latent variable is then used in further analyses with auxiliary variables. The relationship between the imputed latent variable and auxiliary variables can only be correctly estimated if these auxiliary variables are included in the latent class model. Otherwise, point estimates will be biased. We develop a method that correctly estimates the relationship between an imputed latent variable and external auxiliary variables, by updating the latent variable imputations to be conditional on the external auxiliary variables using a combination of multiple imputation of latent classes and the so-called three-step approach. In contrast with existing “one-step” and “three-step” approaches, our method allows the resulting imputations to be analyzed using the familiar methods favored by substantive researchers. 相似文献

4.

指纹图像分割方法研究

汪庆《淮南师范学院学报》2014,(5):87-89

指纹分割是指纹识别系统中的重要组成部分,准确有效的分割方法是提高系统识别效率的关键。从指纹结构特点和特定理论的角度出发,对近几年指纹分割的新方法从错分率、时间复杂度、算法特点三方面进行分类比较,指出指纹分割方法深入研究的方向。相似文献

5.

Book reviews

Tom Bramley 《Educational research; a review for teachers and all concerned with progress in education》2013,55(3):325-330

Background:?A recent article published in Educational Research on the reliability of results in National Curriculum testing in England (Newton, The reliability of results from national curriculum testing in England, Educational Research 51, no. 2: 181–212, 2009) suggested that: (1) classification accuracy can be calculated from classification consistency; and (2) classification accuracy on a single test administration is higher than classification consistency across two tests.

Purpose:?This article shows that it is not possible to calculate classification accuracy from classification consistency. It then shows that, given reasonable assumptions about the distribution of measurement error, the expected classification accuracy on a single test administration is higher than the expected classification consistency across two tests only in the case of a pass–fail test, but not necessarily for tests that classify test-takers into more than two categories.

Main argument and conclusion:?Classification accuracy is defined in terms of a ‘true score’ specified in a psychometric model. Three things must be known or hypothesised in order to derive a value for classification accuracy: (1) a psychometric model relating observed scores to true scores; (2) the location of the cut-scores on the score scale; and (3) the distribution of true scores in the group of test-takers. 相似文献

6.

Performance of gender detection tools: a comparative study of name-to-gender inference services

Paul Sebo 《Journal of the Medical Library Association》2021,109(3):414

Objective:To evaluate the performance of gender detection tools that allow the uploading of files (e.g., Excel or CSV files) containing first names, are usable by researchers without advanced computer skills, and are at least partially free of charge.Methods:The study was conducted using four physician datasets (total number of physicians: 6,131; 50.3% female) from Switzerland, a multilingual country. Four gender detection tools met the inclusion criteria: three partially free (Gender API, NamSor, and genderize.io) and one completely free (Wiki-Gendersort). For each tool, we recorded the number of correct classifications (i.e., correct gender assigned to a name), misclassifications (i.e., wrong gender assigned to a name), and nonclassifications (i.e., no gender assigned). We computed three metrics: the proportion of misclassifications excluding nonclassifications (errorCodedWithoutNA), the proportion of nonclassifications (naCoded), and the proportion of misclassifications and nonclassifications (errorCoded).Results:The proportion of misclassifications was low for all four gender detection tools (errorCodedWithoutNA between 1.5 and 2.2%). By contrast, the proportion of unrecognized names (naCoded) varied: 0% for NamSor, 0.3% for Gender API, 4.5% for Wiki-Gendersort, and 16.4% for genderize.io. Using errorCoded, which penalizes both types of error equally, we obtained the following results: Gender API 1.8%, NamSor 2.0%, Wiki-Gendersort 6.6%, and genderize.io 17.7%.Conclusions:Gender API and NamSor were the most accurate tools. Genderize.io led to a high number of nonclassifications. Wiki-Gendersort may be a good compromise for researchers wishing to use a completely free tool. Other studies would be useful to evaluate the performance of these tools in other populations (e.g., Asian). 相似文献

7.

Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference

Paul Sebo 《Journal of the Medical Library Association》2021,109(4):609

Objective:We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database.Methods:We used a database containing the first names, surnames, and gender of 6,131 physicians practicing in a multicultural country (Switzerland). We uploaded the original CSV file (file #1), the file obtained after removing all diacritic marks, such as accents and cedilla (file #2), and the file obtained after removing all diacritic marks and retaining only the first term of the compound first names (file #3). For each file, we computed three performance metrics: proportion of misclassifications (errorCodedWithoutNA), proportion of nonclassifications (naCoded), and proportion of misclassifications and nonclassifications (errorCoded).Results:naCoded, which was high for file #1 (16.4%), was reduced after data manipulation (file #2: 11.7%, file #3: 0.4%). As the increase in the number of misclassifications was small, the overall performance of genderize.io (i.e., errorCoded) improved, especially for file #3 (file #1: 17.7%, file #2: 13.0%, and file #3: 2.3%).Conclusions:A relatively simple manipulation of the data improved the accuracy of gender inference by genderize.io. We recommend using genderize.io only with files that were modified in this way. 相似文献