期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Information gain and divergence-based feature selection for machine learning-based text categorization

Changki Lee Gary Geunbae Lee 《Information processing & management》2006

Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami’s method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy. 相似文献

2.

A comparison of feature selection methods for an evolving RSS feed corpus 总被引：3，自引：0，他引：3

Rudy Prabowo Mike Thelwall 《Information processing & management》2006,42(6):1491

Previous researchers have attempted to detect significant topics in news stories and blogs through the use of word frequency-based methods applied to RSS feeds. In this paper, the three statistical feature selection methods: χ², Mutual Information (MI) and Information Gain (I) are proposed as alternative approaches for ranking term significance in an evolving RSS feed corpus. The extent to which the three methods agree with each other on determining the degree of the significance of a term on a certain date is investigated as well as the assumption that larger values tend to indicate more significant terms. An experimental evaluation was carried out with 39 different levels of data reduction to evaluate the three methods for differing degrees of significance. The three methods showed a significant degree of disagreement for a number of terms assigned an extremely large value. Hence, the assumption that the larger a value, the higher the degree of the significance of a term should be treated cautiously. Moreover, MI and I show significant disagreement. This suggests that MI is different in the way it ranks significant terms, as MI does not take the absence of a term into account, although I does. I, however, has a higher degree of term reduction than MI and χ². This can result in loosing some significant terms. In summary, χ² seems to be the best method to determine term significance for RSS feeds, as χ² identifies both types of significant behavior. The χ² method, however, is far from perfect as an extremely high value can be assigned to relatively insignificant terms. 相似文献

3.

A Bayesian feature selection paradigm for text classification

Guozhong Feng Jianhua Guo Bing-Yi Jing Lizhu Hao 《Information processing & management》2012

The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach. 相似文献

4.

A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization

Jieming Yang Yuanning Liu Xiaodong Zhu Zhen Liu Xiaoxu Zhang 《Information processing & management》2012

The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naïve Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used. 相似文献

5.

A novel personality detection method based on high-dimensional psycholinguistic features and improved distributed Gray Wolf Optimizer for feature selection

《Information processing & management》2023,60(2):103217

Existing personality detection methods based on user-generated text have two major limitations. First, they rely too much on pre-trained language models to ignore the sentiment information in psycholinguistic features. Secondly, they have no consensus on the psycholinguistic feature selection, resulting in the insufficient analysis of sentiment information. To tackle these issues, we propose a novel personality detection method based on high-dimensional psycholinguistic features and improved distributed Gray Wolf Optimizer (GWO) for feature selection (IDGWOFS). Specifically, we introduced the Gaussian Chaos Map-based initialization and neighbor search strategy into the original GWO to improve the performance of feature selection. To eliminate the bias generated when using mutual information to select features, we adopt symmetric uncertainty (SU) instead of mutual information as the evaluation for correlation and redundancy to construct the fitness function, which can balance the correlation between features–labels and the redundancy between features–features. Finally, we improve the common Spark-based parallelization design of GWO by parallelizing only the fitness computation steps to improve the efficiency of IDGWOFS. The experiments indicate that our proposed method obtains average accuracy improvements of 3.81% and 2.19%, and average F1 improvements of 5.17% and 5.8% on Essays and Kaggle MBTI dataset, respectively. Furthermore, IDGWOFS has good convergence and scalability. 相似文献

6.

基于汽车轻量化设计的绿色供应商评价准则研究

尤筱玥雷星晖石涌江《科技管理研究》2018,(16)

以轻量化汽车生产需求为导向,通过文献筛选分别归纳出绿色供应商和汽车供应商评价的一般准则,而后基于生命周期评估(life cycle assessment,LCA)找出汽车轻量化需求下绿色供应商评价的关键准则和子准则,并按照可持续维度(经济、环境、社会)进行去重、补充和优化,给出各条子准则对应的轻量化设计方法。研究结果有助于进一步细化子准则,使评价指标更具针对性,对企业提高绿色供应商选择绩效具有积极的参考价值。相似文献

7.

决策树方法在药物选择模型中的应用

张燕汪卫霞《大众科技》2013,(12):128-131

近年来,数据挖掘在IT行业备受关注。数据挖掘技术解决了目前数据贫乏的问题,它通过分析,从大量的、杂乱无章的数据中提取出有价值的信息,这些信息可用于解决如医疗诊断、风险评估等决策问题。决策树方法是数据挖掘中的一个重要内容,文章通过决策树在药物选择中的应用来阐述决策树的构建过程。相似文献

8.

高压电流互感器的误差分析及运行维护注意事项

周亚新《黑龙江科技信息》2007,(3S):40-40

阐述了电流互感器的基本原理，分析了引起电流互感器误差的原因，并详述了高压电流互感器运行维护中需注意的事项。相似文献

9.

专利机制在知识型服务业创新中的现状与前景预测分析 总被引：2，自引：0，他引：2

下载免费PDF全文

曹勇佘硕《科研管理》2009,30(5):52-60

摘要：由于知识型服务业兼具服务业的一般特征及不同于其他服务行业的自身特点,因而其创新成果的有效保护成为其进一步成长的关键因素。本研究以理论分析与实证调研方式,通过对专利机制的制度分析结合对欧洲服务企业运用专利机制的数据与指标及中国的商业方法与相关服务专利现状的统计分析,得出专利机制对知识型服务业创新成果意义重大、其专利数量将大幅增长的结论。同时,利用Trend函数对2016年中国商业方法及相关专利数量进行趋势分析,并用二次曲线拟合模型对结论进行检验,把握专利机制对知识型服务业创新成果保护的未来。相似文献

10.

我国科研项目成本补偿机制改革的路径选择分析

下载免费PDF全文

阿儒涵李铭禄《科学学研究》2016,(4):558-563

基于机制设计理论,首先提出了科研项目成本补偿机制的概念与核心问题;其次,针对机制设计的两个核心问题:信息效率和激励相容建立了在科研项目成本补偿情境下的分析模型,并以英、美、日、中四国案例为对象,分析了四国的科研项目成本补偿机制的现状;在此基础之上,特别针对我国的问题在分析模型的框架下提出了未来改革的路径选择,并提出相关的政策建议。相似文献

11.

Social media competitive analysis and text mining: A case study in the pizza industry

Wu He Shenghua Zha Ling Li 《International Journal of Information Management》2013

Social media have been adopted by many businesses. More and more companies are using social media tools such as Facebook and Twitter to provide various services and interact with customers. As a result, a large amount of user-generated content is freely available on social media sites. To increase competitive advantage and effectively assess the competitive environment of businesses, companies need to monitor and analyze not only the customer-generated content on their own social media sites, but also the textual information on their competitors’ social media sites. In an effort to help companies understand how to perform a social media competitive analysis and transform social media data into knowledge for decision makers and e-marketers, this paper describes an in-depth case study which applies text mining to analyze unstructured text content on Facebook and Twitter sites of the three largest pizza chains: Pizza Hut, Domino's Pizza and Papa John's Pizza. The results reveal the value of social media competitive analysis and the power of text mining as an effective technique to extract business value from the vast amount of available social media data. Recommendations are also provided to help companies develop their social media competitive analysis strategy. 相似文献

12.

Preferences in Wikipedia abstracts: Empirical findings and implications for automatic entity summarization

Danyun Xu Gong Cheng Yuzhong Qu 《Information processing & management》2014

相似文献

13.

中国会计信息研究文献统计及演进分析——基于我国经济管理类权威核心期刊文献数据

韩少真张晓明《未来与发展》2013,(6):42-47

本文系统整理了1995-2012年来我国会计信息研究相关成果,发现我国会计信息研究在总体特征与演变特征呈现三大差异性趋势：研究领域向信息使用者转变,研究方法向实证研究转变,研究方式向合作研究转变。认为导致三大差异的主要原因是经济转型和制度变迁的宏观背景、研究生教育的变革和学术交流尤其是国际交流的加强。相似文献

14.

大数据背景下的政府数据治理:共享机制、管理机制研究

肖炯恩吴应良《科技管理研究》2018,(17)

对大数据驱动的管理与决策的相关文献进行研究,得出大数据资源的共享机制及其信息孤岛互联技术是当今大数据研究的前沿课题之一。对国内外政府数据共享交换应用进行研究分析,归纳政府数据资源共享交换存在管理理念问题和原有系统造成数据壁垒的问题。基于云平台,结合数据即服务的理论,提出构建政府全量数据资源的管理框架,在保证不对原有系统做任何改动的前提下,做到数据不搬家、数据不复制、数据不改变原来的管理模式,界定各个运营主体对数据的权利、义务,解决数据共享交换面临的管理理念问题和系统壁垒问题。相似文献

15.

中国产业结构变动和FDI间的动态关系研究 总被引：3，自引：1，他引：3

下载免费PDF全文

陈迅高远东《科研管理》2006,27(5):137-142

本文采用1982-2003年度全国的时间序列数据,运用现代协整理论,对中国的产业结构变动和FDI之间的长短期关系进行Granger因果关系检验。结果表明:中国的产业结构和FDI之间存在长期的双向Granger因果关系;而短期中,中国的产业结构变动对FDI的变化则具有单向的Granger因果关系:中国产业结构的变动对FDI的增长率具有正的影响,而FDI的变化却不是推动中国产业结构变动的主要原因;滞后一期的FDI对FDI的流入具有显著的影响;滞后一期的产业结构变动对FDI产生正的影响,而滞后两期的产业结构变动对FDI产生负的影响。相似文献

16.

粤港澳大湾区县域创新环境评价研究——以广东57个县（市）实证分析为例

林海个县《科技管理研究》2020,(12)

创新环境与创新绩效优化已成为粤港澳大湾区区域自主创新能力提升的重要方面,因此本文基于因子分析和数据包络分析（DEA）方法,对县域的创新环境情况以及创新绩效进行实证分析。首先,构建县域创新环境的指标体系,基于粤港澳大湾区发展背景下,分析珠三角核心区、沿海经济带、北部生态发展区三个不同区位的县域创新环境、创新投入、创新绩效特征。其次,挑选出代表性县（市）,对其创新绩效情况进行综合评价。最后,根据实证分析结果找出各县（市）自身所存在的短板以及发展问题,并提出针对提高创新绩效、促进创新能力整体发展的一些策略和建议。相似文献

17.

基于题录信息分析的期刊数据研究——以《情报学报》为例

车尧宋扬李兵《中国科技期刊研究》2018,29(4):406-410

【目的】挖掘期刊题录信息,揭示海量文章的内容特征,为编辑分析期刊发展动态提供研究范例。【方法】利用VBA程序并借助Excel进行可视化操作,对《情报学报》刊载文献的题目、作者、相似文献

18.

The data scientist profile and its representativeness in the European e-Competence framework and the skills framework for the information age

《International Journal of Information Management》2017,37(6):726-734

The activities in our current world are mainly supported by data-driven web applications, making extensive use of databases and data services. Such phenomenon led to the rise of Data Scientists as professionals of major relevance, which extract value from data and create state-of-the-art data artifacts that generate even more increased value. During the last years, the term Data Scientist attracted significant attention. Consequently, it is relevant to understand its origin, knowledge base and skills set, in order to adequately describe its profile and distinguish it from others like Business Analyst. This work proposes a conceptual model for the professional profile of a Data Scientist and evaluates the representativeness of this profile in two commonly recognized competences/skills frameworks in the field of Information and Communications Technology (ICT), namely in the European e-Competence (e-CF) framework and the Skills Framework for the Information Age (SFIA). The results indicate that a significant part of the knowledge base and skills set of Data Scientists are related with ICT competences/skills, including programming, machine learning and databases. The Data Scientist professional profile has an adequate representativeness in these two frameworks, but it is mainly seen as a multi-disciplinary profile, combining contributes from different areas, such as computer science, statistics and mathematics. 相似文献

19.

Named entity recognition in Turkish: A comparative study with detailed error analysis

《Information processing & management》2022,59(6):103065

Named entity recognition aims to detect pre-determined entity types in unstructured text. There is a limited number of studies on this task for low-resource languages such as Turkish. We provide a comprehensive study for Turkish named entity recognition by comparing the performances of existing state-of-the-art models on the datasets with varying domains to understand their generalization capability and further analyze why such models fail or succeed in this task. Our experimental results, supported by statistical tests, show that the highest weighted F1 scores are obtained by Transformer-based language models, varying from 80.8% in tweets to 96.1% in news articles. We find that Transformer-based language models are more robust to entity types with a small sample size and longer named entities compared to traditional models, yet all models have poor performance for longer named entities in social media. Moreover, when we shuffle 80% of words in a sentence to imitate flexible word order in Turkish, we observe more performance deterioration, 12% in well-written texts, compared to 7% in noisy text. 相似文献

20.

我国要成为世界科学中心应采取的对策分析

武龙武卫兵《科学学研究》2001,19(2):43-46

本文通过英国成为世界科学活动中心的进程 ,概括出世界科学活动中心形成的一般模式 ,并结合此模式分析了我国要成为世界科学中心应采取的对策相似文献