首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami’s method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.  相似文献   

2.
A comparison of feature selection methods for an evolving RSS feed corpus   总被引:3,自引:0,他引:3  
Previous researchers have attempted to detect significant topics in news stories and blogs through the use of word frequency-based methods applied to RSS feeds. In this paper, the three statistical feature selection methods: χ2, Mutual Information (MI) and Information Gain (I) are proposed as alternative approaches for ranking term significance in an evolving RSS feed corpus. The extent to which the three methods agree with each other on determining the degree of the significance of a term on a certain date is investigated as well as the assumption that larger values tend to indicate more significant terms. An experimental evaluation was carried out with 39 different levels of data reduction to evaluate the three methods for differing degrees of significance. The three methods showed a significant degree of disagreement for a number of terms assigned an extremely large value. Hence, the assumption that the larger a value, the higher the degree of the significance of a term should be treated cautiously. Moreover, MI and I show significant disagreement. This suggests that MI is different in the way it ranks significant terms, as MI does not take the absence of a term into account, although I does. I, however, has a higher degree of term reduction than MI and χ2. This can result in loosing some significant terms. In summary, χ2 seems to be the best method to determine term significance for RSS feeds, as χ2 identifies both types of significant behavior. The χ2 method, however, is far from perfect as an extremely high value can be assigned to relatively insignificant terms.  相似文献   

3.
The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.  相似文献   

4.
The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naïve Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used.  相似文献   

5.
Existing personality detection methods based on user-generated text have two major limitations. First, they rely too much on pre-trained language models to ignore the sentiment information in psycholinguistic features. Secondly, they have no consensus on the psycholinguistic feature selection, resulting in the insufficient analysis of sentiment information. To tackle these issues, we propose a novel personality detection method based on high-dimensional psycholinguistic features and improved distributed Gray Wolf Optimizer (GWO) for feature selection (IDGWOFS). Specifically, we introduced the Gaussian Chaos Map-based initialization and neighbor search strategy into the original GWO to improve the performance of feature selection. To eliminate the bias generated when using mutual information to select features, we adopt symmetric uncertainty (SU) instead of mutual information as the evaluation for correlation and redundancy to construct the fitness function, which can balance the correlation between features–labels and the redundancy between features–features. Finally, we improve the common Spark-based parallelization design of GWO by parallelizing only the fitness computation steps to improve the efficiency of IDGWOFS. The experiments indicate that our proposed method obtains average accuracy improvements of 3.81% and 2.19%, and average F1 improvements of 5.17% and 5.8% on Essays and Kaggle MBTI dataset, respectively. Furthermore, IDGWOFS has good convergence and scalability.  相似文献   

6.
以轻量化汽车生产需求为导向,通过文献筛选分别归纳出绿色供应商和汽车供应商评价的一般准则,而后基于生命周期评估(life cycle assessment,LCA)找出汽车轻量化需求下绿色供应商评价的关键准则和子准则,并按照可持续维度(经济、环境、社会)进行去重、补充和优化,给出各条子准则对应的轻量化设计方法。研究结果有助于进一步细化子准则,使评价指标更具针对性,对企业提高绿色供应商选择绩效具有积极的参考价值。  相似文献   

7.
张燕  汪卫霞 《大众科技》2013,(12):128-131
近年来,数据挖掘在IT行业备受关注。数据挖掘技术解决了目前数据贫乏的问题,它通过分析,从大量的、杂乱无章的数据中提取出有价值的信息,这些信息可用于解决如医疗诊断、风险评估等决策问题。决策树方法是数据挖掘中的一个重要内容,文章通过决策树在药物选择中的应用来阐述决策树的构建过程。  相似文献   

8.
阐述了电流互感器的基本原理,分析了引起电流互感器误差的原因,并详述了高压电流互感器运行维护中需注意的事项。  相似文献   

9.
曹勇  佘硕 《科研管理》2009,30(5):52-60
摘要:由于知识型服务业兼具服务业的一般特征及不同于其他服务行业的自身特点,因而其创新成果的有效保护成为其进一步成长的关键因素。本研究以理论分析与实证调研方式,通过对专利机制的制度分析结合对欧洲服务企业运用专利机制的数据与指标及中国的商业方法与相关服务专利现状的统计分析,得出专利机制对知识型服务业创新成果意义重大、其专利数量将大幅增长的结论。同时,利用Trend函数对2016年中国商业方法及相关专利数量进行趋势分析,并用二次曲线拟合模型对结论进行检验,把握专利机制对知识型服务业创新成果保护的未来。  相似文献   

10.
基于机制设计理论,首先提出了科研项目成本补偿机制的概念与核心问题;其次,针对机制设计的两个核心问题:信息效率和激励相容建立了在科研项目成本补偿情境下的分析模型,并以英、美、日、中四国案例为对象,分析了四国的科研项目成本补偿机制的现状;在此基础之上,特别针对我国的问题在分析模型的框架下提出了未来改革的路径选择,并提出相关的政策建议。  相似文献   

11.
Social media have been adopted by many businesses. More and more companies are using social media tools such as Facebook and Twitter to provide various services and interact with customers. As a result, a large amount of user-generated content is freely available on social media sites. To increase competitive advantage and effectively assess the competitive environment of businesses, companies need to monitor and analyze not only the customer-generated content on their own social media sites, but also the textual information on their competitors’ social media sites. In an effort to help companies understand how to perform a social media competitive analysis and transform social media data into knowledge for decision makers and e-marketers, this paper describes an in-depth case study which applies text mining to analyze unstructured text content on Facebook and Twitter sites of the three largest pizza chains: Pizza Hut, Domino's Pizza and Papa John's Pizza. The results reveal the value of social media competitive analysis and the power of text mining as an effective technique to extract business value from the vast amount of available social media data. Recommendations are also provided to help companies develop their social media competitive analysis strategy.  相似文献   

12.
13.
本文系统整理了1995-2012年来我国会计信息研究相关成果,发现我国会计信息研究在总体特征与演变特征呈现三大差异性趋势:研究领域向信息使用者转变,研究方法向实证研究转变,研究方式向合作研究转变。认为导致三大差异的主要原因是经济转型和制度变迁的宏观背景、研究生教育的变革和学术交流尤其是国际交流的加强。  相似文献   

14.
对大数据驱动的管理与决策的相关文献进行研究,得出大数据资源的共享机制及其信息孤岛互联技术是当今大数据研究的前沿课题之一。对国内外政府数据共享交换应用进行研究分析,归纳政府数据资源共享交换存在管理理念问题和原有系统造成数据壁垒的问题。基于云平台,结合数据即服务的理论,提出构建政府全量数据资源的管理框架,在保证不对原有系统做任何改动的前提下,做到数据不搬家、数据不复制、数据不改变原来的管理模式,界定各个运营主体对数据的权利、义务,解决数据共享交换面临的管理理念问题和系统壁垒问题。  相似文献   

15.
中国产业结构变动和FDI间的动态关系研究   总被引:3,自引:1,他引:3       下载免费PDF全文
陈迅  高远东 《科研管理》2006,27(5):137-142
本文采用1982-2003年度全国的时间序列数据,运用现代协整理论,对中国的产业结构变动和FDI之间的长短期关系进行Granger因果关系检验。结果表明:中国的产业结构和FDI之间存在长期的双向Granger因果关系;而短期中,中国的产业结构变动对FDI的变化则具有单向的Granger因果关系:中国产业结构的变动对FDI的增长率具有正的影响,而FDI的变化却不是推动中国产业结构变动的主要原因;滞后一期的FDI对FDI的流入具有显著的影响;滞后一期的产业结构变动对FDI产生正的影响,而滞后两期的产业结构变动对FDI产生负的影响。  相似文献   

16.
创新环境与创新绩效优化已成为粤港澳大湾区区域自主创新能力提升的重要方面,因此本文基于因子分析和数据包络分析(DEA)方法,对县域的创新环境情况以及创新绩效进行实证分析。首先,构建县域创新环境的指标体系,基于粤港澳大湾区发展背景下,分析珠三角核心区、沿海经济带、北部生态发展区三个不同区位的县域创新环境、创新投入、创新绩效特征。其次,挑选出代表性县(市),对其创新绩效情况进行综合评价。最后,根据实证分析结果找出各县(市)自身所存在的短板以及发展问题,并提出针对提高创新绩效、促进创新能力整体发展的一些策略和建议。  相似文献   

17.
【目的】 挖掘期刊题录信息,揭示海量文章的内容特征,为编辑分析期刊发展动态提供研究范例。【方法】 利用VBA程序并借助Excel进行可视化操作,对《情报学报》刊载文献的题目、作者、  相似文献   

18.
The activities in our current world are mainly supported by data-driven web applications, making extensive use of databases and data services. Such phenomenon led to the rise of Data Scientists as professionals of major relevance, which extract value from data and create state-of-the-art data artifacts that generate even more increased value. During the last years, the term Data Scientist attracted significant attention. Consequently, it is relevant to understand its origin, knowledge base and skills set, in order to adequately describe its profile and distinguish it from others like Business Analyst. This work proposes a conceptual model for the professional profile of a Data Scientist and evaluates the representativeness of this profile in two commonly recognized competences/skills frameworks in the field of Information and Communications Technology (ICT), namely in the European e-Competence (e-CF) framework and the Skills Framework for the Information Age (SFIA). The results indicate that a significant part of the knowledge base and skills set of Data Scientists are related with ICT competences/skills, including programming, machine learning and databases. The Data Scientist professional profile has an adequate representativeness in these two frameworks, but it is mainly seen as a multi-disciplinary profile, combining contributes from different areas, such as computer science, statistics and mathematics.  相似文献   

19.
Named entity recognition aims to detect pre-determined entity types in unstructured text. There is a limited number of studies on this task for low-resource languages such as Turkish. We provide a comprehensive study for Turkish named entity recognition by comparing the performances of existing state-of-the-art models on the datasets with varying domains to understand their generalization capability and further analyze why such models fail or succeed in this task. Our experimental results, supported by statistical tests, show that the highest weighted F1 scores are obtained by Transformer-based language models, varying from 80.8% in tweets to 96.1% in news articles. We find that Transformer-based language models are more robust to entity types with a small sample size and longer named entities compared to traditional models, yet all models have poor performance for longer named entities in social media. Moreover, when we shuffle 80% of words in a sentence to imitate flexible word order in Turkish, we observe more performance deterioration, 12% in well-written texts, compared to 7% in noisy text.  相似文献   

20.
本文通过英国成为世界科学活动中心的进程 ,概括出世界科学活动中心形成的一般模式 ,并结合此模式分析了我国要成为世界科学中心应采取的对策  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号