首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper proposes new modified methods for back propagation neural networks and uses semantic feature space to improve categorization performance and efficiency. The standard back propagation neural network (BPNN) has the drawbacks of slow learning and getting trapped in local minima, leading to a network with poor performance and efficiency. In this paper, we propose two methods to modify the standard BPNN and adopt the semantic feature space (SFS) method to reduce the number of dimensions as well as construct latent semantics between terms. The experimental results show that the modified methods enhanced the performance of the standard BPNN and were more efficient than the standard BPNN. The SFS method cannot only greatly reduce the dimensionality, but also enhances performance and can therefore be used to further improve text categorization systems precisely and efficiently.  相似文献   

2.
The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naïve Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used.  相似文献   

3.
Contextual feature selection for text classification   总被引:1,自引:0,他引:1  
We present a simple approach for the classification of “noisy” documents using bigrams and named entities. The approach combines conventional feature selection with a contextual approach to filter out passages around selected features. Originally designed for call for tender documents, the method can be useful for other web collections that also contain non-topical contents. Experiments are conducted on our in-house collection as well as on the 4-Universities data set, Reuters 21578 and 20 Newsgroups. We find a significant improvement on our collection and the 4-Universities data set (10.9% and 4.1%, respectively). Although the best results are obtained by combining bigrams and named entities, the impact of the latter is not found to be significant.  相似文献   

4.
Text documents usually contain high dimensional non-discriminative (irrelevant and noisy) terms which lead to steep computational costs and poor learning performance of text classification. One of the effective solutions for this problem is feature selection which aims to identify discriminative terms from text data. This paper proposes a method termed “Hebb rule based feature selection (HRFS)”. HRFS is based on supervised Hebb rule and assumes that terms and classes are neurons and select terms under the assumption that a term is discriminative if it keeps “exciting” the corresponding classes. This assumption can be explained as “a term is highly correlated with a class if it is able to keep “exciting” the class according to the original Hebb postulate. Six benchmarking datasets are used to compare HRFS with other seven feature selection methods. Experimental results indicate that HRFS is effective to achieve better performance than the compared methods. HRFS can identify discriminative terms in the view of synapse between neurons. Moreover, HRFS is also efficient because it can be described in the view of matrix operation to decrease complexity of feature selection.  相似文献   

5.
The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.  相似文献   

6.
渔业文本分类是充分利用渔业信息资源的有效途径。针对中文文献资料的结构特点,提出一种结合特征词权值和支持向量机(Support Vector Machine,SVM)的渔业文本分类方法,利用向量空间模型(Vector Space Model,VSM)构建文本向量空间,并结合特征词权值计算文本特征向量中的各特征项,将构建的文本向量送入SVM进行渔业文本分类。采用中国知网下载的标准文档进行了实验测试,并考察了准确率和召回率两个指标,实验结果表明,文章提出的渔业文本分类方法具有较好的分类效果。  相似文献   

7.
Document classification, with the blooming of the Internet information delivery, has become indispensable required and is expected to be disposed by an automatic text categorization. This paper presents a text categorization system to solve the multi-class categorization problem. The system consists of two modules: the processing module and the classifying module. In the first module, ICF and Uni are used as the indictors to extract the relevant terms. While the fuzzy set theory is incorporated into the OAA-SVM in the classifying module, we specifically propose an OAA-FSVM classifier to implement a multi-class classification system. The performances of OAA-SVM and OAA-FSVM are evaluated by macro-average performance index.  相似文献   

8.
A comparison of feature selection methods for an evolving RSS feed corpus   总被引:3,自引:0,他引:3  
Previous researchers have attempted to detect significant topics in news stories and blogs through the use of word frequency-based methods applied to RSS feeds. In this paper, the three statistical feature selection methods: χ2, Mutual Information (MI) and Information Gain (I) are proposed as alternative approaches for ranking term significance in an evolving RSS feed corpus. The extent to which the three methods agree with each other on determining the degree of the significance of a term on a certain date is investigated as well as the assumption that larger values tend to indicate more significant terms. An experimental evaluation was carried out with 39 different levels of data reduction to evaluate the three methods for differing degrees of significance. The three methods showed a significant degree of disagreement for a number of terms assigned an extremely large value. Hence, the assumption that the larger a value, the higher the degree of the significance of a term should be treated cautiously. Moreover, MI and I show significant disagreement. This suggests that MI is different in the way it ranks significant terms, as MI does not take the absence of a term into account, although I does. I, however, has a higher degree of term reduction than MI and χ2. This can result in loosing some significant terms. In summary, χ2 seems to be the best method to determine term significance for RSS feeds, as χ2 identifies both types of significant behavior. The χ2 method, however, is far from perfect as an extremely high value can be assigned to relatively insignificant terms.  相似文献   

9.
Gene ontology (GO) consists of three structured controlled vocabularies, i.e., GO domains, developed for describing attributes of gene products, and its annotation is crucial to provide a common gateway to access different model organism databases. This paper explores an effective application of text categorization methods to this highly practical problem in biology. As a first step, we attempt to tackle the automatic GO annotation task posed in the Text Retrieval Conference (TREC) 2004 Genomics Track. Given a pair of genes and an article reference where the genes appear, the task simulates assigning GO domain codes. We approach the problem with careful consideration of the specialized terminology and pay special attention to various forms of gene synonyms, so as to exhaustively locate the occurrences of the target gene. We extract the words around the spotted gene occurrences and used them to represent the gene for GO domain code annotation. We regard the task as a text categorization problem and adopt a variant of kNN with supervised term weighting schemes, making our method among the top-performing systems in the TREC official evaluation. Furthermore, we investigate different feature selection policies in conjunction with the treatment of terms associated with negative instances. Our experiments reveal that round-robin feature space allocation with eliminating negative terms substantially improves performance as GO terms become specific.  相似文献   

10.
In text classification, it is necessary to perform feature selection to alleviate the curse of dimensionality caused by high-dimensional text data. In this paper, we utilize class term frequency (CTF) and class document frequency (CDF) to characterize the relevance between terms and categories in the level of term frequency (TF) and document frequency (DF). On the basis of relevance measurement above, three feature selection methods (ADF based on CTF (ADF-CTF), ADF based on CDF (ADF-CDF), and ADF based on both CTF and CDF (ADF-CTDF)) are proposed to identify relevant and discriminant terms by introducing absolute deviation factors (ADFs). Absolute deviation, a statistic concept, is first adopted to measure the relevance divergence characterized by CTF and CDF. In addition, ADF-CTF and ADF-CDF can be combined with existing DF-based and TF-based methods, respectively, which results in new ADF-based methods. Experimental results on six high-dimensional textual datasets using three classifiers indicate that ADF-based methods outperform original DF-based and TF-based ones in 89% cases in terms of Micro-F1 and Macro-F1, which demonstrates the role of ADF integrated in existing methods to boost the classification performance. In addition, findings also show that ADF-CTDF ranks first averagely among multiple datasets and significantly outperforms other methods in 99% cases.  相似文献   

11.
为去除网络入侵数据集中的冗余和噪声特征,降低数据处理难度和提高检测性能,提出一种基于特征选择和支持向量机的入侵检测方法。该方法采用提出的特征选择算法选取最优特征组合,并以支持向量机为分类器建立模型,应用于入侵检测系统。仿真结果表明,本文方法不仅可以减少特征维数,降低训练和测试时间,还能提高入侵检测的分类准确率。  相似文献   

12.
In this paper, a Generalized Cluster Centroid based Classifier (GCCC) and its variants for text categorization are proposed by utilizing a clustering algorithm to integrate two well-known classifiers, i.e., the K-nearest-neighbor (KNN) classifier and the Rocchio classifier. KNN, a lazy learning method, suffers from inefficiency in online categorization while achieving remarkable effectiveness. Rocchio, which has efficient categorization performance, fails to obtain an expressive categorization model due to its inherent linear separability assumption. Our proposed method mainly focuses on two points: one point is that we use a clustering algorithm to strengthen the expressiveness of the Rocchio model; another one is that we employ the improved Rocchio model to speed up the categorization process of KNN. Extensive experiments conducted on both English and Chinese corpora show that GCCC and its variants have better categorization ability than some state-of-the-art classifiers, i.e., Rocchio, KNN and Support Vector Machine (SVM).  相似文献   

13.
李文  王炜立  洪胜华 《科技广场》2006,18(11):94-95
本文主要论述了一种改进的基于互信息的特征提取方法及其在中文法律案情文本分类中的应用,文中给出了具体实现过程及实验数据。  相似文献   

14.
Proactive Maintenance practices are becoming more standard in industrial environments, with a direct and profound impact on the competitivity within the sector. These practices demand the continuous monitorization of industrial equipment, which generates extensive amounts of data. This information can be processed into useful knowledge with the use of machine learning algorithms. However, before the algorithms can effectively be applied, the data must go through an exploratory phase: assessing the meaning of the features and to which degree they are redundant. In this paper, we present the findings of the analysis conducted on a real-world dataset from a metallurgic company. A number of data analysis and feature selection methods are employed, uncovering several relationships, which are systematized in a rule-based model, and reducing the feature space from an initial 47-feature dataset to a 32-feature dataset.  相似文献   

15.
Existing personality detection methods based on user-generated text have two major limitations. First, they rely too much on pre-trained language models to ignore the sentiment information in psycholinguistic features. Secondly, they have no consensus on the psycholinguistic feature selection, resulting in the insufficient analysis of sentiment information. To tackle these issues, we propose a novel personality detection method based on high-dimensional psycholinguistic features and improved distributed Gray Wolf Optimizer (GWO) for feature selection (IDGWOFS). Specifically, we introduced the Gaussian Chaos Map-based initialization and neighbor search strategy into the original GWO to improve the performance of feature selection. To eliminate the bias generated when using mutual information to select features, we adopt symmetric uncertainty (SU) instead of mutual information as the evaluation for correlation and redundancy to construct the fitness function, which can balance the correlation between features–labels and the redundancy between features–features. Finally, we improve the common Spark-based parallelization design of GWO by parallelizing only the fitness computation steps to improve the efficiency of IDGWOFS. The experiments indicate that our proposed method obtains average accuracy improvements of 3.81% and 2.19%, and average F1 improvements of 5.17% and 5.8% on Essays and Kaggle MBTI dataset, respectively. Furthermore, IDGWOFS has good convergence and scalability.  相似文献   

16.
This paper presents a cluster validation based document clustering algorithm, which is capable of identifying an important feature subset and the intrinsic value of model order (cluster number). The important feature subset is selected by optimizing a cluster validity criterion subject to some constraint. For achieving model order identification capability, this feature selection procedure is conducted for each possible value of cluster number. The feature subset and the cluster number which maximize the cluster validity criterion are chosen as our answer. We have evaluated our algorithm using several datasets from the 20Newsgroup corpus. Experimental results show that our algorithm can find the important feature subset, estimate the cluster number and achieve higher micro-averaged precision than previous document clustering algorithms which require the value of cluster number to be provided.  相似文献   

17.
科学基金遴选中非共识研究项目的评估研究   总被引:9,自引:0,他引:9  
在科学基金的遴选中 ,应用同行评议的方法 ,通常难以遴选出非共识项目中有创新性的项目。为此 ,本文提出用项目整体非共识和指标非共识度作为专家判断项目创新性的定量辅助依据 ,从而形成以定性质量判断为主、定量数据为辅的针对非共识研究项目的评估方法。此方法在实际评估工作中得到较好的应用  相似文献   

18.
With the development of information technology and economic growth, the Internet of Things (IoT) industry has also entered the fast lane of development. The IoT industry system has also gradually improved, forming a complete industrial foundation, including chips, electronic components, equipment, software, integrated systems, IoT services, and telecom operators. In the event of selective forwarding attacks, virus damage, malicious virus intrusion, etc., the losses caused by such security problems are more serious than those of traditional networks, which are not only network information materials, but also physical objects. The limitations of sensor node resources in the Internet of Things, the complexity of networking, and the open wireless broadcast communication characteristics make it vulnerable to attacks. Intrusion Detection System (IDS) helps identify anomalies in the network and takes the necessary countermeasures to ensure the safe and reliable operation of IoT applications. This paper proposes an IoT feature extraction and intrusion detection algorithm for intelligent city based on deep migration learning model, which combines deep learning model with intrusion detection technology. According to the existing literature and algorithms, this paper introduces the modeling scheme of migration learning model and data feature extraction. In the experimental part, KDD CUP 99 was selected as the experimental data set, and 10% of the data was used as training data. At the same time, the proposed algorithm is compared with the existing algorithms. The experimental results show that the proposed algorithm has shorter detection time and higher detection efficiency.  相似文献   

19.
Traditional Information Retrieval (IR) models assume that the index terms of queries and documents are statistically independent of each other, which is intuitively wrong. This paper proposes the incorporation of the lexical and syntactic knowledge generated by a POS-tagger and a syntactic Chunker into traditional IR similarity measures for including this dependency information between terms. Our proposal is based on theories of discourse structure by means of the segmentation of documents and queries into sentences and entities. Therefore, we measure dependencies between entities instead of between terms. Moreover, we handle discourse references for each entity. It has been evaluated on Spanish and English corpora as well as on Question Answering tasks obtaining significant increases.  相似文献   

20.
学位与研究生教育管理信息系统构建研究   总被引:2,自引:0,他引:2  
柳晨 《科技广场》2007,20(5):163-164
结合我国学位与研究生教育发展的形式,对学位与研究生教育管理信息系统进行了规划与设计。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号