首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
This paper presents a classifier for text data samples consisting of main text and additional components, such as Web pages and technical papers. We focus on multiclass and single-labeled text classification problems and design the classifier based on a hybrid composed of probabilistic generative and discriminative approaches. Our formulation considers individual component generative models and constructs the classifier by combining these trained models based on the maximum entropy principle. We use naive Bayes models as the component generative models for the main text and additional components such as titles, links, and authors, so that we can apply our formulation to document and Web page classification problems. Our experimental results for four test collections confirmed that our hybrid approach effectively combined main text and additional components and thus improved classification performance.  相似文献   

2.
将图像的像素特征与矩特征结合,构建了神经网络分类器,利用提取的特征向量对分类器进行了训练和测试。将图像二值化,并归一化为16*16大小,提取了其每个像素点的0、1特征共16*16—256维,图像的网格特征13维,及Hu矩特征7维,一共276维特征。建立了BP神经网络分类器,分别使用最速下降BP算法、动量BP算法、学习率可变BP算法对BP神经网络分类器进行了训练,得出了在相同条件下学习率可变BP算法训练时间短,收敛快的结论。建立了PNN神经网络分类器,与BP神经网络分类器性能进行比较,实验结果表明,PNN神经网络分类器性能更好。  相似文献   

3.
This paper reports on the underlying IR problems encountered when dealing with the complex morphology and compound constructions found in the Hungarian language. It describes evaluations carried out on two general stemming strategies for this language, and also demonstrates that a light stemming approach could be quite effective. Based on searches done on the CLEF test collection, we find that a more aggressive suffix-stripping approach may produce better MAP. When compared to an IR scheme without stemming or one based on only a light stemmer, we find the differences to be statistically significant. When compared with probabilistic, vector-space and language models, we find that the Okapi model results in the best retrieval effectiveness. The resulting MAP is found to be about 35% better than the classical tf idf approach, particularly for very short requests. Finally, we demonstrate that applying an automatic decompounding procedure for both queries and documents significantly improves IR performance (+10%), compared to word-based indexing strategies.  相似文献   

4.
This paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as the classical tf idf vector-space approach. We found that the Divergence from Randomness paradigm tend to propose better retrieval effectiveness than the Okapi, LM or tf idf models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming reduces generally the MAP by more than 40%, and these differences are always significant. Finally, if our more aggressive stemmer tends to show the best performance, the differences in performance with a light stemmer are not statistically significant.  相似文献   

5.
This work assesses the performance of two N-gram matching techniques for Arabic root-driven string searching: contiguous N-grams and hybrid N-grams, combining contiguous and non-contiguous. The two techniques were tested using three experiments involving different levels of textual word stemming, a textual corpus containing about 25 thousand words (with a total size of about 160KB), and a set of 100 query textual words. The results of the hybrid approach showed significant performance improvement over the conventional contiguous approach, especially in the cases where stemming was used. The present results and the inconsistent findings of previous studies raise some questions regarding the efficiency of pure conventional N-gram matching and the ways in which it should be used in languages other than English.  相似文献   

6.
目标噪声特征提取是被动声纳目标识别系统的关键技术。首先提出了一种利用从噪声极限环中提取的非线性特征来分析舰船噪声信号的新方法,然后采用基于自适应遗传BP算法的神经网络对提取的特征进行分类。实验结果表明,该系统具有较好的分类效果。  相似文献   

7.
提出了一种人脸关键点检测方法,该方法用了少量的正面图像,不用归一化人脸图像,而传统的人脸关键点检测方法需要对图像进行严格预处理。随机森林是一种分类器融合算法,可以很好地解决多类分类问题,虽然LBP特征简单,但其可以包含大量的纹理信息。利用改进的LBP特征与随机森林相结合,构成一种对人脸关键点检测的方法。通过高斯平滑图像的LBP特征的提取,对每个点生成特征,计算出有用的特征作为正例,并且与反例集合变为训练集。通过随机森林分类器进行分类,误差率较低,仅在10%左右。  相似文献   

8.
词干化、词形还原是英文文本处理中的一个重要步骤。本文利用3种聚类算法对两个Stemming算法和一个Lemmatization算法进行较为全面的实验。结果表明,Stemming和Lemmatization都可以提高英文文本聚类的聚类效果和效率,但对聚类结果的影响并不显著。相比于Snowball Stemmer和Stanford Lemmatizer,Porter Stemmer方法在Entropy和Pu-rity表现上更好,也更为稳定。  相似文献   

9.
将多分类器融合技术用于CRM中的客户分类研究,以提高分类性能。以决策树作为基本分类器,引入最小二乘技术进行多分类器线性融合。实证结果显示,4种不同的融合方案的分类性能均胜过任一基本分类器,甚至优于基于遗传算法的神经网络融合分类结果,从而表明了该方法的可行性和有效性。  相似文献   

10.
Artificial intelligence (AI) is rapidly becoming the pivotal solution to support critical judgments in many life-changing decisions. In fact, a biased AI tool can be particularly harmful since these systems can contribute to or demote people’s well-being. Consequently, government regulations are introducing specific rules to prohibit the use of sensitive features (e.g., gender, race, religion) in the algorithm’s decision-making process to avoid unfair outcomes. Unfortunately, such restrictions may not be sufficient to protect people from unfair decisions as algorithms can still behave in a discriminatory manner. Indeed, even when sensitive features are omitted (fairness through unawareness), they could be somehow related to other features, named proxy features. This study shows how to unveil whether a black-box model, complying with the regulations, is still biased or not. We propose an end-to-end bias detection approach exploiting a counterfactual reasoning module and an external classifier for sensitive features. In detail, the counterfactual analysis finds the minimum cost variations that grant a positive outcome, while the classifier detects non-linear patterns of non-sensitive features that proxy sensitive characteristics. The experimental evaluation reveals the proposed method’s efficacy in detecting classifiers that learn from proxy features. We also scrutinize the impact of state-of-the-art debiasing algorithms in alleviating the proxy feature problem.  相似文献   

11.
This paper presents an algorithm for generating stemmers from text stemmer specification files. A small study shows that the generated stemmers are computationally efficient, often running faster than stemmers custom written to implement particular stemming algorithms. The stemmer specification files are easily written and modified by non-programmers, making it much easier to create a stemmer, or tune a stemmer's performance, than would be the case with a custom stemmer program. Stemmer generation is thus also human-resource efficient.  相似文献   

12.
郑明国  蔡强国 《资源科学》2007,29(3):214-220
首先根据生产者精度和使用者精度的概念,提出生产者精度和使用者精度的条件概率表达式,然后根据概率乘积公式,推导出生产者精度和使用者精度之间的关系式。该关系式表明:①使用者精度和生产者精度的比值可作为类别真实面积与分类结果中类别面积的比值的估计;②利用使用者精度和生产者精度的比值可对遥感分类结果进行修正,产生更接近于真实值的土地覆盖类别面积值,且该方法的计算结果仅取决于使用者精度和生产者精度数据的可靠性,与分类算法的优劣无关。该方法可用于最大似然分类方法中先验概率的估计。对Erdas Imagine软件所附带lanierimg文件的实验结果表明,各种分类结果包括一种对常规最大似然分类结果进行任意修改后的分类结果,利用文中提出方法修正后均产生了接近于真实值的类别面积比例。由于作为标准的精度检验方法,几乎所有的分类影像都会产生误差矩阵用于精度报告,这保证了该方法具有很好的应用价值,可以帮助土地利用/土地覆盖研究中获取更准确的土地利用/土地覆盖的面积数据。  相似文献   

13.
Practical classification problems often involve some kind of trade-off between the decisions a classifier may take. Indeed, it may be the case that decisions are not equally good or costly; therefore, it is important for the classifier to be able to predict the risk associated with each classification decision. Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. The objective is to quantify the trade-off between various classification decisions using probability and the costs that accompany such decisions. Within this framework, a loss function measures the rates of the costs and the risk in taking one decision over another.  相似文献   

14.
The research field of crisis informatics examines, amongst others, the potentials and barriers of social media use during disasters and emergencies. Social media allow emergency services to receive valuable information (e.g., eyewitness reports, pictures, or videos) from social media. However, the vast amount of data generated during large-scale incidents can lead to issue of information overload. Research indicates that supervised machine learning techniques are suitable for identifying relevant messages and filter out irrelevant messages, thus mitigating information overload. Still, they require a considerable amount of labeled data, clear criteria for relevance classification, a usable interface to facilitate the labeling process and a mechanism to rapidly deploy retrained classifiers. To overcome these issues, we present (1) a system for social media monitoring, analysis and relevance classification, (2) abstract and precise criteria for relevance classification in social media during disasters and emergencies, (3) the evaluation of a well-performing Random Forest algorithm for relevance classification incorporating metadata from social media into a batch learning approach (e.g., 91.28%/89.19% accuracy, 98.3%/89.6% precision and 80.4%/87.5% recall with a fast training time with feature subset selection on the European floods/BASF SE incident datasets), as well as (4) an approach and preliminary evaluation for relevance classification including active, incremental and online learning to reduce the amount of required labeled data and to correct misclassifications of the algorithm by feedback classification. Using the latter approach, we achieved a well-performing classifier based on the European floods dataset by only requiring a quarter of labeled data compared to the traditional batch learning approach. Despite a lesser effect on the BASF SE incident dataset, still a substantial improvement could be determined.  相似文献   

15.
This paper presents a new method for the performance evaluation of bit decoding algorithms. The method is based on estimating the probability density function (pdf) of the bit log likelihood ratio (LLR) by using an exponential model. It is widely known that the pdf of the bit LLR is close to the normal density. The proposed approach takes advantage of this property to present an efficient algorithm for the pdf estimation. The moment matching method is combined with the maximum entropy principle to estimate the underlying parameters. We present a simple method for computing the probabilities of the point estimates for the estimated parameters, as well as for the bit error rate. The corresponding results are used to compute the number of samples that are required for a given precision of the estimated values. It is demonstrated that this method requires significantly fewer samples as compared to the conventional Monte-Carlo simulation.  相似文献   

16.
结合炮孔堵塞对爆破的作用机制,运用有限元动力分析软件ANSYS/LS-DYNA建立不同装药系数和堵塞情况下的三维爆破模型,采用多物质ALE算法进行计算,通过对整体应力云图和堵塞段有效应力的分析和比较,得出了反向起爆时堵塞情况对爆破效果的影响以及不同装药系数下堵塞情况的作用范围,可为爆破数值验算和实际工程提供参考。  相似文献   

17.
黄莉  李湘东 《情报杂志》2012,31(7):177-181,176
KNN最邻近算法是文本自动分类中最基本且常用的算法,该算法中需要计算文本之间的相似度.以Jensen-Shannon散度为例,在推导和说明其基本原理的基础之上,将其用于计算文本之间的相似度;作为对比,也使用常规的余弦值方法计算文本之间的相似度,并进而使用KNN最邻近算法对文本进行分类,以探讨不同的相似度计算方法对使用KNN最邻近算法进行文本自动分类效果的影响.多种试验材料的实证研究说明,较之于余弦值方法,基于Jensen-Shannon散度计算文本相似度的自动分类会使分类正确率更高,但会花费更长的时间.  相似文献   

18.
逯洋 《科技广场》2007,(3):173-174
串匹配问题是计算机科学中研究得最广泛的问题之一,它在文字编辑与处理、图象处理、文献检索、自然语言识别、生物学等领域都有很广泛的应用。随着互联网的日渐庞大,信息也是越来越多,如何在海量的信息中快速查找自己所要的信息是网络搜索研究的热点所在,在这其中,字符串匹配算法起着非常重要的作用,一个好的串匹配算法往往能显著地提高应用的效率。文章所研究的是如何设计求任意两个字符串的最大匹配子串及其长度的算法,这种串匹配算法可应用到自动阅卷系统、查询系统、检索系统等许多系统中。  相似文献   

19.
The aim in multi-label text classification is to assign a set of labels to a given document. Previous classifier-chain and sequence-to-sequence models have been shown to have a powerful ability to capture label correlations. However, they rely heavily on the label order, while labels in multi-label data are essentially an unordered set. The performance of these approaches is therefore highly variable depending on the order in which the labels are arranged. To avoid being dependent on label order, we design a reasoning-based algorithm named Multi-Label Reasoner (ML-Reasoner) for multi-label classification. ML-Reasoner employs a binary classifier to predict all labels simultaneously and applies a novel iterative reasoning mechanism to effectively utilize the inter-label information, where each instance of reasoning takes the previously predicted likelihoods for all labels as additional input. This approach is able to utilize information between labels, while avoiding the issue of label-order sensitivity. Extensive experiments demonstrate that our method outperforms state-of-the art approaches on the challenging AAPD dataset. We also apply our reasoning module to a variety of strong neural-based base models and show that it is able to boost performance significantly in each case.  相似文献   

20.
The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naïve Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号