首页 | 本学科首页   官方微博 | 高级检索  
 共查询到12条相似文献,搜索用时 0 毫秒
In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction.  相似文献   

In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which automatically discovers contexts of narrow scope within a document corpus. These contexts act as attractors for clustering documents that are semantically related to each other. Once clustered, the documents are organized into a minimum spanning tree so that the topical similarity of adjacent documents within this structure can be assessed. The pre-defined categories from three different document category sets are used to assess the quality of CDC in terms of its ability to group and structure semantically related documents given the contexts. Quality is evaluated based on two factors, the category overlap between adjacent documents within a cluster, and how well a representative document categorizes all the other documents within a cluster. As the RCV1 collection was collated in a time ordered fashion, it was possible to assess the stability of clusters formed from documents within one time interval when presented with new unseen documents at subsequent time intervals. We demonstrate that CDC is a powerful and scaleable technique with the ability to create stable clusters of high quality. Additionally, to our knowledge this is the first time that a collection as large as RCV1 has been analyzed in its entirety using a static clustering approach.  相似文献   

The exponential growth of information available on the World Wide Web, and retrievable by search engines, has implied the necessity to develop efficient and effective methods for organizing relevant contents. In this field document clustering plays an important role and remains an interesting and challenging problem in the field of web computing. In this paper we present a document clustering method, which takes into account both contents information and hyperlink structure of web page collection, where a document is viewed as a set of semantic units. We exploit this representation to determine the strength of a relation between two linked pages and to define a relational clustering algorithm based on a probabilistic graph representation. The experimental results show that the proposed approach, called RED-clustering, outperforms two of the most well known clustering algorithm as k-Means and Expectation Maximization.  相似文献   

In this paper, a document summarization framework for storytelling is proposed to extract essential sentences from a document by exploiting the mutual effects between terms, sentences and clusters. There are three phrases in the framework: document modeling, sentence clustering and sentence ranking. The story document is modeled by a weighted graph with vertexes that represent sentences of the document. The sentences are clustered into different groups to find the latent topics in the story. To alleviate the influence of unrelated sentences in clustering, an embedding process is employed to optimize the document model. The sentences are then ranked according to the mutual effect between terms, sentence as well as clusters, and high-ranked sentences are selected to comprise the summarization of the document. The experimental results on the Document Understanding Conference (DUC) data sets demonstrate the effectiveness of the proposed method in document summarization. The results also show that the embedding process for sentence clustering render the system more robust with respect to different cluster numbers.  相似文献   

This paper presents a cluster validation based document clustering algorithm, which is capable of identifying an important feature subset and the intrinsic value of model order (cluster number). The important feature subset is selected by optimizing a cluster validity criterion subject to some constraint. For achieving model order identification capability, this feature selection procedure is conducted for each possible value of cluster number. The feature subset and the cluster number which maximize the cluster validity criterion are chosen as our answer. We have evaluated our algorithm using several datasets from the 20Newsgroup corpus. Experimental results show that our algorithm can find the important feature subset, estimate the cluster number and achieve higher micro-averaged precision than previous document clustering algorithms which require the value of cluster number to be provided.  相似文献   

企业自主创新的可拓创新模型构建与应用研究   总被引:2,自引:0,他引:2       下载免费PDF全文
企业是自主创新的主体,提升企业的自主创新能力关系到国家的核心竞争力和综合国力。文章分析了创新方法是企业自主创新的根本之源,在介绍可拓学理论和可拓创新方法体系的基础上,提出了基于可拓学的企业自主创新模型与实施平台,最后通过可拓创新思维与方法的培训及可拓学在奥运营销中的应用案例,分析了企业开展自主创新方法研究的新途径。  相似文献   

运用战略成本管理理念,结合我国造船企业的实际情况和集群市场环境的特点,提出了我国造船企业战略成本管理模式的构建原则与基本框架.在对集群背景下造船企业价值链和造船企业战略成本动因分析基础上,提出了集群背景下造船企业的战略成本定位,阐明了造船企业战略成本管理模式实施的保障措施.  相似文献   

The relevance feedback process uses information obtained from a user about a set of initially retrieved documents to improve subsequent search formulations and retrieval performance. In extended Boolean models, the relevance feedback implies not only that new query terms must be identified and re-weighted, but also that the terms must be connected with Boolean And/Or operators properly. Salton et al. proposed a relevance feedback method, called DNF (disjunctive normal form) method, for a well established extended Boolean model. However, this method mainly focuses on generating Boolean queries but does not concern about re-weighting query terms. Also, this method has some problems in generating reformulated Boolean queries. In this study, we investigate the problems of the DNF method and propose a relevance feedback method using hierarchical clustering techniques to solve those problems. We also propose a neural network model in which the term weights used in extended Boolean queries can be adjusted by the users’ relevance feedbacks.  相似文献   

Tourism has become a growing industry day by day with the developing economic conditions and the increasing communication and social interaction ability of the people. Forecasting tourism demand is not only important for tourism operators to maximize their revenues but also important for the formation of economic plans of the countries on a global scale. Based on the predictions countries are able to regulate the sectors that benefit economically from tourism locally. Therefore, it is crucial to accurately predict the demand in many weeks advance. In this study, we propose a new demand forecasting model for the hospitality industry that forecasts weekly hotel demand four weeks in advance through Attention-Long Short Term Memory (Attention-LSTM). Unlike most of the existing methods, the proposed method utilizes the time series demand data together with additional features obtained from K-Means Clustering findings such as Top 10 Hotel Features or Hotel Embeddings obtained using Neural Networks (NN). While creating our model, the clustering part was influenced by the fact that travelers choose their accommodation according to certain criteria, and the hotels meeting similar criteria may have similar demands. Therefore, before the clustering part, we also applied methods that would enable us to represent the features of the hotels more properly and we observed that 10-D Embedded Hotel Data representation with NN Embeddings came to the fore. In order to observe the performance of the proposed hotel demand forecasting model we used a real-world dataset provided by a tourism agency in Turkey and the results show that the proposed model achieves less mean absolute error and mean absolute percentage error (at worst % 3 and at most % 29 improvements) compared to the currently used machine learning and deep learning models.  相似文献   

This paper presents a relevance model to rank the facts of a data warehouse that are described in a set of documents retrieved with an information retrieval (IR) query. The model is based in language modeling and relevance modeling techniques. We estimate the relevance of the facts by the probability of finding their dimensions values and the query keywords in the documents that are relevant to the query. The model is the core of the so-called contextualized warehouse, which is a new kind of decision support system that combines structured data sources and document collections. The paper evaluates the relevance model with the Wall Street Journal (WSJ) TREC test subcollection and a self-constructed fact database.  相似文献   

The high quality evaluation of generated summaries is needed if we are to improve automatic summarization systems. Although human evaluation provides better results than automatic evaluation methods, its cost is huge and it is difficult to reproduce the results. Therefore, we need an automatic method that simulates human evaluation if we are to improve our summarization system efficiently. Although automatic evaluation methods have been proposed, they are unreliable when used for individual summaries. To solve this problem, we propose a supervised automatic evaluation method based on a new regression model called the voted regression model (VRM). VRM has two characteristics: (1) model selection based on ‘corrected AIC’ to avoid multicollinearity, (2) voting by the selected models to alleviate the problem of overfitting. Evaluation results obtained for TSC3 and DUC2004 show that our method achieved error reductions of about 17–51% compared with conventional automatic evaluation methods. Moreover, our method obtained the highest correlation coefficients in several different experiments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号