首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
阮光册 《情报科学》2012,(1):105-109
运用文本挖掘技术发现网络新闻报道中潜在的、有价值的信息是情报研究的一个新尝试。笔者探讨了网络新闻的文本挖掘方法,以上海世博新闻媒体网络版报道为例,进行实证研究,并对报道差异进行对比分析。本文选取香港、台湾、境外媒体华语版、上海本地媒体对世博会相关报道,基于文本挖掘、特征提取对报道内容的差异进行阐述,并得出结论。  相似文献   

2.
用户网络信息查询需求研究   总被引:6,自引:0,他引:6  
曹树金  马利霞  郑敏 《情报科学》2006,24(6):876-883
阐述了用户网络信息查询需求调查结果中与网络信息组织相关的主要结果,论述了它对网络信息组织的三个启示:增强关键词检索功能是一项长期的任务,网络学术分类法和大众分类法应该并存,应加强对网页分类问题的研究。  相似文献   

3.
【目的/意义】随着互联网的迅速发展,网络媒体成为反映社会舆论的主要载体。如何有效地从网络媒体获取公共政策相关的社情民意以引导公共政策的传播议程设置是政府职能部门所关注的重要问题之一。【方法/过程】本文基于网络媒体数据,应用数据挖掘、机器学习等数据分析技术,提出了一个面向公共政策的网络媒体内容文本分析框架。利用文本语义分析方法,从主题识别、情感分析等角度对网络主流媒体的公共政策传播议程设置与社交媒体网民舆论进行挖掘和对比,并以新能源汽车政策为例对该分析框架的有效性进行了验证。【结果/结论】通过实证发现当下网络媒体报道的议题与社交媒体上公众对有关公共政策的关注焦点之间存在较大偏差,就新能源汽车政策为例,公众对于其政策的关注偏向于衡量自身获利的多少,而网络媒体报道更多以描述政策传递信息为主。建议政府职能机构针对公共政策使用网络媒体进行传播时,可根据公众关注焦点话题进行议程的设置和调整,以增强公众对此政策的认可度。  相似文献   

4.
5.
6.
Queries submitted to search engines can be classified according to the user goals into three distinct categories: navigational, informational, and transactional. Such classification may be useful, for instance, as additional information for advertisement selection algorithms and for search engine ranking functions, among other possible applications. This paper presents a study about the impact of using several features extracted from the document collection and query logs on the task of automatically identifying the users’ goals behind their queries. We propose the use of new features not previously reported in literature and study their impact on the quality of the query classification task. Further, we study the impact of each feature on different web collections, showing that the choice of the best set of features may change according to the target collection.  相似文献   

7.
With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.  相似文献   

8.
We present a term weighting approach for improving web page classification, based on the assumption that the images of a web page are those elements which mainly attract the attention of the user. This assumption implies that the text contained in the visual block in which an image is located, called image-block, should contain significant information about the page contents. In this paper we propose a new metric, called the Inverse Term Importance Metric, aimed at assigning higher weights to important terms contained into important image-blocks identified by performing a visual layout analysis. We propose different methods to estimate the visual image-blocks importance, to smooth the term weight according to the importance of the blocks in which the term is located. The traditional TFxIDF model is modified accordingly and used in the classification task. The effectiveness of this new metric and the proposed block evaluation methods have been validated using different classification algorithms.  相似文献   

9.
朱学芳  冯曦曦 《情报科学》2012,(7):1012-1015
通过对农业网页的HTML结构和特征研究,叙述基于文本内容的农业网页信息抽取和分类实验研究过程。实验中利用DOM结构对农业网页信息进行信息抽取和预处理,并根据文本的内容自动计算文本类别属性,得到特征词,通过总结样本文档的特征,对遇到的新文档进行自动分类。实验结果表明,本文信息提取的时间复杂度比较小、精确度高,提高了分类的正确率。  相似文献   

10.
Ontologies and folksonomies are currently the most prominent web content classification schemes. While their roles are similar, their engineering is different. In an attempt to combine and harness their distinct powers, web and information scientists are attempting to integrate them, merging the flexibility, collaboration and information aggregation of folksonomies with the standardisation, automated validation and interoperability of ontologies. This paper explores the basics of web information classification engineering, identifies the strengths and weaknesses of the existing methodologies, assesses their effectiveness and investigates a number of key quality issues. It then investigates the existing methods for integrating ontologies and folksonomies and examines the integration requirements. It finally proposes a common framework for reconciliation of the two classification approaches and quality assurance.  相似文献   

11.
This paper is concerned with the quality of training data in learning to rank for information retrieval. While many data selection techniques have been proposed to improve the quality of training data for classification, the study on the same issue for ranking appears to be insufficient. As pointed out in this paper, it is inappropriate to extend technologies for classification to ranking, and the development of novel technologies is sorely needed. In this paper, we study the development of such technologies. To begin with, we propose the concept of “pairwise preference consistency” (PPC) to describe the quality of a training data collection from the ranking point of view. PPC takes into consideration the ordinal relationship between documents as well as the hierarchical structure on queries and documents, which are both unique properties of ranking. Then we select a subset of the original training documents, by maximizing the PPC of the selected subset. We further propose an efficient solution to the maximization problem. Empirical results on the LETOR benchmark datasets and a web search engine dataset show that with the subset of training data selected by our approach, the performance of the learned ranking model can be significantly improved.  相似文献   

12.
中国企业社会责任信息披露的现状分析与对策思考   总被引:2,自引:0,他引:2  
周祖城  王旭  韦佳园 《软科学》2007,21(4):83-86,90
通过对中国100强公司的独立报告、上市公司年报、网站年报、网站栏目的调查与分析,揭示了中国企业披露社会责任信息的现状,并进行了分析与讨论,提出了若干对策。  相似文献   

13.
搜索引擎自动分类功能评价   总被引:2,自引:0,他引:2  
朱剑俊 《情报科学》2006,24(5):754-757
本文分析了搜索引擎的检索结果自动分类的功能。通过模拟真实环境的检索实验。比较分析了“中国搜索”和“搜狗”在该项功能上的特点、区别和用户使用情况,并对此做出评价。  相似文献   

14.
潘晓  段鑫星 《情报科学》2021,39(7):131-135
【目的/意义】针对当前中小企业情报收集系统模型收集情报的准确性、信息检索查全率以及情报分类管理 效率较低的问题,提出基于LDA及模糊VIKOR法的中小企业情报收集系统模型构建。【方法/过程】根据LDA模型 设计并构建中小企业情报收集系统模型架构,通过企业管理架构采集知识资源,将获取的知识分别划分至管理架 构相应模块中,实现企业知识整合管理。根据模糊VIKOR法设计了中小企业情报分类步骤,引入贝叶斯统计的标 准法,获取最佳主题数量,采用Gibbs抽样算法得出分类隐含层主题集合概率整体分布的向量,实现中小企业情报 收集系统分类管理。【结果/结论】实验结果表明,该系统的准确性较高,能够有效提高情报分类管理效率以及信息 检索查全率。【创新/局限】本文采用LDA模型整合管理企业知识,结合模糊VIKOR法分类管理企业情报收集,构建 准确高效的系统模型,但本文构建的系统模型未应用于实际企业中进行反馈与完善。  相似文献   

15.
随着网络的飞速发展,网页数量急剧膨胀,近几年来更是以指数级进行增长,搜索引擎面临的挑战越来越严峻,很难从海量的网页中准确快捷地找到符合用户需求的网页。网页分类是解决这个问题的有效手段之一,基于网页主题分类和基于网页体裁分类是网页分类的两大主流,二者有效地提高了搜索引擎的检索效率。网页体裁分类是指按照网页的表现形式及其用途对网页进行分类。介绍了网页体裁的定义,网页体裁分类研究常用的分类特征,并且介绍了几种常用特征筛选方法、分类模型以及分类器的评估方法,为研究者提供了对网页体裁分类的概要性了解。  相似文献   

16.
基于模糊向量空间的文本分类方法   总被引:1,自引:0,他引:1  
郑凤萍  刘春雨 《情报科学》2007,25(4):588-591
本文针对文本自动分类问题,提出了一种基于模糊向量空间模型和径向基函数网络的分类方法。网络由输入层、隐层和输出层组成。输入层完成分类样本的输入,隐层提取输入样本所隐含的模式特征,将分类结果在输出层表现出来。该方法在特征提取时充分考虑了特征项在文档中的位置信息,构造出模糊特征向量,使自动分类更接近手工分类方法。以中国期刊网全文数据库部分文档数据为例验证了该方法的有效性。  相似文献   

17.
金燕  闫晓妍  林琳 《现代情报》2009,29(3):23-25
介绍了Web环境下竞争情报自动采集的关键技术,构造了一种基于自动采集的CI模型,该模型能够对Web信息源进行自动栗集、文本分析、分类聚类,并根据特定主题对信息源实施监控,生成竞争情报报告提交给企业决策层,从而提高企业决策的及时性、科学性。  相似文献   

18.
A new dictionary-based text categorization approach is proposed to classify the chemical web pages efficiently. Using a chemistry dictionary, the approach can extract chemistry-related information more exactly from web pages. After automatic segmentation on the documents to find dictionary terms for document expansion, the approach adopts latent semantic indexing (LSI) to produce the final document vectors, and the relevant categories are finally assigned to the test document by using the k-NN text categorization algorithm. The effects of the characteristics of chemistry dictionary and test collection on the categorization efficiency are discussed in this paper, and a new voting method is also introduced to improve the categorization performance further based on the collection characteristics. The experimental results show that the proposed approach has the superior performance to the traditional categorization method and is applicable to the classification of chemical web pages.  相似文献   

19.
Due to the large repository of documents available on the web, users are usually inundated by a large volume of information, most of which is found to be irrelevant. Since user perspectives vary, a client-side text filtering system that learns the user's perspective can reduce the problem of irrelevant retrieval. In this paper, we have provided the design of a customized text information filtering system which learns user preferences and modifies the initial query to fetch better documents. It uses a rough-fuzzy reasoning scheme. The rough-set based reasoning takes care of natural language nuances, like synonym handling, very elegantly. The fuzzy decider provides qualitative grading to the documents for the user's perusal. We have provided the detailed design of the various modules and some results related to the performance analysis of the system.  相似文献   

20.
石建  刘红鹰 《现代情报》2009,29(5):121-123
针对人们提出的网络信息个性化服务内容及相关技术问题,本文重点介绍了当前具有代表性的个性化信息服务优先领域的研究。并认为用户的兴趣和行为表达、聚类与分类、个性化信息服务安全与系统评价等,为目前web个性化信息系统所采用的关键技术中,应重点关注的领域。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号