首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 859 毫秒
1.
XML文档相似度计算方法研究   总被引:1,自引:0,他引:1  
XML(可扩展标记语言)正在成为Web上各种应用交换信息的标准.随着XML格式的半结构数据的大量出现,如何处理和管理XML文档已经成为了一个研究热点.XML文档的相似度计算是XML数据处理的重要课题,是XML文档聚类与检索的关键技术.XML文档由逻辑结构(structure)和文本内容(content)构成,可以根据结构特征或内容特征来度量XML文档之间的相似度.本文将XML文档的相似度计算方法分为基于结构的和结构与内容相结合的两类,并对各种已有的XML文档相似度计算方法进行了比较和述评.  相似文献   

2.
Over a six-month time period a school board and its community discussed their district's strategic plan goals about diversity. This article analyzes that discussion within the practical theory frame articulated by Craig. Meeting talk and documents were analyzed to determine how the group's policy deliberation became an argument over what words to have (or avoid) in the strategic plan document. Proposals about document language were framed as technical editing, as inadvertent changing of a policy, and as wordsmithing. In addition to each frame being used, each was also challenged as to its being used to advance some group members' interests at the expense of others. Moving back and forth between using and resisting wording proposal frames, we suggest, is a reasonable way for groups to manage a dilemma they face in crafting policies about controversial, abstract issues. The paper concludes by identifying implications for dilemma theorizing and future study of groups.  相似文献   

3.
随着计算机技术普及,人们普遍接受了以电子文件形式处理日常事务的工作方式,并且这种办公方式越来越常态化。在可以预见的未来,电子文件将成为传承人类记忆的重要工具。目前,我国电子文件发展正面临从双轨制向单轨制过渡的关键阶段。因而笔者拟将电子文件证据地位作为切入点,剖析我国电子文件证据效力研究现状,指出阻滞电子文件法律证据效力发挥作用的影响因素,从管理角度提出相应措施,以期使电子文件管理制度变革不再有后顾之忧。  相似文献   

4.
文献推荐系统综述   总被引:1,自引:0,他引:1  
文献推荐系统帮助用户在海量文献环境下发现个性化的信息,已经成为文献检索系统的重要组成部分。文献推荐技术研究在信息检索、文献计量学与电子商务推荐系统研究成果综合演变下发展起来。首先讨论了一般个性化推荐技术;进一步对文献推荐技术已经取得的研究成果进行了系统的分析与总结;同时,介于评价测度与方法是推荐系统的重要组成部分,给出了常用的文献推荐系统的评价测度;最后,对文献推荐系统研究现状作出总体评价并指出将来的发展方向。  相似文献   

5.
The method of bibliographic coupling in combination with the complete link cluster method was applied for mapping of the field of organic chemistry with the purpose of testing the applicability of a proposed mapping method on the field level. The method put forward aimed at the generation of cognitive cores of documents, so-called ‘bibliographic cliques’ in the network of bibliographically coupled research articles. The defining feature of these cliques is that they can be considered complete graphs where each bibliographic coupling link ties an unordered pair of documents. In this way, it was presumed that coherent groups of documents in the research front would be found and that these groups would be intellectually coherent as well. Statistical analysis and subject specialist evaluations confirmed these presumptions. The study also elaborates on the choice of observation period and the application of thresholds in relation to the size of document populations.  相似文献   

6.
Documents in digital and paper libraries may be arranged, based on their topics, in order to facilitate browsing. It may seem intuitively obvious that ordering documents by their subject should improve browsing performance; the results presented in this article suggest that ordering library materials by their Gray code values and through using links consistent with the small world model of document relationships is consistent with improving browsing performance. Below, library circulation data, including ordering with Library of Congress Classification numbers and Library of Congress Subject Headings, are used to provide information useful in generating user-centered document arrangements, as well as user-independent arrangements. Documents may be linearly arranged so they can be placed in a line by topic, such as on a library shelf, or in a list on a computer display. Crossover links, jumps between a document and another document to which it is not adjacent, can be used in library databases to allow additional paths that one might take when browsing. The improvement that is obtained with different combinations of document orderings and different crossovers is examined and applications suggested.  相似文献   

7.
Over a six-month time period a school board and its community discussed their district's strategic plan goals about diversity. This article analyzes that discussion within the practical theory frame articulated by Craig. Meeting talk and documents were analyzed to determine how the group's policy deliberation became an argument over what words to have (or avoid) in the strategic plan document. Proposals about document language were framed as technical editing, as inadvertent changing of a policy, and as wordsmithing. In addition to each frame being used, each was also challenged as to its being used to advance some group members' interests at the expense of others. Moving back and forth between using and resisting wording proposal frames, we suggest, is a reasonable way for groups to manage a dilemma they face in crafting policies about controversial, abstract issues. The paper concludes by identifying implications for dilemma theorizing and future study of groups.  相似文献   

8.
Document theory is the least explored area of study about documents. It lags significantly behind applied document research, which summarizes various document processing practices that have accumulated for thousands of years. This problem has recently been complicated by the rise of so-called general document theories. The boundaries of the document concept have become blurred due to the development of parallel areas of study and their forced differentiation into “classic” and “library” document science. In addition, knowledge about objects that are referred to as documents that can neither be properly integrated nor applied in practice is being developed. This situation is mainly due to the lack of attention that is paid by document scientists to the theoretical and methodological issues of document science. This paper reviews the origins, nature, and the social roles of documents from the perspective of a synergetic paradigm and has the goal of constructing a synergetic document theory.  相似文献   

9.
文档聚类分析是组织文档的一种有效方法,在信息处理中被广泛应用于未知话题的自动发现并取得不错的效果。本文提出了一个轻量级聚类算法。该算法利用减小原始文档的索引数,来处理大量小文档,并把它们分组到几千个簇,或者通过更改特定参数,将聚类簇的数量减小到几十个。理论分析和实际应用表明,该算法改善了对高维数据和大量小文档处理效率。  相似文献   

10.
Modern retrieval test collections are built through a process called pooling in which only a sample of the entire document set is judged for each topic. The idea behind pooling is to find enough relevant documents such that when unjudged documents are assumed to be nonrelevant the resulting judgment set is sufficiently complete and unbiased. Yet a constant-size pool represents an increasingly small percentage of the document set as document sets grow larger, and at some point the assumption of approximately complete judgments must become invalid. This paper shows that the judgment sets produced by traditional pooling when the pools are too small relative to the total document set size can be biased in that they favor relevant documents that contain topic title words. This phenomenon is wholly dependent on the collection size and does not depend on the number of relevant documents for a given topic. We show that the AQUAINT test collection constructed in the recent TREC 2005 workshop exhibits this biased relevance set; it is likely that the test collections based on the much larger GOV2 document set also exhibit the bias. The paper concludes with suggested modifications to traditional pooling and evaluation methodology that may allow very large reusable test collections to be built.
Ellen VoorheesEmail:
  相似文献   

11.
Observations from a unique investigation of failure analysis of Information Retrieval research engines held in 2003 are presented. The Reliable Information Access Workshop invited seven leading IR research groups to supply both their systems and their experts to an effort to analyze why their systems fail on some topics and whether the failures are due to system flaws, approach flaws, or the topic itself. There were surprising results from this cross-system failure analysis. One is that despite systems retrieving very different documents, the major cause of failure for any particular topic was almost always the same across all systems. Another is that relationships between aspects of a topic are not especially important for state-of-the-art systems; the systems are failing at a much more basic level where the top-retrieved documents are not reflecting some aspect at all. The investigatory framework and the lessons learned can serve as a model for needed future research in this area.  相似文献   

12.
图书馆可以通过文献调查、网站访问统计、网站链接推荐、与用户信息互动等方式统计地方文献资源的利用率,提高地方文献资源利用率的有效手段主要有:整合地方文献资源,鼓励用户参与地方文献资源建设,宣传推广地方文献资源等。  相似文献   

13.
14.
15.
金悦  赵彦昌 《档案学研究》2022,36(6):136-143
照会是近代对外交往中最常用的文书形式。美国国家档案馆藏近代美国驻奉天领事馆领事报告档案中含有大量美国领事与中国东北当局的往来照会,这些照会内容丰富,中英文兼备且保存完整,其以外交文书的形式承载了丰富多元的历史内涵,因此具有极高的档案价值。本文通过细致爬梳领事照会档案,从发文方向、语言、事由、涉事方等角度对其进行细致分类,进而从文书体式及历史内涵等方面进行深入研究。该研究将有助于档案学、历史学更好地体认近代照会文书的程式与内容的演变以及其蕴含的深层历史脉络。中美往来照会逐渐由传统的繁复的程式向高效练达的风格演变,也成为中美外交文书和外交观念相互交融与影响的例证。  相似文献   

16.
Document clustering of scientific texts using citation contexts   总被引:3,自引:0,他引:3  
Document clustering has many important applications in the area of data mining and information retrieval. Many existing document clustering techniques use the “bag-of-words” model to represent the content of a document. However, this representation is only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms. In other words, instances of synonymy between related documents are ignored, which can reduce the effectiveness of applications using a standard full-text document representation. To address this problem, we present a new approach for clustering scientific documents, based on the utilization of citation contexts. A citation context is essentially the text surrounding the reference markers used to refer to other scientific works. We hypothesize that citation contexts will provide relevant synonymous and related vocabulary which will help increase the effectiveness of the bag-of-words representation. In this paper, we investigate the power of these citation-specific word features, and compare them with the original document’s textual representation in a document clustering task on two collections of labeled scientific journal papers from two distinct domains: High Energy Physics and Genomics. We also compare these text-based clustering techniques with a link-based clustering algorithm which determines the similarity between documents based on the number of co-citations, that is in-links represented by citing documents and out-links represented by cited documents. Our experimental results indicate that the use of citation contexts, when combined with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered by journal articles. More specifically, this document representation strategy when used by the clustering algorithm investigated in this paper, outperforms both the full-text clustering approach and the link-based clustering technique on both scientific journal datasets.  相似文献   

17.
祁卓麟 《图书馆论坛》2012,32(2):124-126,174
通过对西北农林科技大学图书馆近5年文献传递的基本情况、学科分类、用户需求等数据分析发现,不同学科和用户类型对于文献传递的需求差异很大,馆藏资源结构应该结合学科发展和用户需要不断调整;图书馆要多渠道解决文献传递经费,拓宽文献来源,整合已获取文献资源,加强特色馆藏资源建设,满足用户需求。  相似文献   

18.
本文提出一种面向聚类主题的文本特征表示方法,即以聚类的主题概念来刻画文本的特征向量,将文本描述提升至语义层次.首先,通过聚类,形成一组以向量形式表达的隐含主题概念,再将基于词条空间的文本特征向量投影至这组主题概念,以隐含的主题概念来描述文本.实验分析表明,建立在概念空间之上的文本向量实质上是文本矢量与主题概念的关联度,能够突出表现文本内容的主题特征,更好地反映文本的语义内容,从而有效提高模型在文本检索与分类等领域的应用性能.而基于聚类形成的概念空间的维数由于可主观调整,又能有效地约减概念空间的维数,提高模型的应用实效.  相似文献   

19.
本文认为对于档案馆开展的现行文件公开利用服务,文件生命周期理论和文件价值理论都难以为其提供理论依据.而用档案形成在前说来对现行文件公开利用解读,许多问题可以迎刃而解.并认为档案在形式、内容信息和功能上都与文件有着本质的区别,它们是两类完全不同的事物.这就决定现行文件公开不等于档案开放,现行文件以公开为原则,而档案则应有一个相对的封闭期.  相似文献   

20.
孙国超  徐硕  乔晓东 《情报工程》2016,2(4):020-029
随着科研人员需要处理的文献集规模的日益庞大,以LDA 为代表的主题模型能够从语义层面挖掘大规模文献集中隐含的主题,因此,LDA 主题模型的应用越来越广泛。LDA 模型仅仅关注文献集的内容,而忽略了文献其他重要的外部信息,AToT 模型在LDA 主题模型的基础上引入了文献作者和文献发表时间两个属性,使AToT 模型不仅可以挖掘文献中隐含的信息,还可以分析文献作者的研究兴趣及文献主题随时间的变化。AToT 模型对文献集建模的结果是以概率矩阵的形式呈现,不能直观、全面、清晰的呈现挖掘出来的信息,特别是对数据挖掘不熟悉的科研人员,因此,本文开发了一个基于AToT 模型的可视化系统,该可视化系统清晰、美观地展现了AToT 模型中文献、主题、作者、时间、词项间的关系。如文档中的主题分布、主题的词项分布、作者的研究兴趣分布、主题的相似主题和主题的演化趋势等。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号