首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
“文族”文档名词包括文、公文、文字、文献、文书、文案、文牍等50余个文档名词。以数据统计和文本检索为基础,阐述了它们在中国古代的总体构成、构词方式、应用频次与范围、词语内涵等问题。按照趋势性增长、波浪式发展、逐渐消亡和稀见型文档名词4种基本类型,具体说明了“文族”文档名词在中国古代的演变发展规律。  相似文献   

2.
Compound noun segmentation is a key first step in language processing for Korean. Thus far, most approaches require some form of human supervision, such as pre-existing dictionaries, segmented compound nouns, or heuristic rules. As a result, they suffer from the unknown word problem, which can be overcome by unsupervised approaches. However, previous unsupervised methods normally do not consider all possible segmentation candidates, and/or rely on character-based segmentation clues such as bi-grams or all-length n-grams. So, they are prone to falling into a local solution. To overcome the problem, this paper proposes an unsupervised segmentation algorithm that searches the most likely segmentation result from all possible segmentation candidates using a word-based segmentation context. As word-based segmentation clues, a dictionary is automatically generated from a corpus. Experiments using three test collections show that our segmentation algorithm is successfully applied to Korean information retrieval, improving a dictionary-based longest-matching algorithm.
Jong-Hyeok LeeEmail:
  相似文献   

3.
本文扼要地介绍了日本学者围绕提高情报检索系统的性能,开展自然语言信息处理研究的一些成果:包括以英语科技文摘为对象所进行的自动索引研究和动词的用法分析,以及从混合使用汉字和假名的日语文本中自动抽出日语名词的研究等。  相似文献   

4.
5.
王卉 《档案学研究》2020,34(4):87-96
拼音名词规范化控制在近代海关档案信息资源开发利用中占据极其重要的地位,不仅因为它是构建海关档案数据库的基础数据源,更重要的是,它有助于厘清档案信息资源开发过程中的语言、书写、专有名词不断变换的障碍,实现档案信息资源的有效整合和充分开发,从而提高档案信息资源组织和检索的效率。本文以近代广东海关档案为对象、以粤海关档案为中心进行此次研究,通过整合粤海关档案中的人名、地名、机构名等拼音文字数据资料,梳理粤海关档案中语言文字的特点、不同的拼音系统及其复杂的演化过程,结合传统的规范控制方法和基于网络的关联数据技术,将粤海关档案中不同拼音系统所导致的一音多词、一词多音、一词多义、一义多词等问题进行合并,在美国国会图书馆已有的书目框架基础上,提出构建以拼音名词为核心的海关档案文献规范控制模式。  相似文献   

6.
基于集成学习的自动标引方法研究   总被引:1,自引:0,他引:1  
目前大多数自动标引方法不能有效利用文本中包含的多个特征.而支持向量机、条件随机场模型等统计机器学习模型能够有效利用文本包含的多种特征进行关键词提取.同时,由于各种自动标引模型性能各异,综合利用各种模型进行集成学习方式的自动标引,能够提高自动标引的质量.为了进一步提高自动标引的质量,本文试图整合统计机器学习模型与集成学习方法的优势,对文档进行基于多分类模型综合投票方式的自动标引.实验结果表明基于集成学习方法的自动标引能提高标引结果的查准率和召回率.另外,集成学习标引模型中,基分类器加权的标引结果,优于基分类器未加权的标引结果.  相似文献   

7.
以矩阵理论作为研究的切入点,将经典向量空间模型中常用的向量和集合以矩阵的形式加以重构,并认为基于向量内积法的相似性计算与相应矩阵的乘法运算等价。结合稀疏矩阵和数据稀疏的定义,分析VSM信息检索背景下数据稀疏产生的原因;同时,讨论三种情形下数据稀疏对相似性计算的共同影响--部分毫无意义的时间复杂度。最后,给出规避数据稀疏问题的三层策略:文本级策略、文本集级策略和矩阵级策略。  相似文献   

8.
本文提出基于MPEG 7的教学语音内容的描述模式 ,它是语音数据的层次结构化组织和多维索引的重要依据。描述模式为用户提供了层次结构的浏览视图和导航机制 ,以及反映用户多角度观察和分析的多维索引 ,从而实现互操作性的基于内容的语音检索等服务。最后本文概要地分析了特征的自动提取和描述的自动生成技术。  相似文献   

9.
基于传统文本检索系统的XML索引实现研究   总被引:3,自引:0,他引:3  
陆伟 《情报学报》2006,25(6):679-685
作为重要的信息交换与存储标准,XML得到学者们越来越多的重视。作为XML检索研究的重要组成部分,XML索引机制与实现的研究已经取得了一定的研究成果。然而,大部分研究都是基于数据库及专门的半结构化管理器之上的。本文提出了如何在传统文本检索系统Okapi的基础上构建XML索引的方法。首先介绍了Okapi的索引结构,在此基础上,深入探讨了XML索引的存储结构及实现,并对索引的性能进行了评价。  相似文献   

10.
BACKGROUND: EUROETHICS is a database covering European literature on ethics in medicine. It is produced within Eurethnet, a European information network on ethics in medicine and biotechnology. OBJECTIVES: The aim of Euroethics is to disseminate information on European bioethical literature that may otherwise be difficult to find. METHODS: A collaboration model for pooling data from different centres was developed. The policy was to accomplish data uniformity, while still allowing for local differences in terms of software, indexing practices and resources. Records contributed to the database follow common standards in terms of data fields and indexing terms. The indexing terms derive from two thesauri, Thesaurus Ethics in the Life Sciences (TELS) and Medical Subject Headings (MeSH). Combining elements from search tools developed previously, the developers sought to find a technical solution optimized for this data model. An approach relying on a thesaurus database that is loaded along with the bibliographic database is described. RESULTS AND CONCLUSIONS: The present case study offers examples of possible approaches to several tasks often encountered in database development, such as: merging data from diverse sources, getting the most out of indexing terms used in a database, and handling more than one thesaurus in the same system.  相似文献   

11.
Background and objective: With the advent of an interprofessional approach to delivering health care in today's health care systems, should health care professionals be educated together? Supported by policy‐making circles worldwide, interprofessional education is accumulating a research literature at an exponential rate. Using one‐word search terms in the medline query box for scoping this body of literature, we obtained an unmanageable number of articles (342 338 in all fields). The objective of our study was to outline an efficient specific query. Methods: We created 1072 phrasal search terms consisting of a prefix, an adjective and a noun. Of those, 66 were prolific for the whole indexed period (1950–2006). Results: Only 2510 citations have the search term in all medline fields; of those 2049 were in title/abstract and 652 in title alone. From the 1950s, the citations were published at a slow rate, but the rate then exploded during the decade 1995 to 2006. The combination of prefixes ‘inter’ and ‘multi’ with the adjectives ‘professional’, ‘disciplinary’ and ‘shared’, and the nouns ‘education’, ‘learning’ and ‘training’ may retrieve almost all the relevant citations, while the terms ‘collaborative’ and ‘common’ may retrieve mainly irrelevant ones. The adjective ‘cohesive’ and nouns ‘practice’ and ‘role’ should be also considered. Conclusion: Phrasal search terms highly increased the relevance of medline ‐retrieved citations.  相似文献   

12.
宋芸芳 《图书馆建设》2012,(3):52-54,57
组配标引是在词表中选择两个及两个以上有形式逻辑关系的词,按照特定规则组成的一组标引词串,用以满足文献多层次、多途径检索的需要。概念组配是文献标引的关键环节。根据参与组配的主题词之间的逻辑关系,概念组配可分为交叉组配、限定组配和联结组配3种基本类型。在实际组配标引工作中,编目员应避免因对新词表不熟悉造成检索词语构成混乱,避免因主题概念转换错误造成粗标、漏标和错标,避免因未遵循专指性标引规则造成切题不当,减少组配标引失误。  相似文献   

13.
为缓解海量文献关键词标引的巨大压力,文章构建了用于海量文献关键词标引的计算机辅助加工系统,对标引数据预处理规范、自动标引核心工作区和人工标引校对平台进行了具体阐述。文章采用数据测试方法确定了自动标引软件,在单一软件不能满足标引要求后探索了多种机标结果后处理方式提升机标质量,最终由人工标引校对平台保证海量文献关键词标引质量的同时,将机标出现的问题和改进意见反馈给软件设计和词表维护,保证了计算机辅助加工系统的持续改进。  相似文献   

14.
近五年来自动标引研究在关键词抽取、标引系统设计、自动分类标引、网络信息自动标引、数字图像标引、音频信息标引、视频信息标引、自动标引结果评价等方面取得很大进展,但尚存弱点与不足之处,还不能达到人工标引的效果。今后的研究将朝着探索更优越的语言分析技术、更高端的多媒体信息自动标引方法、高效的知识库智能自学习机制、多种标引方法或模型的互补的集成学习等方向发展。  相似文献   

15.
论述文献数据库的标引规范和不同类型文献的标引模式;探讨自然语言与受控语言相结合的标引模式,以利于向受控语言的自动转化  相似文献   

16.
The economic value of public libraries for local residents in Korea was measured. An economic-value measurement model that enables the estimation of diverse types of public library services was designed, using a conditional-value measurement method. Benefits were taken as the value of the main services provided by public libraries, such as accessibility to informational materials, facilities, and programs. Costs included the total amount of expenses at libraries such as personnel expenses, materials purchasing expenses, and other operational costs. Data were collected from 1220 users from 22 public libraries in the province of Seoul/Gyeonggi-do and the other seven Korean provinces. The return on investment (ROI) was calculated to be 3.66.  相似文献   

17.
[目的/意义]文章对科技政策隐性扩散路径自组织方法进行研究,挖掘科技政策文本包含深层语义信息,将隐性知识显性化,为科研人员拓展和丰富政策扩散路径研究提供参考。[方法/过程]本文结合科技政策篇章文本的形式语义和内容语义两个方面对政策文本结构化处理和深度挖掘,对政策文本资源全解析,抽取科技政策文本中包含的特征,其中包括概念和关系自动获取与标引技术、网络表示学习,挖掘科技政策文本中的隐含结构信息,利用BiLSTM-CRF模型的深度学习方法实现概念的自动获取和自动标引关系。将得到多篇科技政策文本的概念和关系组成概念关系对的形式,借助于表示学习的方法发现每个节点稠密的向量表示。[结果/结论]通过实验验证,证明了本文借助隐性路径特征的科技政策扩散隐性路径自组织方法的有效性,在一定程度上拓展了政策研究的方法,为科研人员在政策扩散研究上提供了参考。  相似文献   

18.
基于EMM中文抽词算法的XMARC主题信息挖掘   总被引:4,自引:0,他引:4  
王兰成 《情报学报》2005,24(1):82-86
本文在分词词典上采用区间最大词长,改进正向减字最大匹配法为“词首 长词匹配 短词推进”自动标引方法,从而有效地减少领域的分词歧义性和缩短标引时间。最后将该研究付诸于XMARC主题信息的挖掘与检索的实现,并证明其在时间和质量综合性能上的优越性。  相似文献   

19.
In this paper, which treats Swedish full text retrieval, the problem of morphological variation of query terms in the document database is studied. The Swedish CLEF 2003 test collection was used, and the effects of combination of indexing strategies with query terms on retrieval effectiveness were studied. Four of the seven tested combinations involved indexing strategies that used normalization, a form of conflation. All of these four combinations employed compound splitting, both during indexing and at query phase. SWETWOL, a morphological analyzer for the Swedish language, was used for normalization and compound splitting. A fifth combination used stemming, while a sixth attempted to group related terms by right hand truncation of query terms. The truncation was performed by a search expert. These six combinations were compared to each other and to a baseline combination, where no attempt was made to counteract the problem of morphological variation of query terms in the document database. Both the truncation combination, the four combinations based on normalization and the stemming combination outperformed the baseline. Truncation had the best performance. The main conclusion of the paper is that truncation, normalization and stemming enhanced retrieval effectiveness in comparison to the baseline. Further, normalization and stemming were not far below truncation.  相似文献   

20.
标引的一致性是衡量主题标引质量的重要指标。综述了计算机类献主题标引不一致的四种表现,分析了引起标引不一致的原因,在此基础上探讨了计算机类献主题标引一致性的方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号