首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
基于词频的中文文本分类研究   总被引:1,自引:0,他引:1  
姚兴山 《现代情报》2009,29(2):179-181
本文对中文文本分类系统的设计和实现进行了阐述,对分类系统的系统结构、特征提取、训练算法、分类算法等进行了详细的介绍。将基于词频统计的方法应用于文本分类。并提出了一种基于汉语中单字词及二字词统计特性的中文文本分类方法,在无词表的情况下,通过统计构造单字和二字词表,对文本进行分类,并取得不错的效果。  相似文献   

2.
全文检索研究   总被引:11,自引:0,他引:11  
A new algorithm for automatic segmentation of Chinese word with the stop word list and post-controlled thesaurus, that has absorbed the ideas from the single-Chinese character method and the thesaurus method, is given. Based on this algorithm, a new full text retrieval mode is built.  相似文献   

3.
设计和实现了一个超链接符合度测试系统.首先通过Crawler技术爬取超链接文本和链接指向内容,然后采用中文分词技术对其分别进行分词,得到相应的词语列表.对链接指向的文本内容,分别依据标题、段首句、段尾句和正文进行分词,最后根据设计的符合度计算模型,计算链接文本和链接指向文本的符合度.经过和人工判断的符合度对比,系统计算的符合度较好,和人工判断的符合度有很好的相关性.  相似文献   

4.
王丽 《科教文汇》2014,(13):101-101
在当代作家中,莫言在小说语言的创造方面可谓首屈一指。他小说中语言表达的张力充分表现在其凭借感性与想象,善于通过变异,通过更换语素、颠倒语素、拆词活用及构成超常搭配等方式赋予语言活泼、新奇、变幻、绚丽多姿的特点,使小说独具一格。  相似文献   

5.
徐坤  曹锦丹 《情报杂志》2012,(1):172-174,171
提出了一种针对领域文献的易于实现且具有较高准确率的未登录词自动识别方法。通过该方法生成未登录词表,可提高中文自动分词效果,弥补领域主题词表更新慢的不足,方便对领域文献的后续处理,进而提高科研工作者利用文献的效率。  相似文献   

6.
Automatic word spacing in Korean remains a significant task in natural language processing owing to the extremely complex word spacing rules involved. Most previous models remove all spaces in input sentences and insert new spaces in the modified input sentences. If input sentences include only a small number of spacing errors, the previous models often return sentences with even more spacing errors than the input sentences because they remove the correct spaces that were typed intentionally by the users. To reduce this problem, we propose an automatic word spacing model based on a neural network that effectively uses word spacing information from input sentences. The proposed model comprises a space insertion layer and a spacing-error correction layer. Using an approach similar to previous models, the space insertion layer inserts word spaces into input sentences from which all spaces have been removed. The spacing error correction layer post-corrects the spacing errors of the space insertion model using word spacing typed by users. Because the two layers are tightly connected in the proposed model, the backpropagation flows are not blocked. As a result, the space insertion and error correction are performed simultaneously. In experiments, the proposed model outperformed all compared models on all measures on the same test data. In addition, it exhibited reliable performance (word-unit F1-measures of 94.17%~97.87%) regardless of how many word spacing errors were present in the input sentences.  相似文献   

7.
This paper presents our work towards developing a new speech corpus for Modern Standard Arabic (MSA), which can be used for implementing and evaluating Arabic speaker-independent, large vocabulary, automatic, and continuous speech recognition systems. The speech corpus was recorded by 40 (20 male and 20 female) Arabic native speakers from 11 countries representing three major regions (Levant, Gulf, and Africa). Three development phases were conducted based on the size of training data, Gaussian mixture distributions, and tied states (senones). Based on our third development phase using 11 hours of training speech data, the acoustic model is composed of 16 Gaussian mixture distributions and the state distributions tied to 300 senones. Using three different data sets, the third development phase obtained 94.32% and 8.10% average word recognition correctness rate and average Word Error Rate (WER), respectively, for same speakers with different sentences (testing sentences). For different speakers with same sentences (training sentences), this work obtained 98.10% and 2.67% average word recognition correctness rate and average WER, respectively, whereas for different speakers with different sentences (testing sentences) this work obtained 93.73% and 8.75% average word recognition correctness rate and average WER, respectively.  相似文献   

8.
Neural decoders were introduced as a generalization of the classic Belief Propagation (BP) decoding algorithms. In this work, we propose several neural decoders with different permutation invariant structures for BCH codes and punctured RM codes. Firstly, we propose the cyclically equivariant neural decoder which makes use of the cyclically invariant structure of these two codes. Next, we propose an affine equivariant neural decoder utilizing the affine invariant structure of those two codes. Both these two decoders outperform previous neural decoders when decoding cyclic codes. The affine decoder achieves a smaller decoding error probability than the cyclic decoder, but it usually requires a longer running time. Similar to using the property of the affine invariant property of extended BCH codes and RM codes, we propose the list decoding version of the cyclic decoder that can significantly reduce the frame error rate(FER) for these two codes. For certain high-rate codes, the gap between the list decoder and the Maximum Likelihood decoder is less than 0.1 dB when measured by FER.  相似文献   

9.
This paper describes an intelligent spelling error correction system for use in a word processing environment. The system employs a dictionary of 93,769 words and provided the intended word is in the dictionary it identifies 80 to 90% of spelling and typing errors.  相似文献   

10.
刘爱琴  安婷 《现代情报》2019,39(8):52-58
[目的/意义]面向非相关文献的知识关联能够促进新知识的产生,为科学研究提供了一种有效的辅助手段。[方法/过程]本文以《中国分类主题词表》为主题词受控词表,首先对文献摘要进行中文分词处理并提取主题词,利用计量分析技术和聚类技术分析文献间特征的相似、相异水平,然后基于该系统为用户检索并利用用TOP-K算法反馈用户精确结果。[结果/结论]设计了面向非相关文献的知识关联检索系统,从更细的粒度层面揭示文献之间的知识关联,为用户提供高质量的服务。  相似文献   

11.
在科研机构动态监测的需求推动下,提出一种利用科技敏感词表并结合网络资源属性、主题内容、链接文本、锚文本、来源目录等多项指标,对采集自科研机构的网络资源进行重要性排序的方法.提出网络资源重要性排序的研究框架及流程,利用实时采集的数据进行实验分析,最后给出研究结论及下一步工作.  相似文献   

12.
公司治理对企业技术创新投资影响的实证研究   总被引:7,自引:0,他引:7  
基于江苏省具有省级以上技术中心的上市公司的数据,通过建立计量经济模型研究了公司治理对企业创新投资的影响。实证研究表明,股权集中度和高层管理者的激励对企业技术创新投资具有显著的正向影响,企业负债率对企业技术创新投资具有显著的负影响,样本公司的董事会规模对企业技术创新投资的影响不显著。  相似文献   

13.
Conglomerates as a general framework for informetric research   总被引:2,自引:0,他引:2  
We introduce conglomerates as a general framework for informetric (and other) research. A conglomerate consists of two collections: a finite source collection and a pool, and two mappings: a source-item map and a magnitude map. The ratio of the sum of all magnitudes of item-sets, and the number of elements in the source collection is called the conglomerate ratio. It is a kind of average, generalizing the notion of an impact factor. The source-item relation of a conglomerate leads to a list of sources ranked according to the magnitude of their corresponding item-sets. This list, called a Zipf list, is the basic ingredient for all considerations related to power laws and Lotkaian or Zipfian informetrics. Examples where this framework applies are: impact factors, including web impact factors, Bradford–Lotka type bibliographies, first-citation studies, word use, diffusion factors, elections and even bestsellers lists.  相似文献   

14.
With the increasing use of research paper search engines, such as CiteSeer, for both literature search and hiring decisions, the accuracy of such systems is of paramount importance. This article employs conditional random fields (CRFs) for the task of extracting various common fields from the headers and citation of research papers. CRFs provide a principled way for incorporating various local features, external lexicon features and globle layout features. The basic theory of CRFs is becoming well-understood, but best-practices for applying them to real-world data requires additional exploration. We make an empirical exploration of several factors, including variations on Gaussian, Laplace and hyperbolic-L1 priors for improved regularization, and several classes of features. Based on CRFs, we further present a novel approach for constraint co-reference information extraction; i.e., improving extraction performance given that we know some citations refer to the same publication. On a standard benchmark dataset, we achieve new state-of-the-art performance, reducing error in average F1 by 36%, and word error rate by 78% in comparison with the previous best SVM results. Accuracy compares even more favorably against HMMs. On four co-reference IE datasets, our system significantly improves extraction performance, with an error rate reduction of 6–14%.  相似文献   

15.
Word sense ambiguity has been identified as a cause of poor precision in information retrieval (IR) systems. Word sense disambiguation and discrimination methods have been defined to help systems choose which documents should be retrieved in relation to an ambiguous query. However, the only approaches that show a genuine benefit for word sense discrimination or disambiguation in IR are generally supervised ones. In this paper we propose a new unsupervised method that uses word sense discrimination in IR. The method we develop is based on spectral clustering and reorders an initially retrieved document list by boosting documents that are semantically similar to the target query. For several TREC ad hoc collections we show that our method is useful in the case of queries which contain ambiguous terms. We are interested in improving the level of precision after 5, 10 and 30 retrieved documents (P@5, P@10, P@30) respectively. We show that precision can be improved by 8% above current state-of-the-art baselines. We also focus on poor performing queries.  相似文献   

16.
王俊丽 《科教文汇》2012,(35):131-132
通过对高中英语学习错误的量化统计,并对其施以错误分析,结果发现:词汇错误中所占比例最大的类型是词性的掌握,其次为同近义词混淆,单词拼写和词汇搭配分列第三和第四位.动词及时态的使用和复合句的错误率均分别是语法和语篇中出错比例最高的一类.  相似文献   

17.
李泗兰  郭雅 《大众科技》2014,(6):217-218
在实际工作中,高职学校的班主任经常会遇到批量制作成绩报告书、准考证、请柬等情况;这些工作都具有工作量大、重复率高的特点,既枯燥乏味又容易出错,有什么办法可以解决?结合使用Microsoft office Excel 2010中的if函数和Word 2010中的"邮件合并"功能,可以轻松解决上述问题。使用if函数,可以对数据源进行相应的编辑;使用"邮件合并"功能,可以对大部分固定不变的word文档内容进行批量编辑、打印。本文以打印并且邮寄学生成绩报告书为例,详细叙述邮件合并在管理、教学等方面的应用。  相似文献   

18.
薛调 《现代情报》2017,37(10):72
以"清博指数"微信总榜中的17所高校图书馆为研究样本,利用统计方法和定性研究方法对头条文章从推送频率、标题字数、标题特征和标题内容4个方面进行了分析,构建了头条文章标题内容的主题模型。提出了树立头条意识、提高读者认知,确定合理的推送频率及推送方式,选择恰当的标题特征,甄选契合读者需求的标题内容四点增强高校图书馆微信公众号信息传播效果的建议。  相似文献   

19.
古合韵浅说     
李威  辛朝乾 《科教文汇》2011,(19):74-75
合韵是指诵读先秦韵文时遇到押韵不谐之处,可以临时改变某字的读音以求谐和的方法。这是古人没有意识到语音的历史变化而以今律古导致的错误。通过对"合韵说"的来龙去脉、历代学者对它的批判及其产生的原因的分析,可以对我们今天研究语音有一定的借鉴作用。  相似文献   

20.
Language modeling (LM), providing a principled mechanism to associate quantitative scores to sequences of words or tokens, has long been an interesting yet challenging problem in the field of speech and language processing. The n-gram model is still the predominant method, while a number of disparate LM methods, exploring either lexical co-occurrence or topic cues, have been developed to complement the n-gram model with some success. In this paper, we explore a novel language modeling framework built on top of the notion of relevance for speech recognition, where the relationship between a search history and the word being predicted is discovered through different granularities of semantic context for relevance modeling. Empirical experiments on a large vocabulary continuous speech recognition (LVCSR) task seem to demonstrate that the various language models deduced from our framework are very comparable to existing language models both in terms of perplexity and recognition error rate reductions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号