期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Spelling checkers,spelling correctors and the misspellings of poor spellers

《Information processing & management》1987,23(5):495-505

A large corpus of spelling errors taken from free writing is analyzed to assess how great a challenge such errors present for automatic checking and correction. The analysis reveals a high proportion of errors that match dictionary words; these would necessitate the use of context in error detection. Some of these errors are caused by incorrect word-division, a type of error difficult to spot since it calls into question the placing of word boundaries. Misspellings tend to differ from the correct words more than mistypings do. Some knowledge of pronunciation would help in correcting many of the errors, but misspellings do not always reflect pronunciation in a simple way. 相似文献

2.

一种基于多重哈希词典和K-最短路径算法的中文粗分词方案研究 总被引：1，自引：1，他引：0

岑咏华《情报理论与实践》2009,32(3)

本文在已有研究基础上,针对中文粗分词,设计了多重哈希词典结构,以提高分词的词典匹配效率,同时基于删除算法改进了中科院ICTCLAS分词系统的K-最短路径搜索思想.最后,论文对所研究技术方案进行了系统实现.系统实验结果表明,对于大规模文本,论文所提出的粗分词方案体现出了很好的性能. 相似文献

3.

自动分词中智能词典的研究

蔡灿民吴晟霍雪娜赵莉楠《科技广场》2007,(3):34-36

现阶段,绝大多数自动分词系统都是基于词典的方法,词典的完备性是决定分词系统性能的基础和关键,但词典的完备性一直都是很难完善的。本文介绍了机械分词法与无词典分词法,并利用两种分词法各自的优点将其整合,提出了具有自学习功能的智能词典这个概念,以弥补分词词典无法完备的缺陷。相似文献

4.

Word classification and hierarchy using co-occurrence word information

Kazuhiro Morita El-Sayed Atlam Masao Fuketra Kazuhiko Tsuda Masaki Oono Jun-ichi Aoe 《Information processing & management》2004,40(6):9325

By the development of the computer in recent years, calculating a complex advanced processing at high speed has become possible. Moreover, a lot of linguistic knowledge is used in the natural language processing (NLP) system for improving the system. Therefore, the necessity of co-occurrence word information in the natural language processing system increases further and various researches using co-occurrence word information are done. Moreover, in the natural language processing, dictionary is necessary and indispensable because the ability of the entire system is controlled by the amount and the quality of the dictionary. In this paper, the importance of co-occurrence word information in the natural language processing system was described. The classification technique of the co-occurrence word (receiving word) and the co-occurrence frequency was described and the classified group was expressed hierarchically. Moreover, this paper proposes a technique for an automatic construction system and a complete thesaurus. Experimental test operation of this system and effectiveness of the proposal technique is verified. 相似文献

5.

基于双层哈希表的中文分词算法优化

习明王增辉庄怡《人天科学研究》2010,(10):54-55

采用基于词典的正向增字最大匹配算法,分词词典采用改进的双层哈希表加动态数组的数据结构。在不提升已有典型词典机制空间复杂度与维护复杂度的情况下,一定程度上提高了中文分词的速度和效率。相似文献

6.

基于专利语料库的双语词典自动抽取及其在知识图谱中的应用

胡寅骏殷玥孙虎王茜《中国发明与专利》2021,(2):40-46

将大量中英文对照的专利文本作为平行语料库,提出一种自动抽取中英文词典的方法.先利用外部语义资源维基百科构建种子双语词典,再通过计算点互信息获得中英文词对的候补,并设置阈值筛选出用于补充种子词典的词对.实验结果表明:对英语文档进行单词的短语化有助于提高自动抽取结果的综合性能;另一方面,虽然通过句对齐方式可以提高自动抽取结... 相似文献

7.

Automatic spelling correction using a trigram similarity measure

Richard C. Angell George E. Freund Peter Willett 《Information processing & management》1983,19(4):255-261

A nearest neighbour search procedure is described for the automatic correction of misspellings. The procedure involves the replacement of a misspelt word by that word in a dictionary which best matches the misspelling, the degree of match being calculated using a similarity coefficient based on the number of trigrams common to the two words. Experiments with a collection of 1544 misspellings and a dictionary of 64,636 words suggest that the procedure results in the unique identification of the correct spelling for over 75% of the misspellings if the correct form of the word is in the dictionary, and that this figure may be increased to over 90% if near, rather than nearest, neighbours are acceptable. 相似文献

8.

An efficient file structure for specialized dictionaries and other “lumpy” data

E. J. Yannakoudakis 《Information processing & management》1987,23(6)

There are many cases where it is necessary to store sets of data that are variable in length, and to search these in order to satisfy requests for subsets with a common characteristic. This article presents a file structure that holds an integrated English dictionary used to locate clusters of words for presentation to an intelligent spelling error correction system. Although the emphasis has been on misspelling, the structure presented is capable of handling any other types of lumpy data provided the characteristics used in search requests can be translated into a set of integer numbers. 相似文献

9.

基于专业术语提取的中文分词方法

郑阳莫建文《大众科技》2012,14(4):20-23

针对在科技文献中,未登录词等相关专业术语其变化多端,在中文分词中难以识别,影响了专业领域文章的分词准确度,结合实际情况给出了一种基于专业术语提取的中文分词方法。通过大量特定领域的专业语料库,基于互信息和统计的方法,对文中的未登录词等专业术语进行提取,构造专业术语词典,并结合通用词词典,利用最大匹配方法进行中文分词。经实验证明,该分词方法可以较准确的抽取出相关专业术语,从而提高分词的精度,具有实际的应用价值。相似文献

10.

Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions

《Information processing & management》2019,56(2):343-353

相似文献

11.

An examination of undetected typing errors

Fred J. Damerau Eric Mays 《Information processing & management》1989,25(6)

We examine the effect of increasing word list size on the error rate of spelling correctors. An experiment on a large body of text shows that an increase in the word list size decreases the error rate. 相似文献

12.

Applying query structuring in cross-language retrieval

《Information processing & management》2003,39(3):391-402

We will explore various ways to apply query structuring in cross-language information retrieval. In the first test, English queries were translated into Finnish using an electronic dictionary, and were run in a Finnish newspaper database of 55,000 articles. Queries were structured by combining the Finnish translation equivalents of the same English query key using the syn-operator of the InQuery retrieval system. Structured queries performed markedly better than unstructured queries. Second, the effects of compound-based structuring using a proximity operator for the translation equivalents of query language compound components were tested. The method was not useful in syn-based queries but resulted in decrease in retrieval effectiveness. Proper names are often non-identical spelling variants in different languages. This allows n-gram based translation of names not included in a dictionary. In the third test, a query structuring method where the Boolean and-operator was used to assign more weight to keys translated through n-gram matching gave good results. 相似文献

13.

面向信息检索的汉语同义词自动识别和挖掘 总被引：3，自引：0，他引：3

陆勇侯汉清《情报理论与实践》2006,29(4):472-475

为了提高同义词自动挖掘的效率,本文提出了从词典释义中自动识别和挖掘同义词的方法,使用超链接分析算法和模式匹配算法,从不同的角度提取同义词：第一部分是把词汇之间注释与被注释的关系看成是一种链接关系。对给定的词汇进行分析,把与给定词汇具有链接关系的所有相关词汇构造一个词汇图,图中的每一个节点代表相关词,每条弧代表了词汇之间注释与被注释的关系。利用超链接分析方法并结合PageRank算法,计算词汇的PageRank值,把PageRank值看成是体现词汇之间语义相似性的衡量指标,最后为每一个词汇生成候选同义词集,并通过一定的筛选原则和方法,推荐出最佳的同义词。第二部分是利用词汇定义模式,对词汇的释义方式进行分析,归纳总结出在词典释义中同义词出现的模式,进而利用模式匹配方法识别和挖掘同义词。此外,利用模式匹配方法对Web网页和期刊论文中的同义词也进行了挖掘测试。测试结果表明,利用模式匹配和超链接分析方法来自动识别和挖掘同义词具有可行性和实用性。相似文献

14.

Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary

Yeohoon Yoon Choong-Nyoung Seon Songwook Lee Jungyun Seo 《Information processing & management》2007

Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The system learns the similarity matrix between word pairs from the unlabeled corpus, and it uses the vector representations of sense definitions from MRD, which are derived based on the similarity matrix. In order to disambiguate all occurrences of polysemous words in a sentence, the system separately constructs the acyclic weighted digraph (AWD) for every occurrence of polysemous words in a sentence. The AWD is structured based on consideration of the senses of context words which occur with a target word in a sentence. After building the AWD per each polysemous word, we can search the optimal path of the AWD using the Viterbi algorithm. We assign the most appropriate sense to the target word in sentences with the sense on the optimal path in the AWD. By experiments, our system shows 76.4% accuracy for the semantically ambiguous Korean words. 相似文献

15.

Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary

Yeohoon Yoon Choong-Nyoung Seon Songwook Lee Jungyun Seo 《Information processing & management》2006

Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The system learns the similarity matrix between word pairs from the unlabeled corpus, and it uses the vector representations of sense definitions from MRD, which are derived based on the similarity matrix. In order to disambiguate all occurrences of polysemous words in a sentence, the system separately constructs the acyclic weighted digraph (AWD) for every occurrence of polysemous words in a sentence. The AWD is structured based on consideration of the senses of context words which occur with a target word in a sentence. After building the AWD per each polysemous word, we can search the optimal path of the AWD using the Viterbi algorithm. We assign the most appropriate sense to the target word in sentences with the sense on the optimal path in the AWD. By experiments, our system shows 76.4% accuracy for the semantically ambiguous Korean words. 相似文献

16.

Compression of index term dictionary in an inverted-file-orientated database: Some effective algorithms

Janusz L. Wi niewski 《Information processing & management》1986,22(6)

A new method of index term dictionary compression in an inverted-file-orientated database is discussed. A technique of word coding that generates short fixed-length codes obtained from the index terms themselves by analysis of monogram and bigram statistical distributions is described. Transformation of the index term dictionary into a code dictionary preserves a word-to-word discrimination with a rate of three synonyms per 1300 terms, at compression ratio up to 90% and at low cost in terms of the CPU time expenditure. When applied in computer network environment, it offers substantial savings in communication channel utilization at negligible response time degradation. Experimental data for 26,113 index term dictionary of the New York Times Info Bank available via a computer network are presented. 相似文献

17.

高三英语复习中“猜词”能力的培养

马燕宴《科教文汇》2014,(32):154-155

词义猜测题是高考阅读理解中的必有题型,一般占10%左右。猜词能力要求学生在掌握词汇基本词义和用法外,根据文章语境将词义灵活转化,推断的内容有一个单词,也有一个短语的意义推断;既可以是生词意义,也可以是熟词新意,还可以是对替代词所替代内容的判断和理解,然而大多数学生花很多时间和精力机械背诵单词的拼写和汉语意思,词语的运用能力没有得到相应的提高,本文针对这一现象,提出高三复习中“猜词”能力的培养的方法。相似文献

18.

Automatic extraction of bilingual word pairs using inductive chain learning in various languages

Hiroshi Echizen-ya Kenji Araki Yoshio Momouchi 《Information processing & management》2006

In this paper, we propose a new learning method for extracting bilingual word pairs from parallel corpora in various languages. In cross-language information retrieval, the system must deal with various languages. Therefore, automatic extraction of bilingual word pairs from parallel corpora with various languages is important. However, previous works based on statistical methods are insufficient because of the sparse data problem. Our learning method automatically acquires rules, which are effective to solve the sparse data problem, only from parallel corpora without any prior preparation of a bilingual resource (e.g., a bilingual dictionary, a machine translation system). We call this learning method Inductive Chain Learning (ICL). Moreover, the system using ICL can extract bilingual word pairs even from bilingual sentence pairs for which the grammatical structures of the source language differ from the grammatical structures of the target language because the acquired rules have the information to cope with the different word orders of source language and target language in local parts of bilingual sentence pairs. Evaluation experiments demonstrated that the recalls of systems based on several statistical approaches were improved through the use of ICL. 相似文献

19.

词语相似度算法研究综述

李慧《现代情报》2015,35(4):172-177

词语相似度计算方法在信息检索、词义消歧、机器翻译等自然语言处理领域有着广泛的应用。现有的词语相似度算法主要分为基于统计和基于语义资源两类方法,前者是从大规模的语料中统计与词语共现的上下文信息以计算其相似度,而后者利用人工构建的语义词典或语义网络计算相似度。本文比较分析了两类词语相似度算法,重点介绍了基于Web语料库和基于维基百科的算法,并总结了各自的特点和不足之处。最后提出,在信息技术的影响下,基于维基百科和基于混合技术的词语相似度算法以及关联数据驱动的相似性计算具有潜在的发展趋势。相似文献

20.

Effective foreign word extraction for Korean information retrieval

《Information processing & management》2002,38(1):91-109

In Korean text, foreign words, which are mostly transliterations of English words, are frequently used. Foreign words are usually very important index terms in Korean information retrieval since most of them are technical terms or names. So accurate foreign word extraction is crucial for high performance of information retrieval. However, accurate foreign word extraction is not easy because it inevitably accompanies word segmentation and most of the foreign words are unknown. In this paper, we present an effective foreign word recognition and extraction method. In order to accurately extract foreign words, we developed an effective method of word segmentation that involves unknown foreign words. Our word segmentation method effectively utilizes both unknown word information acquired through the automatic dictionary compilation and foreign word recognition information. Our HMM-based foreign word recognition method does not require large labeled examples for the model training unlike the previously proposed method. 相似文献