首页 | 本学科首页   官方微博 | 高级检索  
     

基于文本加权词共现的跨语言文本相似度分析
引用本文:张晓宇,王永滨,吴 林. 基于文本加权词共现的跨语言文本相似度分析[J]. 教育技术导刊, 2020, 19(2): 92-95. DOI: 10. 11907/rjdk. 191233
作者姓名:张晓宇  王永滨  吴 林
作者单位:中国传媒大学 智能融媒体教育部重点实验室,北京100024
基金项目:中国传媒大学青年理工科规划项目(3132018XNG1837)
摘    要:跨语言文本相似度计算在跨语言信息检索、数据挖掘、抄袭检测等领域有着重要应用,但是跨语言文本相似度计算因为不同语言文法、结构等问题,在空间映射、特征选择上与单语言文本相似度计算有很大差异。为解决上述问题,采用一种基于文本加权词共现关系的跨语言文本相似度计算方法,通过平行语料库构建跨语言词共现关系模型,使用该模型进行跨语言文本映射,对不同语言的文本进行相似度计算。该模型实际反映了某种语言中某些关键词共同出现时映射成另一种语言时的关键词概率分布。实验表明,该方法对跨语言文本排序的计算更接近人工评判标准。

关 键 词:词共现  文本相似度  跨语言  统计翻译模型  
收稿时间:2019-03-01

Cross-linguistic Text Similarity Analysis Based on Co-occurrence of Text Weighted Words
ZHANG Xiao-yu,WANG Yong-bin,WU Lin. Cross-linguistic Text Similarity Analysis Based on Co-occurrence of Text Weighted Words[J]. Introduction of Educational Technology, 2020, 19(2): 92-95. DOI: 10. 11907/rjdk. 191233
Authors:ZHANG Xiao-yu  WANG Yong-bin  WU Lin
Affiliation:Key Laboratory of Convergent Media and Intelligent Technology, Communication University of China, Beijing 100024, China
Abstract:Cross-language text similarity computation has important applications in cross-language information retrieval, data mining, plagiarism detection and other fields. However, cross-linguistic text similarity calculation differs greatly from single-language text similarity calculation in spatial mapping and feature selection due to the different grammar and structure of the languages. In order to solve the above problem, a cross-linguistic text similarity calculation method based on the co-occurrence relationship of text weighted words is adopted. This method constructs a cross-linguistic word co-occurrence relationship model by parallel corpus, and uses this model to map cross-linguistic texts, and calculates the similarity of texts in different languages. The model actually reflects the probability distribution of keywords in one language when some keywords appear together and map to another language. Experimental results show that the calculation of the cross language text sorting method is closer to the artificial evaluation standard.
Keywords:Key Words:word co-occurrence   text similarity   cross-linguistic   statistical translation model  
点击此处可从《教育技术导刊》浏览原始摘要信息
点击此处可从《教育技术导刊》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号