期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Constructing and analyzing domain-specific language model for financial text mining

《Information processing & management》2023,60(2):103194

The application of natural language processing (NLP) to financial fields is advancing with an increase in the number of available financial documents. Transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT) have been successful in NLP in recent years. These cutting-edge models have been adapted to the financial domain by applying financial corpora to existing pre-trained models and by pre-training with the financial corpora from scratch. In Japanese, by contrast, financial terminology cannot be applied from a general vocabulary without further processing. In this study, we construct language models suitable for the financial domain. Furthermore, we compare methods for adapting language models to the financial domain, such as pre-training methods and vocabulary adaptation. We confirm that the adaptation of a pre-training corpus and tokenizer vocabulary based on a corpus of financial text is effective in several downstream financial tasks. No significant difference is observed between pre-training with the financial corpus and continuous pre-training from the general language model with the financial corpus. We have released our source code and pre-trained models. 相似文献

2.

Word sense disambiguation based on context selection using knowledge-based word similarity

Sunjae Kwon Dongsuk Oh Youngjoong Ko 《Information processing & management》2021,58(4):102551

In this paper, we introduce a novel knowledge-based word-sense disambiguation (WSD) system. In particular, the main goal of our research is to find an effective way to filter out unnecessary information by using word similarity. For this, we adopt two methods in our WSD system. First, we propose a novel encoding method for word vector representation by considering the graphical semantic relationships from the lexical knowledge bases, and the word vector representation is utilized to determine the word similarity in our WSD system. Second, we present an effective method for extracting the contextual words from a text for analyzing an ambiguous word based on word similarity. The results demonstrate that the suggested methods significantly enhance the baseline WSD performance in all corpora. In particular, the performance on nouns is similar to those of the state-of-the-art knowledge-based WSD models, and the performance on verbs surpasses that of the existing knowledge-based WSD models. 相似文献

3.

Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary

Yeohoon Yoon Choong-Nyoung Seon Songwook Lee Jungyun Seo 《Information processing & management》2007

Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The system learns the similarity matrix between word pairs from the unlabeled corpus, and it uses the vector representations of sense definitions from MRD, which are derived based on the similarity matrix. In order to disambiguate all occurrences of polysemous words in a sentence, the system separately constructs the acyclic weighted digraph (AWD) for every occurrence of polysemous words in a sentence. The AWD is structured based on consideration of the senses of context words which occur with a target word in a sentence. After building the AWD per each polysemous word, we can search the optimal path of the AWD using the Viterbi algorithm. We assign the most appropriate sense to the target word in sentences with the sense on the optimal path in the AWD. By experiments, our system shows 76.4% accuracy for the semantically ambiguous Korean words. 相似文献

4.

Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary

Yeohoon Yoon Choong-Nyoung Seon Songwook Lee Jungyun Seo 《Information processing & management》2006

Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The system learns the similarity matrix between word pairs from the unlabeled corpus, and it uses the vector representations of sense definitions from MRD, which are derived based on the similarity matrix. In order to disambiguate all occurrences of polysemous words in a sentence, the system separately constructs the acyclic weighted digraph (AWD) for every occurrence of polysemous words in a sentence. The AWD is structured based on consideration of the senses of context words which occur with a target word in a sentence. After building the AWD per each polysemous word, we can search the optimal path of the AWD using the Viterbi algorithm. We assign the most appropriate sense to the target word in sentences with the sense on the optimal path in the AWD. By experiments, our system shows 76.4% accuracy for the semantically ambiguous Korean words. 相似文献

5.

Automatic extraction of bilingual word pairs using inductive chain learning in various languages

Hiroshi Echizen-ya Kenji Araki Yoshio Momouchi 《Information processing & management》2006

In this paper, we propose a new learning method for extracting bilingual word pairs from parallel corpora in various languages. In cross-language information retrieval, the system must deal with various languages. Therefore, automatic extraction of bilingual word pairs from parallel corpora with various languages is important. However, previous works based on statistical methods are insufficient because of the sparse data problem. Our learning method automatically acquires rules, which are effective to solve the sparse data problem, only from parallel corpora without any prior preparation of a bilingual resource (e.g., a bilingual dictionary, a machine translation system). We call this learning method Inductive Chain Learning (ICL). Moreover, the system using ICL can extract bilingual word pairs even from bilingual sentence pairs for which the grammatical structures of the source language differ from the grammatical structures of the target language because the acquired rules have the information to cope with the different word orders of source language and target language in local parts of bilingual sentence pairs. Evaluation experiments demonstrated that the recalls of systems based on several statistical approaches were improved through the use of ICL. 相似文献

6.

Stabilizing unstable periodic orbits in large stability domains with dynamic time-delayed feedback control

《Journal of The Franklin Institute》2022,359(16):8484-8496

Dynamic time-delayed feedback control (DDFC) is applied to stabilize the unstable periodic orbits (UPOs) of chaotic systems in large stability domains. The stability domain is defined as certain areas of the parameter space of the feedback strength, in which the UPOs are stabilized. The control effect of the DDFC with a second-order controller system is investigated by considering two control objectives: to broaden the stability domain of the controlled UPOs and to minimize the modulus of the largest Floquet multiplier, which leads to a multi-objective optimization problem (MOP). The MOP is solved with the genetic algorithm. Case studies indicate that the control effect of the DDFC is significantly better than that of original time-delayed feedback control. The DDFC can stabilize the UPOs in a large stability domain and with a small modulus of the largest Floquet multiplier when the adjustable parameters of the controller system are properly designed. 相似文献

7.

Resolving ambiguity in biomedical text to improve summarization

Laura Plaza Mark Stevenson Alberto Díaz 《Information processing & management》2012

Access to the vast body of research literature that is now available on biomedicine and related fields can be improved with automatic summarization. This paper describes a summarization system for the biomedical domain that represents documents as graphs formed from concepts and relations in the UMLS Metathesaurus. This system has to deal with the ambiguities that occur in biomedical documents. We describe a variety of strategies that make use of MetaMap and Word Sense Disambiguation (WSD) to accurately map biomedical documents onto UMLS Metathesaurus concepts. Evaluation is carried out using a collection of 150 biomedical scientific articles from the BioMed Central corpus. We find that using WSD improves the quality of the summaries generated. 相似文献

8.

基于卷积神经网络的旅游信息关系抽取研究

鲍玉来耿雪来飞龙《现代情报》2019,39(8):132-136

[目的/意义]在非结构化语料集中抽取知识要素,是实现知识图谱的重要环节,本文探索了应用深度学习中的卷积神经网络（CNN）模型进行旅游领域知识关系抽取方法。[方法/过程]抓取专业旅游网站的相关数据建立语料库,对部分语料进行人工标注作为训练集和测试集,通过Python语言编程实现分词、向量化及CNN模型,进行关系抽取实验。[结果/结论]实验结果表明,应用卷积神经网络对非结构化的旅游文本进行关系抽取时能够取得满意的效果（Precision 0.77,Recall 0.76,F1-measure 0.76）。抽取结果通过人工校对进行优化后,可以为旅游知识图谱构建、领域本体构建等工作奠定基础。相似文献

9.

基于本地化差分隐私的政务数据共享隐私保护算法研究

郝玉蓉朴春慧颜嘉麒蒋学红《情报杂志》2021,40(2):169-175,137

[目的/意义]为了合理化决策,通常一个政府部门会根据业务需求向其他部门共享某类数据,为本部门管理或服务决策提供辅助参考依据。数据共享在其中至关重要,但若在没有适当预防措施的情况下就共享政务数据,将容易造成隐私信息的泄露。[方法/过程]针对政府部门间共享统计数据的场景,提出一种基于本地化差分隐私的政务数据共享方法。该方法在算法Generalized randomized response(GRR)的基础上引入数据分箱思想,通过等宽分箱将数据记录分入更小的数据域范围内,以克服当前隐私保护算法在数据域较大且数据量较少时统计误差大的问题。[结果/结论]将所提算法与GRR算法在仿真数据集和真实数据集上均进行了对比分析,实验结果表明该算法可有效降低统计误差,并能在不同分布和数据域大小下保持其效用性。相似文献

10.

Automatic acquisition of inflectional lexica for morphological normalisation

J. Šnajder B. Dalbelo Bašić M. Tadić 《Information processing & management》2008

Due to natural language morphology, words can take on various morphological forms. Morphological normalisation – often used in information retrieval and text mining systems – conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance. 相似文献

11.

中学语文语感教学研究

陈莎莎《科教文汇》2014,(14):167-168

语感问题自夏丏尊先生于1924年提出以来,至今仍没有一个确切的定义能令所有人信服,同时也没有一套具体的、能推而广之的语感教学的方法。语感的定义无需纠结于复杂的、甚至是尚无定论的心理学名词解释,根据＂属＋种差＂的定义公式,语感就是人类对语言的直觉。另外,语感教学不能浮于表面,要将听说读写看的大方法与实际教学具体结合。相似文献

12.

VIDPSO: Victim item deletion based PSO inspired sensitive pattern hiding algorithm for dense datasets

《Information processing & management》2020,57(5):102255

Collaborative frequent itemset mining involves analyzing the data shared from multiple business entities to find interesting patterns from it. However, this comes at the cost of high privacy risk. Because some of these patterns may contain business-sensitive information and hence are denoted as sensitive patterns. The revelation of such patterns can disclose confidential information. Privacy-preserving data mining (PPDM) includes various sensitive pattern hiding (SPH) techniques, which ensures that sensitive patterns do not get revealed when data mining models are applied on shared datasets. In the process of hiding sensitive patterns, some of the non-sensitive patterns also become infrequent. SPH techniques thus affect the results of data mining models. Maintaining a balance between data privacy and data utility is an NP-hard problem because it requires the selection of sensitive items for deletion and also the selection of transactions containing these items such that side effects of deletion are minimal. There are various algorithms proposed by researchers that use evolutionary approaches such as genetic algorithm(GA), particle swarm optimization (PSO) and ant colony optimization (ACO). These evolutionary SPH algorithms mask sensitive patterns through the deletion of sensitive transactions. Failure in the sensitive patterns masking and loss of data have been the biggest challenges for such algorithms. The performance of evolutionary algorithms further gets degraded when applied on dense datasets. In this research paper, victim item deletion based PSO inspired evolutionary algorithm named VIDPSO is proposed to sanitize the dense datasets. In the proposed algorithm, each particle of the population consists of n number of sub-particles derived from pre-calculated victim items. The proposed algorithm has a high exploration capability to search the solution space for selecting optimal transactions. Experiments conducted on real and synthetic dense datasets depict that VIDPSO algorithm performs better vis-a-vis GA, PSO and ACO based SPH algorithms in terms of hiding failure with minimal loss of data. 相似文献

13.

Language models and fusion for authorship attribution

《Information processing & management》2019,56(6):102061

We deal with the task of authorship attribution, i.e. identifying the author of an unknown document, proposing the use of Part Of Speech (POS) tags as features for language modeling. The experimentation is carried out on corpora untypical for the task, i.e., with documents edited by non-professional writers, such as movie reviews or tweets. The former corpus is homogeneous with respect to the topic making the task more challenging, The latter corpus, puts language models into a framework of a continuously and fast evolving language, unique and noisy writing style, and limited length of social media messages. While we find that language models based on POS tags are competitive in only one of the corpora (movie reviews), they generally provide efficiency benefits and robustness against data sparsity. Furthermore, we experiment with model fusion, where language models based on different modalities are combined. By linearly combining three language models, based on characters, words, and POS trigrams, respectively, we achieve the best generalization accuracy of 96% on movie reviews, while the combination of language models based on characters and POS trigrams provides 54% accuracy on the Twitter corpus. In fusion, POS language models are proven essential effective components. 相似文献

14.

三元组可比语料库自动剖析在情报智能处理中的研究与应用

王毅肖健袁琦宋金平李强《情报理论与实践》2012,35(4):94-98

文章提出的基于三元组可比语料库的自动语言剖析技术扩大了该研究领域的内涵,使其包括面向自然语言处理的应用研究。从工程可实现性考虑,创新性地提出建造三元组可比语料库,利用n-元词串、关键词簇和语义多词表达等自动抽取技术,通过对比中式英语表达,发掘英语本族语言模型,实现改进和发展机器翻译、跨语言信息检索等自然语言处理应用的目标。相似文献

15.

基于小波变换的PET图像分析

闫镔王鹏李可郝晶吴义根谢千河支联合王崴鲁娜袁秀丽单保慈《中国科学院研究生院学报》2005,22(4):499-505

提出一种把小波变换和统计检验结合起来检测PET图像激活区的方法.首先,采用模拟的PET图像来评价算法的可靠性,结果显示,在小波域上进行统计检验比传统的直接在空间域上进行统计检验具有更高的灵敏度,更强的抗噪声干扰性能,更快的计算速度.最后,用该方法处理真实的PET图像,也得到了满意的结果.该方法为PET医学图像处理和脑功能研究提供了一种新的多尺度、高性能的分析手段,对于脑功能研究的功能区定位、临床诊断、药物疗效评估等有着重要意义. 相似文献

16.

基于条件价值法的武夷山风景名胜区遗产资源非使用价值评估

游巍斌何东进洪伟刘翠俞建安陈炳容朱建琴纪志荣陈晓芳《资源科学》2014,36(9):1880-1888

武夷山是我国4个世界文化与自然双遗产地之一,武夷山风景名胜区作为遗产地内文化和自然遗产资源最为集中、开发最早的对外旅游窗口,其遗产资源价值高,在武夷山双遗产地中具有重要地位。运用条件价值法对武夷山风景名胜区受访者支付意愿与非使用价值进行评估。结果表明:在考虑不确定性因素影响下,2009年武夷山风景名胜区受访游客人均支付价值为29.67元/年。收入、学历、对旅游的热爱程度、遗产保护意识4个受访者特征变量在各分组水平下的支付意愿存在显著差异(p0.05);性别、年龄无显著差异(p0.05)。收入越高、学历越高、对旅游的热爱程度及遗产保护意识越强烈的受访游客有更高的支付意愿;而性别、年龄与支付意愿不相关。相似文献

17.

Passivity-based adaptive tracking control of spacecraft line-of-sight relative motion with thrust saturation

《Journal of The Franklin Institute》2021,358(13):6408-6432

This paper considers the control problem of spacecraft line-of-sight (LOS) relative motion with thrust saturation in the presence of unmodeled dynamics, external disturbance and unknown mass property. By using skew-symmetric property, reference trajectory generator and anti-windup technique, a novel passivity-based adaptive sliding mode control (SMC) scheme is proposed without prior knowledge of uncertainty/disturbance bound. Within the Lyapunov framework, the establishment of a real sliding mode (which induces the practical stability of closed-loop error system) is validated. The main contributions are that a new control gain adaptive algorithm is adopted to attenuate the overestimation of switching gain and a differentiable projection-based parameter adaptive algorithm is proposed to force the mass approximator to remain in a desired domain, then the adaptive control law is modified by the reference trajectory generator and anti-windup technique to compensate for the effect of thrust saturation. Finally, simulations are conducted to show the fine performance of proposed control scheme. 相似文献

18.

分布式空间数据集成在国土资源系统的应用

管博《中国科技信息》2006,(10):173-174

分布式空间数据集成是当前解决异构空间数据库的共享的主要手段；通过对分布式空间数据库集成的深入探讨和研究，结合国土资源系统的异构空间数据库的特点和现状，提出了相对简单的分布式空间数据库集成模式来实现系统内异构空间数据的整合，并在国土资源系统内得到了应用。相似文献

19.

以电视剧教学为中心的韩国语教育

潘燕梅《科教文汇》2012,(22):151-152

语言教育的最终目的是提高学习者的沟通能力,因此实际生活场景中发生的对话教学尤为重要。以电视剧为中心的韩国语教育在一定程度上可以完善这一方面的教学。电视剧语言具有生动、活泼、时代感强的特点,并且能够极大地调动学生的学习兴趣,加深对韩国文化的了解。选择适合不同年级的学生观看的电视剧是以电视剧为中心的韩国语教育的一个重要环节。作者通过运用情景教学法,对以电视剧为中心的韩国语教育进行了具体的实践。相似文献

20.

房地产销售价格指数的神经网络预测

张喆姚亚辉《中国科技信息》2012,(19):114-115

房地产销售价格指数是指导业界活动和市场研究的有效工具,但是预测的准确程度一直是人们倍加关注的。人工神经网络是一门新兴交叉学科,近年来被越来越多的应用到了实际问题的预测中,显示出其广阔的应用前景,特别是人工神经网络具有预测非线性系统未来行为的巨大潜力。因此,本文提出了用人工神经网络对房地产销售价格指数进行预测的方法,首先将输入数据进行预处理,再利用多层前馈神经网络BP算法来研究人工神经网络在房地产销售价格指数预测中的应用问题,最后得出神经网络方法预测精度较高的结论。相似文献