欢迎访问中国科学院大学学报,今天是

中国科学院大学学报 ›› 2020, Vol. 37 ›› Issue (4): 553-561.DOI: 10.7523/j.issn.2095-6134.2020.04.016

• 计算机科学 • 上一篇    下一篇

基于实体嵌入和长短时记忆网络的入侵检测方法

赖训飞1,2,3,4, 梁旭文2,3,4, 谢卓辰3, 李宗旺3,4   

  1. 1. 中国科学院上海微系统与信息技术研究所, 上海 200050;
    2. 上海科技大学信息学院, 上海 201210;
    3. 中国科学院上海微小卫星工程中心, 上海 201203;
    4. 中国科学院大学, 北京 100049
  • 收稿日期:2019-01-25 修回日期:2019-04-03 发布日期:2020-07-15
  • 通讯作者: 赖训飞
  • 基金资助:
    国家自然科学基金(91738201)和上海市青年科技英才扬帆计划项目(17YF1418200)资助

Intrusion detection method based on entity embedding and long short-term memory networks

LAI Xunfei1,2,3,4, LIANG Xuwen2,3,4, XIE Zhuochen3, LI Zongwang3,4   

  1. 1. Shanghai Institute of Microsyst&Information Technology, Chinese Academy of Sciences, Shanghai 200050, China;
    2. School of Information Science&Technology, ShanghaiTech University, Shanghai 201210, China;
    3. Shanghai Engineering Center for Microsatellites, Chinese Academy of Sciences, Shanghai 201203, China;
    4. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2019-01-25 Revised:2019-04-03 Published:2020-07-15

摘要: 针对网络入侵检测过程中无法有效处理入侵数据中分类变量的表示,导致网络入侵检测准确率低、漏报率高等问题,提出一种基于实体嵌入和长短时记忆网络(long short-term memory network,LSTM)相结合的网络入侵检测方法。首先,在数据预处理时,将表示网络特征数据中的数值型变量和分类型变量数据分开,通过实体嵌入方法将分类型变量数据映射在一个欧几里得空间,得到一个向量表示,再将这个向量嵌入到数值型数据后面得到输入数据。然后,通过把数据输入到长短时记忆网络中去训练,通过时间反向传播更新参数,得到最优嵌入向量作为输入特征的同时,也得到一个相对最优的LSTM网络的检测模型。在数据集NSL-KDD上进行实验验证,结果表明实体嵌入是一种有效处理网络入侵数据中分类变量的方法,它和LSTM网络相结合组成的模型能够有效提高入侵检测率。在数据预处理时对分类变量的处理中,实体嵌入方法与传统的One-Hot编码方法相比,检测的准确率提高1.44个百分点,漏报率降低2.99个百分点。

关键词: 实体嵌入, 长短时记忆网络, 入侵检测, 分类变量

Abstract: Due to the inability to effectively deal with the representation of categorical variables in intrusion data, the network intrusion detection has low accuracy and high false negative rate. A method combining entity embedding and long short-term memory network (LSTM) is proposed. First, when the data is preprocessed, the numerical variable data and categorical variable data are separated, and the categorical variable data are mapped into an Euclidean space by using the entity embedding method to obtain a vector representation and then this vector is embedded into the numeric data to get the input data. Then, by inputting the data into the long short-term memory network, the parameters are updated by time back propagation. Thus the optimal embedded vector is obtained as the input feature, and a relatively optimal detection model of the LSTM network is also obtained through training. Experiments are carried out on the data set NSL-KDD, and the results show that entity embedding is an effective method to deal with categorical variables in network intrusion data. The model composed of LSTM network effectively improves the detection rate. In the processing of categorical variables, the accuracy of detection using entity embedding method increases by 1.44 percentage points and the false negative rate decreases by 2.99 percentage points, compared with those using the traditional One-Hot coding method.

Key words: entity embedding, LSTM, intrusion detection, categorical variables

中图分类号: