首页 | 本学科首页   官方微博 | 高级检索  
     检索      


HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog
Institution:1. College of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou, China 450002;2. School of Cyber Science and Engineering, Wuhan University, Wuhan, China 430079;3. School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou, China 450045;4. Henan Key Laboratory of Cyberspace Situation Awareness, Zhengzhou 450001;1. Institute of Environmental Science and Technology, Universitat Autònoma de Barcelona, Spain;2. Graduate School of Economics and Management, Ural Federal University, Yekaterinburg, Russian Federation;1. School of Information Management, Nanjing University, Nanjing 210023, China;2. School of Computer Science and Engineering, Southeast University, Nanjing 210096, China;3. School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094, China
Abstract:Visual dialog, a visual-language task, enables an AI agent to engage in conversation with humans grounded in a given image. To generate appropriate answers for a series of questions in the dialog, the agent is required to understand the comprehensive visual content of an image and the fine-grained textual context of the dialog. However, previous studies typically utilized the object-level visual feature to represent a whole image, which only focuses on the local perspective of an image but ignores the importance of the global information in an image. In this paper, we proposed a novel model Human-Like Visual Cognitive and Language-Memory Network for Visual Dialog (HVLM), to simulate global and local dual-perspective cognitions in the human visual system and understand an image comprehensively. HVLM consists of two key modules, Local-to-Global Graph Convolutional Visual Cognition (LG-GCVC) and Question-guided Language Topic Memory (T-Mem). Specifically, in the LG-GCVC module, we design a question-guided dual-perspective reasoning to jointly learn visual contents from both local and global perspectives through a simple spectral graph convolution network. Furthermore, in the T-Mem module, we design an iterative learning strategy to gradually enhance fine-grained textual context details via an attention mechanism. Experimental results demonstrate the superiority of our proposed model, which obtains the comparable performance on benchmark datasets VisDial v1.0 and VisDial v0.9.
Keywords:Visual-language understanding  Dual-perspective reasoning  Simple spectral graph convolution network
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号