基于视觉区域聚合与双向协作的端到端图像描述生成
作者:
作者单位:

作者简介:

通讯作者:

宋井宽,jingkuan.song@gmail.com

中图分类号:

基金项目:

国家自然科技支撑计划(2022YFC2009900/2022YFC2009903);国家自然科学基金(62122018,62020106008,61772116,61872064)


End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    近几年,基于Transformer的预训练模型展现了强大的模态表征能力,促使了多模态的下游任务(如图像描述生成任务)正朝着完全端到端范式的趋势所转变,并且能够使得模型获得更好的性能以及更快的推理速度.然而,该技术所提取的网格型视觉特征中缺乏区域型的视觉信息,从而导致模型对对象内容的描述不精确.因此,预训练模型在图像描述生成任务上的适用性在很大程度上仍有待探索.针对这一问题,本文提出一种基于视觉区域聚合与双向协作学习的端到端图像描述生成方法(visual region aggregation and dual-level collaboration,简称VRADC).为了学习到区域型的视觉信息,本文设计了一种视觉区域聚合模块,将有相似语义的网格特征聚合在一起形成紧凑的视觉区域表征.接着,双向协作模块利用交叉注意力机制从两种视觉特征中学习到更加有代表性的语义信息,进而指导模型生成更加细粒度的图像描述文本.基于MSCOCO和Flickr 30K两个数据集的实验结果表明,本文提出的VRADC方法能够大幅度地提升图像描述生成的质量,实现了最先进的性能.

    Abstract:

    To date, Transformer based pre-trained models have demonstrated powerful capabilities of modality representation, leading to a shift towards a fully end-to-end paradigm for multimodal downstream tasks such as image captioning, and enabling better performance and faster inference speed. However, the grid features extracted with the pre-trained model lack regional visual information, which leads to inaccurate descriptions of the object content by the model. Thus, the applicability of using pre-trained models for image captioning remains largely unexplored. Toward this goal, this paper proposes a novel end-to-end image captioning method based on visual region aggregation and dual-level collaboration (VRADC). Specifically, to learn regional visual information, this paper designs a visual region aggregation that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, dual-level collaboration uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which in turn generates more fine-grained descriptions. Experimental results on the MSCOCO dataset and Flickr 30K dataset show that the proposed method, VRADC, can significantly improve the quality of image description generation, and achieves state-of-the-art performance.

    参考文献
    相似文献
    引证文献
引用本文

宋井宽,曾鹏鹏,顾嘉扬,朱晋宽,高联丽.基于视觉区域聚合与双向协作的端到端图像描述生成.软件学报,2023,(5):0

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-04-18
  • 最后修改日期:2022-08-03
  • 录用日期:
  • 在线发布日期: 2022-09-20
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号