Article :Browse 1291 Download 1235
Received:May 11, 2020 Revised:June 26, 2020
Received:May 11, 2020 Revised:June 26, 2020
Abstract:A multimedia world in which human beings live is built from a large number of different modal contents. The information between different modalities is highly correlated and complementary. The main purpose of multi-modal representation learning is to mine the different modalities. Commonness and characteristics produce implicit vectors that can represent multimodal information. This article mainly introduces the corresponding research work of the currently widely used visual language representation, including traditional research methods based on similarity models and current mainstream pre-training methods based on language models. The current better ideas and solutions are to semanticize visual features and then generate representations with textual features through a powerful feature extractor. Transformer is currently used in various tasks of representation learning as the mainstream network architecture. This article elaborates from several different angles of research background, division of different studies, evaluation methods, future development trends, etc.
keywords: multimodal representation learning representation learning multimodal machine learning deep learning
Foundation items:National Natural Science Foundation of China (U1836215)
Reference text:
DU Peng-Fei,LI Xiao-Yong,GAO Ya-Li.Survey on Multimodal Visual Language Representation Learning.Journal of Software,2021,32(2):327-348
DU Peng-Fei,LI Xiao-Yong,GAO Ya-Li.Survey on Multimodal Visual Language Representation Learning.Journal of Software,2021,32(2):327-348