Abstract:In recent years, deep learning has achieved excellent performance in unimodal areas such as computer vision (CV) and natural language processing (NLP). With the development of technology, the importance and necessity of multimodal learning has been shown. As an important part of multimodal learning, visual language learning has received a lot of attention from researchers in China and abroad. Thanks to the development of Transformer framework, more and more pre-trained models have been applied to visual language multimodal learning, and the performance of related tasks has been improved qualitatively. In this paper, we systematically review the current work on visual language pretraining models, firstly, we introduce the knowledge about pretraining models, secondly, we analyze and compare the structure of pretraining models from two different perspectives, discuss the commonly used visual language pretraining techniques, detail the five types of downstream pretraining tasks, and finally, we introduce the datasets of commonly used image and video pretraining tasks, and compare and analyze the commonly used pretraining models on different datasets under different tasks.