Abstract:As a cross-modal understanding task, video question answering (VideoQA) requires the interaction between semantic information of different modalities to generate answers to questions given a video and the questions associated with it. In recent years, graph neural networks have made remarkable progress in video question answering tasks due to their powerful capabilities in cross-modal information fusion and inference. However, most existing graph nerual network approaches fail to further improve the performance of VideoQA models due to their inherent deficiencies of overfitting or oversmoothing, weak robustness and weak generalization. In view of the effectiveness and robustness of self-supervised constrastive learning methods in pre-training techniques, this study proposes a self-supervised graph constrastive learning framework GMC based on the idea of graph data augmentation in the video question answering tasks. The framework uses two independent data augmentation operations for nodes and edges to generate dissimilar subsamples, and improves the consistency between the original and augmented subsample graph data prediction distributions in order to enhance the accuracy and robustness of the VideoQA models. The effectiveness of the proposed framework is verified by experimental comparisons with existing state-of-the-art VideoQA models and different GMC variants on the public dataset for video question answering tasks.