Abstract:Convolutional neural network (CNN) has continuously achieved performance breakthroughs in the task of image forgery detection, but when faced with realistic scenarios where the means of tampering is unknown, the existing methods are still unable to effectively capture the long-term dependencies of the input image to alleviate the recognition bias problem, thus affecting the detection accuracy. In addition, due to the difficulty of labeling, the task of image forgery detection usually lacks accurate pixel-level image labeling information. Aiming at the above problems, this paper proposes a pretraining-driven multimodal boundary-aware visual transformer. To capture the subtle forgery traces that are not visible in the RGB domain, the method first introduces the frequency domain modality of the image and combines it with the RGB spatial domain as a form of multimodal embedding. Secondly, the encoder of the backbone network is trained with ImageNet to alleviate the current problem of insufficient training samples. Then, the transformer module is integrated into the tail of this encoder for the purpose of capturing both low-level spatial details and global context, thereby improving the overall representation ability of the model. Finally, to effectively alleviate the problem of difficult localization caused by the blurred boundary of the forged regions, this paper establishes a boundary awareness module, which can use the noise distribution obtained by the Scharr convolutional layer to pay more attention to the noise information rather than the semantic content, and utilize the boundary residual block to sharpen the boundary information, thereby improving the boundary segmentation performance of the model. Extensive experimental results show that the proposed method outperforms existing image forgery detection methods in terms of recognition accuracy, and has better generalization and robustness to the different forgery methods.