Abstract:In large-scale image retrieval tasks, image hashing typically relies on a large amount of manually annotated data to train deep hashing models. However, the high cost of manual annotation limits its practical application. To alleviate this dependency, existing studies attempt to use texts provided by web users as weak supervision to guide the model in mining semantic information associated with the texts from images. Nevertheless, the inherent noise in user tags often limits model performance. Multimodal pre-trained models such as CLIP exhibit strong image-text alignment capabilities. Inspired by this, this study utilizes CLIP to optimize user tags and proposes a weakly supervised hashing method called CLIP-guided tag refinement hashing (CTRH). The proposed method consists of three key components: a tag replacement module, a tag weighting module, and a tag-balanced loss function. The tag replacement module fine-tunes CLIP to mine potential image-relevant tags. The tag weighting module performs cross-modal global semantic interaction between the optimized text and images to learn discriminative joint representations. To address the imbalance of user tags, a tag-balanced loss is designed, which dynamically reweights hard samples to enhance the model’s representation learning. Experiments on two general datasets, MirFlickr and NUS-WIDE, verify the effectiveness of the proposed method compared to state-of-the-art approaches.