基于CLIP引导标签优化的弱监督图像哈希

doi:10.13328/j.cnki.jos.007543

微信小程序

微信服务号

微信订阅号

首页 > 过刊浏览>2026年第37卷第5期 >. DOI:10.13328/j.cnki.jos.007543

PDF HTML阅读 XML下载导出引用引用提醒

基于CLIP引导标签优化的弱监督图像哈希
DOI:
                        10.13328/j.cnki.jos.007543
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金(62425603, 62372233); 江苏省基础研究计划攀登项目（BK20240011)

Weakly Supervised Hashing for Image Retrieval via CLIP-Guided Tag Refinement

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

在大规模图像检索任务中,图像哈希技术通常依赖大量人工标注数据来训练深度哈希模型,但高昂的人工标注成本限制了其实际应用.为缓解对人工标注的依赖,现有研究尝试利用网络用户提供的文本作为弱监督信息,引导模型从图像中挖掘和文本关联的语义信息.然而,用户标签中普遍存在噪声,限制了这些的方法的性能.多模态预训练基础模型(如CLIP)具备较强的图像-文本对齐能力.受此启发,本文利用CLIP来优化用户标签,并提出一种CLIP引导标签优化的弱监督哈希方法(CLIP-guided Tag Refinement Hashing, CTRH).该方法包含三个主要内容:标签置换模块,标签赋权模块和标签平衡损失函数.标签置换模块通过微调CLIP挖掘图像关联的潜在标签.标签赋权模块利用优化后的文本和图像进行跨模态全局语义交互,学习判别性的联合表示.针对用户标签的分布不平衡问题,本文设计了一种标签平衡损失,通过动态加权增强模型对难样本的表征学习.在MirFlickr和NUS-WIDE两个通用数据集上与最先进的方法对比验证了所提方法的有效性.

Abstract:

Image hashing typically rely on large-scale manually annotated data to train deep hashing models. However, the high cost of manual annotation limits their practical application. To alleviate this dependency, recent studies have explored using user-provided textual tags as weak supervision to guide hash model capturing semantic information. Nevertheless, the inherent noise in user-generated tags often hinders model performance. Multimodal pre-trained foundation models, such as CLIP, exhibit strong image-text alignment capabilities. Inspired by this, we propose a CLIP-guided Tag Refinement Hashing (CTRH) framework that leverages CLIP to optimize noisy user tags for weakly supervised hashing. The proposed method consists of three key components: a tag replacement module, a tag weighting module, and a tag-balanced loss function. The tag replacement module fine-tunes CLIP to discover potential image-relevant tags. The tag weighting module performs cross-modal global semantic interaction among the refined text and images to learn discriminative joint representations. To address the tag imbalance problem, we design a tag-balanced loss that dynamically reweights training samples to enhance representation learning for hard instances. Extensive experiments conducted on two benchmark datasets, MirFlickr and NUS-WIDE, demonstrate that our method consistently outperforms state-of-the-art approaches, validating its effectiveness.

参考文献

相似文献

引证文献

引用本文

李泽超,金露,王浩骅,唐金辉.基于CLIP引导标签优化的弱监督图像哈希.软件学报,2026,37(5):

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-05-26
最后修改日期:2025-07-11
录用日期:
在线发布日期: 2025-09-23
出版日期:

微信小程序

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码