Abstract:Text-based person search aims to find the target person with a given text description, which has attracted the attention of researchers from academia and industry. It has two challenges:fine-grained retrieval and heterogeneous gap between images and texts. Some methods propose to use supervised attribute learning to obtain attribute-related features and build fine-grained and cross-modal semantic associations. But the attribute annotations are hard to obtained, making these methods difficult to be applied in practice. How to explore attribute-related features without using attribute annotations to establish fine-grained and cross modal semantic association becomes a key problem. To address this issue, we incorporate pre-training models and propose a text-based person search via virtual attribute learning approach, which associates image and text in fine-grained level through unsupervised attribute learning. First, based on the invariance and cross-modal consistency of pedestrian attributes, we propose a semantics guided attribute decoupling method. It utilizes identity labels as supervision to automatically decouple attribute-related features. Second, we propose a feature learning via semantic reasoning module, which utilizes learned attributes as nodes and the relations between attributes as edges to construct a semantic graph. We exchange information among attributes to enhance cross-modal identification ability of features. Extensive experimental results on public text-based person search dataset CUHK-PEDES and cross-modality retrieval dataset Flickr30k verify the effectiveness of the proposed approach.