ITSELF: Attention Guided Fine-Grained Alignment for Vision–Language Retrieval

University of Information Technology, VNU-HCM, Vietnam
WACV 2026

*Indicates Equal Contribution

First research result visualization

Evolution of text-based person search paradigms. (a) Global matching method uses powerful MLLM to synthesize extra datasets (b) Recent local implicit matching method implicitly reasons over relations among all local tokens. (c) Ours - ITSELF with GRAB : an attention-guided local branch to learn implicitly fine-grained, discriminative features to achieve better alignment.

Abstract

Vision–language models (VLMs) have rapidly advanced and show strong promise for text-based person search (TBPS), a task that requires capturing fine-grained relationships between images and text to distinguish individuals. Previous methods address these challenges through local alignment, yet they are often prone to shortcut learning and spurious correlations, yielding misalignment. Moreover, injecting prior knowledge can distort intra-modality structure. Motivated by our observation that encoder attention surfaces spatially precise evidence from the earliest training epochs and to alleviate these issues, we introduce ITSELF, an attention-guided framework for implicit local alignment. At its core, Guided Representation with Attentive Bank (GRAB) converts the model’s own attention into an Attentive Bank of high-saliency tokens and applies local objectives on this bank, learning fine-grained correspondences without extra supervision. To make the selection reliable and non-redundant, we introduce Multi-Layer Attention for Robust Selection (MARS), which aggregates attention across layers and performs diversity-aware top-k selection; and Adaptive Token Scheduler (ATS), which schedules the retention budget from coarse to fine over training, preserving context early while progressively focusing on discriminative details. Extensive experiments on three widely used TBPS benchmarks show state-of-the-art performance and strong cross-dataset generalization, confirming the effectiveness and robustness of our approach without additional prior supervision.

Overview

First research result visualization

Overview of our proposed ITSELF (an attention-guided implicit local alignment framework). The architecture features a dual-stream encoder for images (left) and text (right). At its core is the GRAB (Guided Representation with Attentive Bank) module, designed to learn fine-grained, discriminative cues. GRAB consists of MARS (Multi-layer Attention for Robust Selection), which fuses attention across layers to select informative patches/tokens, and ATS (Adaptive Token Scheduler), which anneals token selection from coarse to fine during training. The model is optimized with a dual-loss strategy: a local loss Llocal aligns guided local representations, and a global loss Lglobal matches overall embeddings. ITSELF reinforces global text-image alignment without additional supervision or inference-time cost.

Quantitative Results

Comparison with SOTA methods. We compare our approach with recent methods on three benchmarks. Our model outperforms all CLIP-based competitors on every metric, setting new R@1 and mAP records on all datasets, including RSTP-Reid (+2.17% mAP). Gains extend to R@5 and R@10, showing broad retrieval improvements. Remarkably, we achieve this without ReID-domain pretraining, surpassing methods using larger backbones or extra resources. Notably, using only CLIP, we attain the top R@1 on ICFG-PEDES over all methods.

Domain Generalization. We test cross-domain performance by training on one dataset and evaluating on another. Our model consistently outperforms prior approaches across all six transfer settings (C→I, I→C, etc.), e.g., improving R@1 by over 2% in the challenging I→R setting. These results demonstrate robust text–image alignment and strong generalization across unseen domains.

Qualitative Results

Top-5 Retrieval Examples. Examples on the RSTPReid benchmark show our method outperforms RDE. For queries like a man in a black and blue jacket or a boy in a patterned black jacket, our approach retrieves more correct matches in the top-5, demonstrating better alignment of fine-grained textual descriptions with images.

Attention Comparison. Grad-CAM visualizations reveal our model achieves sharper, query-focused attention on clothing and accessories, isolating target pedestrians in crowded scenes. RDE, by contrast, produces diffuse and sometimes irrelevant hotspots, highlighting our framework’s stronger attribute-level fidelity and reduced cross-identity confusion.

Top-K Token Selection Analysis. Our top-K token selection generates focused, semantically relevant attention maps. It successfully highlights regions corresponding to keywords like "white coat," "red boots," and "gray backpack," improving text-to-image grounding compared to the scattered baseline attention.

Ablation Study

Effectiveness of each component. We evaluate the contribution of each module in ITSELF across three datasets. The MARS module outperforms a fixed single-layer (SL) strategy, achieving R@1 gains of +2.24%, +3.01%, and +5.65%, showing the advantage of multi-layer attention. The ATS module further improves R@1, and combined with MARS, the full model achieves the strongest performance (+2.29%, +3.17%, +6.00%) by preserving discriminative cues and stabilizing optimization.

Layer Selection (MARS) & Discard Ratio (ATS). Multi-layer configurations in MARS consistently outperform baselines, with the Middle+Late (M+L) combination giving the best R@1 and mAP. Early layer attention is sharply peaked with low semantic content, while middle layers capture broader context and late layers focus on

BibTeX

@inproceedings{Nguyen_2026_WACV,
  title={ITSELF:AttentionGuidedFine-GrainedAlignmentforVision–Language Retrieval},
  author={Nguyen, Tien-Huy and Tran, Huu-Loc and Ngo, Thanh Duc},
  booktitle={Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
  pages={},
  year={2026}
}