Cross-modal Retrieval from Coarse-grained to Fine-grained Perspectives: A Survey
-
Abstract
Cross-modal Retrieval (CMR) has become a fundamental technique in multime dia understanding and recommendation systems, enabling information retrieval across heterogeneous modalities such as images, videos, and text. While several prior surveys have reviewed the progress of CMR, they are limited by outdated taxonomies and insufficient coverage of recent developments. In particular, most surveys focus on coarse-grained retrieval, which retrieves entire instances given a query, while neglecting fine-grained tasks that require retrieving only a specific part of the instance, such as a region within an image or a segment within a video, or retrieving at a finer semantic level to distinguish different subcategories. Moreover, due to the rapid devel opment of large-scale vision-language pre-training (VLP) and multimodal large language models (MLLMs), many existing surveys fail to capture the impact of these transformative advancements on CMR. To address these gaps, in this survey, we provide a unified taxonomy, categorizing CMR into coarse-grained cross-modal retrieval (CCMR) and fine-grained cross-modal retrieval (FCMR). CCMR aims to retrieve the whole instance based on the given query, such as image-text and video-text retrieval. FCMR aims to distinguish and retrieve specific subordinate-level fine-grained categories within a superclass, or retrieve a part of the instance, such as image grounding and video temporal grounding. Taking both these types into consideration brings a broad view to CMR, bridges the gap between disparate tasks, and offers a comprehensive overview of the field. We review major methodological paradigms, including recent VLP-based and MLLM-based approaches, and summarize widely used datasets and evaluation protocols. Beyond systematic performance comparisons, we also discuss applications and insights for future research.
-
-