从粗粒度到细粒度视角的跨模态检索综述

彭宇新; 郑明航; 刘洋

doi:10.1007/s11390-026-5922-5

从粗粒度到细粒度视角的跨模态检索综述

Cross-Modal Retrieval from Coarse-Grained to Fine-Grained Perspectives: A Survey

摘要

摘要:
文章摘要图/表： 图1：跨模态检索任务的统一分类体系：从粗粒度到细粒度视角表1：本文和现有综述的对比
研究背景 随着多媒体数据的爆炸式增长，跨模态检索已成为多媒体理解和推荐系统中的基础技术，旨在实现跨越异构模态（如图像、视频和文本）的信息检索。尽管已有综述回顾了跨模态检索的进展，但多数工作仍受限于过时的分类体系，且未能充分覆盖最新的技术发展：一方面，现有综述往往聚焦于粗粒度检索任务（即根据查询检索整个图像或视频实例），而很大程度上忽略了细粒度检索需求——这类任务不仅需要区分更细微的语义子类别，还要定位并检索实例特定部分（如图像区域或视频片段）；另一方面，视觉-语言预训练（Visual-Language Pretraining，VLP）和多模态大语言模型（Multimodal Large Language Models，MLLMs）的迅速发展正深刻重塑跨模态检索领域的研究范式，但现有综述通常未能系统梳理并评估这些进展。
目的为了填补上述空白，本文旨在提供一个统一的分类体系，将跨模态检索划分为粗粒度跨模态检索和细粒度跨模态检索。通过同时对比分析这两类检索任务，本文旨在为跨模态检索研究提供更广阔的视角，弥合不同任务之间的差距，并全面梳理总结该领域的最新进展。除方法论的系统归纳外，本文进一步分析了VLP和MLLMs等大模型技术如何重塑该领域的研究范式。
方法本文首先建立了一个基于检索粒度的分类体系：（1）粗粒度跨模态检索涵盖图像-文本检索和视频-文本检索（旨在检索整个实例）；（2）细粒度跨模态检索则涵盖细粒度子类别检索（旨在区分下级细粒度类别）、图像定位和视频时序定位（旨在定位实例的特定部分）。基于此分类体系，本文系统地回顾了各任务主要的方法论范式，详细梳理了从传统的非预训练方法到最新的基于VLP和MLLMs的方法。同时，本文总结了各类任务中广泛使用的数据集和评测基准，并对比分析了不同方法的性能。
结果 VLP模型通过大规模预训练学习到了通用的跨模态表征，已在粗粒度和细粒度跨模态检索任务中成为主流并取得了领先性能。对于旨在检索实例特定部分的细粒度检索中，多模态大语言模型（MLLMs）凭借其强大的指令遵循和推理能力展现出卓越的零样本性能，尤其是在理解复杂查询方面优势明显。然而，由于MLLMs通常采用计算密集的早期融合架构，其在面向海量数据库的粗粒度检索场景中的应用尚受限，而在搜索空间有限的细粒度图像定位和视频时序定位任务中更为可行。此外，本文揭示了粗粒度检索和细粒度检索在方法层面的内在联系，例如粗粒度检索可通过引入细粒度建模来提升检索精度，细粒度子类别检索可通过粗粒度检索确定大类以缩小搜索范围，而细粒度图像定位和视频时序定位的两阶段方法则通过生成候选区域将问题转化为粗粒度检索。
结论本文从检索粒度出发提出了一种新的分类视角，系统总结了跨模态检索任务，揭示了粗粒度与细粒度检索之间的联系与区别，并分析了VLP和MLLMs等大模型技术如何重塑该领域的研究范式。本文最后指出了未来的研究方向，包括面向层次化、组合式与开放词汇的细粒度子类别检索、构建统一的多模态检索基础模型、向以数据为中心的研究范式转变、发展以用户为中心的交互式检索系统，以及探索3D、音频和具身智能等新兴模态的检索应用。

Abstract: Cross-modal retrieval (CMR) has become a fundamental technique in multimedia understanding and recommendation systems, enabling information retrieval across heterogeneous modalities such as images, videos, and text. While several prior surveys have reviewed the progress of CMR, they are limited by outdated taxonomies and insufficient coverage of recent developments. In particular, most surveys focus on coarse-grained retrieval, which retrieves entire instances given a query, while neglecting fine-grained tasks that require retrieving at a finer semantic level to distinguish different subcategories or retrieving only a specific part of the instance, such as a region within an image or a segment within a video. Moreover, due to the rapid development of large-scale vision-language pre-training (VLP) models and multimodal large language models (MLLMs), many existing surveys fail to capture the impact of these transformative advancements on CMR. To address these gaps, in this survey, we provide a unified taxonomy, categorizing CMR into coarse-grained cross-modal retrieval (CCMR) and fine-grained cross-modal retrieval (FCMR). CCMR aims to retrieve the whole instance based on the given query, such as image-text and video-text retrieval. FCMR aims to distinguish and retrieve specific subordinate-level fine-grained categories within a super-class, or retrieve a part of the instance, such as image grounding and video temporal grounding. Taking both these types into consideration brings a broad view to CMR, bridges the gap between disparate tasks, and offers a comprehensive overview of the field. We review major methodological paradigms, including recent VLP-based and MLLM-based approaches, and summarize widely used datasets and evaluation protocols. Beyond systematic performance comparisons, we also discuss applications and insights for future research.

HTML全文

参考文献()

施引文献

资源附件()