We use cookies to improve your experience with our site.
Xue-Yang Qin, Li-Shuang Li, Jing-Yao Tang, Fei Hao, Mei-Ling Ge, Guang-Yao Pang. Multi-task Visual Semantic Embedding Network for image-text retrieval[J]. Journal of Computer Science and Technology. DOI: 10.1007/s11390-024-4125-1
Citation: Xue-Yang Qin, Li-Shuang Li, Jing-Yao Tang, Fei Hao, Mei-Ling Ge, Guang-Yao Pang. Multi-task Visual Semantic Embedding Network for image-text retrieval[J]. Journal of Computer Science and Technology. DOI: 10.1007/s11390-024-4125-1

Multi-task Visual Semantic Embedding Network for image-text retrieval

  • Image-text retrieval aims to capture the semantic correspondence between images and texts, which serves as a foundation and crucial component in multi-modal recommendations, search systems, and online shopping. Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval. To this end, a Multi-task Visual Semantic Embedding Network (MVSEN) is proposed for image-text retrieval. Specifically, we design two auxiliary tasks, including text-text matching and multi-label classification, for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective. Besides, we also present an intra- and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities. Subsequently, we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs. Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets, Flickr30K and MSCOCO, with \emphrSum improvements of 8.2% and 3.0%, respectively.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return