Concept-Guided Open-Vocabulary Temporal Action Detection

Song-Miao Wang; Rui-Ze Han; Wei Feng

doi:10.1007/s11390-025-5281-7

Wang SM, Han RZ, Feng W. Concept-guided open-vocabulary temporal action detection. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(5): 1270−1284, Sept. 2025. DOI: 10.1007/s11390-025-5281-7

Citation:

Wang SM, Han RZ, Feng W. Concept-guided open-vocabulary temporal action detection. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(5): 1270−1284, Sept. 2025. DOI: 10.1007/s11390-025-5281-7

Citation:

Wang SM, Han RZ, Feng W. Concept-guided open-vocabulary temporal action detection. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(5): 1270−1284, Sept. 2025. DOI: 10.1007/s11390-025-5281-7

Concept-Guided Open-Vocabulary Temporal Action Detection

Abstract

Abstract

Vision-language models (VLMs) have shown strong open-vocabulary learning abilities in various video understanding tasks. However, when applied to open-vocabulary temporal action detection (OV-TAD), existing OV-TAD methods often face challenges in generalizing to unseen action categories due to their reliance on visual features, resulting in limited generalization. In this paper, we propose a novel framework, Concept-Guided Semantic Projection (CSP), to enhance the generalization ability of OV-TAD methods. By projecting video features into a unified action concept space, CSP enables the use of abstracted action concepts for action detection, rather than solely relying on visual details. To further improve feature consistency across action categories, we introduce a mutual contrastive loss (MCL), ensuring semantic coherence and better feature discrimination. Extensive experiments on the ActivityNet and THUMOS14 benchmarks demonstrate that our method outperforms state-of-the-art OV-TAD methods. Code and data are available at Concept-Guided-OV-TAD.

FullText(HTML)

References (65)

Relative Articles

Supplements (2)

Cited By

Concept-Guided Open-Vocabulary Temporal Action Detection

Abstract

Catalog

Export File

Citation

Format

Content