-
Abstract
Vision-language models (VLM) have shown strong open-vocabulary learning abilities in various video understanding tasks. However, when applied to open-vocabulary temporal action detection (OV-TAD), existing OV-TAD methods often face challenges in generalizing to unseen action categories due to their reliance on visual features, resulting in limited generalization. In this paper, we propose a novel framework, concept-guided semantic projection (CSP), to enhance the generalization ability of OV-TAD methods. By projecting video features into a unified action concept space, CSP enables the use of abstracted action concepts for action detection, rather than solely relying on visual details. To further improve feature consistency across action categories, we introduce a mutual contrastive loss (MCL), ensuring semantic coherence and better feature discrimination. Extensive experiments on ActivityNet and THUMOS14 benchmarks demonstrate that our method outperforms state-of-the-art OV-TAD methods.
-
-