Text-Parsing Adaptive Gated Fusion for Audio-Visual Question Answering
-
Abstract
Audio--visual question answering (AVQA) tasks require models to answer questions concerning visual objects, sounds, and their interactions in videos by effectively integrating multimodal information. A key challenge in AVQA lies in the presence of irrelevant audio--visual contents that can obscure question-relevant signals, necessitating targeted perception and reasoning mechanisms for accurate predictions. In this study, we propose a text-parsing adaptive gated fusion network (TP-Net), a framework that dynamically adjusts modality contributions based on question semantics. Unlike existing methods that consider all question types uniformly, TP-Net emphasizes modality-specific features during prediction, such as increasing the weight of audio features for auditory-centric questions. The proposed model comprises a spatiotemporal perception module for selecting informative temporal segments and refining audio--visual spatiotemporal correlations and a modality-gated fusion module that reweights multimodal features based on question types. Extensive evaluations on standard AVQA benchmarks demonstrate that TP-Net achieves state-of-the-art performance, particularly excelling in scenarios requiring nuanced multimodal.
-
-