多模态人工智能体：最新进展与未来方向

孙玉柱; 孙鹤立; 马健聪; 张鹏; 黄小勇

doi:10.1007/s11390-025-4802-8

多模态人工智能体：最新进展与未来方向

Multimodal Agent AI: A Survey of Recent Advances and Future Directions

摘要

摘要: 近年来，多模态智能体人工智能（multimodal agent AI, MAA）已成为关键研究领域，有望变革人机交互模式。智能体人工智能系统能够感知并响应多模态输入（如语言、视觉、音频），在理解复杂环境与执行复杂任务方面展现出显著进展。本综述全面梳理了 MAA 领域的最新研究成果，探讨其基本概念、核心技术及跨领域应用。首先介绍智能体人工智能的基础理论及其多模态交互能力；随后深入分析支撑智能体实现任务规划、决策制定与多感官融合的核心技术，并重点探讨 MAA 在机器人、医疗健康、游戏等领域的各类应用场景。同时，本文着重剖析当前系统存在的挑战与局限，包括但不限于可解释性、输出带有偏见、容易引发社会问题，以及幻觉等问题。此外，文章提出未来研究方向，包括人机协作、在线学习方法优化等。通过回顾现有研究成果并指出开放性问题，本综述旨在为 MAA 领域的研究者与从业者提供全面的研究路线图。

Abstract: In recent years, multimodal agent AI (MAA) has emerged as a pivotal area of research, holding promise for transforming human-machine interaction. Agent AI systems, capable of perceiving and responding to inputs from multiple modalities (e.g., language, vision, audio), have demonstrated remarkable progress in understanding complex environments and executing intricate tasks. This survey comprehensively reviews the state-of-the-art developments in MAA and examines its fundamental concepts, key techniques, and applications across diverse domains. We first introduce the basics of agent AI and its multimodal interaction capabilities. We then delve into the core technologies that enable agents to perform task planning, decision-making, and multi-sensory fusion. Furthermore, we focus on exploring various applications of MAA in robotics, healthcare, gaming, and beyond. Additionally, we mainly focus on analyzing the challenges and limitations of current systems and propose promising research directions for future improvements, including human-AI collaboration, online learning method improvement. By reviewing existing work and highlighting open questions, this survey aims to provide a comprehensive roadmap for researchers and practitioners in the field of MAA.

HTML全文

参考文献()

施引文献

资源附件()