一种多智能体合作式环境下基于带权估计的策略优化算法

doi:10.1007/s11390-020-9967-6

一种多智能体合作式环境下基于带权估计的策略优化算法

Efficient Multiagent Policy Optimization Based on Weighted Estimators in Stochastic Cooperative Environments

摘要

摘要: 多智能体深度强化学习（MA-DRL）受到越来越广泛的关注。然而大部分现有的MA-DRL算法，由于环境本身以及其余智能体不断的变化行为而带来的不稳定性，往往难以取得高效的表现。本文将加权双重估计量扩展到多智能体场景下，并提出了一种称为加权双重深层Q网络（WDDQN）的MA-DRL框架。通过利用加权的双重估计器以及深度神经网络，WDDQN不仅可以有效地减少估计偏差，还可以直接处理图片输入。为了在多智能体环境下中实现有效的合作，我们引入了宽容的奖励网络和预定的经验重放策略。实验结果表明，WDDQN在平均奖赏值和收敛速度方面优于现有的DRL算法（深度双Q网络，double DQN）和MA-DRL算法（宽松Q学习，lenient Q-learning），并且在随机合作式多智能体问题中更可能收敛于帕累托最优纳什均衡环境。

Abstract: Multiagent deep reinforcement learning (MA-DRL) has received increasingly wide attention. Most of the existing MA-DRL algorithms, however, are still inefficient when faced with the non-stationarity due to agents changing behavior consistently in stochastic environments. This paper extends the weighted double estimator to multiagent domains and proposes an MA-DRL framework, named Weighted Double Deep Q-Network (WDDQN). By leveraging the weighted double estimator and the deep neural network, WDDQN can not only reduce the bias effectively but also handle scenarios with raw visual inputs. To achieve efficient cooperation in multiagent domains, we introduce a lenient reward network and scheduled replay strategy. Empirical results show that WDDQN outperforms an existing DRL algorithm (double DQN) and an MA-DRL algorithm (lenient Q-learning) regarding the averaged reward and the convergence speed and is more likely to converge to the Pareto-optimal Nash equilibrium in stochastic cooperative environments.

HTML全文

参考文献()

施引文献

资源附件()