Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (2): 268-280.doi: 10.1007/s11390-020-9967-6

• Special Section on Learning and Mining in Dynamic Environments • Previous Articles     Next Articles

Efficient Multiagent Policy Optimization Based on Weighted Estimators in Stochastic Cooperative Environments

Yan Zheng1, Jian-Ye Hao1,*, Member, CCF, Zong-Zhang Zhang2, Member, CCF, IEEE, Zhao-Peng Meng1, Xiao-Tian Hao1        

  1. 1 College of Intelligence and Computing, Tianjin University, Tianjin 300350, China;
    2 National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Received:2019-08-20 Revised:2020-01-23 Online:2020-03-05 Published:2020-03-18
  • Contact: Jian-Ye Hao E-mail:jianye.hao@tju.edu.cn
  • About author:Yan Zheng received his Ph.D degree in software engineering, Tianjin University, Tianjin. He is now a research fellow at Nanyang Technological University, Singapare, and also a member of the Deep Reinforcement Learning Laboratory in Tianjin University, Tianjin. His research includes deep reinforcement learning, and multiagent system.
  • Supported by:
    The work was supported by the National Natural Science Foundation of China under Grant Nos. 61702362, U1836214, and 61876119, the Special Program of Artificial Intelligence of Tianjin Research Program of Application Foundation and Advanced Technology under Grant No. 16JCQNJC00100, the Special Program of Artificial Intelligence of Tianjin Municipal Science and Technology Commission of China under Grant No. 56917ZXRGGX00150, the Science and Technology Program of Tianjin of China under Grant Nos. 15PTCYSY00030 and 16ZXHLGX00170, and the Natural Science Foundation of Jiangsu Province of China under Grant No. BK20181432.

Multiagent deep reinforcement learning (MA-DRL) has received increasingly wide attention. Most of the existing MA-DRL algorithms, however, are still inefficient when faced with the non-stationarity due to agents changing behavior consistently in stochastic environments. This paper extends the weighted double estimator to multiagent domains and proposes an MA-DRL framework, named Weighted Double Deep Q-Network (WDDQN). By leveraging the weighted double estimator and the deep neural network, WDDQN can not only reduce the bias effectively but also handle scenarios with raw visual inputs. To achieve efficient cooperation in multiagent domains, we introduce a lenient reward network and scheduled replay strategy. Empirical results show that WDDQN outperforms an existing DRL algorithm (double DQN) and an MA-DRL algorithm (lenient Q-learning) regarding the averaged reward and the convergence speed and is more likely to converge to the Pareto-optimal Nash equilibrium in stochastic cooperative environments.

Key words: deep reinforcement learning; multiagent system; weighted double estimator; lenient reinforcement learning; cooperative Markov game;

[1] Sutton R S, Barto A G. Reinforcement Learning:An Introduction. MIT Press, 1998.
[2] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. Playing Atari with deep reinforcement learning. arXiv:1312.5602, 2013. https://arxiv.org/abs/1312.5602,Nov.2019.
[3] Mnih V, Kavukcuoglu K, Silver D et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540):529-533.
[4] Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In Proc. the 33rd International Conference on Machine Learning, June 2016, pp.1928-1937.
[5] Schaul T, Quan J, Antonoglou I, Silver D. Prioritized experience replay. In Proc. the 4th International Conference on Learning Representations, May 2016.
[6] van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning. In Proc. the 30th AAAI Conference on Artificial Intelligence, February 2016, pp.2094-2100.
[7] Wang Z, Schaul T, Hessel M, van Hasselt H, Lanctot M, de Freitas N. Dueling network architectures for deep reinforcement learning. In Proc. the 33rd International Conference on Learning Representations, June 2016, pp.1995-2003.
[8] Bloembergen D, Kaisers M, Tuyls K. Empirical and theoretical support for lenient learning. In Proc. the 10th International Conference on Autonomous Agents and Multiagent Systems, May 2011, pp.1105-1106.
[9] Matignon L, Laurent G J, le Fort-Piat N. Hysteretic Qlearning:An algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In Proc. the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, October 2007, pp.64-69.
[10] Matignon L, Laurent G J, le Fort-Piat N. Independent reinforcement learners in cooperative Markov games:A survey regarding coordination problems. Knowledge Engineering Review, 2012, 27(1):1-31.
[11] Panait L, Sullivan K, Luke S. Lenient learners in cooperative multiagent systems. In Proc. the 5th International Conference on Autonomous Agents and Multiagent Systems, May 2006, pp.801-803.
[12] Wei E, Luke S. Lenient learning in independent-learner stochastic cooperative games. Journal of Machine Learning Research, 2016, 17:Article No. 84.
[13] Yang T, Hao J, Meng Z, Zheng Y, Zhang C, Zheng Z. BayesToMoP:A fast detection and best response algorithm towards sophisticated opponents. In Proc. the 18th International Conference on Autonomous Agents and Multiagent Systems, May 2019, pp.2282-2284.
[14] Yang T, Hao J, Meng Z, Zhang C, Zheng Y, Zheng Z. Towards efficient detection and optimal response against sophisticated opponents. In Proc. the 28th International Joint Conference on Artificial Intelligence, August 2019, pp.623-629.
[15] Zheng Y, Meng Z P, Hao J Y, Zhang Z Z, Yang T P, Fan C J. A deep Bayesian policy reuse approach against nonstationary agents. In Proc. the 2018 Annual Conference on Neural Information Processing Systems, December 2018, pp.962-972.
[16] Gupta J K, Egorov M, Kochenderfer M. Cooperative multiagent control using deep reinforcement learning. In Proc. the 2017 International Conference on Autonomous Agents and Multiagent Systems Workshops, May 2017, pp.66-83.
[17] Lanctot M, Zambaldi V, Gruslys A et al. A unified gametheoretic approach to multiagent reinforcement learning. In Proc. the 2017 Annual Conference on Neural Information Processing Systems, December 2017, pp.4190-4203.
[18] Claus C, Boutilier C. The dynamics of reinforcement learning in cooperative multiagent systems. In Proc. the 15th AAAI Conference on Artificial Intelligence, July 1998, pp.746-752.
[19] Zhang Z, Pan Z, Kochenderfer M J. Weighted double Qlearning. In Proc. the 26th International Joint Conference on Artificial Intelligence, August 2017, pp.3455-3461.
[20] Zheng Y, Meng Z, Hao J, Zhang Z. Weighted double deep multiagent reinforcement learning in stochastic cooperative environments. In Proc. the 15th Pacific Rim International Conference on Artificial Intelligence, August 2018, pp.421-429.
[21] Watkins C. Learning from delayed rewards[Ph.D. Thesis]. King's College, University of Cambridge, 1989.
[22] Sutton R S. Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3:9-44.
[23] Smith J E, Winkler R L. The optimizer's curse:Skepticism and postdecision surprise in decision analysis. Management Science, 2006, 52(3):311-322.
[24] van Hasselt H. Double Q-learning. In Proc. the 24th Annual Conference on Neural Information Processing Systems, December 2010, pp.2613-2621.
[25] Potter M A, de Jong K A. A cooperative convolutionary approach to function optimization. In Proc. the 3rd International Conference on Parallel Problem Solving from Nature, October 1994, pp.249-257.
[26] Tang H, Houthooft R, Foote D, Stooke A, Chen O X, Duan Y, Schulman J, de Turck F, Abbeel P. #Exploration:A study of count-based exploration for deep reinforcement learning. In Proc. the 2017 Annual Conference on Neural Information Processing Systems, December 2017, pp.2753-2762.
[27] Benda M, Jagannathan V, Dodhiawala R. On optimal cooperation of knowledge sources-An empirical investigation. Technical Report, Boeing Advanced Technology Center, Boeing Computing Services, 1986.
[28] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proc. the 2017 Annual Conference on Neural Information Processing Systems, December 2017, pp.6379-6390.
[29] Palmer G, Tuyls K, Bloembergen D, Savani R. Lenient multi-agent deep reinforcement learning. In Proc. the 17th International Conference on Autonomous Agents and Multigent Systems, July 2018, pp.443-451.
[30] Buşoniu L, Babuška R, de Schutter B. Multi-agent reinforcement learning:An overview. In Innovations in Multiagent Systems and Applications-1, Srinivasan P, Jain L C (eds.), 2010, pp.183-221.
[31] Chou P, Maturana D, Scherer S. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In Proc. the 34th International Conference on Machine Learning, August 2017, pp.834-843.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Wu Enhua;. A Graphics System Distributed across a Local Area Network[J]. , 1986, 1(3): 53 -64 .
[2] Qu Yanwen;. AGDL: A Definition Language for Attribute Grammars[J]. , 1986, 1(3): 80 -91 .
[3] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[4] Gong Zhenhe;. On Conceptual Model Specification and Verification[J]. , 1987, 2(1): 35 -50 .
[5] Qiao Xiangzhen;. An Efficient Parallel Algorithm for FFT[J]. , 1987, 2(3): 174 -190 .
[6] S. T. Chanson; L. Liang; A. Kumar. Throughput Models of CSMA Network with Stations Uniformly Distributed along the Bus[J]. , 1987, 2(4): 243 -264 .
[7] Meng Liming; Xu Xiaofei; Chang Huiyou; Chen Guangxi; Hu Mingzeng; Li Sheng;. A Tree-Structured Database Machine for Large Relational Database Systems[J]. , 1987, 2(4): 265 -275 .
[8] Shi Weigeng; StephenY.H.Su;. An Online Diagnosable Fault-Tolerant Redundancy System[J]. , 1987, 2(4): 310 -321 .
[9] Li Renwei;. Soundness and Completeness of Kung s Reasoning Procedure[J]. , 1988, 3(1): 7 -15 .
[10] Wang Jianchao; Wei Daozheng;. Reconvergent-Fanout-Oriented Testability Measure[J]. , 1988, 3(1): 16 -28 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved