计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (5): 1002-1021.doi: 10.1007/s11390-021-1217-z

所属专题: Artificial Intelligence and Pattern Recognition

• • 上一篇    下一篇

基于动态偏度和稀疏度计算的A3C鲁棒性评估:一种并行计算视角

Tong Chen1, Ji-Qiang Liu1, He Li1, Shuo-Ru Wang1, Wen-Jia Niu1,*, Member, CCF En-Dong Tong1,*, Member, CCF, Liang Chang2, Qi Alfred Chen3, and Gang Li4, Member, IEEE   

  1. 1 Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University Beijing 100044, China;
    2 Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China;
    3 Donald Bren School of Information and Computer Sciences, University of California, Irvine 92697, U.S.A.;
    4 Centre for Cyber Security Research and Innovation, Deakin University, Geelong, VIC 3216, Australia
  • 收稿日期:2020-12-12 修回日期:2021-07-26 出版日期:2021-09-30 发布日期:2021-09-30
  • 作者简介:Tong Chen received her M.S. degree in cyber security from Beijing Jiaotong University, Beijing, in 2018. She is currently a Ph.D. candidate of cyber security in Beijing Jiaotong University, Beijing. Her main research interests are cyber security and reinforcement learning security.
  • 基金资助:
    The work was supported by the National Natural Science Foundation of China under Grant Nos. 61972025, 61802389, 61672092, U1811264, and 61966009, the National Key Research and Development Program of China under Grant Nos. 2020YFB1005604 and 2020YFB2103802, and Guangxi Key Laboratory of Trusted Software under Grant No. KX201902.

Robustness Assessment of Asynchronous Advantage Actor-Critic Based on Dynamic Skewness and Sparseness Computation: A Parallel Computing View

Tong Chen1, Ji-Qiang Liu1, He Li1, Shuo-Ru Wang1, Wen-Jia Niu1,*, Member, CCF En-Dong Tong1,*, Member, CCF, Liang Chang2, Qi Alfred Chen3, and Gang Li4, Member, IEEE        

  1. 1 Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University Beijing 100044, China;
    2 Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China;
    3 Donald Bren School of Information and Computer Sciences, University of California, Irvine 92697, U.S.A.;
    4 Centre for Cyber Security Research and Innovation, Deakin University, Geelong, VIC 3216, Australia
  • Received:2020-12-12 Revised:2021-07-26 Online:2021-09-30 Published:2021-09-30
  • About author:Tong Chen received her M.S. degree in cyber security from Beijing Jiaotong University, Beijing, in 2018. She is currently a Ph.D. candidate of cyber security in Beijing Jiaotong University, Beijing. Her main research interests are cyber security and reinforcement learning security.
  • Supported by:
    The work was supported by the National Natural Science Foundation of China under Grant Nos. 61972025, 61802389, 61672092, U1811264, and 61966009, the National Key Research and Development Program of China under Grant Nos. 2020YFB1005604 and 2020YFB2103802, and Guangxi Key Laboratory of Trusted Software under Grant No. KX201902.

1、研究背景(context)。
强化学习作为一种自主学习,在极大程度上推动了人工智能领域中基础应用程序的发展。在所有主流的强化学习算法中,异步优势演员-评论家(A3C)以其能够支持异步并行学习的特性,成为人工智能研究领域中的一种流行算法,同时该算法引领了并行计算引发的深度强化学习革命。A3C可以异步执行多个代理与环境进行交互,摒除传统的单代理学习模式,通过多代理协作更快地实现自主学习。目前,越来越多的实际应用场景(例如:电力控制)开始考虑使用A3C进行部署。基于并行计算的A3C在极大程度上提高了同步并行学习的潜力,为强化学习的发展打开了新的大门。
2、目的(Objective):
相关研究表明,在轻微随机的环境干扰下,A3C并不能保持其鲁棒性,因此,在并行计算的高速环境中,对A3C的鲁棒性进行系统的评估是非常有意义且重要的。因此,在本工作中,我们的研究目标是基于多代理并行计算实现对A3C的系统性鲁棒性评估。
3、方法(Method):
我们首先计算动作概率偏差,并构建一个由动作概率偏差组成的全局矩阵,进而获得在每个状态上的策略动作差异。通过对动作概率偏差全局矩阵进行深度分析,以及A3C算法训练,我们定义了两种全新的偏度、稀疏度计算指标。考虑到偏度、稀疏度不同的权重组合,利用实现对二者的综合计算,并以此作为整体的鲁棒性评估指标。除了静态评估方法以外,我们基于对变化回合的条件性全局空间状态采样,提出了一种动态评估算法,同时分析了该算法的时间复杂度,进而证明动态鲁棒性评估算法的时间效率。我们实现了一个基于A3C的寻路场景作为我们的实验环境,同时针对代理个数、学习率的不同实验组合来验证本文提出方法的有效性。
4、结果(Result&Findings):
通过对代理个数、学习率不同的组合进行实验,我们发现,本文提出的动态A3C鲁棒性评估方法与基准相比可以达到83.3%的准确率。随着代理个数的增加,偏度、稀疏度会相应下降,最大下降幅度分别达到38.1%和7.86%。通常情况下,设置一个更低的学习率会得到更高的偏度和稀疏度值,同时也意味着A3C模型的鲁棒性会更强。
5、结论(Conclusions):
实验证明本文提出的鲁棒性评估方法能够以较高的准确性实现对A3C模型的鲁棒性评估,并且详细分析了代理个数以及学习率对A3C模型鲁棒性的影响。本工作首次针对基于并行计算的A3C鲁棒性问题进行了深入的研究。在未来,有望启发该领域一系列相关研究,包括但不限于:(1)面向无限状态空间的A3C强化学习的鲁棒性研究;(2)针对更多类型强化学习的鲁棒性研究;(3)确保强化学习鲁棒性的完善机制研究。

关键词: 鲁棒性评估, 偏度, 稀疏度, A3C, 强化学习

Abstract: Reinforcement learning as autonomous learning is greatly driving artificial intelligence (AI) development to practical applications. Having demonstrated the potential to significantly improve synchronously parallel learning, the parallel computing based asynchronous advantage actor-critic (A3C) opens a new door for reinforcement learning. Unfortunately, the acceleration's influence on A3C robustness has been largely overlooked. In this paper, we perform the first robustness assessment of A3C based on parallel computing. By perceiving the policy's action, we construct a global matrix of action probability deviation and define two novel measures of skewness and sparseness to form an integral robustness measure. Based on such static assessment, we then develop a dynamic robustness assessing algorithm through situational whole-space state sampling of changing episodes. Extensive experiments with different combinations of agent number and learning rate are implemented on an A3C-based pathfinding application, demonstrating that our proposed robustness assessment can effectively measure the robustness of A3C, which can achieve an accuracy of 83.3%.

Key words: robustness assessment, skewness, sparseness, asynchronous advantage actor-critic, reinforcement learning

[1] Fabisch A, Petzoldt C, Otto M, Kirchner F. A survey of behavior learning applications in robotics-State of the art and perspectives. arXiv:1906.01868, 2019. https://arxiv.org/abs/1906.01868, June 2021.
[2] Silver D, Huang A, Maddison C J et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587):484-489. DOI:10.1038/nature16961.
[3] Mnih V, Kavukcuoglu K, Silver D et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540):529-533. DOI:10.1038/nature14236.
[4] Tamar A, Wu Y, Thomas G, Levine S, Abbeel P. Value iteration networks. In Proc. the 30th International Conference on Neural Information Processing Systems, Dec. 2016, pp.2154-2162.
[5] Watkins C. Learning from delayed rewards[Ph.D. Thesis]. University of Cambridge, England, 1989.
[6] Grounds M, Kudenko D. Parallel reinforcement learning with linear function approximation. In Proc. the 6th European Conference on Adaptive and Learning Agents and Multiagent Systems:Adaptation and Multi-Agent Learning, May 2007, Article No. 45. DOI:10.1145/1329-125.1329179.
[7] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. Playing Atari with deep reinforcement learning. In Proc. the 27th Conference on Neural Information Processing Systems, Dec. 2013.
[8] Barto G A, Sutton S R, Anderson W C. Neuron like elements that can solve difficult learning control problems. IEEE Trans. Systems, Man, & Cybernetics, 1983, SMC-13(5):834-846. DOI:10.1109/TSMC.1983.6313077.
[9] Mnih V, Badia A P, Mirza M, Graves A, Harley T, Lillicrap T P, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In Proc. the 33rd International Conference on Machine Learning, Jun. 2016, pp.1928-1937.
[10] Lillicrap T, Hunt J J, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. arXiv:1509.02971, 2016. http://arxiv.org/abs/1509.02971, May 2021.
[11] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv:1707.06347, 2017. https://arxiv.org/abs/1707.06347, May 2021.
[12] Babaeizadeh M, Frosio I, Tyree S, Clemons J, Kautz J. GA3C:GPU-based A3C for deep reinforcement learning. In Proc. the 30th Conference on Neural Information Processing Systems, Dec. 2016.
[13] Cho H, Oh P, Park J, Jung W, Lee J. FA3C:FPGAaccelerated deep reinforcement learning. In Proc. the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, Apr. 2019, pp.499-513. DOI:10.1145/3297858.3304058.
[14] Huang S, Papernot N, Goodfellow I, Duan Y, Abbeel P. Adversarial attacks on neural network policies. arXiv:170-2.02284, 2017. https://arxiv.org/abs/1702.02284, February 2021.
[15] Yuan Z, Gong Y. Improving the speed delivery for robotic warehouses. IFAC-PapersOnLine, 2016, 49(12):1164-1168. DOI:10.1016/j.ifacol.2016.07.661.
[16] McKee J. Speeding Fermat's factoring method. Math. Comput., 1999, 68(228):1729-1737. DOI:10.1090/S0025-5718-99-01133-3.
[17] Chinchor N. MUC-4 evaluation metrics. In Proc. the 4th Message Understanding Conference, Jun. 1992, pp.22-29. DOI:10.3115/1072064.1072067.
[18] Koutník J, Schmidhuber J, Gomez F. Evolving deep unsupervised convolutional networks for vision-based reinforcement learning. In Proc. the 14th Conference on Genetic and Evolutionary Computation, Jul. 2014, pp.541-548. DOI:10.1145/2576768.2598358.
[19] Babaeizadeh M, Frosio I, Tyree S, Clemons J, Kautz J. Reinforcement learning through asynchronous advantage actor-critic on a GPU. arXiv:1611.06256, 2016. https://arxiv.org/abs/1611.06256, November 2020.
[20] Bojchevski A, Gunnemann S. Adversarial attacks on node embeddings via graph poisoning. arXiv:1809.01093, 2018. https://arxiv.org/abs/1809.01093, May 2021.
[21] Xiao H, Xiao H, Eckert C. Adversarial label flips attack on support vector machines. In Proc. the 20th European Conference on Artificial Intelligence, Aug. 2012, pp.870-875. DOI:10.3233/978-1-61499-098-7-870.
[22] Zugner D, Gunnemann S. Adversarial attacks on graph neural networks via meta learning. arXiv:1902.08412, 2019. https://arxiv.org/abs/1902.08412, February 2021.
[23] Goodfellow I, Shlens J, Szegedy C. Explaining and harnessing adversarial examples. arXiv:1412.6572, 2014. https://arxiv.org/abs/1412.6572, March 2021.
[24] Kurakin A, Goodfellow I, Bengio S. Adversarial examples in the physical world. In Proc. the 5th International Conference on Learning Representations, Apr. 2017.
[25] Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R. Intriguing properties of neural networks. arXiv:1312.6199, 2013. https://arxiv.org/abs/1312.6199, February 2021.
[26] Huang Y, Zhu Q. Manipulating reinforcement learning:Poisoning attacks on cost signals. arXiv:2002.03827, 2020. https://arxiv.org/abs/2002.03827, June 2021.
[27] Tan A, Lu N, Xiao D. Integrating temporal difference methods and self-organizing neural networks for reinforcement learning with delayed evaluative feedback. IEEE Transactions on Neural Networks, 2008, 19(2):230-244. DOI:10.1109/TNN.2007.905839.
[28] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, WardeFarley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In Proc. the 27th Neural Information Processing Systems, Dec. 2014, pp.2672-2680.
[29] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proc. the 29th IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770-778. DOI:10.1109/CVPR.2016.90.
[30] Szegedy C, Liu W, Jia Y, Serrmanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015. DOI:10.1109/CVPR.2015.7298594.
[31] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014. https://arxiv.org/abs/1409.1556, April 2021.
[32] Huang G, Liu Z, Van Der Maaten L Q, Weinberger K. Densely connected convolutional networks. In Proc. the 30th IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.2261-2269. DOI:10.1109/CVPR.2017.243.
[33] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60(6):84-90. DOI:10.1145/3065386.
[1] Qing-Bin Liu, Shi-Zhu He, Kang Liu, Sheng-Ping Liu, Jun Zhao. 一种用于对话状态跟踪的统一共享私有网络和去燥方法[J]. 计算机科学技术学报, 2021, 36(6): 1407-1419.
[2] Jia-Ke Ge, Yan-Feng Chai, Yun-Peng Chai. WATuning:一种基于注意力机制的深度强化学习的工作负载感知调优系统[J]. 计算机科学技术学报, 2021, 36(4): 741-761.
[3] Yan Zheng, Jian-Ye Hao, Zong-Zhang Zhang, Zhao-Peng Meng, Xiao-Tian Hao. 一种多智能体合作式环境下基于带权估计的策略优化算法[J]. 计算机科学技术学报, 2020, 35(2): 268-280.
[4] Ai-Wen Jiang, Bo Liu, Ming-Wen Wang. 基于上下文引导型循环注意机制与深度多模态强化网络的图像问答算法[J]. , 2017, 32(4): 738-748.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[7] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[8] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[9] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[10] 衷仁保; 邢林; 任朝阳;. An Interactive System SDI on Microcomputer[J]. , 1987, 2(1): 64 -71 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: