计算机科学技术学报 ›› 2020,Vol. 35 ›› Issue (1): 179-193.doi: 10.1007/s11390-020-9651-x

• • 上一篇    下一篇

标签化网络栈:一种用于支持大规模物联网设备的高并发低尾延迟的云服务器框架

Wen-Li Zhang1, Member, CCF, ACM, IEEE, Ke Liu1, Member, CCF, Yi-Fan Shen1,2, Ya-Zhu Lan1, Member, CCF, Hui Song1, Member, CCF, Ming-Yu Chen1,2,3, Member, CCF, ACM, IEEE, Yuan-Fei Chen1,4, Member, CCF   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Peng Cheng Laboratory, Shenzhen 518000, China;
    4 Zhongke Zhicheng Electronic Technology Company Limited, Jining 272000, China
  • 收稿日期:2019-04-20 修回日期:2019-11-08 出版日期:2020-01-05 发布日期:2020-01-14
  • 作者简介:Wen-Li Zhang received her Ph.D. degree in computer system architecture from Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, in 2014. She is currently an associate professor of ICT, CAS. She is a member of CCF, ACM and IEEE. Her main research interests include architecture and algorithm optimization for high end computers and datacenter networks.
  • 基金资助:
    The work was supported by the National Key Research and Development Plan of China under Grant No. 2016YFB1000203.

Labeled Network Stack: A High-Concurrency and Low-Tail Latency Cloud Server Framework for Massive IoT Devices

Wen-Li Zhang1, Member, CCF, ACM, IEEE, Ke Liu1, Member, CCF, Yi-Fan Shen1,2, Ya-Zhu Lan1, Member, CCF, Hui Song1, Member, CCF, Ming-Yu Chen1,2,3, Member, CCF, ACM, IEEE, Yuan-Fei Chen1,4, Member, CCF        

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Peng Cheng Laboratory, Shenzhen 518000, China;
    4 Zhongke Zhicheng Electronic Technology Company Limited, Jining 272000, China
  • Received:2019-04-20 Revised:2019-11-08 Online:2020-01-05 Published:2020-01-14
  • About author:Wen-Li Zhang received her Ph.D. degree in computer system architecture from Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, in 2014. She is currently an associate professor of ICT, CAS. She is a member of CCF, ACM and IEEE. Her main research interests include architecture and algorithm optimization for high end computers and datacenter networks.
  • Supported by:
    The work was supported by the National Key Research and Development Plan of China under Grant No. 2016YFB1000203.

物联网(IOT)应用程序存在海量向云服务器的客户端连接,而且联网的IOT设备数量在显著增加。物联网服务需要数据中心兼具低尾延迟和高并发性。这项研究旨在通过为未来数据中心提出一种软硬件协同设计的标签化网络栈(LNS)技术,来确定当前主流系统的尾部延迟和并发性是否可能有一个数量级的改进。关键的创新是,一种跨层次的载荷标签机制,它可以贯穿整个网络栈来区分不同的有效负载请求,包括应用层、TCP/IP层和以太网层。这种设计可以确保全数据路径上进行数据包的优先处理和转发,以至于延迟不敏感的请求不会明显受到高优先级请求的影响。我们构建了一个数据中心服务器原型系统,使用云化的物联网应用场景,对比商用X86服务器和mTCP研究来评估标签化网络栈的设计。实验结果表明,标签化网络栈的设计可以达成尾延迟和并发性的数量级提升。在以50毫秒为99分位尾延迟阈值的情况下,单台数据中心服务器节点可以支持超过200万并发的物联网设备长连接。此外,软硬件协同的设计方法显著降低了标签化和优先处理的开销,并有效控制了高优先级请求对低优先级请求的干扰。

关键词: 尾延迟, 高并发, 网络栈, 云服务器, 物联网服务

Abstract: Internet of Things (IoT) applications have massive client connections to cloud servers, and the number of networked IoT devices is remarkably increasing. IoT services require both low-tail latency and high concurrency in datacenters. This study aims to determine whether an order of magnitude improvement is possible in tail latency and concurrency in mainstream systems by proposing a hardware-software codesigned labeled network stack (LNS) for future datacenters. The key innovation is a cross-layered payload labeling mechanism that distinguishes different requests by payload across the full network stack, including application, TCP/IP, and Ethernet layers. This type of design enables prioritized data packet processing and forwarding along the full datapath, such that latency-insensitive requests cannot significantly interfere with high-priority requests. We build a prototype datacenter server to evaluate the LNS design against a commercial X86 server and the mTCP research, using a cloud-supported IoT application scenario. Experimental results show that the LNS design can provide an order of magnitude improvement in tail latency and concurrency. A single datacenter server node can support over 2 million concurrent long-living connections for IoT devices as a 99-percentile tail latency of 50 ms is maintained. In addition, the hardware-software codesign approach remarkably reduces the labeling and prioritization overhead and constrains the interference of high-priority requests to low-priority requests.

Key words: tail latency, high concurrency, network stack, cloud server, Internet of Things (IoT) service

[1] Gubbi J, Buyya R, Marusic S et al. Internet of Things (IoT):A vision, architectural elements, and future directions. Future Generation Computer Systems, 2013, 29(7):1645-1660.
[2] Botta A, De Donato W, Persico V et al. Integration of cloud computing and Internet of Things:A survey. Future Generation Computer Systems, 2016, 56:684-700.
[3] Mohammadi M, Al-Fuqaha A, Sorour S et al. Deep learning for IoT big data and streaming analytics:A survey. IEEE Communications Surveys & Tutorials, 2018, 20(4):2923-2960.
[4] Dean J, Barroso L A. The tail at scale. Communications of the ACM, 2013, 56(2):74-80.
[5] Zats D, Das T, Mohan P, Borthakur D, Katz R. DeTail:Reducing the flow completion time tail in datacenter networks. ACM SIGCOMM Comput. Commun. Rev., 2012, 42:139-150.
[6] Li J, Sharma N K, Ports D R et al. Tales of the tail:Hardware, OS, and application-level sources of tail latency. In Proc. the ACM Symposium on Cloud Computing, November 2014, Article No. 9.
[7] Liu H. A measurement study of server utilization in public clouds. In Proc. the 9th IEEE International Conference on Dependable, Autonomic and Secure Computing, December 2011, pp.435-442.
[8] Thekkath C A, Nguyen T D, Moy E et al. Implementing network protocols at user level. IEEE/ACM Transactions on Networking, 1993, 1(5):554-565.
[9] Zhang W, Liu K, Song H et al. Labeled network stack:A codesigned stack for low tail-latency and high concurrency in datacenter services. In Proc. the 15th IFIP WG 10.3 International Conference on Network and Parallel Computing, November 2018, pp.132-136.
[10] Wu W, Feng X, Zhang W, Chen M. MCC:A predictable and scalable massive client load generator. In Proc. the 2019 BenchCouncil International Symposium on Benchmarking, Measuring and Optimizing, Nov. 2019.
[11] Song H, Zhang W, Liu K et al. HCMonitor:An accurate measurement system for high concurrent network services. In Proc. the 2019 IEEE International Conference on Networking, Architecture and Storage, August 2019, Article No. 2.
[12] Xu Z W, Li C D. Low-entropy cloud computing systems. SCIENTIA SINICA Informationis, 2017, 47(9):1149-1163.
[13] Nowlan M F, Tiwari N, Iyengar J et al. Fitting square pegs through round pipes:Unordered delivery wire-compatible with TCP and TLS. In Proc. the 9th USENIX Symposium on Networked Systems Design and Implementation, April 2012, pp.383-398.
[14] Moritz P, Nishihara R, Wang S et al. Ray:A distributed framework for emerging AI applications. In Proc. the 13th USENIX Symposium on Operating Systems Design and Implementation, October 2018, pp.561-577.
[15] Nguyen M, Li Z, Duan F et al. The tail at scale:How to predict it? In Proc. the 8th USENIX Workshop on Hot Topics in Cloud Computing, June 2016, Article No. 17.
[16] Delimitrou C, Kozyrakis C. Amdahl's law for tail latency. Communications of the ACM, 2018, 61(8):65-72.
[17] Xu Y, Musgrave Z, Noble B et al. Bobtail:Avoiding long tails in the cloud. In Proc. the 10th USENIX Symposium on Networked Systems Design & Implementation, April 2013, pp.329-342.
[18] Lai Z, Cui Y, Li M et al. TailCutter:Wisely cutting tail latency in cloud CDN under cost constraints. In Proc. the 35th Annual IEEE International Conference on Computer Communications, April 2016.
[19] Suresh L, Canini M, Schmid S et al. C3:Cutting tail latency in cloud data stores via adaptive replica selection. In Proc. the 12th USENIX Conference on Networked Systems Design & Implementation, May 2015, pp.513-527.
[20] Kasture H, Sanchez D. Tailbench:A benchmark suite and evaluation methodology for latency-critical applications. In Proc. the 2016 IEEE International Symposium on Workload Characterization, September 2016, pp.3-12.
[21] Cerrato I, Annarumma M, Risso F. Supporting fine-grained network functions through Intel DPDK. In Proc. the 3rd European Workshop on Software Defined Networks, September 2014, pp.1-6.
[22] Shanmugalingam S, Ksentini A, Bertin P. DPDK Open vSwitch performance validation with mirroring feature. In Proc. the 23rd International Conference on Telecommunications, May 2016, Article No. 45.
[23] Marinos I, Watson R N M, Handley M. Network stack specialization for performance. ACM SIGCOMM Computer Communication Review, 2014, 44(4):175-186.
[24] Ousterhout A, Fried J, Behrens J et al. Shenango:Achieving high CPU efficiency for latency-sensitive datacenter workloads. In Proc. the 16th USENIX Symposium on Networked Systems Design and Implementation, February 2019, pp.361-378.
[25] Kaffes K, Chong T, Humphries J T et al. Shinjuku:Preemptive scheduling for μ second-scale tail latency. In Proc. the 16th USENIX Symposium on Networked Systems Design and Implementation, February 2019, pp.345-360.
[26] Jeong E, Woo S, Jamshed M, Jeong H, Ihm S, Han D, Park K. mTCP:A highly scalable user-level TCP stack for multicore systems. In Proc. the 11th USENIX Symposium on Networked Systems Design and Implementation, April 2014, pp.489-502.
[27] Belay A, Prekas G, Klimovic A et al. IX:A protected data plane operating system for high throughput and low latency. In Proc. the 11th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2014, pp.49-65.
[28] Dragojevic A, Narayanan D, Hodson O, Castro M. FaRM:Fast remote memory. In Proc. the 11th Symposium on Networked Systems Design and Implementation, April 2014, pp.401-414.
[29] Jose J, Subramoni H, Luo M et al. Memcached design on high performance RDMA capable interconnects. In Proc. the 2011 International Conference on Parallel Processing, September 2011, pp.743-752.
[30] Mitchell C, Geng Y, Li J. Using one-sided RDMA reads to build a fast, CPU-efficient key value store. In Proc. the 2013 USENIX Annual Technical Conference, June 2013, pp.103-114.
[31] Ongaro D, Rumble S M, Stutsman R, Ousterhout J K, Rosenblum M. Fast crash recovery in RAMCloud. In Proc. the 23rd ACM Symposium on Operating Systems Principles, October 2011, pp.29-41.
[32] Nishtala R, Fugal H, Grimm S et al. Scaling Memcache at Facebook. In Proc. the 10th Symposium on Networked Systems Design and Implementation, April 2013, pp.385-398.
[33] Han S, Marshall S, Chun B G, Ratnasamy S. MegaPipe:A new programming interface for scalable network I/O. In Proc. the 10th USENIX Symposium on Operating System Design and Implementation, October 2012, pp.135-148.
[34] Bao Y G, Wang S. Labeled von Neumann architecture for software-defined cloud. J. Comput. Sci. Technol., 2017, 32(2):219-223.
[35] Ma J, Sui X, Sun N H et al. Supporting differentiated services in computers via programmable architecture for resourcing-on-demand (PARD). In Proc. the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2015, pp.131-143.
[36] Marian T, Lee K S, Weatherspoon H. NetSlices:Scalable multi-core packet processing in user-space. In Proc. the 8th ACM/IEEE Symposium on Architectures for Networking and Communication Systems, October 2012, pp.27-38.
[1] 徐天妮, 孙海锋, 张笛, 周小明, 隋秀峰, 王卅, 黄群, 包云岗. 服务链自动化部署与测试框架[J]. 计算机科学技术学报, 2022, 37(3): 680-698.
[2] Sa Wang, Yan-Hai Zhu, Shan-Pei Chen, Tian-Ze Wu, Wen-Jie Li, Xu-Sheng Zhan, Hai-Yang Ding, Wei-Song Shi, Yun-Gang Bao. 基于多层感知网的阿里巴巴数据中心动态资源管理实践[J]. 计算机科学技术学报, 2020, 35(1): 209-220.
[3] Yun-Gang Bao, Sa Wang. 面向软件定义云计算的标签化冯诺依曼体系结构[J]. , 2017, 32(2): 219-223.
[4] Bin-Lei Cai, Rong-Qi Zhang, Xiao-Bo Zhou, Lai-Ping Zhao, Ke-Qiu Li. 体验可用性:面向尾延迟的软件定义的云计算的可用性[J]. , 2017, 32(2): 250-257.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 潘启敬;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .
[2] 屈延文;. AGDL: A Definition Language for Attribute Grammars[J]. , 1986, 1(3): 80 -91 .
[3] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[4] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[5] 衷仁保; 邢林; 任朝阳;. An Interactive System SDI on Microcomputer[J]. , 1987, 2(1): 64 -71 .
[6] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[7] 孙永强; 陆汝占; 黄小戎;. Termination Preserving Problem in the Transformation of Applicative Programs[J]. , 1987, 2(3): 191 -201 .
[8] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
[9] 黄国祥; 刘健;. A Key-Lock Access Control[J]. , 1987, 2(3): 236 -243 .
[10] 冯玉琳;. Hierarchical Protocol Analysis by Temporal Logic[J]. , 1988, 3(1): 56 -69 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: