Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (1): 179-193.doi: 10.1007/s11390-020-9651-x

• Special Section on Applications • Previous Articles     Next Articles

Labeled Network Stack: A High-Concurrency and Low-Tail Latency Cloud Server Framework for Massive IoT Devices

Wen-Li Zhang1, Member, CCF, ACM, IEEE, Ke Liu1, Member, CCF, Yi-Fan Shen1,2, Ya-Zhu Lan1, Member, CCF, Hui Song1, Member, CCF, Ming-Yu Chen1,2,3, Member, CCF, ACM, IEEE, Yuan-Fei Chen1,4, Member, CCF   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Peng Cheng Laboratory, Shenzhen 518000, China;
    4 Zhongke Zhicheng Electronic Technology Company Limited, Jining 272000, China
  • Received:2019-04-20 Revised:2019-11-08 Online:2020-01-05 Published:2020-01-14
  • About author:Wen-Li Zhang received her Ph.D. degree in computer system architecture from Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, in 2014. She is currently an associate professor of ICT, CAS. She is a member of CCF, ACM and IEEE. Her main research interests include architecture and algorithm optimization for high end computers and datacenter networks.
  • Supported by:
    The work was supported by the National Key Research and Development Plan of China under Grant No. 2016YFB1000203.

Internet of Things (IoT) applications have massive client connections to cloud servers, and the number of networked IoT devices is remarkably increasing. IoT services require both low-tail latency and high concurrency in datacenters. This study aims to determine whether an order of magnitude improvement is possible in tail latency and concurrency in mainstream systems by proposing a hardware-software codesigned labeled network stack (LNS) for future datacenters. The key innovation is a cross-layered payload labeling mechanism that distinguishes different requests by payload across the full network stack, including application, TCP/IP, and Ethernet layers. This type of design enables prioritized data packet processing and forwarding along the full datapath, such that latency-insensitive requests cannot significantly interfere with high-priority requests. We build a prototype datacenter server to evaluate the LNS design against a commercial X86 server and the mTCP research, using a cloud-supported IoT application scenario. Experimental results show that the LNS design can provide an order of magnitude improvement in tail latency and concurrency. A single datacenter server node can support over 2 million concurrent long-living connections for IoT devices as a 99-percentile tail latency of 50 ms is maintained. In addition, the hardware-software codesign approach remarkably reduces the labeling and prioritization overhead and constrains the interference of high-priority requests to low-priority requests.

Key words: tail latency, high concurrency, network stack, cloud server, Internet of Things (IoT) service

[1] Gubbi J, Buyya R, Marusic S et al. Internet of Things (IoT):A vision, architectural elements, and future directions. Future Generation Computer Systems, 2013, 29(7):1645-1660.
[2] Botta A, De Donato W, Persico V et al. Integration of cloud computing and Internet of Things:A survey. Future Generation Computer Systems, 2016, 56:684-700.
[3] Mohammadi M, Al-Fuqaha A, Sorour S et al. Deep learning for IoT big data and streaming analytics:A survey. IEEE Communications Surveys & Tutorials, 2018, 20(4):2923-2960.
[4] Dean J, Barroso L A. The tail at scale. Communications of the ACM, 2013, 56(2):74-80.
[5] Zats D, Das T, Mohan P, Borthakur D, Katz R. DeTail:Reducing the flow completion time tail in datacenter networks. ACM SIGCOMM Comput. Commun. Rev., 2012, 42:139-150.
[6] Li J, Sharma N K, Ports D R et al. Tales of the tail:Hardware, OS, and application-level sources of tail latency. In Proc. the ACM Symposium on Cloud Computing, November 2014, Article No. 9.
[7] Liu H. A measurement study of server utilization in public clouds. In Proc. the 9th IEEE International Conference on Dependable, Autonomic and Secure Computing, December 2011, pp.435-442.
[8] Thekkath C A, Nguyen T D, Moy E et al. Implementing network protocols at user level. IEEE/ACM Transactions on Networking, 1993, 1(5):554-565.
[9] Zhang W, Liu K, Song H et al. Labeled network stack:A codesigned stack for low tail-latency and high concurrency in datacenter services. In Proc. the 15th IFIP WG 10.3 International Conference on Network and Parallel Computing, November 2018, pp.132-136.
[10] Wu W, Feng X, Zhang W, Chen M. MCC:A predictable and scalable massive client load generator. In Proc. the 2019 BenchCouncil International Symposium on Benchmarking, Measuring and Optimizing, Nov. 2019.
[11] Song H, Zhang W, Liu K et al. HCMonitor:An accurate measurement system for high concurrent network services. In Proc. the 2019 IEEE International Conference on Networking, Architecture and Storage, August 2019, Article No. 2.
[12] Xu Z W, Li C D. Low-entropy cloud computing systems. SCIENTIA SINICA Informationis, 2017, 47(9):1149-1163.
[13] Nowlan M F, Tiwari N, Iyengar J et al. Fitting square pegs through round pipes:Unordered delivery wire-compatible with TCP and TLS. In Proc. the 9th USENIX Symposium on Networked Systems Design and Implementation, April 2012, pp.383-398.
[14] Moritz P, Nishihara R, Wang S et al. Ray:A distributed framework for emerging AI applications. In Proc. the 13th USENIX Symposium on Operating Systems Design and Implementation, October 2018, pp.561-577.
[15] Nguyen M, Li Z, Duan F et al. The tail at scale:How to predict it? In Proc. the 8th USENIX Workshop on Hot Topics in Cloud Computing, June 2016, Article No. 17.
[16] Delimitrou C, Kozyrakis C. Amdahl's law for tail latency. Communications of the ACM, 2018, 61(8):65-72.
[17] Xu Y, Musgrave Z, Noble B et al. Bobtail:Avoiding long tails in the cloud. In Proc. the 10th USENIX Symposium on Networked Systems Design & Implementation, April 2013, pp.329-342.
[18] Lai Z, Cui Y, Li M et al. TailCutter:Wisely cutting tail latency in cloud CDN under cost constraints. In Proc. the 35th Annual IEEE International Conference on Computer Communications, April 2016.
[19] Suresh L, Canini M, Schmid S et al. C3:Cutting tail latency in cloud data stores via adaptive replica selection. In Proc. the 12th USENIX Conference on Networked Systems Design & Implementation, May 2015, pp.513-527.
[20] Kasture H, Sanchez D. Tailbench:A benchmark suite and evaluation methodology for latency-critical applications. In Proc. the 2016 IEEE International Symposium on Workload Characterization, September 2016, pp.3-12.
[21] Cerrato I, Annarumma M, Risso F. Supporting fine-grained network functions through Intel DPDK. In Proc. the 3rd European Workshop on Software Defined Networks, September 2014, pp.1-6.
[22] Shanmugalingam S, Ksentini A, Bertin P. DPDK Open vSwitch performance validation with mirroring feature. In Proc. the 23rd International Conference on Telecommunications, May 2016, Article No. 45.
[23] Marinos I, Watson R N M, Handley M. Network stack specialization for performance. ACM SIGCOMM Computer Communication Review, 2014, 44(4):175-186.
[24] Ousterhout A, Fried J, Behrens J et al. Shenango:Achieving high CPU efficiency for latency-sensitive datacenter workloads. In Proc. the 16th USENIX Symposium on Networked Systems Design and Implementation, February 2019, pp.361-378.
[25] Kaffes K, Chong T, Humphries J T et al. Shinjuku:Preemptive scheduling for μ second-scale tail latency. In Proc. the 16th USENIX Symposium on Networked Systems Design and Implementation, February 2019, pp.345-360.
[26] Jeong E, Woo S, Jamshed M, Jeong H, Ihm S, Han D, Park K. mTCP:A highly scalable user-level TCP stack for multicore systems. In Proc. the 11th USENIX Symposium on Networked Systems Design and Implementation, April 2014, pp.489-502.
[27] Belay A, Prekas G, Klimovic A et al. IX:A protected data plane operating system for high throughput and low latency. In Proc. the 11th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2014, pp.49-65.
[28] Dragojevic A, Narayanan D, Hodson O, Castro M. FaRM:Fast remote memory. In Proc. the 11th Symposium on Networked Systems Design and Implementation, April 2014, pp.401-414.
[29] Jose J, Subramoni H, Luo M et al. Memcached design on high performance RDMA capable interconnects. In Proc. the 2011 International Conference on Parallel Processing, September 2011, pp.743-752.
[30] Mitchell C, Geng Y, Li J. Using one-sided RDMA reads to build a fast, CPU-efficient key value store. In Proc. the 2013 USENIX Annual Technical Conference, June 2013, pp.103-114.
[31] Ongaro D, Rumble S M, Stutsman R, Ousterhout J K, Rosenblum M. Fast crash recovery in RAMCloud. In Proc. the 23rd ACM Symposium on Operating Systems Principles, October 2011, pp.29-41.
[32] Nishtala R, Fugal H, Grimm S et al. Scaling Memcache at Facebook. In Proc. the 10th Symposium on Networked Systems Design and Implementation, April 2013, pp.385-398.
[33] Han S, Marshall S, Chun B G, Ratnasamy S. MegaPipe:A new programming interface for scalable network I/O. In Proc. the 10th USENIX Symposium on Operating System Design and Implementation, October 2012, pp.135-148.
[34] Bao Y G, Wang S. Labeled von Neumann architecture for software-defined cloud. J. Comput. Sci. Technol., 2017, 32(2):219-223.
[35] Ma J, Sui X, Sun N H et al. Supporting differentiated services in computers via programmable architecture for resourcing-on-demand (PARD). In Proc. the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2015, pp.131-143.
[36] Marian T, Lee K S, Weatherspoon H. NetSlices:Scalable multi-core packet processing in user-space. In Proc. the 8th ACM/IEEE Symposium on Architectures for Networking and Communication Systems, October 2012, pp.27-38.
[1] Sa Wang, Yan-Hai Zhu, Shan-Pei Chen, Tian-Ze Wu, Wen-Jie Li, Xu-Sheng Zhan, Hai-Yang Ding, Wei-Song Shi, Yun-Gang Bao. A Case for Adaptive Resource Management in Alibaba Datacenter Using Neural Networks [J]. Journal of Computer Science and Technology, 2020, 35(1): 209-220.
[2] Yun-Gang Bao, Sa Wang. Labeled von Neumann Architecture for Software-Defined Cloud [J]. , 2017, 32(2): 219-223.
Full text



[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Li Wei;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] Qu Yanwen;. AGDL: A Definition Language for Attribute Grammars[J]. , 1986, 1(3): 80 -91 .
[4] Xu Xiaoshu;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[5] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
[6] S. T. Chanson; L. Liang; A. Kumar. Throughput Models of CSMA Network with Stations Uniformly Distributed along the Bus[J]. , 1987, 2(4): 243 -264 .
[7] Jin Hongping; Gu Junzhong;. The Optimization of Distributed Join in C-POREL System[J]. , 1987, 2(4): 276 -286 .
[8] Lu Qi; Zhang Fubo; Qian Jiahua;. Program Slicing:Its Improved Algorithm and Application in Verification[J]. , 1988, 3(1): 29 -39 .
[9] Xie Li; Chen Peipei; Yang Peigen; Sun Zhongxiu;. The Design and Implementation of an OA System ZGL1[J]. , 1988, 3(1): 75 -80 .
[10] Bao Feng;. On the Condition for FSM Being a Scrambler[J]. , 1988, 3(1): 70 -74 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved