Abstract Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelism. A streaming application is represented by a graph of computation stages that communicate with each other via FIFO channels. In shared-memory environment, an FIFO channel is classically a common, fixed-size synchronized buffer shared between the producer and the consumer. As the number of concurrent stage workers increases, the synchronization overheads, such as contention and waiting times, rise sharply and severely impair application performance. In this paper, we present a novel multithreaded model which isolates memory between threads by default and provides a higher level abstraction for scalable unicast or multicast communication between threads-CHAUS (Channel for Unbounded Streaming). The CHAUS model hides the underlying synchronization details, but requires the user to declare producer-consumer relationship of a channel in advance. It is the duty of the runtime system to ensure reliable data transmission at data item granularity as declared. To achieve unbounded buffer for streaming and reduce the synchronization overheads, we propose a virtual memory based solution to implement a scalable CHAUS channel. We check the programmability of CHAUS by successfully porting dedup and ferret from PARSEC as well as implementing MapReduce library with Phoenix-like API. The experimental results show that workloads built with CHAUS run faster than those with Pthreads, and CHAUS has the best scalability compared with two Pthread versions. There are three workloads whose CHAUS versions only spend no more than 0.17x runtime of Pthreads on both 16 and 32 cores.
This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000403.
Corresponding Authors: 10.1007/s11390-017-1801-4
About author: Yu Zhang is an associate professor in School of Computer Science and Technology,University of Science and Technology of China (USTC),Hefei.She received her Ph.D.degree in computer software and theory from USTC,Hefei,in 2005.Her research interests are software security,the construction and evaluation of efficient parallel systems.
Cite this article:
Yu Zhang, Yu-Fen Yu, Hui-Fang Cao, Jian-Kang Chen, Qi-Liang Zhang.CHAUS:Scalable VM-Based Channels for Unbounded Streaming[J] Journal of Computer Science and Technology, 2017,V32(6): 1288-1304
 Bienia C, Kumar S, Singh J P, Li K. The PARSEC benchmark suite:Characterization and architectural implications. In Proc. the 17th Int. Conf. Parallel Architectures and Compilation Techniques, October 2008, pp.72-81. Michael M M, Scott M L. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proc. the 15th Annual ACM Symp. Principles of Distributed Computing, May 1996, pp.267-275. Tsigas P, Zhang Y. A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems. In Proc. the 13th Annual ACM Symp. Parallel Algorithms and Architectures, July 2001, pp.134-143. Kogan A, Petrank E. Wait-free queues with multiple enqueuers and dequeuers. In Proc. the 16th ACM Symp. Principles and Practice of Parallel Programming, February 2011. Morrison A, Afek Y. Fast concurrent queues for x86 processors. In Proc. the 18th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, February 2013, pp.103-112. Hendler D, Incze I, Shavit N, Tzafrir M. Flat combining and the synchronization-parallelism tradeoff. In Proc. the 22nd Annual ACM Symp. Parallelism in Algorithms and Architectures, June 2010, pp.355-364. Feldman S D, Bhat A, LaBorde P, Yi Q, Dechev D. Effective use of non-blocking data structures in a deduplication application. In Proc. the Companion Publication for Conf. Systems, Programming, & Applications:Software for Humanity, October 2013, pp.133-142. Thies W, Karczmarek M, Amarasinghe S. StreamIt:A language for streaming applications. In Proc. the 11th Int. Conf. Compiler Construction, April 2002, pp.179-196. Thakur R, Rabenseifner R, Gropp W. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, 2005, 19(1):49-66. Yoo R M, Romano A, Kozyrakis C. Phoenix rebirth:Scalable MapReduce on a large-scale shared-memory system. In Proc. IEEE Int. Symp. Workload Characterization, October 2009, pp.198-207. Gloger W. Dynamic memory allocator implementations in Linux system libraries. http://www.malloc.de/papers/malloc-slides.html, May 2006. Evans J. A scalable concurrent malloc (3) implementation for FreeBSD. https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf, July 2017. de Melo A C. Performance counters on Linux the new tools. http://vger.kernel.org/~acme/perf/perf.pdf, July 2017. Clements A T, Kaashoek M F, Zeldovich N. Scalable address spaces using RCU balanced trees. In Proc. the 17th Int. Conf. Architectural Support for Programming Languages and Operating Systems, March 2012, pp.199-210. Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating MapReduce for multi-core and multiprocessor systems. In Proc. the 13th Int. Symp. High Performance Computer Architecture, February 2007, pp.13-24. Zhang Y, Cao H F. DMR:A deterministic MapReduce for multicore systems. International Journal of Parallel Programming, 2017, 45(1):128-141. Lu M, Zhang L, Huynh H P, Ong Z, Liang Y, He B S, Goh R S, Huynh R. Optimizing the MapReduce framework on Intel Xeon Phi coprocessor. In Proc. IEEE Int. Conf. Big Data, October 2013, pp.125-130. Reed E C, Chen N, Johnson R E. Expressing pipeline parallelism using TBB constructs:A case study on what works and what doesn't. In Proc. the Compilation of the Co-Located Workshops on DSM'11, TMC'11, AGERE! 2011, AOOPES'11, NEAT'11, & VMIL'11, October 2011, pp.133-138. Gordon M I, Thies W, Amarasinghe S. Exploiting coarsegrained task, data, and pipeline parallelism in stream programs. In Proc. the 12th Int. Conf. Architectural Support for Programming Languages and Operating Systems, October 2006, pp.151-162. McCool M, Robison A D, Reinders J. Structured Parallel Programming:Patterns for Efficient Computation. Amsterdam:Elsevier, 2012. Bienia C, Li K. Characteristics of workloads using the pipeline programming model. In Proc. the Int. Conf. Computer Architecture, June 2010, pp.161-171.