摘要 流处理，作为一种特殊的数据流执行模型，为优化和自动化并行提供了广泛的机会。通常，一个流应用可表示成一个通过FIFO通道通信的多个计算阶段组成的图。在共享内存环境中，FIFO通道典型地表示为一个固定大小且被生产者和消费者共享的、需要被同步的缓冲区。然而，随着并发阶段中工作者数量的增加，这种同步开销（如争用和等待时间）也将急剧上升，严重影响应用程序的性能。为此，本文提出一种新型的多线程模型，它缺省地隔离各线程的内存空间，并提供高层编程抽象——Chaus通道（Channel for Unbounded Streaming）——用于支持线程之间可伸缩的单播或多播通信。Chaus模型隐藏了底层的同步细节以简化编程，但要求用户事先声明通道的生产者-消费者关系。Chaus运行时系统负责保证按用户声明的生产-消费关系、在任意用户自定义数据项粒度上实施可靠的并发数据传输。为了达到对流的无界缓存和减少同步开销，我们提出一种基于虚拟内存的解决方案来实现可伸缩的Chaus通道。我们成功地用Chaus多线程编程模型重写PARSEC的dedup和ferret，并用它实现一个类似Phoenix的MapReduce库，从而检验了Chaus的可编程性。实验结果表明，基于Chaus实现的应用程序的运行速度高于对应的Pthreads程序，并且Chaus的可伸缩性最好。其中，有3个应用程序的Chaus版本在16和32个内核上仅花费不超过0.17X的Pthreads版本的运行时间。
Stream processing is a special form of the dataflow execution model that offers extensive opportunities for optimization and automatic parallelism. A streaming application is represented by a graph of computation stages that communicate with each other via FIFO channels. In shared-memory environment, an FIFO channel is classically a common, fixed-size synchronized buffer shared between the producer and the consumer. As the number of concurrent stage workers increases, the synchronization overheads, such as contention and waiting times, rise sharply and severely impair application performance. In this paper, we present a novel multithreaded model which isolates memory between threads by default and provides a higher level abstraction for scalable unicast or multicast communication between threads-CHAUS (Channel for Unbounded Streaming). The CHAUS model hides the underlying synchronization details, but requires the user to declare producer-consumer relationship of a channel in advance. It is the duty of the runtime system to ensure reliable data transmission at data item granularity as declared. To achieve unbounded buffer for streaming and reduce the synchronization overheads, we propose a virtual memory based solution to implement a scalable CHAUS channel. We check the programmability of CHAUS by successfully porting dedup and ferret from PARSEC as well as implementing MapReduce library with Phoenix-like API. The experimental results show that workloads built with CHAUS run faster than those with Pthreads, and CHAUS has the best scalability compared with two Pthread versions. There are three workloads whose CHAUS versions only spend no more than 0.17x runtime of Pthreads on both 16 and 32 cores.
This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000403.
About author: Yu Zhang is an associate professor in School of Computer Science and Technology,University of Science and Technology of China (USTC),Hefei.She received her Ph.D.degree in computer software and theory from USTC,Hefei,in 2005.Her research interests are software security,the construction and evaluation of efficient parallel systems.
Yu Zhang, Yu-Fen Yu, Hui-Fang Cao, Jian-Kang Chen, Qi-Liang Zhang.一种基于VM的支持无界流传递的可伸缩通道[J] Journal of Computer Science and Technology , 2017,V32(6): 1288-1304
Yu Zhang, Yu-Fen Yu, Hui-Fang Cao, Jian-Kang Chen, Qi-Liang Zhang.CHAUS:Scalable VM-Based Channels for Unbounded Streaming[J] Journal of Computer Science and Technology, 2017,V32(6): 1288-1304
 Bienia C, Kumar S, Singh J P, Li K. The PARSEC benchmark suite:Characterization and architectural implications. In Proc. the 17th Int. Conf. Parallel Architectures and Compilation Techniques, October 2008, pp.72-81. Michael M M, Scott M L. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proc. the 15th Annual ACM Symp. Principles of Distributed Computing, May 1996, pp.267-275. Tsigas P, Zhang Y. A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems. In Proc. the 13th Annual ACM Symp. Parallel Algorithms and Architectures, July 2001, pp.134-143. Kogan A, Petrank E. Wait-free queues with multiple enqueuers and dequeuers. In Proc. the 16th ACM Symp. Principles and Practice of Parallel Programming, February 2011. Morrison A, Afek Y. Fast concurrent queues for x86 processors. In Proc. the 18th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, February 2013, pp.103-112. Hendler D, Incze I, Shavit N, Tzafrir M. Flat combining and the synchronization-parallelism tradeoff. In Proc. the 22nd Annual ACM Symp. Parallelism in Algorithms and Architectures, June 2010, pp.355-364. Feldman S D, Bhat A, LaBorde P, Yi Q, Dechev D. Effective use of non-blocking data structures in a deduplication application. In Proc. the Companion Publication for Conf. Systems, Programming, & Applications:Software for Humanity, October 2013, pp.133-142. Thies W, Karczmarek M, Amarasinghe S. StreamIt:A language for streaming applications. In Proc. the 11th Int. Conf. Compiler Construction, April 2002, pp.179-196. Thakur R, Rabenseifner R, Gropp W. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, 2005, 19(1):49-66. Yoo R M, Romano A, Kozyrakis C. Phoenix rebirth:Scalable MapReduce on a large-scale shared-memory system. In Proc. IEEE Int. Symp. Workload Characterization, October 2009, pp.198-207. Gloger W. Dynamic memory allocator implementations in Linux system libraries. http://www.malloc.de/papers/malloc-slides.html, May 2006. Evans J. A scalable concurrent malloc (3) implementation for FreeBSD. https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf, July 2017. de Melo A C. Performance counters on Linux the new tools. http://vger.kernel.org/~acme/perf/perf.pdf, July 2017. Clements A T, Kaashoek M F, Zeldovich N. Scalable address spaces using RCU balanced trees. In Proc. the 17th Int. Conf. Architectural Support for Programming Languages and Operating Systems, March 2012, pp.199-210. Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating MapReduce for multi-core and multiprocessor systems. In Proc. the 13th Int. Symp. High Performance Computer Architecture, February 2007, pp.13-24. Zhang Y, Cao H F. DMR:A deterministic MapReduce for multicore systems. International Journal of Parallel Programming, 2017, 45(1):128-141. Lu M, Zhang L, Huynh H P, Ong Z, Liang Y, He B S, Goh R S, Huynh R. Optimizing the MapReduce framework on Intel Xeon Phi coprocessor. In Proc. IEEE Int. Conf. Big Data, October 2013, pp.125-130. Reed E C, Chen N, Johnson R E. Expressing pipeline parallelism using TBB constructs:A case study on what works and what doesn't. In Proc. the Compilation of the Co-Located Workshops on DSM'11, TMC'11, AGERE! 2011, AOOPES'11, NEAT'11, & VMIL'11, October 2011, pp.133-138. Gordon M I, Thies W, Amarasinghe S. Exploiting coarsegrained task, data, and pipeline parallelism in stream programs. In Proc. the 12th Int. Conf. Architectural Support for Programming Languages and Operating Systems, October 2006, pp.151-162. McCool M, Robison A D, Reinders J. Structured Parallel Programming:Patterns for Efficient Computation. Amsterdam:Elsevier, 2012. Bienia C, Li K. Characteristics of workloads using the pipeline programming model. In Proc. the Int. Conf. Computer Architecture, June 2010, pp.161-171.