SCIE, Ei, INSPEC, JST, AJ, MR, CA, DBLP, etc.
Edited by: Editorial Board of Journal Of Computer Science and Technology
P.O. Box 2704, Beijing 100190, P.R. China Sponsored by: Institute of Computing Technology, CAS & China Computer Federation Undertaken by: Institute of Computing Technology, CAS Published by: SCIENCE PRESS, BEIJING, CHINA Distributed by: China: All Local Post Offices Other Countries: Springer
Non-Volatile Main Memories (NVMMs) have recently emerged as a promising technology for future memory systems. Generally, NVMMs have many desirable properties such as high density, byte-addressability, non-volatility, low cost, and energy efficiency, at the expense of high write latency, high write power consumption, and limited write endurance. NVMMs have become a competitive alternative of Dynamic Random Access Memory (DRAM), and will fundamentally change the landscape of memory systems. They bring many research opportunities as well as challenges on system architectural designs, memory management in operating systems (OSes), and programming models for hybrid memory systems. In this article, we revisit the landscape of emerging NVMM technologies, and then survey the state-of-the-art studies of NVMM technologies. We classify those studies with a taxonomy according to different dimensions such as memory architectures, data persistence, performance improvement, energy saving, and wear leveling. Second, to demonstrate the best practices in building NVMM systems, we introduce our recent work of hybrid memory system designs from the dimensions of architectures, systems, and applications. At last, we present our vision of future research directions of NVMMs and shed some light on design challenges and opportunities.
This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-based 64-core architecture. We focus on the cache and memory subsystems, analyzing the characteristics that impact the high-performance computing applications. We provide insights into the memory-relevant performance behaviours of the Phytium 2000+ system through micro-benchmarking. With the help of the well-known roofline model, we analyze the Phytium 2000+ system, taking both memory accesses and computations into account. Based on the knowledge gained from these micro-benchmarks, we evaluate two applications and use them to assess the capabilities of the Phytium 2000+ system. The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels.
Driven by the increasing requirements of high-performance computing applications, supercomputers are prone to containing more and more computing nodes. Applications running on such a large-scale computing system are likely to spawn millions of parallel processes, which usually generate a burst of I/O requests, introducing a great challenge into the metadata management of underlying parallel file systems. The traditional method used to overcome such a challenge is adopting multiple metadata servers in the scale-out manner, which will inevitably confront with serious network and consistence problems. This work instead pursues to enhance the metadata performance in the scale-up manner. Specifically, we propose to improve the performance of each individual metadata server by employing GPU to handle metadata requests in parallel. Our proposal designs a novel metadata server architecture, which employs CPU to interact with file system clients, while offloading the computing tasks about metadata into GPU. To take full advantages of the parallelism existing in GPU, we redesign the in-memory data structure for the name space of file systems. The new data structure can perfectly fit to the memory architecture of GPU, and thus helps to exploit the large number of parallel threads within GPU to serve the bursty metadata requests concurrently. We implement a prototype based on BeeGFS and conduct extensive experiments to evaluate our proposal, and the experimental results demonstrate that our GPU-based solution outperforms the CPU-based scheme by more than 50% under typical metadata operations. The superiority is strengthened further on high concurrent scenarios, e.g., the high-performance computing systems supporting millions of parallel threads.
Genomic sequence alignment is the most critical and time-consuming step in genomic analysis. Alignment algorithms generally follow a seed-and-extend model. Acceleration of the extension phase for sequence alignment has been well explored in computing-centric architectures on field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), and graphics processing unit (GPU) (e.g., the Smith-Waterman algorithm). Compared with the extension phase, the seeding phase is more critical and essential. However, the seeding phase is bounded by memory, i.e., fine-grained random memory access and limited parallelism on conventional system. In this paper, we argue that the processing-in-memory (PIM) concept could be a viable solution to address these problems. This paper describes \PIM-Align"|an application-driven near-data processing architecture for sequence alignment. In order to achieve memory-capacity proportional performance by taking advantage of 3D-stacked dynamic random access memory (DRAM) technology, we propose a lightweight message mechanism between different memory partitions, and a specialized hardware prefetcher for memory access patterns of sequence alignment. Our evaluation shows that the proposed architecture can achieve 20x and 1 820x speedup when compared with the best available ASIC implementation and the software running on 32-thread CPU, respectively.
Accesses Per Cycle (APC), Concurrent Average Memory Access Time (C-AMAT), and Layered Performance Matching (LPM) are three memory performance models that consider both data locality and memory assess concurrency. The APC model measures the throughput of a memory architecture and therefore reflects the quality of service (QoS) of a memory system. The C-AMAT model provides a recursive expression for the memory access delay and therefore can be used for identifying the potential bottlenecks in a memory hierarchy. The LPM method transforms a global memory system optimization into localized optimizations at each memory layer by matching the data access demands of the applications with the underlying memory system design. These three models have been proposed separately through prior efforts. This paper reexamines the three models under one coherent mathematical framework. More specifically, we present a new memory-centric view of data accesses. We divide the memory cycles at each memory layer into four distinct categories and use them to recursively define the memory access latency and concurrency along the memory hierarchy. This new perspective offers new insights with a clear formulation of the memory performance considering both locality and concurrency. Consequently, the performance model can be easily understood and applied in engineering practices. As such, the memory-centric approach helps establish a unified mathematical foundation for model-driven performance analysis and optimization of contemporary and future memory systems.
Non-volatile memory (NVM) provides a scalable and power-efficient solution to replace dynamic random access memory (DRAM) as main memory. However, because of the relatively high latency and low bandwidth of NVM, NVM is often paired with DRAM to build a heterogeneous memory system (HMS). As a result, data objects of the application must be carefully placed to NVM and DRAM for the best performance. In this paper, we introduce a lightweight runtime solution that automatically and transparently manages data placement on HMS without the requirement of hardware modifications and disruptive change to applications. Leveraging online profiling and performance models, the runtime solution characterizes memory access patterns associated with data objects, and minimizes unnecessary data movement. Our runtime solution effectively bridges the performance gap between NVM and DRAM. We demonstrate that using NVM to replace the majority of DRAM can be a feasible solution for future HPC systems with the assistance of a software-based data management.
Byte-addressable persistent memory (B-APM) presents a new opportunity to bridge the performance gap between main memory and storage. In this paper, we present the usage scenarios for this new technology, based on the capabilities of Intel's DCPMM. We outline some of the basic performance characteristics of DCPMM, and explain how it can be configured and used to address the needs of memory and I/O intensive applications in the HPC (high-performance computing) and data intensive domains. Two decision trees are presented to advise on the configuration options for BAPM; their use is illustrated with two examples. We show that the flexibility of the technology has the potential to be truly disruptive, not only because of the performance improvements it can deliver, but also because it allows systems to cater for wider range of applications on homogeneous hardware.
The short-range pair interaction consumes most of the CPU time in molecular dynamics (MD) simulations. The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core architecture. In this paper, we present a highly efficient short-range force kernel on the Sunway, a novel many-core architecture with many unique features. The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts. To enhance the data locality, we propose a super-cluster-based neighbor list with an appropriate granularity that fits in the local memory of computing cores. In the absence of a low overhead locking mechanism, using data-privatization force array is a more feasible method to avoid write conflicts, but results in the large overhead of data reduction. We propose a dual-slice partitioning scheme for both hardware resources and computing tasks, which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing. Moreover, we exploit the single instruction multiple data (SIMD) parallelism and perform instruction reordering of the force kernel on this many-core processor. The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20% of peak flop rate on the Sunway many-core processor.
Persistent indexing structures are proposed in response to emerging non-volatile memory (NVM) to provide high performance yet durable indexes. However, due to the lack of real NVM hardware, many prior persistent indexing structures were evaluated via emulation, which varies a lot across different setups and differs from the real deployment. Recently, Intel has released its Optane DC Persistent Memory Module (PMM), which is the first production-ready NVM. In this paper, we revisit popular persistent indexing structures on PMM and conduct comprehensive evaluations to study the performance differences among persistent indexing structures, including persistent hash tables and persistent trees. According to the evaluation results, we find that Cacheline-Conscious Extendible Hashing (CCEH) achieves the best performance among all evaluated persistent hash tables, and Failure-Atomic ShifT B+-Tree (FAST) and Write Optimal Radix Tree (WORT) perform better than other trees. Besides, we find that the insertion performance of hash tables is heavily influenced by data locality, while the insertion latency of trees is dominated by the flush instructions. We also uncover that no existing emulation methods accurately simulate PMM for all the studied data structures. Finally, we provide three suggestions on how to fully utilize PMM for better performance, including using clflushopt/clwb with sfence instead of clflush, flushing continuous data in a batch, and avoiding data access immediately after it is flushed to PMM.
Iterative control structures allow the repeated execution of tasks, activities or sub-processes according to the given conditions in a process model. Iterative control structures can significantly increase the risk of triggering temporal exceptions since activities within the scope of these control structures could be repeatedly executed until a predefined condition is met. In this paper, we propose two approaches to unravel iterative control structures from process models. The first approach unravels loops based on zero-one principle. The second approach unravels loops based on branching probabilities assigned at split gateways. The proposed methods can be used to unfold structured loops, nested loops and crossing loops. Since the unfolded model does not contain any iterative control structures, it can be used for further analysis by process designers during the modeling phase. The proposed methods are implemented based on workflow graphs, and therefore they are compatible with modeling languages such as Business Process Modelling Notation (BPMN). In the experiments, the execution behavior of unfolded process models is compared against the original models based on the concept of runs. Experimental results reveal that runs generated from the original models can be correctly executed in the unfolded BPMN models that do not contain any loops.
Communication and coordination between OSS developers who do not work physically in the same location have always been the challenging issues. The pull-based development model, as the state-of-art collaborative development mechanism, provides high openness and transparency to improve the visibility of contributors' work. However, duplicate contributions may still be submitted by more than one contributors to solve the same problem due to the parallel and uncoordinated nature of this model. If not detected in time, duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant work. In this paper, we propose an approach combining textual and change similarities to automatically detect duplicate contributions in pull-based model at submission time. For a new-arriving contribution, we first compute textual similarity and change similarity between it and other existing contributions. And then our method returns a list of candidate duplicate contributions that are most similar with the new contribution in terms of the combined textual and change similarity. The evaluation shows that 83.4% of the duplicates can be found in average when we use the combined textual and change similarity compared to 54.8% using only textual similarity and 78.2% using only change similarity.
Entity relation classification aims to classify the semantic relationship between two marked entities in a given sentence, and plays a vital role in various natural language processing applications. However, existing studies focus on exploiting mono-lingual data in English, due to the lack of labeled data in other languages. How to effectively benefit from a richly-labeled language to help a poorly-labeled language is still an open problem. In this paper, we come up with a language adaptation framework for cross-lingual entity relation classification. The basic idea is to employ adversarial neural networks (AdvNN) to transfer feature representations from one language to another. Especially, such a language adaptation framework enables feature imitation via the competition between a sentence encoder and a rival language discriminator to generate effective representations. To verify the effectiveness of AdvNN, we introduce two kinds of adversarial structures, dual-channel AdvNN and single-channel AdvNN. Experimental results on the ACE 2005 multilingual training corpus show that our single-channel AdvNN achieves the best performance on both unsupervised and semi-supervised scenarios, yielding an improvement of 6.61% and 2.98% over the state-of-the-art, respectively. Compared with baselines which directly adopt a machine translation module, we find that both dual-channel and single-channel AdvNN significantly improve the performances (F1) of cross-lingual entity relation classification. Moreover, extensive analysis and discussion demonstrate the appropriateness and effectiveness of different parameter settings in our language adaptation framework.
A universal and general quantum simultaneous secret distribution (QSSD) protocol is put forward based on the properties of the one-dimensional high-level cluster states, in which one sender dispatches different high-level classical secret messages to many users at the same time. Due to the idea of quantum dense coding, the sender can send different two-dit classical messages (two d-level classical numbers) to different receivers simultaneously by using a one-dimensional d-level cluster state, which means that the information capacity is up to the maximal. To estimate the security of quantum channels, a new eavesdropping check strategy is put forward. Meanwhile, a new attack model, the general individual attack is proposed and analyzed. It is shown that the new eavesdropping check strategy can effectively prevent the traditional attacks including the general individual attack. In addition, multiparty quantum secret report (MQSR, the same as quantum simultaneous secret submission (QSSS)) in which different users submit their different messages to one user simultaneously can be gotten if the QSSD protocol is changed a little.