HXPY: 一个高效处理金融时间序列数据的软件包
-
摘要:研究背景
海量的金融时间序列数据在全球各地每天产生,这样的数据需要被快速分析以提供最大化的价值,许多学术研究和工业应用场景也需要高性能的金融时间序列计算框架。然而,传统的金融时间序列数据计算框架在性能及功能覆盖上表露出一些不足,并且在对多线程计算和CUDA的利用上具有缺陷。
目的本篇论文的研究目标是提供一种新的金融时间序列计算框架,兼容Python Pandas的接口,优化单线程的性能表现,并支持多线程和CUDA进行计算加速,同时实现了更多金融时间序列的函数。
方法本文提出了HXPY,一个新的金融时间序列计算框架,基于单指令流多数据流(SIMD)、流式算法、内存布局优化等技术,在现代C++中实现并优化了相应函数,并提供了接近原生体验的Python接口,也兼容和其他Python库的相互转换。
结果HXPY显现出了显著的性能优势,在单线程的性能对比中,HXPY在文本文件读写上相比Python Pandas取得了5~10倍的性能提升,在时间序列函数上取得了2~3000倍的性能提升,在分组函数中提供了15~200倍的性能提升。同时,HXPY在多线程中相比基于Ray的Modin取得了2~200倍的性能提升,在CUDA测试中相比NVIDIA的cuDF取得了2~400倍的性能提升。
结论HXPY实现了一种新的数据框结构,能够高效地处理和计算金融时间序列数据,我们在递增分析中观察到了从最初版本到所有优化手段被采用后带来的显著性能改进。HXPY具有较佳的应用价值,已经在一些研究机构和合作伙伴中进行内部测试和使用。在未来,我们将持续优化并添加更多的函数支持。
-
关键词:
- 数据框 /
- 时间序列数据 /
- 单指令流多数据流 (SIMD) /
- 单指令流多数据流 (SIMD)、统一计算架构 (CUDA)
Abstract:A tremendous amount of data has been generated by global financial markets everyday, and such time-series data needs to be analyzed in real time to explore its potential value. In recent years, we have witnessed the successful adoption of machine learning models on financial data, where the importance of accuracy and timeliness demands highly effective computing frameworks. However, traditional financial time-series data processing frameworks have shown performance degradation and adaptation issues, such as the outlier handling with stock suspension in Pandas and TA-Lib. In this paper, we propose HXPY, a high-performance data processing package with a C++/Python interface for financial time-series data. HXPY supports miscellaneous acceleration techniques such as the streaming algorithm, the vectorization instruction set, and memory optimization, together with various functions such as time window functions, group operations, down-sampling operations, cross-section operations, row-wise or column-wise operations, shape transformations, and alignment functions. The results of benchmark and incremental analysis demonstrate the superior performance of HXPY compared with its counterparts. From MiBs to GiBs data, HXPY significantly outperforms other in-memory dataframe computing rivals even up to hundreds of times.
-
1. Introduction
Time-series data, composed of a sequence of data points indexed or listed in time order, has been prevalent due to its success in modeling many real-life applications such as finance, meteorology, statistics and economics, as well as the sensors, Internet of Things (IoTs), and control systems. From a machine learning perspective, these high-impact applications on time-series data can be categorized into the following tasks that have been extensively studied: forecasting[1-3], classification[4, 5], representation learning[6, 7], and time-series anomaly detection (TAD)[8, 9].
In the time-series data family, financial time-series data plays an essential role. It records the fluctuation of specific characteristics of a particular target with time, such as the change of a particular stock index over time in Fig.1, the implied return rate of a particular bond at different dates, or the revenues of a listed company in different quarters. The characteristics of financial time-series data are of immense volume and real-time velocity. A tremendous amount of financial time-series data has been generated on a global scale with frequencies ranging from microseconds to seasons, and then analyzed and utilized by numerous financial analysts to help decision-making.
Figure 1. A financial time-series data sample: daily open, high, low and close price of CNI 2000 INDEX, 399303 in April 20221 . Red denotes this day's close price lower than last day's close price. Green denotes this day's close price higher than last day's close price.A typical application scenario of financial time-series data is quantitative trading. The data needs to be analyzed immediately after being collected from exchanges, and then trading instructions are issued subsequently. This process is usually completed in microseconds to seconds[10], and any delay can cause loss of profit or even loss of capital. Therefore, the importance of accuracy and timeliness in financial time-series data processing demands highly effective computing frameworks.
1 Some efforts have been made toward financial time-series data processing, primarily based on Python
2 due to its friendly syntax, rich interfaces with other programming languages, and machine learning platform ecology. Many recent studies have been conducted with Python, such as the finance platform Qlib by Microsoft[11], stock movement predictions[12], and reinforcement learning based trading[13]. Python packages such as Pandas[14] and TA-Lib3 are developed to solve the problem of processing financial time series in Python. Among such approaches, the two-dimensional Pandas dataframe is more prevalent4 since it can analyze the two dimensions of time and securities at the same time. Despite flexibility and popularity, Pandas suffers from performance issues with large-scale data because of its inefficient single-threaded algorithm implementation and its lack of utilization of multi-threading and CUDA (Compute Unified Device Architecture) for acceleration[15]. Despite what Pandas' multi-threaded counterpart Modin[16] and CUDA's analogue cuDF5 strive for, many performance challenges such as the in-memory conversion of row-column storages and efficient partitioning algorithms prevent one from making further performance optimizations. The experimental results in this paper show that some functions in them are optimized, while others may not be as fast as native Pandas. Therefore, due to the ever-increasing volume of financial time-series data, the call for high-performance computing has become increasingly urgent.This paper presents HXPY (shown in Fig.2, HXPY has five layers, each of which has different components. HXPY is written in many different programming languages, and uses various optimization methods), a high-performance financial time-series data processing package in Python and C++ with abundant functions for I/O management, manipulation, and calculation, packaged in a user-friendly interface. Our system is designed within a consistent interface with the popular Pandas to be ready-to-hand, but with a different internal backend computation mechanism: SIMD (single instruction multiple data) instructions, memory optimizations, row-major storage, streaming algorithms, and efficient thread task division algorithms are used for intensively augmentations. As a result, our system can achieve up to hundreds of times superior single-threaded performance while supporting multi-threading and CUDA acceleration. Moreover, we expand the function set of Pandas to support more fundamental operations on financial time-series data to meet the needs of financial analysis.
Our main contributions include the followings.
1) We design and implement the HXPY package for high-performance financial time-series data processing.
2) We optimize I/O functions to achieve more than 5x speedup compared with Pandas.
3) We optimize time-series functions and get dozens of times the performance improvement.
4) We implement a multi-threaded version of the dataframe and achieve 2x to 200x performance improvement over Ray[17]-based Modin[18], a multi-threaded variant of Pandas.
5) We implement a CUDA version of the dataframe and achieve significant performance gains over NVIDIA's CuDF from 2x to 400x.
6) In addition, HXPY not only is an academic attempt, but also has been deployed in practice by entities such as International Digital Economy Academy (IDEA)
6 at a scale of 10 thousand CPU cores.The rest of the paper is organized as follows. Section 2 presents related work about other financial time-series data processing packages. In Section 3, we illustrate the internal mechanisms of optimized functions. Section 4 presents, analyzes, and explains experiment benchmark procedures and results. In Section 5, this paper further describes the implementation details of the HXPY package. Finally, Section 6 concludes with HXPY's status quo and the future road-maps.
2. Related Work
2.1 SIMD and SIMT
Single instruction multiple data (SIMD) and single instruction multiple threads (SIMT) can drastically help accelerating computation. The earliest application of SIMD can be traced back to the ILLIAC IV[19]. Later, there have also been some attempts at vector computers. The core idea of SIMD is the operation of multiple scalars in one instruction, as the modern x86 processor equipped with Advanced Vector Extensions (AVX) instructions that we are familiar with. For example, the vaddps instruction in the x86 world can add up multiple packed single-precision floating-point (fp32) values in determined cycles like shown in Fig.3, Ax,y,z,w, Bx,y,z,w and Cx,y,z,w denote scalars, and the SIMD operation can process four pairs of scalar operand Ax,y,z,w and Bx,y,z,w concurrently in a single instruction.
Modern CPUs usually have multiple extended registers, such as zmm0 and zmm15 which have a 512-bit length. Through reasonable instruction and register planning, continuous calculation of floating-point numbers or integers can be scheduled and executed to achieve the effect of high-performance computing. Meanwhile, SIMD instructions also bring some challenges for programmers, including but not limited to the alignment of data, interaction with other control flow codes, and possible frequency down-scaling. The simdjson's success of using SIMD for string parsing[20] has only been achieved nearly 20 years after the wide adoption of SIMD instruction sets, of which the prolonged time lag reflects the difficulty of SIMD programming as well.
At present, the AVX512 instruction set with the widest SIMD on the x86 architecture can process 16 single-precision floating-point numbers in one instruction. In operation, the width of the SIMD instructions cannot increase infinitely, while the memory bandwidth is also limited in a single core. Especially with the increasing number of single-socket CPU cores and the popularity of multi-chip module (MCM) chips, a single CPU core can no longer fulfill the bandwidth capacity of all memory channels. At a finer granularity, the width of the register and the number of ALUs will also become constraints, since they decide the maximum floating number that can be dealt with at one cycle.
Moreover, with the increasing operand width of a single SIMD instruction, the waste of computing resources can be more acute. For example, in the era of SSE (Streaming SIMD Extensions) instructions, we can use a 128-bit register to describe four single-precision floating-point numbers, which perhaps is coordinate in a three-dimensional space with only a quarter of the computing resources wasted. However, when the number of operand scalars gradually increases, the real-world data length is not always a multiple of 8 or even 16. The memory bandwidth, frequency drop, power limit, etc., can also become a bottleneck. Some researches have shown that AVX512 is not so performant as AVX2 in a specific scenarios[21]. Therefore, GPUs and many dedicated DSPs have been designed for data processing purposes in addition to the CPU.
In the world of GPUs, SIMD and SIMT are both popular. Modern GPUs usually consist of thousands of or even more stream processors (or CUDA cores in the NVIDIA world as depicted in Fig.4). Each stream processor can perform calculations on floating-point or integer data. Different threads inside SIMT can access different register sets, use different memory addressing methods, or follow different execution paths. As a result, though the frequency of each stream processor is typically lower than that of CPU and the ALU width is smaller, more flexible programming method with more stream processors and enormous memory throughput contributes to GPU's greater value in many scenarios.
Figure 4. NVIDIA GA100 Ampere architecture7 which contains thousands of CUDA cores on chip.Despite their success in computation efficiency optimization, it is still very challenging to directly apply SIMD and SIMT in financial time-series data. In addition to the aforementioned byte alignment issues, many outliers such as NAN or INF in IEEE 754 floating numbers[22] often exist in financial data. These outliers need to be handled carefully, since they may interrupt the continuous flow of instructions. Moreover, it is difficult to have a memory layout that can simultaneously address the continuity of memory required for cross-sectional and time-series function computations.
7 In this article, HXPY will try to alleviate the adoption problems of applying SIMD and SIMT instructions on financial time-series data with NAN and INF, and the memory layout issue to ensure continuous memory access in both cross-sectional and time-series functions.
2.2 1D Approach of Financial Time-Series Data
As in classical parallel processing area, a natural approach for financial time-series data is to consider the data of each financial instrument along the time axis as a one-dimensional (1D) vector and perform analysis on different 1D vectors. Most multivariate time-series data can also be split into single variables and turned into a one-dimensional array sampled on the time axis. Many programming languages like R[23] and Julia[24] natively provide the semantics of one-dimensional arrays, but usually without any builtin sliding window function, leaving users to program their own codes to access the elements of the array in turn for statistical analysis.
As a famous library for financial data-processing, TA-Lib started as an open-source project in 1999 by Mario Fortier. TA-Lib provides C implementation of around 200 indicators such as ADX, MACD, RSI, and several candlestick pattern recognition algorithms like recognizing a candlestick pattern like a cross. Although TA-Lib lacks the functions of data I/O, it still offers a wide range of versatility across academia and industry. Many studies use TA-Lib as a toolkit for financial time-series data processing. For example, Duvinage et al.[25] analyzed the intra-day performance on the Japanese market, while Nelson et al.[26] used TA-Lib to generate features and train a CNN to predict stock movements. Even cryptocurrency can be traded by TA-Lib processed features[27].
Since the maintenance of TA-Lib was stopped in 2007, the application of TA-Lib inevitably exhibits inherit performance defects and inflexibility, making it very hard to satisfy the needs of today's financial data processing. For example, TA-Lib only supports double-precision floating data input, while the input of most neural networks ranges from single-precision to INT8. In consequence, the casting between different data types yields some performance degradation. Worse still, TA-Lib poorly handles outliers and does not support multi-threading, requiring researchers to perform additional preprocessing of the data and goes against the design of modern compute devices.
2.3 2D Approach of Financial Time-Series Data
When dealing with financial time-series data, many cross-sectional analyses require the calculations of cross-financial instruments, such as sorting the price-earnings ratios (PE) of all stocks at a specific date or selecting stocks with a price-earnings ratio of less than 10. In such cases, the one-dimensional storage format can be insufficient: one has to continuously construct vectors from the temporal dimension or the securities dimension, extracting data from different functions, and writing calculated results back to different vectors. In contrast, tabular data is a more intuitive way of data analysis, since many people's understanding of data starts with tables affiliated with indexes and columns. Both Python and R dataframe provide rich support for tabular data, where multiple variables and names of financial time-series can be formatted as columns or indexes of tabular data.
One much-favored Python package Pandas, originated from the AQR Capital Management, is in support of tabular data processing with a variety of row-wise and column-wise operations. Nevertheless, Pandas dataframe operations face performance issues even on moderately large datasets[15]. Some ameliorating studies are conducted on Pandas dataframe semantics, with only a small subset of functions supported. For instance, NVIDIA's cuDF accelerates specific Pandas dataframe functions on GPU, whereas the Modin is launched by Berkeley RISE Lab, which accelerates Pandas via CPU multi-threading. These jobs are not so widely used as native Pandas, and the speedup ratio varies by functions, as Section 4 demonstrates, many of their functions are even slower than single-threaded Pandas.
Therefore, our work HXPY supports 2-D tabular data layouts like Fig.5 and interface to Pandas with financial time-series data-oriented optimizations. In addition to the single-threaded algorithm and execution optimization, HXPY is integrated with both multi-threading and CUDA accelerating, offering more practical functions related to financial time-series data such as technical indicators and industry-grouped functions.
Figure 5. A typical layout of a financial dataframe. The first column consisting of the datetime is called the index, and the first row containing ticker name strings like ``000001'' is called the column, where ``000001'' is the abbreviation of ``00001.XSHE'', which means the stock whose trading code is 000001 on the Shenzhen Stock Exchange, and the company name of ``00001.XSHE'' is Ping An Bank Co., Ltd. All floating number values are stock's daily close price.3. HXPY Functions
As a robust data analysis framework, it is essential for HXPY to realize a variety of functions for arbitrary analysis. HXPY has improved and added fundamental functions for financial time-series data processing, while preserving the consistency with the Pandas interface as much as possible for user-friendliness. Due to the space limitation, we only give one or two examples of each type of functions and briefly explain the implementation.
3.1 I/O Functions
Comma-separated values (CSV) file format is well used in financial time-series data storage and transfer, especially in data transfer among different programming languages and platforms. In the CSV file, typically, a text line ended by the newline character represents a row of data, and commas are used to separate each item in every row as shown in Fig.6.
In reading a CSV file, each line of the file is read first, where each line is parsed concurrently. Before parsing, from the number of lines and the number of delimiters for the first line, we can estimate the size of the memory required for the data and pre-allocate it. In the parsing of each line in the file, we use the double-pointer technique, where the two pointers point to two separators such as commas, and then use the Boost Spirit library[28] to conduct literal parsing on the text in the middle of two pointers.
Similarly for the CSV writing, we can process each line concurrently and write each line sequentially to the file after all threads have completed their work. In the processing of each line, we also pre-allocate the length of the string according to the number of elements inside the line, and then use the fmt library
8 to convert the elements into the text form.Besides literal serialization, binary format is more I/O-efficient. HXPY provides a continuous, binary storage format that improves sequential read and writing. This contiguous storage structure stores elements in row-major and uncompressed layouts, while enabling zero-copy modifications by mapping from a given memory address. At the beginning of the file, we store various offsets to facilitate partial reading, and thus this binary format also supports reading only part of the index instead of all the data. This partial read improves convenience especially when the file is very large, such as the archiving of many years of historical data. As a result, both CSV and binary storage formats gain significant speed improvement in the benchmark experiment as discussed in Section 4.
3.2 Element Functions
Element-wise functions are usually element-by-element operations on one or more dataframes, such as addition, subtraction, multiplication, and division of two dataframe objects, size comparison, or some scientific functions such as sine, square root, and relu in Fig.7. Such functions usually behave as contiguous scalar operations on aligned memory, and thus the compiler easily vectorizes them. We implement this type of functions with a simple partitioning strategy so that each thread's memory access is contiguous, that is, each working thread takes a continuous and aligned subset reference of the origin storage.
In the task of each thread, due to space limitations, we only illustrate how the abs function is implemented in CPU SIMD. Unlike sqrt which has x86 native AVX instruction support (vsqrtss), we need to implement the absolute value function manually. Inspired by the IEEE 754 floating number layout, we notice that both single and double precision floating numbers have a sign bit. Therefore, we pack every eight floating-point numbers (or 16 numbers if the processor supports AVX-512) into a 256-bit processing unit (or 512-bit unit in AVX512), and then perform a logical bit-wise AND operation with a 256-bit all-1 mask to get the absolute value of inputs. In practice, we would place multiple such processing units in loop sections to improve register and arithmetic logic unit (ALU) utilization efficiency. The demo C++ code is shown in Fig.8.
3.3 Column and Row Functions
Since dataframes have indexes and columns that index a 2-D value array, a large quantity of row-by-row or column-by-column accesses are expected in practice. In financial time-series data analysis, usually, each row represents a different timestamp, and each column denotes a different equity type. HXPY adopts row-major storage. When dealing with column functions that involve a large number of non-contiguous memory accesses, HXPY first transposes them into the continuous storage. As exemplified by Fig.9, when the matrix is large enough and there is sufficient memory, HXPY treats this transpose operation as performant, even considering the computational overhead of memory and transpose. Thus, no matter whether the function call is row-wise or col-wise, HXPY can call the same operator to execute, which can save the code workload. Such functions sometimes do not change shape, such as computing the sorted index for each row of data in Fig.10, and sometimes they are reductive, such as computing the mean or standard deviation of a column of data in Fig.11. In both cases, HXPY pre-allocates the memory for the result in advance, and then the operators directly write to the corresponding address.
Figure 9. Transposing storage to make column-wise functions continuously9 .3.4 Time-Series Functions
Time-series windowed functions are a very critical set of functions in HXPY. We use several tactics to optimize the performance of time-series functions, and the specific optimization methods are described as follows. Time-series functions usually produce result dataframes of the same shape (except that the first T – 1 data points may be illegal results) as Fig.12 illustrates. They can sometimes be reductive, such as some down-sampling in time from a more fine-grained aggregate statistics in Fig.13 or sampling on timestamps.
As part of the rigor of the paper, due to the importance of time-series functions, we use the optimization process of the ts_corr function to demonstrate in details how HXPY optimizes time-series functions. On the basis of the comparison with Pandas, we disassemble and analyze the source of HXPY's performance improvement in more depth. The ts_corr (1) functions calculate windowed Pearson's correlation coefficient between two aligned dataframes. All of the incremental analysis results are shown in Section 4.
9 3.4.1 Language Overhead
The idea is to implement the ts_corr function by iterating over each cell in a dataframe. For each cell, we look back at the past T elements on the time index of the two dataframes, store them in an array, calculate the correlation coefficient, and write back the corresponding cells of the resulting dataframe. To calculate the language overhead, we start with this preliminary idea and implement basic versions of ts_corr in Python and C++, respectively. C++ code is compiled without any optimizations (-O0). The brute-force version of Python costs 1422 seconds, and the C++ version costs 29.3 seconds on the small dataset introduced in Section 4.
3.4.2 Memory Optimization
The basic version of ts_corr has a critical memory shortcoming: the T – 1 values and T values are not continuous in memory due to our usage of row-major storage.
An intuitive improvement is that we copy the data of each column into a one-dimensional array first like Fig.14 and then perform the calculation of ts_corr in this array in turn, which ensures the continuity of memory at each time ts_corr is invoked. This memory fetch trick alone reduces the time from 29 seconds to 1.6 seconds.
10 Figure 14. Transforming the memory layout to accelerate calculation10 .3.4.3 Floating Number Optimization
Through the profile, it is found that most of the calculation overhead of the current program version is associated with the calculation of floating-point numbers, and GCC has floating-point calculation optimization flags that are not enabled by default even in -O3. Though some of these optimization flags may be safe, others can break strict IEEE compliance. For example, in financial computing, turning on -freciprocal-math to allow the compiler to compute x/y as x × (1/y) is usually correct with seldom a precision issue, but -ffinite-math-only can be impossible to turn on because financial data usually has a lot of NAN or INF and turning on this flag may cause incorrect results. To ensure that the calculation results are correct, we have turned on three safe floating-point optimization flags: -fno-trapping-math generates nonstop code, on the assumption that no math exceptions which can be handled by the user program will be raised, -fno-math-errno disables the global errno variable for simple math functions, and -freciprocal-math for faster floating number division. Finally, we can have around 40% speedup.
3.4.4 Streaming Algorithm
The complexity of the algorithms we have implemented so far is O(nmT), of which n represents the number of rows, m represents the number of columns, and T represents the length of the time window. However, after introducing the streaming algorithm, we only need to scan each row once.
The streaming algorithm usually describes a class of algorithms whose input data is a sequence of items and the calculation result can be completed by scanning the sequence once or a few times. After being formalized and popularized by [29] in 1996, a variety of streaming algorithms are now widely used in network optimization, clustering, and real-time data analysis systems. Compared with batched computation, streaming computation has the advantages of real-time, low latency, and limited memory usage, while the steaming algorithm also suffers from poor accuracy and limited scenarios. In addition, there are several open-source frameworks such as Apache Flink[30] and Apache Storm[31] developed to simplify the process of deploying streaming algorithms. Fig.15 describes a simple streaming algorithm which calculates fixed-length windowed sum, and complex functions such as ts_corr can also be calculated by maintaining the first and the second order streamed statistics of x and y.
In the world of financial time-series data, both stream computing and batch computing are widely used. Although financial time-series data naturally exhibits the characteristics of time streaming, some analysis is fixed in a window, such as the moving average (MA) of the past five days. Certain analysis such as exponential moving average (EMA) (EMAt=k×Pricet+(1−k)×EMAt−1, where k is a coefficient between 0 and 1). The closer the coefficient k is to 1, the closer the EMA result is to the value at the current time point. The closer the coefficient k is to 0, the more historical time information the EMA result contains.) needs to start from the first time when the data begins to calculate continuously, thus being naturally streamed. Moreover, specific fixed-window calculations can also be streamed to reduce time complexity. For instance, to calculate a five-day moving average, we can start on the sixth day, multiply the previous day's result by 5, substract the first day's result, and then add the value of the sixth day and divide by 5, that is, (MAt,5=(MAt−1,5×5−Pricet−5+Pricet)/5). Such optimizations can significantly reduce the calculation time when the calculation window is wide (because all the data is scanned only once). However, calculation errors are much more likely to accumulate due to the limited precision of the floating-point numbers. Therefore, it is necessary to comprehensively consider the problems of calculation speed and accuracy trade-off to select an appropriate algorithm for practical applications.
When the value range of the two dataframes is similar, in our experiments, the errors of the streaming version of ts_corr and the fixed-window version are very small. Under such conditions mentioned above, we can calculate ts_corr in any time window, and the time complexity can be reduced to O(nm), which is independent of T. The streaming algorithm brings a nearly three-fold speed improvement when T = 10 in our benchmark.
3.4.5 Multi-Threading
11 So far, we have realized most of the optimizations for a single thread. As Modin's idea says, however, when the dataframe is huge, it would be a better choice to partition it and use multiple workers to calculate the function. We can partition the dataframe according to different columns as Fig.16 shows. For example, when there are 1000 stocks with 2000 days of data and 10 worker threads, we can divide the dataframe with each thread into 100 columns of 2000 days of data so that each thread has 100 columns to ensure continuous memory access. Thus, SIMD instructions can be applied, and the memory accesses of each thread do not intersect with each other, where 16 threads bring more than three-fold speed improvement. When the number of execution threads is larger than 16, we cannot bind the threads into a single memory non-uniform access (NUMA) node at AMD 7742 CPU, and thus the fluctuations of execution time increase rapidly.Figure 16. Distributing rows or columns to different threads11 .3.4.6 CUDA Cores
At the time of writing, a single socket with the most core x86 CPUs can provide 64 physical cores and 128 threads (AMD EPYC 7763). However, due to memory non-uniform access (NUMA) and memory bandwidth limitations, multi-threaded computing on the CPU's speedup is not always linear. With the increase of CPU threads, memory bandwidth gradually becomes a bottleneck, and the benefits brought by optimization methods such as SIMD gradually subside with the reduction of partitions. The current mainstream CUDA graphics cards have nearly 10000 cores. We can implement the streaming algorithm version of ts_corr on CUDA, and assign a CUDA core to each column of data, and the results are around 5x than the CPU version on NVIDIA A100 40 GB. We believe it is possible to be much faster if the CUDA kernel operator is delicately tuned.
3.5 Grouped Functions
Different variables are usually not uniform in financial time-series data and consist of clustering characteristics. For example, stocks usually are categorized into different industries, while bulk commodities usually are divided into agricultural products, ores, non-ferrous metals, oil and gas, and other sectors. In practice, it is common to group the data first and then analyze it, such as calculating statistical characteristics in a given industry. This by-industry analysis approach is beneficial. For example, the COVID-19 pandemic that begins in 2020 can result in loss of profits for companies engaged in shipping or hospitality, but biomedical companies researching vaccines may not suffer such loss, while neither by-row nor by-column analysis captures this intra-market structural information.
In HXPY, grouping is implemented by another dataframe whose values represent the category. After achieving the index and column alignment between data and group dataframes, then evenly partitioning the original dataframe and different groups' values inside partitions are dispatched to workers, and results are merged and reduced concurrently. On each row, items of different groups are concatenated into different aligned arrays, and then different arrays can be dispatched to use our optimized operators for continuous row-wise functions to achieve the effect of operator multiplexing. We use a hash table to efficiently store the index positions of different classes, and we have optimized small containers as stack objects. When the result is calculated, the data is written back to a new dataframe according to the recorded class position. Fig.17 shows a demo of the grouped function, which calculates the mean of different groups.
3.6 Shape Manipulating Functions
In-memory dataframe analysis requires flexible operations on shapes and indexes. HXPY supports many shapes manipulating functions and data indexing, slicing, transformation as shown in Figs.18-21. These shape manipulating functions usually need access to the indexes or columns, and thus we construct a hash table to speed up the indexing of index names. In addition, in functions such as concat that operate on many small dataframes, we pre-construct the new index and dataframe, and then the alignment of items in old dataframes can be concurrently executed to the new large dataframe. In the implementation of the transpose function, we refer to Eigen's efficient matrix transpose algorithm
12 , and directly swap the index and column to avoid copying. Through reasonable execution planning, frequent memory allocation is avoided, and the acceleration effect is finally achieved.4. Experiments
In this section, we empirically evaluate the proposed framework HXPY, targeting at answering the following major question. Is HXPY a high-performance financial time-series data processing framework that can provide a fast calculation speed, accurate results, and enough financial function sets? We compare HXPY with Pandas[14] and one of Pandas' multi-core improvement: Modin[16, 18]. We also benchmark Pandas' CUDA counterpart: cuDF. With the goal of maximizing diversity and ensuring fair comparisons, we select different functions for each group of functions to test and verify the effectiveness of our proposed optimization methods. The experimental results are arranged into different tables grouped by function types.
4.1 Benchmark Data and Settings
First, we introduce the benchmark datasets used in our experiments. The benchmarks are adopted to compare the speed with other analysis frameworks and aim to clearly illustrate how it works. We have two benchmark datasets, i.e., the small dataset and the large dataset. The small dataset consists of 16-year daily data of stocks from 2006 to 2021 in the Chinese market, while the large dataset consists of 1000 day 1-minute data of stocks in the Chinese market shown in Table 1. These datasets are compiled from the public data-sources of the Chinese stock exchange.
Table 1. Large and Small Datasets Used in BenchmarkDataset Dates Start Time End Time Daily Timestamp Number of Rows Number of Columns Memory Size CSV Size Small 3890 20060104 20211231 1 3890 4797 0.03 0.06 Large 1000 20180102 20220216 240 240000 4740 4.34 8.61 Note: The file size is measured in GiB as 10243 (1073 741 824) bytes. We conduct our benchmark evaluation on a server equipped with two AMD 7742 64 cores CPU. The compiler is GCC 11.2 and NVCC 11.6. The memory is 8-channel memory 2 TB at 3200 MHz. We use eight NVIDIA A100 40 G with NVLINK[32] enabled for the CUDA benchmark.
4.2 I/O Functions
In the I/O benchmark, we disable the OS's file cache and guarantee that only one process is accessing the file at one time. Files are stored in a raid array of eight PM1733 3.84 T NVME SSDs whose maximum read and write speed usually requires multiple processes to achieve, and thus the disks cannot bottleneck the read and write speed. Thus, the file parsing pipeline's design is critical to different frameworks, whereas the Latin square design[33] is used to ensure that different frames take turns evenly and fairly in the I/O test.
For the CSV benchmark, firstly, a CSV file is generated to a given path. HXPY reads this file sequentially (where the sequence order is different in multiple experiments) and dumps it into another CSV file in order. For the binary file test, the dataframe is constructed by function read_csv, the binary is dumped by HXPY and Pandas, and then the binary reading benchmark can be conducted. For the binary benchmark of Pandas, besides Python's native pickle format, we also benchmark Apache Arrow
13 , a very popular cross-language storage format.The results recording the speedup in Table 2 and Table 3 indicate that HXPY shows I/O's state-of-the-art performance. A single thread HXPY package can outperform four-threaded Modin, only slightly slower than eight-threaded Modin. Also, Modin's read binary function is multi-threaded, but its threaded binary I/O is even slower than the original Pandas.
Table 2. CSV File I/O Performance Compared with Pandas and ModinFunction Dataset HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) read_csv Small 0.29+ 2.27 0.88 0.09 7.8x 9.7x Large 34.30 178.30 32.00 11.60 5.2x 2.7x to_csv Small 0.97 7.39 2.01 0.21 7.6x 9.5x Large 102.40 692.40 95.20 25.10 6.7x 3.8x Note: The number in parentheses like HXPY(8) means the count of threads used in the benchmark. Speedup(n) means in an n-threaded benchmark, the performance gain times of HXPY compared with its counterpart. Bold numbers denote the best performance achieved among different packages. Table 3. Binary File I/O Performance Compared with Pandas and ModinFunction Dataset HXPY(1) Pandas#(1) Pandas$(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) read_binary Small 0.03 0.17 0.050 0.18 0.009 1.6x 20.0x Large 1.50 5.90 2.700 7.18 0.380 1.8x 18.9x to_binary Small 0.03 0.41 0.046 0.63 0.030o 1.5x 21.0x Large 1.30 8.00 2.900 8.77 1.300o 2.2x 6.7x Note: # denotes Pandas using the Arrow format. $denotes Pandas using the Python Pickle format. ∘ denotes to_binary's multi-threaded version is not implemented in HXPY, so single-threaded results were used. Bold numbers denote the best performance achieved among different packages. 4.3 Element Functions
Element functions are a relatively simple type of functions, and there are no obvious differences on the experimental speed of each framework designed for such functions. We benchmark four common element-wise functions: power, round, abs and relu. Specifically, power calculates the power of numbers, while round is handy in finance since many times we need to round the data or keep only a few significant digits (i.e., rounding stock prices to cents). Both abs and relu are used to eliminate negative values, while relu[34] transforms negative values to zero. Nonetheless, it can be concluded that HXPY achieves consistent speed-up across dataframes of different sizes. To highlight the best results, for functions where Pandas does not own a native implementation such as relu, HXPY implements relu through C++ lambda function, leading to a maximum 298x speedup.
relu(x)={0,if x⩽ From the results in Table 4 we can find that: power is a compute-consuming function and thus the multi-threading helps improve it a lot (i.e., 6x to 7x with eight threads). However, the multi-threading speedup in relu and abs may be less significant (2x to 3x with eight threads), where the memory capacity bound may be reached, considering the function itself is very simple. The speedup column in the table represents the HXPY speedup compared with the corresponding counterpart—Pandas and Modin. Both single-threaded and multi-threaded versions of HXPY can approach or substantially outperform their opponents.
Table 4. Element Functions Performance Compared with Pandas and ModinFunction Dataset HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) power Small 0.180 0.240 0.120 0.028 1.30x 4.30x Large 16.600 16.200 2.300 2.500 0.97x 0.92x round Small 0.020 0.450 0.670 0.010 22.00x 67.00x Large 1.280 40.200 6.030 0.550 31.40x 10.90x abs Small 0.024 0.038 0.081 0.009 1.60x 9.00x Large 1.160 1.210 0.890 0.540 1.04x 1.65x relu Small 0.021 5.570 1.330 0.011 265.00x 120.00x Large 1.250 373.000 49.900 0.580 298.00x 86.00x Note: Bold numbers denote the best performance achieved among different packages. 4.4 Row and Column Functions
Besides element functions, researchers often need to run statistics results on each timestamp or each security. These functions can provide researchers with insights into the data distribution on each cross-section or security symbol. The performance pursuit of row-wise and col-wise functions can be intricate since when ensuring the continuity of row memory access, the performance of the column access cannot avoid being sacrificed. We test three axis-level functions: rank is a trendy function in financial data analysis to obtain the ranking of different securities at every timestamp. std calculates the standard deviation since volatility is a critical concept in finance. maxmin_scale is also a useful row-wise function in finance when normalizing data to the range of [0, 1].
On the one hand, the results in Table 5 show that in the row-wise functions (axis=1), HXPY always wins over Pandas, due to the row-major storage choice of HXPY compared with Pandas' column-major storage. On the other hand, on the occasion of col-wise (axis=0), HXPY is not so fast as Pandas on the large dataset. The reason behind such slowdown can be the massive discontinuous memory accesses. Nevertheless, from an overall perspective, after evenly combining the time of col-wise and row-wise functions, HXPY still has a significant advantage in both single-threaded and multi-threaded versions.
Table 5. Row-Wise and Colume-Wise Functions Performance Compared to Pandas and ModinFunction Dataset HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) rank(axis=1) Small 0.590 1.150 0.350 0.140 1.90x 2.50x Large 40.700 87.900 17.300 8.200 2.10x 2.10x rank(axis=0) Small 0.690 0.980 0.280 0.150 1.40x 1.90x Large 80.600 161.300 51.500 18.100 2.00x 2.80x std(axis=1) Small 0.017 0.351 0.098 0.002 20.60x 49.00x Large 1.040 32.700 4.560 0.140 31.40x 32.60x std(axis=0) Small 0.093 0.126 0.076 0.012 1.30x 6.30x Large 11.500 6.600 3.510 4.500 0.50x 0.80x maxmin_scale(axis=1) Small 0.040 1.270 0.260 0.030 31.70x 8.70x Large 2.580 86.600 13.100 1.820 33.50x 7.20x maxmin_scale(axis=0) Small 0.300 1.290 0.280 0.050 4.30x 5.60x Large 26.800 4.300 3.700 16.000 0.16x 0.23x 4.5 Time-Series Functions
When testing time-series functions, the length of the time window is usually critical, but up to today, both Pandas and HXPY implement a window independent time complexity algorithm for most windowed time-series functions by streaming algorithms. For instance, we can use two heaps to maintain the sliding-windowed median of an array. Since the running time is no longer related to T, we only need to benchmark one window T to get a fair comparison. We test the computational speed of different frameworks at the same time window (T=10). Using T=10 can be meaningful since 10 is not a multiple of 8, and thus the effect of the AVX speedup is negatively affected, which making the speed comparison of frameworks to emphasize more on the idea of internal algorithm. Besides, the number 10 exhibits practical significance in finance where a 10-day or 10-minute aggregation window analysis is often used for the trend judgment.
Since time-series functions play a very critical role in financial data analysis, we test a larger number of functions in this subsection. Specifically, we test ts_sum, ts_std and ts_max for simple statistics, ts_rank for ranking values among the time-series window, ts_corr ((1)) for windowed Pearson's correlation between two aligned dataframes and ts_argmax- min_diff ((2)) for a more complicated statistic that calculates the index difference between the maximum value and minimum value occurred in the time window, which are widely used as typical technical indicators. A sub-sampling among the time-series axis is also tested. Further analysis of the time-series functions speedup with different methods can be found in Table 6.
Table 6. Incremental Analysis of ts_corr FunctionLanguage Optimization or
DescriptionExecution Time
on Small
Dataset (s)Python Brute-force 1422.0000 C++ Brute-force 29.3000 C and Python Pandas's ts_corr 2.9100 C++ Memory optimization 1.6400 C++ Memory optimization + SIMD 0.5600 C++ Memory optimization +
SIMD + FNO0.4000 C++ Memory optimization +
SIMD + FNO + SA0.1400 C++ Memory optimization +
SIMD + FNO + SA +
16 threads0.0450 CUDA SA + 4790 CUDA threads 0.0093 Note: FNO: floating number optimization enabled. SA: streaming algorithm enabled. Let a, b be two time-series sequences. The time-series correlation is defined as
\begin{split} ts\_corr(a, b, t, T) = \frac{\text{cov}(a_{t-T+1:t}, b_{t-T+1:t})}{\text{std}(a_{t-T+1:t}) \times \text{std}(b_{t-T+1:t})}, \end{split} (1) where a_{t-T+1:t} = (a_{t-T+1}, \dots, a_{t}) is the sub-sequence from t-T+1 to t , and \text{cov} , \text{std} are the sample covariance and standard deviation function respectively. The time-series argmaxmin difference is defined as
\begin{split} &ts\_argmaxmin\_diff(a, t, T)\\ = \;& {\rm argmin}_{t-T+1 \leqslant s \leqslant t} (a_s) - {\rm argmax}_{t-T+1 \leqslant s \leqslant t} (a_s). \end{split} (2) From the results in Table 7, we can see on most of the time-series functions, HXPY can achieve significant performance improvement. Though, HXPY becomes slightly inferior to its counterparts on the down-sampling functions, where discontinuous memory access becomes a bottleneck due to reduced computational intensity.
Table 7. Time-Series Operations Compared with Pandas and ModinFunction Dataset HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) ts_sum Small 0.078 0.51 0.187 0.040 6.50x 4.6x Large 3.250 14.60 8.610 2.090 4.50x 4.1x ts_std Small 0.093 0.67 0.190 0.042 7.20x 4.5x Large 4.890 22.56 9.100 2.260 4.60x 4.0x ts_max Small 0.240 0.66 0.200 0.068 2.70x 2.9x Large 20.100 26.00 8.580 3.910 1.30x 2.2x ts_rank Small 1.850 2.91 * 0.400 1.50x 7.2x Large 176.000 234.00 * 27.800 1.30x 8.4x ts_corr Small 0.140 2.39 4.130 0.050 17.10x 82.6x Large 8.510 84.00 121.700 2.770 9.90x 43.9x ts_argmaxmin_diff Small 0.430 901.00 176.600 0.100 2100.00x 1760.0x Large 45.100 ** ** 11.600 N/A N/A ts_subsample_median Small 0.240 0.21 0.170 0.044 0.87x 3.8x Large 25.100 15.70 5.620 4.790 0.62x 1.2x Note: *: rolling().rank() is not supported in Modin 0.14. **: Too slow (more than one hour) to measure. 4.6 Grouped Functions
Grouping functions are helpful when the analysis is conducted among different sectors. However, we only test group functions on the daily (small) dataset since the industry sector data of Chinese stocks is only updated on a daily-frequency basis. We achieve a significant speedup about 15x to 30x in single thread and 130x to 200x in multi-thread in the daily dataset.
The performance improvement in Table 8 is not only from HXPY's design but also because that Python Pandas do not initially have a two-dimensional (2D) matrix group semantic. As a result, users must first convert a 2D dataframe into a one-dimensional (1D) value array, add a column of 1D group arrays, perform group operations, and convert the result into a 2D dataframe. Such frequent shape transformation also brings a significant performance penalty.
Table 8. Grouped Functions Compared with Pandas and ModinFunction Dataset HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) grouped_count Small 0.90 13.8 24.9 0.19 15.3x 131x grouped_max Small 0.96 19.7 30.8 0.20 20.5x 154x grouped_mean Small 1.00 30.1 40.3 0.20 30.1x 201x 4.7 Shape Manipulating Functions
We have tested standard shape manipulating scenarios, such as concatenating massive small files, aligning one dataframe to another, and other operations like selecting a subset of the index. In the concatenation benchmark, we split data into different dataframes and measure the time of different packages merging such dataframes to a big dataframe. For the alignment benchmark, we align the small dataset to the large dataset or align the large dataset to the small dataset to measure the performance of different frameworks on dataframe reshaping operations. For the re-indexing task, we randomly sample from the original index with 2% index names totally unseen (this increases the difficulty of the task because such non-existing rows will involve additional memory allocation), and then measure the speed of different frameworks building new dataframes based on such unseen index names.
This type of functions requires complex shape transformation, resulting in poor multi-threading speedup as illustrated in Table 9. We find that Modin is much slower than the original Pandas in many cases such as the alignment and re-indexing. It is because that many shaping operations are difficult to divide equally among threads, especially when designing exception handling systems with non-existing columns. Although HXPY can still speed up by multi-threading, the speedup ratio is far lower than the number of threads used.
Table 9. Shape Manipulation Functions Compared with Pandas and ModinFunction Operation HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) concat 100 daily files 0.044 3.970 0.910 0.038 90.20x 23.9x 1000 daily files 0.450 79.600 18.300 0.300 176.80x 61.0x align Small to large 1.090 1.860 6.110 0.470 1.70x 13.0x Large to small 0.021 0.042 1.000 0.017 2.00x 58.8x reindex 1000 index in small set 0.005 0.003 0.066 0.002 0.66x 33.0x 10000 index in large set 0.130 0.320 0.980 0.120 2.40x 8.1x 4.8 CUDA Functions
Since CUDA is becoming more and more popular in many data science scenarios, we benchmark the CUDA version of HXPY with NVIDIA's cuDF. We find that most functions on CUDA show notable performance improvement when compared with the multi-threaded CPU version of HXPY. Specifically, simple mathematical functions such as abs and power can be greatly accelerated on the CUDA version of HXPY. However, functions such as rank that require frequent abnormal value processing and sorting have little acceleration compared with the CPU version. The following Table 10 is CUDA functions performance comparison.
Table 10. CUDA Functions Compared with NVIDIA cuDFFunction Dataset HXPY(8) HXPY(CUDA) cuDF(CUDA) Speedup(CPU) Speedup(CUDA) abs Small 0.009 0.0021 0.27 4.3x 128.0x Large 0.540 0.0280 4.11 19.3x 146.0x power Small 0.028 0.0020 0.81 14.0x 405.0x Large 2.500 0.0360 4.54 69.4x 126.0x rank(axis=0) Small 0.150 0.1500 1.96 1.0x 13.1x Large 18.100 9.4500 19.40 1.9x 2.0x rank(axis=1) Small 0.140 0.1500 ^{+} 0.9x N/A Large 8.200 8.8300 ^{+} 0.9x N/A ts_sum Small 0.040 0.0043 0.47 9.3x 109.0x Large 2.090 0.2000 4.58 10.4x 22.9x ts_corr Small 0.050 0.0093 ^{\#} 5.4x N/A Large 2.770 0.5000 ^{\#} 5.5x N/A Note: HXPY(CUDA) and cuDF(CUDA) were benchmarked on a single NVIDIA A100 40 GB, with CUDA 11.6. Speedup(CPU) means HXPY's CUDA version compared with HXPY's CPU multi-threaded version; speedup(CUDA) means HXPY's CUDA version speedup times compared with cuDF's CUDA version. + denotes rank() among axis=1 is not supported in cuDF 2022.4, and # denotes rolling().corr() is not supported in cuDF 2022.4 Due to the size of CUDA memory and the difficulty of programming, both NVIDIA cuDF and HXPY only implement a part of the functions on CUDA. For instance, only column-wise version of the rank function is implemented on cuDF. Therefore, row-wise analysis on cuDF dataframes is unavailable at this moment. Besides, it seems that cuDF does not have any optimizations for floating-point financial time-series data, where many functions are even not so fast as the CPU version of Pandas.
5. HXPY Package
Section 4 demonstrates the powerful performance of our enhanced functions. In this section we briefly describe how we implement our research exploration into an industry-standard software package.
5.1 Backend
The dataframe class like Python Pandas, typically contains an index, a list-like columns, and the storage of a dense matrix. The element in the index and columns can be sequential strings, integers, or DateTime type, while values' types can be homogeneous or heterogeneous (usually using column store and different columns might have different types).
In HXPY, all above components are implemented in C++ and provided with multiple backbones. For instance, the original index implementation of Pandas is a linear vector structure, while HXPY provides vector and tree-based data structures for faster search. As for storage, HXPY also provides unmanaged storage, which can directly construct objects from already allocated memory.
In Python interface, HXPY supports the interchange with Python Pandas and NumPy[35]. Therefore, researchers can use HXPY to process a part of the computationally intensive data and then convert HXPY objects to the Pandas objects so that they can still utilize Python's rich data analysis ecology for plotting and other operations. Such two-side conversion can offer researchers the flexibility and ease of use.
5.2 Memory Layout
HXPY uses an aligned continuous memory layout to ensure SIMD instructions. It emphasizes the memory alignment because misaligned memory can bring additional overhead to read and write data on both the CPU and GPU, as illustrated in Fig.22. Intrusive pointers are used for reference counting to achieve shallow copies of objects within the same process, in several cases where the avoiding of deep copies brings performance gains. Atomic operations are applied to guarantee the thread safety of reference counting in a multi-threaded environment. The memory copy between the CUDA memory and host is based on CUDA runtime APIs.
14 Figure 22. A misaligned memory address might cause performance loss14 .5.3 Execution and Dispatching
Since HXPY supports multiple data types and device types, it is crucial to design and implement a dynamic verification mechanism to check the validity of operators and operands at run-time. Here, we refer to the implementation of Pytorch[36], a prevalent tensor library well used in deep learning, which introduces the registration mechanism of operators. We also implement reflection based on modern C++ features, calling the function based on string names of operators. If users encounter an unimplemented operator at run-time, a warning is thrown and handled as detailed in Subsection 5.4. Users can also register new operators at run-time.
As for task distribution, we implement multi-threaded task invocation based on OpenMP[37]. Each sub-thread processes only part of the dataframe. The partitioning rule is dynamically generated according to the function type and the data size to ensure memory access efficiency. For CUDA objects, a CUDA stream takes charge in generating and scheduling each CUDA device.
5.4 Runtime Error Handling
When executing any functions, various errors can occur. Some errors are acceptable, such as sqrt on a negative number, which only results in a numerical anomaly of the result, while others may cause failure to produce any results at all, such as two dataframes performing an addition that does not have the same size.
Before and during the execution of the function, HXPY performs checking to ensure the reasonableness and integrity of the results. HXPY uses the newly introduced source locations feature in C++20, so that all error reports have to debug information such as line numbers and source file names. In Python, exceptions in HXPY C++ sources are converted into those exceptions that can be caught and handled by Python's native error handling mechanisms. Besides, there is a global switch to turn off warning messages providing cleaner outputs for expert users.
5.5 Cross Language Build System
C++ is a statically typed language and needs to be compiled before execution, which conflicts with the users' need to see the analysis results interactively. Thus, it is essential to find an appropriate approach to interpreting and executing the analysis statement. Pybind11
15 provides a modern and elegant way to expose C++ types in Python and vice versa.Many high-performance computing packages are using Pybind11 to provide a more flexible method of programming, such as OpenFOAM, a solver for computational mechanics[38], and the SEAL library which is a widespread implementation of fully homomorphic encryption[39]. HXPY also uses Pybind11 to enable Python interface. Beside this, HXPY provides different compiled binaries for CPU and CUDA based on macro definitions to support different device platforms. We can use macro definitions to support AMD CPUs without AVX512, or ARM CPUs with NEON instructions.
5.6 Docker Distribution and Documents
Since we adopt a very recent compiler (GCC 11.2), it is not compatible with the common system Glibc and libstdc++ versions. At the same time, the inconsistency of the CUDA version also can cause run-time errors. To tackle this issue, we no longer distribute binary packages like Python built wheel or C shared libraries directly but distribute optimized docker containers[40] instead, since in practice users have to spend time solving problems with various libraries and environments dependencies, which is also similar to the practice of NVIDIA's cuDF. In our docker containers, various environment variables are corrected for optimal performance.
All functions in HXPY are well documented with code examples of C++ and Python, some annotation information such as function signatures and argument types is also embedded in the Python package and thus can be invoked by the Python help function. The users can do troubleshooting by themselves, check supported functions, or perform performance tuning based on documents.
5.7 Custom Functions
For advanced users, such as financial researchers who want to implement new financial time-series functions or override the behavior of existing functions, HXPY reserves interfaces based on function pointers for various function types. Users only need to implement a lambda function to use HXPY's efficient execution framework like automatic multi-threading. This customization can be implemented in both Python and C++ in Fig.23 and Fig.24 respectively, which enhance the scalability of HXPY. In the near future, we also plan to provide run-time function just-in-time (JIT) compilation and dynamic operator registration systems.
6. Conclusions
In this work, a high-performance HXPY package was proposed to process financial time-series data efficiently. HXPY implemented a new dataframe architecture that supports both CPU and GPU backends, and was optimized especially for financial applications, whose user-friendly interface is similar to the Python Pandas. Unlike previous dataframes that utilize column storage[14, 18], HXPY adopts row storage and supports a large number of streaming algorithms so that statistical calculations of data for different time cross-sections can be vectorized on the row dimension. As a result, except for a few functions that continuously access columns, the performance of HXPY perfectly surpasses that of Pandas.
Compared with Modin, we found that the multi-threading acceleration of Modin is not always linearly improved, and the Modin version of some functions is even slower than Pandas. This is because it uses the Ray[17] backend to communicate based on sockets, which is a relatively heavy approach. HXPY uses a more lightweight multi-threading acceleration provided by OpenMP[37] so that the performance can constantly be improved nearly linearly. In addition, the homogeneous storage design makes dataframe partitioning easier. For example, we can usually only partition on indexes or columns, while Modin needs to partition both axes.
In the CUDA world, compared with NVIDIA cuDF, the performance of HXPY exceeds cuDF significantly. Though neither framework supports enough operations on CUDA and is currently restricted by the limitation of GPU memory, we are optimistic about the attempt of data analysis on CUDA. Both HXPY and cuDF are very preliminary in the current situation, requiring effective partitioning and other work like kernel tuning. Nevertheless, we believe that more financial data calculations will be moved onto GPUs with the increase of GPU memory and supported operators.
Some industry partners are also making endeavors to improve the application of this framework. At present, this framework has been tried on the occasion involving the financial data calculation of the stock and futures markets. Tens of thousands of CPU cores and hundreds of GPUs are using HXPY for cleaning, sorting, and feature calculation of various types of financial data from global exchanges, and this utilization rate has been continued for several months.
In the future, we plan to optimize and improve HXPY to become a high-quality, open-sourced, and lasting project. We hope our work may give some innovative thoughts, possibilities, and inspirations to both academic research and industrial application in finance to some extent, and we hope that the HXPY package can be of value to researchers and practitioners in the field of computational finance.
-
Figure 1. A financial time-series data sample: daily open, high, low and close price of CNI 2000 INDEX, 399303 in April 2022
1 . Red denotes this day's close price lower than last day's close price. Green denotes this day's close price higher than last day's close price.Figure 4. NVIDIA GA100 Ampere architecture
7 which contains thousands of CUDA cores on chip.Figure 5. A typical layout of a financial dataframe. The first column consisting of the datetime is called the index, and the first row containing ticker name strings like ``000001'' is called the column, where ``000001'' is the abbreviation of ``00001.XSHE'', which means the stock whose trading code is 000001 on the Shenzhen Stock Exchange, and the company name of ``00001.XSHE'' is Ping An Bank Co., Ltd. All floating number values are stock's daily close price.
Figure 9. Transposing storage to make column-wise functions continuously
9 .Figure 14. Transforming the memory layout to accelerate calculation
10 .Figure 16. Distributing rows or columns to different threads
11 .Figure 22. A misaligned memory address might cause performance loss
14 .Table 1 Large and Small Datasets Used in Benchmark
Dataset Dates Start Time End Time Daily Timestamp Number of Rows Number of Columns Memory Size CSV Size Small 3890 20060104 20211231 1 3890 4797 0.03 0.06 Large 1000 20180102 20220216 240 240000 4740 4.34 8.61 Note: The file size is measured in GiB as 1\,024^{3} (1073 741 824) bytes. Table 2 CSV File I/O Performance Compared with Pandas and Modin
Function Dataset HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) read_csv Small 0.29+ 2.27 0.88 0.09 7.8x 9.7x Large 34.30 178.30 32.00 11.60 5.2x 2.7x to_csv Small 0.97 7.39 2.01 0.21 7.6x 9.5x Large 102.40 692.40 95.20 25.10 6.7x 3.8x Note: The number in parentheses like HXPY(8) means the count of threads used in the benchmark. Speedup(n) means in an n-threaded benchmark, the performance gain times of HXPY compared with its counterpart. Bold numbers denote the best performance achieved among different packages. Table 3 Binary File I/O Performance Compared with Pandas and Modin
Function Dataset HXPY(1) Pandas#(1) Pandas$(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) read_binary Small 0.03 0.17 0.050 0.18 0.009 1.6x 20.0x Large 1.50 5.90 2.700 7.18 0.380 1.8x 18.9x to_binary Small 0.03 0.41 0.046 0.63 0.030o 1.5x 21.0x Large 1.30 8.00 2.900 8.77 1.300o 2.2x 6.7x Note: # denotes Pandas using the Arrow format. $denotes Pandas using the Python Pickle format. {\circ} denotes to\_binary 's multi-threaded version is not implemented in HXPY, so single-threaded results were used. Bold numbers denote the best performance achieved among different packages. Table 4 Element Functions Performance Compared with Pandas and Modin
Function Dataset HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) power Small 0.180 0.240 0.120 0.028 1.30x 4.30x Large 16.600 16.200 2.300 2.500 0.97x 0.92x round Small 0.020 0.450 0.670 0.010 22.00x 67.00x Large 1.280 40.200 6.030 0.550 31.40x 10.90x abs Small 0.024 0.038 0.081 0.009 1.60x 9.00x Large 1.160 1.210 0.890 0.540 1.04x 1.65x relu Small 0.021 5.570 1.330 0.011 265.00x 120.00x Large 1.250 373.000 49.900 0.580 298.00x 86.00x Note: Bold numbers denote the best performance achieved among different packages. Table 5 Row-Wise and Colume-Wise Functions Performance Compared to Pandas and Modin
Function Dataset HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) rank(axis=1) Small 0.590 1.150 0.350 0.140 1.90x 2.50x Large 40.700 87.900 17.300 8.200 2.10x 2.10x rank(axis=0) Small 0.690 0.980 0.280 0.150 1.40x 1.90x Large 80.600 161.300 51.500 18.100 2.00x 2.80x std(axis=1) Small 0.017 0.351 0.098 0.002 20.60x 49.00x Large 1.040 32.700 4.560 0.140 31.40x 32.60x std(axis=0) Small 0.093 0.126 0.076 0.012 1.30x 6.30x Large 11.500 6.600 3.510 4.500 0.50x 0.80x maxmin_scale(axis=1) Small 0.040 1.270 0.260 0.030 31.70x 8.70x Large 2.580 86.600 13.100 1.820 33.50x 7.20x maxmin_scale(axis=0) Small 0.300 1.290 0.280 0.050 4.30x 5.60x Large 26.800 4.300 3.700 16.000 0.16x 0.23x Table 6 Incremental Analysis of ts_corr Function
Language Optimization or
DescriptionExecution Time
on Small
Dataset (s)Python Brute-force 1422.0000 C++ Brute-force 29.3000 C and Python Pandas's ts_corr 2.9100 C++ Memory optimization 1.6400 C++ Memory optimization + SIMD 0.5600 C++ Memory optimization +
SIMD + FNO0.4000 C++ Memory optimization +
SIMD + FNO + SA0.1400 C++ Memory optimization +
SIMD + FNO + SA +
16 threads0.0450 CUDA SA + 4790 CUDA threads 0.0093 Note: FNO: floating number optimization enabled. SA: streaming algorithm enabled. Table 7 Time-Series Operations Compared with Pandas and Modin
Function Dataset HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) ts_sum Small 0.078 0.51 0.187 0.040 6.50x 4.6x Large 3.250 14.60 8.610 2.090 4.50x 4.1x ts_std Small 0.093 0.67 0.190 0.042 7.20x 4.5x Large 4.890 22.56 9.100 2.260 4.60x 4.0x ts_max Small 0.240 0.66 0.200 0.068 2.70x 2.9x Large 20.100 26.00 8.580 3.910 1.30x 2.2x ts_rank Small 1.850 2.91 * 0.400 1.50x 7.2x Large 176.000 234.00 * 27.800 1.30x 8.4x ts_corr Small 0.140 2.39 4.130 0.050 17.10x 82.6x Large 8.510 84.00 121.700 2.770 9.90x 43.9x ts_argmaxmin_diff Small 0.430 901.00 176.600 0.100 2100.00x 1760.0x Large 45.100 ** ** 11.600 N/A N/A ts_subsample_median Small 0.240 0.21 0.170 0.044 0.87x 3.8x Large 25.100 15.70 5.620 4.790 0.62x 1.2x Note: *: rolling().rank() is not supported in Modin 0.14. **: Too slow (more than one hour) to measure. Table 8 Grouped Functions Compared with Pandas and Modin
Function Dataset HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) grouped_count Small 0.90 13.8 24.9 0.19 15.3x 131x grouped_max Small 0.96 19.7 30.8 0.20 20.5x 154x grouped_mean Small 1.00 30.1 40.3 0.20 30.1x 201x Table 9 Shape Manipulation Functions Compared with Pandas and Modin
Function Operation HXPY(1) Pandas(1) Modin(8) HXPY(8) Speedup(1) Speedup(8) concat 100 daily files 0.044 3.970 0.910 0.038 90.20x 23.9x 1000 daily files 0.450 79.600 18.300 0.300 176.80x 61.0x align Small to large 1.090 1.860 6.110 0.470 1.70x 13.0x Large to small 0.021 0.042 1.000 0.017 2.00x 58.8x reindex 1000 index in small set 0.005 0.003 0.066 0.002 0.66x 33.0x 10000 index in large set 0.130 0.320 0.980 0.120 2.40x 8.1x Table 10 CUDA Functions Compared with NVIDIA cuDF
Function Dataset HXPY(8) HXPY(CUDA) cuDF(CUDA) Speedup(CPU) Speedup(CUDA) abs Small 0.009 0.0021 0.27 4.3x 128.0x Large 0.540 0.0280 4.11 19.3x 146.0x power Small 0.028 0.0020 0.81 14.0x 405.0x Large 2.500 0.0360 4.54 69.4x 126.0x rank(axis=0) Small 0.150 0.1500 1.96 1.0x 13.1x Large 18.100 9.4500 19.40 1.9x 2.0x rank(axis=1) Small 0.140 0.1500 ^{+} 0.9x N/A Large 8.200 8.8300 ^{+} 0.9x N/A ts_sum Small 0.040 0.0043 0.47 9.3x 109.0x Large 2.090 0.2000 4.58 10.4x 22.9x ts_corr Small 0.050 0.0093 ^{\#} 5.4x N/A Large 2.770 0.5000 ^{\#} 5.5x N/A Note: HXPY(CUDA) and cuDF(CUDA) were benchmarked on a single NVIDIA A100 40 GB, with CUDA 11.6. Speedup(CPU) means HXPY's CUDA version compared with HXPY's CPU multi-threaded version; speedup(CUDA) means HXPY's CUDA version speedup times compared with cuDF's CUDA version. + denotes rank() among axis=1 is not supported in cuDF 2022.4, and # denotes rolling().corr() is not supported in cuDF 2022.4 -
[1] Farnoosh A, Azari B, Ostadabbas S. Deep switching auto-regressive factorization: Application to time series forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(8): 7394–7403. DOI: 10.1609/aaai.v35i8.16907.
[2] Rasul K, Seward C, Schuster I, Vollgraf R. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.8857–8868.
[3] Pan Q Y, Hu W B, Chen N. Two birds with one stone: Series saliency for accurate and interpretable multivariate time series forecasting. In Proc. the 30th International Joint Conference on Artificial Intelligence, Aug. 2021, pp.2884–2891. DOI: 10.24963/ijcai.2021/397.
[4] Lee D, Lee S, Yu H. Learnable dynamic temporal pooling for time-series classification. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(9): 8288–8296. DOI: 10.1609/aaai.v35i9.17008.
[5] Mbouopda M F. Uncertain time series classification. In Proc. the 30th International Joint Conference on Artificial Intelligence, Aug. 2021, pp.4903–4904. DOI: 10.24963/ijcai.2021/683.
[6] Yue Z H, Wang Y J, Duan J Y, Yang T M, Huang C R, Tong Y H, Xu B X. TS2Vec: Towards universal representation of time series. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(8): 8980–8987. DOI: 10.1609/aaai.v36i8.20881.
[7] Eldele E, Ragab M, Chen Z H, Wu M, Kwoh C K, Li X L, Guan C T. Time-series representation learning via temporal and contextual contrasting. In Proc. the 30th International Joint Conference on Artificial Intelligence, Aug. 2021, pp.2352–2359. DOI: 10.24963/ijcai.2021/324.
[8] Deng A L, Hooi B. Graph neural network-based anomaly detection in multivariate time series. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(5): 4027–4035. DOI: 10.1609/aaai.v35i5.16523.
[9] Kim S, Choi K, Choi H S, Lee B, Yoon S. Towards a rigorous evaluation of time-series anomaly detection. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(7): 7194–7201. DOI: 10.1609/aaai.v36i7.20680.
[10] McGowan M J. The rise of computerized high frequency trading: Use and controversy. Duke L. & Tech. Rev., 2010, 16.
[11] Yang X, Liu W Q, Zhou D, Bian J, Liu T Y. Qlib: An AI-oriented quantitative investment platform. arXiv: 2009.11189, 2021. https://arxiv.org/abs/2009.11189, Dec. 2022.
[12] Ding Q G, Wu S F, Sun H, Guo J D, Guo J. Hierarchical multi-scale Gaussian transformer for stock movement prediction. In Proc. the 29th International Joint Conference on Artificial Intelligence, Jul. 2020, pp.4640–4646. DOI: 10.24963/ijcai.2020/640.
[13] Wang J Y, Zhang Y, Tang K, Wu J J, Xiong Z. Alphastock: A buying-winners-and-selling-losers investment strategy using interpretable deep reinforcement attention networks. In Proc. the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Jul. 2019, pp.1900–1908. DOI: 10.1145/3292500.3330647.
[14] McKinney W. Pandas: A foundational Python library for data analysis and statistics. Python for High Performance and Scientific Computing, 2011, 14(9): 1–9.
[15] Petersohn D. Dataframe systems: Theory, architecture, and implementation. Technical Report No. UCB/EECS-2021-193, University of California, Berkeley, 2021. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-193.html, Dec. 2022.
[16] Petersohn D, Macke S, Xin D, Ma W, Lee D J L, Mo X X, Gonzalez J E, Hellerstein J M, Joseph A D, Ganesh A. Towards scalable dataframe systems. Proceedings of the VLDB Endowment, 2020, 13(12): 203–204. DOI: 10.14778/3407790.3407807.
[17] Moritz P, Nishihara R, Wang S, Tumanov A, Liaw R, Liang E, Elibol M, Yang Z H, Paul W, Jordan M I, Stoica I. Ray: A distributed framework for emerging AI applications. In Proc. the 13th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2018, pp.561–577. https://www.usenix.org/system/files/osdi18-moritz.pdf, Jan. 2023.
[18] Petersohn D, Tang D X, Durrani R, Melik-Adamyan A, Gonzalez J E, Joseph A D, Parameswaran A G. Flexible rule-based decomposition and metadata independence in modin: A parallel dataframe system. Proceedings of the VLDB Endowment, 2021, 15(3): 739–751. DOI: 10.14778/3494124.3494152.
[19] Hord R M. The Illiac IV: The First Supercomputer. Springer Science & Business Media, 2013.
[20] Langdale G, Lemire D. Parsing gigabytes of JSON per second. The VLDB Journal, 2019, 28(6): 941–960. DOI: 10.1007/s00778-019-00578-5.
[21] Watanabe H, Nakagawa K M. SIMD vectorization for the Lennard-Jones potential with AVX2 and AVX-512 instructions. Computer Physics Communications, 2019, 237: 1–7. DOI: 10.1016/j.cpc.2018.10.028.
[22] Kahan W. IEEE standard 754 for binary floatingpoint arithmetic. Lecture Notes on the Status of IEEE 754, 1996.
[23] Ihaka R, Gentleman R. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 1996, 5(3): 299–314. DOI: 10.2307/1390807.
[24] Bezanson J, Edelman A, Karpinski S, Shah V B. Julia: A fresh approach to numerical computing. SIAM Review, 2017, 59(1): 65–98. DOI: 10.1137/141000671.
[25] Duvinage M, Mazza P, Petitjean M. The intra-day performance of market timing strategies and trading systems based on Japanese candlesticks. Quantitative Finance, 2013, 13(7): 1059–1070. DOI: 10.1080/14697688.2013.768774.
[26] Nelson D M Q, Pereira A C M, De Oliveira R A. Stock market’s price movement prediction with LSTM neural networks. In Proc. International Joint Conference on Neural Networks (IJCNN), May 2017, pp.1419–1426. DOI: 10.1109/IJCNN.2017.7966019.
[27] Tummon E, Raja M A, Ryan C. Trading cryptocurrency with deep deterministic policy gradients. In Proc. the 21st International Conference on Intelligent Data Engineering and Automated Learning, Nov. 2020, pp.245–256. DOI: 10.1007/978-3-030-62362-3_22.
[28] De Guzman J, Nuffer D. The Spirit parser library: Inline parsing in C++. CC Plus Plus Users Journal, 2003, 21(9): 22–46.
[29] Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 1999, 58(1): 137–147. DOI: 10.1006/jcss.1997.1545.
[30] Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K. Apache FlinkTM: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4): 28–38.
[31] Iqbal M H, Soomro T R. Big data analysis: Apache storm perspective. International Journal of Computer Trends and Technology, 2015, 19(1): 9–14. DOI: 10.14445/22312803/IJCTT-V19P103.
[32] Foley D, Danskin J. Ultra-performance Pascal GPU and NVLink interconnect. IEEE Micro, 2017, 37(2): 7–17. DOI: 10.1109/MM.2017.37.
[33] Grant D A. The Latin square principle in the design and analysis of psychological experiments. Psychological Bulletin, 1948, 45(5): 427–442. DOI: 10.1037/h0053912.
[34] Agarap A F. Deep learning using rectified linear units (ReLU). arXiv: 1803.08375, 2018. https://arxiv.org/abs/1803.08375, Dec. 2022.
[35] Harris C R, Millman K J, Van Der walt S J, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith N J, Kern R, Picus M, Hoyer S, Van Kerkwijk M H, Brett M, Haldane A, Del Río J F, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant T E. Array programming with NumPy. Nature, 2020, 585(7825): 357–362. DOI: 10.1038/s41586-020-2649-2.
[36] Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z M, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J J, Chintala S. PyTorch: An imperative style, high-performance deep learning library. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 712.
[37] Dagum L, Menon R. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science and Engineering, 1998, 5(1): 46–55. DOI: 10.1109/99.660313.
[38] Rodriguez S, Cardiff P. A general approach for running Python codes in OpenFOAM using an embedded Pybind11 Python interpreter. OpenFOAM® Journal, 2022, 2: 166–182. DOI: 10.51560/ofj.v2.79.
[39] Titus A J, Kishore S, Stavish T, Rogers S M, Ni K. PySEAL: A Python wrapper implementation of the SEAL homomorphic encryption library. arXiv: 1803.01891, 2018. https://arxiv.org/abs/1803.01891, Dec. 2022.
[40] Anderson C. Docker [software engineering]. IEEE Software, 2015, 32(3): 102-c3. DOI: 10.1109/MS.2015.62.
-
期刊类型引用(6)
1. Ayaz H. Khan, Hamed Al-Mehdhar. Memory Pooling for Enhanced Data Loading in GPU-Accelerated Environments. IEEE Access, 2025, 13: 87175. 必应学术
2. Md Tohidul Islam, Md Rakibul Islam, Md Sabbir Faruque, et al. Comparative Stock Performance Analysis of Leading Electric Vehicle Brands: Tesla, BYD, and NIO Using Python Programming Language. European Journal of Theoretical and Applied Sciences, 2024, 2(4): 327. 必应学术
3. Zhongping Zhang, Yuehan Hou, Daoheng Liu, et al. HGOD: Outlier detection based on a hybrid graph. Neurocomputing, 2024, 602: 128288. 必应学术
4. Yunfei Yin, Caihao Huang, Xianjian Bao. ContrAttNet: Contribution and attention approach to multivariate time-series data imputation. Network: Computation in Neural Systems, 2024. 必应学术
5. Jie Mao. Application of TOPSIS Algorithm in Tax Online Filing System. 2023 3rd International Conference on Smart Generation Computing, Communication and Networking (SMART GENCON), 必应学术
6. Miao Chen, Zhenghui Zhao. Optimization of Deep Learning Models for Non-stationary Time Series Data. 2024 International Conference on Intelligent Algorithms for Computational Intelligence Systems (IACIS), 必应学术
其他类型引用(0)
-
本文视频
其他相关附件
-
本文中文pdf
2023-1-2-2879-Chinese Information 点击下载(27KB) -
本文英文pdf
2023-1-2-2879-Highlights 点击下载(1020KB) -
本文附件外链
https://rdcu.be/dhRZ7
-