HXPY: 一个高效处理金融时间序列数据的软件包

郭家栋; 彭靖姝; 苑航; 倪明选

doi:10.1007/s11390-023-2879-5

HXPY: 一个高效处理金融时间序列数据的软件包

HXPY: A High-Performance Data Processing Package for Financial Time-Series Data

摘要

摘要:
研究背景 海量的金融时间序列数据在全球各地每天产生，这样的数据需要被快速分析以提供最大化的价值，许多学术研究和工业应用场景也需要高性能的金融时间序列计算框架。然而，传统的金融时间序列数据计算框架在性能及功能覆盖上表露出一些不足，并且在对多线程计算和CUDA的利用上具有缺陷。
目的本篇论文的研究目标是提供一种新的金融时间序列计算框架，兼容Python Pandas的接口，优化单线程的性能表现，并支持多线程和CUDA进行计算加速，同时实现了更多金融时间序列的函数。
方法本文提出了HXPY，一个新的金融时间序列计算框架，基于单指令流多数据流(SIMD)、流式算法、内存布局优化等技术，在现代C++中实现并优化了相应函数，并提供了接近原生体验的Python接口，也兼容和其他Python库的相互转换。
结果 HXPY显现出了显著的性能优势，在单线程的性能对比中，HXPY在文本文件读写上相比Python Pandas取得了5~10倍的性能提升，在时间序列函数上取得了2~3000倍的性能提升，在分组函数中提供了15~200倍的性能提升。同时，HXPY在多线程中相比基于Ray的Modin取得了2~200倍的性能提升，在CUDA测试中相比NVIDIA的cuDF取得了2~400倍的性能提升。
结论 HXPY实现了一种新的数据框结构，能够高效地处理和计算金融时间序列数据，我们在递增分析中观察到了从最初版本到所有优化手段被采用后带来的显著性能改进。HXPY具有较佳的应用价值，已经在一些研究机构和合作伙伴中进行内部测试和使用。在未来，我们将持续优化并添加更多的函数支持。

Abstract: A tremendous amount of data has been generated by global financial markets everyday, and such time-series data needs to be analyzed in real time to explore its potential value. In recent years, we have witnessed the successful adoption of machine learning models on financial data, where the importance of accuracy and timeliness demands highly effective computing frameworks. However, traditional financial time-series data processing frameworks have shown performance degradation and adaptation issues, such as the outlier handling with stock suspension in Pandas and TA-Lib. In this paper, we propose HXPY, a high-performance data processing package with a C++/Python interface for financial time-series data. HXPY supports miscellaneous acceleration techniques such as the streaming algorithm, the vectorization instruction set, and memory optimization, together with various functions such as time window functions, group operations, down-sampling operations, cross-section operations, row-wise or column-wise operations, shape transformations, and alignment functions. The results of benchmark and incremental analysis demonstrate the superior performance of HXPY compared with its counterparts. From MiBs to GiBs data, HXPY significantly outperforms other in-memory dataframe computing rivals even up to hundreds of times.

HTML全文

参考文献()

施引文献

资源附件()