
计算机科学技术学报 ›› 2019,Vol. 34 ›› Issue (1): 7793.doi: 10.1007/s1139001919005
所属专题： Computer Architecture and Systems
Min Li^{1,2}, Student Member, CCF, Chao Yang^{3,4,5,*}, Senior Member, CCF, Member, ACM, IEEE, Qiao Sun^{1} WenJing Ma^{1}, WenLong Cao^{1,2}, Student Member, CCF, and YuLong Ao^{3,4,5}, Member, CCF
Min Li^{1,2}, Student Member, CCF, Chao Yang^{3,4,5,*}, Senior Member, CCF, Member, ACM, IEEE, Qiao Sun^{1} WenJing Ma^{1}, WenLong Cao^{1,2}, Student Member, CCF, and YuLong Ao^{3,4,5}, Member, CCF
kmeans是无监督学习中一种经典的基于距离计算的聚类算法，由于简单，易实现，无需标记样本等特点，kmeans在图像处理、数据挖掘、文本聚类、生物学等领域有着广泛的应用，并越来越多的被用作许多更复杂算法的预处理手段。大数据时代的到来，使得样本数据的特征维度从原来的几十维上升到了数以千计，相应的，对计算速度也提出了更高的要求，因此，研究kmeans的并行加速对于实际应用是非常重要且有意义的。我们提出了一种将距离计算与簇标签规约融合的kmeans并行算法；同时将其部署在申威26010众核处理器上。该算法减少了对距离矩阵的存储及内存访问次数，使用三层分块策略进行任务划分，以实现数据到硬件资源的高效映射，并利用协作式的核间数据共享方式以提高数据的重用率，使用寄存器通信机制实现簇标签规约。通过使用指令重排、双缓冲等优化技术，我们进一步大幅提升了kmeans的性能。此外，我们对分块大小的选择进行了讨论，并建立简单的性能模型进行性能分析。实验显示，我们的方法可以达到348.1 GFlops的浮点计算性能，相比机器最大浮点性能/理论上最大性能，可以获得46.9%/84%的浮点计算效率，相比较当前在CPU和GPU上最好的实现算法，具有很大的优越性。
[1] Wu X, Kumar V, Quinlan J R, Ghosh J, Yang Q, Motoda H, Mclachlan G J, Ng A S, Liu B, Yu P S et al. Top 10 algorithms in data mining. Knowledge and Information Systems, 2007, 14(1):137. [2] Muja M, Lowe D G. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11):22272240. [3] You Y, Demmel J, Czechowski K, Song L, Vuduc R. CASVM:Communicationavoiding support vector machines on distributed systems. In Proc. IEEE International Parallel and Distributed Processing Symposium, May 2015, pp.847859. [4] Wu J, Leng C, Wang Y, Hu Q, Cheng J. Quantized convolutional neural networks for mobile devices. In Proc. Computer Vision and Pattern Recognition, June 2016, pp.48204828. [5] Narayanan R, Ozisikyilmaz B, Zambreno J, Memik G, Choudhary A. MineBench:A benchmark suite for data mining workloads. In Proc. IEEE International Symposium on Workload Characterization, October 2006, pp.182188. [6] Hadian A, Shahrivari S. High performance parallel kmeans clustering for diskresident datasets on multicore CPUs. Journal of Supercomputing, 2014, 69(2):845863. [7] Wang H, Zhao J, Li H, Wang J. Parallel clustering algorithms for image processing on multicore CPUs. In Proc. International Conference on Computer Science and Software Engineering, December 2008, pp.450453. [8] Farivar R, Rebolledo D, Chan E, Campbell R H. A parallel implementation of kmeans clustering on GPUs. In Proc. International Conference on Parallel and Distributed Processing Techniques and Applications, July 2008, pp.340345. [9] Che S, Boyer M, Meng J, Tarjan D, Sheaffer J W, Skadron K. A performance study of general purpose applications on graphics processors using CUDA. Journal of Parallel & Distributed Computing, 2008, 68(10):13701380. [10] Fang W, Lau K K, Lu M, Xiao X, Chi K L, Yang P Y, He B, Luo Q, Sander P V, Yang K. Parallel data mining on graphics processors. Technical Report, Hong Kong University of Science and Technology, 2008. http://www.comp.hkbu.edu.hk/~youli/Papers/Papers/2009Fall/Mining/Parallel%20Data%20Mining%20on%20Graphics%20Processorsgpuminer.pdf, September 2018. [11] Wu R, Zhang B, Hsu M. Clustering billions of data points using GPUs. In Proc. the Combined Workshops on UnConventional High Performance Computing Workshop Plus Memory Access Workshop, May 2009, pp.16. [12] Zechner M, Granitzer M. Accelerating kmeans on the graphics processor via CUDA. In Proc. the 1st International Conference on Intensive Applications and Services, April 2009, pp.715. [13] Kohlhoff K J, Pande V S, Altman R B. Kmeans for parallel architectures using allprefixsum sorting and updating steps. IEEE Transactions on Parallel and Distributed Systems, 2013, 24(8):16021612. [14] Li Y, Zhao K, Chu X, Liu J. Speeding up kmeans algorithm by GPUs. Journal of Computer and System Sciences, 2013, 79(2):216229. [15] Karantasis K I, Polychronopoulos E D, Dimitrakopoulos G N. Accelerating data clustering on GPUbased clusters under shared memory abstraction. In Proc. IEEE International Conference on Cluster Computing Workshops and Posters, September 2010, Article No. 31. [16] Karantasis K I, Polychronopoulos E D. Programming GPU clusters with shared memory abstraction in software. In Proc. the 19th Euromicro International Conference on Parallel, Distributed and NetworkBased Processing, February 2011, pp.223230. [17] Wasif M K, Narayanan P J. Scalable clustering using multiple GPUs. In Proc. the 18th International Conference on High Performance Computing, December 2011, Article No. 14. [18] Kijsipongse E, URuekolan S. Dynamic load balancing on GPU clusters for largescale kmeans clustering. In Proc. the 9th International Joint Conference on Computer Science and Software Engineering, May 2012, pp.346350. [19] Stuart J A, Owens J D. MultiGPU MapReduce on GPU clusters. In Proc. IEEE International Parallel & Distributed Processing Symposium, May 2011, pp.10681079. [20] Wu F, Wu Q, Tan Y, Wei L, Shao L, Gao L. A vectorized kmeans algorithm for Intel many integrated core architecture. In Proc. the 10th International Symposium on Advanced Parallel Processing Technologies, August 2013, pp.277294. [21] Fu H, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao F et al. The Sunway TaihuLight supercomputer:System and applications. Science China Information Sciences, 2016, 59(7):Article No. 072001. [22] Yang C, Xue W, Fu H, You H, Wang X, Ao Y, Liu F, Gan L, Xu P, Wang L et al. 10Mcore scalable fullyimplicit solver for nonhydrostatic atmospheric dynamics. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis, November 2016, Article No. 6. [23] Garey M R, Johnson D S, Witsenhausen H S. The complexity of the generalized LloydMax problem (Corresp.). IEEE Transactions on Information Theory, 1982, 28(2):255256. [24] Hamerly G, Elkan C. Alternatives to the kmeans algorithm that find better clusterings. In Proc. the 11th International Conference on Information and Knowledge Management, November 2002, pp.600607. [25] Bradley P S, Fayyad U M. Refining initial points for kmeans clustering. In Proc. the 15th International Conference on Machine Learning, July 1998, pp.9199. [26] Arthur D, Vassilvitskii S. Kmeans++:The advantages of careful seeding. In Proc. the 18th ACMSIAM Symposium on Discrete Algorithms, January 2007, pp.10271035. [27] Sarje A, Zola J, Aluru S. Accelerating pairwise computations on cell processors. IEEE Transactions on Parallel and Distributed Systems, 2011, 22(1):6977. [28] Sarje A, Aluru S. Allpairs computations on manycore graphics processors. Parallel Computing, 2013, 39(2):7993. [29] Steuwer M, Friese M, Albers S, Gorlatch S. Introducing and implementing the allpairs skeleton for programming multiGPU systems. International Journal of Parallel Programming, 2014, 42(4):601618. [30] Dongarra J J, Croz J D, Hammarling S, Duff I S. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 1990, 16(1):117. [31] Dhillon I, Modha D. A data clustering algorithm on distributed memory multiprocessors. Lecture Notes in Computer Science, 2000, 1759:245260. [32] Zhang J, Zhou C, Wang Y, Ju L, Du Q, Chi X, Xu D, Chen D, Liu Y, Liu Z. Extremescale phase field simulations of coarsening dynamics on the Sunway TaihuLight supercomputer. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis, November 2016, Article No. 4. [33] Lucas D D, Yver Kwok C, Cameronsmith P, Graven H, Bergmann D, Guilderson T P, Weiss R, Keeling R. Designing optimal greenhouse gas observing networks that consider performance and cost. Geoscientific Instrumentation Methods & Data Systems Discussions, 2014, 4(1):121137. [34] Sapsanis C, Georgoulas G, Tzes A, Lymberopoulos D. Improving EMG based classification of basic hand movements using EMD. In Proc. the 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, July 2013, pp.57545757. [35] Altun K, Barshan B, Tuncel O. Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognition, 2010, 43(10):36053620. [36] Barshan B, Yüksek M C. Recognizing daily and sports activities in two open source machine learning environments using bodyworn sensor units. Computer Journal, 2014, 57(11):16491667. [37] Altun K, Barshan B. Human activity recognition using inertial/magnetic sensor units. In Proc. the 1st International Workshop on Human Behavior Understanding, August 2010, pp.3851. [38] Newling J, Fleuret F. Fast kmeans with accurate bounds. In Proc. the 33rd International Conference on International Conference on Machine Learning, June 2016, pp.936944. [39] Elkan C. Using the triangle inequality to accelerate kmeans. In Proc. the 20th International Conference on Machine Learning, August 2003, pp.147153. [40] Drake J, Hamerly G. Accelerated kmeans with adaptive distance bounds. In Proc. the 5th NIPS Workshop on Optimization for Machine Learning, December 2012, pp.4253. [41] Ding Y, Zhao Y, Shen X, Musuvathi M, Mytkowicz T. Yinyang kmeans:A dropin replacement of the classic kmeans with consistent speedup. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.579587. [42] Bottesch T, Buhler T, Kachele M. Speeding up kmeans by approximating Euclidean distances via block vectors. In Proc. the 33rd International Conference on Machine Learning, June 2016, pp.25782586. [43] Wu J, Hong B. An efficient kmeans algorithm on CUDA. In Proc. IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, May 2011, pp.17401749. [44] Lin Z, Lo C, Chow P. Kmeans implementation on FPGA for highdimensional data using triangle inequality. In Proc. the 22nd International Conference on Field Programmable Logic and Applications, August 2012, pp.437442. [45] Winterstein F, Bayliss S, Constantinides G A. FPGAbased kmeans clustering using treebased data structures. In Proc. the 23rd International Conference on Field Programmable Logic and Applications, September 2013, Article No. 18. [46] Tang Q Y, Khalid M A S. Acceleration of kmeans algorithm using Altera SDK for OpenCL. ACM Transactions on Reconfigurable Technology and Systems, 2016, 10(1):Article No. 6. [47] Jiang L, Yang C, Ao Y, Yin W, Ma W, Sun Q, Liu F, Lin R, Zhang P. Towards highly efficient DGEMM on the emerging SW26010 manycore processor. In Proc. the 46th International Conference on Parallel Processing, August 2017, pp.422431. [48] Fang J, Fu H, Zhao W, Chen B, Zheng W, Yang G. swDNN:A library for accelerating deep learning applications on Sunway TaihuLight. In Proc. IEEE International Parallel & Distributed Processing Symposium, May 2017, pp.615624. [49] Zhao W, Ma H, He Q. Parallel kmeans clustering based on MapReduce. In Proc. the 1st IEEE International Conference on Cloud Computing, December 2009, pp.674679. [50] Cordeiro R L F, Traina A J M, Kang U, Faloutsos C et al. Clustering very large multidimensional datasets with MapReduce. In Proc. the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2011, pp.690698. [51] Li Q, Wang P, Wang W, Hu H, Li Z, Li J. An efficient kmeans clustering algorithm on MapReduce. In Proc. the 19th International Conference on Database Systems for Advanced Applications, April 2014, pp.357371. 
[1]  Jason Liu, Pedro Espina, XianHe Sun. 关于储存系统建模和优化的综述[J]. 计算机科学技术学报, 2021, 36(1): 7189. 
[2]  Lan Huang, DaLin Li, KangPing Wang, Teng Gao, Adriano Tavares. 一个关于高级综合工具性能优化的综述[J]. 计算机科学技术学报, 2020, 35(3): 697720. 
[3]  Qi Chen, Kang Chen, ZuoNing Chen, Wei Xue, Xu Ji, Bin Yang. 神威存储系统面向应用I/O性能提升的优化介绍[J]. 计算机科学技术学报, 2020, 35(1): 4760. 

版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持：support@magtech.com.cn 总访问量： 