计算机科学技术学报 ›› 2019,Vol. 34 ›› Issue (5): 1152-1166.doi: 10.1007/s11390-019-1966-0

所属专题: Data Management and Data Mining

• Data Management and Data Mining • 上一篇    

基于标签属性图节点凝聚力的CK-modes聚类算法

Da-Wei Wang1, Wan-Qiu Cui2, Biao Qin1,*, Member, CCF   

  1. 1 School of Information, Renmin University of China, Beijing 100872, China;
    2 School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • 收稿日期:2018-07-29 修回日期:2019-07-25 出版日期:2019-08-31 发布日期:2019-08-31
  • 通讯作者: Biao Qin E-mail:qinbiao@ruc.edu.cn
  • 作者简介:Da-Wei Wang is a Ph.D. candidate in School of Information, Renmin University of China, Beijing. His research interests include design and analysis of algorithms, social network mining, databases, and graph query.
  • 基金资助:
    The work was supported by the National Natural Science Foundation of China under Grant No. 61772534, and the Excellent Chinese-Foreign Youth Exchange Foundation Program of Chinese Association of Science and Technology under Grant No. 311319000207.

CK-Modes Clustering Algorithm Based on Node Cohesion in Labeled Property Graph

Da-Wei Wang1, Wan-Qiu Cui2, Biao Qin1,*, Member, CCF   

  1. 1 School of Information, Renmin University of China, Beijing 100872, China;
    2 School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2018-07-29 Revised:2019-07-25 Online:2019-08-31 Published:2019-08-31
  • Contact: Biao Qin E-mail:qinbiao@ruc.edu.cn
  • About author:Da-Wei Wang is a Ph.D. candidate in School of Information, Renmin University of China, Beijing. His research interests include design and analysis of algorithms, social network mining, databases, and graph query.
  • Supported by:
    The work was supported by the National Natural Science Foundation of China under Grant No. 61772534, and the Excellent Chinese-Foreign Youth Exchange Foundation Program of Chinese Association of Science and Technology under Grant No. 311319000207.

对K-modes聚类算法而言,聚类数K和初始质心的确定是至关重要的。然而,大多数基于K-modes的改进方法都是手工指定K值并随机生成初始质心。这使得聚类算法受到人为因素的影响较大,且迭代时间不稳定。为了克服这个限制,我们提出了内聚K-modes(CK-modes)算法来自动生成簇数K和初始质心。具体地,我们通过构造基于免索引邻接的标记属性图,以捕获输入数据集样本中节点的全局和局部内聚力。利用基于属性相似度计算的内聚节点将图分解为K-nodes树,其可以确定K值,然后从分割子树中选择初始质心。在真实数据集和合成数据集上的实验验证表明,CK-modes算法在聚类质量上优于最先进的算法。由于属性图的构造和内聚计算与K-modes无关,前者的占用时间仅是后者多次迭代聚类时间的一次性影响,因此,CK-modes显著地加速了聚类收敛。

关键词: 聚类数K, 属性图, 免索引邻接, K-nodes树, 凝聚力节点

Abstract: The designation of the cluster number K and the initial centroids is essential for K-modes clustering algorithm. However, most of the improved methods based on K-modes specify the K value manually and generate the initial centroids randomly, which makes the clustering algorithm significantly dependent on human-based decisions and unstable on the iteration time. To overcome this limitation, we propose a cohesive K-modes (CK-modes) algorithm to generate the cluster number K and the initial centroids automatically. Explicitly, we construct a labeled property graph based on index-free adjacency to capture both global and local cohesion of the node in the sample of the input datasets. The cohesive node calculated based on the property similarity is exploited to split the graph to a K-node tree that determines the K value, and then the initial centroids are selected from the split subtrees. Since the property graph construction and the cohesion calculation are only performed once, they account for a small amount of execution time of the clustering operation with multiple iterations, but significantly accelerate the clustering convergence. Experimental validation in both real-world and synthetic datasets shows that the CK-modes algorithm outperforms the state-of-the-art algorithms.

Key words: cluster number, property graph, index-free adjacency, K-node tree, cohesive node

[1] Shiokawa H, Fujiwara Y, Onizuka M. SCAN++:Efficient algorithm for finding clusters, hubs and outliers on largescale graphs. Proceedings of the VLDB Endowment, 2015, 8(11):1178-1189.
[2] Zhang W P, Li Z J, Li R H, Liu Y H, Mao R, Qiao S J. MapReduce-based graph structural clustering algorithm. Journal of Software, 2018, 29(3):627-641. (in Chinese)
[3] Wu Y, Zhong Z N, Xiong W, Chen L, Jing N. An efficient method for attributed graph clustering. Chinese Journal of Computer, 2013, 36(8):1704-1713. (in Chinese)
[4] Guo T, Ding X W, Li Y F. Parallel K-modes algorithm based on MapReduce. In Proc. the 3rd International Conference on Digital Information, Networking, and Wireless Communications, February 2015, pp.176-179.
[5] Zhou F F, Li J C, Huang W, Wang J H, Zhao Y. Extending dimensions in Radviz for visual clustering analysis. Journal of Software, 2016, 27(5):1127-1139. (in Chinese)
[6] Noori-Daryan M, Taleizadeh A A, Govindan K. Joint replenishment and pricing decisions with different freight modes considerations for a supply chain under a composite incentive contract. Journal of the Operational Research Society, 2018, 69(6):876-894.
[7] Huang Z X. Clustering large data sets with mixed numeric and categorical values. In Proc. the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining, February 1997, pp.21-35.
[8] Ahmad A, Dey L. A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recognition Letters, 2007, 28(1):110-118.
[9] Park H S, Jun C H. A simple and fast algorithm for Kmedoids clustering. Expert Systems with Applications, 2009, 36(2):3336-3341.
[10] Zadegan S M R, Mirzaie M, Sadoughi F. Randed Kmedoids:A fast and accurate rank-based partitioning algorithm for clustering large datasets. Knowledge-Based Systems, 2013, 39:133-143.
[11] Ferrarini L, Olofsen H, Palm W M, van Buchem M A, Reiber J H C, Admiraal-Behloul F. GAMEs:Growing and adaptive meshes for fully automatic shape modeling and analysis. Medical Image Analysis. 2007, 11(3):302-314.
[12] Ng M K, Chan E Y, So M M C, Ching W K. A semisupervised regression model for mixed numerical and categorical variables. Pattern Recognition, 2007, 40(6):1745-1752.
[13] Bachem O, Lucic M, Hassani S H, Krause A. Approximate K-means++ in sublinear time. In Proc. the 30th AAAI Conference on Artificial Intelligence, February 2016, pp.1459-1467.
[14] Arthur D, Vassilvitskii S. K-means++:The advantages of careful seeding. In Proc. the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, January 2007, pp.1027-1035.
[15] Liu Y C, Li Z M, Xiong H, Gao X D, Wu J J. Understanding of internal clustering validation measures. In Proc. the 10th IEEE International Conference on Data Mining, December 2010, pp.911-916.
[16] Liu Y C, Li Z M, Xiong H, Gao X D, Wu J J. Understanding and enhancement of internal clustering validation measures. IEEE Transactions on Cybernetics, 2013, 43(3):982-994.
[17] Robinson I, Webber J, Eifrem E. Graph Databases (1st edition). O'Reilly Media, 2013.
[18] Akpan N P, Iwok I A. A minimum spanning tree approach of solving a transportation problem. International Journal of Mathematics and Statistics Invention, 2017, 5(3):9-18.
[19] Li M C, Han S, Shi J. An enhanced ISODATA algorithm for recognizing multiple electric appliances from the aggregated power consumption dataset. Energy and Buildings, 2017, (140):305-316.
[20] Hesthaven J S. A stable penalty method for the compressible Navier-Stokes equations:Ⅱ. One-dimensional domain decomposition schemes. SIAM Journal on Scientific Computing, 1997, 18(3):658-685.
[21] Jin X, Han J. K-medoids clustering. In Encyclopedia of Machine Learning, Sammut G, Webb G I (eds.), Springer, 2016, pp.564-565.
[22] Han L S, Xiang L S, Liu X Y, Luan J. The K-medoids algorithm with initial centers optimized based on a P System. Journal of Information and Computational Science, 2014, 11(6):1765-1773.
[23] Kang Z, Peng C, Cheng Q. Clustering with adaptive manifold structure learning. In Proc. the 33rd Int. Conference on Data Engineering, Apr. 2017, pp.79-82.
[24] Nehak D, Dehak R, Glass J, Reynolds D, Kenny P. Cosine similarity scoring without score normalization techniques. In Proc. the Speaker and Language Recognition Workshop, June 2010, Article No. 15.
[25] Cheng H, Zhou Y, Yu J X. Clustering large attributed graphs:A balance between structural and attribute similarities. ACM Transactions on Knowledge Discovery from Data, 2011, 5(2):Article No. 12.
[26] Chang L J, Li W, Lu Q, Zhang W J, Yang S Y. pSCAN:Fast and exact structural graph clustering. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(2):387-401.
[27] Schubert E, Sander J, Ester M, Kriegel H P, Xu X W. DBSCAN revisited, revisited:Why and how you should (still) use DBSCAN. ACM Transactions on Database Systems, 2017, 42(3):Article No. 19.
[28] Du Z H, Li Y B. An improved BIRCH clustering algorithm and application in thermal power. In Proc. the 2010 International Conference on Web Information Systems and Mining, October 2010, pp.53-56.
[29] Xiong H, Wu J J, Chen J. K-means clustering versus validation measures:A data-distribution perspective. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 2009, 39(2):318-331.
[30] Wu J J, Xiong H, Chen J. Adapting the right measures for K-means clustering. In Proc. the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June 2009, pp.877-886.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 张钹; 张铃;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[10] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: