›› 2016, Vol. 31 ›› Issue (6): 1194-1211.doi: 10.1007/s11390-016-1692-9

Special Issue: Data Management and Data Mining

• Regular Paper • Previous Articles     Next Articles

Efficient Metric All-k-Nearest-Neighbor Search on Datasets Without Any Index

Hai-Da Zhang1,2, Zhi-Hao Xing1, Lu Chen1, Yun-Jun Gao1*, Senior Member, CCF, Member, ACM, IEEE   

  1. 1 College of Computer Science, Zhejiang University, Hangzhou 310027, China;
    2 School of Computer Science and Engineering, The University of New South Wales, Sydney 2052, Australia
  • Received:2015-04-29 Revised:2016-05-26 Online:2016-11-05 Published:2016-11-05
  • Supported by:

    This work was supported in part by the National Basic Research 973 Program of China under Grant No. 2015CB352502, the National Natural Science Foundation of China under Grant Nos. 61522208, 61379033, and 61472348, and the Fundamental Research Funds for the Central Universities of China under Grant Nos. 2015XZZX004-18 and 2015XZZX005-07.

An all-k-nearest-neighbor (AkNN) query finds k nearest neighbors for each query object.This problem arises naturally in many areas,such as GIS (geographic information system),multimedia retrieval,and recommender systems.To support various data types and flexible distance metrics involved in real applications,we study AkNN retrieval in metric spaces,namely,metric AkNN (MAkNN) search.Consider that the underlying indexes on the query set and the object set may not exist,which is natural in many scenarios.For example,the query set and the object set could be the results of other queries,and thus,the underlying indexes cannot be built in advance.To support MAkNN search on datasets without any underlying index,we propose an efficient disk-based algorithm,termed as Partition-Based MAkNN Algorithm (PMA),which follows a partition-search framework and employs a series of pruning rules for accelerating the search.In addition,we extend our techniques to tackle an interesting variant of MAkNN queries,i.e.,metric self-AkNN (MSAkNN) search,where the query set is identical to the object set.Extensive experiments using both real and synthetic datasets demonstrate the effectiveness of our pruning rules and the efficiency of the proposed algorithms,compared with state-of-the-art MAkNN and MSAkNN algorithms.

[1] Chen H L, Chang Y I. All-nearest-neighbors finding based on the Hilbert curve. Expert Systems with Applications, 2011, 38(6):7462-7475.

[2] Zhang J, Mamoulis N, Papadias D, Tao Y F. Allnearest-neighbors queries in spatial databases. In Proc. the 16th International Conference on Scientific and Statistical Database Management, June 2004, pp.297-306.

[3] Böhm C, Krebs F. The k-nearest neighbour join:Turbo charging the KDD process. Knowledge and Information Systems, 2004, 6(6):728-749.

[4] Chen Y, Patel J M. Efficient evaluation of all-nearestneighbor queries. In Proc. the 23rd International Conference on Data Engineering, Apr. 2007, pp.1056-1065.

[5] Emrich T, Graf F, Kriegel H P, Schubert M, Thoma M. Optimizing all-nearest-neighbor queries with trigonometric pruning. In Proc. the 22nd International Conference on Scientific and Statistical Database Management, June 2010, pp.501-518.

[6] Xia C Y, Lu H J, Ooi B C, Hu J. GORDER:An efficient method for KNN join processing. In Proc. the VLDB Conference, Oct. 2004, pp.756-767.

[7] Kybic J, Vnu?ko I. Approximate all nearest neighbor search for high dimensional entropy estimation for image registration. Signal Processing, 2012, 92(5):1302-1316.

[8] Lu W, Shen Y Y, Chen S, Ooi B C. Efficient processing of k nearest neighbor joins using MapReduce. Proceedings of the PVLDB Endowment, 2012, 5(10):1016-1027.

[9] Sankaranarayanan J, Samet H, Varshney A. A fast all nearest neighbor algorithm for applications involving large point-clouds. Computers & Graphics, 2007, 31(2):157-174.

[10] Nakano K, Olariu S.An optimal algorithm for the anglerestricted all nearest neighbor problem on the reconfigurable mesh, with applications. IEEE Transactions on Parallel and Distributed Systems, 1997, 8(9):983-990.

[11] Wang Y R, Horng S J, Wu C H. Efficient algorithms for the all nearest neighbor and closest pair problems on the linear array with a reconfigurable pipelined bus system. IEEE Transactions on Parallel and Distributed Systems, 2005, 16(3):193-206.

[12] Wang Y, Metwally A, Parthasarathy S.Scalable all-pairs similarity search in metric spaces. In Proc. the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2013, pp.829-837.

[13] Jain A K, Murthy M N, Flynn P J. Data clustering:A review. ACM Computing Surveys, 1999, 31(3):264-323.

[14] Nanopoulos A, Theodoridis Y, Manolopoulos Y. C2P:Clustering based on closest pairs. In Proc. the 27th International Conference on Very Large Data Bases, Jan. 2001, pp.331-340.

[15] Aggarwal C C, Yu P S.Outlier detection for high dimensional data. In Proc. the 2001 ACM SIGMOD International Conference on Management of Data, June 2001, pp.37-46.

[16] Álvarez M, Pan A, Raposo J, Bellas F, Cacheda F. Using clustering and edit distance techniques for automatic web data extraction. In Lecture Notes in Computer Science 4831, Benatallah B, Casati F, Georgakopoulos D, Bartolini C, Sadiq W, Godart C (eds.), Springer-Verlag, 2007, pp.212-224.

[17] Lowe D G. Object recognition from local scale-invariant features. In Proc. the 7th IEEE International Conference on Computer Vision, Sept. 1999, pp.1150-1157.

[18] Jacox E H, Samet H. Spatial join techniques. ACM Transaction on Database Systems, 2007, 32(1):Article No. 7.

[19] Yang C, Yu X H, Liu Y. Continuous KNN join processing for real-time recommendation. In Proc. the IEEE International Conference on Data Mining, Dec. 2014, pp.640-649.

[20] Yao B, Li F F, Kumar P. K nearest neighbor queries and KNN-joins in large relational databases (almost) for free. In Proc. the 26th IEEE International Conference on Data Engineering, Mar. 2010, pp.4-15.

[21] Yu C, Cui B, Wang S G, Su J W. Efficient index-based KNN join processing for high-dimensional data. Information and Software Technology, 2007, 49(4):332-344.

[22] Yu C, Zhang R, Huang Y C, Xiong H. High-dimensional KNN joins with incremental updates. Geoinformatica, 2010, 14(1):55-82.

[23] Zhang C, Li F, Jestes J. Efficient parallel KNN joins for large data in MapReduce. In Proc. the 15th International Conference on Extending Database Technology, Jan. 2012, pp.38-49.

[24] Clarkson K L. Nearest neighbor queries in metric spaces. Discrete & Computational Geometry, 1999, 22(1):63-93.

[25] Figueroa K. An efficient algorithm to all k nearest neighbor problem in metric spaces[M.S.Thesis]. Universidad Michoacana, 2000. (in Spanish)

[26] Paredes R, Chávez E, Figueroa K, Navarro G. Practical construction of k-nearest neighbor graphs in metric spaces. In Lecture Notes in Computer Science 4007, Álvarez C, SernaM (eds.), Springer-Verlag, 2006, pp.85-97.

[27] Chen L, Gao Y J, Chen G, Zhang H D. Metric all-knearest-neighbor search. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(1):98-112.

[28] Park Y, Park S, Lee S G, Jung W. Scalable k-nearest neighbor graph construction based on greedy filtering. In Proc. the 22nd International Conference on World Wide Web, May 2013, pp.227-228.

[29] Emrich T, Kriegel H P, Kröger P, Niedermayer J, Renz M, Züfle A. Reverse-k-nearest-neighbor join processing. In Lecture Notes in Computer Science 8098, Nascimento M A, Sellis T, Cheng R, Sander J, Zheng Y, Kriegel H P, Renz M, Sengstock C (eds.), Springer-Verlag, 2013, pp.277-294.

[30] Song G, Rochas J, Huet F, Magoulés F. Solutions for processing k nearest neighbor joins for massive data on MapReduce. In Proc. the Euromicro International Conference on Parallel, Distributed and Network-Based Processing, Mar. 2015, pp.279-287.

[31] Jang M, Shin Y S, Chang J W. A grid-based k-nearest neighbor join for large scale datasets on MapReduce. In Proc. the 17th IEEE International Conference on High Performance Computing and Communications (HPCC), the 7th IEEE International Symposium on Cyberspace Safety and Security (CSS), the 12th IEEE International Conference on Embedded Software and Systems (ICESS), Aug. 2015, pp.888-891.

[32] Miranda N, Chávez E, Piccoli M F, Reyes N. (Very) Fast (all) k-nearest neighbors in metric and non metric spaces without indexing. In Lecture Notes in Computer Science 8199, Brisaboa N, Pedreira O, Zezula P (eds.), SpringerVerlag, 2013, pp.300-311.

[33] Chávez E, Navarro G, Baeza-Yates R, Marroqu?n J L. Searching in metric spaces. ACM Computing Surveys, 2001, 33(3):273-321.

[34] Chen L, Gao Y J, Li X H, Jensen C S, Chen G. Efficient metric indexing for similarity search. In Proc. the 31st IEEE International Conference on Data Engineering, Apr. 2015, pp.591-602.

[35] Achtert E, Kriegel H P, Kröger P, Renz M, Züfle A. Reverse k-nearest neighbor search in dynamic and general metric databases. In Proc. the 12th International Conference on Extending Database Technology:Advances in Database Technology, Jan. 2009, pp.886-897.

[36] Tao Y F, Yiu M L, Mamoulis N. Reverse nearest neighbor search in metric spaces. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(9):1239-1252.

[37] Jacox E H, Samet H. Metric space similarity joins. ACM Transactions on Database Systems, 2008, 33(2):Article No. 7.

[38] Silva Y N, Pearson S.Exploiting database similarity joins for metric spaces. Proceedings of the VLDB Endowment, 2012, 5(12):1922-1925.

[39] Chen L, Lian X. Efficient processing of metric skyline queries. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(3):351-365.

[40] Tiakas E, Valkanas G, Papadopoulos A N, Manolopoulos Y, Gunopulos D. Metric-based top-k dominating queries. In Proc. the 17th International Conference on Extending Database Technology, Mar. 2014, pp.415-426.

[41] Bustos B, Navarro G, Chávez E. Pivot selection techniques for proximity searching in metric spaces. Pattern Recognition Letters, 2003, 24(14):2357-2366.

[42] Mao R, Xu H L, Wu W B, Li J Q, Li Y, Lu M H. Overcoming the challenge of variety:Big data abstraction, the next evolution of data management for AAL communication systems. IEEE Communications Magazine, 2015, 53(1):42-47.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Shen Li; Stephen Y.H.Su;. Generalized Parallel Signature Analyzers with External Exclusive-OR Gates[J]. , 1986, 1(4): 49 -61 .
[2] Xu Zhiming;. Discrete Interpolation Surface[J]. , 1990, 5(4): 329 -332 .
[3] Weigeng Shi;. Reconnectable Network with Limited Resources[J]. , 1991, 6(3): 243 -249 .
[4] Liao Xianzhi; Jin Lan;. A Mechanism Supporting the Client/Server Relationship in the Operating System of Distributed System “THUDS”[J]. , 1991, 6(3): 256 -262 .
[5] Jin Guohua; Yang Xuejun; Chen Fujie;. Loop Staggering,Loop Compacting:Restructuring Techniques for Thrashing Problem[J]. , 1993, 8(1): 49 -57 .
[6] Wu Xindong;. Inductive Learning[J]. , 1993, 8(2): 22 -36 .
[7] Zhang Bo; Zhang Ling;. On Memory Capacity of the Probabilistic Logic Neuron Network[J]. , 1993, 8(3): 62 -66 .
[8] Gu Junzhong;. An Object-Oriented Transaction Model[J]. , 1993, 8(4): 3 -20 .
[9] Huang Guoyong; Li Sanli;. TSP: A Heterogeneous Multiprocessor Supercomputing System Based on i860XP[J]. , 1994, 9(3): 285 -288 .
[10] Zhao Yu; Zhang Qiong; Xiang Hui; Shi Jiaosing; He Zhijun;. A Simplified Model for Generating 3D Realistic Sound in the Multimedia and Virtual Reality Systems[J]. , 1996, 11(4): 461 -470 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved