›› 2010, Vol. 25 ›› Issue (3): 389-400.

• Special Section on Trends Changing Data Management • Previous Articles     Next Articles

Outlier Detection over Sliding Windows for Probabilistic Data Streams

Bin Wang (王斌), Member, CCF, Xiao-Chun Yang (杨晓春), Senior Member, CCF, Member ACM, IEEE, Guo-Ren Wang (王国仁), Senior Member, CCF, Member, ACM, IEEE, and Ge Yu (于戈), Senior Member, CCF, Member, ACM, IEEE   

  1. School of Information Science and Engineering, Northeastern University, Shenyang 110004, China
    Key Laboratory of Medical Image Computing (Northeastern University), Ministry of Education, Shenyang 110004, China
  • Received:2009-06-30 Revised:2009-11-16 Online:2010-05-05 Published:2010-05-05
  • About author:
    Bin Wang received the Ph.D.~degree in computer science from the Northeastern University, China, in 2008. He is currently a lecturer in computer science at the Northeastern University. He is a member of CCF. His research interests include design and analysis of algorithms, databases, data quality, and distributed systems.
    Xiao-Chun Yang received the Ph.D.~degree in computer science from Northeastern University, China, in 2001. She is a professor at Northeastern University, China. She is a member of the ACM and the IEEE Computer Society, and a senior member of CCF. Her research interests are in the areas of data quality, query processing, data privacy, and distributed data management.
    Guo-Ren Wang received the Ph.D. degree from Northeastern University, China, in 1996. He is a professor at Northeastern University of China. He is a member of IEEE, ACM, and a senior member of CCF. His research interests are XML data management, query processing and optimization, bioinformatics, high-dimensional indexing, parallel database systems, and P2P data management. He has published more than 100 research papers.
    Ge Yu received his Ph.D. degree in computer science from Kyushu University of Japan in 1996. He is a professor at Northeastern University of China. He is a member of IEEE, ACM, and a senior member of CCF. His research interests include database theory and technology, distributed and parallel systems, embedded software, network information security.
  • Supported by:

    This work is partially supported by the National Natural Science Foundation of China under Grant Nos. 60973020, 60828004, and 60933001, the Program for New Century Excellent Talents in University of China under Grant No. NCET-06-0290, and the Fundamental Research Funds for the Central Universities under Grant No. N090504004.

Outlier detection is a very useful technique in many applications, where data is generally uncertain and could be described using probability. While having been studied intensively in the field of deterministic data, outlier detection is still novel in the emerging uncertain data field. In this paper, we study the semantic of outlier detection on probabilistic data stream and present a new definition of distance-based outlier over sliding window. We then show the problem of detecting an outlier over a set of possible world instances is equivalent to the problem of finding the $k$-th element in its neighborhood. Based on this observation, a dynamic programming algorithm (DPA) is proposed to reduce the detection cost from $O(2^{|R(e,d)|})$ to $O(|k{\cdot}R(e,d)|)$, where $R(e,d)$ is the $d$-neighborhood of $e$. Furthermore, we propose a pruning-based approach (PBA) to effectively and efficiently filter non-outliers on single window, and dynamically detect recent $m$ elements incrementally. Finally, detailed analysis and thorough experimental results demonstrate the efficiency and scalability of our approach.


[1] Knorr E M, Ng R T. Algorithms for mining distance-based outliers in large datasets. In Proc. the 24th International Conference on Very Large Data Bases (VLDB), New York City, USA, Aug. 24-27, 1998, pp.392-403.

[2] Knorr E M, Ng R T. Finding intensional knowledge of distance-based outliers. In Proc. the 25th International Conference on Very Large Data Bases (VLDB), Edinburgh, UK, Sept. 7-10, 1999, pp.211-222.

[3] Shen H, Zhan Y. Improved approximate detection of duplicates for data streams over sliding windows. Journal of Computer Science and Technology, 2008, 23(6): 973-987.

[4] Breuning M M, Kriegel H P, Ng R T, Sander J. LOF: Identifying density-based local outliers. In Proc. the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD), Dallas, USA, May 16-18, 2000, pp.93-104.

[5] Hinterberger H. Exploratory Data Analysis. Encyclopedia of Database Systems, Springer US, 2009, p.1080.

[6] Arning A, Agrawal R, Raghavan P. A linear method for deviation detection in large databases. In Proc. the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Portland, USA, Aug. 2-4, 1996, pp.164-169.

[7] Sarawagi S, Agrawal R, Megiddo N. Discovery-driven exploration of OLAP data cubes. In Proc. the 6th International Conference on Extending Database Technology (EDBT), Valencia, Spain, Mar. 23-27, 1998, pp.168-182.

[8] Pei J, Jiang B, Lin X, Yuan Y. Probabilistic skylines on uncertain data. In Proc. the 33rd International Conference on Very Large Data Bases (VLDB), Vienna, Austria, Sept. 2327, 2007, pp.15-26.

[9] Soliman M A, Ilyas I F, Chang K C-C. Top-k query processing in uncertain databases. In Proc. the 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, Apr. 15-20, 2007, pp.345-360.

[10] Hua M, Pei J, Zhang W, Lin X. Ranking queries on uncertain data: A probabilistic threshold approach. In Proc. the ACM SIGMOD International Conference on Management of Data (SIGMOD), Vancouver, Canada, Jun. 10-12, 2008, pp.673689.

[11] Aggarwal C C, Yu P S. Outlier detection with uncertain data. In Proc. the SIAM International Conference on Data Mining (SDM), Atlanta, USA, Apr. 24-26, 2008, pp.483-493.

[12] Kriegel H, Pfeifle M. Density-based clustering of uncertain data. In Proc. the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Chicago, USA, Aug. 21-24, 2005, pp.672-677.

[13] Jin C, Yi K, Chen L, Yu J X, Lin X. Sliding-window top-k queries on uncertain streams. PVLDB, 2008, 1(1): 301-312.

[14] Woo H, Mok A K. Real-time monitoring of uncertain data streams using probabilistic similarity. In Proc. the 28th IEEE Real-Time Systems Symposium (RTSS), Tucson, Arizona, USA, Dec. 3-6, 2007, pp.288-300.

[15] Aggarwal C C, Yu P S. A framework for clustering uncertain data streams. In Proc. the 24th International Conference on Data Engineering (ICDE), Cancun, Mexico, Apr. 7-12, 2008, pp.150-159.

[16] Lin X, Yuan Y, Wang W, Lu H. Stabbing the sky: Efficient skyline computation over sliding windows. In Proc. the 21st International Conference on Data Engineering (ICDE), Tokyo, Japan, Apr. 5-8, 2005, pp.502-513.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved