• Database and Knowledge-Based Systems • Previous Articles     Next Articles

PGG: An Online Pattern Based Approach for Stream Variation Management

Lu-An Tang1, Bin Cui2,4, Hong-Yan Li2,3,**, Gao-Shan Miao2,3, Dong-Qing Yang2,4, and Xin-Biao Zhou2,3   

  1. {\it 1Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, U.S.A. 2School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China 3Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing 100871, China 4Key Laboratory of High Confidence Software Technologies (Ministry of Education), Peking University, Beijing 100871, China
  • Received:2008-04-14 Revised:2008-06-03 Online:2008-07-10 Published:2008-07-10

Many database applications require efficient processing of data streams with value variations and fluctuant sampling frequency. The variations typically imply fundamental features of the stream and important domain knowledge of underlying objects. In some data streams, successive events seem to recur in a certain time interval, but the data indeed evolves with tiny differences as time elapses. This feature, so called {\it pseudo periodicity}, poses a new challenge to stream variation management. This study focuses on the online management for variations over such streams. The idea can be applied to many scenarios such as patient vital signal monitoring in medical applications. This paper proposes a new method named Pattern Growth Graph (PGG) to detect and manage variations over evolving streams with following features: 1) adopts the wave-pattern to capture the major information of data evolution and represent them compactly; 2) detects the variations in a single pass over the stream with the help of wave-pattern matching algorithm; 3) only stores different segments of the pattern for incoming stream, and hence substantially compresses the data without losing important information; 4) distinguishes meaningful data changes from noise and reconstructs the stream with acceptable accuracy. Extensive experiments on real datasets containing millions of data items, as well as a prototype system, are carried out to demonstrate the feasibility and effectiveness of the proposed scheme.

Key words: database query; natural language processing; word segmentation; disambiguation;


[1] Papadimitriou S, Yu P S. Optimal multi-scale patterns in time series streams. In {\it Proc. the 2006 ACM SIGMOD International Conference on Management of Data}, Chicago, IL, USA, June 27--29, 2006, pp.647--658.
[2]} Aggarwal C C, Han J, Wang J, Yu P S. A framework for projected clustering of high dimensional data streams. In {\it Proc. the Thirtieth International Conference on Very Large Data Bases}, Toronto, Canada, August 31--September 3, 2004, Vol.30, pp.852--863.
[3]} Wang H, Fan W, Yu P S, Han J. Mining concept-drifting data streams using ensemble classifiers. In {\it Proc. the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, KDD'03, Washington D C, August 24--27, 2003, pp.226--235.
[4]} Babcock B, Datar M, Motwani R, O'Callaghan L. Maintaining variance and $k$-medians over data stream windows. In {\it Proc. the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'03}, San Diego, California, June 9--11, 2003, pp.234--243.
[5]} Wang H, Pei J, Yu P S. Online mining data streams: Problems, applications and progress. In {\it Proc. the 21st International Conference on Data Engineering, ICDE'05}, Tokyo, Japan, April 5--8, 2005.
[6]} Keogh E, Lin J, Fu A. HOT SAX: Efficiently finding the most unusual time series subsequence. In {\it Proc. Fifth IEEE International Conference on Data Mining}, Houston, Texas, USA, Nov. 2005, pp.27--30.
[7]} Keogh E, Lonardi S, Chiu B. Finding surprising patterns in a time series database in linear time and space. In {\it Proc. the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, Edmonton, Alberta, Canada, July 23--26, 2002, pp.550--556.
[8]} Wu H, Salzberg B, Zhang D. Online event-driven subsequence matching over financial data streams. In {\it Proc. the 2004 ACM SIGMOD International Conference on Management of Data}, Paris, France, June 13--18, 2004, pp.23--34.
[9]} Varon Joseph, Marik PE. Clinical information systems and the electronic medical record in the intensive care unit. {\it Current Option in Critical Care}, 2002, 8(6): 616--624.
[10]} Zhou X, Miao G, Li H, Tang L, Wei X. PEDS-VM: A variation management prototype for pattern evolving data streams. In {\it Proc. the Ninth International Conference on Web-Age Information Management, WAIM 08}, Zhangjiajie, China, July 20--22, 2008. (To appear)
[11]} Tang L, Cui B, Li H, Miao G, Yang D, Zhou X. Effective variation management for pseudo periodical streams. In {\it Proc. the 2007 ACM SIGMOD International Conference on Management of Data}, %SIGMOD'07, Beijing, China, June 11--14, 2007, pp.257--268.
[12]} Papadimitriou S, Sun J, Faloutsos C. Streaming pattern discovery in multiple time-series. In {\it Proc. the 31st International Conference on Very Large Data Bases}, Trondheim, Norway, August 30 -- September 02, 2005, pp.697--708.
[13]} Babu S, Widom J. Continuous queries over data streams. {\it SIGMOD Rec.}, Sept. 2001, 30(3): 109--120.
[14]} Abadi D, Carney D, Cetintemel U, Cherniack M, Convey C, Lee S, Stonebraker M, Tatbul N, Zdonik S. Aurora: A new model and architecture for data stream management. {\it The VLDB Journal}, August 2003, 12(2): 120--139.
[15]} Cortes C, Fisher K, Pregibon D, Rogers A. Hancock: A language for extracting signatures from data streams. In {\it Proc. the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, Boston, Massachusetts, United States, August 20--23, 2000, pp.9--17.
[16]} Chandrasekaran S, Cooper O, Deshpande A, Franklin M J, Hellerstein J M, Hong W, Krishnamurthy S, Madden S R, Reiss F, Shah M A. TelegraphCQ: Continuous dataflow processing. In {\it Proc. the 2003 ACM SIGMOD International Conference on Management of Data}, %SIGMOD'03, San Diego, California, June 9--12, 2003, pp.668--668.
[17]} Sullivan M. Tribeca: A stream database manager for network traffic analysis. In {\it Proc. the 22nd International Conference on Very Large Data Bases}, San Francisco, CA, September 3--6, 1996, p.594.
[18]} Yao Y, Gehrke J. The cougar approach to in-network query processing in sensor networks. {\it SIGMOD Rec.}, Sept. 2002, 31(3): 9--18.
[19]} Cormode G, Datar M, Indyk P, Muthukrishnan S. Comparing data streams using Hamming norms (how to zero in). In {\it Proc. the 28th International Conference on Very Large Data Bases}, Hong Kong, China, August 20--23, 2002, pp.335--345.
[20]} Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows. In {\it Proc. the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms}, San Francisco, California, January 6--8, 2002, pp.635--644.
[21]} Hu Z, Li H, Qiu B, Tang L, Fan Y, Liu H, Gao J, Zhou X. Using control theory to guide load shedding in medical data stream management system. In {\it Proc. the 10th Asian Computing Science Conference, Advances in Computer Science, 2005}, Kunming, China, {\it LNCS} 3818, pp.236--248.
[22]} Ganguly S, Garofalakis M, Rastogi R. Processing set expressions over continuous update streams. In {\it Proc. the 2003 ACM SIGMOD International Conference on Management of Data}, %SIGMOD'03, San Diego, California, June 9--12, 2003, pp.265--276.
[23]} David J Fraenkel, Melleesa Cowie, Peter Daley. Quality benefits of an intensive care clinical information system. {\it Crit. Care Medi.}, 2003, 31: 120--125.
[24]} Axel Junger, Achim Michel {\it et al}. Evaluation of the suitability of a patient data management system for ICUs on a general ward. {\it International Journal of Medical Informatics}, 2001, 64: 57--66.
[25]} Liu Y B, Cai J R, Yin J, Fu W A. Clustering text data streams. {\it Journal of Computer Science and Technology}, Jan. 2008, 23(1): 112--128.
[26]} Hu X G, Li P P, Wu X D, Wu G Q. A semi-random multiple decision-tree algorithm for mining data streams. {\it Journal of Computer Science and Technology}, Sept. 2007, 22(5): 711--724.
[27]} Chong Z H, Yu J X, Zhang Z J, Lin X M, Wang W, Zhou A Y. Efficient computation of k-medians over data streams under memory constraints. {\it Journal of Computer Science and Technology}, Mar. 2006, 21(2): 284--296.
[28]} Chang J H, Lee W S. Effect of count estimation in finding frequent itemsets over online transactional data streams. {\it Journal of Computer Science and Technology}, Jan. 2005, 20(1): 63--69.
[29]} Cai Y D, Clutter D, Pape G, Han J, Welge M, Auvil L. MAIDS: Mining alarming incidents from data streams. In {\it Proc. the 2004 ACM SIGMOD International Conference on Management of Data}, Paris, France, June 13--18, 2004, %SIGMOD'04. pp.919--920.
[30]} Teng W, Chen M, Yu P S. A regression-based temporal pattern mining scheme for data streams. In {\it Proc. the 29th International Conference on Very Large Data Bases}, Berlin, Germany, September 09--12, 2003, pp.93--104.
[31]} Zhu Y, Shasha D. Efficient elastic burst detection in data streams. In {\it Proc. the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, Washington D C, August 24--27, 2003, pp.336--345.
[32]} Ma J, Perkins S. Online novelty detection on temporal sequences. In {\it Proc. the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, Washington D C, August 24--27, 2003, pp.613--618.
[33]} Aggarwal C C. On abnormality detection in spuriously populated data streams. In {\it Proc. SIAM International Conference on Data Mining}, Newport Beach, CA, USA, 2005.
[34]} Lin J, Keogh E, Lonardi S, Chiu B. A symbolic representation of time series, with implications for streaming algorithms. In {\it Proc. the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery}, San Diego, California, June 13, 2003, pp.2--11.
[35]} Gilbert A C, Kotidis Y, Muthukrishnan S {\it et al}. One-Pass Wavelet Decompositions of Data Streams. {\it IEEE Trans. Knowl. and Data Eng.}, Mar. 2003, 15(3): 541--554.
[36]} Papadimitriou S, Brockwell A, Faloutsos C. Adaptive, unsupervised stream mining. {\it The VLDB Journal}, Sept. 2004, 13(3): 222--239.
[37]} Gao L, Wang X. Continuous similarity-based queries on streaming time series. {\it IEEE Trans. Knowl. Data Eng.}, Oct. 2005, 17(10): 1320--1332.
[38]} Wu H, Salzberg B, Zhang D. Online event-driven subsequence matching over financial data streams. In {\it Proc. the 2004 ACM SIGMOD International Conference on Management of Data}, Paris, France, June 13--18, 2004, pp.23--34.
[39]} Wu H, Sharp G, Salzberg B, Kaeli D, Shirato H, Jiang S. A finite state model for respiratory motion analysis in image guided radiation therapy. {\it Physics in Medicine and Biology $($PMB$)$}, 2004, 49(23): 5357--5372.
[40]} Wu H, Salzberg B, Sharp G C, Jiang S B, Shirato H, Kaeli D. Subsequence matching on structured time series data. In {\it Proc. the 2005 ACM SIGMOD International Conference on Management of Data}, Baltimore, Maryland, June 14--16, 2005, pp.682--693.
[41]} Aggarwal C C. A framework for diagnosing changes in evolving data streams. In {\it Proc. the 2003 ACM SIGMOD International Conference on Management of Data}, San Diego, California, June 09--12, 2003, pp.575--586.
[42]} Wang H, Pei J. A random method for quantifying changing distributions in data streams. In {\it Proc. the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases $($PKDD$)$}, Porto, Portugal, October 2005, pp.684--691.
[43]} Keogh E J, Chu S, Hart D, Pazzani M J. An online algorithm for segmenting time series. In {\it Proc. the 2001 IEEE International Conference on Data Mining}, Cercone N (ed), November 29--December 02, 2001, pp.289--296.
[44]} http://peer.berkeley.edu/nga/flatfile.html
[45]} http://www.schoolsobservatory.org
[1] Sheng-Luan Hou, Xi-Kun Huang, Chao-Qun Fei, Shu-Han Zhang, Yang-Yang Li, Qi-Lin Sun, Chuan-Qing Wang. A Survey of Text Summarization Approaches Based on Deep Learning [J]. Journal of Computer Science and Technology, 2021, 36(3): 633-663.
[2] Peng Ni, Su-Yun Zhao, Zhi-Gang Dai, Hong Chen, Cui-Ping Li. Partial Label Learning via Conditional-Label-Aware Disambiguation [J]. Journal of Computer Science and Technology, 2021, 36(3): 590-605.
[3] Nuo Qun, Hang Yan, Xi-Peng Qiu, Xuan-Jing Huang. Chinese Word Segmentation via BiLSTM+Semi-CRF with Relay Node [J]. Journal of Computer Science and Technology, 2020, 35(5): 1115-1126.
[4] Xiang-Guang Zhou, Ren-Bin Gong, Fu-Geng Shi, Zhe-Feng Wang. PetroKG: Construction and Application of Knowledge Graph in Upstream Area of PetroChina [J]. Journal of Computer Science and Technology, 2020, 35(2): 368-378.
[5] Chen-Chen Sun, De-Rong Shen, Yue Kou, Tie-Zheng Nie, Ge Yu. Topological Features Based Entity Disambiguation [J]. , 2016, 31(5): 1053-1068.
[6] Fei Tian, Bin Gao, En-Hong Chen, Tie-Yan Liu. Learning Better Word Embedding by Asymmetric Low-Rank Projection of Knowledge Graph [J]. , 2016, 31(3): 624-634.
[7] Xiang-Lilan Zhang, Zhi-Gang Luo, Ming Li. Merge-Weighted Dynamic Time Warping for Speech Recognition [J]. , 2014, 29(6): 1072-1082.
[8] Dan Yang, De-Rong Shen, Ge Yu, Yue Kou, and Tie-Zheng Nie. Query Intent Disambiguation of Keyword-Based Semantic Entity Search in Dataspaces [J]. , 2013, 28(2): 382-393.
[9] Peter Szolovits. Possibilities for Healthcare Computing [J]. , 2011, 26(4): 625-631.
[10] Javier Tejada-Cárcamo, Hiram Calvo, Alexander Gelbukh, and Kazuo Hara. Unsupervised WSD by Finding the Predominant Sense Using Context as a Dynamic Thesaurus [J]. , 2010, 25(5): 1030-1039.
[11] Xu Sun, Hou-Feng Wang, and Bo Wang. Predicting Chinese Abbreviations from Definitions: An Empirical Learning Approach Using Support Vector Regression [J]. , 2008, 23(4 ): 602-611 .
[12] Hai Zhao and Chunyu Kit. Scaling Conditional Random Fields by One-Against-the-Other Decomposition [J]. , 2008, 23(4 ): 612-619 .
[13] Jun Zhao and Fei-Fan Liu. Linguistic Theory Based Contextual Evidence Mining for Statistical Chinese Co-Reference Resolution [J]. , 2007, 22(4): 608-617 .
[14] Chang-Le Zhou, Yun Yang, and Xiao-Xi Huang. Computational Mechanisms for Metaphor in Languages: A Survey [J]. , 2007, 22(2): 308-319 .
[15] Sheng Li and Tie-Jun Zhao. Chinese Information Processing and Its Prospects [J]. , 2006, 21(5): 838-846 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Wu Yunzeng;. On the Development of Applications of Logic in Programming[J]. , 1987, 2(1): 30 -34 .
[2] Wang Lei; Tan Ying;. The Researches in Fault-Tolerant D ataflow Architecture[J]. , 1991, 6(4): 395 -398 .
[3] WANG Deqiang; ZHAO Lianchang;. The Twisted-Cube Connected Networks[J]. , 1999, 14(2): 181 -187 .
[4] Jin-Woo Kim, Ju-Hum Kwon, Young-Gab Kim, Chee-Yang Song, Hyun-Seok Kim, and Doo-Kwon Baik. EAFoC: Enterprise Architecture Framework Based on Commonality[J]. , 2006, 21(6): 952 -964 .
[5] Qiang Liu, Tao Huang, Shao-Hua Liu, and Hua Zhong. An Ontology-Based Approach for Semantic Conflict Resolution in Database Integration[J]. , 2007, 22(2): 218 -227 .
[6] Shung Han Cho, Student Member, IEEE, Yuntai Kyong, Student Member, IEEE, Sangjin Hong, Senior Member, IEEE, and We-Duke Cho, Member, IEEE. Self Localization Method Using Parallel Projection Model for Mobile Sensor in Navigation Applications[J]. , 2009, 24(3): 588 -603 .
[7] Peng Xiao, Zhi-Gang Hu, and Yan-Ping Zhang. An Energy-Aware Heuristic Scheduling for Data-Intensive Workflows in Virtualized Datacenters[J]. , 2013, 28(6): 948 -961 .
[8] Po Hu, Min-Lie Huang, and Xiao-Yan Zhu. Exploring the Interactions of Storylines from Informative News Events[J]. , 2014, 29(3): 502 -518 .
[9] Jing-Yuan Zhao, Mei-Qin Wang, Long Wen . Improved Linear Cryptanalysis of CAST-256[J]. , 2014, 29(6): 1134 -1139 .
[10] Peng Du, Jie-Yi Zhao, Wan-Bin Pan, Yi-Gang Wang. GPU Accelerated Real-time Collision Handling in Virtual Disassembly[J]. , 2015, 30(3): 511 -518 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved