Massive XML data are increasingly generated for the representation, storage and exchange of web information. Twig query processing over massive XML data has become a research focus. However, most traditional algorithms cannot be directly implemented in a distributed manner. Some of the existing distributed algorithms generate a lot of useless intermediate results and execute many join operations of partial results in most cases; others require the priori knowledge of query pattern before XML partition, storage and query processing, which is impractical in the cases of large-scale data or frequent incoming new queries. To improve efficiency and scalability, in this paper, we propose a 3-phase distributed algorithm DisT3 based on node distribution mechanism to avoid unnecessary intermediate results. Furthermore, we propose a lightweight local index ReP with an enhanced XML partitioning approach using arbitrary partitioning strategy, and based on ReP we propose an improved 2-phase distributed algorithm DisT2ReP to further reduce the communication cost. After the performance guarantees are analyzed, extensive experiments are conducted to verify the efficiency and scalability of our proposed algorithms in distributed twig query applications.
About author: Xin Bi received his M.S. degree in computer science from Northeastern University, Shenyang, in 2011. Currently, he is a Ph.D. candidate of Northeastern University, Shenyang. His main research interest includes XML data management and distributed database system.
Xin Bi, Xiang-Guo Zhao, Guo-Ren Wang.基于节点分发的分布式Twig查询处理技术[J] Journal of Computer Science and Technology , 2017,V32(1): 78-92
Xin Bi, Xiang-Guo Zhao, Guo-Ren Wang.Efficient Processing of Distributed Twig Queries Based on Node Distribution[J] Journal of Computer Science and Technology, 2017,V32(1): 78-92
 Chen S, Li H G, Tatemura J, Hsiung W P, Agrawal D, Can dan K S. Twig2Stack:Bottom-up processing of generalized tree-pattern queries over XML documents. In Proc. the 32nd VLDB, Sept. 2006, pp.283-294. Liu J, Yan D L. Answering approximate queries over XML data. IEEE Transactions on Fuzzy Systems, 2016, 24(2):288-305. Mohammed S, El-Alfy E S M, Barradah A F. Improved selectivity estimator for XML queries based on structural synopsis. World Wide Web, 2015, 18(4):1123-1144. Bruno N, Koudas N, Srivastava D. Holistic twig joins:Op timal XML pattern matching. In Proc. the ACM SIGMOD International Conference on Management of Data, June 2002, pp.310-321. Al-Khalifa S, Jagadish H, Patel J et al. Structural joins:A primitive for efficient XML query pattern matching. In Proc. the 18th International Conference on Data Engineering (ICDE), Feb. 26-Mar. 1, 2002, pp.141-152. Jiang H, Wang W, Lu H, Yu J X. Holistic twig joins on indexed XML documents. In Proc. the 29th VLDB, Sept. 2003, pp.273-284. Chen T, Lu J, Ling T W. On boosting holism in XML twig pattern matching using structural indexing techniques. In Proc. the ACM SIGMOD International Conference on Management of Data, June 2005, pp.455-466. Lu J, Ling T W, Chan C Y, Chen T. From region encoding to extended dewey:On efficient processing of XML twig pattern matching. In Proc. the 31st VLDB, Aug. 30-Sept. 2, 2005, pp.193-204. Jiang H, Lu H, Wang W. Efficient processing of XML twig queries with or-predicates. In Proc. the ACM SIGMOD International Conference on Management of Data, June 2004, pp.59-70. Buneman P, Cong G, Fan W, Kementsietsidis A. Using partial evaluation in distributed query evaluation. In Proc. the 32nd VLDB, Sept. 2006, pp.211-222. Lü K, Zhu Y, Sun W et al. Parallel processing XML documents. In Proc. the International Database Engineering and Applications Symposium (IDEA), July 2002, pp.96-105. Suciu D. Distributed query evaluation on semistructured data. ACM Transactions on Database Systems, 2002, 27(1):1-62. Kido K, Amagasa T, Kitagawa H. Processing XPath queries in PC-clusters using XML data partitioning. In Proc. the 22nd International Conference on Data Engineering Workshops, April 2006, Article No. 114. Tang N, Wang G, Yu J X, Wong K F, Yu G. WIN:An efficient data placement strategy for parallel XML databases. In Proc. the 11th International Conference on Parallel and Distributed Systems, July 2005, pp.349-355. Kurita H, Hatano K, Miyazaki J, Uemura S. Efficient query processing for large XML data in distributed environments. In Proc. the 21st International Conference on Advanced Information Networking and Applications, May 2007, pp.317-322. Bordawekar R, Lim L, Shmueli O. Parallelization of XPath queries using multi-core processors:Challenges and experiences. In Proc. the 12th International Conference on Extending Database Technology (EDBT), March 2009, pp.180-191. Machdi I, Amagasa T, Kitagawa H. GMX:An XML data partitioning scheme for holistic twig joins. In Proc. the 10th International Conference on Information Integration and Web-Based Applications & Services (iiWAS), Nov. 2008, pp.137-146. Machdi I, Amagasa T, Kitagawa H. XML data partitioning strategies to improve parallelism in parallel holistic twig joins. In Proc. the 3rd International Conference on Ubiquitous Information Management and Communication (ICUIMC), Jan. 2009, pp.471-480. Kling P, Özsu M T, Daudjee K. Generating efficient exe-cution plans for vertically partitioned XML databases. Proc. the VLDB Endowment, 2010, 4(1):1-11. Bidoit N, Colazzo D, Malla N, Sartiani C. Partitioning XML documents for iterative queries. In Proc. the 16th International Database Engineering & Applications Symposium, Aug. 2012, pp.51-60. Bidoit N, Colazzo D, Malla N, Ulliana F, Nolé M, Sartiani C. Processing XML queries and updates on map/reduce clusters. In Proc. the 16th International Conference on Extending Database Technology (EDBT), Mar. 2013, pp.745-748. Wu H. Parallelizing structural joins to process queries over big XML data using MapReduce. In Proc. the 25th International Conference on Database and Expert Systems Applications (DEXA), Sept. 2014, pp.183-190. Shnaiderman L, Shmueli O. Multi-core processing of XML twig patterns. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(4):1057-1070. Camacho-Rodrguez J, Colazzo D, Manolescu I. Building large XML stores in the Amazon cloud. In Proc. the 28th International Conference on Data Engineering Workshops (ICDEW), Apr. 2012, pp.151-158. Choi H, Lee K H, Kim S H, Lee Y J, Moon B. HadoopXML:A suite for parallel processing of massive XML data with multiple twig pattern queries. In Proc. the 21st ACM International Conference on Information and Knowledge Management, Oct. 29-Nov. 21, 2012, pp.2737-2739. Dean J, Ghemawat S. MapReduce:Simplified data processing on large clusters. In Proc. the 6th Conference on Operating Systems Design & Implementation (OSDI), Dec. 2004, pp.137-150. Damigos M, Gergatsoulis M, Plitsos S. Distributed processing of XPath queries using MapReduce. In Proc. the 17th ADBIS, Sept. 2013, pp.69-77. Damigos M, Gergatsoulis M, Kalogeros E. Distributed evaluation of XPath queries over large integrated XML data. In Proc. the 18th Panhellenic Conference on Informatics, Oct. 2014, pp.61:1-61:6. Afrati F N, Damigos M, Gergatsoulis M. Lower bounds on the communication of XPath queries in MapReduce. In Proc. the Workshops of the EDBT/ICDT Joint Conference, Mar. 2015, pp.38-41. Bi X, Wang G, Zhao X, Zhang Z, Chen S. Distributed XML twig query processing using MapReduce. In Proc. the 17th Asia-PacificWeb Conference (APWeb), Sept. 2015, pp.203-214. Ding L, Wang G, Xin J, Wang X, Huang S, Zhang R. ComMapReduce:An improvement of MapReduce with lightweight communication mechanisms. Data & Knowledge Engineering, 2013, 88:224-247. Schmidt A, Waas F, Kersten M, Carey M J, Manolescu I, Busse R. XMark:A benchmark for XML data management. In Proc. the 28th VLDB, Aug. 2002, pp.974-985.