|
计算机科学技术学报 ›› 2022,Vol. 37 ›› Issue (3): 699-718.doi: 10.1007/s11390-021-1076-7
所属专题: Artificial Intelligence and Pattern Recognition
Rong-Yu Cao1,2 (曹荣禹), Student Member, CCF, Yi-Xuan Cao1,2 (曹逸轩), Member, CCF, IEEE, Gan-Bin Zhou3 (周干斌), and Ping Luo1,2,4 (罗平), Senior Member, CCF, Member, IEEE
1、研究背景:近期,随着信息技术渗透到各个垂直领域(例如金融,法律,政府和教育领域),电子文档的数量迅速增加。为了从这些非结构化文档中获得有价值的信息,获取基础文档结构至关重要,这样可以方便对这些文档进行重新编辑、重新设置样式或重新排列。并且文档结构也对支持许多下游NLP和文本挖掘应用程序至关重要。 但是,从这些文档的编辑格式(例如WORD和LaTeX)到其显示格式(例如PDF和JPG)的转换过程,仅仅保证了文档布局的不变性,但文档中基本的物理和逻辑结构则部分或完全丢失了。因此,使得这种转换过程总体上是可逆的仍然是一个未解决的问题。处于这些原因,本文旨在研究从长文档中提取深度可变的文档逻辑结构。
2、目标:本文的研究目标是从长文档中提取深度可变的文档逻辑结构。换句话说,旨在将已经识别出来的文档物理对象重新组织成层级结构。难点在于长文档包含众多的物理对象,并且这些物理对象处于不同层次从而导致不同文档的层级深度不同。
3、方法:受人类如何在阅读中文档层级结构的启发,我们提出了一种基于神经网络的新模型。本模型的输入是已经识别出来的文档物理对象组成的一个有序序列,本模型的输出是这些文档物理对象组成的层级结构树。具体来说,按照物理对象的序列顺序,我们依次将每个物理对象插入树的适当位置。对于某一个待插入的对象,按照确定的遍历顺序,我们查询当前树中所有可能的插入位置,直到找到合适的位置为止。 确定每个可能的插入位置是否合适可以用二分类问题来表示,即“放置或跳过”。如此,生成层次树,直到所有物理对象都已插入。进一步,我们还探究了该模型的一些变种,包括:插入节点时不同的遍历顺序的影响,显式或隐式地检测标题,插入过程中对错误节点的容错等。为了判断逻辑结构树的准确率,我们提出了一种新的评估指标。除此之外,我们还探索了逻辑结构树对下游的段落检索任务的影响。
4、结果:依据实验结果,本文提出的模型在中文年报数据集、英文年报数据集和arXiv文档数据集中分别获得0.9726,0.7291和0.9578的F1值。而对比的基准模型的准确率都低于本文提出的模型。另外,在前两个个数据集上,显式地提取标题使得准确率提升了0.0148、0.1184的F1值。在下游的段落检索任务中,使用了逻辑层级树的特征后,在mAP指标上获得了0.189的提升。
5、结论:依据实验结果,本文提出的模型相比于两个基准模型,在准确率上都获得了明显地提升。同时,由于显式地提取标题明确地提高了准确率和效率,因此我们选择了两阶段模型。从根到叶的遍历顺序实现了准确率和效率之间的最优权衡,而从叶到根的遍历方法则牺牲一部分效率的前提下获得最高的准确率。并且,获得了文档逻辑层次之后,也能提升后续任务中段落检索的准确率。综上所述,我们提出的长文档逻辑结构提取模型是有效的。
[1] Bloechle J L. Physical and logical structure recognition of pdf documents [PhD Thesis]. University of Fribourg, 2010. [2] Mao S, Rosenfeld A, Kanungo T. Document structure analysis algorithms: A literature survey. In Proc. the 2003 Document Recognition and Retrieval X, Jan. 2003, pp.197-207. DOI: 10.1117/12.476326. [3] Pembe F C, Gungor T. Heading-based sectional hierarchy identification for HTML documents. In Proc. the 22nd International Symposium on Computer and Information Sciences, Nov. 2007. DOI: 10.1109/ISCIS.2007.4456839. [4] Geva M, Berant J. Learning to search in long documents using document structure. In Proc. the 27th International Conference on Computational Linguistics, Aug. 2018, pp.161-176. [5] Howard T, Bruce C. Inference networks for document retrieval. ACM SIGIR Forum, 2017, 51(2): 124-147. DOI: 10.1145/3130348.3130361. [6] Summers K. Automatic discovery of logical document structure [PhD Thesis]. Cornell University, 1998. [7] Luong M T, Nguyen T D, Kan M Y. Logical structure recovery in scholarly articles with rich document features. International Journal of Digital Library Systems, 2010, 1(4): 1-23. DOI: 10.4018/jdls.2010100101. [8] Pembe F C, Güngör T. A tree-based learning approach for document structure analysis and its application to Web search. Natural Language Engineering, 2014, 21(4): 569-605. DOI: 10.1017/S1351324914000023. [9] Ramakrishnan C, Patnia A, Hovy E, Burns G A. Layout-aware text extraction from full-text pdf of scientific articles. Source Code for Biology Medicine, 2012, 7(1): Article No. 7. DOI: 10.1186/1751-0473-7-7. [10] Manabe T, Tajima K. Extracting logical hierarchical structure of HTML documents based on headings. Proceedings of the VLDB Endowment, 2015, 8(12): 1606-1617. DOI: 10.14778/2824032.2824058. [11] Rahman M M, Finin T. Understanding the logical and semantic structure of large documents. arXiv:1709.00770, 2017. https://arxiv.org/abs/1709.00770, April 2021. [12] Bentabet N I, Juge R, Ferradans S. Table-of-contents generation on contemporary documents. In Proc. the 2019 International Conference on Document Analysis and Recognition, Sept. 2019, pp. 100-107. DOI: 10.1109/ICDAR.2019.00025. [13] Conway A. Page grammars and page parsing: A syntactic approach to document layout recognition. In Proc. the 2nd International Conference on Document Analysis and Recognition, Oct. 1993, pp.761-764. DOI: 10.1109/ICDAR.1993.395626. [14] Tsujimoto S, Asada H. Understanding multi-articled documents. In Proc. the 10th International Conference on Pattern Recognition, June 1990, pp.124-133. DOI: 10.1109/ICPR.1990.118163. [15] Constantin A, Pettifer S, Voronkov A. PDFX: Fully-automated PDF-to-XML conversion of scientific literature. In Proc. the 2013 ACM Symposium on Document Engineering, Sept. 2013, pp.177-180. DOI: 10.1145/2494266.2494271. [16] Tkaczyk D, Szostek P, Fedoryszak M, Dendek P J, Bolikowski. CERMINE: Automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition, 2015, 18(4): 317-335. DOI: 10.1007/s10032-015-0249-8. [17] Summers K. Toward a taxonomy of logical document structures. In Proc. the Dartmouth Institute for Advanced Graduate Studies: Electronic Publishing and the Information Superhighway, May 30-June 2, 1995, pp.124-133. [18] Baird H S, Jones S E, Fortune S J. Image segmentation by shape-directed covers. In Proc. the 10th International Conference on Pattern Recognition, June 1990, pp.820-825. DOI: 10.1109/ICPR.1990.118223. [19] Nagy G, Seth S, Viswanathan M. A prototype document image analysis system for technical journals. Computer, 1992, 25(7): 10-22. DOI: 10.1109/2.144436. [20] Kopec G E, Chou P A. Document image decoding using Markov source models. In Proc. the 1993 IEEE International Conference on Acoustics Speech and Signal Processing, April 1993, pp.85-88. DOI: 10.1109/ICASSP.1993.319753. [21] Xiao Y, Yumer E, Asente P, Kraley M, Kifer D, Giles C L. Learning to extract semantic structure from documents using multimodal fully convolutional neural network. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.4342-4351. DOI: 10.1109/CVPR.2017.462. [22] Augusto Borges Oliveira D, Palhares Viana M. Fast CNN-based document layout analysis. In Proc. the 2017 IEEE International Conference on Computer Vision Workshops, Oct. 2017, pp.1173-1180. DOI: 10.1109/ICCVW.2017.142. [23] Wong K Y, Casey R G, Wahl F M. Document analysis system. IBM Journal of Research and Development, 1982, 26(6): 647-656. DOI: 10.1147/rd.266.0647. [24] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015. [25] Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In Proc. the 2015 IEEE International Conference on Computer Vision, Dec. 2015, pp.1520-1528. DOI: 10.1109/ICCV.2015.178. [26] He D, Cohen S, Price B, Kifer D, Giles C L. Multi-scale multi-task FCN for semantic page segmentation and table detection. In Proc. the 14th IAPR International Conference on Document Analysis and Recognition, Nov. 2017, pp.254-261. DOI: 10.1109/ICDAR.2017.50. [27] Schuster M, Paliwal K K. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681. DOI: 10.1109/78.650093. [28] Zhou G, Luo P, Cao R, Xiao Y, Lin F, Chen B, He Q. Tree-structured neural machine for linguistics-aware sentence generation. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.5722-5729. [29] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In Proc. the 27th International Conference on Neural Information Processing Systems, December 2014, pp.3104-3112. [30] Tan Z, Wang M, Xie J, Chen Y, Shi X. Deep semantic role labeling with self-attention. In Proc. the 32nd AAAI Conference on Artificial Intelligence, Feb. 2018, pp.4929-4936. [31] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing, December 2017, pp.5998-6008. [32] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In Proc. the 2013 International Conference on Learning Representations, May 2013. [33] Lin M, Chen Q, Yan S. Network in network. arXiv:1, 2013. https://arxiv.org/abs/1312.4400, Jan. 2021. [34] Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.448-456. [35] Nair V, Hinton G E. Rectified linear units improve restricted Boltzmann machines. In Proc. the 27th International Conference on Machine Learning, Jun. 2010, pp.807-814. [36] He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proc. the IEEE International Conference on Computer Vision, Dec. 2015, pp.1026-1034. DOI: 10.1109/ICCV.2015.123. [37] Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015. [38] Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799, 2018. https://arxiv.org/abs/1802.05799, Jan. 2021. [39] Friedman J H. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 2001, 29(5): 1189-1232. DOI: 10.1214/aos/1013203451. |
[1] | 申林峰, 陈煜驰, 刘江川. 智能手机上基于视线追踪的360°视频视场控制[J]. 计算机科学技术学报, 2022, 37(4): 906-918. |
[2] | Geun Yong Kim, Joon-Young Paik, Yeongcheol Kim, and Eun-Sun Cho. 基于字节频率特征码的勒索病毒检测方法[J]. 计算机科学技术学报, 2022, 37(2): 423-442. |
[3] | 赵建喆, 王兴伟, 毛克明, 黄辰希, 苏昱恺, 李宇宸. 机器学习中基于相关差分隐私保护的多方数据发布方法[J]. 计算机科学技术学报, 2022, 37(1): 231-251. |
[4] | Yi Zhong, Jian-Hua Feng, Xiao-Xin Cui, Xiao-Le Cui. 机器学习辅助的抗逻辑块加密密钥猜测攻击范式[J]. 计算机科学技术学报, 2021, 36(5): 1102-1117. |
[5] | Sara Elmidaoui, Laila Cheikhi, Ali Idri, Alain Abran. 用于软件可维护性预测的机器学习技术:精度分析[J]. 计算机科学技术学报, 2020, 35(5): 1147-1174. |
[6] | Andrea Caroppo, Alessandro Leone, Pietro Siciliano. 用于老年人面部表情识别的深度学习模型和传统机器学习方法的对比研究[J]. 计算机科学技术学报, 2020, 35(5): 1127-1146. |
[7] | Shu-Zheng Zhang, Zhen-Yu Zhao, Chao-Chao Feng, Lei Wang. 基于的特征选择的用于加速芯片物理设计Floorplan的机器学习框架[J]. 计算机科学技术学报, 2020, 35(2): 468-474. |
[8] | Rui Ren, Jiechao Cheng, Xi-Wen He, Lei Wang, Jian-Feng Zhan, Wan-Ling Gao, Chun-Jie Luo. HybridTune:基于时空数据关联的大数据系统性能诊断[J]. 计算机科学技术学报, 2019, 34(6): 1167-1184. |
[9] | João Fabrício Filho, Luis Gustavo Araujo Rodriguez, Anderson Faustino da Silva. 另一种智能代码生成系统:一种灵活低成本解决方案[J]. 计算机科学技术学报, 2018, 33(5): 940-965. |
[10] | Lan Yao, Feng Zeng, Dong-Hui Li, Zhi-Gang Chen. 基于Lp正则化的稀疏支持向量机特征选择算法[J]. , 2017, 32(1): 68-77. |
[11] | 包新启, 吴云芳. 面向问题检索的层级自训练张量神经网络模型[J]. , 2016, 31(6): 1151-1160. |
[12] | Najam Nazar, Yan Hu, He Jiang. 软件工件摘要方法综述[J]. , 2016, 31(5): 883-909. |
[13] | Xi-Jin Zhang, Yi-Fan Lu, Song-Hai Zhang. 用于食品识别和分析的深度卷积神经网络多任务学习[J]. , 2016, 31(3): 489-500. |
[14] | Lixue Xia, Peng Gu, Boxun Li, Tianqi Tang, Xiling Yin, Wenqin Huangfu, Shimeng Yu, Yu Cao, Yu Wang, Huazhong Yang. 忆阻器阵列矩阵向量乘的设计空间优化[J]. , 2016, 31(1): 3-19. |
[15] | Jun-Fa Liu, Wen-Jing He, Tao Chen, and Yi-Qiang Chen. 由流形约束实现人脸知识迁移的三维卡通重建方法[J]. , 2013, 28(3): 479-489. |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |