从长文档中提取深度可变的文档逻辑结构：方法、评估和应用

曹荣禹; 曹逸轩; 周干斌; 罗平

doi:10.1007/s11390-021-1076-7

从长文档中提取深度可变的文档逻辑结构：方法、评估和应用

曹荣禹^1,2,,
曹逸轩^1,2,
周干斌³,
罗平^1,2,4

¹Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
²University of Chinese Academy of Sciences, Beijing 100049, China
³WeChat Search Application Department, Tencent Holdings Ltd., Beijing 100080, China
⁴Peng Cheng Laboratory, Shenzhen 518066, China

详细信息

作者简介:
曹荣禹: Rong-Yu Cao received his B.E. degree in software engineering from Dalian University of Technology, Dalian, in 2016, and now is a Ph.D. student at the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. His research interests include natural language processing and document analysis.

计量
- 文章访问数: 92
- HTML全文浏览量: 4
- PDF下载量: 0
出版历程
- 收稿日期: 2020-10-15
- 修回日期: 2021-04-28
- 录用日期: 2021-05-08
- 发布日期: 2022-05-29

Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application

¹Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
²University of Chinese Academy of Sciences, Beijing 100049, China
³WeChat Search Application Department, Tencent Holdings Ltd., Beijing 100080, China
⁴Peng Cheng Laboratory, Shenzhen 518066, China

Funds: This work was supported by the National Key Research and Development Program of China under Grant No. 2017YFB1002104, and the National Natural Science Foundation of China under Grant Nos. 62076231 and U1811461.

More Information

Author Bio:
Rong-Yu Cao received his B.E. degree in software engineering from Dalian University of Technology, Dalian, in 2016, and now is a Ph.D. student at the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. His research interests include natural language processing and document analysis.

摘要

摘要: 1、研究背景：近期，随着信息技术渗透到各个垂直领域（例如金融，法律，政府和教育领域），电子文档的数量迅速增加。为了从这些非结构化文档中获得有价值的信息，获取基础文档结构至关重要，这样可以方便对这些文档进行重新编辑、重新设置样式或重新排列。并且文档结构也对支持许多下游NLP和文本挖掘应用程序至关重要。但是，从这些文档的编辑格式（例如WORD和LaTeX）到其显示格式（例如PDF和JPG）的转换过程，仅仅保证了文档布局的不变性，但文档中基本的物理和逻辑结构则部分或完全丢失了。因此，使得这种转换过程总体上是可逆的仍然是一个未解决的问题。处于这些原因，本文旨在研究从长文档中提取深度可变的文档逻辑结构。
2、目标：本文的研究目标是从长文档中提取深度可变的文档逻辑结构。换句话说，旨在将已经识别出来的文档物理对象重新组织成层级结构。难点在于长文档包含众多的物理对象，并且这些物理对象处于不同层次从而导致不同文档的层级深度不同。
3、方法：受人类如何在阅读中文档层级结构的启发，我们提出了一种基于神经网络的新模型。本模型的输入是已经识别出来的文档物理对象组成的一个有序序列，本模型的输出是这些文档物理对象组成的层级结构树。具体来说，按照物理对象的序列顺序，我们依次将每个物理对象插入树的适当位置。对于某一个待插入的对象，按照确定的遍历顺序，我们查询当前树中所有可能的插入位置，直到找到合适的位置为止。确定每个可能的插入位置是否合适可以用二分类问题来表示，即“放置或跳过”。如此，生成层次树，直到所有物理对象都已插入。进一步，我们还探究了该模型的一些变种，包括：插入节点时不同的遍历顺序的影响，显式或隐式地检测标题，插入过程中对错误节点的容错等。为了判断逻辑结构树的准确率，我们提出了一种新的评估指标。除此之外，我们还探索了逻辑结构树对下游的段落检索任务的影响。
4、结果：依据实验结果，本文提出的模型在中文年报数据集、英文年报数据集和arXiv文档数据集中分别获得0.9726，0.7291和0.9578的F1值。而对比的基准模型的准确率都低于本文提出的模型。另外，在前两个个数据集上，显式地提取标题使得准确率提升了0.0148、0.1184的F1值。在下游的段落检索任务中，使用了逻辑层级树的特征后，在mAP指标上获得了0.189的提升。
5、结论：依据实验结果，本文提出的模型相比于两个基准模型，在准确率上都获得了明显地提升。同时，由于显式地提取标题明确地提高了准确率和效率，因此我们选择了两阶段模型。从根到叶的遍历顺序实现了准确率和效率之间的最优权衡，而从叶到根的遍历方法则牺牲一部分效率的前提下获得最高的准确率。并且，获得了文档逻辑层次之后，也能提升后续任务中段落检索的准确率。综上所述，我们提出的长文档逻辑结构提取模型是有效的。
- 文档分析 /
- 文档和文本处理 /
- 机器学习
Abstract: In this paper, we study the problem of extracting variable-depth "logical document hierarchy" from long documents, namely organizing the recognized "physical document objects" into hierarchical structures. The discovery of logical document hierarchy is the vital step to support many downstream applications (e.g., passage-based retrieval and high-quality information extraction). However, long documents, containing hundreds or even thousands of pages and a variable-depth hierarchy, challenge the existing methods. To address these challenges, we develop a framework, namely Hierarchy Extraction from Long Document (HELD), where we "sequentially" insert each physical object at the proper position on the current tree. Determining whether each possible position is proper or not can be formulated as a binary classification problem. To further improve its effectiveness and efficiency, we study the design variants in HELD, including traversal orders of the insertion positions, heading extraction explicitly or implicitly, tolerance to insertion errors in predecessor steps, and so on. As for evaluations, we find that previous studies ignore the error that the depth of a node is correct while its path to the root is wrong. Since such mistakes may worsen the downstream applications seriously, a new measure is developed for a more careful evaluation. The empirical experiments based on thousands of long documents from Chinese financial market, English financial market and English scientific publication show that the HELD model with the "root-to-leaf" traversal order and explicit heading extraction is the best choice to achieve the tradeoff between effectiveness and efficiency with the accuracy of 0.972,6, 0.729,1 and 0.957,8 in the Chinese financial, English financial and arXiv datasets, respectively. Finally, we show that the logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task. In summary, we conduct a systematic study on this task in terms of methods, evaluations, and applications.
- logical document hierarchy /
- long documents /
- passage retrieval

HTML全文

参考文献()

[1]	Bloechle J L. Physical and logical structure recognition of pdf documents [PhD Thesis]. University of Fribourg, 2010.
[2]	Mao S, Rosenfeld A, Kanungo T. Document structure analysis algorithms: A literature survey. In Proc. the 2003 Document Recognition and Retrieval X, Jan. 2003, pp.197-207. DOI: 10.1117/12.476326.
[3]	Pembe F C, Gungor T. Heading-based sectional hierarchy identification for HTML documents. In Proc. the 22nd International Symposium on Computer and Information Sciences, Nov. 2007. DOI: 10.1109/ISCIS.2007.4456839.
[4]	Geva M, Berant J. Learning to search in long documents using document structure. In Proc. the 27th International Conference on Computational Linguistics, Aug. 2018, pp.161-176.
[5]	Howard T, Bruce C. Inference networks for document retrieval. ACM SIGIR Forum, 2017, 51(2): 124-147. DOI: 10.1145/3130348.3130361.
[6]	Summers K. Automatic discovery of logical document structure [PhD Thesis]. Cornell University, 1998.
[7]	Luong M T, Nguyen T D, Kan M Y. Logical structure recovery in scholarly articles with rich document features. International Journal of Digital Library Systems, 2010, 1(4): 1-23. DOI: 10.4018/jdls.2010100101.
[8]	Pembe F C, Güngör T. A tree-based learning approach for document structure analysis and its application to Web search. Natural Language Engineering, 2014, 21(4): 569-605. DOI: 10.1017/S1351324914000023.
[9]	Ramakrishnan C, Patnia A, Hovy E, Burns G A. Layout-aware text extraction from full-text pdf of scientific articles. Source Code for Biology Medicine, 2012, 7(1): Article No. 7. DOI: 10.1186/1751-0473-7-7.
[10]	Manabe T, Tajima K. Extracting logical hierarchical structure of HTML documents based on headings. Proceedings of the VLDB Endowment, 2015, 8(12): 1606-1617. DOI: 10.14778/2824032.2824058.
[11]	Rahman M M, Finin T. Understanding the logical and semantic structure of large documents. arXiv:1709.00770, 2017. https://arxiv.org/abs/1709.00770, April 2021.
[12]	Bentabet N I, Juge R, Ferradans S. Table-of-contents generation on contemporary documents. In Proc. the 2019 International Conference on Document Analysis and Recognition, Sept. 2019, pp. 100-107. DOI: 10.1109/ICDAR.2019.00025.
[13]	Conway A. Page grammars and page parsing: A syntactic approach to document layout recognition. In Proc. the 2nd International Conference on Document Analysis and Recognition, Oct. 1993, pp.761-764. DOI: 10.1109/ICDAR.1993.395626.
[14]	Tsujimoto S, Asada H. Understanding multi-articled documents. In Proc. the 10th International Conference on Pattern Recognition, June 1990, pp.124-133. DOI: 10.1109/ICPR.1990.118163.
[15]	Constantin A, Pettifer S, Voronkov A. PDFX: Fully-automated PDF-to-XML conversion of scientific literature. In Proc. the 2013 ACM Symposium on Document Engineering, Sept. 2013, pp.177-180. DOI: 10.1145/2494266.2494271.
[16]	Tkaczyk D, Szostek P, Fedoryszak M, Dendek P J, Bolikowski. CERMINE: Automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition, 2015, 18(4): 317-335. DOI: 10.1007/s10032-015-0249-8.
[17]	Summers K. Toward a taxonomy of logical document structures. In Proc. the Dartmouth Institute for Advanced Graduate Studies: Electronic Publishing and the Information Superhighway, May 30-June 2, 1995, pp.124-133.
[18]	Baird H S, Jones S E, Fortune S J. Image segmentation by shape-directed covers. In Proc. the 10th International Conference on Pattern Recognition, June 1990, pp.820-825. DOI: 10.1109/ICPR.1990.118223.
[19]	Nagy G, Seth S, Viswanathan M. A prototype document image analysis system for technical journals. Computer, 1992, 25(7): 10-22. DOI: 10.1109/2.144436.
[20]	Kopec G E, Chou P A. Document image decoding using Markov source models. In Proc. the 1993 IEEE International Conference on Acoustics Speech and Signal Processing, April 1993, pp.85-88. DOI: 10.1109/ICASSP.1993.319753.
[21]	Xiao Y, Yumer E, Asente P, Kraley M, Kifer D, Giles C L. Learning to extract semantic structure from documents using multimodal fully convolutional neural network. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.4342-4351. DOI: 10.1109/CVPR.2017.462.
[22]	Augusto Borges Oliveira D, Palhares Viana M. Fast CNN-based document layout analysis. In Proc. the 2017 IEEE International Conference on Computer Vision Workshops, Oct. 2017, pp.1173-1180. DOI: 10.1109/ICCVW.2017.142.
[23]	Wong K Y, Casey R G, Wahl F M. Document analysis system. IBM Journal of Research and Development, 1982, 26(6): 647-656. DOI: 10.1147/rd.266.0647.
[24]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015.
[25]	Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In Proc. the 2015 IEEE International Conference on Computer Vision, Dec. 2015, pp.1520-1528. DOI: 10.1109/ICCV.2015.178.
[26]	He D, Cohen S, Price B, Kifer D, Giles C L. Multi-scale multi-task FCN for semantic page segmentation and table detection. In Proc. the 14th IAPR International Conference on Document Analysis and Recognition, Nov. 2017, pp.254-261. DOI: 10.1109/ICDAR.2017.50.
[27]	Schuster M, Paliwal K K. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681. DOI: 10.1109/78.650093.
[28]	Zhou G, Luo P, Cao R, Xiao Y, Lin F, Chen B, He Q. Tree-structured neural machine for linguistics-aware sentence generation. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.5722-5729.
[29]	Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In Proc. the 27th International Conference on Neural Information Processing Systems, December 2014, pp.3104-3112.
[30]	Tan Z, Wang M, Xie J, Chen Y, Shi X. Deep semantic role labeling with self-attention. In Proc. the 32nd AAAI Conference on Artificial Intelligence, Feb. 2018, pp.4929-4936.
[31]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing, December 2017, pp.5998-6008.
[32]	Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In Proc. the 2013 International Conference on Learning Representations, May 2013.
[33]	Lin M, Chen Q, Yan S. Network in network. arXiv:1, 2013. https://arxiv.org/abs/1312.4400, Jan. 2021.
[34]	Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.448-456.
[35]	Nair V, Hinton G E. Rectified linear units improve restricted Boltzmann machines. In Proc. the 27th International Conference on Machine Learning, Jun. 2010, pp.807-814.
[36]	He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proc. the IEEE International Conference on Computer Vision, Dec. 2015, pp.1026-1034. DOI: 10.1109/ICCV.2015.123.
[37]	Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.
[38]	Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799, 2018. https://arxiv.org/abs/1802.05799, Jan. 2021.
[39]	Friedman J H. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 2001, 29(5): 1189-1232. DOI: 10.1214/aos/1013203451.

施引文献

期刊类型引用(2)

1.	Jiawei Wang, Kai Hu, Zhuoyao Zhong, et al. Detect-order-construct: A tree construction based approach for hierarchical document structure analysis. Pattern Recognition, 2024. 必应学术
2.	Rongyu Cao, Hongwei Li, Ganbin Zhou, et al. Document Analysis and Recognition – ICDAR 2021. Lecture Notes in Computer Science, 必应学术

其他类型引用(0)

资源附件()

其他相关附件
- 本文英文pdf
  2022-3-15-1076-Highlights 点击下载(74KB)
- 本文附件外链
  https://rdcu.be/cQ9Z7

点击查看大图

计量

文章访问数: 92
HTML全文浏览量: 4
PDF下载量: 0
被引次数: 2

从长文档中提取深度可变的文档逻辑结构：方法、评估和应用

计量

Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application

期刊类型引用(2)

其他类型引用(0)

其他相关附件

本文英文pdf

本文附件外链

计量

目录

Home

Overview

Resources

Contents

从长文档中提取深度可变的文档逻辑结构：方法、评估和应用

计量

出版历程

Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application

期刊类型引用(2)

其他类型引用(0)

其他相关附件

本文英文pdf

本文附件外链

计量

出版历程

目录

Home

Overview

Resources

Contents