Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application

Rong-Yu Cao1,2 (曹荣禹), Student Member, CCF, Yi-Xuan Cao1,2 (曹逸轩), Member, CCF, IEEE, Gan-Bin Zhou3 (周干斌), and Ping Luo1,2,4 (罗平), Senior Member, CCF, Member, IEEE        

  1. 1Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
    2University of Chinese Academy of Sciences, Beijing 100049, China
    3WeChat Search Application Department, Tencent Holdings Ltd., Beijing 100080, China
    4Peng Cheng Laboratory, Shenzhen 518066, China
1、研究背景:近期,随着信息技术渗透到各个垂直领域(例如金融,法律,政府和教育领域),电子文档的数量迅速增加。为了从这些非结构化文档中获得有价值的信息,获取基础文档结构至关重要,这样可以方便对这些文档进行重新编辑、重新设置样式或重新排列。并且文档结构也对支持许多下游NLP和文本挖掘应用程序至关重要。 但是,从这些文档的编辑格式(例如WORD和LaTeX)到其显示格式(例如PDF和JPG)的转换过程,仅仅保证了文档布局的不变性,但文档中基本的物理和逻辑结构则部分或完全丢失了。因此,使得这种转换过程总体上是可逆的仍然是一个未解决的问题。处于这些原因,本文旨在研究从长文档中提取深度可变的文档逻辑结构。
3、方法:受人类如何在阅读中文档层级结构的启发,我们提出了一种基于神经网络的新模型。本模型的输入是已经识别出来的文档物理对象组成的一个有序序列,本模型的输出是这些文档物理对象组成的层级结构树。具体来说,按照物理对象的序列顺序,我们依次将每个物理对象插入树的适当位置。对于某一个待插入的对象,按照确定的遍历顺序,我们查询当前树中所有可能的插入位置,直到找到合适的位置为止。 确定每个可能的插入位置是否合适可以用二分类问题来表示,即“放置或跳过”。如此,生成层次树,直到所有物理对象都已插入。进一步,我们还探究了该模型的一些变种,包括:插入节点时不同的遍历顺序的影响,显式或隐式地检测标题,插入过程中对错误节点的容错等。为了判断逻辑结构树的准确率,我们提出了一种新的评估指标。除此之外,我们还探索了逻辑结构树对下游的段落检索任务的影响。

关键词: 文档分析, 文档和文本处理, 机器学习

Abstract: In this paper, we study the problem of extracting variable-depth "logical document hierarchy" from long documents, namely organizing the recognized "physical document objects" into hierarchical structures. The discovery of logical document hierarchy is the vital step to support many downstream applications (e.g., passage-based retrieval and high-quality information extraction). However, long documents, containing hundreds or even thousands of pages and a variable-depth hierarchy, challenge the existing methods. To address these challenges, we develop a framework, namely Hierarchy Extraction from Long Document (HELD), where we "sequentially" insert each physical object at the proper position on the current tree. Determining whether each possible position is proper or not can be formulated as a binary classification problem. To further improve its effectiveness and efficiency, we study the design variants in HELD, including traversal orders of the insertion positions, heading extraction explicitly or implicitly, tolerance to insertion errors in predecessor steps, and so on. As for evaluations, we find that previous studies ignore the error that the depth of a node is correct while its path to the root is wrong. Since such mistakes may worsen the downstream applications seriously, a new measure is developed for a more careful evaluation. The empirical experiments based on thousands of long documents from Chinese financial market, English financial market and English scientific publication show that the HELD model with the "root-to-leaf" traversal order and explicit heading extraction is the best choice to achieve the tradeoff between effectiveness and efficiency with the accuracy of 0.972,6, 0.729,1 and 0.957,8 in the Chinese financial, English financial and arXiv datasets, respectively. Finally, we show that the logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task. In summary, we conduct a systematic study on this task in terms of methods, evaluations, and applications.

Key words: logical document hierarchy, long documents, passage retrieval

