Processing math: 25%
We use cookies to improve your experience with our site.

Multimodal Dependence Attention and Large-Scale Data Based Offline Handwritten Formula Recognition

Han-Chao Liu, Lan-Fang Dong, Xin-Ming Zhang

downloadPDF
刘汉超, 董兰芳, 张信明. 基于多模态关系注意力机制和大型数据集的离线手写公式识别[J]. 计算机科学技术学报, 2024, 39(3): 654-670. DOI: 10.1007/s11390-022-1987-y
引用本文: 刘汉超, 董兰芳, 张信明. 基于多模态关系注意力机制和大型数据集的离线手写公式识别[J]. 计算机科学技术学报, 2024, 39(3): 654-670. DOI: 10.1007/s11390-022-1987-y
Liu HC, Dong LF, Zhang XM. Multimodal dependence attention and large-scale data based offline handwritten formula recognition. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 39(3): 654−670 May 2024. DOI: 10.1007/s11390-022-1987-y.
Citation: Liu HC, Dong LF, Zhang XM. Multimodal dependence attention and large-scale data based offline handwritten formula recognition. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 39(3): 654−670 May 2024. DOI: 10.1007/s11390-022-1987-y.
刘汉超, 董兰芳, 张信明. 基于多模态关系注意力机制和大型数据集的离线手写公式识别[J]. 计算机科学技术学报, 2024, 39(3): 654-670. CSTR: 32374.14.s11390-022-1987-y
引用本文: 刘汉超, 董兰芳, 张信明. 基于多模态关系注意力机制和大型数据集的离线手写公式识别[J]. 计算机科学技术学报, 2024, 39(3): 654-670. CSTR: 32374.14.s11390-022-1987-y
Liu HC, Dong LF, Zhang XM. Multimodal dependence attention and large-scale data based offline handwritten formula recognition. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 39(3): 654−670 May 2024. CSTR: 32374.14.s11390-022-1987-y.
Citation: Liu HC, Dong LF, Zhang XM. Multimodal dependence attention and large-scale data based offline handwritten formula recognition. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 39(3): 654−670 May 2024. CSTR: 32374.14.s11390-022-1987-y.

基于多模态关系注意力机制和大型数据集的离线手写公式识别

Multimodal Dependence Attention and Large-Scale Data Based Offline Handwritten Formula Recognition

Funds: This work is supported by the National Key Research and Development Program of China under Grant No. 2020YFB1313602.
More Information
    Author Bio:

    Han-Chao Liu is now a Ph.D. candidate in the School of Computer Science and Technology, University of Science and Technology of China, Hefei. He received his B.E. degree in computer science from Northwest Agriculture and Forestry University, Yangling, in 2015. His research interests include image analysis and pattern recognition

    Lan-Fang Dong received her B.E. degree in computer science from Lanzhou University, Lanzhou, in 1991, and her M.S. degree in computer application from University of Science and Technology of China, Hefei, in 1994. She is currently an associate professor with the School of Computer Science and Technology, University of Science and Technology of China, Hefei. Her research interests include computing and visualization, intelligent image analysis, and computer animation

    Xin-Ming Zhang received his B.E. and M.E. degrees in electrical engineering from China University of Mining and Technology, Xuzhou, in 1985 and 1988, respectively, and his Ph.D degree in computer science and technology from the University of Science and Technology of China, Hefei, in 2001. Since 2002, he has been with the faculty of the University of Science and Technology of China, Hefei, where he is currently a professor with the School of Computer Science and Technology

    Corresponding author:

    Xin-Ming Zhang: xinming@ustc.edu.cn

  • 摘要:
    研究背景 

    随着社会信息化的发展,人们越来越多的使用计算机处理日常工作和学习上的任务。公式作为一种表达、抽象和定义问题的工具,我们的日常学习和生活中有着广泛的应用,然而由于其复杂的二维结构,导致在计算机中输入公式十分复杂且耗时。虽然手写是人类最自然的信息记录方式,但是手写输入的信息计算机却很难理解。离线手写公式识别的目的就是将人们手写的公式图像转换为计算机可以编辑和理解的格式(如LaTeX字符串)的过程。由于手写字符的随意性以及公式本身复杂的二维结构,离线手写公式识别长久以来是一项极具挑战性的任务。随着近些年深度学习的发展,基于注意力机制的编解码网络极大地推动了离线手写公式识别领域发展,并提高了该领域的识别效果。然而目前的研究工作对于相对简单的公式识别效果较好,而对于具有较长LaTeX字符串标签的复杂公式识别效果相对较差,对于长序复杂公式识别的优化研究暂时也比较少。此外,为了提高识别效果,研究人员设计了越来越精妙而复杂的模型结构,然而现有的训练数据相对较少,往往难以支撑复杂模型的正确训练,模型过拟合逐渐成为了制约该领域发展的瓶颈。

    目的 

    我们的工作首先通过构建大型手写公式图像数据集来增强训练数据,降低模型过拟合,提高离线手写公式识别的效果。此外,我们还通过针对长序复杂公式图像的识别优化,达到提高模型可用性,进一步提高公式识别效果的目的。

    方法 

    我们构建了一个基于真实场景的手写公式图像数据集HFID,该数据集涵盖了156类常用公式字符,共包含26520张数学、物理和化学领域中的手写公式图像,数据量约为目前本领域中最常用的CROHME (Competition on Recognition of Online Handwritten Mathematical Expressions) 数据集数据量的两倍。此外,我们还设计了一种基于字符多模态关系依赖注意力模块(Multimodal Dependence Attention, MDA),通过该模块抽取公式中字符的多模态特征来表征字符,并以字符多模态特征为输入,利用注意力机制建模公式中字符间的依赖关系,并以该关系辅助公式中字符的识别,提高模型的识别效果。

    结果 

    我们在CROHME数据集和HFID数据集中进行了实验。在使用HFID训练集做预训练,使用CROHME训练集进行微调的模型相比未经HFID预训练的模型在CROHME 2014、CROHME 2016和CROHME 2019数据集的识别结果分别由47.70%、50.83%和51.29%提升到58.62%、60.35%和57.80%。在加入MDA模块后,模型在CROHME 2014、CROHME 2016和CROHME 2019数据集中的结果分别提升到59.94%、62.70%和59.38%,在HFID测试集中的结果则由59.12%提升至60.16%。此外,我们对MDA生成的权重图进行了可视化分析,验证了MDA确实能够学到字符的关系依赖。我们还对在不同长度区间的公式识别结果进行了统计,实验结果表明,加入MDA模块后,模型对长序复杂公式的识别效果确实有所提升。最后,在多模型联合的情况下,我们在CROHME 2014和CROHME 2016数据集中分别达到了63.79%和65.24%,是目前在这两个数据集中的最佳识别结果。

    结论 

    实验结果表明,本文构建的HFID数据集能够有效的降低模型过拟合影响,进一步提高模型的识别效果。而通过MDA模块学习到的字符依赖关系,确实能够有效提升长序复杂公式的识别效果,并进一步提高模型在离线手写公式识别问题中的表现。在未来的工作中,我们将研究如何将Transformer这一强大的编解码网络应用到离线手写公式识别问题中,以进一步提高模型的识别效果。

    Abstract:

    Offline handwritten formula recognition is a challenging task due to the variety of handwritten symbols and two-dimensional formula structures. Recently, the deep neural network recognizers based on the encoder-decoder framework have achieved great improvements on this task. However, the unsatisfactory recognition performance for formulas with long \LaTeX strings is one shortcoming of the existing work. Moreover, lacking sufficient training data also limits the capability of these recognizers. In this paper, we design a multimodal dependence attention (MDA) module to help the model learn visual and semantic dependencies among symbols in the same formula to improve the recognition performance of the formulas with long \LaTeX strings. To alleviate overfitting and further improve the recognition performance, we also propose a new dataset, Handwritten Formula Image Dataset (HFID), which contains 25620 handwritten formula images collected from real life. We conduct extensive experiments to demonstrate the effectiveness of our proposed MDA module and HFID dataset and achieve state-of-the-art performances, 63.79% and 65.24% expression accuracy on CROHME 2014 and CROHME 2016, respectively.

  • Formulas including mathematical expressions (MEs) are convenient and essential for describing problems, definitions and theories in documents of math, physics, chemistry, and many other fields. Due to the importance of formulas, handwritten formula recognition has received considerable attention and is extensively applied in smart education, office automation, and human-computer interaction. Currently, it is still a challenging task due to the various ambiguities of handwritten symbols and the complicated two-dimensional structure of formulas[1-3].

    Handwritten formula recognition has been studied for decades since the 1960s[4]. In order to conquer the challenges of this task, extensive researches have been done in these years. The research work on this task can be divided into traditional methods and deep neural network (DNN) based methods. The traditional methods[5-8] divide the task into symbol segmentation, symbol recognition, and structural analysis. They apply manually designed rules to handle these three sub-problems sequentially or synchronously. In recent years, approaches based on DNN have been proposed[1-3, 9-13]. Specifically, these approaches leverage the attention-based encoder-decoder networks[14] to translate the ME images to \LaTeX strings end-to-end and achieve significant improvements over the traditional work. Besides these researches, the Competition on Recognition of Handwritten Mathematical Expressions (CROHME)[15-17], offers a training and a test bed for the researchers around the world to evaluate their work and has greatly boosted the development of handwritten mathematical expression recognition (HMER).

    Although recent work[1-3, 9-13] on HMER that employs the attention-based encoder-decoder models has gotten impressive achievements, there are still some shortcomings to be addressed. First, it is hard to recognize the complex formulas with long \LaTeX strings correctly. The formulas with long \LaTeX strings usually have longer dependencies among symbols than those with short \LaTeX strings. The dependencies of the symbols in the same formula can help the model recognize correctly since they represent the relationships among symbols. For example, when recognizing “2” in “frac { 1 } { 2 } a _ { 3 }”, the dependencies among “frac”, “1” and “2” can assist the model effectively since these three symbols are in the same fraction. However, the attention-based encoder-decoder models for handwritten formula recognition usually utilize the LSTM[18] or GRU[19] as the decoder which learn an embedded language model implicitly to model the dependencies among symbols[3]. However, the LSTM and GRU can only model the dependencies between the co-dependent elements step by step. It is difficult for them to learn long-range dependencies since the length of the path between the long-range co-dependent elements is large[20, 21]. As a consequence, the recognition performance for the complex formulas is relatively poor[1, 22]. Secondly, the training data is far from satisfying for training DNNs. The recent researches tend to build more and more complex and powerful neural networks to enhance the modeling capabilities of their models and improve the recognition performances. However, the CROHME dataset[15-17], which is the most widely used dataset on HMER, only contains less than 10k training samples. Insufficient training data has limited the development of HMER. Due to the lack of training data, researchers[11, 12] have to apply their data augmentation methods on the training set of CROHME to alleviate overfitting. However, the synthetic samples usually preserve the same handwriting styles and \LaTeX strings as the original samples which makes the effectiveness of the synthetic samples limited. Besides, the images in CROHME are drawn from the online stroke information. They all have clean backgrounds and legible handwriting, which is idealized and unlike the real-life samples.

    In this paper, we aim at alleviating the effects of the challenges introduced above. We propose a multimodal dependence attention (MDA) module to improve the recognition performance of formulas with long \LaTeX strings. In addition to the embedded language model in the decoder of the traditional encoder-decoder model, we use the MDA module to explicitly learn the dependencies among the symbols in the same formula. The meanings of the symbols depend on both the visual information and linguistic semantic information. Therefore, in the MDA module, we utilize the multimodal features that are merged with the visual features and character embeddings as the representation of each symbol. After the multimodal information fusion, an attention mechanism is equipped to calculate the correspondences between the features of the current symbol and the features of earlier symbols. Unlike LSTM and GRU, the attention mechanism can reduce the length of the dependence path between any two symbols in the same formula to one, making it easier to model the dependencies in the formulas with long \LaTeX strings. Moreover, considering the lack of training data, we construct the Handwritten Formula Images Dataset (HFID), a new large-scale public dataset for handwritten formula recognition. With 26520 handwritten formula images collected from real life, HFID offers a sufficient training set to alleviate overfitting. It is also worth noticing that, unlike the CROHME dataset, HFID also contains handwritten chemical equation images besides the handwritten ME images. Similar to the MEs, the chemical equations also have two-dimensional structures. Recognizing these two types of formulas can be sorted as the same problem.

    In summary, the main contributions of this paper are highlighted as follows.

    1) We propose the MDA module to improve the recognition performance for the formulas with long \LaTeX strings. MDA first merges the visual and semantic features to represent each symbol. Then, it uses an attention mechanism to learn the dependencies of the symbols in the same formula to assist the recognition of each symbol.

    2) We propose a large-scale dataset called HFID for handwritten formula recognition. To the best of our knowledge, HFID, which contains 26520 handwritten formula images, is the biggest dataset for handwritten formula recognition. Besides, HFID can be used for not only HMER but also handwritten chemical equation recognition. HFID is public available now 1.

    3) We design an attention-based encoder-decoder model equipped with the MDA module. Using this model, extensive experiments are conducted to show the effectiveness of HFID and the MDA module. And we achieve the state-of-the-art performance, 63.79% and 65.24% accuracy on the test set of CROHME 2014[15] and CROHME 2016[16], respectively, which are the most widely used datasets for the HMER task.

    The rest of this paper is organized as follows. In Section 2, related work is discussed. Section 3 gives a detailed introduction on the HFID dataset. We introduce the multimodal dependence attention in Section 4. The experiments and analyses are introduced in Section 5. Finally, the conclusions are drawn in Section 6.

    The task of handwritten formula recognition has gradually become a hot topic in optical character recognition (OCR) in recent years, especially since the inception of CROHME. Aiming to provide a public benchmarking dataset for the researchers worldwide to evaluate their progress clearly, CROHME was first organized in 2011[15] and has been held six times. The CROHME dataset[15-17] gradually becomes the most widely used dataset on HMER and has boosted the development of HMER. Initially, there are only 921 ME samples with 56 symbol classes in the training set of CROHME 2011[15], while in CROHME 2016, the training data expands to 8835 ME samples with 101 symbol classes[16]. Despite the fact that the size of the training data has increased more than tenfold, the CROHME dataset is still relatively small compared with the public dataset for other OCR tasks such as the CASIA HWDB dataset[23] for handwritten Chinese recognition and the IAM dataset[24] for handwritten English sentence recognition. It is far from satisfying for training the models of the recent work[9, 11, 12] on HMER.

    The recent work on HMER can be divided into two groups of methods, the traditional methods and the attention-based encoder-decoder models. Traditional methods divide the task of HMER into three sub-tasks: symbol segmentation, symbol recognition, and structural analysis, and tackle the sub-tasks sequentially or simultaneously. Hu and Zanibbi[5] focused on the symbol segmentation sub-task and proposed to use the AdaBoost algorithm and multi-scale features to segment the symbols correctly. Awal et al.[7] tackled the HMER task using a global approach. They proposed a new contextual modeling method combining syntactic ans structural information to find the most likely combination of segmentation and recognition hypotheses. Álvaro et al.[8] parsed the MEs by two-dimensional stochastic context-free grammar. By combining various stochastic sources of information, they selected the most likely generated \LaTeX string as the final result. Although the traditional methods achieve admirably in several CROHME competitions, they all require elaborate manually designed rules or grammar. On the other hand, the attention-based encoder-decoder models are free of human-designed rules and can learn end-to-end recognition models directly from the training data. Zhang et al.[1] built a model whose encoder is a VGG-like[25] CNN model to extract features from the ME images. Besides, their model applies a GRU layer and an attention mechanism in the decoder to translate features to \LaTeX strings. Wu et al.[2] proposed the paired adversarial learning method to help the network learn semantic invariant features. Furthermore, by employing deeper networks and an novel attention method called pre-aware coverage attention, they enhanced their accuracy on CROHME 2014 from 39.66% to 48.88% using a single model[3]. Due to the lack of data, Le et al.[11] applied local and global distortions to the original ME images to augment the training set. Similarly, Li et al.[12] also proposed to generate new training samples by their scale augmentation method. By augmenting the training set, [11] and [12] improve their accuracy on the CROHME dataset. Although the attention-based encoder-decoder models can lead to better results than the traditional methods, they all demand huge amount of training data and computing resources.

    To improve the performance of image captioning, Gu et al.[26] proposed a language CNN model that can capture the long-range dependencies in the words of the same sentence. Compared with the language models based on RNN that can only predict next word based on one previous word and the hidden state, the language CNN model can be fed with all the history words which are critical for the image captioning task. Xiu et al.[27] proposed a multi-level multimodal fusion network for handwritten Chinese line recognition. They pretrained a set of language embeddings and visual embeddings for each individual Chinese character and fused these embeddings as the multimodal information. Besides the mutimodal feature fusion at the character level, Xiu et al.[27] also applied a language CNN model to model the long-range dependencies among the multimodal information at the text fragment level. The multi-level multimodal information is finally integrated to the decoder of the model to generate the recognition results.

    In this paper, we propose the MDA module that can model the dependencies among the multimodal information of symbols to improve the recognition performance of formulas with long \LaTeX strings. It is worth noticing that the multimodal information used in MDA does not need any pretraining process and it is learned automatically during the training of our model, which is very different from the multimodal information of [27]. Besides, we utilize the attention mechanism to capture the dependencies of symbols in the same formula which is more flexible than the CNN networks with fixed receptive fields that are used in [26, 27]. The details of the MDA module will be introduced in Section 4.

    In this section, we present a new public dataset, HFID, for handwritten formula image recognition to alleviate overfitting caused by lacking of training data. HFID consists of 26520 handwritten formula images with 404904 symbol samples for 156 symbol classes. To the best of our knowledge, it offers the largest public collection of handwritten formula images in real life. It is also worth noticing that not only the mathematical expression images but also the images of chemical equations are included in our dataset. In this section, the collection and annotation methods for the handwritten formula images are first introduced. Then, we provide the statistics of this dataset.

    In order to collect handwritten formula images efficiently, we first render the collected \LaTeX strings to printed formula images and then ask the volunteers to write the corresponding formulas in paper. Finally, we scan the handwritten formulas to gray scale images with 300 dpi. In comparison with the most widely used CROHME dataset, which uses online stroke information to draw formula images with clear strokes and clean backgrounds, the images in HFID that are directly scanned from the paper in real life are typically noisy due to various writing tools, paper grains, dirty marks, and other types of interference. Fig.1 gives a few examples of the formula images in HFID. As shown in Fig.1, the images in HFID are more closer to the images used in real life applications than those in CROHME.

    Figure  1.  Examples of samples in HFID.

    Besides image collection, sample annotation is also crucial in dataset construction. Since the handwritten formula images are copied from the corresponding printed formula images that are rendered from the known \LaTeX strings, we can easily get the \LaTeX label for each collected handwritten formula image after the data collection of HFID. However, one formula can usually be presented by different \LaTeX strings. For example, both “sqrt a+b^2_0” and “sqrt { a } + b _ { 0 } ^ { 2 }” can present a+b20. To eliminate the ambiguity and make it easier to train and evaluate models using the proposed HFID, we have normalized all the \LaTeX notations manually by the following rules:

    • separate the symbol words with spaces,

    • add “{” and “}” symbols to surround the superscripts, subscripts, numerators, denominators, and subexpressions under “ ”,

    • generate the subscripts before the superscripts of the same symbol.

    HFID is divided into the training set, the validation set and the test set. The overall composition of HFID is shown in Table 1. As shown in Table 1, there are 23032, 1528, and 1960 samples in the training set, the validation set, and the test set, respectively. For the training set and validation set that are used in the training stage, over 230 volunteers have contributed their handwriting. While for the test set, we have collected the handwriting of 30 volunteers whose handwriting does not appear in the training and the validation set. Besides, the \LaTeX strings of the samples in the validation set and the test set are all different from those in the training set to avoid overfitting. As mentioned above, there are 156 symbol classes in HFID which fully cover the 101 symbol classes of CROHME dataset. The introduction of the 101 symbol classes in both the CROHME dataset and HFID can be found in [15]. The 55 additional symbol classes in HFID are: “D”, “J”, “K”, “O”, “Q”, “U”, “W”, “Z”, “hat”, “;”, “:”, “csc”, “cot”, “%”, “supseteq”, “longrightarrow”, “omega”, “prod”, “cap”, “tau”, “supseteq”, “arcsin”, “triangle”, “psi”, “emptyset”, “rightleftharpoons”, “cosh”, “delta”, “psi”, “sinh”, “notin”, “subset”, “uparrow”, “varphi”, “downarrow”, “equiv”, “ln”, “lg”, “eta”, “varepsilon”, “circ”, “rho”, “iint”, “approx”, “overline”, “tanh”, “cup”, “partial”, “arctan”, “supset”, “sup”, “arccos”, “sim”, “sec”, “overrightarrow”.

    Table  1.  Composition of HFID
    Usage Number of Volunteers Name Number of Samples Number of Symbols
    Training stage Over 230 Training set 23032 351830
    Validation set 1528 22995
    Test stage 30 Test Set 1960 30079
    下载: 导出CSV 
    | 显示表格

    Table 2 lists the distributions of the formulas in different \LaTeX string length ranges of the training set of CROHME 2016 and HFID. Notice that we also normalize the \LaTeX strings of the samples in CROHME 2016 using the method introduced above. As shown in Table 2, the training set of HFID is more evenly distributed than that of CROHME 2016 in different length intervals. CROHME 2016 focuses more on the samples with relatively short \LaTeX strings and pays less attention on the samples with long \LaTeX strings which are difficult to recognize correctly[1, 22]. There are only 19.84% of training samples whose lengths of \LaTeX strings exceed 25 in CROHME 2016. In contrast, the ratio in HFID is 42.50%. With a more uniform distribution of formulas with different \LaTeX string lengths, HFID offers similar chances for the model to be trained with samples in different lengths.

    Table  2.  Distributions of Formulas with Different Lengths in the Training Set of CROHME 2016 and HFID
    Length Number of Formulas / Percent (%)
    CROHME 2016 HFID
    1-5 1600 / 18.11 1882 / 8.17
    6-10 1955 / 22.12 2937 / 12.75
    11-15 1335 / 15.11 2669 / 11.59
    16-20 1111 / 12.57 2775 / 12.05
    21-25 1078 / 12.20 2977 / 12.93
    26-30 609 / 6.89 2433 / 10.56
    31-35 442 / 5.00 1989 / 8.63
    35+ 705 / 7.98 5370 / 23.31
    下载: 导出CSV 
    | 显示表格

    With large-scale training data, HFID is excepted to help the researchers of handwritten formula recognition improve their research and to help the developing this task.

    In this section, the models used in our experiments are introduced. The proposed baseline model which is not equipped with the MDA module is introduced at first. After that, the MDA module is detailedly introduced. Finally, we give a description of how to apply the MDA module to the baseline model. The overall architecture of the model equipped with MDA is illustrated in Fig.2.

    Figure  2.  Overview of the model equipped with the MDA module. It consists of four parts: a DenseNet (Dcnn) for image feature extraction, a visual attention mechanism (Attv) to select the important visual features at each time step, a two-layer LSTM (L(1)rnn and L(2)rnn) for symbol recognition and the MDA module which consists of two sequential processing steps, the multimodal information fusion (MIF) process and the dependence attention (DA) process. A_t=\{{\boldsymbol{a}}_0, {\boldsymbol{a}}_1, {\boldsymbol{a}}_2, ..., {\boldsymbol{a}}_{t-1}\} and M_t=\{{\boldsymbol{m}}_0, {\boldsymbol{m}}_1, {\boldsymbol{m}}_2, ..., {\boldsymbol{m}}_{t-1}\} are the set of the visual and multimodal features of the symbols at the earlier time steps, respectively. The meaning of {\boldsymbol c}_t , please refer to Subsection 4.2.2.

    Similar to the existing handwritten formula recognition models, we design an attention-based encoder-decoder model using DenseNet[28] and LSTM as the baseline model in our experiments. In the baseline model, DenseNet is applied as the encoder to transform the input image into intermediate features. The decoder consists of two LSTM layers that utilize the features extracted by the encoder to recognize a symbol at each time step and finally form a sequence of symbols as the final recognition result. Aiming at helping the decoder to focus on suitable parts of the features, at each time step, the attention mechanism generates an attention map that demonstrates the important parts of the features. The overall processing of the baseline model can be formulated in (1) to (5):

    \begin{aligned} {\boldsymbol{V}} = D_{\rm cnn}({\boldsymbol{I}}), \end{aligned} (1)
    \begin{aligned} \hat{{\boldsymbol{h}}}_t = L_{\rm rnn}^{(1)}({\boldsymbol{Ey}}_{t-1}, {\boldsymbol{h}}_{t-1}), \end{aligned} (2)
    \begin{aligned} {\boldsymbol{a}}_t = Att_{\rm v}({\boldsymbol{V}}, \hat{{\boldsymbol{h}}}_t, {\boldsymbol{F}}), \end{aligned} (3)
    \begin{aligned} {\boldsymbol{h}}_{t} = L_{\rm rnn}^{(2)}({\boldsymbol{a}}_t, \hat{{\boldsymbol{h}}}_t), \end{aligned} (4)
    \begin{aligned} {\boldsymbol{y}}_t = g({\boldsymbol{W}}_{\rm o}({\boldsymbol{Ey}}_{t-1} + {\boldsymbol{W}}_{\rm h}{\boldsymbol{h}}_t + {\boldsymbol{W}}_{\rm a}{\boldsymbol{a}}_t)). \end{aligned} (5)

    In (1), {\boldsymbol{I}} is the input image, D_{\rm cnn} is the DenseNet processing, {\boldsymbol{V}} \in \mathbb{R}^{H \times W \times C} represents the feature maps extracted by the encoder ( H , W and C are the height, width and channel of the feature map {\boldsymbol{V}} , respectively). L_{\rm rnn}^{(1)} in (2) and L_{\rm rnn}^{(2)} in (4) are the first and the second LSTM in the decoder, respectively. \hat{{\boldsymbol{h}}}_t \in \mathbb{R}^{d_{\rm h}} and {\boldsymbol{h}}_t \in \mathbb{R}^{d_{\rm h}} are the hidden state of the first and the second LSTM, respectively ( {d_{\rm h}} is the dimension of the hidden state in LSTM). {\boldsymbol{E}} \in \mathbb{R}^{d_{\rm w} \times {W}} and {\boldsymbol{y}}_{t} \in \mathbb{R}^{W} are the embedding matrix and the recognition result at the t -th time step, respectively ( W is the number of symbol classes and d_{\rm w} is the character embedding dimension). {\boldsymbol{W}}_{\rm o} \in \mathbb{R} ^{{W} \times d_{\rm w}} , {\boldsymbol{W}}_{\rm h} \in \mathbb{R}^{d_{\rm w} \times d_{\rm h}} and {\boldsymbol{W}}_{\rm a} \in \mathbb{R}^{d_{\rm w} \times C} in (5) are the parameters to be learned in the model and g represents the \rm softmax function.

    It is worth noticing that, for each predicted symbol from the decoder, the entire feature map is not necessary to provide the useful information. The decoder should know which part of the feature map is important for recognizing the current symbol. The visual attention Att_{\rm v} in (3) calculates a weight \alpha_t^i for each element {\boldsymbol{v}}_i \in \mathbb{R}^{C} in {\boldsymbol{V}} . And the context vector {\boldsymbol{a}}_t \in \mathbb{R}^C in (3) is the weighted sum of the elements in {\boldsymbol{V}} . {\boldsymbol{F}} in (3) is the coverage vector[1] that can help the attention mechanism focus more accurately. In summary, the processing of Att_{\rm v} can be formulated in (6) to (8):

    \begin{aligned} e_{t}^i = {\boldsymbol{U}}_{\rm e}^{\rm T} {\rm tanh}({\boldsymbol{U}}_{\rm h}\hat{{\boldsymbol{h}}}_t +{\boldsymbol{U}}_{\rm v}{\boldsymbol{v}}_i + {\boldsymbol{U}}_{\rm f}{\boldsymbol{f}}_t^i), \end{aligned} (6)
    \begin{aligned} \alpha_{t}^i = \frac{\exp(e_t^i)}{\displaystyle\sum\limits_{k=1}^{H\times W}\exp(e_{t}^k)}, \end{aligned} (7)
    \begin{aligned} {\boldsymbol{a}}_t = \displaystyle\sum\limits_{i=1}^{H \times W}\alpha_{t}^{i}{\boldsymbol{v}}_i, \end{aligned} (8)

    where {\boldsymbol{v}}_i \in \mathbb{R}^C is the i -th element in {\boldsymbol{V}} , \alpha_t^i is the corresponding weight for {\boldsymbol{v}}_i , {\boldsymbol{a}}_t is the output visual context vector generated by Att_{\rm v} and {\boldsymbol{f}}_t^i \in \mathbb{R}^{d_{\rm cov}} is the coverage vector for {\boldsymbol{v}}_i at each time step. {\boldsymbol{U}}_{\rm h} \in \mathbb{R}^{d_{\rm a} \times d_{\rm h}} , {\boldsymbol{U}}_{\rm v}\in \mathbb{R}^{d_{\rm a} \times C} , {\boldsymbol{U}}_{\rm f}\in \mathbb{R}^{d_{\rm a} \times d_{\rm cov}} and {\boldsymbol{U}}_{\rm e} \in \mathbb{R}^{d_{\rm a}} are the parameters to be learned in attention ( d_{\rm a} and d_{\rm cov} are the dimension of the hidden state in Att_{\rm v} and the coverage vector, respectively). The calculation of coverage vector {\boldsymbol{F}} for the whole feature map {\boldsymbol{V}} follows (9) and (10):

    \begin{aligned} {\boldsymbol{\beta}}_t = \displaystyle\sum\limits_{l=0}^{t-1} {\boldsymbol{\alpha}}_l, \end{aligned} (9)
    \begin{aligned} {\boldsymbol{F}} = Conv({\boldsymbol{\beta}}_t, {\boldsymbol{Q}}), \end{aligned} (10)

    where {\boldsymbol{\alpha}}_l is the attention weights generated at the l -th time step, Conv(\cdot) represents the convolution operation and {\boldsymbol{Q}} is the convolution kernel to be learned.

    The whole process of the MDA module consists of two steps: the multimodal information fusion (MIF) and the dependence attention (DA) process which utilizes the multimodal features generated by the MIF to learn the correspondences among symbols. In this subsection, we will introduce these two steps of MDA in detail.

    The meanings of different handwritten symbols depend on both the visual perceptual information and the linguistic semantic information. MIF merges the visual and semantic features automatically to give a better representation for the symbols in formulas. In MIF, we use the visual context vector {\boldsymbol{a}}_t generated by the visual attention mechanism Att_{\rm v} as the visual information and the character embeddings as the linguistic information. Here, we parameterize the MIF processing using a multi-layer perception as shown in (11):

    \begin{aligned}[b] {\boldsymbol{m}}_t &= {{MIF}}({\boldsymbol{a}}_t, {\boldsymbol{Ey}}_{t}) \\ &= \sigma({\boldsymbol{W}}_{\rm ma}{\boldsymbol{a}}_t + {\boldsymbol{W}}_{\rm me}({\boldsymbol{Ey}}_t) + {\boldsymbol{b}}_{\rm m}), \end{aligned} (11)

    where {\boldsymbol{W}}_{\rm ma} \in \mathbb{R}^{d_{\rm m} \times C} , {\boldsymbol{W}}_{\rm me} \in \mathbb{R}^{d_{\rm m} \times d_{\rm w}} and {\boldsymbol{b}}_{\rm m} \in \mathbb{R}^{d_{\rm m}} are parameters to be learned in the training process ( d_{\rm m} is the dimension of the multimodal feature {\boldsymbol{m}}_t ). \sigma is the Sigmoid function. With the multimodal features, our model can consider both the visual and the linguistic information at the same time.

    It is worth noting that the visual feature {\boldsymbol{a}}_t used in the MIF process is an intermediate feature generated by the Att_{\rm v} in the recognition process. Different from Xiu et al.[27] that applied the pre-trained visual and embedding in the multimodal information, in our experiments, we do not need to pre-train the visual and word embedding which demands extra training computation, time and data. Besides, compared to the work of Gu et al.[26] that used the feature extracted by VGGNet[25] directly at all the time steps, {\boldsymbol{a}}_t in our work focuses different specific regions of the whole feature map at different time steps which is more flexible and accurate.

    Learning the dependencies among symbols is important for recognizing formulas with long \text{\LaTeX} strings. Although LSTM in the decoder has powerful sequence modeling ability which can get the context information, it suffers from gradient vanishing and exposure bias problems[27, 29, 30]. Furthermore, it can only offer a long path for the long-range co-dependent symbols[20, 21]. To learn the long-range dependencies among symbols more efficiently and improve the recognition performance of formulas with long \text{\LaTeX} strings, we propose the DA process to find the dependencies between the symbol at the current time step that is to be recognized and the symbols in the earlier time steps. Using the attention mechanism, we can shorten the length of the path for any two symbols in the same formula to one, which makes learning the dependencies among symbols easier. In the DA process, we use {\boldsymbol{a}}_t , A_t = \{{\boldsymbol{a}}_0, {\boldsymbol{a}}_1, \ldots, {\boldsymbol{a}}_{t-1}\} and M_t = \{{\boldsymbol{m}}_0, {\boldsymbol{m}}_1, \ldots, {\boldsymbol{m}}_{t-1}\} as the inputs. {\boldsymbol{a}}_t , A_t and M_t are the query vector, the key vector set and the value vector set in the DA, respectively. A_t and M_t are the set of the visual context features and multimodal features in the earlier time steps, respectively. The DA process can be formulated in (12) to (15):

    \begin{aligned} r_t^i = \frac{{\boldsymbol{a}}_t {\boldsymbol{a}}_i ^ {\rm T}}{\sqrt{C}}, \end{aligned} (12)
    \begin{aligned} \gamma_t^i = \frac{\exp(r_t^i)}{\displaystyle\sum\limits_{j=0}^{t-1}\exp(r_t^j)}, \end{aligned} (13)
    \begin{aligned} {\boldsymbol{s}}_t = \sum_{i=0}^{t-1}\gamma_t^i {\boldsymbol{m}}_i, \end{aligned} (14)
    \begin{aligned} {\boldsymbol{c}}_t = \sigma({\boldsymbol{W}}_{\rm ca}{\boldsymbol{a}}_t + {\boldsymbol{W}}_{\rm cs}{\boldsymbol{s}}_t), \end{aligned} (15)

    where {\boldsymbol{a}}_i (0 \leqslant i \leqslant t-1) is the the visual context vector at the i -th time step in A_t , \gamma_t^i is the normalized similarity of {\boldsymbol{a}}_t and {\boldsymbol{a}}_i . {\boldsymbol{s}}_t is the context multimodal information vector, which models all the multimodal features in the earlier time steps that are related to the current symbol to be recognized. Finally, to facilitate the transmission of information more effectively, we have employed a residual-like structure. By utilizing a multi-layer perceptron, we have integrated {\boldsymbol{a}}_t and {\boldsymbol{s}}_t into {\boldsymbol{c}}_t , thereby replacing the visual feature {\boldsymbol{a}}_t in (4). {\boldsymbol{W}}_{\rm ca} \in \mathbb{R}^{d_{\rm h} \times C} and {\boldsymbol{W}}_{\rm cs} \in \mathbb{R} ^ {d_{\rm h} \times d_{\rm m}} are the linear mapping parameters to be learned during training. \sigma is the Sigmoid function. Similar to the previous work[26, 27], the proposed attention module and the models in [26, 27] all can model the information in earlier time steps. However, the models in [26, 27] use the CNN model which has a fixed receptive field and requires fixed input sequence length to model the information in earlier time steps. For example, Xiu et al.[27] fixed the input sequence length to 24 in their model. In contrast, we apply an attention mechanism that has no limit to the input sequence. And the receptive field of DA is the whole sequence of the earlier symbols.

    It is worth noticing that the computation of the DA process is inspired by the Scale Dot-Product Attention in Self-Attention[31]. However, the DA process is very different from Self-Attention. The Self-Attention in [31] is designed to learn the dependencies among all the positions in the same input vector. And query, key and value vectors used in Self-Attention are computed from the same input vector. In contrast, the query vector {\boldsymbol{a}}_t , the key vector A_t = \{{\boldsymbol{a}}_0, {\boldsymbol{a}}_1, \ldots, {\boldsymbol{a}}_{t-1}\} and the value vector M_t = \{{\boldsymbol{m}}_0, {\boldsymbol{m}}_1, \ldots, {\boldsymbol{m}}_{t-1}\} used in DA are not computed from the same vector. Besides the difference of inputs, the goal of DA is also different from that of Self-Attention. The goal of DA is to learn the dependencies between the symbol at the current time step and the symbols in earlier time steps and then model the multimodal context information of the symbols in earlier time steps that are relevant to the present symbol to be recognized. By learning the dependencies among symbols effectively and modeling the multimodal context information, which are the key contributions of the DA process, our model can improve the recognition performance for formulas with long \text{\LaTeX} strings.

    As the baseline model and MDA introduced in the above sections, the proposed recognition model that equipped with MDA can be formulated in (16) to (19):

    \begin{aligned} {\boldsymbol{m}}_{t-1} = {{MIF}}({\boldsymbol{a}}_{t-1}, {\boldsymbol{Ey}}_{t-1}), \end{aligned} (16)
    \begin{aligned} \hat{{\boldsymbol{h}}}_t = L_{\rm rnn}^{(1)}({\boldsymbol{m}}_{t-1}, {\boldsymbol{h}}_{t-1}), \end{aligned} (17)
    \begin{aligned} {\boldsymbol{c}}_t = {{DA}}({\boldsymbol{a}}_t, A_t, M_t), \end{aligned} (18)
    \begin{aligned} {\boldsymbol{h}}_t = L_{\rm rnn}^{(2)}({\boldsymbol{c}}_t, \hat{{\boldsymbol{h}}}_t). \end{aligned} (19)

    At each time step, the model first uses the DenseNet to extract the feature map {\boldsymbol{V}} of the input image by (1). Then the MIF process which is formulated in (11) merges the visual and semantic information of the symbol recognized at the previous time step. Subsequently, the multimodal feature {\boldsymbol{m}}_{t-1} is used as the input of the first LSTM to generate the hidden state \hat{{\boldsymbol{h}}}_t . With {\boldsymbol{V}} and \hat{{\boldsymbol{h}}}_t , we can get the context vector {\boldsymbol{a}}_t by the calculation of {Att}_{\rm v} . Then the DA process which follows the calculation of (12) to (15), models the dependencies among the current symbol and the symbols in the earlier time steps and generates the dependence vector {\boldsymbol{c}}_t . And with {\boldsymbol{c}}_t and \hat{{\boldsymbol{h}}}_t , we can generate the hidden state {\boldsymbol{h}}_t of the second LSTM. Finally, the recognition result for the current symbol is calculated from (5).

    The DenseNet used in our models shares the same architecture of that in [10] which has shown to be quite effective in feature extraction for HMER. For details of the DenseNet, refer to [10]. The dimensions of the hidden states in the LSTMs and the dimensions of the symbol embedding are set to 256. The dimension of the visual attention mechanism is set to 512, while the size of the convolution kernel used to compute the coverage vector is 11\times 11 . And the dimensions of the multimodal vectors used in the MDA are set to 256 in our models.

    To evaluate the effectiveness of our models, we conduct experiments on CROHME 2014, CROHME 2016 and the proposed HFID. As the \text{\LaTeX} strings are symbol sequences, following the previous work on handwritten formula recognition[1, 10-13, 22], the expression recognition rate (ExpRate) and the word error rate (WER) are used as the evaluation metrics in our experiments. WER is computed by:

    \begin{aligned} WER = \frac{N_{\rm sub} + N_{\rm del} + N_{\rm ins}}{N_{\rm groundtruth}},\nonumber \end{aligned}

    where N_{\rm sub} , N_{\rm del} and N_{\rm ins} are the number of substitution errors, deletion errors and insertion errors, respectively. N_{\rm groundtruth} is the symbol number of the groundtruth \text{\LaTeX} string. The substitution, deletion and insertion errors are three types of errors in handwritten formula recognition. For example, if the groundtruth \text{\LaTeX} string is “1 + 1” and the recognition results is “1 + l”, we call the error a substitution error. If the recognition result is “+ 1”, then we call it a deletion error. If the recognition result is “1 + + 1”, we call it an insertion error. WER reflects the recognition performance at the symbol level while ExpRate reflects the recognition performance at the formula level. In our experiments, we use the WER to control the training process, which will be detailed introduced in the subsection of training configurations, and we utilize the ExpRate to evaluate the performance of our models. The same as the existing work[1-3, 10, 15, 16], the ExpRate reported by our experiments is calculated by the label graph[32] which also considers the alignment accuracy. Using the official tools offered by CROHME, we convert the recognized \text{\LaTeX} strings to label graphs and evaluate the ExpRate in our experiments.

    Our experiments are conducted on the Tensorflow framework[33] using an NVIDIA GTX 1080Ti GPU. We utilize the adadelta algorithm[34] as the optimizer and the cross entropy as the loss function during training. In order to avoid the gradients exploding, the gradient clipping is applied during training. The weight decay[35] is also applied to regularize our model and it is set to 1.0\times10^{-3} in our experiments. We also apply the warm up training strategy in our experiments. The initial learning rate is set to 1.0\times10^{-5} and the learning rate increases by 1.0\times10^{-5} after each training step. We train our model 1.0\times10^5 training steps in the warm up training. After the warm up training, we keep the learning rate fixed and reduce to its 10% if the WER of 15 continuous epochs keep higher than the current lowest WER. And we stop training before the fifth decay is to happen. Furthermore, the beam search algorithm[36] is utilized at the test time and we set the beam size to 10 in our experiments.

    In order to figure out whether HFID does alleviate overfitting and help improve the recognition performance, we conduct a series of experiments using the baseline model on the CROHME 2014, CROHME 2016 and CROHME 2019 datasets. The recognition performance on the test sets of these three datasets are shown in Table 3. In these experiments, the validation set of each dataset and the WER evaluation metrics are employed to select the best model during training. The rows named by “HFID + CROHME 2014/2016/2019” in Table 3 mean that we use the HFID dataset to pretrain the model and finetune the model using the training data of CROHME during training. And it is worth noticing that warm up is not utilized in the finetuning process and the initial learning rate of the finetuning is set to 0.1. As shown in Table 3, there is an impressive improvement of the recognition performance after using HFID as the training dataset in the pertraining process. The ExpRate has raised 10.92%, 9.52% and 6.51% on the test set of CROHME 2014, CROHME 2016 and CROHME 2019, respectively, which indicates the effectiveness of HFID. To give a further insight of the effectiveness of HFID, we also list the results of models that are pretrained by the synthetic data in Table 3. The synthetic data is generated by the methods in [11]. In Table 3, the rows named by “Synthetic + CROHME 2014/2016/2019” mean that we use the synthetic data to pretrain the model and finetune the model using the training set of CROHME. As shown in Table 3, there exists a 4.53%, 0.52% and 3.52% ExpRate increase after applying the synthetic data to pretrain the models on the test set of CROHME 2014, 2016 and 2019, respectively. Although the increases of results indicates the effectiveness of the synthetic data generation methods[11], there still exist gaps between ExpRates of “Synthetic + CROHME 2014/2016/2019” and “HFID + CROHME 2014/2016/2019”. And the gaps, 6.39%, 9% and 3.09% on the test set of CROHME 2014, 2016 and 2019, respectively, indicate that the samples collected from real life in HFID are more effective than the synthetic samples.

    Table  3.  Recognition Performances of the Baseline Model Trained with Different Training Data on CROHME 2014, CROHME 2016 and CROHME 2019
    Dataset Number of Samples
    Used in Training
    ExpRate
    (%)
    CROHME 2014 8810 47.70
    Synthetic + CROHME 2014 31842 (23032+8810) 52.23
    HFID + CROHME 2014 31842 (23032+8810) 58.62
    CROHME 2016 8810 50.83
    Synthetic + CROHME 2016 31842 (23032+8810) 51.35
    HFID + CROHME 2016 31842 (23032+8810) 60.35
    CROHME 2019 9993 51.29
    Synthetic + CROHME 2019 33025 (23032+9993) 54.71
    HFID + CROHME 2019 33025 (23032+9993) 57.80
    Note: “(23032+8810/9993)” indicates that there are 23032 and 8810/9993 training samples used in the pretraining and the finetuning process, respectively.
    下载: 导出CSV 
    | 显示表格

    To figure out whether HFID can alleviate overfitting, we list the average cross entropy loss values on the training set and the test set of CROHME 2014, 2016 and 2019 in Table 4. “HFID + CROHME 2014/2016/2019” in Table 4 also means that the model is first pretrained on HFID and then finetuned on the CROHME dataset. As shown in Table 4, when the model is trained only with the training set of CROHME, the gaps between the average training loss and the average test loss are relatively large which indicates the severe overfitting has occurred. In contrast, when the model is trained using the training samples of HFID and CROHME, the gaps between the average training loss and the average test loss have reduced nearly 50% smaller than that when the model is trained just using CROHME. From these results, we can indicate that HFID can help the model escape from the local optima point achieved using only the CROHME dataset and alleviate overfitting. It is worth noticing that CROHME 2014 and CROHME 2016 share the same training set and validation set, and we only train one model for CROHME 2014 and CROHME 2016 in the experiments. Therefore, the average training loss values of “CROHME 2014” and “CROHME 2016” are the same in Table 4. For the same reason, the average training loss values of “HFID + CROHME 2014” and “HFID + CROHME 2016” are the same.

    Table  4.  Average Loss Values on the Training Set and the Test Set of Different Datasets
    Dataset Average
    Training Loss
    Average
    Test Loss
    Loss
    Gap
    CROHME 2014 0.75 8.20 7.45
    CROHME 2016 0.75 8.31 7.56
    CROHME 2019 0.82 8.81 7.99
    HFID + CROHME 2014 4.50 8.64 4.14
    HFID + CROHME 2016 4.50 8.61 4.11
    HFID + CROHME 2019 4.31 9.42 5.11
    Note: Lower loss gap between the average training loss and the average test loss stands for less overfitting.
    下载: 导出CSV 
    | 显示表格

    Table 5 gives a comparison of the ExpRate of the baseline model and the model equipped with the MDA on the test set of CROHME 2014, CROHME 2016, CROHME 2019 and HFID. In Table 5, “HFID + CROHME 2014/2016/2019” has the same meaning as those in Table 3 and Table 4. As the results shown in Table 5, the recognition performances of the model equipped with MDA are always better than those without MDA on all the three datasets, which indicates the effectiveness of MDA. Besides, the “Baseline + MDA” gets better results on the test set of CROHME 2014, CROHME 2016 and CROHME 2019 than “Baseline” no matter whether the HFID is used to pretrain the model or not. Although the improvement of recognition performance that is brought by the MDA becomes smaller as the amount of training data increases, MDA can improve the recognition performances regardless of whether the training data is sufficient. The reason of this phenomenon, in our opinion, is that adding HFID as the training set in the pretrain process already improves the recognition performance by a large margin and it is more difficult for MDA to further improve the accuracy on the basis of the “HFID + CROHME 2014/2016/2019” results.

    Table  5.  ExpRate (%) Comparison Results of the Baseline Model and Baseline + MDA Model on the Test Set of Different Datasets
    DatasetBaselineBaseline + MDA
    CROHME 201447.7750.81
    CROHME 201650.8353.71
    CROHME 201951.2953.21
    HFID + CROHME 201458.6259.94
    HFID + CROHME 201660.3562.70
    HFID + CROHME 201957.8059.38
    HFID59.1260.16
    下载: 导出CSV 
    | 显示表格

    To give a further exploration of the effects of the MDA module, in Fig.3, we visualize the MDA attention maps of some samples in CROHME 2016. As shown in Fig.3, each row in the attention maps represents the dependencies between the features of the symbol at the current time step and the features of the symbols in earlier time steps. For example, the sixth row in the Fig.3(a) illustrates the dependencies between the features of the sixth symbol (“ 2 ”) and the features of the previous six symbols (“ <s> ”, “ \backslash ”, “{”, “ 1 ”, “}” and “{”).

    Figure  3.  MDA attention maps of (a) the “UN_117_em_359_0” sample and (b) the “UN_128_em_1019_0” sample in the test set of CROHME 2016. The symbols in the horizontal axis stand for the recognition results while each element in the vertical axis represents the symbol to be recognized at each time step. And each row in the two subfigures visualizes the dependencies between the current symbol (symbol in the vertical axis) and the symbols from earlier time steps (symbols in the horizontal axis). “ <s> ” is the start token whose corresponding multimodal feature is {\boldsymbol{m}}_0 that is initialized as zero.

    As shown in Fig.3(a), when the second “ 1 ” (the 11st row), the second “ 2 ” (the 16th row) and the second “ l ” (the 14th row) are to be recognized, the highest weight is assigned to the first “ 1 ”, the first “ 2 ” and the first “ l ”, respectively, by the MDA module. The same phenomenon can also be observed in the third row (the second “ a ”), the 9th row (the third “ a ”) and the 14th row (the second “ + ”) in Fig.3(b). From these visualization results, we can infer that the MDA module can focus on the symbols with the same class as the current symbol to be recognized.

    In addition to the visual association between the current symbol and the earlier symbols, MDA can also learn the semantic relevance between the symbols. For example, as shown in the 6th row of Fig.3(a), the most relevant symbols to the current symbol “ 2 ” are “ \backslash frac" and “ 1 ” since they are from the same subexpression “ \frac 1 2 ” in this formula. Besides, as shown in the 12th row of Fig.3(a), the “ ( ” in “ (l+1) ” has gained the highest weight when recognizing the “ ) ” in “ (l+1) ” which implies that the MDA has learned the semantic relevance between “ ( ” and “ ) ” in the same subexpression. Furthermore, the 10th row (“ + ” and “ l ") in Fig.3(a), the 6th row (" 0 ” and the first “ \_ ") in Fig.3(b), the 12th row (“ 1 ” and the second “ \_ ") in Fig.3(b), etc., have also shown the semantic relevance between symbols. From these phenomenons shown in Fig.3, we can infer that, in each time step of the recognition process, MDA is likely to attend the symbols which have visual and semantic associations with the current symbol to be recognized due to the input multimodal symbol features.

    Furthermore, Fig.4 shows the ExpRates of models with or without MDA of formulas in different \text{\LaTeX} string length ranges in the CROHME 2016 test set to reveal the effects of MDA further. As shown in Fig.4, the ExpRates of both the two models gradually decrease with the growth of the \text{\LaTeX} string length since the formulas with longer \text{\LaTeX} strings are usually with more symbols and complex structures which make them more difficult to recognize correctly. But the ExpRates of the model with MDA are higher than those of the baseline model in all the \text{\LaTeX} string length ranges. And it is worth noticing that the gaps between the ExpRates of the two models are negligible when the length of \text{\LaTeX} string is in the range of 11–15 and 16–20. And when the length of \text{\LaTeX} string grows longer than 20, the gaps between the ExpRates of the baseline model and the Baseline+MDA model become large again. This trend suggests that the MDA can help improve the recognition results of formulas with long \text{\LaTeX} strings, which is the same as our original model design intention. With the results of Fig.3 and Fig.4, we can infer that by learning the visual and semantic dependencies between symbols, MDA can help improve the recognition performances for the formulas with long \text{\LaTeX} strings.

    Figure  4.  ExpRates of different models for the formulas with different \text{\LaTeX} string lengths in the test set of CROHME 2016.

    Table 6 shows the ablation experiment results for the MDA module. As the MDA module consists with the MIF step and the DA step, in the ablation experiments, we append each step to the Baseline model for verifying the effectiveness of each step. In Table 6, the row of “Baseline” shows the results of our baseline model. The “Baseline + MIF” row and the “Baseline + DA” row represent the results of the baseline model equipped with the MIF step and the baseline model equipped with the DA step, respectively. The “Baseline + MDA” row shows the results of the model equipped with the whole MDA. Since the DA step uses the multimodal feature {\boldsymbol{m}}_i generated in MIF, in experiments of the “Baseline + DA” row, we replace “ {\boldsymbol{m}}_i ” in (14) with “ {\boldsymbol{a}}_i ". As shown in Table 6, adding either the MIF step or the DA step only can improve the recognition performance on CROHME 2014, CROHME 2016 and CROHME 2019, demonstrating the effectiveness of MIF and DA. Furthermore, as shown in Table 6, “Baseline + MDA” can get better performances than “Baseline", “Baseline + MIF” and “Baseline + DA”. From these results in Table 6, we can indicate that the MIF step and the DA step are both effective when they are applied independently or collaboratively.

    Table  6.  Results of Ablation Experiments for MDA on CROHME 2014, CROHME 2016 and CROHME 2019
    System ExpRate (%)
    CROHME 2014 CROHME 2016 CROHME 2019
    Baseline 58.62 60.35 57.80
    Baseline+MIF 59.22 60.96 58.72
    Baseline+DA 58.72 62.35 58.05
    Baseline+MDA 59.94 62.70 59.38
    下载: 导出CSV 
    | 显示表格

    The results of our model and the existing systems on the test set of CROHME 2014 are listed in Table 7. The systems I to VII are participating systems in CROHME 2014 while the other systems are the recent attention based encoder-decoder models. It is worth noticing that the systems I to VII use the online formula data, which contains more information (i.e., the stroke sequence information) than the offline formula images, as the input of their systems. It is more difficult for the models to use only the offline image information to recognize formulas correctly than those which use the online data. Besides, System III utilizes a large amount private data to train their model.

    Table  7.  ExpRate (%) Results of Different Systems on CROHME 2014
    System ExpRate(%) \leqslant 1(%) \leqslant 2(%) \leqslant 3(%)
    I[15] 37.22 44.22 47.26 52.20
    II[15] 15.01 22.31 26.57 27.69
    III[15] 62.68 72.31 75.15 76.88
    IV[15] 18.97 28.19 32.35 33.37
    V[15] 18.97 26.37 30.83 32.96
    VI[15] 25.66 33.16 35.90 37.32
    VII[15] 26.06 33.87 38.54 39.96
    Deng et al. 2017[9] 39.96
    Zhang et al. 2017*[1] 44.40 58.40 62.20 63.10
    Wu et al. 2018[2] 47.06
    Zhang et al. 2018*[10] 52.80 68.10 72.00 72.70
    Le and Nakagawa 2019[11] 48.78 63.39 70.18 73.83
    Wu et al. 2020*[3] 54.87 70.69 75.76 79.01
    Li et al. 2020*[12] 60.45 73.43 77.69 80.12
    Ours 59.94 76.06 80.83 82.35
    Ours* 63.79 78.60 83.16 84.99
    Note: The best results are highlighted in bold.
    下载: 导出CSV 
    | 显示表格

    In Table 7, the columns of “ \leqslant 1(%)”, “ \leqslant 2(%)” and “ \leqslant 3(%)” show the ExpRate when one, two and three errors can be tolerated, respectively. ``*'' in Table 7 marks the results of the systems that use the ensemble method. In Table 7, our model is equipped with the proposed MDA module and is pretrained on the HFID and finetuned on CROHME 2014. As shown from the results in Table 7, our results outperform the recent attention based encoder-decoder models for offline handwritten formula recognition by a large margin. Even considering the online systems, we achieve the state-of-the-art results on CROHME 2014.

    The result of our model and the existing systems on the test set of CROHME 2016 are listed in Table 8. The first five systems in Table 8 are participating systems in CROHME 2016 which use the online data as input, while the other five models are the recent attention-based encoder-decoder models for offline handwritten formula recognition. Although MyScript system get the best result on CROHME 2016, it utilizes a large amount of private online data in training which contains more information than the offline formula images used in our models. Besides, considering the results of “ \leqslant 1(%)”, “ \leqslant 2(%)” and “ \leqslant 3(%)”, our results are still significantly better than those of Myscript. When only considering the offline systems, similar to the Table 7, our model also outperform the recent attention-based encoder-decoder models for offline handwritten formula recognition by a large margin. And we achieve the state-of-the-art performance for offline handwritten formula recognition on CROHME 2016. We omit the result comparisons with the participating teams of CROHME 2019[17] because the detailed model configurations of the participation models in the CROHME 2019 competition[17] cannot be found and some participation models use extra recognition software to get better results.

    Table  8.  ExpRate (%) Results of Different Systems on CROHME 2016
    SystemExpRate(%) \leqslant 1(%) \leqslant 2(%) \leqslant 3(%)
    MyScript[16]67.6575.5979.86
    Wiris[16]49.6160.4264.69
    Tokyo[16]43.9450.9153.70
    Sao Paolo[16]33.3943.5049.17
    Nantes[16]13.3421.0228.33
    Zhang et al. 2017*[1]44.5557.1061.5562.34
    Zhang et al. 2018*[10]50.1063.8067.4068.50
    Le and Nakagawa 2019[11]45.6059.2965.6569.66
    Wu et al. 2020*[3]57.8970.4476.2979.16
    Li et al. 2020*[12]58.0671.6775.5977.59
    Ours62.7076.4281.7483.58
    Ours*65.2478.9583.6785.85
    Note: The best results are highlighted in bold.
    下载: 导出CSV 
    | 显示表格

    To further illustrate the effectiveness of the proposed MDA module, we also list the results of our model that is trained only with CROHME in Table 9. And we compare our model with the recent mainstream offline systems in Table 9. For a fair comparison, we omit the results of [11] and [12] in Table 9, since [11] and [12] both concentrate on the data augmentation method and their models are trained with extra synthetic training data. Similar to Table 7 and Table 8, ``*'' marks the results of the systems that use the ensemble method. As shown in Table 9, we achieve 50.81% and 53.71% ExpRate on CROHME 2014 and CROHME 2016, respectively, using a single model that is equipped with the MDA module, which are higher than those obtained in [2] and [3] with a single model. Considering the ensemble results, we achieve 56.28% and 58.34% ExpRate on CROHME 2014 and CROHME 2016, respectively, which also outperform the mainstream offline systems. Table 9 shows that even without the HFID pretraining stage, the MDA module can help our model achieve outstanding recognition results, demonstrating the effectiveness of MDA.

    Table  9.  Recognition Results on CROHME 2014 and CROHME 2016 of Different Offline Systems Not Trained with Extra Data
    System ExpRate (%)
    CROHME 2014 CROHME 2016
    Zhang et al. 2017*[1] 46.55 44.55
    Wu et al. 2018[2] 39.66
    Wu et al. 2018*[2] 47.06
    Zhang et al. 2018*[10] 52.80 50.10
    Wu et al. 2020[3] 48.88 49.61
    Wu et al. 2020*[3] 54.87 57.89
    Ours 50.81 53.71
    Ours* 56.28 58.34
    Note: The best results are highlighted in bold.
    下载: 导出CSV 
    | 显示表格

    In Table 10, we compare the results of our model with those of existing methods on the HFID test set. Since the state-of-the-art methods except for [1] and [10] have no implementation provided, we compare our methods with [1] and [10] which released their source code publicly. In the experiments for the methods in [1] and [10] on HFID, we use the official code offered by [1] 2 and [10] 3 and train their model on HFID. As shown in Table 10, the method in [1] only achieves 36.48% accuracy due to the relatively simplistic network architecture. The method in [10] achieve 58.83% on the test set of HFID, which only has a small gap from our baseline model whose accuracy on HFID is 59.12%. On the HFID test set, our model equipped with MDA produces the best results, with an accuracy of 60.16% and a 5.15% WER.

    Table  10.  Recognition Results of Different Methods on HFID
    Method WER (%) ExpRate (%)
    Zhang et al. 2017[1] 26.01 36.48
    Zhang et al. 2018[10] 5.92 58.83
    Ours (Baseline) 5.49 59.12
    Ours (Baseline + MDA) 5.15 60.16
    Note: The best results are highlighted in bold.
    下载: 导出CSV 
    | 显示表格

    Attending to the right parts of the feature maps is the foundation of MDA and symbol prediction process at each time step. At each time step, the visual attention {Att}_{\rm v} gives an attention map to predict the location of the current symbol that is to be recognized and offers the context vector that is combined by the attention map and feature maps of the input image for the decoder. We visualize the attention maps generated by {Att}_{\rm v} during the inference procedure to illustrate the recognition process of our model in Fig.5. From Fig.5, we can see that the locations of the entity symbols such as “ \backslash frac”, “-”, “ \backslash sqrt" and so on, generated by the model are strongly correspond to human intuition. These correct locations make our model get the accurate visual features that are used in MDA and the decoder to predict the current symbol accurately. Besides the general entity symbols, there are also many virtual symbols such as “_”, “^”, “{” and “}” in the \text{\LaTeX} strings of formulas. As shown in Fig.5, our model focuses some background areas of the image when encountering these virtual symbols. More interestingly, we find that the locations of the virtual symbols generated by our model are equipped with semantic meanings to some extend. For example, the 16th subfigure, which corresponds to the “{” of subexpression “{ 5 }”, locates the left border of “5” while the 18th subfigure which corresponds to the “}” of the same subexpression focuses the right border of “5”. Besides, the virtual symbols in the outer layers of the same subexperssion are likely to be equipped with larger areas than the virtual symbols in the inner layers. For example, the last two subfigures of Fig.5 correspond to the penultimate “}” and the last “}”, respectively. And the penultimate subfigure only focuses on the right border of “4” since it only corresponds to the “}” of “{ 4 }” while the last subfigure attends to larger areas that are on the right border of “ \frac 1 4 " since it corresponds to the last “}” of “^ { \backslash frac { 1 } { 4 } }”.

    Figure  5.  Attention visualization of a test sample in CROHME 2016 whose \text{\LaTeX} string is “- ( \backslash frac { 5 + \backslash sqrt { 5 } } { 5 - \backslash sqrt { 5 } }) ^ { \backslash frac { 1 } { 4 } }”. Attention weights are visualized in red and the darker red denotes a higher attention weight.

    Although the handwritten formula recognition performance is improved via the proposed MDA module and the HFID dataset by a large margin, the results of our model still have a great room to improve. To further analyze the reasons, we list a few correctly and wrongly recognized handwritten formula samples in Fig.6.

    Figure  6.  Examples of recognition results for the handwritten formula samples in the test set of CROHME. The green texts represent the parts that are correctly recognized. The red texts without double strikethroughts are the substitution error parts, while the red texts with double strikethroughs are the deletion error parts.

    The samples in Fig.6(a) and Fig.6(b) can be correctly recognized using our model which shows that our model is effective in dealing with the complex 2-dimensional structures and various handwriting styles. While the samples in Fig.6(c), Fig.6(d) and Fig.6(e) fail to be recognized correctly by our model. From these three samples, we can infer that our model may predict wrong results when the symbols are with the same glyph. These symbols, including letters with similar uppercase and lowercase, and some similarly shaped symbols such as “ \gamma ” and “r”, “9” and “g”, often have very similar handwritten appearances in various cases. For example, the wrongly recognized “j” in Fig.6(d) has almost the same appearance with “i” since it is written too narrowly. Besides, when the symbols are written too close to each other or too small, the model would miss some symbols. For example, the “ , ” is missed in Fig.6(d) since it is too close to the “ \backslash cdots” symbol and it is also written in a small scale.

    In this paper, we proposed the MDA (Multimodal Dependence Attention) module and a new dataset HFID to improve the recognition performance of handwritten formula recognition. The MDA module which utilizes the multimodal features of symbols at each time step, learns the visual and semantic dependencies among the symbols in the same formula to improve the recognition performance of the formulas with long \text{\LaTeX} strings. The HFID dataset which contains 26520 handwritten formula images collected from real life, to the best of our knowledge, is the biggest dataset for the handwritten formula recognition task and offers a sufficient training bed for the researchers around the world to break through the bottleneck of overfitting. With the proposed HFID dataset and the MDA module, we achieved the state-of-the-art performance for offline handwritten formula recognition, 63.79% and 65.24% expression accuracy on CROHME 2014 and CROHME 2016, respectively. From the extensive experimental results and analyses, we could demonstrate the effectiveness of the proposed HFID dataset and the MDA module. Furthermore, we have noticed that the Transformer model is an outstanding encoder-decoder network which has shown the effectiveness on various tasks, and we will make it our future work to apply the Transformer model on the offline handwritten formula recognition task.

  • Figure  1.   Examples of samples in HFID.

    Figure  2.   Overview of the model equipped with the MDA module. It consists of four parts: a DenseNet ( D_{\rm cnn} ) for image feature extraction, a visual attention mechanism ( Att_{\rm v} ) to select the important visual features at each time step, a two-layer LSTM ( L_{\rm rnn}^{(1)} and L_{\rm rnn}^{(2)} ) for symbol recognition and the MDA module which consists of two sequential processing steps, the multimodal information fusion (MIF) process and the dependence attention (DA) process. A_t=\{{\boldsymbol{a}}_0, {\boldsymbol{a}}_1, {\boldsymbol{a}}_2, ..., {\boldsymbol{a}}_{t-1}\} and M_t=\{{\boldsymbol{m}}_0, {\boldsymbol{m}}_1, {\boldsymbol{m}}_2, ..., {\boldsymbol{m}}_{t-1}\} are the set of the visual and multimodal features of the symbols at the earlier time steps, respectively. The meaning of {\boldsymbol c}_t , please refer to Subsection 4.2.2.

    Figure  3.   MDA attention maps of (a) the “UN_117_em_359_0” sample and (b) the “UN_128_em_1019_0” sample in the test set of CROHME 2016. The symbols in the horizontal axis stand for the recognition results while each element in the vertical axis represents the symbol to be recognized at each time step. And each row in the two subfigures visualizes the dependencies between the current symbol (symbol in the vertical axis) and the symbols from earlier time steps (symbols in the horizontal axis). “ <s> ” is the start token whose corresponding multimodal feature is {\boldsymbol{m}}_0 that is initialized as zero.

    Figure  4.   ExpRates of different models for the formulas with different \text{\LaTeX} string lengths in the test set of CROHME 2016.

    Figure  5.   Attention visualization of a test sample in CROHME 2016 whose \text{\LaTeX} string is “- ( \backslash frac { 5 + \backslash sqrt { 5 } } { 5 - \backslash sqrt { 5 } }) ^ { \backslash frac { 1 } { 4 } }”. Attention weights are visualized in red and the darker red denotes a higher attention weight.

    Figure  6.   Examples of recognition results for the handwritten formula samples in the test set of CROHME. The green texts represent the parts that are correctly recognized. The red texts without double strikethroughts are the substitution error parts, while the red texts with double strikethroughs are the deletion error parts.

    Table  1   Composition of HFID

    Usage Number of Volunteers Name Number of Samples Number of Symbols
    Training stage Over 230 Training set 23032 351830
    Validation set 1528 22995
    Test stage 30 Test Set 1960 30079
    下载: 导出CSV

    Table  2   Distributions of Formulas with Different Lengths in the Training Set of CROHME 2016 and HFID

    Length Number of Formulas / Percent (%)
    CROHME 2016 HFID
    1-5 1600 / 18.11 1882 / 8.17
    6-10 1955 / 22.12 2937 / 12.75
    11-15 1335 / 15.11 2669 / 11.59
    16-20 1111 / 12.57 2775 / 12.05
    21-25 1078 / 12.20 2977 / 12.93
    26-30 609 / 6.89 2433 / 10.56
    31-35 442 / 5.00 1989 / 8.63
    35+ 705 / 7.98 5370 / 23.31
    下载: 导出CSV

    Table  3   Recognition Performances of the Baseline Model Trained with Different Training Data on CROHME 2014, CROHME 2016 and CROHME 2019

    Dataset Number of Samples
    Used in Training
    ExpRate
    (%)
    CROHME 2014 8810 47.70
    Synthetic + CROHME 2014 31842 (23032+8810) 52.23
    HFID + CROHME 2014 31842 (23032+8810) 58.62
    CROHME 2016 8810 50.83
    Synthetic + CROHME 2016 31842 (23032+8810) 51.35
    HFID + CROHME 2016 31842 (23032+8810) 60.35
    CROHME 2019 9993 51.29
    Synthetic + CROHME 2019 33025 (23032+9993) 54.71
    HFID + CROHME 2019 33025 (23032+9993) 57.80
    Note: “(23032+8810/9993)” indicates that there are 23032 and 8810/9993 training samples used in the pretraining and the finetuning process, respectively.
    下载: 导出CSV

    Table  4   Average Loss Values on the Training Set and the Test Set of Different Datasets

    Dataset Average
    Training Loss
    Average
    Test Loss
    Loss
    Gap
    CROHME 2014 0.75 8.20 7.45
    CROHME 2016 0.75 8.31 7.56
    CROHME 2019 0.82 8.81 7.99
    HFID + CROHME 2014 4.50 8.64 4.14
    HFID + CROHME 2016 4.50 8.61 4.11
    HFID + CROHME 2019 4.31 9.42 5.11
    Note: Lower loss gap between the average training loss and the average test loss stands for less overfitting.
    下载: 导出CSV

    Table  5   ExpRate (%) Comparison Results of the Baseline Model and Baseline + MDA Model on the Test Set of Different Datasets

    DatasetBaselineBaseline + MDA
    CROHME 201447.7750.81
    CROHME 201650.8353.71
    CROHME 201951.2953.21
    HFID + CROHME 201458.6259.94
    HFID + CROHME 201660.3562.70
    HFID + CROHME 201957.8059.38
    HFID59.1260.16
    下载: 导出CSV

    Table  6   Results of Ablation Experiments for MDA on CROHME 2014, CROHME 2016 and CROHME 2019

    System ExpRate (%)
    CROHME 2014 CROHME 2016 CROHME 2019
    Baseline 58.62 60.35 57.80
    Baseline+MIF 59.22 60.96 58.72
    Baseline+DA 58.72 62.35 58.05
    Baseline+MDA 59.94 62.70 59.38
    下载: 导出CSV

    Table  7   ExpRate (%) Results of Different Systems on CROHME 2014

    System ExpRate(%) \leqslant 1(%) \leqslant 2(%) \leqslant 3(%)
    I[15] 37.22 44.22 47.26 52.20
    II[15] 15.01 22.31 26.57 27.69
    III[15] 62.68 72.31 75.15 76.88
    IV[15] 18.97 28.19 32.35 33.37
    V[15] 18.97 26.37 30.83 32.96
    VI[15] 25.66 33.16 35.90 37.32
    VII[15] 26.06 33.87 38.54 39.96
    Deng et al. 2017[9] 39.96
    Zhang et al. 2017*[1] 44.40 58.40 62.20 63.10
    Wu et al. 2018[2] 47.06
    Zhang et al. 2018*[10] 52.80 68.10 72.00 72.70
    Le and Nakagawa 2019[11] 48.78 63.39 70.18 73.83
    Wu et al. 2020*[3] 54.87 70.69 75.76 79.01
    Li et al. 2020*[12] 60.45 73.43 77.69 80.12
    Ours 59.94 76.06 80.83 82.35
    Ours* 63.79 78.60 83.16 84.99
    Note: The best results are highlighted in bold.
    下载: 导出CSV

    Table  8   ExpRate (%) Results of Different Systems on CROHME 2016

    SystemExpRate(%) \leqslant 1(%) \leqslant 2(%) \leqslant 3(%)
    MyScript[16]67.6575.5979.86
    Wiris[16]49.6160.4264.69
    Tokyo[16]43.9450.9153.70
    Sao Paolo[16]33.3943.5049.17
    Nantes[16]13.3421.0228.33
    Zhang et al. 2017*[1]44.5557.1061.5562.34
    Zhang et al. 2018*[10]50.1063.8067.4068.50
    Le and Nakagawa 2019[11]45.6059.2965.6569.66
    Wu et al. 2020*[3]57.8970.4476.2979.16
    Li et al. 2020*[12]58.0671.6775.5977.59
    Ours62.7076.4281.7483.58
    Ours*65.2478.9583.6785.85
    Note: The best results are highlighted in bold.
    下载: 导出CSV

    Table  9   Recognition Results on CROHME 2014 and CROHME 2016 of Different Offline Systems Not Trained with Extra Data

    System ExpRate (%)
    CROHME 2014 CROHME 2016
    Zhang et al. 2017*[1] 46.55 44.55
    Wu et al. 2018[2] 39.66
    Wu et al. 2018*[2] 47.06
    Zhang et al. 2018*[10] 52.80 50.10
    Wu et al. 2020[3] 48.88 49.61
    Wu et al. 2020*[3] 54.87 57.89
    Ours 50.81 53.71
    Ours* 56.28 58.34
    Note: The best results are highlighted in bold.
    下载: 导出CSV

    Table  10   Recognition Results of Different Methods on HFID

    Method WER (%) ExpRate (%)
    Zhang et al. 2017[1] 26.01 36.48
    Zhang et al. 2018[10] 5.92 58.83
    Ours (Baseline) 5.49 59.12
    Ours (Baseline + MDA) 5.15 60.16
    Note: The best results are highlighted in bold.
    下载: 导出CSV
  • [1]

    Zhang J S, Du J, Zhang S L, Liu D, Hu Y L, Hu J S, Wei S, Dai L R. Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition. Pattern Recognition, 2017, 71: 196–206. DOI: 10.1016/j.patcog.2017.06.017.

    [2]

    Wu J W, Yin F, Zhang Y M, Zhang X Y, Liu C L. Image-to-markup generation via paired adversarial learning. In Proc. the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Sept. 2018, pp.18–34. DOI: 10.1007/978-3-030-10925-7_2.

    [3]

    Wu J W, Yin F, Zhang Y M, Zhang X Y, Liu C L. Handwritten mathematical expression recognition via paired adversarial learning. Int. J. Comput. Vision, 2020, 128(10): 2386–2401. DOI: 10.1007/s11263-020-01291-5.

    [4]

    Anderson R H. Syntax-directed recognition of hand-printed two-dimensional mathematics. In Proc. the Association for Computing Machinery Inc. Symposium, Aug. 1967, pp.436–459. DOI: 10.1145/2402536.2402585.

    [5]

    Hu L, Zanibbi R. Segmenting handwritten math symbols using AdaBoost and multi-scale shape context features. In Proc. the 12th International Conference on Document Analysis and Recognition, Aug. 2013, pp.1180–1184. DOI: 10.1109/ICDAR.2013.239.

    [6]

    Álvaro F, Sánchez J A, Benedí J M. Offline features for classifying handwritten math symbols with recurrent neural networks. In Proc. the 22nd International Conference on Pattern Recognition, Aug. 2014, pp.2944–2949. DOI: 10.1109/ICPR.2014.507.

    [7]

    Awal A M, Mouchère H, Viard-Gaudin C. A global learning approach for an online handwritten mathematical expression recognition system. Pattern Recognit. Lett., 2014, 35: 68–77. DOI: 10.1016/j.patrec.2012.10.024.

    [8]

    Álvaro F, Sánchez J A, Benedí J M. An integrated grammar-based approach for mathematical expression recognition. Pattern Recognit., 2016, 51: 135–147. DOI: 10.1016/j.patcog.2015.09.013.

    [9]

    Deng Y T, Kanervisto A, Ling J, Rush A M. Image-to-markup generation with coarse-to-fine attention. In Proc. the 34th International Conference on Machine Learning, Aug. 2017, pp.980–989.

    [10]

    Zhang J S, Du J, Dai L R. Multi-scale attention with dense encoder for handwritten mathematical expression recognition. In Proc. the 24th International Conference on Pattern Recognition, Aug. 2018, pp.2245–2250. DOI: 10.1109/ICPR.2018.8546031.

    [11]

    Le A D, Indurkhya B, Nakagawa M. Pattern generation strategies for improving recognition of handwritten mathematical expressions. Pattern Recognit. Lett., 2019, 128: 255–262. DOI: 10.1016/j.patrec.2019.09.002.

    [12]

    Li Z, Jin L W, Lai S X, Zhu Y C. Improving attention-based handwritten mathematical expression recognition with scale augmentation and drop attention. In Proc. the 17th International Conference on Frontiers in Handwriting Recognition, Sept. 2020, pp.175–180. DOI: 10.1109/ICFHR2020.2020.00041.

    [13]

    Zhang J S, Du J, Yang Y X, Song Y Z, Wei S, Dai L R. A tree-structured decoder for image-to-markup generation. In Proc. the 37th International Conference on Machine Learning, Jul. 2020, Article No. 1027.

    [14]

    Xu K, Ba J L, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R S, Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. In Proc. the 32nd International Conference on International Conference on Machine Learning, Jul. 2015, pp.2048–2057.

    [15]

    Mouchère H, Zanibbi R, Garain U, Viard-Gaudin C. Advancing the state of the art for handwritten math recognition: The CROHME competitions, 2011-2014. Int. J. Document Anal. Recognit., 2016, 19(2): 173–189. DOI: 10.1007/s10032-016-0263-5.

    [16]

    Mouchère H, Viard-Gaudin C, Zanibbi R, Garain U. ICFHR2016 CROHME: Competition on recognition of online handwritten mathematical expressions. In Proc. the 15th International Conference on Frontiers in Handwriting Recognition, Oct. 2016, pp.607–612. DOI: 10.1109/ICFHR.2016.0116.

    [17]

    Mahdavi M, Zanibbi R, Mouchere H, Viard-Gaudin C, Garain U. ICDAR 2019 CROHME + TFD: Competition on recognition of handwritten mathematical expressions and typeset formula detection. In Proc. the 2019 International Conference on Document Analysis and Recognition, Sept. 2019, pp.1533–1538. DOI: 10.1109/ICDAR.2019.00247.

    [18]

    Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput., 1997, 9(8): 1735–1780. DOI: 10.1162/neco. 1997.9.8.1735.

    [19]

    Chung J, Gulcehre C, Cho K H, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv: 1412.3555, 2014. https://arxiv.org/abs/1412.3555, May 2024.

    [20]

    Gehring J, Auli M, Grangier D, Yarats D, Dauphin Y N. Convolutional sequence to sequence learning. In Proc. the 34th International Conference on Machine Learning, Aug. 2017, pp.1243–1252.

    [21]

    Tang G B, Müller M, Rios A, Sennrich R. Why self-attention? A targeted evaluation of neural machine translation architectures. In Proc. the 2018 Conference on Empirical Methods in Natural Language Processing, Oct. 31–Nov. 4, 2018, pp.4263–4272. DOI: 10.18653/v1/D18-1458.

    [22]

    Zhang J S, Du J, Dai L R. Track, Attend, and Parse (TAP): An end-to-end framework for online handwritten mathematical expression recognition. IEEE Trans. Multimedia, 2019, 21(1): 221–233. DOI: 10.1109/TMM.2018.2844 689.

    [23]

    Liu C, Yin F, Wang D, Wang Q. CASIA online and offline Chinese handwriting databases. In Proc. the 2011 International Conference on Document Analysis and Recognition, Sept. 2011, pp.37–41. DOI: 10.1109/ICDAR.2011.17.

    [24]

    Marti U V, Bunke H. The IAM-database: An English sentence database for offline handwriting recognition. Int. J. Document Anal. Recognit., 2002, 5(1): 39–46. DOI: 10.1007/ s100320200071.

    [25]

    Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556, 2014. https://arxiv.org/abs/1409.1556, May 2024.

    [26]

    Gu J X, Wang G, Cai J F, Chen T. An empirical study of language CNN for image captioning. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.1231–1240. DOI: 10.1109/ICCV.2017.138.

    [27]

    Xiu Y H, Wang Q Q, Zhan H J, Lan M, Lu Y. A handwritten Chinese text recognizer applying multi-level multimodal fusion network. In Proc. the 2019 International Conference on Document Analysis and Recognition, Sept. 2019, pp.1464–1469. DOI: 10.1109/ICDAR.2019.00235.

    [28]

    Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.2261–2269. DOI: 10.1109/CVPR.2017.243.

    [29]

    Weston J, Chopra S, Bordes A. Memory networks. arXiv: 1410.3916, 2014. https://arxiv.org/abs/1410.3916, May 2024.

    [30]

    Ranzato M A, Chopra S, Auli M, Zaremba W. Sequence level training with recurrent neural networks. arXiv: 1511.06732, 2015. https://arxiv.org/abs/1511.06732, May 2024.

    [31]

    Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010.

    [32]

    Zanibbi R, Mouchère H, Viard-Gaudin C. Evaluating structural pattern recognition for handwritten math via primitive label graphs. In Proc. the SPIE 8658, Document Recognition and Retrieval XX, Feb. 2013, Article No. 865817. DOI: 10.1117/12.2008409.

    [33]

    Abadi M, Agarwal A, Barham P et al. Tensor-flow: Large-scale machine learning on heterogeneous distributed systems. arXiv: 1603.04467, 2016. https://arxiv.org/abs/1603.04467, May 2024.

    [34]

    Zeiler M D. ADADELTA: An adaptive learning rate method. arXiv: 1212.5701, 2012. https://arxiv.org/abs/1212.5701, May 2024.

    [35]

    Krogh A, Hertz J A. A simple weight decay can improve generalization. In Proc. the 4th International Conference on Neural Information Processing Systems, Dec. 1991, pp.950–957.

    [36]

    Cho K. Natural language understanding with distributed representation. arXiv: 1511.07916, 2015. https://arxiv.org/abs/1511.07916, May 2024.

图(6)  /  表(10)
计量
  • 文章访问数:  265
  • HTML全文浏览量:  1
  • PDF下载量:  56
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-10-20
  • 录用日期:  2022-04-26
  • 网络出版日期:  2023-06-19
  • 刊出日期:  2024-06-27

目录

/

返回文章
返回