大型街景汉字数据集

doi:10.1007/s11390-019-1923-y

摘要: 本文介绍一个大型街景汉字数据集。近年来，对于文档图片的光学字符识别（OCR）已经比较完善，但是检测、识别自然场景照片中的文字——尤其是针对汉字等复杂的字符集——仍然是一个具有挑战性问题。基于深度学习的检测方法需要大量训练数据，训练数据匮乏已经成为问题。这个最新的汉字数据集由三万余张街景图片组成，标注专家标注了大约一百万汉字，其中有3850类不同的字。该数据集具有挑战性和多样性，它包含平面字、立体字、弱光照条件下的字、远处的字、被部分遮挡的字等多种情况。每个字的标注有包围框、字符以及六种属性，属性描述该字的背景、形状、字体等是否复杂。针对三项文字检测识别任务，本文采用目前最前沿的检测与识别算法作为基准算法并给出了基准结果：字符识别准确率80.5%，字符检测平均准确率70.9%，文本行检测平均编辑距离22.1。数据集、源代码以及训练好的模型均会公开。

Abstract: In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep learning methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3 850 unique ones annotated by experts in over 30 000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. For each character, the annotation includes its underlying character, bounding box, and six attributes. The attributes indicate the character's background complexity, appearance, style, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks:character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available.

大型街景汉字数据集

A Large Chinese Text Dataset in the Wild