A Large Chinese Text Dataset in the Wild

Tai-Ling Yuan; Zhe Zhu; Kun Xu; Cheng-Jun Li; Tai-Jiang Mu; Shi-Min Hu

doi:10.1007/s11390-019-1923-y

Tai-Ling Yuan, Zhe Zhu, Kun Xu, Cheng-Jun Li, Tai-Jiang Mu, Shi-Min Hu. A Large Chinese Text Dataset in the Wild. Journal of Computer Science and Technology, 2019, 34(3): 509-521. DOI: 10.1007/s11390-019-1923-y

Citation:

A Large Chinese Text Dataset in the Wild

Abstract

Abstract

In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep learning methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3 850 unique ones annotated by experts in over 30 000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. For each character, the annotation includes its underlying character, bounding box, and six attributes. The attributes indicate the character's background complexity, appearance, style, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks:character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available.

FullText(HTML)

References (41)

Relative Articles

Supplements (1)

Cited By

A Large Chinese Text Dataset in the Wild

Abstract

Catalog

Export File

Citation

Format

Content