Scanned Chinese Invoice Dataset
DAVAR LAB
Introduction
The SCID dataset is from CSIG 2022 Competition on Invoice Recognition and Analysis .
The dataset is also described in the accepted paper of 2023 journal of image and graphics 《SCID : a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images》
The dataset contains six types of invoices for algorithm verification. They are Taxi Invoice, Train Invoice, Passenger Invoice, Toll Invoice, Air Itinerary Invoice and Quota Invoice. All images have been were desensitized. Some visualization examples are shown as follows.
Air Ticket | General Quota Invoice | Taxi Invoice |
Passenger Transport Invoice | Toll Invoice | Train Ticket |
Annotation
We provide two types annotation files for the training data, ocr.json and gt.json :
ocr.json
This file contains the annotations for each text instance's location and content, defined as follows:
"abf3b61f-cefe-374e-2ace-ac1fbdf3f3af_1.jpg": { "height": 891, "width": 1245, "content_ann": { "texts": [ "112002070106", "12921503", "壹佰元整", "###", ... ] "bboxes": [ [ 453, 338, 830, 328, 832, 383, 454, 393 ], [ 446, 411, 739, 406, 741, 466, 448, 473 ], [ 462, 603, 809, 595, 812, 683, 464, 693 ], [ 428, 347, 883, 364, 882, 709, 419, 710 ], ... ] }, },where,
- texts: text content annotations for each text instance,
- bboxes: location for each text instance,
gt.json
This file contains the entity annotation groundtruth. An example corresponding to the above ocr.json is as follows:
"abf3b61f-cefe-374e-2ace-ac1fbdf3f3af_1.jpg": { "发票代码":112002070106, "发票号码":12921503, "金额":"壹佰元整", }
Terms of Use
Citation
@article{SCID, author = { Liang Qiao and Zaisheng Li and Zhanzhan Cheng and Xi Li }, title = {{SCID:} a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images.}, journal = {Journal of Image and Graphics} , volume={28}, number={08}, pages={2298-2313}, year = {2023}, }