Scanned Chinese Invoice Dataset

DAVAR LAB

Introduction

The SCID dataset is from CSIG 2022 Competition on Invoice Recognition and Analysis .

The dataset is also described in the accepted paper of 2023 journal of image and graphics 《SCID : a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images》

The dataset contains six types of invoices for algorithm verification. They are Taxi Invoice, Train Invoice, Passenger Invoice, Toll Invoice, Air Itinerary Invoice and Quota Invoice. All images have been were desensitized. Some visualization examples are shown as follows.

Air Ticket General Quota Invoice Taxi Invoice
Passenger Transport Invoice Toll Invoice Train Ticket

Annotation

We provide two types annotation files for the training data, ocr.json and gt.json :

ocr.json

This file contains the annotations for each text instance's location and content, defined as follows:

"abf3b61f-cefe-374e-2ace-ac1fbdf3f3af_1.jpg": {
	"height": 891,
	"width": 1245,
	"content_ann": {
		"texts": [
			"112002070106", "12921503", "壹佰元整", "###", ...
		]
		"bboxes": [
			[ 453, 338, 830, 328, 832, 383, 454, 393 ],
			[ 446, 411, 739, 406, 741, 466, 448, 473 ],
			[ 462, 603, 809, 595, 812, 683, 464, 693 ],
			[ 428, 347, 883, 364, 882, 709, 419, 710 ],
			...
		]
	},
},
						
where,
- texts: text content annotations for each text instance,
- bboxes: location for each text instance,

gt.json

This file contains the entity annotation groundtruth. An example corresponding to the above ocr.json is as follows:

"abf3b61f-cefe-374e-2ace-ac1fbdf3f3af_1.jpg": {
	"发票代码":112002070106,
	"发票号码":12921503,
	"金额":"壹佰元整",
}
						

Terms of Use

  • The public annotations belong to Hikvision Resarch Institute and are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
  • Citation

  • 中文引用格式:乔梁,李再升,程战战,李玺. 2023. SCID:用于富含视觉信息文档图像中信息提取任务的扫描中文票据数据集. 中国图象图形学报,28(08):2298-2313)[DOI:10. 11834/jig. 220911]
  • 					@article{SCID,
    							author = {
    								Liang Qiao and
    								Zaisheng Li and
    								Zhanzhan Cheng and
    								Xi Li
    							},
    							title = {{SCID:} a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images.},
    							journal = {Journal of Image and Graphics} ,
    							
    							volume={28},
    							number={08},
    							pages={2298-2313},
    							year = {2023},
    						}
    					

    Dataset Download

    To obtain the download link, please download the file of Application_Form_for_Using_SCID.doc, and fill in the required information. Scan the signed file and send it to qiaoliang6@hikvision.com . We will send you the dataset download link.