DAVAR LAB

Scanned Chinese Invoice Dataset

DAVAR LAB

Introduction

The SCID dataset is from CSIG 2022 Competition on Invoice Recognition and Analysis .

The dataset is also described in the accepted paper of 2023 journal of image and graphics 《SCID ： a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images》

The dataset contains six types of invoices for algorithm verification. They are Taxi Invoice, Train Invoice, Passenger Invoice, Toll Invoice, Air Itinerary Invoice and Quota Invoice. All images have been were desensitized. Some visualization examples are shown as follows.


Air Ticket	General Quota Invoice	Taxi Invoice

Passenger Transport Invoice	Toll Invoice	Train Ticket

Annotation

We provide two types annotation files for the training data, ocr.json and gt.json :

ocr.json

This file contains the annotations for each text instance's location and content, defined as follows:

"abf3b61f-cefe-374e-2ace-ac1fbdf3f3af_1.jpg": {
	"height": 891,
	"width": 1245,
	"content_ann": {
		"texts": [
			"112002070106", "12921503", "壹佰元整", "###", ...
		]
		"bboxes": [
			[ 453, 338, 830, 328, 832, 383, 454, 393 ],
			[ 446, 411, 739, 406, 741, 466, 448, 473 ],
			[ 462, 603, 809, 595, 812, 683, 464, 693 ],
			[ 428, 347, 883, 364, 882, 709, 419, 710 ],
			...
		]
	},
},

where,
- texts: text content annotations for each text instance,
- bboxes: location for each text instance,

gt.json

This file contains the entity annotation groundtruth. An example corresponding to the above ocr.json is as follows:

"abf3b61f-cefe-374e-2ace-ac1fbdf3f3af_1.jpg": {
	"发票代码":112002070106,
	"发票号码":12921503,
	"金额":"壹佰元整",
}

Terms of Use

The public annotations belong to Hikvision Resarch Institute and are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citation

中文引用格式:乔梁，李再升，程战战，李玺. 2023. SCID：用于富含视觉信息文档图像中信息提取任务的扫描中文票据数据集. 中国图象图形学报，28（08）：2298-2313）［DOI：10. 11834/jig. 220911］

					@article{SCID,
							author = {
								Liang Qiao and
								Zaisheng Li and
								Zhanzhan Cheng and
								Xi Li
							},
							title = {{SCID:} a Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images.},
							journal = {Journal of Image and Graphics} ,
							
							volume={28},
							number={08},
							pages={2298-2313},
							year = {2023},
						}

Dataset Download

To obtain the download link, please download the file of Application_Form_for_Using_SCID.doc, and fill in the required information. Scan the signed file and send it to qiaoliang6@hikvision.com . We will send you the dataset download link.