A Large-scale Video Text Dataset

Published: October 21, 2019

Introduction

Here, we release a large-scale video text dataset named LSVTD. In recent years, research in video scene text still remains unpopular in contrast to its promising application prospect. The existing video scene text datasets are limited on the scale of video items and scenarios, which may restrain research of video scene text spotting. Then we collect and annotate LSVTD, which contains 100 scene videos acquired from 21 typical real-life scenarios.

Note that, in the origin paper, we have decaleard that LSVTD is collected from 22 scenes. However, due to some privacy policy, we have to remove 1 video text scenario (including 5 videos), leading to 21 scenarios with 95 videos keeped. Even so, we then supplement 2, 1 and 2 videos for city road, street view and bookstore scenarios, respectively. Besides, we also refine the dataset by removing some segments of background frames without text content.

LSVTD mainly characterized is described in detail as follows:

Much larger scale. LSVTD has 100 videos, consists of more than 65k frames and 563k instances.
More diversified scenarios. LSVTD covers a wide range of 13 indoor (eg. bookstore, shopping mall) and 8 outdoor (eg. highway, city road) scenarios. The variety of scenarios challenges text spotting algorithms to achieve robust performance.
Multilingual text instances. LSVTD contains text with more than two kinds of languages (English and Chinese etc.), which are further divided into 2 major categories: Latin and Non-Latin. We additionally label this attribute for the convenience of evaluation for different algorithms. In addition, each text region is labeled with quality score.

Dataset released

Videos: 100 videos are provided in total, in which 66 videos for training and 34 videos for testing.
Annotations: only the training annotations (.xml) are provided.
If you need evaluate your algorithm on the testing set, please send us the predicted results (same as the annotation format) by email.
We will give you the evaluation results as soon as possible. In future, we will consider to add the on-line evaluation server or provide a evaluation tool.
Dataset request: If you need this dataset, please email us [See Contact]. We will share a download link with you. Tips: Please tell us your name and institute in your email.

Contact

If you have any questions about the dataset, please contact Jing Lu or Zhanzhan Cheng .(lujing6kh[at]163.com or 11821104[at]zju.edu.cn).

Terms of Use

The public annotations belong to Zhejiang University and are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The images belong to Zhejiang University and are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Change Log

2020-03-06 (GMT+8): dataset update by removing some consecutive background frames and adding 5 extra videos
2019-10-25 (GMT+8): dataset released
2019-10-21 (GMT+8): website create

@inproceedings{cheng2019you, title={You Only Recognize Once: Towards Fast Video Text Spotting}, author={Cheng, Zhanzhan and Lu, Jing and Niu, Yi and Pu, Shiliang and Wu, Fei and Zhou, Shuigeng}, booktitle={Proceedings of the 27th ACM International Conference on Multimedia}, pages={855–863}, year={2019}, organization={ACM} }

Slides

Paper

BibTex Citation