A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research.
Both English and Chinese captions.
826K captions for 41.3K video clips.
Comprehensive and representative video content from 600 fine-grained human activities.
Unique and lexically-richer annotations to empower more natural and diverse caption generation.
@InProceedings{Wang_2019_ICCV,
author = {Wang, Xin and Wu, Jiawei and Chen, Junkun and Li, Lei and Wang, Yuan-Fang and Wang, William Yang},
title = {VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}
}