VATEX v1.1 (Latest Version)


Note: we will hold the annotations of the test sets for challenge use, but you can submit the results to our VATEX Video Captioning Challenge and Video-guided Machine Translation Challenge for evaluation. The challenges will be always valid and never expire. The split of the validation set is not mandatory, and you are allowed to submit the testing results by a model trained on both training & validation sets.


Training Set
  • 25,991 Videos
  • 259,910 English Captions
  • 259,910 Chinese Captions

(v1.0, 57.3 MB)

Validation Set
  • 3,000 Videos
  • 30,000 English Captions
  • 30,000 Chinese Captions

(v1.0, 6.6 MB)

Public Test Set
(English released for VMT evaluation)
  • 6,000 Videos
  • 6,0000 English Captions

(v1.1, 4.9 MB)

Private Test Set
(heldout for caption evaluation)
  • 6,278 Videos

(v1.1, 0.26 MB)


VATEX v1.0 (for use of VATEX Captioning Challenge 2019)



Training Set
  • 25,991 Videos
  • 259,910 English Captions
  • 259,910 Chinese Captions

(v1.0, 57.3 MB)

Validation Set
  • 3,000 Videos
  • 30,000 English Captions
  • 30,000 Chinese Captions

(v1.0, 6.6 MB)

Public Test Set
  • 6,000 Videos


(v1.0, 0.25 MB)


Pretrained Video Features


Note: Due to the legal and privacy concerns, we cannot directly share the downloaded videos or clips from YouTube in any way (including but not limited to email, online drives and GitHub). However, there are many open-source tools to download the original clips (e.g., [Tool #1] and [Tool #2]). Some videos might be unavailable (deleted or hidden by either YouTube or the users) at this moment, but they were available when we collected the dataset. Considering that it is an extremely small percentage, we expect that it won't have a significant impact on the performance.



In addition to the YouTube video ids, we provide the pretrained video features below for quick development. The features including all the videos are extracted using a pretrained I3D model [here]. Each video is represented by a numpy array of size (1, num_of_segments, 1024).



I3D Features on AWS S3:

Annotation Format



{
    'videoID': 'YouTubeID_StartTime_EndTime',
    'enCap': 
        [
            'Regular English Caption #1',
            'Regular English Caption #2',
            'Regular English Caption #3',
            'Regular English Caption #4',
            'Regular English Caption #5',
            'Parallel English Caption #1',
            'Parallel English Caption #2',
            'Parallel English Caption #3',
            'Parallel English Caption #4',
            'Parallel English Caption #5'
        ],
    'chCap': 
        [
            'Regular Chinese Caption #1',
            'Regular Chinese Caption #2',
            'Regular Chinese Caption #3',
            'Regular Chinese Caption #4',
            'Regular Chinese Caption #5',
            'Parallel Chinese Caption #1',
            'Parallel Chinese Caption #2',
            'Parallel Chinese Caption #3',
            'Parallel Chinese Caption #4',
            'Parallel Chinese Caption #5'
        ]
}