Spoken Moments is a video description dataset with over 500K different short videos depicting a broad range of different events. We collect our descriptions using audio recordings to ensure that they remain as natural and concise as possible. We provide our cached videos and descriptions for academic use.
|
|
|
|
|
|
To obtain the dataset, please fill out the form
(Coming Soon)
Mathew Monfort*, SouYoung Jin*, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2021
Also see Moments in Time and Multi Moments in Time which are used a data sources for the videos in Spoken Moments.
The Spoken Moments dataset contains 500k videos randomly chosen from the Multi-Moments in Time (M-MiT) training set and all of the 10k videos from the validation set. Each video in the training set contains at least one audio description. We transcribed each audio recording using the public Google Automatic Speech Recognition (ASR) engine to generate text captions for each video. When analyzing these transcriptions, we build a picture of the coverage and diversity of our captions. Our captions have an average length of 18 words with a unique vocabulary of 50,570 words consisting of 20,645 nouns, 12,523 adjectives and 7,436 verbs with a total word count of 5.6 million. Please check the paper for more details.
Dataset | Clips | Videos | Captions | Words | Vocab |
---|---|---|---|---|---|
TACoS | 7,206 | 127 | 18,227 | 146,771 | 28,292 |
YouCook II | 15,400 | 2,000 | 15,400 | 121,418 | 2,583 |
MSVD | 1,970 | 1,970 | 70,028 | 607,339 | 13,010 |
Charades | 10,000 | 10,000 | 27,800 | 645,636 | 32,804 |
MPII-MD | 68,337 | 94 | 68,375 | 653,467 | 24,549 |
MSR-VTT | 10,000 | 7,180 | 200,000 | 1,856,523 | 29,316 |
ActivityNet Captions | 100,000 | 20,000 | 100,000 | 1,348,000 | 15,564 |
VideoStory | 123,000 | 20,000 | 123,000 | 1,633,226 | - |
Epic-Kitchens | 76,885 | 633 | 76,885 | 227,974 | 1,737 |
Vatex-en | 41,300 | 41,300 | 413,000 | 4,994,768 | 44,103 |
Spoken Moments | 515,912 | 459,742 | 515,912 | 5,618,064 | 50,570 |
Type | Total | Average | Unique |
---|---|---|---|
Words | 5,618,064 | 18.01 | 50,570 |
Verbs | 492,941 | 1.58 | 7,436 |
Nouns | 1,365,305 | 4.37 | 20,645 |
Adjectives | 386,039 | 1.24 | 12,523 |
Type | Dataset | Coverage |
---|---|---|
Objects | ImageNet | 69.2% |
MS-COCO | 100% | |
Actions | Kinetics | 85.1% |
Moments in Time | 96.2% | |
Scenes | Places365 | 47.4% |
Each AMT worker is presented with a task of recording themselves describing 10 different videos. Each video is shown on the left of the screen while a video with an example text description is shown on the right. This example helps to show the workers the types of descriptions we are looking for and the amount of detail we expect from them. This example stays on the right side of the screen throughout the task while the target videos on the left cycle as the worker completes each description.
Below each target description is a button that allows the worker to start recording their voice as they describe the video. Once they press this button the video is removed from the screen and the recording is started. We block the worker from seeing the video while recording the description to ensure that the recordings are concise and pertain only to the important events highlighted in their memory.
This work was supported by the MIT-IBM Watson AI Lab and its member companies, Nexplore and Woodside as well as the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number D17PC00341.