Spoken Moments in Time

Spoken Moments is a video description dataset with over 500K different short videos depicting a broad range of different events. We collect our descriptions using audio recordings to ensure that they remain as natural and concise as possible. We provide our cached videos and descriptions for academic use.

Video Presentation



Spoken Moments in Time Dataset

To obtain the dataset, please fill out the form


Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

Mathew Monfort*, SouYoung Jin*, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2021

Also see Moments in Time and Multi Moments in Time which are used a data sources for the videos in Spoken Moments.

Dataset Statistics

The Spoken Moments dataset contains 500k videos randomly chosen from the Multi-Moments in Time (M-MiT) training set and all of the 10k videos from the validation set. Each video in the training set contains at least one audio description. We transcribed each audio recording using the public Google Automatic Speech Recognition (ASR) engine to generate text captions for each video. When analyzing these transcriptions, we build a picture of the coverage and diversity of our captions. Our captions have an average length of 18 words with a unique vocabulary of 50,570 words consisting of 20,645 nouns, 12,523 adjectives and 7,436 verbs with a total word count of 5.6 million. Please check the paper for more details.

Table 1. Caption Dataset Comparison: We compare our proposed Spoken Moments dataset to existing video caption datasets. The word count and vocabulary are generated using ASR transcriptions. When compared to other exisiting datasets for video captioning we provide a large jump in scale in terms of the clips, source videos, number of captions and the diversity and total size of our captions.
Dataset Clips Videos Captions Words Vocab
TACoS 7,206 127 18,227 146,771 28,292
YouCook II 15,400 2,000 15,400 121,418 2,583
MSVD 1,970 1,970 70,028 607,339 13,010
Charades 10,000 10,000 27,800 645,636 32,804
MPII-MD 68,337 94 68,375 653,467 24,549
MSR-VTT 10,000 7,180 200,000 1,856,523 29,316
ActivityNet Captions 100,000 20,000 100,000 1,348,000 15,564
VideoStory 123,000 20,000 123,000 1,633,226 -
Epic-Kitchens 76,885 633 76,885 227,974 1,737
Vatex-en 41,300 41,300 413,000 4,994,768 44,103
Spoken Moments 515,912 459,742 515,912 5,618,064 50,570

Table 2. Spoken Moments caption statistics: the total and average number of words, verbs, nouns and adjectives in our captions as well as the number of unique examples of each. Our captions include a large number of unique actions, nouns and adjectives.
Type Total Average Unique
Words 5,618,064 18.01 50,570
Verbs 492,941 1.58 7,436
Nouns 1,365,305 4.37 20,645
Adjectives 386,039 1.24 12,523
Table 3. Coverage of our class vocabulary: we show the percentage of the class vocabulary from different datasets that occur in ourcaptions. When we compare to large-scale image and video datasets we can see that our captions cover a large number of the different classes in object, scene and action datasets commonly used for training visual models. we provide a large jump in scale in terms of the clips, source videos, number of captions and the diversity and total size of our captions
Type Dataset Coverage
Objects ImageNet 69.2%
MS-COCO 100%
Actions Kinetics 85.1%
Moments in Time 96.2%
Scenes Places365 47.4%


Each AMT worker is presented with a task of recording themselves describing 10 different videos. Each video is shown on the left of the screen while a video with an example text description is shown on the right. This example helps to show the workers the types of descriptions we are looking for and the amount of detail we expect from them. This example stays on the right side of the screen throughout the task while the target videos on the left cycle as the worker completes each description.

Below each target description is a button that allows the worker to start recording their voice as they describe the video. Once they press this button the video is removed from the screen and the recording is started. We block the worker from seeing the video while recording the description to ensure that the recordings are concise and pertain only to the important events highlighted in their memory.


This work was supported by the MIT-IBM Watson AI Lab and its member companies, Nexplore and Woodside as well as the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number D17PC00341.