Spoken Moments in Time

Spoken Moments is a video description dataset with over 500K different short videos depicting a broad range of different events. We collect our descriptions using audio recordings to ensure that they remain as natural and concise as possible. We provide our cached videos and descriptions for academic use.

Video Presentation

Examples

Download

Spoken Moments in Time Dataset

To obtain the dataset, please fill out the form

Code and pretrained models

(Coming Soon)

Paper

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

Mathew Monfort*, SouYoung Jin*, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2021

Paper Supplementary Material BIB POSTER

Also see Moments in Time and Multi Moments in Time which are used a data sources for the videos in Spoken Moments.

Dataset Statistics

The Spoken Moments dataset contains 500k videos randomly chosen from the Multi-Moments in Time (M-MiT) training set and all of the 10k videos from the validation set. Each video in the training set contains at least one audio description. We transcribed each audio recording using the public Google Automatic Speech Recognition (ASR) engine to generate text captions for each video. When analyzing these transcriptions, we build a picture of the coverage and diversity of our captions. Our captions have an average length of 18 words with a unique vocabulary of 50,570 words consisting of 20,645 nouns, 12,523 adjectives and 7,436 verbs with a total word count of 5.6 million. Please check the paper for more details.

**Table 1. Caption Dataset Comparison:** We compare our proposed Spoken Moments dataset to existing video caption datasets. The word count and vocabulary are generated using ASR transcriptions. When compared to other exisiting datasets for video captioning we provide a large jump in scale in terms of the clips, source videos, number of captions and the diversity and total size of our captions.
Dataset	Clips	Videos	Captions	Words	Vocab
TACoS	7,206	127	18,227	146,771	28,292
YouCook II	15,400	2,000	15,400	121,418	2,583
MSVD	1,970	1,970	70,028	607,339	13,010
Charades	10,000	10,000	27,800	645,636	32,804
MPII-MD	68,337	94	68,375	653,467	24,549
MSR-VTT	10,000	7,180	200,000	1,856,523	29,316
ActivityNet Captions	100,000	20,000	100,000	1,348,000	15,564
VideoStory	123,000	20,000	123,000	1,633,226	-
Epic-Kitchens	76,885	633	76,885	227,974	1,737
Vatex-en	41,300	41,300	413,000	4,994,768	44,103
Spoken Moments	515,912	459,742	515,912	5,618,064	50,570

**Table 2. Spoken Moments caption statistics:** the total and average number of words, verbs, nouns and adjectives in our captions as well as the number of unique examples of each. Our captions include a large number of unique actions, nouns and adjectives.
Type	Total	Average	Unique
Words	5,618,064	18.01	50,570
Verbs	492,941	1.58	7,436
Nouns	1,365,305	4.37	20,645
Adjectives	386,039	1.24	12,523

**Table 3. Coverage of our class vocabulary:** we show the percentage of the class vocabulary from different datasets that occur in ourcaptions. When we compare to large-scale image and video datasets we can see that our captions cover a large number of the different classes in object, scene and action datasets commonly used for training visual models. we provide a large jump in scale in terms of the clips, source videos, number of captions and the diversity and total size of our captions
Type	Dataset	Coverage
Objects	ImageNet	69.2%
Objects	MS-COCO	100%
Actions	Kinetics	85.1%
Actions	Moments in Time	96.2%
Scenes	Places365	47.4%

Annotation

Each AMT worker is presented with a task of recording themselves describing 10 different videos. Each video is shown on the left of the screen while a video with an example text description is shown on the right. This example helps to show the workers the types of descriptions we are looking for and the amount of detail we expect from them. This example stays on the right side of the screen throughout the task while the target videos on the left cycle as the worker completes each description.

Below each target description is a button that allows the worker to start recording their voice as they describe the video. Once they press this button the video is removed from the screen and the recording is started. We block the worker from seeing the video while recording the description to ensure that the recordings are concise and pertain only to the important events highlighted in their memory.

Acknowledgements

This work was supported by the MIT-IBM Watson AI Lab and its member companies, Nexplore and Woodside as well as the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number D17PC00341.