Moments is a research project in development by the MIT-IBM Watson AI Lab. The project is dedicated to building a very large-scale dataset to help AI systems recognize and understand actions and events in videos.
Today, the dataset includes a collection of one million labeled 3 second videos, involving people, animals, objects or natural phenomena, that capture the gist of a dynamic scene.
Three seconds events capture an ecosystem of changes in the world: 3 seconds convey meaningful information to understand how agents (human, animal, artificial or natural) transform from one state to another.
Designed to have large inter-class and intra-class variation that represent dynamical events at different levels of abstraction (i.e. "opening" doors, drawers, curtains, presents, eyes, mouths, and even flower petals).
A large-scale, human-annotated video dataset capturing visual and/or audible actions, produced by humans, animals, objects or nature that together allow for the creation of compound activities occurring at longer time scales.
Supervised tasks on a large coverage of the visual and auditory ecosystem help construct powerful but flexible feature detectors, allowing models to quickly transfer learned representations to novel domains.
Can we understand what models attend to during a prediction?
Here, we show the areas of the video frames that our neural network is focusing on in order to recognize the event in the video. These methods show the networks ability to locate the most important areas to focus on for each video clip so that it can identify each moment.
Mathew Monfort, Bolei Zhou, Sarah Adel Bargal,
Tom Yan, Alex Andonian, Kandan Ramakrishnan, Lisa Brown,
Quanfu Fan, Dan Gutfreund, Carl Vondrick, Aude Oliva
To obtain the dataset, please contact email@example.comDownload Paper