Deep Segments: Comparisons between Scenes and their Constituent Fragments using Deep Learning

Thumbnail Image
Doshi, Jigar
Mason, Celeste
Wagner, Alan
Kira, Zsolt
Associated Organizations
Supplementary to
We examine the problem of visual scene understanding and abstraction from first person video. This is an important problem and successful approaches would enable complex scene characterization tasks that go beyond classification, for example characterization of novel scenes in terms of previously encountered visual experiences. Our approach utilizes the final layer of a convolutional neural network as a high-level, scene specific, representation which is robust enough to noise to be used with wearable cameras. Researchers have demonstrated the use of convolutional neural networks for object recognition. Inspired by results from cognitive and neuroscience, we use output maps created by a convolutional neural network as a sparse, abstract representation of visual images. Our approach abstracts scenes into constituent segments that can be characterized by the spatial and temporal distribution of objects. We demonstrate the viability of the system on video taken from Google Glass. Experiments examining the ability of the system to determine scene similarity indicate ρ (384) = ±0:498 correlation to human evaluations and 90% accuracy on a category match problem. Finally, we demonstrate high-level scene prediction by showing that the system matches two scenes using only a few initial segments and predicts objects that will appear in subsequent segments.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI