Joint Semantic Segmentation and 3D Reconstruction from Monocular Video
Author(s)
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
We present an approach for joint inference of 3D scene structure and semantic labeling for monocular video. Starting with monocular
image stream, our framework produces a 3D volumetric semantic + occupancy map, which is much more useful than a series of 2D semantic label
images or a sparse point cloud produced by traditional semantic segmentation and Structure from Motion(SfM) pipelines respectively. We derive
a Conditional Random Field (CRF) model defined in the 3D space, that
jointly infers the semantic category and occupancy for each voxel. Such
a joint inference in the 3D CRF paves the way for more informed priors
and constraints, which is otherwise not possible if solved separately in
their traditional frameworks. We make use of class specific semantic cues
that constrain the 3D structure in areas, where multiview constraints are
weak. Our model comprises of higher order factors, which helps when the
depth is unobservable. We also make use of class specific semantic cues to
reduce either the degree of such higher order factors, or to approximately
model them with unaries if possible. We demonstrate improved 3D structure and temporally consistent semantic segmentation for diffcult, large
scale, forward moving monocular image sequence.
Sponsor
Date
2014-09
Extent
Resource Type
Text
Resource Subtype
Book Chapter
Proceedings
Proceedings