Interactive Videos: Plausible Video Editing using Sparse Structure Points

Chia-Sheng Chang
National Tsing Hua University

Hung-Kuo Chu
National Tsing Hua University

Niloy J. Mitra
University College London

Computer Graphics Forum (Proc. of Eurographics 2016)

Abstract

Video remains the method of choice for capturing temporal events. However, without access to the underlying 3D scene models, it remains difficult to make object level edits in a single video or across multiple videos. While it may be possible to explicitly reconstruct the 3D geometries to facilitate these edits, such a workflow is cumbersome, expensive, and tedious. In this work, we present a much simpler workflow to create plausible editing and mixing of raw video footage using only sparse structure points (SSP) directly recovered from the raw sequences. First, we utilize user-scribbles to structure the point representations obtained using structure-from-motion on the input videos. The resultant structure points, even when noisy and sparse, are then used to enable various video edits in 3D, including view perturbation, keyframe animation, object duplication and transfer across videos, etc. Specifically, we describe how to synthesize object images from new views adopting a novel image-based rendering technique using the SSPs as proxy for the missing 3D scene information. We propose a structure-preserving image warping on multiple input frames adaptively selected from object video, followed by a spatio-temporally coherent image stitching to compose the final object image. Simple planar shadows and depth maps are synthesized for objects to generate plausible video sequence mimicking real-world interactions. We demonstrate our system on a variety of input videos to produce complex edits, which are otherwise difficult to achieve.

Algorithm

Overview: (a) Given the input videos, the system, assisted by users, starts by (b) organizing the recovered 3D information to a set of sparse structure points (SSP). (c) Such SSPs, when organized properly, enable the user to perform various object-level edits. The system aims to re-render the edited objects for each of target frames using a novel image-based rendering technique that runs in three stages: (d) First, multiple input frames are adaptively retrieved from object video, which are further (e) warped and stitched to form the final object image. (f) Object images along with synthesized shadows and depth maps are blended to generate plausible video sequence.

Results

(Top row) 6 input videos and their SSPs. Below are eight video sequences generated by our system using mixed 3D manipulations, e.g., shuffling, keyframe animation (4-9th rows), and duplicating (8-9th rows). In each result, we show four representative frames in which shadows and inter-object occlusions are handled plausibly by our system. See supplementary video for complete sequences.

Video

Acknowledgement

We thank the anonymous reviewers for their invaluable comments, suggestions, and additional references. The project was supported in part by the Ministry of Science and Technology of Taiwan (102-2221-E-007-055-MY3 and 103-2221-E-007-065-MY3) and the ERC Starting Grant SmartGeometry (StG-2013-335373).