← Back to homepage
Editing Video by Recovering Scene Structure
How do we let users edit captured video meaningfully—by first recovering the scene structure (moving objects, lighting vs reflectance, cross-frame consistency) that makes plausible modifications possible?
Editing video is harder than editing a photograph: changes to one frame must propagate consistently to every other, and many edits (removing a person, separating lighting from material, stabilising flicker) require understanding the underlying scene rather than just manipulating pixels. We approach editing as inverse reconstruction: decompose video into scene structure first, then edit.
This thread spans my doctoral and postdoc years across UCL, MPI-Inf, Harvard, and LIRIS-CNRS. My earliest piece (2011, UCL) is the cinemagraphs authoring tool—a moment image isolated from a stabilised clip. Miguel Granados led the video-inpainting work at MPI-Inf (2012)—removing dynamic objects from crowded scenes, and the harder case of background recovery under a free-moving camera. Nicolas Bonneel led the consistency and decomposition line (2014–2017)—interactive intrinsic decomposition, blind temporal consistency stabilising any per-frame filter, and the spatio-temporal extension to camera arrays. Our 2016 multicut paper takes a different angle on the same theme: cut the video into the right regions before editing.
Authors
Bjoern Andres · Nicolas Bonneel · Miguel Granados · Oliver Grau · Jan Kautz · Kwang In Kim · Steffen Kirchhoff · Evgeny Levinkov · Sylvain Paris · Fabrizio Pece · Hanspeter Pfister · Kartic Subr · Kalyan Sunkavalli · Deqing Sun · Christian Theobalt · Oliver Wang
Papers in this thread
European Conference on Visual Media Production (CVMP), 2011
An authoring tool that pipelines stabilisation, segmentation, motion selection, and loop detection to produce cinemagraphs—short looping clips where only a chosen region moves.
Computer Graphics Forum (Eurographics), 2012
Object removal from crowded scenes by filling the spatio-temporal hole from other regions of the video where the occluded background was visible, posed as a graph-cut optimisation. Pitched at occlusions harder than previous work had attempted.
European Conference on Computer Vision (ECCV), 2012
Inpaints background revealed by removing dynamic objects from a free-moving-camera video by aligning candidate frames with piecewise planar homographies—sidestepping the full per-frame depth and pose recovery that earlier free-camera methods required.
Transactions on Graphics (SIGGRAPH Asia), 2014
Decomposes video into reflectance and illumination via a hybrid L2-Lp gradient split, fast enough (two orders of magnitude over prior tools) to support interactive refinement and lighting-aware compositing.
Transactions on Graphics (SIGGRAPH Asia), 2015
A gradient-domain post-process that stabilises any per-frame filter against flicker by borrowing temporal regularity from the unprocessed video—agnostic to what the filter actually is. Demonstrated across stylisation, intrinsic decomposition, and depth.
Pacific Graphics (Short Paper), 2016
Interactive multi-label video segmentation from multi-coloured scribbles, posed as a multicut on a supervoxel graph and solved fast enough to feel responsive. Multiple objects cut at once with consistent spatio-temporal boundaries, rather than chained binary segmentations.
Computer Graphics Forum (Eurographics), 2017
Extends the blind-consistency idea from time to time-and-space across stereo, light field, and wide-baseline rigs, and adds a filter-transfer scheme that runs the expensive filter on a small subset of frames and propagates the effect—an order-of-magnitude saving for camera-array data.