Monocular Dynamic 3D Reconstruction

With only ordinary RGB video—no depth sensor, no rig—can we recover dynamic 3D scene geometry and motion?

Monocular dynamic 3D reconstruction takes a single moving camera observing a deforming scene and tries to recover a 4D representation including geometry, appearance, and motion. The problem is fundamentally under-constrained at any one instant, and progress depends on how well the chosen scene representation and supervision signals work together.

We've approached this in two ways. First, per-scene methods fit a representation to a single video. We consider what motion models and regularisations can help (GauFRe, MonoDyGauBench), and what additional information might resolve the ambiguity, e.g., semantics, attention, and optical flow supervision (SAFF). Second, Zero-MSF is data driven: a feed-forward model trained on millions of synthetic examples that transfers zero-shot to real video, with no per-scene fitting.

Authors

Abhishek Badki · Orazio Gallo · Leonidas J. Guibas · Adam Harley · Numair Khan · Eliot Laidlaw · Douglas Lanman · Yiqing Liang · Runfeng Li · Zhengqin Li · Alexander Meyerowitz · Thu Nguyen-Phuoc · Mikhail Okunev · Srinath Sridhar · Hang Su · Mikaela Angelina Uy · Lei Xiao

Papers in this thread

Semantic Attention Flow Fields for Monocular Dynamic Scene Decomposition

International Conference on Computer Vision (ICCV), 2023

Reconstructs a 4D neural volume carrying not just colour and density but also scene flow, semantics, and attention, then uses the latter two to decompose foreground objects from background across spacetime without supervision.

GauFRe🧇: Gaussian Deformation Fields for Real-time Dynamic Novel View Synthesis

arXiv (Dec. 2023) + WACV, 2025

Casts monocular dynamic reconstruction as a canonical Gaussian template plus a forward-warping deformation field, with a separate static component initialised to absorb non-moving regions so the deformation focuses on what actually moves. Trains in roughly twenty minutes and renders in real time.

Monocular Dynamic Gaussian Splatting: Fast, Brittle, and Scene Complexity Rules

Transactions on Machine Learning Research (TMLR), 2025

An apples-to-apples benchmark of monocular dynamic Gaussian splatting methods, categorised by motion representation. Method differences are resolvable on synthetic data but get swamped by real-world scene complexity, and the optimisation is uniformly brittle.

Zero-Shot Monocular Scene Flow Estimation in the Wild

Computer Vision and Pattern Recognition (CVPR), 2025

A feed-forward model that jointly predicts geometry and scene flow, trained on a one-million-sample synthetic recipe. Generalises zero-shot to casual DAVIS video and RoboTAP manipulation scenes—no per-scene optimisation required.

Associate Professor

Visual Computing

Contact