When the input is only ordinary RGB video — no depth sensor, no rig — can we recover dynamic 3D scene geometry well enough to compete with depth-sensor-supervised methods?
Monocular dynamic 3D reconstruction takes a single moving camera observing a deforming scene and tries to recover a complete 4D representation — geometry, appearance, motion — over the captured time window. The problem is fundamentally under-constrained at any one instant, and progress depends on how well the chosen scene representation and the supervision signals work together.
Yiqing Liang's PhD has driven this arc. Starting from semantic attention flow fields built atop a dynamic NeRF at ICCV 2023, the work moved to a forward-warping Gaussian deformation formulation (GauFRe, with Meta colleagues) for real-time rendering, then to a TMLR benchmark (MonoDyGauBench) that puts the recent flood of monocular dynamic Gaussian methods on a like-for-like footing. The latest paper (Zero-MSF, with NVIDIA) abandons per-scene optimization entirely and trains a feed-forward predictor for scene flow that generalizes zero-shot to in-the-wild video.
Abhishek Badki · Orazio Gallo · Leonidas J. Guibas · Adam Harley · Numair Khan · Eliot Laidlaw · Douglas Lanman · Yiqing Liang · Runfeng Li · Zhengqin Li · Alexander Meyerowitz · Thu Nguyen-Phuoc · Mikhail Okunev · Srinath Sridhar · Hang Su · Mikaela Angelina Uy · Lei Xiao