Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ~800K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.
Modern visual-geometry networks first encode each image into patch tokens and a camera token (a) using a multi-view transformer backbone. Based on these latent features, there are several ways to predict dense correspondences between frames. Traditional correspondence heads (b) infer flow directly from patch features, relying purely on visual appearance and ignoring the underlying scene geometry. Alternatively, one may compute flow by explicitly projecting predicted 3D points into another view using decoded camera poses (c); however, this approach assumes static scenes and is highly sensitive to geometric prediction errors. In contrast, our factored flow mechanism (d) combines the geometry latents from the source view with the camera latents from the target view and decodes correspondences directly in latent space. This design yields geometry-aware flow, improves robustness, and naturally extends to dynamic scenes.
Flow3r predicts visual geometry using factored flow supervision, enabling scalable geometry learning from unlabeled videos. Each input image is encoded and processed by the multi-view transformer to produce camera tokens and patch tokens. For data with dense geometry and pose labels, we directly supervise the patch tokens and camera tokens with the corresponding labels. For dynamic datasets, we predict flow between two frames in a factorized manner, supervised by an off-the-shelf 2D flow prediction model UFM[1]. To obtain the factored flow, we fuse the patch features of one frame with the camera features of the other, and decode the fused representation through the DPT head to produce dense flow predictions.
We first compare our factored prediction paradigm against training with only labeled 3D data, and against alternative flow prediction designs on bothstatic and dynamic scenes. We denote the `base model' trained with only 3D supervision as 3d-sup. Building upon this `no-flow' baseline, we study three variants that incorporate additional unlabeled data via flow supervision using different formulations: (1) flow-projective, computes flow explicitly from predicted camera poses and pointmaps via projective geometry; (2) flow-tracking, adopts a VGGT-style tracking head based on pairwise patch features; (3) flow-factored, applies our proposed factored flow prediction formulation. All models share a (small) VGGT-like architecture, and are trained from scratch.
We find that on both static and dynamic scenes, our proposed factored flow prediction (flow-factored) yields consistent gains over the model variant without additional unlabeled data (3d-sup) as well as other flow-supervised alternatives. Qualitatively, the figure below shows that flow-factored yields cleaner reconstructions than 3d-sup while also improving over other flow mechanisms.
To investigate the scaling behavior of the proposed factored flow prediction, we progressively increase the number of unlabeled dynamic videos used for flow supervision on SpatialVID. Specifically, we keep the 3D-labeled OmniWorld set fixed (1K sequences) and scale the number of SpatialVID sequences used to apply the flow loss (3K, 10K, and 20K sequences). For reference, we also train a model using only additional labeled data by increasing OmniWorld to 4K sequences without any unlabeled videos. The results show that scaling the total number of training sequences to a larger regime (e.g., 10X or 20X the amount used in the 3d-sup no-flow baseline) yields consistent improvements. Notably, using 20K unlabeled videos together with 1K labeled sequences outperforms training with 4K labeled sequences alone.
Here we scale the training of an off-the-shelf large visual geometry network (pi3) by leveraging our factored flow prediction strategy with unlabeled dynamic data. We evaluate performance using pose accuracy and reconstruction metrics in four dynamic datasets (Kinetics700, Epic-Kitchens, Sintel and Bonn) and four static datasets (Co3Dv2, Scannet, NRGBD, and 7-scenes). Best, second and third results are highlighted in light red, orange, and yellow, respectively. Flow3r consistently outperforms state-of-the-art methods in both camera pose estimation and scene reconstruction, demonstrating the effectiveness of leveraging large-scale unlabeled videos for visual geometry learning via factored-flow supervision.
In this work, we present flow3r and demonstrate that it effectively leverages in-the-wild unlabeled data by introducing factored flow prediction, advancing visual geometry learning beyond existing fully supervised methods. While our approach opens up new possibilities, several challenges remain.
First, flow3r relies on off-the-shelf models to provide pseudo-ground-truth flow supervision, and there can be domains where such 2D prediction fails, limiting the performance upper bound of flow3r. Second, although our factored flow formulation elegantly handles dynamic scenes and enables flow supervision to improve the learning of both camera motion and scene geometry, flow3r may struggle under complex scenes with multiple moving independently components. Finally, our current experiments operate at a moderate scale (~800K video sequences for flow supervision), and scaling to truly large-scale settings (~10-100M videos) presents an exciting but unexplored direction. While this is out of scope for our work due to computational constraints, we envision flow3r's formulation serving as a building block for future large-scale learning methods.
We thank the members of the Physical Perception Lab at CMU for their valuable discussions.
This work was supported by an NVIDIA academic grant. This work used Bridges-2 at Pittsburgh Supercomputing Center through allocation CIS250061 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. This work was supported by Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) contract number 140D0423C0074. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government.