Toekomstige Optische Stroomvoorspelling Verbeterd Robotbesturing en Videogeneratie

Samenvatting

Toekomstige bewegingsrepresentaties, zoals optische stroming, bieden enorme waarde voor besturings- en generatieve taken. Het voorspellen van generaliseerbare ruimtelijk dichte bewegingsrepresentaties blijft echter een grote uitdaging, en het leren van dergelijke voorspellingen vanuit ruwe, real-world gegevens is relatief onontgonnen gebied. Wij introduceren FOFPred, een nieuwe taalgeconditioneerd model voor het voorspellen van optische stroming, met een uniforme Vision-Language Model (VLM) en Diffusion-architectuur. Deze unieke combinatie maakt sterke multimodale reasoning mogelijk met pixel-level generatieve nauwkeurigheid voor toekomstige bewegingsvoorspelling. Ons model wordt getraind op web-schaal menselijke activiteitendata – een zeer schaalbare maar ongestructureerde bron. Om zinvolle signalen uit deze ruwe video-bijschriftdata te extraheren, gebruiken we cruciale gegevensvoorverwerkingstechnieken en onze uniforme architectuur met sterke image pretraining. Het getrainde model wordt vervolgens uitgebreid om twee verschillende downstreamtaken in besturing en generatie aan te pakken. Evaluaties op het gebied van robotmanipulatie en videogeneratie onder taalgestuurde condities tonen de domeinoverschrijdende veelzijdigheid van FOFPred aan, wat de waarde bevestigt van een uniforme VLM-Diffusion-architectuur en schaalbaar leren vanuit diverse webgegevens voor toekomstige voorspelling van optische stroming.

English

Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.

Toekomstige Optische Stroomvoorspelling Verbeterd Robotbesturing en Videogeneratie

Future Optical Flow Prediction Improves Robot Control & Video Generation

Samenvatting

Support