TIC-VLA:一种面向动态环境机器人导航的思控视觉-语言-动作模型
TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments
February 2, 2026
作者: Zhiyu Huang, Yun Zhang, Johnson Liu, Rui Song, Chen Tang, Jiaqi Ma
cs.AI
摘要
在动态且以人为中心的环境中运行的机器人,必须遵循语言指令同时保持实时反应控制。视觉-语言-动作模型为此提供了可行框架,但其假设推理与控制存在时间对齐,而语义推理本质上会相对实时动作产生延迟。我们提出延迟感知框架TIC-VLA,通过在动作生成过程中显式建模延迟语义推理。该框架定义了延迟语义控制接口,除当前观测外,还将延迟的视觉语言语义状态与显式延迟元数据作为动作生成条件,使策略能够补偿异步推理。我们进一步提出延迟一致性训练流程,在模仿学习与在线强化学习中注入推理延迟,实现训练与异步部署的对齐。为支持真实评估,我们开发了DynaNav——一个物理精确、照片级真实的仿真套件,用于动态环境中的语言引导导航。大量仿真与实体机器人实验表明,TIC-VLA在保持多秒级推理延迟下鲁棒实时控制的同时,持续优于现有VLA模型。项目网站:https://ucla-mobility.github.io/TIC-VLA/
English
Robots in dynamic, human-centric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action. We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations, enabling policies to compensate for asynchronous reasoning. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments. Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency. Project website: https://ucla-mobility.github.io/TIC-VLA/