CogVLA: 인지 정렬 비전-언어-행동 모델 명령어 기반 라우팅 및 희소화를 통한 구현

초록

최근 사전 학습된 Vision-Language Models (VLMs)을 기반으로 구축된 Vision-Language-Action (VLA) 모델들은 광범위한 사후 학습을 필요로 하며, 이로 인해 높은 계산 오버헤드가 발생하여 확장성과 배포에 제약을 받고 있습니다. 우리는 이러한 문제를 해결하기 위해 인간의 다중 모달 조정에서 영감을 받은 CogVLA(Cognition-Aligned Vision-Language-Action) 프레임워크를 제안합니다. 이 프레임워크는 지시 기반 라우팅과 희소화를 활용하여 효율성과 성능을 모두 개선합니다. CogVLA는 3단계의 점진적 아키텍처를 도입합니다. 1) Encoder-FiLM 기반 집계 라우팅(EFA-Routing)은 지시 정보를 비전 인코더에 주입하여 이중 스트림 시각적 토큰을 선택적으로 집계하고 압축하여 지시 인식 잠재 표현을 형성합니다. 2) 이 컴팩트한 시각적 인코딩을 기반으로, LLM-FiLM 기반 가지치기 라우팅(LFP-Routing)은 지시와 무관한 시각적 토큰을 제거하여 액션 의도를 언어 모델에 도입함으로써 토큰 수준의 희소성을 달성합니다. 3) 압축된 인지 입력이 여전히 정확하고 일관된 액션 생성을 지원할 수 있도록, 우리는 인과적 비전-언어 주의와 양방향 액션 병렬 디코딩을 결합한 V-L-A 결합 주의(CAtten)를 도입합니다. LIBERO 벤치마크와 실제 로봇 작업에 대한 광범위한 실험을 통해 CogVLA가 각각 97.4%와 70.0%의 성공률로 최첨단 성능을 달성하면서도 OpenVLA 대비 학습 비용을 2.5배 절감하고 추론 지연 시간을 2.8배 단축함을 입증했습니다. CogVLA는 오픈소스로 공개되어 있으며, https://github.com/JiuTian-VL/CogVLA에서 확인할 수 있습니다.

English

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

CogVLA: 인지 정렬 비전-언어-행동 모델 명령어 기반 라우팅 및 희소화를 통한 구현

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

초록

Support