DUET-VLM: VLM 훈련 및 추론을 위한 이중 단계 통합 효율적 토큰 축약

초록

비전-언어 모델(VLM)은 놀라운 다중모달 이해 및 추론 능력을 달성했지만, 밀집된 시각 토큰화로 인해 계산 비용이 여전히 높습니다. 기존 효율성 접근법은 중복 시각 토큰을 병합하거나 언어 백본에서 점진적으로 제거하는 방식을 취하며, 종종 정확도를 속도와 교환합니다. 본 연구에서는 다용도의 플러그앤플레이 이중 압축 프레임워크인 DUET-VLM을 제안합니다. 이는 (a) 비전 인코더 출력을 정보 보존형 토큰으로 압축하는 비전 전용 중복 인식 압축과, (b) 언어 백본 내에서 덜 중요한 토큰을 점진적으로 제거하기 위한 계층별 텍스트 유도 중요 시각 토큰 삭제로 구성됩니다. 이러한 협응된 토큰 관리를 통해 중요한 의미를 보존하면서도 공격적인 압축이 가능합니다. LLaVA-1.5-7B에서 우리의 접근법은 토큰 수를 67% 줄이면서 기준 모델 정확도의 99% 이상을 유지했으며, 89% 감소 시에도 >97%를 유지했습니다. 학습 중 이중 단계 압축을 적용하면 67% 감소 시 99.7%, 89% 감소 시 97.6%의 정확도를 달성하여 여러 벤치마크에서 기존 최첨단 시각 토큰 감소 방법을 능가했습니다. Video-LLaVA-7B에 통합 시에는 기준 모델을 능가하는 결과를 보였습니다. 즉, 53.1%의 상당한 토큰 감소로 >100% 정확도를 달성했으며, 극단적인 93.4% 감소 설정에서도 97.6%의 정확도를 유지했습니다. 이러한 결과는 DUET-VLM을 통한 종단간 학습이 정확도를 희생하지 않으면서 감소된 시각(이미지/비디오) 입력에 대한 강력한 적응을 가능하게 하여, 동일한 계산 예산 내에서 컴팩트하면서도 의미적으로 풍부한 표현을 생성함을 입증합니다. 우리의 코드는 https://github.com/AMD-AGI/DUET-VLM에서 확인할 수 있습니다.

English

Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline -- achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.

DUET-VLM: VLM 훈련 및 추론을 위한 이중 단계 통합 효율적 토큰 축약

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

초록

Support