Eagle 2.5: 프론티어 비전-언어 모델을 위한 장문맥 사후 학습 강화

초록

우리는 장문의 멀티모달 학습을 위한 최첨단 비전-언어 모델(VLM) 패밀리인 Eagle 2.5를 소개합니다. 본 연구는 긴 동영상 이해와 고해상도 이미지 이해의 과제를 해결하며, 두 작업 모두를 위한 일반적인 프레임워크를 제안합니다. 제안된 학습 프레임워크는 문맥적 무결성과 시각적 세부 사항을 보존하는 두 가지 기술인 자동 저하 샘플링(Automatic Degrade Sampling)과 이미지 영역 보존(Image Area Preservation)을 통합합니다. 또한 이 프레임워크는 장문 데이터 학습을 위한 파이프라인에서 다양한 효율성 최적화를 포함합니다. 마지막으로, 우리는 스토리 수준과 클립 수준의 주석을 통합하여 긴 동영상 이해를 용이하게 하는 새로운 데이터셋인 Eagle-Video-110K를 제안합니다. Eagle 2.5는 장문 멀티모달 벤치마크에서 상당한 개선을 보여주며, 기존 VLM의 한계를 극복하는 강력한 솔루션을 제공합니다. 특히, 우리의 최고 성능 모델인 Eagle 2.5-8B는 512개의 입력 프레임으로 Video-MME에서 72.4%를 달성하며, GPT-4o와 같은 최상위 상용 모델 및 Qwen2.5-VL-72B, InternVL2.5-78B와 같은 대규모 오픈소스 모델의 결과와 맞먹는 성능을 보여줍니다.

English

We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.

Eagle 2.5: 프론티어 비전-언어 모델을 위한 장문맥 사후 학습 강화

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

초록

Support