Eagle 2.5: フロンティア視覚言語モデルのための長文脈ポストトレーニングの強化

要旨

Eagle 2.5を紹介します。これは、長文脈のマルチモーダル学習のための最先端の視覚言語モデル（VLM）ファミリーです。本研究では、長尺動画の理解と高解像度画像の理解における課題に取り組み、両タスクに対応する汎用フレームワークを提案します。提案されたトレーニングフレームワークは、自動劣化サンプリングと画像領域保存という2つの技術を組み込んでおり、文脈の整合性と視覚的詳細を保持します。また、長文脈データのトレーニングパイプラインにおいて、多数の効率最適化も実装されています。さらに、ストーリーレベルとクリップレベルのアノテーションを統合した新しいデータセットEagle-Video-110Kを提案し、長尺動画の理解を促進します。Eagle 2.5は、長文脈マルチモーダルベンチマークにおいて大幅な改善を示し、既存のVLMの限界に対する堅牢なソリューションを提供します。特に、最良のモデルであるEagle 2.5-8Bは、512入力フレームでVideo-MMEにおいて72.4%を達成し、GPT-4oのようなトップクラスの商用モデルやQwen2.5-VL-72B、InternVL2.5-78Bのような大規模オープンソースモデルの結果に匹敵します。

English

We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.

Eagle 2.5: フロンティア視覚言語モデルのための長文脈ポストトレーニングの強化

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

要旨

Support