LongWriter-V: ビジョン・ランゲージモデルにおける超長文かつ高忠実度な生成を実現

要旨

既存の大規模視覚言語モデル（LVLM）は、最大128kの視覚およびテキストトークンまでの入力コンテキストを処理できるが、1,000語を超える一貫性のある出力を生成するのに苦労している。この主な制限要因は、教師ありファインチューニング（SFT）における長い出力例の欠如であることがわかった。この問題に対処するため、22,158の例を含むSFTデータセットであるLongWriter-V-22kを導入した。各例は複数の入力画像、指示、および0から10,000語までの対応する出力で構成されている。さらに、入力画像に対する高忠実度を維持した長い出力を実現するため、SFTモデルにDirect Preference Optimization（DPO）を適用した。長い出力（例：3,000語）に対する人間のフィードバックを収集するコストが高いことを考慮し、長い出力をセグメントに分割し、反復的な修正を行って元の出力と好みのペアを形成するIterDPOを提案した。また、VLMの長文生成能力を評価するための6つのタスクを特徴とするベンチマークMMLongBench-Writeを開発した。LongWriter-V-22kとIterDPOでトレーニングされた7Bパラメータモデルは、このベンチマークで印象的な性能を発揮し、GPT-4oのような大規模なプロプライエタリモデルを上回った。コードとデータ：https://github.com/THU-KEG/LongWriter-V

English

Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V

LongWriter-V: ビジョン・ランゲージモデルにおける超長文かつ高忠実度な生成を実現

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

要旨

Support