LongWriter-V：實現視覺語言模型中的超長文本與高保真生成

摘要

現有的大型視覺語言模型（LVLMs）能夠處理上下文長度達128k視覺與文本標記的輸入，但在生成超過1,000字的連貫輸出時仍顯吃力。我們發現，主要限制在於監督微調（SFT）階段缺乏長輸出的範例。為解決此問題，我們引入了LongWriter-V-22k，這是一個包含22,158個範例的SFT數據集，每個範例包含多張輸入圖像、一條指令以及對應的輸出，輸出長度從0到10,000字不等。此外，為了實現既長又高度忠實於輸入圖像的輸出，我們對SFT模型採用了直接偏好優化（DPO）。考慮到收集長輸出（例如3,000字）的人類反饋成本高昂，我們提出了IterDPO，該方法將長輸出分段處理，並通過迭代修正與原始輸出形成偏好對。同時，我們開發了MMLongBench-Write，這是一個包含六項任務的基準測試，用於評估視覺語言模型的長文本生成能力。我們基於LongWriter-V-22k和IterDPO訓練的7B參數模型，在該基準測試中表現出色，超越了如GPT-4o等更大的專有模型。代碼與數據請見：https://github.com/THU-KEG/LongWriter-V。

English

Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V

LongWriter-V：實現視覺語言模型中的超長文本與高保真生成

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

摘要

Support