LongWriter-V:實現視覺語言模型中的超長文本與高保真生成
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
February 20, 2025
作者: Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li
cs.AI
摘要
現有的大型視覺語言模型(LVLMs)能夠處理上下文長度達128k視覺與文本標記的輸入,但在生成超過1,000字的連貫輸出時仍顯吃力。我們發現,主要限制在於監督微調(SFT)階段缺乏長輸出的範例。為解決此問題,我們引入了LongWriter-V-22k,這是一個包含22,158個範例的SFT數據集,每個範例包含多張輸入圖像、一條指令以及對應的輸出,輸出長度從0到10,000字不等。此外,為了實現既長又高度忠實於輸入圖像的輸出,我們對SFT模型採用了直接偏好優化(DPO)。考慮到收集長輸出(例如3,000字)的人類反饋成本高昂,我們提出了IterDPO,該方法將長輸出分段處理,並通過迭代修正與原始輸出形成偏好對。同時,我們開發了MMLongBench-Write,這是一個包含六項任務的基準測試,用於評估視覺語言模型的長文本生成能力。我們基於LongWriter-V-22k和IterDPO訓練的7B參數模型,在該基準測試中表現出色,超越了如GPT-4o等更大的專有模型。代碼與數據請見:https://github.com/THU-KEG/LongWriter-V。
English
Existing Large Vision-Language Models (LVLMs) can process inputs with context
lengths up to 128k visual and text tokens, yet they struggle to generate
coherent outputs beyond 1,000 words. We find that the primary limitation is the
absence of long output examples during supervised fine-tuning (SFT). To tackle
this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158
examples, each with multiple input images, an instruction, and corresponding
outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that
maintain high-fidelity to the input images, we employ Direct Preference
Optimization (DPO) to the SFT model. Given the high cost of collecting human
feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which
breaks long outputs into segments and uses iterative corrections to form
preference pairs with the original outputs. Additionally, we develop
MMLongBench-Write, a benchmark featuring six tasks to evaluate the
long-generation capabilities of VLMs. Our 7B parameter model, trained with
LongWriter-V-22k and IterDPO, achieves impressive performance on this
benchmark, outperforming larger proprietary models like GPT-4o. Code and data:
https://github.com/THU-KEG/LongWriter-VSummary
AI-Generated Summary