ETCHR:編輯以澄清與駕馭推理
ETCHR: Editing To Clarify and Harness Reasoning
May 22, 2026
作者: Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin
cs.AI
摘要
多模態大型語言模型雖已推進視覺推理能力,但對於需要精細聚焦或視角轉換的問題,純文字思維鏈仍構成瓶頸。「以圖像思考」的範式縮小了這一差距,然而現有方法若非受限於固定預設工具集,便是透過統一多模態方法產生含有雜訊的中間影像。我們探索第三種路徑:採用專用影像編輯模型,並將其與理解模型解耦。然而,現成的影像編輯器作為推理輔助工具存在兩個互補缺口:語言端的缺口——被訓練為被動指令跟隨者的編輯器無法將抽象問題對應至適當的視覺轉換;生成端的缺口——隨著推理深度增加,編輯正確性逐漸下降。在此分析指引下,我們提出ETCHR(以編輯釐清理路並強化推理),這是一個問題條件化、推理感知的影像編輯器,與下游理解模型解耦,並採用兩階段訓練策略鎖定前述缺口:先透過編輯軌跡上的監督式微調進行推理模仿,再結合VLM導出的獎勵(針對編輯正確性與下游推理準確度)進行推理強化。由於編輯器已解耦,ETCHR能以無需訓練的方式整合至不同的開源與閉源MLLM中。在五大任務族(細粒度感知、圖表理解、邏輯推理、拼圖重構、3D理解)中,ETCHR將Qwen3-VL-8B的平均Pass@1從55.95提升至60.77(+4.82),Gemini-3.1-Flash-Lite從65.08提升至70.55(+5.47),以及參數量達1T的MoE模型Kimi K2.5從76.55提升至81.16(+4.61)。
English
Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.