ETCHR: 명료화 및 추론 활용을 위한 편집

초록

다중 모달 대규모 언어 모델은 시각적 추론을 발전시켰지만, 세밀한 초점이나 시점 변환이 필요한 질문에는 순수 텍스트 기반 사고 사슬이 여전히 병목 현상으로 작용한다. '이미지로 생각하기' 패러다임이 이러한 격차를 좁히지만, 기존 접근 방식은 고정된 사전 정의 도구 키트에 제약되거나 통합 다중 모달 방법에서 잡음이 있는 중간 이미지를 생성한다. 우리는 세 번째 옵션인 전용 이미지 편집 모델을 사용하고 이를 이해 모델과 분리하는 방식을 추구한다. 그러나 기성 이미지 편집기는 추론 보조 도구로서 두 가지 상호 보완적 격차로 인해 실패한다: 언어 측면의 격차, 즉 수동적 명령 수행자로 훈련된 편집기가 추상적 질문을 적절한 시각적 변환에 매핑하지 못하는 점, 그리고 생성 측면의 격차, 즉 추론 깊이가 깊어짐에 따라 편집 정확성이 저하되는 점이다. 이 분석에 기반하여 우리는 ETCHR(명확화 및 추론 활용을 위한 편집)을 소개한다. 이는 질문 조건화되고 추론을 인식하는 이미지 편집기로서, 하위 이해 모델과 분리되어 있으며, 두 가지 격차를 대상으로 하는 2단계 방법론으로 훈련된다: 편집 궤적에 대한 지도 미세 조정을 통한 추론 모방, 이어서 편집 정확성과 하위 추론 정확성에 대한 VLM 기반 보상을 통한 추론 강화이다. 편집기가 분리되어 있기 때문에 ETCHR은 훈련 없이 다양한 오픈 소스 및 폐쇄 소스 MLLM에 연결될 수 있다. 다섯 가지 작업군(세밀한 지각, 차트 이해, 논리 추론, 직소 복원, 3D 이해)에 걸쳐, ETCHR은 Qwen3-VL-8B에서 평균 Pass@1을 55.95에서 60.77(+4.82)로, Gemini-3.1-Flash-Lite에서 65.08에서 70.55(+5.47)로, 1조 파라미터 MoE 모델 Kimi K2.5에서 76.55에서 81.16(+4.61)으로 향상시킨다.

English

Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.