SridBench: 이미지 생성 모델의 과학 연구 일러스트레이션 도면 벤치마크

초록

최근 몇 년 동안 AI 기반 이미지 생성 기술은 급속도로 발전해 왔다. 초기 확산 모델은 지각적 품질에 중점을 두었던 반면, GPT-4o-image와 같은 최신 멀티모달 모델은 고차원적 추론을 통합하여 의미 이해와 구조적 구성을 개선하고 있다. 과학적 일러스트레이션 생성은 이러한 진화를 잘 보여주는 예시이다: 일반적인 이미지 합성과 달리, 이는 기술적 내용을 정확히 해석하고 추상적인 아이디어를 명확하고 표준화된 시각 자료로 변환하는 것을 요구한다. 이 작업은 훨씬 더 지식 집약적이며 노동 집약적이어서, 종종 수 시간의 수작업과 전문 도구가 필요하다. 이를 제어 가능하고 지능적인 방식으로 자동화한다면 상당한 실용적 가치를 제공할 수 있다. 그러나 현재 이를 평가할 벤치마크는 존재하지 않는다. 이러한 공백을 메우기 위해, 우리는 과학적 도면 생성을 위한 첫 번째 벤치마크인 SridBench를 소개한다. 이는 13개의 자연과학 및 컴퓨터 과학 분야의 주요 논문에서 선별한 1,120개의 사례를 포함하며, 인간 전문가와 MLLM(Multimodal Large Language Models)을 통해 수집되었다. 각 샘플은 의미 충실도와 구조적 정확성을 포함한 6가지 차원에서 평가된다. 실험 결과, GPT-4o-image와 같은 최상위 모델도 텍스트/시각적 명확성과 과학적 정확성에서 흔히 발생하는 문제로 인해 인간의 성능에 미치지 못하는 것으로 나타났다. 이러한 발견은 더 고급 추론 기반 시각 생성 능력의 필요성을 강조한다.

English

Recent years have seen rapid advances in AI-driven image generation. Early diffusion models emphasized perceptual quality, while newer multimodal models like GPT-4o-image integrate high-level reasoning, improving semantic understanding and structural composition. Scientific illustration generation exemplifies this evolution: unlike general image synthesis, it demands accurate interpretation of technical content and transformation of abstract ideas into clear, standardized visuals. This task is significantly more knowledge-intensive and laborious, often requiring hours of manual work and specialized tools. Automating it in a controllable, intelligent manner would provide substantial practical value. Yet, no benchmark currently exists to evaluate AI on this front. To fill this gap, we introduce SridBench, the first benchmark for scientific figure generation. It comprises 1,120 instances curated from leading scientific papers across 13 natural and computer science disciplines, collected via human experts and MLLMs. Each sample is evaluated along six dimensions, including semantic fidelity and structural accuracy. Experimental results reveal that even top-tier models like GPT-4o-image lag behind human performance, with common issues in text/visual clarity and scientific correctness. These findings highlight the need for more advanced reasoning-driven visual generation capabilities.