DeepPHY: 물리적 추론에서 에이전트형 시각-언어 모델 벤치마킹

초록

비전 언어 모델(VLMs)은 강력한 지각 능력과 인상적인 시각적 추론 능력을 보여주지만, 복잡하고 동적인 환경에서 세부 사항에 대한 주의력과 정확한 행동 계획에는 어려움을 겪어 성능이 저조한 편입니다. 실제 세계의 작업은 일반적으로 복잡한 상호작용, 고급 공간 추론, 장기적인 계획, 그리고 지속적인 전략 개선이 필요하며, 이는 대개 대상 시나리오의 물리 법칙을 이해해야 합니다. 그러나 이러한 능력을 실제 시나리오에서 평가하는 것은 종종 비용이 너무 많이 듭니다. 이러한 격차를 해소하기 위해, 우리는 DeepPHY라는 새로운 벤치마크 프레임워크를 소개합니다. DeepPHY는 일련의 도전적인 시뮬레이션 환경을 통해 VLMs의 기본 물리 원칙에 대한 이해와 추론 능력을 체계적으로 평가하도록 설계되었습니다. DeepPHY는 다양한 난이도의 물리 추론 환경을 통합하고 세밀한 평가 지표를 포함합니다. 우리의 평가 결과, 최첨단 VLMs조차도 서술적인 물리 지식을 정확하고 예측 가능한 제어로 전환하는 데 어려움을 겪는 것으로 나타났습니다.

English

Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs' understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics. Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control.

DeepPHY: 물리적 추론에서 에이전트형 시각-언어 모델 벤치마킹

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

초록

Support