PAI-Bench: Um Benchmark Abrangente para Inteligência Artificial Física

Resumo

A Inteligência Artificial Física visa desenvolver modelos capazes de perceber e prever dinâmicas do mundo real; no entanto, a extensão em que os atuais modelos de linguagem grandes multimodais e modelos generativos de vídeo sustentam essas habilidades é insuficientemente compreendida. Apresentamos o Physical AI Bench (PAI-Bench), um benchmark unificado e abrangente que avalia capacidades de percepção e previsão em geração de vídeo, geração condicional de vídeo e compreensão de vídeo, compreendendo 2.808 casos do mundo real com métricas alinhadas à tarefa, projetadas para capturar plausibilidade física e raciocínio específico de domínio. Nosso estudo fornece uma avaliação sistemática de modelos recentes e mostra que os modelos generativos de vídeo, apesar de alta fidelidade visual, frequentemente lutam para manter dinâmicas fisicamente coerentes, enquanto os modelos de linguagem grandes multimodais exibem desempenho limitado em previsão e interpretação causal. Essas observações sugerem que os sistemas atuais ainda estão em estágio inicial no atendimento às demandas perceptivas e preditivas da Inteligência Artificial Física. Em resumo, o PAI-Bench estabelece uma base realista para avaliar a Inteligência Artificial Física e destaca lacunas-chave que sistemas futuros devem abordar.

English

Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.