Φeat: Representação de Características com Base Física

Resumo

Os modelos de base emergiram como estruturas eficazes para muitas tarefas de visão computacional. No entanto, as características atuais de autoaprendizagem entrelaçam semântica de alto nível com fatores físicos de baixo nível, como geometria e iluminação, dificultando seu uso em tarefas que exigem raciocínio físico explícito. Neste artigo, apresentamos o Φeat, uma nova estrutura visual com base física que incentiva uma representação sensível à identidade do material, incluindo pistas de reflectância e mesoestrutura geométrica. Nossa ideia principal é empregar uma estratégia de pré-treinamento que contrasta recortes espaciais e aumentações físicas do mesmo material sob diferentes formas e condições de iluminação. Embora dados similares tenham sido usados em tarefas supervisionadas avançadas, como decomposição intrínseca ou estimativa de material, demonstramos que uma estratégia de treinamento puramente auto supervisionada, sem rótulos explícitos, já fornece uma forte base prévia para tarefas que exigem características robustas invariantes a fatores físicos externos. Avaliamos as representações aprendidas por meio de análise de similaridade de características e seleção de material, mostrando que o Φeat captura estrutura física fundamentada além do agrupamento semântico. Essas descobertas destacam o potencial da aprendizagem não supervisionada de características físicas como alicerce para a percepção consciente da física na visão computacional e gráficos.

English

Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce Φeat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that Φeat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.

Φeat: Representação de Características com Base Física

Φeat: Physically-Grounded Feature Representation

Resumo

Support