JanusMesh: クロススペースノイズ除去による高速ゼロショット3D視覚錯覚生成

要旨

3D錯視の生成、すなわち一枚の3Dメッシュが異なる視点から全く異なる意味を呈するという課題は、魅力的でありながらも困難である。既存の最適化ベースの手法は処理が遅く、過剰に彩度の高い色を生成する可能性がある。対照的に、単純な接合手法では幾何学的に一貫性のあるオブジェクトを生成できず、不自然な継ぎ目や意味の漏れが生じる。本論文では、テキスト駆動型の3D錯視を生成するための高速かつ学習不要のフレームワークを提案する。本手法は生成を二段階に分離する。第一に、クロススペース・デュアルブランチノイズ除去プロセスを提案する。このプロセスは3D潜在変数を動的にボクセル空間にデコードし、CLIPガイドによる方向合わせおよびSigned Distance Field (SDF) のブレンディングを行い、シームレスな幾何学的融合を実現する。第二に、視点条件付きテクスチャ合成モジュールを導入し、視点固有の2D拡散事前情報を融合後の幾何形状に投影・集約する。広範な実験により、本手法は3～5分で極めてリアルな二重意味3D錯視を生成し、幾何学的完全性、意味認識性、効率性において既存手法を大幅に上回ることを示す。プロジェクトページ: https://siang1105.github.io/JanusMesh.github.io/

English

Creating 3D visual illusions, a single 3D mesh that reveals entirely different semantics from various viewing angles, is a fascinating but tough challenge. Existing optimization-based methods are slow and can produce oversaturated colors. In contrast, naive stitching approaches fail to produce geometrically coherent objects. This results in visible unnatural seams and semantic leaks. In this paper, we present a fast and training-free framework for generating text-driven 3D visual illusions. Our approach decouples the generation into two stages. First, we propose a cross-space dual-branch denoising process. This process dynamically decodes 3D latents into voxel space for CLIP-guided orientation alignment and Signed Distance Field (SDF) blending, which ensures seamless geometric fusion. Second, we introduce a view-conditioned texture synthesis module that projects and aggregates view-specific 2D diffusion priors onto the fused geometry. Extensive experiments demonstrate that our method generates highly realistic, dual-semantic 3D illusions in just 3-5 minutes. It significantly outperforms existing methods in geometric integrity, semantic recognizability, and efficiency. Project page: https://siang1105.github.io/JanusMesh.github.io/