触覚による視覚化：素材領域の触覚駆動型視覚的定位

要旨

本論文では、触覚定位の問題に取り組む。この課題の目的は、触覚入力と同一の材質特性を共有する画像領域を同定することである。既存の視覚-触覚手法は大域的な対応付けに依存するため、本タスクに必要とされるきめ細かい局所対応を捉えられない。さらに、既存データセットが接写・低多様性の画像を主に含むことも課題を深刻にしている。我々は、密なクロスモーダル特徴相互作用による局所的な視覚-触覚対応付けを学習し、接触条件付き材質セグメンテーションのための触覚サリエンシマップを生成するモデルを提案する。データセットの制約を克服するため、(i) 視覚的多様性を拡大する実世界の多材質シーン画像と、(ii) 各触覚サンプルを視覚的に多様だが触覚的に一貫性のある画像と対応付ける材質多様性ペアリング戦略を導入し、文脈を考慮した定位と弱信号への頑健性を向上させる。さらに、定量的評価のため2つの新しい触覚基盤型材質セグメンテーションデータセットを構築した。新規および既存のベンチマークによる実験により、本手法が触覚定位タスクにおいて従来の視覚-触覚手法を大幅に上回ることを示す。

English

We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local visuo-tactile alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.

触覚による視覚化：素材領域の触覚駆動型視覚的定位

Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions

要旨

Support