触覚、視覚、言語のマルチモーダルアラインメントのためのデータセット

要旨

触覚は人間にとって重要な感覚モダリティであるが、これまでマルチモーダル生成言語モデルには組み込まれていなかった。これは、触覚データに対する自然言語ラベルの取得が困難であることや、触覚計測値を視覚観察と言語記述の両方と整合させる複雑さが部分的に原因となっている。このギャップを埋めるための一歩として、本研究では44Kの実世界視覚-触覚ペアの新しいデータセットを導入し、人間による英語ラベル（10%）とGPT-4Vによるテキスト擬似ラベル（90%）を付与した。このデータセットを使用して、オープン語彙分類のための視覚-言語整合触覚エンコーダと、訓練されたエンコーダを用いたテキスト生成のための触覚-視覚-言語（TVL）モデルを訓練した。結果は、触覚を組み込むことで、TVLモデルが既存の任意のモダリティペアで訓練されたモデルよりも触覚-視覚-言語の整合性を向上させる（+29%分類精度）ことを示唆している。データセットのごく一部しか人間によるラベルが付与されていないにもかかわらず、TVLモデルは、新しい触覚-視覚理解ベンチマークにおいて、GPT-4V（+12%）およびオープンソースの視覚-言語モデル（+32%）よりも視覚-触覚理解が向上していることを示している。コードとデータ：https://tactile-vlm.github.io。

English

Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code and data: https://tactile-vlm.github.io.

触覚、視覚、言語のマルチモーダルアラインメントのためのデータセット

A Touch, Vision, and Language Dataset for Multimodal Alignment

要旨

Support