一個用於多模態對齊的觸覺、視覺和語言數據集

摘要

觸覺對人類來說是一種重要的感知方式，但尚未被納入多模式生成語言模型中。部分原因是由於獲取觸覺數據的自然語言標籤困難，以及將觸覺讀數與視覺觀察和語言描述對齊的複雜性。為了彌合這一差距，本研究引入了一個新的數據集，包含了4.4萬組野外視覺-觸覺對，其中英文語言標籤由人類（10%）和GPT-4V的文本虛標籤（90%）進行注釋。我們使用這個數據集來訓練一個視覺-語言對齊的觸覺編碼器，用於開放詞彙分類，以及一個觸覺-視覺-語言（TVL）模型，用於使用已訓練的編碼器進行文本生成。結果表明，通過納入觸覺，TVL模型在現有任何一對這些模態訓練的模型上提高了觸覺-視覺-語言對齊（+29%分類準確性）。儘管數據集中只有一小部分是人工標記的，但TVL模型在新的觸覺-視覺理解基準上展示出比GPT-4V（+12%）和開源視覺-語言模型（+32%）更好的視覺-觸覺理解。代碼和數據：https://tactile-vlm.github.io。

English

Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code and data: https://tactile-vlm.github.io.

一個用於多模態對齊的觸覺、視覺和語言數據集

A Touch, Vision, and Language Dataset for Multimodal Alignment

摘要

Support