一个用于多模态对齐的触觉、视觉和语言数据集

摘要

触觉是人类重要的感知方式，但尚未被纳入多模态生成语言模型中。部分原因是获取触觉数据的自然语言标签困难，以及将触觉读数与视觉观察和语言描述进行对齐的复杂性。为了弥合这一差距，本研究引入了一个新的数据集，包含4.4万组野外视觉-触觉对，其中英语语言标签由人类注释（10%），文本伪标签来自GPT-4V（90%）。我们使用这个数据集训练了一个视觉-语言对齐的触觉编码器，用于开放词汇分类，以及一个触觉-视觉-语言（TVL）模型，用于使用训练好的编码器生成文本。结果表明，通过整合触觉，TVL模型在任意一对这些模态训练的现有模型基础上，提高了触觉-视觉-语言对齐（+29%分类准确率）。尽管数据集只有一小部分是人工标记的，TVL模型在新的触觉-视觉理解基准测试中，表现出比GPT-4V（+12%）和开源视觉-语言模型（+32%）更好的视觉-触觉理解。代码和数据：https://tactile-vlm.github.io。

English

Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code and data: https://tactile-vlm.github.io.

一个用于多模态对齐的触觉、视觉和语言数据集

A Touch, Vision, and Language Dataset for Multimodal Alignment

摘要

Support