다중모달 정렬을 위한 촉각, 시각, 언어 데이터셋

초록

촉각은 인간에게 중요한 감각 양식이지만, 아직 다중모드 생성 언어 모델에 통합되지 못했습니다. 이는 부분적으로 촉각 데이터에 대한 자연어 라벨을 얻는 것의 어려움과 촉각 측정값을 시각적 관찰 및 언어 설명과 정렬하는 복잡성 때문입니다. 이러한 격차를 해소하기 위한 한 걸음으로, 본 연구는 44K의 실생활 시각-촉각 쌍으로 구성된 새로운 데이터셋을 소개합니다. 이 데이터셋은 인간이 주석을 단 영어 라벨(10%)과 GPT-4V에서 생성된 텍스트 의사 라벨(90%)을 포함합니다. 우리는 이 데이터셋을 사용하여 개방형 어휘 분류를 위한 시각-언어 정렬 촉각 인코더와 훈련된 인코더를 사용하여 텍스트 생성을 위한 촉각-시각-언어(TVL) 모델을 학습시킵니다. 결과에 따르면, 촉각을 통합함으로써 TVL 모델은 기존의 어떤 두 모드 쌍으로 훈련된 모델들보다 촉각-시각-언어 정렬을 개선(+29% 분류 정확도)했습니다. 데이터셋의 일부만 인간이 라벨을 달았음에도 불구하고, TVL 모델은 새로운 촉각-시각 이해 벤치마크에서 GPT-4V(+12%)와 오픈소스 시각-언어 모델(+32%)보다 향상된 시각-촉각 이해 능력을 보여줍니다. 코드와 데이터: https://tactile-vlm.github.io.

English

Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code and data: https://tactile-vlm.github.io.

다중모달 정렬을 위한 촉각, 시각, 언어 데이터셋

A Touch, Vision, and Language Dataset for Multimodal Alignment

초록

Support