ChatPaper.aiChatPaper

教授自回归多模态基础模型度量距离

Teaching Metric Distance to Autoregressive Multimodal Foundational Models

March 4, 2025
作者: Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu
cs.AI

摘要

隨著大型語言模型從自然語言領域擴展至數學、多模態理解及具身代理等領域,token 逐漸反映的是度量關係而非純粹的語言意義。我們提出了 DIST2Loss,這是一個利用輸出 token 之間預定義距離關係來訓練自回歸離散模型的距離感知框架。其核心在於,DIST2Loss 將從固有距離度量中導出的連續指數族分佈轉化為與模型架構相容的離散類別優化目標。這種方法使模型在生成 token 時能夠學習並保持有意義的距離關係,同時保持與現有架構的兼容性。實證評估顯示,在多種多模態應用中,包括視覺定位、機器人操作、生成獎勵建模以及使用向量量化特徵的圖像生成,DIST2Loss 均帶來了一致的性能提升。這些改進在訓練數據有限的情況下尤為顯著,凸顯了 DIST2Loss 在資源受限環境中的有效性。
English
As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are pronounced in cases of limited training data, highlighting DIST2Loss's effectiveness in resource-constrained settings.

Summary

AI-Generated Summary

PDF42March 5, 2025