个性化文本到图像生成的定向文本反转技术
Directional Textual Inversion for Personalized Text-to-Image Generation
December 15, 2025
作者: Kunhee Kim, NaHyeon Park, Kibeom Hong, Hyunjung Shim
cs.AI
摘要
文本反转(TI)是一种高效的文本到图像个性化方法,但在处理复杂提示时常常失效。我们将这些失败归因于嵌入范数膨胀:学习到的标记会漂移至分布外的幅值范围,从而降低预归一化Transformer中的提示条件效果。实证研究表明,CLIP标记空间中的语义信息主要由方向编码,而膨胀的范数会损害上下文化能力;理论分析显示,过大的幅值会削弱位置信息并阻碍预归一化块中的残差更新。我们提出方向性文本反转(DTI),该方法将嵌入幅值固定为分布内尺度,并通过黎曼随机梯度下降仅在单位超球面上优化方向。我们将方向学习建模为带有冯·米塞斯-费希尔先验的最大后验估计,从而产生恒定方向先验梯度,这种梯度易于高效融入算法。在各类个性化任务中,DTI在保持主体相似度的同时,较TI及其变体能显著提升文本保真度。关键的是,DTI的超球面参数化实现了学习概念间的平滑、语义连贯插值(球面线性插值),这是标准TI所缺失的能力。我们的研究结果表明,纯方向优化是实现提示忠实个性化的稳健且可扩展的路径。
English
Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.