揭示文本的内在维度:从学术摘要到创意故事
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
November 19, 2025
作者: Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya
cs.AI
摘要
本征维度(ID)是现代大语言模型分析的重要工具,为训练动态、缩放行为和数据集结构的研究提供依据,但其文本决定因素仍未得到充分探索。我们通过交叉编码器分析、语言特征和稀疏自编码器(SAE),首次开展了将ID与可解释文本属性相联系的系统性研究。本研究确立了三个关键发现:第一,ID与基于熵的指标具有互补性——在控制文本长度后,两者无相关性,ID捕获的是与预测质量正交的几何复杂度;第二,ID呈现稳定的体裁分层现象——在所有测试模型中,科学论述呈现低ID值(约8),百科全书类内容呈中等ID值(约9),而创意/观点类写作则显示高ID值(约10.5),这表明当代大语言模型认为科学文本"表征简单",而小说类文本需要更多自由度;第三,通过SAE我们识别出因果特征——科学信号(正式语体、报告模板、统计数据)会降低ID,人性化信号(个性化、情感表达、叙事性)则会提升ID。定向实验证实这些影响具有因果性。因此对当代模型而言,科学写作相对"简单",而小说、观点及情感类内容则增加了表征自由度。我们的多维度分析为ID的正确使用及基于ID结果的有效解读提供了实践指导。
English
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.