ChatPaper.aiChatPaper

揭示文本的內在維度:從學術摘要到創意故事

Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

November 19, 2025
作者: Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya
cs.AI

摘要

內在維度(ID)已成為現代大型語言模型分析的重要工具,應用於訓練動態、規模化規律及資料集結構的研究,然而其文本決定因素仍待深入探索。我們透過交叉編碼器分析、語言特徵與稀疏自編碼器(SAE),首次提出將ID錨定於可解釋文本特性的綜合研究。本工作確立三項關鍵發現:首先,ID與基於熵的指標具互補性——控制文本長度後,兩者無相關性,ID能捕捉正交於預測品質的幾何複雜度。其次,ID呈現穩健的文類分層現象:科學論述呈現低ID值(約8),百科類內容為中等ID值(約9),而創意/評論寫作則具高ID值(約10.5),此模式在所有測試模型中一致。這顯示當代LLM將科學文本視為「表徵簡單」的類型,而小說則需要更多表徵自由度。第三,透過SAE技術,我們識別出因果特徵:科學信號(正式語調、報告模板、統計數據)會降低ID;人性化信號(個人化表達、情感、敘事性)則提升ID。定向調控實驗證實這些影響具因果性。因此對當代模型而言,科學寫作相對「簡單」,而小說、評論及情感表達則會增加表徵自由度。我們的多面向分析為ID的正確應用及基於ID研究成果的嚴謹解讀提供了實務指引。
English
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
PDF853December 1, 2025