ChatPaper.aiChatPaper

运用范畴论实现文档的理解、度量与操作

Document Understanding, Measurement, and Manipulation Using Category Theory

October 24, 2025
作者: Jared Claypoole, Yunye Gong, Noson S. Yanofsky, Ajay Divakaran
cs.AI

摘要

我们运用范畴论来提取多模态文档结构,由此发展出信息论测度、内容摘要与扩展方法,以及大型预训练模型的自监督改进技术。首先,我们建立了将文档表示为问答对范畴的数学框架。其次,开发了正交化流程,将单个或多个文档包含的信息分解为互不重叠的组成部分。前两步提取的结构特征促使我们创建了文档信息度量与枚举方法,并以此为基础开发出新型摘要技术,同时解决了"注疏生成"这一新问题——即实现对原始文档的扩展性解读。我们的问答对方法为摘要技术提供了全新的率失真分析视角。基于大型预训练模型实现了相关技术,并提出了整体数学框架的多模态扩展方案。最后,通过RLVR技术开发出创新的自监督方法,利用组合性及特定运算下的封闭性等一致性约束(这些约束自然衍生自我们的范畴论框架)来优化大型预训练模型。
English
We apply category theory to extract multimodal document structure which leads us to develop information theoretic measures, content summarization and extension, and self-supervised improvement of large pretrained models. We first develop a mathematical representation of a document as a category of question-answer pairs. Second, we develop an orthogonalization procedure to divide the information contained in one or more documents into non-overlapping pieces. The structures extracted in the first and second steps lead us to develop methods to measure and enumerate the information contained in a document. We also build on those steps to develop new summarization techniques, as well as to develop a solution to a new problem viz. exegesis resulting in an extension of the original document. Our question-answer pair methodology enables a novel rate distortion analysis of summarization techniques. We implement our techniques using large pretrained models, and we propose a multimodal extension of our overall mathematical framework. Finally, we develop a novel self-supervised method using RLVR to improve large pretrained models using consistency constraints such as composability and closure under certain operations that stem naturally from our category theoretic framework.
PDF42December 17, 2025