运用范畴论实现文档的理解、度量与操作

摘要

我们运用范畴论来提取多模态文档结构，由此发展出信息论测度、内容摘要与扩展方法，以及大型预训练模型的自监督改进技术。首先，我们建立了将文档表示为问答对范畴的数学框架。其次，开发了正交化流程，将单个或多个文档包含的信息分解为互不重叠的组成部分。前两步提取的结构特征促使我们创建了文档信息度量与枚举方法，并以此为基础开发出新型摘要技术，同时解决了"注疏生成"这一新问题——即实现对原始文档的扩展性解读。我们的问答对方法为摘要技术提供了全新的率失真分析视角。基于大型预训练模型实现了相关技术，并提出了整体数学框架的多模态扩展方案。最后，通过RLVR技术开发出创新的自监督方法，利用组合性及特定运算下的封闭性等一致性约束（这些约束自然衍生自我们的范畴论框架）来优化大型预训练模型。

English

We apply category theory to extract multimodal document structure which leads us to develop information theoretic measures, content summarization and extension, and self-supervised improvement of large pretrained models. We first develop a mathematical representation of a document as a category of question-answer pairs. Second, we develop an orthogonalization procedure to divide the information contained in one or more documents into non-overlapping pieces. The structures extracted in the first and second steps lead us to develop methods to measure and enumerate the information contained in a document. We also build on those steps to develop new summarization techniques, as well as to develop a solution to a new problem viz. exegesis resulting in an extension of the original document. Our question-answer pair methodology enables a novel rate distortion analysis of summarization techniques. We implement our techniques using large pretrained models, and we propose a multimodal extension of our overall mathematical framework. Finally, we develop a novel self-supervised method using RLVR to improve large pretrained models using consistency constraints such as composability and closure under certain operations that stem naturally from our category theoretic framework.