基于范畴论的文档理解、度量与操作
Document Understanding, Measurement, and Manipulation Using Category Theory
October 24, 2025
作者: Jared Claypoole, Yunye Gong, Noson S. Yanofsky, Ajay Divakaran
cs.AI
摘要
我们运用范畴论来提取多模态文档结构,由此发展出信息理论度量方法、内容摘要与扩展技术,以及大型预训练模型的自监督改进方案。首先,我们建立了将文档表示为问答对范畴的数学框架;其次,开发了正交化程序以将单文档或多文档信息分解为互不重叠的组成部分。前两步提取的结构特征引导我们创建了文档信息度量与枚举方法,并以此为基础开发出新型摘要技术,同时解决了文献诠释这一新问题,实现对原始文档的扩展。我们的问答对方法论为摘要技术提供了全新的率失真分析视角。基于大型预训练模型实现了相关技术,并提出了整体数学框架的多模态扩展方案。最后,通过RLVR开发出创新的自监督方法,利用组合性及特定运算下的封闭性等一致性约束(这些约束自然衍生自我们的范畴论框架)来优化大型预训练模型。
English
We apply category theory to extract multimodal document structure which leads
us to develop information theoretic measures, content summarization and
extension, and self-supervised improvement of large pretrained models. We first
develop a mathematical representation of a document as a category of
question-answer pairs. Second, we develop an orthogonalization procedure to
divide the information contained in one or more documents into non-overlapping
pieces. The structures extracted in the first and second steps lead us to
develop methods to measure and enumerate the information contained in a
document. We also build on those steps to develop new summarization techniques,
as well as to develop a solution to a new problem viz. exegesis resulting in an
extension of the original document. Our question-answer pair methodology
enables a novel rate distortion analysis of summarization techniques. We
implement our techniques using large pretrained models, and we propose a
multimodal extension of our overall mathematical framework. Finally, we develop
a novel self-supervised method using RLVR to improve large pretrained models
using consistency constraints such as composability and closure under certain
operations that stem naturally from our category theoretic framework.