TabellenkalkulationLLM: Codierung von Tabellenkalkulationen für große Sprachmodelle

papers.abstract

Tabellenkalkulationen mit ihren umfangreichen zweidimensionalen Rastern, verschiedenen Layouts und vielfältigen Formatierungsoptionen stellen bedeutende Herausforderungen für große Sprachmodelle (LLMs) dar. Als Antwort führen wir SpreadsheetLLM ein, das eine effiziente Codierungsmethode entwickelt, um das leistungsstarke Verständnis- und Schlussfolgerungsvermögen von LLMs auf Tabellenkalkulationen zu entfesseln und zu optimieren. Zunächst schlagen wir einen einfachen Serialisierungsansatz vor, der Zelladressen, Werte und Formate integriert. Allerdings war dieser Ansatz aufgrund der Tokenbeschränkungen von LLMs begrenzt und daher für die meisten Anwendungen unpraktisch. Um diese Herausforderung anzugehen, entwickeln wir SheetCompressor, ein innovatives Codierungsframework, das Tabellenkalkulationen effektiv für LLMs komprimiert. Es besteht aus drei Modulen: strukturankerbasierte Kompression, inverse Indexübersetzung und datenformatbewusste Aggregation. Dies verbessert die Leistung signifikant in der Aufgabe der Tabellenerkennung in Tabellenkalkulationen und übertrifft den einfachen Ansatz um 25,6% im Kontextlernen von GPT4. Darüber hinaus hat ein feinabgestimmtes LLM mit SheetCompressor ein durchschnittliches Kompressionsverhältnis von 25, erreicht jedoch einen state-of-the-art F1-Score von 78,9%, womit die besten bestehenden Modelle um 12,3% übertroffen werden. Abschließend schlagen wir eine Kette von Tabellenkalkulationen für nachgelagerte Aufgaben der Tabellenkalkulationsverarbeitung vor und validieren sie in einer neuen und anspruchsvollen Tabellenkalkulations-F&A-Aufgabe. Wir nutzen systematisch das inhärente Layout und die Struktur von Tabellenkalkulationen und zeigen, dass SpreadsheetLLM bei einer Vielzahl von Tabellenkalkulationsaufgaben äußerst effektiv ist.

English

Spreadsheets, with their extensive two-dimensional grids, various layouts, and diverse formatting options, present notable challenges for large language models (LLMs). In response, we introduce SpreadsheetLLM, pioneering an efficient encoding method designed to unleash and optimize LLMs' powerful understanding and reasoning capability on spreadsheets. Initially, we propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach was limited by LLMs' token constraints, making it impractical for most applications. To tackle this challenge, we develop SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs. It comprises three modules: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation. It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4's in-context learning setting. Moreover, fine-tuned LLM with SheetCompressor has an average compression ratio of 25 times, but achieves a state-of-the-art 78.9% F1 score, surpassing the best existing models by 12.3%. Finally, we propose Chain of Spreadsheet for downstream tasks of spreadsheet understanding and validate in a new and demanding spreadsheet QA task. We methodically leverage the inherent layout and structure of spreadsheets, demonstrating that SpreadsheetLLM is highly effective across a variety of spreadsheet tasks.

TabellenkalkulationLLM: Codierung von Tabellenkalkulationen für große Sprachmodelle

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

papers.abstract

Support