大规模系统的性能预测：基于文本到文本的回归方法

摘要

在众多行业中，预测大型系统的指标结果是一个基础性问题，主要依赖于传统的表格回归方法。然而，这些方法在处理复杂系统数据（如配置文件或系统日志）时表现欠佳，因为在这些场景下特征工程往往难以实施。我们提出文本到文本回归作为一种通用且可扩展的替代方案。在预测Borg（谷歌大规模计算集群调度系统）的资源效率时，一个拥有6000万参数的编码器-解码器模型，从随机初始化开始训练，在整个集群上实现了接近完美的0.99（平均0.9）等级相关性，且均方误差比表格方法低100倍。该模型还能轻松适应新任务，仅需500个少样本示例，并能捕捉复杂结果分布的密度。消融研究强调了使用编码器、增加序列长度以及模型内在不确定性量化的重要性。这些发现为构建现实世界结果的通用模拟器铺平了道路。

English

In many industries, predicting metric outcomes of large systems is a fundamental problem, driven largely by traditional tabular regression. However, such methods struggle on complex systems data in the wild such as configuration files or system logs, where feature engineering is often infeasible. We propose text-to-text regression as a general, scalable alternative. For predicting resource efficiency on Borg, Google's massive compute cluster scheduling system, a 60M parameter encoder-decoder, trained from random initialization, achieves up to a near perfect 0.99 (0.9 average) rank correlation across the entire fleet, and 100x lower MSE than tabular approaches. The model also easily adapts to new tasks in only 500 few-shot examples and captures the densities of complex outcome distributions. Ablation studies highlight the importance of using encoders, increasing sequence length, and the model's inherent uncertainty quantification. These findings pave the way for universal simulators of real-world outcomes.

大规模系统的性能预测：基于文本到文本的回归方法

Performance Prediction for Large Systems via Text-to-Text Regression

摘要

Support