烧烤场景生成：大规模文本到图像模型中的数值边界框与色彩控制

摘要

文本到图像模型在真实性与可控性方面发展迅猛，近期方法通过利用长文本细粒度描述支持精细化生成。然而，核心的参数化鸿沟依然存在：现有模型依赖描述性语言，而专业工作流程要求对物体位置、尺寸和颜色进行精确数值控制。本研究提出BBQ模型——一种基于统一结构化文本框架的大型文本到图像生成系统，可直接通过数值化边界框和RGB三原色进行条件控制。我们通过训练带有参数化标注的增强型描述文本，在不改变模型架构或引入推理时优化的前提下，实现了精确的空间与色彩控制。该方法还支持直观的用户交互界面（如物体拖拽和取色器），用精确熟悉的操控方式取代了模糊的迭代式提示词调整。综合评估表明，BBQ在边界框对齐方面表现优异，并在RGB色彩保真度上超越了现有先进基线模型。更广泛而言，我们的研究成果印证了全新范式：将用户意图转化为中间结构化语言，由基于流式的Transformer作为渲染器进行解析，自然兼容数值化参数输入。

English

Text-to-image models have rapidly advanced in realism and controllability, with recent approaches leveraging long, detailed captions to support fine-grained generation. However, a fundamental parametric gap remains: existing models rely on descriptive language, whereas professional workflows require precise numeric control over object location, size, and color. In this work, we introduce BBQ, a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets within a unified structured-text framework. We obtain precise spatial and chromatic control by training on captions enriched with parametric annotations, without architectural modifications or inference-time optimization. This also enables intuitive user interfaces such as object dragging and color pickers, replacing ambiguous iterative prompting with precise, familiar controls. Across comprehensive evaluations, BBQ achieves strong box alignment and improves RGB color fidelity over state-of-the-art baselines. More broadly, our results support a new paradigm in which user intent is translated into an intermediate structured language, consumed by a flow-based transformer acting as a renderer and naturally accommodating numeric parameters.