ChatPaper.aiChatPaper

CookAnything:一种灵活一致的多步骤食谱图像生成框架

CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

December 3, 2025
作者: Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang-Lin, Hong-Han Shuai, Wen-Huang Cheng
cs.AI

摘要

烹饪是一项具有时序性和视觉基础的活动,其中切菜、搅拌、煎炒等每个步骤既包含程序逻辑又蕴含视觉语义。尽管当前扩散模型在文本到图像生成方面展现出强大能力,却难以处理如食谱图解这类结构化多步骤场景。此外,现有食谱插图方法无法适配食谱长度的自然变化,无论实际指令结构如何都生成固定数量的图像。为突破这些局限,我们提出CookAnything框架——一个基于扩散模型的灵活且连贯的系统,能够根据任意长度的文本烹饪指令生成语义分明、逻辑连贯的图像序列。该框架包含三大核心组件:(1)步骤区域控制技术,通过单次去噪过程实现文本步骤与对应图像区域的对齐;(2)柔性旋转位置编码机制,利用步骤感知的位置编码同时增强时序连贯性与空间多样性;(3)跨步骤一致性控制模块,在多个步骤间保持食材细节的一致性。在食谱插图基准测试上的实验表明,CookAnything在有训练和无训练场景下均优于现有方法。该框架支持对复杂多步骤指令进行可扩展的高质量视觉合成,在教学媒体和流程化内容创作领域具有广阔的应用前景。
English
Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
PDF40December 5, 2025