텍스트-이미지 생성을 위한 연속적 3D 단어 학습

초록

현재 이미지 생성을 위한 확산 모델(예: 텍스트 또는 ControlNet을 통한)의 제어 방식은 조명 방향이나 비강체 형태 변화와 같은 추상적이고 연속적인 속성을 인식하는 데 한계가 있습니다. 본 논문에서는 텍스트-이미지 모델 사용자가 이미지의 여러 속성을 세밀하게 제어할 수 있도록 하는 접근 방식을 제시합니다. 이를 위해 연속적으로 변환 가능한 특수 입력 토큰 세트를 설계하였으며, 이를 '연속 3D 단어(Continuous 3D Words)'라고 명명합니다. 이러한 속성은 예를 들어 슬라이더로 표현될 수 있으며, 텍스트 프롬프트와 함께 적용되어 이미지 생성에 대한 세밀한 제어를 가능하게 합니다. 단일 메시와 렌더링 엔진만 주어지더라도, 우리의 접근 방식이 시간대별 조명, 새의 날개 방향, 돌리줌 효과, 객체 자세 등 여러 3D 인식 속성에 대한 연속적인 사용자 제어를 제공할 수 있음을 보여줍니다. 우리의 방법은 생성 과정에 추가적인 오버헤드를 발생시키지 않으면서도 여러 연속 3D 단어와 텍스트 설명을 동시에 활용하여 이미지 생성에 조건을 부여할 수 있습니다. 프로젝트 페이지: https://ttchengab.github.io/continuous_3d_words

English

Current controls over diffusion models (e.g., through text or ControlNet) for image generation fall short in recognizing abstract, continuous attributes like illumination direction or non-rigid shape change. In this paper, we present an approach for allowing users of text-to-image models to have fine-grained control of several attributes in an image. We do this by engineering special sets of input tokens that can be transformed in a continuous manner -- we call them Continuous 3D Words. These attributes can, for example, be represented as sliders and applied jointly with text prompts for fine-grained control over image generation. Given only a single mesh and a rendering engine, we show that our approach can be adopted to provide continuous user control over several 3D-aware attributes, including time-of-day illumination, bird wing orientation, dollyzoom effect, and object poses. Our method is capable of conditioning image creation with multiple Continuous 3D Words and text descriptions simultaneously while adding no overhead to the generative process. Project Page: https://ttchengab.github.io/continuous_3d_words

텍스트-이미지 생성을 위한 연속적 3D 단어 학습

Learning Continuous 3D Words for Text-to-Image Generation

초록

Support