잠재 구역 네트워크: 생성 모델링, 표현 학습 및 분류를 위한 통합 원리

초록

생성 모델링, 표현 학습, 그리고 분류는 기계 학습(ML)의 세 가지 핵심 문제이지만, 이들의 최첨단(SoTA) 솔루션은 여전히 대부분 분리된 상태로 남아 있다. 본 논문에서 우리는 다음과 같은 질문을 던진다: 단일 원칙이 이 세 가지 문제를 모두 해결할 수 있을까? 이러한 통합은 ML 파이프라인을 단순화하고 다양한 작업 간의 시너지를 증진시킬 수 있다. 우리는 이러한 목표를 향한 한 걸음으로서 잠재 구역 네트워크(Latent Zoning Network, LZN)를 소개한다. LZN의 핵심은 모든 작업에 걸쳐 정보를 인코딩하는 공유된 가우시안 잠재 공간을 생성하는 것이다. 각 데이터 유형(예: 이미지, 텍스트, 레이블)은 샘플을 분리된 잠재 구역으로 매핑하는 인코더와 잠재 공간을 다시 데이터로 매핑하는 디코더로 구성된다. ML 작업은 이러한 인코더와 디코더의 조합으로 표현된다: 예를 들어, 레이블 조건부 이미지 생성은 레이블 인코더와 이미지 디코더를 사용하며, 이미지 임베딩은 이미지 인코더를 사용하고, 분류는 이미지 인코더와 레이블 디코더를 사용한다. 우리는 LZN의 잠재력을 세 가지 점점 더 복잡한 시나리오에서 입증한다: (1) LZN은 기존 모델을 향상시킬 수 있다(이미지 생성): SoTA Rectified Flow 모델과 결합했을 때, LZN은 CIFAR10에서 FID를 2.76에서 2.59로 개선한다—훈련 목표를 수정하지 않고도. (2) LZN은 독립적으로 작업을 해결할 수 있다(표현 학습): LZN은 보조 손실 함수 없이도 비지도 표현 학습을 구현할 수 있으며, ImageNet에서의 하위 선형 분류에서 MoCo와 SimCLR 방법을 각각 9.3%와 0.2% 앞선다. (3) LZN은 여러 작업을 동시에 해결할 수 있다(생성과 분류의 결합): 이미지와 레이블 인코더/디코더를 사용하여 LZN은 설계상 두 작업을 동시에 수행하며, FID를 개선하고 CIFAR10에서 SoTA 분류 정확도를 달성한다. 코드와 훈련된 모델은 https://github.com/microsoft/latent-zoning-networks에서 확인할 수 있다. 프로젝트 웹사이트는 https://zinanlin.me/blogs/latent_zoning_networks.html에 있다.

English

Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.

잠재 구역 네트워크: 생성 모델링, 표현 학습 및 분류를 위한 통합 원리

Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

초록

Support