ByDahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, Chanjun Park
41
3
As development of large language models (LLM) progresses, aligning them with
human preferences has become increasingly important. We propose stepwise DPO
(sDPO), an extension of the recently popularized direct preference optimization
(DPO) for alignment tuning. This approach involves dividing the available
preference datasets and utilizing them in a stepwise manner, rather than
employing it all at once. We demonstrate that this method facilitates the use
of more precisely aligned reference models within the DPO training framework.
Furthermore, sDPO trains the final model to be more performant, even
outperforming other popular LLMs with more parameters.
3D Gaussian Splatting (GS) have achieved considerable improvement over Neural
Radiance Fields in terms of 3D fitting fidelity and rendering speed. However,
this unstructured representation with scattered Gaussians poses a significant
challenge for generative modeling. To address the problem, we introduce
GaussianCube, a structured GS representation that is both powerful and
efficient for generative modeling. We achieve this by first proposing a
modified densification-constrained GS fitting algorithm which can yield
high-quality fitting results using a fixed number of free Gaussians, and then
re-arranging the Gaussians into a predefined voxel grid via Optimal Transport.
The structured grid representation allows us to use standard 3D U-Net as our
backbone in diffusion generative modeling without elaborate designs. Extensive
experiments conducted on ShapeNet and OmniObject3D show that our model achieves
state-of-the-art generation results both qualitatively and quantitatively,
underscoring the potential of GaussianCube as a powerful and versatile 3D
representation.
There has been tremendous progress in multimodal Large Language Models
(LLMs). Recent works have extended these models to video input with promising
instruction following capabilities. However, an important missing piece is
temporal localization. These models cannot accurately answer the "When?"
questions. We identify three key aspects that limit their temporal localization
capabilities: (i) time representation, (ii) architecture, and (iii) data. We
address these shortcomings by proposing Language Instructed
Temporal-Localization Assistant (LITA) with the following features: (1) We
introduce time tokens that encode timestamps relative to the video length to
better represent time in videos. (2) We introduce SlowFast tokens in the
architecture to capture temporal information at fine temporal resolution. (3)
We emphasize temporal localization data for LITA. In addition to leveraging
existing video datasets with timestamps, we propose a new task, Reasoning
Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for
learning and evaluating this task. Reasoning temporal localization requires
both the reasoning and temporal localization of Video LLMs. LITA demonstrates
strong performance on this challenging task, nearly doubling the temporal mean
intersection-over-union (mIoU) of baselines. In addition, we show that our
emphasis on temporal localization also substantially improves video-based text
generation compared to existing Video LLMs, including a 36% relative
improvement of Temporal Understanding. Code is available at:
https://github.com/NVlabs/LITA
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have
revolutionized the field of content generation, enabling significant
advancements in areas like image editing and video synthesis. Despite their
formidable capabilities, these models are not without their limitations. It is
still challenging to synthesize an image that aligns well with the input text,
and multiple runs with carefully crafted prompts are required to achieve
satisfactory results. To mitigate these limitations, numerous studies have
endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing
various technologies. Yet, amidst these efforts, a pivotal question of
text-to-image diffusion model training has remained largely unexplored: Is it
possible and feasible to fine-tune the text encoder to improve the performance
of text-to-image diffusion models? Our findings reveal that, instead of
replacing the CLIP text encoder used in Stable Diffusion with other large
language models, we can enhance it through our proposed fine-tuning approach,
TextCraftor, leading to substantial improvements in quantitative benchmarks and
human assessments. Interestingly, our technique also empowers controllable
image generation through the interpolation of different text encoders
fine-tuned with various rewards. We also demonstrate that TextCraftor is
orthogonal to UNet finetuning, and can be combined to further improve
generative quality.
ByYujin Chen, Yinyu Nie, Benjamin Ummenhofer, Reiner Birkl, Michael Paulitsch, Matthias Müller, Matthias Nießner
14
1
We present Mesh2NeRF, an approach to derive ground-truth radiance fields from
textured meshes for 3D generation tasks. Many 3D generative approaches
represent 3D scenes as radiance fields for training. Their ground-truth
radiance fields are usually fitted from multi-view renderings from a
large-scale synthetic 3D dataset, which often results in artifacts due to
occlusions or under-fitting issues. In Mesh2NeRF, we propose an analytic
solution to directly obtain ground-truth radiance fields from 3D meshes,
characterizing the density field with an occupancy function featuring a defined
surface thickness, and determining view-dependent color through a reflection
function considering both the mesh and environment lighting. Mesh2NeRF extracts
accurate radiance fields which provides direct supervision for training
generative NeRFs and single scene representation. We validate the effectiveness
of Mesh2NeRF across various tasks, achieving a noteworthy 3.12dB improvement in
PSNR for view synthesis in single scene representation on the ABO dataset, a
0.69 PSNR enhancement in the single-view conditional generation of ShapeNet
Cars, and notably improved mesh extraction from NeRF in the unconditional
generation of Objaverse Mugs.