ByKuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen
115
5
We explore an evolutionary search strategy for scaling inference time compute
in Large Language Models. The proposed approach, Mind Evolution, uses a
language model to generate, recombine and refine candidate responses. The
proposed approach avoids the need to formalize the underlying inference problem
whenever a solution evaluator is available. Controlling for inference cost, we
find that Mind Evolution significantly outperforms other inference strategies
such as Best-of-N and Sequential Revision in natural language planning tasks.
In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more
than 98% of the problem instances using Gemini 1.5 Pro without the use of a
formal solver.
ByYichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E
52
10
We introduce PaSa, an advanced Paper Search agent powered by large language
models. PaSa can autonomously make a series of decisions, including invoking
search tools, reading papers, and selecting relevant references, to ultimately
obtain comprehensive and accurate results for complex scholarly queries. We
optimize PaSa using reinforcement learning with a synthetic dataset,
AutoScholarQuery, which includes 35k fine-grained academic queries and
corresponding papers sourced from top-tier AI conference publications.
Additionally, we develop RealScholarQuery, a benchmark collecting real-world
academic queries to assess PaSa performance in more realistic scenarios.
Despite being trained on synthetic data, PaSa significantly outperforms
existing baselines on RealScholarQuery, including Google, Google Scholar,
Google with GPT-4 for paraphrased queries, chatGPT (search-enabled GPT-4o),
GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably,
PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78%
in recall@20 and 39.90% in recall@50. It also exceeds PaSa-GPT-4o by 30.36% in
recall and 4.25% in precision. Model, datasets, and code are available at
https://github.com/bytedance/pasa.
ByTairan Fu, Javier Conde, Gonzalo Martínez, María Grandury, Pedro Reviriego
33
2
One of the most widely used methods to evaluate LLMs are Multiple Choice
Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on
almost any topic at scale as the results can be processed automatically. To
help the LLM answer, a few examples called few shots can be included in the
prompt. Moreover, the LLM can be asked to answer the question directly with the
selected option or to first provide the reasoning and then the selected answer,
which is known as chain of thought. In addition to checking whether the
selected answer is correct, the evaluation can look at the LLM-estimated
probability of its response as an indication of the confidence of the LLM in
the response. In this paper, we study how the LLM confidence in its answer
depends on whether the model has been asked to answer directly or to provide
the reasoning before answering. The results of the evaluation of questions on a
wide range of topics in seven different models show that LLMs are more
confident in their answers when they provide reasoning before the answer. This
occurs regardless of whether the selected answer is correct. Our hypothesis is
that this behavior is due to the reasoning that modifies the probability of the
selected answer, as the LLM predicts the answer based on the input question and
the reasoning that supports the selection made. Therefore, LLM estimated
probabilities seem to have intrinsic limitations that should be understood in
order to use them in evaluation procedures. Interestingly, the same behavior
has been observed in humans, for whom explaining an answer increases confidence
in its correctness.
The 2D cartoon style is a prominent art form in digital character creation,
particularly popular among younger audiences. While advancements in digital
human technology have spurred extensive research into photorealistic digital
humans and 3D characters, interactive 2D cartoon characters have received
comparatively less attention. Unlike 3D counterparts, which require
sophisticated construction and resource-intensive rendering, Live2D, a
widely-used format for 2D cartoon characters, offers a more efficient
alternative, which allows to animate 2D characters in a manner that simulates
3D movement without the necessity of building a complete 3D model. Furthermore,
Live2D employs lightweight HTML5 (H5) rendering, improving both accessibility
and efficiency. In this technical report, we introduce Textoon, an innovative
method for generating diverse 2D cartoon characters in the Live2D format based
on text descriptions. The Textoon leverages cutting-edge language and vision
models to comprehend textual intentions and generate 2D appearance, capable of
creating a wide variety of stunning and interactive 2D characters within one
minute. The project homepage is https://human3daigc.github.io/Textoon_webpage/.
ByLucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, Jie Tang
22
2
Enhancing large language models (LLMs) with real-time APIs can help generate
more accurate and up-to-date responses. However, evaluating the function
calling abilities of LLMs in real-world scenarios remains under-explored due to
the complexity of data collection and evaluation. In this work, we introduce
ComplexFuncBench, a benchmark for complex function calling across five
real-world scenarios. Compared to existing benchmarks, ComplexFuncBench
encompasses multi-step and constrained function calling, which requires
long-parameter filing, parameter value reasoning, and 128k long context.
Additionally, we propose an automatic framework, ComplexEval, for
quantitatively evaluating complex function calling tasks. Through comprehensive
experiments, we demonstrate the deficiencies of state-of-the-art LLMs in
function calling and suggest future directions for optimizing these
capabilities. The data and code are available at
https://github.com/THUDM/ComplexFuncBench.
ByDi Chang, Hongyi Xu, You Xie, Yipeng Gao, Zhengfei Kuang, Shengqu Cai, Chenxu Zhang, Guoxian Song, Chao Wang, Yichun Shi, Zeyuan Chen, Shijie Zhou, Linjie Luo, Gordon Wetzstein, Mohammad Soleymani
15
2
We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for
animating a single human image using facial expressions and body movements
derived from a driving video, that generates realistic, context-aware dynamics
for both the subject and the surrounding environment. Building on prior
approaches centered on human pose control, X-Dyna addresses key shortcomings
causing the loss of dynamic details, enhancing the lifelike qualities of human
video animations. At the core of our approach is the Dynamics-Adapter, a
lightweight module that effectively integrates reference appearance context
into the spatial attentions of the diffusion backbone while preserving the
capacity of motion modules in synthesizing fluid and intricate dynamic details.
Beyond body pose control, we connect a local control module with our model to
capture identity-disentangled facial expressions, facilitating accurate
expression transfer for enhanced realism in animated scenes. Together, these
components form a unified framework capable of learning physical human motion
and natural scene dynamics from a diverse blend of human and scene videos.
Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna
outperforms state-of-the-art methods, creating highly lifelike and expressive
animations. The code is available at https://github.com/bytedance/X-Dyna.
ByNada Saadi, Tathagata Raha, Clément Christophe, Marco AF Pimentel, Ronnie Rajan, Praveen K Kanithi
14
2
This paper investigates the challenges of developing large language models
(LLMs) proficient in both multilingual understanding and medical knowledge. We
demonstrate that simply translating medical data does not guarantee strong
performance on clinical tasks in the target language. Our experiments reveal
that the optimal language mix in training data varies significantly across
different medical tasks. We find that larger models with carefully calibrated
language ratios achieve superior performance on native-language clinical tasks.
Furthermore, our results suggest that relying solely on fine-tuning may not be
the most effective approach for incorporating new language knowledge into LLMs.
Instead, data and computationally intensive pretraining methods may still be
necessary to achieve optimal performance in multilingual medical settings.
These findings provide valuable guidance for building effective and inclusive
medical AI systems for diverse linguistic communities.
ByShengkui Zhao, Kun Zhou, Zexu Pan, Yukun Ma, Chong Zhang, Bin Ma
9
3
The application of generative adversarial networks (GANs) has recently
advanced speech super-resolution (SR) based on intermediate representations
like mel-spectrograms. However, existing SR methods that typically rely on
independently trained and concatenated networks may lead to inconsistent
representations and poor speech quality, especially in out-of-domain scenarios.
In this work, we propose HiFi-SR, a unified network that leverages end-to-end
adversarial training to achieve high-fidelity speech super-resolution. Our
model features a unified transformer-convolutional generator designed to
seamlessly handle both the prediction of latent representations and their
conversion into time-domain waveforms. The transformer network serves as a
powerful encoder, converting low-resolution mel-spectrograms into latent space
representations, while the convolutional network upscales these representations
into high-resolution waveforms. To enhance high-frequency fidelity, we
incorporate a multi-band, multi-scale time-frequency discriminator, along with
a multi-scale mel-reconstruction loss in the adversarial training process.
HiFi-SR is versatile, capable of upscaling any input speech signal between 4
kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that
HiFi-SR significantly outperforms existing speech SR methods across both
objective metrics and ABX preference tests, for both in-domain and
out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).
ByXiangyue Liu, Kunming Luo, Heng Li, Qi Zhang, Yuan Liu, Li Yi, Ping Tan
6
2
We introduce GaussianAvatar-Editor, an innovative framework for text-driven
editing of animatable Gaussian head avatars that can be fully controlled in
expression, pose, and viewpoint. Unlike static 3D Gaussian editing, editing
animatable 4D Gaussian avatars presents challenges related to motion occlusion
and spatial-temporal inconsistency. To address these issues, we propose the
Weighted Alpha Blending Equation (WABE). This function enhances the blending
weight of visible Gaussians while suppressing the influence on non-visible
Gaussians, effectively handling motion occlusion during editing. Furthermore,
to improve editing quality and ensure 4D consistency, we incorporate
conditional adversarial learning into the editing process. This strategy helps
to refine the edited results and maintain consistency throughout the animation.
By integrating these methods, our GaussianAvatar-Editor achieves photorealistic
and consistent results in animatable 4D Gaussian editing. We conduct
comprehensive experiments across various subjects to validate the effectiveness
of our proposed techniques, which demonstrates the superiority of our approach
over existing methods. More results and code are available at: [Project
Link](https://xiangyueliu.github.io/GaussianAvatar-Editor/).