Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan
1334
Unifying multimodal understanding and generation has shown impressive
capabilities in cutting-edge proprietary systems. In this work, we introduce
BAGEL, an open0source foundational model that natively supports multimodal
understanding and generation. BAGEL is a unified, decoder0only model pretrained
on trillions of tokens curated from large0scale interleaved text, image, video,
and web data. When scaled with such diverse multimodal interleaved data, BAGEL
exhibits emerging capabilities in complex multimodal reasoning. As a result, it
significantly outperforms open-source unified models in both multimodal
generation and understanding across standard benchmarks, while exhibiting
advanced multimodal reasoning abilities such as free-form image manipulation,
future frame prediction, 3D manipulation, and world navigation. In the hope of
facilitating further opportunities for multimodal research, we share the key
findings, pretraining details, data creation protocal, and release our code and
checkpoints to the community. The project page is at https://bagel-ai.org/
Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen
766
The efficiency of attention is important due to its quadratic time
complexity. We enhance the efficiency of attention through two key
contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to
accelerate attention computation. Our implementation achieves 1038 TOPS on
RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090.
Experiments show that our FP4 attention can accelerate inference of various
models in a plug-and-play way. Second, we pioneer low-bit attention to training
tasks. Existing low-bit attention works like FlashAttention3 and SageAttention
focus only on inference. However, the efficiency of training large models is
also important. To explore whether low-bit attention can be effectively applied
to training tasks, we design an accurate and efficient 8-bit attention for both
forward and backward propagation. Experiments indicate that 8-bit attention
achieves lossless performance in fine-tuning tasks but exhibits slower
convergence in pretraining tasks. The code will be available at
https://github.com/thu-ml/SageAttention.
Reward models play a critical role in guiding large language models toward
outputs that align with human expectations. However, an open challenge remains
in effectively utilizing test-time compute to enhance reward model performance.
In this work, we introduce Reward Reasoning Models (RRMs), which are
specifically designed to execute a deliberate reasoning process before
generating final rewards. Through chain-of-thought reasoning, RRMs leverage
additional test-time compute for complex queries where appropriate rewards are
not immediately apparent. To develop RRMs, we implement a reinforcement
learning framework that fosters self-evolved reward reasoning capabilities
without requiring explicit reasoning traces as training data. Experimental
results demonstrate that RRMs achieve superior performance on reward modeling
benchmarks across diverse domains. Notably, we show that RRMs can adaptively
exploit test-time compute to further improve reward accuracy. The pretrained
reward reasoning models are available at
https://huggingface.co/Reward-Reasoning.
Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
362
Scaling test-time compute is crucial for enhancing the reasoning capabilities
of large language models (LLMs). Existing approaches typically employ
reinforcement learning (RL) to maximize a verifiable reward obtained at the end
of reasoning traces. However, such methods optimize only the final performance
under a large and fixed token budget, which hinders efficiency in both training
and deployment. In this work, we present a novel framework, AnytimeReasoner, to
optimize anytime reasoning performance, which aims to improve token efficiency
and the flexibility of reasoning under varying token budget constraints. To
achieve this, we truncate the complete thinking process to fit within sampled
token budgets from a prior distribution, compelling the model to summarize the
optimal answer for each truncated thinking for verification. This introduces
verifiable dense rewards into the reasoning process, facilitating more
effective credit assignment in RL optimization. We then optimize the thinking
and summary policies in a decoupled manner to maximize the cumulative reward.
Additionally, we introduce a novel variance reduction technique, Budget
Relative Policy Optimization (BRPO), to enhance the robustness and efficiency
of the learning process when reinforcing the thinking policy. Empirical results
in mathematical reasoning tasks demonstrate that our method consistently
outperforms GRPO across all thinking budgets under various prior distributions,
enhancing both training and token efficiency.
Emile van Krieken, Pasquale Minervini, Edoardo Ponti, Antonio Vergari
342
Neurosymbolic (NeSy) predictors combine neural perception with symbolic
reasoning to solve tasks like visual reasoning. However, standard NeSy
predictors assume conditional independence between the symbols they extract,
thus limiting their ability to model interactions and uncertainty - often
leading to overconfident predictions and poor out-of-distribution
generalisation. To overcome the limitations of the independence assumption, we
introduce neurosymbolic diffusion models (NeSyDMs), a new class of NeSy
predictors that use discrete diffusion to model dependencies between symbols.
Our approach reuses the independence assumption from NeSy predictors at each
step of the diffusion process, enabling scalable learning while capturing
symbol dependencies and uncertainty quantification. Across both synthetic and
real-world benchmarks - including high-dimensional visual path planning and
rule-based autonomous driving - NeSyDMs achieve state-of-the-art accuracy among
NeSy predictors and demonstrate strong calibration.
A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native
agentic ability to use external tools such as web browsers for searching and
writing/executing code for image manipulation to think with images. In the
open-source research community, while significant progress has been made in
language-only agentic abilities such as function calling and tool integration,
the development of multi-modal agentic capabilities that involve truly thinking
with images, and their corresponding benchmarks, are still less explored. This
work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning
(Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large
Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the
ability to browse websites for real-time information updates and write code to
manipulate and analyze input images through cropping, rotation, and other image
processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT)
with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs'
agentic search and coding abilities. Our experimental results demonstrate that
Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and
+10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT
also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks
such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities.
Our findings suggest that Visual-ARFT offers a promising path toward building
robust and generalizable multimodal agents.
Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, Kede Ma
313
DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing
reasoning and generalization capabilities of large language models (LLMs)
through reinforcement learning. Nevertheless, the potential of
reasoning-induced computational modeling has not been thoroughly explored in
the context of image quality assessment (IQA), a task critically dependent on
visual reasoning. In this paper, we introduce VisualQuality-R1, a
reasoning-induced no-reference IQA (NR-IQA) model, and we train it with
reinforcement learning to rank, a learning algorithm tailored to the
intrinsically relative nature of visual quality. Specifically, for a pair of
images, we employ group relative policy optimization to generate multiple
quality scores for each image. These estimates are then used to compute
comparative probabilities of one image having higher quality than the other
under the Thurstone model. Rewards for each quality estimate are defined using
continuous fidelity measures rather than discretized binary labels. Extensive
experiments show that the proposed VisualQuality-R1 consistently outperforms
discriminative deep learning-based NR-IQA models as well as a recent
reasoning-induced quality regression method. Moreover, VisualQuality-R1 is
capable of generating contextually rich, human-aligned quality descriptions,
and supports multi-dataset training without requiring perceptual scale
realignment. These features make VisualQuality-R1 especially well-suited for
reliably measuring progress in a wide range of image processing tasks like
super-resolution and image generation.
Transformers, the standard implementation for large language models (LLMs),
typically consist of tens to hundreds of discrete layers. While more layers can
lead to better performance, this approach has been challenged as far from
efficient, especially given the superiority of continuous layers demonstrated
by diffusion and flow-based models for image generation. We propose the Latent
Flow Transformer (LFT), which replaces a block of layers with a single learned
transport operator trained via flow matching, offering significant compression
while maintaining compatibility with the original architecture. Additionally,
we address the limitations of existing flow-based methods in preserving
coupling by introducing the Flow Walking (FW) algorithm. On the Pythia-410M
model, LFT trained with flow matching compresses 6 of 24 layers and outperforms
directly skipping 2 layers (KL Divergence of LM logits at 0.407 vs. 0.529),
demonstrating the feasibility of this design. When trained with FW, LFT further
distills 12 layers into one while reducing the KL to 0.736 surpassing that from
skipping 3 layers (0.932), significantly narrowing the gap between
autoregressive and flow-based generation paradigms.
Dario Garcia-Gasulla, Jordi Bayarri-Planas, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Adrian Tormos, Daniel Hinjos, Pablo Bernabeu-Perez, Anna Arias-Duart, Pablo Agustin Martin-Torres, Marta Gonzalez-Mallo, Sergio Alvarez-Napagao, Eduard Ayguadé-Parra, Ulises Cortés
272
Purpose: With advancements in Large Language Models (LLMs) for healthcare,
the need arises for competitive open-source models to protect the public
interest. This work contributes to the field of open medical LLMs by optimizing
key stages of data preprocessing and training, while showing how to improve
model safety (through DPO) and efficacy (through RAG). The evaluation
methodology used, which includes four different types of tests, defines a new
standard for the field. The resultant models, shown to be competitive with the
best private alternatives, are released with a permisive license.
Methods: Building on top of strong base models like Llama 3.1 and Qwen 2.5,
Aloe Beta uses a custom dataset to enhance public data with synthetic Chain of
Thought examples. The models undergo alignment with Direct Preference
Optimization, emphasizing ethical and policy-aligned performance in the
presence of jailbreaking attacks. Evaluation includes close-ended, open-ended,
safety and human assessments, to maximize the reliability of results.
Results: Recommendations are made across the entire pipeline, backed by the
solid performance of the Aloe Family. These models deliver competitive
performance across healthcare benchmarks and medical fields, and are often
preferred by healthcare professionals. On bias and toxicity, the Aloe Beta
models significantly improve safety, showing resilience to unseen jailbreaking
attacks. For a responsible release, a detailed risk assessment specific to
healthcare is attached to the Aloe Family models.
Conclusion: The Aloe Beta models, and the recipe that leads to them, are a
significant contribution to the open-source medical LLM field, offering
top-of-the-line performance while maintaining high ethical requirements. This
work sets a new standard for developing and reporting aligned LLMs in
healthcare.
Reinforcement learning (RL) has recently demonstrated strong potential in
enhancing the reasoning capabilities of large language models (LLMs).
Particularly, the "Zero" reinforcement learning introduced by Deepseek-R1-Zero,
enables direct RL training of base LLMs without relying on an intermediate
supervised fine-tuning stage. Despite these advancements, current works for LLM
reasoning mainly focus on mathematical and coding domains, largely due to data
abundance and the ease of answer verification. This limits the applicability
and generalization of such models to broader domains, where questions often
have diverse answer representations, and data is more scarce. In this paper, we
propose General-Reasoner, a novel training paradigm designed to enhance LLM
reasoning capabilities across diverse domains. Our key contributions include:
(1) constructing a large-scale, high-quality dataset of questions with
verifiable answers curated by web crawling, covering a wide range of
disciplines; and (2) developing a generative model-based answer verifier, which
replaces traditional rule-based verification with the capability of
chain-of-thought and context-awareness. We train a series of models and
evaluate them on a wide range of datasets covering wide domains like physics,
chemistry, finance, electronics etc. Our comprehensive evaluation across these
12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC)
demonstrates that General-Reasoner outperforms existing baseline methods,
achieving robust and generalizable reasoning performance while maintaining
superior effectiveness in mathematical reasoning tasks.
Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunkyoung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, Minjoon Seo
202
Despite their strengths, large language models (LLMs) often fail to
communicate their confidence accurately, making it difficult to assess when
they might be wrong and limiting their reliability. In this work, we
demonstrate that reasoning models-LLMs that engage in extended chain-of-thought
(CoT) reasoning-exhibit superior performance not only in problem-solving but
also in accurately expressing their confidence. Specifically, we benchmark six
reasoning models across six datasets and find that they achieve strictly better
confidence calibration than their non-reasoning counterparts in 33 out of the
36 settings. Our detailed analysis reveals that these gains in calibration stem
from the slow thinking behaviors of reasoning models-such as exploring
alternative approaches and backtracking-which enable them to adjust their
confidence dynamically throughout their CoT, making it progressively more
accurate. In particular, we find that reasoning models become increasingly
better calibrated as their CoT unfolds, a trend not observed in non-reasoning
models. Moreover, removing slow thinking behaviors from the CoT leads to a
significant drop in calibration. Lastly, we show that these gains are not
exclusive to reasoning models-non-reasoning models also benefit when guided to
perform slow thinking via in-context learning.
Lingjie Jiang, Xun Wu, Shaohan Huang, Qingxiu Dong, Zewen Chi, Li Dong, Xingxing Zhang, Tengchao Lv, Lei Cui, Furu Wei
192
Recent Large Reasoning Models (LRMs) have shown substantially improved
reasoning capabilities over traditional Large Language Models (LLMs) by
incorporating extended thinking processes prior to producing final responses.
However, excessively lengthy thinking introduces substantial overhead in terms
of token consumption and latency, which is particularly unnecessary for simple
queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the
first kind of model capable of adaptively determining whether to perform
thinking based on the contextual information of user queries. To achieve this,
we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as
a cold start, followed by online reinforcement learning with the proposed
Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the
appropriate thinking mode. Furthermore, we introduce a metric called Hybrid
Accuracy to quantitatively assess the model's capability for hybrid thinking.
Extensive experimental results show that LHRMs can adaptively perform hybrid
thinking on queries of varying difficulty and type. It outperforms existing
LRMs and LLMs in reasoning and general capabilities while significantly
improving efficiency. Together, our work advocates for a reconsideration of the
appropriate use of extended thinking processes and provides a solid starting
point for building hybrid thinking systems.
Recent reasoning-focused language models achieve high accuracy by generating
lengthy intermediate reasoning paths before producing final answers. While this
approach is effective in solving problems that require logical thinking, long
reasoning paths significantly increase memory usage and throughput of token
generation, limiting the practical deployment of such models. We propose
Reasoning Path Compression (RPC), a training-free method that accelerates
inference by leveraging the semantic sparsity of reasoning paths. RPC
periodically compresses the KV cache by retaining KV cache that receive high
importance score, which are computed using a selector window composed of
recently generated queries. Experiments show that RPC improves generation
throughput of QwQ-32B by up to 1.60times compared to the inference with full
KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our
findings demonstrate that semantic sparsity in reasoning traces can be
effectively exploited for compression, offering a practical path toward
efficient deployment of reasoning LLMs. Our code is available at
https://github.com/jiwonsong-dev/ReasoningPathCompression.
Learning general-purpose reasoning capabilities has long been a challenging
problem in AI. Recent research in large language models (LLMs), such as
DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can
enable pre-trained LLMs to develop reasoning capabilities using simple
question-answer pairs. In this paper, we aim to train visual language models
(VLMs) to perform reasoning on image data through reinforcement learning and
visual question-answer pairs, without any explicit chain-of-thought (CoT)
supervision. Our findings indicate that simply applying reinforcement learning
to a VLM -- by prompting the model to produce a reasoning chain before
providing an answer -- can lead the model to develop shortcuts from easy
questions, thereby reducing its ability to generalize across unseen data
distributions. We argue that the key to mitigating shortcut learning is to
encourage the model to interpret images prior to reasoning. Therefore, we train
the model to adhere to a caption-reason-answer output format: initially
generating a detailed caption for an image, followed by constructing an
extensive reasoning chain. When trained on 273K CoT-free visual question-answer
pairs and using only reinforcement learning, our model, named Visionary-R1,
outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, and
Gemini-1.5-Pro, on multiple visual reasoning benchmarks.
Wentao Ma, Weiming Ren, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, Wenhu Chen
152
Large multimodal models (LMMs) have recently emerged as a powerful tool for
long video understanding (LVU), prompting the development of standardized LVU
benchmarks to evaluate their performance. However, our investigation reveals a
rather sober lesson for existing LVU benchmarks. First, most existing
benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation
results are inflated due to the possibility of guessing the correct answer;
Second, a significant portion of questions in these benchmarks have strong
priors to allow models to answer directly without even reading the input video.
For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame
from a long video on Video-MME. We also observe that increasing the number of
frames does not necessarily lead to improvement on existing benchmarks, which
is counterintuitive. As a result, the validity and robustness of current LVU
benchmarks are undermined, impeding a faithful assessment of LMMs' long-video
understanding capability. To tackle this problem, we propose VideoEval-Pro, a
realistic LVU benchmark containing questions with open-ended short-answer,
which truly require understanding the entire video. VideoEval-Pro assesses both
segment-level and full-video understanding through perception and reasoning
tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the
following findings: (1) video LMMs show drastic performance (>25\%) drops on
open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do
not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other
MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input
frames. Our results show that VideoEval-Pro offers a more realistic and
reliable measure of long video understanding, providing a clearer view of
progress in this domain.
Intelligent game creation represents a transformative advancement in game
development, utilizing generative artificial intelligence to dynamically
generate and enhance game content. Despite notable progress in generative
models, the comprehensive synthesis of high-quality game assets, including both
images and videos, remains a challenging frontier. To create high-fidelity game
content that simultaneously aligns with player preferences and significantly
boosts designer efficiency, we present Hunyuan-Game, an innovative project
designed to revolutionize intelligent game production. Hunyuan-Game encompasses
two primary branches: image generation and video generation. The image
generation component is built upon a vast dataset comprising billions of game
images, leading to the development of a group of customized image generation
models tailored for game scenarios: (1) General Text-to-Image Generation. (2)
Game Visual Effects Generation, involving text-to-effect and reference
image-based game visual effect generation. (3) Transparent Image Generation for
characters, scenes, and game visual effects. (4) Game Character Generation
based on sketches, black-and-white images, and white models. The video
generation component is built upon a comprehensive dataset of millions of game
and anime videos, leading to the development of five core algorithmic models,
each targeting critical pain points in game development and having robust
adaptation to diverse game video scenarios: (1) Image-to-Video Generation. (2)
360 A/T Pose Avatar Video Synthesis. (3) Dynamic Illustration Generation. (4)
Generative Video Super-Resolution. (5) Interactive Game Video Generation. These
image and video generation models not only exhibit high-level aesthetic
expression but also deeply integrate domain-specific knowledge, establishing a
systematic understanding of diverse game and anime art styles.
LLM pruning has emerged as a promising technology for compressing LLMs,
enabling their deployment on resource-limited devices. However, current
methodologies typically require access to public calibration samples, which can
be challenging to obtain in privacy-sensitive domains. To address this issue,
we introduce FedPrLLM, a comprehensive federated pruning framework designed for
the privacy-preserving compression of LLMs. In FedPrLLM, each client only needs
to calculate a pruning mask matrix based on its local calibration data and
share it with the server to prune the global model. This approach allows for
collaborative pruning of the global model with the knowledge of each client
while maintaining local data privacy. Additionally, we conduct extensive
experiments to explore various possibilities within the FedPrLLM framework,
including different comparison groups, pruning strategies, and the decision to
scale weights. Our extensive evaluation reveals that one-shot pruning with
layer comparison and no weight scaling is the optimal choice within the
FedPrLLM framework. We hope our work will help guide future efforts in pruning
LLMs in privacy-sensitive fields. Our code is available at
https://github.com/Pengxin-Guo/FedPrLLM.
Code-switching (CS) poses a significant challenge for Large Language Models
(LLMs), yet its comprehensibility remains underexplored in LLMs. We introduce
CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue
to English summarization. CS-Sum is the first benchmark for CS dialogue
summarization across Mandarin-English (EN-ZH), Tamil-English (EN-TA), and
Malay-English (EN-MS), with 900-1300 human-annotated dialogues per language
pair. Evaluating ten LLMs, including open and closed-source models, we analyze
performance across few-shot, translate-summarize, and fine-tuning (LoRA, QLoRA
on synthetic data) approaches. Our findings show that though the scores on
automated metrics are high, LLMs make subtle mistakes that alter the complete
meaning of the dialogue. To this end, we introduce 3 most common type of errors
that LLMs make when handling CS input. Error rates vary across CS pairs and
LLMs, with some LLMs showing more frequent errors on certain language pairs,
underscoring the need for specialized training on code-switched data.
Invisible image watermarking can protect image ownership and prevent
malicious misuse of visual generative models. However, existing generative
watermarking methods are mainly designed for diffusion models while
watermarking for autoregressive image generation models remains largely
underexplored. We propose IndexMark, a training-free watermarking framework for
autoregressive image generation models. IndexMark is inspired by the redundancy
property of the codebook: replacing autoregressively generated indices with
similar indices produces negligible visual differences. The core component in
IndexMark is a simple yet effective match-then-replace method, which carefully
selects watermark tokens from the codebook based on token similarity, and
promotes the use of watermark tokens through token replacement, thereby
embedding the watermark without affecting the image quality. Watermark
verification is achieved by calculating the proportion of watermark tokens in
generated images, with precision further improved by an Index Encoder.
Furthermore, we introduce an auxiliary validation scheme to enhance robustness
against cropping attacks. Experiments demonstrate that IndexMark achieves
state-of-the-art performance in terms of image quality and verification
accuracy, and exhibits robustness against various perturbations, including
cropping, noises, Gaussian blur, random erasing, color jittering, and JPEG
compression.
Despite widespread adoption, multimodal large language models (MLLMs) suffer
performance degradation when encountering unfamiliar queries under distribution
shifts. Existing methods to improve MLLM generalization typically require
either more instruction data or larger advanced model architectures, both of
which incur non-trivial human labor or computational costs. In this work, we
take an alternative approach to enhance the robustness of MLLMs under
distribution shifts, from a representation learning perspective. Inspired by
the information bottleneck (IB) principle, we derive a variational lower bound
of the IB for MLLMs and devise a practical implementation, Visual Instruction
Bottleneck Tuning (Vittle). We then provide a theoretical justification of
Vittle by revealing its connection to an information-theoretic robustness
metric of MLLM. Empirical validation of three MLLMs on open-ended and
closed-form question answering and object hallucination detection tasks over 45
datasets, including 30 shift scenarios, demonstrates that Vittle consistently
improves the MLLM's robustness under shifts by pursuing the learning of a
minimal sufficient representation.
As the size of large language models grows exponentially, GPU memory has
become a bottleneck for adapting these models to downstream tasks. In this
paper, we aim to push the limits of memory-efficient training by minimizing
memory usage on model weights, gradients, and optimizer states, within a
unified framework. Our idea is to eliminate both gradients and optimizer states
using zeroth-order optimization, which approximates gradients by perturbing
weights during forward passes to identify gradient directions. To minimize
memory usage on weights, we employ model quantization, e.g., converting from
bfloat16 to int4. However, directly applying zeroth-order optimization to
quantized weights is infeasible due to the precision gap between discrete
weights and continuous gradients, which would otherwise require de-quantization
and re-quantization. To overcome this challenge, we propose Quantized
Zeroth-order Optimization (QZO), a novel approach that perturbs the continuous
quantization scale for gradient estimation and uses a directional derivative
clipping method to stabilize training. QZO is orthogonal to both scalar-based
and codebook-based post-training quantization methods. Compared to
full-parameter fine-tuning in bfloat16, QZO can reduce the total memory cost by
more than 18times for 4-bit LLMs, and enables fine-tuning Llama-2-13B and
Stable Diffusion 3.5 Large within a single 24GB GPU.
Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, Donglin Wang
102
Despite impressive advancements in Visual-Language Models (VLMs) for
multi-modal tasks, their reliance on RGB inputs limits precise spatial
understanding. Existing methods for integrating spatial cues, such as point
clouds or depth, either require specialized sensors or fail to effectively
exploit depth information for higher-order reasoning. To this end, we propose a
novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that
transforms raw depth data into structured, interpretable textual rationales.
These textual rationales serve as meaningful intermediate representations to
significantly enhance spatial reasoning capabilities. Additionally, we leverage
knowledge distillation to compress the generated rationales into compact latent
embeddings, which facilitate resource-efficient and plug-and-play integration
into existing VLMs without retraining. To enable comprehensive evaluation, we
introduce a new dataset named SSR-CoT, a million-scale visual-language
reasoning dataset enriched with intermediate spatial reasoning annotations, and
present SSRBench, a comprehensive multi-task benchmark. Extensive experiments
on multiple benchmarks demonstrate SSR substantially improves depth utilization
and enhances spatial reasoning, thereby advancing VLMs toward more human-like
multi-modal understanding. Our project page is at
https://yliu-cs.github.io/SSR.
Mixture-of-Experts (MoE) architectures within Large Reasoning Models (LRMs)
have achieved impressive reasoning capabilities by selectively activating
experts to facilitate structured cognitive processes. Despite notable advances,
existing reasoning models often suffer from cognitive inefficiencies like
overthinking and underthinking. To address these limitations, we introduce a
novel inference-time steering methodology called Reinforcing Cognitive Experts
(RICE), designed to improve reasoning performance without additional training
or complex heuristics. Leveraging normalized Pointwise Mutual Information
(nPMI), we systematically identify specialized experts, termed ''cognitive
experts'' that orchestrate meta-level reasoning operations characterized by
tokens like ''<think>''. Empirical evaluations with leading MoE-based LRMs
(DeepSeek-R1 and Qwen3-235B) on rigorous quantitative and scientific reasoning
benchmarks demonstrate noticeable and consistent improvements in reasoning
accuracy, cognitive efficiency, and cross-domain generalization. Crucially, our
lightweight approach substantially outperforms prevalent reasoning-steering
techniques, such as prompt design and decoding constraints, while preserving
the model's general instruction-following skills. These results highlight
reinforcing cognitive experts as a promising, practical, and interpretable
direction to enhance cognitive efficiency within advanced reasoning models.
Generative AI search is reshaping information retrieval by offering
end-to-end answers to complex queries, reducing users' reliance on manually
browsing and summarizing multiple web pages. However, while this paradigm
enhances convenience, it disrupts the feedback-driven improvement loop that has
historically powered the evolution of traditional Web search. Web search can
continuously improve their ranking models by collecting large-scale,
fine-grained user feedback (e.g., clicks, dwell time) at the document level. In
contrast, generative AI search operates through a much longer search pipeline,
spanning query decomposition, document retrieval, and answer generation, yet
typically receives only coarse-grained feedback on the final answer. This
introduces a feedback loop disconnect, where user feedback for the final output
cannot be effectively mapped back to specific system components, making it
difficult to improve each intermediate stage and sustain the feedback loop. In
this paper, we envision NExT-Search, a next-generation paradigm designed to
reintroduce fine-grained, process-level feedback into generative AI search.
NExT-Search integrates two complementary modes: User Debug Mode, which allows
engaged users to intervene at key stages; and Shadow User Mode, where a
personalized user agent simulates user preferences and provides AI-assisted
feedback for less interactive users. Furthermore, we envision how these
feedback signals can be leveraged through online adaptation, which refines
current search outputs in real-time, and offline update, which aggregates
interaction logs to periodically fine-tune query decomposition, retrieval, and
generation models. By restoring human control over key stages of the generative
AI search pipeline, we believe NExT-Search offers a promising direction for
building feedback-rich AI search systems that can evolve continuously alongside
human feedback.
Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, Xiangang Li
92
Distillation has emerged as a practical and effective approach to enhance the
reasoning capabilities of open-source language models. In this work, we conduct
a large-scale empirical study on reasoning data distillation by collecting
verified outputs from three state-of-the-art teacher models-AM-Thinking-v1,
Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We
construct three parallel datasets and analyze their distributions, revealing
that AM-Thinking-v1-distilled data exhibits greater token length diversity and
lower perplexity. Student models trained on each dataset are evaluated on
reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench.
The AM-based model consistently achieves the best performance (e.g., 84.3 on
AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and
demonstrates adaptive output behavior-producing longer responses for harder
tasks and shorter ones for simpler tasks. These findings highlight the value of
high-quality, verified reasoning traces. We release the AM-Thinking-v1 and
Qwen3-235B-A22B distilled datasets to support future research on open and
high-performing reasoning-oriented language models. The datasets are publicly
available on Hugging FaceDatasets are available on Hugging Face:
\href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled{AM-Thinking-v1-Distilled},
https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled{AM-Qwen3-Distilled}.}.
Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda
92
As language models become more powerful and sophisticated, it is crucial that
they remain trustworthy and reliable. There is concerning preliminary evidence
that models may attempt to deceive or keep secrets from their operators. To
explore the ability of current techniques to elicit such hidden knowledge, we
train a Taboo model: a language model that describes a specific secret word
without explicitly stating it. Importantly, the secret word is not presented to
the model in its training data or prompt. We then investigate methods to
uncover this secret. First, we evaluate non-interpretability (black-box)
approaches. Subsequently, we develop largely automated strategies based on
mechanistic interpretability techniques, including logit lens and sparse
autoencoders. Evaluation shows that both approaches are effective in eliciting
the secret word in our proof-of-concept setting. Our findings highlight the
promise of these approaches for eliciting hidden knowledge and suggest several
promising avenues for future work, including testing and refining these methods
on more complex model organisms. This work aims to be a step towards addressing
the crucial problem of eliciting secret knowledge from language models, thereby
contributing to their safe and reliable deployment.
We introduce Vox-Profile, a comprehensive benchmark to characterize rich
speaker and speech traits using speech foundation models. Unlike existing works
that focus on a single dimension of speaker traits, Vox-Profile provides
holistic and multi-dimensional profiles that reflect both static speaker traits
(e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech
flow). This benchmark is grounded in speech science and linguistics, developed
with domain experts to accurately index speaker and speech characteristics. We
report benchmark experiments using over 15 publicly available speech datasets
and several widely used speech foundation models that target various static and
dynamic speaker and speech properties. In addition to benchmark experiments, we
showcase several downstream applications supported by Vox-Profile. First, we
show that Vox-Profile can augment existing speech recognition datasets to
analyze ASR performance variability. Vox-Profile is also used as a tool to
evaluate the performance of speech generation systems. Finally, we assess the
quality of our automated profiles through comparison with human evaluation and
show convergent validity. Vox-Profile is publicly available at:
https://github.com/tiantiaf0627/vox-profile-release.
Chongyang Shi, Sharon Lin, Shuang Song, Jamie Hayes, Ilia Shumailov, Itay Yona, Juliette Pluto, Aneesh Pappu, Christopher A. Choquette-Choo, Milad Nasr, Chawin Sitawarin, Gena Gibson, Andreas Terzis, John "Four" Flynn
82
Gemini is increasingly used to perform tasks on behalf of users, where
function-calling and tool-use capabilities enable the model to access user
data. Some tools, however, require access to untrusted data introducing risk.
Adversaries can embed malicious instructions in untrusted data which cause the
model to deviate from the user's expectations and mishandle their data or
permissions. In this report, we set out Google DeepMind's approach to
evaluating the adversarial robustness of Gemini models and describe the main
lessons learned from the process. We test how Gemini performs against a
sophisticated adversary through an adversarial evaluation framework, which
deploys a suite of adaptive attack techniques to run continuously against past,
current, and future versions of Gemini. We describe how these ongoing
evaluations directly help make Gemini more resilient against manipulation.
Reinforcement finetuning (RFT) has become a standard approach for enhancing
the reasoning capabilities of large language models (LLMs). However, its impact
on model trustworthiness remains underexplored. In this work, we identify and
systematically study a critical side effect of RFT, which we term the
hallucination tax: a degradation in refusal behavior causing models to produce
hallucinated answers to unanswerable questions confidently. To investigate
this, we introduce SUM (Synthetic Unanswerable Math), a high-quality dataset of
unanswerable math problems designed to probe models' ability to recognize an
unanswerable question by reasoning from the insufficient or ambiguous
information. Our results show that standard RFT training could reduce model
refusal rates by more than 80%, which significantly increases model's tendency
to hallucinate. We further demonstrate that incorporating just 10% SUM during
RFT substantially restores appropriate refusal behavior, with minimal accuracy
trade-offs on solvable tasks. Crucially, this approach enables LLMs to leverage
inference-time compute to reason about their own uncertainty and knowledge
boundaries, improving generalization not only to out-of-domain math problems
but also to factual question answering tasks.
Haohang Li, Yupeng Cao, Yangyang Yu, Jordan W. Suchow, Zining Zhu
82
Despite their remarkable success and deployment across diverse workflows,
language models sometimes produce untruthful responses. Our limited
understanding of how truthfulness is mechanistically encoded within these
models jeopardizes their reliability and safety. In this paper, we propose a
method for identifying representations of truthfulness at the neuron level. We
show that language models contain truth neurons, which encode truthfulness in a
subject-agnostic manner. Experiments conducted across models of varying scales
validate the existence of truth neurons, confirming that the encoding of
truthfulness at the neuron level is a property shared by many language models.
The distribution patterns of truth neurons over layers align with prior
findings on the geometry of truthfulness. Selectively suppressing the
activations of truth neurons found through the TruthfulQA dataset degrades
performance both on TruthfulQA and on other benchmarks, showing that the
truthfulness mechanisms are not tied to a specific dataset. Our results offer
novel insights into the mechanisms underlying truthfulness in language models
and highlight potential directions toward improving their trustworthiness and
reliability.
Safal Shrestha, Minwu Kim, Aadim Nepal, Anubhav Shrestha, Keith Ross
72
Designing effective reasoning-capable LLMs typically requires training using
Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with
carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on
extensive training data. This creates a major challenge when the amount of
quality training data is scarce. We propose a sample-efficient, two-stage
training strategy to develop reasoning LLMs under limited supervision. In the
first stage, we "warm up" the model by distilling Long CoTs from a toy domain,
namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning
skills. In the second stage, we apply RLVR to the warmed-up model using a
limited set of target-domain examples. Our experiments demonstrate that this
two-phase approach offers several benefits: (i) the warmup phase alone
facilitates generalized reasoning, leading to performance improvements across a
range of tasks, including MATH, HumanEval^{+}, and MMLU-Pro. (ii) When both
the base model and the warmed-up model are RLVR trained on the same small
dataset (leq100 examples), the warmed-up model consistently outperforms the
base model; (iii) Warming up before RLVR training allows a model to maintain
cross-domain generalizability even after training on a specific domain; (iv)
Introducing warmup in the pipeline improves not only accuracy but also overall
sample efficiency during RLVR training. The results in this paper highlight the
promise of warmup for building robust reasoning LLMs in data-scarce
environments.
Pierre Le Jeune, Benoît Malézieux, Weixuan Xiao, Matteo Dora
72
Ensuring the safety of large language models (LLMs) is critical for
responsible deployment, yet existing evaluations often prioritize performance
over identifying failure modes. We introduce Phare, a multilingual diagnostic
framework to probe and evaluate LLM behavior across three critical dimensions:
hallucination and reliability, social biases, and harmful content generation.
Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic
vulnerabilities across all safety dimensions, including sycophancy, prompt
sensitivity, and stereotype reproduction. By highlighting these specific
failure modes rather than simply ranking models, Phare provides researchers and
practitioners with actionable insights to build more robust, aligned, and
trustworthy language systems.
Han Zheng, Ilia Shumailov, Tianqi Fan, Aiden Hall, Mathias Payer
62
The rapid advancement of bug-finding techniques has led to the discovery of
more vulnerabilities than developers can reasonably fix, creating an urgent
need for effective Automated Program Repair (APR) methods. However, the
complexity of modern bugs often makes precise root cause analysis difficult and
unreliable. To address this challenge, we propose crash-site repair to simplify
the repair task while still mitigating the risk of exploitation. In addition,
we introduce a template-guided patch generation approach that significantly
reduces the token cost of Large Language Models (LLMs) while maintaining both
efficiency and effectiveness.
We implement our prototype system, WILLIAMT, and evaluate it against
state-of-the-art APR tools. Our results show that, when combined with the
top-performing agent CodeRover-S, WILLIAMT reduces token cost by 45.9% and
increases the bug-fixing rate to 73.5% (+29.6%) on ARVO, a ground-truth open
source software vulnerabilities benchmark. Furthermore, we demonstrate that
WILLIAMT can function effectively even without access to frontier LLMs: even a
local model running on a Mac M4 Mini achieves a reasonable repair rate. These
findings highlight the broad applicability and scalability of WILLIAMT.
Linbo Liu, Xinle Liu, Qiang Zhou, Lin Chen, Yihan Liu, Hoan Nguyen, Behrooz Omidvar-Tehrani, Xi Shen, Jun Huan, Omer Tripp, Anoop Deoras
62
With the rapid advancement of powerful large language models (LLMs) in recent
years, a wide range of software engineering tasks can now be addressed using
LLMs, significantly enhancing productivity and scalability. Numerous benchmark
datasets have been developed to evaluate the coding capabilities of these
models, while they primarily focus on problem-solving and issue-resolution
tasks. In contrast, we introduce a new coding benchmark MIGRATION-BENCH with a
distinct focus: code migration. MIGRATION-BENCH aims to serve as a
comprehensive benchmark for migration from Java 8 to the latest long-term
support (LTS) versions (Java 17, 21), MIGRATION-BENCH includes a full dataset
and its subset selected with 5,102 and 300 repositories respectively.
Selected is a representative subset curated for complexity and difficulty,
offering a versatile resource to support research in the field of code
migration. Additionally, we provide a comprehensive evaluation framework to
facilitate rigorous and standardized assessment of LLMs on this challenging
task. We further propose SD-Feedback and demonstrate that LLMs can effectively
tackle repository-level code migration to Java 17. For the selected subset with
Claude-3.5-Sonnet-v2, SD-Feedback achieves 62.33% and 27.00% success rate
(pass@1) for minimal and maximal migration respectively. The benchmark dataset
and source code are available at:
https://huggingface.co/collections/AmazonScience and
https://github.com/amazon-science/self_debug respectively.
Guoheng Sun, Ziyao Wang, Bowei Tian, Meng Liu, Zheyu Shen, Shwai He, Yexiao He, Wanghao Ye, Yiting Wang, Ang Li
52
As post-training techniques evolve, large language models (LLMs) are
increasingly augmented with structured multi-step reasoning abilities, often
optimized through reinforcement learning. These reasoning-enhanced models
outperform standard LLMs on complex tasks and now underpin many commercial LLM
APIs. However, to protect proprietary behavior and reduce verbosity, providers
typically conceal the reasoning traces while returning only the final answer.
This opacity introduces a critical transparency gap: users are billed for
invisible reasoning tokens, which often account for the majority of the cost,
yet have no means to verify their authenticity. This opens the door to token
count inflation, where providers may overreport token usage or inject
synthetic, low-effort tokens to inflate charges. To address this issue, we
propose CoIn, a verification framework that audits both the quantity and
semantic validity of hidden tokens. CoIn constructs a verifiable hash tree from
token embedding fingerprints to check token counts, and uses embedding-based
relevance matching to detect fabricated reasoning content. Experiments
demonstrate that CoIn, when deployed as a trusted third-party auditor, can
effectively detect token count inflation with a success rate reaching up to
94.7%, showing the strong ability to restore billing transparency in opaque LLM
services. The dataset and code are available at
https://github.com/CASE-Lab-UMD/LLM-Auditing-CoIn.
Nam V. Nguyen, Huy Nguyen, Quang Pham, Van Nguyen, Savitha Ramasamy, Nhat Ho
52
Sparse mixture of experts (SMoE) offers an appealing solution to scale up the
model complexity beyond the mean of increasing the network's depth or width.
However, we argue that effective SMoE training remains challenging because of
the suboptimal routing process where experts that perform computation do not
directly contribute to the routing process. In this work, we propose
competition, a novel mechanism to route tokens to experts with the highest
neural response. Theoretically, we show that the competition mechanism enjoys a
better sample efficiency than the traditional softmax routing. Furthermore, we
develop CompeteSMoE, a simple yet effective algorithm to train large language
models by deploying a router to learn the competition policy, thus enjoying
strong performances at a low training overhead. Our extensive empirical
evaluations on both the visual instruction tuning and language pre-training
tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE
compared to state-of-the-art SMoE strategies. We have made the implementation
available at: https://github.com/Fsoft-AIC/CompeteSMoE. This work is an
improved version of the previous study at arXiv:2402.02526
Large Language Model (LLM) reasoning for complex tasks inherently involves a
trade-off between solution accuracy and computational efficiency. The
subsequent step of verification, while intended to improve performance, further
complicates this landscape by introducing its own challenging trade-off:
sophisticated Generative Reward Models (GenRMs) can be computationally
prohibitive if naively integrated with LLMs at test-time, while simpler, faster
methods may lack reliability. To overcome these challenges, we introduce
FlexiVe, a novel generative verifier that flexibly balances computational
resources between rapid, reliable fast thinking and meticulous slow thinking
using a Flexible Allocation of Verification Budget strategy. We further propose
the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework
that intelligently integrates FlexiVe, proactively identifying solution
completion points to trigger targeted verification and provide focused solver
feedback. Experiments show FlexiVe achieves superior accuracy in pinpointing
errors within reasoning traces on ProcessBench. Furthermore, on challenging
mathematical reasoning benchmarks (AIME 2024, AIME 2025, and CNMO), our full
approach outperforms baselines like self-consistency in reasoning accuracy and
inference efficiency. Our system offers a scalable and effective solution to
enhance LLM reasoning at test time.
Hao Mark Chen, Guanxi Lu, Yasuyuki Okoshi, Zhiwen Mo, Masato Motomura, Hongxiang Fan
52
Test-time scaling (TTS) has proven effective in enhancing the reasoning
capabilities of large language models (LLMs). Verification plays a key role in
TTS, simultaneously influencing (1) reasoning performance and (2) compute
efficiency, due to the quality and computational cost of verification. In this
work, we challenge the conventional paradigms of verification, and make the
first attempt toward systematically investigating the impact of verification
granularity-that is, how frequently the verifier is invoked during generation,
beyond verifying only the final output or individual generation steps. To this
end, we introduce Variable Granularity Search (VG-Search), a unified algorithm
that generalizes beam search and Best-of-N sampling via a tunable granularity
parameter g. Extensive experiments with VG-Search under varying compute
budgets, generator-verifier configurations, and task attributes reveal that
dynamically selecting g can improve the compute efficiency and scaling
behavior. Building on these findings, we propose adaptive VG-Search strategies
that achieve accuracy gains of up to 3.1\% over Beam Search and 3.6\% over
Best-of-N, while reducing FLOPs by over 52\%. We will open-source the code to
support future research.
Despite significant advances in large language models (LLMs), their knowledge
memorization capabilities remain underexplored, due to the lack of standardized
and high-quality test ground. In this paper, we introduce a novel, real-world
and large-scale knowledge injection benchmark that evolves continuously over
time without requiring human intervention. Specifically, we propose WikiDYK,
which leverages recently-added and human-written facts from Wikipedia's "Did
You Know..." entries. These entries are carefully selected by expert Wikipedia
editors based on criteria such as verifiability and clarity. Each entry is
converted into multiple question-answer pairs spanning diverse task formats
from easy cloze prompts to complex multi-hop questions. WikiDYK contains 12,290
facts and 77,180 questions, which is also seamlessly extensible with future
updates from Wikipedia editors. Extensive experiments using continued
pre-training reveal a surprising insight: despite their prevalence in modern
LLMs, Causal Language Models (CLMs) demonstrate significantly weaker knowledge
memorization capabilities compared to Bidirectional Language Models (BiLMs),
exhibiting a 23% lower accuracy in terms of reliability. To compensate for the
smaller scales of current BiLMs, we introduce a modular collaborative framework
utilizing ensembles of BiLMs as external knowledge repositories to integrate
with LLMs. Experiment shows that our framework further improves the reliability
accuracy by up to 29.1%.
This research offers a unique evaluation of how AI systems interpret the
digital language of Generation Alpha (Gen Alpha, born 2010-2024). As the first
cohort raised alongside AI, Gen Alpha faces new forms of online risk due to
immersive digital engagement and a growing mismatch between their evolving
communication and existing safety tools. Their distinct language, shaped by
gaming, memes, and AI-driven trends, often conceals harmful interactions from
both human moderators and automated systems. We assess four leading AI models
(GPT-4, Claude, Gemini, and Llama 3) on their ability to detect masked
harassment and manipulation within Gen Alpha discourse. Using a dataset of 100
recent expressions from gaming platforms, social media, and video content, the
study reveals critical comprehension failures with direct implications for
online safety. This work contributes: (1) a first-of-its-kind dataset capturing
Gen Alpha expressions; (2) a framework to improve AI moderation systems for
youth protection; (3) a multi-perspective evaluation including AI systems,
human moderators, and parents, with direct input from Gen Alpha co-researchers;
and (4) an analysis of how linguistic divergence increases youth vulnerability.
Findings highlight the urgent need to redesign safety systems attuned to youth
communication, especially given Gen Alpha reluctance to seek help when adults
fail to understand their digital world. This study combines the insight of a
Gen Alpha researcher with systematic academic analysis to address critical
digital safety challenges.
Detecting AI risks becomes more challenging as stronger models emerge and
find novel methods such as Alignment Faking to circumvent these detection
attempts. Inspired by how risky behaviors in humans (i.e., illegal activities
that may hurt others) are sometimes guided by strongly-held values, we believe
that identifying values within AI models can be an early warning system for
AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal
AI models' priorities on a range of AI value classes. Then, we collect
AIRiskDilemmas, a diverse collection of dilemmas that pit values against one
another in scenarios relevant to AI safety risks such as Power Seeking. By
measuring an AI model's value prioritization using its aggregate choices, we
obtain a self-consistent set of predicted value priorities that uncover
potential risks. We show that values in LitmusValues (including seemingly
innocuous ones like Care) can predict for both seen risky behaviors in
AIRiskDilemmas and unseen risky behaviors in HarmBench.
Media bias detection is a critical task in ensuring fair and balanced
information dissemination, yet it remains challenging due to the subjectivity
of bias and the scarcity of high-quality annotated data. In this work, we
perform sentence-level bias classification by fine-tuning a RoBERTa-based model
on the expert-annotated BABE dataset. Using McNemar's test and the 5x2
cross-validation paired t-test, we show statistically significant improvements
in performance when comparing our model to a domain-adaptively pre-trained
DA-RoBERTa baseline. Furthermore, attention-based analysis shows that our model
avoids common pitfalls like oversensitivity to politically charged terms and
instead attends more meaningfully to contextually relevant tokens. For a
comprehensive examination of media bias, we present a pipeline that combines
our model with an already-existing bias-type classifier. Our method exhibits
good generalization and interpretability, despite being constrained by
sentence-level analysis and dataset size because of a lack of larger and more
advanced bias corpora. We talk about context-aware modeling, bias
neutralization, and advanced bias type classification as potential future
directions. Our findings contribute to building more robust, explainable, and
socially responsible NLP systems for media bias detection.
Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh
32
Recent years have seen a significant increase in video content creation and
consumption. Crafting engaging content requires the careful curation of both
visual and audio elements. While visual cue curation, through techniques like
optimal viewpoint selection or post-editing, has been central to media
production, its natural counterpart, audio, has not undergone equivalent
advancements. This often results in a disconnect between visual and acoustic
saliency. To bridge this gap, we introduce a novel task: visually-guided
acoustic highlighting, which aims to transform audio to deliver appropriate
highlighting effects guided by the accompanying video, ultimately creating a
more harmonious audio-visual experience. We propose a flexible,
transformer-based multimodal framework to solve this task. To train our model,
we also introduce a new dataset -- the muddy mix dataset, leveraging the
meticulous audio and video crafting found in movies, which provides a form of
free supervision. We develop a pseudo-data generation process to simulate
poorly mixed audio, mimicking real-world scenarios through a three-step process
-- separation, adjustment, and remixing. Our approach consistently outperforms
several baselines in both quantitative and subjective evaluation. We also
systematically study the impact of different types of contextual guidance and
difficulty levels of the dataset. Our project page is here:
https://wikichao.github.io/VisAH/.
Xiang He, Dongcheng Zhao, Yang Li, Qingqun Kong, Xin Yang, Yi Zeng
32
Multimodal learning enhances the perceptual capabilities of cognitive systems
by integrating information from different sensory modalities. However, existing
multimodal fusion research typically assumes static integration, not fully
incorporating key dynamic mechanisms found in the brain. Specifically, the
brain exhibits an inverse effectiveness phenomenon, wherein weaker unimodal
cues yield stronger multisensory integration benefits; conversely, when
individual modal cues are stronger, the effect of fusion is diminished. This
mechanism enables biological systems to achieve robust cognition even with
scarce or noisy perceptual cues. Inspired by this biological mechanism, we
explore the relationship between multimodal output and information from
individual modalities, proposing an inverse effectiveness driven multimodal
fusion (IEMF) strategy. By incorporating this strategy into neural networks, we
achieve more efficient integration with improved model performance and
computational efficiency, demonstrating up to 50% reduction in computational
cost across diverse fusion methods. We conduct experiments on audio-visual
classification, continual learning, and question answering tasks to validate
our method. Results consistently demonstrate that our method performs
excellently in these tasks. To verify universality and generalization, we also
conduct experiments on Artificial Neural Networks (ANN) and Spiking Neural
Networks (SNN), with results showing good adaptability to both network types.
Our research emphasizes the potential of incorporating biologically inspired
mechanisms into multimodal networks and provides promising directions for the
future development of multimodal artificial intelligence. The code is available
at https://github.com/Brain-Cog-Lab/IEMF.
Xiang Zhang, Juntai Cao, Jiaqi Wei, Yiwei Xu, Chenyu You
22
Tokenization is the first - and often underappreciated - layer of computation
in language models. While Chain-of-Thought (CoT) prompting enables transformer
models to approximate recurrent computation by externalizing intermediate
steps, we show that the success of such reasoning is fundamentally bounded by
the structure of tokenized inputs. This work presents a theoretical and
empirical investigation into how tokenization schemes, particularly
subword-based methods like byte-pair encoding (BPE), impede symbolic
computation by merging or obscuring atomic reasoning units. We introduce the
notion of Token Awareness to formalize how poor token granularity disrupts
logical alignment and prevents models from generalizing symbolic procedures.
Through systematic evaluation on arithmetic and symbolic tasks, we demonstrate
that token structure dramatically affect reasoning performance, causing failure
even with CoT, while atomically-aligned formats unlock strong generalization,
allowing small models (e.g., GPT-4o-mini) to outperform larger systems (e.g.,
o1) in structured reasoning. Our findings reveal that symbolic reasoning
ability in LLMs is not purely architectural, but deeply conditioned on
token-level representations.
Pengyue Jia, Seongheon Park, Song Gao, Xiangyu Zhao, Yixuan Li
22
Worldwide image geolocalization-the task of predicting GPS coordinates from
images taken anywhere on Earth-poses a fundamental challenge due to the vast
diversity in visual content across regions. While recent approaches adopt a
two-stage pipeline of retrieving candidates and selecting the best match, they
typically rely on simplistic similarity heuristics and point-wise supervision,
failing to model spatial relationships among candidates. In this paper, we
propose GeoRanker, a distance-aware ranking framework that leverages large
vision-language models to jointly encode query-candidate interactions and
predict geographic proximity. In addition, we introduce a multi-order distance
loss that ranks both absolute and relative distances, enabling the model to
reason over structured spatial relationships. To support this, we curate
GeoRanking, the first dataset explicitly designed for geographic ranking tasks
with multimodal candidate information. GeoRanker achieves state-of-the-art
results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly
outperforming current best methods.
Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan
22
Multi-hop Question Answering (MHQA) adds layers of complexity to question
answering, making it more challenging. When Language Models (LMs) are prompted
with multiple search results, they are tasked not only with retrieving relevant
information but also employing multi-hop reasoning across the information
sources. Although LMs perform well on traditional question-answering tasks, the
causal mask can hinder their capacity to reason across complex contexts. In
this paper, we explore how LMs respond to multi-hop questions by permuting
search results (retrieved documents) under various configurations. Our study
reveals interesting findings as follows: 1) Encoder-decoder models, such as the
ones in the Flan-T5 family, generally outperform causal decoder-only LMs in
MHQA tasks, despite being significantly smaller in size; 2) altering the order
of gold documents reveals distinct trends in both Flan T5 models and fine-tuned
decoder-only models, with optimal performance observed when the document order
aligns with the reasoning chain order; 3) enhancing causal decoder-only models
with bi-directional attention by modifying the causal mask can effectively
boost their end performance. In addition to the above, we conduct a thorough
investigation of the distribution of LM attention weights in the context of
MHQA. Our experiments reveal that attention weights tend to peak at higher
values when the resulting answer is correct. We leverage this finding to
heuristically improve LMs' performance on this task. Our code is publicly
available at https://github.com/hwy9855/MultiHopQA-Reasoning.
Recent advances in large language models (LLMs) and the abundance of food
data have resulted in studies to improve food understanding using LLMs. Despite
several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there
has been limited research on integrating food related KGs with LLMs. We
introduce KERL, a unified system that leverages food KGs and LLMs to provide
personalized food recommendations and generates recipes with associated
micro-nutritional information. Given a natural language question, KERL extracts
entities, retrieves subgraphs from the KG, which are then fed into the LLM as
context to select the recipes that satisfy the constraints. Next, our system
generates the cooking steps and nutritional information for each recipe. To
evaluate our approach, we also develop a benchmark dataset by curating recipe
related questions, combined with constraints and personal preferences. Through
extensive experiments, we show that our proposed KG-augmented LLM significantly
outperforms existing approaches, offering a complete and coherent solution for
food recommendation, recipe generation, and nutritional analysis. Our code and
benchmark datasets are publicly available at
https://github.com/mohbattharani/KERL.
Brain-to-image decoding has been recently propelled by the progress in
generative AI models and the availability of large ultra-high field functional
Magnetic Resonance Imaging (fMRI). However, current approaches depend on
complicated multi-stage pipelines and preprocessing steps that typically
collapse the temporal dimension of brain recordings, thereby limiting
time-resolved brain decoders. Here, we introduce Dynadiff (Dynamic Neural
Activity Diffusion for Image Reconstruction), a new single-stage diffusion
model designed for reconstructing images from dynamically evolving fMRI
recordings. Our approach offers three main contributions. First, Dynadiff
simplifies training as compared to existing approaches. Second, our model
outperforms state-of-the-art models on time-resolved fMRI signals, especially
on high-level semantic image reconstruction metrics, while remaining
competitive on preprocessed fMRI data that collapse time. Third, this approach
allows a precise characterization of the evolution of image representations in
brain activity. Overall, this work lays the foundation for time-resolved
brain-to-image decoding.
Despite advances in transformer-based language models (LMs), a fundamental
question remains largely unanswered: Are all layers activated during inference?
We investigate this question by detecting unactivated layers (which we refer to
as Voids) using a non-trainable and parameter-free adaptive computation method
called L2 Adaptive Computation (LAC). We adapt LAC from its original
efficiency-focused application to trace activated layers during inference. This
method monitors changes in the L2-norm of activations to identify voids. We
analyze layer activation in instruction-tuned LMs across two phases: Prompt
Processing (PP), where we trace activated layers for each token in the input
prompts, and Response Generation (RG), where we trace activated layers for each
generated token. We further demonstrate that distinct layers are activated
during these two phases. To show the effectiveness of our method, we evaluated
three distinct instruction-tuned LMs from the Llama, Mistral, and Qwen families
on three benchmarks: MMLU, GPQA Diamond, and BoolQ. For example, on MMLU with a
zero-shot setting, skipping voids in Qwen2.5-7B-Instruct resulted in an
improvement from 69.24 to 71.29 while the model uses only 30% of the layers.
Similarly, Mistral-7B-Instruct-v0.3 on GPQA Diamond improved from 13.88 to
18.36 when using 70% of the layers during both the PP and RG phases. These
results show that not all layers contribute equally during inference, and that
selectively skipping most of them can improve the performance of models on
certain tasks.
A well-known issue with Retrieval Augmented Generation (RAG) is that
retrieved passages that are irrelevant to the query sometimes distract the
answer-generating LLM, causing it to provide an incorrect response. In this
paper, we shed light on this core issue and formulate the distracting effect of
a passage w.r.t. a query (and an LLM). We provide a quantifiable measure of the
distracting effect of a passage and demonstrate its robustness across LLMs.
Our research introduces novel methods for identifying and using hard
distracting passages to improve RAG systems. By fine-tuning LLMs with these
carefully selected distracting passages, we achieve up to a 7.5% increase in
answering accuracy compared to counterparts fine-tuned on conventional RAG
datasets. Our contribution is two-fold: first, we move beyond the simple binary
classification of irrelevant passages as either completely unrelated vs.
distracting, and second, we develop and analyze multiple methods for finding
hard distracting passages. To our knowledge, no other research has provided
such a comprehensive framework for identifying and utilizing hard distracting
passages.
Joel Currie, Gioele Migno, Enrico Piacenti, Maria Elena Giannaccini, Patric Bach, Davide De Tommaso, Agnieszka Wykowska
02
We present a conceptual framework for training Vision-Language Models (VLMs)
to perform Visual Perspective Taking (VPT), a core capability for embodied
cognition essential for Human-Robot Interaction (HRI). As a first step toward
this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse,
that enables supervised learning for spatial reasoning tasks. Each instance
includes an RGB image, a natural language description, and a ground-truth 4X4
transformation matrix representing object pose. We focus on inferring Z-axis
distance as a foundational skill, with future extensions targeting full 6
Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to
support further research. This work serves as a foundational step toward
embodied AI systems capable of spatial understanding in interactive human-robot
scenarios.
Alexandre Chapin, Bruno Machado, Emmanuel Dellandrea, Liming Chen
02
Visual representations are central to the learning and generalization
capabilities of robotic manipulation policies. While existing methods rely on
global or dense features, such representations often entangle task-relevant and
irrelevant scene information, limiting robustness under distribution shifts. In
this work, we investigate object-centric representations (OCR) as a structured
alternative that segments visual input into a finished set of entities,
introducing inductive biases that align more naturally with manipulation tasks.
We benchmark a range of visual encoders-object-centric, global and dense
methods-across a suite of simulated and real-world manipulation tasks ranging
from simple to complex, and evaluate their generalization under diverse visual
conditions including changes in lighting, texture, and the presence of
distractors. Our findings reveal that OCR-based policies outperform dense and
global representations in generalization settings, even without task-specific
pretraining. These insights suggest that OCR is a promising direction for
designing visual systems that generalize effectively in dynamic, real-world
robotic environments.