ByAlexander Groshev, Anastasiia Iashchenko, Pavel Paramonov, Denis Dimitrov, Andrey Kuznetsov
67
2
While the task of face swapping has recently gained attention in the research
community, a related problem of head swapping remains largely unexplored. In
addition to skin color transfer, head swap poses extra challenges, such as the
need to preserve structural information of the whole head during synthesis and
inpaint gaps between swapped head and background. In this paper, we address
these concerns with GHOST 2.0, which consists of two problem-specific modules.
First, we introduce enhanced Aligner model for head reenactment, which
preserves identity information at multiple scales and is robust to extreme pose
variations. Secondly, we use a Blender module that seamlessly integrates the
reenacted head into the target background by transferring skin color and
inpainting mismatched regions. Both modules outperform the baselines on the
corresponding tasks, allowing to achieve state of the art results in head
swapping. We also tackle complex cases, such as large difference in hair styles
of source and target. Code is available at
https://github.com/ai-forever/ghost-2.0
ByKanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo
65
2
We introduce Kanana, a series of bilingual language models that demonstrate
exceeding performance in Korean and competitive performance in English. The
computational cost of Kanana is significantly lower than that of
state-of-the-art models of similar size. The report details the techniques
employed during pre-training to achieve compute-efficient yet competitive
models, including high quality data filtering, staged pre-training, depth
up-scaling, and pruning and distillation. Furthermore, the report outlines the
methodologies utilized during the post-training of the Kanana models,
encompassing supervised fine-tuning and preference optimization, aimed at
enhancing their capability for seamless interaction with users. Lastly, the
report elaborates on plausible approaches used for language model adaptation to
specific scenarios, such as embedding, retrieval augmented generation, and
function calling. The Kanana model series spans from 2.1B to 32.5B parameters
with 2.1B models (base, instruct, embedding) publicly released to promote
research on Korean language models.
ByJuraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vikram Dhillon, Eeshit Dhaval Vaishnav, Byron Lee, Tiago R D Costa, José R Penadés, Gary Peltz, Yunhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, Vivek Natarajan
51
2
Scientific discovery relies on scientists generating novel hypotheses that
undergo rigorous experimental validation. To augment this process, we introduce
an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI
co-scientist is intended to help uncover new, original knowledge and to
formulate demonstrably novel research hypotheses and proposals, building upon
prior evidence and aligned to scientist-provided research objectives and
guidance. The system's design incorporates a generate, debate, and evolve
approach to hypothesis generation, inspired by the scientific method and
accelerated by scaling test-time compute. Key contributions include: (1) a
multi-agent architecture with an asynchronous task execution framework for
flexible compute scaling; (2) a tournament evolution process for self-improving
hypotheses generation. Automated evaluations show continued benefits of
test-time compute, improving hypothesis quality. While general purpose, we
focus development and validation in three biomedical areas: drug repurposing,
novel target discovery, and explaining mechanisms of bacterial evolution and
anti-microbial resistance. For drug repurposing, the system proposes candidates
with promising validation findings, including candidates for acute myeloid
leukemia that show tumor inhibition in vitro at clinically applicable
concentrations. For novel target discovery, the AI co-scientist proposed new
epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and
liver cell regeneration in human hepatic organoids. Finally, the AI
co-scientist recapitulated unpublished experimental results via a parallel in
silico discovery of a novel gene transfer mechanism in bacterial evolution.
These results, detailed in separate, co-timed reports, demonstrate the
potential to augment biomedical and scientific discovery and usher an era of AI
empowered scientists.
ByMax Ku, Thomas Chong, Jonathan Leung, Krish Shah, Alvin Yu, Wenhu Chen
47
2
Understanding domain-specific theorems often requires more than just
text-based reasoning; effective communication through structured visual
explanations is crucial for deeper comprehension. While large language models
(LLMs) demonstrate strong performance in text-based theorem reasoning, their
ability to generate coherent and pedagogically meaningful visual explanations
remains an open challenge. In this work, we introduce TheoremExplainAgent, an
agentic approach for generating long-form theorem explanation videos (over 5
minutes) using Manim animations. To systematically evaluate multimodal theorem
explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems
across multiple STEM disciplines, along with 5 automated evaluation metrics.
Our results reveal that agentic planning is essential for generating detailed
long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an
overall score of 0.77. However, our quantitative and qualitative studies show
that most of the videos produced exhibit minor issues with visual element
layout. Furthermore, multimodal explanations expose deeper reasoning flaws that
text-based explanations fail to reveal, highlighting the importance of
multimodal explanations.
Despite Greece's pivotal role in the global economy, large language models
(LLMs) remain underexplored for Greek financial context due to the linguistic
complexity of Greek and the scarcity of domain-specific datasets. Previous
efforts in multilingual financial natural language processing (NLP) have
exposed considerable performance disparities, yet no dedicated Greek financial
benchmarks or Greek-specific financial LLMs have been developed until now. To
bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation
Benchmark, and Plutus-8B, the pioneering Greek Financial LLM, fine-tuned with
Greek domain-specific data. Plutus-ben addresses five core financial NLP tasks
in Greek: numeric and textual named entity recognition, question answering,
abstractive summarization, and topic classification, thereby facilitating
systematic and reproducible LLM assessments. To underpin these tasks, we
present three novel, high-quality Greek financial datasets, thoroughly
annotated by expert native Greek speakers, augmented by two existing resources.
Our comprehensive evaluation of 22 LLMs on Plutus-ben reveals that Greek
financial NLP remains challenging due to linguistic complexity, domain-specific
terminology, and financial reasoning gaps. These findings underscore the
limitations of cross-lingual transfer, the necessity for financial expertise in
Greek-trained models, and the challenges of adapting financial LLMs to Greek
text. We release Plutus-ben, Plutus-8B, and all associated datasets publicly to
promote reproducible research and advance Greek financial NLP, fostering
broader multilingual inclusivity in finance.
ByTushar Aggarwal, Kumar Tanmay, Ayush Agrawal, Kumar Ayush, Hamid Palangi, Paul Pu Liang
32
2
Multilingual language models (LMs) are expected to recall factual knowledge
consistently across languages, yet they often fail to transfer knowledge
between languages even when they possess the correct information in one of the
languages. For example, we find that an LM may correctly identify Rashed Al
Shashai as being from Saudi Arabia when asked in Arabic, but consistently fails
to do so when asked in English or Swahili. To systematically investigate this
limitation, we introduce a benchmark of 10,000 country-related facts across 13
languages and propose three novel metrics: Factual Recall Score, Knowledge
Transferability Score, and Cross-Lingual Factual Knowledge Transferability
Score-to quantify factual recall and knowledge transferability in LMs across
different languages. Our results reveal fundamental weaknesses in today's
state-of-the-art LMs, particularly in cross-lingual generalization where models
fail to transfer knowledge effectively across different languages, leading to
inconsistent performance sensitive to the language used. Our findings emphasize
the need for LMs to recognize language-specific factual reliability and
leverage the most trustworthy information across languages. We release our
benchmark and evaluation framework to drive future research in multilingual
knowledge transfer.
ByOrion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, Benjamin Van Durme
29
2
We introduce Rank1, the first reranking model trained to take advantage of
test-time compute. Rank1 demonstrates the applicability within retrieval of
using a reasoning language model (i.e. OpenAI's o1, Deepseek's R1, etc.) for
distillation in order to rapidly improve the performance of a smaller model. We
gather and open-source a dataset of more than 600,000 examples of R1 reasoning
traces from queries and passages in MS MARCO. Models trained on this dataset
show: (1) state-of-the-art performance on advanced reasoning and instruction
following datasets; (2) work remarkably well out of distribution due to the
ability to respond to user-input prompts; and (3) have explainable reasoning
chains that can be given to users or RAG-based systems. Further, we demonstrate
that quantized versions of these models retain strong performance while using
less compute/memory. Overall, Rank1 shows that test-time compute allows for a
fundamentally new type of explainable and performant reranker model for search.
ByYancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Zhongyuan Peng, Zhaoxiang Zhang, Wenbo Su, Bo Zheng
28
2
Recently, o1-like models have drawn significant attention, where these models
produce the long Chain-of-Thought (CoT) reasoning steps to improve the
reasoning abilities of existing Large Language Models (LLMs). In this paper, to
understand the qualities of these long CoTs and measure the critique abilities
of existing LLMs on these long CoTs, we introduce the DeltaBench, including the
generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for
different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the
ability to detect errors in long CoT reasoning. Based on DeltaBench, we first
perform fine-grained analysis of the generated long CoTs to discover the
effectiveness and efficiency of different o1-like models. Then, we conduct
extensive evaluations of existing process reward models (PRMs) and critic
models to detect the errors of each annotated process, which aims to
investigate the boundaries and limitations of existing PRMs and critic models.
Finally, we hope that DeltaBench could guide developers to better understand
the long CoT reasoning abilities of their models.
ByHao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, Juanzi Li
23
2
Reward models (RMs) are crucial for the training and inference-time scaling
up of large language models (LLMs). However, existing reward models primarily
focus on human preferences, neglecting verifiable correctness signals which
have shown strong potential in training LLMs. In this paper, we propose agentic
reward modeling, a reward system that combines reward models with verifiable
correctness signals from different aspects to provide reliable rewards. We
empirically implement a reward agent, named RewardAgent, that combines human
preference rewards with two verifiable signals: factuality and instruction
following, to provide more reliable rewards. We conduct comprehensive
experiments on existing reward model benchmarks and inference time best-of-n
searches on real-world downstream tasks. RewardAgent significantly outperforms
vanilla reward models, demonstrating its effectiveness. We further construct
training preference pairs using RewardAgent and train an LLM with the DPO
objective, achieving superior performance on various NLP benchmarks compared to
conventional reward models. Our codes are publicly released to facilitate
further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).
ByChristoph Schuhmann, Gollam Rabby, Ameya Prabhu, Tawsif Ahmed, Andreas Hochlehnert, Huu Nguyen, Nick Akinci Heidrich, Ludwig Schmidt, Robert Kaczmarczyk, Sören Auer, Jenia Jitsev, Matthias Bethge
21
3
Paywalls, licenses and copyright rules often restrict the broad dissemination
and reuse of scientific knowledge. We take the position that it is both legally
and technically feasible to extract the scientific knowledge in scholarly
texts. Current methods, like text embeddings, fail to reliably preserve factual
content, and simple paraphrasing may not be legally sound. We urge the
community to adopt a new idea: convert scholarly documents into Knowledge Units
using LLMs. These units use structured data capturing entities, attributes and
relationships without stylistic content. We provide evidence that Knowledge
Units: (1) form a legally defensible framework for sharing knowledge from
copyrighted research texts, based on legal analyses of German copyright law and
U.S. Fair Use doctrine, and (2) preserve most (~95%) factual knowledge from
original text, measured by MCQ performance on facts from the original
copyrighted text across four research domains. Freeing scientific knowledge
from copyright promises transformative benefits for scientific research and
education by allowing language models to reuse important facts from copyrighted
text. To support this, we share open-source tools for converting research
documents into Knowledge Units. Overall, our work posits the feasibility of
democratizing access to scientific knowledge while respecting copyright.
There is growing excitement about the potential of Language Models (LMs) to
accelerate scientific discovery. Falsifying hypotheses is key to scientific
progress, as it allows claims to be iteratively refined over time. This process
requires significant researcher effort, reasoning, and ingenuity. Yet current
benchmarks for LMs predominantly assess their ability to generate solutions
rather than challenge them. We advocate for developing benchmarks that evaluate
this inverse capability - creating counterexamples for subtly incorrect
solutions. To demonstrate this approach, we start with the domain of
algorithmic problem solving, where counterexamples can be evaluated
automatically using code execution. Specifically, we introduce REFUTE, a
dynamically updating benchmark that includes recent problems and incorrect
submissions from programming competitions, where human experts successfully
identified counterexamples. Our analysis finds that the best reasoning agents,
even OpenAI o3-mini (high) with code execution feedback, can create
counterexamples for only <9% of incorrect solutions in REFUTE, even though
ratings indicate its ability to solve up to 48% of these problems from scratch.
We hope our work spurs progress in evaluating and enhancing LMs' ability to
falsify incorrect solutions - a capability that is crucial for both
accelerating research and making models self-improve through reliable
reflective reasoning.
Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI)
agents via Reinforcement Learning (RL) faces critical challenges:
environment-based RL requires costly interactions, while environment-free
methods struggle with distribution shift and reward generalization. We propose
an environment-free RL framework that decouples value estimation from policy
optimization by leveraging a pretrained Value Environment Model (VEM). VEM
predicts state-action values directly from offline data, distilling human-like
priors about GUI interaction outcomes without requiring next-state prediction
or environmental feedback. This avoids compounding errors and enhances
resilience to UI changes by focusing on semantic reasoning (e.g., Does this
action advance the user's goal?). The framework operates in two stages: (1)
pretraining VEM to estimate long-term action utilities and (2) guiding policy
exploration with frozen VEM signals, enabling layout-agnostic GUI automation.
Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art
performance in both offline and online settings, outperforming environment-free
baselines significantly and matching environment-based approaches without
interaction costs. Importantly, VEM demonstrates that semantic-aware value
estimation can achieve comparable performance with online-trained methods.
Monocular depth estimation (MDE) aims to predict scene depth from a single
RGB image and plays a crucial role in 3D scene understanding. Recent advances
in zero-shot MDE leverage normalized depth representations and
distillation-based learning to improve generalization across diverse scenes.
However, current depth normalization methods for distillation, relying on
global normalization, can amplify noisy pseudo-labels, reducing distillation
effectiveness. In this paper, we systematically analyze the impact of different
depth normalization strategies on pseudo-label distillation. Based on our
findings, we propose Cross-Context Distillation, which integrates global and
local depth cues to enhance pseudo-label quality. Additionally, we introduce a
multi-teacher distillation framework that leverages complementary strengths of
different depth estimation models, leading to more robust and accurate depth
predictions. Extensive experiments on benchmark datasets demonstrate that our
approach significantly outperforms state-of-the-art methods, both
quantitatively and qualitatively.
ByHonglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
10
2
Language model heavily depends on high-quality data for optimal performance.
Existing approaches rely on manually designed heuristics, the perplexity of
existing models, training classifiers, or careful prompt engineering, which
require significant expert experience and human annotation effort while
introduce biases. We introduce CritiQ, a novel data selection method that
automatically mines criteria from human preferences for data quality with only
sim30 human-annotated pairs and performs efficient data selection. The main
component, CritiQ Flow, employs a manager agent to evolve quality criteria and
worker agents to make pairwise judgments. We build a knowledge base that
extracts quality criteria from previous work to boost CritiQ Flow. Compared to
perplexity- and classifier- based methods, verbal criteria are more
interpretable and possess reusable value. After deriving the criteria, we train
the CritiQ Scorer to give quality scores and perform efficient data selection.
We demonstrate the effectiveness of our method in the code, math, and logic
domains, achieving high accuracy on human-annotated test sets. To validate the
quality of the selected data, we continually train Llama 3.1 models and observe
improved performance on downstream tasks compared to uniform sampling. Ablation
studies validate the benefits of the knowledge base and the reflection process.
We analyze how criteria evolve and the effectiveness of majority voting.
ByMehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K. Jain, Virginia Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu, Uri Shalit, Silvia Chiappa, Kate Olszewska, Yi Tay, Vinh Q. Tran, Quoc V. Le, Orhan Firat
10
3
Large language models (LLMs) are increasingly deployed in everyday
applications, demanding robust general reasoning capabilities and diverse
reasoning skillset. However, current LLM reasoning benchmarks predominantly
focus on mathematical and coding abilities, leaving a gap in evaluating broader
reasoning proficiencies. One particular exception is the BIG-Bench dataset,
which has served as a crucial benchmark for evaluating the general reasoning
capabilities of LLMs, thanks to its diverse set of challenging tasks that
allowed for a comprehensive assessment of general reasoning across various
skills within a unified framework. However, recent advances in LLMs have led to
saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH).
State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus
diminishing its utility. To address this limitation, we introduce BIG-Bench
Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM
reasoning evaluation. BBEH replaces each task in BBH with a novel task that
probes a similar reasoning capability but exhibits significantly increased
difficulty. We evaluate various models on BBEH and observe a (harmonic) average
accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best
reasoning-specialized model, indicating substantial room for improvement and
highlighting the ongoing challenge of achieving robust general reasoning in
LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.
ByYuntao Du, Kailin Jiang, Zhi Gao, Chenrui Shi, Zilong Zheng, Siyuan Qi, Qing Li
9
2
Knowledge editing techniques have emerged as essential tools for updating the
factual knowledge of large language models (LLMs) and multimodal models (LMMs),
allowing them to correct outdated or inaccurate information without retraining
from scratch. However, existing benchmarks for multimodal knowledge editing
primarily focus on entity-level knowledge represented as simple triplets, which
fail to capture the complexity of real-world multimodal information. To address
this issue, we introduce MMKE-Bench, a comprehensive MultiModal Knowledge
Editing Benchmark, designed to evaluate the ability of LMMs to edit diverse
visual knowledge in real-world scenarios. MMKE-Bench addresses these
limitations by incorporating three types of editing tasks: visual entity
editing, visual semantic editing, and user-specific editing. Besides,
MMKE-Bench uses free-form natural language to represent and edit knowledge,
offering a more flexible and effective format. The benchmark consists of 2,940
pieces of knowledge and 8,363 images across 33 broad categories, with
evaluation questions automatically generated and human-verified. We assess five
state-of-the-art knowledge editing methods on three prominent LMMs, revealing
that no method excels across all criteria, and that visual and user-specific
edits are particularly challenging. MMKE-Bench sets a new standard for
evaluating the robustness of multimodal knowledge editing techniques, driving
progress in this rapidly evolving field.
ByAnikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn
7
2
Effective personalization of LLMs is critical for a broad range of
user-interfacing applications such as virtual assistants and content curation.
Inspired by the strong in-context learning capabilities of LLMs, we propose
Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a
meta-learning problem. Under this framework, an LLM learns to quickly adapt to
a user via a few labeled preferences from that user, constructing a
personalized reward function for them. Additionally, since real-world
preference data is scarce and challenging to collect at scale, we propose
careful design choices to construct synthetic preference datasets for
personalization, generating over 1M synthetic personalized preferences using
publicly available LLMs. In particular, to successfully transfer from synthetic
data to real users, we find it crucial for the data to exhibit both high
diversity and coherent, self-consistent structure. We evaluate FSPO on
personalized open-ended generation for up to 1,500 synthetic users across
across three domains: movie reviews, pedagogical adaptation based on
educational background, and general question answering, along with a controlled
human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in
generating responses that are personalized to synthetic users and a 72% winrate
with real human users in open-ended question answering.
ByLiang Wang, Shaozhen Liu, Yu Rong, Deli Zhao, Qiang Liu, Shu Wu, Liang Wang
6
2
Establishing the relationship between 3D structures and the energy states of
molecular systems has proven to be a promising approach for learning 3D
molecular representations. However, existing methods are limited to modeling
the molecular energy states from classical mechanics. This limitation results
in a significant oversight of quantum mechanical effects, such as quantized
(discrete) energy level structures, which offer a more accurate estimation of
molecular energy and can be experimentally measured through energy spectra. In
this paper, we propose to utilize the energy spectra to enhance the
pre-training of 3D molecular representations (MolSpectra), thereby infusing the
knowledge of quantum mechanics into the molecular representations.
Specifically, we propose SpecFormer, a multi-spectrum encoder for encoding
molecular spectra via masked patch reconstruction. By further aligning outputs
from the 3D encoder and spectrum encoder using a contrastive objective, we
enhance the 3D encoder's understanding of molecules. Evaluations on public
benchmarks reveal that our pre-trained representations surpass existing methods
in predicting molecular properties and modeling dynamics.
ByTaishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki
6
3
The Mixture of Experts (MoE) architecture reduces the training and inference
cost significantly compared to a dense model of equivalent capacity. Upcycling
is an approach that initializes and trains an MoE model using a pre-trained
dense model. While upcycling leads to initial performance gains, the training
progresses slower than when trained from scratch, leading to suboptimal
performance in the long term. We propose Drop-Upcycling - a method that
effectively addresses this problem. Drop-Upcycling combines two seemingly
contradictory approaches: utilizing the knowledge of pre-trained dense models
while statistically re-initializing some parts of the weights. This approach
strategically promotes expert specialization, significantly enhancing the MoE
model's efficiency in knowledge acquisition. Extensive large-scale experiments
demonstrate that Drop-Upcycling significantly outperforms previous MoE
construction methods in the long term, specifically when training on hundreds
of billions of tokens or more. As a result, our MoE model with 5.9B active
parameters achieves comparable performance to a 13B dense model in the same
model family, while requiring approximately 1/4 of the training FLOPs. All
experimental resources, including source code, training data, model checkpoints
and logs, are publicly available to promote reproducibility and future research
on MoE.
ByMarcus Yu Zhe Wee, Justin Juin Hng Wong, Lynus Lim, Joe Yu Wei Tan, Prannaya Gupta, Dillion Lim, En Hao Tew, Aloysius Keng Siew Han, Yong Zhi Lim
6
2
Effective communication in Air Traffic Control (ATC) is critical to
maintaining aviation safety, yet the challenges posed by accented English
remain largely unaddressed in Automatic Speech Recognition (ASR) systems.
Existing models struggle with transcription accuracy for Southeast
Asian-accented (SEA-accented) speech, particularly in noisy ATC environments.
This study presents the development of ASR models fine-tuned specifically for
Southeast Asian accents using a newly created dataset. Our research achieves
significant improvements, achieving a Word Error Rate (WER) of 0.0982 or 9.82%
on SEA-accented ATC speech. Additionally, the paper highlights the importance
of region-specific datasets and accent-focused training, offering a pathway for
deploying ASR systems in resource-constrained military operations. The findings
emphasize the need for noise-robust training techniques and region-specific
datasets to improve transcription accuracy for non-Western accents in ATC
communications.
As AI models are increasingly deployed across diverse real-world scenarios,
ensuring their safety remains a critical yet underexplored challenge. While
substantial efforts have been made to evaluate and enhance AI safety, the lack
of a standardized framework and comprehensive toolkit poses significant
obstacles to systematic research and practical adoption. To bridge this gap, we
introduce AISafetyLab, a unified framework and toolkit that integrates
representative attack, defense, and evaluation methodologies for AI safety.
AISafetyLab features an intuitive interface that enables developers to
seamlessly apply various techniques while maintaining a well-structured and
extensible codebase for future advancements. Additionally, we conduct empirical
studies on Vicuna, analyzing different attack and defense strategies to provide
valuable insights into their comparative effectiveness. To facilitate ongoing
research and development in AI safety, AISafetyLab is publicly available at
https://github.com/thu-coai/AISafetyLab, and we are committed to its continuous
maintenance and improvement.
ByZhengmian Hu, Tong Zheng, Vignesh Viswanathan, Ziyi Chen, Ryan A. Rossi, Yihan Wu, Dinesh Manocha, Heng Huang
5
2
Large Language Models (LLMs) have become an indispensable part of natural
language processing tasks. However, autoregressive sampling has become an
efficiency bottleneck. Multi-Draft Speculative Decoding (MDSD) is a recent
approach where, when generating each token, a small draft model generates
multiple drafts, and the target LLM verifies them in parallel, ensuring that
the final output conforms to the target model distribution. The two main design
choices in MDSD are the draft sampling method and the verification algorithm.
For a fixed draft sampling method, the optimal acceptance rate is a solution to
an optimal transport problem, but the complexity of this problem makes it
difficult to solve for the optimal acceptance rate and measure the gap between
existing verification algorithms and the theoretical upper bound. This paper
discusses the dual of the optimal transport problem, providing a way to
efficiently compute the optimal acceptance rate. For the first time, we measure
the theoretical upper bound of MDSD efficiency for vocabulary sizes in the
thousands and quantify the gap between existing verification algorithms and
this bound. We also compare different draft sampling methods based on their
optimal acceptance rates. Our results show that the draft sampling method
strongly influences the optimal acceptance rate, with sampling without
replacement outperforming sampling with replacement. Additionally, existing
verification algorithms do not reach the theoretical upper bound for both
without replacement and with replacement sampling. Our findings suggest that
carefully designed draft sampling methods can potentially improve the optimal
acceptance rate and enable the development of verification algorithms that
closely match the theoretical upper bound.
Generating accurate and concise textual summaries from multimodal documents
is challenging, especially when dealing with visually complex content like
scientific posters. We introduce PosterSum, a novel benchmark to advance the
development of vision-language models that can understand and summarize
scientific posters into research paper abstracts. Our dataset contains 16,305
conference posters paired with their corresponding abstracts as summaries. Each
poster is provided in image format and presents diverse visual understanding
challenges, such as complex layouts, dense text regions, tables, and figures.
We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on
PosterSum and demonstrate that they struggle to accurately interpret and
summarize scientific posters. We propose Segment & Summarize, a hierarchical
method that outperforms current MLLMs on automated metrics, achieving a 3.14%
gain in ROUGE-L. This will serve as a starting point for future research on
poster summarization.
Weakly supervised semantic segmentation (WSSS) typically utilizes limited
semantic annotations to obtain initial Class Activation Maps (CAMs). However,
due to the inadequate coupling between class activation responses and semantic
information in high-dimensional space, the CAM is prone to object co-occurrence
or under-activation, resulting in inferior recognition accuracy. To tackle this
issue, we propose DOEI, Dual Optimization of Embedding Information, a novel
approach that reconstructs embedding representations through semantic-aware
attention weight matrices to optimize the expression capability of embedding
information. Specifically, DOEI amplifies tokens with high confidence and
suppresses those with low confidence during the class-to-patch interaction.
This alignment of activation responses with semantic information strengthens
the propagation and decoupling of target features, enabling the generated
embeddings to more accurately represent target features in high-level semantic
space. In addition, we propose a hybrid-feature alignment module in DOEI that
combines RGB values, embedding-guided features, and self-attention weights to
increase the reliability of candidate tokens. Comprehensive experiments show
that DOEI is an effective plug-and-play module that empowers state-of-the-art
visual transformer-based WSSS models to significantly improve the quality of
CAMs and segmentation performance on popular benchmarks, including PASCAL VOC
(+3.6%, +1.5%, +1.2% mIoU) and MS COCO (+1.2%, +1.6% mIoU). Code will be
available at https://github.com/AIGeeksGroup/DOEI.