每日精选AI研究论文及翻译
推理是支撑推断、问题解决与决策制定的基本认知过程。尽管大语言模型在封闭环境中展现出强大的推理能力,但在开放动态环境中仍面临挑战。智能体推理通过将大语言模型重构为能够通过持续交互进行规划、行动与学习的自主智能体,标志着范式的转变。本综述从三个互补维度系统梳理智能体推理研究:首先,通过三层架构刻画环境动态性——基础智能体推理建立智能体在稳定环境中的核心单机能力(包括规划、工具使用与搜索);自我进化智能体推理研究智能体如何通过反馈、记忆与适应机制优化这些能力;集体多智能体推理将智能延伸至涉及协作、知识共享与共同目标的协同场景。跨越多层架构,我们区分了通过结构化编排扩展测试时交互的情境推理,与通过强化学习和监督微调优化行为的训练后推理。进而系统评述了科学、机器人、医疗、自主研究与数学等现实应用场景中的代表性智能体推理框架。本综述将智能体推理方法整合为连接思维与行动的统一路线图,并指出个性化、长周期交互、世界建模、可扩展多智能体训练及实际部署治理等开放挑战与未来方向。
深度研究智能体(DRAs)通过多轮检索与综合生成引证翔实的报告,但现有基准主要针对纯文本场景或短格式多模态问答,缺乏端到端的多模态证据运用评估。我们推出MMDeepResearch-Bench(MMDR-Bench)——一个涵盖21个领域140项专家构建任务的基准,每个任务提供图文组合以评估多模态理解与引证支撑的报告生成能力。相较于先前设定,MMDR-Bench强调显式证据驱动的报告式综合,要求模型必须将视觉要素与溯源主张相关联,并保持叙述、引证和视觉参照的一致性。我们进一步提出统一可解释的评估框架:面向报告质量的公式化LLM自适应评估(FLAE)、确保引证与证据对齐的可信检索校准评估(TRACE)、以及检验图文一致性的多模态支持对齐完整性核查(MOSAIC),每个模块均产生细粒度信号,支持超越单一总分的错误诊断。在25个前沿模型上的实验揭示了生成质量、引证规范与多模态 grounding 之间的系统性权衡,表明优质文本生成并不保证证据使用的可信度,且多模态完整性仍是深度研究智能体的关键瓶颈。
视频生成模型显著推动了具身智能的发展,为生成融合物理世界感知、推理与行动的多样化机器人数据开辟了新可能。然而,合成能准确反映真实机器人交互的高质量视频仍面临挑战,且缺乏标准化基准限制了公平比较与研究进展。为填补这一空白,我们推出综合性机器人基准RBench,通过五大任务域和四种不同具身形态评估面向机器人的视频生成能力。该基准通过可复现的子指标(包括结构一致性、物理合理性和动作完整性)同时评估任务级准确性与视觉保真度。对25个代表性模型的评估揭示了其在生成物理真实机器人行为方面的显著缺陷。此外,该基准与人类评估的斯皮尔曼相关系数达0.96,验证了其有效性。尽管RBench为识别这些缺陷提供了必要视角,但实现物理真实性需超越评估层面,解决高质量训练数据严重短缺的核心问题。基于这些洞察,我们提出精炼的四阶段数据流水线,由此构建的RoVid-X成为目前最大的开源机器人视频生成数据集,包含400万个标注视频片段,覆盖数千项任务并配有全面物理属性标注。这一评估与数据协同的生态系统为视频模型的严谨评估和规模化训练奠定了坚实基础,将加速具身人工智能向通用智能的演进。
撰写有效的反驳意见是一项高难度任务,其要求远不止语言流畅性,更需要精准把握审稿人意图与论文细节之间的对应关系。现有解决方案通常将其视为端到端文本生成问题,存在虚构内容、遗漏批评要点及缺乏可验证依据等缺陷。为突破这些局限,我们提出首个多智能体框架RebuttalAgent,将反驳生成重新定义为以证据为核心的规划任务。该系统将复杂审阅意见分解为原子化问题点,通过融合压缩摘要与高保真原文构建动态混合上下文,同时集成自主按需的外部检索模块以解决需借助外部文献的质疑。通过在起草前生成可审查的响应方案,RebuttalAgent确保每个论点都明确锚定于内部或外部证据。我们在提出的RebuttalBench基准上验证方法,证明本流水线在覆盖度、忠实度与策略连贯性上均优于强基线模型,为同行评审流程提供了透明可控的辅助工具。代码将开源发布。
强化学习(RL)在模型后训练中具有核心地位,尤其对于需要专业推理行为的智能体模型而言。在此背景下,模型融合提供了一种实用机制,可将来自不同任务的多个RL训练智能体整合为单一通用模型。然而,现有融合方法专为监督微调(SFT)设计,在保留RL训练智能体模型的特定任务能力方面存在不足。其根本原因在于RL与SFT之间存在任务向量失配:同策略RL产生的任务向量具有高度稀疏性和异质性,而SFT式融合隐式假设任务向量具备稠密性和全局可比性。当在这种失配情况下应用标准全局平均法时,RL中编码关键任务特定行为的非重叠任务向量会被削弱,参数更新也随之稀释。为解决该问题,我们提出强化智能体融合(RAM)——专为RL训练智能体模型设计的分布感知融合框架。RAM通过解耦共享参数更新与任务特异性独有参数更新,对共享组件进行平均处理,同时选择性保留并重新缩放独有组件以抵消参数更新稀释。跨多个智能体领域和模型架构的实验表明,RAM不仅超越了现有融合基线,更能释放智能体间的协同潜力,实现优于各领域专用智能体的性能表现。
GutenOCR是基于Qwen2.5-VL-3B与Qwen2.5-VL-7B微调得到的系列端到端OCR前端模型。这些单检查点的视觉语言模型通过统一的提示式接口,实现了文本识别、检测与定位功能。该模型基于商业文档、科学文献及合成定位数据训练,支持整页与局部阅读,可输出行级/段落级边界框,并响应条件式"X在哪里?"的查询。我们提出了带定位功能的OCR评估方案,实验表明GutenOCR-7B在1.05万份留存的商业与科学文档上的综合定位OCR得分较其骨干网络Qwen2.5-VL-7B提升超一倍(0.40→0.82)。在Fox与OmniDocBench v1.5基准测试中,本方法显著提升了区域/行级OCR性能及文本检测召回率,但在页面级线性化、色彩引导OCR及公式密集版块处理方面存在权衡。
思维链提示技术在解锁大型语言模型推理能力方面取得了显著成功。尽管该技术能增强推理性能,但其冗长的特性带来了巨大的计算开销。现有研究往往仅关注结果对齐,而缺乏对中间推理过程的监督,这种缺陷使得潜在推理链的可分析性变得模糊。为解决这些挑战,我们提出思维渲染框架——首个通过将文本推理步骤可视化呈现为图像来具象化推理链的方法,使潜在逻辑变得显式化且可追溯。具体而言,我们利用现有视觉语言模型中视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐。这种设计确保了即插即用的实现方式,无需额外预训练开销。在数学与逻辑推理基准测试上的大量实验表明,相较于显式思维链方法,我们的方案实现了3-4倍的令牌压缩和显著的推理加速,同时与其他方法相比保持竞争力,验证了该范式的可行性。代码已开源于https://github.com/TencentBAC/RoT。
文档提取是数字化工作流的核心环节,但现有视觉语言模型主要偏向高资源语言。泰语由于非拉丁字母的文字复杂性、缺乏显性词汇边界以及现实文档高度非结构化的特点,面临额外挑战,限制了当前开源模型的有效性。本文提出Typhoon OCR——一个专为泰英双语定制的开源文档提取视觉语言模型。该模型基于视觉语言主干网络,通过聚焦泰语的训练数据集进行微调。该数据集采用结合传统OCR、基于VLM的重构与精心设计的合成数据的多阶段构建流程开发。Typhoon OCR是能够实现文本转录、版式重建和文档级结构一致性的统一框架。最新版本Typhoon OCR V1.5作为紧凑高效的推理模型,旨在减少对元数据的依赖并简化部署。通过对财务报告、政府表格、书籍、信息图及手写文档等多元泰语文档的综合评估表明,Typhoon OCR在显著降低计算成本的同时,达到了与大型前沿专有模型相当或更优的性能。实验结果证明,开源视觉语言OCR模型能够实现泰语文档的精准文本提取与版式重建,在保持轻量级可部署特性的同时达到与专有系统相媲美的性能水平。
诸如Whisper之类的大型编码器-解码器模型虽能实现强大的离线转录能力,但由于高延迟问题,在流式应用中仍不实用。尽管预训练模型易于获取,当前泰语自动语音识别领域仍由这些离线架构主导,导致高效流式解决方案存在关键空白。我们推出Typhoon ASR Real-time——一个115M参数的FastConformer-Transducer模型,专为低延迟泰语语音识别设计。研究表明,严格的文本规范化可达到模型扩增的效果:相比Whisper Large-v3,我们的紧凑模型在保持相当准确度的同时实现了45倍计算成本降低。我们的规范化流程解决了泰语转录中的系统性歧义问题(包括上下文相关的数字口语化处理和重复标记符mai yamok),从而创建了统一的训练目标。我们还提出针对伊森方言(泰国东北部)适配的两阶段课程学习方案,该方案能保持中部泰语的处理性能。为应对泰语ASR的可复现性挑战,我们发布了Typhoon ASR Benchmark——遵循标准泰语语言学规范的人工标注黄金数据集,为研究社区提供标准化评估协议。
近期,智能体系统已成为形式化定理证明的主流范式,通过协调多个模型与工具实现了强劲性能。然而现有方法常依赖特定任务流水线和训练过的形式证明器,限制了灵活性与可复现性。本文提出直接使用通用编程智能体作为形式数学推理器的新范式,其优势在于:(1)通用编程智能体为证明之外的多样化推理任务提供自然接口;(2)仅需替换底层基础模型即可提升性能,无需训练;(3)MCP框架支持灵活扩展并自主调用专用工具,避免复杂设计。基于此范式,我们推出Numina-Lean-Agent,融合Claude Code与Numina-Lean-MCP以实现与Lean的自主交互、相关定理检索、非形式化证明及辅助推理工具调用。以Claude Opus 4.5为基础模型时,Numina-Lean-Agent在Putnam 2025全部12道题目中实现满分解答(12/12),媲美最佳闭源系统。除基准评估外,我们还通过协助数学家成功形式化Brascamp-Lieb定理,进一步验证其泛化能力。Numina-Lean-Agent及全部解题代码已发布于https://github.com/project-numina/numina-lean-agent。
Financial agents powered by large language models (LLMs) are increasingly deployed for investment analysis, risk assessment, and automated decision-making, where their abilities to plan, invoke tools, and manipulate mutable state introduce new security risks in high-stakes and highly regulated financial environments. However, existing safety evaluations largely focus on language-model-level content compliance or abstract agent settings, failing to capture execution-grounded risks arising from real operational workflows and state-changing actions. To bridge this gap, we propose FinVault, the first execution-grounded security benchmark for financial agents, comprising 31 regulatory case-driven sandbox scenarios with state-writable databases and explicit compliance constraints, together with 107 real-world vulnerabilities and 963 test cases that systematically cover prompt injection, jailbreaking, financially adapted attacks, as well as benign inputs for false-positive evaluation. Experimental results reveal that existing defense mechanisms remain ineffective in realistic financial agent settings, with average attack success rates (ASR) still reaching up to 50.0\% on state-of-the-art models and remaining non-negligible even for the most robust systems (ASR 6.7\%), highlighting the limited transferability of current safety designs and the need for stronger financial-specific defenses. Our code can be found at https://github.com/aifinlab/FinVault.
Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma and https://huggingface.co/FlashLabs/Chroma-4B .
Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: https://01yzzyu.github.io/xr.github.io/.
We introduce RoboBrain 2.5, a next-generation embodied AI foundation model that advances general perception, spatial reasoning, and temporal modeling through extensive training on high-quality spatiotemporal supervision. Building upon its predecessor, RoboBrain 2.5 introduces two major capability upgrades. Specifically, it unlocks Precise 3D Spatial Reasoning by shifting from 2D pixel-relative grounding to depth-aware coordinate prediction and absolute metric constraint comprehension, generating complete 3D manipulation traces as ordered keypoint sequences under physical constraints. Complementing this spatial precision, the model establishes Dense Temporal Value Estimation that provides dense, step-aware progress prediction and execution state understanding across varying viewpoints, producing stable feedback signals for downstream learning. Together, these upgrades extend the framework toward more physically grounded and execution-aware embodied intelligence for complex, fine-grained manipulation. The code and checkpoints are available at project website: https://superrobobrain.github.io
We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.
Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.
Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at https://github.com/tiktok/huvr.
Large Language Models have demonstrated profound utility in the medical domain. However, their application to autonomous Electronic Health Records~(EHRs) navigation remains constrained by a reliance on curated inputs and simplified retrieval tasks. To bridge the gap between idealized experimental settings and realistic clinical environments, we present AgentEHR. This benchmark challenges agents to execute complex decision-making tasks, such as diagnosis and treatment planning, requiring long-range interactive reasoning directly within raw and high-noise databases. In tackling these tasks, we identify that existing summarization methods inevitably suffer from critical information loss and fractured reasoning continuity. To address this, we propose RetroSum, a novel framework that unifies a retrospective summarization mechanism with an evolving experience strategy. By dynamically re-evaluating interaction history, the retrospective mechanism prevents long-context information loss and ensures unbroken logical coherence. Additionally, the evolving strategy bridges the domain gap by retrieving accumulated experience from a memory bank. Extensive empirical evaluations demonstrate that RetroSum achieves performance gains of up to 29.16% over competitive baselines, while significantly decreasing total interaction errors by up to 92.3%.
Large language models exhibit surprising sensitivity to the structure of the prompt, but the mechanisms underlying this sensitivity remain poorly understood. In this work, we conduct an in-depth investigation on a striking case: in multiple-choice question answering, placing context before the questions and options (CQO) outperforms the reverse order (QOC) by over 14%p, consistently over a wide range of models and datasets. Through systematic architectural analysis, we identify causal attention as the core mechanism: in QOC prompts, the causal mask prevents option tokens from attending to context, creating an information bottleneck where context becomes invisible to options.
This work advances autonomous robot exploration by integrating agent-level semantic reasoning with fast local control. We introduce FARE, a hierarchical autonomous exploration framework that integrates a large language model (LLM) for global reasoning with a reinforcement learning (RL) policy for local decision making. FARE follows a fast-slow thinking paradigm. The slow-thinking LLM module interprets a concise textual description of the unknown environment and synthesizes an agent-level exploration strategy, which is then grounded into a sequence of global waypoints through a topological graph. To further improve reasoning efficiency, this module employs a modularity-based pruning mechanism that reduces redundant graph structures. The fast-thinking RL module executes exploration by reacting to local observations while being guided by the LLM-generated global waypoints. The RL policy is additionally shaped by a reward term that encourages adherence to the global waypoints, enabling coherent and robust closed-loop behavior. This architecture decouples semantic reasoning from geometric decision, allowing each module to operate in its appropriate temporal and spatial scale. In challenging simulated environments, our results show that FARE achieves substantial improvements in exploration efficiency over state-of-the-art baselines. We further deploy FARE on hardware and validate it in complex, large scale 200mtimes130m building environment.
Modern CI/CD pipelines integrating agent-generated code exhibit a structural failure in responsibility attribution. Decisions are executed through formally correct approval processes, yet no entity possesses both the authority to approve those decisions and the epistemic capacity to meaningfully understand their basis. We define this condition as responsibility vacuum: a state in which decisions occur, but responsibility cannot be attributed because authority and verification capacity do not coincide. We show that this is not a process deviation or technical defect, but a structural property of deployments where decision generation throughput exceeds bounded human verification capacity. We identify a scaling limit under standard deployment assumptions, including parallel agent generation, CI-based validation, and individualized human approval gates. Beyond a throughput threshold, verification ceases to function as a decision criterion and is replaced by ritualized approval based on proxy signals. Personalized responsibility becomes structurally unattainable in this regime. We further characterize a CI amplification dynamic, whereby increasing automated validation coverage raises proxy signal density without restoring human capacity. Under fixed time and attention constraints, this accelerates cognitive offloading in the broad sense and widens the gap between formal approval and epistemic understanding. Additional automation therefore amplifies, rather than mitigates, the responsibility vacuum. We conclude that unless organizations explicitly redesign decision boundaries or reassign responsibility away from individual decisions toward batch- or system-level ownership, responsibility vacuum remains an invisible but persistent failure mode in scaled agent deployments.
The Korteweg-de Vries (KdV) equation serves as a foundational model in nonlinear wave physics, describing the balance between dispersive spreading and nonlinear steepening that gives rise to solitons. This article introduces sangkuriang, an open-source Python library for solving this equation using Fourier pseudo-spectral spatial discretization coupled with adaptive high-order time integration. The implementation leverages just-in-time (JIT) compilation for computational efficiency while maintaining accessibility for instructional purposes. Validation encompasses progressively complex scenarios including isolated soliton propagation, symmetric two-wave configurations, overtaking collisions between waves of differing amplitudes, and three-body interactions. Conservation of the classical invariants is monitored throughout, with deviations remaining small across all test cases. Measured soliton velocities conform closely to theoretical predictions based on the amplitude-velocity relationship characteristic of integrable systems. Complementary diagnostics drawn from information theory and recurrence analysis confirm that computed solutions preserve the regular phase-space structure expected for completely integrable dynamics. The solver outputs data in standard scientific formats compatible with common analysis tools and generates visualizations of spatiotemporal wave evolution. By combining numerical accuracy with practical accessibility on modest computational resources, sangkuriang offers a platform suitable for both classroom demonstrations of nonlinear wave phenomena and exploratory research into soliton dynamics.
Web AI agents such as ChatGPT Agent and GenSpark are increasingly used for routine web-based tasks, yet they still rely on text-based input prompts, lack proactive detection of user intent, and offer no support for interactive data analysis and decision making. We present WebSeek, a mixed-initiative browser extension that enables users to discover and extract information from webpages to then flexibly build, transform, and refine tangible data artifacts-such as tables, lists, and visualizations-all within an interactive canvas. Within this environment, users can perform analysis-including data transformations such as joining tables or creating visualizations-while an in-built AI both proactively offers context-aware guidance and automation, and reactively responds to explicit user requests. An exploratory user study (N=15) with WebSeek as a probe reveals participants' diverse analysis strategies, underscoring their desire for transparency and control during human-AI collaboration.
Although much research has focused on AI explanations to support decisions in complex information-seeking tasks such as fact-checking, the role of evidence is surprisingly under-researched. In our study, we systematically varied explanation type, AI prediction certainty, and correctness of AI system advice for non-expert participants, who evaluated the veracity of claims and AI system predictions. Participants were provided the option of easily inspecting the underlying evidence. We found that participants consistently relied on evidence to validate AI claims across all experimental conditions. When participants were presented with natural language explanations, evidence was used less frequently although they relied on it when these explanations seemed insufficient or flawed. Qualitative data suggests that participants attempted to infer evidence source reliability, despite source identities being deliberately omitted. Our results demonstrate that evidence is a key ingredient in how people evaluate the reliability of information presented by an AI system and, in combination with natural language explanations, offers valuable support for decision-making. Further research is urgently needed to understand how evidence ought to be presented and how people engage with it in practice.
We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work. Project page is available at https://motion3-to-4.github.io/.
While large language models (LLMs) have shown to perform well on monolingual mathematical and commonsense reasoning, they remain unreliable for multilingual medical reasoning applications, hindering their deployment in multilingual healthcare settings. We address this by first introducing CUREMED-BENCH, a high-quality multilingual medical reasoning dataset with open-ended reasoning queries with a single verifiable answer, spanning thirteen languages, including underrepresented languages such as Amharic, Yoruba, and Swahili. Building on this dataset, we propose CURE-MED, a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization to jointly improve logical correctness and language stability. Across thirteen languages, our approach consistently outperforms strong baselines and scales effectively, achieving 85.21% language consistency and 54.35% logical correctness at 7B parameters, and 94.96% language consistency and 70.04% logical correctness at 32B parameters. These results support reliable and equitable multilingual medical reasoning in LLMs. The code and dataset are available at https://cure-med.github.io/