CoTox：基于思维链的分子毒性推理与预测

摘要

药物毒性仍是药物研发中的一大挑战。近年来，机器学习模型在计算机毒性预测方面取得了进展，但其对标注数据的依赖及缺乏可解释性限制了应用范围，难以捕捉由复杂生物机制驱动的器官特异性毒性。大型语言模型（LLMs）通过逐步推理与文本数据整合提供了有前景的替代方案，然而先前的方法缺乏生物背景和透明的推理逻辑。为解决这一问题，我们提出了CoTox，一个将LLM与链式思维（CoT）推理相结合的新型框架，用于多毒性预测。CoTox整合化学结构数据、生物通路及基因本体（GO）术语，通过逐步推理生成可解释的毒性预测。利用GPT-4o，我们展示了CoTox在性能上超越传统机器学习和深度学习模型。我们进一步考察了其在多种LLMs上的表现，以确定CoTox最有效的应用场景。此外，我们发现使用IUPAC名称表示化学结构，相比SMILES更易于LLMs理解，从而增强了模型的推理能力并提升了预测性能。为展示其在药物开发中的实际效用，我们模拟了药物对相关细胞类型的处理，并将由此产生的生物背景融入CoTox框架。这一方法使CoTox能够生成与生理反应相一致的毒性预测，如案例研究所示。这一成果凸显了基于LLM的框架在提升可解释性和支持早期药物安全性评估方面的潜力。本工作中使用的代码和提示可在https://github.com/dmis-lab/CoTox获取。

English

Drug toxicity remains a major challenge in pharmaceutical development. Recent machine learning models have improved in silico toxicity prediction, but their reliance on annotated data and lack of interpretability limit their applicability. This limits their ability to capture organ-specific toxicities driven by complex biological mechanisms. Large language models (LLMs) offer a promising alternative through step-by-step reasoning and integration of textual data, yet prior approaches lack biological context and transparent rationale. To address this issue, we propose CoTox, a novel framework that integrates LLM with chain-of-thought (CoT) reasoning for multi-toxicity prediction. CoTox combines chemical structure data, biological pathways, and gene ontology (GO) terms to generate interpretable toxicity predictions through step-by-step reasoning. Using GPT-4o, we show that CoTox outperforms both traditional machine learning and deep learning model. We further examine its performance across various LLMs to identify where CoTox is most effective. Additionally, we find that representing chemical structures with IUPAC names, which are easier for LLMs to understand than SMILES, enhances the model's reasoning ability and improves predictive performance. To demonstrate its practical utility in drug development, we simulate the treatment of relevant cell types with drug and incorporated the resulting biological context into the CoTox framework. This approach allow CoTox to generate toxicity predictions aligned with physiological responses, as shown in case study. This result highlights the potential of LLM-based frameworks to improve interpretability and support early-stage drug safety assessment. The code and prompt used in this work are available at https://github.com/dmis-lab/CoTox.

CoTox：基于思维链的分子毒性推理与预测

CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction

摘要

Support