CoTox:基于思维链的分子毒性推理与预测
CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction
August 5, 2025
作者: Jueon Park, Yein Park, Minju Song, Soyon Park, Donghyeon Lee, Seungheun Baek, Jaewoo Kang
cs.AI
摘要
药物毒性仍是药物研发中的一大挑战。近年来,机器学习模型在计算机毒性预测方面取得了进展,但其对标注数据的依赖及缺乏可解释性限制了应用范围,难以捕捉由复杂生物机制驱动的器官特异性毒性。大型语言模型(LLMs)通过逐步推理与文本数据整合提供了有前景的替代方案,然而先前的方法缺乏生物背景和透明的推理逻辑。为解决这一问题,我们提出了CoTox,一个将LLM与链式思维(CoT)推理相结合的新型框架,用于多毒性预测。CoTox整合化学结构数据、生物通路及基因本体(GO)术语,通过逐步推理生成可解释的毒性预测。利用GPT-4o,我们展示了CoTox在性能上超越传统机器学习和深度学习模型。我们进一步考察了其在多种LLMs上的表现,以确定CoTox最有效的应用场景。此外,我们发现使用IUPAC名称表示化学结构,相比SMILES更易于LLMs理解,从而增强了模型的推理能力并提升了预测性能。为展示其在药物开发中的实际效用,我们模拟了药物对相关细胞类型的处理,并将由此产生的生物背景融入CoTox框架。这一方法使CoTox能够生成与生理反应相一致的毒性预测,如案例研究所示。这一成果凸显了基于LLM的框架在提升可解释性和支持早期药物安全性评估方面的潜力。本工作中使用的代码和提示可在https://github.com/dmis-lab/CoTox获取。
English
Drug toxicity remains a major challenge in pharmaceutical development. Recent
machine learning models have improved in silico toxicity prediction, but their
reliance on annotated data and lack of interpretability limit their
applicability. This limits their ability to capture organ-specific toxicities
driven by complex biological mechanisms. Large language models (LLMs) offer a
promising alternative through step-by-step reasoning and integration of textual
data, yet prior approaches lack biological context and transparent rationale.
To address this issue, we propose CoTox, a novel framework that integrates LLM
with chain-of-thought (CoT) reasoning for multi-toxicity prediction. CoTox
combines chemical structure data, biological pathways, and gene ontology (GO)
terms to generate interpretable toxicity predictions through step-by-step
reasoning. Using GPT-4o, we show that CoTox outperforms both traditional
machine learning and deep learning model. We further examine its performance
across various LLMs to identify where CoTox is most effective. Additionally, we
find that representing chemical structures with IUPAC names, which are easier
for LLMs to understand than SMILES, enhances the model's reasoning ability and
improves predictive performance. To demonstrate its practical utility in drug
development, we simulate the treatment of relevant cell types with drug and
incorporated the resulting biological context into the CoTox framework. This
approach allow CoTox to generate toxicity predictions aligned with
physiological responses, as shown in case study. This result highlights the
potential of LLM-based frameworks to improve interpretability and support
early-stage drug safety assessment. The code and prompt used in this work are
available at https://github.com/dmis-lab/CoTox.