CoTox:基於思維鏈的分子毒性推理與預測
CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction
August 5, 2025
作者: Jueon Park, Yein Park, Minju Song, Soyon Park, Donghyeon Lee, Seungheun Baek, Jaewoo Kang
cs.AI
摘要
藥物毒性仍然是藥物開發中的主要挑戰。近年來,機器學習模型在計算毒性預測方面有所進步,但其對註釋數據的依賴以及缺乏可解釋性限制了其應用範圍。這限制了它們捕捉由複雜生物機制驅動的器官特異性毒性的能力。大型語言模型(LLMs)通過逐步推理和文本數據的整合提供了一種有前景的替代方案,然而先前的方法缺乏生物學背景和透明的推理過程。為了解決這一問題,我們提出了CoTox,這是一個將LLM與鏈式推理(CoT)相結合的新框架,用於多毒性預測。CoTox結合了化學結構數據、生物途徑和基因本體(GO)術語,通過逐步推理生成可解釋的毒性預測。使用GPT-4o,我們展示了CoTox在性能上超越了傳統的機器學習和深度學習模型。我們進一步檢驗了其在各種LLMs中的表現,以確定CoTox在哪些情況下最為有效。此外,我們發現使用IUPAC名稱表示化學結構(相比SMILES更易於LLMs理解)增強了模型的推理能力並提高了預測性能。為了展示其在藥物開發中的實際應用,我們模擬了相關細胞類型的藥物處理,並將由此產生的生物學背景整合到CoTox框架中。這種方法使CoTox能夠生成與生理反應一致的毒性預測,如案例研究所示。這一結果突顯了基於LLM的框架在提高可解釋性和支持早期藥物安全評估方面的潛力。本工作中使用的代碼和提示可在https://github.com/dmis-lab/CoTox獲取。
English
Drug toxicity remains a major challenge in pharmaceutical development. Recent
machine learning models have improved in silico toxicity prediction, but their
reliance on annotated data and lack of interpretability limit their
applicability. This limits their ability to capture organ-specific toxicities
driven by complex biological mechanisms. Large language models (LLMs) offer a
promising alternative through step-by-step reasoning and integration of textual
data, yet prior approaches lack biological context and transparent rationale.
To address this issue, we propose CoTox, a novel framework that integrates LLM
with chain-of-thought (CoT) reasoning for multi-toxicity prediction. CoTox
combines chemical structure data, biological pathways, and gene ontology (GO)
terms to generate interpretable toxicity predictions through step-by-step
reasoning. Using GPT-4o, we show that CoTox outperforms both traditional
machine learning and deep learning model. We further examine its performance
across various LLMs to identify where CoTox is most effective. Additionally, we
find that representing chemical structures with IUPAC names, which are easier
for LLMs to understand than SMILES, enhances the model's reasoning ability and
improves predictive performance. To demonstrate its practical utility in drug
development, we simulate the treatment of relevant cell types with drug and
incorporated the resulting biological context into the CoTox framework. This
approach allow CoTox to generate toxicity predictions aligned with
physiological responses, as shown in case study. This result highlights the
potential of LLM-based frameworks to improve interpretability and support
early-stage drug safety assessment. The code and prompt used in this work are
available at https://github.com/dmis-lab/CoTox.