CoTox: 사고 사슬 기반 분자 독성 추론 및 예측

초록

약물 독성은 제약 개발에서 여전히 주요한 과제로 남아 있다. 최근 머신러닝 모델은 실리코 독성 예측을 개선했으나, 주석이 달린 데이터에 대한 의존성과 해석 가능성의 부족으로 인해 적용 범위가 제한된다. 이는 복잡한 생물학적 메커니즘에 의해 유도되는 장기 특이적 독성을 포착하는 능력을 제한한다. 대형 언어 모델(LLM)은 단계별 추론과 텍스트 데이터의 통합을 통해 유망한 대안을 제공하지만, 기존 접근법은 생물학적 맥락과 투명한 근거가 부족하다. 이 문제를 해결하기 위해, 우리는 다중 독성 예측을 위한 사고의 연쇄(CoT) 추론과 LLM을 통합한 새로운 프레임워크인 CoTox를 제안한다. CoTox는 화학 구조 데이터, 생물학적 경로, 그리고 유전자 온톨로지(GO) 용어를 결합하여 단계별 추론을 통해 해석 가능한 독성 예측을 생성한다. GPT-4o를 사용하여 CoTox가 전통적인 머신러닝 및 딥러닝 모델을 능가함을 보여준다. 또한, 다양한 LLM에서의 성능을 검토하여 CoTox가 가장 효과적인 영역을 식별한다. 추가적으로, SMILES보다 LLM이 이해하기 쉬운 IUPAC 명칭으로 화학 구조를 표현하는 것이 모델의 추론 능력을 강화하고 예측 성능을 개선함을 발견했다. 약물 개발에서의 실용적 유용성을 입증하기 위해, 관련 세포 유형에 약물을 처리하는 시뮬레이션을 수행하고 그 결과로 얻은 생물학적 맥락을 CoTox 프레임워크에 통합했다. 이 접근법은 CoTox가 생리적 반응과 일치하는 독성 예측을 생성할 수 있게 하며, 사례 연구에서 이를 보여준다. 이 결과는 LLM 기반 프레임워크가 해석 가능성을 개선하고 초기 단계 약물 안전성 평가를 지원할 잠재력을 강조한다. 본 연구에서 사용된 코드와 프롬프트는 https://github.com/dmis-lab/CoTox에서 확인할 수 있다.

English

Drug toxicity remains a major challenge in pharmaceutical development. Recent machine learning models have improved in silico toxicity prediction, but their reliance on annotated data and lack of interpretability limit their applicability. This limits their ability to capture organ-specific toxicities driven by complex biological mechanisms. Large language models (LLMs) offer a promising alternative through step-by-step reasoning and integration of textual data, yet prior approaches lack biological context and transparent rationale. To address this issue, we propose CoTox, a novel framework that integrates LLM with chain-of-thought (CoT) reasoning for multi-toxicity prediction. CoTox combines chemical structure data, biological pathways, and gene ontology (GO) terms to generate interpretable toxicity predictions through step-by-step reasoning. Using GPT-4o, we show that CoTox outperforms both traditional machine learning and deep learning model. We further examine its performance across various LLMs to identify where CoTox is most effective. Additionally, we find that representing chemical structures with IUPAC names, which are easier for LLMs to understand than SMILES, enhances the model's reasoning ability and improves predictive performance. To demonstrate its practical utility in drug development, we simulate the treatment of relevant cell types with drug and incorporated the resulting biological context into the CoTox framework. This approach allow CoTox to generate toxicity predictions aligned with physiological responses, as shown in case study. This result highlights the potential of LLM-based frameworks to improve interpretability and support early-stage drug safety assessment. The code and prompt used in this work are available at https://github.com/dmis-lab/CoTox.

CoTox: 사고 사슬 기반 분자 독성 추론 및 예측

CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction

초록

Support