능동 학습을 활용한 STGG+를 통한 π-기능성 분자 생성

초록

분포 외 특성을 지닌 새로운 분자를 생성하는 것은 분자 발견 분야에서 주요한 과제입니다. 지도 학습 방법은 데이터셋 내 분자와 유사한 고품질 분자를 생성할 수 있지만, 분포 외 특성으로 일반화하는 데 어려움을 겪습니다. 강화 학습은 새로운 화학적 공간을 탐색할 수 있지만, 종종 '보드 해킹(reward-hacking)'을 수행하거나 합성 불가능한 분자를 생성합니다. 본 연구에서는 최신 지도 학습 방법인 STGG+를 능동 학습 루프에 통합하여 이 문제를 해결합니다. 우리의 접근 방식은 STGG+를 반복적으로 생성, 평가 및 미세 조정하여 지식을 지속적으로 확장합니다. 이 방법을 STGG+AL로 명명합니다. 우리는 STGG+AL을 유기 π-기능성 물질 설계에 적용하며, 특히 두 가지 도전적인 과제를 다룹니다: 1) 높은 진동자 강도(oscillator strength)로 특징지어지는 고흡수성 분자 생성, 2) 근적외선(NIR) 영역에서 합리적인 진동자 강도를 지닌 흡수성 분자 설계. 생성된 분자는 시간 의존 밀도 범함수 이론(time-dependent density functional theory)을 통해 시뮬레이션으로 검증 및 합리화됩니다. 우리의 결과는 강화 학습(RL) 방법과 같은 기존 방법과 달리, 이 방법이 높은 진동자 강도를 지닌 새로운 분자를 생성하는 데 매우 효과적임을 보여줍니다. 우리는 능동 학습 코드와 290만 개의 π-공액 분자를 포함한 Conjugated-xTB 데이터셋, 그리고 진동자 강도와 흡수 파장을 근사하는 함수(sTDA-xTB 기반)를 오픈소스로 공개합니다.

English

Generating novel molecules with out-of-distribution properties is a major challenge in molecular discovery. While supervised learning methods generate high-quality molecules similar to those in a dataset, they struggle to generalize to out-of-distribution properties. Reinforcement learning can explore new chemical spaces but often conducts 'reward-hacking' and generates non-synthesizable molecules. In this work, we address this problem by integrating a state-of-the-art supervised learning method, STGG+, in an active learning loop. Our approach iteratively generates, evaluates, and fine-tunes STGG+ to continuously expand its knowledge. We denote this approach STGG+AL. We apply STGG+AL to the design of organic pi-functional materials, specifically two challenging tasks: 1) generating highly absorptive molecules characterized by high oscillator strength and 2) designing absorptive molecules with reasonable oscillator strength in the near-infrared (NIR) range. The generated molecules are validated and rationalized in-silico with time-dependent density functional theory. Our results demonstrate that our method is highly effective in generating novel molecules with high oscillator strength, contrary to existing methods such as reinforcement learning (RL) methods. We open-source our active-learning code along with our Conjugated-xTB dataset containing 2.9 million pi-conjugated molecules and the function for approximating the oscillator strength and absorption wavelength (based on sTDA-xTB).

능동 학습을 활용한 STGG+를 통한 π-기능성 분자 생성

Generating π-Functional Molecules Using STGG+ with Active Learning

초록

Support