nabla^2DFT:一個包含類似藥物分子的通用量子化學數據集,並作為神經網絡潛力的基準。
nabla^2DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials
June 20, 2024
作者: Kuzma Khrabrov, Anton Ber, Artem Tsypin, Konstantin Ushenin, Egor Rumiantsev, Alexander Telepov, Dmitry Protasov, Ilya Shenbin, Anton Alekseev, Mikhail Shirokikh, Sergey Nikolenko, Elena Tutubalina, Artur Kadurin
cs.AI
摘要
計算量子化學方法提供了準確的分子性質近似值,對於計算輔助藥物發現和化學科學的其他領域至關重要。然而,高計算複雜度限制了其應用的可擴展性。神經網絡勢(NNPs)是量子化學方法的一個有前途的替代方案,但它們需要大量和多樣化的數據集進行訓練。本研究提出了一個基於nablaDFT的新數據集和基準,名為nabla^2DFT。它包含兩倍於分子結構、三倍於構象、新的數據類型和任務,以及最先進的模型。該數據集包括能量、力、17個分子性質、哈密頓和重疊矩陣,以及一個波函數對象。所有計算均在每個構象的DFT水平(omegaB97X-D/def2-SVP)下進行。此外,nabla^2DFT是第一個包含大量類似藥物分子鬆弛軌跡的數據集。我們還引入了一個新的基準,用於評估NNPs在分子性質預測、哈密頓預測和構象優化任務中的表現。最後,我們提出了一個可擴展的框架,用於訓練NNPs,並在其中實現了10個模型。
English
Methods of computational quantum chemistry provide accurate approximations of
molecular properties crucial for computer-aided drug discovery and other areas
of chemical science. However, high computational complexity limits the
scalability of their applications. Neural network potentials (NNPs) are a
promising alternative to quantum chemistry methods, but they require large and
diverse datasets for training. This work presents a new dataset and benchmark
called nabla^2DFT that is based on the nablaDFT. It contains twice as much
molecular structures, three times more conformations, new data types and tasks,
and state-of-the-art models. The dataset includes energies, forces, 17
molecular properties, Hamiltonian and overlap matrices, and a wavefunction
object. All calculations were performed at the DFT level
(omegaB97X-D/def2-SVP) for each conformation. Moreover, nabla^2DFT is the
first dataset that contains relaxation trajectories for a substantial number of
drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in
molecular property prediction, Hamiltonian prediction, and conformational
optimization tasks. Finally, we propose an extendable framework for training
NNPs and implement 10 models within it.