微調整された言語モデルは、テキストとして安定した無機材料を生成する

要旨

安定した材料の生成を目的として、大規模言語モデルのファインチューニングを提案します。非正統的ではありますが、テキストエンコードされた原子データを用いて大規模言語モデルをファインチューニングする手法は、実装が簡単でありながら信頼性が高く、サンプリングされた構造の約90%が原子位置と電荷に関する物理的制約を満たします。学習されたMLポテンシャルとゴールドスタンダードであるDFT計算の両方を用いたエネルギー計算（エネルギーアバブハル計算）により、最も強力なモデル（ファインチューニングされたLLaMA-2 70B）が、競合する拡散モデルであるCDVAEと比較して、約2倍の割合（49%対28%）でメタ安定と予測される材料を生成できることを示します。テキストプロンプティングの本質的な柔軟性により、我々のモデルは、安定した材料の無条件生成、部分構造のインフィリング、およびテキスト条件付き生成を同時に実行することが可能です。最後に、結晶構造の重要な対称性を捉える言語モデルの能力がモデル規模とともに向上することを示し、事前学習されたLLMのバイアスが原子データに驚くほど適していることを示唆します。

English

We propose fine-tuning large language models for generation of stable materials. While unorthodox, fine-tuning large language models on text-encoded atomistic data is simple to implement yet reliable, with around 90% of sampled structures obeying physical constraints on atom positions and charges. Using energy above hull calculations from both learned ML potentials and gold-standard DFT calculations, we show that our strongest model (fine-tuned LLaMA-2 70B) can generate materials predicted to be metastable at about twice the rate (49% vs 28%) of CDVAE, a competing diffusion model. Because of text prompting's inherent flexibility, our models can simultaneously be used for unconditional generation of stable material, infilling of partial structures and text-conditional generation. Finally, we show that language models' ability to capture key symmetries of crystal structures improves with model scale, suggesting that the biases of pretrained LLMs are surprisingly well-suited for atomistic data.

微調整された言語モデルは、テキストとして安定した無機材料を生成する

Fine-Tuned Language Models Generate Stable Inorganic Materials as Text

要旨

Support