Abstract
Machine learning-assisted modeling of the inter-atomic potential energy surface (PES) is revolutionizing the field of molecular simulation. With the accumulation of high-quality electronic structure data, a model that can be pretrained on all available data and finetuned on downstream tasks with a small additional effort would bring the field to a new stage. Here we propose DPA-1, a Deep Potential model with a gated attention mechanism, which is highly effective for representing the conformation and chemical spaces of atomic systems and learning the PES. We tested DPA-1 on a number of systems and observed superior performance compared with existing benchmarks. When pretrained on large-scale datasets containing 56 elements, DPA-1 can be successfully applied to various downstream tasks with a great improvement of sample efficiency. Surprisingly, for different elements, the learned type embedding parameters form a spiral in the latent space and have a natural correspondence with their positions on the periodic table, showing interesting interpretability of the pretrained DPA-1 model.
Similar content being viewed by others
Introduction
Reliably representing the inter-atomic potential energy surface (PES) is core to the study of properties of molecules and materials in computational physics, chemistry, materials science, biology, etc. While electronic structure methods typically give accurate and transferable PES, they are prohibitively expensive for scaling to systems of more than thousands of atoms. On the other hand, empirical force fields are much more efficient but are inherently limited by their accuracy in many applications. By properly integrating machine learning (ML) methodologies and physical requirements like extensiveness and symmetries, various methods have emerged to address the accuracy v.s. efficiency dilemma in the realm of PES modeling1,2,3,4,5,6,7,8,9,10,11. Arguably, a new paradigm is forming: electronic structure methods are no longer used to generate the driving forces during molecular dynamics simulations but are used to generate data for training their alternatives, ML-based PES models.
Despite remarkable achievements of ML-based PES models12,13,14, challenges still remain. For a domain expert who would like to apply such methodologies in their applications, a natural first question is on the efforts needed for obtaining a reliable PES model: Are there ready-to-use PES models? If not, what would be the amount of training data and time cost required? Can we take advantage of the ever-increasing publicly-available training data?
To address these issues, there have been several efforts. On one hand, general-purpose models for various systems, such as silicon15, phosphorus16, water17, metals and alloys18,19,20,21,22, etc., have been developed and are directly applicable to relevant studies. However, the range of applicability of such models is typically limited to small conformation or chemical space. For example, for alloys, the majority of general-purpose ML models are developed for systems with at most two element types. On the other hand, several efficient data generation protocols have been developed23,24,25,26, of which a representative is DP-GEN25,26, a concurrent learning procedure that iteratively explores the configuration space using models trained with existing data, and then labels only those configurations with high uncertainty level. Even with these protocols, the computational effort needed for complicated systems is still prohibitive. For example, to train a fairly general-purpose model for the AlMgCu alloy system, 100k density functional theory (DFT)27,28 calculations were ultimately performed, resulting in the cost of ten million CPU core hours18.
With the accumulation of high-quality electronic structure data covering almost all the elements on the periodic table, it is becoming possible to systematically develop pretraining schemes, which have been widely adopted in areas like computer vision (CV)29,30 and natural language processing (NLP)31,32. In these schemes, one first trains a unified model on large-scale datasets and then finetunes it for downstream tasks, expecting that a good representation can be learned in the first stage, and the amount of supervised data needed for the second stage will be significantly reduced. Recently, the pretraining-finetuning idea has been applied to organic molecules systems for energy and force predictions33,34, and to tackle tasks beyond representing the PES35,36,37. Unfortunately, most ML-based PES models are premature for such schemes at scale in materials applications. Taking the widely used two versions of Deep Potential models6,7 as examples, the ML parameters are element-type-dependent, making it highly inefficient when the training data contains many elements.
Constant efforts have been devoted to adapt the architecture of the ML-based PES models for large datasets. Among them, one class of models named equivariant graph neural networks (GNN)38 that is built upon convolutions over atomic graphs of node and edge equivariant representations has shown promise of training on large datasets. SchNet5, PaiNN39, GemNet-OC40, DimeNet++41, PFP42, SCN43, SpinConv44 and Equiformer/EquiformerV245,46 are trained on the OC20/OC2M47 dataset containing about 133M/2M data frames of 56 elements. These models are benchmarked by the accuracy of energy, force and stable structure predictions. Very recently, it has been shown that introducing the attention architecture45 in a GNN model improves the performance on the OC20/OC2M dataset46. Chen and Ong48 proposed M3GNet, which was able to train on a subset of the Materials Project49 that contains 187,687 configurations encompassing 89 elements and labeled at the generalized gradient approximation (GGA)50 or GGA+U level. Takamota et. al.42 introduced the PFP model, which was trained on a dataset composed of molecular and crystal configurations including approximately 9 × 106 frames of 45 elements. Choudhary et. al.51 developed the ALIGNN model, and they were able to train the model on a subset of the JARVIS-DFT dataset52 that is composed of 307,113 data frames of 89 elements. The M3GNet, PFP, and ALIGNN models are proposed as “universal” potential models, however, their accuracies are not on par with PES models trained for specific materials applications.
The equivariant GNN models are potential candidates for pretraining, several issues worth special attention before applying them in downstream real-world applications. First, the GNN approaches are not well-suited for massively parallel molecular dynamics simulations53. The update of each GNN layer requires communications between spatially decomposed sub-regions of the system. In each evaluation of the energy and forces, in total several to a dozen such updates are required, which may lead to a substantial communication overhead in massively parallel high-performance super-computers. Second, some models, such as PaiNN, GemNet-OC, SCN, Equiformer/EquiformerV2, directly predict forces using rotationally equivariant networks39,40,45,54 instead of energy gradients with respect to atomic coordinates. Therefore, the predicted force is not conservative, which serves as a basic assumption in guaranteeing the accuracy of molecular simulations55. The DimeNet++41 Allegro53 models are conservative. Last but not least, some models, such as GemNet-OC, SpinConv, M3GNet, and ALIGNN are not smooth, i.e. a sudden energy jump may happen as the positions of atoms infinitesimally varies. This leads to non-conserved energy in the Hamiltonian dynamics simulations, which is used in computing the dynamical properties like diffusion constant and viscosity.
By far, how much the downstream materials applications benefit from the ML models trained on the large-scale datasets are still not clear. To answer the question, in this article, we propose DPA-1, a Deep Potential model with a gated attention mechanism. Designed with a local descriptor, this model is exceptionally well-suited for parallel simulations on large-scale systems containing millions of atoms56. Notably, DPA-1 predicts conservative forces, ensures smoothness and demonstrates outstanding efficacy in learning inter-atomic interactions. Moreover, once pretrained, DPA-1 can significantly decrease the supplementary efforts needed for subsequent downstream tasks. We tested DPA-1 on various systems and observed superior performance compared with existing benchmarks. Then we took AlMgCu alloy systems18 as an example, showing that after pretraining with single-element and binary samples, DPA-1 can save around 90% ternary samples compared with the DeepPot-SE model7. Finally, we pretrained DPA-1 using the OC20 dataset, which consists of 56 elements, and successfully applied it to various downstream tasks. We checked the interpretability of the pretrained model by looking into the learned embedding parameters for different element types, finding that the 56 elements are arranged on a spiral in the latent space, which has a natural correspondence with their physical properties on the periodic table. Above all, we believe that DPA-1 and the pretraining scheme will bring the field of molecular simulation to a new stage.
Results
We conducted a number of experiments to evaluate the performance of DPA-1, with its architecture illustrated in Fig. 1 and detailed in the Methods section. First, to test the model’s ability to transfer among different compositions, we trained it from scratch against various systems and tested it under several challenging schemes. Then, we used an AlMgCu dataset to test its ability to transfer to ternary systems upon pretraining with single-element and binary data. Finally, we pretrained DPA-1 using the OC2M subset in OC20 dataset47 and applied it to various downstream tasks. To illustrate the effectiveness of the type-embedding and attention schemes, we compared them against DeepPot-SE model7 in all the experiments. In the following, we shall introduce first the datasets we used, and then the experiments we conducted.
Datasets
AlMgCu alloy systems18. This dataset is generated using DP-GEN26, a concurrent learning scheme. After exploring 2.73 billion alloy configurations (derived from ~2000 bulk and surface systems), only a small portion (~100k configurations) of them are labeled and then compose the compact dataset. The exploration runs in the whole concentration space, i.e., AlxCuyMgz with 0 ≤ x, y, z ≤ 1, x + y + z = 1, and x, y, z take discrete values permitted by the finite-size simulation boxes. We can divide the systems into single, binary, and ternary subsets, in the name of the number of non-zero x, y, and z. The configuration space covers a temperature range of around 50.0 K to 2579.8 K and a pressure range of around 1 bar to 50,000 bar.
Solid-state electrolyte (SSE) systems57. These systems contain Li10XP2S12-type SSE materials, where X represents a single or combination of Ge/Si/Sn, and can be divided into three main parts: init, mix and single. The init part comes from the standard DP-GEN scheme starting from 590 structures that are generated via slightly perturbing DFT-relaxed crystal structures, Li10SiP2S12 and Li10SnP2S12 from Materials Project49. The exploration covers both ordered structures relaxed by DFT (i.e. structures downloaded from the Materials Project database, in which the position of Ge/Si/Sn/P atoms are fixed) and disordered structures whose 4d sites are randomly occupied by Ge/Si/Sn/P. Based on the init part, the mix part contains further exploration in binary and ternary mixture of Ge/Si/Sn, while the single part covers only a single X in Ge/Si/Sn with other changes in lattice and ratio of Li.
HEA systems. The High Entropy Alloy HEA dataset includes bulk TaNbWMoVAl alloy systems of various configurations and compositions. We employ DP-GEN to explore the composition space, starting from Ta3Nb3W3Mo3V3Al1, a 16-atom unit cell containing the former 5 elements as main components and Al as an additive. The dataset is divided into two subsets: interior and exterior. The interior (higher entropy) subset includes composition variations near the starting point. It covers six-component, quinary, quaternary, and ternary alloys. The exterior (lower entropy) subset includes systems that are close to the corners and edges of the composition space. It includes systems where one or two elements dominate, binary alloys, and simple substance systems. For both subsets, the temperature range is around 50.0 K to 388.1 K and the pressure range is around 1 bar to 50000 bar.
OC2047. OC20 consists of single adsorbates (small molecules) physically binding to the surfaces of catalysts covering periodic bulk materials with 56 elements. Both the chemical diversity and system size are much more complex than other benchmark datasets, such as MD1758, ANI-1x24, or QM959. OC2M is a subset including 2 million data points (energies and forces) randomly sampled from OC20, which is still challenging for model training and decent for pretraining. Johannes et al. recently provided several baselines on OC2M, taking months to converge40.
Accuracy on various datasets, trained from scratch
The majority of existing models usually focus on the ability to transfer among different configurations, in which case training and validation subsets consist of similar compositions (e.g. randomly sampled from the same dataset). However, to perform pretraining, the upstream and downstream datasets may differ violently. Thus, it’s vital for models under the pretraining scheme to transfer among different compositions or even among different datasets, which has, as far as we know, rarely been discussed before. In this work, we mainly focus on a more general but challenging scheme to comprehensively test the generalization ability of the model.
We first designed several challenging tasks to test the model’s ability to transfer among different compositions. For AlMgCu, SSE, and HEA systems, we divided them into subsets with different compositions for training and validation (See Datasets subsection for details). The results of DPA-1 and DeepPot-SE are shown in Table 1. With the training loss nearly the same (omitted in the table), the DPA-1 drastically outperforms DeepPot-SE in validation accuracy. For example, for AlMgCu systems, when trained only on single- and binary-element samples, the validation RMSE of DPA-1 on ternary samples can outperform DeepPot-SE by one order of magnitude (6.99 versus 65.1 meV/atom). This suggests that the DPA-1 model might have learned the latent interactions of ternary pairs Al-Mg-Cu from binary pairs Al-Mg, Al-Cu, Mg-Cu, and single-element interactions, possibly thanks to the type-embedding scheme and attention mechanism. We conducted an ablation study in Supplementary Note 1 on HEA systems to demonstrate the influence of each structural component.
To test the performance of DPA-1 in terms of predicting more physical quantities, we performed geometry relaxations on all AlMgCu ternary alloys available from the Materials Project to evaluate their accuracy in predicting formation energy and equilibrium volume (see details in Supplementary Note 2). We also used it to calculate the elastic moduli of AlMgCu systems, which requires accurately capturing the second-order information (see details in Supplementary Note 3). Additionally, we carried out molecular dynamics simulations on LiGePS systems to assess the diffusion coefficients in relation to temperature, comparing the results to ab initio molecular dynamics (AIMD) simulations and experimental studies (see details in Supplementary Note 4). In all tests, satisfactory agreement with the DFT and/or experimental references are obtained.
As a supplement, we also trained DPA-1 model on several simple systems to compare with other ML-based PES. Since these tasks are much easier than the above ones and out of our main focus, we place the results in Supplementary Note 8. Note that there may be relatively little room for improvement on these simple datasets, while DPA-1 can still outperform other methods with even less training samples.
Sample efficiency of pretrained models
As shown in Fig. 2, we use the learning curves to illustrate in terms of the amount of additional training data saved for downstream tasks thanks to model pretraining. In all the experiments, the learning curves were generated by an active learning procedure, in which a pool of data labeled by energy and force is prepared and three steps are repeated iteratively: using samples in the training pool to train the model; testing the model using the remaining samples; selecting 50 samples with the largest prediction errors on per-atom energies and adding them to the training pool. We use the term sample efficiency to denote the amount of training samples required by a model to achieve a given accuracy level for a certain task. The hyperparameter settings in these tests can be found in Supplementary Note 9.
We started with a relatively simple task to compare DeepPot-SE and DPA-1. In this task, both the two models were pretrained using single-element and binary subsets of the AlMgCu systems, and the learning curves were obtained using the AlMgCu ternary subset. As shown in Fig. 2a, DPA-1 exhibits a much better sample efficiency than DeepPot-SE, which should be expected.
Next, we used the OC2M dataset, which contains 56 elements, to pretrain DPA-1 and evaluated its performance on the HEA systems and the AlCu systems (Fig. 2b, c, respectively). As shown in Fig. 3c, the training cost of DeepPot-SE scales quadratically with the number of elements, making its pretraining computationally infeasible, while the number of elements has no effects on the training cost of DPA-1. It is observed that the sample efficiency of DPA-1 pretrained on OC2M is generally better than DPA-1 from scratch, while DeepPot-SE from scratch is the worst. Moreover, compared with AlCu systems, the improvement of pretraining is much more significant for HEA systems, possibly due to the fact that the number of elements of HEA is much larger than AlCu, and the local chemical environment is much more complicated.
The equivariant GNN models usually need thousands of GPU hours to be trained to a descent accuracy40. By contrast, the DPA-1 model only takes less than 200 GPU hours for training. The converged energy and force MAEs on the OC2M validation set are 0.681 eV and 0.076 eV/Å, respectively. This accuracy is comparable with the best energy-conserving GNN model DimeNet++, which achieves MAEs of 0.805 eV and 0.066 eV/Å, reported in ref. 40. A better performance of energy MAE 0.286 eV and force MAE 0.026 eV/Åis achieved by GemNet-OC at the cost of non-conservative forces and loss of smoothness40.
In the potential energy model, the presence of non-conservative forces and unsmoothness introduce an artificial energy drift in MD simulations. While investigating static properties, this drift can be removed by incorporating a thermostat in the simulation. However, it is essential to carefully examine the potential impact on the accuracy of property estimation. To calculate the dynamical response of the system, such as the self-diffusion coefficient, viscosity, and heat conductivity, it is typically necessary to evaluate auto-correlation functions by using the Green-Kubo relations60,61. The estimations of auto-correlation functions usually require 10-100 ps long micro-canonical (NVE) simulations to achieve converged statistics and eliminate possible nonergodicity in the Hamiltonian dynamics62. In this context, energy conservation is critical; otherwise, the energy drift may lead the system to an undesired thermodynamic state or even cause a blow-up in the total energy. In Supplementary Note 5, we demonstrate the magnitude of the total energy drift during a 100-ps long NVE MD simulation for OC20 configurations. The drift observed in non-conservative models is approximately 10−2 eV/atom, which corresponds to a temperature of roughly 102 K. In contrast, the energy-conserving DPA-1 model, as anticipated, does not exhibit any energy drift.
As shown in Supplementary Note 6, it has been observed that, when trained with 1 million steps on the AlMgCu alloy dataset, the non-conservative models achieve relatively higher force accuracy but lower energy accuracy compared to the conservative models. Furthermore, the accuracy of non-conservative models in predicting the equation of state (EOS), a fundamental material property, is lower than that of the conservative models. This may be attributed to the fact that non-conservative models predict energy and force separately, and thus accurate force prediction does not necessarily improve the shape of the energy landscape.
Interpretability of type embedding learned from pretraining
To see whether DPA-1 can learn physically meaningful information from pretraining, we investigated the 3-dimensional principal component analysis (PCA) visualization of the learned type embeddings in the OC2M-pretrained model. Interestingly, as shown in Fig. 3a, the arrangement of the elements generally follows the shape of a downward spiral. Elements belonging to the same period are lined up in the direction of the spiral; while elements belonging to the same family are listed in the direction orthogonal to the spiral. Even though some transition metal elements are almost bound together, this rule still roughly holds. It is observed that C, N, and O are outliers, possibly because in OC2M, C, N and O are mostly in organic molecules, which serve as adsorbates and have chemical environments that are very different from other elements.
In addition, we performed interpolation experiments for the type embedding of Li, an element unseen in OC2M. As shown in Fig. 3b, we let \({T}_{Li}=\lambda \left(Na\right)* {T}_{Na}+\left(1-\lambda \left(Na\right)\right)* {T}_{H}\), since Li lies between H and Na in the same family. When tested on the SSE system, only the bias in the atomic energy is changed, since the setup of the electronic method used to label the SSE system is different from that for OC2M, which typically causes an energy shift. It is found that the RMSE of energy and force shows a sudden drop when \(\lambda \left(Na\right)=0.7\), which meets the chemical intuition and further confirms the interpretability of the pretrained DPA-1 model. Moreover, we conducted analogous interpolation experiments for Nb and Mo on the HEA systems, and reached similar conclusions as the Li interpolation (see detailed report in Supplementary Note 7).
Discussion
In this paper, we developed DPA-1, an attention-based Deep Potential model that allows for large-scale pretraining on atomistic datasets. We tested DPA-1 from different aspects, showing its excellent performance in terms of its accuracy on various datasets when trained from scratch, as well as its sample efficiency when pretrained with existing data. Further investigations on the type embedding parameters suggest the interpretability of DPA-1 pretrained on OC2M.
In the future, it will be of interest to extend the training dataset to cover the full periodic table, and, in particular, see a more converged “spiral” in the latent space; the embedding information of local chemical environments may be useful to characterize different conformations. Multi-task and unsupervised training schemes are worth exploring; and, for downstream tasks, just like what has happened in the fields of CV and NLP, schemes like model compression, distillation, and transfer, etc., are desperately needed. We leave these possibilities and more applications to future works.
Methods
Consider a system of N atoms, the elemental types are \({{{\mathcal{A}}}}=\left\{{\alpha }_{1},{\alpha }_{2},...,{\alpha }_{i},...,{\alpha }_{N}\right\}\), and the atomic coordinates are \({{{\mathcal{R}}}}=\left\{{{{{\boldsymbol{r}}}}}_{1},{{{{\boldsymbol{r}}}}}_{2},...,{{{{\boldsymbol{r}}}}}_{i},...,{{{{\boldsymbol{r}}}}}_{N}\right\}\), with ri being the three Cartesian coordinates of atom i. The PES of the system is denoted by E, a function of elemental types and coordinates, i.e. \(E=E({{{\mathcal{A}}}},{{{\mathcal{R}}}})\). For each atom i, consider its neighbors \(\{j| j\in {{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)\}\), where \({{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)\) denotes the set of atom indices j such that rji < rc, with rji being the Euclidean distance between atoms i and j. E is represented as the summation of atomic energies \(\left\{{e}_{1},{e}_{2},...,{e}_{i},...,{e}_{N}\right\}\), where the atomic energy ei only depends on the information of \({{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)\). We define \({N}_{i}=| {{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)|\), the cardinality of the set \({{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)\). We use \({{{{\mathcal{A}}}}}^{i}\) to denote element types in \({{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)\), and \({{{{\mathcal{R}}}}}^{i}\in {{\mathbb{R}}}^{{N}_{i}\times 3}\) their corresponding coordinates relative to i. The atomic energy ei is thus a function of \({{{{\mathcal{A}}}}}^{i}\) and \({{{{\mathcal{R}}}}}^{i}\). The atomic force on atom i, \({{{{\mathcal{F}}}}}_{i}\), is defined as the negative gradient of the total energy with respect to i’s coordinate:
We refer to ref. 7 for a detailed discussion of several requirements for PES modeling. In particular, the PES has to be invariant under translation, rotation, and permutation of the indices of atoms with the same element types.
The details of the model architecture are introduced below. We refer to Fig. 1 for the overall pipeline to predict the atomic energy ei: from the embedded neighboring environment, through the self-attention scheme, to the symmetry-preserving descriptors, and finally to the fitting network.
Local embedding matrix with type information
We obtain the local embedding matrix with the following three steps. First, \({{{{\mathcal{R}}}}}^{i}\) is mapped to the generalized coordinates \({\tilde{{{{\mathcal{R}}}}}}^{i}\in {{\mathbb{R}}}^{{N}_{i}\times 4}\). In this mapping, each row of \({{{{\mathcal{R}}}}}^{i},\{{x}_{ji},{y}_{ji},{z}_{ji}\}\), is transformed into a row of \({\tilde{{{{\mathcal{R}}}}}}^{i}\):
where \(\{{x}_{ji},{y}_{ji},{z}_{ji}\}\) denotes the Cartesian coordinates of rji = rj − ri, \({\hat{x}}_{ji}=\frac{s({r}_{ji}){x}_{ji}}{{r}_{ji}},{\hat{y}}_{ji}=\frac{s({r}_{ji}){y}_{ji}}{{r}_{ji}},{\hat{z}}_{ji}=\frac{s({r}_{ji}){z}_{ji}}{{r}_{ji}}\), and \(s({r}_{ji}):{\mathbb{R}}\,\mapsto\, {\mathbb{R}}\) is a continuous and differentiable scalar weighting function applied to each component, defined as:
Here rcs is a smooth cutoff parameter that allows the components in \({\tilde{{{{\mathcal{R}}}}}}^{i}\) to smoothly go to zero at the boundary of the local region defined by rc.
Second, we add the atomic type embedding as supplemental information. For atom i, the type embedding map Ti is defined as:
where αi is the atomic type of atom i and ϕT is a one-hot-like embedding network mapping from αi to a length-fixed vector.
Then, given both \({\tilde{{{{\mathcal{R}}}}}}^{i}\) and type embeddings \(\{{T}_{i}\}\cup \{{T}_{j}| j\in {{{{\mathcal{N}}}}}_{{r}_{c}}(i)\}\), we define the local embedding matrix \({{{{\mathcal{G}}}}}^{i}\in {{\mathbb{R}}}^{{N}_{i}\times {M}_{1}}\):
where G is a neural network mapping from scalar weight \(s({r}_{ji})\) and type embeddings of both center and neighbor atoms, through multiple hidden layers, to M1 outputs. Here we simply feed the concatenated inputs into G at once, as shown in Fig. 1b.
Attention method for building up trainable descriptors
The attention mechanism has achieved great success and played an increasingly important role in CV63 and NLP64. It has become an excellent tool for modeling the importance or relevance of visual regions or text tokens, thus is potentially appropriate to reweight the interactions among neighbor atoms according to both distance and angular information.
In DPA-1, we follow the standard self-attention mechanism and obtain the queries \({{{{\mathcal{Q}}}}}^{i,l}\in {{\mathbb{R}}}^{{N}_{i}\times {d}_{k}}\), keys \({{{{\mathcal{K}}}}}^{i,l}\in {{\mathbb{R}}}^{{N}_{i}\times {d}_{k}}\), and values \({{{{\mathcal{V}}}}}^{i,l}\in {{\mathbb{R}}}^{{N}_{i}\times {d}_{v}}\):
where Ql, Kl, Vl represent three linear transformations which output the queries and keys of dimension dk and values of dimension dv, and l is the index of attention layer. Here we take \({{{{\mathcal{G}}}}}^{i,0}={{{{\mathcal{G}}}}}^{i}\).
Then we adopt the scaled dot-product attention method65 to mix the neighbor features after calculating the attention weights:
where \(\varphi \left({{{{\mathcal{Q}}}}}^{i,l},{{{{\mathcal{K}}}}}^{i,l},{{{{\mathcal{R}}}}}^{i,l}\right)\in {{\mathbb{R}}}^{{N}_{i}\times {N}_{i}}\) is attention weights. In the original attention method, one typically has \(\varphi \left({{{{\mathcal{Q}}}}}^{i,l},{{{{\mathcal{K}}}}}^{i,l}\right)={{\mathrm{softmax}}}\,\left(\frac{{{{{\mathcal{Q}}}}}^{i,l}{({{{{\mathcal{K}}}}}^{i,l})}^{T}}{\sqrt{{d}_{k}}}\right)\), with \(\sqrt{{d}_{k}}\) being the normalization temperature. This is slightly modified to better incorporate the angular information:
where \({\hat{{{{\mathcal{R}}}}}}^{i}=\frac{{{{{\mathcal{R}}}}}^{i}}{\parallel {{{{\mathcal{R}}}}}^{i}{\parallel }_{2}}\in {{\mathbb{R}}}^{{N}_{i}\times 3}\) denotes normalized relative coordinates and ⊙ means element-wise multiplication. Intuitively, in the neighborhood of center atom i, neighbor atom k may be highly correlated with j when both the relative distance attention \({({{{{\mathcal{Q}}}}}^{i,l})}_{j}{({{{{\mathcal{K}}}}}^{i,l})}_{k}^{T}\) and normalized product of relative coordinates \(\frac{{{{{\bf{r}}}}}_{ji}{({{{{\bf{r}}}}}_{ki})}^{T}}{{r}_{ji}{r}_{ki}}\) have high scores.
Then we add layer normalization in a residual way to finally obtain the self-attentioned local embedding matrix \({\hat{{{{\mathcal{G}}}}}}^{i}\) in one such attention layer:
We also tried other attention-related tricks such as pre-layer normalization, multi-head attention, etc., which brought little improvement. In practice, as shown in Fig. 1c, we repeated this procedure by l(l ≥ 2) times for a more complete representation. If not stated otherwise, we use l = 2 in the following sections of the work. Next, we define the encoded feature matrix \({{{{\mathcal{D}}}}}^{i}\in {{\mathbb{R}}}^{{M}_{1}\times {M}_{2}}\) of atom i:
where \({\dot{{{{\mathcal{G}}}}}}^{i}\) stands for a sub-matrix of \({\hat{{{{\mathcal{G}}}}}}^{i}\), which takes the first M2(<M1) columns of \({\hat{{{{\mathcal{G}}}}}}^{i}\). Here the feature matrix \({{{{\mathcal{D}}}}}^{i}\), i.e. the descriptor, preserves all the invariance mentioned above, of which the proof can be found in ref. 7. We then pass the reshaped \({{{{\mathcal{D}}}}}^{i}\), concatenated with the type embedding parameters of the center atom, through the multi-layer fitting network:
The total energy of the system is then given as the summation of ei, and the atomic force \({{{{\mathcal{F}}}}}_{i}\) can be further computed via Eq. (1).
Model (pre-)training and finetuning
For model training or pretraining, we adopted the Adam stochastic gradient descent method66 on all the trainable parameters w inside the model to minimize the loss:
Here \({{{\mathcal{B}}}}\) represents a minibatch, \(| {{{\mathcal{B}}}}|\) is the batch size, t denotes the index of the training sample. \({E}^{{{{\boldsymbol{w}}}}},{{{{\mathcal{F}}}}}^{{{{\boldsymbol{w}}}}}\) denote the model outputs and \(E,{{{\mathcal{F}}}}\) are the corresponding DFT results. We also adopted a scheduler to tune the prefactors pϵ and pf during the training process to make a better balance between energy and force labels. Virial errors, which are omitted here, can be added to the loss for training if available.
To finetune the pretrained model with a new dataset, we first change the energy bias in the last layer of the pretrained model with the new statistical results of the new dataset, and then we fix part of the parameters in the pretrained model and train the remaining. For the following experiments, we obtained the best performance when only the type embedding parameters are fixed.
Data availability
The dataset used for training OC2M-pretrained DPA-1 is available at: https://www.aissquare.com/datasets/detail?pageType=datasets&name=Open_Catalyst_2020(OC20_Dataset). Other datasets are available in their references or on reasonable request.
Code availability
The codes of DPA-1 are in the repository of DeePMD-kit: https://github.com/deepmodeling/deepmd-kit. The OC2M-pretrained model is available at: https://www.aissquare.com/models/detail?pageType=models&name=DPA_1_OC2M.
References
Behler, J & Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett. 98, 146401 (2007).
Bartók, A. P., Payne, M. C., Kondor, R. & Csányi, G. ábor Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett. 104, 136403 (2010).
Thompson, A. P., Swiler, L. P., Trott, C. R., Foiles, S. M. & Tucker, G. J. Spectral neighbor analysis method for automated generation of quantum-accurate interatomic potentials. J. Comput. Phys. 285, 316–330 (2015).
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In Proceedings of International Conference on Machine Learning, 1263–1272 (PMLR, 2017).
Schütt, K. et al. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. In Proceedings of Advances in Neural Information Processing Systems (2017).
Zhang, L., Han, J., Wang, H., Car, R. & Weinan, E. J. P. R. L. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett. 120, 143001 (2018).
Zhang, L. et al. End-to-end symmetry preserving inter-atomic potential energy model for finite and extended systems. In Proceedings of Advances in Neural Information Processing Systems (2018).
Drautz, R. Atomic cluster expansion for accurate and transferable interatomic potentials. Phys. Rev. B 99, 014104 (2019).
Gasteiger, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. In Proceedings of International Conference on Learning Representations (2019).
Zhang, Y., Hu, C. & Jiang, B. Embedded atom neural network potentials: Efficient and accurate machine learning with a physically inspired representation. J. Phys. Chem. Lett. 10, 4962–4967 (2019).
Gasteiger, J., Becker, F. & Günnemann, S. Gemnet: Universal directional graph neural networks for molecules. Adv. Neural Inf. Process. Syst. 34, 6790–6802 (2021).
Deringer, V. L. et al. Gaussian process regression for materials and molecules. Chem. Rev. 121, 10073–10141 (2021).
Unke, O. T. et al. Machine learning force fields. Chem. Rev. 121, 10142–10186 (2021).
Wen, T., Zhang, L., Wang, H., Weinan, E. & Srolovitz, D. J. Deep potentials for materials science. Mater. Futures 1, 022601 (2022).
Bartók, A. P., Kermode, J., Bernstein, N. & Csányi, G. ábor Machine learning a general-purpose interatomic potential for silicon. Phys. Rev. X 8, 041048 (2018).
Deringer, V. L., Caro, M. A. & Csányi, G. ábor A general-purpose machine-learning force field for bulk and nanostructured phosphorus. Nat. Commun. 11, 1–11 (2020).
Zhang, L., Wang, H., Car, R. & Weinan, E. Phase diagram of a deep potential water model. Phys. Rev. Lett. 126, 236001 (2021).
Jiang, W., Zhang, Y., Zhang, L. & Wang, H. Accurate deep potential model for the Al–Cu–Mg alloy in the full concentration space. Chin. Phys. B 30, 050706 (2021).
Szlachta, W. J., Bartók, A. P. & Csányi, G. ábor Accuracy and transferability of Gaussian approximation potential models for tungsten. Phys. Rev. B 90, 104108 (2014).
Wang, X., Wang, Y., Zhang, L., Dai, F. & Wang, H. A tungsten deep neural-network potential for simulating mechanical property degradation under fusion service environment. Nucl. Fusion 62, 126013 (2022).
Wang, Yi. Nan, Zhang, LinFeng, Xu, B., Wang, XiaoYang & Wang, H. A generalizable machine learning potential of Ag–Au nanoalloys and its application to surface reconstruction, segregation, and diffusion. Model. Simul. Mater. Sci. Eng. 30, 025003 (2021).
Wen, T. et al. Specialising neural network potentials for accurate properties and application to the mechanical response of titanium. npj Comput. Mater. 7, 206 (2021).
Podryabinkin, E. V. & Shapeev, A. V. Active learning of linearly parametrized interatomic potentials. Comput. Mater. Sci. 140, 171–180 (2017).
Smith, J. S., Nebgen, B., Lubbers, N., Isayev, O. & Roitberg, A. E. Less is more: Sampling chemical space with active learning. J. Chem. Phys. 148, 241733 (2018).
Zhang, L., Lin, De-Ye, Wang, H., Car, R. & Weinan, E. Active learning of uniformly accurate interatomic potentials for materials simulation. Phys. Rev. Mater. 3, 023804 (2019).
Zhang, Y. et al. Dp-gen: A concurrent learning platform for the generation of reliable deep learning based potential energy models. Comput. Phys. Commun. 253, 107206 (2020).
Kohn, W. & Sham, L. J. Self-consistent equations including exchange and correlation effects. Phys. Rev. 140, A1133 (1965).
Car, R. & Parrinello, M. Unified approach for molecular dynamics and density-functional theory. Phys. Rev. Lett. 55, 2471 (1985).
Russakovsky, O. et al. Imagenet large-scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations (2021).
Devlin, J., Chang, Ming-Wei, Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Smith, J. S., Isayev, O. & Roitberg, A. E. Ani-1: an extensible neural network potential with DFT accuracy at force field computational cost. Chem. Sci. 8, 3192–3203 (2017).
Smith, J. S. et al. Approaching coupled cluster accuracy with a general-purpose neural network potential through transfer learning. Nat. Commun. 10, 1–8 (2019).
Liu, S. et al. Pre-training molecular graph representation with 3d geometry. In Proceedings of International Conference on Learning Representations (2022).
Stärk, H. et al. 3d infomax improves gnns for molecular property prediction. In Proceedings of International Conference on Machine Learning, 20479–20502 (PMLR, 2022).
Zhou, G. et al. Uni-mol: A universal 3d molecular representation learning framework. In Proceedings of International Conference on Learning Representations (2023).
Thomas, N. et al. Tensor field networks: Rotation-and translation-equivariant neural networks for 3D point clouds. Preprint at https://arxiv.org/abs/1802.08219 (2018).
Schütt, K., Unke, O. & Gastegger, M. Equivariant message passing for the prediction of tensorial properties and molecular spectra. In Proceedings of International Conference on Machine Learning, 9377–9388 (PMLR, 2021).
Gasteiger, J. et al. Gemnet-oc: Developing graph neural networks for large and diverse molecular simulation datasets. In Proceedings of Transactions on Machine Learning Research (2022).
Gasteiger, J., Giri, S., Margraf, J. T. & Günnemann, S. Fast and uncertainty-aware directional message passing for non-equilibrium molecules. Preprint at https://arxiv.org/abs/2011.14115 (2022).
Takamoto, S. et al. Towards universal neural network potential for material discovery applicable to arbitrary combination of 45 elements. Nat. Commun. 13, 2991 (2022).
Zitnick, L. et al. Spherical channels for modeling atomic interactions. Adv. Neural Inf. Process. Syst. 35, 8054–8067 (2022).
Shuaibi, M. et al. Rotation invariant graph neural networks using spin convolutions. Preprint at https://arxiv.org/abs/2106.09575 (2021).
Liao, Yi-Lun & Smidt, T. Equiformer: Equivariant graph attention transformer for 3D atomistic graphs. In Proceedings of International Conference on Learning Representations (2023).
Liao, Y-L., Wood, B., Das, A. & Smidt, T. Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations. In Proceedings of International Conference on Learning Representations (2024).
Chanussot, L. et al. Open catalyst 2020 (oc20) dataset and community challenges. ACS Catal. 11, 6059–6072 (2021).
Chen, C. & Ong, S. P. A universal graph deep learning interatomic potential for the periodic table. Nat. Comput. Sci. 2, 718–728 (2022).
Jain, A. et al. Commentary: The materials project: A materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).
Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865 (1996).
Choudhary, K. et al. Unified graph neural network force-field for the periodic table: solid state applications. Digit. Discov. 2, 346–355 (2023).
Choudhary, K. et al. The joint automated repository for various integrated simulations (Jarvis) for data-driven materials design. npj Comput. Mater. 6, 173 (2020).
Musaelian, A. et al. Learning local equivariant representations for large-scale atomistic dynamics. Nat. Commun. 14, 579 (2023).
Le, T., Noé, F. & Clevert, D.-A. Equivariant graph attention networks for molecular property prediction. Preprint at https://arxiv.org/abs/2202.09891 (2022).
Bond, S. D. & Leimkuhler, B. J. Molecular dynamics and the accuracy of numerically computed averages. Acta Numer. 16, 1–65 (2007).
Jia, W. et al. Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. In Proceedings of SC20: International Conference For High Performance Computing, Networking, Storage And Analysis, 1–14 (IEEE, 2020).
Huang, J. et al. Deep potential generation scheme and simulation protocol for the li10gep2s12-type superionic conductors. J. Chem. Phys. 154, 094703 (2021).
Chmiela, S. et al. Machine learning of accurate energy-conserving molecular force fields. Sci. Adv. 3, e1603015 (2017).
Ramakrishnan, R., Dral, P. O., Rupp, M. & Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 1–7 (2014).
Green, M. S. Markoff random processes and the statistical mechanics of time-dependent phenomena. J. Chem. Phys. 22, 398–413 (1954).
Kubo, R. Statistical-mechanical theory of irreversible processes. J. Phys. Soc. Jpn. 12, 570–586 (1957).
Lee, H-S. & Tuckerman, M. E. Dynamical properties of liquid water from ab initio molecular dynamics performed in the complete basis set limit. J. Chem. Phys. 126, 164501 (2007).
Guo, M.-H. et al. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 8, 331–368 (2022).
Galassi, A., Lippi, M. & Torroni, P. Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32, 4291–4308 (2020).
Vaswani, A. et al. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems (2017).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
Acknowledgements
The work of H.W. and D.Z. is supported by the National Key R&D Program of China under Grant No. 2022YFA1004300, and the National Natural Science Foundation of China under Grant No. 12122103. We thank Y.L., Z.L., and G.K. for inspiring discussions. The computational resource was supported by the Bohrium Cloud Platform at DP technology.
Author information
Authors and Affiliations
Contributions
D.Z., L.Z., H.W., and F.Z.D. conceived the idea of this work. D.Z., H.B., and H.W. designed the model structure. D.Z. implemented the model. D.Z., H.B., W.J., and X.L. performed the experiments on different systems. All authors contributed to the discussions and edited the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, D., Bi, H., Dai, FZ. et al. Pretraining of attention-based deep learning potential model for molecular simulation. npj Comput Mater 10, 94 (2024). https://doi.org/10.1038/s41524-024-01278-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41524-024-01278-7