Abstract :
[en] The accurate and efficient simulation of large (bio)molecular systems with quantum mechanical fidelity represents a grand challenge in computational science. Ab initio quantum chemistry methods provide accuracy but remain prohibitively expensive at realistic scales, whereas classical force fields achieve efficiency but sacrifice accuracy. Machine learning force fields (MLFFs) promise to close this gap, yet their predictive power is often limited by locality assumptions that miss the long-range effects governing the structure, dynamics, and function of complex (bio)molecular systems. This thesis develops a framework for general-purpose MLFFs that preserves quantum mechanical fidelity while scaling to large systems by combining quantum-mechanical data, efficient atomic representations, and models explicitly designed to capture long-range interactions.
To advance model development beyond small molecules, we introduce two quantum mechanical datasets that span the chemical space of cellular components: MD22 and QCell. MD22 offers a benchmark featuring molecular dynamics trajectories for six biomolecular units and two supramolecular complexes. It represents a significant increase in system size (up to 370 atoms) and conformational flexibility, and is specifically designed to probe nonlocal correlations. To support the training of broadly applicable, general-purpose models, QCell takes this a step further by significantly expanding coverage across all major classes of biomolecules, with ~500k diverse fragments of carbohydrates, nucleic acids, lipids, as well as noncovalent dimers and ion-water clusters.
We then make collective effects tractable in global MLFFs that couple all atomic degrees of freedom by developing an efficient interatomic descriptor. The resulting algorithm, reduced descriptor gradient-domain machine learning (rGDML), automatically constructs the minimal set of interatomic features required to capture long-range fluctuations, converting the quadratic growth of global descriptors into linear scaling. rGDML improves accuracy over both local and baseline global models, and its efficiency and stability are demonstrated through a 50 ns molecular dynamics simulation of a tetrapeptide. Its enhanced interpretability enables systematic analysis across MD22 molecules, revealing that nonlocal features (atoms separated by up to 15 Å in the studied systems) are essential to retain overall accuracy for peptides, DNA base pairs, fatty acids, and supramolecular complexes.
Building on these insights, we introduce SO3LR, a pretrained general-purpose MLFF that couples a fast SO(3)-equivariant neural network for semi-local interactions with universal, physically grounded pairwise potentials for short-range repulsion, long-range electrostatics, and dispersion. SO3LR is trained on a diverse set of four million neutral and charged molecular complexes computed at the PBE0+MBD level of quantum mechanics, ensuring broad coverage of covalent and noncovalent interactions. The model scales to 200k atoms on a single GPU and achieves reasonable to high accuracy across the chemical space of organic (bio)molecules. We validate this performance with polyalanine simulations from 300 to 800 K, accurate structural and spectroscopic observables across both high and low vibrational frequencies for a solvated protein, and consistent local and global structural properties for a glycoprotein and a lipid bilayer.
This thesis establishes a complete route from data to long-range-aware, general-purpose MLFFs that bring quantum accuracy to the biomolecular scale. The synthesis of machine learning and physics marks the beginning of realistic modeling of biological processes with quantum-level fidelity, with important implications for understanding health and disease.