Doctoral thesis (Dissertations and theses)
General-Purpose Machine Learning Force Fields for (Bio)Molecular Simulations
KABYLDA, Adil
2026
 

Files


Full Text
AKabylda2026.pdf
Author postprint (14.38 MB) Creative Commons License - Attribution
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Abstract :
[en] The accurate and efficient simulation of large (bio)molecular systems with quantum mechanical fidelity represents a grand challenge in computational science. Ab initio quantum chemistry methods provide accuracy but remain prohibitively expensive at realistic scales, whereas classical force fields achieve efficiency but sacrifice accuracy. Machine learning force fields (MLFFs) promise to close this gap, yet their predictive power is often limited by locality assumptions that miss the long-range effects governing the structure, dynamics, and function of complex (bio)molecular systems. This thesis develops a framework for general-purpose MLFFs that preserves quantum mechanical fidelity while scaling to large systems by combining quantum-mechanical data, efficient atomic representations, and models explicitly designed to capture long-range interactions. To advance model development beyond small molecules, we introduce two quantum mechanical datasets that span the chemical space of cellular components: MD22 and QCell. MD22 offers a benchmark featuring molecular dynamics trajectories for six biomolecular units and two supramolecular complexes. It represents a significant increase in system size (up to 370 atoms) and conformational flexibility, and is specifically designed to probe nonlocal correlations. To support the training of broadly applicable, general-purpose models, QCell takes this a step further by significantly expanding coverage across all major classes of biomolecules, with ~500k diverse fragments of carbohydrates, nucleic acids, lipids, as well as noncovalent dimers and ion-water clusters. We then make collective effects tractable in global MLFFs that couple all atomic degrees of freedom by developing an efficient interatomic descriptor. The resulting algorithm, reduced descriptor gradient-domain machine learning (rGDML), automatically constructs the minimal set of interatomic features required to capture long-range fluctuations, converting the quadratic growth of global descriptors into linear scaling. rGDML improves accuracy over both local and baseline global models, and its efficiency and stability are demonstrated through a 50 ns molecular dynamics simulation of a tetrapeptide. Its enhanced interpretability enables systematic analysis across MD22 molecules, revealing that nonlocal features (atoms separated by up to 15 Å in the studied systems) are essential to retain overall accuracy for peptides, DNA base pairs, fatty acids, and supramolecular complexes. Building on these insights, we introduce SO3LR, a pretrained general-purpose MLFF that couples a fast SO(3)-equivariant neural network for semi-local interactions with universal, physically grounded pairwise potentials for short-range repulsion, long-range electrostatics, and dispersion. SO3LR is trained on a diverse set of four million neutral and charged molecular complexes computed at the PBE0+MBD level of quantum mechanics, ensuring broad coverage of covalent and noncovalent interactions. The model scales to 200k atoms on a single GPU and achieves reasonable to high accuracy across the chemical space of organic (bio)molecules. We validate this performance with polyalanine simulations from 300 to 800 K, accurate structural and spectroscopic observables across both high and low vibrational frequencies for a solvated protein, and consistent local and global structural properties for a glycoprotein and a lipid bilayer. This thesis establishes a complete route from data to long-range-aware, general-purpose MLFFs that bring quantum accuracy to the biomolecular scale. The synthesis of machine learning and physics marks the beginning of realistic modeling of biological processes with quantum-level fidelity, with important implications for understanding health and disease.
Disciplines :
Physics
Author, co-author :
KABYLDA, Adil  ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Physics and Materials Science (DPHYMS)
Language :
English
Title :
General-Purpose Machine Learning Force Fields for (Bio)Molecular Simulations
Defense date :
16 January 2026
Institution :
Unilu - University of Luxembourg
Degree :
Docteur en Physique (DIP_DOC_0003_B)
Promotor :
TKATCHENKO, Alexandre ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Physics and Materials Science (DPHYMS)
President :
FODOR, Etienne ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Physics and Materials Science (DPHYMS)
Jury member :
ESPOSITO, Massimiliano  ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Physics and Materials Science (DPHYMS)
Margraf, Johannes T.;  University of Bayreuth
Marrink, Siewert J.;  RUG - University of Groningen
FnR Project :
FNR15720828 - NavChem - Navigating Chemical Reaction Space With Machine Learning, 2021 (01/10/2021-30/09/2025) - Adil Kabylda
Available on ORBilu :
since 26 January 2026

Statistics


Number of views
86 (25 by Unilu)
Number of downloads
32 (9 by Unilu)

Bibliography


Similar publications



Contact ORBilu