Reference : Cheminformatics and Computational Approaches for Identifying and Managing Unknown Che...
Dissertations and theses : Doctoral thesis
Physical, chemical, mathematical & earth Sciences : Chemistry
Cheminformatics and Computational Approaches for Identifying and Managing Unknown Chemicals in the Environment
Lai, Adelene mailto [University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Environmental Cheminformatics >]
University of Luxembourg, ​Esch-sur-Alzette, ​​Luxembourg
Friedrich Schiller University, ​Jena, ​​Germany
Docteur en Chimie
Schymanski, Emma mailto
Steinbeck, Christoph mailto
Schneider, Reinhard mailto
Neumann, Steffen mailto
Willighagen, Egon mailto
Stelter, Michael mailto
[en] cheminformatics ; environmental chemistry ; chemical pollutants ; algorithm ; chemicals management ; uvcb
[en] In most societies, using chemical products has become a part of daily life. Worldwide, over 350,000 chemicals have been registered for use in e.g., daily household consumption, industrial processes, agriculture, etc. However, despite the benefits chemicals may bring to society, their usage, production, and disposal, which leads to their eventual release into the environment has multiple implications. Anthropogenic chemicals have been detected in myriad ecosystems all over the planet, as well as in the tissues of wildlife and humans. The potential consequences of such chemical pollution are not fully understood, but links to the onset of human disease and threats to biodiversity have been attributed to the presence of chemicals in our environment.
Mitigating the potential negative effects of chemicals typically involves regulatory steps and multiple stakeholders. One key aspect thereof is environmental monitoring, which consists of environmental sampling, measurement, data analysis, and reporting. In recent years, advancements in Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS), open chemical databases, and software have enabled researchers to identify known (e.g., pesticides) as well as unknown environmental chemicals, commonly referred to as suspect or non-target compounds. However, identifying unknown chemicals, particularly non-targets, remains extremely challenging because of the lack of a priori knowledge on the analytes - all that is available are their mass spectrometry signals. In fact, the number of unknown features in a typical mass spectrum of an environmental sample is in the range of thousands to tens of thousands, and therefore requires feature prioritisation before identification within a suitable workflow.
In this dissertation work, collaborations with two regulatory authorities responsible for environmental monitoring sought to identify relevant unknown compounds in the environment, specifically by developing computational workflows for unknown identification in LC-HRMS data. The first collaboration culminated in Publication A, which involved a joint project with the Zürcher Amt für Wasser, Energie und Luft. Environmental samples taken from wastewater treatment plant sites in Switzerland were retrospectively analysed using a pre-screening workflow that prioritised features
suitable for non-target identification. For this purpose, a multi-step Quality Control algorithm that checks the quality of mass spectral data in terms of peak intensities, alignment, and signal-to-noise ratio was developed and used within pre-screening. This algorithm was incorporated into the R package Shinyscreen. Features that were prioritised by pre-screening then underwent identification using the in silico fragmentation tool MetFrag. To obtain these identifications, MetFrag was coupled to various open chemical information resources such as spectral databases like MassBank Europe and MassBank of North America, as well as suspect lists from the NORMAN Suspect List Exchange and the CompTox Chemicals Dashboard database. One confirmed and twenty-one tentative compound identifications were achieved and reported according to an established confidence level scheme. Comprehensive data interpretation and detailed communication of MetFrag’s results was performed as a means of formulating evidence-based recommendations that may inform future environmental monitoring campaigns.
Building on the pre-screening and identification workflow developed in Publication A, Publication B resulted from a collaboration with the Luxembourgish Administration de la gestion de l’eau that sought to identify, and where possible quantify unknown chemicals in Luxembourgish surface waters. More specifically, surface water samples collected as part of a two-year national monitoring campaign were measured using LC-HRMS and screened for pharmaceutical parent compounds and their transformation products. Compared to pharmaceutical compound information, which is publicly available from local authorities (and was used in the suspect list), information on transformation products is relatively scarce. Therefore, new approaches were developed in this work to mine data from the PubChem database as well as from the literature in order to formulate a suspect list containing pharmaceutical transformation products, in addition to their parent compounds. Overall, 94 pharmaceuticals and 14 transformation products were identified, of which 88 and 2 were confirmed identifications respectively. The spatio-temporal occurrence and distribution of these compounds throughout the Luxembourgish environment were analysed using advanced data visualisations that highlighted patterns in certain regions and time periods of high incidence. These findings may support future chemicals management measures, particularly in environmental monitoring.
Another challenging aspect of managing chemicals is that they mostly exist as complex mixtures within the environment as well as chemical products. Substances of Unknown or Variable composition, Complex reaction products or Biological materials (UVCBs) make up 20-40% of international chemical registries and include chlorinated paraffins, polymer mixtures, petroleum fractions, and essential oils. However, little is known about their chemical identities and/or compositions, which poses formidable obstacles to assessing their environmental fate and toxicity, let alone identification in the environment. Publication C addresses the challenges of UVCBs by taking an interdisciplinary approach in reviewing the literature that incorporates considerations of their chemical representations, toxicity, environmental fate, exposure, and regulatory approaches. Improved substance registration requirements, grouping techniques to simplify assessment, and the use of Mixture InChI to represent UVCBs in a findable, accessible, interoperable, and reusable (FAIR) way in databases are amongst the key recommendations of this work.
A specific type of UVCB, mixtures of homologous compounds, are commonly detected in environmental samples, including many High Production Volume (HPV) compounds such as surfactants. Compounds forming homologous series are related by a common core fragment and repeating chemical subunit, and can be represented using general formulae (e.g., CnF2n+1COOH) and/or Markush structures. However, a significant identification bottleneck is the inability to match their characteristic analytical signals in LC-HRMS data with chemicals in databases; while comb-like elution patterns and constant differences in mass-to-charge ratio indicate the presence of homologous series in samples, most chemical databases do not contain annotated homologous series. To address this gap, Publication D introduces a cheminformatics algorithm, OngLai, to detect homologous series within compound datasets. OngLai, openly implemented in Python using the RDKit, detects homologous series based on two inputs: a list of compounds and the chemical structure of a repeating unit. OngLai was applied to three open datasets from environmental chemistry, exposomics, and natural products, in which thousands of homologous series with a CH2 repeating unit were detected. Classification of homologous series in compound datasets is expected to advance their analytical detection in samples.
Overall, the work in this dissertation contributed to the advancement of identifying and managing unknown chemicals in the environment using cheminformatics and computational approaches. All work conducted followed Open Science and FAIR data principles: all code, datasets, analyses, and results generated, including the final peer-reviewed publications, are openly available to the public. These efforts are intended to spur further developments in unknown chemical identification and management towards protecting the environment and human health.
Luxembourg Centre for Systems Biomedicine (LCSB): Environmental Cheminformatics (Schymanski Group)
Fonds National de la Recherche - FnR
Researchers ; Professionals ; Students ; General public ; Others
FnR ; FNR12341006 > Emma Schymanski > ECHIDNA > Environmental Cheminformatics To Identify Unknown Chemicals And Their Effects > 01/10/2018 > 30/09/2023 > 2018

File(s) associated to this reference

Fulltext file(s):

Open access
Doctoral-Thesis-Adelene-Lai-Cumulative-UniLu-submission.pdfAuthor postprint18.08 MBView/Open

Bookmark and Share SFX Query

All documents in ORBilu are protected by a user license.