Abstract :
[en] Platform Presentation at SETAC Europe 2023, Dublin, 30 April - 4 May 2023 Presenting in {\textless}em{\textgreater}Session 4.05 Characterization, Testing and Assessment of Complex Substances (MCS, UVCBs \& MOCS){\textless}/em{\textgreater} {\textless}br{\textgreater} Presentation 4.05.T-05 at 14:40 Wednesday 3 May (Level 3 East Wing) {\textless}strong{\textgreater}Integrating UVCBs and Related Data into Open Chemical Knowledgebases{\textless}/strong{\textgreater} Emma L. Schymanski$^{\textrm{1}}$, Anjana Elapavalore$^{\textrm{1}}$, Qingliang Li$^{\textrm{2}}$, Paul A. Thiessen$^{\textrm{2}}$, Leonid Zaslavsky$^{\textrm{2}}$, Jian Zhang$^{\textrm{2}}$, Evan E. Bolton$^{\textrm{2}}$ $^{\textrm{1}}$Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 avenue du Swing, 4367 Belvaux, Luxembourg. $^{\textrm{2}}$National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA Although 20-40 \% of chemical registries consist of Substances of Unknown or Variable Composition, Complex Reaction Products, and Biological Materials (UVCBs), integrating and exchanging information on UVCBs in open chemical knowledgebases is challenging. The integration of UVCBs into high resolution mass spectrometry (HR-MS) based identification workflows is also problematic. Often only a name or numerical identifier is provided in the registry listings, hindering comparison, merging and enumeration or mapping of potential component species either based on expert knowledge or (semi-) automated cheminformatics methods. Improved UVCB handling in major open chemical resources will help support the exchange of information between registries, researchers, and regulators, as well as supporting, {\textless}em{\textgreater}e.g.,{\textless}/em{\textgreater} toxicological/environmental assessments and the integration of UVCBs into HR-MS-based workflows. PubChem (https://pubchem.ncbi.nlm.nih.gov/), a large open chemical database with over 112M compounds, 298M substances and contributions from over 884 data sources, have recently introduced “concepts” to specifically improve their handling of UVCB-like entities. An initial dataset of {\textasciitilde}62K “concepts” was compiled from three large authoritative data sources with a high proportion of UVCBs (FDA GSRS, TSCA and ECHA). Close to 0.5M synonyms (names) were associated with these concepts, which were then used to form the basis for literature mining dictionaries and sets of regular expressions for pattern-based recognition of UVCBs among synonyms. This was validated over several collections. Since UVCBs of variable composition often form (or are expressed as) homologue series, this subset of UVCBs is particularly conducive to automated grouping methods and adaptation to HR-MS workflows. Thus, as a second step, the homologue grouping algorithm OngLai was run over the PubChemLite for Exposomics database (a subset of PubChem with environmentally and toxicologically relevant annotation) and connected to “concepts” using mappings to representative or component structures provided by depositors. Over 163 connections between chemical, homologue series and PubChem Concepts (often many concepts per series) were made; a select few were hand curated and processed so far as proof-of-concept exemplars. This contribution intends to show and discuss potential (and pitfalls) associated with UVCB handling in open resources to support environmental and toxicological use cases.