Article (Périodiques scientifiques)
Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS
Barnabas, Shadrack J.; Böhme, Timo; Boyer, Stephen K. et al.
2022In Digital Discovery, p. 10.1039.D2DD00019
Peer reviewed
 

Documents


Texte intégral
Barnabas_etal_2022_OntoChem_PFAS_d2dd00019a.pdf
Postprint Éditeur (668.47 kB)
This Open Access Article is licensed under a Creative Commons Attribution 3.0 Unported Licence
Télécharger

All rights reserved


Tous les documents dans ORBilu sont protégés par une licence d'utilisation.

Envoyer vers



Détails



Résumé :
[en] Extracting PFAS with open source cheminformatics toolkits reveals ~1.78 million PFAS in Google Patents, ~28 K in the CORE literature repository. The extraction of chemical information from documents is a demanding task in cheminformatics due to the variety of text and image-based representations of chemistry. The present work describes the extraction of chemical compounds with unique chemical structures from the open access CORE (COnnecting REpositories) and Google Patents full text document repositories. The importance of structure normalization is demonstrated using three open access cheminformatics toolkits: the Chemistry Development Kit (CDK), RDKit and OpenChemLib (OCL). Each toolkit was used for structure parsing, normalization and subsequent substructure searching, using SMILES as structure representations of chemical molecules and International Chemical Identifiers (InChIs) for comparison. Per- and polyfluoroalkyl substances (PFAS) were chosen as a case study to perform the substructure search, due to their high environmental relevance, their presence in both literature and patent corpuses, and the current lack of community consensus on their definition. Three different structural definitions of PFAS were chosen to highlight the implications of various definitions from a cheminformatics perspective. Since CDK, RDKit and OCL implement different criteria and methods for SMILES parsing and normalization, different numbers of parsed compounds were extracted, which were then evaluated using the three PFAS definitions. A comparison of these toolkits and definitions is provided, along with a discussion of the implications for PFAS screening and text mining efforts in cheminformatics. Finally, the extracted PFAS (~1.7 M PFAS from patents and ~27 K from CORE) were compared against various existing PFAS lists and are provided in various formats for further community research efforts.
Disciplines :
Chimie
Auteur, co-auteur :
Barnabas, Shadrack J.
Böhme, Timo
Boyer, Stephen K.
Irmer, Matthias
Ruttkies, Christoph
Wetherbee, Ian
KONDIC, Todor ;  University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Environmental Cheminformatics
SCHYMANSKI, Emma  ;  University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB)
Weber, Lutz
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS
Date de publication/diffusion :
2022
Titre du périodique :
Digital Discovery
ISSN :
2635-098X
Pagination :
10.1039.D2DD00019A
Peer reviewed :
Peer reviewed
Focus Area :
Computational Sciences
Projet FnR :
FNR12341006 - Environmental Cheminformatics To Identify Unknown Chemicals And Their Effects, 2018 (01/10/2018-30/09/2023) - Emma Schymanski
Disponible sur ORBilu :
depuis le 28 janvier 2023

Statistiques


Nombre de vues
181 (dont 3 Unilu)
Nombre de téléchargements
138 (dont 1 Unilu)

citations Scopus®
 
26
citations Scopus®
sans auto-citations
24
citations OpenAlex
 
27

Bibliographie


Publications similaires



Contacter ORBilu