Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS

Barnabas, Shadrack J.; Böhme, Timo; Boyer, Stephen K.; Irmer, Matthias; Ruttkies, Christoph; Wetherbee, Ian; KONDIC, Todor; SCHYMANSKI, Emma; Weber, Lutz

doi:10.1039/D2DD00019A

Article (Scientific journals)

Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS

Barnabas, Shadrack J.; Böhme, Timo; Boyer, Stephen K. et al.

2022 • In Digital Discovery, p. 10.1039.D2DD00019

Peer reviewed

Permalink
https://hdl.handle.net/10993/54247

DOI
10.1039/D2DD00019A

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

Barnabas_etal_2022_OntoChem_PFAS_d2dd00019a.pdf

Publisher postprint (668.47 kB)

This Open Access Article is licensed under a Creative Commons Attribution 3.0 Unported Licence

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Abstract :

[en] Extracting PFAS with open source cheminformatics toolkits reveals ~1.78 million PFAS in Google Patents, ~28 K in the CORE literature repository. The extraction of chemical information from documents is a demanding task in cheminformatics due to the variety of text and image-based representations of chemistry. The present work describes the extraction of chemical compounds with unique chemical structures from the open access CORE (COnnecting REpositories) and Google Patents full text document repositories. The importance of structure normalization is demonstrated using three open access cheminformatics toolkits: the Chemistry Development Kit (CDK), RDKit and OpenChemLib (OCL). Each toolkit was used for structure parsing, normalization and subsequent substructure searching, using SMILES as structure representations of chemical molecules and International Chemical Identifiers (InChIs) for comparison. Per- and polyfluoroalkyl substances (PFAS) were chosen as a case study to perform the substructure search, due to their high environmental relevance, their presence in both literature and patent corpuses, and the current lack of community consensus on their definition. Three different structural definitions of PFAS were chosen to highlight the implications of various definitions from a cheminformatics perspective. Since CDK, RDKit and OCL implement different criteria and methods for SMILES parsing and normalization, different numbers of parsed compounds were extracted, which were then evaluated using the three PFAS definitions. A comparison of these toolkits and definitions is provided, along with a discussion of the implications for PFAS screening and text mining efforts in cheminformatics. Finally, the extracted PFAS (~1.7 M PFAS from patents and ~27 K from CORE) were compared against various existing PFAS lists and are provided in various formats for further community research efforts.

Disciplines :

Chemistry

Author, co-author :

Barnabas, Shadrack J.

Böhme, Timo

Boyer, Stephen K.

Irmer, Matthias

Ruttkies, Christoph

Wetherbee, Ian

KONDIC, Todor ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Environmental Cheminformatics

SCHYMANSKI, Emma ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB)

Weber, Lutz

External co-authors :

yes

Language :

English

Title :

Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS

Publication date :

2022

Journal title :

Digital Discovery

ISSN :

2635-098X

Pages :

10.1039.D2DD00019A

Peer reviewed :

Peer reviewed

Focus Area :

Computational Sciences

Additional URL :

http://xlink.rsc.org/?DOI=D2DD00019A

FnR Project :

FNR12341006 - Environmental Cheminformatics To Identify Unknown Chemicals And Their Effects, 2018 (01/10/2018-30/09/2023) - Emma Schymanski

Available on ORBilu :

since 28 January 2023

Statistics

Number of views

204 (3 by Unilu)

Number of downloads

154 (1 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenAlex citations

Bibliography

S. E. Fenton, A. Ducatman, A. Boobis, J. C. DeWitt, C. Lau, C. Ng, J. S. Smith and S. M. Roberts, Per- and Polyfluoroalkyl Substance Toxicity and Human Health Review: Current State of Knowledge and Strategies for Informing Future Research, Environ. Toxicol. Chem., 2021, 40, 606–630.
E. M. Sunderland, X. C. Hu, C. Dassuncao, A. K. Tokranov, C. C. Wagner and J. G. Allen, A review of the pathways of human exposure to poly- and perfluoroalkyl substances (PFASs) and present understanding of health effects, J. Exposure Sci. Environ. Epidemiol., 2019, 29, 131–147.
R. C. Buck, J. Franklin, U. Berger, J. M. Conder, I. T. Cousins, P. de Voogt, A. A. Jensen, K. Kannan, S. A. Mabury and S. P. van Leeuwen, Perfluoroalkyl and polyfluoroalkyl substances in the environment: Terminology, classification, and origins, Integr. Environ. Assess. Manage., 2011, 7, 513–541.
I. T. Cousins, J. C. DeWitt, J. Glüge, G. Goldenman, D. Herzke, R. Lohmann, C. A. Ng, M. Scheringer and Z. Wang, The high persistence of PFAS is sufficient for their management as a chemical class, Environ. Sci.: Processes Impacts, 2020, 22, 2307–2312.
OECD, Toward a new comprehensive global database of per- and polyfluoroalkyl substances (PFASs): summary report on updating the OECD 2007 list of per- and polyfluorinated substances (PFASs), Report ENV/JM/MONO(2018)7, 2018, https://www.oecd.org/officialdocuments/publicdisplaydocumentpdf/?cote¼ENV-JM-MONO(2018) 7&doclanguage¼en, accessed 15 January 2022.
Z. Wang, S25jOECDPFASjList of PFAS from the OECD, Version Number: NORMAN-SLE-S25.0.1.2, 2018, DOI: 10.5281/ zenodo.2648775.
US EPA, CompTox Chemicals DashboardjPFASMASTER Chemicals, https://comptox.epa.gov/dashboard/chemical_lists/PFASMASTER, accessed 14 November 2021.
A. J. Williams, C. M. Grulke, J. Edwards, A. D. McEachran, K. Mansouri, N. C. Baker, G. Patlewicz, I. Shah, J. F. Wambaugh, R. S. Judson and A. M. Richard, The CompTox Chemistry Dashboard: a community data resource for environmental chemistry, J. Cheminf., 2017, 9, 61.
US EPA and OECD, CompTox Chemicals DashboardjPFASOECD Chemicals, https://comptox.epa.gov/dashboard/chemical-lists/PFASOECD, accessed 29 December 2021.
L. Weber and E. Schymanski, Supplementary Material: PFAS tables, 2021, DOI: 10.6084/m9.figshare.17168960.v1.
Y. Liu, L. A. D'Agostino, G. Qu, G. Jiang and J. W. Martin, High-Resolution Mass Spectrometry (HRMS) Methods for Nontarget Discovery and Characterization of Poly- and Perfluoroalkyl Substances (PFASs) in Environmental and Human Samples, TrAC, Trends Anal. Chem., 2019, 121, 115420, DOI: 10.1016/j.trac.2019.02.021.
OECD, Reconciling Terminology of the Universe of Per- and Polyfluoroalkyl Substances: Recommendations and Practical Guidance, OECD Publishing, Paris, 2021, Report 61, https://www.oecd.org/chemicalsafety/portal-perfluorinatedchemicals/terminology-per-and-polyfluoroalkylsubstances.pdf, accessed 14 November 2021.
Z. Wang, A. M. Buser, I. T. Cousins, S. Demattio, W. Drost, O. Johansson, K. Ohno, G. Patlewicz, A. M. Richard, G. W. Walker, G. S. White and E. Leinala, A New OECD Definition for Per- and Polyfluoroalkyl Substances, Environ. Sci. Technol., 2021, 55, 23DOI, DOI: 10.1021/acs.est.1c06896.
US EPA, National PFAS Testing Strategy, https://www.epa.gov/assessing-and-managing-chemicals-under-tsca/nationalpfas-testing-strategy, accessed 14 November 2021.
US EPA, National PFAS Testing Strategy: Identification of Candidate Per- and Poly- fluoroalkyl Substances (PFAS) for Testing, Washington, DC, 2021.
S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B. A. Shoemaker, P. A. Thiessen, B. Yu, L. Zaslavsky, J. Zhang and E. E. Bolton, PubChem in 2021: new data content and improved web interfaces, Nucleic Acids Res., 2021, 49, D1388–D1395.
J. M. Barnard, A comparison of different approaches to Markush structure handling, J. Chem. Inf. Model., 1991, 31, 64–68.
M. Irmer, C. Bobach, T. Böhme, U. Laube, A. Püschel and L. Weber, in BioCreative Challenge Evaluation Workshop, 2013, vol. 2, p. 92.
Apache UIMA – Apache UIMA, https://uima.apache.org/, accessed 14 November 2021.
S. A. Akhondi, H. Rey, M. Schwörer, M. Maier, J. Toomey, H. Nau, G. Ilchmann, M. Sheehan, M. Irmer, C. Bobach, M. Doornenbal, M. Gregory and J. A. Kors, Automatic identification of relevant chemical compounds from patents, Database, 2019, 2019, baz001, DOI: 10.1093/ database/baz001.
P. Knoth and Z. Zdrahal, in CERN Workshop on Innovations in Scholarly Communication (OAI7), https://oro.open.ac.uk/32560/, 2011, accessed 14 November 2021.
The Open University and Jisc, CORE – Aggregating the world's open access research papers, https://core.ac.uk/, accessed 14 November 2021.
Google, Google Patents, https://patents.google.com/ advanced, accessed 14 November 2021.
S. Heller, A. McNaught, S. Stein, D. Tchekhovskoi and I. Pletnev, InChI – the worldwide chemical structure identifier standard, J. Cheminf., 2013, 5, 7.
S. J. Barnabas, T. Böhme, S. Boyer, M. Irmer, C. Ruttkies, I. Wetherbee, T. Kondic, E. L. Schymanski and L. Weber, OntoChem PFAS CORE and Patent Files for MetFrag, 2022, DOI: 10.5281/zenodo.6034586.
Google, BigQuery, https://cloud.google.com/bigquery, accessed 14 November 2021.
D. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules, J. Chem. Inf. Model., 1988, 28, 31–36.
Daylight Chemical Information Systems, Inc., SMILES – A Simplified Chemical Language, https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html, accessed 13 April 2019.
Blue Obelisk, OpenSMILES Home Page, https://opensmiles.org/, accessed 14 November 2021.
L. Weber, R. Szargan, B. Schulze and M. Mühlstädt, Nitrogen-15 NMR, 2D NMR and ESCA characterization of a new stable 6a-thia(SIV)-1,6-diazapentalene, Magn. Reson. Chem., 1990, 28, 419–422.
OntoChem, OntoChem SciWalker-Open-Data: 818,280 compounds extracted from CORE documents, https://console.cloud.google.com/bigquery?project¼sciwalkeropen-data%26organizationId¼359740966731% 26d¼chemistry_compounds%26p¼sciwalker-open-data% 26t¼CORE_compounds%26page¼table%26ws¼!1m5!1m4! 4m3!1ssciwalker-open-data!2schemistry_compounds! 3sCORE_compounds", accessed 15 January 2022.
OntoChem, OntoChem SciWalker-Open-Data: Annotations in Patent Documents, https://console.cloud.google.com/bigquery?project¼sciwalker-opendata&d¼google_patents_research&p¼patents-publicdata&t¼annotations_202101&page¼table&ws¼!1m30!1m4! 4m3!1ssciwalker-open-data!2schemistry_compounds! 3soc_registry_flagged!1m4!4m3!1ssciwalker-open-data! 2schemistry_compounds!3sfda_unii!1m4!4m3!1spatents-public-data!2sgoogle_patents_research! 3sannotations_202105!1m4!4m3!1ssciwalker-open-data! 2schemistry_compounds!3sCORE_compounds!1m4!4m3! 1ssciwalker-open-data!2schemistry_compounds! 3sPatents_compounds_202101!1m4!4m3!1spatents-public-data!2sgoogle_patents_research!3sannotations_202101, accessed 15 January 2022.
OntoChem, OntoChem SciWalker-Open-Data: 18,032,261 unique compounds (by InChI) extracted from Google Patents documents, https://console.cloud.google.com/bigquery?project¼sciwalker-opendata&d¼chemistry_compounds&p¼sciwalker-opendata&t¼Patents_compounds_202101&page¼table&ws¼! 1m25!1m4!4m3!1ssciwalker-open-data! 2schemistry_compounds!3soc_registry_flagged!1m4!4m3! 1ssciwalker-open-data!2schemistry_compounds!3sfda_unii! 1m4!4m3!1spatents-public-data!2sgoogle_patents_research! 3sannotations_202105!1m4!4m3!1ssciwalker-open-data! 2schemistry_compounds!3sCORE_compounds!1m4!4m3! 1ssciwalker-open-data!2schemistry_compounds! 3sPatents_compounds_202101, accessed 15 January 2022.
EMBL-EBI, Ontology Concept Identifiers: Identifiers.org, https://registry.identifiers.org/registry/ocid, accessed 20 November 2021.
Google, SciWalker Open Data – SQL workspace – BigQuery – Google Cloud Platform, https://console.cloud.google.com/ bigquery?project¼sciwalker-open-data/ chemistry_compounds/oc_registry, accessed 20 November 2021.
Greg Landrum, RDKit, https://www.rdkit.org/, accessed 29 December 2021.
C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann and E. Willighagen, The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics, J. Chem. Inf. Comput. Sci., 2003, 43, 493–500.
E. L. Willighagen, J. W. Mayfield, J. Alvarsson, A. Berg, L. Carlsson, N. Jeliazkova, S. Kuhn, T. Pluskal, M. Rojas-Chertó, O. Spjuth, G. Torrance, C. T. Evelo, R. Guha and C. Steinbeck, The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching, J. Cheminf., 2017, 9, 33.
Actelion Pharmaceuticals Ltd, GitHub: actelion/openchemlib, Actelion Pharmaceuticals Ltd, https://github.com/Actelion/ openchemlib, 2021, accessed 29 December 2021.
M. Swain, MolVS: Molecule Validation and Standardization, https://github.com/mcs07/MolVS, 2021, accessed 29 December 2021.
M. Swain, Introduction — MolVS 0.1.1 documentation, https://molvs.readthedocs.io/en/latest/guide/intro.html, accessed 29 December 2021.
OntoChem, OntoChem PFAS Code, OntoChem, https://github.com/ontochem/PFAS, 2022, accessed 15 January 2022.
M. Kratochvíl, J. Vondrášek and J. Galgonek, Sachem: a chemical cartridge for high-performance substructure search, J. Cheminf., 2018, 10, 27.
G. Landrum, Fingerprinting and Molecular Similarity (RDKit), https://rdkit.readthedocs.io/en/latest/GettingStartedInPython.html#fingerprinting-andmolecular-similarity, accessed 13 May 2022.
T. Sander, DataWarrior User Manual: Molecule or Reaction Similarity and Descriptors (openmolecules.org), https://openmolecules.org/help/similarity.html, accessed 13 May 2022.
C. Steinbeck, Fingerprinter (CDK API - version 20070216), http://cdk.sourceforge.net/cdk-0.99.1/api/org/openscience/cdk/fingerprint/Fingerprinter.html, accessed 13 May 2022.
ChemAxon, ChemAxon – Software Solutions and Services for Chemistry & Biology, https://chemaxon.com/, accessed 29 December 2021.
I. Filippov, OSRA (Optical Structure Recognition Application), https://sourceforge.net/projects/osra/, accessed 29 December 2021.
I. V. Filippov and M. C. Nicklaus, Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution, J. Chem. Inf. Model., 2009, 49, 740–743.
Wikipedia, Dichlorotetrafluoroethane, https://en.wikipedia.org/w/index.php?title¼1,2-Dichlorotetrafluoroethane&oldid¼35140760, Wikipedia, 2006, accessed 29 December 2021.
Dassault Systèmes, BIOVIA CTfile formats, 2016, https:// help.accelrysonline.com/ulm/onelab/1.0/content/ulm_pdfs/ direct/reference/ctfileformats2016.pdf, accessed 29 December 2021.
B. Alameddine, PhD thesis, Université de Fribourg, 2007, oai:doc.rero.ch:20070803113704-NT.
C. Ruttkies, E. L. Schymanski, S. Wolf, J. Hollender and S. Neumann, MetFrag relaunched: incorporating strategies beyond in silico fragmentation, J. Cheminf., 2016, 8, 3.
R. Helmus, T. L. ter Laak, A. P. van Wezel, P. de Voogt and E. L. Schymanski, patRoon: open source software platform for environmental mass spectrometry based non-target screening, J. Cheminf., 2021, 13, 1.
E. L. Schymanski, T. Kondić, S. Neumann, P. A. Thiessen, J. Zhang and E. E. Bolton, Empowering large chemical knowledge bases for exposomics: PubChemLite meets MetFrag, J. Cheminf., 2021, 13, 19.
NORMAN Network and NCBI/NLM/NIH, NORMAN SLE Classification Browser, https://pubchem.ncbi.nlm.nih.gov/classification/#hid¼101, accessed 7 May 2020.
NCBI/NLM/NIH, PubChem Identifier Exchange, https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi, accessed 23 March 2021.
B. Sha, E. L. Schymanski, C. Ruttkies, I. T. Cousins and Z. Wang, Exploring open cheminformatics approaches for categorizing per- and polyfluoroalkyl substances (PFASs), Environ. Sci.: Processes Impacts, 2019, 21, 1835–1851.