Computational platforms and environments; Data integration; Databases; Proteome informatics
Résumé :
[en] Recent advances in AI-based methods have revolutionized the field of structural biology. Concomitantly, high-throughput sequencing and functional genomics have generated genetic variants at an unprecedented scale. However, efficient tools and resources are needed to link disparate data types—to ‘map’ variants onto protein structures, to better understand how the variation causes disease, and thereby design therapeutics. Here we present the Genomics 2 Proteins portal (https://g2p.broadinstitute.org/): a human proteome-wide resource that maps 20,076,998 genetic variants onto 42,413 protein sequences and 77,923 structures, with a comprehensive set of structural and functional features. Additionally, the Genomics 2 Proteins portal allows users to interactively upload protein residue-wise annotations (for example, variants and scores) as well as the protein structure beyond databases to establish the connection between genomics to proteins. The portal serves as an easy-to-use discovery tool for researchers and scientists to hypothesize the structure–function relationship between natural or synthetic variations and their molecular phenotypes.
Centre de recherche :
Luxembourg Centre for Systems Biomedicine (LCSB): Bioinformatics Core (R. Schneider Group)
Disciplines :
Génétique & processus génétiques
Auteur, co-auteur :
Kwon, Seulki
Safer, Jordan
Nguyen, Duyen T.
HOKSZA, David ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine > Bioinformatics Core > Research BioCore
MAY, Patrick ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core
Arbesfeld, Jeremy A.
Rubin, Alan F.
Campbell, Arthur J.
Burgin, Alex
Iqbal, Sumaiya
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
Genomics 2 Proteins portal: a resource and discovery tool for linking genetic screening outputs to protein sequences and structures
J. Jumper et al. Highly accurate protein structure prediction with AlphaFold Nature 2021 596 583 589 1:CAS:528:DC%2BB3MXhvVaktrrL 34265844 8371605 10.1038/s41586-021-03819-2
M. Baek et al. Accurate prediction of protein structures and interactions using a three-track neural network Science 2021 373 871 876 1:CAS:528:DC%2BB3MXhvVCku7zM 34282049 7612213 10.1126/science.abj8754
R. Krishna et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom Science 2024 384 1:CAS:528:DC%2BB2cXovVCntLg%3D 38452047 10.1126/science.adl2528
Z.M. Lin et al. Evolutionary-scale prediction of atomic-level protein structure with a language model Science 2023 379 1123 1130 1:CAS:528:DC%2BB3sXls1ertrk%3D 36927031 10.1126/science.ade2574
M.L. Hekkelman I.D. Vries R.P. Joosten A. Perrakis AlphaFill: enriching AlphaFold models with ligands and cofactors Nat. Methods 2023 20 205 213 1:CAS:528:DC%2BB38XivFClt7bM 36424442 10.1038/s41592-022-01685-y
H.M. Berman et al. The Protein Data Bank Nucleic Acids Res. 2000 28 235 242 1:CAS:528:DC%2BD3cXhvVKjt7w%3D 10592235 102472 10.1093/nar/28.1.235
S.K. Burley et al. Protein Data Bank: the single global archive for 3D macromolecular structure data Nucleic Acids Res. 2018 47 gky949
A. Patwardhan et al. Data management challenges in three-dimensional EM Nat. Struct. Mol. Biol. 2012 19 1203 1207 1:CAS:528:DC%2BC38XhslOktLzE 23211764 4048199 10.1038/nsmb.2426
S. Gudmundsson et al. Variant interpretation using population databases: lessons from gnomAD Hum. Mutat. 2022 43 1012 1030 34859531 10.1002/humu.24309
M.J. Landrum et al. ClinVar: improving access to variant interpretations and supporting evidence Nucleic Acids Res. 2017 46 gkx1153
P.D. Stenson et al. The Human Gene Mutation Database (HGMD): optimizing its use in a clinical diagnostic or research setting Hum. Genet. 2020 139 1197 1207 32596782 7497289 10.1007/s00439-020-02199-3
K.J. Karczewski et al. The mutational constraint spectrum quantified from variation in 141,456 humans Nature 2020 581 434 443 1:CAS:528:DC%2BB3cXhtVanu7jF 32461654 7334197 10.1038/s41586-020-2308-7
T.N. Turner et al. denovo-db: a compendium of human de novo variants Nucleic Acids Res. 2017 45 D804 D811 1:CAS:528:DC%2BC1cXhsVKrtb0%3D 27907889 10.1093/nar/gkw865
E.M. Porto A.C. Komor I.M. Slaymaker G.W. Yeo Base editing: advances and therapeutic opportunities Nat. Rev. Drug Discov. 2020 19 839 859 1:CAS:528:DC%2BB3cXitFWju7rK 33077937 7721651 10.1038/s41573-020-0084-6
N.Z. Lue et al. Base editor scanning charts the DNMT3A activity landscape Nat. Chem. Biol. 2023 19 176 186 1:CAS:528:DC%2BB38Xis12is7zN 36266353 10.1038/s41589-022-01167-4
A.V. Anzalone et al. Search-and-replace genome editing without double-strand breaks or donor DNA Nature 2019 576 149 157 1:CAS:528:DC%2BC1MXitFGns7rO 31634902 6907074 10.1038/s41586-019-1711-4
A. Dixit et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens Cell 2016 167 1853 1866 1:CAS:528:DC%2BC28XitFWlsrvI 27984732 5181115 10.1016/j.cell.2016.11.038
A. Andreadis M.E. Gallego B. Nadal-Ginard Generation of protein isoform diversity by alternative splicing: mechanistic and biological implications Annu. Rev. Cell Biol. 1987 3 207 242 1:CAS:528:DyaL1cXltVSmtw%3D%3D 2891362 10.1146/annurev.cb.03.110187.001231
S.T. Sherry et al. dbSNP: the NCBI database of genetic variation Nucleic Acids Res. 2001 29 308 311 1:CAS:528:DC%2BD3MXjtlWmtb0%3D 11125122 29783 10.1093/nar/29.1.308
den Dunnen, J. T. Describing sequence variants using HGVS nomenclature. in Genotyping: Methods and Protocols (eds White S. J. & Cantsilieris S.) 243–251 (Springer New York, 2017).
R. Apweiler et al. UniProt: the Universal Protein knowledgebase Nucleic Acids Res. 2004 32 D115 D119 1:CAS:528:DC%2BD3sXhtVSru7vK 14681372 308865 10.1093/nar/gkh131
R.L. Seal et al. Genenames.org: the HGNC resources in 2023 Nucleic Acids Res. 2022 51 D1003 D1009 9825485 10.1093/nar/gkac888
T. Hubbard et al. The Ensembl genome database project Nucleic Acids Res. 2002 30 38 41 1:CAS:528:DC%2BD38Xht12ksbY%3D 11752248 99161 10.1093/nar/30.1.38
K.D. Pruitt T. Tatusova D.R. Maglott NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins Nucleic Acids Res. 2007 35 D61 D65 1:CAS:528:DC%2BD2sXivFGkuw%3D%3D 17130148 10.1093/nar/gkl842
M. Varadi et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models Nucleic Acids Res. 2021 50 D439 D444 8728224 10.1093/nar/gkab1061
J. Morales et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research Nature 2022 604 310 315 1:CAS:528:DC%2BB38XptF2hsLs%3D 35388217 9007741 10.1038/s41586-022-04558-8
P.V. Hornbeck et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations Nucleic Acids Res. 2015 43 D512 D520 1:CAS:528:DC%2BC2sXhtVymt7rO 25514926 10.1093/nar/gku1267
D. Esposito et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect Genome Biol. 2019 20 31679514 6827219 10.1186/s13059-019-1845-6
H. Mi A. Muruganujan J.T. Casagrande P.D. Thomas Large-scale gene function analysis with the PANTHER classification system Nat. Protoc. 2013 8 1551 1566 23868073 6519453 10.1038/nprot.2013.092
W. Kabsch C. Sander Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features Biopolymers 1983 22 2577 2637 1:CAS:528:DyaL2cXkslegtQ%3D%3D 6667333 10.1002/bip.360221211
J.M. Dana et al. SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins Nucleic Acids Res. 2019 47 D482 D489 1:CAS:528:DC%2BC1MXhs1GqtrzL 30445541 10.1093/nar/gky1114
D.R. Armstrong et al. PDBe: improved findability of macromolecular structure data in the PDB Nucleic Acids Res. 2020 48 D335 D343 1:CAS:528:DC%2BB3cXhs1GltrjI 31691821
Schrödinger, L. The PyMOL Molecular Graphics System, version 1.8 (2015).
P. Sancho et al. Characterization of molecular mechanisms underlying the axonal Charcot–Marie–Tooth neuropathy caused by mutations Hum. Mol. Genet 2019 28 1629 1644 1:CAS:528:DC%2BC1MXitFKgs7zL 30624633 10.1093/hmg/ddz006
J. Cheng et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense Science 2023 381 1:CAS:528:DC%2BB3sXhvFyjtbfE 37733863 10.1126/science.adg7492
E.M. Ramos et al. Characterizing genetic variants for clinical action Am. J. Med. Genet. C Semin. Med. Genet. 2014 166 93 104 10.1002/ajmg.c.31386
T.K. Lau T.N. Leung Genetic screening and diagnosis Curr. Opin. Obstet. Gynecol. 2005 17 163 169 15758610 10.1097/01.gco.0000162187.99219.e0
Z. Stark R.H. Scott Genomic newborn screening for rare diseases Nat. Rev. Genet. 2023 24 755 766 1:CAS:528:DC%2BB3sXhtlKjsr3P 37386126 10.1038/s41576-023-00621-w
L. Hoffman-Andrews The known unknown: the challenges of genetic variants of uncertain significance in clinical practice J. Law Biosci. 2017 4 648 657 29868193 10.1093/jlb/lsx038
Carter, T. C. & He, M. M. Challenges of identifying clinically actionable genetic variants for precision medicine. J. Healthc. Eng. https://doi.org/10.1155/2016/3617572 (2016).
J. Woodard S. Iqbal A. Mashaghi Circuit topology predicts pathogenicity of missense mutations Proteins 2022 90 1634 1644 1:CAS:528:DC%2BB38XhtFWitLfL 35394672 9543832 10.1002/prot.26342
S. Iqbal et al. Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants Proc. Natl Acad. Sci. USA 2020 117 28201 28211 1:CAS:528:DC%2BB3cXitlWmsr7F 33106425 7668189 10.1073/pnas.2002660117
S. Iqbal et al. MISCAST: MIssense variant to protein StruCture Analysis web SuiTe Nucleic Acids Res. 2020 48 gkaa361 10.1093/nar/gkaa361
G. Costain D.M. Andrade Third-generation computational approaches for genetic variant interpretation Brain 2023 146 411 412 36691296 10.1093/brain/awad011
X. Watkins L.J. Garcia S. Pundir M.J. Martin U. Consortium ProtVista: visualization of protein sequence annotations Bioinformatics 2017 33 2040 2041 1:CAS:528:DC%2BC1cXhvV2ntLrE 28334231 5963392 10.1093/bioinformatics/btx120
S. Bittrich et al. RCSB Protein Data Bank: improved annotation, search and visualization of membrane protein structures archived in the PDB Bioinformatics 2022 38 1452 1454 1:CAS:528:DC%2BB38XmsFGgurc%3D 34864908 10.1093/bioinformatics/btab813
Thormann, A. et al. Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat Commun.https://doi.org/10.1038/s41467-019-10016-3 (2019).
E. Bragin et al. DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation Nucleic Acids Res. 2014 42 D993 D1000 1:CAS:528:DC%2BC2cXos1aq 24150940 10.1093/nar/gkt937
J.D. Stephenson R.A. Laskowski A. Nightingale M.E. Hurles J. Thornton VarMap: a web tool for mapping genomic coordinates to protein sequence and structure and retrieving protein structural annotations Bioinformatics 2019 35 4854 4856 1:CAS:528:DC%2BB3cXhtFWgtL3K 31192369 6853667 10.1093/bioinformatics/btz482
Stephenson, J. D. et al. ProtVar: mapping and contextualizing human missense variation. Nucleic Acids Res.https://doi.org/10.1093/nar/gkae413 (2024).
M. Hicks I. Bartha J. di Iulio J.C. Venter A. Telenti Functional characterization of 3D protein structures informed by human genetic diversity Proc. Natl Acad. Sci. USA 2019 116 8960 8965 1:CAS:528:DC%2BC1MXos1Shs7c%3D 30988206 6500140 10.1073/pnas.1820813116
S. Iqbal et al. Delineation of functionally essential protein regions for 242 neurodevelopmental genes Brain 2022 146 519 533 9924913 10.1093/brain/awac381
A. Meller et al. Predicting locations of cryptic pockets from single protein structures using the PocketMiner graph neural network Nat. Commun. 2023 14 1:CAS:528:DC%2BB3sXksVWksLk%3D 36859488 9977097 10.1038/s41467-023-36699-3
J. Schymkowitz et al. The FoldX web server: an online force field Nucleic Acids Res. 2005 33 W382 W388 1:CAS:528:DC%2BD2MXlslyrur4%3D 15980494 1160148 10.1093/nar/gki387
M. Tiberti et al. MutateX: an automated pipeline for in silico saturation mutagenesis of protein structures and structural ensembles Brief. Bioinform. 2022 23 35323860 10.1093/bib/bbac074
D. Smedley et al. BioMart—biological queries made easy BMC Genomics 2009 10 19144180 2649164 10.1186/1471-2164-10-22
J. Segura Y. Rose J. Westbrook S.K. Burley J.M. Duarte RCSB Protein Data Bank 1D tools and services Bioinformatics 2020 36 btaa1012
D. Sehnal et al. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures Nucleic Acids Res. 2021 49 W431 W437 1:CAS:528:DC%2BB3MXhvV2isLbE 33956157 8262734 10.1093/nar/gkab314
F. Madeira et al. Search and sequence analysis tools services from EMBL-EBI in 2022 Nucleic Acids Res. 2022 50 W276 W279 1:CAS:528:DC%2BB3sXhtVyiu7nI 35412617 9252731 10.1093/nar/gkac240
D. Karolchik et al. The UCSC Genome Browser Database Nucleic Acids Res. 2003 31 51 54 1:CAS:528:DC%2BD3sXhvFSgu7g%3D 12519945 165576 10.1093/nar/gkg129
A. Gaulton et al. ChEMBL: a large-scale bioactivity database for drug discovery Nucleic Acids Res. 2012 40 D1100 D1107 1:CAS:528:DC%2BC3MXhs12htbjN 21948594 10.1093/nar/gkr777
D.S. Wishart et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration Nucleic Acids Res. 2006 34 D668 D672 1:CAS:528:DC%2BD28XisFOrsw%3D%3D 16381955 10.1093/nar/gkj067
S.S. Weinreich R. Mangon J.J. Sikkens M.E.E. Teeuw M.C. Cornel Orphanet: a European database for rare diseases Ned. Tijdschr. Geneeskd. 2008 152 518 519 1:STN:280:DC%2BD1c3jtlWltQ%3D%3D 18389888
A. Hamosh A.F. Scott J. Amberger D. Valle V.A. McKusick Online Mendelian Inheritance In Man (OMIM) Hum. Mutat. 2000 15 57 61 1:CAS:528:DC%2BD3cXltV2gsw%3D%3D 10612823 10.1002/(SICI)1098-1004(200001)15:1<57::AID-HUMU12>3.0.CO;2-G
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. https://doi.org/10.1186/s13059-016-0974-4 (2016).