![]() ; Schneider, Reinhard ![]() in Trends in Biochemical Sciences - Regular Edition (1993), 18(4), 120-123 Prediction of protein secondary structure is an old problem and progress has been slow. Recently, spectacular success has been claimed in the blind prediction of the catalytic subunit of the cAMP ... [more ▼] Prediction of protein secondary structure is an old problem and progress has been slow. Recently, spectacular success has been claimed in the blind prediction of the catalytic subunit of the cAMP-dependent protein kinase. When predictions in this and other test cases are assessed critically, some claims of prediction success turn out to be exaggerated, but a kernel of real progress remains: protein structure prediction can be improved substantially when a family of related sequences is available. Enough so that molecular biologists equipped with a new amino acid sequence and a multiple sequence alignment in hand may be tempted to test the new prediction methods. [less ▲] Detailed reference viewed: 133 (0 UL)![]() ; Schneider, Reinhard ![]() in Nucleic Acids Research (1993), 21(13), 3105-3109 Detailed reference viewed: 123 (0 UL)![]() ![]() ; Schneider, Reinhard ![]() in Computation of Biomolecular Structures, Achievements, Problems and Perspectives (1993) Detailed reference viewed: 118 (0 UL)![]() ; ; Schneider, Reinhard ![]() in Protein Science: A Publication of the Protein Society (1992), 1(3), 409-417 The Protein Data Bank currently contains about 600 data sets of three-dimensional protein coordinates determined by X-ray crystallography or NMR. There is considerable redundancy in the data base, as many ... [more ▼] The Protein Data Bank currently contains about 600 data sets of three-dimensional protein coordinates determined by X-ray crystallography or NMR. There is considerable redundancy in the data base, as many protein pairs are identical or very similar in sequence. However, statistical analyses of protein sequence-structure relations require nonredundant data. We have developed two algorithms to extract from the data base representative sets of protein chains with maximum coverage and minimum redundancy. The first algorithm focuses on optimizing a particular property of the selected proteins and works by successive selection of proteins from an ordered list and exclusion of all neighbors of each selected protein. The other algorithm aims at maximizing the size of the selected set and works by successive thinning out of clusters of similar proteins. Both algorithms are generally applicable to other data bases in which criteria of similarity can be defined and relate to problems in graph theory. The largest nonredundant set extracted from the current release of the Protein Data Bank has 155 protein chains. In this set, no two proteins have sequence similarity higher than a certain cutoff (30% identical residues for aligned subsequences longer than 80 residues), yet all structurally unique protein families are represented. Periodically updated lists of representative data sets are available by electronic mail from the file server "netserv @ embl-heidelberg.de." The selection may be useful in statistical approaches to protein folding as well as in the analysis and documentation of the known spectrum of three-dimensional protein structures. [less ▲] Detailed reference viewed: 138 (3 UL)![]() ; ; et al in Protein Science: A Publication of the Protein Society (1992), 1(12), 1677-1690 With the completion of the first phase of the European yeast genome sequencing project, the complete DNA sequence of chromosome III of Saccharomyces cerevisiae has become available (Oliver, S.G., et al ... [more ▼] With the completion of the first phase of the European yeast genome sequencing project, the complete DNA sequence of chromosome III of Saccharomyces cerevisiae has become available (Oliver, S.G., et al., 1992, Nature 357, 38-46). We have tested the predictive power of computer sequence analysis on the 176 probable protein products of this chromosome, after exclusion of six problem cases. When the results of database similarity searches are pooled with prior knowledge, a likely function can be assigned to 42% of the proteins, and a predicted three-dimensional structure to a third of these (140% of the total). The function of the remaining 58% remains to be determined. Of these, about one-third have one or more probable transmembrane segments. Among the most interesting proteins with predicted functions are a new member of the type X polymerase family, a transcription factor with an N-terminal DNA-binding domain related to GAL4, a ''fork head'' DNA-binding domain previously known only in Drosophila and in mammals, and a putative methyltransferase. Our analysis increased the number of known significant sequence similarities on chromosome III by 13, to now 67. Although the near 40% success rate of identifying unknown protein function by sequence analysis is surprisingly high, the information gap between known protein sequences and unknown function is expected to widen and become a major bottleneck of genome projects in the near future. Based on the experience gained in this test study, we suggest that the development of an automated computer workbench for protein sequence analysis must be an important item in genome projects. [less ▲] Detailed reference viewed: 145 (1 UL)![]() ; ; et al in Nature (1992), 358(6384), 287-287 Detailed reference viewed: 159 (2 UL)![]() ; Schneider, Reinhard ![]() in Proteins (1991), 9(1), 56-68 The database of known protein three-dimensional structures can be significantly increased by the use of sequence homology, based on the following observations. (1) The database of known sequences ... [more ▼] The database of known protein three-dimensional structures can be significantly increased by the use of sequence homology, based on the following observations. (1) The database of known sequences, currently at more than 12,000 proteins, is two orders of magnitude larger than the database of known structures. (2) The currently most powerful method of predicting protein structures is model building by homology. (3) Structural homology can be inferred from the level of sequence similarity. (4) The threshold of sequence similarity sufficient for structural homology depends strongly on the length of the alignment. Here, we first quantify the relation between sequence similarity, structure similarity, and alignment length by an exhaustive survey of alignments between proteins of known structure and report a homology threshold curve as a function of alignment length. We then produce a database of homology-derived secondary structure of proteins (HSSP) by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve. For each known protein structure, the derived database contains the aligned sequences, secondary structure, sequence variability, and sequence profile. Tertiary structures of the aligned sequences are implied, but not modeled explicitly. The database effectively increases the number of known protein structures by a factor of five to more than 1800. The results may be useful in assessing the structural significance of matches in sequence database searches, in deriving preferences and patterns for structure prediction, in elucidating the structural role of conserved residues, and in modeling three-dimensional detail by homology. [less ▲] Detailed reference viewed: 414 (1 UL) |
||