/
Using these alignments as structural templates accurate structural m Using these alignments as structural templates accurate structural m

Using these alignments as structural templates accurate structural m - PDF document

hazel
hazel . @hazel
Follow
342 views
Uploaded On 2022-10-11

Using these alignments as structural templates accurate structural m - PPT Presentation

template sequence identity Data are from all models from CASP2 andCASP3 Critical Assessment of Techniques for Protein Structurex0000 for which 80 of the protein residues are modeled Eachtarget ID: 958596

structural sequence protein sequences sequence structural sequences protein modeling models structure number identity model quality coverage accurate determinations residues

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Using these alignments as structural tem..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Using these alignments as structural templates, accurate struc-tural models covering the full length of the protein (upper rightquadrant, Fig. 2) can be constructed for 19% of the proteins in SP + TrEMBL. For an additional 10% of the proteins, such model-ing is possible for part of the sequence (lower right quadrant). Inall, some structural information, whether full length or not, isavailable for 43% (19%+10%+10%+4%) of the proteins in SP +TrEMBL. How many reasonably accurate models can be built foreach experimental structure? As the PDB (Protein Data Bank,January 2000) has 3,100 nonredundant structures filtered at 95%sequence identity, the modeling ratio of full length structuralmodels to the number of experimental protein structures is cur-rently (0.19 20. That is, for every uniqueprotein in the PDB, on average 20 reasonably accurate full lengthmodels may be built from SP + TrEMBL sequences. We expect thisratio to increase as more and more sequences become available. Atthe same time, the growing database of experimental structureswill in general put more and more sequences within reach ofmodel building based on several alternative structural templates,with an attendant potential gain of modeling accuracyReliable extrapolation from the different current sequence

datasets to all natural protein sequences is difficult because of unevenrepresentation of types of proteins and species. Therefore, wecomputed the structural coverage of a representative set of com-pletely sequenced genomes (not including the recently complet-ed human genome) as a basis for more reliable extrapolation.nature structural biology ¥ volume 8 number 6 ¥ june 2001The coverage differs significantly depending on whether onefocuses on the fraction of protein sequences with a link to aknown structure (using a more permissive threshold in resultsfrom PSI-BLAST similarity searches Ñ that is, optimistic view;Table 1) or the fraction of the total number of amino acidresidues that can be accurately modeled (using the less permis-sive threshold of 30% minimal sequence identity over alignedregions in results from FASTA searches Ñ that is, realistic view;Table 1). In the optimistic view, some structural information isavailable for 30Ð35% of sequences of genomes. In contrast, in therealistic view, only 5Ð10% of residues from complete genomescan be placed in accurate structural models. Note that by usingmore sensitive methods and data from more recent structures, itis possible to increase the fraction of genomic sequences with alink to a fold to ~50% and the fracti

on of all residues with such alink to ~40% (J. Gough and C. Chothia, pers. comm., see alsohttp://stash.mrc-lmb.cam.ac.uk/superfamilyCompared with structural coverage of complete genomes, theSP + TrEMBL sequence database (Fig. 2) is clearly biased in that itcontains a larger fraction than found in complete genomesofaccurately modelable residues (23% versus5Ð10%). To estimatethe overall effort in structural genomics, we use genome-basednumbers. These numbers are similar in spirit, but different indetail, from those of other studies of structural assignment acrosscompletely sequenced genomesScope as a function of desired model qualityWe now take a detailed look at the way in which the number ofexperimental structure determinations required to cover proteinspace depends on the reliability of homology modeling methods.Anticipating complete organization of protein sequences intodomain families across all species, we use the current Pfam col-lection of protein alignmentsand perform data simulationswith models built using template structures at differing levels ofmodel quality. More precisely, in each simulation run, we set theminimal model quality in terms of a maximal modeling distanceor minimal model-template sequence identity.As an illustration of coverage at differe

nt levels of minimalmodel quality, consider the example of the ras-like protein fami-ly (G-domain) in yeast. This is best visualized in a two-dimen-sional projection (Fig. 3) of a higher-dimensional proteinsequence space in which distances between points (each point isa family member) represent the modeling distance between pro-teins, quantified as percent residue differences between aminoacid sequences. At a given minimal modeling quality, a certainnumber of structural templates (centers of circles) are needed toprovide modeling coverage to sequence neighbors (contained ina circle). A decrease in maximal modeling distance (smaller cir-cles) leads to higher accuracy structural models; however, moreexperimental structure determinations (more circles) must bemade to cover family members. template sequence identity. Data are from all models from CASP2 andCASP3 (Critical Assessment of Techniques for Protein Structure�) for which 80% of the protein residues are modeled. Eachtarget. The range bars delineate the most and the least accurate modelsAs sequence identity falls below30%, errors in Ccoordinates rapidly increase. RMSD is root mean square(r.m.s.) positional deviation. The primary cause of this effect is align-ment errors between target and template sequen

ces. The alignmenterror is quantified as the percentage of misaligned residues in 3D. Modeldata were obtained from the CASP Web site ( center.llnl.gov To simulate structural coverage as a function of minimalmodel quality, we use release 4.4 of PfamA, which contains 2,000domain families (including, by our definition, 1,626 nonmem-brane families) constructed from 260,000 domain sequences.Most Pfam families represent structural domainsand areassembled using sequence profiles in the form of hidden Markov. Manual curators aim at ensuring high qualityalignments and accurate definition of domain boundaries. Ofthe proteins in SP + TrEMBL, 63% have at least one domain inThe number of structure determinations required was estimat-ed using a greedy coverage algorithm. The greedy algorithm firstselects the structural target (template structure) that would gen-erate the maximum number of models for sequences within themaximal modeling distance, then the target that returns the max-imum number of models for the remaining sequences in the fam-ily is selected, and so on, until there are no sequences left thatcannot be modeled. The algorithm is run repeatedly on a givencollection of family alignments for different values of maximalmodeling distance. The results are as follows.Ass

uming that 30% or better sequence identity is required foraccurate modeling, 13,000 experimental structures are requiredto cover models for all nonmembrane domains (in 1,626 fami-lies) in Pfam. Many of these have already been done (we estimate~35% of all residues using the criteria of Table 1, second col-umn), but the point here is to derive numbers for modeling den-sity in a known family collection and use these for extrapolationto less well-known regions of protein sequence space. Inclusionof membrane associated families in Pfam increases the numberof structure determinations required for accurate modeling of all260,000 sequences in Pfam to 17,000.How does the number of structure determinations required tocover the Pfam collection depend on desired model quality? Firstof all, there is a clear trade-off between model quality and exper-imental effort (Fig. 4): as minimal modeling quality (horizontalaxis) increases, more template structures (vertical axis) arerequired. Above 30% sequence identity, the number of experi-mental structure determinations increases approximately linear-ly with the minimal sequence identity between model andtemplate. The slope change at 20% sequence identity representsthe current limit of sensitivity for reliably grouping proteindomains

into families. For modeling distances in the twilightzone of sequence identity (~10Ð20%), modeling density is high-est, so that a single structure determination of any protein fromthe family is often sufficient to cover all family members but witha high penalty in model quality. The shape of the curve (Fig. 4) issuch that a minimal modeling distance at 30% sequence identitycaptures most of the savings in effort (decrease in the number ofstructure determinations), confirming our choice of minimalmodel quality for the purposes of estimating the scope of struc-tural genomics.Practical considerations in covering protein spaceA number of factors may modify our simple estimates. Here, weconsider (i) substantial savings from a slight relaxation of com-pleteness requirements; (ii) realistic success rates of structuredetermination; (iii) special types of protein sequences; and (iv) variation in target selection strategy.(i) Quasi-completeness: in computing the minimal number ofstructure determinations for complete model coverage of all pro- Fig. 2Current structural coverage of proteins inSP + TrEMBL.Color contours show the densityof models that can be currently built as a func-a model (vertical axis), and the sequence identi-ty between the modeled protein and the clos-est

known experimental structure (horizontalTrEMBL (release 7 and 11, respectively; SP =SWISS-PROT) using sequence profile searcheswith PSI-BLAST (see Methods) against proteinswith known structures in the PDB (Protein DataBank). Models to the right of the vertical red�line are based on 30% sequence identity overof relatively high quality; these are called Ôaccu-rateÕ or Ôreasonably accurateÕ models in the textand form the basis of the estimates of modelingdensity. Models above the horizontal red linecover 80% or more of the sequence. The mostuseful models (upper right, 19% of the total) cover most of the length of the protein. Table 1 Two views of current structural OrganismOptimistic view36%10.5%32%7.7%29%7.8%32%6.0%39%10.3%31%10.1%31%9.9%36%6.8% 35%6.5%Full genus names: Mycoplasma genitalium, Mycoplasma pnemonie,Percentage of genomic sequences with a link to a known structure. Percentage of genome residues in accurate homology models (based on30% or higher sequence identity). The fraction of genome sequences forwhich a partial structural model, including those at low accuracy, can beconstructed. The coverage of genome sequences and residues wasobtained using sequence searches with PSI-BLAST and FASTA (see nature structural biology ¥ volume 8 number 6 ¥ jun