Protein Engineering vol.10 no.7 pp.757–761, 1997 Relations of the numbers of protein sequences, families and folds Chun-Ting Zhang Department of Physics, Tianjin University, Tianjin 300072, China The relations among the numbers of protein sequences, families and folds have been studied theoretically. It is found that the number of families is related to the natural logarithm of the number of sequences. The logarithmic relation should not be changed regardless of what value of the homology threshold is applied in the protein sequence comparison routines. To study the relation between the numbers of families and folds, the degenerate degree of a fold has been introduced. The degenerate degree of a fold is the number of protein families which adopt the same fold. The distribution of the degenerate degrees of folds has been found to be very likely exponential. Based on the distribution, the average degenerate degree d is calculated. The number of folds is simply equal to that of families divided by the average degenerate degree of folds. It is shown that d is an increasing function of time. The current value of d is about 2. It will continue to increase and reach the value of at least 3.3 in some years. By using the above result, the numbers of protein folds for four species have been estimated. In particular, the number of folds for human proteins is estimated to be ≤5200. Keywords: degeneracy/degenerate degree/distribution of degenerate degrees/numerical relations/protein families/protein folds/protein sequences Introduction Protein sequence pairs with more than 30% residue identity are clustered together into superfamilies, or 30SEQ families (Orengo et al., 1994). For convenience, the 30SEQ family is also called family hereafter in this paper. It is well established that in most cases each family adopts a unique fold structure, while in the other cases different families may adopt the same fold structure (Sander and Schneider, 1991; Holm et al., 1992; Pascarella and Argos, 1992; Flores et al., 1993; Hilbert et al., 1993; Holm and Sander, 1993; Orengo et al., 1993; Yee and Dill, 1993; Lessel and Schomburg, 1994; Rufino and Blundell, 1994). This implies that the number of protein folds should be less than the number of the families, which should be less than the number of proteins. Therefore, it is reasonable to ask how many folds there are in nature. In other words, there exists an upper limit for the number of the unique folds. The question was probably first raised by Chothia (1992), who estimated the figure to be about 1000. Since then, several research groups have tackled this problem again. However, different results were reported. Blundell and Johnson (1993) estimated the number to be less than 1000, in agreement with the estimate of Chothia (1992), but Alexandrov and Go (1994) and Orengo et al. (1994) reported much larger figures than previously estimated, 6700 and 7920, respectively. Recently, Wang (1996) gave a very low estimate of probably only 400. © Oxford University Press Obviously, this is an ongoing controversial issue. The fact that there exists a limited number of protein folds is in agreement with the principles of stereochemistry. Ptitsyn and Finkelstein have pointed out that due to the stereochemical constraints, the possible number of globular protein folds is limited (Ptitsyn and Finkelstein, 1980; Finkelstein and Ptitsyn, 1987). This conclusion is most welcome to researchers in the area of protein structure prediction. The prediction of protein tertiary structure from amino acid sequences based on the principle of free energy minimization has not yet been successful. In this case, the knowledge-based approach to predicting the tertiary structures of proteins, such as the threading and profile methods (Bowie et al., 1991; Jones et al., 1992), seems to be one of the most promising approaches. The fact that there exists a limited number of protein folds provides a solid basis for such an approach. Therefore, further discussion on the above issue is necessary and meaningful. In this paper, the relations among the numbers of protein sequences, families and folds are studied theoretically. The numbers of folds for proteins in four species are estimated based on the theory established. Result of analysis Logarithmic relation Three quantities are concerned in our case, i.e. the numbers of protein sequences, families and folds, denoted by s, fa and fo, respectively. Note that all three quantities are functions of time. For example, s(t) indicates the cumulative number of protein sequences found through the year t. fa(t) and fo(t) indicate the cumulative numbers of protein families and folds found through the year t, respectively. It is first important to study the relationships among these quantities. Suppose that there is a protein set consisting of s protein sequences. Let s have an increment ∆s. Accordingly, fa has also an increment ∆fa. Obviously, for given s we should have ∆fa ~ ∆s (1) Now, for given ∆s, suppose that ∆fa ~ 1 s0 1 s (2) where s0 is a constant to be determined later. Equation 2 should be explained. Since the 30SEQ families are based on the identity of residues, for given ∆s, the larger the quantity s, the lower is the probability of finding the new family members, i.e. the smaller the quantity ∆fa. Consequently, we have ∆fa 5 k ∆s s0 1 s (3) where k is a proportionality constant. Integrating both sides from t0 to t, we find 757 C.-T.Zhang Fig. 1. Number of protein 30SEQ families fa versus the number of protein sequences s. Data were obtained from Orengo et al. (1994). The solid curve was drawn using the equation fa 5 k ln(1 1 s/s0) with k 5 20 795 and s0 5 58 412. See the text for more details. fa(t) 5 fa(t0) 1 k ln ( s0 1 s(t) s0 1 s(t0) ) (4) where fa(t) and fa(t0) are the cumulative numbers of protein families found through the year t and t0, respectively, and t . t0. Similarly, s(t) and s(t0) are the cumulative numbers of protein sequences found through the year t and t0, respectively, and t . t0. Choosing appropriate t0 such that fa(t0) 5 s(t0) 5 0, we find fa(t) 5 k ln ( 11 s(t) s0 ) (5) The data for fa(t) and s(t) from the year 1960 through the year 1992 were given by Orengo et al. (1994). Figure 1 shows the relation between fa and s. The data points are fixed to Equation 5 by using the nonlinear least-squares method. The constants k and s0 are found to be k 5 20 795, s0 5 58 412 (6) The solid curve in Figure 1 is drawn according to Equations 5 and 6. We can see that the logarithmic relation shows good validity, indicating the correctness of Equation 5. Note that the last data point in Figure 1, i.e. s 5 28 000 and fa 5 7700, was also given by Orengo et al. (1994). It was correctly pointed out by Eisenhaber et al. (1995) that the number of protein families critically depends on the value of homology threshold applied in the protein sequence comparison routines. However, the logarithmic form of Equation 5 should not be changed regardless of what value of the homology threshold is applied. Actually, only the parameters k and s0 in Equation 5 depend on the value of homology threshold applied. 758 Fig. 2. Distribution of the degenerate degrees of folds. The integers on the abscissa indicate the degenerate degrees of folds. The height of the bar along the ordinate indicates the number of folds which have the same degenerate degree as shown under the bar. Degeneracy, degenerate degree and the distribution of degenerate degrees As pointed out previously, different families may adopt the same fold structure. In other words, one fold may be associated with more than one family. This phenomenon of protein structure is called degeneracy. The number of families associated with one fold structure is called the degenerate degree of the fold concerned. Generally, the degenerate degree is a positive integer .1. For convenience, the degenerate degree may be also be equal to 1, in which case there is no degeneracy at all. Recently, a Structural Classification of Proteins Database (SCOP) has been established by Murzin et al. (1995). Based on the sequence alignment, followed by clustering together of structure with more than 30% residue identity, 559 protein families and 286 folds were found in August 1995 (including pre-release). For more details, see also the paper by Wang (1996). The distribution of degenerate degrees for different folds is shown in Figure 2. The 286 folds are divided into 13 degenerate classes. The degenerate degree of the first class consisting of 197 folds is 1, i.e. no degeneracy. The degenerate degree of the second class consisting of 38 folds is 2 and so forth. Based on this result, the density matrix D for the distribution of the degenerate degrees is defined as D5 ( d1 d2 ... dn p1 p2 ... pn ) (7) where d1, d2, ..., dn are the degenerate degrees and p1, p2, ..., pn are the frequencies of occurrence for the first, second, ... and nth degenerate class, respectively. Obviously, n Σp 51 i i51 (8) Relations of the numbers of protein sequences, families and folds It has been reported that n 5 13 through the year 1995 (Murzin et al., 1995; Wang, 1996), so D5 ( 1 2 3 4 5 6 7 8 9 14 15 19 20 197 38 20 12 5 2 1 2 4 2 1 1 1 286 286 286 286 286 286 286 286 286 286 286 286 286 ) (9) The average degenerate degree d is calculated by 0 ø a(t) ø 1 indicates that the increasing rate of families is faster than that of folds. This result is in agreement with the result that d(t) . 1. Simple mathematical inference shows that d(t) continues to increase at least in the future years. From Equation 14, we have ∆fa(t) 5 d(t)∆fo(t) 1 fo(t)∆d(t) n d5 Σ di pi (10) i51 n (∆d)2 5 Σ p (d – d) i i 2 (12) i51 where (∆d)2 represents the variance. Using the data in Equations 9 and 11, we find ∆d 5 2.441 fo(t) 5 d(t) (14) where fa(t) and fo(t) are the cumulative number of protein families and folds found through the year t and d(t) is the average degenerate degree of the folds associated with the year t. Orengo et al. (1994) introduced a very useful quantity a(t), defined as a(t) ∆fo(t) ∆fa(t) ∆fo(t)/∆t ∆fa(t)/∆t ( ) (1 – d(t)a(t) (20) d(t) , 1/a(t) (21) As we know from Equation 11, d(1995) 5 1.955. According to Orengo et al. (1994), a(1995) ™ 0.3. Hence Equation 21 is valid. The value of d(t) will continue to increase in the future until the following condition is satisfied: d(t*) 5 1/a(t*) (22) where ∆d(t*) 5 0, i.e. d(t*) begins to reach its maximum value dmax in the year t*. We think that dmax is an important quantity for the study of protein structure. At present, we can say that dmax ù 3.3 (i.e. 1/0.3). Estimates of the numbers of protein families and folds for four species As pointed out in the Introduction, the estimate of the number of possible protein folds, i.e. fo, is an important yet controversial issue (Chothia, 1992; Blundell and Johnson, 1993; Alexandrov and Go, 1994; Orengo et al., 1994; Wang, 1996). The theory established above provides an alternative approach for estimating this figure. As is well known, there are (0.5–1.0)3105 protein sequences for the human species (Chothia, 1992). Taking the middle value, we obtain s 5 0.753105 for humans. Substituting this figure into Equations 5 and 6, we find fa 5 17 175 for humans (23) that is, based on the 30SEQ families (Orengo et al., 1994), the number of protein families for humans is about 17 175. To estimate the number of folds for humans, we find by using Equation 12 (16) fo 5 fa/d ø 17 175/3.3 5 5200 for humans fo(t – 1) and fa(t – 1) are the numbers of folds and families, respectively, found through the year t – 1. According to their explanation (Orengo et al., 1994), the quantity a(t) is the percentage of newly determined non-homologous proteins which adopt novel folds in the year t. Equation 15 may be rewritten as a(t) 5 fo(t) (15) where ∆fo(t) 5 fo(t) – fo(t – 1), ∆fa 5 fa(t) – fa(t – 1) ∆fa(t) Since ∆fa(t) . 0, the condition that ∆d(t) . 0 is (13) where ∆d is the standard deviation. d and ∆d are two main quantities describing the statistical characteristics of the distribution in Equation 7. Note that d is not a constant; generally, d is a function of time. Based on the definition of d (Equation 10), the relation between fa(t) and fo(t) may be simply written as fa(t) ∆d(t) 5 (11) The variance of the degenerate degrees based on the density matrix in Equation 7 is calculated as usual: (19) Using Equation 15, we find which might be an important parameter for the study of protein structure. Substituting Equation 9 into 10, we find d 5 1.955 (18) (17) where the numerator (denominator) indicates the increasing rate of the number of folds (families) and ∆t is the time increment. That is, a(t) is the ratio of the two rates. In other words, a(t) is the increasing rate ratio of fold/family. The fact that (Orengo et al., 1994) (24) that is, the number of folds of human proteins is ø5200. The numbers of families and the upper limit numbers of folds for proteins in the species of Escherichia coli, yeast, Caenorhabditis elegans and humans calculated by this method are listed in Table I. An interesting question may be raised: are the folds for one species relevant to those of another species? An equivalent form of this question is: are there some overlaps among the sets of folds for different species? The answer seems to be ‘yes’. The principle that governs the folding topologies of proteins is probably independent of the species. Discussion The degenerate degree may be any positive integer except zero. Looking at Figure 2 or Equation 9, we find that some 759 C.-T.Zhang Table I. Estimates of the numbers of protein families and folds for four species Species Number of genesa Number of families Number of foldsb E.coli Yeast C.elegans Human 4000 7000 15000 75000 1380 2350 4750 17175 420 710 1440 5200 aData obtained from Chothia bThe upper limit values. (1992). integers, e.g. 10, 11, 12, 13, 16, 17 and 18, between 1 and 20 are absent. It seems that there is no reason for the absence of these numbers. This is probably due to the bias of experimental work. Furthermore, the maximum degenerate degree is unlikely to be only 20. The fold with the degenerate degree of 20 is the so-called β/α (TIM)-barrel (Murzin et al., 1995; Wang, 1996). If one more family is found in the future which adopts the same fold of β/α (TIM)-barrel, then the maximum degenerate degree is 21 in this case. Generally, the degenerate degrees have the values of positive integers from 1 through dmax successively. One of the most curious questions in the protein-folding studies is dmax 5 ? Recently, based on a simple lattice model of protein folding, Li et al. (1996) introduced a new concept called the designability of protein structures. The concept is quantified by counting the number of sequences that uniquely fold into the same particular structure. The larger the number, the more designable is the structure. Furthermore, it was shown that the highly designable structures are more stable against sequence mutations and thermal fluctuations and also possess more secondary structure elements and tertiary symmetries (Li et al., 1996). Although the concept was derived from a simple model only, it has some implication for the real protein folding. The degenerate degree of a fold proposed here is actually the measurement of the designability of this fold. The larger the degenerate degree, the more designable is the fold. The β/α (TIM)-barrel is the most highly designable fold we know so far. Therefore, the question dmax 5 ? is equivalent to asking the maximum designability degree of folds. The average degenerate degree d defined in Equation 10 is an important parameter describing the distribution represented by the density matrix Equation 7. A reliable conclusion of this paper is that d will increase continuously in future years. Furthermore, we conclude that in some years d should be greater than at least 3.3. The present value of d is 1.955, as shown in Equation 11. Holm and Sander (1996) estimated that there will be 1600 families and 400 folds found by the end of 1997. If their estimate is correct, d will reach the value of 4.0 by the end of 1997. It is reasonable to predict that d will continue to increase even after the year 1997. This prediction may be confirmed by the following consideration. Using Equation 17, we have a(1997) 5 (400 – 286)/2 (1600 – 559)/2 5 0.11 (25) Since d(1997) 5 4.0 , 1/a(1997) 5 9.1, the above prediction follows immediately. Based on these figures, we may estimate roughly the lower limit of t*. Equation 22 is far from being satisfied in the year 1997. Therefore, the equation will be satisfied at most in the year 1998, i.e. t* ù 1998. Even in 760 1998, it seems that Equation 22 will not be satisfied. It is thought that t* . 2000 is very likely. A fold with degenerate degree ù3 was defined as a superfold (Orengo et al., 1994). The introduction of the concept of superfold is important. We do not know why Orengo et al. (1994) chose the integer 3 as the threshold value for defining the superfold. We consider a suitable threshold value to define the superfold is 2. Probably the folds with degenerate degree 2 had not been observed when the paper by Orengo et al. (1994) was written. By our definition, except for the singlets, in which there is no degeneracy at all, all the degenerate folds are superfolds. Based on this definition, the percentage of the superfolds over the folds is 31.1% by using the current density matrix Equation 9. In other words, about one third of folds are superfolds. By the concept of designability, the superfolds are those folds which are probably more designable. The tertiary structures for nine superfolds were shown in the paper by Orengo et al. (1994), in which the TIM barrel and the up–down 4 α-helical bundle, etc., were included. Interestingly, there are really more secondary structure elements (helix and sheet) and tertiary symmetries in these superfolds. If the concept of designability is correct, it is expected that these superfolds should possess more thermodynamic stability against thermal fluctuations and other perturbations. Furthermore, the superfolds should fold more readily kinetically. All of these could be examined experimentally. The distribution represented by the density matrix Equation 9 is another interesting problem that needs to be discussed further. To what distribution does Equation 9 correspond? Is it normal or exponential? It is unlikely that Equation 9 is a normal distribution; rather, it could be characterized well as an exponential decay. We will discuss the implication of the exponential distribution below by comparing the two distributions. If the distribution is normal, the density function g(d) is g(d) 5 { 23 0, 1 √2πσ – e (d–1)2 2σ2 , dù1 (26) d,1 where σ is the standard deviation and d is the degenerate degree. If the distribution is exponential, the density function e(d) is e(d) 5 { λe–λ(d–1), dù1 0, d,1 (27) where λ 5 1/σ. Based on these equations, it is clear that when d . 1 1 2σ, then g(d) , e(d). Using σ 5 ∆d 5 2.44 (see Equation 13), we find when d . 6, the probability of occurrence of an event in the normal distribution Equation 26 is less than that of the exponential Equation 27. Actually, when d . 3σ or d . 8, the probability of occurrence of an event in the normal distribution Equation 26 is very small. In contrast, when d . 8, the probability of the exponential distribution Equation 27 is still considerably large compared with the normal Equation 26. When d 5 20, the maximum degenerate degree observed so far, the probability associated with this event in the normal distribution Equation 26 is only 1.6 3 10–10 of that of the exponential Equation 27. In other words, it is almost impossible for the event of d 5 20 to take place if the distribution is normal. It is the exponential rather than the normal distribution that makes the events of d . 8 take Relations of the numbers of protein sequences, families and folds place with a higher probability. Hence our overall feeling is that the distribution Equation 9 is very likely exponential. However, it is still too early to draw a definite conclusion before more data are available. The distribution of the degenerate degrees may depend on the database used. To compare with that based on SCOP, the collections of related folds in the Sali–Overington database (Sali and Overington, 1994) are analyzed here. The 105 alignments collected in this database are viewed as 105 folds. Although some folds in the Sali–Overington database are identical with those in SCOP, the former is by no means a subset of the latter. There are 162 30SEQ families in the Sali–Overington database, which are associated with the 105 folds. We find seven degenerate classes, i.e. d 5 1, 2, 3, 4, 5, 8 and 16. The corresponding density matrix denoted by D1 is D1 5 ( 1 2 3 4 5 8 16 82 14 2 3 2 1 1 105 105 105 105 105 105 105 ) (28) Accordingly, the average degenerate degree d1 5 1.54 and the standard deviation ∆d1 5 1.76, which may be compared with d 5 1.995 and ∆d 5 2.44 in SCOP. Generally, the two distribution Equations 9 and 28 are similar. Eisenhaber,F., Persson,B. and Argos,P. (1995) CRC Crit. Rev. Biochem. Mol. Biol., 30, 1–94. Finkelstein,A.V. and Ptitsyn,O.B. (1987) Prog. Biophys. Mol. Biol., 50, 171–190. Flores,T.P., Orengo,C.A., Moss,D. and Thornton,J.M. (1993) Protein Sci., 2, 1811–1826. Hilbert,M., Bohm,G. and Jaenicke,R. (1993) Proteins, 17, 138–151. Holm,L. and Sander,C. (1993) Nucleic Acids Res., 22, 3600–3609. Holm,L. and Sander,C. (1996) Science, 273, 595–602. Holm,L., Ouzounis,C., Sander,C., Tuparev,G. and Vriend,G. (1992) Protein Sci., 1, 1691–1698. Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Nature, 358, 86–89. Lessel,U. and Schomburg,D. (1994) Protein Engng, 7, 1175–1187. Li,H., Helling,R., Tang,C. and Wingreen,N. (1996) Science, 273, 666–669. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540. Orengo,C.A., Flores,T.P., Taylor,W.R. and Thornton,J.M. (1993) Protein Engng, 6, 485–500. Orengo,C.A., Jones,D.T. and Thornton,J.M. (1994) Nature, 372, 631–634. Pascarella,S. and Argos,P. (1992) Protein Engng, 5, 121–137. Ptitsyn, O.B. and Finkelstein, A.V. (1980) Q. Rev. Biophys., 13, 339–386. Rufino,S.D. and Blundell,T.L. (1994) J. Comput.-Aided Mol. Des., 8, 5–27. Sali,A. and Overington,J.P. (1994) Protein Sci., 3, 1582–1596. Sander,C. and Schneider,R. (1991) Proteins, 9, 56–68. Wang,Z.-X. (1996) Proteins, 26, 186–191. Yee,D.P. and Dill,K.A. (1993) Protein Sci., 2, 884–899. Received December 11, 1996; revised March 12, 1997; accepted March 14, 1997 Conclusion The relations among the numbers of protein sequences, families and folds have been studied. A logarithmic relation between the numbers of sequences and families has been found. It is important to point out that the logarithmic form should not be changed regardless of what value of the homology threshold is applied to define the families. On the other hand, the relation between the numbers of families and folds is much more complicated than that between the sequences and families. One of the contributions of this paper is that the concept of the degenerate degree of a fold has been introduced. Based on this, the distribution of the degenerate degrees has been studied and found to be very likely exponential. The formalism presented in this paper seems to provide a basis to facilitate the further study of related problems of protein structures. Data and materials The data analyzed in this paper were based on SCOP, Release of August 95 (Murzin et al., 1995), which were obtained via URL:http://scop.mrc-lmb.cam.ac.uk/scop/. The SCOP, Release of August 95, analyzed all released PDB entries available at that time. The data on the numbers of sequences and families were obtained from Dr Janet Thornton via e-mail, which were based on the analysis of the SWISS-PROT database, Release 27 (Orengo et al., 1994). Acknowledgments The author thanks Dr Z.-X.Wang for some stimulating discussions. He is also grateful to Dr Janet Thornton for sending the data used in Figure 1. This study was supported in part by the Pandeng Project of China and grant 19577104 from the China Natural Science Foundation. References Alexandrov,N.N. and Go,N. (1994) Protein Sci., 3, 866–875. Blundell,T.L. and Johnson,M.S. (1993) Protein Sci., 2, 877–883. Bowie,J.U., Luthy,R. and Eisenberg,D. (1991) Science, 253, 164–170. Chothia,C. (1992) Nature, 357, 543–544. 761

© Copyright 2021 Paperzz