Structural, Conformational and Interactional Investigation of Proteins with Related Sequences and Multiple Structures

Homologous proteins are special macromolecules with related primary sequences and multiple native structures and together with sequence-unrelated nonhomologous ones both constitute the protein amazing universe. Here is made a thorough sample selection, and employed quantitative predictions to analyze structures, conformations, steric and hydrophobic interactions and underlying molecular mechanisms in proteins via two coarse-grained (hydrophobic-polar, large-small) models. First, five empirical relations from nonhomologous samples are determined correlating large and hydrophobic residue sequences from primary to helix and β-sheet structures of functional conformations. When applied to homologous proteins, such empirical relations allow precisely surveying the interaction performance, identifying four types of molecular mechanisms, and computing the stability level in conformation ensembles. 1764 structural inspections capture essential features and furnish structural-interactional insights for homologous proteins, as well as suggest a fruitful way for better understanding conformational variability in biomolecular processes such as protein evolution, dynamics, folding and design.


Introduction
Proteins are specialized molecular machines vital for the existence and proper maintenance of all living organisms. They execute their crucial biological roles by means of an almost endless variety of functions that depend on their three-dimensional (3D) variform native structures constituted by secondary structure elements (mainly helices and βsheets) and encoded by the amino acid sequences. However, the gap between the sequence and structure knowledge is inherently complex requiring a sum of many different driving forces and interactions, and involving a multitude of spatial and temporal scales, such that to predict unknown structures from the amino acid sequences alone still remain unsolved. Despite of this long-standing conundrum, many endeavors [1][2][3][4][5][6] have been done by researchers to reduce the protein sequence-structure gap since examining underlying principles and properties until advancing in applicative purposes, such as better understanding the biological and chemical activities of cells/organs, structure-based discovery of specific inhibitors, and to predict protein structures for rational structure based drug design in therapeutic insights, in the development of medicine, and to treat human diseases.
One of the simplest ways of contemplating the extent of the sequence-structure gap is comparing proteins [7][8][9][10] by means of alignments between sequences and structures that can be summarized in four broad subsets: (1) alignment whose low residue sequence identity below 25% reveals unrelated proteins; (2) alignment with low sequence identity in distantly related proteins that have changed their sequences by evolution process and are generally clustered into common fold; (3) considerably high sequence identity (>25%) in proteins that usually have both structural and evolutionary relatedness and are assorted into a same family, in addition it is often assumed that such proteins also possess similar structures; (4) alignment with very high sequence Proteins with Related Sequences and Multiple Structures identity (≳98%) in sequence-similar and structure-dissimilar protein chains. The identity threshold in 25% can assume different values depending of the study method and approach utilized.
The first above subset is commonly utilized through prestated filters in advanced search interfaces of macromolecule databases to remove redundant structures of third subset and to assemble protein structure library. Nonetheless, such sequence-based criterion for similarity may be harmful, because there are many proteins with high sequence identity but different structures of the fourth subset, so leading to loss of relevant structural and functional information [11]. The second and third subsets, on the other hand, are employed in template-based methods [12][13][14] of threading (or fold recognition) and homology modeling, respectively, to construct a model for a query or target structure utilizing a known template structure. The fourth subset represents special proteins with equal or very similar sequences but having reasonably dissimilar structures and this case will be more thoroughly evaluated here.
Factors contributing for structural differences in sequenceidentical proteins (4° subset above) [11,[15][16][17][18] typically include: alternative conformations (e.g. protein crystallized in different spatial groups, alternative fits to the same NMR/crystallographic data); solvent (crystallization conditions with solvent in different pH or salt concentrations); temperature; apo versus ligand-bound forms of a protein; inter-or intra-chain interactions, as those due to different quaternary protein-protein, point mutation, oxidized versus reduced disulfide bridges; and large fragments or domain motions.
Here is explored the sequence-structure correlation and utilized two coarse-grained (HP (hydrophobic-polar) and LS (large-small)) models in a quantitative, empirical approach especially applied to homologous proteins in which one sequence can assume conformational multiplicity and functional diversity. This paper is arranged as follow: Section 2 sets next out the methodology for the selection of nonhomologous and homologous proteins, energetics (molecular interactions), secondary structure elements, and used structural variables. Then, Section 3 presents the initial results selecting samples, and computing in nonhomologous proteins the sequence-structure correlation via five empirical linear relations. In 684 structural-interactional inspections for homologous proteins, the linear relations are used to thoroughly examine the individual as well as mutual action of steric and hydrophobic interactions by four types of molecular mechanisms, quantify the strengths of these interactions, and measure the stability level for protein conformational ensembles. Lastly, the main conclusions are epitomized.

Definition of Homologous Proteins and Protein Structure Library
It is previously necessary to define the terms nonhomologous and homologous proteins used in this paper. Experimentally determined macromolecular structures deposited in the Protein Data Bank (PDB) [19] are culled under the following conditions: (i) non-redundant chains with different primary sequences of at least three residues are included in the set of nonhomologous or sequence-unrelated proteins. Here is always examined primary structure solely consisting of 20 types of naturally occurring amino acids. (ii) redundant protein chain pairs having none (or 100% sequenceidentity), one or two different residues in the primary sequence together with secondary structure elements (helices and/or β-strands) with less than four different residues should be removed one chain and the other inserted as member of the nonhomologous set. (iii)parent chain pairs sharing primary sequences with none, one or two different residues, along with dissimilar segments of helices or strands in at least four different residues are both together inserted as part of the set of homologous or sequence-related proteins. The extension to one and two residue differences in primary sequences would allow us to employ our approach to explore mutation-induced fold changes, protein evolution and misfolding [20][21]. As a consequence of the conditions (i)-(iii), a protein pair should be considered as redundant only when both of their sequences and structures are highly similar and as homologous when both proteins are similar sequences and dissimilar structures. In order to select homologous proteins in a given chain length with N residues, an alignment and comparison of residue-per-residue sequences and secondary structure elements are employed together with the condition (iii) for each protein pair (Figure 1). In a helix and/or strand ensemble with n homologous proteins, the total number of protein pair combinations C n,2 [22] is obtained by: C n,2 = n!/(2!(n-2)!) (1) For instance, if n is equal to 2, 3, 4 ( Figure 1), 5, 6 or 7 Nresidue proteins, then there are 1, 3, 6, 10, 15, 21 C n,2 , respectively.

Molecular Interactions
Proteins make use of a rich repertory of amino acid residues by means of strategic physico-chemical properties in their molecular and cell activities. Among these properties, the volume and hydrophobicity have been recognized indispensable in the selection and maintenance of native conformations and biological functions [3,[24][25][26], and they refer to steric and hydrophobic interactions. Here, the volume and hydrophobicity of 20 natural amino acids (single letters) are assigned by binary codes [27][28][29] -large-small LS and hydrophobic-polar HP -in the following subgroups: largehydrophobic (F, H, I, L, M, V, W, Y), large-polar (E, K, Q, R), small-hydrophobic (A, C, P, T), and small-polar (D, G, N, S). These residue-level codes of four-letter sequences LS and HP capture essential features and information for proteins especially when confronted results of both, with the first referring the steric hindrance and macromolecular packing, and the second contemplating the hydrophobic interaction and effect. The large and hydrophobic sub-components are predominant, both detachedly with 12 among 20 amino acids, and therefore they are taken into account for the results below.

Computation of Residue Sequences and Secondary Structures
In the primary structures, the residue sequences can be properly expressed by total number of large (N L ) and hydrophobic (N H ) residues. Sequences with N i may have none, one, two or many associated proteins, where the subscript index "i" accounts for the large (L) and hydrophobic (H) residues in both primary and secondary structural levels. In two periodic secondary structure elements (helices and β-sheets constituted by strands), here is not considered very short overall lengths L j (only elements with L j >6 residues) to have more reliable measures, since proteins are dynamically diffusive besides subjected to environmental perturbations [18,30], where the index "j" stands for the (3 10 , α, π)-helices (h) and β-strands (e). Furthermore, turns and coils have less accurate regions than helix and strand regions; hence, the formers will not be inspected here.

Sequence-Structure Variables and Their Accuracies
For proteins, the total numbers t i,j of large and hydrophobic residues in secondary structure elements of lengths L j ensue in the real proportion p i,j (in percentage) measured by: where p i,j ranges from 0 (whenever L j does not possess large and hydrophobic residues, t i,j =0) to 100% (every time that L j entirely possesses these residues, L j =t i,j ). The estimated proportions p i,j of large and hydrophobic residues in helices and strands should be directly taken from prediction equations or expressions by means of linear fits in PDB data as below shown. The accurateness of our predictions is obtained by measuring ∆p i,j , the module of the dissimilarity between the real and estimated proportions p i,j through: where ∆p i,j can vary from zero (both p i,j are equals) to 100% (one p i,j is zero and another is 100%). More specifically, the prediction accuracy will be assumed excellent (whenever ∆p i,j ≤5%, that is with fluctuations ∆p i,j ≈0), good (5%<∆p i,j ≤15%, ∆p i,j ≈10%), acceptable (15%<∆p i,j ≲25% providing that L j ≲15), and bad (for further ∆p i,j ).

Selection of Nonhomologous and Homologous Proteins
In the protein selection for each analyzed chain length N, proteins underwent post-translational modifications with non-natural amino acids (condition (i) above) were initially removed. Next, each pair of database proteins is aligned and compared by the residue sequences (via condition (i)) and, if necessary, also by the secondary structure elements (conditions (ii) and (iii)) and then excluded those redundant chains; thus remaining the nonhomologous together with homologous macromolecules. After this stage they are partitioned into a nonhomologous or homologous (e.g. Figure  1) set, respectively. Figure 2a shows the residue sequence identity of 126 proteins with 70 residues that provide 7875 pair combinations C 126,2 (1). Figure 2b displays the helix and strand dissimilarity with at least four different residues (≥4 residues) for 61 homologous protein pairs with 100% sequence identity from Figure 2a.
The sequence identity (Figure 2a) for pairs of nonhomologous protein is frequently less than 25% and for homologous ones is equal to 100%. The homologous protein pairs (Figure 2b) have usually dissimilarities in helices or strands, but sometimes they occur in both secondary elements as shown for 8 pairs of numbers 11, 39, 42, 43, 47, 48, 57, 58. The results for the residue sequence identity ( Figure 2a) and secondary structure dissimilarities ( Figure  2b) are reasonably extensible for other chain lengths N, though here displayed only for N equal to 70 residues. From Figures 2a,b for 126 proteins, 94 nonhomologous and 32 homologous cases were selected. Also in other N, the nonhomologous proteins are in greater quantity and have more diversified residue sequences than those homologous ones; consequently, the nonhomologous macromolecules are first analyzed.

Measurement of Sequence-Structure Correlation for Nonhomologous Proteins
For nonhomologous proteins in each chain length N, the numbers of large and hydrophobic residues N i from the primary structures are individually computed and then are observed the normalized quantities n i (=N i /N) with the real proportions p i,j (2) of these residues in the secondary structural elements, helices and strands. Though p i,j and n i are apparently uncorrelated greatnesses, the plots of p i,j in function of n i (Figure 3) are made for 317 helix and 223 strand data points, in a total amount of 1080 experimental data points, whose linear adjustments have general relations for estimated p i,j given by: where m, b, and R are the slope, intercept, and linear correlation coefficient, and whose specific values (4(a)-(e)) are displayed in Figure 3. The 1080 points of p i,j versus n i (linear relations (4(a)-(e)) in Figure 3) express how happen the information transference of large and hydrophobic residues in primary and secondary structures of folded conformations determined by X-ray crystallography, NMR spectroscopy or electron microscopy. These relations are dependent on the types of residues and secondary structure elements considering that the large residues in helices together with hydrophobic residues in strands have lesser sloped straight lines (with slopes m≲80.0 (4(a),(d),(e))) than the large residues in strands and hydrophobic ones in helices (m≈100.0 (4(b),(c))). 470 out of 540 nonhomologous samples possess both points (both excellent or good ∆p i,j ≤15% (3)) around of their straight lines resulting from concurrent and efficacious use of the large and hydrophobic residue from primary to secondary structures via a doubly effective molecular mechanism.
Other 56 protein samples have a more efficacious and compensative employment of a residue type (just one ∆p i,j ≤15%) from primary to secondary structures by means of a singly effective mechanism. The 14 remaining samples possess a subtle employment of residues (both or one acceptable ∆p i,j ≲25% in L j ≲15) so utilizing a partially effective mechanism. Consequently, no sample possesses native structures with a bad mechanism by both residue types (both together with bad dissimilarities, ∆p L,j and ∆p H,j >15%). Far points from straight lines in the single and partial mechanisms of some protein samples have contributed for low linear coefficients R≈60.0% (4(a)-(e)).
The proportions p H,e of hydrophobic groups in strands had an unsatisfying straight line (with R<40% in p H,e = 61.9n H +32.9 (4f), N=223) that was only used to separate below/above p H,e points of it, and whose fits gave rise to p 1 H,e /p 2 H,e (4(d)/(e)). This dual behavior of p H,e may simultaneously be due to long-range interactions into hydrophobic interplays [31], and non-local strands constituting β-pleated sheets [32][33].
The five linear relations p i,j ((4(a)-(e)) ( Figure 3)) are dependent only of primary sequences (by n L , n H and N), and they will be validated by predictions in homologous protein samples similar to cross-validation assays in statistics [34]; though here is focused on a thorough case study by means of a rule-based approach ((3), (4(a)-(e))), so that it does not suffice to identify the occurrence and to determine the quantity of a type of mechanism, the protein names (PDB ID) should be precisely furnished whenever necessary. Furthermore, the four types of molecular mechanisms and their amounts should be confirmed, complemented or denied in the following more precise inspections for another detached sample set, the homologous proteins.

Molecular Interactions and Mechanisms in Homologous Proteins
In nonhomologous proteins (Figure 3), the empirical sequence-structure correlations between p i,j and n i (4(a)-(e)) were determined, analyzed steric and hydrophobic interactions, and found out four types of molecular mechanisms. Such correlations as prediction rules are now employed to compute estimated p i,j that compared with real p i,j via the their dissimilarities ∆p i,j (3) will permit us to survey molecular interactions and mechanisms in secondary structure elements of homologous proteins. Note that to reckon an estimated p i,j , it suffices to know the primary sequence of the protein by the normalized quantities n L or n H .
The helical structures by means of ∆p i,h (Figure 4), the module of dissimilarity between the real (2) and estimated (4(a),(c)) proportions of large and hydrophobic residues, are firstly inspected. In addition, a thorough analysis is proceeded in samples with troublesome dissimilarities (∆p i,h >15%), so better known the individual occurrence of the steric and hydrophobic interplays and their acting mechanisms.
In 194 homologous samples with helix structures and their 388 values ∆p i,h (Figure 4), 162 of them have both residue types (324 ∆p L,h and ∆p H,h ≤15%) inside gray rectangles, and therefore making use of a doubly effective mechanism. In contrast, 31 once underlined samples possess only one (those of numbers 2, 12, 18...182, 183), already the twice underlined sample, number 45, with none of residue type ∆p i,h (both acceptable 15%<∆p i,j ≲25% in L h =7) inside gray rectangles work with singly and partially effective mechanisms, respectively. Among the once underlined samples, the steric interactions are better than the hydrophobic ones with 21 samples inside gray rectangles. After analyzing helices, a similar proceeding is assumed for 148 samples with strands and their 296 ∆p i,e ( Figure 5) between the real proportions (2) and the estimated proportions of large (4b) and hydrophobic (4(d),(e)) residues. The choice (4(d)/(e)) for estimated p H,e was based on low/high p H,e values, as used previously for nonhomologous proteins (Figure 3). 126 out of a total of 148 samples with strand structures ( Figure 5) possess both points inside gray rectangles with excellent or good predictions ∆p i,e (252 points with ∆p i,e ≤15%), and consequently using a doubly effective mechanism. On the other hand, 21 samples with one (once underlined those of numbers 1, 7, 14…101, 110) and one sample with none (the twice underlined number 2 with both acceptable ∆p i,j in L e =7) point inside gray rectangles work with a singly and partially effective mechanism, respectively. In 17 out of 21 once underlined samples (except for those of numbers 7, 52, 53, 67 in Figure 5b), the hydrophobic interplays are more effective than the steric ones.
All the 342 homologous protein samples with helix and strand structures (Figures 4 and 5) use double, single or partial mechanisms by the steric and hydrophobic interactions, as disclosed by their 684 dissimilarities ∆p i,j . Now we pass to visually analyze homologous samples into conformational ensembles (like Figure 1), and perceive that different arrangements of amino acid residues from primary to secondary structures can have or not more than one type of mechanism in these ensembles (Figure 6), and therefore hypothesizing the stability levels of such ensembles.  Figure 6 shows that the particular disposition of amino acid residues in each strand segment leads to specific fulfillments of the steric and hydrophobic interactions (∆p i,e ) and by consequence the less or more stable forms given by the simply (in 4BBSL, 3RZOl, 3RZDl, 4C2M1) or doubly (3GTGl) effective mechanisms, respectively. In addition, these results evidences the intrinsic interaction instability in short strand lengths L e that vary from 3 residues (3M3Y1) with one isolated strand not forming β-sheet to 13 residues (4C2M1) with four extended strands constituting two antiparallel β-sheets. In consequence of such instability, the need of the cutoff length L j >6 for more precise measures in p i,j and ∆p i,j previously adopted in this paper.
It is substantive to point out that for each conformational ensemble our rule-based approach allows to individually compute the strategic performance of steric and hydrophobic interactions (observing ∆p L,j and ∆p H,j ) in each native conformation as well as identifying the existence of one or more sorts of molecular mechanisms. Such interaction performance and detection of mechanisms are visualized in Figure 7 for a conformational ensemble with 18 homologous ribosomal proteins constituted by diversified helical segments.  Figure 7, four samples from numbers two to five (2QOW with ∆p L,h and ∆p H,h equal to 0.6% and 1.3%; 2YKR, 3.1%, 2.3%; 2QOY, 2.2%, 0.2%; 2I2P, 10.5%, 4.3%) use the steric and hydrophobic interactions with a subtle predominance of one on another considering that both have excellent or good performances (∆p L,h and ∆p H,h ≤15%). The interaction strengths for our 4 samples with helices should be extended for the 13 other homologous partners that also employ ∆p L,h and ∆p H,h ≤15%, and therefore the 17 samples work of stable form with only the doubly effective mechanism. When used this individual numerical characterization in proteins of Figure 1, the helix ensemble of 4 homologous samples shows to possess one (1BQT with ∆p L,h and ∆p H,h equal to 23.7% and 13.6%) and three cases (1PMX, 11.8%, 1.9%; 2GF1, 4.5%, 4.5%; 3GF1, 12.1%, 12.0%) with simple and double mechanisms, respectively.
With regard to coupled acting of ∆p L,j and ∆p H,j , the homologous (and nonhomologous) samples are comparable having in percentage 84 (87), 15 (10), 1 (3) and 0 (0) of the samples successfully working via doubly, simply, partially and badly effective molecular mechanisms, respectively. The quantitative agreement between both types of samples through three types of mechanisms indicate that proteins make use of an interactional plasticity, since depending of the sample the secondary structure elements of 3D native structures utilize either more stably both interactions by a large majority (>80%) of the cases, or less stably one or partially both interactions in a smaller amount of cases, <20%. Therefore, the singularity of sequence and plurality of structures in homologous proteins keep uniformly the interaction performances and four types of molecular mechanisms, in addition to validating the five rules p i,j ((4(a)-(e)), Figure 3) that were originated from the singularity of sequences and structures in nonhomologous proteins.
None of the 1764 inspections in nonhomologous and homologous samples (Figures 3-5) possess native state structures with a malfunctioning or bad mechanism by both residue types (∆p L,j and ∆p H,j >15%). The occurrence of a bad mechanism by ∆p L,j and ∆p H,j in a protein native conformation could indicate atypical interactional behaviors that would call for more inquiry, such as occurrences of specific interaction among other parts of the protein or with another macromolecule, or still a direct influence of other molecular interactions biasing the steric and hydrophobic driving forces measured by ∆p i,j . Although, this unsatisfactory mechanism is plausible to happen of relevant and measurable form by ∆p i,j in conformation ensembles of denatured states and folding intermediates during events of protein dynamics. In the folding and dynamics processes and others such as protein design and evolution, our rule-based approach by both ∆p L,j and ∆p H,j in four mechanism types can be a useful tool for investigating the strategic power and nature of steric and hydrophobic forces.
The current approach based on two coarse-grained models is insufficient for sharper measures of the secondary structure composition, as traditionally occur in these types of models [35][36], and in consequence other approaches, such as semiempirical ones, or via other higher resolution levels with more letter codes or atomic models, should be evaluated. However, the detailed protein approaches are also limited in many features, since they frequently demand too many computational resources and details of molecular interactions and cellular environments that they use or try to catch are still not fully understood [37][38]. In summary, our current low-resolution approach is a suitable instrument to succeed at capturing pivotal insights and principles of homologous proteins when quantitatively accurate estimations and systematic investigations are needed and furthermore particular details can be suppressed.

Conclusion
Numerous studies have examined homology-derived proteins by template-based methods via search optimization for sequence-sequence comparisons, multiple sequence and sequence-structure alignments that incorporate information about protein families or folds [39][40][41][42][43][44] and have been utilized in several investigations, including homology inference, structure modeling, functional prediction and phylogenetic analysis. In the sequence-structure context despite expressive research efforts, some relevant questions as the key acting of fundamental interactions (e.g., steric and hydrophobic/hydrophilic ones), the driving mechanisms resulting from these interactions as well as their implications for analyzing conformation ensemble in homologous proteins are not fully understood, and therefore such questions have been analyzed here by means of a rule-based approach.
Firstly, nonhomologous proteins showed a direct synchronism between the employment (p i,j ) in folded structures with the availableness (n i ) from primary structures by the steric and hydrophobic interactions through five empirical linear relations p i,j (Figure 3) that modulate different strategies employed by the residue volume and hydrophobicity. Then, when used as prediction rules in homologous proteins, such linear relations inside modules of dissimilarities ∆p i,j (in 684 ∆p i,j , Figures 4-5) measure the strengths of the individually steric and hydrophobic interactions, check the stability level by both coupled interactions (looking for ∆p L,j and ∆p H,j ≤15%), identify four types (obtaining 84% double, 15% single, 1% partial and 0 bad) of molecular mechanisms for homologous protein, as well as we can visualize the occurrence of one or more type of these mechanisms in helix and strand ensembles ( Figures  1, 6-7) of native conformations.
In summary, taken together our 1764 inspections intend to contribute with better criteria in the conformational ensemble selection, capture protein fundamental aspects and furnish structural-interactional insights for native conformations of homologous proteins, as well as support the inference that our rule-based approach can potentially to be applied to study other proteins and to better understand conformational variability in biomolecular processes such as protein evolution, design, dynamics and folding [29,[45][46][47][48][49]. Such inspections obtained via two coarse-grained models work complementally with other results from simplified approaches, including misfolding and unfolding events [50], comparative modeling to explore protein-like features [51], lattice models for protein folding [52], energy landscape mapping methods for structure predictions [53], and evaluation of knots in proteins [54]. Furthermore our approach intend to join with other tools and resources [55][56][57][58] to help researches in the protein sequence-structure correlations and to pave the way for improving the general understanding of conformational ensembles in further proteins.

Supplementary Materials
In the present paper, the homologous proteins have been inspected according to their secondary structure compositions that form conformational ensembles with different sizes and component quantities (Table 1). Such helix and strand ensembles were displayed and analyzed in Figures 1, 4, 7 and Figures 5, 6, respectively. The results considered 248 homologous proteins (total sum in second column of Table 1) comprising conformational ensembles in 11 sizes (line numbers in Table 1) and a total of 78 ensembles (amount of second divided by first column in lines of Table 1) segregated by semicolons. Some pair combinations (C n,2 (1)) of n homologous proteins possess dissimilarities in both helices and strands, others in helices or strands, so that the proteins are segregated in 194 samples with helix ensembles (Figure 4) and 148 ones with strand ensembles ( Figure 5) totalizing 684 dissimilarities ∆p i,j .