PLoS Computational Biology
Home Accurate contact-based modelling of repeat proteins predicts the structure of new repeats protein families
Accurate contact-based modelling of repeat proteins predicts the structure of new repeats protein families
Accurate contact-based modelling of repeat proteins predicts the structure of new repeats protein families

The authors have declared that no competing interests exist

Article Type: Research Article Article History
Abstract

Repeat proteins are abundant in eukaryotic proteomes. They are involved in many eukaryotic specific functions, including signalling. For many of these proteins, the structure is not known, as they are difficult to crystallise. Today, using direct coupling analysis and deep learning it is often possible to predict a protein’s structure. However, the unique sequence features present in repeat proteins have been a challenge to use direct coupling analysis for predicting contacts. Here, we show that deep learning-based methods (trRosetta, DeepMetaPsicov (DMP) and PconsC4) overcomes this problem and can predict intra- and inter-unit contacts in repeat proteins. In a benchmark dataset of 815 repeat proteins, about 90% can be correctly modelled. Further, among 48 PFAM families lacking a protein structure, we produce models of forty-one families with estimated high accuracy.

Repeat proteins are widespread among organisms and particularly abundant in eukaryotic proteomes. Their primary sequence presents repetition in the amino acid sequences that origin structures with repeated folds/domains. Although the repeated units often can be recognised from the sequence alone, often structural information is missing. Here, we used contact prediction for predicting the structure of repeats protein directly from their primary sequences. We benchmark the methods on a dataset comprehensive of all the known repeated structures. We evaluate the contact predictions and the obtained models for different classes of repeat proteins. Further, we develop and benchmark a quality assessment (QA) method specific for repeat proteins. Finally, we used the prediction pipeline for all PFAM repeat families without resolved structures and found that forty-one of them could be modelled with high accuracy.

Bassot,Elofsson,and Soeding: Accurate contact-based modelling of repeat proteins predicts the structure of new repeats protein families

Introduction

Repeat proteins contain periodic units in the primary sequence that are likely the result of duplication events at the genetic level [1]. Repeat proteins emerge through replication slippage [2] and double-strand break repair [3]. This protein class is present in all genomes but is more frequent in eukaryotic organisms [46] where they are involved in a wide range of functions [7]. In particular, due to their extended structures, repeat proteins often behave as molecular scaffolds in protein signalling or for protein complexes as WD40 domain [8], or ankyrin repeats [9,10]. Repeat proteins are usually conserved among orthologs [4,11] while exhibiting a more accelerated evolution and divergence among paralogs [11].

A classification of repeat proteins was proposed by Kajava [12,13] based on the length of the repeat units and the tertiary structure of the repeat units. According to Kajava’s classification, there are five classes of repeat proteins. However, in this study, we ignore class I and II because there are no available structures for class I, and class II structures are folded in a coiled-coil structure possible to predict using other methods. Moreover, the extreme amino acid compositional bias of many of these proteins makes it difficult to identify the coevolving residues in these two classes.

The dataset used in our study contains three classes of proteins divided into 20 subclasses by their secondary structure, according to RepeatsDB [14], Fig 1. The three types are; class III extended repeats (e.g. ɑ and β solenoids); class IV closed repeats structures (e.g. TIM and β barrels and β-propeller), and class V where the units appear as separate domains on a string. Further, class V the repeat units are longer than in the other classes.

Repeats proteins classification.
Fig 1

Repeats proteins classification.

Representation of the repeats classes and subclasses as classified in repeatsDB 2.0 [14].

The solenoid structures (subclasses III.1, III.2 III.3) dominate Class III [13], and these proteins contain a wide range of repeated units (from 4 to 38), Fig 1. The length of the individual unit is also quite variable (from 10 to 50 residues) [14], with β-solenoids having significantly shorter repeats compared with α and α/β solenoid [13].

Members of class IV are constrained in variability by the closed fold. Indeed, despite ten subclasses of different units, the number of units varies between 3 and 16, and proteins with more than ten repeat units are rare. Even in this class, the length of the repeats units varies between 10 to 50 residues [13]. Finally, class V proteins are made up of the extended repeat units, often longer than 40 residues [14]. Each unit folds into proper domains, and they only have few inter-unit contacts.

Many repeat protein families lack a resolved structure. For these protein families, residue-residue contact prediction is the best method to obtain structural information [15]. Contact prediction methods use residue-residue co-evolution from multiple sequence alignment and identify the residues’ evolutionary constraints imposed by the tertiary protein structure [16]. Nevertheless, repeat proteins are a difficult target for contact prediction; the internal symmetry introduces artefacts in the contact map at a distance corresponding to the repeated units [17].

Here, we benchmark the deep-learning-based contacts prediction programs PconsC4 [18] trRosetta [19], DeepMetaPsicov [20] against the GaussDCA [21] on a comprehensive dataset generated from RepeatsDB [14]. The predicted contacts were then used as constraints to generate protein models. The model quality was evaluated, combining the quality assessment scores from Pcons [22] and QmeanDisCo [23] through a random forest regression. Based on the benchmark, we propose models for the protein structures of PFAM protein families missing resolved structures.

Results and discussion

General contact prediction analysis in repeat proteins

To assess the quality of the contacts predictions among repeat protein classes, we generate a dataset of proteins using the reviewed entries of RepeatsDB [14] and then clustered at 40% sequence identity. For each repeats region in the dataset, we also extracted a representative repeat unit and a pair of repeats, obtaining in this way three datasets: i) a single unit datasets; ii) a double unit datasets; iii) complete repeat region datasets.

For all the three sets, multiple sequence alignments (MSA) and secondary structure predictions were generated. Subsequently using the MSA as input for trRosetta, PconsC4, DeepMetaPsicov, and GaussDCA contacts were predicted for each family. The performance of the contact predictions was then evaluated for each subclass separately. As expected, the most recent method, trRosetta outperforms an older deep learning method as PconsC4 and a simple DCA method as GaussDCA, but even ifcompared with the more recent, DeepMetaPsicov trRosetta shows a consistent improvement among all but two classes, Fig 2. In general for all the methods the predictions for the full-length regions give better results than when splitting the proteins into smaller units, Figs 3 and S1. In class V however, which is composed of entire domains, forming repeats of the “beads on a string” type, the splitting in units sometimes helps to reach better contact predictions for PconsC4, DeepMetaPsicov, and GaussDCA S1 Fig.

The precision of contact predictions.
Fig 2

The precision of contact predictions.

Positive Predictive Value (PPV) for the GaussDCA (red), Pconsc4 (Blue), DeepMetaPsicov (green), and trRosetta (orange).

Here, it should be remembered that trRosetta, PconsC4 and DeepMetaPsicov, in addition to other information, use DCA predictions as an input and then learn to recognise specific patterns [18]. Therefore, artefacts present in the DCA predictions might propagate into these methods. In Fig 4, selected contact maps are shown as examples. The GaussDCA predictions contain periodic artefacts of wrong predictions (red dots) forming diagonal lines, occurring between equivalent positions in the repeat unit. PconsC4, DeepMetaPsicov, trRosetta appear efficient in removing the artefacts seen in GaussDCA. Here, it can be noted that there is only limited overlap between our repeat protein set and the training set of PconsC4 and DeepMetaPsicov, 25 out of 2856 and 29 out of 3456 proteins are identical respectively. Further, the accuracy for the shared proteins does not show a higher precision than the other proteins, S2 Fig. trRosetta instead has a much bigger training set of 15,051 proteins [19]. To our best knowledge, the IDs of the proteins are not available and in this case, we can not test the performance for the shared proteins. However, the high general consistency shown by trRosetta in our benchmark makes us confident that the results can be generalised, and that it is not strongly affected by a potential overlap with the training set.

The precision of contact predictions of trRosetta for the three datasets.
Fig 3

The precision of contact predictions of trRosetta for the three datasets.

Results are shown for the three datasets, in blue the single unit dataset, in red the double units dataset, and in green complete region dataset.

GaussDCA, PconsC4, DeepMetaPsicov, and trRosetta contact maps.
Fig 4

GaussDCA, PconsC4, DeepMetaPsicov, and trRosetta contact maps.

Contact map for predictions obtained with GaussDCA, PconsC4, DeepMetaPsicov and trRosetta. In grey, the real contacts from the structure, in green, the corrected predicted contacts, and the falsely predicted contacts in red.

It is well known that the prediction quality is directly correlated with the number of sequences in the starting MSA for DCA methods [18]. Here, this trend is also observed, with trRosetta always showing the best performance Fig 5.

The relation between Precision and the effective number of sequences in the MSA.
Fig 5

The relation between Precision and the effective number of sequences in the MSA.

Positively Predicted Value for trRosetta in orange, GaussDCA in red, PconsC4 in Blue and DeepMetaPsicov in green on the Neff value (the effective number of sequences length weighted with the length of the protein). The single dots correspond to each protein in the datasets, and the line is the running average on (n = 50).

Differences among repeat classes in contacts prediction

Fig 2 shows variations in the fraction of correctly predicted contacts among different protein repeat classes and subclasses in all the methods. To clarify the origin of these differences, we investigated, more in-depth, the source of the predicted contacts. One central aspect that affects the difficulty of prediction is the pattern of the contacts [24]. In general, contacts that are parts of larger interaction areas or close in the sequence are predicted more accurately. Therefore, we compared the intra-unit and inter-unit contacts predicted by DeepMetaPsicov and trRosetta, Fig 6. Here, we obtained the number of predicted intra and inter-unit contacts from the PDB structures and selected the same number of predicted intra- and inter-units contacts. The PPV was finally calculated using the number of correctly predicted contacts divided by the number of contacts.

Predicted contacts analysis.
Fig 6

Predicted contacts analysis.

a) Examples of inter- and intra- unit contacts. b) In red, the PPV for intra-units contacts in blue PPV for inter-units contacts predicted by DeepMetaPsicov. The lines are the respective running average of the PPV over the ratio of inter-unit contacts on the total of the protein contacts. c) In red, the PPV for intra-units contacts in blue PPV for inter-units contacts predicted by trRosetta. The lines are the respective running average of the PPV over the ratio of inter-unit contacts on the total of the protein contacts.

On average the intra-units contacts are predicted with higher accuracy than the inter-unit contacts in both DeepMetaPsicov and trRosetta, with trRosetta slightly over perform DeepMetaPsicov in both.

Protein model generation

For PconsC4 and DeepMetaPsicov, protein models were generated using CONFOLD [25] using the contact predictions from the respective method and combining it with secondary structure predictions from PSIPRED. For trRosetta instead, pyRosetta [26] was used for the protein model folding with the predicted distances and angles as input as described in Yang et al. [19]. Here, no secondary structure predictions were used.

In Fig 7, we compare the TM-score between the models of the corresponding PDB protein structure. Here, the trRosetta pipeline outperforms the other two methods in all classes but has to be noted that the use of distances and angles instead of contacts is the main responsible for the difference in performance with DeepMetaPsicov that when compared on the contacts prediction precision show slightly inferior performance. In total with trRosetta 732 models out of 815 (89.8%) are predicted with at least a TM-score of 0.5.

Protein model quality.
Fig 7

Protein model quality.

TM-score for the subfamilies; Models from trRosetta in orange, PconsC4 in blue and DeepMetaPsicov in green.

Quality assessment of the models

To evaluate the quality of the models obtained by trRosetta, we compare the TM-scores of the models with the quality assessment scores from Pcons [22] and QmeanDisCo [23]. Due to the general high quality of the models both the quality assessment methods fail to rank a significant number of models properly, Fig 8A and 8B.

TM-score versus QA methods.
Fig 8

TM-score versus QA methods.

a) TM-score versus Pcons-score for complete region models generated with trRosetta. b) TM-score versus QmeanDisCo score for full region models created from trRosetta contacts.

To improve the quality estimation, we developed a Random Forest Regression method using multiple inputs (Pcons, QmeanDisCo, protein length). Five-fold cross-validation was performed on the complete region dataset. The method obtained an average accuracy of 83.6%, and an average absolute error of 0.09 TM-score, see Fig 9A. The Random Forest Regression predicts the TM-score better than Pcons and QmeanDisCo alone, Fig 9B. We found that nine features were helpful for the prediction of the TM-score, S3 Fig. The most important features are the Pcons score, the local QmeanDisCo score, and protein length.

a) Real TM-score versus Random Forest Predicted TM-score for complete region models generated with trRosetta. b) Pearson correlation coefficient between the TM-score and the QA methods.
Fig 9

a) Real TM-score versus Random Forest Predicted TM-score for complete region models generated with trRosetta. b) Pearson correlation coefficient between the TM-score and the QA methods.

Modelling of repeat protein families without known structures

We selected 48 PFAM repeat-families without resolved structure and fed them through the trRosetta structure prediction pipeline.

Among the models, 41 out of 48 (85%) are predicted with a TM-score higher than 0.5, Table 1. For twelve of these families, we could identify a template with a GMQE score [27] higher than 0.4 using Swissmodel [28]. In these cases, homology models were generated for comparison with the contact based models. We compared the similarity of the contact-based and homology-based models with the predicted TM-score for the contact-based model. For four families (LVIVD, LRR_3, WD40_alt, LGFP) the models obtained by homology agree with the predicted TM-score, the difference between the TM-scores is below 0.1, i.e. the estimated TM-score agrees with what would be estimated if the homology model was identical to an experimental structure. However, for the other six families, there is an overestimation of the quality (RHS_repeat, DCAF15_WD40, DUF4116, Phage_fiber_2, RTTN_N, MORN 2) and for other two an underestimation (DUF5122, FG-GAP_2), see Table 1.

Table 1
In the columns: the family name, the PFAM ID, the Uniprot ID of the sequence used for the modelling, the predicted TM-score, the best template PDB ID, the Swismodel GMQE score, the identity between the target/template alignment, the TM-score between the contact-based model and the homology model.
The models of the PFAM families with predicted TM-score.
PFAM familyPFAM IDRepresentative protein Uniprot IDTM predictionTemplate PDB IDGMQEIdentityTM-score between The contact model and Homology model
NebulinPF00880A0A094KVK30.4831856b40_A0.1218.60%-
SWM_repeatPF13753A0A0B0HSH20.453942ra1_A0.2120.72%-
Plasmod_MYXSPDYPF07981A0A0L7M9B80.592519----
C_tripleXPF02363A0A1A9WU230.4405824xbm_A0.2829.08%-
RHS_repeatPF05593A0A1G0MXS80.7890656fay_A0.728.10%0.51
Plasmodium repeat MYXSPDYPF00839A0A1I7SWM50.5853360.029.09%-
LVIVDPF08309A0A1V1NWB10.7931654jsn_B0.512.50%0.78
DUF5122PF17164A0A1Z4C3E90.6856742ymu_A0.4619.50%0.82
Bacterial tandem repeat domainPF17660A0A252E8A50.748854qp0_A0.1713.70%-
SprBPF13573A0A257INW40.6020682c26_A0.3116.36%
Lustrin_cysteinPF14625A0A2A2LSA20.5217486nan_A0.1320.63%-
SPWPF03779A0A2A3HD640.687938----
Chlorovi_GP_rptPF06598A7RAI00.627261----
CRAM_rptPF07016A7S4G30.6674264aea9_A0.3917.65%-
Dicty_CTDCPF00526D3BR650.5667754u8u_N0.1719.23%-
RtxAPF07634D3UXB80.7448675vgz_A0.0320.51%-
YTVPF07639D5SU360.591834-
LRR_3PF07725D7MCA50.6977792omx_A0.5918.97%0.71
Ice_nucleationPF00818F3GDU00.714233----
LSPRPF06049G1RYA90.571161----
UCH-protein repeatsPF13446G1XIQ80.6117222h5x_C0.0822.45%-
WD40_altPF00400G3VIY20.7181395obm_A0.5922,78%0.71
Lipoprotein_15PF03640I3BT020.4490384yx7_A0.0216%
SSUREPF11966J1S4N00.509385----
LGFPPF08310L8TNF30.7794996sx4_A0.7632.21%0.85
zf-C2H2_3repPF18868O648270.5887891z9v_A0.0412.12%
DCAF15_WD40PF14939Q29AL90.6775636pai_B0.6729.69%0.56
SVS_QKPF10578Q6P6X20.605577----
DUF2963PF11178Q6YQH30.8045355e9t_D0.3224.24%
Plasmo_repPF12135Q7RTC20.6378344nee_C0.0723.08%
CurlinPF07012Q8EIH30.669206----
MORN 2PF07661Q8RH850.712931h3i_A0.6827.87%0.52
OGFr_IIIPF04680Q9NZT20.6794995xme_A0.2116.09%
HNH_repeatPF18780R2SEH80.6485632xsj_C0.2414.67%
ChWPF07538R5P8A50.674395----
WG_beta_repPF14903R6YH890.6815892ki4_A0.067.41%
DUF4116PF13475R7MCC40.6881635lu2_A0.417.81%0.33
PHINT_rptPF14882S6TLB90.4607355lnk_Q0.0313.51%
Chlam_PMPPF02415S7J9T70.6743832m7o_A0.1325%
XinPF08043T0NQR80.6800281ixv_A0.18.33%
WXXGXWPF12779U2FCE10.731013----
SBBPPF06739U5QIU90.7241066i3b_A0.3824.45%
Ish1PF10281U7Q0S50.4334151jjr_A0.115.73%
Phage_fiber_2PF03406V5CQL00.5401255iv5_A0.5822.43%0.21
FG-GAP_2PF14312W4LGN00.6402515ffg_A0.5623.04%0.82
CXCXCPF03128W5N8530.8651531vgh_A0.3626.47%
RTTN_NPF14726W5P4990.68744plr_A0.5816.28%0.58
WDCPPF15390W5Q8K90.4279245nnz_A0.0912.58%

Fig 10 shows the overlap between the contact and the homology models. Two trRosetta models differ significantly from the homology models: Phage_fiber_2 (PF03406) where the template as a partially disordered extended structure while the trRosetta model is packed and DUF4116 (PF13475) in which the trRosetta model is folded as an α-solenoid while the homology model in a longer helical bundle.

Comparison between the contact-based model and homology modelling.
Fig 10

Comparison between the contact-based model and homology modelling.

The superposition between the contact-based model (red) and the homology model (blue) and respective TM-score.

Other 36 families do not have suitable templates, and, therefore, we cannot compare their trRosetta models with a homology-based model. However, the quality assessment shows high scores for the vast majority of the models.

Here we describe a few interesting models in more details, and all the models are available at https://figshare.com/articles/dataset/Repeats_Proteins_contact_prediction_based_modelling_datasets/9995618. We do encourage others to investigate the other models in details.

SPW family (PF03779)

According to the PFAM database [29], the SPW family is present in Bacteria and Archaea, and each protein consists of one or two repeat units. Some members also contain an additional domain, either a Vitamin K epoxide reductase (PF07884) or a NAD-dependent epimerase/dehydratase (PF01370). Each repeat unit is formed by two transmembrane alpha-helices and is characterised by an SPW motif [30]. According to our model, the repeated motif is buried in the membrane symmetrically located close to the extracellular side, Fig 11B. PFAM architectures show many proteins with only a single SPW motif however a more careful analysis of these sequences shows that in many cases they contain a second degenerate SPW unit with the proline residue conserved (S4 Fig).

Selected models.
Fig 11

Selected models.

The different protein units are coloured in red and blue. a) SPW, b) SPW in red the “SPW” motif c) Curlin d) UCH-protein are shown and e) Xin repeat.

The Tryptophan is on the outer side of the protein, facing the bilayer, while the proline is on the inner side of the protein, promoting the formation of a kink in the transmembrane helix [31]. The protein contains a ser-pro motif, rare among TM-proteins and most likely increases the bending effect of proline significantly due to their hydrogen bond pattern [32].

Curlin repeats family (PF07012)

Here, the trRosetta model has a higher predicted TM score (Table 1) and agrees better with information from the available literature [37]. Curlin is predicted to have a β-solenoid structure, see Fig 11C. DeBenedictis et al. presented ab-initio models for two members of the Curlin repeat family, CsgA and CsgB [37]. The structure of their best models is visually in agreement with our model (a direct comparison is difficult as the coordinates are not available for their models). Our model is also in agreement with the partial structure of the repeat units of CsgA published by Perov et al. [38]. This model contains two parallel β-sheets with individual units situated perpendicular to the fibril axis (corresponding PDB IDs are 6G8C, 6G8D, 6G8E).

UCH-protein (PF13446)

Our model (Fig 11D) suggests that this repeat region is a Class V.1 ɑ-beads, with four helical domains separated by a flexible linker.

UCH-protein repeats family is a repeat domain found in Ubiquitin carboxyl-terminal hydrolase. Despite UCH-proteins being widespread among eukaryotes, the repeated domain is present only in yeasts in a variable number of units. According to PFAM [29], the UCH-protein repeats could be involved in the formation of a complex of UCH with Rsp5 and Rup1.

Xin repeat (PF08043)

Xin repeats is a motif with a variable number of units, known for binding and stabilising F-actin [33]. In mouse and chicken is located in the adherens junction complex [33]. In humans Xin-repeat proteins are involved in the developmental and adaptive remodelling of the actin cytoskeleton [34] behaving as a scaffold protein showing multiple interacting partners: I) interact with the EVH1 domain of Mena/VASP/EVL [34]. II) Interact with the SH3 domain of Nebuline and Nebulette, despite the binding site is located in a disordered region [35] III) interact with Aciculin [36].

In our model Fig 11E, Xin-repeats result folded as an α-solenoid. This clarifies the fold of Xin-repeats proteins formed by an α-solenoid N-terminus and a long disordered C-terminal region.

Conclusion

Here, we performed a comprehensive coevolution analysis on repeat protein families, and we show that trRosetta contact-predictions method overcomes the traditional difficulties of previews Deep Learning and DCA methods for this class of proteins. We investigated the modelling of repeat units, and we developed a novel quality assessment method for repats proteins. Finally, we tested the pipeline on PFAM families without protein structures showing its usefulness in providing new structural information.

This paper summarises the extraordinary improvement of the structure prediction method in the past few years and shows that it is now possible to predict the structure of 85% of PFAM repeat families satisfactorily.

Materials and methods

Datasets generation

The repeat protein dataset was generated starting from the 3585 reviewed entries in RepeatsDB [14,39]. The proteins of class I and II were removed, and then the dataset was homology reduced using CD-HIT [40] at 40% identity resulting in 815 repeat regions. From this “complete region dataset” two other datasets were generated. First, a “single unit” dataset with one repeat unit from each family, and secondly a “double unit” dataset with two. In the two derived datasets, the representative units were selected, avoiding or at least minimising, the presence of insertions.

The non-resolved repeats protein family dataset was generated, collecting all the repeat proteins families with missing structural information present in PFAM [29] as of May 2019 and removing domains with a significant overlap with the disorder prediction. It results in 48 protein families. The representative sequence for each family of repeat was chosen for matching these criteria: 1) select the most common architecture; 2) Include when possible at least three repeat units.

Multiple sequence alignment (MSA)

The multiple sequence alignments (MSA) were carried out using HHblits [41] using an E-value cutoff of 0.001 against the Uniclust30_2017_04 database [42]. The number of effective sequences of the alignment, expressed as Neff-score, was calculated by HHblits and used for subsequent analysis. More detail about the Neff calculation can be found at https://github.com/soedinglab/hh-suite/wiki [41].

Contact prediction and models generation

For DeepMetaPsicov and PconsC4 the protein models were generated following the PconsFold2 protocol [43]. The secondary structure of the repeat regions was predicted by PSIpred [44]. Protein contacts were predicted using DeepMetaPsicov [45], or PconsC4 [18] and together with the secondary structure predictions used as input to Confold [25]. The modelling used the top scoring 1.5 L contacts (where L is the length of the modelled regions).

The Rosetta models were obtained running trRosetta locally [19] and use the predicted distances and angles as input for pyRosetta [26].

Contacts analysis

A protein contact was defined as two residues having a beta carbon distance equal to or lower than 8Å in the PDB structure and farther than five residues in the sequence. Using this definition, we assess the number of correctly predicted contacts the Positively Predicted Value (PPV) taking into account the top-scoring 1.5 L contacts.

Since trRosetta predicted distances instead of contacts between the residues we sum the probabilities for the distance bin equal or shorter than 8 Å as in Greener et al. [20] in order to compare them with the contacts predicted with the other methods.

In the intra/inter-unit contacts analysis, the predicted contacts of each protein were divided into i) intra-unit contacts, if between residues inside the same unit; ii) inter-units if the residues are in different repeat units. The units mapping was taken from the RepeatsDB database [14]. In this analysis, we calculate the number of intra- and inter-unit contacts in the PDB structure, and then we selected the same number of predicted intra- and inter-units contacts. The PPV was then calculated as the fraction of correct predictions.

Template search and homology modelling

The template search and the homology models were generated from the representative sequences using the default options from Swissmodel [28].

Protein models analysis

The model quality, expressed in TM-score, was assessed using a random forest regression model using the python module Sklearn. The random forest regression was optimized to include 240 estimators and a maximum depth of 60. The models from the trRosetta “complete region” benchmark set were used as a training set. The label of the training set was the TM-score of each model [46]. To ensure that the protein structure and the model were aligned correctly, the TMalign option -I was used, providing a local alignment of the two sequences.

For training, five cross-validation sets were generated. Several inputs were used for the random forest, described briefly below and in Table 1. The Confold and QmeanDisco inputs were obtained from analysing the first ranked model. Pcons was run using the option -d using all the models in the stage2 folder generated by Confold. Among the different sets of features tried, we select nine features that all improve the prediction of the random forest regression, see S3 Fig.

References

JHeringa. Detection of internal repeats: how common are they? Curr Opin Struct Biol. 1998;8: 338345. 10.1016/s0959-440x(98)80068-7

MStrand, TAProlla, RMLiskay, TDPetes. Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Nature. 1993;365: 274276. 10.1038/365274a0

FPâques, W-YLeung, JEHaber. Expansions and Contractions in a Tandem Repeat Induced by Double-Strand Break Repair. Molecular and Cellular Biology. 1998. pp. 20452054. 10.1128/mcb.18.4.2045

ESchaper, OGascuel, MAnisimova. Deep conservation of human protein tandem repeats within the eukaryotes. Mol Biol Evol. 2014;31: 11321148. 10.1093/molbev/msu062

E.M.Marcotte, M.Pellegrini, T.O.Yeates, D.Eisenberg A census of protein repeats. J Mol Biol. 1999;293: 151160. 10.1006/jmbi.1999.3136

AKBjörklund, DEkman, AElofsson. Expansion of protein domain repeats. PLoS Comput Biol. 2006;2: e114. 10.1371/journal.pcbi.0020114

MAAndrade, CPerez-Iratxeta, CPPonting. Protein Repeats: Structures, Functions, and Evolution. Journal of Structural Biology. 2001. pp. 117131. 10.1006/jsbi.2001.4392

CUStirnimann, EPetsalaki, RBRussell, CWMüller. WD40 proteins propel cellular networks. Trends Biochem Sci. 2010;35: 565574. 10.1016/j.tibs.2010.04.003

JLi, AMahajan, M-DTsai. Ankyrin repeat: a unique motif mediating protein-protein interactions. Biochemistry. 2006;45: 1516815178. 10.1021/bi062188q

10 

LKMosavi, TJCammett, DCDesrosiers, Z-YPeng. The ankyrin repeat as molecular architecture for protein recognition. Protein Sci. 2004;13: 14351448. 10.1110/ps.03554604

11 

EPersi, YIWolf, EVKoonin. Positive and strongly relaxed purifying selection drive the evolution of repeats in proteins. Nat Commun. 2016;7: 13570. 10.1038/ncomms13570

12 

AVKajava. Review: Proteins with Repeated Sequence—Structural Prediction and Modeling. Journal of Structural Biology. 2001. pp. 132144. 10.1006/jsbi.2000.4328

13 

AVKajava. Tandem repeats in proteins: From sequence to structure. Journal of Structural Biology. 2012. pp. 279288. 10.1016/j.jsb.2011.08.009

14 

LPaladin, LHirsh, DPiovesan, MAAndrade-Navarro, AVKajava, SCETosatto. RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures. Nucleic Acids Res. 2017;45: 3613. 10.1093/nar/gkw1268

15 

LAAbriata, GETamò, BMonastyrskyy, AKryshtafovych, MDal Peraro. Assessment of hard target modeling in CASP12 reveals an emerging role of alignment-based contact prediction methods. Proteins. 2018;86 Suppl 1: 97112. 10.1002/prot.25423

16 

FPazos, MHelmer-Citterich, GAusiello, AValencia. Correlated mutations contain information about protein-protein interaction. J Mol Biol. 1997;271: 511523. 10.1006/jmbi.1997.1198

17 

REspada, RGParra, TMora, AMWalczak, DUFerreiro. Capturing coevolutionary signals inrepeat proteins. BMC Bioinformatics. 2015;16: 207. 10.1186/s12859-015-0648-3

18 

MMichel, DMHurtado, AElofsson. PconsC4: fast, accurate, and hassle-free contact predictions. Bioinformatics. 2018. 10.1093/bioinformatics/bty1036

19 

JYang, IAnishchenko, HPark, ZPeng, SOvchinnikov, DBaker. Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci U S A. 2020;117: 14961503. 10.1073/pnas.1914677117

20 

JGGreener, SMKandathil, DTJones. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat Commun. 2019;10: 3977. 10.1038/s41467-019-11994-0

21 

CBaldassi, MZamparo, CFeinauer, AProcaccini, RZecchina, MWeigt, et al. Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners. PLoS One. 2014;9: e92721. 10.1371/journal.pone.0092721

22 

JLundström, LRychlewski, JBujnicki, AElofsson. Pcons: a neural-network-based consensus predictor that improves fold recognition. Protein Sci. 2001;10: 23542362. 10.1110/ps.08501

23 

GStuder, CRempfer, AMWaterhouse, RGumienny, JHaas, TSchwede. QMEANDisCo-distance constraints applied on model quality estimation. Bioinformatics. 2020;36: 2647. 10.1093/bioinformatics/btaa058

24 

MJSkwark, DRaimondi, MMichel, AElofsson. Improved contact predictions using the recognition of protein like contact patterns. PLoS Comput Biol. 2014;10: e1003889. 10.1371/journal.pcbi.1003889

25 

BAdhikari, DBhattacharya, RCao, JCheng. CONFOLD: Residue-residue contact-guided ab initio protein folding. Proteins. 2015;83: 14361449. 10.1002/prot.24829

26 

SChaudhury, SLyskov, JJGray. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics. 2010;26: 689691. 10.1093/bioinformatics/btq007

27 

MBiasini, SBienert, AWaterhouse, KArnold, GStuder, TSchmidt, et al. SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Res. 2014;42: W2528. 10.1093/nar/gku340

28 

AWaterhouse, MBertoni, SBienert, GStuder, GTauriello, RGumienny, et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 2018;46: W296W303. 10.1093/nar/gky427

29 

SEl-Gebali, JMistry, ABateman, SREddy, ALuciani, SCPotter, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019;47: D427D432. 10.1093/nar/gky995

30 

CYeats, SBentley, ABateman. New knowledge from old: in silico discovery of novel protein domains in Streptomyces coelicolor. BMC Microbiol. 2003;3: 3. 10.1186/1471-2180-3-3

31 

Gvon Heijne. Proline kinks in transmembrane alpha-helices. J Mol Biol. 1991;218: 499503. 10.1016/0022-2836(91)90695-3

32 

XDeupi, MOlivella, CGovaerts, JABallesteros, MCampillo, LPardo. Ser and Thr Residues Modulate the Conformation of Pro-Kinked Transmembrane α-Helices. Biophysical Journal. 2004. pp. 105115. 10.1016/S0006-3495(04)74088-6

33 

HWSinn, JBalsamo, JLilien, JJ-CLin. Localization of the novel Xin protein to the adherens junction complex in cardiac and skeletal muscle during development. Dev Dyn. 2002;225: 113. 10.1002/dvdy.10131

34 

PFMvan der Ven, EEhler, PVakeel, SEulitz, JASchenk, HMilting, et al. Unusual splicing events result in distinct Xin isoforms that associate differentially with filamin c and Mena/VASP. Exp Cell Res. 2006;312: 21542167. 10.1016/j.yexcr.2006.03.015

35 

SEulitz, FSauer, M-CPelissier, PBoisguerin, SMolt, JSchuld, et al. Identification of Xin-repeat proteins as novel ligands of the SH3 domains of nebulin and nebulette and analysis of their interaction during myofibril formation and remodeling. Mol Biol Cell. 2013;24: 32153226. 10.1091/mbc.E13-04-0202

36 

SMolt, JBBührdel, SYakovlev, PSchein, ZOrfanos, GKirfel, et al. Aciculin interacts with filamin C and Xin and is essential for myofibril assembly, remodeling and maintenance. J Cell Sci. 2014;127: 35783592. 10.1242/jcs.152157

37 

EPDeBenedictis, DMa, SKeten. Structural predictions for curli amyloid fibril subunits CsgA and CsgB. RSC Adv. 2017;7: 4810248112.

38 

Perov S, Lidor O, Salinas N, Golan N, Tayeb-Fligelman E, Deshmukh M, et al. Structural Insights into Curli CsgA Cross-β Fibril Architecture Inspired Repurposing of Anti-amyloid Compounds as Anti-biofilm Agents. 10.1101/493668

39 

LHirsh, LPaladin, DPiovesan, SCETosatto. RepeatsDB-lite: a web server for unit annotation of tandem repeat proteins. Nucleic Acids Res. 2018;46: W402W407. 10.1093/nar/gky360

40 

WLi, AGodzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006. pp. 16581659. 10.1093/bioinformatics/btl158

41 

MRemmert, ABiegert, AHauser, JSöding. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods. 2012. pp. 173175. 10.1038/nmeth.1818

42 

MMirdita, Lvon den Driesch, CGaliez, MJMartin, JSöding, MSteinegger. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45: D170D176. 10.1093/nar/gkw1081

43 

CBassot, DMenendez Hurtado, AElofsson. Using PconsC4 and PconsFold2 to Predict Protein Structure. Curr Protoc Bioinformatics. 2019; e75. 10.1002/cpbi.75

44 

LJMcGuffin, KBryson, DTJones. The PSIPRED protein structure prediction server. Bioinformatics. 2000. pp. 404405. 10.1093/bioinformatics/16.4.404

45 

SMKandathil, JGGreener, DTJones. Prediction of interresidue contacts with DeepMetaPSICOV in CASP13. Proteins. 2019;87: 10921099. 10.1002/prot.25779

46 

YZhang, JSkolnick. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33: 23022309. 10.1093/nar/gki524