How ConServ ranks the conservation of each residue.

An evolutionary rank (conservation score) k^C, ranging from 1 (for a rapidly evolving, highly variable residue) to 9 (for a slowly evolving, conserved residue), for each residue of a protein is calculated for each residue directly from the free PDB chain sequence, or a provided Fasta sequence using a new locally coded implementation of ConSurf[1, 2] running in Python.

In brief the target sequence is used to search the Uniref90 database[3] using HMMER[4] for similar sequences and the resulting matches were then accepted or rejected according to the following criteria. Firstly, we used CD-Hit[5] to reduce matches that have a >= 95% sequence id with each other to only one representative sequence. Any resulting sequences that had a <= 60% overlap with the target sequence were rejected along with any sequence that had a either a <= 35% or a >= 95% sequence id with the target. Finally any sequence that was a subset of itself elsewhere with a >= 10% overlap was rejected.

Following this selection procedure the top 300 acceptable sequences, or if there are less than 300 then all, were then aligned to the target template using mafft-linsi[6] and the per residue conservation score calculated using the rate4site program[7]. These final scores are then converted into ConSurf grades from 1–9 where residues graded 1 were the least conserved and those graded 9 were the most conserved.

References.

[1] Glaser F, Pupko T, Paz I, Bell RE, Bechor-Shental D, Martz E, et al. ConSurf: Identification of Functional Regions in Proteins by Surface-Mapping of Phylogenetic Information. Bioinformatics. 2003;19:163-4.

[2] Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, et al. ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res 2005;33:299-302.

[3] Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006;34:D187-91.

[4] Johnson LS, Eddy SR, Portugaly E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics. 2010;11:431.

[5] Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658-9.

[6] Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics. 2018;34:2490-2.

[7] Pupko T, Bell R, Mayrose I, Glaser F, Ben-Tal N. Rate4Site: An algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics (Oxford, England). 2002;18 Suppl 1:S71-7.