How ConServ ranks the
conservation of each residue.
An evolutionary
rank (conservation score) kC,
ranging from 1 (for a rapidly evolving, highly variable residue) to 9 (for a
slowly evolving, conserved residue), for each residue of a protein is calculated
for each residue directly from the free PDB chain sequence, or a provided Fasta sequence using a new locally coded implementation of ConSurf[1, 2] running in Python.
In brief the
target sequence is used to search the Uniref90 database[3] using HMMER[4] for similar sequences and the resulting matches were
then accepted or rejected according to the following criteria. Firstly, we used
CD-Hit[5] to reduce matches that have a >= 95% sequence id with
each other to only one representative sequence. Any resulting sequences that
had a <= 60% overlap with the target sequence were rejected along with any
sequence that had a either a <= 35% or a >= 95% sequence id with the
target. Finally any sequence that was a subset of itself elsewhere with a >=
10% overlap was rejected.
Following this
selection procedure the top 300 acceptable sequences, or if there are less than
300 then all, were then aligned to the target template using mafft-linsi[6] and the per residue conservation score calculated using
the rate4site program[7]. These final scores are then converted into ConSurf grades from 1–9 where residues graded 1 were the
least conserved and those graded 9 were the most conserved.
References.
[1] Glaser F, Pupko T, Paz I, Bell RE, Bechor-Shental
D, Martz E, et al. ConSurf: Identification of Functional Regions in Proteins by
Surface-Mapping of Phylogenetic Information. Bioinformatics. 2003;19:163-4.
[2]
Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, et al. ConSurf
2005: the projection of evolutionary conservation scores of residues on protein
structures. Nucleic Acids Res 2005;33:299-302.
[3] Wu
CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, et al. The
Universal Protein Resource (UniProt): an expanding universe of protein
information. Nucleic Acids Res. 2006;34:D187-91.
[4]
Johnson LS, Eddy SR, Portugaly E. Hidden Markov model speed heuristic and
iterative HMM search procedure. BMC Bioinformatics. 2010;11:431.
[5] Li
W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of
protein or nucleotide sequences. Bioinformatics. 2006;22:1658-9.
[6]
Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for
large-scale multiple sequence alignments. Bioinformatics. 2018;34:2490-2.
[7]
Pupko T, Bell R, Mayrose I, Glaser F, Ben-Tal N. Rate4Site: An algorithmic tool
for the identification of functional regions in proteins by surface mapping of
evolutionary determinants within their homologues. Bioinformatics (Oxford,
England). 2002;18 Suppl 1:S71-7.