Consensus Finder User Guide

Background
How To Use
Advanced Features
FAQ

Background

Why:

Consensus protein sequences are useful for numerous applications. Often, mutating a protein to be more like the consensus of homologs will often increase the stability of a protein, allowing it to function at higher temperatures, and have better soluble expression when expressed recombinantly in various hosts. "Consensus Finder" will help identify the consensus sequence and find potentially stabilizing mutations.


What it does:

Consensus Finder will take given your given protein sequence, find similar sequences from the NCBI database, align them, remove redundant/highly similar sequences, trim alignments to the size of the original query, and analyze consensus. Output is trimmed alignment, consensus sequence, frequency and count tables for amino acids at each position, as well as a list of suggested mutations to consensus that may be stabilizing.


How to use:

Choose File

Choose a file containing your protein sequence. Needs to be a plain text file in FASTA format containing the sequence of your protein of interest (protein sequence, not DNA).

If you don’t have that handy, you can download a FASTA file for proteins from NCBI.

Submit

Press Submit and wait. Click the link to see if it’s done or wait for an email if you provided an email address. Usually takes 10-30 minutes, but sometimes takes longer.

Results

The results page will look something like this:

Download full results

These mutations may stabilize your protein since they differ from the consensus residue

Change S 215 to C (94% of similar proteins have C, only 4% have S)

Change S 590 to N (90% of similar proteins have N, only 3% have S)

Change R 60 to L (88% of similar proteins have L, only 2% have R)

Change Y 241 to H (82% of similar proteins have H, only 5% have Y)

Change T 526 to S (81% of similar proteins have S, only 8% have T)

Change L 341 to I (78% of similar proteins have I, only 9% have L)

Change K 484 to L (77% of similar proteins have L, only 1% have K)

Change G 522 to P (76% of similar proteins have P, only 1% have G)

.



The results page shows you a list of mutations to make which are likely to increase the stability of your protein. For example, in the case show above, the “S” (serine) at position 215 in your protein is a “C” (cystine) in 94% of the proteins compared. If you mutate your protein’s S to match the consensus C, it is likely to stabilize your protein. The are listed in order of most conserved (and best bet for stabilization) to least. So if you get a long list, start at the top. If you have your gene and want to use site-directed mutagenesis to change the indicated residues, a site like primerX can aid you in designing primers to make these changes (http://www.bioinformatics.org/primerx/cgi-bin/DNA_1.cgi).
If you don’t get any results, see the Options section below to change the parameters.


Advanced Features

Optional operations

You can change extra options by clicking “show” near “Optional operations” before you submit your job.


2000 Set maximum sequences for BLAST search (Range: 10 - 10000)

1e-3 Set maximum e value for BLAST search (Range: 1e-30 - 1e-1)

.6 Minimum conservation threshold for suggesting mutations (Range: .05 - .99, or leave blank to only use threshold)

7 Minimum conservation ratio for suggesting mutations (Range: 1 - 100, or leave blank to only use threshold)

[] Use only matched portions, not complete sequences

1 Iterations of ClastalW alignments (Range: 1 - 5)

.9 CD-Hit redundancy (Range: .5 - 1.0)

BLAST parameters can be changed to adjust maximum number of sequences and adjust the maximum e value. One or the other will limit the number of sequences returned. For both lower values will tend to return fewer more similar results. Default for “Set maximum sequences ...” is 2000, with reasonable range would be 10-10,000, BLAST searches will occasionally time out if too many sequences are requested. Default for "Set maximum e value ..." is 1e-3 (scientific notation), reasonable range would be 1e-30 to 1e-1.


Changing the “Conservation threshold” or "Conservation ratio" will change how many mutation suggestions

are returned. For threshold, the default is 0.6, meaning that any residue that differes from the consensus, and the consensus residue is present in at least 60% of proteins, that mutation is suggested reasonable range 0.05-0.99, with lower numbers returning more suggestions. Ratio works similarly, but instead of using the absolute frequency of the conserved residue, the relative frequency of the consensus residue and the residue present in the query sequence are compared so that you are more likely to get suggested mutations when your query protein has a rare residue. Reasonable values for the ratio are between 1 and 100, with the default being 7 (i.e. make a suggestion when the consensus residue is at least 7 times more frequent than the query residue). You can specify either a threshold value or ratio value or both. If both are specified, any mutation that fits either criteria will be suggested, thus giving both a threshold and ratio will give you more siggestions.


By default complete sequences will individually be downloaded from NCBI. This takes some time, especially with a lot of BLAST hits. Checking the box for “Use only matched portions...” uses only the partial sequences matching the target in the BLAST result, this can make the program run faster.


Increasing the number of “Iterations of ClustalW alingments” will, in theory, will give better quality alignments, but it takes much longer and can give rise to other issues. Default is just 1 iteration, reasonable options: integers from 1-5.


The maximum threshold for eliminating redundant sequences with CD-HIT is defaulted to 0.9, which will remove any sequences with over 90% identity to prevent over sampling of over represented groups of proteins. Reasonable range would be 0.7-1.0, with a setting of 1.0 keeping all redundant sequences.


Additional Outputs. i.e. “Download full results”

There are other useful results in the .zip file you get! There are 6 files in your full results, one is your original query sequence. The others are:


[your_query]_mutations.txt

This is a text file that contains the same information displayed on the results web page, i.e. a list of suggested mutations.


[your_query]_consensus.fst

This file contains the consensus sequence across the full length of your query protein. It is a plain text file in FASTA format. Each position is the most common amino acid at that position from all the homologous compared (regardless of how strong of a consensus there is).


[your_query]_trimmed_alignment.fst

This file is a text file in FASTA format containing an alignment of all the representative sequences used to calculate the consensus. This alignment has the redundant sequences removed, and is trimmed to the length of your query sequence. That means that any insertions relative to the query sequence are deleted out, and any deletions are replaced with gaps (“-”). The names are the VERSION numbers, so you can cross reference the individual sequences or look them up at NCBI.


[your_query]_counts.csv

This is a table (you can open it as a spreadsheet in Excel) that shows the count of each amino acid at each position. The first column is a list of the 20 amino acids, plus gap (“-”), and other (this can be unknown amino acids or non-canonical amino acids, often designated as “X”). The next column corresponds to position one from your query sequence, and contains a count for each amino acid from the alignment at this position. The next column for position 2, and so on for as many columns as the length of your query sequence. Thus you can see how frequent each amino acid is at each position.


[your_query]_frequencies.csv

This is a table (you can open it as a spreadsheet in Excel) that has the frequency of each amino acid at each position from 0 to 1.0 (i.e. 100%). The layout is the same as the [your_query]_counts.csv file. The frequencies are calculated based upon only the 20 canonical amino acids, so the consensus is calculated by ignoring any gaps or non-canonical amino acids. For example, if half the sequences contain an Alanine, and the rest are a gap at a position, the “A” row will show 1.0, not 0.5 (The “-” row will also contain 1.0 since it’s also normalized to the 20 canonical amino acids).


F.A.Q.

I got no/too few suggested mutations what’s wrong?

Try changing some of the Advanced Options. You could degrease the “Conservation Threshold” or "Conservation Ratio" so that a weaker consensus will still suggest a mutation. Changing the “maximum sequences” or “maximum e” value of the BLAST search can also affect the number of suggested mutations. Either increasing or decreasing the two BLAST options can result in more or fewer suggested mutations. More results can result in weaker consensus so fewer positions will be above your “Conversation Threshold”, but fewer results are more likely to have the same residues as your query, so no mutation can be made.


Why do I get so many suggested mutations?

If you got more suggestions than you want to actually make, just use the suggestions from the top of the list. Since the mutations are sorted with the “best” (i.e. most conserved) suggestions at the top. You can also increase the “Conservation Threshold” or "Conservation Ratio" or change the BLAST settings to change the number of suggested mutations.


How do I open the files I downloaded?

All the files can be opened in a simple text editor (like “notepad” in Windows or “gedit” in Linux, or “TextEdit” on OS X). But it can be useful to open some of the files with other software. Sequence files (i.e. [your_query]_consensus.fst , [your_query]_trimmed_alignment.fst , and your original query file) can be opened with a number of sequence viewer/editor programs like Jalview, SeaView, or MEGA. The comma separated value files (i.e. [your_query]_counts.csv and [your_query]_frequencies.csv) can be opened with any spreadsheet software like MS Excel, LibreOffice Calc, or Numbers.



-----------------------------------------------------------------------------------------------------------------------

Copyright 2016 Bryan J. Jones (bryanjjones@gmail.com)

Consensus Finder can be freely copied and distributed under the GNU General Public

License version 2 (GPLv2) or later.


Citations:

Weizhong Li, Lukasz Jaroszewski & Adam Godzik. "Clustering of highly homologous sequences to reduce database", Bioinformatics, (2001) 17:282-283

Weizhong Li, Lukasz Jaroszewski & Adam Godzik. "Tolerating some redundancy significantly speeds up clustering of large protein databases", Bioinformatics, (2002) 18:77-82

Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG. "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega." Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75.

Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., & Madden T.L. (2008) "BLAST+: architecture and applications." BMC Bioinformatics 10:421.