Alignment Refinement




 

Overview

This option applies an automatic refinement algorithm to the current multiple alignment in Cn3D's sequence viewer, using the existing 'block model' as a biologically-motivated constraint on the search space. To perform refinement:

  1. enable editing in the sequence viewer (Edit->Enable Editor);
  2. all blocks are refined by default, but to limit refinement to specific blocks, mark them using the 'Import->Mark Block' command -- marked blocks are highlighted in grey;
  3. select the 'Edit->Alignment Refiner' command in the sequence viewer;
  4. unchecking 'Refine all unstructured rows' box in the refinement parameters dialog allows you to limit the refinement to specific non-structure rows, if desired;
  5. change any other dialog settings, then click the 'OK' button.

When the refinement options dialog appears, Cn3D highlights in yellow all initially aligned residues (the yellow highlighting replaces the grey highlight of a marked block, but the mark is remembered). After refinement, any changes made are identified as unhighlighted residues that are now part of an aligned block in the sequence viewer. All originally aligned residues that are no longer aligned in the refined multiple alignment remain in yellow for easy identification of the modification made. The lack of yellow highlights outside the bounds of an aligned block after refinement indicates that no changes were made. If you are dissatisfied with the results, issue the 'Edit->Undo' command in the sequence viewer to restore your original pre-refinement alignment: trial-and-error use of the refiner can thus be performed in a non-destructive manner.

Starting with a passably good initial multiple alignment this alignment refinement procedure corrects a variety of flaws, and provides a means of adding (or removing) alignment columns. The algorithm tends to do best in finding and fixing local misalignments on a single sequence. Further, it performs well even in families that contain evolutionarily distant members, while not disrupting known regions of structural and sequence conservation represented by the block model. (Links to a paper describing the alignment refiner and a stand-alone version of the refiner is given in the Reference section below.)

This feature is not intended for use with an alignment that has not first undergone basic scrutiny.. As a corollary, it has a harder time making global, coordinated changes across a majority of the sequences since the algorithm presumes that 'on average' the input alignment is correct. An alignment contaminated by many sequences unrelated to the domain family being described, that consists of multiple families with sufficiently distinctive features or has other serious errors may be nominally improved for individual sequences but that is not guaranteed. Such initial flaws can confound the algorithm as it assumes all sequences in the alignment do in fact belong to one family, so global misalignments may remain in such a case.

back to top



 

Algorithm Summary

In concept, the alignment refiner reformulates the 'Block Align' functionality found in the Import Viewer, but instead applies it to sequences in the sequence viewer so as to optimize each block's location (and optionally, size) on each sequence, based on a PSSM computed from the multiple alignment. Provisional alignments in the Import Viewer are not affected by the alignment refiner.

Each independent run of the refiner is composed of one or more cycles. In turn, a cycle can carry out two distinct refinement phases: block shifting and block modification. In the block-shifting phase, blocks on each selected sequence from the multiple alignment are repositioned consistent with the initial block model (i.e., block order, number and size are unmodified). The same dynamic programming algorithm used by the Import Viewer's 'Block Align' is run for each sequence selected for refinement, where the optimization is done against the PSSM computed from the current state of the multiple alignment with the sequence currently under refinement left out. In this phase, sequences are refined in ascending order according to the score of the sequence against the PSSM over the aligned residues at the start of the phase (i.e., from worst to best score vs. the PSSM).

The other (and by default optional) phase in each cycle is 'block modification'. In this phase, pre-existing blocks in the alignment may be extended or contracted at their N- and/or C-terminii. New block creation and block splitting are not allowed. Note that unlike the block-shifting phase, this phase simulataneously alters the block on every sequence in the multiple alignment. Furthermore, this phase does not use a dynamic programming approach but a heuristic which is outlined in the 'Block Modification Parameters' section.

The algorithm has been evaluated and benchmarked against a set of 362 Conserved Domain Database (CDD) alignments and shows an overall score improvement, reliability in retrieval of functional important sites and enhanced sensitivity compared to the original CDD alignment (citation in Reference section). The score used by the alignment refiner is a sum over the sequence-to-PSSM score for each sequence in a multiple alignment. The method is reasonably fast, refining an initial alignment containing a hundred or so columns and several hundreds of highly diverse sequences within minutes.

However, as with all automated methods, the alignment refiner will not 'do the right thing' in every case from a biological point-of-view, even when it improves the alignment score. So, care should be taken to examine the output before proceeding, using 'Undo' and/or re-running the refiner with different settings or for a restricted set of sequences if necessary.

back to top



 

General Refiner Parameters

back to top



 

Block Shifting Phase Parameters

Note: the reference (master) sequence in the alignment does not undergo refinement. The next three parameters correspond to parameters in the 'Block Align' dialog and control the size of the maximum allowed loop in the refined alignment. (For Cn3D's purposes, a 'loop' is simply a contiguous stretch of unaligned residues.) This is important when full sequences or large footprint extensions are used because the N- and C-terminal blocks can be shifted far from their initial position for reasons unrelated to the quality of the initial alignment. For example, if there are multiple domains in sequence, the block may be shifted out of the current domain of interest into a neighboring domain or tandem repeat. The maximum loop allowed in the refined alignment is: min{ (loop percentile)*(maximum loop length) + loop extension, loop cutoff}.

back to top



 

Block Modification Phase Parameters

The following three threshold parameters control the decision as to whether it is advantageous to extend or shrink a column. A column can be added to the N- or C-terminal end of a block if: i) no sequence in the alignment has a gap in the column, and ii) each of the three quantities computed for the column exceed the threshold values. Expansion at a block terminus stops as soon as a column failing this test is encountered. Similarly, a column can be removed from the N- or C-terminal end of a block if all of the three quantities computed for that column fall below their respective threshold values. Shrinking of a block terminus stops as soon as a column fail this test is encountered, or the minimum block size is reached.

back to top



Reference

The Cn3D alignment refiner was initially developed as a stand-alone program for block-based multiple alignment refinement. Further details can be found in the paper:

"Refining multiple sequence alignments with conserved core regions", Saikat Chakrabarti, Christopher J. Lanczycki, Anna R. Panchenko, Teresa M. Przytycka, Paul A. Thiessen and Stephen H. Bryant (2006) Nucl. Acids Res., 34, 2598-2606 (PMID: 16707662).

The stand-alone program, which takes Cn3D-readable files as input, may be downloaded from the NCBI ftp site.