Chris Lewis at the U of S

Email: Christopher Lewis

Types of Multiple Alignment methods

This page provides a concise overview of the categories for multiple alignment methods discussed in "Recent progress in multiple sequence alignment: a survey" by C. Notredame.

Progressive Algorithms

Progressive sequence alignment algorithms form a multiple alignment from two sequences. Then they align the consensus of this multiple alignment with another sequence producing an expanded multiple alignment. This process is repeated until all sequences are aligned.

The order in which sequences are aligned is important. Given n sequences, S1..Sn, it is likely that you will arrive at a different multiple alignment for each permutation of S1..Sn. For this reason, alignment algorithms such as ClustalW use a dendrogram - a phylogenetic tree based on distance calculated from sequence similarity to determine the order for the alignment. Sequences with the shortest distance are aligned first.

Figure 1 shows the way in which error can be introduced using a progressive alignment strategy.

Exact Algorithms

Exact alignments are based on the Needleman-Wunsch algorithm. This is a dynamic backtracking algorithm which always produces the optimal alignment. The problem with exact alignment strategies is that the amount of time and memory that they requires tends to grow exponentially with the number of sequences.

Previously I sent around a Dynamic Programming example. This example aligned two sequences using a 2 dimensional matrix. Assuming two sequences of length n, this algorithm would require n^2 units of memory. To extend this algorithm for three sequences you would have to use a cube, or 3 dimensional matrix, and the memory requirements would grow to n^3 memory units. Aligning m sequences would require n^m memory units; something which is not feasible. Processing time grows at a similar rate.

The papers lists some tools which are able to align up to 20-30 sequences under specific conditions. It is interesting to note that the MSA algorithm (and possibly the others) is being run, at least in one case, on a supercomputer[1].

Iterative Algorithms

The paper distinguishes between two types of iterative algorithms, Stochastic and non-Stochastic iterative algorithms. A stochastic algorithm is one that incorporates some degree of randomness into the solution; genetic algorithms are good examples of stochastic algorithms. While Stochastic algorithms have been shown to produce good alignments, they are unfortunately too slow for use in large scale projects.

Early non-Stochastic algorithms proceeded by realigning each sequence in the multiple alignment until the iterations consistently fail to improve the alignment. More recent efforts are based on a double iteration strategy. These algorithms have an inner loop that seeks to optimize the weighted sum-of-pairs, and an outer loop that seeks to optimize the weights calculated for the phylogenetic tree. Two other methods include local information to guide the alignment. One method described is based on the creation of a consensus sequence, the other method is based on profiles.

All iterative algorithms, whether stochastic or non-stochastic, proceed by taking a non-optimal alignment and trying to optimize it. These optimization are performed iteratively and continue until they reach some defined limit. Whether or not they have reached this limit will be evaluated using the Objective Function for stochastic algorithms, and based on whether or not the iterations have reached convergence for non-stochastic algorithms.

Consistency Based Algorithms

The optimal alignment in a consistency based algorithm is the one that agrees best with all th possible pair--wise alignments; this is an NP complete problem, so all consistency based algorithms are heuristics. These algorithms are based on any method that is able to do a pair-wise alignment of two sequences. This method is used to produce a library of alignments for each pair of sequences. This library of alignments is then used to ensure consistency in the resulting multiple sequence alignment.

Consistency is ensured as the alignment pairs are used to create a position specific substitution matrix. These alignment pairs have been created using both local and global alignment algorithms. The advantage of this substitution matrix over the other substitution matrices (BLOSUM*, PAM*) is that it is based on the current sequences, and the substitution scores are a measure of how likely it is that a base will appear in a given position. Once this new position specific substitution matrix is created the multiple sequence alignment is created using a progressive alignment similar to that used in ClustalW.

A Work in Progress