Chris Lewis at the U of S



Email: Christopher Lewis

Suffix Trees in Computational Biology

This page was produced to meet the CMPT 857 WWW page requirement. It introduces suffix trees, highlights applications of suffix trees, and provides links to additional information.

For a more detailed explanation of suffix trees and their uses, see "Algorithms on Strings, Trees, and Sequences" by Dan Gusfield. This book was the reference for most of the information on this page.



"Suffix trees are widely used in the computer field... Recent improvements in the method have cut the memory requirement to 17 bytes per letter, which brings the method to the verge of practicality [for bioinformatics applications]" -- Nat Goodman (Genome Technology 10.02, pg 88).


1. Introduction

Any string of length m can be degenerated into m suffixes, and these suffixes can be stored in a suffix tree. Creating this structure requires time O(m) and searching for a pattern in it requires time O(n), where n is the length of the pattern. These two properties make the suffix tree an appealing structure for a diverse range of bioinformatics applications including: multiple genome alignment (Michael Hohl et al., 2002); selection of signature oligonucleotides for DNA arrays (Kaderali and Schliep, 2002); and identification of sequence repeats (Kurtz and Schleiermacher, 1999).

Additional Links

2. Background

T12 =               (empty)
T11 =           i
T10 =          pi
T9  =         ppi
T8  =        ippi
T7  =       sippi
T6  =      ssippi
T5  =     issippi
T4  =    sissippi
T3  =   ssissippi
T2  =  ississippi
T1  = mississippi

Figure 1 - The list of suffixes in the string 'mississippi', these suffixes are added to a suffix tree in Fig. 2.

The first linear-time suffix tree algorithm was developed by Weiner in 1973. A more space efficient algorithm was produced by McCreight in 1976, and Ukkonen produced an "on-line" variant of it in 1995. The key to search speed in a suffix tree is that there is a path from the root for each suffix of the text. This means that at most n comparisons are needed to find a pattern of length n. Lloyd Allison has a detailed introduction to suffix trees, which includes a javascript suffix tree demonstration and a discussion of suffix tree applications. His example uses the string 'mississippi', which can be decomposed into 12 suffixes (Fig 1). A suffix is a substring that includes the final character of the string, for instance the suffix 'ippi' can be found starting at position 8.

A suffix tree can be either implicit (Fig 2a) or explicit (Fig 2b). Suffixes in an implicit suffix tree can end at an interior node -- making them prefixes of another suffix. For example, in the implicit suffix tree for 'mississippi' the final suffix 'i' is a prefix of 'ippi', 'issippi', and 'ississippi'. In the explicit suffix tree, each suffix is explicit -- the suffix ends at a leaf node. This is guaranteed because a terminating character is added to the string (and the tree) that would otherwise never occur in the string. Each string has a known alphabet: the alphabet for 'mississippi' could be [i,m,s], and selecting a terminator that is not part of the alphabet ensures that the terminator will never occur in the string. Creating the explicit suffix tree for 'mississippi' involved adding the character '$' to the string. This required the addition of 2 extra leaves: one for the empty suffix and one for the suffix 'i'.

+
 - i
 -  - ppi
 -  - ssi
 -  -  - ppi
 -  -  - ssippi
 - mississippi
 - p
 -  - i
 -  - pi
 - s
 -  - i
 -  -  - ppi
 -  -  - ssippi
 -  - si
 -  -  - ppi
 -  -  - ssippi

Figure 2a - Implicit suffix tree for 'mississippi'

+
 - $
 - i
 -  - $
 -  - ppi$
 -  - ssi
 -  -  - ppi$
 -  -  - ssippi$
 - mississippi$
 - p
 -  - i$
 -  - pi$
 - s
 -  - i
 -  -  - ppi$
 -  -  - ssippi$
 -  - si
 -  -  - ppi$
 -  -  - ssippi$

Figure 2b - Explicit suffix tree for 'mississippi'

Ukkonen's linear time algorithm begins with an implicit suffix tree containing the first character of the string. Then it steps through the string adding successive characters until the tree is complete. This in order addition of characters gives Ukkonen's algorithm it's "on-line" property; earlier algorithms proceeded backward from the last character. The naive implementation of this algorithm requires O(n^2) or even O(n^3) time, however applying several programming tricks reduces this to O(n) time. To see an example of suffix tree growth, try the applet in "Growing A Suffix Tree" by Ghitza and Warda of McGill University.

The suffixes in Ghitza and Warda's applet are listed as two numbers (x, y), where x represents the start of the suffix in the string and y represents the end of the suffix. Storing the suffixes as indexes to the main string is known as Edge-label compression and allows a suffix tree to be created in O(m) space. The 'mississippi' example is shown using edge-label compression in Figure 3. This trick reduces the size of the output from O(m^2) using substrings as labels to O(m). This is necessary to produce a suffix tree in O(m) time because the 'big-O' time for an algorithm is at least as large as it's output.

+
 - (12, 12)
 - (2,2)
 - (12, 12)
 -  - (3,5)
 -  - (6,8)
 -  -  - (9,12)
 -  -  - (6,12)
 - (1,12)
 - (9,9)
 -  - (11,12)
 -  - (10, 12)
 - (3,3)
 -  - (4,4)
 -  -  - (9,12)
 -  -  - (6,12)
 -  - (4,5)
 -  -  - (9,12)
 -  -  - (6, 12)

Figure 3a - Explicit suffix tree for 'mississippi' using Edge-label compression.

+
 - $
 - i
 -  - $
 -  - ppi$
 -  - ssi
 -  -  - ppi$
 -  -  - ssippi$
 - mississippi$
 - p
 -  - i$
 -  - pi$
 - s
 -  - i
 -  -  - ppi$
 -  -  - ssippi$
 -  - si
 -  -  - ppi$
 -  -  - ssippi$ 

Figure 2b - Explicit suffix tree for 'mississippi' using substrings as labels.

Another necessary implementation trick to achieve a linear time and space bound is the use of suffix links. A suffix link is a pointer from an internal node labelled xS to another internal node labelled S, where x is an arbitrary character and S is a possibly empty substring (Fig 4). Suffix links allow extensions without walking down from the root of the tree because they point to the location of the next extension. Suffix links and edge label compression are the core requirements for the O(m) suffix tree implementation. Rutgers.edu has an excellent suffix tree generation cgi. In their representation the suffix links are represented as light grey arrows. Note the chain of suffix links: 'issi' -> 'ssi' -> 'si' -> 'i' in the suffix tree for 'mississippi'.

+
 - $
 - i (1)
 -  - $
 -  - ppi$
 -  - ssi (4) -> (3)
 -  -  - ppi$
 -  -  - ssippi$
 - mississippi$
 - p
 -  - i$
 -  - pi$
 - s
 -  - i (2) -> (1)
 -  -  - ppi$
 -  -  - ssippi$
 -  - si (3) -> (2)
 -  -  - ppi$
 -  -  - ssippi$

Figure 4a - Explicit suffix tree for 'mississippi' with emphasis on suffix links




issi(4) -> ssi(3)
x = i, S = ssi






si(2) -> i(1)
x = s, S = i

ssi(3) -> si(2)
x = s, S = si

Figure 4b - Suffix links for 'mississippi' the suffix is in the form xS, and the suffix link is S.

When there is more than one string to be searched a generalized suffix tree (GST) can be created. As the GST contains more than one string, each leaf node contains an identifier indicating the string associated with this suffix. For instance, in the implicit GST for 'cagca' and 'gagcga' (Fig 5) the suffix 'agcga' belongs to string 2 and the suffix 'agca' belongs to string 1.

+
 - g (1)      
 - - a (2) -> (7) s:2
 - - - gcga (3) -> (9) s:2
 - - c (4) -> (11)
 - - - ga (5) -> (12) s:2
 - - - a (6) -> (13) s:1
 - a (7) s:2,1 
 - - gc (8) -> (4)
 - - - - ga (9) -> (5) s:2
 - - - - a (10) -> (6) s:1
 - c (11)
 - - ga (12) -> (2) s:2
 - - a (13) -> (7) s:1
 - - - gca (14) -> (10) s:1

Figure 5 - The implicit generalized suffix tree for 'cagca' (string 1) and 'gagcga' (string 2).
Entry a (2)->(7) s:2 represents a leaf node labelled 'a', with a node number of 2, a suffix link to node 7, part of string 2.

Additional Links

2.1. Practical considerations

  1. For very large problems a linear time and space bound is not good enough. This lead to the development of structures such as Suffix Arrays to conserve memory (section 4.3).
  2. Suffix links are designed to quickly take you from one region of the tree to another. In large trees this can cause problems with paging because a suffix link may take you to a region of the tree that is not in memory. For an explanation of paging and it's role in memory management, see the following course notes - Paging.
  3. There must be a trade off between speed and space when storing children. Storing children in a linked list is more memory efficient than using an array, but it requires more processing time to find and add children. For those who are not familiar with arrays and linked lists, the Linked List Tutorial in C provides a detailed explanation of linked lists, and the article entitled " Deciding whether to use arrays or linked lists" explains why one structure would be used instead of the other.
  4. Memory is often the limiting factor for solutions based on suffix trees. Since there is usually more than one solution, memory usage dictates the choice of solution. For instance in an application using more than one string the solution could be implemented using either a GST or match statistics (section 4.2). Creation of a GST is simpler, but match statistics use less memory.
  5. Suffix trees are alphabet dependent -- the size of the alphabet affects the creation and search times of the tree. In the strict case the space requirement must be denoted
    Θ(m|Σ|)
    , or the creation time must be denoted
    min (O(m log m), O(m log |Σ|))
    and the search time must be denoted
    min (O(|P| log m), O(|P| log |Σ|))
    . See Suffix Tree Issues (ppt) for an explanation of the effect alphabet size has on suffix trees and GSTs.

3. Applications of suffix trees

3.1. General Applications

3.1.1. Pattern matching

Found 4 matches:    ___
 1:  1-3:           actgtta...
 1:  7-9:    ...tgttact
 2:  2-4:          gactagcg...
 3:  6-8:    ...acacacta

Figure 6 - The matches for 'act' in 'actgttact', 'gactagcga', and 'gacacacta'.

Suffix trees allow a pattern of length n to be found in O(n + k) time, where k is the number of occurrences of the pattern in the text. This is known as Exact String Matching. Use of a generalized suffix tree allows this pattern to be found in multiple strings in O(n + k) time. For instance, the pattern 'act' occurs 4 times in the strings 'actgttact', 'gactagcga', and 'gacacacta' (Fig 6). A simple extension of this problem is the Exact Set Matching problem. This problem involves finding all exact matches for a set of strings. This problem can be solved using suffix trees in O(n + k) time, where n is the total length of the set of patterns (Gusfield, 1997, p.123).

3.1.2. Longest Common Substring of Two Strings

+
 - g (1)             c=2  d=1
 - - a (2) s:2       c=1  d=2
 - - - gcga (3) s:2  
 - - c (4)           c=2  d=2
 - - - ga (5) s:2
 - - - a (6) s:1
 - a (7) s:2,1       c=2  d=1
 - - gc (8)          c=2  d=3
 - - - - ga (9) s:2    
 - - - - a (10) s:1
 - c (11)            c=2  d=1
 - - ga (12) s:2       
 - - a (13) s:1      c=1  d=2
 - - - gca (14) s:1

Figure 7 - The longest common substring for 'cagca' and 'gagcga' is 'agc', the suffix label on node 8, as node 8 covers both strings and has the deepest string-depth.

The Longest Common Substring problem can be solved in O(m) time on a GST of size m. Two values are required for each internal node in the tree:

  1. String coverage c(v) - the number of distinct strings covered by suffixes in the children of v,
  2. and string-depth d(v) - the number of characters in the suffix label for the node v.

Given these two values, the longest common substring is the suffix label on a node covering all the strings which has the deepest string-depth (Fig 7). The value of c(v) and d(v) can be calculated at the time the tree is created or during an O(m) traversal of the tree. A pointer to the node with the deepest string-depth covering all the strings can be updated during this traversal (Gusfield, 1997, p 125).


3.1.3. Common substrings of more than two strings

+-----+------------+-------+
|Node | Children   | depth |
+-----+------------+-------+
| 1   | 1, 2       | 4     |
+-----+------------+-------+
| 2   | 2          | 6     |
+-----+------------+-------+
| 4   | 1, 2, 3    | 3     |
+-----+------------+-------+
| 7   | 1, 2, 4    | 2     |
+-----+------------+-------+
| 8   | 1, 2, 3, 4 | 8     |
+-----+------------+-------+
| 11  | 1, 2       | 3     |
+-----+------------+-------+
| 13  | 1          | 5     |
+-----+------------+-------+ 

Figure 8 - Table showing the string-depth and list of unique strings covered at each internal node for four strings.

Finding the common substrings of more than two strings is a conceptually simple extension of the longest common substring problem. Instead of calculating the string coverage for each node, generate a list of unique strings covered by the node. This list and the string-depth of the node can be stored in a table created during traversal of the tree (Fig 7b). Any node covering more than one string ends a substring common to the strings it covers. For example node 8 indicates that strings 1, 2, 3 and 4 have a common substring 8 characters long. Equally interesting are internal nodes covering only one string because they end a unique substring. For instance node 2 ends a unique substring in string 2 that contains 6 characters.

The unique strings covered by each node can be encoded in a vector of bits such that bit i is 1 if string i is covered by the node. The sum of the bits in this vector will equal c(v) for the node. This vector can be created by ORing the vector returned from each child with that of it's parent.

Gusfield lists a O(Kn) time for this implementation.



Why does Gusfield allows time K to OR two K-bit vectors? Isn't OR an O(1) operation?

Additional Links

3.2. Biological Applications of Suffix Trees

3.2.1. Genome Alignment

At least two programs based on Suffix trees are available for whole genome alignment: MUMmer (Delcher, et al.) and Multiple Genome Aligner (MGA) (Michael Höhl, et al.). MUMmer and MGA both use common subsequences as anchors for the alignment. While they both use the same data structure: "a suffix-tree data structure, which permits very fast and memory-efficient comparison [of the genomes]" (Delcher et. al., 2002), the two applications take different approaches to genome alignment.

MUMmer extracts Maximal Unique Matches (MUMs) - sequences that occur exactly once in each genome, sorts these sequences to find the longest set of MUMs occurring in the same order in both sequences, and uses this set of sequences to anchor the multiple alignment. The gaps between anchors are filled using the Needleman Wunsch dynamic programming alignment algorithm. As Needleman Wunsch does not scale well for multiple sequences (its time and space requirements increase exponentially with the number of strings), MUMmer is restricted to comparing two genomes.

MGA computes the longest non-overlapping sequence of Maximal Multiple Exact Matches (multiMEMs) and uses these to guide the multiple alignment. A MEM is defined as a (K+1)-tuple (l, p_0, p_1, ... p_k-1) such that l indicates the length of the MEM, and p_i indicates the start coordinate of the exact match in genome i. A maximal MEM cannot be extended to the left or the right and is referred to as a multiMEM. Gaps are shortened by recursively extracting multiMEM sequences, and finally filled using ClustalW - a progressive/iterative multiple alignment method (Types of Multiple Alignment Methods).

Additional Links

3.2.2. Signature Selection

Microarrays are commonly used in gene expression experiments to measure the expression levels of a set of genes. Signature oligonucleotides, short genetic sequences that are specific to the individual genes, must be designed and attached to the microarray. Suffix trees have been used as the core of an algorithm to select these signatures (Kaderali and Schliep, 2002).

Given a set of genomic sequences, the software must identify at least one signature oligonucleotide (probe) for each sequence. These probes must hybridize to only the desired sequence. The algorithm produces a GST from the reverse compliment of all the genomic sequences (candidate probe sequences). Using the GST, the algorithm identifies all common substrings and rejects these regions because probes designed in them would not be specific to a single genomic sequence. Criteria such as probe length are used to further prune this tree.

The algorithm computes the melting temperature (Tm) for all interactions between the remaining candidate probe sequences and the target sequences. The Tm for a pair of sequences is computed by a modified version of the Needleman Wunsch algorithm while they are aligned. The dynamic programming recursion in Needleman Wunsch has been modified to include parameters related to Tm. Recursively calculating the Tm takes advantage of repetitive regions in the sequence and avoids re-computation of Tm for shared subsequences.

Additional Links

3.2.3. Links Related to Suffix Trees in Computation Biology


4. Optimizing the Space Requirements of a Suffix Tree

4.1 Creation of DAGs

For many applications, a suffix tree requiring O(m) space is too large to be practical. One technique to improve the space requirement of a suffix tree involves collapsing common branches of the suffix tree form a Directed Acyclic Graph (DAG). Two branches p and q can be merged if they are isomorphic. In a suffix tree branch p is isomorphic to branch q if there is a suffix link between p and q, and the number of leaves in the two subtrees is equal. Merging these two nodes involves setting the pointer from q's parent to p and removing q from the tree (Fig 9). Creating DAGs saves space by eliminating redundancy in the graph, but it introduces complexity that increases the time for some applications.

+
 - $
 - a (q)
 -  - $
 -  - xa
 -  -  - $
 -  -  - xa$
 - x
 -  - a (p)->(q)
 -  -  - $
 -  -  - xa
 -  -  -  - $
 -  -  -  - xa$
 -  - yxaxaxa$
 - yxaxaxa$

Figure 9a - The explicit suffix tree for 'xyxaxaxa'. The branches below p and q are isomorphic because they both have 3 leaves and have a suffix link between them.

+
 - $
 - a (q)
 -  - $
 -  - xa
 -  -  - $
 -  -  - xa$
 - x
 -  ->(q) 




 -  - yxaxaxa$
 - yxaxaxa$

Figure 9b - The branches below p and q are merged because they are isomorphic. Setting the pointer from p's parent to q removes p from the tree and saves 5 nodes.

4.2. Use of Match Statistics

Another technique to optimize space requirements for some applications is the use of match statistics. Recall the common substring problem -- this problem was solved by creating a GST of the two strings and looking for the node with the deepest string-depth which covered both strings. The space required in this problem can be reduced by storing only the shorter string in the suffix tree. Then the second string is streamed along the suffix tree, stepping through the tree as if adding the string. However, instead of adding the string to the tree a table of match statistics is produced that tracks common regions. This is the approach used in MUMmer to reduce the space requirement of their algorithm.

4.3. Use of Suffix Arrays

T12 =               (empty)
T11 = i
T8  = ippi
T5  = issippi
T2  = ississippi
T1  = mississippi
T10 = pi
T9  = ppi
T7  = sippi
T4  = sissippi
T6  = ssippi
T3  = ssissippi

Figure 10 - The suffixes for the string 'mississippi' in lexicological order.

A suffix array for an m character string is an array of m integers arranged so that the suffixes indexed by the integers are in lexicographical (dictionary) order (Fig 10). For instance the suffix array for 'mississippi' is [12,11,8,5,2,1,10,9,7,4,6,3]. The first and last occurrence of a pattern can be found in this array using a binary search in time O(n). For instance, the pattern pi occurs in position 6 and 7 of the array corresponding to positions 10 and 9 of the string.

A suffix tree can be converted to a suffix array in O(n) time. This is useful when there is memory available to create the suffix tree, but it's space must be reduced after creation. This situation might arise when creating a suffix tree from a large sequence database that will be used in another application. If there is not enough memory available to create and convert the suffix tree there are slower methods available to create the suffix array directly from the sequence.

Additional Links

5. Suffix Tree Source Code


6. References

  1. Lars Kaderali and Alexander Schliep. Selecting signature oligonucleotides to identify organisms using DNA arrays. Bioinformatics 2002 18: 1340-1349.
  2. Michael Höhl, Stefan Kurtz, and Enno Ohlebusch. Efficient multiple genome alignment. Bioinformatics 2002 18: 312S-320S.
  3. Ezekiel F. Adebiyi, Tao Jiang, and Michael Kaufmann. An efficient algorithm for finding short approximate non-tandem repeats. Bioinformatics 2001 17: 5S-12S.
  4. S Kurtz and C Schleiermacher. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 1999 15: 426-427.
  5. A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of Whole Genomes. Nucleic Acids Research, 27:11 (1999), 2369-2376.
  6. A.L. Delcher, A. Phillippy, J. Carlton, and S.L. Salzberg. Fast Algorithms for Large-scale Genome Alignment and Comparison. Nucleic Acids Research (2002), Vol. 30, No. 11 2478-2483.
  7. Stefan Kurtz. Reducing the Space Requirement of Suffix Trees. Software--Practice and Experience, 29(13), pages 1149-1171, 1999.
  8. Dan Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Jan 15, 1997. Cambridge Univ Pr.

A Work in Progress