Any string of length m can be degenerated into m suffixes, and these suffixes can be stored in a suffix tree. Creating this structure requires time O(m) and searching for a pattern in it requires time O(n), where n is the length of the pattern. These two properties make the suffix tree an appealing structure for a diverse range of bioinformatics applications including: multiple genome alignment (Michael Hohl et al., 2002); selection of signature oligonucleotides for DNA arrays (Kaderali and Schliep, 2002); and identification of sequence repeats (Kurtz and Schleiermacher, 1999).

- Big 'O' notation - A pdf with a brief explanation of order "O" notation as it applies to the analysis of algorithms.
- Data Structures and Search Trees - A web page that introduces list and tree structures including: arrays, linked lists, binary trees, tries, and suffix trees.

T12 = (empty) T11 = i T10 = pi T9 = ppi T8 = ippi T7 = sippi T6 = ssippi T5 = issippi T4 = sissippi T3 = ssissippi T2 = ississippi T1 = mississippi |

The first linear-time suffix tree algorithm was developed by Weiner in 1973. A more space efficient algorithm was produced by McCreight in 1976, and Ukkonen produced an "on-line" variant of it in 1995. The key to search speed in a suffix tree is that there is a path from the root for each suffix of the text. This means that at most n comparisons are needed to find a pattern of length n. Lloyd Allison has a detailed introduction to suffix trees, which includes a javascript suffix tree demonstration and a discussion of suffix tree applications. His example uses the string 'mississippi', which can be decomposed into 12 suffixes (Fig 1). A suffix is a substring that includes the final character of the string, for instance the suffix 'ippi' can be found starting at position 8.

A suffix tree can be either *implicit* (Fig 2a) or
*explicit* (Fig 2b). Suffixes in an implicit suffix tree can
end at an interior node -- making them prefixes of another suffix.
For example, in the implicit suffix tree for 'mississippi' the
final suffix 'i' is a prefix of 'ippi', 'issippi', and
'ississippi'. In the explicit suffix tree, each suffix is explicit
-- the suffix ends at a leaf node. This is guaranteed because a
terminating character is added to the string (and the tree) that
would otherwise never occur in the string. Each string has a known
alphabet: the alphabet for 'mississippi' could be [i,m,s], and
selecting a terminator that is not part of the alphabet ensures
that the terminator will never occur in the string. Creating the
explicit suffix tree for 'mississippi' involved adding the
character '$' to the string. This required the addition of 2 extra
leaves: one for the empty suffix and one for the suffix 'i'.

+ - i - - ppi - - ssi - - - ppi - - - ssippi - mississippi - p - - i - - pi - s - - i - - - ppi - - - ssippi - - si - - - ppi - - - ssippi |
+ - |

Ukkonen's linear time algorithm begins with an implicit suffix tree containing the first character of the string. Then it steps through the string adding successive characters until the tree is complete. This in order addition of characters gives Ukkonen's algorithm it's "on-line" property; earlier algorithms proceeded backward from the last character. The naive implementation of this algorithm requires O(n^2) or even O(n^3) time, however applying several programming tricks reduces this to O(n) time. To see an example of suffix tree growth, try the applet in "Growing A Suffix Tree" by Ghitza and Warda of McGill University.

The suffixes in Ghitza and Warda's applet are listed as two
numbers (x, y), where x represents the start of the suffix in the
string and y represents the end of the suffix. Storing the suffixes
as indexes to the main string is known as *Edge-label
compression* and allows a suffix tree to be created in O(m)
space. The 'mississippi' example is shown using edge-label
compression in Figure 3. This trick reduces the size of the output
from O(m^2) using substrings as labels to O(m). This is necessary
to produce a suffix tree in O(m) time because the 'big-O' time for
an algorithm is at least as large as it's output.

+ - (12, 12) - (2,2) - (12, 12) - - (3,5) - - (6,8) - - - (9,12) - - - (6,12) - (1,12) - (9,9) - - (11,12) - - (10, 12) - (3,3) - - (4,4) - - - (9,12) - - - (6,12) - - (4,5) - - - (9,12) - - - (6, 12) |
+ - $ - i - - $ - - ppi$ - - ssi - - - ppi$ - - - ssippi$ - mississippi$ - p - - i$ - - pi$ - s - - i - - - ppi$ - - - ssippi$ - - si - - - ppi$ - - - ssippi$ |

Another necessary implementation trick to achieve a linear time
and space bound is the use of suffix links. A *suffix link*
is a pointer from an internal node labelled xS to another internal
node labelled S, where x is an arbitrary character and S is a
possibly empty substring (Fig 4). Suffix links allow extensions
without walking down from the root of the tree because they point
to the location of the next extension. Suffix links and edge label
compression are the core requirements for the O(m) suffix tree
implementation. Rutgers.edu has an excellent suffix tree generation
cgi. In their
representation the suffix links are represented as light grey
arrows. Note the chain of suffix links: 'issi' -> 'ssi' ->
'si' -> 'i' in the
suffix tree for 'mississippi'.

+ - $ - i |
issi(4) -> ssi(3) x = i, S = ssi si(2) -> i(1) x = s, S = i ssi(3) -> si(2) x = s, S = si |

When there is more than one string to be searched a
*generalized suffix tree* (GST) can be created. As the GST
contains more than one string, each leaf node contains an
identifier indicating the string associated with this suffix. For
instance, in the implicit GST for 'cagca' and 'gagcga' (Fig 5) the
suffix 'agcga' belongs to string 2 and the suffix 'agca' belongs to
string 1.

+ - g (1) - - a (2) -> (7) s:2 - - - gcga (3) -> (9) s:2 - - c (4) -> (11) - - - ga (5) -> (12) s:2 - - - a (6) -> (13) s:1 - a (7) s:2,1 - - gc (8) -> (4) - - - - ga (9) -> (5) s:2 - - - - a (10) -> (6) s:1 - c (11) - - ga (12) -> (2) s:2 - - a (13) -> (7) s:1 - - - gca (14) -> (10) s:1 |

- Suffix Trees - A power point presentation that introduces suffix trees as a specialized form of keyword trees
- Building Suffix Trees in O(m) time - A power point presentation that describes Ukkonen's Linear-time Suffix Tree Construction Algorithm
- Lecture 22 - Lecture notes about suffix trees (pdf). Notes include: a suffix tree definition; an algorithm to create suffix trees; a representation for suffix trees; the use of generalized suffix trees; definition of the longest common extension problem and the maximal palindromes problem.

- For very large problems a linear time and space bound is not good enough. This lead to the development of structures such as Suffix Arrays to conserve memory (section 4.3).
- Suffix links are designed to quickly take you from one region of the tree to another. In large trees this can cause problems with paging because a suffix link may take you to a region of the tree that is not in memory. For an explanation of paging and it's role in memory management, see the following course notes - Paging.
- There must be a trade off between speed and space when storing children. Storing children in a linked list is more memory efficient than using an array, but it requires more processing time to find and add children. For those who are not familiar with arrays and linked lists, the Linked List Tutorial in C provides a detailed explanation of linked lists, and the article entitled " Deciding whether to use arrays or linked lists" explains why one structure would be used instead of the other.
- Memory is often the limiting factor for solutions based on suffix trees. Since there is usually more than one solution, memory usage dictates the choice of solution. For instance in an application using more than one string the solution could be implemented using either a GST or match statistics (section 4.2). Creation of a GST is simpler, but match statistics use less memory.
- Suffix trees are alphabet dependent -- the size of the alphabet
affects the creation and search times of the tree. In the strict
case the space requirement must be denoted
Θ(m|Σ|), or the creation time must be denotedmin (O(m log m), O(m log |Σ|))and the search time must be denotedmin (O(|P| log m), O(|P| log |Σ|)). See Suffix Tree Issues (ppt) for an explanation of the effect alphabet size has on suffix trees and GSTs.

Found 4 matches: ___ 1: 1-3: |

Suffix trees allow a pattern of length n to be found in O(n + k)
time, where k is the number of occurrences of the pattern in the
text. This is known as *Exact String Matching*. Use of a
generalized suffix tree allows this pattern to be found in multiple
strings in O(n + k) time. For instance, the pattern 'act' occurs 4
times in the strings 'actgttact', 'gactagcga', and 'gacacacta' (Fig
6). A simple extension of this problem is the *Exact Set
Matching* problem. This problem involves finding all exact
matches for a set of strings. This problem can be solved using
suffix trees in O(n + k) time, where n is the total length of the
set of patterns (Gusfield, 1997, p.123).

+ - g (1) c=2 d=1 - - a (2) s:2 c=1 d=2 - - - gcga (3) s:2 - - c (4) c=2 d=2 - - - ga (5) s:2 - - - a (6) s:1 - a (7) s:2,1 c=2 d=1 - - |

The *Longest Common Substring* problem can be solved in
O(m) time on a GST of size m. Two values are required for each
internal node in the tree:

*String coverage*c(v) - the number of distinct strings covered by suffixes in the children of v,- and
*string-depth*d(v) - the number of characters in the suffix label for the node v.

Given these two values, the longest common substring is the suffix label on a node covering all the strings which has the deepest string-depth (Fig 7). The value of c(v) and d(v) can be calculated at the time the tree is created or during an O(m) traversal of the tree. A pointer to the node with the deepest string-depth covering all the strings can be updated during this traversal (Gusfield, 1997, p 125).

+-----+------------+-------+ |Node | Children | depth | +-----+------------+-------+ | 1 | 1, 2 | 4 | +-----+------------+-------+ | 2 | 2 | 6 | +-----+------------+-------+ | 4 | 1, 2, 3 | 3 | +-----+------------+-------+ | 7 | 1, 2, 4 | 2 | +-----+------------+-------+ | 8 | 1, 2, 3, 4 | 8 | +-----+------------+-------+ | 11 | 1, 2 | 3 | +-----+------------+-------+ | 13 | 1 | 5 | +-----+------------+-------+ |

Finding the common substrings of more than two strings is a conceptually simple extension of the longest common substring problem. Instead of calculating the string coverage for each node, generate a list of unique strings covered by the node. This list and the string-depth of the node can be stored in a table created during traversal of the tree (Fig 7b). Any node covering more than one string ends a substring common to the strings it covers. For example node 8 indicates that strings 1, 2, 3 and 4 have a common substring 8 characters long. Equally interesting are internal nodes covering only one string because they end a unique substring. For instance node 2 ends a unique substring in string 2 that contains 6 characters.

The unique strings covered by each node can be encoded in a vector of bits such that bit i is 1 if string i is covered by the node. The sum of the bits in this vector will equal c(v) for the node. This vector can be created by ORing the vector returned from each child with that of it's parent.

Gusfield lists a O(Kn) time for this implementation.

- There are K strings in the GST, so the vector has K bits.
- The vector of each node must be ORed with the vectors of the nodes children. For l children this takes time lK.
- As there are O(n) edges in the GST the total time is O(nK)

Why does Gusfield allows time K to OR two K-bit vectors? Isn't OR an O(1) operation?

- Having considered the matter, OR can only be O(1) if K is less
than the instruction width of the computer architecture -- on a
machine with 32 bit instructions you could encode up to 32 strings
per integer and OR two integers with a single machine operation.
This makes the time for a single OR:
Θ(K / width)orO(K).

- EXACT STRING MATCHING ALGORITHMS - This page describes 24 different exact string matching algorithms. There is no discussion of suffix trees for exact string matching, but it provides a nice sample of alternate methods.
- Applications of Suffix Trees - Power point slides discussing exact string and substring matching, longest common substrings, finding and representing repeated substrings, and space efficient alternative implementations.
- Carrot - a clustering tool for web search results in polish and english - this link has been included to the everyday uses of suffix trees.

At least two programs based on Suffix trees are available for whole genome alignment: MUMmer (Delcher, et al.) and Multiple Genome Aligner (MGA) (Michael Höhl, et al.). MUMmer and MGA both use common subsequences as anchors for the alignment. While they both use the same data structure: "a suffix-tree data structure, which permits very fast and memory-efficient comparison [of the genomes]" (Delcher et. al., 2002), the two applications take different approaches to genome alignment.

MUMmer extracts Maximal Unique Matches (MUMs) - sequences that occur exactly once in each genome, sorts these sequences to find the longest set of MUMs occurring in the same order in both sequences, and uses this set of sequences to anchor the multiple alignment. The gaps between anchors are filled using the Needleman Wunsch dynamic programming alignment algorithm. As Needleman Wunsch does not scale well for multiple sequences (its time and space requirements increase exponentially with the number of strings), MUMmer is restricted to comparing two genomes.

MGA computes the longest non-overlapping sequence of Maximal Multiple Exact Matches (multiMEMs) and uses these to guide the multiple alignment. A MEM is defined as a (K+1)-tuple (l, p_0, p_1, ... p_k-1) such that l indicates the length of the MEM, and p_i indicates the start coordinate of the exact match in genome i. A maximal MEM cannot be extended to the left or the right and is referred to as a multiMEM. Gaps are shortened by recursively extracting multiMEM sequences, and finally filled using ClustalW - a progressive/iterative multiple alignment method (Types of Multiple Alignment Methods).

Microarrays are commonly used in gene expression experiments to measure the expression levels of a set of genes. Signature oligonucleotides, short genetic sequences that are specific to the individual genes, must be designed and attached to the microarray. Suffix trees have been used as the core of an algorithm to select these signatures (Kaderali and Schliep, 2002).

Given a set of genomic sequences, the software must identify at least one signature oligonucleotide (probe) for each sequence. These probes must hybridize to only the desired sequence. The algorithm produces a GST from the reverse compliment of all the genomic sequences (candidate probe sequences). Using the GST, the algorithm identifies all common substrings and rejects these regions because probes designed in them would not be specific to a single genomic sequence. Criteria such as probe length are used to further prune this tree.

The algorithm computes the melting temperature (Tm) for all interactions between the remaining candidate probe sequences and the target sequences. The Tm for a pair of sequences is computed by a modified version of the Needleman Wunsch algorithm while they are aligned. The dynamic programming recursion in Needleman Wunsch has been modified to include parameters related to Tm. Recursively calculating the Tm takes advantage of repetitive regions in the sequence and avoids re-computation of Tm for shared subsequences.

- Supplementary information on the software is available from http://www.zaik.uni-koeln.de/bioinformatik/arraydesign.html
- Needleman Wunsch dynamic programming tutorial. This is the same link I provided in section 3.2.1, however it is equally relevant here. Once you've read and understood the description of the basic Needleman Wunsch algorithm be sure to check out their "advanced" tutorial, which uses a more interesting scoring schema.

- Linear Time Algorithm for Finding and Representing all Tandem Repeats in a String - Gusfield describes an O(|S|) algorithm for finding and marking every tandem repeat in S. A tandem repeat is a pattern that repeats several times in close succession. Prologues genes might appear as tandem repeats in genomic sequence.
- REPuter - REPuter is an application, based on suffix trees, to extract maximal repeats from a complete genome. The software is described in "REPuter: fast computation of maximal repeats in complete genomes" (Kurtz and Schleiermacher, 1999). The running time for their algorithm is O(n+m), which is optimal for this problem, and their implementation requires only 12.5*n bytes to represent the suffix tree.
- Generalized Suffix Trees for Biological Sequence Data: Applications and Implementation - a paper describing GST operations, and a variety of applications for suffix trees in computational biology including: Search applications, Exact match searches, Subsequence composition searches, Homology searches, Single sequence analysis applications, and Multiple sequence analysis applications

For many applications, a suffix tree requiring O(m) space is too large to be practical. One technique to improve the space requirement of a suffix tree involves collapsing common branches of the suffix tree form a Directed Acyclic Graph (DAG). Two branches p and q can be merged if they are isomorphic. In a suffix tree branch p is isomorphic to branch q if there is a suffix link between p and q, and the number of leaves in the two subtrees is equal. Merging these two nodes involves setting the pointer from q's parent to p and removing q from the tree (Fig 9). Creating DAGs saves space by eliminating redundancy in the graph, but it introduces complexity that increases the time for some applications.

+ - $ - a (q) - - $ - - xa - - - $ - - - xa$ - x - - a (p)->(q) - - - $ - - - xa - - - - $ - - - - xa$ - - yxaxaxa$ - yxaxaxa$ |
+ - $ - a (q) - - $ - - xa - - - $ - - - xa$ - x - ->(q) - - yxaxaxa$ - yxaxaxa$ |

Another technique to optimize space requirements for some applications is the use of match statistics. Recall the common substring problem -- this problem was solved by creating a GST of the two strings and looking for the node with the deepest string-depth which covered both strings. The space required in this problem can be reduced by storing only the shorter string in the suffix tree. Then the second string is streamed along the suffix tree, stepping through the tree as if adding the string. However, instead of adding the string to the tree a table of match statistics is produced that tracks common regions. This is the approach used in MUMmer to reduce the space requirement of their algorithm.

T12 = (empty) T11 = i T8 = ippi T5 = issippi T2 = ississippi T1 = mississippi T10 = pi T9 = ppi T7 = sippi T4 = sissippi T6 = ssippi T3 = ssissippi |

A suffix array for an m character string is an array of m integers arranged so that the suffixes indexed by the integers are in lexicographical (dictionary) order (Fig 10). For instance the suffix array for 'mississippi' is [12,11,8,5,2,1,10,9,7,4,6,3]. The first and last occurrence of a pattern can be found in this array using a binary search in time O(n). For instance, the pattern pi occurs in position 6 and 7 of the array corresponding to positions 10 and 9 of the string.

A suffix tree can be converted to a suffix array in O(n) time. This is useful when there is memory available to create the suffix tree, but it's space must be reduced after creation. This situation might arise when creating a suffix tree from a large sequence database that will be used in another application. If there is not enough memory available to create and convert the suffix tree there are slower methods available to create the suffix array directly from the sequence.

- Suffix Arrays - a paper describing the suffix array data structure and it's application to on-line string searching. The paper describes an algorithm to create and search the suffix array in O(P + log N) time where P is the pattern length and N is the text length. This time is comparable to the O(n+m+k) time to create and search a suffix tree.
- Optimal Exact String Matching Based on Suffix Arrays - a paper that describes the use of suffix arrays for exact string matching. The algorithm described in this paper is not confined to exact string matching, but can also be used to solve any problem that would be solved using a top down traversal of a suffix tree.
- The enhanced suffix array and its applications to genome analysis - the paper describes a framework allowing every suffix tree algorithm based on a bottom up traversal of the suffix tree to be implemented using the more space efficient suffix tree. Using the more space efficient suffix array allows very large genomes to be analysed.
- Reducing the Space Requirement of Suffix Trees (Kurtz, 1999) - Describes a technique for compressing suffix trees by exploiting redundant information in the tree. In the worst case their implementation requires 20 bytes per input character, and 10.1 bytes per input character in the average case.

- ANSI C implementation of a Suffix Tree - initiated by Dotan Tsadok of Haifa University. This implementation includes a Makefile, and a c source and header file. The suffix tree is created using Ukkonen's on-line algorithm, and was implemented based on the description in Algorithms on Strings, Trees, and Sequences (Gusfield, 1997).
- Strmat - a collection of string matching programs initiated by Dan Gusfield. The algorithms are tied together using a simple menu program. This project has been maintained and contributed by several developers and is available free of charge. A slightly modified version of the stree source was used to generate some of the diagrams on this page.
- Fast String Searching With Suffix Trees - A discussion of searching with suffix trees by Mark Nelson. The discussion of suffix trees is well written and informative, and the page contains numerous useful figures. The c++ source provided is useful, some of the diagrams on this page were produced by a slightly modified version of the source.

- Lars Kaderali and Alexander Schliep.
*Selecting signature oligonucleotides to identify organisms using DNA arrays*. Bioinformatics 2002 18: 1340-1349. - Michael Höhl, Stefan Kurtz, and Enno Ohlebusch.
*Efficient multiple genome alignment*. Bioinformatics 2002 18: 312S-320S. - Ezekiel F. Adebiyi, Tao Jiang, and Michael Kaufmann.
*An efficient algorithm for finding short approximate non-tandem repeats*. Bioinformatics 2001 17: 5S-12S. - S Kurtz and C Schleiermacher.
*REPuter: fast computation of maximal repeats in complete genomes*. Bioinformatics 1999 15: 426-427. - A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O.
White, and S.L. Salzberg.
*Alignment of Whole Genomes*. Nucleic Acids Research, 27:11 (1999), 2369-2376. - A.L. Delcher, A. Phillippy, J. Carlton, and S.L. Salzberg.
*Fast Algorithms for Large-scale Genome Alignment and Comparison*. Nucleic Acids Research (2002), Vol. 30, No. 11 2478-2483. - Stefan Kurtz.
*Reducing the Space Requirement of Suffix Trees*. Software--Practice and Experience, 29(13), pages 1149-1171, 1999. - Dan Gusfield,
*Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology*. Jan 15, 1997. Cambridge Univ Pr.