Generalized suffix tree

Suffix tree for the strings ABAB and BABA. Suffix links not shown.

In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings $D=S_{1},S_{2},\dots ,S_{d}$ of total length $n$ , it is a Patricia tree containing all $n$ suffixes of the strings. It is mostly used in bioinformatics.^[1]

Functionality

It can be built in $\Theta (n)$ time and space, and can be used to find all $z$ occurrences of a string $P$ of length $m$ in $O(m+z)$ time, which is asymptotically optimal (assuming the size of the alphabet is constant^[2]^:119).

When constructing such a tree, each string should be padded with a unique out-of-alphabet marker symbol (or string) to ensure no suffix is a substring of another, guaranteeing each suffix is represented by a unique leaf node.

Algorithms for constructing a GST include Ukkonen's algorithm (1995) and McCreight's algorithm (1976).

Example

A suffix tree for the strings ABAB and BABA is shown in a figure above. They are padded with the unique terminator strings $0 and $1. The numbers in the leaf nodes are string number and starting position. Notice how a left to right traversal of the leaf nodes corresponds to the sorted order of the suffixes. The terminators might be strings or unique single symbols. Edges on $ from the root are left out in this example.

Alternatives

An alternative to building a generalised suffix tree is to concatenate the strings, and build a regular suffix tree or suffix array for the resulting string. When hits are evaluated after a search, global positions are mapped into documents and local positions with some algorithm and/or data structure, such as a binary search in the starting/ending positions of the documents.

References

↑ Paul Bieganski; John Riedl; John Carlis; Ernest F. Retzel (1994). "Generalized Suffix Trees for Biological Sequence Data". Biotechnology Computing, Proceedings of the Twenty-Seventh Hawaii International Conference on. pp. 35–44.
↑ Gusfield, Dan (1999) [1997]. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN 0-521-58519-8.

External links

A C implementation of Generalized Suffix Tree for two strings

Strings

String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm

String searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp string search algorithm

Multiple string searching	Aho–Corasick Commentz-Walter algorithm Rabin–Karp

Regular expression	Comparison of regular expression engines Regular tree grammar Thompson's construction Nondeterministic finite automaton

Sequence alignment	Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm

Data structures	DAFSA Suffix array Suffix automaton Suffix tree Generalized suffix tree Rope Ternary search tree Trie

Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting

This article is issued from Wikipedia - version of the 5/31/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.