Thursday, May 2, 2013

Why Suffix Tree and not Prefix Tree.


To find a pattern we need to find sub string in a text.Since Tries can also find sub string, but there is a catch since they find the sub strings with well defined start and end delimiter.
For ex : Suppose I need to search for a word in a text "bull". Then Trie can do it in O(m) time. But let say we need to find pattern such as "l bid be" now it is a sub-string, but it doesn't have a well defined delimiters starting and ending.
So Basically tries doesn't solve our problem for pattern search.It can find string with well defined boundaries, but not sub string with delimiters.Now this has obvious applications in finding some patterns in Human Gnome.
Like :
"ADCNDJCDNCD CNDJCNDNCDXXCAVDA CDNJCDNCD HBADSCD"
Now if I need to find  "CDNCD HBA" then tries can't do this. Since they stores words and not sub strings.
                                  
Therefore we need some other data structures which can do this.

Firstly our requirement is :
a)We need searching like Tries, means searching of sub string should be done in time proportional to the length of the pattern.
b)Instead of storing individual words in a Trie, we can store either Prefixes or Suffixes of the text(not of the words). Since by this way we can catch the patterns (i.e sub strings) other than individual words.

Now why prefixes or suffixes ?
Since to find sub strings, we must require to search in prefixes or suffixes of text. What it really means is ?
First lets look at Sub string definition.
We can define sub string in two ways:
 -->A prefix of some suffix.
 -->A suffix of some prefix.

For ex :
I have some Text as :
 "abcde h"
 Firstly we must find its prefixes :
  "a"
  "ab"
 "abc"
"abcd"
"abcde"
"abcde h"
Now to find a sub string/pattern let say : "de h"
This sub string must be a suffix of some prefix.
Here in our example it is suffix of last prefix.

Similarly for our other definition of sub string :
We calculate suffixes :
 "h"
" h"
"e h"
"de h"
"cde h"
"bcde h"
"abcde h"
Now to find substring "cde ". It must be  prefix of some suffix.
Here in our case it is prefix of suffix "cde h"

So from this we mean : if we store prefixes or suffixes of the text. We can easily search for a sub string (even without delimiters).

So our next step is :
Create a Trie of prefixes or suffixes of text.

Now our next question is Does Trie of prefixes really do our job?

Logically it seems that we don't require Trie of prefixes.
Since our main concern is the starting delimiter of the sub string.Because if we create a prefix tree, from where we should start our search for pattern.Since our Trie of prefixes always starts with prefixes.Now in order to find intermediate sub string like "cde h", we are not able start the Trie traversal from root of the Trie.
  There are two ways to deal with this problem :
 -->Either create a prefix tree and make some techniques to start traversal from bottom-up.Now this seems to be inappropriate , with some complex traversing from bottom to top,while also to make our Data structure to remain as Trie.
-->Or create a Trie of suffixes.
    Now this will solve our problem, because we will get our starting point from the top of the Trie and match till our sub string length.

So in short we will create a Trie , storing suffixes of entire Text(not individual words).