The KMP algorithm in Haskell
Almost all the string algorithms I read are doing index manipulations somewhere. However meddling with indices are never a smart move in pedagogical settings. For example, see the code for the KMP algorithm in C. It’s short, to the point, and elegant in it’s own way. Except it’s hard to see the meaning behind all the operations.
As an example, this article demonstrates how to write the KMP string matching algorithm without all those indices.
The KMP string matching algorithm solves the following problem.
Given a string $pat$ of length $m$, return if it exist in $text$ of length $n$ in $O(n)$ time.
Half of the KMP algorithm implementations are actually the MP
algorithm. Twan
van Laarhoven’s implementation, the earlier version of the KMP package
and even Wikipedia’s
page. Although both KMP and MP runs in $O(n)$ time, KMP uses at most $O(gm)$ time to advance to match the next
element in the list when MP could take $O(m)$ comparisons. More concretely, KMP could
output the sequence $m_{0},…,m_{n−1}$
in $O(n)$ time, where $m_{i}=1$ iff $pat$ is a suffix of $text[0..i]$, and the time between output is
$O(gm)$.
This added benefit comes at a cost. In the MP algorithm, the failure
table has only one $−1$. The failure
table for KMP, there is no $−1$, and
everything just goes to $0$ instead. If
one tries to not use any indices, and want to separates the searching
and the table building, one would need a special element to treat the
$0$ positions.
Comparing to the KMP package, the implementation here doesn’t use any array.
1 The algorithm
We build the a automaton $A$ using a input string $S=a_{0},…,a_{m−1}$. The automaton consist of states $nil,s_{0},…,s_{m−1}$.
Because of laziness, it is built incrementally. The total running time is $O(n)$, even if $m$ is much larger than $n$.
$δ$ is the transition function, $δ(s_{i},a_{i})=s_{i+1}$, and $δ(s_{i},x)=f_{i}$ for $x=a_{i}$, where $f_{i}$ is the failure state associated with state $s_{i}$. We call $a_{i}$ the value of $s_{i}$.
One would also want to know if an state is accepting state or not. This prompt us to use the following declaration for the automaton.
data Automaton a = Node {value :: a,
success :: Automaton a,
failure :: Automaton a,
accept :: Bool

} Null {success :: Automaton a,
accept :: Bool}
Null _ _) = True
isNull (= False isNull _
Assume we have built a automaton, then doing a matching is quite
easy. Just simulate the automaton. It’s important to notice the
next
and stay
are saying if we want to read
the next character or stay at the same character. The code below is a
generalized version of matching. It basically does a fold while we
match. isInfixOf'
is an example of how to use
matchFold
to test if a list is a infix in another list.
matchFold :: Eq a => Automaton a > [a] > ([a]>b>b) > ([a]>b>b) > b > b
= identity
matchFold _ [] _ _ identity = match' state text
matchFold state text nomat mat identity where match' _ [] = identity
:xs)
match' a (x not (isNull a) && value a /= x = stay
 not (accept a) = nomat (x:xs) next
 otherwise = mat (x:xs) next
where next = match' (success a) xs
= match' (failure a) (x:xs)
stay
isInfixOf' :: Eq a => [a] > [a] > Bool
pattern text
isInfixOf'  null pattern = True
 otherwise = or $ matchFold (buildAutomaton pattern) text (const (False:)) (const (True:)) []
It is obvious how we can build the entire automaton except for the failure transition.
Another way to reason about it. We impose an order on a set of strings $S$ by measuring the length of the string, so this order is a linear order when the set of strings have different length. Let $Prefixes(s)$ to be the set of all prefixes of $s$, $Suffixes(s)$ to be the set of all suffixes of $s$. $Border(s)Border_{′}(sa)b(s)b_{′}(sa) =Prefixes(s)∩Suffixes(s)={x∣x∈Border(s),xa∈Border(sa)}=max(Border(s)\{s})=max(Border_{′}(sa)) $
Assume we try to compute $b(x)$ for a string $x$, we first build the function $next$, such that $next$ iterates through $Prefix(x)$ by length, and $last$ is a function that returns the last element in the string. We have the following relation for $sa$ a prefix of $x$, where $s$ is a string and $a$ is in the alphabet.
$b(sa)={next(b(s))next(b(b(s)a)) last(next(b(s)))=aotherwise $
$b_{′}(sa)={b_{′}(next(b(s)))b_{′}(b_{′}(next(b(s))a)) last(b_{′}(next(b(s))))=aotherwise $ In fact, one can see the failure function is precisely $b_{′}$. What’s not clear is how can one compute this function without keep track of the entire history of $b$.
The essential function is build ys s
. It builds the
automaton by consume the remaining string, and s
records
essential information to compute the transition of the failure
function.
 $values=valuet$, then $b(t)=b(s)$, and we would know $b_{′}(next(t))=next(s)$.
 $values=valuet$, then $b(t)=s$, and we would compute $b_{′}(next(t))$ by searching back through failure edges.
Note we have also computed $b_{′}(next(t))$, which allow us to compute the next node $next(t)$.
So in build ys s
, s
will precisely store
the state $b_{′}(t)$.
buildAutomaton :: Eq a => [a] > Automaton a
:xs) = automaton
buildAutomaton (xwhere automaton = Node x (build xs automaton) (Null automaton False) (null xs)
= s
build [] s :xs) s
build (x x == value s = success s `nextNode` failure s
 otherwise = newS `nextNode` s
where nextNode a b = Node x (build xs a) b (null xs)
= success $ until (\s>isNull s  x == value s) failure s newS