Strings with hamming distance exactly $1$

Lin Yang asked me about the complexity for the following problem, which is the day 2 part 2 of the advent of code 2018. It is an elegant programming exercise, and also a clever algorithmic exercise. The problem can be summarized below.

Problem1

Given a set $W$ of $n$ length $m$ strings. Decide if there are two of them that differs at exactly one position.

In other words, we want to find two strings in $W$ with hamming distance $1$ .

The naive algorithm would have running time $O (n^{2} m)$ . The complexity of the problem have gathered a lot of attention a while ago, for example a post in dev.to, and on reddit. Some of them had a running time of $O (n m^{2})$ instead. Some require hashing to get the expected running time of $O (mn)$ . Here we are interested in an algorithm with worst case $O (mn)$ time.

1 An $O (mn)$ time algorithm

First, we define some equivalent classes on the strings in $W$ .

$x \equiv^{i} y$ , if $x [1.. i - 1] = y [1.. i - 1]$ . Namely, if the first $i - 1$ elements of $x$ and $y$ match.
$x \equiv_{i} y$ if $x [i + 1.. m] = y [i + 1.. m]$ . Namely, if the last $m - i + 1$ elements of $x$ and $y$ match.

The algorithm uses the following idea. For each $i$ , decide if there are any strings $x$ and $y$ such that differs in precisely position $i$ .

Lemma2

For distinct $x$ and $y$ , they differ only in position $i$ if and only if $x \equiv^{i} y$ and $x \equiv_{i} y$ .

Let $P_{i}$ and $S_{i}$ be the collection of equivalent classes of $\equiv^{i}$ and $\equiv_{i}$ , respectively. We show a result related to the meet of partitions.

Lemma3

Let $A$ and $B$ be partitions of $[n]$ . There is an $O (n)$ time algorithm to test find the sets in ${A \cap B ∣ A \in A, B \in B}$ .

Proof

Let $A = {A_{1}, \dots, A_{k}}$ and $B = {B_{1}, \dots, B_{ℓ}}$ . We define $I_{i} = (a, b)$ such that $i \in A_{a}$ and $i \in B_{b}$ . Then we know $i$ is in $A_{a} \cap B_{b}$ if $I_{i} = (a, b)$ . Hence we are interested in find the largest set of elements such $S$ such that for $i, j \in S$ , $I_{i} = I_{j}$ . The simplified problem can be solved in $O (n)$ time. Indeed, the pair is just a base $n$ number with $2$ digits. We can apply radix sort with running time $O (n)$ and group by the result.

Note one can also directly use a partition refinement data structure to get the same result.

As a corollary, consider $A = P_{i}$ and $B = S_{i}$ , then we obtain the following lemma.

Lemma4

Given the collections $P_{i}$ and $S_{i}$ , there is an $O (n)$ time algorithm to test if there are two strings $x, y \in W$ that differs in precisely position $i$ .

Lemma5

Finding $P_{1}, \dots, P_{m}$ and $S_{1}, \dots, S_{m}$ can be done in $O (mn)$ time.

Proof

To find the equivalent classes, build two tries for the strings. Trie $T_{P}$ for strings in $W$ and trie $T_{S}$ for the reversal of strings in $W$ . Building the tries takes $O (mn)$ time. Inspect the nodes at depth $i - 1$ in $T_{P}$ and nodes at depth $m - i + 1$ in $T_{S}$ to recover $P_{i}$ and $S_{i}$ in $O (n)$ time.

Theorem6

There is an $O (mn)$ time algorithm that solves Problem 1.

Proof

Finding the sequence of equivalent classes takes $O (mn)$ time by Lemma 4. For each $i$ , checking if there exists $x, y \in W$ differs in precisely position $i$ takes $O (n)$ time by Lemma 3. Since $i$ ranges from $1$ to $m$ , we obtain the final running time is $O (mn)$ .

2 Remarks

Ruosong Wang communicated another $O (mn)$ solution. It is much easier to describe. Let $⋄$ be a symbol not in the alphabet. Build a generalized suffix tree over the set of strings $S^{'} = {x ⋄ x ∣ x \in W}$ . Traverse the suffix tree, up to level $m$ , and output true if a path that contains $⋄$ was traversed, and can lead to more than $2$ leaves. Indeed, this means the substring $x ⋄ y$ appears at least twice. Hence there are at least two strings of the form $y a x$ and $y b x$ in $W$ . This definitely hits the optimal running time, but implementing a generalized suffix tree is fairly hard.

We do assume the alphabet size is constant. If the alphabet size is $σ$ and ordered, then there is an extra factor of $lo g σ$ in building the tries. The the final running time will be $O (mn lo g σ)$ .

Problem 1 also reduces to finding the closest pair of elements by hamming metric [1]. It does not get us the desired running time though.

3 An implementation in Haskell

The implementation is mostly faithful to the presentation in the article. We did not implement counting sort nor radix sort.

References

[1]

K. Min, M.-Y. Kao, H. Zhu, The closest pair problem under the hamming metric, in: H.Q. Ngo (Ed.), Computing and Combinatorics, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009: pp. 205–214.

Posted by Chao Xu on 2018-12-23.

Tags: algorithms, strings, tries.

Strings with hamming distance exactly 1

1 An O(mn) time algorithm

2 Remarks

3 An implementation in Haskell

References

Strings with hamming distance exactly $1$

1 An $O (mn)$ time algorithm