# Strings with hamming distance exactly $1$

Lin Yang asked me about the complexity for the following problem, which is the day 2 part 2 of the advent of code 2018. It is an elegant programming exercise, and also a clever algorithmic exercise. The problem can be summarized below.

Given a set $W$ of $n$ length $m$ strings. Decide if there are two of them that differs at exactly one position.

In other words, we want to find two strings in $W$ with hamming distance $1$.

The naive algorithm would have running time $O(n_{2}m)$. The complexity of the problem have
gathered a lot of attention a while ago, for example a post
in dev.to, and on
reddit. Some of them had a running time of $O(nm_{2})$ instead. Some require hashing to get
the *expected* running time of $O(mn)$. Here we are interested in an algorithm
with *worst case* $O(mn)$
time.

# 1 An $O(mn)$ time algorithm

First, we define some equivalent classes on the strings in $W$.

- $x≡_{i}y$, if $x[1..i−1]=y[1..i−1]$. Namely, if the first $i−1$ elements of $x$ and $y$ match.
- $x≡_{i}y$ if $x[i+1..m]=y[i+1..m]$. Namely, if the last $m−i+1$ elements of $x$ and $y$ match.

The algorithm uses the following idea. For each $i$, decide if there are any strings $x$ and $y$ such that differs in precisely position $i$.

For distinct $x$ and $y$, they differ only in position $i$ if and only if $x≡_{i}y$ and $x≡_{i}y$.

Let $P_{i}$ and $S_{i}$ be the collection of equivalent classes of $≡_{i}$ and $≡_{i}$, respectively. We show a result related to the meet of partitions.

Let $A$ and $B$ be partitions of $[n]$. There is an $O(n)$ time algorithm to test find the sets in ${A∩B∣A∈A,B∈B}$.

Let $A={A_{1},…,A_{k}}$ and $B={B_{1},…,B_{ℓ}}$. We define $I_{i}=(a,b)$ such that $i∈A_{a}$ and $i∈B_{b}$. Then we know $i$ is in $A_{a}∩B_{b}$ if $I_{i}=(a,b)$. Hence we are interested in find the largest set of elements such $S$ such that for $i,j∈S$, $I_{i}=I_{j}$. The simplified problem can be solved in $O(n)$ time. Indeed, the pair is just a base $n$ number with $2$ digits. We can apply radix sort with running time $O(n)$ and group by the result.

Note one can also directly use a partition refinement data structure to get the same result.

As a corollary, consider $A=P_{i}$ and $B=S_{i}$, then we obtain the following lemma.

Given the collections $P_{i}$ and $S_{i}$, there is an $O(n)$ time algorithm to test if there are two strings $x,y∈W$ that differs in precisely position $i$.

Finding $P_{1},…,P_{m}$ and $S_{1},…,S_{m}$ can be done in $O(mn)$ time.

To find the equivalent classes, build two tries for the strings. Trie $T_{P}$ for strings in $W$ and trie $T_{S}$ for the reversal of strings in $W$. Building the tries takes $O(mn)$ time. Inspect the nodes at depth $i−1$ in $T_{P}$ and nodes at depth $m−i+1$ in $T_{S}$ to recover $P_{i}$ and $S_{i}$ in $O(n)$ time.

There is an $O(mn)$ time algorithm that solves Problem 1.

Finding the sequence of equivalent classes takes $O(mn)$ time by Lemma 4. For each $i$, checking if there exists $x,y∈W$ differs in precisely position $i$ takes $O(n)$ time by Lemma 3. Since $i$ ranges from $1$ to $m$, we obtain the final running time is $O(mn)$.

# 2 Remarks

Ruosong
Wang communicated another $O(mn)$
solution. It is much easier to describe. Let $⋄$ be a symbol not in the alphabet.
Build a generalized
suffix tree over the set of strings $S_{′}={x⋄x∣x∈W}$. Traverse
the suffix tree, up to level $m$, and
output `true`

if a path that contains $⋄$ was traversed, and can lead to more
than $2$ leaves. Indeed, this means the
substring $x⋄y$ appears at least
twice. Hence there are at least two strings of the form $yax$ and $ybx$
in $W$. This definitely hits the optimal
running time, but implementing a generalized suffix tree is fairly
hard.

We do assume the alphabet size is constant. If the alphabet size is $σ$ and ordered, then there is an extra factor of $gσ$ in building the tries. The the final running time will be $O(mngσ)$.

Problem 1 also reduces to finding the closest pair of elements by hamming metric [1]. It does not get us the desired running time though.

# 3 An implementation in Haskell

The implementation is mostly faithful to the presentation in the article. We did not implement counting sort nor radix sort.