# Two problem related to sequence of sets

Given a sequence of sets $S_{1},…,S_{n}$ with a total of $m$ elements. Partition $[n]$, such that if $i,j$ is in the same partition class, then $S_{i}=S_{j}$.

Solve the problem by building a trie over the lexicographic ordering of the elements in the set. Since the alphabet has size $n$, it has running time $O(mgn)$. One can get better running time using integer data structures, say $O(mggn)$ using van Emde Boas tree.

$O(m)$ time is actually possible. For each $k$, we build the set $H_{k}={j∣k∈S_{j}}$ (as a list). We define equivalent relation $≡_{k}$ as $i≡_{k}j$ if $S_{i}∩[k]=S_{j}∩[k]$. If we have equivalent class of $≡_{k}$, we can obtain the equivalent class of $≡_{k+1}$ in $O(∣H_{k}∣)$ time. Hence together the running time is $O(m)$.

Given a sequence of sets $S_{1},…,S_{n}$ containing a total of $m$ integers, and a integer $k$. Decide if there exists $i$ and $j$ such that $i=j$ and $∣S_{i}∩S_{j}∣≥k$.

We assume the elements in the sets are in $[m]$. Let $S=⋃_{i=1}S_{i}$.

For $k=0,1$, we can solve it in $O(m)$ time: Decide if any element appears more than once in the sets.

For larger $k$, we shall compute $∣S_{i}∩S_{j}∣$ for every pair $i$ and $j$. To do this, we start with an all zero $n×n$ matrix $C$. At the end of the algorithm, $C_{i,j}=∣S_{i}∩S_{j}∣$ for all $i,j∈[n]$. For each element $x$, we find $E_{x}={i∣x∈S_{i}}$. This takes $O(m)$ time. We increment $C_{i,j}$ for all $i,j∈E_{x}$. We claim this algorithm have running time $O(nm)$. Indeed, for each $x$, we spend $∣E_{x}∣$ time in incrementing $C_{i,j}$ where $i,j∈E_{x}$. Hence the running time is bounded by $∑_{x∈S}∣E_{x}∣_{2}$. We know $∑_{x∈S}∣E_{x}∣=m$ and $∣E_{x}∣≤n$. We see the worst case is when $∣E_{x}∣=n$ and $∣S∣=m/n$. In that case, we have running time $O(∑_{x∈S}n_{2})=O(mn)$.

Since we just want to find a pair ${i,j}$ where $∣S_{i}∩S_{j}∣≥k$. We can stop the algorithm as soon as $C_{i,j}≥k$ for some $i$ and $j$. This means we can increment at most $(k−1)n_{2}$ times.

Together, the running time become $O(min(nm,kn_{2}+m))$.

For $k=2$. One can improve the running time when $n$ is large by reduce it to a problem similar to finding rectangles or finding a $C_{4}$ in the incident graph. Let $n_{′}$ be $∣⋃_{i}S_{i}∣$, we can obtain a more refined bound. Together, the final running time for $k=2$ is $O(min(m_{4/3},dm,n_{2}+m))$. Here $d$ is the degeneracy of the incident graph of the sets and the elements, which is bounded above by the maximum degree.

Recently, I had some result for $k=3$, where the running time improves to $O(m_{28/15})$.