Word break with cost
Given a set of strings , a cost function and a string . Find elements such that , and is minimized.
This problem is a generalization of the word break problem on leetcode. Many algorithms you see online assumes that each string in has constant length, checking the hash table takes time, and obtains an time algorithm. It is not as easy if the strings can have arbitrary length. Here we show an algorithm that considers the strings in have arbitrary length.
Assume the input has length . Consider the following graph , where , and there is an edge from and , if , the label of the edge is the string , and the cost is . Let be total length of the input, namely $_{wW} |w| + |s|. Let be number of substrings in matches some element in . The graph has edges. Note . Indeed, the sum of the length of the labels of all outgoing edges cannot be more than , and the length of each label is different. Hence each vertex can have at most outgoing edges. The graph is a DAG, so we can find the shortest path from to in linear time with respect to the number of edges. This shows if we can compute the graph in time, then we solve the problem in time.
We can build an Aho–Corasick automaton for in time. It can be used to find all substrings of that matches something in by traversing the automaton once. The running time is the total number of substrings matched, which is . Hence building the graph takes time. is clearly no more than , where . Also, it is also clear . Indeed, there can be at most edges start from , since each edge has a label of different length, and sum of those length labels is no larger than .
If we only want to know if there exists a solution, then there is a time algorithm [1]. The algorithm is close to optimal assuming the algorithm is combinatorial and the alphabet can be arbitrarily large. A recent progress showed the version with minimum cost is also solvable in time [2].