Bounded regression on data streams
1 Bounded Regression on Data Streams
Hsien-Chih sent me this problem. Similar problem has been asked on Quora. He noticed it might be solved in near linear time using min-cost circulation. Here we show a generalization.
Given
- ,
- ,
- .
Output such that for all , and minimize .
2 Reduce the problem to min-cost circulation
It’s natural to model this problem as variations of min-cost circulation problem on a graph.
The graph with vertices .
Edges:
- Edge for all .
- Edge for all .
Edge Capacity:
- has lower bound , upper bound for all .
- All other edges are uncapacited. Namely lower bound and upper bound are and respectively.
Edge Costs: has cost function . Cost function on other edges are .
A function is called a circulation if for all vertex . It is feasible if is within the capacity. It is min-cost if is minimized.
Solving the min-cost circulation problem would give us the desired by setting .
3 min-cost circulation on series-parallel graphs
The constructed graph is a two terminal series-parallel graph. There is a simple procedure to solve min-cost flow problem on series-parallel graphs. Consider a series connection of two edges, each with cost function and . We can replace it with an edge with cost function . If it is a parallel connection, then we can replace it with one edge and a cost function , where is the infimal convolution: .
Once we have a good data structure to represent the costs, we can reduce the graph to one single edge easily, and find the minimum cost circulation. In particular, if the cost are continuous, convex and piecewise linear in a interval and everywhere else, and the total number of breakpoints is , then Booth and Tarjan has an algorithm that runs in time [1].
Because all edge has a cost function with at most breakpoint. The bounded regression problem can be solved in time.
4 Isotonic regression
We can try to minimize instead ( error). It is a generalization of the lipschitz isotonic regression problem [2] when and for some constant . We can also ask to minimize the error.
If the upper bounds are and all lower bounds are , then the problem is called the isotonic regression problem. I have solved a interesting problem using isotonic regression.
We can express all the problems as min-cost circulation problem on a appropriate graph. If the min-cost circulation algorithm on those graphs have the same running time as current best algorithm, it would imply something more general is acting in the background.
Here is what we know.
- error: This post shows it can be solved in time using the min-cost circulation formulation. It matches the running time of specialized algorithms.
- error: It can be solved in time, but doesn’t come from the quadratic cost min-cost circulation formulation.
- error: It can be solved in time. However, it doesn’t come from the minimax circulation problem. (In the minimax circulation, the cost is the largest edge cost incurred by the circulation).
- error: It can be solved in time. This is equivalent to the longest non-decreasing subsequence problem.
This prompt the following two natural problems:
Can min-cost circulation with quadratic cost on series parallel graph have time solution? This is in fact possible when all edges have no capacity[3]. But with capacity, even for a edge with a lower bound of and cost, we don’t know.
What about minimax circulation? We can’t find any study of minimax circulation on series-parallel graphs.