A segment tree problem

22-12-2018

I came across a problem with a solution that surprised me. Below is the abridged problem statement. You can find the full problem here.

We say an element dominates a subarray of $n$ elements if it appears more than $n/2$ times. For example, 1 dominates [1, 1, 2, 1, 3] and no number dominates [1, 2, 1, 2]. We are given an array of numbers and have to answer some queries on it. Each query is an interval. For each query, we have to determine the number, if any, that dominates the subarray described by that interval.

This problem can be solved using a segment tree that keeps track of the dominating number in each node. Here’s the crucial observation: the dominating number for a node, if it exists, must be the same as either children’s. This is easy to verify. So for every node, we just need to check which of the two children’s dominating numbers also dominates that node. This check can be done in $O(\log n)$ if we also maintain frequency statistics in each node (using a hash map, for example). Thus, we can answer each query in $O(\log\log n)$ .

As you can see, this is a fairly standard segment tree problem. I was thus delighted to learn that there exists an elegant probabilistic solution.

To answer a query, we can pick a random element in that subarray and check if it dominates the subarray. If it is a dominating number, we are done. Otherwise, repeat this process a few more times. If we still don’t find a dominating number, there probably isn’t one. In each trial, the probabililty of picking a dominating number if it exists is at least $1/2$ . So the probability of a false negative, the event that we fail to find one after $k$ trials, is at most $1/2^k$ . As is typical of randomised algorithms, we can easily make this margin of error as small as we want since the probability of a false negative shrinks at an exponential rate in $k$ . Since we only run a constant number of trials, we can answer each query in $O(\log n)$ .

As a bonus, we can do away with segment trees altogether. We don’t need it to maintain the frequency statistics, nor do we need a fenwick tree. Suppose we have a vector<vector<int>> v(n) such that each $v[i]$ stores the indices where $i$ occurs in the original array in sorted order. We can populate $v$ in a single pass over the array. Then, the frequency of $i$ in the interval $[l, r]$ , can be computed by binary searching for $l$ and $r$ in $v[i]$ . This structure also uses $O(n)$ memory and can answer queries in $O(\log n)$ .