Problem
You are interested in analyzing some hard-to-obtain data from two separate databases. Each database contains n
numerical values—so there are 2n
values total—and you may assume that no two values are the same. You’d like to determine the median of this set of 2n
values, which we will define here to be the nth smallest value. However, the only way you can access these values is through queries to the databases. In a single query, you can specify a value k
to one of the two databases, and the chosen database will return the k
th smallest value that it contains. Since queries are expensive, you would like to compute the median using as few queries as possible. Task Design an algorithm that finds the median value using at most O(log n) queries.
Solution
def median(A, B):
if len(A) == 1 or len(B) == 1:
return any element from A or B
else:
m1 <- query(A, n/2)
m2 <- query(B, n/2)
if m1 == m2:
return m1
elif m1 < m2:
return median(A[n/2 : n], B[1 : n/2])
else:
return median(A[1 : n/2], B[n/2 : n])
Proof Outline
Let median(A, B)
correctly compute the median of inputs of size less than ( n ).
Base Case: If there is only one element, return that element as the median.
Inductive Step: Assume the function correctly computes the median for arrays of size less than n. Now, consider arrays A and B of size n.
To find the median: Use the query function twice to determine the medians of A and B, equal to m1 and m2. Since A and B are of equal size:
If ( m_1 = m_2 ), the median of ( A + B ) is ( m_1 ) (or ( m_2 )). This is because both halves of A and Bare equally sized, thus an equal number of elements in A and B are greater than and less than the median:
Without loss of generality, assume m1 < m2. (If not, swap the arrays)
From the properties of medians:
Since m1 < m2, it follows that:
Thus, the median of the combined array must lie within ( A[n/2+1:n] + B[1:n/2] ). Removing the first n/2 elements and the last n/2 elements of A + B preserves the median.
By the inductive hypothesis, the recursive call to median(A', B')
(on the reduced arrays) correctly computes the median. Hence, the function works for input arrays of size ( n ).