Find Median from Data Stream (Leetcode #295)

The problem of finding the median from a stream of numbers is a classic question, often found in coding challenges. This post will break down how you can solve this problem efficiently, starting from a simple brute force approach, all the way to a sophisticated and optimized solution. We’ll also explain the code for a more efficient solution and dive into its time and space complexities.

Understanding the Problem Statement

The challenge asks us to design a data structure that can handle two operations:

  1. addNum(int num) - Add a number to a running list of numbers.

  2. findMedian() - Return the median value of the current list of numbers.

The median is defined as the middle value when the numbers are sorted. If the size of the list is even, the median is the average of the two middle values. The operations should ideally be optimized for efficiency, particularly if there are many numbers in the data stream.

Brute Force Approach

The brute force approach to solve this problem is to keep all numbers in a list and sort them every time you need to find the median.

  1. Step 1: Maintain an unsorted list of numbers.

  2. Step 2: When you need to find the median, sort the list and then pick the middle element (or average the two middle elements if the list length is even).

While simple, this solution becomes inefficient for large data streams because the sorting operation is expensive—it runs in O(N log N) time complexity each time you need to find the median.

Hint to Solve the Problem Efficiently

To optimize finding the median, consider how you can use two heaps—a min-heap and a max-heap—to maintain the order of numbers while ensuring quick access to the median. The idea is to split the numbers into two halves, making it easy to balance them and retrieve the median.

Efficient Solution

The efficient solution for this problem leverages two heaps:

  • Max-Heap for the left half (stores the smaller half of numbers)

  • Min-Heap for the right half (stores the larger half of numbers)

Here is the Python code that implements this efficient approach:

import heapq

class MedianFinder:

    def __init__(self):
        self.leftHalf = []  # Max heap (invert the values for max heap behavior)
        self.rightHalf = [] # Min heap

    def addNum(self, num: int) -> None:
        if not self.leftHalf or num <= -self.leftHalf[0]:
            heapq.heappush(self.leftHalf, -num)  # Invert num for max heap
        else:
            heapq.heappush(self.rightHalf, num)

        # Rebalance heaps
        if len(self.leftHalf) > len(self.rightHalf) + 1:
            heapq.heappush(self.rightHalf, -heapq.heappop(self.leftHalf))
        elif len(self.rightHalf) > len(self.leftHalf):
            heapq.heappush(self.leftHalf, -heapq.heappop(self.rightHalf))

    def findMedian(self) -> float:
        if len(self.leftHalf) > len(self.rightHalf):
            return -self.leftHalf[0]
        elif len(self.rightHalf) > len(self.leftHalf):
            return self.rightHalf[0]
        else:
            return (-self.leftHalf[0] + self.rightHalf[0]) / 2.0

Explanation:

  1. Heaps Setup: The left half (max-heap) stores the smaller numbers, and the right half (min-heap) stores the larger numbers.

  2. Balancing Heaps: Every time we add a number, we place it in the appropriate heap. If one heap becomes larger by more than one element, we rebalance them by moving an element from one heap to the other.

  3. Finding the Median: The median is simply the root of the heap that has more elements, or the average of the roots if both heaps are balanced.

Time and Space Complexity

  • Time Complexity:

    • The addNum() function operates in O(log N) time because inserting into a heap takes O(log N).

    • The findMedian() function runs in O(1) time, as accessing the root of a heap is constant.

  • Space Complexity:

    • The solution uses O(N) space, where N is the number of elements in the data stream. This space is used by the two heaps to store all elements.

Conclusion

The brute force approach may work for smaller data sets but is inefficient for larger data streams. By using two heaps, we can efficiently find the median in logarithmic time. This solution strikes a balance between keeping the numbers sorted and providing fast median retrieval, making it ideal for handling real-time data streams.