Misra and Gries describe a couple approaches. I don't entirely understand their paper, but a key idea is to use a bag.
Boyer and Moore's original majority algorithm paper has a lot of incomprehensible proofs and discussion of formal verification of FORTRAN code, but it has a very good start of an explanation of how the majority algorithm works. The key concept starts with the idea that if the majority of the elements are A
and you remove, one at a time, a copy of A
and a copy of something else, then in the end you will have only copies of A
. Next, it should be clear that removing two different items, neither of which is A
, can only increase the majority that A
holds. Therefore it's safe to remove any pair of items, as long as they're different. This idea can then be made concrete. Take the first item out of the list and stick it in a box. Take the next item out and stick it in the box. If they're the same, let them both sit there. If the new one is different, throw it away, along with an item from the box. Repeat until all items are either in the box or in the trash. Since the box is only allowed to have one kind of item at a time, it can be represented very efficiently as a pair (item type, count)
.
The generalization to find all items that may occur more than n/k
times is simple, but explaining why it works is a little harder. The basic idea is that we can find and destroy groups of k
distinct elements without changing anything. Why? If w > n/k
then w-1 > (n-k)/k
. That is, if we take away one of the popular elements, and we also take away k-1
other elements, then the popular element remains popular!
Implementation: instead of only allowing one kind of item in the box, allow k-1
of them. Whenever you see a group of k
different items show up (that is, there are k-1
types in the box, and the one arriving doesn't match any of them), you throw one of each type in the trash, including the one that just arrived. What data structure should we use for this "box"? Well, a bag, of course! As Misra and Gries explain, if the elements can be ordered, a tree-based bag with O(log k) basic operations will give the whole algorithm a complexity of O(n log k). One point to note is that the operation of removing one of each element is expensive (I think O(k log k)), but that cost is amortized over the arrivals of those elements, so it's no big deal. Of course, if your elements are hashable rather than orderable, you can use a hash-based bag instead, which under certain common assumptions will give even better asymptotic performance (but it's not guaranteed). If your elements are drawn from a small finite set, you can guarantee that. If they can only be compared for equality, then your bag gets much more expensive and I'm pretty sure you end up with something like O(nk) instead.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…