Union-Find
Danny Sleator
15-211 Spring 2009
January 15 2009

----------------------------------
Summary

  The problem
  Equivalence relations
  Tree-based solutions
     non-optimized solution, and analysis
     union by rank, and analysis
     path compression
----------------------------------


Last time we talked about an algorithm for generating
a maze (in this case, really a spanning tree of a graph).

We came up with this algorithm:

  1. Start with a "fully-walled" maze.
  2. Randomly pick a wall and delete it 
     if it won't create a cycle.
  3. Stop when every room reachable from every other room.


     +--+--+--+--+--+
     |  |  |  |  |  |
     +--+--+--+--+--+
     |  |  |  |  |  |
     +--+--+--+--+--+
     |  |  |  |  |  |
     +--+--+--+--+--+
     |  |  |  |  |  |
     +--+--+--+--+--+
     |  |  |  |  |  |
     +--+--+--+--+--+


We also interpreted this graph theoretically.  In this case
regard each room above as a node in a graph.  Each room is
has an edge to its neighbors (crossing a wall).  We're going
to find a random subset of edges of this graph that causes
it to be a tree that connectes all the nodes together.

  1. Start with no edges.
  2. Randomly pick a edge and add it
     if it won't create a cycle.  <--------This lecture
  3. Stop when every the current edges connect
     all the nodes together.

It's clear that at any point in time what we have selected
is a set of edges that for a forest of trees.  We did not
talk about the test to see if adding this edge to the
current forest of trees will create a cycle.  That is the
subject of this lecture.

We'll develop a simple data structure to solve this problem.
We'll analyze its worst case behavior.  Then we'll come up
with a couple of improved versions, and analyze their
behaviors too.

Family of Disjoint Sets
-----------------------

It's fairly obvious what's going on here.  When doing the
test "will adding this edge create a cycle", what we really
need to know is if the two nodes at opposite ends of this
edge are already connected together.  All that matters is is
the partitioning of the nodes of the graph into disjoint
sets, and what we need to do is (1) given two nodes
determine if they're in the same set and (2) given two nodes
in different sets, union those two sets together.

Maintaining a set of disjoint sets and supporting these
two operations is called the UNION-FIND problem.

[[Aside:
This can be viewed as the problem of implementing a sort of
dynamic equivalence relation.  A relation ~ (set of pairs of
elements from some base set) is called an equivalence
relation if (for all a, b, and c) it satisies:

       a ~ a                          reflexive
       a ~ b  iff  b~ a               symmetric
       a ~ b & b ~ c  implies a ~ c   transitive

Quiz: which of these are equivalence relations:
<, <=, a = O(b), =, "connected in an undirected graph".

The Union-Find problem can be regarded as a sort of
implementation of a dynamic equivalence relation, where the
relation can change as a result of union operations.
]]

1st. Implementation
-------------------

The way we'll approach this problem is to implement
the following methods.

          UnionFind(n)      Create a collection of n sets
                            {0}, {1}, {2}, ... {n-1}

          find(i)           return the *canonical element*
                            of the set containing i.

          union(i,j)        i and j are canonical elements
                            of two disjoint sets.  this
                            unions them together.

The "canonical element" can be regarded as simply the
unique name of the set.  That is, i and j are in the same
set if and only if find(i) == find(j).

Note that if you don't like these assumptions on union(),
you can remove them as follows:

         safe_union(i, j) {
            int a = find(i);
            int b = find(j);
            if (a != b) union(a, b);
         }

Also, the assumptions that the base set is known in advance,
and that the elements are integers are not material.  They
just simplify the exposition here, and are easily removed.

Here's our first solution:

class UnionFind {
    int[] p;

    UnionFind(int n) {
    p = new int[n];
    for (int i = 0; i < n; i++) {
        p[i] = -1;
        }
    }

    int find(int i) {
        if (p[i] == -1) return i;
        return (find(p[i]));
    }

    void union(int i, int j) {
        p[i] = j;
    }
}

The array p[] is viewed as specifying the "parent" of each
set element.  An element that is a root has parent -1.

The invariant that holds is that when we regard p[] as a
parent pointer, then array represents a set of disjoint
trees.  Each of these disjoint trees is one of the disjoint
sets of our collection.   The canonical element of a set is
the root.  Since union() is only called on a pair of root
elements, the tree-structure is preserved.  (Making the
parent of a root of one tree a node in another tree
preserves treeness.)  

[[Example]]

Consider a sequence of m of these operations, starting from
a collection of n disjoint sets.  How long can this take, in
the worst case?  Well, clearly, union is constant time O(1).
And find, in the worst case, touches every element, so it's
O(n).  [We'll be covering asymptotic notation, i.e. O(), in
much more detail in the next few lectures.]  This indicates
that at worst, the running time is O(mn).

Is it possible to achieve something this bad?  Yes.  What is
the bad example?


2nd Solution: Union by Rank
---------------------------

Well, the bad example was constructed by making the root of
a large tree point to a small tree.  But when we do a union,
we can do the link either way, and the result will still be
correct.  So what if we always make the root of the
"smaller" tree link to the root of the "larger" tree.

We do this by defining a measure of the size of the tree
called the "rank".  It works as follows.  Initially the rank
of the singleton sets is 0.  And when we link two trees of
differing ranks together we always make the one of smaller
rank the child of the larger one.  If we are linking two
trees of equal rank together, we do it either way, but then
increment the rank of the new root.  Here is the code for
this:

class UnionFind {
    int[] p;
    int[] r;

    UnionFind(int n) {
    p = new int[n];
    r = new int[n];
    for (int i = 0; i < n; i++) {
        p[i] = -1;
            r[i] = 0;
            }
    }

    int find(int i) {
        if (p[i] == -1) return i;
        return (find(p[i]));
    }

    void union(int i, int j) {
        if (r[i] > r[j]) {
            int t = i; i = j; j = t;
            } 
        // now r[j] >= r[i]
        if (r[i] == r[j]) r[j]++;
        p[i] = j;
    }
}

The same argument as before assures us that this code is
correct.  Only the running time is different from before.

Again the cost of union() is O(1).  What about find()?

Property 1: Each step up the tree causes the rank to
increase.

Proof: All we need to prove is that if this property holds
before a union() operation, then it will hold after the
union() operation.  A union just adds one more link to the
forest of trees, and only changes the rank of the root node
of the new set.  So we only have to verify the property in
the vicinity of these changes.

Case 1: the two nodes being unioned have different ranks.
Then the bigger one is put on top, and clearly property 1
will hold when traversing the new edge.

Case 2: the two nodes being unioned have the same rank.
In this case the new parent has its rank increased, so 
traversing the new edge has property 1.  And increasing the
rank of the root just causes the traversal of other edges
into that root to have an even bigger rank increase.

QED.

Property 2: Take any node.  Let r be its rank and 
let s be its *size* (the number of nodes in the subtree
rooted there).  Then the following holds:
                    r
              s >= 2

Proof: Clearly the property holds initially, because for all
nodes the size is 1 and the rank is 0.  We just need to show
that the property is preserved when we do a union.  Let
the two trees being unioned have ranks q and r, with 
q <= r.  Let the sizes of these trees be a and b
respectively.

[[Pictures]]

To prove the result we only have to consider the inequality
at the new root, because for all other nodes their sizes and
ranks stay the same.

Case 1: q < r.  In this case the new rank of the root is
still r.  The size of the tree rooted there used to be b,
and now it's a+b.  If
              r
        b >= 2

then clearly

               r
       a+b >= 2

Case 2: q == r.  Here the new rank r' = r+1.  Inductively
we know that:

             q               r
       a >= 2    and   b >= 2

Adding these together, and noting that q=r we get:

             r     r      r'
     a+b >= 2  +  2  ==  2

QED.

Now, finally, we can actually get a handle on the worst-case
running time of this algorithm.  Again union() is O(1).
How long could find() take?  The biggest a tree can be is n
nodes.  How big could the rank be?  Let r be the rank.
We know that 
                  r
            n >= 2

By taking logs of both sides (and swapping sides) we get:

            r <= log(n).

We also know that (property 1) each step up the tree
increases the rank.  Therefore the cost of a find() is
bounded by O(log n).  And we can conclude that the total
cost of a sequence of m operations is at most:

              O(m log(n))

It's also not hard to create a case that is this bad.

[[work it out]]

[Note: In this course the default base to assume when you
see the word "log" is base 2.  Roughly, log n is the number
of times you have to double a number (starting from 1) until
you get to n.  We'll have much more to say about logs later
in the course.]


3nd Solution: Union by Rank and Path Compression
------------------------------------------------

Can we do even better than this?  Here's an interesting
observation.  If we restructure a tree after doing a find
into a new tree that has the same root, the whole algorithm
will still work correctly.  Shortening the paths from nodes
to the root could be useful.  In particular, it's easy to
make every node on the find path have the root be its
parent.  This is called *path compression*.  The only change
needed in the code is:

    int find(int i) {
        if (p[i] == -1) return i;
        return (find(p[i]));
    }

Becomes:

    int find(int i) {
        if (p[i] == -1) return i;
        return (p[i] = find(p[i]));
    }

What happens to the running time?  Well, it gets much
more efficient.  Now, the cost of a sequence of m operations
on an initial collection of n disjoint sets has running time

                    O(m log*(n))

Where log*(n) is the number of times (starting from n) you
have to take logs to get to 1 (or below).

          log*(1) = 0
          log*(2) = 1
          log*(4) = 2
          log*(16) = 3
          log*(65536) = 4

                65536
          log*(2     ) = 5

Note that the last one is more than the number of particles
in the known universe.  But in reality this is a very fast
growing function compared to the real answer.  The correct
answer is:

                    O(m alpha(m,n))

Where alpha() is the inverse of the Ackermann function.  The
growth rate of this is so monsterously slow that it makes
log*() look like it's skyrocketing to infinity.  The details
are beyond the scope of this course.  See the book "Data
Structures and Network Algoriths", by Robert Tarjan for all
the details.


One Final Code Optimization
---------------------------

Here's an observation.  You only need the rank values on
nodes that are roots of trees -- i.e. nodes which have no
parent.  This allows you to have a single parent array
instead of keeping a parent array and a rank array. The
trick is to let negative values of p[] indicate a root node,
and have the value be -(r[]-1).  Here's the code.


class UnionFind {
    int[] p;

    UnionFind(int n) {
    p = new int[n];
    for (int i = 0; i < n; i++) {
        p[i] = -1;
        }
    }

    int find(int i) {
        if (p[i] < 0) return i;
        return (p[i] = find(p[i]));
    }

    void union(int i, int j) {
        if (p[i] < p[j]) {
            int t = i; i = j; j = t;
        } 
        if (p[i] == p[j]) p[j]--;
        p[i] = j;
    }
}