Union-Find Danny Sleator 15-211 Spring 2009 January 15 2009 ---------------------------------- Summary The problem Equivalence relations Tree-based solutions non-optimized solution, and analysis union by rank, and analysis path compression ---------------------------------- Last time we talked about an algorithm for generating a maze (in this case, really a spanning tree of a graph). We came up with this algorithm: 1. Start with a "fully-walled" maze. 2. Randomly pick a wall and delete it if it won't create a cycle. 3. Stop when every room reachable from every other room. +--+--+--+--+--+ | | | | | | +--+--+--+--+--+ | | | | | | +--+--+--+--+--+ | | | | | | +--+--+--+--+--+ | | | | | | +--+--+--+--+--+ | | | | | | +--+--+--+--+--+ We also interpreted this graph theoretically. In this case regard each room above as a node in a graph. Each room is has an edge to its neighbors (crossing a wall). We're going to find a random subset of edges of this graph that causes it to be a tree that connectes all the nodes together. 1. Start with no edges. 2. Randomly pick a edge and add it if it won't create a cycle. <--------This lecture 3. Stop when every the current edges connect all the nodes together. It's clear that at any point in time what we have selected is a set of edges that for a forest of trees. We did not talk about the test to see if adding this edge to the current forest of trees will create a cycle. That is the subject of this lecture. We'll develop a simple data structure to solve this problem. We'll analyze its worst case behavior. Then we'll come up with a couple of improved versions, and analyze their behaviors too. Family of Disjoint Sets ----------------------- It's fairly obvious what's going on here. When doing the test "will adding this edge create a cycle", what we really need to know is if the two nodes at opposite ends of this edge are already connected together. All that matters is is the partitioning of the nodes of the graph into disjoint sets, and what we need to do is (1) given two nodes determine if they're in the same set and (2) given two nodes in different sets, union those two sets together. Maintaining a set of disjoint sets and supporting these two operations is called the UNION-FIND problem. [[Aside: This can be viewed as the problem of implementing a sort of dynamic equivalence relation. A relation ~ (set of pairs of elements from some base set) is called an equivalence relation if (for all a, b, and c) it satisies: a ~ a reflexive a ~ b iff b~ a symmetric a ~ b & b ~ c implies a ~ c transitive Quiz: which of these are equivalence relations: <, <=, a = O(b), =, "connected in an undirected graph". The Union-Find problem can be regarded as a sort of implementation of a dynamic equivalence relation, where the relation can change as a result of union operations. ]] 1st. Implementation ------------------- The way we'll approach this problem is to implement the following methods. UnionFind(n) Create a collection of n sets {0}, {1}, {2}, ... {n-1} find(i) return the *canonical element* of the set containing i. union(i,j) i and j are canonical elements of two disjoint sets. this unions them together. The "canonical element" can be regarded as simply the unique name of the set. That is, i and j are in the same set if and only if find(i) == find(j). Note that if you don't like these assumptions on union(), you can remove them as follows: safe_union(i, j) { int a = find(i); int b = find(j); if (a != b) union(a, b); } Also, the assumptions that the base set is known in advance, and that the elements are integers are not material. They just simplify the exposition here, and are easily removed. Here's our first solution: class UnionFind { int[] p; UnionFind(int n) { p = new int[n]; for (int i = 0; i < n; i++) { p[i] = -1; } } int find(int i) { if (p[i] == -1) return i; return (find(p[i])); } void union(int i, int j) { p[i] = j; } } The array p[] is viewed as specifying the "parent" of each set element. An element that is a root has parent -1. The invariant that holds is that when we regard p[] as a parent pointer, then array represents a set of disjoint trees. Each of these disjoint trees is one of the disjoint sets of our collection. The canonical element of a set is the root. Since union() is only called on a pair of root elements, the tree-structure is preserved. (Making the parent of a root of one tree a node in another tree preserves treeness.) [[Example]] Consider a sequence of m of these operations, starting from a collection of n disjoint sets. How long can this take, in the worst case? Well, clearly, union is constant time O(1). And find, in the worst case, touches every element, so it's O(n). [We'll be covering asymptotic notation, i.e. O(), in much more detail in the next few lectures.] This indicates that at worst, the running time is O(mn). Is it possible to achieve something this bad? Yes. What is the bad example? 2nd Solution: Union by Rank --------------------------- Well, the bad example was constructed by making the root of a large tree point to a small tree. But when we do a union, we can do the link either way, and the result will still be correct. So what if we always make the root of the "smaller" tree link to the root of the "larger" tree. We do this by defining a measure of the size of the tree called the "rank". It works as follows. Initially the rank of the singleton sets is 0. And when we link two trees of differing ranks together we always make the one of smaller rank the child of the larger one. If we are linking two trees of equal rank together, we do it either way, but then increment the rank of the new root. Here is the code for this: class UnionFind { int[] p; int[] r; UnionFind(int n) { p = new int[n]; r = new int[n]; for (int i = 0; i < n; i++) { p[i] = -1; r[i] = 0; } } int find(int i) { if (p[i] == -1) return i; return (find(p[i])); } void union(int i, int j) { if (r[i] > r[j]) { int t = i; i = j; j = t; } // now r[j] >= r[i] if (r[i] == r[j]) r[j]++; p[i] = j; } } The same argument as before assures us that this code is correct. Only the running time is different from before. Again the cost of union() is O(1). What about find()? Property 1: Each step up the tree causes the rank to increase. Proof: All we need to prove is that if this property holds before a union() operation, then it will hold after the union() operation. A union just adds one more link to the forest of trees, and only changes the rank of the root node of the new set. So we only have to verify the property in the vicinity of these changes. Case 1: the two nodes being unioned have different ranks. Then the bigger one is put on top, and clearly property 1 will hold when traversing the new edge. Case 2: the two nodes being unioned have the same rank. In this case the new parent has its rank increased, so traversing the new edge has property 1. And increasing the rank of the root just causes the traversal of other edges into that root to have an even bigger rank increase. QED. Property 2: Take any node. Let r be its rank and let s be its *size* (the number of nodes in the subtree rooted there). Then the following holds: r s >= 2 Proof: Clearly the property holds initially, because for all nodes the size is 1 and the rank is 0. We just need to show that the property is preserved when we do a union. Let the two trees being unioned have ranks q and r, with q <= r. Let the sizes of these trees be a and b respectively. [[Pictures]] To prove the result we only have to consider the inequality at the new root, because for all other nodes their sizes and ranks stay the same. Case 1: q < r. In this case the new rank of the root is still r. The size of the tree rooted there used to be b, and now it's a+b. If r b >= 2 then clearly r a+b >= 2 Case 2: q == r. Here the new rank r' = r+1. Inductively we know that: q r a >= 2 and b >= 2 Adding these together, and noting that q=r we get: r r r' a+b >= 2 + 2 == 2 QED. Now, finally, we can actually get a handle on the worst-case running time of this algorithm. Again union() is O(1). How long could find() take? The biggest a tree can be is n nodes. How big could the rank be? Let r be the rank. We know that r n >= 2 By taking logs of both sides (and swapping sides) we get: r <= log(n). We also know that (property 1) each step up the tree increases the rank. Therefore the cost of a find() is bounded by O(log n). And we can conclude that the total cost of a sequence of m operations is at most: O(m log(n)) It's also not hard to create a case that is this bad. [[work it out]] [Note: In this course the default base to assume when you see the word "log" is base 2. Roughly, log n is the number of times you have to double a number (starting from 1) until you get to n. We'll have much more to say about logs later in the course.] 3nd Solution: Union by Rank and Path Compression ------------------------------------------------ Can we do even better than this? Here's an interesting observation. If we restructure a tree after doing a find into a new tree that has the same root, the whole algorithm will still work correctly. Shortening the paths from nodes to the root could be useful. In particular, it's easy to make every node on the find path have the root be its parent. This is called *path compression*. The only change needed in the code is: int find(int i) { if (p[i] == -1) return i; return (find(p[i])); } Becomes: int find(int i) { if (p[i] == -1) return i; return (p[i] = find(p[i])); } What happens to the running time? Well, it gets much more efficient. Now, the cost of a sequence of m operations on an initial collection of n disjoint sets has running time O(m log*(n)) Where log*(n) is the number of times (starting from n) you have to take logs to get to 1 (or below). log*(1) = 0 log*(2) = 1 log*(4) = 2 log*(16) = 3 log*(65536) = 4 65536 log*(2 ) = 5 Note that the last one is more than the number of particles in the known universe. But in reality this is a very fast growing function compared to the real answer. The correct answer is: O(m alpha(m,n)) Where alpha() is the inverse of the Ackermann function. The growth rate of this is so monsterously slow that it makes log*() look like it's skyrocketing to infinity. The details are beyond the scope of this course. See the book "Data Structures and Network Algoriths", by Robert Tarjan for all the details. One Final Code Optimization --------------------------- Here's an observation. You only need the rank values on nodes that are roots of trees -- i.e. nodes which have no parent. This allows you to have a single parent array instead of keeping a parent array and a rank array. The trick is to let negative values of p[] indicate a root node, and have the value be -(r[]-1). Here's the code. class UnionFind { int[] p; UnionFind(int n) { p = new int[n]; for (int i = 0; i < n; i++) { p[i] = -1; } } int find(int i) { if (p[i] < 0) return i; return (p[i] = find(p[i])); } void union(int i, int j) { if (p[i] < p[j]) { int t = i; i = j; j = t; } if (p[i] == p[j]) p[j]--; p[i] = j; } }