String matching algorithms

Problems involving strings, i.e. sequences of characters are common in programming competitions. A very useful tool for many of these problems are pattern matching algorithms. Given a string S and a string T, the pattern matching problem is to find all occurrences of T as a substring of S. Let $ n $ denote the length of S and $ m $ denote the length of T. We can solve the pattern matching problem in $ O(nm) $ time by simply checking every length $ m $ substring of S and checking whether it is equal to T. This is not very efficient, though. Today, we will learn one algorithm that solves this problem in $ O(n + m) $ time. In addition to solving this specific problem, it also ellucidates an extremely powerful technique for attacking string problems - using hashing.

The Rabin-Karp algorithm

The algorithm we'll see for pattern matching today is the Rabin-Karp algorithm. It uses a very powerful technique to solve a seemingly impossible problem: Given a substring of S, we want to compare it to T in constant time! While we can't actually compare the substring to T in constant time, we can check whether they are probably the same by using hashing. The Rabin-Karp algorithm first computes a hash of the string T, and then, computes the hashes of each substring of S and compares them. The trick is in being able to compute the hash of each substring of S in constant time. To do so, we use a special hash function that has the property that given the hash of one substring of S, we can find the hash of the next substring by removing the previous character and adding the next one and updating the hash in constant time. This special hash function is called the polynomial hash function, because it looks like this $$ h(S[i..j]) = x^{m-1} S[i] + x^{m-2} S[i+1] + x^{m-3} S[i+2] + ... + x S[j-1] + S[j] $$ where the characters $ S[i] $ are interpreted as integers (for example, by using their position in the alphabet as their value). Note that this is a polynomial in $ x $ whose coefficients are the values of the characters of the substring, hence the name polynomial hash function. Note that if we choose the base $ x $ to be at least as large as the size of the alphabet, then the polynomial hash is unique for every possible string of the same length. We will therefore usually use $ x = 26 $ (if working with alphabetic characters) or $ x = 255 $ (if working with ASCII), for example. However, since these hash values can get extremely large, we will need to work in modular arithmetic, which means the hashes will no longer be unique. The probability that two strings hash to the same value, however, should be low. Assume that all of the equations given here are performed modulo $ p $ for some large prime $ p $. Good choices in practice are $ 10^9 + 7 $ and $ 10^9 + 9 $.

The fact that makes this hash function useful is the following. Suppose we are given the hash of the substring $ S[i..j] $. Then we can compute the hash of $ S[i+1...j+1] $ as follows. $$ h(S[i+1...j+1]) = x \cdot h(S[i...j]) - x^{m} S[i] + S[j+1] $$ Since this can be done in constant time, we can compute the hashes of every substring of length $ m $ in constant time each (except the first one which will take $ O(m) $ time.) It then suffices to compare the hashes with the hash of T, and the ones that match are the occurrences of T in S, right? Well... almost. Remember that we compute the hashes modulo $ p $, so there is a small chance of collisions, i.e. two distinct strings might have the same hash value. We can deal with this in several ways:

The most common approach: Don't! The probability of collisions is very small, so a common approach to deal with them is to simply assume that they won't happen. If they do happen, we can change the value of $ x $ or $ p $ and try again.
Use multiple different values of $ p $ and check that the hashes are equal for all of them. This increases the runtime, but gives more confidence that the answers are correct
Compare the strings that have equal hash values to see whether they really are a match. This solutions guarantees that the algorithm is correct, but it can make it slower if there are lots of matches. In the worst case, performance degrades to that of the naive algorithm. If very few matches are expected, this can be fine, though.

Additional resources

You can find an example implementation of Rabin-Karp here.
Here's a recitation that describes a more advance version of Karp-Rabin called the "oracle", that allows you to preprocess a string and the compute the fingerprint of any range within it in O(1) time.
Some more detailed notes with some math about the probability of collisions can be found here.

Danny Sleator

Last modified: Wed Nov 9 13:13:29 2022