Problems involving strings, i.e. sequences of characters are common in
programming competitions. A very useful tool for many of these problems
are *pattern matching* algorithms. Given a string S and a string T, the
pattern matching problem is to find all occurrences of T as a substring
of S. Let \( n \) denote the length of S and \( m \) denote the length of
T. We can solve the pattern matching problem in \( O(nm) \) time by simply
checking every length \( m \) substring of S and checking whether it is
equal to T. This is not very efficient, though. Today, we will learn one
algorithm that solves this problem in \( O(n + m) \) time. In addition
to solving this specific problem, it also ellucidates an extremely powerful
technique for attacking string problems - using hashing.

The algorithm we'll see for pattern matching today is the
*Rabin-Karp* algorithm. It uses a very powerful technique to solve a
seemingly impossible problem: Given a substring of S, we want to compare
it to T in constant time! While we can't actually compare the substring
to T in constant time, we can check whether they are probably the same
by using hashing. The Rabin-Karp algorithm first computes a hash of
the string T, and then, computes the hashes of each substring of S
and compares them. The trick is in being able to compute the hash
of each substring of S in constant time. To do so, we use a special
hash function that has the property that given the hash of one substring
of S, we can find the hash of the next substring by removing the previous
character and adding the next one and updating the hash in constant time. This special hash function is called
the polynomial hash function, because it looks like this
$$
h(S[i..j]) = x^{m-1} S[i] + x^{m-2} S[i+1] + x^{m-3} S[i+2] + ... + x S[j-1] + S[j]
$$
where the characters \( S[i] \) are interpreted as integers (for example,
by using their position in the alphabet as their value). Note that this
is a polynomial in \( x \) whose coefficients are the values of the
characters of the substring, hence the name polynomial hash function.
Note that if we choose the base \( x \) to be at least as large as the
size of the alphabet, then the polynomial hash is unique for every
possible string of the same length. We will therefore usually use \( x = 26 \)
(if working with alphabetic characters) or \( x = 255 \) (if working with ASCII),
for example. However, since these hash values can get extremely large, we
will need to work in modular arithmetic, which means the hashes will no longer
be unique. The probability that two strings hash to the same value, however,
should be low. Assume that all of the equations given here are performed
modulo \( p \) for some large prime \( p \). Good choices in practice
are \( 10^9 + 7 \) and \( 10^9 + 9 \).

The fact that makes this hash function useful is the following. Suppose we are given the hash of the substring \( S[i..j] \). Then we can compute the hash of \( S[i+1...j+1] \) as follows. $$ h(S[i+1...j+1]) = x \cdot h(S[i...j]) - x^{m} S[i] + S[j+1] $$ Since this can be done in constant time, we can compute the hashes of every substring of length \( m \) in constant time each (except the first one which will take \( O(m) \) time.) It then suffices to compare the hashes with the hash of T, and the ones that match are the occurrences of T in S, right? Well... almost. Remember that we compute the hashes modulo \( p \), so there is a small chance of collisions, i.e. two distinct strings might have the same hash value. We can deal with this in several ways:

- The most common approach: Don't! The probability of collisions is very small, so a common approach to deal with them is to simply assume that they won't happen. If they do happen, we can change the value of \( x \) or \( p \) and try again.
- Use multiple different values of \( p \) and check that the hashes are equal for all of them. This increases the runtime, but gives more confidence that the answers are correct
- Compare the strings that have equal hash values to see whether they really are a match. This solutions guarantees that the algorithm is correct, but it can make it slower if there are lots of matches. In the worst case, performance degrades to that of the naive algorithm. If very few matches are expected, this can be fine, though.

You can find an example implementation of Rabin-Karp here.

Here's a recitation that describes a more advance version of Karp-Rabin called the "oracle", that allows you to preprocess a string and the compute the fingerprint of any range within it in O(1) time.

Some more detailed notes with some math about the probability of collisions can be found here.

Danny Sleator Last modified: Wed Nov 9 13:13:29 2022