15-211 CMU
April 8, 2008 Game Programming
Danny Sleator

----------------------------------------------------------

Example of NIM
Mathematical framework
   definitions
   values
   preclusion
   minimax formulation
   alpha-beta

----------------------------------------------------------

EXAMPLE: Nim
------------

Before we can talk about real programming in practical games like
chess, we need to develop a little theory.  So we do that by starting
with a simple example.  The game of Nim.

There are several piles of stones.  Two players alternate removing
stones.  You must pick a pile and take one or more stones from that
pile.  The one who is stuck with no stones left loses.

Let's analyze a sample game.  Call the two players A (Alice) and B
(Bob).  Alice goes first.  Say there is a pile of 1 stone and a pile
of 2 stones.  We'll denote this (1,2).  What are the options for A?
Take 1 from pile 1, or take 1 or 2 from pile 2.

                      (1,2)                    A's move
                      / | \
                     /  |  \
                    /   |   \
                   /    |    \
                  /     |     \
                 /      |      \
                /       |       \
              (2)     (1,1)     (1)            B's move
              / \       |        |
           (1)  ()     (1)       ()            A's move
            |           |
            ()          ()                     B's move

Is it possible for A to guarantee a win?  To find out, we have to
analyze all the options.  Moving to (2) would allow B to take all the
stones from the pile and win.  So that's a bad option for A.  Moving
to (1,1) forces B to move to (1), then A wins.  Moving to (1) allows
B to win.  So A can win by moving to (1,1).

Let's try to formalize this reasoning.  Let's assign a numerical value
of +1 or -1 to each position in the tree.  The value is +1 if it's
possible for the current player to force a win (there'a a winning
strategy for the current player).  And the value is -1 if the other
player can win.  Here are the values assigned to the game above.

                      (1,2)1
                      / | \
                     /  |  \
                    /   |   \
                   /    |    \
                  /     |     \
                 /      |      \
                /       |       \
              (2)1    (1,1)-1   (1)1
              / \       |        |
           (1)1 ()-1   (1)1      ()-1
            |           |
            ()-1        ()-1

What's the general rule for computing the values of the positions in
the tree?  Well, if a node has no children, its value is obviously
-1, because the player whose turn it is to move has lost.  Suppose
that we've inductively evaluated all the children of a node.  What's
the value of the node?  Well, if there is a child of value -1, then
the value of this node is 1.  If all the children have value 1, then
the value of  this node is -1.  A more concise way to describe this is
that the value of a node is just the maximum of the negations of
the values of the children.

This is an example of a:

     2-person,
     perfect information,
     deterministic,
     finite,
     zero-sum game.

What do these terms mean?  2-person is clear.  Perfect information
means that both players know the entire state of the world.  (Unlike,
say, poker where what you have in your hand is hidden from the other
players.)  Deterministic means that there is no randomness, such as
dice.  (Backgammon is perfect information but not deterministic.)
Finite means that the game must end in a finite number of moves.
Zero-sum means that the total winnings of all players add to zero.
Poker and backgammon are zero-sum, but there are games such as the
prisoner's dilemma or the game of chicken, in which this is not the
case.  Of course there are rich and beautiful theories and algorithms
for these kinds of games, but they're beyond the scope of this course.

Many famous and popular games have these properties, such as
tic-tac-toe, chess, checkers, go, othello, hex.

[By the way, there's a beautiful algorithm to play optimal nim.  Write
the number of stones in each pile in binary.  Bitwise XOR all these
numbers together.  If the result is 0, the value of the position is
-1, otherwise it's 1.  Now if the position has non-zero xor, you can
always win at nim by moving to a position where the xor is zero.  Can
you prove that this works?]

So, let's say we have one of these 2-person, perfect information,
deterministic, finite, zero-sum games, in which players alternate
moves.  How do we figure out what to do?  It's easy.  First we draw
the entire game tree.  The nodes with no children are terminal
positions (those where the game is over, such as checkmate in chess,
or 3-in-a-row in tic-tac-toe).  Start by labeling the terminal
positions.  Now propogate the values up the tree using the maximum of
the negatives of the children.  Let's formalize this.

Let P be a position in the game.  Let S(P) be the successor positions,
that is, the positions that can result when the player whose turn it
is in position P makes a move.  If P is a terminal position, let v(P)
be the value of that position.  (In all the examples so far, this
value must be -1, 0 or 1, but the theory works for real valued
positions.  This will become important later.)

The value v(P) of a non-terminal position P is defined by the
following formula:

                     v(P) =    MAX    -v(P')
                            P' in S(P)

Theorem: If m = v(P), then there exists a strategy for the moving
player in which this player wins at least m, and there exists a
strategy for the other player in which the moving player wins at most
m.

A "strategy" simply means a table that tells you what to do in each
position.  The theorem is an obvious consequence of the discussion
above.

This definition can be turned into a trivial recursive algorithm for
evaluating a game:

      negamax(P)
          if P is terminal, return v(P)
          m = -infinity
          for each P' in S(P)
              m =  max(m, -negamax(P'))
          return m

Note: here and below I use the following notation: "=" is assignment,
"<=" is less than or equal to, ">=" is greater than or equal to.

The Minimax formulation:

Let's rewrite this so that it is always working from the point of view
of one of the players.  Let's call the two players Max and Min, and
assume that Max is trying to maximize the value and Min is trying to
minimize the value.  For a terminal position P, we let v'(P) be the
value of the position with respect to Max.

We can rewrite the above search algorithm as follows:

      Max(P)
          if P is terminal, return v'(P)
          m = -infinity
          for each P' in S(P)
              m =  max(m, Min(P'))
          return m

      Min(P)
          if P is terminal, return v'(P)
          m = infinity
          for each P' in S(P)
              m =  min(m, Max(P'))
          return m

This does exactly the same thing.  It is easy to see this by making a
syntatic change in the program.  (Just propogate a "-" sign through the
code.)  The reason we write this here is because it's easier to
understand what comes below (Alpha-Beta) in this formulation.

More efficient Evaluation
=========================

Preclusion
----------
If the value of all the positions are 1, 0, or -1, then a simple form
of preclusion can be applied to make the search more efficient.  The
idea is simply to stop searching after you've found a winning move,
because the other successors cannot affect the value.  Here is the
resulting algorithm.

      negamax(P)
          if P is terminal, return v(P)
          m = -1
          for each P' in S(P)
              m = max(m, -negamax(P'))
              if m == 1 then return 1
          return m

Alpha-Beta
----------
The alpha-beta algorithm is simply a generalization of this idea which
applies to games with arbitrary values.  Consider the following game
tree, where the values are written with respect to Max.  Say that I'm
evaluating the nodes from left to right.  Q, S and T (either terminal
nodes or trees) have the values shown.

                      P                 Max's move
                     / \
                    /   \
                10 Q     R              Min's move
                        / \
                       /   \
                    5 S     T           Max's move

I evaluate Q and find that the value is 10 (all values here are from
Max's perspective).  It follows that the value of P is at least 10.
Now I move on and evaluate S and find that its value is 5.  Do I need
to look at T?  The answer is no.  The value of T cannot effect the
value of P.  All T can do is pull down the value of R, which we
already know is at most 5.

For those familiar with chess consider this.  Suppose I'm examining
position P in chess game.  I evaluate the first move m, and I find
that m is a good move, because it leaves me ahead by a rook.  now,
suppose that I continue and look at another move m'.  I check the
first move of my opponent and find that it leaves me a queen down.  Do
I need to look at the other alternatives for the opponent?  No.
Because I already know that m' is not as good as m.  If the opponent
has a mating move after I make move m', it does not matter.

When properly implemented, this kind of pruning can occur many levels
deep in the tree.  There's a very elegant way to express this in a
recursive search that adds two parameters to the search function.  A
range (alpha, beta) is added.  This range tells the search that if the
value is <= alpha, I really don't care what the value actually is,
I just want to know that it's <= alpha.  And it tells the search that 
if the value is >= beta, I really don't care to know the exact value,
just that it's >= beta.  If the value is in the range (alpha,beta)
then I do want to know the exact value.

We'll state the algorithm in two ways.  First as a generalization of
Minimax and then as a generalization of negamax.

      MaxAB(alpha,beta,P)
          if P is terminal, return v'(P)
          for each P' in S(P)
              alpha = max(alpha, MinAB(alpha, beta, P'))
              if alpha >= beta then return alpha
          return alpha

      MinAB(alpha,beta,P)
          if P is terminal, return v'(P)
          for each P' in S(P)
              beta = min(beta, MaxAB(alpha, beta, P'))
              if beta <= alpha then return beta
          return beta

The parameters alpha and beta are simply numbers with alpha<beta, and
the interval (alpha,beta) is known as the _alpha-beta window_.
These are local variables --- each recursive call creates its own
instances of alpha and beta that are independent of all others.

Note: It's easy to fold these two mutually recursive functions into
one.  To see this, start with MinAP() and replace alpha by -beta and
beta by -alpha, and negate the returned result.  You'll see that what
you end up with is just MaxAB.  So the alternate formulation is:

      AB(alpha,beta,P)
          if P is terminal, return v(P)
          for each P' in S(P)
              alpha = max(alpha, -AB(-beta, -alpha, P'))
              if alpha >= beta then return alpha
          return alpha

This is usually the best one to implement in practice, because it
avoids duplication of code.

Theorem: Let v(P) be the value of a position P.  Let X be the value
returned by a call to AB(alpha,beta,P).  Then we know the following:

        if          X <= alpha     then   v(P) <= alpha
        if  alpha < X < beta       then   v(P) == X
        if  beta <= X              then   beta <= v(P)

Proof (Optional material):

The proof is by induction on the game tree.  Take a particular
node in the tree P, and assume that the theorem holds for all the
successor positions of P.  Also assume WLOG that P is a MAX node.
There are three cases.

Case 1: X <= alpha.

    Since X <= alpha, it must be the case that each return value
    from the recursive calls is also <= alpha (because we're taking
    the maximum of them to compute X).  Therefore by induction, their
    true values all must also be <= alpha.  Thus, their maximum
    (which is v(P)) is also <= alpha.

Case 2: alpha < X < beta.

    Let P' be one of the successors of P.  Let X' be the value
    returned by the alpha-beta search at P'.  If P' has the highest
    value of all the successors of P, then when it's called its return
    value is strictly inside the range (alpha, beta).  Therefore by
    induction the return value is the true value, and it's X.  If X'
    is one of those that is <= the current alpha, then the true value
    (by induction) is also <= alpha, so it does not effect the
    outcome.

Case 3: beta <= X.

    We know that for one of the calls to alpha-beta the result
    returned X' is >= beta.  By induction, for this position P' the
    true value is >= beta.  Therefore we know that the true value of P
    is also >= beta.

QED.


To compute the value of a game using the alpha-beta algorithm, you
simply call AB(-infinity, infinity, P).  The value returned must be in
the interval (-infinity, infinity).  (Every returned value is actually
the value of some leaf.)  Therefore by the theorem the value returned
must be the actual value of the position.

Let's look at what this algorithm does on the example shown above.
[Let i denote infinity in this discussion.]  We start by calling
MaxAB(-i, i, P).  alpha is initially -i.  We then call MinAB(-i,i,Q).
It returns 10, so alpha (at node P) is now 10.  Next we call MinAB(10,
i, R).  The first thing it does is call MaxAB(10, i, S).  This returns
some value X which is at most 10.  (Since the true value 5 is below
the range [10,i].)  This causes Min to set beta to X, and immediately
terminate without having to evaluate T.


               [-i,i] P 10               Max
                     / \
                    /   \
                   /     \
                  /       \
                 /         \
                /           \
               /             \
       [-i,i] Q 10     [10,i] R X<=10     Min
                             / \
                            /   \
                           /     \
                          /       \
                  [10,i] S X<=10   T     Max


    The numbers in brackets left of each node are
    the values of [alpha,beta] on the call to evaluate
    that node.  The number to the right of the node is
    the value returned by the call to Min() (or Max()).

[Present a bigger example on the blackboard.]