Game Playing with MCTS

Overview
Get started
Your tasks
Implementation details
Pseudocode
Implementation tips
Run your code
Written questions
FAQ
- Backpropagation extent
- Tracking whose turn it is
Grading

Overview

Your goal is to create an system that can accurately play Connect 4 (also called “Four in a Line”) using Monte Carlo Tree Search. Here is a site where you can ready the rules, and here is a site where you can play online. Play a few games, especially if you haven’t played Connect Four before.

Get started

To get started with your assignment, copy the files game1.py and MCTS.py into your directory. game1.py contains everything necessary to get the game moving. Use Python’s help facility to see what functions and methods are in there that you might find useful. In other words, do the following:

$ python3
Python 3.5.2 (default, Jul 10 2019, 11:58:48) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import game1
>>> help(game1)

You can see the results of a random game by running

python MCTS.py

from a command prompt.

Your tasks

You’ll now implement the MCTS function in MCTS.py, which will also require implementing some functions in the Node class (also located in MCTS.py). Your MCTS function will need to interact with the game by calling some methods in the game class. Here are the relevant functions, methods, and classes:

newGame(): Function in game1.py that returns a State object.
State: Class in game1.py that defines the game states; important methods for instances of state are defined below.
getMoves(): Method in State that returns a collection of moves.
isTerminal(): Method in State that returns true if the state is terminal and false otherwise.
value(): Returns +1, 0, or -1 for a player 1 win, draw, or loss respectively. Only valid to call on terminal states.
turn(): Returns +1 for the first player’s turn and -1 for the second player’s turn.
key(): Returns a string representation of the state. You might find this occasionally helpful for debugging purposes.

You shouldn’t change the code in game1.py, and you shouldn’t access any other methods, functions or instance variables.

Implementation details

When describing MCTS, the word “rollout” is sometimes used to describe just the simulation step, or the entire selection/expansion/simulation/backpropagation process. Humans don’t always agree on how to use words. (You can find direct confirmation of this in the Wikipedia article for MCTS.) For this assignment, the word “rollout” is used to describe the whole process.

MCTS follows a basic pattern for selecting nodes: It performs repeated rollouts, where associated with each rollout, the following operations are performed:

Selection: It selects nodes from the expanded part of the tree until it reaches a node that has at least one child that is unexpanded. It begins this process at the root node, which is passed as a parameter to MCTS. Node selection is performed by choosing the child with highest UCB value. The formula for UCB value is in the MCTS reading; for the exploration constant, you should use the variable UCB_CONST, which is defined near the top of the module.
Expansion: Assuming it finds an unexpanded child that is not a terminal, it expands that child. Expanding a child means adding a node for it to the MCTS tree. (If it finds a terminal child, it will also add it to the tree, but it won’t have any simulating to do in the next step - it will simply backpropagate immediately, see the step after the next one.)
Simulation: A simulation is performed from that newly expanded child: all the moves are chosen randomly for both players, starting at the state for the newly expanded child. This simulation continues until a terminal node is reached, allowing us to determine the outcome of the simulation (+1 for player 1, 0 for a draw, -1 for player 2).
Backpropagation: Finally, it propagates the outcome of the simulation up the tree. This “backpropagation” involves going from the node for the newly expanded child all the way up to the root, and updating the count of visits and the value (based on the outcome of this simulation) for each node visited.

The number of rollouts to perform is indicated by the rollouts parameter in the MCTS function (a stub for this function is in MCTS.py). After finishing all rollouts, MCTS returns the move that results in the highest-value child of the root from the current player’s perspective. (I.e., the move that has resulted in the best winning percentage from the perspective of the player choosing moves.)

The description above is almost enough to allow you to implement the algorithm, but it’s a little vague on what utility is for each node. We’ll define $U(n)/N(n)$ as the winning percentage from this node for the player who chose the action to reach this node (i.e., the parent’s turn), where a draw counts as half of a win. For example, imagine MCTS has visited a node four times and that in this node’s parent, player 2 is choosing the move. Player 2 lost once, won twice, and had a draw once. Then the value for the node would be (0+1+1+.5)/4=.625. Note that this means the value of each node should always be between 0 and 1, regardless of whether it’s a node where player 1 plays next or player 2 plays next. By doing it this way, high value for a node always means that it is good, and you don’t have to worry about adding some sort of min/max code. Admittedly, this means that you’re going to do some switching back and forth between whether you think of the players as +1/-1, or +1/0, depending on the context.

Pseudocode

The following pseudocode summarizes the above.

function MCTS(root)
   for i = 1 : rollouts
      node = root
      # selection
      while all children of node are expanded and node is not terminal
         node = choose child by UCB criterion
      # expansion
      if node not terminal
         pick an unexplored child state
         create a new_node for the resulting state
         # simulation
         outcome = random_playout(child state)
      else
         outcome = node's state
      # backpropagation
      update values, visits along root to node path according to outcome

Implementation tips

The Node class stores relevant information for the game tree. I’ve included a constructor with instance variables I think are likely to be useful. You can add to those instance variables if you like or change them. You should not remove or change the signature of any methods or constructors (existing methods are used by starter code and/or the grader’s tests), although you may add methods. If you change your instance variables, you should ensure that the methods all still behave in accordance with the comments. (I.e., it is okay to change the implementation of these methods if need be, but you should not change their names or parameters.)
The Node class has stubs for updateValue and UCBWeight. You should implement these methods (this will be helpful for the MCTS implementation).
Use helper functions! There are a lot of parts here - I’d recommend breaking it into small chunks, each with its own method, so you can see what it’s doing.

Run your code

Once you’ve implemented MCTS (see below for some implementation tips), you should be able to play a better game. You can run it as:

python3 MCTS.py --rollouts 50 --displayBoard

Change 50 to however many rollouts you’d like. The “displayBoard” option will allow you to see the agent’s ranking of the moves each time it has a choice (1 is the move it thinks is best). I recommend debugging with a small number of rollouts and examining how your node values and visits change; compare to results you calculate on paper. A debugger will likely be helpful. The randomness in this algorithm can make debugging tough, so you may want to “seed” the random number generator. Random number generators aren’t truly random - instead, they generate a set sequence based on some initial input (see this page if you want to learn more about random number generation). You can set that initial input to get the same sequence every time. Add the following line just after your imports to do so:

random.seed(1)

Delete the line when you want to go back to different games.

Once you have the basic implementation working, answer the written questions below in a file named questions.txt. For many of these questions, understanding the command line arguments will be helpful. To see description of all the command line options, run

python MCTS.py --help.

Written questions

Run your MCTS algorithm for 25 games with rollouts set to 100. This may take a little time, but not more than a few minutes. How many games does player 1 win?
Again run your MCTS algorithm for 25 games with rollouts set to 100, except now make MCTS player 2. How many games does player 1 win? (If your answers to these two parts don’t make sense to you based on which player should be the smarter player, you probably have an error in your implementation.)
Experiment with several values of the UCB exploration constant. Report what values you used and your results for how this affects your MCTS agent’s effectiveness. Explain your hypotheses about why you see these results, tying back to what you know about how varying this constant affects the algorithm. Your experimentation should be sufficiently extensive to explore at least some trend in the results. Make sure you run enough games with enough rollouts that you feel confident in being able to see and explore a trend in the results (i.e., if I only run 3 games, it’s likely going to be difficult to detect differences based on small changes to the UCB exploration constant.)
Change the UCB exploration constant back to its original value (.5) and experiment with two MCTS players with different numbers of rollouts relative to one another. For example, you might look at an agent with 8 rollouts and an agent with 64 rollouts, and then look at an agent with 8 rollouts versus an agent with 512 rollouts. As in the previous question, report what values you used and your results. Explain your hypotheses about why you see these results, tying back to what you know about the algorithm and collecting additional evidence to support your hypotheses if you can. Your experimentation should be sufficiently extensive to explore at least some trend in the results, and you should be sure to design your experimentation so as to avoid confounding the impact of being player 1 versus player 2 and the number of rollouts. You may want to use a smaller board to be able to explore your results further without spending a ton of time running simulations; you can adjust the size of the board with the constants at the top of game1.py.

FAQ

Backpropagation extent

Q: Do we end backpropagation at the state for which we started searching (the current move), or do we backpropagate all the way back to the root of the whole game (the first move)?

A: You backpropagate to state where you started the procedure, and not all the way to the first move of the game (References: Browne et al survey paper Algorithm 1; orig UCT paper Fig 1). Backpropagating all the way to the start of the game biases the results. Using this procedure of backpropagating just to the current node of interest ensures that the UCB criterion was used once per visit. If you backpropagate all the way to the start of the game, you’re making dramatic increases to the number of visits on the top nodes that aren’t well balanced by repeated UCB checks.

Tracking whose turn it is

Q: How does the value of a state reflect whose turn it is?

A: There isn’t a single way to do this. That said, here are some ideas. Since the second half of the UCB criterion is frequency based, you’ve got to stick with an approach of picking child nodes that maximize the UCB criterion. This means that it is likely easiest for the value in a node to reflect this. So in a node that is being chosen by the max player, a bigger value should be good for that player, whereas in a node where the min player is making choosing it, a bigger value should be good for that player. For UCB, the range of the value is from 0 to 1. For a node whose turn is the max player, 1 should emphasize a strong outcome for that player (hopefully leading to a terminal state with +1), whereas for a node whose turn is the min player, 1 should emphasize a strong outcome for that player (hopefully leading to a terminal state with -1).

Grading

Here is some information on how we will grade part 2 of this assignment.

It is exceedingly difficult to tell if your code is working perfectly without lots of detailed tests and a dramatically over-specified assignment description. (This tends to be true for a lot of AI assignments.) Therefore, here is what I will be instructing the grader:

Run their program a number of times. First, just verify that it wins most games with a decent number of rollouts, something like (I’ll supply the command to run, and the output at least I’m getting):
$ python MCTS.py --numGames 20 --rollouts 10
Player 1 games won: 20/20
Draws: 0/20
Then verify that it flips to player 2 if they reverse the order of play:
$ python MCTS.py --second --numGames 20 --rollouts 10
Player 1 games won: 4/20
Draws: 0/20
Then have both players go at it, and unbalance the number of rollouts for each, and verify that it seems to be behaving as expected:
$ python MCTSsolution.py --numGames 20 --rollouts 20 --rolloutsSecondMCTSAgent 10
Player 1 games won: 14/20
Draws: 0/20
$ python MCTSsolution.py --numGames 20 --rollouts 10 --rolloutsSecondMCTSAgent 20
Player 1 games won: 3/20
Draws: 0/20
Look to see if the above is giving something similar-ish to what I’ve got.

Also skim through their code to make sure it looks like they’re doing the right sort of thing.

On the writeup, they should answer the questions in enough detail to show that they understood the question, ran the experiments, and interpreted the results.

Exemplary: all of the above

Meets expectations: close to all of the above, but the program behaves differently than expected in some case (despite the code otherwise looking good), and the writeup is still strong and/or is missing some small pieces.