# Using Reinforcement Learning To Learn To Play Tic-Tac-Toe

About a year ago I set myself the goal of writing an algorithm that could learn to play tic-tac-toe.
I didn’t want to tell the algorithm what the rules of the game are, nor did I want it to try and use some kind of calculation to look ahead at possible moves which might lead to a win from the current state of the board. I wanted the algorithm to “learn” how to play, by playing against itself.
I knew nothing about machine learning, so I spent a bit of time learning about neural networks but didn’t get very far and convinced myself that neural networks wouldn’t work and I put the project to the back of my mind.
A few months ago at a Christmas party, I bumped into an acquaintance, JP, and I ended up telling him about my goal. He suggested reinforcement learning and a few days later I started reading about it. It didn’t take long before I was trying to understand Q-learning and failing fast by getting lost in the  maths. So with my limited knowledge I went about iteratively designing an algorithm to “learn” to play.
Here is the result, which you can play against interactively by clicking on the board or buttons. It contains no code telling it what good or bad moves are and the AI algorithm knows nothing about the rules of the game – it’s simply told what the outcome of each move is (unfinished, won, drawn, lost). It doesn’t even know that the aim is to get three in a row.
All the AI algorithm knows is that X and O make alternate moves and that a square can only contain an X or an O or be emtpy.

The “clear” button clears the board and resets the game.
The “reset stats” button clears the score board below the buttons.
The “play autonomously” button causes it to start playing against itself and continue learning – it swaps side every hundred games and will
occasionally explore randomly, according to its exploration strategy (see below).
The “swap roles” button causes it to change sides. The “explore” button can be used to turn off exploration during games against humans.

Tic-tac-toe is played by successively placing ‘X’s and ‘O’s (markers) on the board until one side has three in a row, vertically, horizontally or diagonally. To start with, each cell on the board is empty (modelled as a null in Javascript). As the game progresses, the state of the board, a.k.a. “board state”, changes. As an example, here is the board state after one move by X in the middle of the top row of the board:

 j=0 j=1 j=2 i=0 X i=1 i=2

Each state can be converted into a “pattern” which is built by concatenating the cell values from left to right starting with the top row and ending with the bottom row, delimiting the rows with a pipe character. Empty cells are represented using a hyphen (‘-‘). So the example above becomes: “-X-|—|—“.
The patterns are used as keys, which map to a list of possible moves from that board state (like a dictionary). To make things simple, I chose that X would always start regardless of whether it was the AI or the human moving first. For each possible move, the algorithm stores the final result that was achieved – a win for X, a win for O or a draw. After the first move, there are eight possible moves, after the second, only seven, etc. Results are recorded in JSON similar to the following, and here is an example showing what the algorithm has learned for the pattern where no move has yet been made:

```const memory = {
"---|---|---" : [
{i: 1, j: 1, xWinRewards: 185510, oWinRewards: 20028, drawRewards: 60161},
{i: 2, j: 2, xWinRewards:   1390, oWinRewards:  1910, drawRewards:   379},
...
],
...
};
```

Each element in the array represents a move which can be made from the board state represented by the key, with ‘i’ representing the vertical position, and ‘j’ the horizontal position. So ‘xWinRewards’ records how well X has done from this board state, in this case an empty board, after moving to the position indicated by i and j. ‘oWinRewards’ records how well player O has done, and ‘drawRewards’ shows how likely a draw is.

The example above was taken after a hundred thousand games were played. It shows how the algorithm has learned that X is most likely to win if it starts in the centre of the board (position i=1, j=1). It also shows how moving to the centre is more likely to result in a draw than a win for the second player O, because it has gained more rewards for draws than for O winning.
You can also see that far fewer games were played where the first player opened in the bottom right (2, 2) because it has gained less rewards for that move.

Notice the term “reward”. The numbers recorded above are not the number of games, rather they are rewards (or points) that are given at the end of the game, when the algorithm walks through each
move of the game, retroactively awarding points for each move. First it builds the pattern and then it looks that pattern up in its memory. It then adds a reward to the relevant “next move” which was made in the current game, adding it to the relevant field in the object shown above. Take for example the board after an initial move by X to the centre of the board. Let’s say that O won later in the game, after moving to position 2,2 as their first move. So the pattern which is looked up is “—|-X-|—“. In that memory location, the algorithm finds the object with i=2 and j=2. It adds a reward to the field named “oWinRewards”. The algorithm then moves on to the next move which took place in the game and does the same thing for the pattern “—|-X-|–O”, since that pattern represents the game after the second move.

Initially I used a reward of just one point. It soon became clear that the algorithm wasn’t learning properly and it was still making “stupid” decisions, for example it wouldn’t complete two in a row in order to win immediately and would prefer to move elsewhere rather than taking the victory. The reason is that in the beginning it only moves randomly (it has to explore since it has no experience to exploit, see below) and sometimes it would miss the chance to win immediately, but then win later in a game anyway. It learned that it could skip an immediate win and still be successful. The solution was to increase the reward for winning shorter games, so that the incentive was higher.

The next problem I encountered was that the algorithm was giving too much weight to moves which it had already encountered many times, even if a different move was actually better. It resulted in the algorithm doing what appeared to be silly moves, and not really seeming to learn. The solution was to chose the next move based on relative rewards, rather than absolute rewards. That way, the total number of times a move was tried was no longer relevant. Rather the chances of winning were what made the algorithm chose the next move. I later learned that there are different strategies here – you can base the choice of the next move on the average reward, or the best one that has ever been achieved, and the strategy relates to the reinforcement learning algorithm being used (for example Q-learning apparently uses the best reward rather than an average one).

Another factor which influenced how the algorithm learned was whether it learned against a player who always moved randomly or against itself. It became a stronger player when learning against itself, than against a random monkey. That makes sense because if you’ve only ever had to respond to  random moves, you won’t have learned how to play someone who knows how to win tic-tac-toe, where more skill is required.

Finally, the last important factor in learning is the trade off between exploration (to try a random move which might lead to learning something new) and exploitation (using existing knowledge to choose a “good” move based on experience). Just like a human, the algorithm starts by only exporing randomly, because it doesn’t know better, but the more games it plays, the less it explores and the more it choses moves based on experience. The method used to decide whether to explore or not has quite an effect on how quickly and how well the algorithm learns.

So does the algorithm really learn? Let’s take a look by asking some questions.

1) Does the algorithm know where the best starting location is?

Yes, because it knows that starting in the centre gives the best chances of a win. The rewards for the possible moves before any player has had a turn (pattern “—|—|—|”), ordered by reward with the best move at the top and the worst at the bottom are as follows:

```{i: 1, j: 1, xWinRewards: 185510, oWinRewards: 20028, drawRewards: 60161} // => 72.0
{i: 2, j: 2, xWinRewards:   1390, oWinRewards:  1910, drawRewards:   379} // => 38.3
{i: 0, j: 0, xWinRewards:   1169, oWinRewards:  1526, drawRewards:   447} // => 38.1
{i: 2, j: 0, xWinRewards:   1291, oWinRewards:  2068, drawRewards:   419} // => 34.7
{i: 0, j: 2, xWinRewards:   1191, oWinRewards:  1920, drawRewards:   495} // => 33.9
{i: 1, j: 2, xWinRewards:   1155, oWinRewards:  2300, drawRewards:   402} // => 30.4
{i: 0, j: 1, xWinRewards:   1122, oWinRewards:  2372, drawRewards:   400} // => 29.2
{i: 1, j: 0, xWinRewards:    844, oWinRewards:  1678, drawRewards:   491} // => 29.0
{i: 2, j: 1, xWinRewards:   1079, oWinRewards:  2292, drawRewards:   474} // => 28.7
```

The reward that is shown on the right of each line, which is used when choosing the next move to make, is calculated as follows:

```winRewards = player === X ? possibleMove.xWinRewards : possibleMove.oWinRewards;
drawRewards = possibleMove.drawRewards;
lossRewards = player === X ? possibleMove.oWinRewards : possibleMove.xWinRewards;
total = winRewards + drawRewards + lossRewards;
...
finalReward = (100*(winRewards/total)) + (10*(drawRewards/total)) + (-1*(lossRewards/total));
```

Remember, reward is calculated relatively rather than absolutely, and rewards given at the end of the game are higher if the game is shorter. Note that the above calculation also weights the rewards, so that the algorithm prefers winning over drawing, and drawing over losing.

If you were wondering whether or not the centre is indeed the best opening move, please read
my previous article which shows that contrary to common belief, the centre is a better opening move than a corner. Wikipedia, xkcd and Google all get it wrong.

2) Does it know how to respond to an opening move in the centre?

Here are the rewards for the next moves after an opening in the centre (pattern “—|-X-|—|”):

```{i: 2, j: 0, xWinRewards: 26920, oWinRewards: 3712, drawRewards:  9333} // => 11.5
{i: 0, j: 0, xWinRewards: 67866, oWinRewards: 8622, drawRewards: 35873} // => 10.8
{i: 0, j: 2, xWinRewards: 21060, oWinRewards: 2616, drawRewards:  5856} // => 10.8
{i: 2, j: 2, xWinRewards: 21014, oWinRewards: 2574, drawRewards:  6730} // => 10.6
{i: 0, j: 1, xWinRewards: 12096, oWinRewards:  668, drawRewards:   603} // =>  5.4
{i: 2, j: 1, xWinRewards: 12502, oWinRewards:  664, drawRewards:   595} // =>  5.2
{i: 1, j: 2, xWinRewards: 11850, oWinRewards:  592, drawRewards:   582} // =>  4.9
{i: 1, j: 0, xWinRewards: 11994, oWinRewards:  580, drawRewards:   589} // =>  4.8
```

My previous article and other sources like Wikipedias tic-tac-toe article and Martin Gardners book entitled Hexaflexagons and Other Mathematical Diversions agree that the best counter to an opening move in the centre is a move to a corner. The results above agree because all four corners have higher rewards (again shown on the right side) than the edges.

3) Does it know how to win on the next move when it nearly has three in a row?

Yes, it knows how to win instantly, for example with the pattern “OX-|-X-|O–“, which looks as follows, where the winning move is the middle of the bottom row (2,1):

 j=0 j=1 j=2 i=0 O X i=1 X i=2 O

The rewards for that pattern are:

```{i: 2, j: 1, xWinRewards: 7104, oWinRewards:  0, drawRewards: 0} // => 100.0
{i: 1, j: 0, xWinRewards:  224, oWinRewards:  0, drawRewards: 2} // =>  99.2
{i: 1, j: 2, xWinRewards:   16, oWinRewards:  8, drawRewards: 0} // =>  66.3
{i: 0, j: 2, xWinRewards:    0, oWinRewards: 48, drawRewards: 0} // =>  -1.0
{i: 2, j: 2, xWinRewards:    0, oWinRewards: 56, drawRewards: 0} // =>  -1.0
```

So indeed, it knows the reward is highest for (2,1) and will move there to win. It also knows that it doesn’t need to block the opponent’s imminent win on the left edge (1, 0).

4) Does it know how to block the opponent from winning on their next turn?

Consider the pattern “XO-|-O-|–X” which looks like this:

 j=0 j=1 j=2 i=0 X O i=1 O i=2 X

It’s Xs turn and the algorithm would need to block the bottom middle (2, 1) in order to stop O winning on their next move. Here are the rewards for that pattern:

```1: {i: 2, j: 1, xWinRewards: 0, oWinRewards:  4, drawRewards: 4} // =>  4.5
0: {i: 0, j: 2, xWinRewards: 0, oWinRewards: 32, drawRewards: 0} // => -1.0
2: {i: 2, j: 0, xWinRewards: 0, oWinRewards: 32, drawRewards: 0} // => -1.0
3: {i: 1, j: 2, xWinRewards: 0, oWinRewards: 16, drawRewards: 0} // => -1.0
4: {i: 1, j: 0, xWinRewards: 0, oWinRewards: 32, drawRewards: 0} // => -1.0
```

The algorithm knows that doing a move at (2,1) is best and avoids the loss. It also knows that X has never won from here and can only draw by stopping the win for O.

5) Does it know about the strategy on Wikipedias tic-tac-toe page which shows how O can best counter opening moves by X?

Yes! The algorithm is clever! But not clever enough to read Wikipedia 😀 It’s simply learned through playing, that these moves give the highest rewards. Here is what Wikipedia says:

Player O must always respond to a corner opening with a center mark,…

Try it out on the interactive board at the top of this article! The algorithm does this in three out of the four corner openings, strangely not when starting in top right. I guess it needs a little more practice on that opening move.

…and to a center opening with a corner mark.

Yes, the algorithm does this.

An edge opening must be answered either with a center mark, a corner mark next to the X, or an edge mark opposite the X.

For the bottom and right edges it responds by taking the centre cell. For the left edge it responds by taking the bottom left corner, i.e. next to starting position. For the top, it fails – I guess it needs some more practise here too.

Since we are pretty much able to answer “yes” to these five questions, I claim that it has indeed learned how to play tic-tac-toe.
If you like that and think it’s impressive, read on, because it gets better.

If you look at the interactive board at the top of this article, you will see a line which says “Unique games seen: 4330”. In my previous article I confirmed that there are 255,168 unique games, of which only 29,600 are not classed as “stupid” because of missed wins or blocks. So if there are so many games, why has the algorithm only encountered less than 2% of them and why is it still able to play so well and why has it learned as shown above?

We can get a list of all patterns where the algorithm hasn’t experienced every possible next move,
by opening the Javascript console for the interactive board (right click on the board and select “inspect”, in Chrome), and running the following snippet:

```Object.keys(model.patterns).forEach(function(k){
let len = k.split('').filter(function(l){ return l === 'X' || l === O;}).length;
if(model.patterns[k] && model.patterns[k].possibleNextMoves &&
(model.patterns[k].possibleNextMoves.length + len !== 9)) {

console.log(k);
}
});
```

2927 patterns have missing solutions, i.e. the large majority of them. Let’s randomly take one and have a look, e.g. “X–|OXO|—“:

 j=0 j=1 j=2 i=0 X i=1 O X O i=2

The algorithm has only experienced one move from there, although there are actually 5.

```{i: 0, j: 1, xWinRewards: 8, oWinRewards: 0, drawRewards: 0}
```

The move to (0, 1) is actually a “stupid” move, because it misses the immediate win which it would get if it moved to (2, 2). But if we take a step back, Os last move was also “stupid” as it should have blocked that potential win. If we look at the rewards for the two patterns where O makes its second move to take the left or right edge, we see the following.

Pattern “X–|OX-|—“:

```{i: 2, j: 2, xWinRewards: 98, oWinRewards: 12, drawRewards: 21} // => 10.7
{i: 2, j: 0, xWinRewards: 16, oWinRewards:  0, drawRewards:  0} // =>  0.0
{i: 1, j: 2, xWinRewards:  8, oWinRewards:  0, drawRewards:  0} // =>  0.0
```

Pattern “X–|-XO|—“:

```{i: 2, j: 2, xWinRewards: 164, oWinRewards: 148, drawRewards: 4} // => 46.5
```

The algorithm has learned to block the potential win by X, and so it will never ever go down the path which resulted in the above game again. Moreover, if we try and recreate this pattern using the interactive board at the top of this article, we find that we can’t because the algorithm has learned much better moves. If the algorithm starts, taking the centre, and a human follows by taking the left edge, the algorithm doesn’t take the top left corner, because by taking the top edge it can force either an immediate win or a fork and so it wins in all cases. Equally, if a human starts as X and takes the top left corner, the algorithm will take the centre because it is a better defence, and so the board layout above will never happen. Finally, if a human starts in the centre, the algorithm will take the top right corner rather than an edge, because taking the corner is a better defence to an opening in the centre than an edge is. The algorithm only ever came across the pattern above during the exploration phase of its learning cycle. It has found the strongest moves starting from the opening and as such it doesn’t need to bother learning about games where it ends up in a weaker position. Humans do the same if we think about chess – games where we move say the king with our second move don’t make sense so we don’t keep them in our toolbox of best moves. I personally find it quite impressive how the algorithm has homed in on optimal games and as such reduced its need to know every possible game. Originally I questioned whether or not this type of machine learning was “brute force” because it needs to play so many games before becoming strong. I assumed it needed to play many many unique games, and that after “learning” it was just going to use its massive memory to work out what to do. But it isn’t actually playing that many games, it just takes time to work out which of the 4,330 games that it knows about are the best ones. If it needed to encounter all 29,600 games which don’t involve stupid moves, or indeed more than that number, then I think one could claim that the algorithm uses brute force in order to become strong. But because that isn’t the case, I am confident in saying that this algorithm isn’t using brute force to learn.
Having done the analysis above, I am a lot more impressed by the algorithm than I was the first time it beat me, which was quite an impressive moment in itself.

So if I am so impressed, am I scared of AI? Not really. It took me a long time to tune this algorithm so that it played reasonably and there are many dimensions which can be tuned:

• Exploration: how long should it explore for and when should it no longer explore? If it explores too long, it just learns to play randomly. If it doesn’t explore enough, my teenage son can easily beat it when he uses some unexpected moves.
• Rewards: the question is really “what is the best move”, and defining what “best” even means can be tricky – see my previous article. In the case of this algorithm, the rewards are two dimensional, because not only do they depend on whether the game is won/drawn/lost, but also how quickly, so that it learns to not miss immediate wins.
• Experience: how many games does it need to play until it becomes good? I wasn’t able to answer “yes” to my five questions until it had played 100,000 games.

Tuning those dimensions is something that takes quite a lot of time and I would say is something that still needs a human to do. As with all uses of software, one needs to work out if there is a business case for it. I must have spent 5-20 hours a week on this project over three months, let’s call it a round 20 man days. I’m not sure I’d find someone who would have paid me to write this algorithm because who really cares that much about playing a futile game which will always end in a draw when played by experienced players? My current customer is introducing an algorithm which takes the description of the insurance case being created and based on that text it preselects some fields in the form. It’s 80% accurate, it doesn’t use reinforcement learning, more likely pattern recognition with neural networks.
If it saves each user 10 seconds per insurance case, that’s over a man year of work saved every year. Designing, implementing and maintaining the algorithm certainly took less time, so there is a real business case where there is a positive return on investment. As long as companies reinvest this return, rather than laying people off, I have no problem with it. Most of my career has been based on building software to increase productivity and I don’t currently believe that AI will increase this productivity by orders of magnitude greater than what we have been achieving without AI and just good old fashion software.

Just a couple of closing remarks now:

• The algorithm presented here learns by playing against itself. I did play around with training it against more sophisticated players to start with, and it seemed to improve the rate at which it learned. But I find it quite cool that I was able to remove that and get it to learn against itself. It’s effectively taking the algorithm with no knowledge, throwing it into a dark room and saying “hey, go play, and I’ll tell you the result of each game, you don’t need to know the rules, you can learn them along the way”. It learns the rules and the strategies to become a strong player!
• The algorithm can be classed as reinforcement learning. But what about it’s subclass, for example is it a q-learning algorithm? Not quite.
• The title of this article could have been “Using *Deep* Reinforcement Learning To Learn To Play Tic-Tac-Toe”, which would have required using a neural network to recognise patterns, rather than just matching known board patterns exactly. I played with this a little but couldn’t train a network which worked at all. I believe that the problem is that unlike matching letters or images, similar looking patterns in tic-tac-toe or other games like chess, can actually have exact opposite outcomes, one leading to a win and the other leading to a loss. Imagine a game
of chess where a king is moved by just one position so that the opponent could put it into checkmate – the patterns would be very similar, the outcomes very different. This is a different problem to classifying hand written letters. But I am really a novice when it comes to neural networks, so it’s best if I don’t comment further.

The complete code for this project can be viewed here: http://github.com/maxant/tictactoe/.

What are the next steps? I have spent a lot of time on this little project. I’ve learned a lot. One of my original thoughts was that a true test would be to introduce a new game, say Three Men’s Morris and see if the same algorithm could learn to play that too. One day… but first it’s time for something different…

Finally a few links that I used along the way: