The self-learning method of Tothello
I have been working on Tothello for more than 4 years. I fully agree with Michal Buro when he states that building a good othello program needs 2-3 years depending on your experience. When I started to develop Tothello, there were already a few computer programs in existence that had taught themselves to play reversi. In my view, the best described theory belongs to Herakles written by Konstantinos Tournavitis (see the links). It is a clear description of how to train a program playing auto-games.
Tothello uses a slightly different approach. This game can be learnt at quite an impressive level that is learning the game starting from endgame positions whereby the exact result is known and then moving backward to the mid- and opening game. It is very important to mention that Tothello does not only use a well-trained set of coefficients but also different built-in optimizations like position heuristics and move ordering. Certainly, these optimizations differ certainly from game type to game type: you have different optimizations in a normal 8 and a reversed 8 game. Even normal 8 and normal 6 are very different in this regard.
I used a separate program to tune these parameters. I developed about 50 complex routines to execute each separate phase of the learning and verification procedure.
The learning started with the calculation of different optimizations: position heuristic, move ordering, etc. Then game positions with empty squares between 19 and 8 were generated. These game positions can be solved perfectly, and by saving consecutive boards of the exact solution, at the end you can get a huge number of training positions. These training positions are suitable to train the coefficients used by your evaluation function. What this procedure makes difficult is that even for solving game positions with empties 19-16 you need an already well-functioning eval function. Otherwise it will take too long (it isn’t so easy to solve 5 million of such game positions). Once you have trained your coefficients set for these empties you have to work out the MPC (multi prob cut) coefficients to make the endgame solving even better. These coefficients were tuned using the Konjugate Gradient method.
There is an important aspect you should take care of: how to train those coefficients which occur very rarely? I used a direct generation method for these coefficients by generating first a random game position and afterwards modifying it in order to obtain the actual coefficient. Maybe it isn’t the best way, because these game positions are not too „realistic“ when starting the game from the well-known opening position, but ensures the minimum occurrence of the coefficients you need for tuning. As a consequence Tothello is now able to play by starting from any random game position, even from that which are setup manually and totally unrealistic in a normal opening game.
Having a good endgame solver you can move to the next stage, to the empty squares between 20 and 24. It is very difficult to solve perfectly a huge number of training sets at this game stage. One option is to decrease the confidence level of the endgame solver. Having an already working MPC endgame solver you can indulge in the luxury of solving these game positions only at, let’s say, 90-95% confidence level. Depending on the accuracy of your evaluation function, you can calculate how this approximation will affect your coefficients. It turns out that the tuned coefficients will not be impaired too much, in general, at a confidential level between 95-98% and you gain considerable reduction in the time needed for generation & solving. Thus you can have the next coefficient set calculated and the MPC calculation can follow. During its learning procedure Tothello found some board positions very difficult to solve (see “Difficult to solve”).
If you have a very good endgame solver or many computers at your disposal (e.g. you are a system administrator ) you can extend this stage to empties up to 26-28. Don’t forget: it is not too difficult and it doesn’t take too much time to solve one game position at this stage. But to train a huge number of coefficients you need a few million of solved game positions. This is very difficult indeed. At this point I need to amend slightly the random board generation method, because if in the first starting board position the mover has too many possible moves, solving takes considerably longer. Therefore I had to minimize, within certain limits, the mobility of the starting board position by choosing that board which has a suitable number of moves.
You can check the accuracy of your coefficients by generating new, realistic game positions, solving them, comparing the results to the evaluation function’s prediction and calculating the dispersion. Naturally this can be done only for endgame positions.
We have now reached game positions with the number of empty squares bigger than 24-26. Having an already working coefficient set for the endgame positions, you can execute midgame searches - without using midgame MPC search - to depths 10-14. You can do a midgame search for game positions with the number of empty squares 38 -40 and select the best move for that position. In this case I used the midgame search only to find the best move. Making these best moves consecutively you will finally reach a position which can be perfectly solved. This is the auto-play part of the learning. Having this exact result you assign it to each intermediate position.
In fact this is the main idea behind the self learning procedure of many other programs too, but there could be small differences in the implementation: for example in case of Mouse, whereby the best move (calculated by the eval function) was not always chosen, but one random move from the first three best moves.
For a while I investigated the accuracy of selecting the best move in this way and assigning the end result to all intermediate game positions. Please note that this highly depends on the accuracy of your evaluation function. I found that in 90% percent of the cases it was correct, and for the rest the endgame result was very close to the perfect value of the intermediate game positions. So the learning could be performed.
Please note that if you use this method from the very beginning and start auto playing games from a level with a much bigger number of empty squares, you may impair the accuracy of your learning procedure. That’s why learning - in my opinion - must be done gradually backwards from the endgame. If you already have a training set for empties 26-38, you can move to the next interval, for ex. 39-46.
You will find that you would definitely need a different coefficient set for each game stage. This is due to the fact that a coefficient in the middle game will become more or less important in the endgame, so coefficients’ weight vary. For example, mobility is very important in the opening and midgame and it is less important in the endgame.
Having tuned coefficients for the midgame, you will have to calculate the midgame MPC parameters (if you use the Multi ProbCut search) for your evaluation function. This is also very time consuming. Here the problem lies in the fact that you have to execute full width non MPC searches. You will get as a result the a, b, sigma parameters for each (shallow search depth, deep search depth) pair. Since the full width search takes time, I only managed to calculate the parameters up to the depth = 14. To have a fast search in the midgame you will really need MPC parameters for depths = 15,16,17,18,19,20,21,22… Top level programs use the calculated values to extrapolate the a,b, sigma parameters for deeper searches. Tothello uses a similar implementation of extrapolation. The sigma varies considerably depending on the shallow search and deep search depth. This must be reflected by the extrapolation.
I have already mentioned the opening book, which can be built either via self-playing or via learning from selected games of game archives. I needed the self-playing mechanism to build the reversed games’ book anyway, thus Tothello used this procedure when it built its opening books.
One can ask why don’t we make use of game databases which are available on the internet and why don’t we allow our program to learn from these games (played by high level programs). Of course you can. These databases can definitely be used to play games which start with the standard opening position, but I think their most important value is that an opening book based on them can be built. However, it is very difficult to select from these game archives a trusted set of games. Random games also help to train your rare coefficients. I didn’t use game archives because:
- it is very difficult to select from these game archives a trusted set of games;
- despite the huge number of random games available for instance on the CGS server, these game positions are not “supervised” and they will not guarantee you that your coefficients will have a 90% required minimum occurrence;
- there are no useful game archives for anti 8x8, 7x7, 6x6, etc. Thus a self-learning is required anyway.