Monte Carlo Tree Search (MCTS)

Formal Definition of Monte Carlo Tree Search

Monte Carlo Tree Search is a heuristic search which focuses only on the most promising nodes of the tree, where the value of these nodes is formed based on the outcomes of a number of simulations.

The Monte Carlo Tree Search is done in multiple steps:

1. Selection:

This process is done by traversing the tree from the root node until finding a node which contains unexplored children and is not in a terminal state. The child node is selected based on the upper confidence trees that follow the Upper Confidence Bound (UCB) rule, which is defined by the formula:

\[ \max_{w \in \text{children}(v)} \left( Q(w) + C \sqrt{\frac{2 \ln N(v)}{N(w)}} \right) \]

Where:

w – Child Node
v – Parent Node
Q(w) – Average reward or value of child node w
N(v) – Number of times node v is visited
N(w) – Number of times child node w is visited
C – Constant for balancing exploration and exploitation

The first term \( Q(w) \) favors exploitation where the nodes have performed well historically, and the second term \( C \sqrt{\frac{2 \ln N(v)}{N(w)}} \) favors exploration where the nodes are less explored. The node with the higher UCB value is selected.

2. Expansion:

Here, one or more new nodes are added to the node which was selected during the selection process. The new node is called the child node or leaf node.

3. Simulation:

A random simulation (or "rollout") is performed from the newly expanded node. This involves selecting actions according to a default policy until a terminal state is reached, where a reward is obtained.

4. Backpropagation:

Once the terminal state is reached, backpropagation is performed from the leaf node back to the root node while updating the \( Q(v) \) for all the nodes and incrementing \( N(v) \) along the path. Here, \( Q(v) \) is the sum of all the simulation outcomes in the subtree.

Overview of Key Results

Monte-Carlo Tree Search (MCTS) has demonstrated significant success across various domains, particularly in AI-driven game playing. One of the most remarkable achievements is in the game of Go, where traditional search techniques struggle due to the immense search space. The success of AlphaGo, which utilized MCTS in combination with deep neural networks, showcased how MCTS can guide AI to perform at a superhuman level. By simulating different move sequences and evaluating the outcomes, MCTS enabled AlphaGo to select optimal strategies and defeat world champions.

Beyond games, MCTS has proven effective in other decision-making tasks, such as robotics and real-time optimization. In robotic path planning, for instance, MCTS helps robots navigate through dynamic environments by simulating different paths and choosing the one with the highest likelihood of success. This adaptability to different problem domains is one of the key reasons why MCTS is widely used, providing near-optimal solutions in situations where exhaustive search is impractical.

MCTS also plays a crucial role in reinforcement learning (RL), where it is used to improve decision-making capabilities. Algorithms like AlphaZero and MuZero incorporate MCTS to simulate future actions and evaluate potential outcomes, enabling the agents to learn and refine their strategies over time. This integration has shown that MCTS can be effectively combined with deep learning techniques to tackle complex, high-dimensional problems across a variety of fields, including robotics, optimization, and strategic planning.

How MCTS Relates to Decision-Making for Robotics

Monte-Carlo Tree Search (MCTS) is widely used in robotics for sequential decision-making tasks, especially in scenarios where the robot needs to plan its actions under uncertainty. In robotics, decision-making often involves navigating a dynamic environment, choosing actions that optimize certain objectives (such as reaching a goal, minimizing energy use, or avoiding obstacles). MCTS helps robots make these decisions by simulating different action sequences and evaluating their potential outcomes. It is particularly beneficial in situations where the robot must decide on the best course of action in real-time, as MCTS can dynamically adapt its strategy based on new observations and information gained through simulations.

One example is robotic path planning in uncertain environments, such as a robot navigating through a cluttered room or an unmapped outdoor area. Here, MCTS can simulate potential paths and predict the likelihood of success (e.g., avoiding obstacles) for each path. The robot then chooses the most promising path based on these simulations, updating its plan if new obstacles are detected. In more complex tasks, like robotic manipulation, MCTS can help the robot decide the sequence of movements needed to grasp an object or assemble a structure, considering the uncertainties in the environment and the robot's own motion.

Another important application is in multi-robot coordination, where MCTS helps in deciding the optimal strategy for each robot in a group to achieve a collective goal. For instance, in search and rescue operations, multiple drones or robots may be used to explore different areas or search for survivors. MCTS can be used to decide which regions each robot should explore by simulating different allocation strategies and choosing the one that maximizes coverage and minimizes overlap. Similarly, in industrial settings, robots working together on a task, such as transporting objects in a warehouse, can use MCTS to optimize their paths and reduce potential collisions or delays.

Variants of MCTS

Monte Carlo Tree Search (MCTS) has several variants that adapt the basic approach to different contexts and improve its efficiency, scalability, or applicability. Here are some of the most common variants and the key differences between them:

1. Basic MCTS (Standard)

The basic MCTS algorithm consists of four steps: Selection, Expansion, Simulation, and Backpropagation. The original form of MCTS follows a standard implementation of these steps, with nodes selected based on criteria like the Upper Confidence Bound for Trees (UCT).

2. UCT (Upper Confidence Bound for Trees)

Key Feature: UCT is a specific method of node selection within MCTS that balances exploration (trying less explored nodes) and exploitation (focusing on promising nodes).

Difference: UCT uses a formula based on the Upper Confidence Bound (UCB) to decide which node to visit, ensuring a balance between gaining more information about lesser-known parts of the tree and focusing on the current best options.

Advantage: UCT helps the algorithm focus on the best moves while still occasionally trying new ones to avoid missing out on potential opportunities.

3. RAVE (Rapid Action Value Estimation)

Key Feature: RAVE is a technique used to speed up the convergence of MCTS by sharing information across similar states.

Difference: It estimates the value of a move not just from the context in which it has been played, but also from other contexts in the tree where the move was possible. It gives earlier and more reliable estimates of move quality.

Advantage: RAVE accelerates MCTS by quickly propagating information about move effectiveness, reducing the number of simulations required to get accurate results. It is especially useful in the early stages of exploration.

4. AMAF (All Moves As First)

Key Feature: This is an estimation method similar to RAVE. It assumes that a move should have similar value regardless of the point in a sequence in which it is made.

Difference: AMAF works by treating moves made later in a simulation as if they were played at the first opportunity. This provides more data for evaluating moves, even though the move's position may have differed.

Advantage: This helps the MCTS converge faster by effectively increasing the sample size of move evaluations.

5. Progressive Widening

Key Feature: Limits the branching factor by adding new children incrementally as more simulations are performed.

Difference: Instead of expanding every possible move immediately, progressive widening adds moves to the search tree gradually, with a selection bias towards promising actions.

Advantage: This helps control the number of child nodes and improves computational efficiency, especially in games with a large branching factor where exploring all possible moves at once would be impractical.

6. Progressive Bias

Key Feature: Uses heuristic information to bias the selection of nodes within the MCTS process.

Difference: Progressive bias alters the UCT formula to include heuristic scores for different nodes, making it more likely that promising moves are selected earlier in the search process.

Advantage: When domain knowledge is available (like heuristic evaluation of a board position), this variant helps the search to converge faster by prioritizing likely good moves based on that knowledge.

7. P-UCT (Policy Upper Confidence Bound for Trees)

Key Feature: Uses a policy network to guide the selection of nodes.

Difference: P-UCT leverages a policy network (from deep learning) to bias MCTS towards moves that have higher prior probabilities, instead of relying purely on UCB for exploration and exploitation.

Advantage: The policy network helps MCTS make more informed decisions by guiding the search, leading to fewer simulations needed to identify strong moves. It’s used in implementations like AlphaGo.

8. Nested Monte Carlo Search

Key Feature: Uses a hierarchy of Monte Carlo searches to refine the evaluation of moves.

Difference: In a nested search, a Monte Carlo search at a higher level calls another Monte Carlo search at a lower level for evaluating nodes.

Advantage: This can increase the precision of move evaluation and lead to stronger decision-making, particularly in puzzles or games that require precise evaluation at many levels of depth.

9. Parallel MCTS

Key Feature: Parallelizes the MCTS algorithm across multiple processors or threads to speed up simulations.

Difference: Variants like root parallelization (starting multiple searches from the root node) or leaf parallelization (running rollouts in parallel) are used to speed up the MCTS process.

Advantage: Significantly speeds up the search by using multiple CPUs or GPUs, enabling more simulations to be run in the same amount of time.

10. Transpositions (Transposition Tables)

Key Feature: Transposition tables store information about game states that have been visited multiple times.

Difference: MCTS with transposition tables avoids re-exploring nodes that represent the same game state but are reached via different sequences of moves.

Advantage: Reduces redundant calculations, especially in games with a high number of transpositions (i.e., equivalent states reachable through different move orders).

Pseudocode for Monte Carlo Tree Search (MCTS)

The pseudocode below follows the four main steps of MCTS: Selection, Expansion, Simulation (Rollout), and Backpropagation. It explains how the algorithm navigates the search tree, adds new nodes, simulates outcomes, and updates the tree based on simulation results.

    # Main function for the Monte Carlo Tree Search
    def monte_carlo_tree_search(root):
        while resources_left(time, computational_power):
            # Step 1: Selection
            leaf = traverse(root)
            
            # Step 2: Expansion
            if not is_terminal(leaf):
                leaf = expand(leaf)
            
            # Step 3: Simulation (Rollout)
            simulation_result = rollout(leaf)
            
            # Step 4: Backpropagation
            backpropagate(leaf, simulation_result)
            
        # Return the best child node based on the number of visits
        return best_child(root)
    
    # Function for node traversal (Selection)
    def traverse(node):
        while fully_expanded(node) and not is_terminal(node):
            node = best_uct(node)
        # Return the node that has unexplored children or is terminal
        return node
    
    # Function for expanding the tree (Expansion)
    def expand(node):
        # Add a new child node from the list of unexplored children
        child = pick_unvisited(node.children)
        node.children.append(child)
        return child
    
    # Function for the result of the simulation (Rollout)
    def rollout(node):
        # Simulate until a terminal state is reached
        while not is_terminal(node):
            node = rollout_policy(node)
        # Return the outcome of the simulation
        return result(node)
    
    # Function for randomly selecting a child node (Rollout Policy)
    def rollout_policy(node):
        return pick_random(node.children)
    
    # Function for backpropagation (Backpropagation)
    def backpropagate(node, result):
        # Update the stats of the node and propagate the result up the tree
        while node is not None:
            node.stats = update_stats(node, result)
            node = node.parent
    
    # Function for selecting the best child based on the UCT value
    def best_uct(node):
        # Use the UCB formula to select the child with the highest score
        return max(node.children, key=uct_value)
    
    # Function to calculate the UCT value
    def uct_value(node):
        # Q(node) is the average reward, N(parent) is the number of visits to the parent, N(node) is the number of visits to this node
        C = exploration_constant
        return node.Q + C * math.sqrt(2 * math.log(node.parent.N) / node.N)
    
    # Function for selecting the best child (Best Child)
    def best_child(node):
        # Pick the child with the highest number of visits
        return max(node.children, key=lambda child: child.N)

This pseudocode ensures that the MCTS process adheres to the formal definition, covering the steps of Selection, Expansion, Simulation, and Backpropagation. Each function is designed to perform a specific task within these steps, making it a comprehensive implementation of the algorithm.

Important Applications of MCTS in RL

AlphaGo and AlphaGo Zero

AlphaGo is perhaps the most famous application of MCTS, where it was used in combination with deep neural networks to conquer the ancient board game Go. Go is notorious for its vast search space—far exceeding that of chess—making traditional brute-force approaches infeasible. In AlphaGo, MCTS simulated different sequences of moves to evaluate their potential outcomes, while deep neural networks predicted the probability of winning from each state. The integration of MCTS enabled the system to efficiently explore promising moves by balancing exploration and exploitation. This approach allowed AlphaGo to achieve superhuman performance, culminating in a victory against world champion Lee Sedol in 2016, marking a significant breakthrough in AI research.

AlphaGo Zero, a successor to AlphaGo, achieved superhuman performance without using human game data. Instead, it learned entirely from self-play using reinforcement learning, starting from random play. This version further optimized the use of MCTS in conjunction with deep learning by employing a single neural network to represent both policy and value, thereby simplifying the architecture while improving performance.

AlphaGo Visualization — Figure 1: Visualization of AlphaGo's strategy using Monte-Carlo Tree Search in the game of Go.

Selection: During the selection phase, AlphaGo selects the most promising node to explore by calculating the upper confidence bound for trees (UCT). The UCT balances between exploiting known high-reward moves (high Q-values) and exploring less-visited nodes (guided by the exploration term). The move with the highest value is chosen for further exploration.
Expansion: In this phase, the selected node is expanded by adding new child nodes representing potential future moves. This step increases the search tree's depth and breadth, allowing AlphaGo to consider more potential game outcomes.
Evaluation: The newly expanded node is then evaluated by running a rollout (simulated play) to predict the outcome from that position. Deep neural networks are used to estimate the value of the game state and the policy (likelihood of each move).
Backup: Finally, the results of the evaluation are propagated back through the tree, updating the Q-values for all the nodes along the path. This step helps AlphaGo refine its estimates of the action values and select more promising moves in subsequent simulations.

AlphaZero

AlphaZero, an evolution of AlphaGo Zero, generalizes the approach to master multiple games—such as chess, shogi, and Go—without any domain-specific knowledge. In contrast to traditional game-playing programs that rely on handcrafted evaluation functions and domain-specific adaptations, AlphaZero started with no prior knowledge and learned purely through self-play, given only the game rules. The same algorithm and neural network architecture were applied to all three games, demonstrating that a general-purpose reinforcement learning algorithm can achieve superhuman performance across multiple challenging games. MCTS guided the learning process by simulating different moves during training, evaluating the outcomes, and using the results to improve the policy and value networks.

MuZero

MuZero further advances the application of MCTS by learning not only the optimal policy but also the model of the environment. It combines MCTS with model-based reinforcement learning, where the agent learns to predict the environment's dynamics (transitions and rewards) through experience. This allows MuZero to be applied to a broader range of problems, including classic board games like chess, Go, and shogi, as well as visually rich environments in Atari games. By learning a model of the environment, MuZero can simulate future outcomes using MCTS even when the underlying rules are not known, making it a powerful tool for decision-making in complex, real-world scenarios.

In MuZero, MCTS is used to plan the agent's actions by simulating different sequences of actions and their corresponding outcomes. As shown in the diagram, MuZero predicts the future dynamics of the environment (such as the next state and reward) based on its learned model. MCTS is used to evaluate possible actions by simulating multiple steps ahead, with the agent predicting state transitions, rewards, and policy values. These predictions are then used to expand the search tree and choose the most promising actions. The process involves iterating over simulations to refine the estimates of action values, helping MuZero make more informed decisions in complex environments where the exact rules may not be fully known.

TreeQN and ATreeC

In the realm of deep reinforcement learning, TreeQN and ATreeC are algorithms that incorporate MCTS into the training process to enhance decision-making capabilities. In TreeQN, MCTS is integrated with Q-learning to build a tree of future states, allowing the agent to simulate potential future scenarios to improve its action-value estimates. Similarly, ATreeC extends actor-critic methods by using MCTS to perform lookahead planning during training. These methods demonstrate how incorporating tree search within RL frameworks can help the agent learn more effective policies, particularly in environments with long-term dependencies or sparse rewards, where standard RL techniques might struggle.

References

MCTS Variants and Formal Definition

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., ... & Colton, S. (2012). "A Survey of Monte Carlo Tree Search Methods." IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 1-43.
Kocsis, L., & Szepesvári, C. (2006). "Bandit based Monte-Carlo Planning." In European Conference on Machine Learning (ECML) (pp. 282-293). Springer, Berlin, Heidelberg.
Coulom, R. (2006). "Efficient selectivity and backup operators in Monte-Carlo tree search." In International Conference on Computers and Games (pp. 72-83). Springer, Berlin, Heidelberg.

AlphaGo

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Hassabis, D. (2016). "Mastering the game of Go with deep neural networks and tree search." Nature, 529(7587), 484-489.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., ... & Hassabis, D. (2017). "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm." arXiv preprint arXiv:1712.01815.

AlphaZero

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., ... & Hassabis, D. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." Science, 362(6419), 1140-1144.

MuZero

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., & Silver, D. (2020). "Mastering Atari, Go, Chess and Shogi by planning with a learned model." Nature, 588(7839), 604-609.

TreeQN and ATreeC

Farquhar, G., Rocktaeschel, T., Igl, M., & Whiteson, S. (2018). "TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning." arXiv preprint arXiv:1710.11417.

CMSC818B - Mini Project -1

Team Members: Sai Jagadeesh Muralikrishnan, Varun Lakshmanan, Nitish Ravichandran Raveendran

Summary of Monte Carlo Tree Search

Formal Definition of Monte Carlo Tree Search

1. Selection:

2. Expansion:

3. Simulation:

4. Backpropagation:

Monte Carlo Tree Search Visualization

Overview of Key Results

How MCTS Relates to Decision-Making for Robotics

Variants of MCTS

1. Basic MCTS (Standard)

2. UCT (Upper Confidence Bound for Trees)

3. RAVE (Rapid Action Value Estimation)

4. AMAF (All Moves As First)

5. Progressive Widening

6. Progressive Bias

7. P-UCT (Policy Upper Confidence Bound for Trees)

8. Nested Monte Carlo Search

9. Parallel MCTS

10. Transpositions (Transposition Tables)

Pseudocode for Monte Carlo Tree Search (MCTS)

Important Applications of MCTS in RL

AlphaGo and AlphaGo Zero

AlphaZero

MuZero

TreeQN and ATreeC

Open Research Questions

References

MCTS Variants and Formal Definition

AlphaGo

AlphaZero

MuZero

TreeQN and ATreeC