Skip to content

Evolving a Neural Network to Solve CartPole-v1 (ERL)

DOI

In this notebook, we'll solve the CartPole-v1 environment from Gymnasium. The goal is to balance a pole on a moving cart for as long as possible. We'll create a population of simple neural network "agents". The best agents (those who balance the pole the longest) will be selected to "reproduce" and create the next generation, gradually evolving a successful policy.

Common and standard approach to solve this problem is to use a policy gradient algorithm called REINFORCE. However, for demonstration purpose we'll use Genetic Algorithm instead. The main drawback of this approach is high computational burden which in some cases may result to deminishing returns as opposed to standard gradient-based algorithms. The choice to use ERL or other mixed EA-NN must always be carefully considered.

In some cases, using evolutionary algorithms in place of a gradient may work better. Currently (2025), these types of ML algorithms have promising results in robotics, game ai, complex systems optimizations, and other applications [1].

CartPole-v1

Imagine a cart that can move left or right on a track. On top of this cart, a pole is attached by a hinge, free to swing. The goal of our agent is simple: keep the pole balanced upright for as long as possible.

CartPole-v1

Observation (State):
What the agent "sees." In CartPole, this is a vector of four numbers:

[Cart_Position, Cart_Velovity, Pole_Angle, Pole_Angular_Velocity]
Actions:
[Push_Left, Push_Right]

Reward: For every timestep the pole remains upright (i.e., its angle doesn't exceed a certain threshold), the agent receives a reward of +1.

Termination:
The game is over if:

  • The pole falls too far.
  • The cart moves too far off the center of the screen.
  • 200 timesteps pass (the agent successfully balanced the pole for the maximum duration).

The objective for our agent is to learn a policy—a mapping from observed states to actions—that maximizes the cumulative reward over an episode.

Evolutionary Reinforcement Learning (ERL)

Traditionally, Reinforcement Learning (RL) agents learn using algorithms like Q-learning, Policy Gradients, or Actor-Critic methods, which typically rely on gradient descent to update a neural network's weights. ERL takes a different path, combining the population-based search of Evolutionary Algorithms (EAs) with the learning framework of RL. In this specific example, we're using a form of Neuroevolution, where the structure of a neural network is fixed, but its weights are optimized purely through evolution.

Agent Policy Network

Agent's "brain" is a simple feed-forward neural network. It takes the four state observations as input and outputs the probability of taking each action (left or right).

Policy(s)=Softmax(Dense(ReLU(Dense(s))))\text{Policy}(s) = \text{Softmax}(\text{Dense}(\text{ReLU}(\text{Dense}(s))))
  • ss is the input state vector.
  • SoftmaxSoftmax converts the final layer's outputs into probabilities for each action.

The parameters (weights and biases) of this network, collectively denoted as θ\theta, define our agent's behavior. The goal is to find the optimal θ\theta that yields the highest cumulative reward.

The ERL Loop

Instead of using gradient descent to iteratively adjust a single agent's weights, ERL maintains a population of agents, each with its own unique set of neural network weights. The learning process mimics natural evolution:

1. Initialization - We'll start with a population of NN agents. Each agent's neural network has its weights initialized randomly:

Population0={Agent1(θ1),Agent2(θ2),,AgentN(θN)}\text{Population}_0 = \{\text{Agent}_1(\theta_1), \text{Agent}_2(\theta_2), \dots, \text{Agent}_N(\theta_N)\}

2. Evaluation (Fitness Calculation) - Each agent in the current population is given a "chance at life" to play the CartPole game for one episode. Its fitness is the total reward it accumulates during that episode. A higher reward means better fitness.

Fitness(Agenti)=t=0Tirt\text{Fitness}(\text{Agent}_i) = \sum_{t=0}^{T_i} r_t

Where rtr_t is the reward at timestep tt, and TiT_i is the total timesteps for Agent ii in one episode.

3. Selection - Taking the top K agents (e.g., top 20%). This ensures that only the most successful traits are passed on.

ParentsPopulationsuch that pParents,Fitness(p)Fitness(a) for some aPopulationParents\text{Parents} \subset \text{Population} \quad \text{such that } \forall p \in \text{Parents}, \text{Fitness}(p) \ge \text{Fitness}(a) \text{ for some } a \in \text{Population} \setminus \text{Parents}

4. Crossover - We'll create a child agent whose weights are the average of its two parents' weights. This allows successful traits from different parents to merge.

θC=θA+θB2\theta_C = \frac{\theta_A + \theta_B}{2}

5. Mutation - Introduce novelty and prevent the population from getting stuck in local optima, random, small changes are applied to a child's weights.

For each weight:

w=w+δw' = w + \delta

Where δ is a small random number (e.g., drawn from a normal distribution with mean 0 and small standard deviation) applied with a certain probability (mutation rate).

6. Next Generation - The newly created children (along with the very best parent, often carried over using "elitism") form the new population.

Python Implementation

We'll use gymnasium and tensorflow:

# Install required packages
! pip install -q gymnasium tensorflow pygame
 
import gymnasium as gym
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt

Every agent is a dense network with softmax output:

def create_agent(input_size, output_size):
    """
    Creates a Keras Sequential model to act as our agent.
    """
    model = keras.Sequential([
        layers.Input(shape=(input_size,)),
        layers.Dense(32, activation='relu'),
        layers.Dense(output_size, activation='softmax') # Output probabilities for each action
    ])
    return model
 
def get_action(agent, state):
    """
    Gets an action from the agent for a given state.
    """
    # Keras expects a batch dimension, so we add one.
    state_tensor = np.expand_dims(state, axis=0)
    action_probs = agent.predict(state_tensor, verbose=0)[0]
    # Choose the action with the highest probability
    action = np.argmax(action_probs)
    return action

The fitness function rewards an agent each time on a "good" action:

def evaluate_fitness(agent, env):
    """
    Runs one episode of the environment with the given agent
    and returns the total reward (fitness).
    """
    state, _ = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        action = get_action(agent, state)
        state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        total_reward += reward
        
    return total_reward

Creating genetic algorithm:

POPULATION_SIZE = 50
NUM_GENERATIONS = 20
SELECTION_RATE = 0.2  # Keep the top 20% of the population as parents
MUTATION_RATE = 0.05  # Probability that a weight matrix/vector will be mutated
MUTATION_SCALE = 0.1  # The scale of the random noise added during mutation
 
# Environment Setup
env = gym.make("CartPole-v1")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
 
# 1. INITIALIZATION: Create the initial population of agents
population = [create_agent(state_size, action_size) for _ in range(POPULATION_SIZE)]
 
# ERL Loop
history = {'best_fitness': [], 'avg_fitness': []}
 
for generation in range(NUM_GENERATIONS):
    # 2. EVALUATION: Calculate fitness for each agent
    fitness_scores = [evaluate_fitness(agent, env) for agent in population]
    
    # Sort agents by fitness (descending)
    sorted_population = sorted(zip(population, fitness_scores), key=lambda x: x[1], reverse=True)
    
    # Log progress
    best_agent, best_fitness = sorted_population[0]
    avg_fitness = sum(fitness_scores) / POPULATION_SIZE
    history['best_fitness'].append(best_fitness)
    history['avg_fitness'].append(avg_fitness)
    print(f"Generation {generation:2d}: Best Fitness = {best_fitness:.0f}, Avg Fitness = {avg_fitness:.2f}")
 
    # 3. SELECTION: Select the top agents to be parents
    num_parents = int(POPULATION_SIZE * SELECTION_RATE)
    parents = [agent for agent, score in sorted_population[:num_parents]]
    
    # Create the next generation
    next_population = []
    # Elitism: The very best agent automatically survives
    next_population.append(best_agent) 
 
    # 4. & 5. CROSSOVER & MUTATION
    while len(next_population) < POPULATION_SIZE:
        parent1, parent2 = np.random.choice(parents, 2, replace=False)
        child = create_agent(state_size, action_size)
        
        # Crossover: Average the weights of the parents
        p1_weights = parent1.get_weights()
        p2_weights = parent2.get_weights()
        child_weights = [(w1 + w2) / 2.0 for w1, w2 in zip(p1_weights, p2_weights)]
        
        # Mutation: Add random noise to the child's weights
        mutated_weights = []
        for weights in child_weights:
            if np.random.rand() < MUTATION_RATE:
                noise = np.random.normal(0, MUTATION_SCALE, weights.shape)
                mutated_weights.append(weights + noise)
            else:
                mutated_weights.append(weights)
        child.set_weights(mutated_weights)
        
        next_population.append(child)
        
    population = next_population

Ploting the results:

# Plot the learning curve
plt.figure(figsize=(10, 5))
plt.plot(history['best_fitness'], label='Best Fitness')
plt.plot(history['avg_fitness'], label='Average Fitness')
plt.title('Fitness Over Generations')
plt.xlabel('Generation')
plt.ylabel('Fitness (Score)')
plt.legend()
plt.grid(True)
plt.show()
 
# Watch an agent play!·A new window might pop up to show the game.
print("\nShowing an agent...\n")
 
import random
 
random_agent = random.choice(population)
vis_env = gym.make("CartPole-v1", render_mode="human")
evaluate_fitness(random_agent, vis_env)
vis_env.close()

png

Showing an agent...

Agent Play

Conclusion

In Game Dev similar approaches allow developing adaptive and non-predictable NPC behaviors (in this example we select from a population of 50). ERL and other mix EA-NN approaches show an immense promise in other fields like Robotics, Product Design, Quantitative Finance, etc.

References

Anton Nesterov © 2025 | vski·science