29
loading...
This website collects cookies to deliver better user experience
ℹ️ First things first! This post was written in collaboration with Alexey Vinel (Professor, Halmstad University). Some ideas and visuals are borrowed from my previous post on Q-learning written for Learndatasci. Unlike most posts you'll find on Reinforcement learning, we try to explore Reinforcement Learning here with an angle of multiple agents. So this makes it slightly more complicated and exciting at the same time. While this will be a good resource to develop an intuitive understanding of Reinforcement Learning (Reinforcement Q-learning, to be specific), it is highly recommended to visit the theoretical parts (some links shared in the appendix) if you're willing to explore Reinforcement Learning beyond this post.
I had to fork openAI's gym library to implement a custom environment. The code can be found on this GitHub repository. If you'd like to explore an interactive version, you can check out this google colab notebook. We use Python to implement the algorithms; if you're not familiar with Python, you can pretend that those snippets don't exist and read through the textual part (including code comments). Alright, time to get started 🚀
"The agent learns to take desired for a given state in the environment",
For a given state S, if you take action A, the new state of the environment becomes S', and the reward received is R.
State | Action | Reward | Probability | Next State |
---|---|---|---|---|
Sp
|
Aq
|
Rpq
|
1.0 | Sp'
|
... | ... | ... | ... | ... |
# Let's first install the custom gym module, which contains the environment
pip uninstall gym -y
pip install git+git://github.com/satwikkansal/gym-dual-taxi.git#"egg=gym&subdirectory=gym/"
import gym
env = gym.make('DualTaxi-v1')
env.render()
# PS: If you're using jupyter notebook and get env not registered error, you have to restart your kernel after installing the custom gym package in the last step.
DualTaxi-v1
environment and rendered its current state. In the rendered output,>>> env.observation_space, env.action_space
(Discrete(6144), Discrete(36))
def play_random(env, num_episodes):
"""
Function to play the episodes.
"""
for i in range(num_episodes):
state = env.reset()
done = False
while not done:
next_action = env.action_space.sample()
state, reward, done, _ = env.step(next_action)
# Trying the dumb agent
print_frames(play_random(env, num_episodes=2)) # check github for the code for print_frames
For a given state, the higher the Q-value for the state-action pair, the higher would be the expected long term reward of taking that particular action.
t
t
(the agent was in state st at this point in time)DualTaxi-v1
environment? Because we have two taxis in our environment, we can do it in a couple of ways,from collections import Counter, deque
import random
def bellman_update(q_table, state, action, next_state, reward):
"""
Function to perform the q-value update as per bellman equation.
"""
# Get the old q_value
old_q_value = q_table[state, action]
# Find the maximum q_value for the actions in next state
next_max = np.max(q_table[next_state])
# Calculate the new q_value as per the equation
new_q_value = (1 - alpha) * old_q_value + alpha * (reward + gamma * next_max)
# Finally, update the q_value
q_table[state, action] = new_q_value
def update(q_table, env, state):
"""
Selects an action according to epsilon greedy method, performs it, and the calls bellman update
to update the Q-values.
"""
if random.uniform(0, 1) > epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state])
next_state, reward, done, info = env.step(action)
bellman_update(q_table, state, action, next_state, reward)
return next_state, reward, done, info
def train_agent(
q_table, env, num_episodes, log_every=50000, running_metrics_len=50000,
evaluate_every=1000, evaluate_trials=200):
"""
This is the training logic. It takes input as a q-table, the environment.
The training is done for num_episodes episodes. The results are logged periodically.
We also record some useful metrics like average reward in the last 50k timesteps, the average length of the last 50 episodes and so on. These are helpful to gauge how the algorithm is performing
over time.
After every few episodes of training. We run an evaluation routine, where we just "exploit", i.e. rely on the q-table so far and see how well the agent has learned so far. Over time, the results should get
better until the q-table starts converging, after which there's negligible change in the results.
"""
rewards = deque(maxlen=running_metrics_len)
episode_lengths = deque(maxlen=50)
total_timesteps = 0
metrics = {}
for i in range(num_episodes):
epochs = 0
state = env.reset()
num_penalties, reward= 0, 0
done = False
while not done:
state, reward, done, info = update(q_table, env, state)
rewards.append(reward)
epochs += 1
total_timesteps += 1
if total_timesteps % log_every == 0:
rd = Counter(rewards)
avg_ep_len = np.mean(episode_lengths)
zeroes, fill_percent = calculate_q_table_metrics(q_table)
print(f'Current Episode: {i}')
print(f'Reward distribution: {rd}')
print(f'Last 10 episode lengths (avg: {avg_ep_len})')
print(f'{zeroes} Q table zeroes, {fill_percent} percent filled')
episode_lengths.append(epochs)
if i % evaluate_every == 0:
print('===' * 10)
print(f"Running evaluation after {i} episodes")
finish_percent, avg_time, penalties = evaluate_agent(q_table, env, evaluate_trials)
print('===' * 10)
rd = Counter(rewards)
avg_ep_len = float(np.mean(episode_lengths))
zeroes, fill_percent = calculate_q_table_metrics(q_table)
metrics[i] = {
'train_reward_distribution': rd,
'train_ep_len': avg_ep_len,
'fill_percent': fill_percent,
'test_finish_percent': finish_percent,
'test_ep_len': avg_time,
'test_penalties': penalties
}
print("Training finished.")
return q_table, metrics
def calculate_q_table_metrics(grid):
"""
This function counts what percentage of cells in the q-table is non-zero.
Note: Certain state-action combinations are illegal, so the table might never be full.
"""
r, c = grid.shape
total = r * c
count = 0
for row in grid:
for cell in row:
if cell == 0:
count += 1
fill_percent = (total - count) / total * 100.0
return count, fill_percent
def evaluate_agent(q_table, env, num_trials):
"""
The routine to evaluate an agent. It simply exploits the q-table and records the performance metrics.
"""
total_epochs, total_penalties, total_wins = 0, 0, 0
for _ in range(num_trials):
state = env.reset()
epochs, num_penalties, wins = 0, 0, 0
done = False
while not done:
next_action = np.argmax(q_table[state])
state, reward, done, _ = env.step(next_action)
if reward < -2:
num_penalties += 1
elif reward > 10:
wins += 1
epochs += 1
total_epochs += epochs
total_penalties += num_penalties
total_wins += wins
average_penalties, average_time, complete_percent = compute_evaluation_metrics(num_trials,total_epochs,total_penalties,total_wins)
print_evaluation_metrics(average_penalties,average_time,num_trials,total_wins)
return complete_percent, average_time, average_penalties
def print_evaluation_metrics(average_penalties, average_time, num_trials, total_wins):
print("Evaluation results after {} trials".format(num_trials))
print("Average time steps taken: {}".format(average_time))
print("Average number of penalties incurred: {}".format(average_penalties))
print(f"Had {total_wins} wins in {num_trials} episodes")
def compute_evaluation_metrics(num_trials, total_epochs, total_penalties, total_wins):
average_time = total_epochs / float(num_trials)
average_penalties = total_penalties / float(num_trials)
complete_percent = total_wins / num_trials * 100.0
return average_penalties, average_time, complete_percent
import numpy as np
# The hyper-parameters of Q-learning
alpha = 0.1 # learning rate
gamma = 0.7 # discout factor
epsilon = 0.2
env = gym.make('DualTaxi-v1')
num_episodes = 50000
# Initialize a q-table full of zeroes
q_table = np.zeros([env.observation_space.n, env.action_space.n])
q_table, metrics = train_agent(q_table, env, num_episodes) # Get back trained q-table and metrics
Total encoded states are 6144
==============================
Running evaluation after 0 episodes
Evaluation results after 200 trials
Average time steps taken: 1500.0
Average number of penalties incurred: 1500.0
Had 0 wins in 200 episodes
==============================
----------------------------
Skipping intermediate output
----------------------------
==============================
Running evaluation after 49000 episodes
Evaluation results after 200 trials
Average time steps taken: 210.315
Average number of penalties incurred: 208.585
Had 173 wins in 200 episodes
==============================
Current Episode: 49404
Reward distribution: Counter({-3: 15343, -12: 12055, -4: 11018, -11: 4143, -20: 3906, -30: 1266, -2: 1260, 99: 699, -10: 185, 90: 125})
Last 10 episode lengths (avg: 63.0)
48388 Q table zeroes, 78.12319155092592 percent filled
Training finished.
state_space_size x sqrt(action_space_size)
.def update_multi_agent(q_table1, q_table2, env, state):
"""
Same as the update method discussed in the last section, just modified for two independent q-tables.
"""
if random.uniform(0, 1) > epsilon:
action = env.action_space.sample()
action1, action2 = env.decode_action(action)
else:
action1 = np.argmax(q_table1[state])
action2 = np.argmax(q_table2[state])
action = env.encode_action(action1, action2)
next_state, reward, done, info = env.step(action)
reward1, reward2 = reward
bellman_update(q_table1, state, action1, next_state, reward1)
bellman_update(q_table2, state, action1, next_state, reward2)
return next_state, reward, done, info
def train_multi_agent(
q_table1, q_table2, env, num_episodes, log_every=50000, running_metrics_len=50000,
evaluate_every=1000, evaluate_trials=200):
"""
Same as the train method discussed in the last section, just modified for two independent q-tables.
"""
rewards = deque(maxlen=running_metrics_len)
episode_lengths = deque(maxlen=50)
total_timesteps = 0
metrics = {}
for i in range(num_episodes):
epochs = 0
state = env.reset()
done = False
while not done:
# Modification here
state, reward, done, info = update_multi_agent(q_table1, q_table2, env, state)
rewards.append(sum(reward))
epochs += 1
total_timesteps += 1
if total_timesteps % log_every == 0:
rd = Counter(rewards)
avg_ep_len = np.mean(episode_lengths)
zeroes1, fill_percent1 = calculate_q_table_metrics(q_table1)
zeroes2, fill_percent2 = calculate_q_table_metrics(q_table2)
print(f'Current Episode: {i}')
print(f'Reward distribution: {rd}')
print(f'Last 10 episode lengths (avg: {avg_ep_len})')
print(f'{zeroes1} Q table 1 zeroes, {fill_percent1} percent filled')
print(f'{zeroes2} Q table 2 zeroes, {fill_percent2} percent filled')
episode_lengths.append(epochs)
if i % evaluate_every == 0:
print('===' * 10)
print(f"Running evaluation after {i} episodes")
finish_percent, avg_time, penalties = evaluate_multi_agent(q_table1, q_table2, env, evaluate_trials)
print('===' * 10)
rd = Counter(rewards)
avg_ep_len = float(np.mean(episode_lengths))
zeroes1, fill_percent1 = calculate_q_table_metrics(q_table1)
zeroes2, fill_percent2 = calculate_q_table_metrics(q_table2)
metrics[i] = {
'train_reward_distribution': rd,
'train_ep_len': avg_ep_len,
'fill_percent1': fill_percent1,
'fill_percent2': fill_percent2,
'test_finish_percent': finish_percent,
'test_ep_len': avg_time,
'test_penalties': penalties
}
print("Training finished.\n")
return q_table1, q_table2, metrics
def evaluate_multi_agent(q_table1, q_table2, env, num_trials):
"""
Same as evaluate method discussed in the last section, just modified for two independent q-tables.
"""
total_epochs, total_penalties, total_wins = 0, 0, 0
for _ in range(num_trials):
state = env.reset()
epochs, num_penalties, wins = 0, 0, 0
done = False
while not done:
# Modification here
next_action = env.encode_action(
np.argmax(q_table1[state]),
np.argmax(q_table2[state]))
state, reward, done, _ = env.step(next_action)
reward = sum(reward)
if reward < -2:
num_penalties += 1
elif reward > 10:
wins += 1
epochs += 1
total_epochs += epochs
total_penalties += num_penalties
total_wins += wins
average_penalties, average_time, complete_percent = compute_evaluation_metrics(num_trials,total_epochs,total_penalties,total_wins)
print_evaluation_metrics(average_penalties,average_time,num_trials,total_wins)
return complete_percent, average_time, average_penalties
# The hyperparameter of Q-learning
alpha = 0.1
gamma = 0.8
epsilon = 0.2
env_c = gym.make('DualTaxi-v1', competitive=True)
num_episodes = 50000
q_table1 = np.zeros([env_c.observation_space.n, int(np.sqrt(env_c.action_space.n))])
q_table2 = np.zeros([env_c.observation_space.n, int(np.sqrt(env_c.action_space.n))])
q_table1, q_table2, metrics_c = train_multi_agent(q_table1, q_table2, env_c, num_episodes)
Total encoded states are 6144
==============================
Running evaluation after 0 episodes
Evaluation results after 200 trials
Average time steps taken: 1500.0
Average number of penalties incurred: 1500.0
Had 0 wins in 200 episodes
==============================
----------------------------
Skipping intermediate output
----------------------------
==============================
Running evaluation after 48000 episodes
Evaluation results after 200 trials
Average time steps taken: 323.39
Average number of penalties incurred: 322.44
Had 158 wins in 200 episodes
==============================
Current Episode: 48445
Reward distribution: Counter({-12: 13993, -3: 12754, -4: 11561, -20: 3995, -11: 3972, -30: 1907, -10: 649, -2: 524, 90: 476, 99: 169})
Last 10 episode lengths (avg: 78.08)
8064 Q table 1 zeroes, 78.125 percent filled
8064 Q table 2 zeroes, 78.125 percent filled
==============================
Running evaluation after 49000 episodes
Evaluation results after 200 trials
Average time steps taken: 434.975
Average number of penalties incurred: 434.115
Had 143 wins in 200 episodes
==============================
Current Episode: 49063
Reward distribution: Counter({-3: 13928, -12: 13605, -4: 10286, -11: 4542, -20: 3917, -30: 1874, -10: 665, -2: 575, 90: 433, 99: 175})
Last 10 episode lengths (avg: 75.1)
8064 Q table 1 zeroes, 78.125 percent filled
8064 Q table 2 zeroes, 78.125 percent filled
Current Episode: 49706
Reward distribution: Counter({-12: 13870, -3: 13169, -4: 11054, -11: 4251, -20: 3985, -30: 1810, -10: 704, -2: 529, 90: 436, 99: 192})
Last 10 episode lengths (avg: 76.12)
8064 Q table 1 zeroes, 78.125 percent filled
8064 Q table 2 zeroes, 78.125 percent filled
Training finished.
from collections import defaultdict
import matplotlib.pyplot as plt
# import seaborn as plt
def plot_metrics(m):
"""
Plotting various metrics over the number of episodes.
"""
ep_nums = list(m.keys())
series = defaultdict(list)
for ep_num, metrics in m.items():
for metric_name, metric_val in metrics.items():
t = type(metric_val)
if t in [float, int, np.float64]:
series[metric_name].append(metric_val)
for m_name, values in series.items():
plt.plot(ep_nums, values)
plt.title(m_name)
plt.xlabel('Number of episodes')
plt.show()
def play(q_table, env, num_episodes):
for i in range(num_episodes):
state = env.reset()
done = False
while not done:
next_action = np.argmax(q_table[state])
state, reward, done, _ = env.step(next_action)
def play_multi(q_table1, q_table2, env, num_episodes):
"""
Capture frames by playing using the two q-tables.
"""
for i in range(num_episodes):
state = env.reset()
done = False
while not done:
next_action = env.encode_action(
np.argmax(q_table1[state]),
np.argmax(q_table2[state]))
state, reward, done, _ = env.step(next_action)
plot_metrics(metrics)
frames = play(q_table, env, 10)
print_frames(frames)
plot_metrics(metrics_c)
print_frames(play_multi(q_table1, q_table2, env_c, 10))
env.P
object which contains a mapping of the formcurrent_state : action_taken: [(transition_prob, next_state, reward, done)]
, this is all the info we need to simulate the environment, and this is what we can use to create the transition table.env.P # First, let's take a peek at this object
{0: {
0: [(1.0, 0, -30, False)],
1: [(1.0, 1536, -0.5, True)],
2: [(1.0, 1560, -0.5, True)],
3: [(1.0, 1536, -0.5, True)],
4: [(1.0, 1536, -0.5, True)],
5: [(1.0, 1536, -0.5, True)],
6: [(1.0, 96, -0.5, True)],
7: [(1.0, 0, -30, False)],
8: [(1.0, 24, -0.5, True)],
9: [(1.0, 0, -30, False)],
10: [(1.0, 0, -30, False)],
11: [(1.0, 0, -30, False)],
12: [(1.0, 480, -0.5, True)],
13: [(1.0, 384, -0.5, True)],
14: [(1.0, 0, -30, False)],
15: [(1.0, 384, -0.5, True)],
16: [(1.0, 384, -0.5, True)],
17: [(1.0, 384, -0.5, True)],
18: [(1.0, 96, -0.5, True)],
19: [(1.0, 0, -30, False)],
20: [(1.0, 24, -0.5, True)],
21: [(1.0, 0, -30, False)],
22: [(1.0, 0, -30, False)],
23: [(1.0, 0, -30, False)],
24: [(1.0, 96, -0.5, True)],
25: [(1.0, 0, -30, False)],
26: [(1.0, 24, -0.5, True)],
27: [(1.0, 0, -30, False)],
28: [(1.0, 0, -30, False)],
29: [(1.0, 0, -30, False)],
30: [(1.0, 96, -0.5, True)],
31: [(1.0, 0, -30, False)],
32: [(1.0, 24, -0.5, True)],
33: [(1.0, 0, -30, False)],
34: [(1.0, 0, -30, False)],
35: [(1.0, 0, -30, False)]},
1: {0: [(1.0, 1, -30, False)],
1: [(1.0, 1537, -0.5, True)],
2: [(1.0, 1561, -0.5, True)],
3: [(1.0, 1537, -0.5, True)],
4: [(1.0, 1537, -0.5, True)],
5: [(1.0, 1537, -0.5, True)],
6: [(1.0, 97, -0.5, True)],
7: [(1.0, 1, -30, False)],
8: [(1.0, 25, -0.5, True)],
9: [(1.0, 1, -30, False)],
10: [(1.0, 1, -30, False)],
11: [(1.0, 1, -30, False)],
12: [(1.0, 481, -0.5, True)],
13: [(1.0, 385, -0.5, True)],
14: [(1.0, 1, -30, False)],
15: [(1.0, 385, -0.5, True)],
16: [(1.0, 385, -0.5, True)],
17: [(1.0, 385, -0.5, True)],
18: [(1.0, 97, -0.5, True)],
19: [(1.0, 1, -30, False)],
20: [(1.0, 25, -0.5, True)],
21: [(1.0, 1, -30, False)],
22: [(1.0, 1, -30, False)],
23: [(1.0, 1, -30, False)],
24: [(1.0, 97, -0.5, True)],
25: [(1.0, 1, -30, False)],
26: [(1.0, 25, -0.5, True)],
27: [(1.0, 1, -30, False)],
28: [(1.0, 1, -30, False)],
29: [(1.0, 1, -30, False)],
30: [(1.0, 97, -0.5, True)],
31: [(1.0, 1, -30, False)],
32: [(1.0, 25, -0.5, True)],
33: [(1.0, 1, -30, False)],
34: [(1.0, 1, -30, False)],
35: [(1.0, 1, -30, False)]},
# omitting the whole output because it's very long!
! pip install pandas
import pandas as pd
table = []
env_c = gym.make('DualTaxi-v1', competitive=True)
def state_to_human_readable(s):
passenger_loc = ['R', 'G', 'B', 'Y', 'T1', 'T2'][s[2]]
destination = ['R', 'G', 'B', 'Y'][s[3]]
return f'Taxi 1: {s[0]}, Taxi 2: {s[1]}, Pass: {passenger_loc}, Dest: {destination}'
def action_to_human_readable(a):
actions = 'NSEWPD'
return actions[a[0]], actions[a[1]]
for state_num, transition_info in env_c.P.items():
for action, possible_transitions in transition_info.items():
transition_prob, next_state, reward, done = possible_transitions[0]
table.append({
'State': state_to_human_readable(list(env.decode(state_num))),
'Action': action_to_human_readable(env.decode_action(action)),
'Probablity': transition_prob,
'Next State': state_to_human_readable(list(env.decode(next_state))),
'Reward': reward,
'Is over': done,
})
pd.DataFrame(table)
State | Action | Probablity | Next State | Reward | Is over | |
---|---|---|---|---|---|---|
0 | Taxi 1: (0, 0), Taxi 2: (0, 0), Pass: R, Dest: R | (N, N) | 1.0 | Taxi 1: (0, 0), Taxi 2: (0, 0), Pass: R, Dest: R | (-15, -15) | False |
1 | Taxi 1: (0, 0), Taxi 2: (0, 0), Pass: R, Dest: R | (N, S) | 1.0 | Taxi 1: (1, 0), Taxi 2: (0, 0), Pass: R, Dest: R | (-0.5, 0) | True |
2 | Taxi 1: (0, 0), Taxi 2: (0, 0), Pass: R, Dest: R | (N, E) | 1.0 | Taxi 1: (1, 0), Taxi 2: (0, 1), Pass: R, Dest: R | (-0.5, 0) | True |
3 | Taxi 1: (0, 0), Taxi 2: (0, 0), Pass: R, Dest: R | (N, W) | 1.0 | Taxi 1: (1, 0), Taxi 2: (0, 0), Pass: R, Dest: R | (-0.5, 0) | True |
4 | Taxi 1: (0, 0), Taxi 2: (0, 0), Pass: R, Dest: R | (N, P) | 1.0 | Taxi 1: (1, 0), Taxi 2: (0, 0), Pass: R, Dest: R | (-0.5, 0) | True |
... | ... | ... | ... | ... | ... | ... |
221179 | Taxi 1: (3, 3), Taxi 2: (3, 3), Pass: T2, Dest: Y | (D, S) | 1.0 | Taxi 1: (3, 3), Taxi 2: (2, 3), Pass: T2, Dest: Y | (-0.5, 0) | True |
221180 | Taxi 1: (3, 3), Taxi 2: (3, 3), Pass: T2, Dest: Y | (D, E) | 1.0 | Taxi 1: (3, 3), Taxi 2: (3, 3), Pass: T2, Dest: Y | (-15, -15) | False |
221181 | Taxi 1: (3, 3), Taxi 2: (3, 3), Pass: T2, Dest: Y | (D, W) | 1.0 | Taxi 1: (3, 3), Taxi 2: (3, 2), Pass: T2, Dest: Y | (-0.5, 0) | True |
221182 | Taxi 1: (3, 3), Taxi 2: (3, 3), Pass: T2, Dest: Y | (D, P) | 1.0 | Taxi 1: (3, 3), Taxi 2: (3, 3), Pass: T2, Dest: Y | (-15, -15) | False |
221183 | Taxi 1: (3, 3), Taxi 2: (3, 3), Pass: T2, Dest: Y | (D, D) | 1.0 | Taxi 1: (3, 3), Taxi 2: (3, 3), Pass: T2, Dest: Y | (-15, -15) | False |