# Optimal Path Detection With Reinforcement Learning

# Optimal Path Detection With Reinforcement Learning

In this article, design an agent that finds the optimum path through a given map using Reinforcement Learning.

Join the DZone community and get the full member experience.

Join For Free**Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.**

In this article, I will design an agent that finds the optimum path through a given map using Reinforcement Learning. I hope it becomes a useful article in the sense of awareness.

Reinforcement Learning (RL) is a machine learning technique that deals with the problems of finding the optimum actions that must be done in a given situation in order to maximize rewards.

This learning technique, which is inspired by behavioral psychology, is usually described as follows. An agent in any environment makes certain movements in this environment and gains rewards as a result of these movements. The main aim is to maximize the total reward and learn the optimal policy for the longest range.

The most commonly used applications of Reinforcement Learning are games. For example, in the 90's, Backgammon had done his learning with a Reinforcement Learning, which beat a human. Also recently, it performed its training with Reinforcement Learning (Google DeepMind's Alpha GO) on the machine that defeated the best GO player in the world. In addition, RL is a commonly used learning technique in Robotics, Control, Autonomous Vehicles, and many other industrial areas.

I will not explain the details of this technique in this article with the reason that Reinforcement Learning is a deep topic, but those who want to learn about this subject can start to learn the theory of this technique by watching the following video or making searches over the internet.

Now we can move on to a sample application development that will be the subject of this article.

**Q-Learning For Optimal Hiking**

We have a map like this, and I am looking for the answer to this question, starting from the starting point and getting the most total rewards to the finish point. However, in this map, the rewards won when passing from one cell to another cell vary according to different situations. As you can see from the map, there are plain and hill shaped cells, and the rewards of passing between these cells are given to us as follows.

Plain to Plain: -1

Hill to Hill: -2

Plain to Hill: -3

Hill to Plain: -1

As you can see, we have information about how the map is, and how many rewards we have won when moving between states. We will try to find out the most optimal policy by using the Q-Learning method. The basic rationale of Q-Learning is to find out how the agent can finish the journey with maximum rewards using experience on the environment. For this, it is aimed that the agent learns the environment in the best way by making different movements on the environment.

Now let's write a Reinforcement Learning code that will go to the end by maximizing the winning rewards in a given map and find out the optimal way and observe the results.

We will implement the code with MATLAB.

At the first stage, we are doing the map we want to solve and the necessary definitions.

```
%****************************************************
%Defining Maze Setup
goalmark = 4;
startmark = 2;
rewards = [-1 -2 -3 -1];
%****************************************************
%You can change maze whatever you want
%We need 1 start and 1 stop state.
maze =[ 1 0 0 1 1
0 2 0 1 0
0 0 0 0 0
0 0 1 1 0
1 0 1 1 0
1 1 1 1 4];
%****************************************************
actUp=3;
actDown=4;
actLeft=1;
actRight=2;
ACTIONS = [actLeft, actRight, actUp, actDown];
mazeSize = size(maze);
mazeRow = mazeSize(1);
mazeColumn = mazeSize(2);
%****************************************************
```

Now, we create the inter-state reward matrix.

```
%****************************************************
%Construct Reward Matrix
rewardMat = zeros(mazeRow*mazeColumn,length(ACTIONS));
transitMat = zeros(mazeRow*mazeColumn,length(ACTIONS));
stateIterator=0;
rew = 0;
rewforGoal =10;
for i=1:mazeRow
for j=1:mazeColumn
stateIterator = (i-1)*mazeColumn+j; % Defining Current State
%****************************************************
%Up
if(i-1>=1)
if(maze(i-1,j)==0 && maze(i,j)==0) % plain plain
rew = -1;
end
if(maze(i-1,j)==0 && maze(i,j)==1) % hill plain
rew = -1;
end
if(maze(i-1,j)==1 && maze(i,j)==0) % plain hill
rew = -3;
end
if(maze(i-1,j)==1 && maze(i,j)==1) % hill-hill
rew = -2;
end
if maze(i-1,j)==goalmark
rewardMat(stateIterator,actUp)= rewforGoal;
else
rewardMat(stateIterator,actUp)= rew;
end
transitMat(stateIterator,actUp) = stateIterator-(abs(mazeColumn-j)+abs(1-j)+1);
end
%****************************************************
%Down
if(i+1<=mazeRow) if(maze(i+1,j)==0 && maze(i,j)==0) % plain plain rew = -1; end if(maze(i+1,j)==0 && maze(i,j)==1) % hill plain rew = -1; end if(maze(i+1,j)==1 && maze(i,j)==0) % plain hill rew = -3; end if(maze(i+1,j)==1 && maze(i,j)==1) % hill-hill rew = -2; end if maze(i+1,j)==goalmark rewardMat(stateIterator,actDown)= rewforGoal; %10 else rewardMat(stateIterator,actDown)= rew; end transitMat(stateIterator,actDown) = stateIterator+(abs(mazeColumn-j)+abs(1-j)+1); end %**************************************************** %Left if(j-1>0)
if(maze(i,j-1)==0 && maze(i,j)==0) % plain plain
rew = -1;
end
if(maze(i,j-1)==0 && maze(i,j)==1) % hill plain
rew = -1;
end
if(maze(i,j-1)==1 && maze(i,j)==0) % plain hill
rew = -3;
end
if(maze(i,j-1)==1 && maze(i,j)==1) % hill-hill
rew = -2;
end
if maze(i,j-1)==goalmark
rewardMat(stateIterator,actLeft)= rewforGoal; %10
else
rewardMat(stateIterator,actLeft)= rew;
end
transitMat(stateIterator,actLeft) = stateIterator-1;
end
%****************************************************
%Right
if(j+1<=mazeColumn)
if(maze(i,j+1)==0 && maze(i,j)==0) % plain plain
rew = -1;
end
if(maze(i,j+1)==0 && maze(i,j)==1) % hill plain
rew = -1;
end
if(maze(i,j+1)==1 && maze(i,j)==0) % plain hill
rew = -3;
end
if(maze(i,j+1)==1 && maze(i,j)==1) % hill-hill
rew = -2;
end
if maze(i,j+1)==goalmark
rewardMat(stateIterator,actRight)= rewforGoal; %10
else
rewardMat(stateIterator,actRight)= rew;
end
transitMat(stateIterator,actRight) = stateIterator+1;
end
%****************************************************
end
end
```

Now we will use it in the code and come to define other parameters.

```
%**************************************
%Finding Goal and Start states positions
[sx,sy] = find(maze==startmark);
startState = (sx-1)*mazeColumn+sy;
[gx,gy] = find(maze==goalmark);
goalState = (gx-1)*mazeColumn+gy;
%**************************************
%Initializing Parameteters
Q = transitMat*0;
Q(goalState,1)= 10;
gamma = 1;
alpha = 0.5;
randomProb = 0.25;
tx = sx;
ty = sy;
V = zeros(mazeColumn*mazeRow,1);
episode=30000;
stepSize=20;
rewforGoal =10;
%**************************************
```

Next came the coding part of learning (Q-Learning).

```
%**************************************
%Start Q Learning
for it=1:episode
tx = sx;
ty = sy;
next=0;
for step=1:stepSize
next = 0;
%*************************************
%Defining Current Position and fing possible actions
curr = (tx-1)*mazeColumn+ty;
actionOptions = Q(curr,:);
%*************************************
%*************************************
% Finding Next Action with using random probability
while next==0
%choosing random action
if rand < randomProb
randAction = randperm(length(actionOptions));
index = randAction(1);
else
[value, index] = max(actionOptions);
end
next = transitMat(curr, index);
end
%*************************************
%*************************************
% Get reward for current position
rew = rewardMat(curr,index);
%*************************************
%*************************************
% Q Learning Equation
if(curr~=goalState)
Q(curr,index) = Q(curr,index) + alpha*( (rew+gamma*max(Q(next,:))) - (Q(curr,index)));
end
%*************************************
%*************************************
% Finding Next Step Position
sect = fix((next/mazeColumn));
if sect == 0
sect =1;
end
rest = mod(next,mazeColumn);
if next==mazeColumn*mazeRow
tx = sect;
ty = mazeColumn;
else
if ty == 0
tx = sect;
ty = mazeColumn;
else
tx = sect+1;
ty = rest;
end
end
%*************************************
end
end
%*****************************************
```

Now, visualize the algorithm that we wrote.

```
%****************PLOT*********************
%***********************************************
% get the figure and axes handles
hFig = gcf;
hAx = gca;
% set the figure to full screen
set(hFig,'units','normalized','outerposition',[0 0 1 1]);
%***********************************************
%***********************************************
%Plot V Values using Q Matrix
subplot(1,3,1);
V = max(Q,[],2);
V = reshape(V,mazeRow,mazeColumn);
%V(goalState) = 10;
imagesc(V);
title('V Values');
hold on;
for i=1:mazeRow
for j=1:mazeColumn
cell = (i-1)*mazeColumn+j;
content = sprintf('%2.2f',V(cell));
labels = text(j-0.20,i, content);
set(labels);
end
end
%***********************************************
%***********************************************
%Plot Q Values
subplot(1,3,2);
imagesc(maze);
title('Q Values');
hold on;
for i=1:mazeRow
for j=1:mazeColumn
cell = (i-1)*mazeColumn+j;
label1 = text(j-0.15,i-0.21, sprintf('%2.2f',Q(cell,actUp)));
set(label1,'FontSize',8);
label2 = text(j-0.15,i+0.22, sprintf('%2.2f',Q(cell,actDown)));
set(label2,'FontSize',8);
label3 = text(j-0.40,i, sprintf('%2.2f',Q(cell,actLeft)));
set(label3,'FontSize',8);
label4 = text(j+0.16,i, sprintf('%2.2f',Q(cell,actRight)));
set(label4,'FontSize',8);
end
end
%***********************************************
%***********************************************
%Plot Optimal Path
subplot(1,3,3);
imagesc(maze);
title('Optimal Path');
hold on;
curr = startState;
tx = sx;
ty = sy;
QPr = Q;
QPr(QPr==0)=-1000;
while curr~=goalState
curr = (tx-1)*mazeColumn+ty;
if curr==startState
labelN = text(ty-0.40,tx-0.4, sprintf('START'));
set(labelN,'FontSize',15);
elseif curr==goalState
labelN = text(ty-0.35,tx, sprintf('GOAL'));
set(labelN,'FontSize',15);
continue;
end
actionOptions = QPr(curr,:);
[value, index] = max(actionOptions);
if index == actLeft
labelN = text(ty-0.2,tx, sprintf('L'));
set(labelN,'FontSize',25);
ty = ty-1;
end
if index == actRight
labelN = text(ty-0.2,tx, sprintf('R'));
set(labelN,'FontSize',25);
ty = ty+1;
end
if index == actUp
labelN = text(ty-0.2,tx, sprintf('U'));
set(labelN,'FontSize',25);
tx = tx -1;
end
if index == actDown
labelN = text(ty-0.2,tx, sprintf('D'));
set(labelN,'FontSize',25);
tx = tx +1;
end
end
%***********************************************
```

Now let's try and test the code that we wrote.

First I give a map like the one below.

maze =[ 1 0 2 1 1

0 0 0 1 0

0 0 0 0 0

0 0 1 1 0

1 4 1 1 0

1 1 1 1 1 ];

0: Plain

1: Hill

2: Starting Point

4: Finish

**Parameters:**

**Gamma:**0.7

**Alpha: **0.5

**Episode: **30000

**Results:**

We have 3 different matrices as output. The first one is the V values, the second is the Q values, and the third is the optimal path.

**Your machine learning project needs enormous amounts of training data to get to a production-ready confidence level. Get a checklist approach to assembling the combination of technology, workforce and project management skills you’ll need to prepare your own training data.**

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

{{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}