Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Optimal Path Detection With Reinforcement Learning

DZone's Guide to

Optimal Path Detection With Reinforcement Learning

In this article, design an agent that finds the optimum path through a given map using Reinforcement Learning.

· AI Zone ·
Free Resource

EdgeVerve’s Business Applications built on AI platform Infosys Nia™ enables your enterprise to manage specific business areas and make the move from a deterministic to cognitive approach.

In this article, I will design an agent that finds the optimum path through a given map using Reinforcement Learning. I hope it becomes a useful article in the sense of awareness.

Reinforcement Learning (RL) is a machine learning technique that deals with the problems of finding the optimum actions that must be done in a given situation in order to maximize rewards. 

This learning technique, which is inspired by behavioral psychology, is usually described as follows. An agent in any environment makes certain movements in this environment and gains rewards as a result of these movements. The main aim is to maximize the total reward and learn the optimal policy for the longest range.

The most commonly used applications of Reinforcement Learning are games. For example, in the 90's, Backgammon had done his learning with a Reinforcement Learning, which beat a human. Also recently, it performed its training with Reinforcement Learning (Google DeepMind's Alpha GO) on the machine that defeated the best GO player in the world. In addition, RL is a commonly used learning technique in Robotics, Control, Autonomous Vehicles, and many other industrial areas.

I will not explain the details of this technique in this article with the reason that Reinforcement Learning is a deep topic, but those who want to learn about this subject can start to learn the theory of this technique by watching the following video or making searches over the internet.

Now we can move on to a sample application development that will be the subject of this article.

Q-Learning For Optimal Hiking

We have a map like this, and I am looking for the answer to this question, starting from the starting point and getting the most total rewards to the finish point. However, in this map, the rewards won when passing from one cell to another cell vary according to different situations. As you can see from the map, there are plain and hill shaped cells, and the rewards of passing between these cells are given to us as follows.

Plain to Plain: -1

Hill to Hill: -2

Plain to Hill: -3

Hill to Plain: -1

As you can see, we have information about how the map is, and how many rewards we have won when moving between states. We will try to find out the most optimal policy by using the Q-Learning method. The basic rationale of Q-Learning is to find out how the agent can finish the journey with maximum rewards using experience on the environment. For this, it is aimed that the agent learns the environment in the best way by making different movements on the environment.

Now let's write a Reinforcement Learning code that will go to the end by maximizing the winning rewards in a given map and find out the optimal way and observe the results.

We will implement the code with MATLAB.

At the first stage, we are doing the map we want to solve and the necessary definitions.

%****************************************************
%Defining Maze Setup
goalmark = 4;
startmark = 2;
rewards = [-1 -2 -3 -1]; 


        %****************************************************
        %You can change maze whatever you want
        %We need 1 start and 1 stop state.
        maze =[  1 0 0 1 1
                 0 2 0 1 0
                 0 0 0 0 0
                 0 0 1 1 0
                 1 0 1 1 0
                 1 1 1 1 4]; 
        %****************************************************

actUp=3;
actDown=4;
actLeft=1;
actRight=2;
ACTIONS = [actLeft, actRight, actUp, actDown];

mazeSize = size(maze);
mazeRow = mazeSize(1);
mazeColumn = mazeSize(2);
%****************************************************

Now, we create the inter-state reward matrix.

%****************************************************
%Construct Reward Matrix
rewardMat =  zeros(mazeRow*mazeColumn,length(ACTIONS));
transitMat = zeros(mazeRow*mazeColumn,length(ACTIONS));
stateIterator=0;
rew = 0;
rewforGoal =10;
for i=1:mazeRow
    for j=1:mazeColumn
        stateIterator = (i-1)*mazeColumn+j;     % Defining Current State

        %****************************************************
        %Up
        if(i-1>=1)
            if(maze(i-1,j)==0 && maze(i,j)==0)  % plain plain
                rew = -1;
            end
            if(maze(i-1,j)==0 && maze(i,j)==1)  % hill plain
                rew = -1;
            end
            if(maze(i-1,j)==1 && maze(i,j)==0)  % plain hill
                rew = -3;
            end
            if(maze(i-1,j)==1 && maze(i,j)==1)  % hill-hill
                rew = -2;
            end

            if maze(i-1,j)==goalmark
                rewardMat(stateIterator,actUp)= rewforGoal; 
            else
                rewardMat(stateIterator,actUp)= rew;
            end

            transitMat(stateIterator,actUp) = stateIterator-(abs(mazeColumn-j)+abs(1-j)+1);
        end

        %****************************************************
        %Down
        if(i+1<=mazeRow) if(maze(i+1,j)==0 && maze(i,j)==0) % plain plain rew = -1; end if(maze(i+1,j)==0 && maze(i,j)==1) % hill plain rew = -1; end if(maze(i+1,j)==1 && maze(i,j)==0) % plain hill rew = -3; end if(maze(i+1,j)==1 && maze(i,j)==1) % hill-hill rew = -2; end if maze(i+1,j)==goalmark rewardMat(stateIterator,actDown)= rewforGoal; %10 else rewardMat(stateIterator,actDown)= rew; end transitMat(stateIterator,actDown) = stateIterator+(abs(mazeColumn-j)+abs(1-j)+1); end %**************************************************** %Left if(j-1>0)
            if(maze(i,j-1)==0 && maze(i,j)==0) % plain plain
                rew = -1;
            end
            if(maze(i,j-1)==0 && maze(i,j)==1) % hill plain
                rew = -1;
            end
            if(maze(i,j-1)==1 && maze(i,j)==0) % plain hill
                rew = -3;
            end
            if(maze(i,j-1)==1 && maze(i,j)==1) % hill-hill
                rew = -2;
            end

            if maze(i,j-1)==goalmark
                rewardMat(stateIterator,actLeft)= rewforGoal;  %10
            else
                rewardMat(stateIterator,actLeft)= rew;
            end

            transitMat(stateIterator,actLeft) = stateIterator-1;
        end

        %****************************************************
        %Right
        if(j+1<=mazeColumn)
            if(maze(i,j+1)==0 && maze(i,j)==0) % plain plain
                rew = -1;
            end
            if(maze(i,j+1)==0 && maze(i,j)==1) % hill plain
                rew = -1;
            end
            if(maze(i,j+1)==1 && maze(i,j)==0) % plain hill
                rew = -3;
            end
            if(maze(i,j+1)==1 && maze(i,j)==1) % hill-hill
                rew = -2;
            end


            if maze(i,j+1)==goalmark
                rewardMat(stateIterator,actRight)= rewforGoal;  %10
            else
                rewardMat(stateIterator,actRight)= rew;
            end

            transitMat(stateIterator,actRight) = stateIterator+1;
        end
        %****************************************************
    end
end

Now we will use it in the code and come to define other parameters.

%**************************************
%Finding Goal and Start states positions
[sx,sy] = find(maze==startmark);
startState = (sx-1)*mazeColumn+sy;

[gx,gy] = find(maze==goalmark);
goalState = (gx-1)*mazeColumn+gy;

%**************************************
%Initializing Parameteters
Q = transitMat*0; 
Q(goalState,1)= 10; 
gamma = 1;
alpha = 0.5;
randomProb = 0.25;
tx = sx;
ty = sy;
V = zeros(mazeColumn*mazeRow,1);
episode=30000;
stepSize=20;
rewforGoal =10;
%**************************************

Next came the coding part of learning (Q-Learning).

%**************************************
%Start Q Learning
for it=1:episode
    tx = sx;
    ty = sy;
    next=0;
    for step=1:stepSize
        next = 0;

        %*************************************
        %Defining Current Position and fing possible actions
        curr =  (tx-1)*mazeColumn+ty;
        actionOptions = Q(curr,:);
        %*************************************

        %*************************************
        % Finding Next Action with using random probability
        while next==0
            %choosing random action
            if rand < randomProb
                randAction = randperm(length(actionOptions));
                index = randAction(1);
            else
                [value, index] = max(actionOptions);
            end
            next = transitMat(curr, index);
        end
        %*************************************

        %*************************************
        % Get reward for current position
        rew = rewardMat(curr,index);
        %*************************************


        %*************************************
        % Q Learning Equation
        if(curr~=goalState)
            Q(curr,index) = Q(curr,index) + alpha*( (rew+gamma*max(Q(next,:))) - (Q(curr,index)));
        end
        %*************************************

        %*************************************
        % Finding Next Step Position
        sect = fix((next/mazeColumn));

        if sect == 0
            sect =1;
        end

        rest = mod(next,mazeColumn);

        if next==mazeColumn*mazeRow
            tx = sect;
            ty = mazeColumn;
        else
            if ty == 0
                tx = sect;
                ty = mazeColumn;
            else
                tx = sect+1;
                ty = rest;
            end                        
        end
        %*************************************

    end
end
%*****************************************

Now, visualize the algorithm that we wrote.

%****************PLOT*********************


%***********************************************
 % get the figure and axes handles
 hFig = gcf;
 hAx  = gca;
 % set the figure to full screen
 set(hFig,'units','normalized','outerposition',[0 0 1 1]);
%***********************************************


%***********************************************
%Plot V Values using Q Matrix
subplot(1,3,1);
V = max(Q,[],2);
V = reshape(V,mazeRow,mazeColumn);
%V(goalState) = 10;
imagesc(V); 
title('V Values');
hold on;
for i=1:mazeRow
    for j=1:mazeColumn
        cell = (i-1)*mazeColumn+j;
        content = sprintf('%2.2f',V(cell));
        labels = text(j-0.20,i, content); 
        set(labels);
    end
end
%***********************************************


%***********************************************
%Plot Q Values
subplot(1,3,2);
imagesc(maze);
title('Q Values');
hold on;
for i=1:mazeRow
    for j=1:mazeColumn
        cell = (i-1)*mazeColumn+j;

        label1 = text(j-0.15,i-0.21, sprintf('%2.2f',Q(cell,actUp))); 
        set(label1,'FontSize',8);

        label2 = text(j-0.15,i+0.22, sprintf('%2.2f',Q(cell,actDown))); 
        set(label2,'FontSize',8);

        label3 = text(j-0.40,i, sprintf('%2.2f',Q(cell,actLeft))); 
        set(label3,'FontSize',8);

        label4 = text(j+0.16,i, sprintf('%2.2f',Q(cell,actRight))); 
        set(label4,'FontSize',8);        
    end
end
%***********************************************


%***********************************************
%Plot Optimal Path
subplot(1,3,3);
imagesc(maze);
title('Optimal Path');
hold on;
curr = startState;
tx = sx;
ty = sy;
QPr = Q;
QPr(QPr==0)=-1000;
while curr~=goalState
    curr =  (tx-1)*mazeColumn+ty;

    if curr==startState
        labelN = text(ty-0.40,tx-0.4, sprintf('START'));
        set(labelN,'FontSize',15);
    elseif curr==goalState
        labelN = text(ty-0.35,tx, sprintf('GOAL'));
        set(labelN,'FontSize',15);
        continue;
    end

    actionOptions = QPr(curr,:);    
    [value, index] = max(actionOptions);        

    if index == actLeft
        labelN = text(ty-0.2,tx, sprintf('L'));
        set(labelN,'FontSize',25);
        ty = ty-1;
    end

    if index == actRight
        labelN = text(ty-0.2,tx, sprintf('R'));
        set(labelN,'FontSize',25);
        ty = ty+1;
    end

    if index == actUp
        labelN = text(ty-0.2,tx, sprintf('U'));
        set(labelN,'FontSize',25);
        tx = tx -1;
    end

    if index == actDown
        labelN = text(ty-0.2,tx, sprintf('D'));
        set(labelN,'FontSize',25);
        tx = tx +1;
    end
end
%***********************************************

Now let's try and test the code that we wrote.

First I give a map like the one below.

maze =[     1 0 2 1 1
                    0 0 0 1 0
                    0 0 0 0 0
                    0 0 1 1 0
                    1 4 1 1 0
                    1 1 1 1 1  ];

0: Plain
1: Hill
2: Starting Point
4: Finish

Parameters:

Gamma:0.7

Alpha: 0.5

Episode: 30000

Results:

We have 3 different matrices as output. The first one is the V values, the second is the Q values, and the third is the optimal path.

Adopting a digital strategy is just the beginning. For enterprise-wide digital transformation to truly take effect, you need an infrastructure that’s #BuiltOnAI. Click here to learn more.

Topics:
machine learning ,reinforcement learning ,q-learning ,artificial intelligence ,ai

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}