Learning Automata Applet




Instructions

This Java applet demonstrates the discrete Learning Automata. An action is selected randomly by the learning automata and if selected returns either a reward or a penalty. The decision is based on the actions own internal probably. The red action is the optimal action and will return a reward at least 50% of the time. The aim of the learning automata is to determine this action from amongst all the others in as few a number of iterations as possible. You can control the number of iterations before screen update, the probability update rule that is used and the number of actions that are being searched. The learning rates for all these methods are however pre-set and so you many not consider this a fair comparison of the algorithms. They are available as a Matlab toolbox however - see below.

Learning automata systems are finite state adaptive systems that iteratively interact with a general environment. Through a probabilistic trial-and-error response process they try to select actions, which produces the best response. Think of the learning automata model as consisting of two components, an automaton and an environment. The learning cycle begins with the automata generating an action that is input to the environment. The environment receives the action and evaluates it returning a feedback signal to the automaton that represents the quality of that action in the environment. The signal represents the degree of success that the action had in the environment and is used to alter the automaton structure to improve its action selection strategy. The actions are selected probabilistically and internal to the learning automata is a probability distribution that is used for action selections. The probability update rule is the heart of the learning automata and many different learning rules have been developed to improve the convergence speed and accuracy of these systems.

The learning rules available in this demo are:

Linear Reward/Inaction LRI (theta = 0.01)
Linear Reward Penalty LRP (theta = 0.001)

The following rules also use times selected/ times updated vectors that are initialized by try each action ten times

Reward epsilon Penalty (theta1==0.01; theta2 =0.002)
TSE estimator learning rule (lambda = 0.01)
Pursuit (Lambda = 0.005)
Discrete Pursuit Automata DPA (lambda=0.2; n=10000)
DTSE (theta = 5, n=100);

For further information on the learning automata please refer to my publications or contact me at mnwhowell@gmail.com.