# 程序代写代做代考 chain algorithm 1. You MUST answer this question.

1. You MUST answer this question.

(a) For what types of decision problems would the application of reinforcement

learning methods be preferable over other methods? [3 marks ]

(b) Explain the difference between value and reward signal. What properties are

desirable for the choice of a reward signal when setting-up a reinforcement

learning algorithm? [3 marks ]

(c) Why is the ergodicity of a Markov chain an important property in the con-

text of reinforcement learning? [4 marks ]

(d) Explain the concept of regret in the context of performance evaluation of

on-line algorithms. Does the ε-greedy method achieve zero regret? If so,

why? If not, why not? [3 marks ]

(e) How can one achieve sufficient exploration of states and actions in a rein-

forcement learning problem? [3 marks ]

(f) Using eligibility traces can speed-up learning. Give an example and briefly

explain your answer, also mentioning potential risks. [3 marks ]

(g) Explain how the stationary solution of a well-behaved Markovian decision

problem can be calculated when the transition probabilities and the starting

state are known. How can this solution be used to compare a number of

applicable control policies with respect to a given cost function? [3 marks ]

(h) Consider a one-dimensional track of discrete states which are numbered from

1 to N . Rewards are always zero, except for state 1 where a reward of r = 1

is given. The discount factor is γ. Assume the agent is currently in state k

(1 < k < N) and can move either to k + 1 or to k − 1. How do the values
of the two reachable states differ? [3 marks ]
Page 1 of 3
2. (a) What is a state in a reinforcement learning problem and what is the reason
the states are sometimes not fully observable? What is a belief state in the
context of Partially Observable Markov Decision Processes (POMDP)? [3 marks ]
(b) Classical reinforcement algorithms aim at maximising the mean discounted
reward over an unlimited future. In POMDPs, however, often a finite time
horizon is considered. Why is this so? [2 marks ]
(c) What do you understand by the term particle filter? Explain — with equa-
tions — how this can be used to define an approximate solution method for
POMDPs. [3 marks ]
(d) In principle, partial observability considerably increases the complexity of
the learning algorithms. Explain two ways (different from question (c)) to
reduce the complexity of POMDPs. [4 marks ]
(e) How do on- and off-policy methods differ at the algorithmic level? What is
the key requirement for off-policy methods in terms of policy architecture? [2 marks ]
(f) How does a semi-Markov Decision Process (SMDP) differ from an MDP?
Explain, using expressions for reward, state transition law and cost criterion
for the example of a robot navigating in a large building, why SMDPs can
be beneficial. If the robot receives information about its position, but does
not have access to a map of the building, would it still be able to realise
these benefits? [4 marks ]
(g) Describe an application of inverse reinforcement learning in robotics. Pro-
vide details of an algorithm for the problem by a graphical representation
or by relevant equations. [4 marks ]
(h) Give an example where a stochastic policy is better than a deterministic
policy even after perfect exploration? [3 marks ]
Page 2 of 3
3. Connect Four (according to wikipedia) is a game in which two players first choose
a colour and then take turns dropping a disc of their respective colour from the
top into a seven-column, six-row grid. The pieces fall straight down, occupying
the next available space within the column. The objective of the game is to
connect four of one’s own discs next to each other vertically, horizontally, or
diagonally before your opponent.
Consider the problem of training an artificial player using reinforcement learning.
(a) How would you design a reward signal for this problem? How would you
apply the algorithm and organise the training? [2 marks ]
(b) While the number of actions is quite small, the state space is rather large:
Each of the 42 grid positions can be empty or occupied by one of the colours,
which leads to an upper bound of 342 states, but not all of these states
actually occur. Would it still make sense to use a look-up table for the
representation of the value function? [2 marks ]
(c) Assuming that the answer for (b) is yes, what reinforcement learning algo-
rithm would you choose? Describe the main steps of the algorithm. How
would you initialise and control its parameters? [6 marks ]
(d) Assuming that the answer of (b) is no, explain two ways how the algo-
rithms can be adapted to be applicable to the problem? Which way would
you choose? How would you initialise and control the parameters of the
algorithm? [7 marks ]
(e) If only a few thousands of games have been played for training, which im-
plementation, (c) or (d), can be expected to perform better against a skilled
human player? [2 marks ]
(f) How does the two-player problem differ from the standard reinforcement
learning problem? Extending upon this, how does a multi-agent reinforce-
ment learning problem differ from its single-agent counterpart? Explain
what aspects of the problem formulation change owing to the introduction
of additional agents? Outline a learning procedure that works in the pres-
ence of multiple agents. [6 marks ]
Page 3 of 3