# 程序代写代做代考 chain algorithm 1. You MUST answer this question.

1. You MUST answer this question.

(a) Consider a bandit problem with two arms. It is known that one of the arms
leads to rewards that are homogeneously distributed in the interval [0, 1],
while for the other one rewards in [0, 2] are possible. How many exploratory
actions will you need take on average in order to identify which of the two
arms which has a higher reward average? In the above example, give a value
for the optimistic initialisation of the reward estimate which is sufficient to
identify the optimal solution in at least 80% of all cases. [6 marks ]

(b) The concept of value is important in reinforcement learning. Explain the
difference between value and reward signal. What properties are desirable
for the choice of a reward signal when setting-up a reinforcement learning
algorithm? [3 marks ]

(c) Explain the difference between on-policy learning and off-policy learning for
the example of an agent that moves in a grid world that contains one or
more “pits” (positions with very large negative reward). [3 marks ]

(d) Describe (using symbols and pseudocode) the SARSA and Q-learning algo-
rithms. What is the essential difference between them? Explain using an
example. In some applications, it is empirically observed, although not the-
oretically justified, that SARSA converges faster than Q-learning. Describe
possible reasons for this effect. [6 marks ]

(e) Exploration is essential in most reinforcement learning algorithms. Identify
three different approaches to exploration. [3 marks ]

(f) Consider a one-dimensional track of discrete states which are numbered from
1 to N . Rewards are always zero, except for state 1 where a reward of r = 1
is given. The discount factor is γ. Assume the agent is currently in state k