# CS代考 COSC1127/1125 Artificial Intelligence – cscodehelp代写

COSC1127/1125 Artificial Intelligence
School of Computing Technologies Semester 2, 2021
Prof. ̃a
Tutorial No. 9 Reinforcement Learning
You can check this video on max/min vs arg max/arg min; and this video on the formulas for Temporal Difference in the AIMIA book.
PART 1: Passive agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . In the passive Adaptive Dynamic Programming (ADP) paradigm, an agent estimates the transition model T using tables of frequencies (model-based learning). Consider the 2 by 3 gridworld with the policy depicted as arrows and the terminal states are illustrated with X’s.
12 1
2 3
Cells are indexed by (row, column).
The passive ADP agent sees the following state transition observations:
Trial 1: (3,2) → (2,2) → (2,1) → (1,1)
Trial 2: (3,2) → (2,2) → (1,2) → (1,1)
Trial 3: (3,2) → (2,2) → (2,1) → (3,1)
Trial 4: (3,2) → (2,2) → (2,1) → (2,2) → (2,1) → (1,1)
X

X

With perceived rewards as follows:
State Reward
(3, 2) −1 (2, 2) −2 (2, 1) −2 (1, 2) −1 (3, 1) −10 (1, 1) +16
(a) Compute the transition model by computing:
i. thetableofstate-actionfrequencies/counts,N(s,a);
1

COSC1127/1125 – AI Tutorial 9: Reinforcement Learning Page 2 of 6
Solution: (Note: this is a partial solution)
state (s) action (a) Frequency
(3,2) N 4 (2,2) W …. (2,1) N …. (1,2) W 1
ii. thetableofstate-action-statefrequencies/counts,N(s,a,s′);
Solution: (Note: this is a partial solution)
state (s)
(3,2) (2, 2) (2, 2) (2, 1) (2, 1) (2, 1) (1,2)
action (a)
N W W N N N W
next state (2,2) (2,1) (1,2) (1,1) (3,1) (2,2) (1,1)
(s’)
Frequency
4
…. …. …. ….
….
1
iii. an estimate of the transition model T , using the above tables.
Solution: (Note: this is a partial solution)
state (s)
(3, 2) (2, 2) (2, 2) (2,1) (2,1) (2, 1) (1, 2)
action (a)
N W W N N N W
next state (s’)
(2, 2) (2, 1) (1, 2) (1,1) (3,1) (2, 2) (1, 1)
T(s,a,s’)
4/4 = 1.0 ….
….
….
….
….
1/1 = 1.0
(b) Using the same environment model and observations, now consider how a passive Temporal Difference agent will estimate the utility of the states. The temporal difference learning rule is:
U(s) ← U(s) + α[R(s) + γU(s′) − U(s)]
If we take α = 0.5 as a constant (recall α is the learning rate) and γ = 1. Compute:
i. the utilities after the first trial;
Solution: (Note: this is a partial solution)
Initially, we set U[(s)] = 0 for all states s ∈ S.
• Starting state is (3, 2), R[(3, 2)] = −1, U [(3, 2)] = 0. • Firstobservationis(3,2)→(2,2).UpdateU[(3,2)]:
U[(3, 2)] = U[(3, 2)] + α(R[(3, 2)] + γU[(2, 2)] − U[(3, 2)]) = 0 + 0.5(−1 + 0 + 0) = −0.5
• Secondobservationis(2,2)→(2,1).UpdateU[(2,2)]:
U[(2, 2)] = U[(2, 2)] + α(R[(2, 2)] + γU[(2, 1)] − U[(2, 2)]) = ….
RMITAI2021(Semester2)-SebastianSardin ̃a Exercise1continuesonthenextpage…

COSC1127/1125 – AI Tutorial 9: Reinforcement Learning Page 3 of 6
• Thirdobservationis(2,1)→(1,1).UpdateU[(2,1)]: U [(2, 1)] = ….
= ….
You can check THIS VIDEO on the solving of this problem, in which we set initially U[(s)] = R[(s)] for all states s ∈ S (according to the book Russell & Norving ”Artificial Intelligence: A Modern Approach” 3th edition).
ii. the utilities after the first and second trials.
Solution: (Note: this is a partial solution)
• Firstobservationis(3,2)→(2,2).UpdateU[(3,2)]:
U[(3, 2)] = U[(3, 2)] + α(R[(3, 2)] + γU[(2, 2)] − U[(3, 2)])
= −0.5 + 0.5(−1 − 1 + 0.5) = −1.25 • Secondobservationis(2,2)→(1,2).UpdateU[(2,2)]:
U[(2, 2)] = U[(2, 2)] + α(R[(2, 2)] + γU[(1, 2)] − U[(2, 2)]) = ….
• Thirdobservationis(1,2)→(1,1).UpdateU[(1,2)]:
U [(1, 2)] = …. = ….
(c) (optional) Consider the “occasionally-random” and exploration function methods to strike a balance be- tween exploitation and exploration. Recall in the “occasionally-random” approach, 1t of the time the agent selects a random action, otherwise follow the greedy policy. What would a high t value mean? What about a low value t?
Contrast this with the exploration function concept. As an example, consider this exploration function,