# AI｜人工智能｜数据分析｜图算法｜CSE 473: Introduction to Artificial Intelligence

CSE 473: Introduction to Artificial Intelligence
Home

Schedule

Textbooks

Assignments

Assignment 6: AEMS

Heaven or hell
I can’t tell.
In this assignment you will implement an online POMDP solver, named AEMS2.
Introduction
As we have seen in class, in a POMDP, the agent can’t directly observe the world-states, thus the agent instead maintains a posterior probability distribution over all states to represent how likely the agent thinks it’s in each state. This “posterior probability” is called a belief state and defined as the following:
b
t
(
s
)
=
P
(
s
|
s
0
,
a
0
,
z
1
,
a
1
,
.
.
.
,
a
t

1
,
z
t
)
b
t
(
s
)
=
P
(
s
|
s
0
,
a
0
,
z
1
,
a
1
,
.
.
.
,
a
t

1
,
z
t
)
Where
b
t
(
s
)
b
t
(
s
)
indicates how likely the agent thinks it’s in state
s
s
.
Now, we can think of the discrete world-states in the POMDP as being ordered by some ordering. We can then “name” each state with a number that represents its position in that list, with the first discrete state being 0, and the last being
|
S
|

1
|
S
|

1
. This allows us to write the belief state instead as a vector (a list of numbers), where the
i
i
-th number in that vector has the value
b
t
(
s
i
)
b
t
(
s
i
)
(the probability of being in that state). This new vector fully represents all the possible belief states.
The belief state represents the agent’s partial understanding of which state of the world it is actually in. Therefore, a POMDP may actually be seen as an MDP, whose state space consists of the belief state vectors we defined earlier. This means that in theory we could apply the same techniques used in solving MDPs to find the optimal policy of a POMDP model.
However, the belief state space is a continuous, real-valued probability distribution over states, meaning there are infinitely many possible belief state vectors. As a result, standard MDP solvers such as VI, PI, and even RTDP are too computationally expensive. In fact, this is true even when we know the world dynamics encoded by the transition, reward, and observation functions. While we may use the same fundamental principles of search such as greedy action selection and admissible heuristics, we need to develop more complex algorithms with better convergence rates to operate on POMDP models if we want to have any chance of solving them.
Implementation
Important: For this part of the assignment, you will need to have NumPy installed.
If you don’t already have NumPy installed, you may install it through your distribution package manager if you are using Linux.
Alternatively, you can install NumPy using pip: pip install numpy (you may need to use pip3), or conda: conda install numpy.
We will provide you the model parser, an environment simulator (not as fancy as Pacman though), and vectors for a tight lower bound. You will be asked to implement an upper bound, a loose lower bound, and most importantly AEMS2 for an infinite horizon POMDP.
Files you’ll edit:
mdpSolver.py
Both of the heuristics you implement.
aems.py
Files you will not edit:
pomdp.py
Store the model in a pomdp class (learn how T, O, and R are presented)
offlineSolver.py
General offline solver for POMDPs (look at the abstract methods)
onlineSolver.py
General online solver for POMDPs (look at the abstract methods)
environment.py
Changes the real state of the environment by the agent’s action, simulate and return an observation with gained instant reward
Gives the value function or the best function based on a provided policy (used for PBVI as the lower bound).
evaluate.py
Evaluate a POMDP solver by running it multiple times.
test.py
Test and grade! You can check the tests in test folder.
The above builds on a general POMDP solver that works with any model expressed in a common POMDP format, developed by Anthony Cassandra (see here: http://www.pomdp.org/code/pomdp-file-spec.html.)
A few environments are provided in /examples/env:
Tiger
Two closed doors, opening one gives you a large reward. Behind the other one is a tiger! (Full description: Kaelbling et al., 1998)
Hallway
A navigation environment with 57 states. (Full description: QMDP paper by Littman et al., 1995)
Hallway2
A navigation environment with 57 states. (Full description: QMDP paper by Littman et al., 1995)
TagAvoid
Laser tag game for robots! (Full description: PBVI paper by Pineau et al., 03)
RockSample_4_4
Scientific exploration by a rover (Full description: HSVI paper by Smith & Simmons, 04)
You do not need to worry too much about the format. Just take a look at the QMDP class in mdpSolver.py which documents some of the properties of the POMDP object. Remember that we only work with indices in the code, so actions, states, and observations are represented as a number and not their name.
Important: The parameter named “precision” determines the computing precision of your methods (including deciding convergence). You can use it to avoid extra computation in value iteration or heuristic updates.
Question 1 (5 points) – QMDP – An upper bound attempt at a POMDP solution
QMDP is one of the very first algorithms suggested for solving POMDPs. It can be calculated very fast and offline (without the need for simulation). However, it’s not a very good approximation. Therefore, QMDP is often used to calculate a bound for other online solvers, including the AEMS solver that we will be implementing at the end.
To describe it simply, QMDP assumes that after one action, the agent will learn what state it is in with certainty (this is an optimistic assumption thus it provides an upper bound). As a result, after just one action the problem is reduced to a standard an MDP. Since we know all the parameters of the MDP underlying the POMDP, we can solve this MDP using standard tools like value iteration:
Q
MDP
(
s
,
a
)
=

s

T
(
s
,
a
,
s

)
[
R
(
s
,
a
,
s

)
+
γ
V
MDP
(
s

)
]

=
E
[
R
(
s
,
a
)
]
+
γ

s

T
(
s
,
a
,
s

)
V
MDP
(
s

)
Q
MDP
(s,a)
=

s

T(s,a,
s

)[R(s,a,
s

)+γ
V
MDP
(
s

)]

=E[R(s,a)]+γ

s

T(s,a,
s

)
V
MDP
(
s

)
Thus QMDP boils down to solving the underlying MDP producing the
Q
MDP
Q
MDP
above, and then constructing the POMDP solution as follows:
Q
QMDP
(
b
t
,
a
)
=

s
b
t
(
s
)
Q
MDP
(
s
,
a
)
V
QMDP
=
max
a
Q
QMDP
(
b
t
,
a
)
Q
QMDP
(
b
t
,a)
=

s
b
t
(s)
Q
MDP
(s,a)
V
QMDP
=
max
a
Q
QMDP
(
b
t
,a)
Implement the QMDP class in mdpSolver.py. Although it is offline, you should try to make it as fast as possible. Consider using vector operations in numpy.
Hint: In a POMDP rewards are given as
R
(
s
,
a
,
s

,
o
)
R
(
s
,
a
,
s

,
o
)
which depend on the state reached after the action and the observation. How would you construct
E
[
R
(
s
,
a
)
]
E
[
R
(
s
,
a
)
]
using the POMDP rewards, observation probabilities and transition probabilities?
You can evaluate the performance of QMDP with the following command:
python3 evaluate.py QMDP [Environment] [Number of Runs] [precision]
For example the following command gives you the average total reward of QMDP for 1000 runs in Hallway problem (Computing precision = 0.01):
python3 evaluate.py QMDP Hallway 1000 0.01
You can test your QMDP implementation with the following command:
python3 test.py q1
Question 2 (3 points) – MinMDP – A naive lower bound of a POMDP solution
Now that we have an upper bound, it would be nice to find a lower bound of a POMDP solution. Here we won’t spend too much effort and will use a MinMDP, the “worst possible future reward” to lower bound the POMDP.
Here’s how the MinMDP works: After getting the (expected) immediate reward of the next action, the agent will always receive the worst possible reward (minimum of all
R
(
s
,
a
,
s

)
R
(
s
,
a
,
s

)
no matter what the belief is) in any future steps. We can see that this is trivially a lower bound. The concrete formulation is as follows:
R
min
=
min
a
,
s
,
s

R
(
s
,
a
,
s

)
V
worst
(
b
t
)
=
max
a

s
b
t
(
s
)
R
(
s
,
a
)
+
(
γ
+
γ
2
+
.
.
.
)
R
min
R
min
=
min
a
,
s
,
s

R(s,a,
s

)
V
worst
(
b
t
)
=
max
a

s
b
t
(s)R(s,a)+(γ+
γ
2
+…)
R
min
Hint 1: How does one calculate the
V
V
function given the
Q
Q
function? Consider the definition of
V
worst
(
b
t
)
V
worst
(
b
t
)
above. Can you derive the corresponding
Q
Q
function?
Hint 2:
(
γ
+
γ
2
+
.
.
.
)
(
γ
+
γ
2
+
.
.
.
)
is the inifinte sum of a geometric sequence. What value does it take when
λ
< 1 λ < 1 ? Implement this bound in the MinMDP class. You can evaluate it with the following command: python3 evaluate.py MinMDP [Environment] [Number of Runs] [precision] You can test your implementation by running the following command: python3 test.py q2 Question 3 (5 points) – Understanding AEMS2 Here’s a link to the paper introducing AEMS. Feel free to refer to it to understand the details of the algorithm as we will only be describing high level concepts below. As mentioned in the introduction, solving POMDPs in an offline way is hard due to the infinite number of belief states, so much so that an offline solver may never finish computing the complete optimal policy even for a given initial state. We instead turn to an online POMDP solver, which interleaves planning with executing actions (recall similarities with reinforcement learning). By considering a number of actions in an anytime fashion, the agent is able to choose a good action (partial solution) when it must act in the real world while improving it’s solution during off-time. Recap First, let’s go over how the belief states are updated in POMDPs. The belief state is updated in two steps after each action: first because of performing the action ( a t a t ) and then based on the new observation ( z t + 1 z t + 1 ) that the agent received because of this action: (1) Action update b ′ t ( s ) = P ( s | s 0 , a 0 , z 1 , a 1 , . . . , a t − 1 , z t ) , a t = ∑ s ′ P ( s ′ | s 0 , . . . , a t − 1 , z t ) P ( s | a t , s ′ ) = ∑ s ′ b t ( s ) P ( s | a t , s ′ ) b t ′ (s) =P(s | s 0 , a 0 , z 1 , a 1 ,…, a t − 1 , z t ), a t = ∑ s ′ P( s ′ | s 0 ,…, a t − 1 , z t )P(s | a t , s ′ ) = ∑ s ′ b t (s)P(s | a t , s ′ ) (2) Observation update b t + 1 ( s ) = P ( s | s 0 , . . . , z t , a t , z t + 1 ) = P ( s | s 0 , . . . , a t − 1 , z t , a t ) P ( s | a t , z t + 1 ) = b ′ t ( s ) P ( s | z t + 1 ) ∝ b ′ t ( s ) P ( z t + 1 | s , a t ) ∝ b ′ t ( s ) O ( s , a t , z t + 1 ) b t + 1 (s) =P(s | s 0 ,…, z t , a t , z t + 1 ) =P(s | s 0 ,…, a t − 1 , z t , a t )P(s | a t , z t + 1 ) = b t ′ (s)P(s | z t + 1 ) ∝ b t ′ (s)P( z t + 1 | s, a t ) ∝ b t ′ (s)O(s, a t , z t + 1 ) Representing Search in a POMDP: And-Or Tree To represent the search tree of an online POMDP solver, we can encode belief states during the search process ( b b and b ′ b ′ ) as nodes in a tree. As the belief state is updated by two aspects, i.e. action and observation, there are two different types of belief nodes: OR nodes representing parts of the search where an action is picked ( b t → b ′ t b t → b t ′ ) and AND nodes representing parts of the search where the observations are considered ( b ′ t → b t + 1 b t ′ → b t + 1 ). Note that since the agent can pick the actions, only one branch of an OR node would be optimal. In contrast, at an AND node, it may be possible to observe any observation so all branches need to be considered.  And-Or Tree: Triangles represent OR nodes where the agent should pick one action from all available ones. OR node contains the belief state. Circles represent AND nodes where the agent observes. The probability of each observation depends on the observation function and the posterior probability of states in the current node ( b ′ b ′ ). This And-Or tree has two heuristics, one upper and one lower bound (numbers in brackets). Searching the And-Or Tree A solution of a POMDP (and in fact an MDP) can be thought of as finding the value V ∗ V ∗ at some belief state (or state in the case of an MDP). This is similar to the adversarial search case where we tried to find the achievable utility. Recall alpha-beta pruning in adversarial search where we used a branch-and-bound mechanism to efficiently prune the search tree. We can utilize the same principles here and select two heuristics (one upper and one lower bound) which will enable us to prune nodes with their upper bound lower than the lower bound of at least one other node. In fact, there is an online POMDP algorithm that is based on pruning (Ros et al 2008). However, without sufficiently tight bounds, pruning turns out to not be very effective in practice. Instead, we take a different approach: we try to minimize the difference between the upper and lower bounds. This is referred to as Error Minimization. We can think of this difference as the agent’s uncertainty about the value of an action, or alternatively, as the error of its estimated value. Anytime Error Minimization Search (AEMS) is based on this idea. At each step it expands and updates the node in the tree with maximum (expected) error thus reducing the error. The original paper contains all the details of this algorithm, but to make things easier, we’ll go over the high level ideas. AEMS, at last First, let’s define a representation for a lower bound of the POMDP’s V ∗ V ∗ -value as L ( b ) L ( b ) and an upper bound as U ( b ) U ( b ) , where b b is the belief state. We define the “error” as ϵ ( b ) = U ( b ) − L ( b ) ϵ ( b ) = U ( b ) − L ( b ) . This error will eventually be used to guide us on which node to expand, however, right not it isn’t sufficient yet. Next, we need to define a way to pick “optimal” actions. One way of doing this is to select an action probablistically following the likelihood of that action being “good” in the long run: Let P ( a | b ) P ( a | b ) be the probability of the action a a being the best action at the belief state b b . There are many ways model this distribution that all make sense. However, to make things simple, AEMS2 takes the optimistic view and imagines that each action can eventually achieve a value equal to its upper bound. This means that the best action to pick is the one with highest upper bound. Specifically, it means that P ( a | b ) P ( a | b ) is equal to 1 if a a maximizes the upper bound at b b and 0 if otherwise. To understand the details of why this is a reasonable assumption, look to page 4 of the paper. With this done, we can model the probability of reaching belief state b b after d d steps in the tree (where we pick actions according to the method described before): P ( b d ) = ∏ d − 1 i = 0 P ( o i | b i , a i ) P ( a i | b i ) P ( b d ) = ∏ i = 0 d − 1 P ( o i | b i , a i ) P ( a i | b i ) This is derived through chaining probabilities (remember we have Markovian assumptions so things are independent). We note that the first term can be derived based on the observation and transition probabilities, whereas the second term is the action selection probabilities we defined a paragraph ago. This means we can write the expected error E = E [ ϵ ] E = E [ ϵ ] of a belief state b b that’s d d steps away (= deep down the tree) as: E ( b d ) = γ d P ( b d ) ϵ ( b d ) E ( b d ) = γ d P ( b d ) ϵ ( b d ) Each time we expand a leaf node on the search tree, we are able to get a better estimate of the upper and lower bounds of that node and reduce its error. To make our expansion worthwile, we want to improve the estimate of the worst error bound. Thus we will pick the leaf node that has the maximum E ( b d ) E ( b d ) value as our candidate to expand. With this we can derive a high level algorithm for AEMS: While we still have planning time remaining (don’t need to make a decision yet), do: 1. Find the node in the fringe (leaves) of the And-Or tree that has maximal E
(
b
d
)
E
(
b
d

)

value as defined above. 2. Expand this node into a subtree. Using QMDP and MinMDP as heuristics for bounds when creating new leaves. 3. Update the upper and lower bounds of this node based on those of its children. 4. Backtrack the bounds changes all the way back up to the root If we’re out of time, use the current tree to determine the best action at the root. Return that action. To check that you understand AEMS conceptually, please answer the following questions in the writeup: 1. (1 point) What does “anytime” mean in the context of AEMS? Explain why the “anytime” property is desirable for an online POMDP solver? 2. (4 points) Write down the error term that AEMS2 would calculate for the fringe nodes in the example And-Or tree above (b
2
b
2

to b
8
b
8

). (Use a discount factor γ
=
0.95
γ
=
0.95