# 程序代写代做代考 AI database arm scheme ER decision tree Bayesian Excel mips algorithm chain flex cache information theory i

i

Reinforcement Learning:

An Introduction

Second edition, in progress

Richard S. Sutton and Andrew G. Barto

c© 2012

A Bradford Book

The MIT Press

Cambridge, Massachusetts

London, England

ii

In memory of A. Harry Klopf

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Series Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

Summary of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

I The Problem 1

1 Introduction 3

1.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 4

1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Elements of Reinforcement Learning . . . . . . . . . . . . . . 7

1.4 An Extended Example: Tic-Tac-Toe . . . . . . . . . . . . . . 10

1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6 History of Reinforcement Learning . . . . . . . . . . . . . . . 16

1.7 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . 23

2 Bandit Problems 25

2.1 An n-Armed Bandit Problem . . . . . . . . . . . . . . . . . . 26

2.2 Action-Value Methods . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Softmax Action Selection . . . . . . . . . . . . . . . . . . . . . 30

2.4 Incremental Implementation . . . . . . . . . . . . . . . . . . . 32

2.5 Tracking a Nonstationary Problem . . . . . . . . . . . . . . . 33

2.6 Optimistic Initial Values . . . . . . . . . . . . . . . . . . . . . 35

2.7 Associative Search (Contextual Bandits) . . . . . . . . . . . . 37

iii

iv CONTENTS

2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 40

3 The Reinforcement Learning Problem 43

3.1 The Agent–Environment Interface . . . . . . . . . . . . . . . . 43

3.2 Goals and Rewards . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Unified Notation for Episodic and Continuing Tasks . . . . . . 52

∗3.5 The Markov Property . . . . . . . . . . . . . . . . . . . . . . . 53

3.6 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . 58

3.7 Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.8 Optimal Value Functions . . . . . . . . . . . . . . . . . . . . . 66

3.9 Optimality and Approximation . . . . . . . . . . . . . . . . . 71

3.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.11 Bibliographical and Historical Remarks . . . . . . . . . . . . . 74

II Tabular Action-Value Methods 79

4 Dynamic Programming 83

4.1 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.2 Policy Improvement . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.4 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.5 Asynchronous Dynamic Programming . . . . . . . . . . . . . . 98

4.6 Generalized Policy Iteration . . . . . . . . . . . . . . . . . . . 99

4.7 Efficiency of Dynamic Programming . . . . . . . . . . . . . . . 101

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 103

5 Monte Carlo Methods 107

5.1 Monte Carlo Policy Evaluation . . . . . . . . . . . . . . . . . 108

CONTENTS v

5.2 Monte Carlo Estimation of Action Values . . . . . . . . . . . . 112

5.3 Monte Carlo Control . . . . . . . . . . . . . . . . . . . . . . . 114

5.4 On-Policy Monte Carlo Control . . . . . . . . . . . . . . . . . 118

∗5.5 Evaluating One Policy While Following Another (Off-policy Pol-

icy Evaluation) . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.6 Off-Policy Monte Carlo Control . . . . . . . . . . . . . . . . . 122

5.7 Incremental Implementation . . . . . . . . . . . . . . . . . . . 124

5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 127

6 Temporal-Difference Learning 129

6.1 TD Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.2 Advantages of TD Prediction Methods . . . . . . . . . . . . . 134

6.3 Optimality of TD(0) . . . . . . . . . . . . . . . . . . . . . . . 137

6.4 Sarsa: On-Policy TD Control . . . . . . . . . . . . . . . . . . 141

6.5 Q-Learning: Off-Policy TD Control . . . . . . . . . . . . . . . 144

6.6 Games, Afterstates, and Other Special Cases . . . . . . . . . . 147

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . 149

7 Eligibility Traces 153

7.1 n-Step TD Prediction . . . . . . . . . . . . . . . . . . . . . . . 154

7.2 The Forward View of TD(λ) . . . . . . . . . . . . . . . . . . . 159

7.3 The Backward View of TD(λ) . . . . . . . . . . . . . . . . . . 163

7.4 Equivalence of Forward and Backward Views . . . . . . . . . . 166

7.5 Sarsa(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7.6 Q(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

7.7 Replacing Traces . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.8 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . 178

∗7.9 Variable λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

vi CONTENTS

7.11 Bibliographical and Historical Remarks . . . . . . . . . . . . . 180

8 Planning and Learning with Tabular Methods 183

8.1 Models and Planning . . . . . . . . . . . . . . . . . . . . . . . 183

8.2 Integrating Planning, Acting, and Learning . . . . . . . . . . . 186

8.3 When the Model Is Wrong . . . . . . . . . . . . . . . . . . . . 191

8.4 Prioritized Sweeping . . . . . . . . . . . . . . . . . . . . . . . 194

8.5 Full vs. Sample Backups . . . . . . . . . . . . . . . . . . . . . 198

8.6 Trajectory Sampling . . . . . . . . . . . . . . . . . . . . . . . 202

8.7 Heuristic Search . . . . . . . . . . . . . . . . . . . . . . . . . . 205

8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

8.9 Bibliographical and Historical Remarks . . . . . . . . . . . . . 209

III Approximate Solution Methods 211

9 On-policy Approximation of Action Values 213

9.1 Value Prediction with Function Approximation . . . . . . . . 214

9.2 Gradient-Descent Methods . . . . . . . . . . . . . . . . . . . . 217

9.3 Linear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 221

9.4 Control with Function Approximation . . . . . . . . . . . . . . 230

9.5 Should We Bootstrap? . . . . . . . . . . . . . . . . . . . . . . 236

9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

9.7 Bibliographical and Historical Remarks . . . . . . . . . . . . . 239

10 Off-policy Approximation of Action Values 243

11 Policy Approximation 245

11.1 Actor–Critic Methods . . . . . . . . . . . . . . . . . . . . . . . 245

11.2 R-Learning and the Average-Reward Setting . . . . . . . . . . 248

12 State Estimation 251

CONTENTS vii

13 Temporal Abstraction 253

IV Frontiers 255

14 Biological Reinforcement Learning 257

15 Applications and Case Studies 259

15.1 TD-Gammon . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

15.2 Samuel’s Checkers Player . . . . . . . . . . . . . . . . . . . . . 265

15.3 The Acrobot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

15.4 Elevator Dispatching . . . . . . . . . . . . . . . . . . . . . . . 272

15.5 Dynamic Channel Allocation . . . . . . . . . . . . . . . . . . . 277

15.6 Job-Shop Scheduling . . . . . . . . . . . . . . . . . . . . . . . 281

16 Prospects 289

16.1 The Unified View . . . . . . . . . . . . . . . . . . . . . . . . . 289

16.2 Other Frontier Dimensions . . . . . . . . . . . . . . . . . . . . 292

References 295

Index 320

viii PREFACE

Preface

We first came to focus on what is now known as reinforcement learning in late

1979. We were both at the University of Massachusetts, working on one of

the earliest projects to revive the idea that networks of neuronlike adaptive

elements might prove to be a promising approach to artificial adaptive intel-

ligence. The project explored the “heterostatic theory of adaptive systems”

developed by A. Harry Klopf. Harry’s work was a rich source of ideas, and

we were permitted to explore them critically and compare them with the long

history of prior work in adaptive systems. Our task became one of teasing

the ideas apart and understanding their relationships and relative importance.

This continues today, but in 1979 we came to realize that perhaps the simplest

of the ideas, which had long been taken for granted, had received surprisingly

little attention from a computational perspective. This was simply the idea of

a learning system that wants something, that adapts its behavior in order to

maximize a special signal from its environment. This was the idea of a “he-

donistic” learning system, or, as we would say now, the idea of reinforcement

learning.

Like others, we had a sense that reinforcement learning had been thor-

oughly explored in the early days of cybernetics and artificial intelligence. On

closer inspection, though, we found that it had been explored only slightly.

While reinforcement learning had clearly motivated some of the earliest com-

putational studies of learning, most of these researchers had gone on to other

things, such as pattern classification, supervised learning, and adaptive con-

trol, or they had abandoned the study of learning altogether. As a result, the

special issues involved in learning how to get something from the environment

received relatively little attention. In retrospect, focusing on this idea was

the critical step that set this branch of research in motion. Little progress

could be made in the computational study of reinforcement learning until it

was recognized that such a fundamental idea had not yet been thoroughly

explored.

The field has come a long way since then, evolving and maturing in sev-

eral directions. Reinforcement learning has gradually become one of the most

active research areas in machine learning, artificial intelligence, and neural net-

work research. The field has developed strong mathematical foundations and

impressive applications. The computational study of reinforcement learning is

now a large field, with hundreds of active researchers around the world in di-

verse disciplines such as psychology, control theory, artificial intelligence, and

neuroscience. Particularly important have been the contributions establishing

and developing the relationships to the theory of optimal control and dynamic

programming. The overall problem of learning from interaction to achieve

PREFACE ix

goals is still far from being solved, but our understanding of it has improved

significantly. We can now place component ideas, such as temporal-difference

learning, dynamic programming, and function approximation, within a coher-

ent perspective with respect to the overall problem.

Our goal in writing this book was to provide a clear and simple account

of the key ideas and algorithms of reinforcement learning. We wanted our

treatment to be accessible to readers in all of the related disciplines, but we

could not cover all of these perspectives in detail. Our treatment takes almost

exclusively the point of view of artificial intelligence and engineering, leaving

coverage of connections to psychology, neuroscience, and other fields to others

or to another time. We also chose not to produce a rigorous formal treatment

of reinforcement learning. We did not reach for the highest possible level of

mathematical abstraction and did not rely on a theorem–proof format. We

tried to choose a level of mathematical detail that points the mathematically

inclined in the right directions without distracting from the simplicity and

potential generality of the underlying ideas.

The book consists of three parts. Part I is introductory and problem ori-

ented. We focus on the simplest aspects of reinforcement learning and on its

main distinguishing features. One full chapter is devoted to introducing the

reinforcement learning problem whose solution we explore in the rest of the

book. Part II presents tabular versions (assuming a small finite state space)

of all the basic solution methods based on estimating action values. We intro-

duce dynamic programming, Monte Carlo methods, and temporal-difference

learning. There is a chapter on eligibility traces which unifies the latter two

methods, and a chapter that unifies planning methods (such as dynamic pro-

gramming and state-space search) and learning methods (such as Monte Carlo

and temporal-difference learning). Part III is concerned with extending the

tabular methods to include various forms of approximation including function

approximation, policy-gradient methods, and methods designed for solving

off-policy learning problems. Part IV surveys some of the frontiers of rein-

forcement learning in biology and applications.

This book was designed to be used as a text in a one-semester course, per-

haps supplemented by readings from the literature or by a more mathematical

text such as Bertsekas and Tsitsiklis (1996) or Szepesvari. This book can also

be used as part of a broader course on machine learning, artificial intelligence,

or neural networks. In this case, it may be desirable to cover only a subset of

the material. We recommend covering Chapter 1 for a brief overview, Chapter

2 through Section 2.2, Chapter 3 except Sections 3.4, 3.5 and 3.9, and then

selecting sections from the remaining chapters according to time and interests.

The five chapters of Part II build on each other and are best covered in se-

quence; of these, Chapter 6 is the most important for the subject and for the

x PREFACE

rest of the book. A course focusing on machine learning or neural networks

should cover Chapter 9, and a course focusing on artificial intelligence or plan-

ning should cover Chapter 8. Throughout the book, sections that are more

difficult and not essential to the rest of the book are marked with a ∗. These

can be omitted on first reading without creating problems later on. Some ex-

ercises are marked with a ∗ to indicate that they are more advanced and not

essential to understanding the basic material of the chapter.

The book is largely self-contained. The only mathematical background

assumed is familiarity with elementary concepts of probability, such as expec-

tations of random variables. Chapter 9 is substantially easier to digest if the

reader has some knowledge of artificial neural networks or some other kind of

supervised learning method, but it can be read without prior background. We

strongly recommend working the exercises provided throughout the book. So-

lution manuals are available to instructors. This and other related and timely

material is available via the Internet.

At the end of most chapters is a section entitled “Bibliographical and His-

torical Remarks,” wherein we credit the sources of the ideas presented in that

chapter, provide pointers to further reading and ongoing research, and describe

relevant historical background. Despite our attempts to make these sections

authoritative and complete, we have undoubtedly left out some important

prior work. For that we apologize, and welcome corrections and extensions for

incorporation into a subsequent edition.

In some sense we have been working toward this book for thirty years, and

we have lots of people to thank. First, we thank those who have personally

helped us develop the overall view presented in this book: Harry Klopf, for

helping us recognize that reinforcement learning needed to be revived; Chris

Watkins, Dimitri Bertsekas, John Tsitsiklis, and Paul Werbos, for helping us

see the value of the relationships to dynamic programming; John Moore and

Jim Kehoe, for insights and inspirations from animal learning theory; Oliver

Selfridge, for emphasizing the breadth and importance of adaptation; and,

more generally, our colleagues and students who have contributed in countless

ways: Ron Williams, Charles Anderson, Satinder Singh, Sridhar Mahadevan,

Steve Bradtke, Bob Crites, Peter Dayan, and Leemon Baird. Our view of re-

inforcement learning has been significantly enriched by discussions with Paul

Cohen, Paul Utgoff, Martha Steenstrup, Gerry Tesauro, Mike Jordan, Leslie

Kaelbling, Andrew Moore, Chris Atkeson, Tom Mitchell, Nils Nilsson, Stuart

Russell, Tom Dietterich, Tom Dean, and Bob Narendra. We thank Michael

Littman, Gerry Tesauro, Bob Crites, Satinder Singh, and Wei Zhang for pro-

viding specifics of Sections 4.7, 15.1, 15.4, 15.5, and 15.6 respectively. We

thank the the Air Force Office of Scientific Research, the National Science

Foundation, and GTE Laboratories for their long and farsighted support.

PREFACE xi

We also wish to thank the many people who have read drafts of this book

and provided valuable comments, including Tom Kalt, John Tsitsiklis, Pawel

Cichosz, Olle Gällmo, Chuck Anderson, Stuart Russell, Ben Van Roy, Paul

Steenstrup, Paul Cohen, Sridhar Mahadevan, Jette Randlov, Brian Sheppard,

Thomas O’Connell, Richard Coggins, Cristina Versino, John H. Hiett, An-

dreas Badelt, Jay Ponte, Joe Beck, Justus Piater, Martha Steenstrup, Satin-

der Singh, Tommi Jaakkola, Dimitri Bertsekas, Torbjörn Ekman, Christina

Björkman, Jakob Carlström, and Olle Palmgren. Finally, we thank Gwyn

Mitchell for helping in many ways, and Harry Stanton and Bob Prior for be-

ing our champions at MIT Press.

xii

Series Forward

SUMMARY OF NOTATION xiii

Summary of Notation

Capital letters are used for random variables and major algorithm variables.

Lower case letters are used for the values of random variables and for scalar

functions. Quantities that are required to be real-valued vectors are written

in bold and in lower case (even if random variables).

s state

a action

S set of all nonterminal states

S+ set of all states, including the terminal state

A(s) set of actions possible in state s

t discrete time step

T final time step of an episode

St state at t

At action at t

Rt reward at t, dependent, like St, on At−1 and St−1

Gt return (cumulative discounted reward) following t

G

(n)

t n-step return (Section 7.1)

Gλt λ-return (Section 7.2)

π policy, decision-making rule

π(s) action taken in state s under deterministic policy π

π(a|s) probability of taking action a in state s under stochastic policy π

p(s′|s, a) probability of transition from state s to state s′ under action a

r(s, a, s′) expected immediate reward on transition from s to s′ under action a

vπ(s) value of state s under policy π (expected return)

v∗(s) value of state s under the optimal policy

qπ(s, a) value of taking action a in state s under policy π

q∗(s, a) value of taking action a in state s under the optimal policy

Vt estimate (a random variable) of vπ or v∗

Qt estimate (a random variable) of qπ or q∗

v̂(s,w) approximate value of state s given a vector of weights w

q̂(s, a,w) approximate value of state–action pair s, a given weights w

w,wt vector of (possibly learned) weights underlying an approximate value function

x(s) vector of features visible when in state s

w>x inner product of vectors, w>x =

∑

iwixi; e.g., v̂(s,w) = w

>x(s)

xiv SUMMARY OF NOTATION

δt temporal-difference error at t (a random variable, even though not upper case)

Zt(s) eligibility trace for state s at t

Zt(s, a) eligibility trace for a state–action pair

zt eligibility trace vector at t

γ discount-rate parameter

ε probability of random action in ε-greedy policy

α, β step-size parameters

λ decay-rate parameter for eligibility traces

Part I

The Problem

1

Chapter 1

Introduction

The idea that we learn by interacting with our environment is probably the

first to occur to us when we think about the nature of learning. When an

infant plays, waves its arms, or looks about, it has no explicit teacher, but it

does have a direct sensorimotor connection to its environment. Exercising this

connection produces a wealth of information about cause and effect, about

the consequences of actions, and about what to do in order to achieve goals.

Throughout our lives, such interactions are undoubtedly a major source of

knowledge about our environment and ourselves. Whether we are learning to

drive a car or to hold a conversation, we are acutely aware of how our environ-

ment responds to what we do, and we seek to influence what happens through

our behavior. Learning from interaction is a foundational idea underlying

nearly all theories of learning and intelligence.

In this book we explore a computational approach to learning from inter-

action. Rather than directly theorizing about how people or animals learn, we

explore idealized learning situations and evaluate the effectiveness of various

learning methods. That is, we adopt the perspective of an artificial intelligence

researcher or engineer. We explore designs for machines that are effective in

solving learning problems of scientific or economic interest, evaluating the

designs through mathematical analysis or computational experiments. The

approach we explore, called reinforcement learning , is much more focused on

goal-directed learning from interaction than are other approaches to machine

learning.

3

4 CHAPTER 1. INTRODUCTION

1.1 Reinforcement Learning

Reinforcement learning is learning what to do—how to map situations to

actions—so as to maximize a numerical reward signal. The learner is not

told which actions to take, as in most forms of machine learning, but instead

must discover which actions yield the most reward by trying them. In the most

interesting and challenging cases, actions may affect not only the immediate

reward but also the next situation and, through that, all subsequent rewards.

These two characteristics—trial-and-error search and delayed reward—are the

two most important distinguishing features of reinforcement learning.

Reinforcement learning is defined not by characterizing learning methods,

but by characterizing a learning problem. Any method that is well suited to

solving that problem, we consider to be a reinforcement learning method. A

full specification of the reinforcement learning problem in terms of optimal

control of Markov decision processes must wait until Chapter 3, but the basic

idea is simply to capture the most important aspects of the real problem facing

a learning agent interacting with its environment to achieve a goal. Clearly,

such an agent must be able to sense the state of the environment to some extent

and must be able to take actions that affect the state. The agent also must

have a goal or goals relating to the state of the environment. The formulation

is intended to include just these three aspects—sensation, action, and goal—in

their simplest possible forms without trivializing any of them.

Reinforcement learning is different from supervised learning, the kind of

learning studied in most current research in machine learning, statistical pat-

tern recognition, and artificial neural networks. Supervised learning is learn-

ing from examples provided by a knowledgable external supervisor. This is

an important kind of learning, but alone it is not adequate for learning from

interaction. In interactive problems it is often impractical to obtain examples

of desired behavior that are both correct and representative of all the situa-

tions in which the agent has to act. In uncharted territory—where one would

expect learning to be most beneficial—an agent must be able to learn from its

own experience.

One of the challenges that arise in reinforcement learning and not in other

kinds of learning is the trade-off between exploration and exploitation. To

obtain a lot of reward, a reinforcement learning agent must prefer actions

that it has tried in the past and found to be effective in producing reward.

But to discover such actions, it has to try actions that it has not selected

before. The agent has to exploit what it already knows in order to obtain

reward, but it also has to explore in order to make better action selections in

the future. The dilemma is that neither exploration nor exploitation can be

pursued exclusively without failing at the task. The agent must try a variety of

1.1. REINFORCEMENT LEARNING 5

actions and progressively favor those that appear to be best. On a stochastic

task, each action must be tried many times to gain a reliable estimate its

expected reward. The exploration–exploitation dilemma has been intensively

studied by mathematicians for many decades (see Chapter 2). For now, we

simply note that the entire issue of balancing exploration and exploitation

does not even arise in supervised learning as it is usually defined.

Another key feature of reinforcement learning is that it explicitly considers

the whole problem of a goal-directed agent interacting with an uncertain envi-

ronment. This is in contrast with many approaches that consider subproblems

without addressing how they might fit into a larger picture. For example,

we have mentioned that much of machine learning research is concerned with

supervised learning without explicitly specifying how such an ability would

finally be useful. Other researchers have developed theories of planning with

general goals, but without considering planning’s role in real-time decision-

making, or the question of where the predictive models necessary for planning

would come from. Although these approaches have yielded many useful results,

their focus on isolated subproblems is a significant limitation.

Reinforcement learning takes the opposite tack, starting with a complete,

interactive, goal-seeking agent. All reinforcement learning agents have ex-

plicit goals, can sense aspects of their environments, and can choose actions

to influence their environments. Moreover, it is usually assumed from the

beginning that the agent has to operate despite significant uncertainty about

the environment it faces. When reinforcement learning involves planning, it

has to address the interplay between planning and real-time action selection,

as well as the question of how environmental models are acquired and im-

proved. When reinforcement learning involves supervised learning, it does so

for specific reasons that determine which capabilities are critical and which

are not. For learning research to make progress, important subproblems have

to be isolated and studied, but they should be subproblems that play clear

roles in complete, interactive, goal-seeking agents, even if all the details of the

complete agent cannot yet be filled in.

One of the larger trends of which reinforcement learning is a part is that

toward greater contact between artificial intelligence and other engineering

disciplines. Not all that long ago, artificial intelligence was viewed as almost

entirely separate from control theory and statistics. It had to do with logic

and symbols, not numbers. Artificial intelligence was large LISP programs, not

linear algebra, differential equations, or statistics. Over the last decades this

view has gradually eroded. Modern artificial intelligence researchers accept

statistical and control algorithms, for example, as relevant competing methods

or simply as tools of their trade. The previously ignored areas lying between

artificial intelligence and conventional engineering are now among the most

6 CHAPTER 1. INTRODUCTION

active, including new fields such as neural networks, intelligent control, and our

topic, reinforcement learning. In reinforcement learning we extend ideas from

optimal control theory and stochastic approximation to address the broader

and more ambitious goals of artificial intelligence.

1.2 Examples

A good way to understand reinforcement learning is to consider some of the

examples and possible applications that have guided its development.

• A master chess player makes a move. The choice is informed both by

planning—anticipating possible replies and counterreplies—and by im-

mediate, intuitive judgments of the desirability of particular positions

and moves.

• An adaptive controller adjusts parameters of a petroleum refinery’s op-

eration in real time. The controller optimizes the yield/cost/quality

trade-off on the basis of specified marginal costs without sticking strictly

to the set points originally suggested by engineers.

• A gazelle calf struggles to its feet minutes after being born. Half an hour

later it is running at 20 miles per hour.

• A mobile robot decides whether it should enter a new room in search of

more trash to collect or start trying to find its way back to its battery

recharging station. It makes its decision based on how quickly and easily

it has been able to find the recharger in the past.

• Phil prepares his breakfast. Closely examined, even this apparently mun-

dane activity reveals a complex web of conditional behavior and inter-

locking goal–subgoal relationships: walking to the cupboard, opening it,

selecting a cereal box, then reaching for, grasping, and retrieving the box.

Other complex, tuned, interactive sequences of behavior are required to

obtain a bowl, spoon, and milk jug. Each step involves a series of eye

movements to obtain information and to guide reaching and locomotion.

Rapid judgments are continually made about how to carry the objects

or whether it is better to ferry some of them to the dining table before

obtaining others. Each step is guided by goals, such as grasping a spoon

or getting to the refrigerator, and is in service of other goals, such as

having the spoon to eat with once the cereal is prepared and ultimately

obtaining nourishment.

1.3. ELEMENTS OF REINFORCEMENT LEARNING 7

These examples share features that are so basic that they are easy to over-

look. All involve interaction between an active decision-mak