Skip to content

Commit b80f62c

Browse files
Usage examples for PassiveTDAgent
1 parent 2d0d765 commit b80f62c

File tree

2 files changed

+181
-2
lines changed

2 files changed

+181
-2
lines changed

images/mdp.png

824 Bytes
Loading

rl.ipynb

Lines changed: 181 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,14 @@
1313
},
1414
{
1515
"cell_type": "code",
16-
"execution_count": null,
16+
"execution_count": 1,
1717
"metadata": {
1818
"collapsed": true
1919
},
2020
"outputs": [],
21-
"source": []
21+
"source": [
22+
"from rl import *"
23+
]
2224
},
2325
{
2426
"cell_type": "markdown",
@@ -37,6 +39,183 @@
3739
"\n",
3840
"In summary we have a sequence of state action transitions with rewards associated with some states. Our goal is to find the optimal policy (pi) which tells us what action to take in each state."
3941
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"metadata": {},
46+
"source": [
47+
"## Passive Reinforcement Learning\n",
48+
"\n",
49+
"In passive Reinforcement Learning the agent follows a fixed policy and tries to learn the Reward function and the Transition model (if it is not aware of that).\n",
50+
"\n"
51+
]
52+
},
53+
{
54+
"cell_type": "markdown",
55+
"metadata": {},
56+
"source": [
57+
"### Passive Temporal Difference Agent\n",
58+
"\n",
59+
"The PassiveTDAgent class in the rl module implements the Agent Program (notice the usage of word Program) described in **Fig 21.4** of the AIMA Book. PassiveTDAgent uses temporal differences to learn utility estimates. In simple terms we learn the difference between the states and backup the values to previous states while following a fixed policy. Let us look into the source before we see some usage examples."
60+
]
61+
},
62+
{
63+
"cell_type": "code",
64+
"execution_count": 3,
65+
"metadata": {
66+
"collapsed": true
67+
},
68+
"outputs": [],
69+
"source": [
70+
"%psource PassiveTDAgent"
71+
]
72+
},
73+
{
74+
"cell_type": "markdown",
75+
"metadata": {},
76+
"source": [
77+
"The Agent Program can be obtained by creating the instance of the class by passing the appropriate parameters. Because of the __ call __ method the object that is created behaves like a callable and returns an appropriate action as most Agent Programs do. To instantiate the object we need a policy(pi) and a mdp whose utility of states will be estimated. Let us import a GridMDP object from the mdp module. **Fig[17, 1]** is similar to **Fig[21, 1]** but has some discounting as **gamma = 0.9**."
78+
]
79+
},
80+
{
81+
"cell_type": "code",
82+
"execution_count": 4,
83+
"metadata": {
84+
"collapsed": true
85+
},
86+
"outputs": [],
87+
"source": [
88+
"from mdp import Fig"
89+
]
90+
},
91+
{
92+
"cell_type": "code",
93+
"execution_count": 5,
94+
"metadata": {
95+
"collapsed": false
96+
},
97+
"outputs": [
98+
{
99+
"data": {
100+
"text/plain": [
101+
"<mdp.GridMDP at 0x7f1f0c77ab00>"
102+
]
103+
},
104+
"execution_count": 5,
105+
"metadata": {},
106+
"output_type": "execute_result"
107+
}
108+
],
109+
"source": [
110+
"Fig[17,1]"
111+
]
112+
},
113+
{
114+
"cell_type": "markdown",
115+
"metadata": {},
116+
"source": [
117+
"**Fig[17,1]** is a GridMDP object and is similar to the grid shown in **Fig 21.1**. The rewards in the terminal states are **+1** and **-1** and **-0.04** in rest of the states. <img src=\"files/images/mdp.png\"> Now we define a policy similar to **Fig 21.1** in the book."
118+
]
119+
},
120+
{
121+
"cell_type": "code",
122+
"execution_count": 6,
123+
"metadata": {
124+
"collapsed": true
125+
},
126+
"outputs": [],
127+
"source": [
128+
"policy = {(0, 0): (0, 1),\n",
129+
" (0, 1): (0, 1),\n",
130+
" (0, 2): (1, 0),\n",
131+
" (1, 0): (-1, 0),\n",
132+
" (1, 2): (1, 0),\n",
133+
" (2, 0): (-1, 0),\n",
134+
" (2, 1): (0, 1),\n",
135+
" (2, 2): (1, 0),\n",
136+
" (3, 0): (-1, 0),\n",
137+
" (3, 1): None,\n",
138+
" (3, 2): None,\n",
139+
" }"
140+
]
141+
},
142+
{
143+
"cell_type": "markdown",
144+
"metadata": {},
145+
"source": [
146+
"Let us create our object now. We also use the **same alpha** as given in the footnote of the book on **page 837**."
147+
]
148+
},
149+
{
150+
"cell_type": "code",
151+
"execution_count": 7,
152+
"metadata": {
153+
"collapsed": true
154+
},
155+
"outputs": [],
156+
"source": [
157+
"our_agent = PassiveTDAgent(policy, Fig[17,1], alpha=lambda n: 60./(59+n))"
158+
]
159+
},
160+
{
161+
"cell_type": "markdown",
162+
"metadata": {},
163+
"source": [
164+
"The rl module also has a simple implementation to simulate iterations. The function is called **run_single_trial**. Now we can try our implementation. We can also compare the utility estimates learned by our agent to those obtained via **value iteration**.\n"
165+
]
166+
},
167+
{
168+
"cell_type": "code",
169+
"execution_count": 8,
170+
"metadata": {
171+
"collapsed": true
172+
},
173+
"outputs": [],
174+
"source": [
175+
"from mdp import value_iteration"
176+
]
177+
},
178+
{
179+
"cell_type": "markdown",
180+
"metadata": {},
181+
"source": [
182+
"The values calculated by value iteration:"
183+
]
184+
},
185+
{
186+
"cell_type": "code",
187+
"execution_count": 9,
188+
"metadata": {
189+
"collapsed": false
190+
},
191+
"outputs": [
192+
{
193+
"name": "stdout",
194+
"output_type": "stream",
195+
"text": [
196+
"{(0, 1): 0.3984432178350045, (1, 2): 0.649585681261095, (3, 2): 1.0, (0, 0): 0.2962883154554812, (3, 0): 0.12987274656746342, (3, 1): -1.0, (2, 1): 0.48644001739269643, (2, 0): 0.3447542300124158, (2, 2): 0.7953620878466678, (1, 0): 0.25386699846479516, (0, 2): 0.5093943765842497}\n"
197+
]
198+
}
199+
],
200+
"source": [
201+
"print(value_iteration(Fig[17,1]))"
202+
]
203+
},
204+
{
205+
"cell_type": "markdown",
206+
"metadata": {},
207+
"source": [
208+
"Now the values estimated by our agent after 200 trials."
209+
]
210+
},
211+
{
212+
"cell_type": "code",
213+
"execution_count": null,
214+
"metadata": {
215+
"collapsed": true
216+
},
217+
"outputs": [],
218+
"source": []
40219
}
41220
],
42221
"metadata": {

0 commit comments

Comments
 (0)