|
| 1 | +Lecture noted addendum: The one thing we discussed that's not in the |
| 2 | +lecture notes is how to do stack arguments for tail calls: it turns |
| 3 | +out that the recursive caller has to write the stack arguments into |
| 4 | +its caller's stack argument zone. This gets hairier when you have |
| 5 | +non-self-recursive tail calls, because then the original caller needs |
| 6 | +to allocate stack space for arguments that it never writes to: for |
| 7 | +example, if A non-tail-calls B (which has 3 arguments), and B |
| 8 | +tail-calls C (which has 10 arguments), A has to allocate stack space |
| 9 | +for 4 stack arguments even though it itself never calls anything that |
| 10 | +uses stack arguments. |
| 11 | + |
| 12 | + |
| 13 | +Today I wanted to talk about one important optimization in compilers: |
| 14 | +tail call optimization. |
| 15 | + |
| 16 | +This might end up being a pretty quick lecture because the |
| 17 | +optimization itself is pretty straightforward, but this is an |
| 18 | +optimization important enough that it's actually explicit in the |
| 19 | +Racket specification: you don't technically have a "Racket |
| 20 | +implementation" until you have this. To see what this is about, let's |
| 21 | +take a look at the "same" program written in two languages. First of |
| 22 | +all, Racket: |
| 23 | + |
| 24 | +[pull up racket_infinite_loop.rkt] |
| 25 | +(define (f) |
| 26 | + (f)) |
| 27 | +(f) |
| 28 | + |
| 29 | +So, this is obviously an infinite loop, right? That's cool, we're down |
| 30 | +with infinite loops. Let's run this program. [do so] And look at |
| 31 | +that, it's looping infinitely. |
| 32 | + |
| 33 | +Now, compare this with another program, that looks identical and |
| 34 | +should behave identical in theory, except this time, it's written in |
| 35 | +Python. |
| 36 | + |
| 37 | +[pull up python_infinite_loop.py] |
| 38 | +def f(): |
| 39 | + f() |
| 40 | +f() |
| 41 | + |
| 42 | +Let's run this version. [do so] And this time, the program |
| 43 | +explodes. The difference is that while Racket performs tail call |
| 44 | +optimization, Python somewhat famously doesn't. And while we might not |
| 45 | +be too concerned about this program in particular, there are lots of |
| 46 | +functions that are invoked recursively many times before eventually |
| 47 | +returning a result --- we don't want to have to worry bout our program |
| 48 | +dying if a function loops more than an arbitrary number of times. |
| 49 | + |
| 50 | +So specifically, the thing that causes Python to die here is that |
| 51 | +every time a recursive call is that every time the function is |
| 52 | +recursively applied, a new stack frame is allocated, and eventually, |
| 53 | +the stack just takes up so much space that the execution is |
| 54 | +automatically halted. Python, of course, is an interpreted language |
| 55 | +(as is regular Racket), but basically the same principle holds in our |
| 56 | +compiled code: every time a recursive call happens, the prelude of of |
| 57 | +the program is going to push all the caller-save registers and move |
| 58 | +the stack pointers to create enough space for spilled |
| 59 | +variables. Eventually, this is going to result in a segfault as the |
| 60 | +stack space used by the program exceeds that which is provided to it |
| 61 | +by the OS (or at least that's my understanding). |
| 62 | + |
| 63 | +The answer that tail call elimination provides is that in some |
| 64 | +circumstances we can simply reuse the current stack frame. We can only |
| 65 | +do this if the recursive call that would be generating the new frame |
| 66 | +is the very last thing that happens in this particular iteration of |
| 67 | +the function call. So, given this program, we can achieve efficiency |
| 68 | +by changing the callq instruction generated for the recursive call |
| 69 | +into a straight, unconditional jump the the beginning of the procedure |
| 70 | +code. There's some subtleties that I'll get into, but that's really |
| 71 | +the core of it. |
| 72 | + |
| 73 | +To implement this, the first step is to mark which recursive calls can |
| 74 | +be eliminated. I reccomend doing this as part of reveal-functions, |
| 75 | +because when the program is still in Racket form it's very clear which |
| 76 | +calls are in tail position and which aren't. For now, we're only |
| 77 | +talking about eliminating recursive tail calls, so only calls inside |
| 78 | +(define)'d functions can be eliminated |
| 79 | + |
| 80 | +So given a program like this: |
| 81 | + |
| 82 | +(define (times n m) |
| 83 | + (if (eq? n 0) |
| 84 | + 0 |
| 85 | + (+ m (times (- n 1) m)))) |
| 86 | + |
| 87 | +we will do our normal revealing of functions: |
| 88 | + |
| 89 | +(define (times n m) |
| 90 | + (if (eq? n 0) |
| 91 | + 0 |
| 92 | + (+ m (app (function-ref times) (- n 1) m)))) |
| 93 | + |
| 94 | +since the addition happens after the recursive call, we can't reuse |
| 95 | +the stack frame for the call: there are still things happening within |
| 96 | +the current frame after the call returns. But if we change the program like so: |
| 97 | + |
| 98 | +(define (times-iter n m prod) |
| 99 | + (if (eq? n 0) |
| 100 | + c |
| 101 | + (times-iter (- n 1) m (+ prod m)))) |
| 102 | + |
| 103 | +we can see that the recursive call is at the end of one path through |
| 104 | +the function, so we can change it to |
| 105 | + |
| 106 | +(define (times-iter n m prod) |
| 107 | + (if (eq? n 0) |
| 108 | + c |
| 109 | + (tailcall (function-ref times-iter) (- n 1) m (+ prod m)))) |
| 110 | + |
| 111 | +The downside of introducing a new AST form this early, of course, is |
| 112 | +that we have to propagate it all the way through the rest of our |
| 113 | +passes. Once we get into the C language, I reccomend treating tailcall |
| 114 | +as a new statement, just like assign and return. This is because it |
| 115 | +really is more like the return statement than anything else: it's not |
| 116 | +going to write a result to the LHS, like an assign would, its just |
| 117 | +going to do a jump and then relinquish control over returning the |
| 118 | +result to the callee. You'll have to propagate this statement through |
| 119 | +all the C passes, but it's very straightforward to do so, with one |
| 120 | +exception. |
| 121 | + |
| 122 | +The one pass that is interesting among the C passes is |
| 123 | +uncover-call-live-roots. Remember, one of the things that this pass |
| 124 | +does is, when it sees a function call: |
| 125 | + |
| 126 | +(assign lhs (app (function-ref f) y z)) |
| 127 | + |
| 128 | +Given the set '(somevector someclosure) of live heap values, we have |
| 129 | +to compile this to |
| 130 | + |
| 131 | +(call-live-roots (somevector someclosure) (assign lhs (app (function-ref f) y z))) |
| 132 | + |
| 133 | +And this will in turn get compiled into instructions that push |
| 134 | +somevector and someclosure onto the root stack and then pop them off |
| 135 | +afterwards, so that they dont get garbage collected while the |
| 136 | +recursive call is executing. |
| 137 | + |
| 138 | +But this seems like a really bad thing for tail calls, right? We can't |
| 139 | +do anything after a tail call, because we've wiped out this stack |
| 140 | +frame and we'll never return to it. |
| 141 | + |
| 142 | +QUESTION: what's the solution here? |
| 143 | + |
| 144 | +ANSWER: Actually, there isn't a problem at all! Like I said, we're |
| 145 | +never returning to this stack frame, so everything in the frame is |
| 146 | +dead at the point that we make the tail call, except for the arguments |
| 147 | +to the function. And the callee won't collect those if they're alive |
| 148 | +when it executes. So in fact, we don't need to insert a |
| 149 | +call-live-roots around tailcalls. |
| 150 | + |
| 151 | +When I said that this pass was interesting, I meant that it's |
| 152 | +interesting for what we _don't_ have to do, rather than what we do. |
| 153 | + |
| 154 | +Now, like regular apps, tailcalls go away when we perform instruction |
| 155 | +selection. Recall that when we did instruction selection on regular |
| 156 | +apps, we turned them into "indirect callq"s |
| 157 | + |
| 158 | +(indirect-callq (reg rax)) |
| 159 | + |
| 160 | +which eventually we print out as |
| 161 | + |
| 162 | +callq *%rax |
| 163 | + |
| 164 | +It shouldn't be surprising that there's an analagous form for indirect |
| 165 | +jumps: we'll introduce |
| 166 | + |
| 167 | +(indirect-jmp (reg rax)) |
| 168 | + |
| 169 | +into our pseudo-x86 language and then print it out as |
| 170 | + |
| 171 | +jmp *%rax |
| 172 | + |
| 173 | +So our times function will end up looking something like this |
| 174 | + |
| 175 | + |
| 176 | + .globl times |
| 177 | +times: |
| 178 | + push %rbp |
| 179 | + movq %rsp, %rbp |
| 180 | + pushq %r14 |
| 181 | + pushq %r13 |
| 182 | + pushq %r12 |
| 183 | + pushq %rbx |
| 184 | + subq $16, %rsp |
| 185 | + |
| 186 | + movq %rdi, n |
| 187 | + movq %rsi, m |
| 188 | + mocq %rdx, prod |
| 189 | + |
| 190 | + ... do stuff ... |
| 191 | + |
| 192 | + leaq times(%rip), %r12 |
| 193 | + ... closure stuff ... |
| 194 | + |
| 195 | + movq n, %rdi |
| 196 | + movq m, %rsi |
| 197 | + movq prod, %rdx |
| 198 | + jmp *%r12 |
| 199 | + |
| 200 | + ... other cases ... |
| 201 | + |
| 202 | +And then we're cool, right? Except --- |
| 203 | + |
| 204 | +QUESTION: Can anybody spot what the problem is here? |
| 205 | + |
| 206 | +ANSWER: We're jumping to _before_ the function prelude, so for every |
| 207 | +iteration, we're still doing some of the work to allocate a new frame: |
| 208 | +pushing the base pointer, allocating stack space, etc. That's bad! |
| 209 | + |
| 210 | +The way that I solved this, which perhaps isn't the cleverest, is to |
| 211 | +introduce a new label that marks the end of the prelude and the |
| 212 | +beginning of the function's body. I then used that as the indirect-jmp |
| 213 | +target: |
| 214 | + |
| 215 | + .globl times |
| 216 | +times: |
| 217 | + ... prelude ... |
| 218 | + |
| 219 | +times_body: |
| 220 | + movq %rdi, n |
| 221 | + movq %rsi, m |
| 222 | + mocq %rdx, prod |
| 223 | + |
| 224 | + ... do stuff ... |
| 225 | + |
| 226 | + leaq times_body(%rip), %r12 |
| 227 | + ... stuff ... |
| 228 | + |
| 229 | +Then allllll the way back in the modified version of reveal-functions, |
| 230 | +I change function-refs within tailcalls to append "body" to the target |
| 231 | +label. Maybe a better way to do this, though is to make sure that the |
| 232 | +prelude is always of constant length and then jump to an offset from |
| 233 | +the function entry label. |
| 234 | + |
| 235 | +So, that's the story for recursive tail calls. What about tail calls |
| 236 | +to other functions, or mutually recursive tail calls? For example, |
| 237 | + |
| 238 | +(define (odd n) |
| 239 | + (if (eq? n 0) |
| 240 | + #f |
| 241 | + (even (- n 1)))) |
| 242 | +(define (even n) |
| 243 | + (if (eq? n 0) |
| 244 | + #t |
| 245 | + (odd (- n 1)))) |
| 246 | + |
| 247 | +The recursive calls to even and odd are in tail position, so we want |
| 248 | +to eliminate them too. We can do so, but in general this requires a |
| 249 | +bit of care, because now we are going to reuse the same stack frame |
| 250 | +for calls to different functions. So when we select instructions for a |
| 251 | +function with tail calls to other functions, we need to make sure that |
| 252 | +the frame has enough space for the spilled locals and arguments for |
| 253 | +_every_ function that can get tailcalled, transitively: if function A |
| 254 | +tailcalls function B which tailcalls function C, then function A's |
| 255 | +stack frame needs to be the max of what A,B, and C alone would need. |
| 256 | + |
| 257 | +Alternatively, we can roll back the stack space and jump to after |
| 258 | +saving the locals but before allocating stack space. |
| 259 | + |
| 260 | +This also means that we have to have this information to perform a |
| 261 | +tailcall; if we don't have it, we have to fall back to normal calls. For example, |
| 262 | + |
| 263 | +(define (foo fn) |
| 264 | + (fn 42)) |
| 265 | + |
| 266 | +The call to fn is in tail position here, but fn is an arbitrary |
| 267 | +function value, and we have no way of knowing how much stack space is |
| 268 | +needed for it. For that reason, we have to fall back to regular calls. |
| 269 | + |
0 commit comments