A calling convention for ARM that supports proper tail-calls efficiently
Tue 27 Nov 2012 12:06 EST
Because proper tail calls are necessary for object-oriented languages, we can’t quite use the standard calling conventions unmodified when compiling OO languages efficiently to ARM architectures.
Here’s one approach to a non-standard, efficient, tail-call-supporting calling convention that I’ve been exploring recently.
The big change from the standard is that we do not move the stack pointer down over outbound arguments when we make a call.
Instead, the callee moves the stack pointer as they see fit. The reason for this is so that the callee can tail-call someone else without having to do any hairy adjusting of the frame, and so that the original caller doesn’t have to know anything about what’s left to clean up when they receive control: all the clean-up has already been completed.
This bears stating again: just after return from a subroutine, all clean-up has already been completed.
In the official standard, the stack space used to communicate arguments to a callee is owned by the caller. In this modified convention, that space is owned by the callee as soon as control is transferred.
Other aspects of the convention are similar to the AAPCS standard:
- keep the stack Full Descending, just like the standard.
- ensure it is 8-byte aligned at all times, just like (a slight restriction of) the standard.
- make outbound arguments leftmost-low in memory, that is, “pushed from right to left”. This makes the convention compatible with naive C struct overlaying of memory.
- furthermore, ensure argument 0 in memory is also 8-byte aligned.
Details of the stack layout
Consider compiling a single subroutine, either a leaf or a non-leaf routine. We need to allocate stack space to incoming arguments, to saved temporaries, to outbound arguments, and to padding so we maintain the correct stack alignment. Let
- Ni= inward-arg-count, the number of arguments the routine expects
- No= most-tail-args, the largest number of outbound tail-call arguments the routine produces
- Nt= inward-temp-count, the number of temps the routine requires
- Na= outward-arg-count, the number of arguments supplied in a particular call the routine makes to some other routine
Upon entry to the routine, where Ni=5, No=7, Nt=3, Na=3, we
have the following stack layout. Recall that stacks are
full-descending.
(low)                                                               (high)
    | outbound  |   |   temps   |   |shuffle|      inbound      |
    | 0 | 1 | 2 |---| 0 | 1 | 2 |---| - | - | 0 | 1 | 2 | 3 | 4 |---|
                    ^                                               ^
                  sp for non-leaf                                sp for leaf
I’ve marked two interesting locations in the stack: the position of the stack pointer for leaf routines, and the position of the stack pointer for non-leaf routines, which need some space of their own to store their internal state at times when they delegate to another routine. Leaf routines simply leave the stack pointer in place as they start execution; non-leaf routines adjust the stack pointer themselves as control arrives from their caller.
Note that the first four arguments are transferred in registers, but that stack slots still need to be reserved for them. Note also the padding after the outbound arguments, the temps, and the inbound/shuffle-space.
The shuffle-space is used to move values around during preparation for a tail call whenever the routine needs to supply more arguments to the tail-called routine than it received in turn from its caller.
The extra shuffle slots are only required if there’s no room in the inbound slots plus padding. For example, if Ni=5 and No=6, then since we expect the inbound arguments to have one slot of padding, that slot can be used as shuffle space.
Addressing calculations
Leaf procedures do not move the stack pointer on entry. Nonleaf procedures do move the stack pointer on entry. This means we have different addressing calculations depending on whether we’re a leaf or nonleaf procedure.
- Pad8(x)= x rounded up to the nearest multiple of 8.
- sp_delta = Pad8(No * 4) + Pad8(Nt * 4), the distance SP might move on entry and exit.
Leaf procedures, where the stack pointer does not move on entry to the routine:
inward(n) = rn, if n < 4
          | sp - Pad8(Ni * 4) + (n * 4)
temp(n) = sp - sp_delta + (n * 4)
outward(n) (tail calls only) = rn, if n < 4
                             | sp - Pad8(Na * 4) + (n * 4)
Nonleaf procedures, where the stack pointer moves down by sp_delta
bytes on entry to the routine:
inward(n) = rn, if n < 4
          | sp + sp_delta - Pad8(Ni * 4) + (n * 4)
temp(n) = sp + (n * 4)
outward(n) (non-tail calls) = rn, if n < 4
                            | sp - Pad8(Na * 4) + (n * 4)
outward(n) (tail calls) = rn, if n < 4
                        | sp + sp_delta - Pad8(Na * 4) + (n * 4)
Variations
This convention doesn’t easily support varargs. One option would be to sacrifice simple C struct overlaying of the inbound argument stack area, flipping arguments so they are pushed from left to right instead of from right to left. That way, the first argument is always at a known location.
Another option would be to use an argument count at the end of the argument list in the varargs case. This requires both the caller and callee to be aware that a varargs-specific convention is being used.
Of course, varargs may not even be required: instead, a vector could be passed in as a normal argument. Whether this makes sense or not depends on the language being compiled.


Comments (closed)
Do you have GC, or do you not /really/ have proper tail calls? That is, do you ever stack allocate vectors and tail call with them as arguments? This would not be allowed for proper tail calls - that vector would be considered escaping and thus have to be heap allocated. In that case, you get into some lifetime difficulties that can only be fully eradicated by GC or a restricted computational model for which you can statically determine all points in the code that are reached if and only if heap allocated (let's say specifically heap allocated for tail calling) resources become dead (so you can introduce a free at exactly the right time). I'm gonna go out on a limb and say... not proper.
Stack allocation requires you to wait around for the called subroutine in order to then release the allocated space, so it isn't really a tail call. Imagine modelling stack allocation in Scheme:
(let ((v (stack-alloc!)))
(let ((result (do-something-with v)))
(stack-release! v)
result))
The call to do-something-with can't be in tail position, because there's a pending storage-reclamation-action in its continuation.