Because proper tail calls are necessary for object-oriented
languages, we
can't quite use the standard calling
conventions
unmodified when compiling OO languages efficiently to ARM
architectures.
Here's one approach to a non-standard, efficient,
tail-call-supporting calling convention that I've been exploring
recently.
The big change from the standard is that we do not move the stack
pointer down over outbound arguments when we make a call.
Instead, the callee moves the stack pointer as they see fit. The
reason for this is so that the callee can tail-call someone else
without having to do any hairy adjusting of the frame, and so that the
original caller doesn't have to know anything about what's left to
clean up when they receive control: all the clean-up has already been
completed.
This bears stating again: just after return from a subroutine, all
clean-up has already been completed.
In the official standard, the stack space used to communicate
arguments to a callee is owned by the caller. In this modified
convention, that space is owned by the callee as soon as control is
transferred.
Other aspects of the convention are similar to the AAPCS standard:
- keep the stack Full
Descending,
just like the standard.
- ensure it is 8-byte aligned at all times, just like (a slight
restriction of) the standard.
- make outbound arguments leftmost-low in memory, that is, "pushed
from right to left". This makes the convention compatible with
naive C struct overlaying of memory.
- furthermore, ensure argument 0 in memory is also 8-byte aligned.
Details of the stack layout
Consider compiling a single subroutine, either a leaf or a non-leaf
routine. We need to allocate stack space to incoming arguments, to
saved temporaries, to outbound arguments, and to padding so we
maintain the correct stack alignment. Let
Ni = inward-arg-count, the number of arguments the routine expects
No = most-tail-args, the largest number of outbound tail-call
arguments the routine produces
Nt = inward-temp-count, the number of temps the routine requires
Na = outward-arg-count, the number of arguments supplied in a
particular call the routine makes to some other routine
Upon entry to the routine, where Ni=5, No=7, Nt=3, Na=3, we
have the following stack layout. Recall that stacks are
full-descending.
(low) (high)
| outbound | | temps | |shuffle| inbound |
| 0 | 1 | 2 |---| 0 | 1 | 2 |---| - | - | 0 | 1 | 2 | 3 | 4 |---|
^ ^
sp for non-leaf sp for leaf
I've marked two interesting locations in the stack: the position of
the stack pointer for leaf routines, and the position of the stack
pointer for non-leaf routines, which need some space of their own to
store their internal state at times when they delegate to another
routine. Leaf routines simply leave the stack pointer in place as they
start execution; non-leaf routines adjust the stack pointer themselves
as control arrives from their caller.
Note that the first four arguments are transferred in registers, but
that stack slots still need to be reserved for them. Note also the
padding after the outbound arguments, the temps, and the
inbound/shuffle-space.
The shuffle-space is used to move values around during preparation for
a tail call whenever the routine needs to supply more arguments to the
tail-called routine than it received in turn from its caller.
The extra shuffle slots are only required if there's no room in the
inbound slots plus padding. For example, if Ni=5 and No=6, then since
we expect the inbound arguments to have one slot of padding, that slot
can be used as shuffle space.
Addressing calculations
Leaf procedures do not move the stack pointer on entry. Nonleaf
procedures do move the stack pointer on entry. This means we have
different addressing calculations depending on whether we're a leaf or
nonleaf procedure.
Pad8(x) = x rounded up to the nearest multiple of 8.
sp_delta = Pad8(No * 4) + Pad8(Nt * 4), the distance SP might move on entry and exit.
Leaf procedures, where the stack pointer does not move on entry to the
routine:
inward(n) = rn, if n < 4
| sp - Pad8(Ni * 4) + (n * 4)
temp(n) = sp - sp_delta + (n * 4)
outward(n) (tail calls only) = rn, if n < 4
| sp - Pad8(Na * 4) + (n * 4)
Nonleaf procedures, where the stack pointer moves down by sp_delta
bytes on entry to the routine:
inward(n) = rn, if n < 4
| sp + sp_delta - Pad8(Ni * 4) + (n * 4)
temp(n) = sp + (n * 4)
outward(n) (non-tail calls) = rn, if n < 4
| sp - Pad8(Na * 4) + (n * 4)
outward(n) (tail calls) = rn, if n < 4
| sp + sp_delta - Pad8(Na * 4) + (n * 4)
Variations
This convention doesn't easily support varargs. One option would be to
sacrifice simple C struct overlaying of the inbound argument stack
area, flipping arguments so they are pushed from left to right instead
of from right to left. That way, the first argument is always at a
known location.
Another option would be to use an argument count at the end of the
argument list in the varargs case. This requires both the caller and
callee to be aware that a varargs-specific convention is being used.
Of course, varargs may not even be required: instead, a vector could
be passed in as a normal argument. Whether this makes sense or not
depends on the language being compiled.