If x1 contains 0x10001000, then I believe that these two instructions do the same thing, namely, they load four-bytes from 0x10001004-0x10001007 into w6.
ldr w6, [x1, #4] // "Unsigned Offset" (immediate)
ldur w6, [x1, #4] // "Unscaled" load
- Link to the LDR instruction (scroll down to the “Unsigned Offset” variant).
- Link to the LDUR instruction.
To summarize my understanding, the 32-bit LDR
(immediate) is able to use POSITIVE multiples of 4 as the pre-increment. The 64-bit LDR
(immediate) is able to use positive multiples of 8 as the pre-increment. It’s clear that these next two instructions are illegal.
ldr w6, [x1, #1] // ILLEGAL
ldr w6, [x1, #-4] // ILLEGAL
If I want a negative offset, or if I want some unaligned transfer, then I must use ldur
.
ldur w6, [x1, #1] // Loads bytes from address 0x10001001-0x1001004
ldur w6, [x1, #-4] // Loads bytes from address 0x10000ffc-0x1000fff
Question #1:
Why are there two different instructions? Why not just simplify and do this under the hood? I presume that an ldur
may have much more latency than an ldr
(immediate). Is ARM just trying to make it very visible that your code may be slower?
Question #2:
Obviously, SVE(vectors) would be better for this application, but I’m just using it to focus my understanding of ldur
. On a recent Neoverse, which of these approaches would you expect to be faster/better…? (QEMU doesn’t model what I need, so I cannot measure.)
#1 #2 #3 (post-increment to destroy x1)
ldur x2, [x1, # -4] sub x1, x1, 4 ldr x2, [x1], #-4
ldur x3, [x1, # -8] ldr x2, [x1] ldr x3, [x1], #-4
ldur x4, [x1, #-12] sub x1, x1, 4 ldr x4, [x1], #-4
ldur x5, [x1, #-16] ldr x3, [x1] ldr x5, [x1], #-4
ldur x6, [x1, #-20] sub x1, x1, 4 ldr x6, [x1], #-4
ldur x7, [x1, #-24] ldr x4, [x1] ldr x7, [x1], #-4
ldur x8, [x1, #-28] sub x1, x1, 4 ldr x8, [x1], #-4
ldur x9, [x1, #-32] ldr x5, [x1] ldr x9, [x1], #-4
sub x1, x1, 4
ldr x6, [x1]
sub x1, x1, 4
ldr x7, [x1]
sub x1, x1, 4
ldr x8, [x1]
sub x1, x1, 4
ldr x9, [x1]
Thanks!