I recently talked to a homebrew developer who was trying to add exception
handlers at link time but found out that Dolphin was overwriting their
exception handlers. I figure that's not the usual way to do exception
handlers, but... making us load the executable after setting up memory
rather than before is easy, and matches what we do when booting discs,
so I suppose there's no reason not to do it. It also matches the intent
of why Dolphin is writing default exception handlers – we're writing
them because some homebrew relies on exception handlers being left
around from whatever program was running before it (see 3dd777be70).
Let's take advantage of ARM64's input register shifting one last time,
shall we?
Before:
0x1280005b mov w27, #-0x3
0x1b1b7f18 mul w24, w24, w27
After:
0x4b180b18 sub w24, w24, w24, lsl #2
ARM64's flexible shifting of input registers also allows us to calculate
a negative power of two in one instruction; shift the input of a NEG
instruction.
Before:
0x128001f7 mov w23, #-0x10
0x1b1a7efa mul w26, w23, w26
0x93407f58 sxtw x24, w26
After:
0x4b1a13fa neg w26, w26, lsl #4
0x93407f58 sxtw x24, w26
If the destination register doesn't equal the input register, using it
to temporarily hold the immediate value is fair game as it'll be
overwritten with the result of the multiplication anyway. This can
slightly reduce register pressure.
Before:
0x52800659 mov w25, #0x32
0x1b197f5b mul w27, w26, w25
After:
0x5280065b mov w27, #0x32
0x1b1b7f5b mul w27, w26, w27
By taking advantage of ARM64's ability to shift an input register by any
amount, we can calculate multiplication by a number that is one more
than a power of two with a single instruction.
Before:
0x52800838 mov w24, #0x41
0x1b187f7b mul w27, w27, w24
After:
0x0b1b1b7b add w27, w27, w27, lsl #6
Turn multiplications by a power of two into bitshifts.
Before:
0x52800817 mov w23, #0x40
0x1b167ef6 mul w22, w23, w22
After:
0x531a66d6 lsl w22, w22, #6
Multiplication by one is also trivial. Depending on the registers
involved, either a single MOV or no instructions will be generated.
Before:
0x52800038 mov w24, #0x1
0x1b1a7f1b mul w27, w24, w26
After:
0x2a1a03fb mov w27, w26
Before:
0x52800039 mov w25, #0x1
0x1b1a7f3a mul w26, w25, w26
After:
Nothing!
Add a new function that will handle all the special cases regarding
multiplication. It does nothing for now, but will be expanded in
follow-up commits.
We can merge an SXTW with the SUB, eliminating one instruction. In
addition, it is no longer necessary to allocate a temporary register,
reducing register pressure.
Before:
0x93407f59 sxtw x25, w26
0x93407ebb sxtw x27, w21
0xcb1b033b sub x27, x25, x27
After:
0x93407f5b sxtw x27, w26
0xcb35c37b sub x27, x27, w21, sxtw
Because of the previous commit, `regs_in_use` must not include `dest_reg`
when calling MMIOLoadToReg. There are also some other registers we can
skip including in regs_in_use just for efficiency's sake.
The `addr_reg_set = false` statements that I've added in this commit are
technically redundant – if `mmio_address` is non-zero then `addr_reg_set`
is already false – but it's just a coincidence that that's the case.
The old calculation was stride * (max_index + 1), which fails if stride is less than the size of a component (for instance, if float XYZ positions are used, and the stride was set to 4 (i.e. sizeof(float)) instead of 12 (i.e. 3 * sizeof(float)), it would be missing the last 8 bytes of the final element in the array. Or, if stride was set to 0, then no bytes would be recorded at all (though that's not a useful configuration so it's unlikely to actually exist).
I'm not aware of any games affected by this issue.
This should fix recording the wall in the staircase leading to the basement in Luigi's Mansion (though I haven't tested it, as I don't own a copy of Luigi's Mansion). This uses NormalIndex3, and the index for the normal vector (generally 0x02XX or 0x01XX) there is always lower than the tangent or binormal (generally 0x07XX). Other games seem to usually have a similar range of indices for the normal, tangent, and binormal, so this issue wouldn't affect them.
In most cases, games will use the same type for all vertex components (either Index8 or Index16 or Direct). However, RS2's deflection towers use Index16 for the texture coordinate and Index8 for everything else, meaning the texture coordinates were recorded incorrectly (the first byte was used, so only indices 0 and 1 were recorded instead of 0 through 0x0192). Worse still, some background elements in RS2 use direct positions but indexed normals or texture coordinates, and those would not be recorded at all.
This is a regression from b5fd35f951.