I’m working on a transpiler that converts source language into C++ without disobeying its semantics other attributes of the language.
This language is based on a VM/bytecode and it has a lot of “intermediate” memory values that its compiler generate a cause a lot of bloat even when its transpiled into C++ code because compiler is not able to resolve the origin of the memory to perform compiler-time optimizations.
“Memory” is stored in a contigous memory block and interacted through it. Since the source language is an embedded one, it relies on a sophisticated reflection system that collects attributes of the values that can be accessible from source language from C++. Reflection system has the following class (pseudocode):
class Property
{
int Offset;
int Alignment;
// .. .etc
};
And code generator reads values from the “Memory” like this:
Property* Property = (Bytecode + 32); // Offset of the FProperty* is known in codegenerator so its a literal integer here
uint8* PropertyAddress = (Memory + Property->Offset); // Offset of PropertyAddress is not known so I have to read from reflection. PropertyAddress is interacted by the generated code multiple times, and its copied around as pointer to perform "stack" logic.
Handwritten C++ version of the code:
class NativeClass
{
bool SomeValue;
bool Function() { return (SomeValue && (3 > 2) || (5 > 1));
}
Clang output:
Function(NativeClass&): # @Function(NativeClass&)
mov al, 1
ret
Clang output for the same exact code from code generator’s output:
TranspiledCode(stack&, void*): # @TranspiledCode(stack&, void*)
mov rax, qword ptr [rdi + 8]
mov rcx, qword ptr [rdi + 16]
movsxd rdx, dword ptr [rax + 10]
movsxd rsi, dword ptr [rax + 40]
movsxd r8, dword ptr [rax + 70]
mov byte ptr [rcx + rdx], 0
mov byte ptr [rcx + rsi], 1
mov r9, qword ptr [rdi]
movsxd r10, dword ptr [rax + 88]
cmp byte ptr [r9 + r10], 0
setne r9b
cmp byte ptr [rcx + rdx], 0
setne dl
and dl, r9b
mov byte ptr [rcx + r8], dl
movsxd r8, dword ptr [rax + 108]
cmp byte ptr [rcx + rsi], 0
setne sil
or sil, dl
mov byte ptr [rcx + r8], sil
mov rdx, qword ptr [rdi + 32]
movsxd rsi, dword ptr [rax + 155]
add rax, 147
lea r8, [rcx + rsi]
movzx esi, byte ptr [rcx + rsi]
mov byte ptr [rdx], sil
mov qword ptr [rdi + 40], rax
mov qword ptr [rdi + 48], r8
mov qword ptr [rdi + 56], rcx
ret
If there is a “virtually free” operation defined in the source language, like comparing a few bools, summing and subtracting values etc. writing same code in C++ often just outputs a few lines of assembly code. Meanwhile in transpiled version of the source language outputs 30x more code, which is still faster than evaluating same code in VM but I am able to optimize away this a lot if I do something like this instead:
bool SomeValue; // I know this specific value is completely local and wont be used elsewhere
uint8* PropertyAddress = &SomeValue;
When using a locally constructed bool value for the same bytecode function Clang output:
TranspiledCodeOptimized(stack&, void*): # @TranspiledCodeOptimized(stack&, void*)
mov rax, qword ptr [rdi + 32]
mov byte ptr [rax], 1
ret
However this is not always possible because a single function thats defined in the source language is spread to multiple functions (one function per n
amount of opcode – depends on how much transpiler could fold/combine). So bool SomeValue
cant be local to the function body I’m declaring it, its value has to be carried around multiple times.
I’m looking for a way to hint compiler that content of the “Memory” can be optimized in compiler-time without using actual types thats declared in C++.
I tried to create a struct that contains the values that I can “localize” from the “Memory”, pseudocode:
struct LocalizedMemoryElements
{
bool SomeValue;
int SomeFunctionsReturnValue;
int SomeFunctionsInputValue;
};
and passing this struct to the generated opcode function makes clang generate faster code:
uint8* PropertyAddress = reinterpret_cast<uint8*>(&LocalizedMemoryElements.SomeValue);
Clang output:
TranspiledCodeWithLocalizedMemoryElements(stack&, void*, LocalizedMemoryElements&): # @TranspiledCodeWithLocalizedMemoryElements(stack&, void*, LocalizedMemoryElements&)
mov word ptr [rcx + 4], 256
mov byte ptr [rcx + 6], 0
mov byte ptr [rcx + 16], 1
mov rax, qword ptr [rdi + 32]
mov byte ptr [rax], 1
ret
But there are problems with this:
- I end up duplicating memory by creating another struct on the local scope or heap, because “Memory” is already allocated by VM.
- Reinterpret casting to the “Memory” through LocalizedMemoryElements is not possible either because I am unable to localize/nativize all types of data. The optimization I’m after mostly works for POD types.
- I also tried to exploit usage of
__restrict
a lot but no luck.