Why does writing to 32bit register clear the upper part of the 64bit register?

I was facing the same problem when I was writing child process injection assembly for MacType:
mov eax, dword ptr [rax] zeros the upper part of rax, but instructions like mov ax, [rax] won't wipe the high part of eax.

Looks like a lot of people have asked the problem before, and here is the answer:

Question

In the x86-64 Tour of Intel Manuals, I read

Perhaps the most surprising fact is that an instruction such as MOV EAX, EBX automatically zeroes >upper 32 bits of RAX register.

The Intel documentation (3.4.1.1 General-Purpose Registers in 64-Bit Mode in manual Basic Architecture) quoted at the same source tells us:

64-bit operands generate a 64-bit result in the destination general-purpose register.
32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register.
8-bit and 16-bit operands generate an 8-bit or 16-bit result. The upper 56 bits or 48 bits (respectively) of the destination general-purpose register are not be modified by the operation. If the result of an 8-bit or 16-bit operation is intended for 64-bit address calculation, explicitly sign-extend the register to the full 64-bits.
In x86-32 and x86-64 assembly, 16 bit instructions such as

mov ax, bx

don't show this kind of "strange" behaviour that the upper word of eax is zeroed.

Thus: what is the reason why this behaviour was introduced? At a first glance it seems illogical (but the reason might be that I am used to the quirks of x86-32 assembly).

Answer

The processor model as documented in the Intel/AMD processor manual is a pretty imperfect model for the real execution engine of a modern core. In particular, the notion of the processor registers does not match reality, there is no such thing as a EAX or RAX register.

One primary job of the instruction decoder is to convert the legacy x86/x64 instructions into micro-ops, instructions of a RISC-like processor. Small instructions that are easy to execute concurrently and being able to take advantage of multiple execution sub-units. Allowing as many as 6 instructions to execute at the same time.

To make that work, the notion of processor registers is virtualized as well. The instruction decoder allocates a register from a big bank of registers. When the instruction is retired, the value of that dynamically allocated register is written back to whatever register currently holds the value of, say, RAX.

To make that work smoothly and efficiently, allowing many instructions to execute concurrently, it is very important that these operations don't have an interdependency. And the worst kind you can have is that the register value depends on other instructions. The EFLAGS register is notorious, many instructions modify it.

Same problem with the way you like it to work. Big problem, it requires two register values to be merged when the instruction is retired. Creating a data dependency that's going to clog up the core. By forcing the upper 32-bit to 0, that dependency instantly disappears, no longer a need to merge. Warp 9 execution speed.

Refer:
https://stackoverflow.com/questions/25455447/x86-64-registers-rax-eax-ax-al-overwriting-full-register-contents
https://stackoverflow.com/questions/11177137/why-do-x86-64-instructions-on-32-bit-registers-zero-the-upper-part-of-the-full-6
https://stackoverflow.com/questions/11177137/why-do-x86-64-instructions-on-32-bit-registers-zero-the-upper-part-of-the-full-6

最后一次更新于2018-12-11

Aloha せかい

Thoughts and memos about random things.