The 64-bit BSWAP instruction is listed to have a latency of 2 cycles on both uops.info and in Agner Fog’s instruction tables on modern Intel architectures like Broadwell, Cannon Lake, Ice Lake, etc. On AMD this instruction has been a single-cycle latency instruction for a very, very long time according to these sources.
However, I can’t imagine any reason as to why this instruction would need 2 cycles of latency on Intel. It doesn’t compute anything, it is a zero-gate instruction: it is literally just a set of 64 wires crossing in a particular pattern.
What is the technical reason this is so? Or is the latency reading incorrect on both uops.info and Agner Fog?