From Tanenbaum’s Structured Computer Organization,
Most instructions can be divided into one of two categories: register-memory or register-register.
Register-memory instructions allow memory words to be fetched into registers, where, for example, they can be used as ALU inputs
in subsequent instructions. (‘‘Words’’ are the units of data moved
between memory and registers. A word might be an integer. We will
discuss memory organization later in this chapter.) Other
register-memory instructions allow registers to be stored back into
memory.A typical register-register instruction fetches two operands from the registers, brings them to the ALU input registers, performs
some operation on them (such as addition or Boolean AND), and stores
the result back in one of the registers. The process of running two
operands through the ALU and storing the result is called the data
path cycle and is the heart of most CPUs. To a considerable extent,
it defines what the machine can do. Modern computers have multiple
ALUs operating in parallel and specialized for different functions.
The faster the data path cycle is, the faster the machine runs.
Are there memory-memory instructions?
Or is a memory-memory “operation” implemented as two register-memory instructions (one for read and the other for write)?
Isn’t this inefficient than moving data directly between two places in the same memory without going via a register?
2
Lots of machine architectures have memory-memory instructions.
The IBM System/360 and its successors have a whole set of instructions that operate on two locations in memory (the Storage Storage (SS) group). “Move Character” (MVC) instruction copies up to 256 bytes from one memory location to another, and even has a clear definition for when the source and destination ranges overlap. Similarly there are Compare Logical Character (CLC) (which does a string-comparison), OR Character (OC), AND Character (NC), and XOR Character (XC), which are bitwise logical operators, etc. The also have a set of decimal arithmetic instructions, which only operate on memory – there aren’t any registers for decimal math.
Then there are the memory-immediate instructions, which have one operand in memory and the other in the instruction itself. The DEC PDP-10 had Add One to Storage (AOS) and Subtract One from Storage (SOS). The IBM S/360 family had a wide range of Storage Immediate (SI) instructions, in which one operand was a memory location and the other was an 8-bit quantity in the instruction.
5
Memory chips do not have a mechanism for transferring data directly from one memory location to another. Hence, the processor must read the data from memory, and then write it to the new location.
In computer systems having DMA controllers, it is possible to perform memory transfers without involving the CPU. There are potential complications, such as cache coherency.
2
The Motorola 68000 (“68K”) architecture had an orthogonal instruction set, and both operands could specify absolute memory addresses. You could also do things like directly incremement / decrement the value at a specific memory location, whereas with a more RISC-like architecture you’d still be required to load memory to register, increment register, write (store) register back to memory.
The ColdFire architecture is the heir/successor to the 68K, and I think they might have trimmed away some of the more exotic instructions and addressing modes.
Of all the 32-bit and 64-bit CPUs produced every year, most of them use the ARM architecture.
The ARM architecture, like the DLX architecture and the RISC-V architecture and other load/store architectures,
has only 3 kinds of instructions — (1) instructions that have no effect on memory (“register-register instructions”), (2) instructions that LOAD from external memory into register(s) (and do practically nothing else), and (3) instructions that STORE a value from register(s) into external memory (and do practically nothing else). ( (2) and (3) are “register-memory instructions” ).
Is a memory-memory “operation” implemented as two register-memory
instructions (one for read and the other for write)?
Yes and no.
Computers built with the most common 32-bit or 64-bit CPUs have no memory-memory instructions.
Some read-modify-write-memory operations are very useful in building non-blocking algorithms on systems with more than one processor connected to the same memory.
Some less-common CPU architectures, such as the 32-bit x86 and 64-bit x86-64 architectures, do have memory-memory instructions. In particular, some can perform read-modify-write-memory in a single instruction, such as compare-and-swap.
ARM processors intended for use in multi-processor systems can perform read-modify-write-memory operation, but not as a single instruction — they split them up into multiple instructions such as load-linked/store conditional, where any one instruction either LOADs or STOREs, not both.
two register-memory
instructions (one for read and the other for write)? Isn’t this
inefficient than moving data directly between two places in the same
memory without going via a register?
Yes, this inefficiency is part of the von Neumann bottleneck.
Commodity DRAM only allows one address at a time to be selected,
so even those less-common CPUs that have memory-to-memory operations in a single instruction are forced to implement those instructions as multiple memory cycles — one memory cycle for the read, and and second memory cycle for the write.
In a small loop where instructions are being read from the instruction cache, a single instruction to do both doesn’t run any faster than the 2 separate instructions that would be required on a ARM processor.
Simply copying data from one place to another is extremely common,
so several techniques have been developed for speeding it up, bypassing some or all of the von Nuemann bottleneck:
- some DMA hardware directly copies data from one chip to a different chip in a single memory cycle, typically reading from main memory and writing to some peripheral, or reading from some peripheral and writing into main memory. This requires sending different addresses and different READ/WRITE Enable signals to the two chips.
- Displaying stuff on screen has historically used a variety of hardware speed-ups — character ROMs, tiled rendering, hardware sprites, blitter hardware for speeding up bit blit operations, dual-ported video DRAM, etc. Some of these techniques involve reading data from one chip and sending it directly to a different chip during a single memory cycle.
- Some Computational RAM chips can copy large blocks of data from one location to another inside the memory chip, much faster than reading that data (from one address) out of the chip, then writing that data (to a different address) back into the chip.
A single load or store operation is hard enough to implement. It is actually one of the most important things to get both right and fast. There is alignment, caches, address translation, communication with other cores, exception handling, memory mapped hardware. It’s more complicated than most other instructions.
A modern ARM processor has load/store and nothing else. A modern x86 processor has more complex instructions (“add register x to memory address y”), but that kind of operation gets internal split into micro operations that only do load and store.
An operation moving data from memory to memory must contain two addresses, and addresses are complicated so you get massive instructions on x86, or your instructions just won’t fit into a 32 bit word on ARM. It must do the whole load/store logic twice. There can be two page faults in an instruction for aligned access, and four for unaligned access.
It is just an enormous amount of complexity for very little gain compared to just having two instructions.