From Section 2.1.3 RISC vs CISC from Structured Computer Organization by Tanenbaum,
While the initial emphasis was on simple instructions that could be
executed quickly, it was soon realized that designing instructions that could be issued (started) quickly was the key to good
performance. How long an instruction actually took mattered less
than how many could be started per second.
-
what does “issue or start an instruction” mean? Does it refer to step 1, 2, 3, 4 and 5 in the fetch-decode-execute cycle?
-
Does “execute an instruction” mean only step 6 in the fetch-decode-execute cycle ?
Thanks.
Note:
The CPU executes each instruction in a series of small steps. Roughly
speaking, the steps are as follows:
- Fetch the next instruction from memory into the instruction register.
- Change the program counter to point to the following instruction.
- Determine the type of instruction just fetched.
- If the instruction uses a word in memory, determine where it is.
- Fetch the word, if needed, into a CPU register.
- Execute the instruction.
- Go to step 1 to begin executing the following instruction.
This sequence of steps is frequently referred to as the
fetch-decode-execute cycle.
It probably refers to pipelining, that is, parallel (or semi-parallel) execution of instructions. That’s the only scenario I can think of where it does not really matter how long something takes, as long as you can have enough of them running in parallel.
So, the CPU may fetch one instruction, (step 1 in the table above,) and then as soon as it proceeds to step 2 for that instruction, it can at the same time (in parallel) start with step 1 for the next instruction, and so on.
Let’s call our two consecutive instructions A and B. So, the CPU executes step 1 (fetch) for instruction A. Now, when the CPU proceeds to step 2 for instruction A, it cannot yet start with step 1 for instruction B, because the program counter has not advanced yet. So, it has to wait until it has reached step 3 for instruction A before it can get started with step 1 for instruction B. This is the time it takes to start another instruction, and we want to keep this at a minimum, (start instructions as quickly as possible,) so that we can be executing in parallel as many instructions as possible.
CISC architectures have instructions of varying lengths: some instructions are only one byte long, others are two bytes long, and yet others are several bytes long. This does not make it easy to increment the program counter immediately after fetching one instruction, because the instruction has to be decoded to a certain degree in order to figure out many bytes long it is. On the other hand, one of the primary characteristics of RISC architectures is that all instructions have the same length, so the program counter can be incremented immediately after fetching instruction A, meaning that the fetching of instruction B can begin immediately afterwards. That’s what the author means by starting instructions quickly, and that’s what increases the number of instructions that can be executed per second.
In the above table, step 2 says “Change the program counter to point to the following instruction” and step 3 says “Determine the type of instruction just fetched.” These two steps can be in that order only on RISC machines. On CISC machines, you have to determine the type of instruction just fetched before you can change the program counter, so step 2 has to wait. This means that on CISC machines the next instruction cannot be started as quickly as it can be started on a RISC machine.
2
What does “issue or start an instruction” mean?
In the context of what’s written, it should mean the amount of time between when the CPU can start the process of handling one instruction and when it can do the same for the one that follows. If that explanation sounds vague, the reason is that it’s going to vary a lot by architecture.
Say you have some instructions to run on a fictional CPU with numbered registers where fetch and decode (F&D) always takes three units of time and execution takes ten:
ADD R1, R2 ; R1 ← R1 + R2
LOAD R3, 0 ; R3 ← 0
LSR R4, 3 ; R4 ← R4 shifted 3 bits to the right
On a simplistic design that doesn’t parallelize anything, the total time to run this snippet of code is 39 units: 3+10 (F&D plus execute) for the ADD
, 3+10 for the LOAD
and 3+10 for the LSR
.
At some point, CPU designers noticed their chips had hardware that sat idle while instructions executed and figured out that nothing was stopping from from using it to execute later instructions that weren’t dependent on the outcome of those that came before. In the example above, none of the instructions have anything in common with the others. All of the registers involved are unique, as are what the instructions do. (For the sake of simplicity, let’s say ADD
, LOAD
and LSR
all use different hardware to execute.) This means they can all go through the execution phase in parallel without producing the wrong results.
Three instructions executing in perfect, not-of-this-world parallel would take a total of 13 units of time, a 66% time savings over the non-parallelized CPU. That’s quite an improvement until the real world intrudes and you realize that you can’t do all of the F&D phase in parallel because you’d have no idea if the LOAD
instruction is going to use the same registers as the ADD
above it. To prevent that, F&D must be done serially. Once you’ve determined that it doesn’t depend on anything prior, you can send it off for execution and go F&D the next one.
The first instruction takes three units for F&D and then is sent somewhere else in the processor to be executed. Then the second F&D can go ahead for another three time units and then the third for the same. At that point, we’ve spent nine units of time on F&D. Because the third instruction was the last to be decoded, we have to wait another ten units after that for it to execute. That means the last instruction will finish executing 19 time units after F&D starts on the first one. That’s still a not-too-shabby savings of 51% over doing everything serially.
When you’re able to execute a lot of instructions in parallel, how long the instructions take to execute becomes a bit less relevant than how long it takes to start (issue) them. Start time has become more critical because every unit of time spent evaluating whether or not an instruction can be executed is time the execution hardware won’t be busy. If the designers of our fictitious CPU find a way to cut a time unit out of that phase, the total time for the three instructions in the example drops from 19 to 16, which is nothing to sneeze at. Chop out one more and you’re down to 13.
Does “execute an instruction” mean only step 6 in the fetch-decode-execute cycle ?
For the purposes of this discussion, it would be reasonable to say it means using the hardware in the CPU to carry out the instruction’s wishes.