Previously, we covered data hazards. If you write really late in the pipeline, and you need the value really early in the pipeline, then you could use the wrong register contents. The simplest thing you could do is to put in noops. It doesn't slow down the clock, but it requires knowledge to the programmer and does not have backwards compatibility. Also, the code bloats. Instead, you could put detect and stall hardware in so the code doesn't bloat. You might have to slow down the clock a little bit.
A more advanced feature is using detect and forward. You take the output from the ALU, and straight route back the input to the ALU in the next cycle. If you use add instructions, you don't have to stall with forwarding. However, you need to use a single noop with load instructions. The value is not ready early enough to get through without a stall. You have to wait until the memory instruction.
These come up because of branches. If you branch, you may not know if you're going to jump, or where you're going to jump until you evaluate the expression.
Pipeline function for BEQ:
Three ways to handle:
Advantage: no software bloat. Disadvantage: have to add hardware.
The CPI increases every time a branch is detected! Is that necessary? Not always! Sometimes the branch is not taken. You can keep fetching, assuming the branch is not taken. If you are wrong, then that is okay as long as you do not COMPLETE any instructions you mistakenly executed. As the instruction has not modified any globally visible state, you're fine.
Just set noops everywhere for the instructions that need to be squashed.
For example, if you have the following program:
beq 1 2 1 sub 3 4 5 add 6 7 8
You need to consider the time filling the pipeline, the time getting the instructions through, and the time squashing.
If the branch is not taken, you don't lose any time. The time filling the pipeline is 4 cycles, and the time to execute each additional instruction is 1, so 7 cycles total.
If you speculate the branch is taken, and it is really not taken, then it takes 4 cycles to fill the pipeline, plus 2 to get them through, plus 3 because the branch was taken! This increases the CPI each time the branch is taken!
The second we know the address of the instruction we want to fetch, we want to know: is it a branch instruction? If it is, where is the target address? Is it taken or not?
For the LC2k, you can reliably know the target address for any jump instruction, since it is just PC + 1 + Offset! All of that information is encoded into the instruction! You can't reliably predict that the branch will be taken, since it is based on the contents of registers. You can predict with some accuracy, but not 100% accuracy.
Predicts the next fetch address (to be used in the next cycle). It requires three things to be predicted at the fetch stage:
Observation: Target address remains the same for a conditional direct branch accross dynamic instances
The first time you encounter an instruction, you just guess. No problem, since the instruction will likely be ran hundreds of thousands of times.
Always not taken:
Always taken:
Backward taken, forward not taken:
You can use more states to improve performance by using a 2-bit saturating 1-bit counter. If the number of the last taken is 3 or 2 out of 3, then predict taken. Else, predict not taken.
Sometimes you have exceptions, such as divide by zero or overflow. After the exception occurs, it makes sure that the instructions before happen, and the instructions after never occurred (never changed globally visible state). Then, you jump to the memory address ( jalr ) to handle that exception.
If you get lucky, you can get close to a CPI of 1 (ideal case – no stalls). In reality, it won't be quite that good. If you want to improve performance more, than you can use multiprocessors. You could share the cache, but not the register files.
In superscalar pipelining, you build two (or more) pipelines that execute in parallel. It's pretty hard for most programmers to debug multithreaded programs compared to singlethreaded programs.