The decoder of any x86 CPU (what takes the fetched instructions and decodes them into a form understandable by the execution units) has one of the highest gate counts out of all of the pieces of logic. This translates into quite a bit of time being spent in the decoding stage when preparing to process an instruction either for the first time or after a branch mis-prediction.
This is where the Willamette’s trace cache comes into play. The trace cache acts as a middle man between the decoding stage and the first stage of execution after the decoding has been complete. The trace cache essentially caches decoded micro-ops (the instructions after they have been fetched and decoded, thus ready for execution) so that instead of going through the fetching and decoding process all over again when executing a new instruction, the Willamette can just go straight to the trace cache, retrieve its decoded micro-op and begin execution.
The addition of the trace cache in the case of the Willamette isn’t only to improve performance, but it’s to hide the penalties associated with incorrectly predicting a branch deeper into the Willamette’s 20 stage pipeline. Since, on the Willamette, an incorrectly predicted branch could potentially only send the instruction back to the trace cache where the fetching/decoding process could be skipped and execution could take place almost immediately, a major downside to the Willamette’s 20 stage pipeline is somewhat masked by this trace cache.
Another benefit of the trace cache is that it caches the micro-ops in the predicted path of execution, meaning that if the Willamette fetches 3 instructions from the trace cache they are already presented in their order of execution. This adds potential for an incorrectly predicted path of execution of the cached micro-ops however Intel is confident that these penalties will be minimized because of the prediction algorithms used by the Willamette.Double Pumped ALU & Low Latency Data Cache
This was the big attention getter when we published our first live report from IDF, the Willamette has a double pumped Integer Arithmetic Logic Unit (ALU). The ALU actually executes the instructions in the Willamette and by “double pumping” it, Intel is able to make the two physical ALUs of the Willamette produce the benefits of four ALUs each running at the core frequency of the CPU.
The Willamette’s double pumped ALU should make the business application and content creation performance of the processor very difficult to beat since those two areas are relatively non-FPU intensive and would benefit greatly from the double pumped Integer ALU.
The double pumped ALU naturally reduces the latency associated with executing instructions, for example, a single add or subtract would take only 1/2 of a clock cycle on a Willamette because of the double pumped ALU. Theoretically, you could execute a total of four instructions in two clock cycles through the Willamette’s two physical ALUs courtesy of their double pumped nature.
Along the topic of low latencies, the Willamette will also feature a very low latency data cache which makes up half of the CPU’s L1 cache. The L1 data cache boasts an extremely low 2 clock load latency which is considerably lower than the L1 data cache latency on the Pentium III.