What Has Improved?

Ivy Bridge is what Intel calls a tick+, a transition to the latest 22nm process technology (the famous P1270 process) with minor architectural optimizations compared to predecessor Sandy Bridge (described in detail by Anand here):

  • Divider is twice as fast
  • MOVs take no execution slots
  • Improved prefetchers
  • Improved shift/rotate and split/Load
  • Buffers are dynamically allocated to threads (not statically split in two parts for each thread)

Given the changes, we should not expect a major jump in single-threaded performance. Anand made a very interesting Intel CPU generational comparison in his Haswell review, showing the IPC improvements of the Ivy Bridge core are very modest. Clock for clock, the Ivy Bridge architecture performed:

  • 5% better in 7-zip (single-threaded test, integer, low IPC)
  • 8% better in Cinebench (single-threaded test, mostly FP, high IPC)
  • 6% better in compiling (multi-threaded, mostly integer, high IPC)

So the Ivy Bridge core improvements are pretty small, but they are measureable over very different kinds of workloads.

The core architecture improvements might be very modest, but that does not mean that the new Xeon E5-2600 V2 series will show insignificant improvements over the previous Xeon E5-2600. The largest improvement comes of course from the P1270 process: 22nm tri-gate (instead of 32nm planar) transistors. Discussing the actual quality of Intel process technology is beyond our expertise, but the results are tangible:

Focus on the purple text: within the same power envelope, the Ivy Bridge Xeon is capable of delivering 25% more performance while still consuming less power. In other words, the P1270 process allowed Intel to increase the number of cores and/or clock speed significantly. This can be easily demonstrated by looking at the high-end cores. An octal-core Xeon E5-2680 came with a TDP of 130W and ran at 2.7GHz. The E5-2697 runs at the same clock speed and has the same TDP label, but comes with four extra cores.

Virtualization Improvements

Each new generation of Xeon has reduced the amount of cycles required for a VMexit or a VMentry, but another way to reduce hardware virtualization overhead is to avoid VMexits all together. One of the major causes of VMexits (and thus also VMentries) are interrupts. With external interrupts, the guest OS has to check which interrupt has the priority and it does this by checking the APIC Task Priority Register (TPR). Intel already introduced an optimization for external interrupts in the Xeon 7400 series (back in 2008) with the Intel VT FlexPriority. By making sure a virtual copy of the APIC TPR exists, the guest OS is capable of reading out that register without a VMexit to the hypervisor.

The Ivy Bridge core is now capable of eliminating the VMexits due to "internal" interrupts, interrupts that originate from within the guest OS (for example inter-vCPU interrupts and timers). The virtual processor will then need to access the APIC registers, which will require a VMexit. Apparantly, the current Virtual Machine Monitors do not handle this very well, as they need somewhere between 2000 to 7000 cycles per exit, which is high compared to other exits.

The solution is the Advanced Programmable Interrupt Controller virtualization (APICv). The new Xeon has microcode that can be read by the Guest OS without any VMexit, though writing still causes an exit. Some tests inside the Intel labs show up to 10% better performance.

Related to this, Sandy Bridge introduced support for large pages in VT-d (faster DMA for I/O, chipset translates virtual addresses to physical), but in fact still fractioned large pages into 4KB pages. Ivy Bridge fully supports large pages in VT-d.

Only Xen 4.3 (July 2013) and KVM 1.4 (Spring 2013) support these new features. Both VMware and Microsoft are working on it, but the latest documents about vSphere 5.5 do not mention anything about APICv. AMD is working on an alternative called Advanced Virtual Interrupt Controller (AVIC). We found AVIC inside the AMD64 programmer's manual at page 504, but it is not clear which Opterons will support it (Warsaw?).

Introduction Improvements, Continued
Comments Locked

70 Comments

View All Comments

  • JohanAnandtech - Friday, September 20, 2013 - link

    I have to admit were are new to SPECjbb 2013. Any suggestions for the JVM tunings to reduce the GC latency?
  • mking21 - Wednesday, September 18, 2013 - link

    Surely its more interesting to see if the 12 core is faster than the 10 and 8 core V2s.
    Its not obvious to me that the 12 Core can out perform the 2687w v2 in real world measures rather than in synthetic benchmarks. The higher sustained turbo clock is really going to be hard to beat.
  • JohanAnandtech - Wednesday, September 18, 2013 - link

    There will be a follow-up, with more energy measurements, and this looks like a very interesting angle too. However, do know that the maximum Turbo does not happen a lot. In case of the 2697v2, we mostly saw 3 GHz, hardly anything more.
  • mking21 - Wednesday, September 18, 2013 - link

    Yes based on bin specs 3Ghz is what I would expect from 2697v2 if more than 6 or more cores are in use. 5 or more cores on 2687wv2 will run @ 3.6Ghz. While 2690v2 will run 3.3Ghz with 4 or more cores. So flat out the 12 core will be faster than 10 core will be faster than 8 core - but in reality hard to run these flat out with real-world tasks, so usually faster clock wins. Look forward to u sharing some comparative benchmarks.
  • psyq321 - Thursday, September 19, 2013 - link

    3 GHz is the maximum all-core turbo for 2697 v2.

    You are probably seeing 3 GHz because several cores are in use and 100% utilized.
  • JohanAnandtech - Friday, September 20, 2013 - link

    With one thread, the CPU ran at 3.4 GHz but only for very brief periods (almost unnoticeable).
  • polyzp - Saturday, September 21, 2013 - link

    AMD's Kaveri IGPU absolutley destroys intel iris 5200! Look at the first benchmarks ever leaked! +500% :O

    AMDFX .blogspot.com
  • Jajo - Tuesday, October 1, 2013 - link

    E5-2697v2 vs. E5-2690 +30% performance @ +50% cores? I am a bit disappointed. Don't get me wrong, I am aware of the 200 Mhz difference and the overall performance per watt ratio is great but I noticed something similar with the last generation (X5690 vs. E5-2690).
    There are still some single threaded applications out there and yes, there is a turbo. But it won't be aggressive on an averagely loaded ESXi server which might host VMs with single threaded applications.
    I somehow do not like this development, my guess is that the Hex- or Octacore CPUs with higher clocks are still a better choice for virtualization in such a scenario.

    Just my 2 cents
  • Chrisrodinis - Wednesday, October 23, 2013 - link

    Here is an easy to understand, hands on video explaining how to upgrade your server by installing an Intel E5 2600 V2 processor: http://www.youtube.com/watch?v=duzrULLtonM
  • DileepB - Thursday, October 31, 2013 - link

    I think 12 core diagram and description are incorrect! The mainstream die is indeed a 10 core die with 25 MB L3 that most skus are derived from. But the second die is actually a 15 core die with 37.5 MB. I am guessing (I know I am right :-))
    That they put half of the 10 core section with its QPIs and memory controllers, 5 cores and 12.5 MB L3 on top and connected the 2 sections using an internal QPI. From the outside it looks like a 15 core part, currently sold as a 12 core part only. A full 15 core sku would require too much power well above the 130W TDP that current platforms are designed for. They might sell the 15 core part to high end HPC customers like Cray! The 12 core sku should have roughly 50% higher die area than the 10 core die!

Log in

Don't have an account? Sign up now