Other Improvements

One of the improvements that caught our attention was the "Fast Access of FS & GS base registers". We were under the impression that segment registers were not used in a modern OS with 64-bit flat addressing (with the exception of the Binary Translation VMM of VMware), but the promise of "Critical optimization for large thread-count server workloads" in Intel's Xeon E5-2600 V2 presentation seems to indicate otherwise.

Indeed no modern operating system uses the segment registers, but FS and GS registers are an exception. The GS register (for 64-bit; FS for 32-bit x86) points to the Thread Local Storage descriptor block. That thread block stores unique information for each thread and is accessed quite a bit when many threads are running concurrently.

That sounds great, but unfortunately operating system support is not sufficient to benefit from this. An older Intel presentation states that this feature is implement by adding "Four new instructions for ring-3 access of FS & GS base registers". The GCC compiler 4.7 (and later) has a flag called "-fsgsbase" to recompile your source code to make use of this. So although Ivy Bridge could make user thread switching a lot faster, it will take a while before commercial code actually implements this.

Other ISA optimizations (Float16 to and from SP conversion) will be useful for some image/video processing applications, but we cannot imagine that many server applications will benefit from this. HPC/render farms on the other hand may find this useful.

The Uncore

The uncore part has some modest improvements too. The snoop directory has now three states (Exclusive, Modified, Shared) instead of two and it improves server performance in 2-socket configurations as well. In Sandy Bridge the snoop directoy was disabled in 2-socket configurations as it hampered performance (which is also a best practice on the Opterons).

Also, the snoop broadcoasting got a lot more "opportunistic". If lots of traffic is going on, broadcasts are avoided; if very little is going on, it "snoops away". If it is likely that the snoop directory will not have the entry, the snoop is issued prior to directory feedback. "Opportunistic" snooping makes sure that snooping traffic is reduced and as a result the multi-core performance scales better. Which is quite important when your are dealing with up to 24 physical cores in a system.

Wrapping up, maximum PCI Express bandwidth when performing two thirds reads and one third writes has been further improved from 80GB/s (using quad-channel 1600 MT/s DDR3) to 90GB/s. T here are now two memory controllers instead of one to reduce latency. Bandwidth is also improved thanks to the support for DDR3- 1866. Lastly, the half width QPI mode is disabled in turbo mode, as it is very likely that there is a lot of traffic between the interconnects between the sockets. Turbo mode is after all triggered by heavy CPU activity.

Improvements Positioning: SKUs and Servers
Comments Locked

70 Comments

View All Comments

  • JohanAnandtech - Friday, September 20, 2013 - link

    I have to admit were are new to SPECjbb 2013. Any suggestions for the JVM tunings to reduce the GC latency?
  • mking21 - Wednesday, September 18, 2013 - link

    Surely its more interesting to see if the 12 core is faster than the 10 and 8 core V2s.
    Its not obvious to me that the 12 Core can out perform the 2687w v2 in real world measures rather than in synthetic benchmarks. The higher sustained turbo clock is really going to be hard to beat.
  • JohanAnandtech - Wednesday, September 18, 2013 - link

    There will be a follow-up, with more energy measurements, and this looks like a very interesting angle too. However, do know that the maximum Turbo does not happen a lot. In case of the 2697v2, we mostly saw 3 GHz, hardly anything more.
  • mking21 - Wednesday, September 18, 2013 - link

    Yes based on bin specs 3Ghz is what I would expect from 2697v2 if more than 6 or more cores are in use. 5 or more cores on 2687wv2 will run @ 3.6Ghz. While 2690v2 will run 3.3Ghz with 4 or more cores. So flat out the 12 core will be faster than 10 core will be faster than 8 core - but in reality hard to run these flat out with real-world tasks, so usually faster clock wins. Look forward to u sharing some comparative benchmarks.
  • psyq321 - Thursday, September 19, 2013 - link

    3 GHz is the maximum all-core turbo for 2697 v2.

    You are probably seeing 3 GHz because several cores are in use and 100% utilized.
  • JohanAnandtech - Friday, September 20, 2013 - link

    With one thread, the CPU ran at 3.4 GHz but only for very brief periods (almost unnoticeable).
  • polyzp - Saturday, September 21, 2013 - link

    AMD's Kaveri IGPU absolutley destroys intel iris 5200! Look at the first benchmarks ever leaked! +500% :O

    AMDFX .blogspot.com
  • Jajo - Tuesday, October 1, 2013 - link

    E5-2697v2 vs. E5-2690 +30% performance @ +50% cores? I am a bit disappointed. Don't get me wrong, I am aware of the 200 Mhz difference and the overall performance per watt ratio is great but I noticed something similar with the last generation (X5690 vs. E5-2690).
    There are still some single threaded applications out there and yes, there is a turbo. But it won't be aggressive on an averagely loaded ESXi server which might host VMs with single threaded applications.
    I somehow do not like this development, my guess is that the Hex- or Octacore CPUs with higher clocks are still a better choice for virtualization in such a scenario.

    Just my 2 cents
  • Chrisrodinis - Wednesday, October 23, 2013 - link

    Here is an easy to understand, hands on video explaining how to upgrade your server by installing an Intel E5 2600 V2 processor: http://www.youtube.com/watch?v=duzrULLtonM
  • DileepB - Thursday, October 31, 2013 - link

    I think 12 core diagram and description are incorrect! The mainstream die is indeed a 10 core die with 25 MB L3 that most skus are derived from. But the second die is actually a 15 core die with 37.5 MB. I am guessing (I know I am right :-))
    That they put half of the 10 core section with its QPIs and memory controllers, 5 cores and 12.5 MB L3 on top and connected the 2 sections using an internal QPI. From the outside it looks like a 15 core part, currently sold as a 12 core part only. A full 15 core sku would require too much power well above the 130W TDP that current platforms are designed for. They might sell the 15 core part to high end HPC customers like Cray! The 12 core sku should have roughly 50% higher die area than the 10 core die!

Log in

Don't have an account? Sign up now