Other Improvements

One of the improvements that caught our attention was the "Fast Access of FS & GS base registers". We were under the impression that segment registers were not used in a modern OS with 64-bit flat addressing (with the exception of the Binary Translation VMM of VMware), but the promise of "Critical optimization for large thread-count server workloads" in Intel's Xeon E5-2600 V2 presentation seems to indicate otherwise.

Indeed no modern operating system uses the segment registers, but FS and GS registers are an exception. The GS register (for 64-bit; FS for 32-bit x86) points to the Thread Local Storage descriptor block. That thread block stores unique information for each thread and is accessed quite a bit when many threads are running concurrently.

That sounds great, but unfortunately operating system support is not sufficient to benefit from this. An older Intel presentation states that this feature is implement by adding "Four new instructions for ring-3 access of FS & GS base registers". The GCC compiler 4.7 (and later) has a flag called "-fsgsbase" to recompile your source code to make use of this. So although Ivy Bridge could make user thread switching a lot faster, it will take a while before commercial code actually implements this.

Other ISA optimizations (Float16 to and from SP conversion) will be useful for some image/video processing applications, but we cannot imagine that many server applications will benefit from this. HPC/render farms on the other hand may find this useful.

The Uncore

The uncore part has some modest improvements too. The snoop directory has now three states (Exclusive, Modified, Shared) instead of two and it improves server performance in 2-socket configurations as well. In Sandy Bridge the snoop directoy was disabled in 2-socket configurations as it hampered performance (which is also a best practice on the Opterons).

Also, the snoop broadcoasting got a lot more "opportunistic". If lots of traffic is going on, broadcasts are avoided; if very little is going on, it "snoops away". If it is likely that the snoop directory will not have the entry, the snoop is issued prior to directory feedback. "Opportunistic" snooping makes sure that snooping traffic is reduced and as a result the multi-core performance scales better. Which is quite important when your are dealing with up to 24 physical cores in a system.

Wrapping up, maximum PCI Express bandwidth when performing two thirds reads and one third writes has been further improved from 80GB/s (using quad-channel 1600 MT/s DDR3) to 90GB/s. T here are now two memory controllers instead of one to reduce latency. Bandwidth is also improved thanks to the support for DDR3- 1866. Lastly, the half width QPI mode is disabled in turbo mode, as it is very likely that there is a lot of traffic between the interconnects between the sockets. Turbo mode is after all triggered by heavy CPU activity.

Improvements Positioning: SKUs and Servers
Comments Locked

70 Comments

View All Comments

  • mczak - Tuesday, September 17, 2013 - link

    Yes that's surprising indeed. I wonder how large the difference in die size is (though the reason for two dies might have more to do with power draw).
  • zepi - Tuesday, September 17, 2013 - link

    How about adding turbo frequencies to sku-comparison tables? That'd make comparison of the sku's a bit easier as that is sometimes more repsentative figure depending on the load that these babies are run.
  • JarredWalton - Tuesday, September 17, 2013 - link

    I added Turbo speeds to all SKUs as well as linking the product names to the various detail pages at AMD/Intel. Hope that helps! (And there were a few clock speed errors before that are now corrected.)
  • zepi - Wednesday, September 18, 2013 - link

    Appreciated!
  • zepi - Wednesday, September 18, 2013 - link

    For most server buyers things are not this simple, but for armchair sysadmins this might do:
    http://cornflake.softcon.fi/export/ivyexeon.png
  • ShieTar - Tuesday, September 17, 2013 - link

    "Once we run up to 48 threads, the new Xeon can outperform its predecessor by a wide margin of ~35%. It is interesting to compare this with the Core i7-4960x results , which is the same die as the "budget" Xeon E5s (15MB L3 cache dies). The six-core chip at 3.6GHz scores 12.08."

    What I find most interesting here is that the Xeon manages to show a factor 23 between multi-threaded and single-threaded performance, a very good scaling for a 24-thread CPU. The 4960X only manages a factor of 7 with its 12 threads. So it is not merely a question of "cores over clock speed", but rather hyperthreading seems to not work very well on the consumer CPUs in the case of Cinebench. The same seems to be true for the Sandy Bridge and Haswell models as well.

    Do you know why this is? Is hyperthreading implemented differently for the Xeons? Or is it caused by the different OS used (Windows 2008 vs Windows 7/8)?
  • JlHADJOE - Tuesday, September 17, 2013 - link

    ^ That's very interesting. Made me look over the Xeon results and yes, they do appear to be getting close to a 100% increase in performance for each thread added.
  • psyq321 - Tuesday, September 17, 2013 - link

    Hyperthreading is the same.

    However, HCC version of IvyTown has two separate memory controllers, more features enabled (direct cache access, different prefetchers etc.). So it might scale better.

    I am achieving 1.41x speed-up with dual Xeon 2697 v2 setup, compared to my old dual Xeon 2687W setup. This is so close to the "ideal" 1.5x scaling that it is pretty amazing. And, 2687w was running on a slightly higher clock in all-core turbo.

    So, I must say I am very happy with the IvyTown upgrade.
  • garadante - Tuesday, September 17, 2013 - link

    It's not 24 threads, it's 48 threads for that scaling. 2x physical CPUs with 12 cores each, for 24 physical cores and a total of 48 logical cores.
  • Kevin G - Tuesday, September 17, 2013 - link

    Actually if you run the numbers, the scaling factor from 1 to 48 threads is actually 21.9. I'm curious what the result would have been with Hyperthreading disabled as that can actually decrease performance in some instances.

Log in

Don't have an account? Sign up now