Other Improvements

One of the improvements that caught our attention was the "Fast Access of FS & GS base registers". We were under the impression that segment registers were not used in a modern OS with 64-bit flat addressing (with the exception of the Binary Translation VMM of VMware), but the promise of "Critical optimization for large thread-count server workloads" in Intel's Xeon E5-2600 V2 presentation seems to indicate otherwise.

Indeed no modern operating system uses the segment registers, but FS and GS registers are an exception. The GS register (for 64-bit; FS for 32-bit x86) points to the Thread Local Storage descriptor block. That thread block stores unique information for each thread and is accessed quite a bit when many threads are running concurrently.

That sounds great, but unfortunately operating system support is not sufficient to benefit from this. An older Intel presentation states that this feature is implement by adding "Four new instructions for ring-3 access of FS & GS base registers". The GCC compiler 4.7 (and later) has a flag called "-fsgsbase" to recompile your source code to make use of this. So although Ivy Bridge could make user thread switching a lot faster, it will take a while before commercial code actually implements this.

Other ISA optimizations (Float16 to and from SP conversion) will be useful for some image/video processing applications, but we cannot imagine that many server applications will benefit from this. HPC/render farms on the other hand may find this useful.

The Uncore

The uncore part has some modest improvements too. The snoop directory has now three states (Exclusive, Modified, Shared) instead of two and it improves server performance in 2-socket configurations as well. In Sandy Bridge the snoop directoy was disabled in 2-socket configurations as it hampered performance (which is also a best practice on the Opterons).

Also, the snoop broadcoasting got a lot more "opportunistic". If lots of traffic is going on, broadcasts are avoided; if very little is going on, it "snoops away". If it is likely that the snoop directory will not have the entry, the snoop is issued prior to directory feedback. "Opportunistic" snooping makes sure that snooping traffic is reduced and as a result the multi-core performance scales better. Which is quite important when your are dealing with up to 24 physical cores in a system.

Wrapping up, maximum PCI Express bandwidth when performing two thirds reads and one third writes has been further improved from 80GB/s (using quad-channel 1600 MT/s DDR3) to 90GB/s. T here are now two memory controllers instead of one to reduce latency. Bandwidth is also improved thanks to the support for DDR3- 1866. Lastly, the half width QPI mode is disabled in turbo mode, as it is very likely that there is a lot of traffic between the interconnects between the sockets. Turbo mode is after all triggered by heavy CPU activity.

Improvements Positioning: SKUs and Servers
Comments Locked

70 Comments

View All Comments

  • psyq321 - Tuesday, September 17, 2013 - link

    Yep, EP-46xx v2 will use the same C1 stepping (for HCC SKUs) for production parts as 2P Xeons, but there will be some features enabled in microcode which did not make it in the 26xx SKUs.

    EX is already on D1 stepping for QS, as the validation cycle for EX is more strict due to more RAS features etc.
  • Casper42 - Tuesday, September 17, 2013 - link

    So I work for HP and your comments about 4x1P instead of 2x2P make me wonder if you have been sneaking around our ProLiant development lab in Houston.

    I was there 6 weeks ago and a decent sized cluster of 1P nodes was being assembled on an as yet unannounced HP platform. I was told the early/beta customer it was for had done some testing and found for their particular HPC app, they were in fact getting measurably better overall performance.

    The interesting thing about this design was they put 2 x 1P nodes on a single PCB (Motherboard) in order to more easily adapt the 1P nodes to a system largely designed with 2P space requirements in mind.

    Pretty sure the chips were Haswell based as well but can't recall for sure.
  • André - Tuesday, September 17, 2013 - link

    Would be nice to see benchmarks for OS X, considering this thing is going inside the new Mac Pro.

    Final Cut X, After Effects, Premiere Pro, Photoshop, Lightroom, DaVinci Resolve etc.

    I believe the 2660v2 hits the sweet spot with it's 10 cores.
  • DanNeely - Tuesday, September 17, 2013 - link

    That'd require Apple giving Anandtech a new Mac Pro to run benchmarks on...
  • Kevin G - Tuesday, September 17, 2013 - link

    Now that Intel has officially launched the new Xeons, the new Mac Pro can't be far behind.
  • wallysb01 - Tuesday, September 17, 2013 - link

    Well, you could run the CPU benchmarks just fine. But not the GPU ones.
  • Simon G - Tuesday, September 17, 2013 - link

    Typo in Conclusion section . . . " Thta's no small feat, . . ."
  • garadante - Tuesday, September 17, 2013 - link

    There's a minor error on the Cinebench single-threaded graph. It has the clock speed for the E5-2697 v2 as 2.9 instead of 2.7, as it should be. Which is semi confusing on that graph as it explains the lower single-threaded performance from the E5-2690.
  • SanX - Tuesday, September 17, 2013 - link

    This forum has most obsolete comments design of pre-Neanderthals times, no Edit, no Delete, no look at previous user comments. Effin shame
  • MrSpadge - Tuesday, September 17, 2013 - link

    You mixed up forum and article comments.

Log in

Don't have an account? Sign up now