Other Improvements

One of the improvements that caught our attention was the "Fast Access of FS & GS base registers". We were under the impression that segment registers were not used in a modern OS with 64-bit flat addressing (with the exception of the Binary Translation VMM of VMware), but the promise of "Critical optimization for large thread-count server workloads" in Intel's Xeon E5-2600 V2 presentation seems to indicate otherwise.

Indeed no modern operating system uses the segment registers, but FS and GS registers are an exception. The GS register (for 64-bit; FS for 32-bit x86) points to the Thread Local Storage descriptor block. That thread block stores unique information for each thread and is accessed quite a bit when many threads are running concurrently.

That sounds great, but unfortunately operating system support is not sufficient to benefit from this. An older Intel presentation states that this feature is implement by adding "Four new instructions for ring-3 access of FS & GS base registers". The GCC compiler 4.7 (and later) has a flag called "-fsgsbase" to recompile your source code to make use of this. So although Ivy Bridge could make user thread switching a lot faster, it will take a while before commercial code actually implements this.

Other ISA optimizations (Float16 to and from SP conversion) will be useful for some image/video processing applications, but we cannot imagine that many server applications will benefit from this. HPC/render farms on the other hand may find this useful.

The Uncore

The uncore part has some modest improvements too. The snoop directory has now three states (Exclusive, Modified, Shared) instead of two and it improves server performance in 2-socket configurations as well. In Sandy Bridge the snoop directoy was disabled in 2-socket configurations as it hampered performance (which is also a best practice on the Opterons).

Also, the snoop broadcoasting got a lot more "opportunistic". If lots of traffic is going on, broadcasts are avoided; if very little is going on, it "snoops away". If it is likely that the snoop directory will not have the entry, the snoop is issued prior to directory feedback. "Opportunistic" snooping makes sure that snooping traffic is reduced and as a result the multi-core performance scales better. Which is quite important when your are dealing with up to 24 physical cores in a system.

Wrapping up, maximum PCI Express bandwidth when performing two thirds reads and one third writes has been further improved from 80GB/s (using quad-channel 1600 MT/s DDR3) to 90GB/s. T here are now two memory controllers instead of one to reduce latency. Bandwidth is also improved thanks to the support for DDR3- 1866. Lastly, the half width QPI mode is disabled in turbo mode, as it is very likely that there is a lot of traffic between the interconnects between the sockets. Turbo mode is after all triggered by heavy CPU activity.

Improvements Positioning: SKUs and Servers
Comments Locked

70 Comments

View All Comments

  • ShieTar - Tuesday, September 17, 2013 - link

    Oops, you are perfectly right of course. In that case the 4960X actually gets the slightly better efficiency (12.08 is 0.28 per thread and GHz) than the dual 2697s (33.56 is 0.26 per thread and GHz), which makes perfect sense.

    It also indicates the 4960X gets about 70% of the performance of a single 2697 at 38% of the cost. Then again, a 1270v3 gets you 50% of the performance at 10% of the price. So when talking farms (i.e. more than one system cooperating), four single-socket boards with 1270v3 will get you almost the power of a dual-socket board with 2697v2 (minus communication overhead), will likely use similar power demand (plus communication overhead), and save you $4400 in the process. Since you use 32 instead of 48 threads, but 4 installations instead of 1, software licensing cost may vary strongly in either direction.

    Would be interesting to see this tested. Anybody willing to send AT four single-socket workstations?
  • hpvd - Tuesday, September 17, 2013 - link

    yes - this would be really interesting. But you should use Infiniband interconnect for a good scaling. And this could only be done without an expensive IB-Switch with 3-maschines...
  • DanNeely - Tuesday, September 17, 2013 - link

    Won't the much higher price of a 4 socket board kill any CPU cost savings?

    In any event, the 1270v3 is a unisocket chip so you'd need to do 4 boxes to cluster.

    Poking around on Intel's site it looks like all 1xxx Xeons are uniprocessor, 2xxx is dual socket, 4xxx quad, 8xxx octo socket. But the 4xxx series is still on 2012 models and 8xxx on 2011 releases. The 4 way chips could just be a bit behind the 2way ones being reviewed now; but with the 8 way ones not updated in 2 years I'm wondering if they're being stealth discontinued due to minimal cases where 2 smaller servers aren't a better buy.
  • hpvd - Tuesday, September 17, 2013 - link

    I think we are talking around about 4 systems with each one cpu, one mainboard, RAM, ..+ network interface card
  • hpvd - Tuesday, September 17, 2013 - link

    another advantage would be that these CPUs uses the latest Hashwell Achitecture: some workloads would greatly benefit from it's AVX2 ...
  • Kevin G - Tuesday, September 17, 2013 - link

    I'd fathom the bigger benefit of Haswell is found in the TSX and L4 cache for server workloads. The benefits of AVX2 would be exploited in more HPC centric workloads. Now if Intel would just release a socketed 1200v3 series CPU with L4 cache.
  • MrSpadge - Tuesday, September 17, 2013 - link

    > Now if Intel would just release a socketed 1200v3 series CPU with L4 cache.

    Agreed! And someone would test it at server loads. And BOINC. And if only Intel would release an overclockalbe Haswell with L4 which we can actually buy!
  • ShieTar - Tuesday, September 17, 2013 - link

    A 4 socket board is expensive, but thats not the discussion I was making. A Xeon E5-4xxx is not likely to be less expensive than the E5-2xxx part anyways.

    The question was specifically how four single socket boards (with 4 cores each, at 3.5GHz, and Haswell technology) would position themselves against a dual-socket board with 24 cores at 2.7GHz and Ivy Bridge EP tech. Admittedly, the 3 extra boards will add a bit of cost (~500$), and and extra memory & communications cards, etc. can also add something depending on usage scenario. Then again, a single 4-core might get the work done with less than half the memory of a 12-core, so you might safe a little there as well.
  • psyq321 - Tuesday, September 17, 2013 - link

    E5-46xx v2 is coming in few months, qualification samples are already available and for all intents and purposes it is ready - Intel just needs to ramp-up production.

    E7-88xx v2 is coming in Q1 2014, it is definitely not discontinued, and the platform (Brickland) will be compatible with both Ivy Bridge EX (E7-88xx v2 among others) and Haswell EX (E7-88xx v3 among others) CPUs and will also be able to take DDR4 RAM. It will require different LGA 2011 socket, though.

    EX platform will come with up to 15 cores in Ivy Bridge EX generation.
  • Kevin G - Tuesday, September 17, 2013 - link

    The E5-46xxx is simply a rebranded E5-26xx with official support for quad socket. The dies are the going to be the same between both families. Intel is just doing extra validation for the quad socket market as the market tends to favor more reliability features as socket count goes up.

    While not socket compatible, Brickland as a platform is expected to be used for the next (last?) Itanium chips.

Log in

Don't have an account? Sign up now