Other Improvements

One of the improvements that caught our attention was the "Fast Access of FS & GS base registers". We were under the impression that segment registers were not used in a modern OS with 64-bit flat addressing (with the exception of the Binary Translation VMM of VMware), but the promise of "Critical optimization for large thread-count server workloads" in Intel's Xeon E5-2600 V2 presentation seems to indicate otherwise.

Indeed no modern operating system uses the segment registers, but FS and GS registers are an exception. The GS register (for 64-bit; FS for 32-bit x86) points to the Thread Local Storage descriptor block. That thread block stores unique information for each thread and is accessed quite a bit when many threads are running concurrently.

That sounds great, but unfortunately operating system support is not sufficient to benefit from this. An older Intel presentation states that this feature is implement by adding "Four new instructions for ring-3 access of FS & GS base registers". The GCC compiler 4.7 (and later) has a flag called "-fsgsbase" to recompile your source code to make use of this. So although Ivy Bridge could make user thread switching a lot faster, it will take a while before commercial code actually implements this.

Other ISA optimizations (Float16 to and from SP conversion) will be useful for some image/video processing applications, but we cannot imagine that many server applications will benefit from this. HPC/render farms on the other hand may find this useful.

The Uncore

The uncore part has some modest improvements too. The snoop directory has now three states (Exclusive, Modified, Shared) instead of two and it improves server performance in 2-socket configurations as well. In Sandy Bridge the snoop directoy was disabled in 2-socket configurations as it hampered performance (which is also a best practice on the Opterons).

Also, the snoop broadcoasting got a lot more "opportunistic". If lots of traffic is going on, broadcasts are avoided; if very little is going on, it "snoops away". If it is likely that the snoop directory will not have the entry, the snoop is issued prior to directory feedback. "Opportunistic" snooping makes sure that snooping traffic is reduced and as a result the multi-core performance scales better. Which is quite important when your are dealing with up to 24 physical cores in a system.

Wrapping up, maximum PCI Express bandwidth when performing two thirds reads and one third writes has been further improved from 80GB/s (using quad-channel 1600 MT/s DDR3) to 90GB/s. T here are now two memory controllers instead of one to reduce latency. Bandwidth is also improved thanks to the support for DDR3- 1866. Lastly, the half width QPI mode is disabled in turbo mode, as it is very likely that there is a lot of traffic between the interconnects between the sockets. Turbo mode is after all triggered by heavy CPU activity.

Improvements Positioning: SKUs and Servers
POST A COMMENT

70 Comments

View All Comments

  • Kevin G - Tuesday, September 17, 2013 - link

    Odd that Intel went the 3 die route with Ivy Bridge-EP. It was no surprise that the lowend would be a variant of the 6 core Ivy Bridge-E found in the Core i7-4900 series. Apple leaked that the line up would scale to 12 cores. The surprise is a native 10 core part and the differences between it and the 12 core design.

    Judging from the diagrams, Intel altered its internal ring bus for connecting cores. One ring goes orbits around all three columns of cores while another connects two columns. Thus the cores in the middle column have better latency for coherency as they have fewer stops on the ring bus to reach any core. The outer columns should have similar latency than the native 10 core chip for coherency: fewer cores to stop but longer traces on the die between columns.

    Not disclosed is how the 12 core chip divides cache. Previously each core would have a 2.5 MB of L3 cache that was more local than the rest of the L3 cache. The middle column may have access to L3 cache on both sides.

    The usage of dual memory controllers on the 12 core die is interesting. I wonder what measurable differences it produces. I'd fathom tests with a mix of reads/writes (ie databases) would show the greatest benefit as a concurrent read and write may occur. In a single socket configuration, enabling NUMA may produce a benefit. (Actually, how many single socket 2011 boards have this option?)
    Reply
  • madmilk - Tuesday, September 17, 2013 - link

    It looks like each ring is connected to two columns. One ring goes around all three, but does not connect to the center column. Reply
  • JlHADJOE - Tuesday, September 17, 2013 - link

    I'm guessing the 12-core might see action in the 8P segment, which is well overdue for an update. Reply
  • psyq321 - Tuesday, September 17, 2013 - link

    There will be 15-core E7 8xxx v2 CPUs based on the same IvyTown architecture.

    As Intel is not showing the die-shot of a 12 core Ivy EP, I wonder if the 15-core EX and 12-core EP are using the same 3x5 die.
    Reply
  • Kevin G - Tuesday, September 17, 2013 - link

    The memory controller interfaces are different between the Ivy Bridge-EP and Ivy Bridge-EX. The EP uses DDR3 in all of its forms (vanilla, ECC, buffered ECC, LR ECC) where as the EX version is going to use a serial interface similar in concept to FB-DIMMs. There will be two types of memory buffers for the EX line, one for DDR3 and later another that will use DDR4 memory. No changes need to be made to the new EX socket to support both types of memory. Reply
  • Brutalizer - Tuesday, September 17, 2013 - link

    I would have expected this newest Intel 12-core cpu to perform better. For instance, in Java SPECjbb2013 benchmarks, it gets 35,500 and 4,500. However, the Oracle SPARC T5 gets 75.700 and 23.300 which totally demolishes the x86 cpu. Have not the x86 cpus improved that much in comparison to SPARC? The x86 still lags behind?
    https://blogs.oracle.com/BestPerf/entry/20130326_s...
    Reply
  • JohanAnandtech - Tuesday, September 17, 2013 - link

    Be careful when you compare inflated, for marketing purposes results with independent "limited optimization" results ;-) Reply
  • Phil_Oracle - Friday, February 21, 2014 - link

    What do you mean by inflated for marketing purposes? SPECjbb2013 is clearly a real world, recent benchmark that’s full audited by all vendors on the SPEC committee. If you make such claims, surely you have some evidence? Reply
  • extide - Tuesday, September 17, 2013 - link

    Dont forget those T5's run at TDP's in the 200-300W range... If you clocked up one of these babies to those power levels I am sure it would be >= to the T5. Reply
  • Kevin G - Tuesday, September 17, 2013 - link

    TDP's are indeed higher on the SPARC side but not as radically as you indicate. Generally they do not consume more than 200W. (Unfortunately Oracle doesn't give a flat power consumption figure for just the CPU, this is just an estimate based upon their total system power calculator. For reference, the POWER7 is 200W and the POWER7+ is 180W.) Reply

Log in

Don't have an account? Sign up now