Other Improvements

One of the improvements that caught our attention was the "Fast Access of FS & GS base registers". We were under the impression that segment registers were not used in a modern OS with 64-bit flat addressing (with the exception of the Binary Translation VMM of VMware), but the promise of "Critical optimization for large thread-count server workloads" in Intel's Xeon E5-2600 V2 presentation seems to indicate otherwise.

Indeed no modern operating system uses the segment registers, but FS and GS registers are an exception. The GS register (for 64-bit; FS for 32-bit x86) points to the Thread Local Storage descriptor block. That thread block stores unique information for each thread and is accessed quite a bit when many threads are running concurrently.

That sounds great, but unfortunately operating system support is not sufficient to benefit from this. An older Intel presentation states that this feature is implement by adding "Four new instructions for ring-3 access of FS & GS base registers". The GCC compiler 4.7 (and later) has a flag called "-fsgsbase" to recompile your source code to make use of this. So although Ivy Bridge could make user thread switching a lot faster, it will take a while before commercial code actually implements this.

Other ISA optimizations (Float16 to and from SP conversion) will be useful for some image/video processing applications, but we cannot imagine that many server applications will benefit from this. HPC/render farms on the other hand may find this useful.

The Uncore

The uncore part has some modest improvements too. The snoop directory has now three states (Exclusive, Modified, Shared) instead of two and it improves server performance in 2-socket configurations as well. In Sandy Bridge the snoop directoy was disabled in 2-socket configurations as it hampered performance (which is also a best practice on the Opterons).

Also, the snoop broadcoasting got a lot more "opportunistic". If lots of traffic is going on, broadcasts are avoided; if very little is going on, it "snoops away". If it is likely that the snoop directory will not have the entry, the snoop is issued prior to directory feedback. "Opportunistic" snooping makes sure that snooping traffic is reduced and as a result the multi-core performance scales better. Which is quite important when your are dealing with up to 24 physical cores in a system.

Wrapping up, maximum PCI Express bandwidth when performing two thirds reads and one third writes has been further improved from 80GB/s (using quad-channel 1600 MT/s DDR3) to 90GB/s. T here are now two memory controllers instead of one to reduce latency. Bandwidth is also improved thanks to the support for DDR3- 1866. Lastly, the half width QPI mode is disabled in turbo mode, as it is very likely that there is a lot of traffic between the interconnects between the sockets. Turbo mode is after all triggered by heavy CPU activity.

Improvements Positioning: SKUs and Servers
Comments Locked

70 Comments

View All Comments

  • Bytales - Tuesday, September 17, 2013 - link

    Please make some gaming related tests. Im planning on upgrading from 2x2609 to 2x2690v2, now that i now for sure that 10 cores 25 mb cache is a complete die. I dont trust verz much the design on the 12 core die, its not how i would design the CPU. Besides the 2690v2 is 3ghz base and 3.6 boost, perfect for gaming.

    Would have like to see how a 2690v2 would compare with a 2687w v2 in gaming related tests, seeing as the latter has a 3.4 base 4 ghz boost but 2 cores less.

    Anyways, im not pazing 3000+ euros on disabled die (like the one in 2687v2) so the 10 core is my choice, but still would have like to seee how higher freq lower core count would impact gaming performance !
  • mking21 - Wednesday, September 18, 2013 - link

    I can tell you now that the 8 core is going to kick the 10 core's ass for gaming. The higher clock will win here. So as you are going to pay 3000 euros you may as well get the best, even if it does have two cores disabled. But I do agree for me a more interesting comparison would have been 12 vs 10 vs 8 all V2s all fastest clock available versions...
  • mapesdhs - Wednesday, September 18, 2013 - link


    IMO for gaming you'd be better off with a used oc'd 2700K. I just bought one for 160 UKP,
    fitted with a used TRUE (cost 15), two new Coolermaster Blademaster fans, Q-fan active
    (ASUS M4E mbd, used, cost 130), runs at 5GHz no problem, silent running when idle. See:

    http://valid.canardpc.com/a64s8p

    The vast majority of games gain the most from a sensible middle ground between
    multiple cores and a high clock. Few will properly exploit more than 4 cores with HT.
    Using a multi-core XEON for gaming is silly. You would see far greater gaming
    performance by getting a much cheaper 4/6-core and spending the saved cash on
    more powerful GPUs like two 780 or Titans SLI, or two 7970 CF, etc. A 4-core Z68
    should be just fine, though if you do want oodles of PCIe lanes for high-end SLI/CF
    then I'd get X79 and a 3930K (don't see the point of IB-E).

    Trust me, a 5GHz 2700K, or a 4.7GHz 3930K, paired with two much better GPUs
    via the saved money, will be massively better for gaming vs. what you could afford
    having spent thousands on two 10 or 12-core CPUs with much lower clocks. Most
    2600Ks will oc pretty nicely too.

    Bytales, what GPU(s) do you have in your system atm?

    Ian.

    PS. IB/HW are a waste of time. They don't oc aswell as SB. I bought a 2500K for 125, only
    took 3 mins to get it running 4.7 stable on a used Gigabyte Z68 board (which cost a mere 35).
  • Bytales - Saturday, September 21, 2013 - link

    The reason im looking at xeons is because of the motherboard i own, which is the z9ped8ws, which i bought because i need the pci express lanes two xeons provide. No other motherboard could have gottwn me what this one does, and i have looked everywhere. Thats the reason i need these xeons. I originally bought two 2609 cpus and a crossfire tahiti le(one burned down due to bitcoin mining) their purpose were/are to make the pc usable until the new xeons and the new radeons wil become available. I know i wont be getting the best possible cpus for gaming on this platform. I just want some decent performers. The 2609 i have now are 2.4 ghz no boost no HT, and did their job good so far. Im expecting decent gaming performance out of a 3ghz chip with multiple cores. Sure, i could get the 2687wv2 for the same price, but i have a hate for disabled things. Why the hell didnt they make a 10 core chip with 25 mb cache 3.5 base 4ghz boost and 150 160 w tdp. I would have bought such a cpu. But as it is ill have to make due with two 2690. Maybe, just maybe, if i see some gaming benchmarks between the two cpus, i will consider the 2687wv2. Untill then, my first choice is the 2690.
    Hopefully, the people from anandtwch will test this aspect of the cpus, gaming that is, becauae all they tested was server/enterpriae stuff, which was to be expected after all.
    Gaming was not what these cpus were built for. But i like having strong cpus which will have my back if i decie to do some other stuff as well. I do bunch of converting, compressing, autocad photoshop. Etc. Thats why more cores. The better.
  • Ktracho - Thursday, October 3, 2013 - link

    I would think you can get the PCIe lanes you want with a motherboard that has a PLX bridge chip, such as the ASUS P9X79-E WS, without needing to resort to a two-socket motherboard. As far as gaming, I think the E5-1620 v2 gives good bang for the money, and if you need more cores, the E5-1650 v2 does well, too. If you need a little better performance, you can get the E5-1680 v2, but at a price. Too bad Intel doesn't sell single-socket CPU versions with more than 6 cores, though.
  • MrSpadge - Tuesday, September 17, 2013 - link

    The Xeon2660v2 could in theory be what Ivy-E should have been for enthusiasts: something at least a bit more worth spending big $ on. The mainboard would have to let us enable multi-core turbo and OC the bus though.
  • psyq321 - Tuesday, September 17, 2013 - link

    Situation with IvyBridge EP is absolutely the same as with Sandy Bridge EP:

    - No BCLK "straps" (or ratios) for Xeon line - only 100 MHz allowed
    - No unlocked multipliers
    - BCLK overclocking works - your mileage may vary. I can get up to 105 MHz with dual Xeon 2697 v2 setup on Z9PE D8 WS

    So, Ivy Bridge EP Xeons do not overclock particularly well - the best you can get out of 2S parts (26xx v2) is 100-150 MHz depending on the max. turbo multiplier your SKU has.
  • ezekiel68 - Wednesday, September 18, 2013 - link

    Johan, what do you mean by "...over four NUMA nodes" in the last sentence on the Compression And Decompression page?

    My understanding is that for both Opeteron and Xeon, a NUMA node is a complete CPU package (with all its cores) and the associated RAM directly connected to that CPU's memory controllers. In the charts, all of the Opterons are listed as "2x Opteron XXXX". Are you considering each die within the Opteron MCM package to be a separate NUMA node -- or how else are you coming up with "four" above?
  • JohanAnandtech - Friday, September 20, 2013 - link

    AFAIK, the two dies in the package communicate via hypertransport links and it is quicker for one die to communicate with its own memory than with the memory attached to the second die.
  • ddkeenan - Wednesday, September 18, 2013 - link

    The data in this article is incomplete. The JVM tuning used is targeted for throughput alone, basically ignoring GC pause times. The critical jOPS metric is intended to measure with response time constraints, and the results posted here are most likely highly variable and definitely dreadfully low because of the poor tuning choices.

    Actual customers care more about response time/latency these days. Throughput is often solved by scaling horizontally, response time is not. Commercial benchmarking should try to reflect that desire by focusing on response time and the SPECjbb2013 critical jOPS in order to influence hardware and software vendors to compete.

    Finally, to Kevin G, I think it's also likely that SPARC T-series systems have been focusing on customer metrics more than competitive benchmarks, and now there's a benchmark that takes response time into consideration.

Log in

Don't have an account? Sign up now