Istanbul's Improvements

The cores inside “Istanbul” are not different from those found in Shanghai. Istanbul introduces only a few improvements: HT assist, slightly higher HT speeds, APML and x8 ECC.

X8 ECC: Each DRAM chip on a DIMM provides either 4 bits or 8 bits of a 64-bit data word. Chips that provide 4 bits are called x4 (by 4), and chips that provide 8 bits are called x8 (by 8). It takes eight x8 chips or sixteen x4 chips to make a 64-bit word, so at least eight chips are located on one or both sides of a DIMM. Istanbul’s memory controller now supports error correction for both x4 and x8 DIMMs.

APML Remote Power Management Interface: APML provides an interface that allows you to monitor and control platform power consumption via P-state limits. You need to have a CPU and BMC (management processor) that support APML on the server and you need to have some type of software (OS or management software) that supports APML and allows you to monitor power and make changes to power management parameters. Both hardware and software are in development, so this won’t be available on the servers that will be launched this month. APML is interesting as it would allow you to cap power without going into the BIOS. AMD’s PowerCap Manager allows you to limit power to a certain amount by making sure the CPU’s clock never goes beyond a certain limit, effectively underclocking the CPU. This is very useful in a datacenter that is cooling or power limited. Of course, BIOS options are not that handy in a datacenter with hundreds of servers. That is where APML could make the difference.

Higher HT Speeds: The later versions of the “Shanghai” Opteron versions support HyperTransport 3.0 or HT3. HT3 allows much higher clockspeeds than the HyperTransport links that all the older Opterons have been using so far (1GHz). The clockspeed was boosted to 2.2 GHz DDR, good for 8.8 GB/s in each direction. Istanbul pushes the clock of the HyperTransport up to 2.4GHz DDR, good for 9.6 GB/s in each direction. Or as fast as the QPI links which can be found on the slower “Nehalem” Xeons. Since the new Fiorano platform is not ready, we still have to test with an older NVIDIA MCP55 platform. But that does not matter; the CPU interconnect speed is handled by the CPUs, not the board or chipset. You can clearly see in the BIOS screenshot below:

The last improvement is HT Assist. We will discuss this feature in more detail.

HT Assist: Only for the Quad-Socket

HT assist is a probe or snoop filter AMD implemented. First, let us look at a quad Shanghai system. CPU 3 needs a cacheline which CPU 1 has access to. The most recent data is however in CPU’s 2 L2-cache.

Start at CPU 3 and follow the sequence of operations:

1. CPU 3 requests information from CPU 1 (blue “data request” arrow in diagram)
2. CPU 1 broadcasts to see if another CPU has more recent data (three red “probe request” arrows in diagram)
3. CPU 3 sits idle while these probes are resolved (four red & white “probe response” arrows in diagram)
4. The requested data is sent from CPU 2 to CPU 3 (two blue and white “data response” arrows in diagram)

There are two serious problems with this broadcasting approach. Firstly, it wastes a lot of bandwidth as 10 transactions are needed to perform a relatively simple action. Secondly, those 10 transactions are adding a lot of latency to the instruction on CPU 3 that needs the piece of data (which was requested by CPU 3 to CPU 1).

The solution to is a directory-based system, that AMD calls HT Assist. HT assist reserves 1MB portion of each CPU’s L3 cache to act as a directory. This directory tracks where that CPU’s cache lines are used elsewhere in the system. In other words the L3-caches are only 5 MB large, but a lot of probe or snoop traffic is eliminated. To understand this look at the picture below:

Let us see what happens. Start again with CPU 3:

1. CPU 3 requests information from CPU 1 (blue line)
2. CPU 1 checks its L3 directory cache to locate the requested data (Fat red line)
3. The read from CPU 1’s L3 directory cache indicates that CPU 2 has the most recent copy and directly probes CPU 2 (Dark red line)
4. The requested data is sent from CPU 2 to CPU 3 (blue and white lines)

Instead of 10 transactions, we have only 4 this time. A considerable reduction in latency and wasted bandwidth is the result. Probe “broadcasting” can be eliminated in 8 of 11 typical CPU-to-CPU transactions. Stream measurements show that 4-Way memory bandwidth improves 60%: 41.5GB/s with HT Assist versus 25.5GB/s without HT Assist.

But it must be clear that HT assist is only useful in a quad-socket system and of the utmost importance in octal CPU systems. In a dual system, broadcast is the same as a unicast as there is only one other CPU. HT assist also lowers the hitrate of L2-caches (5 MB instead of 6) so it should be disabled on 2P systems. If you look in the BIOS...

...you get 3 options next to probe filter: “auto”, “disabled” and “MP”. In automatic mode the probe filter or HT Assist will be turned off for 2P systems. You can force “HT assist” by setting “MP”, indicating there are more than 2 processors.

Index What Intel and AMD are Offering
Comments Locked

40 Comments

View All Comments

  • solori - Tuesday, June 2, 2009 - link

    I should have said "abundant (cheap) memory."
  • mkruer - Monday, June 1, 2009 - link

    I am disappointed that you did not bench X5550 vs 2435. This is the chip that the Opteron 2435 was designed to go up against, not the X5570 which is clocked 300MHz higher and 40% more expensive. Heaven forbid that you try to include chips at the same price point. That being said other sites that did compare based upon price, and not top of the line, show that the Opteron 2435 is indeed comparable to the X5550 at the same price point and speed. Now if AMD can up the speed of the hex core, then it will be a more direct comparison to the X5570. The X5570 is 50% faster but it is also >50% more in cost.
  • mino - Wednesday, June 3, 2009 - link

    Right.

    Actually, I have no qualms with comparing the best with the best, but the commentary is mostly out-of-place.
    I guess this was written after 3 days without sleep, but anyway.

    After an excelent vAPUS Mark 1 article I would expect better that old-school style:
    "1000 $ Pentium 4 3.2 EE is clearly (15%) better than $400 Athlon 3200+ so Athlon is clearly a piece of junk. Well maybe for games not so much but generally it is a piece of junk."

    Thank god the numbers tell their own story.
  • JohanAnandtech - Wednesday, June 3, 2009 - link

    It seems that some people like to create the impression that we did not take into account that both CPUs were not at the same pricing.

    However:

    http://it.anandtech.com/IT/showdoc.aspx?i=3571&...">http://it.anandtech.com/IT/showdoc.aspx?i=3571&...
    [quote]"However, as the Opteron 2435 competes with 2.66 GHz Xeon and not the Xeon 2.93 GHz, this is the first benchmark where “Istanbul” is competitive."[/quote]

    http://it.anandtech.com/IT/showdoc.aspx?i=3571&...">http://it.anandtech.com/IT/showdoc.aspx?i=3571&...
    [quote]"The Nehalem-based Xeon moves forward, but does not make a huge jump. Performance of the six-core Opteron was decreased by 2%, which is inside the error margin of this benchmark. It is still an excellent result for the latest Opteron: this results means it will have no trouble competing with the 2.66 Ghz Xeon X5550. "
    [/quote]

    http://it.anandtech.com/IT/showdoc.aspx?i=3571&...">http://it.anandtech.com/IT/showdoc.aspx?i=3571&...
    [quote]"The new Opteron 2435 at 2.6 GHz was a pleasant surprise on vApus Mark I: it keeps up with more expensive Xeons on ESX 3.5 update 4 while consuming less, and offers a competitive performance/watt and performance/price ratio on vSphere 4. The six-core Opteron is about 11 to 30% slower on vSphere 4 than the 2.93 GHz Xeon X5570 but the overall cost of the Istanbul platform is significantly lower (DDR-2 versus DDR-3) and the 2.6 GHz 2435 consumes less power in a virtualized environment "
    [/quote]

    And I have confidence that the vast majority of my readers are intelligent people who can decrease the benchmarks with 8 to 10% to see what a Xeon x5550 would do
  • mino - Thursday, June 4, 2009 - link

    No, I do not like that, nor like to create such an impression.

    The article presents the numbers reasonably well for me. It is just that your (justified) love for Nehalem is glowing through and many, many comments were out of place.
    I believe this was not intentional but cause by your love for the Nehalem platform which is otherwise great.

    All the numbers tell one thing - Istanbull is generally on par with Nehalem clock for clock +- 10% depending on the workload.

    About that glowiong love for Nehalem:
    >>>MCS eFMS 9.2
    "A single 8-thread Xeon X55xx is by far the best choice here."

    Why ? There is no 1*2435 number.
    Based on the numbers published single 2435 will get about 55-58rps which for all practical needs is identical performance to _flagship_ Nehalem.

    >>>3ds Max 2008 32b
    "We are sure that there are probably more efficient render engines out there, but it is simply not a market the AMD six-core should cater to. Nehalem-based Xeons are simply way too powerful for this kind of application. Render engines scale almost perfectly with clockspeed. So if cost is your main concern, consider the Xeon E5520 at 2.26 GHz, the cheapest CPU that still supports HT. We will test this one soon, but we expect it to deliver 67 frames per hour, which is still more than 20% better than any Opteron."

    OK, so first bash(rightfully) the application fo it rigid resource use pattern, than say that for Nehalem is "way too powerfull for this KIND of application" for Opteron to compete with.
    You managed to contradict your own reasoning to promote Nehalem for rendering while the numbers speak about single improperly optimized app.
    Which it is pretty certain SW vendor will take care of in due time. These numbers are just a result of no (affordable) 6-core presence on the market up to now.

    By these 2 comments you took the article balance from "Instanbul is generally about 5% slower per_clock than Nehalem, in certain apps it is on par or better while in other loses about 15%" - which is what the numbers tell - to "Instanbul is good for VMware, forget about it elsewhere".

    Which is about as much bad publicity you could give to the second fastest CPU on the market by_large_margin.

    Fact is, at a given price, Nehalem box is ALMOST IDENTICAL performance-wise to Istanbul box. While both crush everything else on the market by 30+ %.
  • lopri - Monday, June 1, 2009 - link

    Page 2, "..The most recent data is however in CPU’s L2-cache" I think you meant CPU #2?
  • JohanAnandtech - Monday, June 1, 2009 - link

    Yes, good catch. Fixed the issue.
  • classy - Monday, June 1, 2009 - link

    I skipped right to the virtualization portions. It is by far becoming the most dominate criteria for most of the IT world. The 6 core opty looks solid there, so it will come down to price. Now with the quickly developing virtual desktop infrastructures, how well a platform does virtualization makes it just two fold more important. Many folks have already virtualized mission critical apps. I know we're doing exchange in the near future. The days of seperate physical servers and desktops are going the way of the dodo bird. Its becoming all about virtualization.
  • genkk - Tuesday, June 2, 2009 - link

    why power consumption not shown here....the bench mark guys in anandtech lost the papers...or they don't want you to see

    any way go to techreport.com where istanbul wins
  • JohanAnandtech - Tuesday, June 2, 2009 - link

    More detailed power consumption numbers will be available in the next review.

Log in

Don't have an account? Sign up now