In the second quarter of this year, we’ll have affordable servers with up to 48 cores (AMD’s Magny-cours) and 64 threads (Intel Nehalem EX). The most obvious way to wield all that power is to consolidate massive amounts of virtual machines on those powerhouses. Typically, we’ll probably see something like 20 to 50 VMs on such machines. Port aggregation with a quad-port gigabit Ethernet card is probably not going to suffice. If we have 40 VMs on a quad-port Ethernet, that is less than 100Mbit/s per VM. We are back in the early Fast Ethernet days. Until virtualization took over, our network intensive applications would get a gigabit pipe; now we will be offering them 10 times less? This is not acceptable.

Granted, few applications actually need a full 1Gbit/s pipe. Database servers need considerably less, only a few megabits per second. Even at full load, the servers in our database tests rarely go beyond 10Mbit/s. Web servers are typically satisfied with a few tens of Mbit/s, but AnandTech's own web server is frequently bottlenecked by its 100Mbit connection. Fileservers can completely saturate Gbit links. Our own fileserver in the Sizing Servers Lab is routinely transmitting 120MB/s (a saturated 1Gbit/s link). The faster the fileserver is, the shorter the waiting time to deploy images and install additional software. So if we want to consolidate these kinds of workloads on the newest “über machines”, we need something better than one or two gigabit connections for 40 applications.

Optical 10Gbit Ethernet – 10GBase-SR/LR - saw the light of day in 2002. Similar to optical fibre channel in the storage world, it was very expensive technology. Somewhat more affordable, 10G on “Infiniband-ish” copper cable (10GBase-CX4) was born in 2004. In 2006, 10Gbit Ethernet via UTP cable (10GBase-T) held the promise that 10G Ethernet would become available on copper UTP cables. That promise has still not materialized in 2010; CX4 is by far the most popular copper based 10G Ethernet. The reason is that the 10GBase-T PHYs need too much power. The early 10GBase-T solutions needed up to 15W per port! Compare this to the 0.5W that a typical gigabit port needs, and you'll understand why you find so few 10GBase-T ports in servers. Broadcom reported a breakthrough just a few weeks ago: Broadcom claims that their newest 40nm PHYs use less than 4W per port. Still, it will take a while before the 10GBase-T conquers the world, as this kind of state-of-the art technology needs some time to mature.

We decided to check out the some of the more mature CX4-based solutions as they are decently priced and require less power. For example, a dual-port CX4 card goes as low as 6W… that is 6W for the controller, two ports and the rest of the card. So a complete dual-port NIC needs considerably less than one of the early 10GBase-T ports. But back to our virtualized server: can 10Gbit Ethernet offer something that the current popular quad-port gigabit NICs can’t?

Adapting the network layers for virtualization

When lots of VMs are hitting the same NIC, quite a few performance problems may arise. First, one network intensive VM may completely fill up the transmit queues and block the access to the controller for some time. This will increase the network latency that the other VMs see. The hypervisor has to emulate a network switch that sorts and routes the different packets of the various active VMs. Such an emulated switch costs quite a bit of processor performance, and this emulation and other network calculations might all be running on one core. In that case, the performance of this one core might limit your network bandwidth and raise network latency. That is not all, as moving data around without being able to use DMA means that the CPU has to handle all memory move/copy actions too. In a nutshell, a NIC with one transmit/receive queue and a software emulated switch is not an ideal combination if you want to run lots of network intensive VMs: it will reduce the effective bandwidth, raise the NIC latency and increase the CPU load significantly.


Without VMDQ, the hypervisor has to emulate a software switch. (Source: Intel VMDQ Technology)
 

Several companies have solved this I/O bottleneck by making use of the multiple queues". Intel calls it VMDq; Neterion calls it IOV. A single NIC controller is equipped with different queues. Each receive queue can be assigned to a virtual NIC of your VM and mapped to the guest memory of your VM. Interrupts are load balanced across several cores, avoiding the problem that one CPU is completely overwhelmed by the interrupts of tens of VMs.


With VMDq, the NIC becomes a Layer 2 switch with many different Rx/Tx queues. (Source: Intel VMDQ Technology)
 

When packets arrive at the controller, the NIC’s Layer 2 classifier/sorter sorts the packets and places them (based on the virtual MAC addresses) in the queue assigned to a certain VM. Layer 2 routing is thus done in hardware and not in software anymore. The hypervisor looks in the right queue and then routes those packets towards the right VM. Packets that have to go out of your physical server are placed in the transmit queues of each VM. In the ideal situation, each VM has its own queue. Packets are sent to the physical wire in a round-robin fashion.

The hypervisor has to support this and your NIC vendor must of course have an “SR-IOV” capable driver for the hypervisor. VMware ESX 3.5 and 4.0 have support for VMDq and similar technologies, calling it “NetQueue”. Microsoft Windows 2008 R2 supports this too, under the name “VMQ”.

Benchmark Configuration
Comments Locked

49 Comments

View All Comments

  • Parak - Tuesday, March 9, 2010 - link

    The per-port prices of 10Gbe are still $ludicrous; you're not going to be able to connect an entire vmware farm plus storage at a "reasonable" price. I'd suggest looking at infiniband:

    Pros:

    40Gb/s theoretical - about 25Gb/s maximum out of single stream ip traffic, or 2.5x faster than 10Gbe.
    Per switch port costs of about 3x-4x times less that of 10Gbe, and comparable per adapter port costs.
    Latency even lower than 10Gbe.
    Able to do remote direct memory access for specialized protocols (google helps here).
    Fully supported under your major operating systems, including ESX4.

    Cons:

    Hefty learning curve. Expect to delve into mailing lists and obscure documentations, although just the "basic" ip functionality is easy enough to get started with.

    10Gbe has the familiarity concept going for it, but it is just not cost effective enough yet, where as infiniband just seems to get cheaper, faster, and lately, a lot more user friendly. Just something to consider next time :D
  • has407 - Monday, March 8, 2010 - link

    Thanks. Good first-order test and summary. A few more details and tests would be great, and I look forward to more on this subject...

    1. It would be interesting to see what happens when the number of VM's exceeds the number of VMDQ's provided by the interface. E.g., 20-30 VM's with 16 VMDQ's... does it fall on its face? If yes, that has significant implications for hardware selection and VM/hardware placement.

    2. Would be interesting to see if the Supermicro/Intel NIC can actually drive both ports at close to an aggregate 20Gbs.

    3. What were the specific test parameters used (MTU, readers/writers, etc)? I ask because those throughput numbers seem a bit low for the non-virtual test (wouldn't have been surprised 2-3 years ago) and very small changes can have very large effects with 10Gbe.

    4. I assume most of the tests were primarily unidirectional? Would be interesting to see performance under full-duplex load.

    > "In general, we would advise going with link aggregation of quad-port gigabit Ethernet ports in native mode (Linux, Windows) for non-virtualized servers."

    10x 1Gbe links != 1x 10Gbe link. Before making such decisions, people need to understand how link aggregation works and its limitations.

    > "10Gbit is no longer limited to the happy few but is a viable backbone technology."

    I'd say it has been for some time, as vendors who staked their lives on FC or Infiniband have discovered over the last couple years much to their chagrin (at least outside of niche markets). Consolidation using 10Gbe has been happening for a while.
  • tokath - Tuesday, March 9, 2010 - link

    "2. Would be interesting to see if the Supermicro/Intel NIC can actually drive both ports at close to an aggregate 20Gbs. "

    At best since it's a PCIe 1.1 x8 would be about 12Gbps per direction for a total aggregate throughput of about 24Gbps bi-directional traffic.

    The PCIe 2.0 x8 dual port 10Gb NICs can push line rate on both ports.
  • somedude1234 - Wednesday, March 10, 2010 - link

    "At best since it's a PCIe 1.1 x8 would be about 12Gbps per direction for a total aggregate throughput of about 24Gbps bi-directional traffic."

    How are you figuring 12 Gbps max? PCIe 1.x can push 250 MBps per lane (in each direction). A x8 connection should max out around 2,000 MBps, which sounds just about right for a dual 10 GbE card.
  • mlambert - Monday, March 8, 2010 - link

    This is a great article and I hope to see more like it.
  • krazyderek - Monday, March 8, 2010 - link

    In the opening statements it basically boils down to file servers being the biggest bandwidth hogs, so i'd like to see a SMB and enterprise review of how exactly you could saturate these connections, comparing the 4x1gb port to your 10GB cards in real world usage. Everyone use's chariot to show theoretical numbers, but i'd like to see real world examples.

    What kind of raid arrays, or SSD's and CPU's are required on both the server AND CLIENT side of these cards to really utilize that much bandwidth?

    Other then a scenario such as 4 or 5 clients all writing large sequential files to a fileserver at the same time i'm having trouble seeing the need for 10Gb connection, even at that level you'd be limited by hard disk performance on a 4 or maybe even 8 disk raid array unless you're using 15k drives in raid 0.

    I guess i'd like to see the other half of this "affordable 10Gb" explained for SMB and how best to use it, when it's usable, and what is required beyond the server's NIC to use it.

    Continuing the above example, if the 4 or 5 clients were reading off a server instead of writting you begin to be limited by the client CPU and HD write speeds, in this scenario what upgrades are required on the client side to best make use of the 10Gb server?

    Hope this doesn't sound to newb.
  • dilidolo - Monday, March 8, 2010 - link

    I agree with you.

    The biggest benefit for 10Gb is not bandwidth, it's port consolidation, thus reducing total cost.

    Then it comes down to how much IO the storage subsystem can provide. If the storage system can only provide 500MB/s, then how can 10Gb nic help?

    I also don't understand why anyone wants to run a file server as a VM, and connects to NAS to store actual data. NAS is designed for it already, why add another layer.
  • JohanAnandtech - Monday, March 8, 2010 - link

    File server access is - as far as I have seen - not that random. In our case it used to stream (OS + desktop apps) images, software installations etc.

    So in most cases you have relatively few users that are downloading hundreds of MB. Why would you not consolidate that file server? It uses very little CPU power (compared to the webservers) most of the time, and it can use the power of your SAN pretty well as it sequentially access the disks. Why would you need a separate NAS?

    Once your NAS is integrated in your virtualized platform, you can get the benefit of HA, live migration etc.

  • dilidolo - Monday, March 8, 2010 - link

    For most people, their storage for virtualized platform is NAS based(NFS/iSCSI). I still put iSCSI into NAS as it's an addon to NAS. Most NAS devices support multiple protocols - NFS, CIFS, ISCSI, etc.

    If you don't have a proper NAS device, that's a different story, but if you do, why do you waste resources on virtual host to duplicate the features your NAS already provides?
  • MGSsancho - Tuesday, March 9, 2010 - link

    Only thing I can think of at the moment is your SAN is overburdened and you want to move portions of it into your VM to give your SAN more resources to do other things. As mentioned, streaming system images can be put on a cheap/simple NAS or VM where you allow your SAN with all its features to do what you paid for it to do. Seams like a quick fix to free up your SAN temporally, however it is rare to see any IT shop set things up ideally. There are always various constraints.

Log in

Don't have an account? Sign up now