Monday, February 27, 2012

Alternatives to Cisco for 10 Gb/s Servers

This is a post written in response to Chris Marget's well done series "Nexus 5K Layout for 10Gb/s Servers". While I appreciate the detail and thought that went into the Cisco-based design, he's clearly a "Cisco guy" showing a teensy-weensy bit of bias toward the dominant vendor in the networking industry. The end result of his design is a price that is over $50K per rack just for networking - approaching the aggregate cost of the actual servers inside the rack! Cisco is the Neiman Marcus of the networking world.

So I came up with an alternate design based on anything-but-Cisco switches. The 10G switches from Arista here use Broadcom's Trident+ chip, which supports very similar hardware features as the Cisco solution (MLAG/vPC, DCB, hardware support for TRILL if needed in the future, etc.). Many other vendors offer 10G switches based on this merchant silicon, such as Dell/Force10 and IBM. Because this is commodity switching silicon which BRCM will sell to just about anyone (like an x86 processor), pricing would be similar for Trident+ solutions from other vendors1.

Like Chris, I also include a non-redundant 1G layer 2 switch for out-of-band management of servers and switches, in this case a Dell 5548. I have followed his design by organizing into 3-rack "pods", each containing sixteen 2U servers that each have dual-port 10 Gbps network with SFP+ ports. All needed optics are included. Server-to-switch copper cabling is not included in the pricing, nor are the fiber runs. Switch-to-switch Twinax cables are included.

Here's a quick diagram, note that the "core" switches are not included in the costs, but the optics used on those core switches are:

Made with Graphviz in 5 minutes because drop shadows don't add information to a diagram!

Vendor SKU Desc Qty List$ Extend$
Arista DCS-7050S-64-R Arista 7050, 48xSFP+ & 4xQSFP+ switch, rear-to-front airflow and dual 460W AC power supplies 2 29995 59990
Artisa QSFP-SR4 40GBASE-SR4 QSFP+ Optic, up to 100m over OM3 MMF or 150m over OM4 MMF 8 1995 15960
Arista CAB-Q-Q-2M 40GBASE-CR4 QSFP+ to QSFP+ Twinax Copper Cable 2 Meter 2 190 380
Dell 225-0849 PCT5548, 48 GbE Ports, Managed Switch, 10GbE and Stacking built-in 1 1295 1295
Dell 320-2880 SFP Transceiver 1000BASE-SX for PowerConnect LC Connector 4 169 676
        subtotal 78301
        per rack 26100






As you can see, the costs are roughly half that of the Cisco Nexus 5000-based solution, at just over US$26K per rack (list pricing) versus just over $50K per rack in Chris's favored design. The total oversubscription ratio is the same as Chris's design as well, although we have 4x40G links going to the core here instead of 16x10G as in his design. 40G links can be broken out into 4x10G links in any case with a splitter cable if your core is not capable of 40G, or you want to use a "wider" Clos-style architecture with more than two core/spine switches. You'll need to do layer 3 access or TRILL to take advantage of that design, though.

Note also that the control and management planes of the two Artista 7050S switches remain independent for resiliency: this is not a "stacked" configuration. There is also 80 Gbps available between switches on the MLAG peer link in the event of uplink or downlink failures that cause traffic to transit the peer link (which it would not do in normal operation). Assuming, as Chris does, that your core switches support MLAG/vPC, the uplinks are all active and form a 4x40G port channel to the network core.

Finally, if you want to do 10GBASE-T instead of SFP+/twinax to the servers, you can get away with spending less than $20K per rack! Arista's 7050T-64 switch is basically the same as the 7050S-64, but has 48 10GBASE-T ports instead of the 48 SFP+ ports. And it lists for just $20995. If you assume like everyone else that servers will soon have 10GBASE-T "for free" on the motherboard, that is proibably the way to go.

Full disclosure: I am not affiliated with any networking equipment vendor in any way, except as a small customer. I might indirectly own stock in one or more of the companies mentioned here via mutual funds, but if I do, I am unaware of it. I pay mutual fund managers to make those decisions for me, thereby stimulating the luxury-sedan/yacht/country-club sector of the economy.


1. Except Cisco, of course. They want $48K (list) for their Nexus 3064 which is based on the older "non-plus" version of the Broadcom Trident, and will therefore never support TRILL or DCB in hardware.

Monday, February 06, 2012

Bufferbloat on a 3G network

I've been experiencing terrible network performance lately on Sprint's 3G (EVDO) network in downtown Chicago. I have both a Sprint Mobile Broadband card for my laptop and an iPhone 4S on Sprint's network. Sprint performance used to be fantastic compared with AT&T and Verizon mobile data networks in Chicago, but the introduction of the iPhone on Sprint seems to have caused some capacity problems. The worst spot I've come across seems to be the Ogilvie train station.

I decided to run some diagnostics from my laptop in the area during a busy rush hour. I expected to see that the network was just hopelessly oversubscribed, with high packet loss. This is a very busy commuter train station, and there are probably tens of thousands of 3G smart-phones in the vicinity at rush hour. There's also lots of mirrored glass and steel in the high-rise buildings above - basically it's the worst "urban canyon" radio environment imaginable.

However, some simple ping tests from my laptop broadband card showed almost no packet loss. What these simple tests revealed instead was a problem I previously thought carriers knew to avoid and only really affected consumer devices: "buffer bloat".

Click to embiggen
The data in this chart was collected on Friday, 3 February 2012 starting at 5:03 PM. I ran a simple ping test from my laptop to the first-hop router on the 3G connection, so I was essentially only testing the "wireless" part of the network. I presume that Sprint has fiber-based backhaul from their towers in downtown Chicago that have plenty of bandwidth.

The minimum round-trip time observed was 92 ms, which is similar to what I see on 3G networks in uncongested locations. However, ping times varied wildly, and the worst round trip time was over 1.7 seconds. . An RTT this long is disastrous from a user experience standpoint. It means, for example, that connecting to a HTTPS-enabled site takes nearly 7 seconds before the first bit of the HTML web page is transferred to the client.

There was no other network activity on my laptop at all, so this insane result seems to have come from the 3G network itself. It looks to me as though Sprint has massively over-sized the buffers on their wireless towers. There is really no excuse for this at all given the recent attention buffer bloat has received in the networking community. I can't think of any circumstance beyond perhaps satellite communications where holding on to an IP packet for 1.5 seconds is at all reasonable.

Now, I suppose the problem could be caused by upstream buffering on the laptop. But as I said there was no other activity, confirmed by the wireless card's byte counters. Even if a flow-control mechanism in EVDO was telling my laptop not to transmit, or even telling it to re-transmit previous data, there should not be 1.5 seconds of buffering in the card or its drivers. An IP packet 1.5 seconds old should just be dropped.

I plan on doing more testing in the near future, but I have to ask the mobile networking experts out there: am I totally mis-interpreting the data from this admittedly simple test? Is this buffering something inherent in CDMA technology? Can anybody think of a test to see if it is the OS or driver buffers holding on to the packets for so long (I don' think Wireshark would work for this). Obviously I don't have access to hardware-based radio testing equipment, so software tests are all I can really do.

Assuming the problem is actually the downstream buffers in the tower, Sprint really needs to adjust their buffer sizes and start dropping some packets to make their 3G network usable again.

Technical details of equipment used in the test: Dell D430 laptop, Windows 7 current with all patches and service packs, Dell-Novatel Wireless 5720 Sprint Mobile Broadband (EVDO-Rev A) card with firmware version 145 and driver version 3.0.3.0.