Monday, February 06, 2012

Bufferbloat on a 3G network

I've been experiencing terrible network performance lately on Sprint's 3G (EVDO) network in downtown Chicago. I have both a Sprint Mobile Broadband card for my laptop and an iPhone 4S on Sprint's network. Sprint performance used to be fantastic compared with AT&T and Verizon mobile data networks in Chicago, but the introduction of the iPhone on Sprint seems to have caused some capacity problems. The worst spot I've come across seems to be the Ogilvie train station.

I decided to run some diagnostics from my laptop in the area during a busy rush hour. I expected to see that the network was just hopelessly oversubscribed, with high packet loss. This is a very busy commuter train station, and there are probably tens of thousands of 3G smart-phones in the vicinity at rush hour. There's also lots of mirrored glass and steel in the high-rise buildings above - basically it's the worst "urban canyon" radio environment imaginable.

However, some simple ping tests from my laptop broadband card showed almost no packet loss. What these simple tests revealed instead was a problem I previously thought carriers knew to avoid and only really affected consumer devices: "buffer bloat".

Click to embiggen
The data in this chart was collected on Friday, 3 February 2012 starting at 5:03 PM. I ran a simple ping test from my laptop to the first-hop router on the 3G connection, so I was essentially only testing the "wireless" part of the network. I presume that Sprint has fiber-based backhaul from their towers in downtown Chicago that have plenty of bandwidth.

The minimum round-trip time observed was 92 ms, which is similar to what I see on 3G networks in uncongested locations. However, ping times varied wildly, and the worst round trip time was over 1.7 seconds. . An RTT this long is disastrous from a user experience standpoint. It means, for example, that connecting to a HTTPS-enabled site takes nearly 7 seconds before the first bit of the HTML web page is transferred to the client.

There was no other network activity on my laptop at all, so this insane result seems to have come from the 3G network itself. It looks to me as though Sprint has massively over-sized the buffers on their wireless towers. There is really no excuse for this at all given the recent attention buffer bloat has received in the networking community. I can't think of any circumstance beyond perhaps satellite communications where holding on to an IP packet for 1.5 seconds is at all reasonable.

Now, I suppose the problem could be caused by upstream buffering on the laptop. But as I said there was no other activity, confirmed by the wireless card's byte counters. Even if a flow-control mechanism in EVDO was telling my laptop not to transmit, or even telling it to re-transmit previous data, there should not be 1.5 seconds of buffering in the card or its drivers. An IP packet 1.5 seconds old should just be dropped.

I plan on doing more testing in the near future, but I have to ask the mobile networking experts out there: am I totally mis-interpreting the data from this admittedly simple test? Is this buffering something inherent in CDMA technology? Can anybody think of a test to see if it is the OS or driver buffers holding on to the packets for so long (I don' think Wireshark would work for this). Obviously I don't have access to hardware-based radio testing equipment, so software tests are all I can really do.

Assuming the problem is actually the downstream buffers in the tower, Sprint really needs to adjust their buffer sizes and start dropping some packets to make their 3G network usable again.

Technical details of equipment used in the test: Dell D430 laptop, Windows 7 current with all patches and service packs, Dell-Novatel Wireless 5720 Sprint Mobile Broadband (EVDO-Rev A) card with firmware version 145 and driver version

No comments: