Friday, June 29, 2012

Windows Server 2012 storage = awesome sauce

We've been playing with the Windows Server 2012 release candidate on a new NAS system, and the combination of Storage Spaces and deduplication make for an impressive combination (see screenshot).

89% deduplication rate
We copied a week's worth of database and disk-image backups from a few servers to a deduplication-enabled volume on the test system. This amounted to a total of 845 GiB of raw, uncompressed data files. After waiting a bit for the deduplication to kick in, we ended up with a 90% savings in space.

This is the kind of result usually seen on purpose-built and reassuringly expensive dedplication appliances such as those from Data Domain.

The data copy process itself was also quite interesting. We configured twelve 2TB 7200 RPM drives into a Windows Storage Spaces pool, and set up a 5 TB NTFS volume on them in parity mode. Storage Spaces give you much of the flexibility of something like ZFS or Drobo: you create a pool of raw disks, and can carve it up into thin-provisioned volumes with different RAID and size policies. These volumes can be formatted with NTFS, ReFS, or shared out as raw iSCSI to other systems. Disks of different sizes can be added or removed and the pool will re-balance data automatically.

We copied the files from another NAS using ROBOCOPY with two threads, and the Windows 2012 system was able to write out the data at 100% of network speed (about 120 MiB/s) while using just 2% of a single Xeon E5-2620. Parity calculations are not a bottleneck here. Supposedly Microsoft also supposedly has some tricks in Storage Spaces to prevent the "software RAID-5 write hole" for parity volumes a la ZFS. The actual deduplication process took a few hours after the data was ingested, as it is a post-process system that runs at a low priority in the background.

There are caveats with the new deduplication feature, making it unsuitable for things like live VM disks or live databases. But it's certainly great for backup data, archival data, and general purpose file sharing. Management of the Storage Spaces and Deduplication features is dead-simple through the GUI, with sensible defaults. There is also a wealth of PowerShell commands to let you dig into the details not exposed in the GUI.

Finally, you can't beat the cost, which is basically "free" if you were already buying Windows Server 2012 anyway.

Friday, May 18, 2012

Fixing slow NFS performance between VMware and Windows 2008 R2

I've seen hundreds of reports of slow NFS performance between VMware ESX/ESXi and Windows Server 2008 (with or without R2) out there on the internet, mixed in with a few reports of it performing fabulously.

We use the storage on our big Windows file servers periodically for one-off dev/test VMware virutal machines, and have  been struggling with this quite a bit recently. It used to be fast. Now it was very slow, like less than 3 MB/s for a copy of a VMDK. It made no sense.

We chased a lot of ideas. Started with the Windows and WMware logs of course, but nothing significant showed up. The Windows Server performance counters showed low CPU utilization and queue depth, low disk queue depth, less than 1 ms average IO service time, and a paltry 30 Mbps network utilization on bonded GbE links.

So where was the bottleneck? I ran across this Microsoft article about slow NFS performance when user name mapping wasn't set up, but it only seemed to apply to Windows 2003. Surely the patch mentioned there had made it into the 2008 code base?

Now, NFS version 3 is a really stupid and insecure protocol. I'm shocked it is still in widespread use frankly. There is basically no authentication other than easily spoofed source IP addresses; the server blindly trusts whatever user and group identifiers are set by the NFS client.

Another complication is that the POSIX permissions model of user/group/other bits isn't even close to the Windows ACL model using local users, domain users, groups, nesting of groups, and exclusions.

Ultimately, there has to be actual Windows security accounts assigned permissions to files on the Windows server. Therefore some means of "mapping" unix-style user and group IDs to Windows accounts and groups must be in place. Handling the lack of nesting, exlusions, inheritance, etc. on the UNIX side is an additional problem, so you often have to "dumb down" the Windows security model to make things work.

With Windows 2008 and later, you can used "unampped" access for NFS, which are really just UNIX UID/GIDs mapped directly to Windows security identifiers (SIDs) created on-the-fly by the Windows server. Or you can choose to pick up your Unix-to-Windows account mappings from Active Directory attributes.

VMware always sends userid=0 (root) and groupid=0 (also root) to NFS servers. On the windows side of things, if you are using the "unmapped" method as we had been, this gets translated into a really strange looking NTFS access control list (ACL). It will show permissions for security IDs with no usernames, that look like "S-1-5-88-1-0".

The first thing we did was reconfigure Windows NFS services to use active directory account mapping, then set up accounts with the uid=0 and gid=0 in AD. This worked, and assigned permissions to these new Active Directory accounts, but it didn't improve performance at all unfortunately.

So, I started looking at the permissions on the directories and files in our NFS shares. Someone had added permissions to a few directories so they could back up VMware files from the windows side of things across the network using SMB file sharing. This was in addition to the "unmapped" UNIX-style permissions created by the windows NFS service.

So, given the old KB article above that highlighted slow performance (but no access denied erros) when the Windows NFS server tried to find a mapping for a user, I decided tweaking the permissions was worth a shot. I ran across the nfsfile utility for setting NFS permissions to files. Finding little documentation online, the only aid I had was the command-line tools help text:
NFSFILE [/v] [/s] [/i[[u=]|[g=]|[wu=]|[wg=]]]         [/r[[u=]|[g=]|[m=]]] [/c[w|x]]
/? - this message /v - verbose /s - scan sub-directories for matching files /i - include files matching the specified criteria         u - NFS owner SID matches
        g - NFS group SID matches
        wu - NFS owner SID matches
        wg - NFS group SID matches
/r - replace specified option on file         u - set uid         g - set gid         m - set modebits to
        wu - Set Windows Owner account         wg - Set Windows Group account /c - convert the file according to         w - Windows style ACL (Mapped)         x - Unix Style ACL (Unmapped)
After some experimentation, I found that this command:
nfsfile /v /s /ru=0 /rg=0 /rm=777 /cx DIRECTORYNAME
reset all the permissions to the UNIX style for unmapped NFS access.

After eliminating the Active-Directory integration configuration and restarting Windows NFS services, VMware performance via NFS peformance was again qutie fast, bounded only by the disk subsystem or network.

What I think was happening is this: the Windows Services for NFS, when it encounters additional Windows ACLs on the files shared via NFS, figures it has to go evaluate all of those permissions by doing AD lookups for user and group IDs. Since NFS is a stateless protocol, it has to do this for *every* read and write request from the client. We did see a lot of traffic to our domain controllers from the NFS servers.

I am guessing that when only the "simple" UNIX-style ACLs set by the nfsfile utility are in place, Windows NFS services does not have to make a request to Active Directory for each NFS request, so things are much faster.

It worked for us anyway, and I am too lazy to dig into it much further, having burned way too much time on it already. But I hope this write-up helps somebody out there.

Monday, February 27, 2012

Alternatives to Cisco for 10 Gb/s Servers

This is a post written in response to Chris Marget's well done series "Nexus 5K Layout for 10Gb/s Servers". While I appreciate the detail and thought that went into the Cisco-based design, he's clearly a "Cisco guy" showing a teensy-weensy bit of bias toward the dominant vendor in the networking industry. The end result of his design is a price that is over $50K per rack just for networking - approaching the aggregate cost of the actual servers inside the rack! Cisco is the Neiman Marcus of the networking world.

So I came up with an alternate design based on anything-but-Cisco switches. The 10G switches from Arista here use Broadcom's Trident+ chip, which supports very similar hardware features as the Cisco solution (MLAG/vPC, DCB, hardware support for TRILL if needed in the future, etc.). Many other vendors offer 10G switches based on this merchant silicon, such as Dell/Force10 and IBM. Because this is commodity switching silicon which BRCM will sell to just about anyone (like an x86 processor), pricing would be similar for Trident+ solutions from other vendors1.

Like Chris, I also include a non-redundant 1G layer 2 switch for out-of-band management of servers and switches, in this case a Dell 5548. I have followed his design by organizing into 3-rack "pods", each containing sixteen 2U servers that each have dual-port 10 Gbps network with SFP+ ports. All needed optics are included. Server-to-switch copper cabling is not included in the pricing, nor are the fiber runs. Switch-to-switch Twinax cables are included.

Here's a quick diagram, note that the "core" switches are not included in the costs, but the optics used on those core switches are:

Made with Graphviz in 5 minutes because drop shadows don't add information to a diagram!

Vendor SKU Desc Qty List$ Extend$
Arista DCS-7050S-64-R Arista 7050, 48xSFP+ & 4xQSFP+ switch, rear-to-front airflow and dual 460W AC power supplies 2 29995 59990
Artisa QSFP-SR4 40GBASE-SR4 QSFP+ Optic, up to 100m over OM3 MMF or 150m over OM4 MMF 8 1995 15960
Arista CAB-Q-Q-2M 40GBASE-CR4 QSFP+ to QSFP+ Twinax Copper Cable 2 Meter 2 190 380
Dell 225-0849 PCT5548, 48 GbE Ports, Managed Switch, 10GbE and Stacking built-in 1 1295 1295
Dell 320-2880 SFP Transceiver 1000BASE-SX for PowerConnect LC Connector 4 169 676
        subtotal 78301
        per rack 26100

As you can see, the costs are roughly half that of the Cisco Nexus 5000-based solution, at just over US$26K per rack (list pricing) versus just over $50K per rack in Chris's favored design. The total oversubscription ratio is the same as Chris's design as well, although we have 4x40G links going to the core here instead of 16x10G as in his design. 40G links can be broken out into 4x10G links in any case with a splitter cable if your core is not capable of 40G, or you want to use a "wider" Clos-style architecture with more than two core/spine switches. You'll need to do layer 3 access or TRILL to take advantage of that design, though.

Note also that the control and management planes of the two Artista 7050S switches remain independent for resiliency: this is not a "stacked" configuration. There is also 80 Gbps available between switches on the MLAG peer link in the event of uplink or downlink failures that cause traffic to transit the peer link (which it would not do in normal operation). Assuming, as Chris does, that your core switches support MLAG/vPC, the uplinks are all active and form a 4x40G port channel to the network core.

Finally, if you want to do 10GBASE-T instead of SFP+/twinax to the servers, you can get away with spending less than $20K per rack! Arista's 7050T-64 switch is basically the same as the 7050S-64, but has 48 10GBASE-T ports instead of the 48 SFP+ ports. And it lists for just $20995. If you assume like everyone else that servers will soon have 10GBASE-T "for free" on the motherboard, that is proibably the way to go.

Full disclosure: I am not affiliated with any networking equipment vendor in any way, except as a small customer. I might indirectly own stock in one or more of the companies mentioned here via mutual funds, but if I do, I am unaware of it. I pay mutual fund managers to make those decisions for me, thereby stimulating the luxury-sedan/yacht/country-club sector of the economy.

1. Except Cisco, of course. They want $48K (list) for their Nexus 3064 which is based on the older "non-plus" version of the Broadcom Trident, and will therefore never support TRILL or DCB in hardware.

Monday, February 06, 2012

Bufferbloat on a 3G network

I've been experiencing terrible network performance lately on Sprint's 3G (EVDO) network in downtown Chicago. I have both a Sprint Mobile Broadband card for my laptop and an iPhone 4S on Sprint's network. Sprint performance used to be fantastic compared with AT&T and Verizon mobile data networks in Chicago, but the introduction of the iPhone on Sprint seems to have caused some capacity problems. The worst spot I've come across seems to be the Ogilvie train station.

I decided to run some diagnostics from my laptop in the area during a busy rush hour. I expected to see that the network was just hopelessly oversubscribed, with high packet loss. This is a very busy commuter train station, and there are probably tens of thousands of 3G smart-phones in the vicinity at rush hour. There's also lots of mirrored glass and steel in the high-rise buildings above - basically it's the worst "urban canyon" radio environment imaginable.

However, some simple ping tests from my laptop broadband card showed almost no packet loss. What these simple tests revealed instead was a problem I previously thought carriers knew to avoid and only really affected consumer devices: "buffer bloat".

Click to embiggen
The data in this chart was collected on Friday, 3 February 2012 starting at 5:03 PM. I ran a simple ping test from my laptop to the first-hop router on the 3G connection, so I was essentially only testing the "wireless" part of the network. I presume that Sprint has fiber-based backhaul from their towers in downtown Chicago that have plenty of bandwidth.

The minimum round-trip time observed was 92 ms, which is similar to what I see on 3G networks in uncongested locations. However, ping times varied wildly, and the worst round trip time was over 1.7 seconds. . An RTT this long is disastrous from a user experience standpoint. It means, for example, that connecting to a HTTPS-enabled site takes nearly 7 seconds before the first bit of the HTML web page is transferred to the client.

There was no other network activity on my laptop at all, so this insane result seems to have come from the 3G network itself. It looks to me as though Sprint has massively over-sized the buffers on their wireless towers. There is really no excuse for this at all given the recent attention buffer bloat has received in the networking community. I can't think of any circumstance beyond perhaps satellite communications where holding on to an IP packet for 1.5 seconds is at all reasonable.

Now, I suppose the problem could be caused by upstream buffering on the laptop. But as I said there was no other activity, confirmed by the wireless card's byte counters. Even if a flow-control mechanism in EVDO was telling my laptop not to transmit, or even telling it to re-transmit previous data, there should not be 1.5 seconds of buffering in the card or its drivers. An IP packet 1.5 seconds old should just be dropped.

I plan on doing more testing in the near future, but I have to ask the mobile networking experts out there: am I totally mis-interpreting the data from this admittedly simple test? Is this buffering something inherent in CDMA technology? Can anybody think of a test to see if it is the OS or driver buffers holding on to the packets for so long (I don' think Wireshark would work for this). Obviously I don't have access to hardware-based radio testing equipment, so software tests are all I can really do.

Assuming the problem is actually the downstream buffers in the tower, Sprint really needs to adjust their buffer sizes and start dropping some packets to make their 3G network usable again.

Technical details of equipment used in the test: Dell D430 laptop, Windows 7 current with all patches and service packs, Dell-Novatel Wireless 5720 Sprint Mobile Broadband (EVDO-Rev A) card with firmware version 145 and driver version

Friday, January 06, 2012

What the Internet looks like in 2012

Click to zoom
This is a graph of links between visible Autonomous Systems on the Internet generated from public BGP routing tables early on 1 Jan 2012. Each link of each AS path in the BGP data is represented as an edge, with duplicates removed. The data was then graphed using the twopi layout tool from Graphviz. Links to the top twelve most-connected service provider networks are highlighted in color, with all other AS links in white.

I'm struck by the sheer density of connectivity on the modern Internet. Each of the 94865 lines on this graph represents at least one physical link between organizations. But in the case of larger networks that same thin line might represent dozens of routers and 10 Gb/s fibers at many locations throughout the world.

It certainly looks as robust as originally intended, but also chaotic and disordered. Surely no government, organization, or evil genius bent on world domination could possibly control all those links. The sooner our politicians figure that out, the better.