Friday, May 18, 2012

Fixing slow NFS performance between VMware and Windows 2008 R2


I've seen hundreds of reports of slow NFS performance between VMware ESX/ESXi and Windows Server 2008 (with or without R2) out there on the internet, mixed in with a few reports of it performing fabulously.

We use the storage on our big Windows file servers periodically for one-off dev/test VMware virutal machines, and have  been struggling with this quite a bit recently. It used to be fast. Now it was very slow, like less than 3 MB/s for a copy of a VMDK. It made no sense.

We chased a lot of ideas. Started with the Windows and WMware logs of course, but nothing significant showed up. The Windows Server performance counters showed low CPU utilization and queue depth, low disk queue depth, less than 1 ms average IO service time, and a paltry 30 Mbps network utilization on bonded GbE links.

So where was the bottleneck? I ran across this Microsoft article about slow NFS performance when user name mapping wasn't set up, but it only seemed to apply to Windows 2003. Surely the patch mentioned there had made it into the 2008 code base?

Now, NFS version 3 is a really stupid and insecure protocol. I'm shocked it is still in widespread use frankly. There is basically no authentication other than easily spoofed source IP addresses; the server blindly trusts whatever user and group identifiers are set by the NFS client.

Another complication is that the POSIX permissions model of user/group/other bits isn't even close to the Windows ACL model using local users, domain users, groups, nesting of groups, and exclusions.

Ultimately, there has to be actual Windows security accounts assigned permissions to files on the Windows server. Therefore some means of "mapping" unix-style user and group IDs to Windows accounts and groups must be in place. Handling the lack of nesting, exlusions, inheritance, etc. on the UNIX side is an additional problem, so you often have to "dumb down" the Windows security model to make things work.

With Windows 2008 and later, you can used "unampped" access for NFS, which are really just UNIX UID/GIDs mapped directly to Windows security identifiers (SIDs) created on-the-fly by the Windows server. Or you can choose to pick up your Unix-to-Windows account mappings from Active Directory attributes.

VMware always sends userid=0 (root) and groupid=0 (also root) to NFS servers. On the windows side of things, if you are using the "unmapped" method as we had been, this gets translated into a really strange looking NTFS access control list (ACL). It will show permissions for security IDs with no usernames, that look like "S-1-5-88-1-0".

The first thing we did was reconfigure Windows NFS services to use active directory account mapping, then set up accounts with the uid=0 and gid=0 in AD. This worked, and assigned permissions to these new Active Directory accounts, but it didn't improve performance at all unfortunately.

So, I started looking at the permissions on the directories and files in our NFS shares. Someone had added permissions to a few directories so they could back up VMware files from the windows side of things across the network using SMB file sharing. This was in addition to the "unmapped" UNIX-style permissions created by the windows NFS service.

So, given the old KB article above that highlighted slow performance (but no access denied erros) when the Windows NFS server tried to find a mapping for a user, I decided tweaking the permissions was worth a shot. I ran across the nfsfile utility for setting NFS permissions to files. Finding little documentation online, the only aid I had was the command-line tools help text:
NFSFILE [/v] [/s] [/i[[u=]|[g=]|[wu=]|[wg=]]]         [/r[[u=]|[g=]|[m=]]] [/c[w|x]]
/? - this message /v - verbose /s - scan sub-directories for matching files /i - include files matching the specified criteria         u - NFS owner SID matches
        g - NFS group SID matches
        wu - NFS owner SID matches
        wg - NFS group SID matches
/r - replace specified option on file         u - set uid         g - set gid         m - set modebits to
        wu - Set Windows Owner account         wg - Set Windows Group account /c - convert the file according to         w - Windows style ACL (Mapped)         x - Unix Style ACL (Unmapped)
After some experimentation, I found that this command:
nfsfile /v /s /ru=0 /rg=0 /rm=777 /cx DIRECTORYNAME
reset all the permissions to the UNIX style for unmapped NFS access.

After eliminating the Active-Directory integration configuration and restarting Windows NFS services, VMware performance via NFS peformance was again qutie fast, bounded only by the disk subsystem or network.

What I think was happening is this: the Windows Services for NFS, when it encounters additional Windows ACLs on the files shared via NFS, figures it has to go evaluate all of those permissions by doing AD lookups for user and group IDs. Since NFS is a stateless protocol, it has to do this for *every* read and write request from the client. We did see a lot of traffic to our domain controllers from the NFS servers.

I am guessing that when only the "simple" UNIX-style ACLs set by the nfsfile utility are in place, Windows NFS services does not have to make a request to Active Directory for each NFS request, so things are much faster.

It worked for us anyway, and I am too lazy to dig into it much further, having burned way too much time on it already. But I hope this write-up helps somebody out there.