Fixing slow NFS performance between VMware and Windows 2008 R2

2012-05-18 11:35 nfsfile permissions nfs windows performance

I’ve seen hundreds of reports of slow NFS performance between VMware ESX/ESXi and Windows Server 2008 (with or without R2) out there on the internet, mixed in with a few reports of it performing fabulously.

We use the storage on our big Windows file servers periodically for one-off dev/test VMware virutal machines, and have  been struggling with this quite a bit recently. It used to be fast. Now it was very slow, like less than 3 MB/s for a copy of a VMDK. It made no sense.

We chased a lot of ideas. Started with the Windows and WMware logs of course, but nothing significant showed up. The Windows Server performance counters showed low CPU utilization and queue depth, low disk queue depth, less than 1 ms average IO service time, and a paltry 30 Mbps network utilization on bonded GbE links.

So where was the bottleneck? I ran across this Microsoft article about slow NFS performance when user name mapping wasn’t set up, but it only seemed to apply to Windows 2003. Surely the patch mentioned there had made it into the 2008 code base?

Now, NFS version 3 is a really stupid and insecure protocol. I’m shocked it is still in widespread use frankly. There is basically no authentication other than easily spoofed source IP addresses; the server blindly trusts whatever user and group identifiers are set by the NFS client.

Another complication is that the POSIX permissions model of user/group/other bits isn’t even close to the Windows ACL model using local users, domain users, groups, nesting of groups, and exclusions.

Ultimately, there has to be actual Windows security accounts assigned permissions to files on the Windows server. Therefore some means of “mapping” unix-style user and group IDs to Windows accounts and groups must be in place. Handling the lack of nesting, exlusions, inheritance, etc. on the UNIX side is an additional problem, so you often have to “dumb down” the Windows security model to make things work.

With Windows 2008 and later, you can used “unampped” access for NFS, which are really just UNIX UID/GIDs mapped directly to Windows security identifiers (SIDs) created on-the-fly by the Windows server. Or you can choose to pick up your Unix-to-Windows account mappings from Active Directory attributes.

VMware always sends userid=0 (root) and groupid=0 (also root) to NFS servers. On the windows side of things, if you are using the “unmapped” method as we had been, this gets translated into a really strange looking NTFS access control list (ACL). It will show permissions for security IDs with no usernames, that look like “S-1-5-88-1-0”.

The first thing we did was reconfigure Windows NFS services to use active directory account mapping, then set up accounts with the uid=0 and gid=0 in AD. This worked, and assigned permissions to these new Active Directory accounts, but it didn’t improve performance at all unfortunately.

So, I started looking at the permissions on the directories and files in our NFS shares. Someone had added permissions to a few directories so they could back up VMware files from the windows side of things across the network using SMB file sharing. This was in addition to the “unmapped” UNIX-style permissions created by the windows NFS service.

So, given the old KB article above that highlighted slow performance (but no access denied erros) when the Windows NFS server tried to find a mapping for a user, I decided tweaking the permissions was worth a shot. I ran across the _nfsfile _utility for setting NFS permissions to files. Finding little documentation online, the only aid I had was the command-line tools help text:

NFSFILE [/v] [/s] [/i[[u=]|[g=]|[wu=]|[wg=]]] [/r[[u=]|[g=]|[m=]]] [/c[w|x]]  
/? - this message
/v - verbose
/s - scan sub-directories for matching files
/i - include files matching the specified criteria        
    u - NFS owner SID matches  
    g - NFS group SID matches  
    wu - NFS owner SID matches  
    wg - NFS group SID matches  
/r - replace specified option on file
    u - set uid
    g - set gid
    m - set modebits to  
    wu - Set Windows Owner account
    wg - Set Windows Group account
/c - convert the file according to
    w - Windows style ACL (Mapped)
    x - Unix Style ACL (Unmapped)

After some experimentation, I found that this command:

nfsfile /v /s /ru=0 /rg=0 /rm=777 /cx DIRECTORYNAME

reset all the permissions to the UNIX style for unmapped NFS access.

After eliminating the Active-Directory integration configuration and restarting Windows NFS services, VMware performance via NFS peformance was again qutie fast, bounded only by the disk subsystem or network.

What I think was happening is this: the Windows Services for NFS, when it encounters additional Windows ACLs on the files shared via NFS, figures it has to go evaluate all of those permissions by doing AD lookups for user and group IDs. Since NFS is a stateless protocol, it has to do this for *every* read and write request from the client. We did see a lot of traffic to our domain controllers from the NFS servers.

I am guessing that when _only the “simple” UNIX-style ACLs set by the nfsfile _utility are in place, Windows NFS services does _not _have to make a request to Active Directory for each NFS request, so things are much faster.

It worked for us anyway, and I am too lazy to dig into it much further, having burned way too much time on it already. But I hope this write-up helps somebody out there.


Comments:

I have yet to confirm if this fixed our speed issu…

Jeff -

I have yet to confirm if this fixed our speed issues, but I wanted to post that the syntax of the command above isn’t correct. After a cut/paste of the above command failed, I did some hunting on TechNet and found out that it should look like this (which worked) -

nfsfile /v /s /r g=0 /r u=0 /r m=777 /cx .

So the u= g= and m= are all sub-parameters of the /r parameter. I’ll confirm whether this fixed our issues or not. Hopefully this works, thanks for the post!


Updated the command line. Copy-paste error removed…

RPM -

Updated the command line. Copy-paste error removed the /r from in front of teh mode (m) parameter. I do not beleive the spaces between /r and the subcommand are necessary.

Note that you really want to have NO OTHER PERMISSIONS on the files/directories in question.

Also, you do need to watch out for thin provisioning of VMware VMDKs on windows NFS datastores. It appears that Windows only supports one “sparse” region in an NFS-shared VMDK file. So, when you quick-format NTFS on a thin-provisioned VMDK on a Windows NFS-mounted datastore, you see a very odd and long delay as windows actually writes zeros out to a huge swath of disk between the front of the disk and the copy of the MFT records which are somewhere in the middle of the virtual disk.

There’s apparently a good reason that Windows isn’t on the VMware compatibility list as an NFS target (although I think the OEM-built Windows Storage Server might be when operating as an iSCSI target).

But it works in a pinch for oddball dev/test secnarios if you’re short on storage.


I can confirm that this workaround increases the p…

[Anonymous]( “noreply@blogger.com”) -

I can confirm that this workaround increases the performance from 3MB/s to 8MB/s in a Gigabit and SSD environment. In most production environment, 8MB/s is too slow for sysadmin, NFS server on Unix system is preferred.


Thanks very much for this solution - I was experie…

[Anonymous]( “noreply@blogger.com”) -

Thanks very much for this solution - I was experiencing extremely slow read performance using an AIX client and NFS services running on a Windows Server 2012 R2 guest OS on a vSphere 5.5 host. The nfsfile command documented has given a massive read speed improvement. Thanks again.