Skip to main content

Fixing slow NFS performance between VMware and Windows 2008 R2


I've seen hundreds of reports of slow NFS performance between VMware ESX/ESXi and Windows Server 2008 (with or without R2) out there on the internet, mixed in with a few reports of it performing fabulously.

We use the storage on our big Windows file servers periodically for one-off dev/test VMware virutal machines, and have  been struggling with this quite a bit recently. It used to be fast. Now it was very slow, like less than 3 MB/s for a copy of a VMDK. It made no sense.

We chased a lot of ideas. Started with the Windows and WMware logs of course, but nothing significant showed up. The Windows Server performance counters showed low CPU utilization and queue depth, low disk queue depth, less than 1 ms average IO service time, and a paltry 30 Mbps network utilization on bonded GbE links.

So where was the bottleneck? I ran across this Microsoft article about slow NFS performance when user name mapping wasn't set up, but it only seemed to apply to Windows 2003. Surely the patch mentioned there had made it into the 2008 code base?

Now, NFS version 3 is a really stupid and insecure protocol. I'm shocked it is still in widespread use frankly. There is basically no authentication other than easily spoofed source IP addresses; the server blindly trusts whatever user and group identifiers are set by the NFS client.

Another complication is that the POSIX permissions model of user/group/other bits isn't even close to the Windows ACL model using local users, domain users, groups, nesting of groups, and exclusions.

Ultimately, there has to be actual Windows security accounts assigned permissions to files on the Windows server. Therefore some means of "mapping" unix-style user and group IDs to Windows accounts and groups must be in place. Handling the lack of nesting, exlusions, inheritance, etc. on the UNIX side is an additional problem, so you often have to "dumb down" the Windows security model to make things work.

With Windows 2008 and later, you can used "unampped" access for NFS, which are really just UNIX UID/GIDs mapped directly to Windows security identifiers (SIDs) created on-the-fly by the Windows server. Or you can choose to pick up your Unix-to-Windows account mappings from Active Directory attributes.

VMware always sends userid=0 (root) and groupid=0 (also root) to NFS servers. On the windows side of things, if you are using the "unmapped" method as we had been, this gets translated into a really strange looking NTFS access control list (ACL). It will show permissions for security IDs with no usernames, that look like "S-1-5-88-1-0".

The first thing we did was reconfigure Windows NFS services to use active directory account mapping, then set up accounts with the uid=0 and gid=0 in AD. This worked, and assigned permissions to these new Active Directory accounts, but it didn't improve performance at all unfortunately.

So, I started looking at the permissions on the directories and files in our NFS shares. Someone had added permissions to a few directories so they could back up VMware files from the windows side of things across the network using SMB file sharing. This was in addition to the "unmapped" UNIX-style permissions created by the windows NFS service.

So, given the old KB article above that highlighted slow performance (but no access denied erros) when the Windows NFS server tried to find a mapping for a user, I decided tweaking the permissions was worth a shot. I ran across the nfsfile utility for setting NFS permissions to files. Finding little documentation online, the only aid I had was the command-line tools help text:
NFSFILE [/v] [/s] [/i[[u=]|[g=]|[wu=]|[wg=]]]         [/r[[u=]|[g=]|[m=]]] [/c[w|x]]
/? - this message /v - verbose /s - scan sub-directories for matching files /i - include files matching the specified criteria         u - NFS owner SID matches
        g - NFS group SID matches
        wu - NFS owner SID matches
        wg - NFS group SID matches
/r - replace specified option on file         u - set uid         g - set gid         m - set modebits to
        wu - Set Windows Owner account         wg - Set Windows Group account /c - convert the file according to         w - Windows style ACL (Mapped)         x - Unix Style ACL (Unmapped)
After some experimentation, I found that this command:
nfsfile /v /s /ru=0 /rg=0 /rm=777 /cx DIRECTORYNAME
reset all the permissions to the UNIX style for unmapped NFS access.

After eliminating the Active-Directory integration configuration and restarting Windows NFS services, VMware performance via NFS peformance was again qutie fast, bounded only by the disk subsystem or network.

What I think was happening is this: the Windows Services for NFS, when it encounters additional Windows ACLs on the files shared via NFS, figures it has to go evaluate all of those permissions by doing AD lookups for user and group IDs. Since NFS is a stateless protocol, it has to do this for *every* read and write request from the client. We did see a lot of traffic to our domain controllers from the NFS servers.

I am guessing that when only the "simple" UNIX-style ACLs set by the nfsfile utility are in place, Windows NFS services does not have to make a request to Active Directory for each NFS request, so things are much faster.

It worked for us anyway, and I am too lazy to dig into it much further, having burned way too much time on it already. But I hope this write-up helps somebody out there.

Comments

Jeff said…
I have yet to confirm if this fixed our speed issues, but I wanted to post that the syntax of the command above isn't correct. After a cut/paste of the above command failed, I did some hunting on TechNet and found out that it should look like this (which worked) -

nfsfile /v /s /r g=0 /r u=0 /r m=777 /cx *.*

So the u= g= and m= are all sub-parameters of the /r parameter. I'll confirm whether this fixed our issues or not. Hopefully this works, thanks for the post!
RPM said…
Updated the command line. Copy-paste error removed the /r from in front of teh mode (m) parameter. I do not beleive the spaces between /r and the subcommand are necessary.

Note that you really want to have NO OTHER PERMISSIONS on the files/directories in question.

Also, you do need to watch out for thin provisioning of VMware VMDKs on windows NFS datastores. It appears that Windows only supports one "sparse" region in an NFS-shared VMDK file. So, when you quick-format NTFS on a thin-provisioned VMDK on a Windows NFS-mounted datastore, you see a very odd and long delay as windows actually writes zeros out to a huge swath of disk between the front of the disk and the copy of the MFT records which are somewhere in the middle of the virtual disk.

There's apparently a good reason that Windows isn't on the VMware compatibility list as an NFS target (although I think the OEM-built Windows Storage Server might be when operating as an iSCSI target).

But it works in a pinch for oddball dev/test secnarios if you're short on storage.
Anonymous said…
I can confirm that this workaround increases the performance from 3MB/s to 8MB/s in a Gigabit and SSD environment. In most production environment, 8MB/s is too slow for sysadmin, NFS server on Unix system is preferred.
Anonymous said…
Thanks very much for this solution - I was experiencing extremely slow read performance using an AIX client and NFS services running on a Windows Server 2012 R2 guest OS on a vSphere 5.5 host. The nfsfile command documented has given a massive read speed improvement. Thanks again.

Popular posts from this blog

Google's public NTP servers?

I was struggling with finding a good set of low-ping NTP servers for use as upstream sources in the office. Using pool.ntp.org is great and all, but the rotating DNS entries aren't fabulous for Windows NTP clients (or really any NTP software except the reference ntpd implementation).

ntpd resolves a server hostname to an IP once at startup, and then sticks with that IP forever. Most other NTP clients honor DNS TTLs, and will follow the rotation of addresses returned by pool.ntp.org. This means Windows NTP client using the built-in Windows Time Service will actually be trying to sync to a moving set of target servers when pointed at a pool.ntp.org source. Fine for most client, but not great for servers trying to maintain stable timing for security and logging purposes.

I stumbled across this link referencing Google's ntp servers at hostname time[1-4].google.com. These servers support IPv4 and IPv6, and seem to be anycast just like Google's public DNS servers at 8.8.8.8. time…

Presets versus quality in x264 encoding

I'm scoping a project that will require re-encoding a large training video library into HTML5 and Flash-compatible formats. As of today, this means using H.264-based video for best compatability and quality (although WebM might become an option in a year or two).
The open source x264 is widely considered the state of the art in H.264 encoders. Given the large amount of source video we need to convert as part of the project, finding the optimal trade-off between encoding speed and quality with x264-based encoders (x264 itself, FFmpeg, MEencoder, HandBrake, etc.) is important.
So I created a 720p video comprised of several popular video test sequences concatenated together. All of these sequences are from lossless original sources, so we are not re-compressing the artifacts of another video codec. The sequences are designed to torture video codecs: scenes include splashing water, flames, slow pans, detailed backgrounds and fast motion. I did several two-pass 2500 kbps encodings using …