More on Locked files in VMware vCenter and ESX

Back in February, I posted a story on Fixing Invalid VM’s in VMware vCenter and ESX and it became one of my two most popular stories and still remains so even after 12 months.

Now, I can bring you some more on this topic, thanks to Bmonroe on Experts-Exchange for posting his answers. It was this article that pulled the thread that untied the knot.

After doing an upgrade of vCenter from 4.1 to 5.1, we found that we had a phantom Server. In fact, this server had become Schrodingers Server. It was not visible from either the vCenter Server nor was it visible using the client software to connect directly to the host. However, the Server was alive providing all Services (hence Shrodinger – The Server was both alive and dead)

This was an untenable situation. It couldn’t be managed or controlled yet it was interfering with our other production services causing all sorts of problems. Any attempt to delete the files direct from the datastore resulted in a “device or resource busy” error

Here is how I fixed it…

1.       Logon to the ESX host where the VM was last known to be running.

2.      type cmd:  vmkfstools -D /vmfs/volumes/path/to/file to dump information on the file into /var/log/vmkernel

3.      type cmd:  less /var/log/vmkernel and scroll to the bottom, you will see output like below:

Feb 25 15:49:17 vm22 vmkernel: 2:00:15:18.435 cpu6:1038)FS3: 130:

Feb 2515:49:17 vm22 vmkernel: 2:00:15:18.435 cpu6:1038)Lock [type 10c00001 offset 30439424 v 21, hb offset 4154368

Feb 25 15:49:17 vm22 vmkernel: gen 66493, mode 1, owner 46c60a7c-94813bcf-4273-0017a44c7727 mtime 8781867] 

Feb 25 15:49:17 vm22 vmkernel: 2:00:15:18.435 cpu6:1038)Addr <4, 588, 7>, gen 20, links 1, type reg, flags 0x0, uid 0, gid 0, mode 644

Feb 25 15:49:17 vm22 vmkernel: 2:00:15:18.435 cpu6:1038)len 23973, nb 1 tbz 0, zla 2, bs 65536

Feb 25 15:49:17 vm22 vmkernel: 2:00:15:18.435 cpu6:1038)FS3: 132:

4.      The owner of the lock is on the third line, the last part is all you need, in this case 0017a44c7727 

5.      Type cmd: esxcfg-info | grep -i ‘system uuid’ | awk -F ‘-‘ ‘{print $NF}’ This will display the system uuid of the ESX server. You need to run the esxcfg-info command on each ESX server in the cluster to discover the owner. Of course, for me, it was on Host number 5 of a 6 Host Cluster.

6.      When you find the ESX server that matches the uuid owner, logon to that ESX server and run the command: ps -elf|grep vmname where vmname is the problem vm. Example output below:

4 S root 13254 1 0 65 -10 – 435 schedu Feb 25 ? 00:00:02 /usr/lib/vmware/bin/vmkload_app /usr/lib/vmware/bin/vmware-vmx -ssched.group=host/user/pool2 -@ pipe=/tmp/vmhsdaemon-0/vmxf7fb85ef5d8b3522;vm=f7fb85ef5d8b3522 /vmfs/volumes/470e25b6-37016b37-a2b3-001b78bedd4c/iu-lsps-vstest/iu-lsps-vstest.vmx0

7.      Since there is a process running, pid 13254 in the example, you need to kill it by following steps 5-12 on stopping a VM above

8.      Once the kill is complete the files should be released.

 

 

 

 

 

 

 

This entry was posted in Work and tagged , . Bookmark the permalink.

Leave a comment