VMWare VMX files, snapshots, and VMDKs – The parent virtual disk has been modified since the child was created

I had a bit of misfortune last night with VMWare and seeing as I’m big into learning from my mistakes, I thought I’d share my mistake with everyone.  I had been working on an issue on a VMWare ESX 3.5 server and after speaking with VMWare support they recommended that I rebuild the guest’s VMX file.  For those of you who don’t know, a VMX is essentially the guest configuration file.  This isn’t a big deal to do and I have done it several times in the past successfully.  However this time was different.  I waited until 6pm, shut down the VM, removed it from inventory, copied the current VMX file out of the guest’s VM folder for backup, renamed it, and then began creating the new virtual machine.  Creating the new virtual machine is much like how you create any other sort of VMWare guest, the only difference is that you exactly replicate the original guest’s configuration settings and you chose the ‘use existing disks option’ pointing the new guest configuration to the servers existing disks.  Pretty easy right?  I thought so.  So I ensured the new VMX saved and booted up the guest machine.  Immediately I found a problem.  I couldn’t log in and I was getting domain controller errors which forced me to log in as the local admin.  Upon logging in I realized that things looked drastically different.  Starting to freak out (since this was a production exchange server) I checked the mail databases to find the last edited date to be over 3 months ago.  What had gone wrong?!?!?

After I managed to calm down a bit, I shut down the guest, and put my original VMX file back in place hoping I could get the server to boot with its initial configuration.  After moving the new VMX file out, and putting the old one back in, I tried to boot the server and I was greeted with an error message that looked like this….
image 

At this point I immediately realized what had happened and was in absolute horror.  Someone had created a snapshot of the server 3 months ago, and never merged it back into the VM.  When I created the new guest, I had pointed to the disks to original VMDK files (which is your only option) and when I booted the server with them the file became modified which broke the snapshot chain.  I thought I had lost the system.

Before I go any further, I want to point out how I got to this point and things that should have been done to avoid such a catastrophic problem.

Issue – When you create the new VMX file and are asked to select the existing disk, the wizard only shows you the base VMDK file rather than all of the snapshots (even though they end in VMDK as well). 
Lesson Learned – Check the snapshot manager before making ANY changes.  A more thorough look in the data store browser would have point out the snapshot files to me as well

Issue – I had wanted to make a local copy of the guests folder on the data store prior to making ANY changes but was unable due to lack of space. 
Lesson Learned – If you are using local storage, ALWAYS have enough space to make a complete backup of any guest on that system.  If its a SAN, you might be able to get away with just taking a snapshot at the storage level.

Issue – The snapshots in general
Lesson Learned – As a rule, I don’t usually take snapshots in VMWare.  And if I do, I don’t let them hang out there for more than a day.  Keep an eye on the snapshots!

Green – Console output
Blue – My values, you’ll need to insert your own
Normal text – What I entered
Bold – Parts I’m trying to point out in the output

How I got out of this mess
After doing some research, I determined that the issue was with what VMWare calls the CID chain.  The CID value is used to link snapshots to parent VMDK files and when you load the parent file when there are snapshots present, you screw up this chain.  I didn’t have a backup of the system (at the VMWare level) and even though I had found steps online on how to try to fix this.  I let the experts at VMWare do it for me.  Afterwards, I recreated the scenario at home for the purpose of making this blog entry.  The steps below are offered at your own risk and there is no guarantee that they will work.

Step 1
BACKUP EVERYTHING.  Backup as much as you can, these steps will outline making direct edits to VMDK files which ,if done incorrectly, can cause corruption.  Also, try not to make ANY changes on the VMWare guest when you boot the parent VMDK.  I realized my mistake, and shut it down as soon as I could.

Step 2 
If you haven’t done so already, put the original VMX file back in place.  We’ll be using it to try to boot the system.  If you deleted it (for some silly reason) you may be able to make a new one but I haven’t tried that.

Step 3
Enable SSH on the ESX box if you haven’t already.  I’m running ESXi 3.5 in this test but you can very easily google how to enable SSH on your ESX version.  Using SSH just makes this a lot easier (especially if you are remote as I was).  In this example I’ll actually be using telnet.

Step 4
Figure out what the CID values are for each of the snapshots.  To do this we need to log in to the console (via SSH) and find the value on each of the files.  To do this, I entered the following commands.  Note the server I am running this test on is called UberServ01. 

Change to the guest directory
~ # cd /vmfs/volumes
/vmfs/volumes # ls
0016047d-c4d39b6a-ec38-631130484fa9  Hypervisor1
3abb47ef-875ea67c-c948-7bf6ff8d3c38  Hypervisor2
4af76f09-2611a4b1-ea7e-000f1ff86fb0  Hypervisor3
4af76f0b-0ba64946-17d8-000f1ff86fb0  datastore1
931ac070-8437760b-9dcc-b0a7dbce2d74
/vmfs/volumes #
cd ./datastore1/
/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0 # ls
ISO                    UberServ01 
/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0 #
cd ./UberServ01/
/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # ls -ltr *.vmdk
-rw——-    1 root     root          255 Jul 15 21:34 UberServ01-000001.vmdk
-rw——-    1 root     root     67131904 Jul 15 21:36 UberServ01-000001-delta.vmdk
-rw——-    1 root     root          262 Jul 15 21:42 UberServ01-000002.vmdk
-rw——-    1 root     root     67131904 Jul 15 21:44 UberServ01-000002-delta.vmdk
-rw——-    1 root     root          532 Jul 15 21:48 UberServ01.vmdk
-rw——-    1 root     root    10758666240 Jul 15 21:52 UberServ01-flat.vmdk

The last command outputs the name of all of the files in the directory with the VMDK file extension.  As you can see, we have two snapshot files here.  One called UberServer01-000001.vmdk and the other UberServ02-000002.vmdk.  To return the CID values enter the following commands.

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # grep CID ./UberServ01-000001.vmdk
CID=986a79c0
parentCID=4fc239f6

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # grep CID ./UberServ01-000002.vmdk
CID=1df04fbb
parentCID=986a79c0

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # grep CID ./UberServ01.vmdk
CID=0db4eee2
parentCID=ffffffff

Note that the parent file (the original VMDK) will always have the parentCID value of ‘ffffffff’ since its the parent.

So as of right now here is the info we have….
Correct Parent CID – 0db4eee2
Snapshot 1 Parent CID – 4fc239f6
Snapshot 1 CID -986a79c0
Snapshot 2 Parent CID – 986a79c0
Snapshot 2 CID – 1df04fbb

The issue should be apparent at this point.  The CIDs need to reference each other in the correct order.  So this is what we currently have…..

image

Do you see the issue?  The 1st snapshot references the wrong CID value for the parent.  To fix this, we need to edit the parent VMDK file to reference the new CID value of 4fc239f6.  To to do this, enter the following commands….

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # vi ./UberServ01.vmdk

Now this is the tricky part.  I absolutely hate vi, I just can’t seem to get the hang of it.  But here’s what I did to edit the file.  Keep in mind that if the file is large, it can take some time to load, so if you get a black screen for a period of time, just hang tight.

-Use the arrow keys to arrow down to the beginning of the line that starts with ‘CID=’.  This line should list the incorrect SID which you pulled earlier.  Rather than deleting the value we are just going to comment it out.
-Press the insert key once
-Press enter to insert a new line
-Type a # to comment out the original CID line
-Arrow up one to start entering text on your new blank line
-Type in the new CID value prefaced by ‘CID=’
-When you are done, the lines should look something like this….

image
-Press the escape key once. 
-Type in ‘:wq’ (minus the single quotes)
-This should kick you back out to the command line

At this point I would verify the chain one more time to ensure that you have the CIDs correct.

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # grep CID ./UberServ01.vmdk
CID=4fc239f6
#CID=0db4eee2
parentCID=ffffffff

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # grep CID ./UberServ01-000001.vmdk
CID=986a79c0
parentCID=4fc239f6

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 #
grep CID ./UberServ01-000002.vmdk
CID=1df04fbb
parentCID=986a79c0

So now it looks like….

image 
If you are confident with your changes, try to fire up the VM now.  With any luck, it will boot correctly and your current data will be back where it should be.  After you make sure that it boots correctly, I would shut it down and delete the snapshots to merge everything back into one file.  If you booted up the parent file and made changes the chances that some sort of corruption occurred is high.  In my case, I had to rebuild the server WMI namespace which wasn’t that big of a deal.  But all I did was boot the server up!

8 thoughts on “VMWare VMX files, snapshots, and VMDKs – The parent virtual disk has been modified since the child was created

  1. Babak

    Thank you very much, indeed. Your guide helped me. I could have lost a $3,000 software installed on my server. God bless you!

    Reply
  2. wenk

    Dear admin,

    What about if we don’t have original file of the parent vmdk (the original backup was damage), coz it already running with the vm and the data inside was accessed/modified with data (the data change with last 3 days after it running. Can i still merge the data from delta to parent vmdk??

    Reply
  3. Marcus

    YOU REALLY SAFED MY LIFE!
    I was also recognizing it very fast (within a few minutes), that I was on an old stage of the server (like 9 months ago).
    So I shutted it down and was searching for the articles, and glad glad glad, I found yours!
    In my case the shit happened, while moveing the data MANUALLY from one to another datastore and append the machine new again over the wizard. (yes I found the
    VMWARE should really build in to recognize, there are Snapshot files, in the wizard, and at least ask for it if they should be integrated or not!
    I think it’s one of the easiest way, to destroy really a system in the VM!

    Thank you again!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *