VMWare VMX files, snapshots, and VMDKs – The parent virtual disk has been modified since the child was created

Thursday, 15. July 2010

I had a bit of misfortune last night with VMWare and seeing as I’m big into learning from my mistakes, I thought I’d share my mistake with everyone.  I had been working on an issue on a VMWare ESX 3.5 server and after speaking with VMWare support they recommended that I rebuild the guest’s VMX file.  For those of you who don’t know, a VMX is essentially the guest configuration file.  This isn’t a big deal to do and I have done it several times in the past successfully.  However this time was different.  I waited until 6pm, shut down the VM, removed it from inventory, copied the current VMX file out of the guest’s VM folder for backup, renamed it, and then began creating the new virtual machine.  Creating the new virtual machine is much like how you create any other sort of VMWare guest, the only difference is that you exactly replicate the original guest’s configuration settings and you chose the ‘use existing disks option’ pointing the new guest configuration to the servers existing disks.  Pretty easy right?  I thought so.  So I ensured the new VMX saved and booted up the guest machine.  Immediately I found a problem.  I couldn’t log in and I was getting domain controller errors which forced me to log in as the local admin.  Upon logging in I realized that things looked drastically different.  Starting to freak out (since this was a production exchange server) I checked the mail databases to find the last edited date to be over 3 months ago.  What had gone wrong?!?!?

After I managed to calm down a bit, I shut down the guest, and put my original VMX file back in place hoping I could get the server to boot with its initial configuration.  After moving the new VMX file out, and putting the old one back in, I tried to boot the server and I was greeted with an error message that looked like this….
image 

At this point I immediately realized what had happened and was in absolute horror.  Someone had created a snapshot of the server 3 months ago, and never merged it back into the VM.  When I created the new guest, I had pointed to the disks to original VMDK files (which is your only option) and when I booted the server with them the file became modified which broke the snapshot chain.  I thought I had lost the system.

Before I go any further, I want to point out how I got to this point and things that should have been done to avoid such a catastrophic problem.

Issue - When you create the new VMX file and are asked to select the existing disk, the wizard only shows you the base VMDK file rather than all of the snapshots (even though they end in VMDK as well). 
Lesson Learned – Check the snapshot manager before making ANY changes.  A more thorough look in the data store browser would have point out the snapshot files to me as well

Issue – I had wanted to make a local copy of the guests folder on the data store prior to making ANY changes but was unable due to lack of space. 
Lesson Learned – If you are using local storage, ALWAYS have enough space to make a complete backup of any guest on that system.  If its a SAN, you might be able to get away with just taking a snapshot at the storage level.

Issue – The snapshots in general
Lesson Learned – As a rule, I don’t usually take snapshots in VMWare.  And if I do, I don’t let them hang out there for more than a day.  Keep an eye on the snapshots!

Green – Console output
Blue – My values, you’ll need to insert your own
Normal text – What I entered
Bold – Parts I’m trying to point out in the output

How I got out of this mess
After doing some research, I determined that the issue was with what VMWare calls the CID chain.  The CID value is used to link snapshots to parent VMDK files and when you load the parent file when there are snapshots present, you screw up this chain.  I didn’t have a backup of the system (at the VMWare level) and even though I had found steps online on how to try to fix this.  I let the experts at VMWare do it for me.  Afterwards, I recreated the scenario at home for the purpose of making this blog entry.  The steps below are offered at your own risk and there is no guarantee that they will work.

Step 1
BACKUP EVERYTHING.  Backup as much as you can, these steps will outline making direct edits to VMDK files which ,if done incorrectly, can cause corruption.  Also, try not to make ANY changes on the VMWare guest when you boot the parent VMDK.  I realized my mistake, and shut it down as soon as I could.

Step 2 
If you haven’t done so already, put the original VMX file back in place.  We’ll be using it to try to boot the system.  If you deleted it (for some silly reason) you may be able to make a new one but I haven’t tried that.

Step 3
Enable SSH on the ESX box if you haven’t already.  I’m running ESXi 3.5 in this test but you can very easily google how to enable SSH on your ESX version.  Using SSH just makes this a lot easier (especially if you are remote as I was).  In this example I’ll actually be using telnet.

Step 4
Figure out what the CID values are for each of the snapshots.  To do this we need to log in to the console (via SSH) and find the value on each of the files.  To do this, I entered the following commands.  Note the server I am running this test on is called UberServ01. 

Change to the guest directory
~ # cd /vmfs/volumes
/vmfs/volumes # ls
0016047d-c4d39b6a-ec38-631130484fa9  Hypervisor1
3abb47ef-875ea67c-c948-7bf6ff8d3c38  Hypervisor2
4af76f09-2611a4b1-ea7e-000f1ff86fb0  Hypervisor3
4af76f0b-0ba64946-17d8-000f1ff86fb0  datastore1
931ac070-8437760b-9dcc-b0a7dbce2d74
/vmfs/volumes #
cd ./datastore1/
/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0 # ls
ISO                    UberServ01 
/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0 #
cd ./UberServ01/
/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # ls -ltr *.vmdk
-rw——-    1 root     root          255 Jul 15 21:34 UberServ01-000001.vmdk
-rw——-    1 root     root     67131904 Jul 15 21:36 UberServ01-000001-delta.vmdk
-rw——-    1 root     root          262 Jul 15 21:42 UberServ01-000002.vmdk
-rw——-    1 root     root     67131904 Jul 15 21:44 UberServ01-000002-delta.vmdk
-rw——-    1 root     root          532 Jul 15 21:48 UberServ01.vmdk
-rw——-    1 root     root    10758666240 Jul 15 21:52 UberServ01-flat.vmdk

The last command outputs the name of all of the files in the directory with the VMDK file extension.  As you can see, we have two snapshot files here.  One called UberServer01-000001.vmdk and the other UberServ02-000002.vmdk.  To return the CID values enter the following commands.

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # grep CID ./UberServ01-000001.vmdk
CID=986a79c0
parentCID=4fc239f6

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # grep CID ./UberServ01-000002.vmdk
CID=1df04fbb
parentCID=986a79c0

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # grep CID ./UberServ01.vmdk
CID=0db4eee2
parentCID=ffffffff

Note that the parent file (the original VMDK) will always have the parentCID value of ‘ffffffff’ since its the parent.

So as of right now here is the info we have….
Correct Parent CID – 0db4eee2
Snapshot 1 Parent CID – 4fc239f6
Snapshot 1 CID -986a79c0
Snapshot 2 Parent CID – 986a79c0
Snapshot 2 CID - 1df04fbb

The issue should be apparent at this point.  The CIDs need to reference each other in the correct order.  So this is what we currently have…..

image

Do you see the issue?  The 1st snapshot references the wrong CID value for the parent.  To fix this, we need to edit the parent VMDK file to reference the new CID value of 4fc239f6.  To to do this, enter the following commands….

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # vi ./UberServ01.vmdk

Now this is the tricky part.  I absolutely hate vi, I just can’t seem to get the hang of it.  But here’s what I did to edit the file.  Keep in mind that if the file is large, it can take some time to load, so if you get a black screen for a period of time, just hang tight.

-Use the arrow keys to arrow down to the beginning of the line that starts with ‘CID=’.  This line should list the incorrect SID which you pulled earlier.  Rather than deleting the value we are just going to comment it out.
-Press the insert key once
-Press enter to insert a new line
-Type a # to comment out the original CID line
-Arrow up one to start entering text on your new blank line
-Type in the new CID value prefaced by ‘CID=’
-When you are done, the lines should look something like this….

image
-Press the escape key once. 
-Type in ‘:wq’ (minus the single quotes)
-This should kick you back out to the command line

At this point I would verify the chain one more time to ensure that you have the CIDs correct.

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # grep CID ./UberServ01.vmdk
CID=4fc239f6
#CID=0db4eee2
parentCID=ffffffff

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 # grep CID ./UberServ01-000001.vmdk
CID=986a79c0
parentCID=4fc239f6

/vmfs/volumes/4af76f0b-0ba64946-17d8-000f1ff86fb0/UberServ01 #
grep CID ./UberServ01-000002.vmdk
CID=1df04fbb
parentCID=986a79c0

So now it looks like….

image 
If you are confident with your changes, try to fire up the VM now.  With any luck, it will boot correctly and your current data will be back where it should be.  After you make sure that it boots correctly, I would shut it down and delete the snapshots to merge everything back into one file.  If you booted up the parent file and made changes the chances that some sort of corruption occurred is high.  In my case, I had to rebuild the server WMI namespace which wasn’t that big of a deal.  But all I did was boot the server up!

Part 3 – Attaching your LeftHand ISCSI LUN to ESXi

Tuesday, 1. December 2009

In the first two articles in this series I discussed the process of configuring the LeftHand VSA and preparing a volume for attachment to a VMWare server.  A few things should be noted here.
-I’m using ESXi 3.5 for this walk through
-The VSA is installed on the same ESXi server that I am attaching the server to
-Since I’m working in a test environment I am keeping the ISCSI traffic on the same subnet as the Virtual Machines.  The VMKernel is the interface in VMWare that needs to see the ISCSI traffic so to keep this simple I will leave everything on the same subnet.  Remember ESXi is different than ESX as far as the network configuration goes.  We really aren’t looking at performance right now, just trying to get the VSA up and running so we can play with it.

Alright let’s get right into it……

Since this is ESXi, I have already configured a VMKernel port for management of the ESXi host.  That being said my current network configuration looks like the image below.  Note the VSA is part of the VM Network already.  The hosts on the VM Network all have the same subnet of 10.20.30.X /24image

So basically all I need to do at this point is enable the ISCSI software initiator in ESXi and give it the information it needs to find the ISCSI LUN we created.  Under configuration in your VI Client click on “Storage Adapters” and then scroll down in the right pane until you see “ISCSI Software Adapter”.  Select it by left clicking on it and then choose properties in the lower pane.  image 

On the properties click the “Configure…” button and check the box under status that says “Enabled”.  Note you cant insert your own ISCSI name at this point.  Once you press OK the system will think for a brief moment and the populate the “ISCSI Properties” section at the top of the Properties page.  Additionally if you click on “Configure…” again you are now able to edit the ISCSI name, however I wouldn’t recommend doing so unless you know what you are doing.

image image

You can now continue configuring the ISCSI connection.  Select the “Dynamic Discovery” tab from the top of the Properties window.  Underneath “Dynamic Discovery” click the “Add…” button from the lower part of the screen.  On the “Add Send Targets Server” window enter the IP of your LeftHand cluster.  Please note here DON’T USE THE IP OF YOUR LEFTHAND NODE.  You have to use the cluster IP.  I know it doesn’t make a lot of sense since we are only working with one node but you have to think of situations where you would have more than one node in a cluster.  Leave the port value at its default setting of 3260.image

Now click on the “CHAP Authentication” tab of the properties window.  If you prefer to not use CHAP you could very easily go back to the “General” tab, copy the ISCSI name, go back to the LeftHand CMC and change the server definition to “CHAP not required” and paste the name you copied in the “Initiator Node Name” text box.  We simply configured CHAP in the previous step because we didn’t know what the ESXi Initiator Name would be at that point of the configuration.  If you choose to do not use CHAP and make the above listed change disregard the next step where we configure the CHAP authentication.

Click the “Configure…” button and enter in the credentials you specified when you configured the server in the LeftHand CMC.  Once done press “OK”, and then close the Properties window.image

After you press CLOSE, VMware will ask you if you want to “Rescan the Host”.  Choose “Yes”, this will rescan all the HBA (Host Bus Adapters) on the ESX server.image 

With any luck after the HBA scan completes you should see the volume you created in the LeftHand CMC appear as an available LUN.  If you don’t see it verify the CHAP username and password you used and make sure you are matching the correct password that you defined in the LeftHand CMC.  You had to define a Target and an Initiator Secret which had to be different.  Ensure you are matching the  correct Secret and then try hitting the “Rescan…” button at the top of the Storage Adapters pane.  The Windows ISCSI Initiator lets you do mutual authentication which would use both the Initiator and Target Secrets.

image
Now if we select Storage from the “Hardware” area on the left hand side of the screen we can see our existing datastore.  To see the ISCSI datastore we are going to have to add it.  Select the “Add Storage…” link from the right side of the screen.image

This should bring up the “Add Storage Wizard”.  Ensure that Disk/Lun is selected and press NEXT.

image

On the next screen you should see your ISCSI LUN, ensure its selected and press NEXTimage

The next screen warns you that the current Disk Layout will be destroyed.  Note in the below image that I had used this LUN for a quick test on a MS Server and formatted it as NTFS.  The wizard picks up on that and warns that the disk layout will be destroyed.  Press NEXT.image

On the next screen give the datastore a name and then Press NEXT.image

On the next screen choose how big you want to make the datastore. Most of the time I would just leave the option “Maximize Capacity” checked.  That will use all of the space on the LUN and is the default setting.  The other option on this screen is Maximum Block Size.  There are a couple different schools of thought on this and if you aren’t sure I would recommend the default settings.image

On the last screen just verify your settings and press FINISHimage

After it finishes creating your VFMS Datastore, the new datastore will show up under storage where you can start treating it just like any other datastore.   That’s it!  You’ve successfully implemented a completely virtualized SAN solution!

Part 1 – Configuring the LeftHand VSA

Sunday, 29. November 2009

Part of the LeftHand SAN portfolio includes the LeftHand VSA (Virtual SAN Appliance).  The VSA can be loaded into a virtual environment and turns your local storage into SAN storage.  So now you get the benefits of a SAN without having to actually have a physical SAN.  Snapshots, thin provisioning, etc….  I like to talk about one situation in particular that makes the VSA pretty amazing.  Say you have a corporate office as well as several remote sales offices that have a small server on site.  You have the LeftHand SAN at your corporate office and are looking for a backup solution for the remote sites.  Enter the VSA….  Configure ESX on the server, install the VSA, convert the server into VMware on top of the VSA storage, and configure the site replication.  That’s it….  Your entire backup is done and all you had to do was make sure you had a big enough pipe (I believe a T1 or similar would work fine).   That particular remote sales office gets wiped out by a hurricane all you have to do is mount that VMware server on a different ESX box and you are up and running.  Pretty cool huh?  Lets take a look at how to actually configure the VSA.

Step 1 – Download the VSA and get to the VM
Go to HP’s website and download the 30 day trial of the VSA (http://www.hp.com/go/tryvsa).  You’ll have to enter some info about yourself but afterwards you are allowed to download the Virtual Machine.  The download doesn’t come in VM appliance form so you’ll need to import the VMX config file to get it to show up in ESX.  The ZIP file you download has three folders in it: Documents, Centralized Management Console (CMC), and Virtual SAN Appliance.  You can start off by installing the CMC right away since that’s the easy part.  For some reason you need to run HP’s installer to extract the VSA from the installer CAB files.  After you run the installer you should have the HP LeftHand Networks group under "All Programs".  Browse to the group and select "VSA Files" image

Within the VSA files directory you should see the VMware files as well as some PDF files that walk you through the configuration.  I found the PDFs not as straight forward as I had hoped but you can certainly give them a try if you like.

Step 2 – Copy the VMware files to the ESX machine.
I did it through the VSphere client Datastore browser (Side note the screen shots, and installation in this post were done on a ESXi 3.5 server).  Go under configuration, select storage under the Hardware tab, and on the right hand right click on the datastore you wish to use and select “Browse Datastore”image

Select the upload button on the tool bar (circled in red) and select the “Upload Folder” button.  Then browse to ‘C:\Program Files\LeftHand Networks\Virtual SAN Appliance’ (Unless you changed the install path) and upload the entire folder.  I let the upload complete and then went through the folder and deleted the PDFs but you don’t need to do this.image

Step 3 – Add the VSA to inventory and configure its settings
After it’s uploaded, select the “Virtual SAN Appliance” on the left hand side to display the folder files on the right.  Right click on the .vmx file and select “Add to Inventory”.
image

Now you will be presented with the Add to Inventory Wizard.
Enter a name, NEXTimage

Select a host from the Resource Pool, NEXTimage

FINISH image
Now you can close the Datastore Browser and return to the Virtual Machine tab of the VSphere client, where we should now see our VSA.
image

Don’t fire it up yet.  We need to configure a few things first.  Right click on the VSA and go to “Edit Settings”.  The configuration should be pretty standard but you want to make sure that it has 1024 meg of RAM and that the network configuration is how you want it to be.  This is my test box so I’m going to leave it on my VM Network however if you are doing a separate VLAN for ISCSI this might need to be modified. image

The only thing left to do at this point is add in a disk for the SAN appliance to use.  The VSA will accept more than one disk but you really only want 1.  I made the mistake the first time I did this of giving the VSA three 10 gig disks.  For some reason the VSA sees multiple drives and automatically changes the configuration to RAID-5 so you lose a disk.   Not necessary since you hopefully already have your ESX datastore sitting on a RAID configuration.   So, just add one disk.  I’ll add a 10 gig disk for this example.  Keep in mind that you’ll want to fully allocate this, you can’t go back and add more disk or change the size; at least not that I am aware of.  It might be interesting to see if you could use the VMWare converter to increase the size.  Perhaps I’ll try that in a later post.

Select Add at the bottom left of the settings screen.  On the first screen select Hard Disk and click NEXTimage

  Ensure Create a new virtual disk is select and click NEXTimage
Set the size of the disk you want and click NEXTimage

This step is Key.  The VSA will only look for disks in the SCSI 1:X range.  That is any disk you attach needs to be SCSI 1:0.  If you don’t configure this right the VSA won’t show any disks.image
Click Finishimage

Step 4 – Launch the VM and configure the IP address
Once you have made these changes you can now safely fire up the VM.  Open the console and you should see the following screen.  It should sit on this screen for less than a minute, then you should get a login promptimage

Type “start” and press enterimage

At the next screen press ENTERimage

Now you are at the main menu.  The menu lists all of the console configuration options and as you can see there aren’t many things you can do from the console. image

We want to configure the network so select Network TCP/IP Settings and press ENTER.  You will be prompted with a interface selection screen.  Press ENTER again to select eth0.image
On the next screen enter a hostname and select the option to use a specific IP address.  If you haven’t figured it out by now, the console screen only accepts tab, ENTER, and space commands.  Arrows don’t work for moving from option to option.  On this screen I entered hostname, tabbed three times to get to the third option, pressed space to select it, and then tabbed through and entered the rest of the information for the IP address.   Tab to OK and ENTER to finalize the settings.  After you press enter you will receive a warning about the NIC needing to be rest, just press ENTER again.imageimage
After the changes are done processing you’ll get a message indicating that the new IP address has been set.  Press ENTER, and then back your way out to the main menu where you can select logout to get back to the main page.  Your VSA is configured and ready for the final configuration in the LeftHand CMC.  In the next few posts I’ll walk through configuring the VSA in the CMC and how to attach it to your VM network.

LeftHand SAN

Saturday, 28. November 2009

SAN-iQ_poweredI recently had a client who was looking for some of the advantages of VMWare (HA, VMotion, etc…) but didn’t have the required storage infrastructure to do so. We started pricing out SAN storage but quickly realized that the traditional FC (Fiber Channel) SANs were ,as expected, incredibly expensive. Both Dell and HP came back with numbers that were well beyond the client’s budget. During a discussion with an HP storage specialist the “LeftHand” name came up. I had heard of ISCSI in the past in regards to Dell’s equalogic SANs but had never implemented one. Needless to say we pursued the option and got a LeftHand SAN specialist to come in and talk to us about their appliances. I have to say, I was very impressed.   I signed up for the HP Left Hand Academy Technical Training.  I felt like the course was a good overview of the appliances and if LeftHand is something you might be interested in I strongly suggest taking it.  The class number was HH670P.

Here are some of the notes from our LeftHand Training

LeftHand vs. Traditional SAN
LeftHand is a truly virtual SAN implementation.  On other SANs I had worked with you can literally log into the controller, pick which physical disks you wanted in a volume from each disk cabinet, and then provision the RAID.  LeftHand sees all of its storage as one big pool.  As you add more nodes onto the cluster the amount of storage you have increases but it’s still all one big pool.  All the data is striped across all of the nodes in the cluster.  With Network RAID you can lose an entire node in a Cluster and not even know.  Bottom line is LeftHand isn’t traditional SAN.

Licensing
The really, really, really nice part about LeftHand is that it’s an all inclusive license, meaning that you get all of the features for a flat fee. Everything is included, no extra license for Snapshots or SAN replication are needed which makes the package even more appealing.

5 Main points
-Storage clustering
Physical appliances are seen as clusters.  Clusters have a single VIP (Virtual IP) that is used as the ISCSI Target address. You can start with one appliance and as your storage needs increase, simply add more appliances to the cluster, increasing your SAN storage.

-Network RAID
LeftHand uses what they call Network RAID to ensure up time in the case of appliance / hardware failure.  The data is striped across all of the nodes in a cluster.  You can configure your cluster for different levels of replication which is what LeftHand calls Network RAID.  For instance if you have two physical appliances you can configure 2 way replication.  In 2 way replication an exact copy of all of your data from one node would be on the second node.  In turn this cuts your usable space in half since a 1 to 1 replication of your data is taking place.  On the other hand 2 way replication, when you have 3 or 4 physical nodes, sounds very appealing.  Your data is in two places and you can very easily lose an entire physical node and the cluster would still be running.   Additionally as you add more than 2 nodes you can configure 3 or 4 way replication spreading your copies of your data across more physical nodes.  (Side note: The LeftHand appliances use RAID-5 in the physical nodes for local disk)

-Thin Provisioning
If you aren’t familiar with thin provisioning you should be.  It’s becoming a very common word in both the storage and the VMWare world.  Both LeftHand SAN and VSphere support thin provisioning.  Thin provisioning allows you to use storage on an “as you use it” basis.  When we used to provision disks we had to fully provision them meaning that once I clicked the commit button and created the disk in the storage manager that disk was gone out of my available pool.  So if a DBA requested a 1 terabyte disk for one of his DB servers I had to fully provision the disk initially even if they weren’t planning on filling up that terabyte until 5 years down the road.  With thin provisioning I tell the SAN that I want a 1 terabyte disk and it presents a 1 terabyte disk to the OS but it doesn’t actually use the space until it needs to.  The SAN will just use space as it needs it.  The downside to this of course is that since you can overprovision your SAN you can run into a situation where a thin provisioned volume tries to use more space and there just isn’t any there.  That = BAD

-Snapshots
Like any other SAN (or any other good one) you can snapshot.  The all-inclusive license is a big plus here.  Additionally you can do some cool stuff with snapshots and backups.  For instance if you have an NTFS volume you can back up the snapshot.  Using the LeftHand CLI you can snap a copy of the volume, use Windows built in ISCSI initiator on your backup server to mount the snapshot, backup the snapped copy, dismount the volume, and finally remove the snapshot from the SAN.  Of course if you are doing VMWare you need something that could read VMFS.

-Remote copy
I won’t get too much into this since it sort of speaks for itself but you can use Remote Copy to asynchronously copy your data to another LeftHand appliance for DR purposes.  There are a ton of options here (scheduled, not scheduled, throttling, etc…) so it’s worth looking into if you are doing straight backups to SAN at a remote DR site. 

The VSA
I’m not going to spend a lot of time talking about this because I plan on having a later post the describes the VSA configuration.  However, it is worth noting that LeftHand is the only SAN provider that I know of (save EMC I believe) that has a Virtual SAN Appliance.  Have old unused servers at your collo?  If they run ESX, load the VSA and use them with Remote Copy to backup your data.  The instructor at the class told me he thought the VSA was about 85% as fast as the physical appliance dependant on the fact that the hardware  it ran on was up to spec.  More to come on the VSA!

Comments (Random Notes)
-Dual Gigabit NIC’s, can handle the 10 Gig NIC cards if you have the infrastructure
-No Management port.  Only in band management on the ISCSI network
-Every appliance is its own controller.  You no longer have controllers and then tack on disk drawers
-Managed through the LeftHand Management console
-Can swap between full and thin volume provisioning at any time
-The appliances use Managers on each node to form cluster quorum.  No quorum = No Cluster