A client ran into a corrupted .vhd file for the data drive for a xen server in a pool. We helped them to restore from a backup, however there were some items that they had not backed up properly, our task was to see if we could some how restore the data from their drive.
First, we had to find the raw file for the drive. To do this we looked at the Local Storage -> General tab on the XenCenter to find the UUID that will contain the failing disk.
When we tried to attach the failing disk we get this error
Attaching virtual disk 'xxxxxx' to VM 'xxxx' The attempt to load the VDI failed
So, we know that the xen servers / pool reject loading the corrupted vhd. So I came up with a way to try and access the data.
After much research I came across a tool that was published by ‘twindb.com’ called ‘undrop tool for innodb’. The idea is that even after you drop or delete innodb files on your system, there are still markers in the file system which allow code to parse what ‘used’ to be on the system. They claimed some level of this worked for corrupted file systems.
The documentation was poor, and it took a long time to figure out, however they claimed to have 24-hour support, so I thought I would call them and just pay them to sort out the issue. They took a while and didn’t call back before I had sorted it out. All of the documentation he did have showed a link to his github account, however the link was dead. I searched and found a couple other people out there that had forked it before twindb took it down. I am thinking perhaps they run more of an service business now and can help people resolve the issue and they dont want to support the code. Since this code worked for our needs, I have forked it so that we can make it permanently available: https://github.com/matraexinc/undrop-for-innodb
First step was for me to copy the .vhd to a working directory
# cp -a 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd /tmp/restore_vhd/orig.vhd
#git clone https://github.com/matraexinc/undrop-for-innodb
#apt-get install bison flex
#apt-get install libmysqld-dev #this was not mentioned anywhere, however an important file was quitely not compiled without it.
#mv * ../. #move all of the compiles files into your working directory
#./stream_parser -f orig.vhd # here is the magic – their code goes through and finds all of the ibdata1 logs and markers and creates data you can start to work through
#mv pages-orig.vhd pages-ibdata1 #the program created an organized set of data for you, and the next programs need to find this at pages-ibdata1.
#./recover_dictionary.sh #this will need to run mysql as root and it will create a database named ‘test’ which has a listing of all of the databases, tables and indexes it found.
This was where I had to start coming up with a custom solution in order to process the large volume of customer databases. I used some PHP to script the following commands for all of the many databases that needed to be restored. But here are the commands for each database and table you must run a command that corresponds to an ‘index’ file that the previous commands created for you, so you must loop through each of them.
select c.name as tablename ,a.id as indexid from SYS_INDEXES a join SYS_TABLES c on (a.TABLE_ID =c.ID)
This returns a list of the tables and any associated indexes, Using this you must generate a command which
- generates a create statement for the table you are backing up,
- generate a load infile sql statement and associated data file
#sys_parser -h localhost -u username -p password -d test tablennamefromsql
This generates the createstatement for the tables, save this to a createtable.sql file and execute it on your database to restore your table.
#c_parser -5 -o data.load -f pages-ibdata1/FIL_PAGE_INDEX/00000017493.page -t createtable.sql
This outputs a “load data infile ‘data.load’ statement, you should pipe this to MYSQL and it will restore your data.
I found one example where the was createstatement was notproperty created for table_id 754, it appears that the sys_parser code relies on indexes, and in one case the client tables did not have an index (not even a primary key), this make it so that no create statement was created and the import did not continue. To work around this, I manually inserted a fake primary key on one of the columns into the database
#insert into SYS_INDEXES set id=1000009, table_id = 754, name=PRIMARY, N_FIELDS=1, Type=3,SPACE=0, PAGE_NO=400000000
#insert into SYS_FIELDS set INDEX_ID=10000009, POS=0, COL_NAME=myprimaryfield
Then I was able to run the sys_parser command which then created the statement.
An Idea that Did not work ….
The idea is to create a new hdd device at /dev/xvdX create a new filesystem and mount it. The using a tool use as dd or qemu-img , overwrite the already mounted device with the contents of the vhd. While the contents are corrupted, the idea is that we will be able to explore the corrupted contents as best we can.
so the command I ran was
#qemu-img convert -p -f vpc -O raw /var/run/sr-mount/f40f93af-ae36-147b-880a-729692279845/3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd/dev/xvde
Where 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd is the name of the file / UUID that is corrupted on the xen DOM0 server and f40f93af-ae36-147b-880a-729692279845 is the UUID of the Storage / SR that it was located on
The command took a while to complete (it had to convert 50GB) but the contents of the vhd started to show up as I ran find commands on the mounted directory. During the transfer, the results were sporadic as the partition was only partially build, however after it was completed, I had access to about 50% of the data.
An Idea that Did not work (2) ….
This was not good enough to get the files the client needed. I had a suspicion that the qemu-img convert command may have dropped some of the data that was still available, so i figured I would try another, somewhat similar command, that actually seems to be a bit simpler.
This time I created another disk on the same local storage and found it using the xe vdi-list command on the dom0.
#xe vdi-list name-label=disk_for_copyingover
this showed me the UUID of this file was ‘fd959935-63c7-4415-bde0-e11a133a50c0.vhd’
i found it on disk and I executed a cat from the corrupted vhd file into the mounted vhd file while it was running.
cat 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd > ../8c5ecc86-9df9-fd72-b300-a40ace668c9b/fd959935-63c7-4415-bde0-e11a133a50c0.vhd
Where 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd is the name of the file / UUID that is corrupted on the xen DOM0 server fd959935-63c7-4415-bde0-e11a133a50c0.vhd is the name of the vdi we created to copy over
This method completely corrupted the mounted drive, so I scrapped this method.
Try some file partition recovery tools:
I started with testdisk (apt-get install testdisk) and ran it directly againstt the vhd file