Hacking a corrupt VHD on xen in order to access innodb mysql information.

A client ran into a corrupted .vhd file for the data drive for a xen server in a pool. We helped them to restore from a backup, however there were some items that they had not backed up properly, our task was to see if we could some how restore the data from their drive.

First, we had to find the raw file for the drive. To do this we looked at the Local Storage -> General tab on the XenCenter to find the UUID that will contain the  failing disk.

When we tried to attach the failing disk we get this error

Attaching virtual disk 'xxxxxx' to VM 'xxxx'
The attempt to load the VDI failed

So, we know that the xen servers / pool reject loading the corrupted vhd. So I came up with a way to try and access the data.

After much research I came across a tool that was published by ‘twindb.com’ called ‘undrop tool for innodb’.  The idea is that even after you drop or delete innodb files on your system, there are still markers in the file system which allow code to parse what ‘used’ to be on the system. They claimed some level of this worked for corrupted file systems.

The documentation was poor, and it took a long time to figure out, however they claimed to have 24-hour support, so I thought I would call them and just pay them to sort out the issue. They took a while and didn’t call back before I had sorted it out. All of the documentation he did have showed a link to his github account,  however the link was dead.  I searched and found a couple other people out there that had forked it before twindb took it down.  I am thinking perhaps they run more of an service business now and can help people resolve the issue and they dont want to support the code.  Since this code worked for our needs,  I have forked it so that we can make it permanently available: https://github.com/matraexinc/undrop-for-innodb

First step was for me to copy the .vhd to a working directory

# cp -a 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd /tmp/restore_vhd/orig.vhd
#cd /tmp/restore_vhd/
#git clone https://github.com/matraexinc/undrop-for-innodb
#cd undrop-for-innodb
#apt-get install bison flex
#apt-get install libmysqld-dev  #this was not mentioned anywhere,  however an important file was quitely not compiled without it.
#mv * ../.  #move all of the compiles files into your working directory
#cd ../
#./stream_parser -f orig.vhd # here is the magic – their code goes through and finds all of the ibdata1 logs and markers and creates data you can start to work through
#mv pages-orig.vhd pages-ibdata1  #the program created an organized set of data for you,  and the next programs need to find this at pages-ibdata1.
#./recover_dictionary.sh #this will need to run mysql as root and it will create a database named ‘test’ which has a listing of all of the databases, tables and indexes it found.

This was where I had to start coming up with a custom solution in order to process the large volume of customer databases.  I used some PHP to script the following commands for all of the many databases that needed to be restored.   But here are the commands for each database and table you must run a command that corresponds to an ‘index’ file that the previous commands created for you,  so you must loop through each of them.

 

select c.name as tablename

,a.id as indexid
from SYS_INDEXES a
join SYS_TABLES c on (a.TABLE_ID =c.ID)

 

This returns a list of the tables and any associated indexes,   Using this you must generate a command which

  1. generates a create statement for the table you are backing up,
  2. generate a load infile sql statement and associated data file

#sys_parser -h localhost -u username -p password -d test tablennamefromsql

This generates the createstatement for the tables,   save this to a createtable.sql file and execute it on your database to restore your table.

#c_parser -5 -o data.load -f pages-ibdata1/FIL_PAGE_INDEX/00000017493.page -t createtable.sql

This outputs a “load data infile ‘data.load’ statement,   you should pipe this to MYSQL and it will restore your data.


I found one example where the was createstatement  was notproperty created for table_id 754,   it appears that the sys_parser code relies on indexes,  and in one case the client tables did not have an index (not even a primary key),   this make it so that no create statement was created and the import did not continue.   To work around this,  I manually inserted a fake primary key on one of the columns into the database

#insert into SYS_INDEXES set id=1000009, table_id = 754,  name=PRIMARY, N_FIELDS=1, Type=3,SPACE=0, PAGE_NO=400000000
#insert into SYS_FIELDS set INDEX_ID=10000009, POS=0, COL_NAME=myprimaryfield

Then I was able to run the sys_parser command which then created the statement.


An Idea that Did not work ….

The idea is to create a new hdd device at /dev/xvdX create a new filesystem and mount it.   The using a tool use as dd or qemu-img ,  overwrite the already mounted device with the contents of the vhd.   While the contents are corrupted,  the idea is that we will be able to explore the corrupted contents as best we can.

so the command I ran was

#qemu-img convert -p -f vpc -O raw /var/run/sr-mount/f40f93af-ae36-147b-880a-729692279845/3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd/dev/xvde

 

Where 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd is the name of the file / UUID that is corrupted on the xen DOM0 server  and f40f93af-ae36-147b-880a-729692279845 is the UUID of the Storage / SR that it was located on

 

The command took a while to complete (it had to convert 50GB) but the contents of the vhd started to show up as I ran find commands on the mounted directory.   During the transfer,  the results were sporadic as the partition was only partially build,  however after it was completed,  I had access to about 50% of the data.

An Idea that Did not work (2) ….

This was not good enough to get the files the client needed.   I had a suspicion that the  qemu-img convert command may have dropped some of the data that was still available,  so i figured I would try another, somewhat similar command,  that actually seems to be a bit simpler.

This time I created another disk on the same local storage and found it using the xe vdi-list command on the dom0.

#xe vdi-list name-label=disk_for_copyingover

this showed me the UUID of this file was ‘fd959935-63c7-4415-bde0-e11a133a50c0.vhd’

i found it on disk and I executed a cat  from the corrupted vhd file into the mounted vhd file while it was running.

cat 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd > ../8c5ecc86-9df9-fd72-b300-a40ace668c9b/fd959935-63c7-4415-bde0-e11a133a50c0.vhd

Where 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd is the name of the file / UUID that is corrupted on the xen DOM0 server fd959935-63c7-4415-bde0-e11a133a50c0.vhd is the name of the vdi we created to copy over

 

This method completely corrupted the mounted drive, so I scrapped this method.

Next up:  

Try some  file partition recovery tools:

I started with testdisk (apt-get install testdisk)   and ran it directly againstt the vhd file

testdisk 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd

Enabling Xen VM auto start for 6.2- command line

Cytrix removed auto start from the easy to access options using XenCenter for 6.X servers.

However you can still run it.

First enable it on your pool

  • xe pool-param-set uuid=UUID other-config:auto_poweron=true

Then run a command to get all of the VMs in your pool and turn auto power on for all of the VMs that are currently on.

  • xe vm-list power-state=running |awk -F: ‘/uuid/ {print “xe vm-param-set uuid=”$NF” other-config:auto_power=true;”}’

This will give you a list of commands to enable auto_poweron for each of the running vm in your pool

Command Dump – Extending a disk on XenServer with xe

To expand the disk on a XenServer using the command line,   I assume that you have backed up the data elsewhere before the expansion,   as this method deletes everything on the disk to be expanded

  • dom0>xe vm-list name-label=<your vm name> # to  get the UUID of the host = VMUUID
  • dom0>xe vm-shutdown uuid=<VMUUID>
  • dom0>xe vbd-list  params=device,empty,vdi-name-label,vdi-uuid   vm-name-label=<your vm name>  # to get the vdi-uuid of the disk you would like to expand = VDIUUID
  • dom0>xe vdi-resize uuid=<VDIUUID> disk-size=120GB #use the size that you would like to expade to
  • dom0>xe vm-start uuid=<VMUUID>

Thats it on th dom0,  now as your vm boots up,  log in via SSH and complete the changes by deleting the old partition,  repartitioning and making a new filesystem,   I am going to do this as though the system is mounted at /data

  • domU>df /data # to get the device name =DEVICENAME
  • domU>umount /dev/DEVICENAME
  • domU>fdisk /dev/DEVICENAME
  •    [d]  to delete the existing partition
  •    [c] to create a new partition
  •    [w] to write the partition
  •    [q] to close fdisk
  • mkfs.ext3 /dev/DEVICENAME
  • mount /data
  • df /data #to see the file size expanded

 

Looking for help with XenServer?   Matraex can help.

XenServer and XenCenter

Why do we Blog about XenServer and XenCenter?

First, a quick bit about why we chose XenServer

We are small users of the XenServer and XenCenter software, and when we were first evaluating the Hyper Visor, we didn’t know much at all about Virtualizing servers.
At the same time as we were looking at XenServer, we were also looking into HyperV and VMWare. Of the 3, I found the open source model that XenServer had, backed by Cytrix’s large company status, to be the most appealing.
XenServer was also what Amazon AWS was based on, and with our experience with AWS it helped us lean towards XenServer.
To add to this, the XenCenter software was very simple to use, way that we were able to quickly create and manage Pools of servers and simply connect to the console seemed to address the features we would need, and not overcomplicate it like the VM Ware software did. An I liked the simple fast interface.

And finally, since we dont like to have Windows or GUI interfaces in our windows environment,   we loved that the Hypervisor is a Linux install we can log into and run ‘xe’ command on..  This makes XenServer is very scriptable.

XenServer is scriptable

Looking back and why we have created so many blog posts about XenServer is simply, because it is so easy to do.   As we have run into things that we have had difficulty doing,  it has been simple to document the process of figuring it out,    We have the option to simply cut and paste our command line history.   This seem so much easier than creating picture snippets of a GUI based management system,  and it makes it simple to turn our documentation of the process of troubleshooting an issue into a blog post.

Solutions to Problems are easy to forget

When we find a solution to a problem,  they can be very easy to implement and forget.  What happens here is that we end up doing the same research a year later to find a solution to a problem.   This is one of the reasons that many of our blog posts are not polished,   the posts just read like a stream of consciousness troubleshooting session.   We are not expert article writers,  we are expert Website Developers, Server Administrators and technical implementers.   However we recognized that when we solve a difficult problem,  if we document that problem in a place that is easy to find (our own blog) we can easily come back to it.    We simply search our own blog for it.

All of our blog topics

So really,  the reasons above apply to many of our blog topics.

  • Easy to script,  or describe in text (without pictures of it) we are able to cut and paste
  • Solution is one that we want to easily be able to find and solve again

Examples of XenServer Blog Posts

manually removing a pool slave from a pool in XenCenter

manually removing a pool slave from a pool in XenCenter

Problem: The pool master was lost or the ip address was changed. Upon bootup of one of the pool’s slaves, it came up with no management network, and no network interfaces to configure.

Resolution:

MAKE SURE YOUR VMs ARE BACKED UP!!!! LOCAL STORAGE WILL GO AWAY AFTER THIS AND WILL HAVE TO BE RE-CREATED.

Remove the slave server from XenCenter.

At the slave console’s main menu, go to “Network and Management Interface”, “Emergency Network Reset”

Login, and walk through he steps of re-assigning your address. Go ahead and enter an address for the master when prompted.

The server will reboot.

Go to “Local Command Shell” on the main menu.

Check the state of the server:

xe host-is-in-emergency-mode

answer: true

because the server is still in emergency mode, we need to edit the pool.conf.

nano /etc/xensource/pool.conf

It will probably reference “slave” and whatever address you defined as your master.

Remove all entries and add : master

save the conf file with Ctrl + o, exit with Ctrl + x

Rename the state.db with this command.

mv /var/xapi/state.db /var/xapi/state.db-old

Exit to the main console with xsconsole.

reboot it, and you should be able to re-add it to XenCenter and your pool.

More on changing ip addresses here:

http://support.citrix.com/article/CTX123477

Adding your local storage back to the xenserver:

Once you’ve re-added your server back to XenCenter, you’ll notice that your storage devices are gone. to re-add:

On the console tab of the server you just added, You can list your devices with:

cat /proc/partitions

get your device id’s with:

ll /dev/disk/by-id

Execute the following command:

xe sr-create content-type=user device-config:device=/dev/disk/by-id/<device ID from the list from the previous command> host-uuid=<ID can be copied and pasted from the “general” tab> name-label=”Give It a Name” shared=false type=lvm

If you’re trying to add the disk with the system on it, you’ll have to select the partition to restore:

xe sr-create content-type=user device-config:device=/dev/disk/by-id/<device ID for the partition from the list from the previous command> host-uuid=<ID can be copied and pasted from the “general” tab> name-label=”Give It a Name” shared=false type=lvm

This might at least allow you to get and files on that storage off to a more stable place. With a server in this condition, I would recommend reloading XenServer once you’ve taken everything that you need off of it.

Matt Long

02/24/2015

 

In XenCenter Console – mount DVD drive in Ubuntu 14.04

In XenCenter Console – mount DVD drive in Ubuntu 14.04

When running Ubuntu 14.04 LTS as a guest under XenServer6.5  I was attempting to install xs-tools.iso by mounting it into server using the drop down box.

 

However at the console,  i was unable to find /dev/cdrom or /dev/dvd* or /dev/sr*  or anything that seemed to fit.

 

So I ran fdisk -l

#fdisk -l

and I found a disk I didnt recognize

Disk /dev/xvdd: 119 MB, 119955456 bytes
255 heads, 63 sectors/track, 14 cylinders, total 234288 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/xvdd doesn't contain a valid partition table

So I mounted it and looked at the contents

#mount /dev/xvdd /mnt
#ls /mnt
dr-xr-xr-x 4 root root 2048 Jan 27 04:08 Linux
-r--r--r-- 1 root root 1180 Jan 27 04:08 README.txt
-r--r--r-- 1 root root 65 Jan 27 04:07 AUTORUN.INF
-r--r--r-- 1 root root 802816 Jan 27 04:07 citrixguestagentx64.msi
-r--r--r-- 1 root root 802816 Jan 27 04:07 citrixguestagentx86.msi
-r--r--r-- 1 root root 278528 Jan 27 04:07 citrixvssx64.msi
-r--r--r-- 1 root root 253952 Jan 27 04:07 citrixvssx86.msi
-r--r--r-- 1 root root 1925120 Jan 27 04:07 citrixxendriversx64.msi
-r--r--r-- 1 root root 1486848 Jan 27 04:07 citrixxendriversx86.msi
-r--r--r-- 1 root root 26 Jan 27 04:07 copyright.txt
-r--r--r-- 1 root root 831488 Jan 27 04:07 installwizard.msi
-r-xr-xr-x 1 root root 50449456 Jan 27 04:03 dotNetFx40_Full_x86_x64.exe
-r-xr-xr-x 1 root root 1945 Jan 27 04:03 EULA_DRIVERS
-r-xr-xr-x 1 root root 1654835 Jan 27 04:03 xenlegacy.exe
-r-xr-xr-x 1 root root 139542 Jan 27 04:03 xluninstallerfix.exe


So I found it!  Now just to install the tools and reboot

#cd Linux && ./install.sh
#reboot

XenCenter – missing ‘Logs’ tab

XenCenter – missing ‘Logs’ tab

Xencenter has moved the status of actions somewhere for each Physical and VM from the very intuitive ‘logs’ tab location it was before. Here is where they moved it.

  • At the bottom of the left pane there is an option called ‘Notifications’,  when you click it you are automatically shown all of the the alerts (such as the status changes)
  • At the top of the left pane whn you are clicked on Notifications you will notice that it has given you three options “Alerts”, “Updates” and “Events”.
  • If you click on “Events” you will see the status of ongoing ‘Exports’ or transfers or other  actions.

 

Script for Patching XenServer 6.5

Script for Patching XenServer 6.5

Here’s a little script that you can run at the dom0 console to automate loading patches on a fresh installation of XenServer 6.5 up to patch XS65E005. If they add more patches, just add more lines referencing the new patch name (e.g. XS65E006, etc) starting with the “wget command and ending with the “rm -f .xsupdate” command.

#!/bin/bash

wget http://downloadns.citrix.com.edgesuite.net/akdlm/10194/XS65E001.zip

unzip XS65E001.zip

xe patch-apply uuid=`xe patch-upload file-name=XS65E001.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *.zip

rm -f *.xsupdate

wget http://downloadns.citrix.com.edgesuite.net/akdlm/10195/XS65E002.zip

unzip XS65E002.zip

xe patch-apply uuid=`xe patch-upload file-name=XS65E002.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *.zip

rm -f *.xsupdate

wget http://downloadns.citrix.com.edgesuite.net/akdlm/10196/XS65E003.zip

unzip XS65E003.zip

xe patch-apply uuid=`xe patch-upload file-name=XS65E003.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *.zip

rm -f *.xsupdate

wget http://downloadns.citrix.com.edgesuite.net/akdlm/10201/XS65E005.zip

unzip XS65E005.zip

xe patch-apply uuid=`xe patch-upload file-name=XS65E005.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *.zip

rm -f *.xsupdate

Changing IP Addresses on a XenServer 6.5 Pool

Changing IP Addresses on a XenServer 6.5 Pool

To change the ip addresses on a XenServer 6.5 pool, start with the slaves, and use the following xe commands:

Remember: Slaves first, then the Master

NOTE: There is no need to change the IP from the Management Console.

Find the UUID of the Host Management PIF:

xe pif-list params=uuid,host-name-label,device,management

You will see a big list. Find the UUID for the slave that you’re working on. Use the “more” pipe if the UUID for your particular slave scrolls off the screen:

xe pif-list params=uuid,host-name-label,device,management | more

Change the IP Address on the first slave:

xe pif-reconfigure-ip uuid=<UUID of host management PIF> IP=<New IP> gateway=<GatewayIP> netmask=<Subnet Mask> DNS=<DNS Lookup IPs> mode=<dhcp,none,static>

Then:

xe-toolstack-restart

Verify the new address with ifconfig, and/or ping it from a workstation.

Point the slave to the new Master IP Address:

xe pool-emergency-reset-master master-address=NEW_IP_OF_THE_MASTER

Repeat the commands above on all slaves.

On the Master:

xe pif-list params=uuid,host-name-label,device,management

xe pif-reconfigure-ip uuid=<UUID of host management PIF> IP=<New IP> gateway=<GatewayIP> netmask=<Subnet Mask> DNS=<DNS Lookup IPs> mode=<dhcp,none,static>

xe-toolstack-restart

DO NOT run the emergency-reset-master command on the Master.

Reboot the Master, then reboot the Slaves and verify that they can find the Master.

Matt Long

04/06/2015

Using MPT-Status for RAID Monitoring in a Poweredge C6100 with Perc 6

Using MPT-Status for RAID Monitoring in a Poweredge C6100 with Perc 6

This post outlines the steps needed to get a CLI report of the conditions of your RAIDs in a Poweredge C6100 with a PERC 6/i RAID Controller.

Verify your controller type:

cat /proc/scsi/mptsas/0

ioc0: LSISAS1068E B3, FwRev=011b0000h, Ports=1, MaxQ=277

Download the following packages:

daemonize-1.5.6-1.el5.i386.rpm mpt-status-1.2.0-3.el5.centos.i386.rpm lsscsi-0.17-3.el5.i386.rpm

http://dl.nux.ro/utils/mpt-status/mpt-status-1.2.0-3.el5.centos.i386.rpm

http://dl.nux.ro/utils/mpt-status/daemonize-1.5.6-1.el5.i386.rpm

http://mirror.centos.org/centos/5/os/i386/CentOS/lsscsi-0.17-3.el5.i386.rpm

Install mtp-status:

rpm -ivh mpt-status-1.2.0-3.el5.centos.i386.rpm daemonize-1.5.6-1.el5.i386.rpm lsscsi-0.17-3.el5.i386.rpm

modprobe mptctl

echo mptctl >> /etc/modules

Verify your modules:

lsmod |grep mpt

mptctl 90739 0

mptsas 57560 4

mptscsih 39876 1 mptsas

mptbase 91081 3 mptctl,mptsas,mptscsih

scsi_transport_sas 27681 1 mptsas

scsi_mod 145658 7 mptctl,sg,libata,mptsas,mptscsih,scsi_transport_sas,sd_mod

run:

mpt-status or mpt-status -n -s

Also, you can use: lsscsi -l

This little script:

echo `mpt-status -n -s|awk ‘/OPTIMAL/ {print $1, “OK”}; /ONLINE/ {print $1, “OK”}; /DEGRADED/ {print $1, “FAILURE”}; /scsi/ {print $2}; /MISSING/ {print $1, “FAILURE”} ‘`

reports:

vol_id:0 OK phys_id:1 OK phys_id:0 OK 100% 100%

On a rebuild, it reports:

vol_id:0 FAILURE phys_id:2 OK phys_id:3 OK 75% 75%

Copy that script into a file called “check_raid”, and make it executable, E.G. 755

Edit nagios-statd on parcel1. Replace “sudo /customcommands/check_raid.pl -b -w1 -c1” with filename check-raid (without the switches) at line 20, and remove “sudo”

So, from this:

commandlist[‘Linux’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,”free | awk ‘$1~/^Swap:/{print ($3/$2)*100}'”,”sudo /customcommands/check_raid.pl -b -w1 -c1″)

To this:

commandlist[‘Linux’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,”free | awk ‘$1~/^Swap:/{print ($3/$2)*100}'”,”/customcommands/check_raid”)

Port 1040 will need to be opened in XenServer. Edit /etc/sysconfig/iptables and insert this line:

-A RH-Firewall-1-INPUT -p tcp -m tcp –dport 1040 -j ACCEPT

Restart the firewall:

service iptables restart

Output:

Flushing firewall rules: [ OK ]

Setting chains to policy ACCEPT: filter [ OK ]

Unloading iptables modules: [ OK ]

Applying iptables firewall rules: [ OK ]

Loading additional iptables modules: ip_conntrack_netbios_n[FAILED]

NOTE: The “FAILED” error above doesn’t seem to be a problemVerify that port 1040 is open:

Check the status of port 1040:

service iptables status

Output:

Table: filter

Chain INPUT (policy ACCEPT)

num target prot opt source destination

1 ACCEPT 47 — 0.0.0.0/0 0.0.0.0/0

2 RH-Firewall-1-INPUT all — 0.0.0.0/0 0.0.0.0/0

Chain FORWARD (policy ACCEPT)

num target prot opt source destination

1 RH-Firewall-1-INPUT all — 0.0.0.0/0 0.0.0.0/0

Chain OUTPUT (policy ACCEPT)

num target prot opt source destination

Chain RH-Firewall-1-INPUT (2 references)

num target prot opt source destination

1 ACCEPT all — 0.0.0.0/0 0.0.0.0/0

2 ACCEPT icmp — 0.0.0.0/0 0.0.0.0/0 icmp type 255

3 ACCEPT esp — 0.0.0.0/0 0.0.0.0/0

4 ACCEPT ah — 0.0.0.0/0 0.0.0.0/0

5 ACCEPT udp — 0.0.0.0/0 224.0.0.251 udp dpt:5353

6 ACCEPT udp — 0.0.0.0/0 0.0.0.0/0 udp dpt:631

7 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:631

8 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:1040

9 ACCEPT all — 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED

10 ACCEPT udp — 0.0.0.0/0 0.0.0.0/0 state NEW udp dpt:694

11 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:22

12 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:80

13 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:443

14 REJECT all — 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited

running “nagios-statd” opens port 1040 on Parcel1 and listens for commands to be initiated by nagios_stat on the nagios server.

On the nagios server, in a file called “remote.orig.cfg, there are commands defined using “nagios-stat”: NOTE: These are from a working server and haven’t been modified to work with mpt. Some changes may need to be made. This is just an example of the interaction between Nagios server and client

Example:

define command{

command_name check_remote_raid

command_line $USER1$/nagios-stat -w $ARG1$ -c $ARG2$ -p $ARG3$ raid $HOSTADDRESS$

}

This command defined above is used in the “services.cfg” file.

Example:

define service{

use matraex-template

host_name mtx-lilac

service_description Lilac /data Raid

check_command check_remote_raid!1!1!1040

The three files needed on the C6100 node are:

/customcommands/check_raid (contents below) -rwxr-xr-x

/customcommands/nagios-statd (contents below) -rwxr-xr-x

/etc/init.d/nagios-statd (contens below) -rwxr–r–

Creating the soft links:

ln -s /etc/init.d/nagios-statd /etc/rc.d/rc3.d/K01nagios-statd

ln -s /etc/init.d/nagios-statd /etc/rc.d/rc3.d/S99nagios-statd

The -s = soft, and -f if used, forces overwrite.

/rc3.d/ designates runlevel 3

So when you do this:

ls -lt /customcommands/nagios-statd /etc/init.d/nagios-statd /customcommands/check_raid /etc/rc.d/rc3.d/*nagios-statd

This is what you should see:

lrwxrwxrwx 1 root root 22 Mar 6 08:08 /etc/rc.d/rc3.d/K01nagios-statd -> ../init.d/nagios-statd

-rwxr-xr-x 1 root root 365 Mar 6 07:59 /customcommands/check_raid

lrwxrwxrwx 1 root root 22 Mar 6 07:52 /etc/rc.d/rc3.d/S99nagios-statd -> ../init.d/nagios-statd

-rwxr-xr-x 1 root root 649 Mar 6 07:51 /etc/init.d/nagios-statd

-rwxr-xr-x 1 root root 9468 Mar 5 12:05 /customcommands/nagios-statd

Script Files:

NOTE: Here’s a little fix that helped me out. I had originally pasted these scripts into a DOS/Windows editor (wordpad) and it added DOS-type returns to the file, resulting in an error:

-bash: ./nagios-statd: /bin/sh^M: bad interpreter: No such file or directory

If you encounter this, do this:

Open the file in vi

hit “:” to go into command mode

enter “set fileformat=unix”

then :wq to quit.

/customcommands/check_raid:

#!/bin/bash

EXECFILE=/usr/sbin/mpt-status

if [ ! -e $EXECFILE ] ; then

echo

echo “Error $EXECFILE is not installed, please install before running”

echo

echo “Usage $0”;

echo

exit 10

fi

echo `$EXECFILE -n -s|awk ‘/OPTIMAL/ {print $1, “OK”}; /ONLINE/ {print $1, “OK”}; /DEGRADED/ {print $1, “FAILURE”}; /scsi/ {print $2};

/MISSING/ {print $1, “FAILURE”} ‘`

/customcommands/nagios_statd

#!/usr/bin/python

import getopt, os, sys, signal, socket, SocketServer

class Functions:

“Contains a set of methods for gathering data from the server.”

def __init__(self):

self.nagios_statd_version = 3.09

# As of right now, the commands are for df, who, proc, uptime, and swap.

commandlist = {}

commandlist[‘AIX’] = (“df -Ik”,”who | wc -l”,”ps ax”,”uptime”,”lsps -sl | grep -v Paging | awk ‘{print $2}’ | cut -f1 -d%”)

commandlist[‘BSD/OS’] = (“df”,”who | wc -l”,”ps -ax”,”uptime”,None)

commandlist[‘CYGWIN_NT-5.0’] = (“df -P”,None,”ps -s -W | awk ‘{printf(“%6s%6s%3s%6s%sn”,$1,$2,” S”,” 0:00″,substr($0,22))}'”,None,None)

commandlist[‘CYGWIN_NT-5.1’] = commandlist[‘CYGWIN_NT-5.0’]

commandlist[‘FreeBSD’] = (“df -k”,”who | wc -l”,”ps ax”,”uptime”,”swapinfo | awk ‘$1!~/^Device/{print $5}'”)

commandlist[‘HP-UX’] = (“bdf -l”,”who -q | grep “#””,”ps -el”,”uptime”,None)

commandlist[‘IRIX’] = (“df -kP”,”who -q | grep “#””,”ps -e -o “pid tty state time comm””,”/usr/bsd/uptime”,None)

commandlist[‘IRIX64’] = commandlist[‘IRIX’]

commandlist[‘Linux’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,”free | awk ‘$1~/^Swap:/{print ($3/$2)*100}'”,”/customcommands/check_raid”)

commandlist[‘NetBSD’] = (“df -k”,”who | wc -l”,”ps ax”,”uptime”,”swapctl -l | awk ‘$1!~/^Device/{print $5}'”)

commandlist[‘NEXTSTEP’] = (“df”,”who | /usr/ucb/wc -l”,”ps -ax”,”uptime”,None)

commandlist[‘OpenBSD’] = (“df -k”,”who | wc -l”,”ps -ax”,”uptime”,”swapctl -l | awk ‘$1!~/^Device/{print $5}'”)

commandlist[‘OSF1’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,None)

commandlist[‘SCO-SV’] = (“df -Bk”,”who -q | grep “#””,”ps -el -o “pid tty s time args””,”uptime”,None)

commandlist[‘SunOS’] = (“df -k”,”who -q | grep “#””,”ps -e -o “pid tty s time comm””,”uptime”,”swap -s | tr -d -s -c [:digit:][:space:] | nawk ‘{print ($3/($3+$4))*100}'”)

commandlist[‘UNIXWARE2’] = (“/usr/ucb/df”,”who -q | grep “#””,”ps -el | awk ‘{printf(“%6d%9s%2s%5s %sn”,$5,substr($0, 61, 8),$2,substr($0,69,5),substr($0,75))}”,”echo `uptime`, load average: 0.00, `sar | awk ‘{oldidle=idle;idle=$5} END {print 100-oldidle}’`,0.00″,None)

# Now to make commandlist with the correct one for your OS.

try:

self.commandlist = commandlist[os.uname()[0]]

except KeyError:

print “Your platform isn’t supported by nagios-statd – exiting.”

sys.exit(3)

# Below are the functions that the client can call.

def disk(self):

return self.__run(0)

def proc(self):

return self.__run(2)

def swap(self):

return self.__run(4)

def uptime(self):

return self.__run(3)

def user(self):

return self.__run(1)

def raid(self):

return self.__run(5)

def version(self):

i = “nagios-statd ” + str(self.nagios_statd_version)

return i

def __run(self,cmdnum):

# Unmask SIGCHLD so popen can detect the return status (temporarily)

signal.signal(signal.SIGCHLD, signal.SIG_DFL)

outputfh = os.popen(self.commandlist[cmdnum])

output = outputfh.read()

returnvalue = outputfh.close()

signal.signal(signal.SIGCHLD, signal.SIG_IGN)

if (returnvalue):

return “ERROR %s ” % output

else:

return output

class NagiosStatd(SocketServer.StreamRequestHandler):

“Handles connection initialization and data transfer (as daemon)”

def handle(self):

# Check to see if user is allowed

if self.__notallowedhost():

self.wfile.write(self.error)

return 1

if not hasattr(self,”generichandler”):

self.generichandler = GenericHandler(self.rfile,self.wfile)

self.generichandler.run()

def __notallowedhost(self):

“Compares list of allowed users to client’s IP address.”

if hasattr(self.server,”allowedhosts”) == 0:

return 0

for i in self.server.allowedhosts:

if i == self.client_address[0]: # Address is in list

return 0

try: # Do an IP lookup of host in blocked list

i_ip = socket.gethostbyname(i)

except:

self.error = “ERROR DNS lookup of blocked host “%s” failed. Denying by default.” % i

return 1

if i_ip != i: # If address in list isn’t an IP

if socket.getfqdn(i) == socket.getfqdn(self.client_address[0]):

return 0

self.error = “ERROR Client is not among hosts allowed to connect.”

return 1

class GenericHandler:

def __init__(self,rfile=sys.stdin,wfile=sys.stdout):

# Create functions object

self.functions = Functions()

self.rfile = rfile

self.wfile = wfile

def run(self):

# Get the request from the client

line = self.rfile.readline()

line = line.strip()

# Check for appropriate requests from client

if len(line) == 0:

self.wfile.write(“ERROR No function requested from client.”)

return 1

# Call the appropriate function

try:

output = getattr(self.functions,line)()

except AttributeError:

error = “ERROR Function “” + line + “” does not exist.”

self.wfile.write(error)

return 1

except TypeError:

error = “ERROR Function “” + line + “” not supported on this platform.”

self.wfile.write(error)

return 1

# Send output

if output.isspace():

error = “ERROR Function “” + line + “” returned no information.”

self.wfile.write(error)

return 1

elif output == “ERROR”:

error = “ERROR Function “” + line + “” exited abnormally.”

self.wfile.write(error)

else:

for line in output:

self.wfile.write(line)

class ReUsingServer (SocketServer.ForkingTCPServer):

allow_reuse_address = True

class Initialization:

“Methods for interacting with user – initial code entry point.”

def __init__(self):

self.port = 1040

self.ip = ”

# Run this through Functions initially, to make sure the platform is supported.

i = Functions()

del(i)

def getoptions(self):

“Parses command line”

try:

opts, args = getopt.getopt(sys.argv[1:], “a:b:ip:P:Vh”, [“allowedhosts=”,”bindto=”,”inetd”,”port=”,”pid=”,”version”,”help”])

except getopt.GetoptError, (msg, opt):

print sys.argv[0] + “: ” + msg

print “Try ‘” + sys.argv[0] + ” –help’ for more information.”

sys.exit(3)

for option,value in opts:

if option in (“-a”,”–allowedhosts”):

value = value.replace(” “,””)

self.allowedhosts = value.split(“,”)

elif option in (“-b”,”–bindto”):

self.ip = value

elif option in (“-i”,”–inetd”):

self.runfrominetd = 1

elif option in (“-p”,”–port”):

self.port = int(value)

elif option in (“-P”,”–pid”):

self.pidfile = value

elif option in (“-V”,”–version”):

self.version()

sys.exit(3)

elif option in (“-h”,”–help”):

self.usage()

def main(self):

# Retrieve command line options

self.getoptions()

# Just splat to stdout if we’re running under inetd

if hasattr(self,”runfrominetd”):

server = GenericHandler()

server.run()

sys.exit(0)

# Check to see if the port is available

try:

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

s.bind((self.ip, self.port))

s.close()

del(s)

except socket.error, (errno, msg):

print “Unable to bind to port %s: %s – exiting.” % (self.port, msg)

sys.exit(2)

# Detach from terminal

if os.fork() == 0:

# Make this the controlling process

os.setsid()

# Be polite and chdir to /

os.chdir(‘/’)

# Try to close all open filehandles

for i in range(0,256):

try: os.close(i)

except: pass

# Redirect the offending filehandles

sys.stdin = open(‘/dev/null’,’r’)

sys.stdout = open(‘/dev/null’,’w’)

sys.stderr = open(‘/dev/null’,’w’)

# Set the path

os.environ[“PATH”] = “/bin:/usr/bin:/usr/local/bin:/usr/sbin”

# Reap children automatically

signal.signal(signal.SIGCHLD, signal.SIG_IGN)

# Save pid if user requested it

if hasattr(self,”pidfile”):

self.savepid(self.pidfile)

# Create a forking TCP/IP server and start processing

server = ReUsingServer((self.ip,self.port),NagiosStatd)

if hasattr(self,”allowedhosts”):

server.allowedhosts = self.allowedhosts

server.serve_forever()

# Get rid of the parent

else:

sys.exit(0)

def savepid(self,file):

try:

fh = open(file,”w”)

fh.write(str(os.getpid()))

fh.close()

except:

print “Unable to save PID file – exiting.”

sys.exit(2)

def usage(self):

print “Usage: ” + sys.argv[0] + ” [OPTION]”

print “nagios-statd daemon – remote UNIX system monitoring tool for Nagios.n”

print “-a, –allowedhosts=HOSTS Comma delimited list of IPs/hosts allowed to connect.”

print “-b, –bindto=IP IP address for the daemon to bind to.”

print “-i, –inetd Run from inetd.”

print “-p, –port=PORT Port to listen on.”

print “-P, –pid=FILE Save pid to FILE.”

print “-V, –version Output version information and exit.”

print ” -h, –help Print this help and exit.”

sys.exit(3)

def version(self):

i = Functions()

print “nagios-statd %.2f” % i.nagios_statd_version

print “os.uname()[0] = %s ” % os.uname()[0]

print “Written by Nick Reinkingn”

print “Copyright (C) 2002 Nick Reinking”

print “This is free software. There is NO warranty; not even for MERCHANTABILITY or”

print “FITNESS FOR A PARTICULAR PURPOSE.”

print “nNagios is a trademark of Ethan Galstad.”

if __name__ == “__main__”:

# Check to see if running Python 2.x+ / needed because getfqdn() is Python 2.0+ only

if (int(sys.version[0]) < 2):

print “nagios-statd requires Python version 2.0 or greater.”

sys.exit(3)

i = Initialization()

i.main()

/etc/init.d/nagios-statd:

#!/bin/sh

#

# This file should have uid root, gid sys and chmod 744

#

if [ ! -d /usr/bin ]

then # /usr not mounted

exit

fi

killproc() { # kill the named process(es)

pid=`/bin/ps -e |

/bin/grep -w $1 |

/bin/sed -e ‘s/^ *//’ -e ‘s/ .*//’`

[ “$pid” != “” ] && kill $pid

}

# Start/stop processes required for netsaint_statd server

case “$1” in

‘start’)

/customcommands/nagios-statd -a <IP of Allowed Nagios Server>,<IP of Test Workstation> -p 1040

;;

‘stop’)

killproc nagios-statd

;;

*)

echo “Usage: /etc/init.d/nagios-statd { start | stop }”

;;

esac

 

Testing:

As you can see in the script file above, I’ve added the IP Address of a test workstation. This will allow me to simply telnet to a node in the C6100 and execute one of the commands defined in this section of the /customcommands/nagios-statd script:

# Below are the functions that the client can call.

def disk(self):

return self.__run(0)

def proc(self):

return self.__run(2)

def swap(self):

return self.__run(4)

def uptime(self):

return self.__run(3)

def user(self):

return self.__run(1)

def raid(self):

return self.__run(5)

At your workstation, telnet to <Node IP Address> 1040

When connected, the screen will be blank.

Type “raid”. The screen won’t echo this.

When you hat enter, you should see:

vol_id:0 OK phys_id:2 OK phys_id:3 OK 100% 100%

Now you’re ready to move on to the Nagios configuration.

Matt Long

03/06/2015

Adding and Removing Local Storage From XenServer

Adding and Removing Local Storage From XenServer

To add local storage XenServer 6.x

get your device id’s with:

ll /dev/disk/by-id

The host uuid can be copied and pasted from the general tab of your host in XenCenter.

Create your storage:

xe sr-create content-type=user device-config:device=/dev/sdb host-uuid=<Place the host’s UUID here> name-label=”<Name your local storage here>” shared=false type=lvm

NOTE: Make sure that “shared=” is false. If you have shared storage on a hypervisor, you won’t be able to add it to a pool. When a hypervisor is added to a pool, its local storage is automatically shared in that pool.

NOTE: Replace sdb in the above command with the device that you’re adding.

To Remove local storage XenServer 6.x

Go to console in XenCenter or log in to your xenserver host via ssh

List your storage repositories.

xe sr-list

You will see something like this:

uuid ( RO) : <The uuid number you want is here>
name-label ( RW): Local storage
name-description ( RW):
host ( RO): host.example.com
type ( RO): lvm
content-type ( RO): user

uuid string is the Storage Repository uuid (SR-uuid) that you need to be able to do the next step.

Get the Physical Block Device UUID.

xe pbd-list sr-uuid=Your-UUID

uuid ( RO) This is the PBD-uuid

Unplug the local storage.

xe pbd-unplug uuid=Your-PBD-uuid

Delete the PBD:

xe pbd-destroy uuid=your-PBD-uuid

Forget ( remove ) the Local storage from showing up as detached.

xe sr-forget uuid=your-SR-uuid

Now check your XenCenter that it’s removed.

Automating patch installation on XenServer

Automating patch installation on XenServer

I have four instances of freshly installed XenServer 6.2 and there are about a dozen patches for each that need to be applied. Herein, I will attempt to somewhat automated the application of these patches.

What you will need to know:

The list of patches required

The URLs of the download pages for each patch to be applied

The UUID of the patch

The UUID of the Target Host. This can be found by doing, on the console,

a:

xe host-list

The procedure is as follows:

Use the wget command to download the patch. I’ll use Service Pack 1 as an example. We want that patch first, as it is cumulative, and will cut down the number of other patches to be installed.

To find the URL where SP1 resides. go to XenCenter Console, Tools, Check for Updates. This will give you a list of patches available for your server, with links to the download location. Click on the link for XS62ESP1. On the web page that opens, click on “Download”. This will open another page. This is the URL that you want. Copy it to clipboard.

You’ll need the urls and filenames for all the applicable patches when you customize your script.

 

http://downloadns.citrix.com.edgesuite.net/8707/XS62ESP1.zip

Now initiate the wget command at the console in XenCenter using this URL:

wget http://downloadns.citrix.com.edgesuite.net/8707/XS62ESP1.zip              NOTE:Case Sensitive!

Then unzip it:

unzip XS62ESP1.zip

This zip file contains a file ending with the extension “xsupdate” as in “XS62ESP1.xsupdate. That’s our patch

Register the patch on the Target Server:

xe patch-upload file-name=XS62ESP1.xsupdate

This will display the uuid of the patch. We’ll need that for the next command. You can call it up again with:

xe patch-list

Install the patch:

xe patch-apply uuid=<The patch uuid from th xe patch-list command> host-uuid=<The host uuid from the xe host-list command>

Due to limited disk space, we’re going to clean out our working directory:

rm *

Now we’ll write our script: Notice that we’re doing the upload/registration and apply in one command.

Start of Script:

wget http://downloadns.citrix.com.edgesuite.net/8707/XS62ESP1.zip

unzip XS62ESP1.zip

xe patch-apply uuid=`xe patch-upload file-name=XS62ESP1.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *

wget http://downloadns.citrix.com.edgesuite.net/8737/XS62ESP1002.zip

unzip XS62ESP1002.zip

xe patch-apply uuid=`xe patch-upload file-name=XS62ESP1002.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *

wget http://downloadns.citrix.com.edgesuite.net/9058/XS62ESP1005.zip

unzip XS62ESP1005.zip

xe patch-apply uuid=`xe patch-upload file-name=XS62ESP1005.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *

wget http://downloadns.citrix.com.edgesuite.net/9491/XS62ESP1008.zip

unzip XS62ESP1008.zip

xe patch-apply uuid=`xe patch-upload file-name=XS62ESP1008.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *

wget http://downloadns.citrix.com.edgesuite.net/9617/XS62ESP1009.zip

unzip XS62ESP1009.zip

xe patch-apply uuid=`xe patch-upload file-name=XS62ESP1009.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *

wget http://downloadns.citrix.com.edgesuite.net/9698/XS62ESP1011.zip

unzip XS62ESP1011.zip

xe patch-apply uuid=`xe patch-upload file-name=XS62ESP1011.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *

wget http://downloadns.citrix.com.edgesuite.net/9703/XS62ESP1013.zip

unzip XS62ESP1013.zip

xe patch-apply uuid=`xe patch-upload file-name=XS62ESP1013.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *

wget http://downloadns.citrix.com.edgesuite.net/9708/XS62ESP1014.zip

unzip XS62ESP1014.zip

xe patch-apply uuid=`xe patch-upload file-name=XS62ESP1014.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *

wget http://downloadns.citrix.com.edgesuite.net/10128/XS62ESP1015.zip

unzip XS62ESP1015.zip

xe patch-apply uuid=`xe patch-upload file-name=XS62ESP1015.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *

wget http://downloadns.citrix.com.edgesuite.net/10134/XS62ESP1012.zip

unzip XS62ESP1012.zip

xe patch-apply uuid=`xe patch-upload file-name=XS62ESP1012.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *

wget http://downloadns.citrix.com.edgesuite.net/10174/XS62ESP1016.zip

unzip XS62ESP1016.zip

xe patch-apply uuid=`xe patch-upload file-name=XS62ESP1016.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *

Now just copy the text you created with your patches listed in place of mine and paste in to to the console. You’ll be off and running! (Go get Coffee)

Matt Long

02/17/2015

Disk write speed testing different XenServer configurations – single disk vs mdadm vs hardware raid

Disk write speed testing different XenServer configurations – single disk vs mdadm vs hardware raid

In our virtual environment on of the VM Host servers has a hardware raid controller on it .  so natuarally we used the hardware raid.

The server is a on a Dell 6100 which uses a low featured LSI SAS RAID controller.
One of the ‘low’ features was that it only allows two RAID volumes at a time.  Also it does not do RAID 10

So I decided to create a RAID 1 with two SSD drives for the host,  and we would also put the root operating systems for each of the Guest VMs there.   It would be fast and redundant.   Then we have upto 4 1TB disks for the larger data sets.  We have multiple identically configured VM Hosts in our Pool.

For the data drives,  with only 1 more RAID volume I could create without a RAID 10,  I was limited to either a RAID V,   a mirror with 2 spares,   a JBOD.  In order to get the most space out of the 4 1TB drives,   I created the RAIDV.   After configuring two identical VM hosts like this,  putting a DRBD Primary / Primary connection between the two of them and then OCFS2 filesystem on top of it.  I found I got as low as 3MB write speed.   I wasnt originally thinking about what speeds I would get,  I just kind of expected that the speeds would be somewhere around disk write speed and so I suppose I was expecting to get acceptable speeds beetween 30 and 80 MB/s.   When I didn’t,  I realized I was going to have to do some simple benchmarking on my 4 1TB drives to see what configuration will work best for me to get the best speed and size configuration out of them.

A couple of environment items

  • I will mount the final drive on /data
  • I mount temporary drives in /mnt when testing
  • We use XenServer for our virtual environment,  I will refer to the host as the VM Host or dom0 and to a guest VM as VM Guest or domU.
  • The final speed that we are looking to get is on domU,  since that is where our application will be,  however I will be doing tests in both dom0 and domU environments.
  • It is possible that the domU may be the only VM Guest,  so we will also test raw disk access from domU for the data (and skip the abstraction level provided by the dom0)

So,  as I test the different environments I need to be able to createw and destroy the local storage on the dom0 VM Host.  Here are some commands that help me to do it.
I already went through xencenter and removed all connections and virtual disk on the storage I want to remove,  I had to click on the device “Local Storage 2” under the host and click the storage tab and make sure each was deleted. {VM Host SR Delete Process}

xe sr-list host=server1 #find and keep the uuid of the sr in my case "c2457be3-be34-f2c1-deac-7d63dcc8a55a"
xe pbd-list   sr-uuid=c2457be3-be34-f2c1-deac-7d63dcc8a55a # find and keep the uuid of the pbd connectig sr to dom0 "b8af1711-12d6-5c92-5ab2-c201d25612a9"
xe pbd-unplug  uuid=b8af1711-12d6-5c92-5ab2-c201d25612a9 #unplug the device from the sr
xe pbd-destroy uuid=b8af1711-12d6-5c92-5ab2-c201d25612a9 #destroy the devices
xe sr-forget uuid=c2457be3-be34-f2c1-deac-7d63dcc8a55a #destroy the sr

Now that the sr is destroyed,  I can work on the raw disks on the dom0 and do some benchmarking on the speeds of differnt soft configurations from their.
Once I have made  a change,  to the structure of the disks,  I can recreate the sr with a new name on top of whatever solution I come up with by :

xe sr-create content-type=user device-config:device=/dev/XXX host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage XXX on `cat /etc/hostname`” shared=false type=lvm

Replace the red XXX with what works for you

Most of the tests were me just running dd commands and writing the slowest time,  and then what seemed to be about the average time in MB/s.   It seemed like,  the first time a write was done it was a bit slower but each subsequent time it was faster and I am not sure if that means when a disk is idle,  it takes a bit longer to speed up and write?  if that is the case then there are two scenarios,   if the disk is often idle,  the it will use the slower number,  but if the disk is busy,  it will use the higher average number,  so I tracked them both.  The idle disk issue was not scientific and many of my tests did not wait long enough for the disk to go idle inbetween tests.

The commands I ran for testing were dd commands

dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=1000 conv=fdatasync  #for 1 mb
dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=10000 conv=fdatasync  #for 10 mb
dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=100000 conv=fdatasync  #for 100 mb
dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=1000000 conv=fdatasync  #for 1000 mb

I wont get into the details of every single command I ran as I was creating the different disk configurations and environments but I will document a couple of them

Soft RAID 10 on dom0

dom0>mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb2 --assume-clean
dom0>mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd2 --assume-clean
dom0>mdadm --create /dev/md10 --level=0 --raid-devices=2 /dev/md0 /dev/md1 --assume-clean
dom0>mkfs.ext3 /dev/md10
dom0>xe sr-create content-type=user device-config:device=/dev/md10 host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage md10 on `cat /etc/hostname`” shared=false type=lvm

Dual Dom0 Mirror – Striped on DomU for an “Extended RAID 10”

dom0> {VM Host SR Delete Process} #to clean out 'Local storage md10'
dom0>mdadm --manage /dev/md2 --stop
dom0>mkfs.ext3 /dev/md0 && mkfs.ext3 /dev/md1
dom0>xe sr-create content-type=user device-config:device=/dev/md0 host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage md0 on `cat /etc/hostname`” shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/md1 host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage md1 on `cat /etc/hostname`” shared=false type=lvm
domU>
#at this  point use Xen Center to add and attach disks from each of the local md0 and md1 disks to the domU (they were attached on my systems as xvdb and xvdc
domU> mdadm --create /dev/md10 --level=0 --raid-devices=2 /dev/xvdb /dev/xvdc
domU> mkfs.ext3 /dev/md10  && mount /data /dev/md10

Four disks SR from dom0, soft raid 10 on domU

domU>umount /data
domU> mdadm --manage /dev/md10 --stop
domU> {delete md2 and md1 disks from the storage tab under your VM Host in Xen Center}
dom0> {VM Host SR Delete Process} #to clean out 'Local storage md10'
dom0>mdadm --manage /dev/md2 --stop
dom0>mdadm --manage /dev/md1 --stop
dom0>mdadm --manage /dev/md0 --stop
dom0>fdisk /dev/sda #delete partition and write (d w)
dom0>fdisk /dev/sdb #delete partition and write (d w)
dom0>fdisk /dev/sdc #delete partition and write (d w)
dom0>fdisk /dev/sdd #delete partition and write (d w)
dom0>xe sr-create content-type=user device-config:device=/dev/sda host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sda on `cat /etc/hostname`" shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/sdb host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sdb on `cat /etc/hostname`" shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/sdc host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sdc on `cat /etc/hostname`" shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/sdd host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sdd on `cat /etc/hostname`" shared=false type=lvm
domU>mdadm --create /dev/md10 -l10 --raid-devices=4 /dev/xvdb /dev/xvdc /dev/xvde /dev/xvdf
domU>mdadm --detail --scan >> /etc/mdadm/mdadm.conf 
domU>echo 100000 > /proc/sys/dev/raid/speed_limit_min #I made the resync go fast, which reduced it from 26 hours to about 3 hours
domU>mdadm --grow /dev/md0 --size=max

Creating a Bootable USB Install Thumb drive for XenServer

Creating a Bootable USB Install Thumb drive for XenServer

We have a couple sites with XenServer VM machines,  so part of our redundancy / failure plan is to be able to quickly isntall / reinstall a XenServer hypervisor.

THere are plenty of more involved methods with setting up PXE servers,  etc.  But the quickest / low tech method is to have a USB thumbdrive on hand.

So we can use one of the plethora of tools to create a USB thumbdrive,   (unetbootin,  USB to ISO,  etc)  but they all seem to have problems with the ISO,   (OS not found,    error with install,  etc)

So I found one that works well

http://rufus.akeo.ie/

He keeps his software upto date it appears.  Download it and run it,   select your USB drive then check the box to ‘create a bootable disk using ISO Image,  select the image to use from your hard drive.    I downloaded the iso image from

http://xenserver.org/overview-xenserver-open-source-virtualization/download.html

– XenServer Installation ISO

Then just boot from the USB drive and the install should start.

 

Xen Center – import from exported XVA file for restoring – does not create a new VM

Xen Center – import from exported XVA file for restoring – does not create a new VM

Building a backup procedure for Xen Center

This started when defining the procedure for how to backup VMs ‘off site’ in a way that would later allow us to restore them,  should some sort of unrecoverable error occur.
The concept is,  we want to be able to take a VM,  which is currently running and get a backup of some kind which can then be restored at a later point.

First I will explain what I have found to be the way that Xen Center currenlty allows,  I dont know what the ‘Best Practice’ is suggested for this procedure,  I couldnt find it from searching and so I explored the options available within Xen Center and at the command line

Exporting (stopping a VM first)

This method requires you to stop a VM first,   which means that it does not work well for VMs which are already in production,  and the method is not a viable “backup” solution for us,  but I explain it here to make the distinction between the different types of exporting.  This method is something that would work well for backing up defined templates to an offsite location,  but would not work well for saving a full running VM in a way that could be restored from an offsite location.

Xen Center allows you to export a VM,  which you later import:  First,  shut down your VM,   right click on it and go to Export  and follow the steps in the wizard to put the export on your local drive.  you can conveniently do an export of multiple stopped VMs.     the progress is displayed in the Logs tab  for the pool. I told the export process to ‘verify’ the exported files,  which added A LONG time to the process,  be prepared for this

Once the export is complete,  you can move these files where ever you want,  to restore,  you simply right click on your pool and go to Import select the file from disk,  follow the wizard and the VMS will be started (I am not sure what happens with any kind of MAC address collisions here if the same VM you exported is currently running)

Exporting a live VM

It seems reasonable that a currently running VM could not be exported directly to a file,  because the VM is running and the changes that occur in a running VM would be inconsitant during the process of the export.  here is how we work around this.  

In short,  here are the steps

  1. create a snapshot of a vm
  2. export the snapshot to a file offsite using Xen Center (we are backed up)
  3. start the restore by creating a vm martyr by installing a new vm or template (hopefully a small )
  4. destroy the martyrs existing vdi
  5. import the snapshot from a file (could be on the same server or pool,  or a completely new build)
  6. attach the imported vdi as a bootable disk
  7. rename from martry to your correct name and start the new VM

WIth more detail: 

First take snapshot of the running VM,  then go to the Snapshots tab for th VM and export that Snapshot.  Follow the wizard and save it to a file. 

When it comes time to re import,  we have a little preparation we should do and keep track of these numbers

#Find the UUID for the SR that you will be importing the file back to
xe sr-list host=HOSTNAME
>079eb2c5-091d-5c2f-7e84-17705a8158cf
#get a list of all of the CURRENT uuids for the virtual disks on the sr
xe vdi-list sr-uuid=079eb2c5-091d-5c2f-7e84-17705a8158cf|grep uuid
>uuid ( RO) : bb8015b0-0672-45af-aed5-c5308f60b914
>uuid ( RO) : f0b67634-25bc-486d-b38e-0e8294de7df6
>uuid ( RO) : cdc13e40-9ffe-497c-91ff-d426a52aaf2a

Now we import the file. Right click on the host you would like to restore it to and click ‘Import’   THe import process asks you a couple of pieces of information about the restore,  host name,  network,  etc.  go through the steps an click finish.    The vm will be imported again,  the progress will be shows in the Logs tab of the host and pool,  when complete,   we now have a virtual disk unattached to a VM,  which we need to attach to a VM,

Here things are a bit more complex.  First we create a VM ‘martyr’,  this is what I call a VM,  that we create through some other method, soley for the purpose of attaching our reimported snapshot to it.  we will take the guts out of whatever VM we create and put the guts from our import into it.    on the technical side,  we take a VM,  disconnect the existing bootable vdi and reconnect the vdi we just imported.  I create the VM using a template or install I dont cover that here,  but I name it martyr_for_import

#get a list of the latest uuids for the virtual disks on the sr
xe vdi-list sr-uuid=079eb2c5-091d-5c2f-7e84-17705a8158cf|grep uuid
>uuid ( RO) : bb8015b0-0672-45af-aed5-c5308f60b914
>uuid ( RO) : f0b67634-25bc-486d-b38e-0e8294de7df6
>uuid ( RO) : cdc13e40-9ffe-497c-91ff-d426a52aaf2a
>uuid ( RO) : 04a7f80e-e108-4468-9bd3-fada613e9a42
#each time I have done this,  the imported uuid is listed last,  but I run the list,  before and after,  just to make sure,  in this case my vdi is: 04a7f80e-e108-4468-9bd3-fada613e9a42
#find the current vbds attached to this vm
xe vbd-list vm-label-name-label=martyr_for_import
>uuid ( RO) : b0f4cb5e-5285-bbec-13a3-f581c6e6d287
 vm-uuid ( RO): 708b633a-683d-859f-1f1f-bf8495d17fe8
 vm-name-label ( RO): martyr_for_import
 vdi-uuid ( RO): a36d6025-039b-4f6e-9d19-f7eb7d1d4c46
 empty ( RO): false
 device ( RO): xvdd
uuid ( RO) : eb12fdac-c36c-78fa-8eb6-67fa3a5a1d85
 vm-uuid ( RO): 708b633a-683d-859f-1f1f-bf8495d17fe8
 vm-name-label ( RO): martyr_for_import
 vdi-uuid ( RO): cdc13e40-9ffe-497c-91ff-d426a52aaf2a
 empty ( RO): false
 device ( RO): xvda
#shut down the vm
xe vm-shutdown uuid=708b633a-683d-859f-1f1f-bf8495d17fe8
#destroy the vdi virtual disk that is attached to our marty as the current xvda vbd
xe vdi-destroy uuid=cdc13e40-9ffe-497c-91ff-d426a52aaf2a
#you can verify that it has been destroyed and detached by running xe vbd-list vm-label-name-label=martyr_for_import again
#now attach our snapshot vdi as a new vbd bottable device as xvda again. (note the bootable=true and type=Disk)
xe vbd-create vm-uuid=708b633a-683d-859f-1f1f-bf8495d17fe8 vdi-uuid=04a7f80e-e108-4468-9bd3-fada613e9a42 bootable=true device=xvda type=Disk
#okay we are attached (you can verify by running xe vbd-list vm-label-name-label=martyr_for_import again
#go ahead and start the vm through your Xen scenter or run this command
xe vm-start uuid=708b633a-683d-859f-1f1f-bf8495d17fe8

			

Deleting Orphaned Disks in Citrix XenServer

Deleting Orphaned Disks in Citrix XenServer

I found that while building my virtual environment with templates ready to deploy I created quite a few templates and snapshots.

I did a pretty good job of deleting the extras when I didn’t need them any more,  but in some cases when deleting a VM I no longer needed,  I forgot to check the box to delete the snapshots that went WITH that VM.

I could see under the dom0 host -> Storage tab  that space was still allocated to the snapshots,  (Usage was higher than the combined visible suage of servers and templates,  and Virtual allocation was way higher than it should be)

But without a place that listed the snapshots that were taking up space. When looking into the way to delete these orphaned snapshots (and the disk snapshots that went with them)  I found some cumbersome command line methods. 

Like this old method that someone used - http://blog.appsense.com/2009/11/deleting-orphaned-disks-in-citrix-xenserver-5-5/

After a big more digging,  i found that by just clicking on the Local Storage under the domU  then clicking on the ‘Storage’ tab under there,  I would see a list of all of the storage elements that are allocated.  I would see some that were for snapshots without a name.  Turns out those were the ones that were orphaned,  If they were allocated to a live server the delete button would not be highlighted so I just deleted those old ones.

 

 

Resizing a VDI on XenServer using XenCenter and Commandline

Resizing a VDI on XenServer using XenCenter and Commandline

Occassionally I have a need to change the size of a disk,  perhaps to allocate more data to the os.

To do this,  on the host I unmount the disk

umount /data

Click on the domU server in XenCenter and click on the Storage tab,  select the storage item I want to resize and click ‘Detach’
at the command line on one of the dom0 hosts

 xe sr-list host=dom0hostname

write down the uuid of the SR which the Virtual Disk was in. (we will use XXXXX-XXXXX-XXXX)

 xe vdi-list sr-uuid=XXXXX-XXXXX-XXXX

write down the uuid of the disk that you wanted to resize(we will use YYYY-YYYY-YYYYY)
Also,  note that the the virtual-size parameter that shows.  VDIs can not be shrunk so you will need a disk size LARGER than the size displayed here.

 xe vdi-resize sr-uuid=YYYY-YYYY-YYYYY disk-size=9887654


XenCenter – live migrating a vm in a pool to another host

XenCenter – live migrating a vm in a pool to another host

When migrating a vm server from one host to another host in the pool I found it to be very easy at first.

In fact,  it was one of the first things test I did after setting up my first vm on a host in a pool. 4 steps

  1. Simply right click on the vm in XenCenter ->
  2. Migrate to Server ->
  3. Select from your available servers.
  4. Follow the wizzard

In building some servers,  I wanted to get some base templates which are ‘aware’ of the network I am putting together.  This would involve adding some packages and configuration,  taking a snapshot and then turning that snapshot into a template that I could easily restart next time I wanted a similar server.  Then when I went to migrate one of the servers into its final resting place.  I found an interesting error.

  1. Right click on the vm in XenCenter ->
  2. Migrate to Server ->
  3. All servers listed – Cannot see required storage

I found this odd since I was sure that the pool could see all of the required storage (In fact I was able to start a new VM on the storage available,  so I new the storage was there)

I soon found out though that the issue is that the live migrate feature,  just doesnt work when there is more than one snapshot.  I will have to look into my snapshot management on how I want to do this now,  but basically I found that by removing old snapshots does to where the VM only had one snapshot (I left one that was a couple of days old) I was able to follow the original 4 steps

 

Note:  the way I found out about the limitation of the number of snapshots was by

  1. Eight click on the vm in XenCenter ->
  2. Migrate to Server ->
  3. The available servers are all grayed out,  so Select “Migrate VM Wizard”
  4. In the wizard that comes up select the current pool for “Destination”
  5. This populates a list of VMs with Home Server in the destination pool want to migrate the VM (My understanding,  is that this will move the VM to that server AND make that new server the “Home Server” for that VM)
  6. When you attempt to select from the drop down list under Home Server,  you see a message “You attempted to migrate a VM with more than one snaphot”

Using that information I removed all but one snapshot and was able to migrate.  I am sure there is some logical reason behind snapshot / migration limitation but for now I will work around it and come up with some other way to handle my snapshots than just leaving them under the snapshot tab of the server.

 

Notes on Recovering from a XenServer Pool failure

Notes on Recovering from a XenServer Pool failure

For my pool I have 8 XenServers (plot1, plot2, plot3, plot4, plot5, plot6, plot7 and plot8)

At the start of my tests,  plot1 is the pool master.

If the pool master goes down, another server must take over as master.
To simulate this, I just ran ‘shut down’ on the master host

A large issue here is that all of the slaves in the pool, just disabled their management interfaces so they can not be connected to using XenCenter (something I did not expect), so I connected to plot2 via SSH

THen I connected to another server in the pool, and verified its state

xe host-is-in-emergency-mode

The server said FALSE!?! the server didn’t even know that the pool was in trouble? so I ran pool-list

xe pool-list

The command took a long time so I figured I would stop it and put a time command in front of it to find out how long it really tool

time xe pool-list

Turns out, when I shut down the pool master, I am shutting down the pool! , I am not simulating an error at all. Somehow the pool master notified the slaves that it was gracefully shutting down, telling the slaves dont worry, I will be all right., the commands above never returned. so I just told plot2 to take over as master to see how we could recover from this situation.

xe pool-emergency-transition-to-master

At this point on plot 2, the pool was restored but we could still not connect to the management interfaces of any of the other plots in the pool. But XenCenter WAS able to connect to plot2, and it synchronized the entire pool, showing all of the other hosts (including plot1 which was the master previously) as down.

 

The other hosts in the pool are still running all of their services (SSH, apache or whathever) they just can not communicate about the pool so I have to ‘recover’ them back into the pool.
On the new master I run

xe pool-recover-slaves

This brings the slaves back into the pool so they are visible within XenCenter again. plot1, the original master is still turned off, but visible as turned off in XenCenter, so I right click on it. in XenCenter and Power On. It begins booting and I hold my breath to see if there are any master master conflicts, since the shut down host thought it was the all powerful one when it shut down.

Once it comes up (3 minutes later) I find that plot1 gracefully fell into place as a slave. So the moral of this story,
!Dont shut down the pool master, if you do you will lose XenCenter access to all of the hosts in the pools so you MUST either 1) bring it backup immediately or 2) SSH to the console of another host run #xe pool-emergency-transition-to-master and then #xe pool-recover-slaves – this will restore your pool minus the host that was originally the master. reconnect with XenCenter to the new poolmaster, using the XenCenter then power on the host that was the pool master

!Best Practice: before stopping a host that is currently the poolmaster, connect to another host in the pool and run #xe pool-emergency-transition-to-master and then #xe pool-recover-slaves prior too shutting down the host.

Well, so now that we know shutting down the master does not simulate a failure, we will have to use another ‘onsite’ method.

!Simulation2:
On plot2 (current pool master) I disconnected the ethernet cables.

The XenCenter console can no longer connect to the pool again, so I have to use SSH, This time I will connect to plot3 and find out what it thinks of the pool issue.
xe host-is-in-emergency-mode

This command returns false, somehow the host thinks every thing is okay, I run xe pool-list and xe host-list, both of which never return, come one host shouldn’t you recognize a failure here?

I ping the same IP as the pool master and the ping fails, but the xe host-is-in-emergency-mode still returns false, for some reason, this host just does not think it has a problem

so, I guess I just can’t trust xe host-is-in-emergency-mode,

Even after 2 hours, the xe host-is-in-emergency-mode still returns false.

So for monitoring, I will have to come up with some other method. but the rules for how to recover are the same

xe pool-emergency-transition-to-master
xe pool-recover-slaves

This brought the pool up again on plot3 with plot3 as the new master.
Now the trick is to bring plot2 back on, in this case, plot 2 never ended up going offline, so it is still running without the ethernet cable plugged in, so when I plug it back in, I may end up with some master – master conflicts ….. here goes!.

After reconnecting the ethernet cable to plot 2 (the old master):
– plot3 did not recognize automatically that the host is backup, infact in XenCenter, it still shows red as though it is shut down, I right clicked on it and told it to power on, but it didn’t do anything but wait.
– plot2 did not make any changes, it appears they both, happily think they are the masters.

To test how the pool reacted, I attempted to disable one of the slaves from plot2 xe host-disable uuid=xxxxxxxx (my thought is that plot 2 is incorrectly considered down and not connected so the disable should not be let through.)
It turns out that plot2 could not disable the host, because the host ‘could not be contacted’ , this is good because it makes sure that none of the slaves are confused, in fact, plot3 is not confused either, it is only plot2, the master that went missing that is confused (I have seen in xen docs that they call this a barrier of some sort)
I tried to connect to plot2 with XenCenter, but XenCenter smartly told me that I can not connect because it appears that the server was created as a backup from my pool and that the dangerous operation is not allowed. (I will try to trick XenCenter into connecting by removing references to my pool from it and then trying again)
AH! it let me! that means that XenCenter is smart enough to recognize when you are attempting to make two connection separately to the split brain masters of a pool, but prevents it.
To dig further into this issue. I decided to further ‘break’ the pool by splitting the two masters further with different definitions of the pool. On the plot2 master I used XenCenter to destroy the disconnected host plot7. XenCenter let me do this. Now when I go to reconnect, I will be attempting to pull the orphaned master with a different definition of the pool, back into the pool.

Now the trick is to determine what the best way to bring the plot2 old master back into the current pool as a slave. We need to tell the new master to recover slaves.

xe pool-recover-slaves

That pulls plot2 back in as a slave, and GREAT it did not use any of the pool definition from plot2. plot3 property asserted its role as the true pool master
I can imagine a bad scenario happening if I told the “OLD” master to recover slave, I imagine that either the split would have gotten much worse, Or (if the barrier was really working, the the pool would have told the old master that it was not possible).

Other methods that I did not use which may have worked but were nto tried (they dont feel right):
– from the orphaned master: xe pool-join force=1 ….. server username password (i doubt this would work since it is already the member of a pool)
– from the orphaned master xe pool-reset-master master-server= ip of new master (this one I am not sure of, would be worth a shot if for some reason pool is not working)

THe thing that you NEVER want to do while a master or any other server is orphaned or down, is remove the server from the pool. What can happen in this sitation is that the server that is down, still thinks it is in the pool when it comes back up but the pool does not know about it. We get into a race condition that I have only ever found one way out of. The orphaned server thinks it is in a pool, but can not get out of the pool without connecting to the master. The master will not recognize the orphaned server so the server cant do anything. (the way out of this was to promote the orphaned server to master, the remove all of the hosts in the pool, then delete all of the stored resources and pbd and then join the pool anew. This sucked because everything on the server was destroyed so I could have just r reinstalled xenserver.

I have heard but not attempted to reinstall xenserver without selecting the disks
http://support.citrix.com/article/CTX120962

Promoting a XenServer host to pool master

Promoting a XenServer host to pool master

There are a couple of problems that can be found when attempting to promote a XenServer host to pool master.

  • If you are on host1 attempting to designate host2 through the xsconsole,   you are prompted to select from all of the other hostnames in the pool,  however xsconsole appears to use the hostname only and if your server is not configured to be able to refer to the other host by the hostname xsconsole will show an error,  can not connect to host.
    This could be resolved by making /etc/hosts entries for each host,  but that is overkill when you need to make a quick poolmaster change
  • The next idea is to by convention,  only designate a host to pool master FROM that host,  and make sure that each host can refer to itself using it hostname.
    Not too much problem here,  that seems pretty reasonable to have the current hostname in the /etc/hosts file,  however the default XenServer install does not do this.

 

So the best way i find is to have a convention to only promot a pool master FROM that host,  using the command line method,  using the UUID.

# xe pool-designate-new-master host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

A couple of problems can still prevent new pool master from being designated.

  • All hosts in the pool are not available.  For example,  One of the hosts in the pool was down,  and I received an error because host5 was not available. I solved this using XenCenter to destroy the other hosts,  that seemed like it was not a good idea since I wanted them to come up at sometime in the future and rejoin the pool,  but I guess that has to be done manually.
    • You attempted an operation which involves a host which could not be contacted.
      host: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (HOST5)
  • Another problem that I have not encountered yet but is still possible, is an issue where you are unable to designate a new master while you have servers down which the pool thinks may still be running. (sounds like some sort of split brain ‘protection’ or something)

Add a second disk as Local Disk Storage to an existing XenServer

Add a second disk as Local Disk Storage to an existing XenServer

On each of the XenServers in the pool I created,  I have at least 2 partitions that I wanted to be available to my VMs.

For a little while I was just running individual commands to figure it out each time and finally I decided to come up with a single command that I could copy and paste

I have it below so I can always come to this blog post and find it

First I find out which partition I want to add

#cat /proc/partitions

I just have to replace the /dev/sdb in the command below with the actual partition I want to add,  And I might need to change the “name-label” in the case that I already have a Local storage 2,  but otherwise,  the system figures out what the current hostname is and gets the uuid and names the storage appropriately.  This works in a pool where host-list returns more than one..

CAUTION:  when cutting and pasting from below,  be careful to make sure that the quotes match exactly,   I have run into situations where the Double Quotes(“) around the name-label parameter and the single quotes (‘) around the awk parameter,  show as question marks (?) when pasted into the XenCenter console.

#xe sr-create content-type=user device-config:device=/dev/sdb  host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage 2 on `cat /etc/hostname`” shared=false type=lvm

 

Adding a new XenServer to my XenServer Pool – Homogenity required

Adding a new XenServer to my XenServer Pool – Homogenity required

In order to add an additional XenServer to an existing XenServer Pool – the servers must be homogenous, meaning that all of the same updates must be applied.

#blogpostinnoteform #couldbecleanedup

I have not had any luck applying updates using the XenCenter software ‘Apply pending updates’.

Although,   XenCenter does a good job of showing which servers have updates to apply

Below are my notes on how to find any apply patches  so that XenServers can have the same updates / patches applied as the pool and then added

on any server in the pool

xe patch-list

This will list out several patches, The confusing thing for me was knowing which patches are included in a Service Pack, since service packs seem to roll up all of the patches in to them it seems that patches which are applied as part of the services pack show a size of 1.

search for the downloads from support.citrix.com
on the server to add

wget http://downloadns.citrix.com.edgesuite.net/8707/XS62ESP1.zip # to get the patch
unzip XS62ESP1.zip # to open the patch
xe patch-upload file-name=XS62ESP1.xsupdate

this will out put the uuid of the patch, you need this (you can also get it from running #xe patch-list
you also need the host-uuid which you can get from #xe host-list, but since the host is not in a pool yet, you should be able to just do command line tab completion (xe is smart like that)

xe patch-apply uuid=0850b186-4d47-11e3-a720-001b2151a503 host-uuid=93c98aa5-935b-41a4-9b79-789fa68db354

(A technique that has worked for me is to copy this  text paste it all at once and the press ‘TAB’ which auto completes the host-uuid,  so I can past it all at once rpess tab and enter and leave the system to its work)

wget http://downloadns.citrix.com.edgesuite.net/8707/XS62ESP1.zip
 unzip XS62ESP1.zip # to open the patch
 xe patch-upload file-name=XS62ESP1.xsupdate
 xe patch-apply uuid=0850b186-4d47-11e3-a720-001b2151a503 host-uuid=

 

SIGN UP TO
GET OUR 
FREE
 APP BLUEPRINT

Join our email list

and get your free whitepaper