Notes on Recovering from a XenServer Pool failure

Notes on Recovering from a XenServer Pool failure

For my pool I have 8 XenServers (plot1, plot2, plot3, plot4, plot5, plot6, plot7 and plot8)

At the start of my tests,  plot1 is the pool master.

If the pool master goes down, another server must take over as master.
To simulate this, I just ran ‘shut down’ on the master host

A large issue here is that all of the slaves in the pool, just disabled their management interfaces so they can not be connected to using XenCenter (something I did not expect), so I connected to plot2 via SSH

THen I connected to another server in the pool, and verified its state

xe host-is-in-emergency-mode

The server said FALSE!?! the server didn’t even know that the pool was in trouble? so I ran pool-list

xe pool-list

The command took a long time so I figured I would stop it and put a time command in front of it to find out how long it really tool

time xe pool-list

Turns out, when I shut down the pool master, I am shutting down the pool! , I am not simulating an error at all. Somehow the pool master notified the slaves that it was gracefully shutting down, telling the slaves dont worry, I will be all right., the commands above never returned. so I just told plot2 to take over as master to see how we could recover from this situation.

xe pool-emergency-transition-to-master

At this point on plot 2, the pool was restored but we could still not connect to the management interfaces of any of the other plots in the pool. But XenCenter WAS able to connect to plot2, and it synchronized the entire pool, showing all of the other hosts (including plot1 which was the master previously) as down.

 

The other hosts in the pool are still running all of their services (SSH, apache or whathever) they just can not communicate about the pool so I have to ‘recover’ them back into the pool.
On the new master I run

xe pool-recover-slaves

This brings the slaves back into the pool so they are visible within XenCenter again. plot1, the original master is still turned off, but visible as turned off in XenCenter, so I right click on it. in XenCenter and Power On. It begins booting and I hold my breath to see if there are any master master conflicts, since the shut down host thought it was the all powerful one when it shut down.

Once it comes up (3 minutes later) I find that plot1 gracefully fell into place as a slave. So the moral of this story,
!Dont shut down the pool master, if you do you will lose XenCenter access to all of the hosts in the pools so you MUST either 1) bring it backup immediately or 2) SSH to the console of another host run #xe pool-emergency-transition-to-master and then #xe pool-recover-slaves – this will restore your pool minus the host that was originally the master. reconnect with XenCenter to the new poolmaster, using the XenCenter then power on the host that was the pool master

!Best Practice: before stopping a host that is currently the poolmaster, connect to another host in the pool and run #xe pool-emergency-transition-to-master and then #xe pool-recover-slaves prior too shutting down the host.

Well, so now that we know shutting down the master does not simulate a failure, we will have to use another ‘onsite’ method.

!Simulation2:
On plot2 (current pool master) I disconnected the ethernet cables.

The XenCenter console can no longer connect to the pool again, so I have to use SSH, This time I will connect to plot3 and find out what it thinks of the pool issue.
xe host-is-in-emergency-mode

This command returns false, somehow the host thinks every thing is okay, I run xe pool-list and xe host-list, both of which never return, come one host shouldn’t you recognize a failure here?

I ping the same IP as the pool master and the ping fails, but the xe host-is-in-emergency-mode still returns false, for some reason, this host just does not think it has a problem

so, I guess I just can’t trust xe host-is-in-emergency-mode,

Even after 2 hours, the xe host-is-in-emergency-mode still returns false.

So for monitoring, I will have to come up with some other method. but the rules for how to recover are the same

xe pool-emergency-transition-to-master
xe pool-recover-slaves

This brought the pool up again on plot3 with plot3 as the new master.
Now the trick is to bring plot2 back on, in this case, plot 2 never ended up going offline, so it is still running without the ethernet cable plugged in, so when I plug it back in, I may end up with some master – master conflicts ….. here goes!.

After reconnecting the ethernet cable to plot 2 (the old master):
– plot3 did not recognize automatically that the host is backup, infact in XenCenter, it still shows red as though it is shut down, I right clicked on it and told it to power on, but it didn’t do anything but wait.
– plot2 did not make any changes, it appears they both, happily think they are the masters.

To test how the pool reacted, I attempted to disable one of the slaves from plot2 xe host-disable uuid=xxxxxxxx (my thought is that plot 2 is incorrectly considered down and not connected so the disable should not be let through.)
It turns out that plot2 could not disable the host, because the host ‘could not be contacted’ , this is good because it makes sure that none of the slaves are confused, in fact, plot3 is not confused either, it is only plot2, the master that went missing that is confused (I have seen in xen docs that they call this a barrier of some sort)
I tried to connect to plot2 with XenCenter, but XenCenter smartly told me that I can not connect because it appears that the server was created as a backup from my pool and that the dangerous operation is not allowed. (I will try to trick XenCenter into connecting by removing references to my pool from it and then trying again)
AH! it let me! that means that XenCenter is smart enough to recognize when you are attempting to make two connection separately to the split brain masters of a pool, but prevents it.
To dig further into this issue. I decided to further ‘break’ the pool by splitting the two masters further with different definitions of the pool. On the plot2 master I used XenCenter to destroy the disconnected host plot7. XenCenter let me do this. Now when I go to reconnect, I will be attempting to pull the orphaned master with a different definition of the pool, back into the pool.

Now the trick is to determine what the best way to bring the plot2 old master back into the current pool as a slave. We need to tell the new master to recover slaves.

xe pool-recover-slaves

That pulls plot2 back in as a slave, and GREAT it did not use any of the pool definition from plot2. plot3 property asserted its role as the true pool master
I can imagine a bad scenario happening if I told the “OLD” master to recover slave, I imagine that either the split would have gotten much worse, Or (if the barrier was really working, the the pool would have told the old master that it was not possible).

Other methods that I did not use which may have worked but were nto tried (they dont feel right):
– from the orphaned master: xe pool-join force=1 ….. server username password (i doubt this would work since it is already the member of a pool)
– from the orphaned master xe pool-reset-master master-server= ip of new master (this one I am not sure of, would be worth a shot if for some reason pool is not working)

THe thing that you NEVER want to do while a master or any other server is orphaned or down, is remove the server from the pool. What can happen in this sitation is that the server that is down, still thinks it is in the pool when it comes back up but the pool does not know about it. We get into a race condition that I have only ever found one way out of. The orphaned server thinks it is in a pool, but can not get out of the pool without connecting to the master. The master will not recognize the orphaned server so the server cant do anything. (the way out of this was to promote the orphaned server to master, the remove all of the hosts in the pool, then delete all of the stored resources and pbd and then join the pool anew. This sucked because everything on the server was destroyed so I could have just r reinstalled xenserver.

I have heard but not attempted to reinstall xenserver without selecting the disks
http://support.citrix.com/article/CTX120962

Promoting a XenServer host to pool master

Promoting a XenServer host to pool master

There are a couple of problems that can be found when attempting to promote a XenServer host to pool master.

  • If you are on host1 attempting to designate host2 through the xsconsole,   you are prompted to select from all of the other hostnames in the pool,  however xsconsole appears to use the hostname only and if your server is not configured to be able to refer to the other host by the hostname xsconsole will show an error,  can not connect to host.
    This could be resolved by making /etc/hosts entries for each host,  but that is overkill when you need to make a quick poolmaster change
  • The next idea is to by convention,  only designate a host to pool master FROM that host,  and make sure that each host can refer to itself using it hostname.
    Not too much problem here,  that seems pretty reasonable to have the current hostname in the /etc/hosts file,  however the default XenServer install does not do this.

 

So the best way i find is to have a convention to only promot a pool master FROM that host,  using the command line method,  using the UUID.

# xe pool-designate-new-master host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

A couple of problems can still prevent new pool master from being designated.

  • All hosts in the pool are not available.  For example,  One of the hosts in the pool was down,  and I received an error because host5 was not available. I solved this using XenCenter to destroy the other hosts,  that seemed like it was not a good idea since I wanted them to come up at sometime in the future and rejoin the pool,  but I guess that has to be done manually.
    • You attempted an operation which involves a host which could not be contacted.
      host: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (HOST5)
  • Another problem that I have not encountered yet but is still possible, is an issue where you are unable to designate a new master while you have servers down which the pool thinks may still be running. (sounds like some sort of split brain ‘protection’ or something)

Add a second disk as Local Disk Storage to an existing XenServer

Add a second disk as Local Disk Storage to an existing XenServer

On each of the XenServers in the pool I created,  I have at least 2 partitions that I wanted to be available to my VMs.

For a little while I was just running individual commands to figure it out each time and finally I decided to come up with a single command that I could copy and paste

I have it below so I can always come to this blog post and find it

First I find out which partition I want to add

#cat /proc/partitions

I just have to replace the /dev/sdb in the command below with the actual partition I want to add,  And I might need to change the “name-label” in the case that I already have a Local storage 2,  but otherwise,  the system figures out what the current hostname is and gets the uuid and names the storage appropriately.  This works in a pool where host-list returns more than one..

CAUTION:  when cutting and pasting from below,  be careful to make sure that the quotes match exactly,   I have run into situations where the Double Quotes(“) around the name-label parameter and the single quotes (‘) around the awk parameter,  show as question marks (?) when pasted into the XenCenter console.

#xe sr-create content-type=user device-config:device=/dev/sdb  host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage 2 on `cat /etc/hostname`” shared=false type=lvm

 

Adding a new XenServer to my XenServer Pool – Homogenity required

Adding a new XenServer to my XenServer Pool – Homogenity required

In order to add an additional XenServer to an existing XenServer Pool – the servers must be homogenous, meaning that all of the same updates must be applied.

#blogpostinnoteform #couldbecleanedup

I have not had any luck applying updates using the XenCenter software ‘Apply pending updates’.

Although,   XenCenter does a good job of showing which servers have updates to apply

Below are my notes on how to find any apply patches  so that XenServers can have the same updates / patches applied as the pool and then added

on any server in the pool

xe patch-list

This will list out several patches, The confusing thing for me was knowing which patches are included in a Service Pack, since service packs seem to roll up all of the patches in to them it seems that patches which are applied as part of the services pack show a size of 1.

search for the downloads from support.citrix.com
on the server to add

wget http://downloadns.citrix.com.edgesuite.net/8707/XS62ESP1.zip # to get the patch
unzip XS62ESP1.zip # to open the patch
xe patch-upload file-name=XS62ESP1.xsupdate

this will out put the uuid of the patch, you need this (you can also get it from running #xe patch-list
you also need the host-uuid which you can get from #xe host-list, but since the host is not in a pool yet, you should be able to just do command line tab completion (xe is smart like that)

xe patch-apply uuid=0850b186-4d47-11e3-a720-001b2151a503 host-uuid=93c98aa5-935b-41a4-9b79-789fa68db354

(A technique that has worked for me is to copy this  text paste it all at once and the press ‘TAB’ which auto completes the host-uuid,  so I can past it all at once rpess tab and enter and leave the system to its work)

wget http://downloadns.citrix.com.edgesuite.net/8707/XS62ESP1.zip
 unzip XS62ESP1.zip # to open the patch
 xe patch-upload file-name=XS62ESP1.xsupdate
 xe patch-apply uuid=0850b186-4d47-11e3-a720-001b2151a503 host-uuid=

 

Call Now Button(208) 344-1115

SIGN UP TO
GET OUR 
FREE
 APP BLUEPRINT

Join our email list

and get your free whitepaper