Asides
Recovering / updating secondary cisco asa into an Active/standby config
Recovering / updating secondary cisco asa into an Active/standby config
I have two cisco asa 5510 with ALMOST matchin configurations,
basically I copied the configuration of one devices to the other device (using tftp://) but then we had a problem where they could not communicate because they both though they were masters.
runing the below on both devices showed the exact same thing
[codebox]show running-config failover[/codebox]
The secondary had failed.
so what I had to do was tell one of the devices that it was the secondary.
[codebox]failover lan unit secondary[/codebox]
The full recipe:
[codebox line_numbers=”true” remove_breaks=”false” lang=”text”]
ciscoasa> enable
Password:<Enter>
ciscoasa# conf t
ciscoasa(config)# interface Ethernet0/1
no shutdown
description LAN Failover Interface
ciscoasa(config)# failover
failover lan unit secondary
failover lan interface failover_link Ethernet0/1
failover interface ip failover_link 172.16.100.1 255.255.255.252 standby 172.16.100.2
Detected an Active mate
Beginning configuration replication from mate.
End configuration replication from mate.
ciscoasa(config)#write
ciscoasa(config)#config-register 0x2102
ciscoasa(config)#reload
Proceed with reload? [confirm] <Enter>
[/codebox]
Immediately they connected .
Here is how ou duplicate this
[codebox]ciscoasa(config)#write clear
ciscoasa(config)#reload[/codebox]
!confirm and agree
!as the system is booting press ESC to stop the boot process
Use BREAK or ESC to interrupt boot.
make it so enable can be run without password and then boot
[codebox]ROMMON #1>confreg 0x41
ROMMON #2>boot[/codebox]
The boot process continues and then you can enable it to enter the priviledged mode, just press enter when the password prompt appears
[codebox]ciscoasa>enable
Password:
ciscoasa#conf t
ciscoasa(config)#[/codebox]
At this poing I am going to configure an interface with an IP Address so I can copy the configuration I need to use over TFTP
[codebox]ciscoasa(config)#interface Ethernet0/0
no shutdown
nameif inside
security-level 100
ip address 192.168.101.1 255.255.254.0 standby 192.168.101.2[/codebox]
Now I can connect to a tftp server I have setup and copy the running-config I need to use
[codebox]ciscoasa(config)# copy tftp://192.168.101.64/running-config-2014-01-22.txt running-config[/codebox]
since I just copied the running configuration from file, the system just thinks it is the primary
[codebox]ciscoasa(config)# show failover
Last Failover at: 10:43:23 MST Jan 22 2014
This host: Primary – Active
Active time: 208 (sec)
slot 0: ASA5510 hw/sw rev (2.0/8.0(4)) status (Up Sys)
Interface inside (192.168.101.1): Normal (Waiting)
Interface outside (192.168.1.184): Normal (Waiting)
slot 1: empty
Other host: Secondary – Failed
Active time: 0 (sec)
slot 0: empty
Interface inside (192.168.101.2): Unknown (Waiting)
Interface outside (192.168.1.183): Unknown (Waiting)
slot 1: empty[/codebox]
Notice that the Secondary is failed and when I connect to the other device I get the exact same thing, basically what has to be done here, is I need to tell one of them that they are secondary I tried many different things to make it secondary
[codebox]ciscoasa(config)# failover reset
ciscoasa(config)# no failover active
WARNING: NO Standby detected in the network, or standby is in FAILED state.
Switching this unit to Standby can bring down the Network without any Active
ciscoasa(config)#
Switching to Standby
Switching to Active[/codebox]
These did not work, the system just couldn’t communicate with the other device until I just set it to the secondary device
[codebox]ciscoasa(config)# failover lan unit secondary
State check detected an Active mate
Beginning configuration replication from mate.
End configuration replication from mate.
Switching to Standby[/codebox]
That was it! now the devices are synced, when I turn off device 1 device 2 takes over, when I turn off device 2, device 1 takes over. Now the status is
[codebox]ciscoasa(config)# show failover
Last Failover at: 10:55:22 MST Jan 22 2014
This host: Secondary – Standby Ready
Active time: 702 (sec)
slot 0: ASA5510 hw/sw rev (2.0/8.0(4)) status (Up Sys)
Interface inside (192.168.101.2): Normal
Interface outside (192.168.1.183): Normal
slot 1: empty
Other host: Primary – Active
Active time: 5717 (sec)
slot 0: ASA5510 hw/sw rev (2.0/8.0(4)) status (Up Sys)
Interface inside (192.168.101.1): Normal
Interface outside (192.168.1.184): Normal
slot 1: empty
[/codebox]
This exercise helps us to learn and recover failed systems. however it is not the most efficient way to recover. I removed the secondary from the failover pair , clear the config and reloaded. Enter No and Ente when asking if you want to save
[codebox]ciscoasa(config)# clear configure failover
ciscoasa(config)# write erase
ciscoasa(config)# reload
System config has been modified. Save? [Y]es/[N]o:N
Proceed with reload? [confirm] <Enter>[/codebox]
This brings up an interface asking if you want to configure the firewall using prompts, I answer no
[codebox]Ignoring startup configuration as instructed by configuration register.
INFO: Converting to disk0:/
Pre-configure Firewall now through interactive prompts [yes]? no[/codebox]
Now, starting from “scratch” lets only configure the failover interface and start as the secondar in the quickest method to bring up a ‘fresh’ standby.
[codebox]ciscoasa> enable
Password:<Enter>
ciscoasa# conf t
ciscoasa(config)# interface Ethernet0/1
no shutdown
description LAN Failover Interface
ciscoasa(config)# failover
failover lan unit secondary
failover lan interface failover_link Ethernet0/1
failover interface ip failover_link 172.16.100.1 255.255.255.252 standby 172.16.100.2
Detected an Active mate
Beginning configuration replication from mate.
End configuration replication from mate.
ciscoasa(config)#write
ciscoasa(config)#reload
Proceed with reload? [confirm] <Enter>[/codebox]
The key here I have underlined. This was copied directly from my running configuration on the live device except that the word primary was changed to secondary as you see above
Dont forget to write and reload to test!!
CISCO ASA 5510 booting to ROMMON
CISCO ASA 5510 booting to ROMMON
If you are starting from a clean CISCO ASA 55XX you will boot to the
- ROMMON #1>
prompt. and you will have to manually type
- ROMMON #1>boot
To address this type
- ROMMON #1>confreg
This takes you through a list of options, select to boot from flash register 1,
Notes on Recovering from a XenServer Pool failure
Notes on Recovering from a XenServer Pool failure
For my pool I have 8 XenServers (plot1, plot2, plot3, plot4, plot5, plot6, plot7 and plot8)
At the start of my tests, plot1 is the pool master.
If the pool master goes down, another server must take over as master.
To simulate this, I just ran ‘shut down’ on the master host
A large issue here is that all of the slaves in the pool, just disabled their management interfaces so they can not be connected to using XenCenter (something I did not expect), so I connected to plot2 via SSH
THen I connected to another server in the pool, and verified its state
xe host-is-in-emergency-mode
The server said FALSE!?! the server didn’t even know that the pool was in trouble? so I ran pool-list
xe pool-list
The command took a long time so I figured I would stop it and put a time command in front of it to find out how long it really tool
time xe pool-list
Turns out, when I shut down the pool master, I am shutting down the pool! , I am not simulating an error at all. Somehow the pool master notified the slaves that it was gracefully shutting down, telling the slaves dont worry, I will be all right., the commands above never returned. so I just told plot2 to take over as master to see how we could recover from this situation.
xe pool-emergency-transition-to-master
At this point on plot 2, the pool was restored but we could still not connect to the management interfaces of any of the other plots in the pool. But XenCenter WAS able to connect to plot2, and it synchronized the entire pool, showing all of the other hosts (including plot1 which was the master previously) as down.
The other hosts in the pool are still running all of their services (SSH, apache or whathever) they just can not communicate about the pool so I have to ‘recover’ them back into the pool.
On the new master I run
xe pool-recover-slaves
This brings the slaves back into the pool so they are visible within XenCenter again. plot1, the original master is still turned off, but visible as turned off in XenCenter, so I right click on it. in XenCenter and Power On. It begins booting and I hold my breath to see if there are any master master conflicts, since the shut down host thought it was the all powerful one when it shut down.
Once it comes up (3 minutes later) I find that plot1 gracefully fell into place as a slave. So the moral of this story,
!Dont shut down the pool master, if you do you will lose XenCenter access to all of the hosts in the pools so you MUST either 1) bring it backup immediately or 2) SSH to the console of another host run #xe pool-emergency-transition-to-master and then #xe pool-recover-slaves – this will restore your pool minus the host that was originally the master. reconnect with XenCenter to the new poolmaster, using the XenCenter then power on the host that was the pool master
!Best Practice: before stopping a host that is currently the poolmaster, connect to another host in the pool and run #xe pool-emergency-transition-to-master and then #xe pool-recover-slaves prior too shutting down the host.
Well, so now that we know shutting down the master does not simulate a failure, we will have to use another ‘onsite’ method.
!Simulation2:
On plot2 (current pool master) I disconnected the ethernet cables.
The XenCenter console can no longer connect to the pool again, so I have to use SSH, This time I will connect to plot3 and find out what it thinks of the pool issue.
xe host-is-in-emergency-mode
This command returns false, somehow the host thinks every thing is okay, I run xe pool-list and xe host-list, both of which never return, come one host shouldn’t you recognize a failure here?
I ping the same IP as the pool master and the ping fails, but the xe host-is-in-emergency-mode still returns false, for some reason, this host just does not think it has a problem
so, I guess I just can’t trust xe host-is-in-emergency-mode,
Even after 2 hours, the xe host-is-in-emergency-mode still returns false.
So for monitoring, I will have to come up with some other method. but the rules for how to recover are the same
xe pool-emergency-transition-to-master
xe pool-recover-slaves
This brought the pool up again on plot3 with plot3 as the new master.
Now the trick is to bring plot2 back on, in this case, plot 2 never ended up going offline, so it is still running without the ethernet cable plugged in, so when I plug it back in, I may end up with some master – master conflicts ….. here goes!.
After reconnecting the ethernet cable to plot 2 (the old master):
– plot3 did not recognize automatically that the host is backup, infact in XenCenter, it still shows red as though it is shut down, I right clicked on it and told it to power on, but it didn’t do anything but wait.
– plot2 did not make any changes, it appears they both, happily think they are the masters.
To test how the pool reacted, I attempted to disable one of the slaves from plot2 xe host-disable uuid=xxxxxxxx (my thought is that plot 2 is incorrectly considered down and not connected so the disable should not be let through.)
It turns out that plot2 could not disable the host, because the host ‘could not be contacted’ , this is good because it makes sure that none of the slaves are confused, in fact, plot3 is not confused either, it is only plot2, the master that went missing that is confused (I have seen in xen docs that they call this a barrier of some sort)
I tried to connect to plot2 with XenCenter, but XenCenter smartly told me that I can not connect because it appears that the server was created as a backup from my pool and that the dangerous operation is not allowed. (I will try to trick XenCenter into connecting by removing references to my pool from it and then trying again)
AH! it let me! that means that XenCenter is smart enough to recognize when you are attempting to make two connection separately to the split brain masters of a pool, but prevents it.
To dig further into this issue. I decided to further ‘break’ the pool by splitting the two masters further with different definitions of the pool. On the plot2 master I used XenCenter to destroy the disconnected host plot7. XenCenter let me do this. Now when I go to reconnect, I will be attempting to pull the orphaned master with a different definition of the pool, back into the pool.
Now the trick is to determine what the best way to bring the plot2 old master back into the current pool as a slave. We need to tell the new master to recover slaves.
xe pool-recover-slaves
That pulls plot2 back in as a slave, and GREAT it did not use any of the pool definition from plot2. plot3 property asserted its role as the true pool master
I can imagine a bad scenario happening if I told the “OLD” master to recover slave, I imagine that either the split would have gotten much worse, Or (if the barrier was really working, the the pool would have told the old master that it was not possible).
Other methods that I did not use which may have worked but were nto tried (they dont feel right):
– from the orphaned master: xe pool-join force=1 ….. server username password (i doubt this would work since it is already the member of a pool)
– from the orphaned master xe pool-reset-master master-server= ip of new master (this one I am not sure of, would be worth a shot if for some reason pool is not working)
THe thing that you NEVER want to do while a master or any other server is orphaned or down, is remove the server from the pool. What can happen in this sitation is that the server that is down, still thinks it is in the pool when it comes back up but the pool does not know about it. We get into a race condition that I have only ever found one way out of. The orphaned server thinks it is in a pool, but can not get out of the pool without connecting to the master. The master will not recognize the orphaned server so the server cant do anything. (the way out of this was to promote the orphaned server to master, the remove all of the hosts in the pool, then delete all of the stored resources and pbd and then join the pool anew. This sucked because everything on the server was destroyed so I could have just r reinstalled xenserver.
I have heard but not attempted to reinstall xenserver without selecting the disks
http://support.citrix.com/article/CTX120962
Promoting a XenServer host to pool master
Promoting a XenServer host to pool master
There are a couple of problems that can be found when attempting to promote a XenServer host to pool master.
- If you are on host1 attempting to designate host2 through the xsconsole, you are prompted to select from all of the other hostnames in the pool, however xsconsole appears to use the hostname only and if your server is not configured to be able to refer to the other host by the hostname xsconsole will show an error, can not connect to host.
This could be resolved by making /etc/hosts entries for each host, but that is overkill when you need to make a quick poolmaster change - The next idea is to by convention, only designate a host to pool master FROM that host, and make sure that each host can refer to itself using it hostname.
Not too much problem here, that seems pretty reasonable to have the current hostname in the /etc/hosts file, however the default XenServer install does not do this.
So the best way i find is to have a convention to only promot a pool master FROM that host, using the command line method, using the UUID.
# xe pool-designate-new-master host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`
A couple of problems can still prevent new pool master from being designated.
- All hosts in the pool are not available. For example, One of the hosts in the pool was down, and I received an error because host5 was not available. I solved this using XenCenter to destroy the other hosts, that seemed like it was not a good idea since I wanted them to come up at sometime in the future and rejoin the pool, but I guess that has to be done manually.
- You attempted an operation which involves a host which could not be contacted.
host: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (HOST5)
- You attempted an operation which involves a host which could not be contacted.
- Another problem that I have not encountered yet but is still possible, is an issue where you are unable to designate a new master while you have servers down which the pool thinks may still be running. (sounds like some sort of split brain ‘protection’ or something)
- http://discussions.citrix.com/topic/250603-how-to-forcefully-remove-a-dead-host-from-a-pool/
Add a second disk as Local Disk Storage to an existing XenServer
Add a second disk as Local Disk Storage to an existing XenServer
On each of the XenServers in the pool I created, I have at least 2 partitions that I wanted to be available to my VMs.
For a little while I was just running individual commands to figure it out each time and finally I decided to come up with a single command that I could copy and paste
I have it below so I can always come to this blog post and find it
First I find out which partition I want to add
#cat /proc/partitions
I just have to replace the /dev/sdb in the command below with the actual partition I want to add, And I might need to change the “name-label” in the case that I already have a Local storage 2, but otherwise, the system figures out what the current hostname is and gets the uuid and names the storage appropriately. This works in a pool where host-list returns more than one..
CAUTION: when cutting and pasting from below, be careful to make sure that the quotes match exactly, I have run into situations where the Double Quotes(“) around the name-label parameter and the single quotes (‘) around the awk parameter, show as question marks (?) when pasted into the XenCenter console.
#xe sr-create content-type=user device-config:device=/dev/sdb host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage 2 on `cat /etc/hostname`” shared=false type=lvm
Adding a new XenServer to my XenServer Pool – Homogenity required
Adding a new XenServer to my XenServer Pool – Homogenity required
In order to add an additional XenServer to an existing XenServer Pool – the servers must be homogenous, meaning that all of the same updates must be applied.
#blogpostinnoteform #couldbecleanedup
I have not had any luck applying updates using the XenCenter software ‘Apply pending updates’.
Although, XenCenter does a good job of showing which servers have updates to apply
Below are my notes on how to find any apply patches so that XenServers can have the same updates / patches applied as the pool and then added
on any server in the pool
xe patch-list
This will list out several patches, The confusing thing for me was knowing which patches are included in a Service Pack, since service packs seem to roll up all of the patches in to them it seems that patches which are applied as part of the services pack show a size of 1.
search for the downloads from support.citrix.com
on the server to add
wget http://downloadns.citrix.com.edgesuite.net/8707/XS62ESP1.zip # to get the patch unzip XS62ESP1.zip # to open the patch xe patch-upload file-name=XS62ESP1.xsupdate
this will out put the uuid of the patch, you need this (you can also get it from running #xe patch-list
you also need the host-uuid which you can get from #xe host-list, but since the host is not in a pool yet, you should be able to just do command line tab completion (xe is smart like that)
xe patch-apply uuid=0850b186-4d47-11e3-a720-001b2151a503 host-uuid=93c98aa5-935b-41a4-9b79-789fa68db354
(A technique that has worked for me is to copy this text paste it all at once and the press ‘TAB’ which auto completes the host-uuid, so I can past it all at once rpess tab and enter and leave the system to its work)
wget http://downloadns.citrix.com.edgesuite.net/8707/XS62ESP1.zip unzip XS62ESP1.zip # to open the patch xe patch-upload file-name=XS62ESP1.xsupdate xe patch-apply uuid=0850b186-4d47-11e3-a720-001b2151a503 host-uuid=
MDADM – Failed disk recovery (too many disk errors)
MDADM – Failed disk recovery (too many disk errors)
This only happens once every couple of years, but occasionally a SCSI disk on one of our servers has too many errors, and is kicked out of the md array
And… we have to rebuild it. Perhaps we should replace it since it appears to be having problems, but really, the I in RAID is inexpensive (or something) so I would rather lean to being frugal with the disks and replacing them only if required.
I can never remember of the top of my head the commands to recover, so this time I am going to blog it so I can easily find it.
First step, take a look at the status of the arrays on the disk
#cat /proc/mdstat
(I don't have a copy of what the failed drive looks like since I didn't start blogging until after)
Sometimes an infrequent disk error can cause md to fail a hard drive and remove it from an array, even though the disk is fine.
That is what happened in this case, and I knew the disk was at least partially good. The disk / partition that failed was /dev/sdb1 and was part of a RAID V, on that same device another partition is part of a RAID I, that RAID I is still healthy so I knew the disk is just fine. So I am only re-adding the disk to the array so it can rebuild. If the disk has a second problem in the next few months, I will go ahead and replace it, since the issue that happened tonight is probably indicating a disk that is beginning to fail but probably still has lots of life in it.
The simple process is
#mdadm /dev/md0 --remove /dev/sdb1
This removed the faulty disk, that is when you would physically replace the disk in the machine, since I am only going to rebuild the disk I just skip that and move to the next step.
#mdadm /dev/md0 --re-add
The disk started to reload and VOILA! we are rebuilding and will be back online in a few minutes.
Now you take a look at the status of the arrays
#cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md0 : active raid5 sdb1[3] sdc1[2] sda1[0] 140632704 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U] [=======>.............] recovery = 35.2% (24758528/70316352) finish=26.1min speed=29020K/sec md1 : active raid1 sda2[0] sdb2[1] 1365440 blocks [2/2] [UU]
In case you want to do any trouble shooting on what happened, this command is useful in looking into the logs.
#grep mdadm /var/log/syslog -A10 -B10
But this command is the one that I use to see the important events related to the failure and rebuild. As I am typing this I am just over 60% complete rebuilt which you see in the log
#grep mdadm /var/log/syslog Jun 15 21:02:02 xxxxxx mdadm: Fail event detected on md device /dev/md0, component device /dev/sdb1 Jun 15 22:03:16 xxxxxx mdadm: RebuildStarted event detected on md device /dev/md0 Jun 15 22:11:16 xxxxxx mdadm: Rebuild20 event detected on md device /dev/md0 Jun 15 22:19:16 xxxxxx mdadm: Rebuild40 event detected on md device /dev/md0 Jun 15 22:27:16 xxxxxx mdadm: Rebuild60 event detected on md device /dev/md0
You can see from the times, it took me just over an hour to respond and start the rebuild (I know, that seems too long if I were to just do this remotely, but when I got the notice, I went on site since I thought I would have to do a physical swap and I had to wait a bit while the Colo security verified my ID, and I was probably moving a little slow after some Nachos at Jalepeno’s) Once the rebuild started it took about 10 minutes per 20% of the disk to rebuild.
————————-
Update: 9 months later the disk finally gave out and I had to manually replace the disk. I blogged again:
https://www.matraex.com/mdadm-failed-d…nreadable-disk/
Employee Appreciation Night – Idaho Steelheads Hockey
Employee Appreciation Night – Idaho Steelheads Hockey
The crew at Matraex headed down to the Century Link Arena to watch the Idaho Steelheads play against the Colorado SomethingSomethings. Â The group of 13 (including some family that came along) sat just behind the east goal and caught some great action.
Unfortunately The Steelheads lost, but the everyone enjoyed the show!
Family Bowling Night
Family Bowling Night
At 6:00 Wednesday October 3rd Matraex employees with their families had a bowling night at Big Al’s. It was great to get together with coworkers, spouses and the kids and spend some time having fun and getting to know the families. John was the experienced bowler of our bunch scoring an impressive 199. We are learning that John is a man of many talents.
Bowling at Big Als
Wednesday October 3rd
6:00
http://www.ilovebigals.com/meridian/