Hacking a corrupt VHD on xen in order to access innodb mysql information.

A client ran into a corrupted .vhd file for the data drive for a xen server in a pool. We helped them to restore from a backup, however there were some items that they had not backed up properly, our task was to see if we could some how restore the data from their drive.

First, we had to find the raw file for the drive. To do this we looked at the Local Storage -> General tab on the XenCenter to find the UUID that will contain the  failing disk.

When we tried to attach the failing disk we get this error

Attaching virtual disk 'xxxxxx' to VM 'xxxx'
The attempt to load the VDI failed

So, we know that the xen servers / pool reject loading the corrupted vhd. So I came up with a way to try and access the data.

After much research I came across a tool that was published by ‘twindb.com’ called ‘undrop tool for innodb’.  The idea is that even after you drop or delete innodb files on your system, there are still markers in the file system which allow code to parse what ‘used’ to be on the system. They claimed some level of this worked for corrupted file systems.

The documentation was poor, and it took a long time to figure out, however they claimed to have 24-hour support, so I thought I would call them and just pay them to sort out the issue. They took a while and didn’t call back before I had sorted it out. All of the documentation he did have showed a link to his github account,  however the link was dead.  I searched and found a couple other people out there that had forked it before twindb took it down.  I am thinking perhaps they run more of an service business now and can help people resolve the issue and they dont want to support the code.  Since this code worked for our needs,  I have forked it so that we can make it permanently available: https://github.com/matraexinc/undrop-for-innodb

First step was for me to copy the .vhd to a working directory

# cp -a 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd /tmp/restore_vhd/orig.vhd
#cd /tmp/restore_vhd/
#git clone https://github.com/matraexinc/undrop-for-innodb
#cd undrop-for-innodb
#apt-get install bison flex
#apt-get install libmysqld-dev  #this was not mentioned anywhere,  however an important file was quitely not compiled without it.
#mv * ../.  #move all of the compiles files into your working directory
#cd ../
#./stream_parser -f orig.vhd # here is the magic – their code goes through and finds all of the ibdata1 logs and markers and creates data you can start to work through
#mv pages-orig.vhd pages-ibdata1  #the program created an organized set of data for you,  and the next programs need to find this at pages-ibdata1.
#./recover_dictionary.sh #this will need to run mysql as root and it will create a database named ‘test’ which has a listing of all of the databases, tables and indexes it found.

This was where I had to start coming up with a custom solution in order to process the large volume of customer databases.  I used some PHP to script the following commands for all of the many databases that needed to be restored.   But here are the commands for each database and table you must run a command that corresponds to an ‘index’ file that the previous commands created for you,  so you must loop through each of them.

 

select c.name as tablename

,a.id as indexid
from SYS_INDEXES a
join SYS_TABLES c on (a.TABLE_ID =c.ID)

 

This returns a list of the tables and any associated indexes,   Using this you must generate a command which

  1. generates a create statement for the table you are backing up,
  2. generate a load infile sql statement and associated data file

#sys_parser -h localhost -u username -p password -d test tablennamefromsql

This generates the createstatement for the tables,   save this to a createtable.sql file and execute it on your database to restore your table.

#c_parser -5 -o data.load -f pages-ibdata1/FIL_PAGE_INDEX/00000017493.page -t createtable.sql

This outputs a “load data infile ‘data.load’ statement,   you should pipe this to MYSQL and it will restore your data.


I found one example where the was createstatement  was notproperty created for table_id 754,   it appears that the sys_parser code relies on indexes,  and in one case the client tables did not have an index (not even a primary key),   this make it so that no create statement was created and the import did not continue.   To work around this,  I manually inserted a fake primary key on one of the columns into the database

#insert into SYS_INDEXES set id=1000009, table_id = 754,  name=PRIMARY, N_FIELDS=1, Type=3,SPACE=0, PAGE_NO=400000000
#insert into SYS_FIELDS set INDEX_ID=10000009, POS=0, COL_NAME=myprimaryfield

Then I was able to run the sys_parser command which then created the statement.


An Idea that Did not work ….

The idea is to create a new hdd device at /dev/xvdX create a new filesystem and mount it.   The using a tool use as dd or qemu-img ,  overwrite the already mounted device with the contents of the vhd.   While the contents are corrupted,  the idea is that we will be able to explore the corrupted contents as best we can.

so the command I ran was

#qemu-img convert -p -f vpc -O raw /var/run/sr-mount/f40f93af-ae36-147b-880a-729692279845/3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd/dev/xvde

 

Where 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd is the name of the file / UUID that is corrupted on the xen DOM0 server  and f40f93af-ae36-147b-880a-729692279845 is the UUID of the Storage / SR that it was located on

 

The command took a while to complete (it had to convert 50GB) but the contents of the vhd started to show up as I ran find commands on the mounted directory.   During the transfer,  the results were sporadic as the partition was only partially build,  however after it was completed,  I had access to about 50% of the data.

An Idea that Did not work (2) ….

This was not good enough to get the files the client needed.   I had a suspicion that the  qemu-img convert command may have dropped some of the data that was still available,  so i figured I would try another, somewhat similar command,  that actually seems to be a bit simpler.

This time I created another disk on the same local storage and found it using the xe vdi-list command on the dom0.

#xe vdi-list name-label=disk_for_copyingover

this showed me the UUID of this file was ‘fd959935-63c7-4415-bde0-e11a133a50c0.vhd’

i found it on disk and I executed a cat  from the corrupted vhd file into the mounted vhd file while it was running.

cat 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd > ../8c5ecc86-9df9-fd72-b300-a40ace668c9b/fd959935-63c7-4415-bde0-e11a133a50c0.vhd

Where 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd is the name of the file / UUID that is corrupted on the xen DOM0 server fd959935-63c7-4415-bde0-e11a133a50c0.vhd is the name of the vdi we created to copy over

 

This method completely corrupted the mounted drive, so I scrapped this method.

Next up:  

Try some  file partition recovery tools:

I started with testdisk (apt-get install testdisk)   and ran it directly againstt the vhd file

testdisk 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd

Enabling Xen VM auto start for 6.2- command line

Cytrix removed auto start from the easy to access options using XenCenter for 6.X servers.

However you can still run it.

First enable it on your pool

  • xe pool-param-set uuid=UUID other-config:auto_poweron=true

Then run a command to get all of the VMs in your pool and turn auto power on for all of the VMs that are currently on.

  • xe vm-list power-state=running |awk -F: ‘/uuid/ {print “xe vm-param-set uuid=”$NF” other-config:auto_power=true;”}’

This will give you a list of commands to enable auto_poweron for each of the running vm in your pool

Command Dump – Extending a disk on XenServer with xe

To expand the disk on a XenServer using the command line,   I assume that you have backed up the data elsewhere before the expansion,   as this method deletes everything on the disk to be expanded

  • dom0>xe vm-list name-label=<your vm name> # to  get the UUID of the host = VMUUID
  • dom0>xe vm-shutdown uuid=<VMUUID>
  • dom0>xe vbd-list  params=device,empty,vdi-name-label,vdi-uuid   vm-name-label=<your vm name>  # to get the vdi-uuid of the disk you would like to expand = VDIUUID
  • dom0>xe vdi-resize uuid=<VDIUUID> disk-size=120GB #use the size that you would like to expade to
  • dom0>xe vm-start uuid=<VMUUID>

Thats it on th dom0,  now as your vm boots up,  log in via SSH and complete the changes by deleting the old partition,  repartitioning and making a new filesystem,   I am going to do this as though the system is mounted at /data

  • domU>df /data # to get the device name =DEVICENAME
  • domU>umount /dev/DEVICENAME
  • domU>fdisk /dev/DEVICENAME
  •    [d]  to delete the existing partition
  •    [c] to create a new partition
  •    [w] to write the partition
  •    [q] to close fdisk
  • mkfs.ext3 /dev/DEVICENAME
  • mount /data
  • df /data #to see the file size expanded

 

Looking for help with XenServer?   Matraex can help.

XenServer and XenCenter

Why do we Blog about XenServer and XenCenter?

First, a quick bit about why we chose XenServer

We are small users of the XenServer and XenCenter software, and when we were first evaluating the Hyper Visor, we didn’t know much at all about Virtualizing servers.
At the same time as we were looking at XenServer, we were also looking into HyperV and VMWare. Of the 3, I found the open source model that XenServer had, backed by Cytrix’s large company status, to be the most appealing.
XenServer was also what Amazon AWS was based on, and with our experience with AWS it helped us lean towards XenServer.
To add to this, the XenCenter software was very simple to use, way that we were able to quickly create and manage Pools of servers and simply connect to the console seemed to address the features we would need, and not overcomplicate it like the VM Ware software did. An I liked the simple fast interface.

And finally, since we dont like to have Windows or GUI interfaces in our windows environment,   we loved that the Hypervisor is a Linux install we can log into and run ‘xe’ command on..  This makes XenServer is very scriptable.

XenServer is scriptable

Looking back and why we have created so many blog posts about XenServer is simply, because it is so easy to do.   As we have run into things that we have had difficulty doing,  it has been simple to document the process of figuring it out,    We have the option to simply cut and paste our command line history.   This seem so much easier than creating picture snippets of a GUI based management system,  and it makes it simple to turn our documentation of the process of troubleshooting an issue into a blog post.

Solutions to Problems are easy to forget

When we find a solution to a problem,  they can be very easy to implement and forget.  What happens here is that we end up doing the same research a year later to find a solution to a problem.   This is one of the reasons that many of our blog posts are not polished,   the posts just read like a stream of consciousness troubleshooting session.   We are not expert article writers,  we are expert Website Developers, Server Administrators and technical implementers.   However we recognized that when we solve a difficult problem,  if we document that problem in a place that is easy to find (our own blog) we can easily come back to it.    We simply search our own blog for it.

All of our blog topics

So really,  the reasons above apply to many of our blog topics.

  • Easy to script,  or describe in text (without pictures of it) we are able to cut and paste
  • Solution is one that we want to easily be able to find and solve again

Examples of XenServer Blog Posts

manually removing a pool slave from a pool in XenCenter

manually removing a pool slave from a pool in XenCenter

Problem: The pool master was lost or the ip address was changed. Upon bootup of one of the pool’s slaves, it came up with no management network, and no network interfaces to configure.

Resolution:

MAKE SURE YOUR VMs ARE BACKED UP!!!! LOCAL STORAGE WILL GO AWAY AFTER THIS AND WILL HAVE TO BE RE-CREATED.

Remove the slave server from XenCenter.

At the slave console’s main menu, go to “Network and Management Interface”, “Emergency Network Reset”

Login, and walk through he steps of re-assigning your address. Go ahead and enter an address for the master when prompted.

The server will reboot.

Go to “Local Command Shell” on the main menu.

Check the state of the server:

xe host-is-in-emergency-mode

answer: true

because the server is still in emergency mode, we need to edit the pool.conf.

nano /etc/xensource/pool.conf

It will probably reference “slave” and whatever address you defined as your master.

Remove all entries and add : master

save the conf file with Ctrl + o, exit with Ctrl + x

Rename the state.db with this command.

mv /var/xapi/state.db /var/xapi/state.db-old

Exit to the main console with xsconsole.

reboot it, and you should be able to re-add it to XenCenter and your pool.

More on changing ip addresses here:

http://support.citrix.com/article/CTX123477

Adding your local storage back to the xenserver:

Once you’ve re-added your server back to XenCenter, you’ll notice that your storage devices are gone. to re-add:

On the console tab of the server you just added, You can list your devices with:

cat /proc/partitions

get your device id’s with:

ll /dev/disk/by-id

Execute the following command:

xe sr-create content-type=user device-config:device=/dev/disk/by-id/<device ID from the list from the previous command> host-uuid=<ID can be copied and pasted from the “general” tab> name-label=”Give It a Name” shared=false type=lvm

If you’re trying to add the disk with the system on it, you’ll have to select the partition to restore:

xe sr-create content-type=user device-config:device=/dev/disk/by-id/<device ID for the partition from the list from the previous command> host-uuid=<ID can be copied and pasted from the “general” tab> name-label=”Give It a Name” shared=false type=lvm

This might at least allow you to get and files on that storage off to a more stable place. With a server in this condition, I would recommend reloading XenServer once you’ve taken everything that you need off of it.

Matt Long

02/24/2015

 

In XenCenter Console – mount DVD drive in Ubuntu 14.04

In XenCenter Console – mount DVD drive in Ubuntu 14.04

When running Ubuntu 14.04 LTS as a guest under XenServer6.5  I was attempting to install xs-tools.iso by mounting it into server using the drop down box.

 

However at the console,  i was unable to find /dev/cdrom or /dev/dvd* or /dev/sr*  or anything that seemed to fit.

 

So I ran fdisk -l

#fdisk -l

and I found a disk I didnt recognize

Disk /dev/xvdd: 119 MB, 119955456 bytes
255 heads, 63 sectors/track, 14 cylinders, total 234288 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/xvdd doesn't contain a valid partition table

So I mounted it and looked at the contents

#mount /dev/xvdd /mnt
#ls /mnt
dr-xr-xr-x 4 root root 2048 Jan 27 04:08 Linux
-r--r--r-- 1 root root 1180 Jan 27 04:08 README.txt
-r--r--r-- 1 root root 65 Jan 27 04:07 AUTORUN.INF
-r--r--r-- 1 root root 802816 Jan 27 04:07 citrixguestagentx64.msi
-r--r--r-- 1 root root 802816 Jan 27 04:07 citrixguestagentx86.msi
-r--r--r-- 1 root root 278528 Jan 27 04:07 citrixvssx64.msi
-r--r--r-- 1 root root 253952 Jan 27 04:07 citrixvssx86.msi
-r--r--r-- 1 root root 1925120 Jan 27 04:07 citrixxendriversx64.msi
-r--r--r-- 1 root root 1486848 Jan 27 04:07 citrixxendriversx86.msi
-r--r--r-- 1 root root 26 Jan 27 04:07 copyright.txt
-r--r--r-- 1 root root 831488 Jan 27 04:07 installwizard.msi
-r-xr-xr-x 1 root root 50449456 Jan 27 04:03 dotNetFx40_Full_x86_x64.exe
-r-xr-xr-x 1 root root 1945 Jan 27 04:03 EULA_DRIVERS
-r-xr-xr-x 1 root root 1654835 Jan 27 04:03 xenlegacy.exe
-r-xr-xr-x 1 root root 139542 Jan 27 04:03 xluninstallerfix.exe


So I found it!  Now just to install the tools and reboot

#cd Linux && ./install.sh
#reboot

XenCenter – missing ‘Logs’ tab

XenCenter – missing ‘Logs’ tab

Xencenter has moved the status of actions somewhere for each Physical and VM from the very intuitive ‘logs’ tab location it was before. Here is where they moved it.

  • At the bottom of the left pane there is an option called ‘Notifications’,  when you click it you are automatically shown all of the the alerts (such as the status changes)
  • At the top of the left pane whn you are clicked on Notifications you will notice that it has given you three options “Alerts”, “Updates” and “Events”.
  • If you click on “Events” you will see the status of ongoing ‘Exports’ or transfers or other  actions.

 

Script for Patching XenServer 6.5

Script for Patching XenServer 6.5

Here’s a little script that you can run at the dom0 console to automate loading patches on a fresh installation of XenServer 6.5 up to patch XS65E005. If they add more patches, just add more lines referencing the new patch name (e.g. XS65E006, etc) starting with the “wget command and ending with the “rm -f .xsupdate” command.

#!/bin/bash

wget http://downloadns.citrix.com.edgesuite.net/akdlm/10194/XS65E001.zip

unzip XS65E001.zip

xe patch-apply uuid=`xe patch-upload file-name=XS65E001.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *.zip

rm -f *.xsupdate

wget http://downloadns.citrix.com.edgesuite.net/akdlm/10195/XS65E002.zip

unzip XS65E002.zip

xe patch-apply uuid=`xe patch-upload file-name=XS65E002.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *.zip

rm -f *.xsupdate

wget http://downloadns.citrix.com.edgesuite.net/akdlm/10196/XS65E003.zip

unzip XS65E003.zip

xe patch-apply uuid=`xe patch-upload file-name=XS65E003.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *.zip

rm -f *.xsupdate

wget http://downloadns.citrix.com.edgesuite.net/akdlm/10201/XS65E005.zip

unzip XS65E005.zip

xe patch-apply uuid=`xe patch-upload file-name=XS65E005.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`

rm -f *.zip

rm -f *.xsupdate

Changing IP Addresses on a XenServer 6.5 Pool

Changing IP Addresses on a XenServer 6.5 Pool

To change the ip addresses on a XenServer 6.5 pool, start with the slaves, and use the following xe commands:

Remember: Slaves first, then the Master

NOTE: There is no need to change the IP from the Management Console.

Find the UUID of the Host Management PIF:

xe pif-list params=uuid,host-name-label,device,management

You will see a big list. Find the UUID for the slave that you’re working on. Use the “more” pipe if the UUID for your particular slave scrolls off the screen:

xe pif-list params=uuid,host-name-label,device,management | more

Change the IP Address on the first slave:

xe pif-reconfigure-ip uuid=<UUID of host management PIF> IP=<New IP> gateway=<GatewayIP> netmask=<Subnet Mask> DNS=<DNS Lookup IPs> mode=<dhcp,none,static>

Then:

xe-toolstack-restart

Verify the new address with ifconfig, and/or ping it from a workstation.

Point the slave to the new Master IP Address:

xe pool-emergency-reset-master master-address=NEW_IP_OF_THE_MASTER

Repeat the commands above on all slaves.

On the Master:

xe pif-list params=uuid,host-name-label,device,management

xe pif-reconfigure-ip uuid=<UUID of host management PIF> IP=<New IP> gateway=<GatewayIP> netmask=<Subnet Mask> DNS=<DNS Lookup IPs> mode=<dhcp,none,static>

xe-toolstack-restart

DO NOT run the emergency-reset-master command on the Master.

Reboot the Master, then reboot the Slaves and verify that they can find the Master.

Matt Long

04/06/2015

Using MPT-Status for RAID Monitoring in a Poweredge C6100 with Perc 6

Using MPT-Status for RAID Monitoring in a Poweredge C6100 with Perc 6

This post outlines the steps needed to get a CLI report of the conditions of your RAIDs in a Poweredge C6100 with a PERC 6/i RAID Controller.

Verify your controller type:

cat /proc/scsi/mptsas/0

ioc0: LSISAS1068E B3, FwRev=011b0000h, Ports=1, MaxQ=277

Download the following packages:

daemonize-1.5.6-1.el5.i386.rpm mpt-status-1.2.0-3.el5.centos.i386.rpm lsscsi-0.17-3.el5.i386.rpm

http://dl.nux.ro/utils/mpt-status/mpt-status-1.2.0-3.el5.centos.i386.rpm

http://dl.nux.ro/utils/mpt-status/daemonize-1.5.6-1.el5.i386.rpm

http://mirror.centos.org/centos/5/os/i386/CentOS/lsscsi-0.17-3.el5.i386.rpm

Install mtp-status:

rpm -ivh mpt-status-1.2.0-3.el5.centos.i386.rpm daemonize-1.5.6-1.el5.i386.rpm lsscsi-0.17-3.el5.i386.rpm

modprobe mptctl

echo mptctl >> /etc/modules

Verify your modules:

lsmod |grep mpt

mptctl 90739 0

mptsas 57560 4

mptscsih 39876 1 mptsas

mptbase 91081 3 mptctl,mptsas,mptscsih

scsi_transport_sas 27681 1 mptsas

scsi_mod 145658 7 mptctl,sg,libata,mptsas,mptscsih,scsi_transport_sas,sd_mod

run:

mpt-status or mpt-status -n -s

Also, you can use: lsscsi -l

This little script:

echo `mpt-status -n -s|awk ‘/OPTIMAL/ {print $1, “OK”}; /ONLINE/ {print $1, “OK”}; /DEGRADED/ {print $1, “FAILURE”}; /scsi/ {print $2}; /MISSING/ {print $1, “FAILURE”} ‘`

reports:

vol_id:0 OK phys_id:1 OK phys_id:0 OK 100% 100%

On a rebuild, it reports:

vol_id:0 FAILURE phys_id:2 OK phys_id:3 OK 75% 75%

Copy that script into a file called “check_raid”, and make it executable, E.G. 755

Edit nagios-statd on parcel1. Replace “sudo /customcommands/check_raid.pl -b -w1 -c1” with filename check-raid (without the switches) at line 20, and remove “sudo”

So, from this:

commandlist[‘Linux’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,”free | awk ‘$1~/^Swap:/{print ($3/$2)*100}'”,”sudo /customcommands/check_raid.pl -b -w1 -c1″)

To this:

commandlist[‘Linux’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,”free | awk ‘$1~/^Swap:/{print ($3/$2)*100}'”,”/customcommands/check_raid”)

Port 1040 will need to be opened in XenServer. Edit /etc/sysconfig/iptables and insert this line:

-A RH-Firewall-1-INPUT -p tcp -m tcp –dport 1040 -j ACCEPT

Restart the firewall:

service iptables restart

Output:

Flushing firewall rules: [ OK ]

Setting chains to policy ACCEPT: filter [ OK ]

Unloading iptables modules: [ OK ]

Applying iptables firewall rules: [ OK ]

Loading additional iptables modules: ip_conntrack_netbios_n[FAILED]

NOTE: The “FAILED” error above doesn’t seem to be a problemVerify that port 1040 is open:

Check the status of port 1040:

service iptables status

Output:

Table: filter

Chain INPUT (policy ACCEPT)

num target prot opt source destination

1 ACCEPT 47 — 0.0.0.0/0 0.0.0.0/0

2 RH-Firewall-1-INPUT all — 0.0.0.0/0 0.0.0.0/0

Chain FORWARD (policy ACCEPT)

num target prot opt source destination

1 RH-Firewall-1-INPUT all — 0.0.0.0/0 0.0.0.0/0

Chain OUTPUT (policy ACCEPT)

num target prot opt source destination

Chain RH-Firewall-1-INPUT (2 references)

num target prot opt source destination

1 ACCEPT all — 0.0.0.0/0 0.0.0.0/0

2 ACCEPT icmp — 0.0.0.0/0 0.0.0.0/0 icmp type 255

3 ACCEPT esp — 0.0.0.0/0 0.0.0.0/0

4 ACCEPT ah — 0.0.0.0/0 0.0.0.0/0

5 ACCEPT udp — 0.0.0.0/0 224.0.0.251 udp dpt:5353

6 ACCEPT udp — 0.0.0.0/0 0.0.0.0/0 udp dpt:631

7 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:631

8 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:1040

9 ACCEPT all — 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED

10 ACCEPT udp — 0.0.0.0/0 0.0.0.0/0 state NEW udp dpt:694

11 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:22

12 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:80

13 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:443

14 REJECT all — 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited

running “nagios-statd” opens port 1040 on Parcel1 and listens for commands to be initiated by nagios_stat on the nagios server.

On the nagios server, in a file called “remote.orig.cfg, there are commands defined using “nagios-stat”: NOTE: These are from a working server and haven’t been modified to work with mpt. Some changes may need to be made. This is just an example of the interaction between Nagios server and client

Example:

define command{

command_name check_remote_raid

command_line $USER1$/nagios-stat -w $ARG1$ -c $ARG2$ -p $ARG3$ raid $HOSTADDRESS$

}

This command defined above is used in the “services.cfg” file.

Example:

define service{

use matraex-template

host_name mtx-lilac

service_description Lilac /data Raid

check_command check_remote_raid!1!1!1040

The three files needed on the C6100 node are:

/customcommands/check_raid (contents below) -rwxr-xr-x

/customcommands/nagios-statd (contents below) -rwxr-xr-x

/etc/init.d/nagios-statd (contens below) -rwxr–r–

Creating the soft links:

ln -s /etc/init.d/nagios-statd /etc/rc.d/rc3.d/K01nagios-statd

ln -s /etc/init.d/nagios-statd /etc/rc.d/rc3.d/S99nagios-statd

The -s = soft, and -f if used, forces overwrite.

/rc3.d/ designates runlevel 3

So when you do this:

ls -lt /customcommands/nagios-statd /etc/init.d/nagios-statd /customcommands/check_raid /etc/rc.d/rc3.d/*nagios-statd

This is what you should see:

lrwxrwxrwx 1 root root 22 Mar 6 08:08 /etc/rc.d/rc3.d/K01nagios-statd -> ../init.d/nagios-statd

-rwxr-xr-x 1 root root 365 Mar 6 07:59 /customcommands/check_raid

lrwxrwxrwx 1 root root 22 Mar 6 07:52 /etc/rc.d/rc3.d/S99nagios-statd -> ../init.d/nagios-statd

-rwxr-xr-x 1 root root 649 Mar 6 07:51 /etc/init.d/nagios-statd

-rwxr-xr-x 1 root root 9468 Mar 5 12:05 /customcommands/nagios-statd

Script Files:

NOTE: Here’s a little fix that helped me out. I had originally pasted these scripts into a DOS/Windows editor (wordpad) and it added DOS-type returns to the file, resulting in an error:

-bash: ./nagios-statd: /bin/sh^M: bad interpreter: No such file or directory

If you encounter this, do this:

Open the file in vi

hit “:” to go into command mode

enter “set fileformat=unix”

then :wq to quit.

/customcommands/check_raid:

#!/bin/bash

EXECFILE=/usr/sbin/mpt-status

if [ ! -e $EXECFILE ] ; then

echo

echo “Error $EXECFILE is not installed, please install before running”

echo

echo “Usage $0”;

echo

exit 10

fi

echo `$EXECFILE -n -s|awk ‘/OPTIMAL/ {print $1, “OK”}; /ONLINE/ {print $1, “OK”}; /DEGRADED/ {print $1, “FAILURE”}; /scsi/ {print $2};

/MISSING/ {print $1, “FAILURE”} ‘`

/customcommands/nagios_statd

#!/usr/bin/python

import getopt, os, sys, signal, socket, SocketServer

class Functions:

“Contains a set of methods for gathering data from the server.”

def __init__(self):

self.nagios_statd_version = 3.09

# As of right now, the commands are for df, who, proc, uptime, and swap.

commandlist = {}

commandlist[‘AIX’] = (“df -Ik”,”who | wc -l”,”ps ax”,”uptime”,”lsps -sl | grep -v Paging | awk ‘{print $2}’ | cut -f1 -d%”)

commandlist[‘BSD/OS’] = (“df”,”who | wc -l”,”ps -ax”,”uptime”,None)

commandlist[‘CYGWIN_NT-5.0’] = (“df -P”,None,”ps -s -W | awk ‘{printf(“%6s%6s%3s%6s%sn”,$1,$2,” S”,” 0:00″,substr($0,22))}'”,None,None)

commandlist[‘CYGWIN_NT-5.1’] = commandlist[‘CYGWIN_NT-5.0’]

commandlist[‘FreeBSD’] = (“df -k”,”who | wc -l”,”ps ax”,”uptime”,”swapinfo | awk ‘$1!~/^Device/{print $5}'”)

commandlist[‘HP-UX’] = (“bdf -l”,”who -q | grep “#””,”ps -el”,”uptime”,None)

commandlist[‘IRIX’] = (“df -kP”,”who -q | grep “#””,”ps -e -o “pid tty state time comm””,”/usr/bsd/uptime”,None)

commandlist[‘IRIX64’] = commandlist[‘IRIX’]

commandlist[‘Linux’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,”free | awk ‘$1~/^Swap:/{print ($3/$2)*100}'”,”/customcommands/check_raid”)

commandlist[‘NetBSD’] = (“df -k”,”who | wc -l”,”ps ax”,”uptime”,”swapctl -l | awk ‘$1!~/^Device/{print $5}'”)

commandlist[‘NEXTSTEP’] = (“df”,”who | /usr/ucb/wc -l”,”ps -ax”,”uptime”,None)

commandlist[‘OpenBSD’] = (“df -k”,”who | wc -l”,”ps -ax”,”uptime”,”swapctl -l | awk ‘$1!~/^Device/{print $5}'”)

commandlist[‘OSF1’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,None)

commandlist[‘SCO-SV’] = (“df -Bk”,”who -q | grep “#””,”ps -el -o “pid tty s time args””,”uptime”,None)

commandlist[‘SunOS’] = (“df -k”,”who -q | grep “#””,”ps -e -o “pid tty s time comm””,”uptime”,”swap -s | tr -d -s -c [:digit:][:space:] | nawk ‘{print ($3/($3+$4))*100}'”)

commandlist[‘UNIXWARE2’] = (“/usr/ucb/df”,”who -q | grep “#””,”ps -el | awk ‘{printf(“%6d%9s%2s%5s %sn”,$5,substr($0, 61, 8),$2,substr($0,69,5),substr($0,75))}”,”echo `uptime`, load average: 0.00, `sar | awk ‘{oldidle=idle;idle=$5} END {print 100-oldidle}’`,0.00″,None)

# Now to make commandlist with the correct one for your OS.

try:

self.commandlist = commandlist[os.uname()[0]]

except KeyError:

print “Your platform isn’t supported by nagios-statd – exiting.”

sys.exit(3)

# Below are the functions that the client can call.

def disk(self):

return self.__run(0)

def proc(self):

return self.__run(2)

def swap(self):

return self.__run(4)

def uptime(self):

return self.__run(3)

def user(self):

return self.__run(1)

def raid(self):

return self.__run(5)

def version(self):

i = “nagios-statd ” + str(self.nagios_statd_version)

return i

def __run(self,cmdnum):

# Unmask SIGCHLD so popen can detect the return status (temporarily)

signal.signal(signal.SIGCHLD, signal.SIG_DFL)

outputfh = os.popen(self.commandlist[cmdnum])

output = outputfh.read()

returnvalue = outputfh.close()

signal.signal(signal.SIGCHLD, signal.SIG_IGN)

if (returnvalue):

return “ERROR %s ” % output

else:

return output

class NagiosStatd(SocketServer.StreamRequestHandler):

“Handles connection initialization and data transfer (as daemon)”

def handle(self):

# Check to see if user is allowed

if self.__notallowedhost():

self.wfile.write(self.error)

return 1

if not hasattr(self,”generichandler”):

self.generichandler = GenericHandler(self.rfile,self.wfile)

self.generichandler.run()

def __notallowedhost(self):

“Compares list of allowed users to client’s IP address.”

if hasattr(self.server,”allowedhosts”) == 0:

return 0

for i in self.server.allowedhosts:

if i == self.client_address[0]: # Address is in list

return 0

try: # Do an IP lookup of host in blocked list

i_ip = socket.gethostbyname(i)

except:

self.error = “ERROR DNS lookup of blocked host “%s” failed. Denying by default.” % i

return 1

if i_ip != i: # If address in list isn’t an IP

if socket.getfqdn(i) == socket.getfqdn(self.client_address[0]):

return 0

self.error = “ERROR Client is not among hosts allowed to connect.”

return 1

class GenericHandler:

def __init__(self,rfile=sys.stdin,wfile=sys.stdout):

# Create functions object

self.functions = Functions()

self.rfile = rfile

self.wfile = wfile

def run(self):

# Get the request from the client

line = self.rfile.readline()

line = line.strip()

# Check for appropriate requests from client

if len(line) == 0:

self.wfile.write(“ERROR No function requested from client.”)

return 1

# Call the appropriate function

try:

output = getattr(self.functions,line)()

except AttributeError:

error = “ERROR Function “” + line + “” does not exist.”

self.wfile.write(error)

return 1

except TypeError:

error = “ERROR Function “” + line + “” not supported on this platform.”

self.wfile.write(error)

return 1

# Send output

if output.isspace():

error = “ERROR Function “” + line + “” returned no information.”

self.wfile.write(error)

return 1

elif output == “ERROR”:

error = “ERROR Function “” + line + “” exited abnormally.”

self.wfile.write(error)

else:

for line in output:

self.wfile.write(line)

class ReUsingServer (SocketServer.ForkingTCPServer):

allow_reuse_address = True

class Initialization:

“Methods for interacting with user – initial code entry point.”

def __init__(self):

self.port = 1040

self.ip = ”

# Run this through Functions initially, to make sure the platform is supported.

i = Functions()

del(i)

def getoptions(self):

“Parses command line”

try:

opts, args = getopt.getopt(sys.argv[1:], “a:b:ip:P:Vh”, [“allowedhosts=”,”bindto=”,”inetd”,”port=”,”pid=”,”version”,”help”])

except getopt.GetoptError, (msg, opt):

print sys.argv[0] + “: ” + msg

print “Try ‘” + sys.argv[0] + ” –help’ for more information.”

sys.exit(3)

for option,value in opts:

if option in (“-a”,”–allowedhosts”):

value = value.replace(” “,””)

self.allowedhosts = value.split(“,”)

elif option in (“-b”,”–bindto”):

self.ip = value

elif option in (“-i”,”–inetd”):

self.runfrominetd = 1

elif option in (“-p”,”–port”):

self.port = int(value)

elif option in (“-P”,”–pid”):

self.pidfile = value

elif option in (“-V”,”–version”):

self.version()

sys.exit(3)

elif option in (“-h”,”–help”):

self.usage()

def main(self):

# Retrieve command line options

self.getoptions()

# Just splat to stdout if we’re running under inetd

if hasattr(self,”runfrominetd”):

server = GenericHandler()

server.run()

sys.exit(0)

# Check to see if the port is available

try:

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

s.bind((self.ip, self.port))

s.close()

del(s)

except socket.error, (errno, msg):

print “Unable to bind to port %s: %s – exiting.” % (self.port, msg)

sys.exit(2)

# Detach from terminal

if os.fork() == 0:

# Make this the controlling process

os.setsid()

# Be polite and chdir to /

os.chdir(‘/’)

# Try to close all open filehandles

for i in range(0,256):

try: os.close(i)

except: pass

# Redirect the offending filehandles

sys.stdin = open(‘/dev/null’,’r’)

sys.stdout = open(‘/dev/null’,’w’)

sys.stderr = open(‘/dev/null’,’w’)

# Set the path

os.environ[“PATH”] = “/bin:/usr/bin:/usr/local/bin:/usr/sbin”

# Reap children automatically

signal.signal(signal.SIGCHLD, signal.SIG_IGN)

# Save pid if user requested it

if hasattr(self,”pidfile”):

self.savepid(self.pidfile)

# Create a forking TCP/IP server and start processing

server = ReUsingServer((self.ip,self.port),NagiosStatd)

if hasattr(self,”allowedhosts”):

server.allowedhosts = self.allowedhosts

server.serve_forever()

# Get rid of the parent

else:

sys.exit(0)

def savepid(self,file):

try:

fh = open(file,”w”)

fh.write(str(os.getpid()))

fh.close()

except:

print “Unable to save PID file – exiting.”

sys.exit(2)

def usage(self):

print “Usage: ” + sys.argv[0] + ” [OPTION]”

print “nagios-statd daemon – remote UNIX system monitoring tool for Nagios.n”

print “-a, –allowedhosts=HOSTS Comma delimited list of IPs/hosts allowed to connect.”

print “-b, –bindto=IP IP address for the daemon to bind to.”

print “-i, –inetd Run from inetd.”

print “-p, –port=PORT Port to listen on.”

print “-P, –pid=FILE Save pid to FILE.”

print “-V, –version Output version information and exit.”

print ” -h, –help Print this help and exit.”

sys.exit(3)

def version(self):

i = Functions()

print “nagios-statd %.2f” % i.nagios_statd_version

print “os.uname()[0] = %s ” % os.uname()[0]

print “Written by Nick Reinkingn”

print “Copyright (C) 2002 Nick Reinking”

print “This is free software. There is NO warranty; not even for MERCHANTABILITY or”

print “FITNESS FOR A PARTICULAR PURPOSE.”

print “nNagios is a trademark of Ethan Galstad.”

if __name__ == “__main__”:

# Check to see if running Python 2.x+ / needed because getfqdn() is Python 2.0+ only

if (int(sys.version[0]) < 2):

print “nagios-statd requires Python version 2.0 or greater.”

sys.exit(3)

i = Initialization()

i.main()

/etc/init.d/nagios-statd:

#!/bin/sh

#

# This file should have uid root, gid sys and chmod 744

#

if [ ! -d /usr/bin ]

then # /usr not mounted

exit

fi

killproc() { # kill the named process(es)

pid=`/bin/ps -e |

/bin/grep -w $1 |

/bin/sed -e ‘s/^ *//’ -e ‘s/ .*//’`

[ “$pid” != “” ] && kill $pid

}

# Start/stop processes required for netsaint_statd server

case “$1” in

‘start’)

/customcommands/nagios-statd -a <IP of Allowed Nagios Server>,<IP of Test Workstation> -p 1040

;;

‘stop’)

killproc nagios-statd

;;

*)

echo “Usage: /etc/init.d/nagios-statd { start | stop }”

;;

esac

 

Testing:

As you can see in the script file above, I’ve added the IP Address of a test workstation. This will allow me to simply telnet to a node in the C6100 and execute one of the commands defined in this section of the /customcommands/nagios-statd script:

# Below are the functions that the client can call.

def disk(self):

return self.__run(0)

def proc(self):

return self.__run(2)

def swap(self):

return self.__run(4)

def uptime(self):

return self.__run(3)

def user(self):

return self.__run(1)

def raid(self):

return self.__run(5)

At your workstation, telnet to <Node IP Address> 1040

When connected, the screen will be blank.

Type “raid”. The screen won’t echo this.

When you hat enter, you should see:

vol_id:0 OK phys_id:2 OK phys_id:3 OK 100% 100%

Now you’re ready to move on to the Nagios configuration.

Matt Long

03/06/2015

Call Now Button(208) 344-1115

SIGN UP TO
GET OUR 
FREE
 APP BLUEPRINT

Join our email list

and get your free whitepaper