COMMANDDUMP – Monitor File for error – Ding if found

An Elusive error was occuring that we needed to be notified of immediately.  The fastest way to catch it was to run the following script at a bash command prompt so that when the error happened the script would beep until we stopped it.

while true; do ret=`tail -n 1 error-error-main.log|grep -i FATAL `;if [ “$ret” != “” ] ; then echo $ret; echo -en “\0007”; fi; sleep 1; done

Load problems after disk replacement on a ocfs2 and drbd system.

Notes Blurb on investigating a complex issue.    resolved,  however not with a concise description,  notes kept in order to continue the issue in the case it happens again.

Recently,    we had a disk failure on one of two SAN servers utilizing MD, OCFS2 and drbd to keep two servers synchronized.

We will call the two Systems: A  and B

 

The disk was replaced on System A, which required a reboot in order for the system to recognize the new disk,   then we ad to –re-add the disk to the MD.  Once this happened,  the disk started to rebuild.   The OCFS and drbd layers did not seem to have any issue rebuilding quickly as soon as the servers rebuilt,  the layers of redundancy made it fairly painless.   However,  the load on System B went up to 2.0+  and on System A up to 7.0+!

This slowed down System B significantly and made System A completely unusable.

I took a look at the many different tools to try to debug this.

  • top
  • iostat -x 1
  • iotop
  • lsof
  • atop

The dynamics of how we use the redundant sans should be taken into should be taken into account here.

We mount System B to an application server via NFS,  and reads and writes are done to System B,   this makes it odd that System A is having such a hard time keeping up,  it honly has to handle the DRBD and OCFS2 communication in order to keep synced (System A is handling reads and writes,   where System B is only having to handle writes on the DRBD layer when changes are made.  iotop shows this between 5 and 40 K/s,  which seemed minimal.

Nothing is pointing to any kind of a direct indicator of what is causing the 7+ load on System A.   the top two processes seem to be drbd_r_r0 and o2hb-XXXXXX,  which take up minimal amounts of read and write

The command to run on a disk to see what is happening is

#iotop -oa

This command shows you only the commands that have used some amount of disk reas or write (-o)  and it shows them cumulatively (-a) so you can easily see what is using the io on the system.    From this I figured out that a majority of the write on the system,  was going to the system drive.

What I found from this,  is that the iotop, tool does not show the activity that is occuring at the drbd / ocfs2 level.   I was able to see that on System B,  where the NFS drive was connected to,  that the nfsd command was writing MULTIPLE MB of information when I would write to the nfsdrive (cat /dev/zero> tmpfile),  but I would see only 100K or something written to drbd on System B,  and nothing on SystemA,  however I would be able to see the file on System A,

I looked at the cpuload on Sysetm A when running the huge write,  and it increased by about 1 (from 7+ to 8+)  so it was doing some work ,  iotop just did not monitor it.

So  i looked to iostat to find out if i would allow me to see the writes to the actual devices in the MD.

I ran

#iostat -x 5

So I could see what was being written to the devices,  here is could see that the disk utilization on System A and System B was similar (about 10% per drive in the MD Array)  and the await time on System B was a bit higher than System A.  When I did this test I caused the load to go up on all servers to about 7 (application server,  System A and System B)  Stopping the write made the load time on the application server, and on System B go back down.

While this did not give me the cause, it helped me to see that disk writes on System A are trackable through iostat, and since no writes are occurring when I run iostat -x 5 I have to assume that there is some sort of other overhead that is causing the huge load time.    With nothing else I felt I could test,  I just rebooted the Server A.

Low and behold,   the load dropped,    writing huge files,  deleting huge files was no longer an issue.   The only think I could think was that there was a large amount of traffic of something which  was being transferred back and forth to some ‘zombie’ server or something.   (I had attempted to restart ocfs2 and drbd and the system wouldn’t allow that either which seems like it indicates a problem with some process being held open by a zombie process)

In the end,  this is the best scenario I can use to describe the problem.  While this is not real resolution.  I publish this so that when an issue comes up with this in the future,  we will be able to investigate about three different possibilities in order to get closer to  figuring out the true issue.

  1. Investigate the network traffic (using ntop for traffic,   tcpdump for contents,  and eth for total stats and possible errors)
  2. Disconnect / Reconnect the drbd and ocfs2 pair to stop the synchronization and watch the load balance to see if that is related to the issue.
  3. Attempt to start and stop the drbd and ocfs2 processes and debug any problems with that process. (watch the traffic or other errors related to those processes)

Find out which PHP packages are installed on ubuntu / debian

Find out which PHP packages are installed on ubuntu / debian

As we have moved or upgraded sites from one server to another,  sometimes we have needed to know which PHP5 dependencies were installed on one server servera,  so that we could make sure those same dependencies were met on another server serverb

To do this we can run a simply command line tool on server a

servera# echo `dpkg -l|awk '$1 ~ /ii/ && $2 ~ /php5/{print $2}'`
libapache2-mod-php5 php5 php5-cli php5-common php5-curl php5-gd php5-mcrypt php5-mysql php5-pgsql 

and then we copy the contents of the output and past it after the apt-get install command on serverb

serverb# apt-get install libapache2-mod-php5 php5 php5-cli php5-common php5-curl php5-gd php5-mcrypt php5-mysql php5-pgsql 

Dont forget to reload apache as some packages do not reload it automatically

serverb# /etc/init.d/apache2 reload

Utility – Bulk Convert the Unix Timestamp in log messages To a Readable Date

Utility – Bulk Convert the Unix Timestamp in log messages To a Readable Date

I have often run into the need to convert a large list of timestamps from Unix Timestamp to a readable date.

Batch Unix Timestamp Convert

Often times this is simply a need that I have when receiving an error message from a server,  or when reviewing log files which only use Unix Timestamps.

So I created a simple utility,  just paste in your text from the log file,   the utility will search out the string for timestamps listed as the first part of each line,  and convert the timestamp to a date.

While this might be useful at some point as an automated process,  for now I just use it when I need it.

I am documenting the tool here with a link for myself (or any one else that may need it) so that it is simple to find.

http://matraex.com/batch-timestamp-to-date.php

Possible future upgrades to this utility will likely search out Unix Timestamps anywhere in the text and convert them,  instead of just at the first of the line.

 

 

One Line WordPress Install

One Line WordPress Install

To install the latest version of WordPress to your current working directory in Linux you can run this command

#wget -O - https://wordpress.org/latest.tar.gz |tar --strip-components=1 -xvzf - wordpress/

Just make sure you are in your install directory when you run it

#cd /var/www/html


my btmp file is huge on linux, what do I do

my btmp file is huge on linux,  what do I do

The /var/log/btmp file is one that tracks all of the login attempts on your machine.  If it is huge it probably means someone is trying to brute force attack you computer.

the file is binary so you can not just view it,  you have to use

#lastb|less

Most likely you will find that someone has been attempting to repeatedly hack your computer,   consider setting up a firewall which limits the IP address that are allowed to login to your SSH port.

You could also install DenyHosts

#apt-get install denyhosts

One issue that can occur is that if you are getting attacked,  the log size gets to large.

Most likely your logrotate.conf file has a /var/log/btmp entry in it.   Update this file to rotate and compress the log file more frequently (see the logrotate documentation)

The Linux find command is awesomely powerful!

The Linux find command is awesomely powerful!

At least I think it is awesome. Here are a couple of useful commands which highlight some of it more powerful features.  (these are just ones I used recently,   as soon as you start chaining sed, awk, sort and uniq,   the commands get even more powerful)

Changing the ownership of all files which do not have the correct ownership (useful to me when doing a server migration where the postfix user was uid 102 and changed to uid 105)
This command also lists the details of the file before it runs the chown command on it.

find . -not -uid 105 -exec chown postfix {} ;

Get a list of all of the files that have been modified in the last 20 minutes

find . -mmin -20

find all log files and their sizes older than 60 days,   I use awk to sum the size of these up.

find /data/webs/ -path '*/logs/*' -name '*log' -mtime +60 -exec du {} ; |awk '{t+=$1; print t" "$0}'

Often times I just turn around and delete these files if I do not need them ,  the command above helps me know what kind of space I would be recovering and if there are any HUGE file size offenders.

find /data/webs/ -path '*/logs/*' -name '*log' -mtime +60 -delete

 

How to skip certain directories when running updatedb

How to skip certain directories when running updatedb

To skip certain directories when running updatedb edit

/etc/updatedb.conf

and add the directories you want to skip to the PRUNEFS configuration variable

PRUNEFS="/tmp /my/giant/directory"

That is it,  then run updatedb again,  it will skip the listed directory,   in my case updatedb ran much faster.

 

Quick script to install WordPress from the Linux command line

Quick script to install WordPress from the Linux command line

I find that it is much faster to download and install WordPress from the command line in Linux than attempting to use FTP

By running the following script in a new directory, you will:

  • download the latest version of WordPress
  • untar / unzip it
  • move the files into the current directory
  • cleanup the unused empty directory
  • and update the ownership of all of the files to match the directory you are already in.
wget https://wordpress.org/latest.tar.gz
tar -xvzf latest.tar.gz
mv wordpress/* .
rm -rf wordpress/ latest.tar.gz
chown -R `stat . --printf '%u.%g'` *

AWK script to show number of apache hits per minute

AWK script to show number of apache hits per minute

Documenting this script,  to save me the time of rewriting it time and again on different servers

 tail -n100000 www.access.log|awk '/09/Apr/{print $4}'|awk -F'[' '{print $2}'|awk -F':' '{print $1":"$2":"$3}' |sort -n|uniq -c

This shows output like this

 21 09/Apr/2015:12:48
 21 09/Apr/2015:12:49
 21 09/Apr/2015:12:50
 21 09/Apr/2015:12:51
 21 09/Apr/2015:12:52
 711 09/Apr/2015:12:53
 1371 09/Apr/2015:12:54
 1903 09/Apr/2015:12:55
 2082 09/Apr/2015:12:56
 2256 09/Apr/2015:12:57
 2123 09/Apr/2015:12:58
 1951 09/Apr/2015:12:59
 1589 09/Apr/2015:13:00
 1427 09/Apr/2015:13:01
 811 09/Apr/2015:13:02

SSL Cipher Suites – Apache config for IE 11

SSL Cipher Suites – Apache config for IE 11

In past posts I showed how I had followed some suggestions from qualsys on configuring Apache to only use specific ciphers in order to pass all of the required security scans.

However it turns out that blindly using their list of Ciphers led to another problem,   (displaying the page in IE 11) which I describe the fix to below.

In addition though,   the process I go through below,  can / will help you trouble shoot and possibly find and enable / disable the Ciphers for any situation and browser.

On this page:
https://community.qualys.com/blogs/securitylabs/2013/08/05/configuring-apache-nginx-and-openssl-for-forward-secrecy

They suggest setting this SSLCipherSuite:

 EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS !RC4

However I found IE 11 was showing “This web page can not be displayed” on Windows 7 and Windows 2008 Server (probably others as well),

I figured out that the problem was the CipherSuite, by commenting out the SSLCipherSuite line in apache, restarting, and the page loaded.

So the next step was to , with the line commented out, to run the ssllabs test with the SSLCipherSuite commented out,
https://www.ssllabs.com/ssltest/

the result of which I found to show some details about the CipherSuites used by different browsers. I would use this tool to make sure you have the correct CipherSuite for any, all browsers and exclude any older insecure browsers.

If you look down the report to the “Handshake Simulation portion of the report you will find a listing of browsers with the Cipher they used. IE 11/ Win 7 was working EVEN BEFORE noticed the ‘can not be displayed’ error, so I went on a hunch and decided to try and enable the IE 8-10 / Win 7 option which showed

 TLS_RSA_WITH_AES_256_CBC_SHA

I googled “openssl TLS_RSA_WITH_AES_256_CBC_SHA” which brought me to the openssl page where they show all of the ciphers and on this page I found “AES256-SHA” which I needed to include in the Apache SSLCipherSuite directive

https://www.openssl.org/docs/apps/ciphers.html

Next, to confirm that this cipher is even available on my server, i ran this command

openssl cipher AES256-SHA

which returned a result showing that the cipher was indeed an option on the server

So, I added it towards the end, and the resulting SSLCipherSuite directive I have is:

SSLCipherSuite "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA AES256-SHA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS !RC4"

And now I can load the webpage in the IE 11 browser.

Note that when I ran the ssllabs.com test again, it downgraded the site to an A- probably because the cipher did not offer Forward Secrecy (notated with a small orange ‘No FS’) on the report,

I decided that this is an okay grade in order to allow IE 11 to access the site, but hopefully Microsoft figures it out.

SSL Vulnerability and Problem Test – Online and Command Line

SSL Vulnerability and Problem Test – Online and Command Line

There are many vulnerabilities out there,  and there seems to be no single test for all of them.

When working to correct SSL issues, some of the more comprensive tests, test EVERYTHING,  while this is good,  it can also make it difficult to test the smaller incremental changes that we make as system administrators make

This blog post is a way to collect and keep a resource in one place of links or methods we can use to quickly test individual failures

The big test,  which only takes a minute or so,  but is somewhat bloated for individual tests,  is ssllabs.com.   You will find out most failures here and even get a grade

http://ssllabs.com

But you wont find them all,  and it is difficult to quickly test small changes.  So here are some instant tests.

if you have an SSL Chain issue

openssl s_client -connect example.com:443

to test for CVE-2014-0224, otherwise know n as a CCS Injection vulnerability enter your domain here

http://ccsbug.exposed/

to test for CVE-2014-0160 or Heartbleed test or

http://possible.lv/tools/hb/

Verify ssl certificate chain using openssl

Verify ssl certificate chain using openssl

SSL Certificates ‘usually’ work and show ‘green’ in  browsers,    even if the full certificate chain is not correctly configured in apache.

You can use tools such as SSL Labs (link) or run a PCI ASV check on your site to find out if you are compliant,  but a quicker way to do it is using openssl from the command link.

Using this command you can quickly verify your SSL Certificate and Certificate chain from you linux command line using openssl

openssl s_client -showcerts -connect mydomain.com:443

If you receive a line,  ‘Verify return code: 0 ‘ at the end of the long out put,  your chain is working,  however you might receive an error 27 if it is not configured correctly.

In order to configure it correctly you will like need an line in your apache conf file

 SSLCACertificateFile <yourCAfilename>

In addition to the files which list your Key and Cert file

SSLCertificateFile <yourcertfilename>
SSLCertificateKeyFile <yourkeyfilename>

ip tables commands which ‘might’ make your firewall PCI compliant

ip tables commands which ‘might’ make your firewall PCI compliant

This is a list of the iptables commands that will setup a minimal firewall which ‘might’ be PCI compliant

This is primarily here to remind me, so I have a reference in the future.

I also have ports for FTP and SSH for a single developer IP as well as monitoring for a single monitoring server.   The format is simple and can easily be changed for other services.

Be sure to replace ‘my.ip’ with your development ip,  and ‘monitoring.ip’ with

This is on a Linux Ubuntu machine (of course)

apt-get install iptables iptables-persistent
iptables -A INPUT -i lo -j ACCEPT
iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -p tcp --dport 22 -s my.ip/32 -j ACCEPT
iptables -A INPUT -p tcp --dport 21 -s my.ip/32 -j ACCEPT
iptables -A INPUT -p tcp --dport 5666 -s monitoring.ip/32-j ACCEPT 
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p udp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
iptables -A INPUT -p udp --dport 443 -j ACCEPT
iptables -A INPUT -j REJECT --reject-with icmp-host-unreachable


iptables -A INPUT -p icmp --icmp-type timestamp-request -j DROP
iptables -A OUTPUT -p icmp --icmp-type timestamp-reply -j DROP

iptables -t raw -A PREROUTING -p tcp --tcp-flags FIN,SYN FIN,SYN -j DROP
iptables -t raw -A PREROUTING -p tcp --tcp-flags SYN,RST SYN,RST -j DROP
iptables -t raw -A PREROUTING -p tcp --tcp-flags FIN,SYN,RST,PSH,ACK,URG FIN,PSH,URG -j DROP
iptables -t raw -A PREROUTING -p tcp --tcp-flags FIN,SYN,RST,PSH,ACK,URG FIN -j DROP
iptables -t raw -A PREROUTING -p tcp --tcp-flags FIN,SYN,RST,PSH,ACK,URG NONE -j DROP
iptables -t raw -A PREROUTING -p tcp --tcp-flags FIN,SYN,RST,PSH,ACK,URG FIN,SYN,RST,PSH,ACK,URG -j DROP
iptables-save > /etc/iptables/rules.v4


Installing tsung on an amazon t2.micro server

Installing tsung on an amazon t2.micro server

install ubuntu 14.04

#apt-get update
#apt-get install erlang erlang-dev erlang-eunit
#wget http://tsung.erlang-projects.org/dist/tsung-1.5.1.tar.gz
#tar -xvzf tsung-1.5.1.tar.gz
#cd tsung-1.5.1
#make
#make install
#tsung-recorder start

 

 

That is it!!  you are now collecting data and you can run a recording session.

 

 

 

 

 

 

———————–read below for instructions on a failed attempt

Install Ubuntu 14.04,  launch and run

#apt-get update
#apt-get install tsung

still comes up with a crash report becuase tsung is attempting to use the wrong version of erlang,  it seems that the tsung build expects a different version of erlang,  perhaps becuase the versions that are considered the most up to date by debian are not compatile

 

 

 

—–read below if you want instructions that i started but did not work because amazons yum based AMI sucks compared to ubuntu

 

 

apt-Once you launch and connect to the Amazon server (i choose a small amazon server which already has  the amazon cli tools installed)

#sudo yum update
#sudo yum --nogpgcheck install http://tsung.erlang-projects.org/dist/redhat/tsung-1.5.1-1.fc20.x86_64.rpm
#sudo ln -s /usr/bin/erl /bin/erl     #(not sure why the package install erlang in one location and tsung looks in another ....)


Now you are ready to run the tsung command to record your session

#tsung-recorder start -d 7 -P htt

But you get the error below…

Starting Tsung recorder on port 8090
[root@ip-172-16-1-236 ~]# {"init terminating in do_boot",{undef,[{tsung_recorder,start,[]},{init,start_it,1},{init,start_em,1}]}}
Crash dump was written to: erl_crash.dump
init terminating in do_boot ()

			

Compare the packages (deb / apache) on two debian/ubuntu servers

Compare the packages (deb / apache) on two debian/ubuntu servers

Debian / Ubuntu

I worked up this command and I don’t want to lose it

#diff <(dpkg -l|awk '/ii /{print $2}') <(ssh 111.222.33.44 "dpkg -l"|awk '/ii /{print $2}')|grep '>'|sed -e 's/>//'

This command shows a list of all of the packages installed on 111.222.33.44 that are not installed on the current machine

To make this work for you,  just update the ssh 111.222.33.44 command to point to the server you want to compare it with.

I used this command to actually create my apt-get install command

#apt-get install `diff <(dpkg -l|awk '/ii /{print $2}') <(ssh 111.222.33.44 "dpkg -l"|awk '/ii /{print $2}')|grep '>'|sed -e 's/>//'`

Just be careful that you have the same Linux kernels etc,  or you may be installing more than you expect

Apache

The same thing can be done to see if we have the same Apache modeuls enabled on both machines

diff <(a2query -m|awk '{print $1}'|sort) <(ssh 111.222.33.44 a2query -m|awk '{print $1}'|sort)

This will show you which modules are / are not enabled on the different machines

 

Installing s3tools on SUSE using yast

Installing s3tools on SUSE using yast

We manage many servers with multiple flavors of Linux.   All of them use either apt or yum for package management.

The concept of yast is the same as apt and yum,  but was new to me,   so I thought I would document it.

Run yast which pulls up an ncurses Control Center, use the arrows go to Software -> Software Repositories

#yast

Use the arrows or press Alt+A to add a new repository

I selected Specify URL (the default) and press Alt+x to go to the next screen where I typed into the url box

http://s3tools.org/repo/SLE_11/

and then pressed Alt+n to continue.

Now I have a new repository and I press Alt+q to quit.

At the command line I types

#yast2 -i s3cmd

And the s3cmd is installed,   15 minutes!

 

 

Disk write speed testing different XenServer configurations – single disk vs mdadm vs hardware raid

Disk write speed testing different XenServer configurations – single disk vs mdadm vs hardware raid

In our virtual environment on of the VM Host servers has a hardware raid controller on it .  so natuarally we used the hardware raid.

The server is a on a Dell 6100 which uses a low featured LSI SAS RAID controller.
One of the ‘low’ features was that it only allows two RAID volumes at a time.  Also it does not do RAID 10

So I decided to create a RAID 1 with two SSD drives for the host,  and we would also put the root operating systems for each of the Guest VMs there.   It would be fast and redundant.   Then we have upto 4 1TB disks for the larger data sets.  We have multiple identically configured VM Hosts in our Pool.

For the data drives,  with only 1 more RAID volume I could create without a RAID 10,  I was limited to either a RAID V,   a mirror with 2 spares,   a JBOD.  In order to get the most space out of the 4 1TB drives,   I created the RAIDV.   After configuring two identical VM hosts like this,  putting a DRBD Primary / Primary connection between the two of them and then OCFS2 filesystem on top of it.  I found I got as low as 3MB write speed.   I wasnt originally thinking about what speeds I would get,  I just kind of expected that the speeds would be somewhere around disk write speed and so I suppose I was expecting to get acceptable speeds beetween 30 and 80 MB/s.   When I didn’t,  I realized I was going to have to do some simple benchmarking on my 4 1TB drives to see what configuration will work best for me to get the best speed and size configuration out of them.

A couple of environment items

  • I will mount the final drive on /data
  • I mount temporary drives in /mnt when testing
  • We use XenServer for our virtual environment,  I will refer to the host as the VM Host or dom0 and to a guest VM as VM Guest or domU.
  • The final speed that we are looking to get is on domU,  since that is where our application will be,  however I will be doing tests in both dom0 and domU environments.
  • It is possible that the domU may be the only VM Guest,  so we will also test raw disk access from domU for the data (and skip the abstraction level provided by the dom0)

So,  as I test the different environments I need to be able to createw and destroy the local storage on the dom0 VM Host.  Here are some commands that help me to do it.
I already went through xencenter and removed all connections and virtual disk on the storage I want to remove,  I had to click on the device “Local Storage 2” under the host and click the storage tab and make sure each was deleted. {VM Host SR Delete Process}

xe sr-list host=server1 #find and keep the uuid of the sr in my case "c2457be3-be34-f2c1-deac-7d63dcc8a55a"
xe pbd-list   sr-uuid=c2457be3-be34-f2c1-deac-7d63dcc8a55a # find and keep the uuid of the pbd connectig sr to dom0 "b8af1711-12d6-5c92-5ab2-c201d25612a9"
xe pbd-unplug  uuid=b8af1711-12d6-5c92-5ab2-c201d25612a9 #unplug the device from the sr
xe pbd-destroy uuid=b8af1711-12d6-5c92-5ab2-c201d25612a9 #destroy the devices
xe sr-forget uuid=c2457be3-be34-f2c1-deac-7d63dcc8a55a #destroy the sr

Now that the sr is destroyed,  I can work on the raw disks on the dom0 and do some benchmarking on the speeds of differnt soft configurations from their.
Once I have made  a change,  to the structure of the disks,  I can recreate the sr with a new name on top of whatever solution I come up with by :

xe sr-create content-type=user device-config:device=/dev/XXX host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage XXX on `cat /etc/hostname`” shared=false type=lvm

Replace the red XXX with what works for you

Most of the tests were me just running dd commands and writing the slowest time,  and then what seemed to be about the average time in MB/s.   It seemed like,  the first time a write was done it was a bit slower but each subsequent time it was faster and I am not sure if that means when a disk is idle,  it takes a bit longer to speed up and write?  if that is the case then there are two scenarios,   if the disk is often idle,  the it will use the slower number,  but if the disk is busy,  it will use the higher average number,  so I tracked them both.  The idle disk issue was not scientific and many of my tests did not wait long enough for the disk to go idle inbetween tests.

The commands I ran for testing were dd commands

dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=1000 conv=fdatasync  #for 1 mb
dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=10000 conv=fdatasync  #for 10 mb
dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=100000 conv=fdatasync  #for 100 mb
dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=1000000 conv=fdatasync  #for 1000 mb

I wont get into the details of every single command I ran as I was creating the different disk configurations and environments but I will document a couple of them

Soft RAID 10 on dom0

dom0>mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb2 --assume-clean
dom0>mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd2 --assume-clean
dom0>mdadm --create /dev/md10 --level=0 --raid-devices=2 /dev/md0 /dev/md1 --assume-clean
dom0>mkfs.ext3 /dev/md10
dom0>xe sr-create content-type=user device-config:device=/dev/md10 host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage md10 on `cat /etc/hostname`” shared=false type=lvm

Dual Dom0 Mirror – Striped on DomU for an “Extended RAID 10”

dom0> {VM Host SR Delete Process} #to clean out 'Local storage md10'
dom0>mdadm --manage /dev/md2 --stop
dom0>mkfs.ext3 /dev/md0 && mkfs.ext3 /dev/md1
dom0>xe sr-create content-type=user device-config:device=/dev/md0 host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage md0 on `cat /etc/hostname`” shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/md1 host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage md1 on `cat /etc/hostname`” shared=false type=lvm
domU>
#at this  point use Xen Center to add and attach disks from each of the local md0 and md1 disks to the domU (they were attached on my systems as xvdb and xvdc
domU> mdadm --create /dev/md10 --level=0 --raid-devices=2 /dev/xvdb /dev/xvdc
domU> mkfs.ext3 /dev/md10  && mount /data /dev/md10

Four disks SR from dom0, soft raid 10 on domU

domU>umount /data
domU> mdadm --manage /dev/md10 --stop
domU> {delete md2 and md1 disks from the storage tab under your VM Host in Xen Center}
dom0> {VM Host SR Delete Process} #to clean out 'Local storage md10'
dom0>mdadm --manage /dev/md2 --stop
dom0>mdadm --manage /dev/md1 --stop
dom0>mdadm --manage /dev/md0 --stop
dom0>fdisk /dev/sda #delete partition and write (d w)
dom0>fdisk /dev/sdb #delete partition and write (d w)
dom0>fdisk /dev/sdc #delete partition and write (d w)
dom0>fdisk /dev/sdd #delete partition and write (d w)
dom0>xe sr-create content-type=user device-config:device=/dev/sda host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sda on `cat /etc/hostname`" shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/sdb host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sdb on `cat /etc/hostname`" shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/sdc host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sdc on `cat /etc/hostname`" shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/sdd host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sdd on `cat /etc/hostname`" shared=false type=lvm
domU>mdadm --create /dev/md10 -l10 --raid-devices=4 /dev/xvdb /dev/xvdc /dev/xvde /dev/xvdf
domU>mdadm --detail --scan >> /etc/mdadm/mdadm.conf 
domU>echo 100000 > /proc/sys/dev/raid/speed_limit_min #I made the resync go fast, which reduced it from 26 hours to about 3 hours
domU>mdadm --grow /dev/md0 --size=max

Working with GB Large mysql dump files -splitting insert statements

Working with GB Large mysql dump files -splitting insert statements

Recently I had to restore a huge database from a huge MySQL dump file.
Since the dump file was had all of the create statements mixed with insert statements, I found the recreation of the database to take a very long time with the possibility that it might error out and rollback all of the transactions.

So I came up with the following script which processes the single MySQL dump file and splits it out so we can run the different parts separately.

This creates files that can be run individually called

  • mysql.tblname.beforeinsert
  • mysql.tblname.insert
  • mysql.tblname.afterinsert
cat mysql.dump.sql| awk 'BEGIN{ TABLE="table_not_set"}
{
	if($1=="CREATE" && $2=="TABLE")
	{ 
		TABLE=$3
		gsub("`","",TABLE)
		inserted=false
	}
	if($1!="INSERT") 
	{
		if(!inserted)
		{
			print $0 > "mysql."TABLE".beforeinsert";
		}
		else
		{
			print $0 > "mysql."TABLE".afterinsert";
		}
	} else {
		print $0 > "mysql."TABLE".insert"; 
		inserted=true
	}
}
'

Setting up DRBD with OCSF2 on a Ubuntu 12.04 server for Primary/Primary

Setting up DRBD with OCSF2 on a Ubuntu 12.04 server for Primary/Primary

We run in a virtual environment and so we thought we would go with the virtual kernel for the latest linux kernls
We learned that we should NOT not in the case we want to use the OCFS2 distributed locking files system because ocfs2 did not have the correct modules so we would have had to doa  custom build of the modules so we decided against it.   we just went with the latest kernel,   and would install ocfs2 tools from the package manager.

DRBD on the other hand had to be downloaded, compiled and installed regardless of kernel,   here are the procedures,  these must be run on each of a pair of machines.
We assume that /dev/xvdb has a similar sized device on both machines.

apt-get install make gcc flex
wget http://oss.linbit.com/drbd/8.4/drbd-8.4.4.tar.gztar xzvf drbd-8.4.4.tar.gz 
cd drbd-8.4.4/
./configure --prefix=/usr --localstatedir=/var --sysconfdir=/etc --with-km
make all

Connfigure both systems to be aware of eachother without dns /etc/hosts

192.168.100.10 server1
192.168.100.11 server2

Create a configuration file at /etc/drbd.d/disk.res

resource r0 {
protocol C;
syncer { rate 1000M; }
startup {
wfc-timeout 15;
degr-wfc-timeout 60;
become-primary-on both;
}
net {
#requires a clustered filesystem ocfs2 for 2 prmaries, mounted simultaneously
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
cram-hmac-alg sha1;
shared-secret "sharedsanconfigsecret";
}
on server1 {
device /dev/drbd0;
disk /dev/xvdb;
address 192.168.100.10:7788;
meta-disk internal;
}
on server2 {
device /dev/drbd0;
disk /dev/xvdb;
address 192.168.100.11:7788;
meta-disk internal;
}
}

 

configure drbd to start on reboot verify that DRBD is running on both machines and reboot,  and verify again

update-rc.d drbd defaults
/etc/init.d/drbd start
drbdadm -- --force create-md r0
drbdadm up r0
cat /proc/drbd

at this point  you should see that both devices are connected Secondary/Secondary and Inconsistent/Inconsistent.
Now we start the sync fresh,   on server1 only both sides are blank so drbd should manage any changes from here on.  cat /proc/drbd will show UpToDate/UpToDate
Then we mark both primary and reboot to verify everything comes back up

server1>drbdadm -- --clear-bitmap new-current-uuid r0
server1>drbdadm primary r0
server2>drbdadm primary r0
server2>reboot
server1>reboot

I took a snapshot at this point
Now it is time to setup the OCFS2 clustered file system on top of the device first setup a /etc/ocfs2/cluster.conf

cluster:
node_count = 2
name = mycluster
node:
ip_port = 7777
ip_address = 192.168.100.10
number = 1
name = server1
cluster = mycluster
node:
ip_port = 7777
ip_address = 192.168.100.11
number = 2
name = server2
cluster = mycluster

get the needed packages, configure them and setup for reboot,  when reconfiguring,   remember to put the name  of the cluster you want to start at boot up mycluster run the below on both machines

apt-get install ocfs2-tools
dpkg-reconfigure ocfs2-tools
mkfs.ocfs2 -L mycluster /dev/drbd0 #only run this on server1
mkdir -p /data
echo "/dev/drbd0  /data  ocfs2  noauto,noatime,nodiratime,_netdev  0 0" >> /etc/fstab
mount /data
touch /data/testfile.`hostname`
stat /data/testfile.*
rm /data/testfile* # you will only have to run this on one machine
reboot

So,  everything should be running on both computers at this point when things come backup make sure everythign is connected.
You can run these commands from either server

/etc/init.d/o2cb status
cat /proc/drbd

			

Setting DRBD in Primary / Primary — common commands to sync resync and make changes

Setting DRBD in Primary / Primary — common commands to sync resync and make changes

As we have been setting up our farm with an NFS share the DRBD primary / primary connection between servers is important.

We are setting up a group of /customcommands/ that we will be able to run to help us keep track of all of the common status and maintenance commands we use,  but  when we have to create, make changes to the structure,  sync and resync, recover, grow or move the servers,  We need to document our ‘Best Practices’ and how we can recover.

From base Server install

apt-get install gcc make flex
wget http://oss.linbit.com/drbd/8.4/drbd-8.4.1.tar.gz
tar xvfz drbd-8.4.1.tar.gz
cd drbd-8.4.1/
./configure --prefix=/usr --localstatedir=/var --sysconfdir=/etc --with-km
make KDIR=/lib/modules/3.2.0-58-virtual/build
make install

Setup in/etc/drbd.d/disk.res

resource r0 {
protocol C;
syncer { rate 1000M; }
startup {
wfc-timeout 15;
degr-wfc-timeout 60;
become-primary-on both;
}
net {
#requires a clustered filesystem ocfs2 for 2 prmaries, mounted simultaneously
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
cram-hmac-alg sha1;
shared-secret "sharedsanconfigsecret";
}
on server1{
device /dev/drbd0;
disk /dev/xvdb;
address 192.168.100.10:7788;
meta-disk internal;
}

on riofarm-base-san2 {
device /dev/drbd0;
disk /dev/xvdb;
address 192.168.100.11:7788;
meta-disk internal;
}
}

 

Setup your /etc/hosts

192.168.100.10 server1
192.168.100.11 server2

Setup /etc/hostname with

server1

reboot,  verify your settings and SAVE A DRBDVMTEMPLATE clone your VM to a new server called server2

Setup /etc/hostname with

server2

start drbd with /etc/init.d/drbd  this will likely try and create the connection,  but this is where we are going to ‘play’ to learn the commands and how we can sync,  etc.

cat /proc/drbd   #shows the status of the connections
server1> drbdadm down r0   #turns of the drbdresource and connection
server2> drbdadm down r0   #turns of the drbd resource and connection
server1> drbdadm -- --force create-md r0  #creates a new set of meta data on the drive,  which 'erases drbds memory of the sync status in the past
server2> drbdadm -- --force create-md r0  #creates a new set of meta data on the drive,  which 'erases drbds memory of the sync status in the past
server1> drbdadm up r0   #turns on the drbdresource and connection and they shoudl connect without a problem,   with no memory of a past sync history
server2> drbdadm up r0    #turns on the drbdresource and connection and they shoudl connect without a problem,   with no memory of a past sync history
server1> drbdadm -- --clear-bitmap new-current-uuid r0  # this create a new 'disk sync image' essentially telling drbd that the servers are blank so no sync needs to be done  both servers are immediately UpToDate/UptoDate in /proc/drbd
server1> drbdadm primary r0
server2> drbdadm primary r0    #make both servers primary and  now when you put an a filesystem on /dev/drbd0 you will be able to read and write on both systems as though they are local

So,    lets do some failure scenarios,    Say, we loose a server,  it doesn’t matter which one since they are both primaries,  in this case though we will say server2 failed.  Create a new VM from DRBDVMTEMPLATE  which already had drbd made on it with the configuration or create another one using the instructions above.

Open /etc/hostname and set it to

server2

reboot.    Make sure /etc/init.d/drbd start is running

server1>watch cat /proc/drbd  #watch the status of dtbd,  it is very useful and telling about what is happening,   you will want DRBD to be Connected Primary/Unknown  UpToDate/DUnknown  
server2>drbdadm down
server2>dbadm wipe-md r0  #this is an optional step that is used to wipe out the meta data,  I have not seen that it does anything different than creating the metadata using the command below,  but it is useful to know the command in case you want to get rid of md  on your disk
server2>drbdadm -- --force create-md r0  ##this makes sure that their is no partial resync data left over from where you cloned it from
server2>drbdadm up r0 # this brings drbd server2 back into the resource and connects them,  it will immediately sart syncing you should see SyncSource Primary/Secondary UpToDate/Inconsistent on server1,     for me it was soing to to 22 hours for my test of a 1TM  (10 MB / second)

Lets get funky,  what happens if you stop everything in the middle of a sync

server1>drbdadm down r0 #we shut down the drdb resource that has the most up to date information,   on server2 /proc/drbd shows Secondary/Unknown  Inconsitent/DUnknown ,  server2 does not know about server1 any more,  but server2 still knows that server2 is inconsitent,   (insertable step here could be on server2: drbdadm down ro; drbdadm up ro,  with no change to the effect)
server1>drbdadm up ro #  this brings server1 back on line and /proc/drbd on server1 shows  SyncSource,  server2 shows SyncTarget,   server1 came backup as the UpToDate server,  server2 was Inconsistent,  it figured it out

 

Where things started to go wrong and become less ‘syncable’  was when servers were both down and had to be brought back up again separately with a new uuid was created on them separately.  so lets simulate that the drbd config fell apart,  and we have to put it together again.

server2>drbdadm disconnect ro;  drdbadm -- --force create-md r0 ; drbd connect ro;   #start the sync process over

awk Command to remove Non IP entries from /etc/hosts and /etc/hosts.deny

awk Command to remove Non IP entries from /etc/hosts and /etc/hosts.deny

We had a script automatically adding malicious IPS to our /etc/hosts.deny file on one of our servers.

The script went awry and ended up putting hundreds of thousands of non ip addresses into the file.  There were malicious IP addresses mixed in

I used this awk script to clean it up , and remove all of the non ip addresses,  and make the list unique.

 awk  '/ALL/ &&  $NF ~ /[0-9.]/' /etc/hosts.deny| sort -n -k2 |uniq > /etc/hosts.deny2

once I inspected the /etc/hosts.deny2  I replaced the original

mv /etc/hosts.deny2 /etc/hosts.deny


Mdadm – Failed disk recovery (unreadable disk)

Mdadm – Failed disk recovery (unreadable disk)

Well,

After 9 more months I ran into a nother disk failure. (First disk failure found here https://www.matraex.com/mdadm-failed-disk-recovery/)

But this time,   The system was unable to read the disk at all

#fdisk /dev/sdb

This process just hung for a few minutes.  It seems I couldn’t simply run a few commands like before to remove and add the disk back to the software RAID.  So I had to replace the disk.  Before I went to the datacenter I ran

#mdadm /dev/md0 --remove /dev/sdb1

I physically went to our data center,   found the disk that showed the failure (it was disk sdb so I ‘assumed’ it was the center disk out of three,  but I was able to verify since it was not blinking from normal disk activity.  I removed the disk,  swapped it out for one that I had sitting their waiting for this to happen,  and replaced it.  Then I ran a command to make sure the disk was correctly partitioned to be able to fit into the array

#fdisk /dev/sdb

This command did not hang,  but responded with cannot read disk.  Darn,  looks like some error happened within the OS or on the backplane that made it so a newly added disk wasn’t readable.  I scheduled a restart on the server later when the server came back up, fdisk could read the disk.  It looks like I had used the disk for something before,  but since I had put it in my spare disk pile,  I knew I could delete it  and I partitioned it with one partion to match what the md was expecting (same as the old disk)

#fdisk /dev/sdb
>d 2         -deletes the old partition 2
>d 1         -deletes the old partition 1
>c            -creates a new partion
>p           – sets the new partion as primary
>1           – sets the new partion as number 1
>> <ENTER>     – just press enter to accept the defaults starting cylinder
>> <ENTER>     – just press enter to accept the defaults ending cylinder
>> w            – write the partion changes to disk
>> Ctrl +c     – break out of fdisk

 

Now the partition is ready to add back to the raid array

#mdadm /dev/md0 –add /dev/sdb1

And we can immediately see the progress

#mdadm /dev/md0 --detail
/dev/md0:
 Version : 00.90.03
 Creation Time : Wed Jul 18 00:57:18 2007
 Raid Level : raid5
 Array Size : 140632704 (134.12 GiB 144.01 GB)
 Device Size : 70316352 (67.06 GiB 72.00 GB)
 Raid Devices : 3
 Total Devices : 3
Preferred Minor : 0
 Persistence : Superblock is persistent
Update Time : Sat Feb 22 10:32:01 2014
 State : active, degraded, recovering
 Active Devices : 2
Working Devices : 3
 Failed Devices : 0
 Spare Devices : 1
Layout : left-symmetric
 Chunk Size : 64K
Rebuild Status : 0% complete
UUID : fe510f45:66fd464d:3035a68b:f79f8e5b
 Events : 0.537869
Number Major Minor RaidDevice State
 0 8 1 0 active sync /dev/sda1
 3 8 17 1 spare rebuilding /dev/sdb1
 2 8 33 2 active sync /dev/sdc1

And then to see the progress of rebuilding

#cat /proc/mdadm
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[3] sda1[0] sdc1[2]
 140632704 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]
 [==============>......] recovery = 71.1% (50047872/70316352) finish=11.0min speed=30549K/sec
md1 : active raid1 sda2[0]
 1365440 blocks [2/1] [U_]

Wow in the time I have been blogging this,  already 71 percent rebuilt!,   but wait!  what is this,   md1 is failed?  I check my monitor and what do I find but another message that shows that md1 failed with the reboot.  I was so used to getting the notice saying md0 was down I did not notice that md1 did not come backup with the reboot!  How can this be?

It turnd out that sdb was in use on both md1 and md0,   but even through sdb could not be read at all on /dev/sdb  and /dev/sdb1 failed out of the md0 array,   somehow the raid subsystem had not noticed and degraded the md1 array even though the entire sdb disk was not respoding (perhaps sdb2 WAS responding back then  just not sdb),  who knows at this point.  Maybe the errors on the old disk could have been corrected by the reboot if I had tried that before replacing the disk,   but that doesn’t matter any more,  All I know is that I have to repartion the sdb device in order to support both the md0 and md1 arrays.

I had to wait until sdb finished rebuilding,  then remove it from md0,   use fdisk to destroy the partitions,   build new  partitions matching sda and add the disk back to md0 and md1

XenCenter – live migrating a vm in a pool to another host

XenCenter – live migrating a vm in a pool to another host

When migrating a vm server from one host to another host in the pool I found it to be very easy at first.

In fact,  it was one of the first things test I did after setting up my first vm on a host in a pool. 4 steps

  1. Simply right click on the vm in XenCenter ->
  2. Migrate to Server ->
  3. Select from your available servers.
  4. Follow the wizzard

In building some servers,  I wanted to get some base templates which are ‘aware’ of the network I am putting together.  This would involve adding some packages and configuration,  taking a snapshot and then turning that snapshot into a template that I could easily restart next time I wanted a similar server.  Then when I went to migrate one of the servers into its final resting place.  I found an interesting error.

  1. Right click on the vm in XenCenter ->
  2. Migrate to Server ->
  3. All servers listed – Cannot see required storage

I found this odd since I was sure that the pool could see all of the required storage (In fact I was able to start a new VM on the storage available,  so I new the storage was there)

I soon found out though that the issue is that the live migrate feature,  just doesnt work when there is more than one snapshot.  I will have to look into my snapshot management on how I want to do this now,  but basically I found that by removing old snapshots does to where the VM only had one snapshot (I left one that was a couple of days old) I was able to follow the original 4 steps

 

Note:  the way I found out about the limitation of the number of snapshots was by

  1. Eight click on the vm in XenCenter ->
  2. Migrate to Server ->
  3. The available servers are all grayed out,  so Select “Migrate VM Wizard”
  4. In the wizard that comes up select the current pool for “Destination”
  5. This populates a list of VMs with Home Server in the destination pool want to migrate the VM (My understanding,  is that this will move the VM to that server AND make that new server the “Home Server” for that VM)
  6. When you attempt to select from the drop down list under Home Server,  you see a message “You attempted to migrate a VM with more than one snaphot”

Using that information I removed all but one snapshot and was able to migrate.  I am sure there is some logical reason behind snapshot / migration limitation but for now I will work around it and come up with some other way to handle my snapshots than just leaving them under the snapshot tab of the server.

 

apt-get – NO_PUBKEY – how to add the pubkey

apt-get – NO_PUBKEY  – how to add the pubkey

I have run into this situation many times on Ubuntu and Debian  so I thought I would finally document the fix.

When run into a apt-get error where there NO_PUBKEY avaiable for a package you want to install you get this error

The following signatures couldn't be verified because the public key is not available: NO_PUBKEY xxxxxxxxxxxxxxxxxxxxxx

This means your system does not trust the signature,  so if you trust mit’s keyserver,  you can do this to fix it

root@servername:~# gpg --keyserver pgp.mit.edu --recv-keys xxxxxxxxxxxxxxxxxxxxxx
root@servername:~# gpg --armor --export xxxxxxxxxxxxxxxxxxxxxx| apt-key add -

Solves it for me every time so far,  at some point though I might run into a situation where mit does not have the keys,  for now though this works,  and I trust them

Entire script below

The following signatures couldn't be verified because the public key is not available: NO_PUBKEY xxxxxxxxxxxxxxxxxxxxxx
 root@servername:~# gpg --keyserver pgp.mit.edu --recv-keys xxxxxxxxxxxxxxxxxxxxxx
 root@servername:~# gpg --armor --export xxxxxxxxxxxxxxxxxxxxxx| apt-key add -
 OK

MDADM – Failed disk recovery (too many disk errors)

MDADM – Failed disk recovery (too many disk errors)

This only happens once every couple of years,  but occasionally a SCSI disk on one of our servers has too many errors,  and is kicked out of the md array
And… we have to rebuild it.  Perhaps we should replace it since it appears to be having problems,  but really,  the I in RAID is inexpensive (or something)  so I would rather lean to being frugal with the disks and replacing them only if required.

I can never remember of the top of my head the commands to recover,  so this time I am going to blog it so I can easily find it.

First step,  take a look at the status of the arrays on the disk

#cat /proc/mdstat
(I don't have a copy of what the failed drive looks like since I didn't start blogging until after)
Sometimes an infrequent disk error can cause  md to fail a hard drive and remove it from an array, even though the disk is fine.

That is what happened in this case,  and I knew the disk was at least partially good.  The disk / partition that failed was /dev/sdb1 and was part of a RAID V,   on that same device another partition is part of a RAID I,   that RAID I is still healthy so I knew the disk is just fine.  So I am only re-adding the disk to the array so it can rebuild.  If the disk has a second problem in the next few months,  I will go ahead and replace it,  since the issue that happened tonight is probably indicating a disk that is beginning to fail but probably still has lots of life in it.

The simple process is

#mdadm /dev/md0 --remove /dev/sdb1

This removed the faulty disk,  that is when you would physically replace the disk in the machine,  since I am only going to rebuild the disk I just skip that and move to the next step.

#mdadm /dev/md0 --re-add

The disk started to reload and VOILA!  we are rebuilding and will be back online in a few minutes.

Now you  take a look at the status of the arrays

#cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[3] sdc1[2] sda1[0]
 140632704 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]
 [=======>.............] recovery = 35.2% (24758528/70316352) finish=26.1min speed=29020K/sec
md1 : active raid1 sda2[0] sdb2[1]
 1365440 blocks [2/2] [UU]

 

In case you want to do any trouble shooting on what happened,  this command is useful in looking into the logs.

#grep mdadm /var/log/syslog -A10 -B10

But this command is the one that I use to see the important events related to the failure and rebuild.  As I am typing this I am just over 60% complete rebuilt which you see in the log

#grep mdadm /var/log/syslog
Jun 15 21:02:02 xxxxxx mdadm: Fail event detected on md device /dev/md0, component device /dev/sdb1
Jun 15 22:03:16 xxxxxx mdadm: RebuildStarted event detected on md device /dev/md0
Jun 15 22:11:16 xxxxxx mdadm: Rebuild20 event detected on md device /dev/md0
Jun 15 22:19:16 xxxxxx mdadm: Rebuild40 event detected on md device /dev/md0
Jun 15 22:27:16 xxxxxx mdadm: Rebuild60 event detected on md device /dev/md0

You can see from the times,  it took me just over an hour to respond and start the rebuild (I know,  that seems too long if I were to just do this remotely,  but when I got the notice,  I went on site since I thought I would have to do a physical swap and I had to wait a bit while the Colo security verified my ID,  and I was probably moving a little slow after some Nachos at Jalepeno’s)  Once the rebuild started it took about 10 minutes per 20% of the disk to rebuild.

————————-

Update:  9 months later the disk finally gave out and I had to manually replace the disk.  I blogged again:

https://www.matraex.com/mdadm-failed-d…nreadable-disk/

Linux System Discovery

Linux System Discovery

Over the last couple of weeks I have been working on doing some in depth “System Discovery” work for a client.

The client came to us after a major employee restructuring,  during which they lost ALL of the technical knowledge of their network.
The potentially devastating business move on their part turned into a very intriguing challenge for me.

They asked me to come in and document what service each of their 3 Linux servers.
As I dug in I found that their network had some very unique, intelligent solutions:

  • A reliable production network
  • Thin Client Linux printing stations,  remotely connected via VPN
  • Several Object Oriented PHP based web applications

Several open source products had been combined to create robust solutions

It has been a very rewarding experience to document the systems and give ownership of the systems, network and processes back to the owner.

The documentation I have provided included

  • A high level network diagram as a quick reference overview for new administrators and developers
  • An overall application and major network, server and node object description
  • Detailed per server/node description with connection documentation,  critical processes , important paths and files and dependencies
  • Contact Information for the people and companies that the systems rely on.

As a business owner myself,  I have tried to help the client recognize that even when they use an outside consultant,  it is VERY important that they maintain details of their critical business processes INSIDE of their company.  Their might not be anything in business that is as rewarding as giving ownership of a “lost” system back to a client.

 

Matraex Upgraded Mail Client From Squirrelmail to Roundcube

Matraex Upgraded Mail Client From Squirrelmail to Roundcube

Matraex has officially upgraded our web based mail client from Squirrelmail to Roundcube.

Roundcube is a modern mail client utilizing newer technologies for faster and more feature rich mail interaction.  Roundcube runs on our Linux webservers, utilizing Apache, PHP and MySQL.  The software connects to the mail server using the IMAP protocol.

All address book contacts and preferences were imported to Roundcube from Squirellmail at the time of the transition.

As well as updating and implementing their own technologies, Matraex provides server administration, open source production implementation and software customizations to business as a service.

Users with questions about the new mail service or Matraex Consulting Services should contact:

Michael Blood
Matraex, Inc
208.344.1115
www.matraex.com

 

Network Boot Server with Linux Install, Debian Etch and Lenny, CentOS and KNOPPIX

Network Boot Server with Linux Install, Debian Etch and Lenny, CentOS and KNOPPIX

I just LOVE my dedicated PXE boot server at the office with several flavors of linux install on it.

I can bring a new server online with a base install in as few as five minutes with Debian or CentOS
I can debug workstations and servers with a quickbooting KNOPPIX install.

I even have some kernel installations customized to install network drivers for the Dell 2650 so that the installs I do for those are quick and simple. (basically the broadcom network drivers and the openssh-server packages are preseeded to be installed with the default package)

Here are the contents my pxelinux.cfg/default file:

DISPLAY boot.txt

#DEFAULT etch_i386_install

LABEL etch_i386_install
kernel debian/etch/i386/linux
append vga=normal initrd=debian/etch/i386/initrd.gz  —
LABEL etch_i386_expert
kernel debian/etch/i386/linux
append priority=low vga=normal initrd=debian/etch/i386/initrd.gz  —
LABEL etch_i386_rescue
kernel debian/etch/i386/linux
append vga=normal initrd=debian/etch/i386/initrd.gz  rescue/enable=true —
LABEL knoppix
kernel knoppix/vmlinuz
append secure myconfig=scan nfsdir=192.168.0.1:/srv/diskless/knoppix nodhcp lang=us ramdisk_size=100000 init=/etc/init apm=p
ower-off nomce vga=791 initrd=knoppix/miniroot.gz quiet BOOT_IMAGE=knoppix
LABEL centos5_install
kernel centos/5/vmlinuz
append ks=nfs:192.168.0.1:/srv/diskless/centos/5/ks_prompt.cfg initrd=centos/5/initrd.img ramdisk_size=100000 ksdevice=eth0
ip=dhcp url –url http://mirror.centos.org/centos/5/os/i386/CentOS/
LABEL centos5_raid_install_noprompt
kernel centos/5/vmlinuz
append ks=nfs:192.168.0.1:/srv/diskless/centos/5/ks_raid.cfg initrd=centos/5/initrd.img ramdisk_size=100000 ksdevice=eth0 ip
=dhcp url –url http://mirror.centos.org/centos/5/os/i386/CentOS/
LABEL centos5_hda_install_noprompt
kernel centos/5/vmlinuz
append ks=nfs:192.168.0.1:/srv/diskless/centos/5/ks_hda.cfg initrd=centos/5/initrd.img ramdisk_size=100000 ksdevice=eth0 ip=
dhcp url –url http://mirror.centos.org/centos/5/os/i386/CentOS/
LABEL centos5_install_noprompt
kernel centos/5/vmlinuz
append ks=nfs:192.168.0.1:/srv/diskless/centos/5/ks.cfg initrd=centos/5/initrd.img ramdisk_size=100000 ksdevice=eth0 ip=dhcp
url –url http://mirror.centos.org/centos/5/os/i386/CentOS/

[dfads params=’groups=221&limit=1′]

LABEL lenny_i386_install
kernel debian/lenny/i386/linux
append vga=normal initrd=debian/lenny/i386/initrd.gz —

LABEL lenny_amd64_install
kernel debian/lenny/amd64/linux
append vga=normal initrd=debian/lenny/amd64/initrd.gz —

LABEL etch_amd64_install
kernel debian/etch/amd64/linux
append vga=normal initrd=debian/etch/amd64/initrd.gz —

LABEL etch_amd64_linux
kernel debian/etch/amd64/linux
append vga=normal initrd=debian/etch/amd64/initrd.gz —

LABEL etch_amd64_expert
kernel debian/etch/amd64/linux
append priority=low vga=normal initrd=debian/etch/amd64/initrd.gz —

LABEL etch_amd64_rescue
kernel debian/etch/amd64/linux
append vga=normal initrd=debian/etch/amd64/initrd.gz rescue/enable=true —

LABEL etch_amd64_auto
kernel debian/etch/amd64/linux
append auto=true priority=critical vga=normal initrd=debian/etch/amd64/initrd.gz —

PROMPT 1

Here are the contents of my boot.txt file (so that I know what to type at the command line when booting)

– Boot Menu –
=============

etch_i386_install   –   Debian Stable
etch_i386_expert    –   Debian Stable (Shows install menu every step)
etch_i386_rescue    –   Debian Stable Rescue
lenny_i386_install — has Broadcom net card customization
lenny_amd64_install — has Broadcom net card customization
etch_amd64_install
etch_amd64_linux
etch_amd64_expert
etch_amd64_rescue
etch_amd64_auto
centos5_install –  CentOS 5 (Will prompt for disks)
centos5_install_noprompt –  CentOS 5 (Will auto install without prompts)
centos5_hda_install_noprompt –  CentOS 5 (Will auto install without prompts)
centos5_raid_install_noprompt –  CentOS 5 (Will auto install on raid 1 without prompts)
knoppix

Hope someone out there can find some use from this.
We of course can help people having trouble with their own TFTP and PXE Boot Server .

Installed PERC management software afaapps and created simple mirror

Installed PERC management software afaapps and created simple mirror

I just installed Debian Lenny on a Dell 2650 with an OLD PERC 3 RAID controller.

I then installed the afaapps package from Dell’s website (http://support.us.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R85529&formatcnt=1&libid=0&fileid=112003)
Use this link or just search for ‘afaapps’ under the Drivers and Downloads section of the Dell support site.

After extracting the rpm from the downloaded file I ran alien against the file to turn it into a debian file

#apt-get install alien
#alien -d –scripts afaapps-2.8-0.i386.rpm

Now just install the created debian package

#dpkg -i afaapps_2.8-1_i386.deb

Now that you have installed the afacli you can run it at the command line prompt which will open the PERC command line “FASTCMD>”
Then you’ll open / connect to the RAID controller using “open afa0”

#afacli
FASTCMD> open afa0
Executing: open “afa0”

A simple ‘disk list’ command to find out what your disk situation looks like

AFA0> disk list
Executing: disk list

B:ID:L  Device Type  Blocks   Bytes/Block Usage  Shared
——  ————–  ——— ———– —————- ——
0:00:0  Disk  35566478  512  Initialized  NO
0:01:0  Disk  287132440 512  Initialized  NO
0:02:0  Disk  287132440 512  Initialized  NO

you may have to initialize your disks by typeing ‘disk initialize 1’ and ‘disk initialize 2’ to make sure that the container can access them,  you can see in my example above that my two disks are already initialized.

Now I will create a volume on disk 1 and mirror that disk to disk 2

AFA0> container create volume 1
AFA0> container create mirror 1 2

At the bottom of your screen you should see the status of the mirroring Job,  something like.

Stat:OK!, Task:100, Func:MSC Ctr:1, State:RUN  16.2%

 

Once the job completes you can partition and format the disk.  Check the label on the disk by running:

AFA0> container list
Executing: container list
Num  Total  Oth Chunk  Scsi  Partition
Label Type  Size  Ctr Size  Usage  B:ID:L Offset:Size
—– —— —— — —— ——- —— ————-
0  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

1  Mirror  136GB  Valid  0:01:0 64.0KB: 136GB
/dev/sdb  0:04:0 64.0KB: 136GB

From this I can see that I will need to partition and format disk “/dev/sdb”

Have fun!  And if I can help you on it let me know.

Call Now Button(208) 344-1115

SIGN UP TO
GET OUR 
FREE
 APP BLUEPRINT

Join our email list

and get your free whitepaper