Load problems after disk replacement on a ocfs2 and drbd system.

Notes Blurb on investigating a complex issue.    resolved,  however not with a concise description,  notes kept in order to continue the issue in the case it happens again.

Recently,    we had a disk failure on one of two SAN servers utilizing MD, OCFS2 and drbd to keep two servers synchronized.

We will call the two Systems: A  and B

 

The disk was replaced on System A, which required a reboot in order for the system to recognize the new disk,   then we ad to –re-add the disk to the MD.  Once this happened,  the disk started to rebuild.   The OCFS and drbd layers did not seem to have any issue rebuilding quickly as soon as the servers rebuilt,  the layers of redundancy made it fairly painless.   However,  the load on System B went up to 2.0+  and on System A up to 7.0+!

This slowed down System B significantly and made System A completely unusable.

I took a look at the many different tools to try to debug this.

  • top
  • iostat -x 1
  • iotop
  • lsof
  • atop

The dynamics of how we use the redundant sans should be taken into should be taken into account here.

We mount System B to an application server via NFS,  and reads and writes are done to System B,   this makes it odd that System A is having such a hard time keeping up,  it honly has to handle the DRBD and OCFS2 communication in order to keep synced (System A is handling reads and writes,   where System B is only having to handle writes on the DRBD layer when changes are made.  iotop shows this between 5 and 40 K/s,  which seemed minimal.

Nothing is pointing to any kind of a direct indicator of what is causing the 7+ load on System A.   the top two processes seem to be drbd_r_r0 and o2hb-XXXXXX,  which take up minimal amounts of read and write

The command to run on a disk to see what is happening is

#iotop -oa

This command shows you only the commands that have used some amount of disk reas or write (-o)  and it shows them cumulatively (-a) so you can easily see what is using the io on the system.    From this I figured out that a majority of the write on the system,  was going to the system drive.

What I found from this,  is that the iotop, tool does not show the activity that is occuring at the drbd / ocfs2 level.   I was able to see that on System B,  where the NFS drive was connected to,  that the nfsd command was writing MULTIPLE MB of information when I would write to the nfsdrive (cat /dev/zero> tmpfile),  but I would see only 100K or something written to drbd on System B,  and nothing on SystemA,  however I would be able to see the file on System A,

I looked at the cpuload on Sysetm A when running the huge write,  and it increased by about 1 (from 7+ to 8+)  so it was doing some work ,  iotop just did not monitor it.

So  i looked to iostat to find out if i would allow me to see the writes to the actual devices in the MD.

I ran

#iostat -x 5

So I could see what was being written to the devices,  here is could see that the disk utilization on System A and System B was similar (about 10% per drive in the MD Array)  and the await time on System B was a bit higher than System A.  When I did this test I caused the load to go up on all servers to about 7 (application server,  System A and System B)  Stopping the write made the load time on the application server, and on System B go back down.

While this did not give me the cause, it helped me to see that disk writes on System A are trackable through iostat, and since no writes are occurring when I run iostat -x 5 I have to assume that there is some sort of other overhead that is causing the huge load time.    With nothing else I felt I could test,  I just rebooted the Server A.

Low and behold,   the load dropped,    writing huge files,  deleting huge files was no longer an issue.   The only think I could think was that there was a large amount of traffic of something which  was being transferred back and forth to some ‘zombie’ server or something.   (I had attempted to restart ocfs2 and drbd and the system wouldn’t allow that either which seems like it indicates a problem with some process being held open by a zombie process)

In the end,  this is the best scenario I can use to describe the problem.  While this is not real resolution.  I publish this so that when an issue comes up with this in the future,  we will be able to investigate about three different possibilities in order to get closer to  figuring out the true issue.

  1. Investigate the network traffic (using ntop for traffic,   tcpdump for contents,  and eth for total stats and possible errors)
  2. Disconnect / Reconnect the drbd and ocfs2 pair to stop the synchronization and watch the load balance to see if that is related to the issue.
  3. Attempt to start and stop the drbd and ocfs2 processes and debug any problems with that process. (watch the traffic or other errors related to those processes)

Recovering / Resyncing a distributed DRBD dual primary Split Brain – [servera] has a different data from [serverb]

Recovering / Resyncing  a distributed DRBD dual primary Split Brain – [servera] has a different data from [serverb]

A client had a pair of servers running drbd in order to keep a large file system syncronized and highly available.  However at some point in time the drbd failed and the two servers got out of sync and it went unnoticed for long enough,  that new files were written on both ‘servera’ and on ‘serverb’.

At this point both servers believe that they are the primary,  and the servers are running in what you call a ‘Split Brain’

To determine that split brain has happened you can run several commands.  In our scenario we have two servers servera and serverb

servera#drbd-overview
 0:r0/0 WFConnection Primary/Unknown UpToDate/DUnknown C r----- /data ocfs2 1.8T 1001G 799G 56%
serverb#drbd-overview
 0:r0/0  StandAlone Primary/Unknown UpToDate/DUnknown r----- /data ocfs2 1.8T 1.1T 757G 58%

From the output above (color added) we can see that servera knows that it is in StandAlone mode,   the server realizes that it can not connect.  We can research the logs and we can find out why it things it is in  StandAlone d.  To do this we grep the syslog.

serverb#grep split /var/log/syslog
Nov 2 10:15:26 serverb kernel: [41853948.860147] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0
Nov 2 10:15:26 serverb kernel: [41853948.862910] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
Nov 2 10:15:26 serverb kernel: [41853948.862934] block drbd0: Split-Brain detected but unresolved, dropping connection! 
Nov 2 10:15:26 serverb kernel: [41853948.862950] block drbd0: helper command: /sbin/drbdadm split-brain minor-0
Nov 2 10:15:26 serverb kernel: [41853948.865829] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)

This set of log entries lets  us know that when serverb attempted to connect to servera,  it detected a situation where both file systems had been written to,  so it could no longer synchronize.  it made these entries and put itself into Standalone mode.

servera on the other hand says that it is waiting for a connection WFConnection.

The next step is to determine which of the two servers has the ‘master’ set of data.  This set of data will sync OVER THE TOP of the other server.

In our client’s case we had to do some investigation in order to determine what differences there were on the two servers.

After some discovery we realized that in our case serverb had the most up to date information, except in the case of one directory,  we simply copied that data from servera to serverb,  and then serverb was ready to become our primary.  In the terminology of DRBD,  servera is our ‘split-brain victim’ and serverb is our ‘splitbrain survivor’  we will need to run a set of commands which

  1. ensures the status of the victim to ‘Standalone’ (currently it is ‘WFConnection’)
  2. umount the drive on the victim(servera) so that the filesystem is no longer accessible
  3. sets the victim to be ‘secondary’ server,   this will allow us to sync from the survivor to victim KNOWING the direction the data will go.
  4. start the victim (servera) and let the let the ‘split brain detector’ know that it is okay to overwrite the data on the victim(servera) with the data on the survivor (serverb)
  5. start the survivor(serverb)  (if the serverb server was in WFConnection mode, it would not need to be started,  however ours was in StandAlone mode so it will need to be restarted)

At first we were concerned that we would have to resync 1.2 TB of data,  however we read here  that

The split brain victim is not subjected to a full device synchronization. Instead, it has its local modifications rolled back, and any modifications made on the split brain survivor propagate to the victim.

The client runs a dual primary,  however as we rebuild the synced pair, we need to ensure that the ‘victim’  is rebuilt from the survivor,  so we move the victim from a primary, to a secondary.   And it seems that we are unable to mount a drive (using our ocfs2 filesystem) while it is a secondary.  So we had to ‘umount’ the drive,  and we were unable to remount it while it is a secondary.  In a future test (in which restoring data redundancy primary / primary is less critical),  we will find out whether we are able to keep the primary/primary status while we are rebuilding from a split brain.

While the drbd-overview tool shows all of the ‘resources’ we are required to use a third parameter specifying the ‘resource’ to operate on .  If you have more than one drbd resource defined you will need to identify which resource you are working with.   You can look in your /etc/drbd.conf file or in your /etc/drbd.d/disk.res (your file may be named differently).  The file has the form of

resource r0 { 
....................
}

where r0 is your resource name,   you can also see this buried in your output of drbd-overview

servera# drbd-overview
0:r0/0 WFConnection Primary/Unknown UpToDate/DUnknown C r----- /data ocfs2 1.8T 1001G 799G 56%

So we ran the following commands on servera to prepare it as the victim

servera# drbd-overview #check the starting status of the victim
 0:r0/0 WFConnection Primary/Unknown UpToDate/DUnknown C r----- /data ocfs2 1.8T 1001G 799G 56%
serverb# drbd-overview #check the starting status of the survivor
 0:r0/0 StandAlone Primary/Unknown UpToDate/DUnknown r----- /data ocfs2 1.8T 1.1T 760G 58%

From this above we can see that serverb has 58% usage and 760GB free, were server a has 56% usage and 799GB free.
Based on what I know about the difference between servera and serverb, this helps me to confirm that serverb has more data and is the ‘survivor’

servera# drbdadm disconnect r0 # 1. ensures the victim is standalone 
servera# drbd-overview #confirm it is now StandAlone 
 0:r0/0 StandAlone Primary/Unknown UpToDate/DUnknown r----- /data ocfs2 1.8T 1001G 799G 56%
servera# umount /data # 2. we can not mount the secondary drive with read write
servera# drbdadm secondary r0 # 3. ensures the victim is the secondary 
servera# drbd-overview #confirm it is now secondary 
  0:r0/0 StandAlone Secondary/Unknown UpToDate/DUnknown r-----
servera# drbdadm connect --discard-my-data r0 # 4. start / connect the victim up again knowing that its data should be overwritten with a primary 
servera# drbd-overview #confirm the status and that it it is now connected [WFConnection]
  0:r0/0 WFConnection Secondary/Unknown UpToDate/DUnknown C r-----


I also checked the logs to confirm the status change

servera#grep drbd /var/log/syslog|tail -4
Nov 4 05:14:03 servera kernel: [278068.555213] drbd r0: conn( StandAlone -> Unconnected )
Nov 4 05:14:03 servera kernel: [278068.555247] drbd r0: Starting receiver thread (from drbd_w_r0 [19105])
Nov 4 05:14:03 servera kernel: [278068.555331] drbd r0: receiver (re)started
Nov 4 05:14:03 servera kernel: [278068.555364] drbd r0: conn( Unconnected -> WFConnection )

Next we simply have to run this command on serverb to let it know that it can connect as the survivor (like I mentioned above,  if the survivor was in WFConnection mode,  it would automatically reconnect,  however we were in StandAlone mode)

serverb# drbd-overview #check one more time that serverb is not yet connected
 0:r0/0 StandAlone Primary/Unknown UpToDate/DUnknown r----- /data ocfs2 1.8T 1.1T 760G 58%
serverb# drbdadm connect r0 # 5. start the surviving server to ensure that it reconnects
serverb# drbd-overview #confirm serverb and servera are communicating again 
 0:r0/0 SyncSource Primary/Secondary UpToDate/Inconsistent C r----- /data ocfs2 1.8T 1.1T 760G 58%
 [>....................] sync'ed: 0.1% (477832/478292)M
servera# drbd-overview #check that servera confirms what serverb says about communicating again 
 0:r0/0 SyncTarget Secondary/Primary Inconsistent/UpToDate C r-----
 [>....................] sync'ed: 0.3% (477236/478292)M

Another way to confirm that the resync started happening is to check the logs

servera# grep drbd /var/log/syslog|grep resync
Nov 4 05:18:09 servera kernel: [278314.571951] block drbd0: Began resync as SyncTarget (will sync 489771348 KB [122442837 bits set]).
serverb# grep drbd /var/log/syslog|grep resync
Nov 4 05:18:09 serverb kernel: [42008909.652451] block drbd0: Began resync as SyncSource (will sync 489771348 KB [122442837 bits set]).


Finally,   we simply run a command to promote servera to be a primary again,  and then both servers will be writable

servera#drbdadm primary r0
servera# drbd-overview
 0:r0/0 Connected Primary/Primary UpToDate/UpToDate C r-----
servera# mount /data #remount the data drive we unmounted previously

 

 

Now that we ‘started’ recovering from the split-brain issue we just have to watch the two servers to confirm once they have fully recovered.  once that is complete we will put in place log watchers and FileSystem tests to send out a notification to the system administrator if it should happen again.

Disk write speed testing different XenServer configurations – single disk vs mdadm vs hardware raid

Disk write speed testing different XenServer configurations – single disk vs mdadm vs hardware raid

In our virtual environment on of the VM Host servers has a hardware raid controller on it .  so natuarally we used the hardware raid.

The server is a on a Dell 6100 which uses a low featured LSI SAS RAID controller.
One of the ‘low’ features was that it only allows two RAID volumes at a time.  Also it does not do RAID 10

So I decided to create a RAID 1 with two SSD drives for the host,  and we would also put the root operating systems for each of the Guest VMs there.   It would be fast and redundant.   Then we have upto 4 1TB disks for the larger data sets.  We have multiple identically configured VM Hosts in our Pool.

For the data drives,  with only 1 more RAID volume I could create without a RAID 10,  I was limited to either a RAID V,   a mirror with 2 spares,   a JBOD.  In order to get the most space out of the 4 1TB drives,   I created the RAIDV.   After configuring two identical VM hosts like this,  putting a DRBD Primary / Primary connection between the two of them and then OCFS2 filesystem on top of it.  I found I got as low as 3MB write speed.   I wasnt originally thinking about what speeds I would get,  I just kind of expected that the speeds would be somewhere around disk write speed and so I suppose I was expecting to get acceptable speeds beetween 30 and 80 MB/s.   When I didn’t,  I realized I was going to have to do some simple benchmarking on my 4 1TB drives to see what configuration will work best for me to get the best speed and size configuration out of them.

A couple of environment items

  • I will mount the final drive on /data
  • I mount temporary drives in /mnt when testing
  • We use XenServer for our virtual environment,  I will refer to the host as the VM Host or dom0 and to a guest VM as VM Guest or domU.
  • The final speed that we are looking to get is on domU,  since that is where our application will be,  however I will be doing tests in both dom0 and domU environments.
  • It is possible that the domU may be the only VM Guest,  so we will also test raw disk access from domU for the data (and skip the abstraction level provided by the dom0)

So,  as I test the different environments I need to be able to createw and destroy the local storage on the dom0 VM Host.  Here are some commands that help me to do it.
I already went through xencenter and removed all connections and virtual disk on the storage I want to remove,  I had to click on the device “Local Storage 2” under the host and click the storage tab and make sure each was deleted. {VM Host SR Delete Process}

xe sr-list host=server1 #find and keep the uuid of the sr in my case "c2457be3-be34-f2c1-deac-7d63dcc8a55a"
xe pbd-list   sr-uuid=c2457be3-be34-f2c1-deac-7d63dcc8a55a # find and keep the uuid of the pbd connectig sr to dom0 "b8af1711-12d6-5c92-5ab2-c201d25612a9"
xe pbd-unplug  uuid=b8af1711-12d6-5c92-5ab2-c201d25612a9 #unplug the device from the sr
xe pbd-destroy uuid=b8af1711-12d6-5c92-5ab2-c201d25612a9 #destroy the devices
xe sr-forget uuid=c2457be3-be34-f2c1-deac-7d63dcc8a55a #destroy the sr

Now that the sr is destroyed,  I can work on the raw disks on the dom0 and do some benchmarking on the speeds of differnt soft configurations from their.
Once I have made  a change,  to the structure of the disks,  I can recreate the sr with a new name on top of whatever solution I come up with by :

xe sr-create content-type=user device-config:device=/dev/XXX host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage XXX on `cat /etc/hostname`” shared=false type=lvm

Replace the red XXX with what works for you

Most of the tests were me just running dd commands and writing the slowest time,  and then what seemed to be about the average time in MB/s.   It seemed like,  the first time a write was done it was a bit slower but each subsequent time it was faster and I am not sure if that means when a disk is idle,  it takes a bit longer to speed up and write?  if that is the case then there are two scenarios,   if the disk is often idle,  the it will use the slower number,  but if the disk is busy,  it will use the higher average number,  so I tracked them both.  The idle disk issue was not scientific and many of my tests did not wait long enough for the disk to go idle inbetween tests.

The commands I ran for testing were dd commands

dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=1000 conv=fdatasync  #for 1 mb
dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=10000 conv=fdatasync  #for 10 mb
dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=100000 conv=fdatasync  #for 100 mb
dd if=/dev/zero of=data/speetest.`date +%s` bs=1k count=1000000 conv=fdatasync  #for 1000 mb

I wont get into the details of every single command I ran as I was creating the different disk configurations and environments but I will document a couple of them

Soft RAID 10 on dom0

dom0>mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb2 --assume-clean
dom0>mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd2 --assume-clean
dom0>mdadm --create /dev/md10 --level=0 --raid-devices=2 /dev/md0 /dev/md1 --assume-clean
dom0>mkfs.ext3 /dev/md10
dom0>xe sr-create content-type=user device-config:device=/dev/md10 host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage md10 on `cat /etc/hostname`” shared=false type=lvm

Dual Dom0 Mirror – Striped on DomU for an “Extended RAID 10”

dom0> {VM Host SR Delete Process} #to clean out 'Local storage md10'
dom0>mdadm --manage /dev/md2 --stop
dom0>mkfs.ext3 /dev/md0 && mkfs.ext3 /dev/md1
dom0>xe sr-create content-type=user device-config:device=/dev/md0 host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage md0 on `cat /etc/hostname`” shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/md1 host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’` name-label=”Local storage md1 on `cat /etc/hostname`” shared=false type=lvm
domU>
#at this  point use Xen Center to add and attach disks from each of the local md0 and md1 disks to the domU (they were attached on my systems as xvdb and xvdc
domU> mdadm --create /dev/md10 --level=0 --raid-devices=2 /dev/xvdb /dev/xvdc
domU> mkfs.ext3 /dev/md10  && mount /data /dev/md10

Four disks SR from dom0, soft raid 10 on domU

domU>umount /data
domU> mdadm --manage /dev/md10 --stop
domU> {delete md2 and md1 disks from the storage tab under your VM Host in Xen Center}
dom0> {VM Host SR Delete Process} #to clean out 'Local storage md10'
dom0>mdadm --manage /dev/md2 --stop
dom0>mdadm --manage /dev/md1 --stop
dom0>mdadm --manage /dev/md0 --stop
dom0>fdisk /dev/sda #delete partition and write (d w)
dom0>fdisk /dev/sdb #delete partition and write (d w)
dom0>fdisk /dev/sdc #delete partition and write (d w)
dom0>fdisk /dev/sdd #delete partition and write (d w)
dom0>xe sr-create content-type=user device-config:device=/dev/sda host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sda on `cat /etc/hostname`" shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/sdb host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sdb on `cat /etc/hostname`" shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/sdc host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sdc on `cat /etc/hostname`" shared=false type=lvm
dom0>xe sr-create content-type=user device-config:device=/dev/sdd host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk '{print $NF}'` name-label="Local storage sdd on `cat /etc/hostname`" shared=false type=lvm
domU>mdadm --create /dev/md10 -l10 --raid-devices=4 /dev/xvdb /dev/xvdc /dev/xvde /dev/xvdf
domU>mdadm --detail --scan >> /etc/mdadm/mdadm.conf 
domU>echo 100000 > /proc/sys/dev/raid/speed_limit_min #I made the resync go fast, which reduced it from 26 hours to about 3 hours
domU>mdadm --grow /dev/md0 --size=max

Setting up DRBD with OCSF2 on a Ubuntu 12.04 server for Primary/Primary

Setting up DRBD with OCSF2 on a Ubuntu 12.04 server for Primary/Primary

We run in a virtual environment and so we thought we would go with the virtual kernel for the latest linux kernls
We learned that we should NOT not in the case we want to use the OCFS2 distributed locking files system because ocfs2 did not have the correct modules so we would have had to doa  custom build of the modules so we decided against it.   we just went with the latest kernel,   and would install ocfs2 tools from the package manager.

DRBD on the other hand had to be downloaded, compiled and installed regardless of kernel,   here are the procedures,  these must be run on each of a pair of machines.
We assume that /dev/xvdb has a similar sized device on both machines.

apt-get install make gcc flex
wget http://oss.linbit.com/drbd/8.4/drbd-8.4.4.tar.gztar xzvf drbd-8.4.4.tar.gz 
cd drbd-8.4.4/
./configure --prefix=/usr --localstatedir=/var --sysconfdir=/etc --with-km
make all

Connfigure both systems to be aware of eachother without dns /etc/hosts

192.168.100.10 server1
192.168.100.11 server2

Create a configuration file at /etc/drbd.d/disk.res

resource r0 {
protocol C;
syncer { rate 1000M; }
startup {
wfc-timeout 15;
degr-wfc-timeout 60;
become-primary-on both;
}
net {
#requires a clustered filesystem ocfs2 for 2 prmaries, mounted simultaneously
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
cram-hmac-alg sha1;
shared-secret "sharedsanconfigsecret";
}
on server1 {
device /dev/drbd0;
disk /dev/xvdb;
address 192.168.100.10:7788;
meta-disk internal;
}
on server2 {
device /dev/drbd0;
disk /dev/xvdb;
address 192.168.100.11:7788;
meta-disk internal;
}
}

 

configure drbd to start on reboot verify that DRBD is running on both machines and reboot,  and verify again

update-rc.d drbd defaults
/etc/init.d/drbd start
drbdadm -- --force create-md r0
drbdadm up r0
cat /proc/drbd

at this point  you should see that both devices are connected Secondary/Secondary and Inconsistent/Inconsistent.
Now we start the sync fresh,   on server1 only both sides are blank so drbd should manage any changes from here on.  cat /proc/drbd will show UpToDate/UpToDate
Then we mark both primary and reboot to verify everything comes back up

server1>drbdadm -- --clear-bitmap new-current-uuid r0
server1>drbdadm primary r0
server2>drbdadm primary r0
server2>reboot
server1>reboot

I took a snapshot at this point
Now it is time to setup the OCFS2 clustered file system on top of the device first setup a /etc/ocfs2/cluster.conf

cluster:
node_count = 2
name = mycluster
node:
ip_port = 7777
ip_address = 192.168.100.10
number = 1
name = server1
cluster = mycluster
node:
ip_port = 7777
ip_address = 192.168.100.11
number = 2
name = server2
cluster = mycluster

get the needed packages, configure them and setup for reboot,  when reconfiguring,   remember to put the name  of the cluster you want to start at boot up mycluster run the below on both machines

apt-get install ocfs2-tools
dpkg-reconfigure ocfs2-tools
mkfs.ocfs2 -L mycluster /dev/drbd0 #only run this on server1
mkdir -p /data
echo "/dev/drbd0  /data  ocfs2  noauto,noatime,nodiratime,_netdev  0 0" >> /etc/fstab
mount /data
touch /data/testfile.`hostname`
stat /data/testfile.*
rm /data/testfile* # you will only have to run this on one machine
reboot

So,  everything should be running on both computers at this point when things come backup make sure everythign is connected.
You can run these commands from either server

/etc/init.d/o2cb status
cat /proc/drbd

			

Setting DRBD in Primary / Primary — common commands to sync resync and make changes

Setting DRBD in Primary / Primary — common commands to sync resync and make changes

As we have been setting up our farm with an NFS share the DRBD primary / primary connection between servers is important.

We are setting up a group of /customcommands/ that we will be able to run to help us keep track of all of the common status and maintenance commands we use,  but  when we have to create, make changes to the structure,  sync and resync, recover, grow or move the servers,  We need to document our ‘Best Practices’ and how we can recover.

From base Server install

apt-get install gcc make flex
wget http://oss.linbit.com/drbd/8.4/drbd-8.4.1.tar.gz
tar xvfz drbd-8.4.1.tar.gz
cd drbd-8.4.1/
./configure --prefix=/usr --localstatedir=/var --sysconfdir=/etc --with-km
make KDIR=/lib/modules/3.2.0-58-virtual/build
make install

Setup in/etc/drbd.d/disk.res

resource r0 {
protocol C;
syncer { rate 1000M; }
startup {
wfc-timeout 15;
degr-wfc-timeout 60;
become-primary-on both;
}
net {
#requires a clustered filesystem ocfs2 for 2 prmaries, mounted simultaneously
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
cram-hmac-alg sha1;
shared-secret "sharedsanconfigsecret";
}
on server1{
device /dev/drbd0;
disk /dev/xvdb;
address 192.168.100.10:7788;
meta-disk internal;
}

on riofarm-base-san2 {
device /dev/drbd0;
disk /dev/xvdb;
address 192.168.100.11:7788;
meta-disk internal;
}
}

 

Setup your /etc/hosts

192.168.100.10 server1
192.168.100.11 server2

Setup /etc/hostname with

server1

reboot,  verify your settings and SAVE A DRBDVMTEMPLATE clone your VM to a new server called server2

Setup /etc/hostname with

server2

start drbd with /etc/init.d/drbd  this will likely try and create the connection,  but this is where we are going to ‘play’ to learn the commands and how we can sync,  etc.

cat /proc/drbd   #shows the status of the connections
server1> drbdadm down r0   #turns of the drbdresource and connection
server2> drbdadm down r0   #turns of the drbd resource and connection
server1> drbdadm -- --force create-md r0  #creates a new set of meta data on the drive,  which 'erases drbds memory of the sync status in the past
server2> drbdadm -- --force create-md r0  #creates a new set of meta data on the drive,  which 'erases drbds memory of the sync status in the past
server1> drbdadm up r0   #turns on the drbdresource and connection and they shoudl connect without a problem,   with no memory of a past sync history
server2> drbdadm up r0    #turns on the drbdresource and connection and they shoudl connect without a problem,   with no memory of a past sync history
server1> drbdadm -- --clear-bitmap new-current-uuid r0  # this create a new 'disk sync image' essentially telling drbd that the servers are blank so no sync needs to be done  both servers are immediately UpToDate/UptoDate in /proc/drbd
server1> drbdadm primary r0
server2> drbdadm primary r0    #make both servers primary and  now when you put an a filesystem on /dev/drbd0 you will be able to read and write on both systems as though they are local

So,    lets do some failure scenarios,    Say, we loose a server,  it doesn’t matter which one since they are both primaries,  in this case though we will say server2 failed.  Create a new VM from DRBDVMTEMPLATE  which already had drbd made on it with the configuration or create another one using the instructions above.

Open /etc/hostname and set it to

server2

reboot.    Make sure /etc/init.d/drbd start is running

server1>watch cat /proc/drbd  #watch the status of dtbd,  it is very useful and telling about what is happening,   you will want DRBD to be Connected Primary/Unknown  UpToDate/DUnknown  
server2>drbdadm down
server2>dbadm wipe-md r0  #this is an optional step that is used to wipe out the meta data,  I have not seen that it does anything different than creating the metadata using the command below,  but it is useful to know the command in case you want to get rid of md  on your disk
server2>drbdadm -- --force create-md r0  ##this makes sure that their is no partial resync data left over from where you cloned it from
server2>drbdadm up r0 # this brings drbd server2 back into the resource and connects them,  it will immediately sart syncing you should see SyncSource Primary/Secondary UpToDate/Inconsistent on server1,     for me it was soing to to 22 hours for my test of a 1TM  (10 MB / second)

Lets get funky,  what happens if you stop everything in the middle of a sync

server1>drbdadm down r0 #we shut down the drdb resource that has the most up to date information,   on server2 /proc/drbd shows Secondary/Unknown  Inconsitent/DUnknown ,  server2 does not know about server1 any more,  but server2 still knows that server2 is inconsitent,   (insertable step here could be on server2: drbdadm down ro; drbdadm up ro,  with no change to the effect)
server1>drbdadm up ro #  this brings server1 back on line and /proc/drbd on server1 shows  SyncSource,  server2 shows SyncTarget,   server1 came backup as the UpToDate server,  server2 was Inconsistent,  it figured it out

 

Where things started to go wrong and become less ‘syncable’  was when servers were both down and had to be brought back up again separately with a new uuid was created on them separately.  so lets simulate that the drbd config fell apart,  and we have to put it together again.

server2>drbdadm disconnect ro;  drdbadm -- --force create-md r0 ; drbd connect ro;   #start the sync process over
Call Now Button(208) 344-1115

SIGN UP TO
GET OUR 
FREE
 APP BLUEPRINT

Join our email list

and get your free whitepaper