Load problems after disk replacement on a ocfs2 and drbd system.
Notes Blurb on investigating a complex issue. resolved, however not with a concise description, notes kept in order to continue the issue in the case it happens again.
Recently, we had a disk failure on one of two SAN servers utilizing MD, OCFS2 and drbd to keep two servers synchronized.
We will call the two Systems: A and B
The disk was replaced on System A, which required a reboot in order for the system to recognize the new disk, then we ad to –re-add the disk to the MD. Once this happened, the disk started to rebuild. The OCFS and drbd layers did not seem to have any issue rebuilding quickly as soon as the servers rebuilt, the layers of redundancy made it fairly painless. However, the load on System B went up to 2.0+ and on System A up to 7.0+!
This slowed down System B significantly and made System A completely unusable.
I took a look at the many different tools to try to debug this.
- top
- iostat -x 1
- iotop
- lsof
- atop
The dynamics of how we use the redundant sans should be taken into should be taken into account here.
We mount System B to an application server via NFS, and reads and writes are done to System B, this makes it odd that System A is having such a hard time keeping up, it honly has to handle the DRBD and OCFS2 communication in order to keep synced (System A is handling reads and writes, where System B is only having to handle writes on the DRBD layer when changes are made. iotop shows this between 5 and 40 K/s, which seemed minimal.
Nothing is pointing to any kind of a direct indicator of what is causing the 7+ load on System A. the top two processes seem to be drbd_r_r0 and o2hb-XXXXXX, which take up minimal amounts of read and write
The command to run on a disk to see what is happening is
#iotop -oa
This command shows you only the commands that have used some amount of disk reas or write (-o) and it shows them cumulatively (-a) so you can easily see what is using the io on the system. From this I figured out that a majority of the write on the system, was going to the system drive.
What I found from this, is that the iotop, tool does not show the activity that is occuring at the drbd / ocfs2 level. I was able to see that on System B, where the NFS drive was connected to, that the nfsd command was writing MULTIPLE MB of information when I would write to the nfsdrive (cat /dev/zero> tmpfile), but I would see only 100K or something written to drbd on System B, and nothing on SystemA, however I would be able to see the file on System A,
I looked at the cpuload on Sysetm A when running the huge write, and it increased by about 1 (from 7+ to 8+) so it was doing some work , iotop just did not monitor it.
So i looked to iostat to find out if i would allow me to see the writes to the actual devices in the MD.
I ran
#iostat -x 5
So I could see what was being written to the devices, here is could see that the disk utilization on System A and System B was similar (about 10% per drive in the MD Array) and the await time on System B was a bit higher than System A. When I did this test I caused the load to go up on all servers to about 7 (application server, System A and System B) Stopping the write made the load time on the application server, and on System B go back down.
While this did not give me the cause, it helped me to see that disk writes on System A are trackable through iostat, and since no writes are occurring when I run iostat -x 5 I have to assume that there is some sort of other overhead that is causing the huge load time. With nothing else I felt I could test, I just rebooted the Server A.
Low and behold, the load dropped, writing huge files, deleting huge files was no longer an issue. The only think I could think was that there was a large amount of traffic of something which was being transferred back and forth to some ‘zombie’ server or something. (I had attempted to restart ocfs2 and drbd and the system wouldn’t allow that either which seems like it indicates a problem with some process being held open by a zombie process)
In the end, this is the best scenario I can use to describe the problem. While this is not real resolution. I publish this so that when an issue comes up with this in the future, we will be able to investigate about three different possibilities in order to get closer to figuring out the true issue.
- Investigate the network traffic (using ntop for traffic, tcpdump for contents, and eth for total stats and possible errors)
- Disconnect / Reconnect the drbd and ocfs2 pair to stop the synchronization and watch the load balance to see if that is related to the issue.
- Attempt to start and stop the drbd and ocfs2 processes and debug any problems with that process. (watch the traffic or other errors related to those processes)