MDADM – Failed disk recovery (too many disk errors)
This only happens once every couple of years, but occasionally a SCSI disk on one of our servers has too many errors, and is kicked out of the md array
And… we have to rebuild it. Perhaps we should replace it since it appears to be having problems, but really, the I in RAID is inexpensive (or something) so I would rather lean to being frugal with the disks and replacing them only if required.
I can never remember of the top of my head the commands to recover, so this time I am going to blog it so I can easily find it.
First step, take a look at the status of the arrays on the disk
#cat /proc/mdstat (I don't have a copy of what the failed drive looks like since I didn't start blogging until after) Sometimes an infrequent disk error can cause md to fail a hard drive and remove it from an array, even though the disk is fine.
That is what happened in this case, and I knew the disk was at least partially good. The disk / partition that failed was /dev/sdb1 and was part of a RAID V, on that same device another partition is part of a RAID I, that RAID I is still healthy so I knew the disk is just fine. So I am only re-adding the disk to the array so it can rebuild. If the disk has a second problem in the next few months, I will go ahead and replace it, since the issue that happened tonight is probably indicating a disk that is beginning to fail but probably still has lots of life in it.
The simple process is
#mdadm /dev/md0 --remove /dev/sdb1
This removed the faulty disk, that is when you would physically replace the disk in the machine, since I am only going to rebuild the disk I just skip that and move to the next step.
#mdadm /dev/md0 --re-add
The disk started to reload and VOILA! we are rebuilding and will be back online in a few minutes.
Now you take a look at the status of the arrays
#cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md0 : active raid5 sdb1 sdc1 sda1 140632704 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U] [=======>.............] recovery = 35.2% (24758528/70316352) finish=26.1min speed=29020K/sec md1 : active raid1 sda2 sdb2 1365440 blocks [2/2] [UU]
In case you want to do any trouble shooting on what happened, this command is useful in looking into the logs.
#grep mdadm /var/log/syslog -A10 -B10
But this command is the one that I use to see the important events related to the failure and rebuild. As I am typing this I am just over 60% complete rebuilt which you see in the log
#grep mdadm /var/log/syslog Jun 15 21:02:02 xxxxxx mdadm: Fail event detected on md device /dev/md0, component device /dev/sdb1 Jun 15 22:03:16 xxxxxx mdadm: RebuildStarted event detected on md device /dev/md0 Jun 15 22:11:16 xxxxxx mdadm: Rebuild20 event detected on md device /dev/md0 Jun 15 22:19:16 xxxxxx mdadm: Rebuild40 event detected on md device /dev/md0 Jun 15 22:27:16 xxxxxx mdadm: Rebuild60 event detected on md device /dev/md0
You can see from the times, it took me just over an hour to respond and start the rebuild (I know, that seems too long if I were to just do this remotely, but when I got the notice, I went on site since I thought I would have to do a physical swap and I had to wait a bit while the Colo security verified my ID, and I was probably moving a little slow after some Nachos at Jalepeno’s) Once the rebuild started it took about 10 minutes per 20% of the disk to rebuild.
Update: 9 months later the disk finally gave out and I had to manually replace the disk. I blogged again: