Mdadm – Failed disk recovery (unreadable disk)
After 9 more months I ran into a nother disk failure. (First disk failure found here https://www.matraex.com/mdadm-failed-disk-recovery/)
But this time, The system was unable to read the disk at all
This process just hung for a few minutes. It seems I couldn’t simply run a few commands like before to remove and add the disk back to the software RAID. So I had to replace the disk. Before I went to the datacenter I ran
#mdadm /dev/md0 --remove /dev/sdb1
I physically went to our data center, found the disk that showed the failure (it was disk sdb so I ‘assumed’ it was the center disk out of three, but I was able to verify since it was not blinking from normal disk activity. I removed the disk, swapped it out for one that I had sitting their waiting for this to happen, and replaced it. Then I ran a command to make sure the disk was correctly partitioned to be able to fit into the array
This command did not hang, but responded with cannot read disk. Darn, looks like some error happened within the OS or on the backplane that made it so a newly added disk wasn’t readable. I scheduled a restart on the server later when the server came back up, fdisk could read the disk. It looks like I had used the disk for something before, but since I had put it in my spare disk pile, I knew I could delete it and I partitioned it with one partion to match what the md was expecting (same as the old disk)
>d 2 -deletes the old partition 2
>d 1 -deletes the old partition 1
>c -creates a new partion
>p – sets the new partion as primary
>1 – sets the new partion as number 1
>> <ENTER> – just press enter to accept the defaults starting cylinder
>> <ENTER> – just press enter to accept the defaults ending cylinder
>> w – write the partion changes to disk
>> Ctrl +c – break out of fdisk
Now the partition is ready to add back to the raid array
#mdadm /dev/md0 –add /dev/sdb1
And we can immediately see the progress
#mdadm /dev/md0 --detail /dev/md0: Version : 00.90.03 Creation Time : Wed Jul 18 00:57:18 2007 Raid Level : raid5 Array Size : 140632704 (134.12 GiB 144.01 GB) Device Size : 70316352 (67.06 GiB 72.00 GB) Raid Devices : 3 Total Devices : 3 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Sat Feb 22 10:32:01 2014 State : active, degraded, recovering Active Devices : 2 Working Devices : 3 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 0% complete UUID : fe510f45:66fd464d:3035a68b:f79f8e5b Events : 0.537869 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 3 8 17 1 spare rebuilding /dev/sdb1 2 8 33 2 active sync /dev/sdc1
And then to see the progress of rebuilding
#cat /proc/mdadm Personalities : [raid1] [raid6] [raid5] [raid4] md0 : active raid5 sdb1 sda1 sdc1 140632704 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U] [==============>......] recovery = 71.1% (50047872/70316352) finish=11.0min speed=30549K/sec md1 : active raid1 sda2 1365440 blocks [2/1] [U_]
Wow in the time I have been blogging this, already 71 percent rebuilt!, but wait! what is this, md1 is failed? I check my monitor and what do I find but another message that shows that md1 failed with the reboot. I was so used to getting the notice saying md0 was down I did not notice that md1 did not come backup with the reboot! How can this be?
It turnd out that sdb was in use on both md1 and md0, but even through sdb could not be read at all on /dev/sdb and /dev/sdb1 failed out of the md0 array, somehow the raid subsystem had not noticed and degraded the md1 array even though the entire sdb disk was not respoding (perhaps sdb2 WAS responding back then just not sdb), who knows at this point. Maybe the errors on the old disk could have been corrected by the reboot if I had tried that before replacing the disk, but that doesn’t matter any more, All I know is that I have to repartion the sdb device in order to support both the md0 and md1 arrays.
I had to wait until sdb finished rebuilding, then remove it from md0, use fdisk to destroy the partitions, build new partitions matching sda and add the disk back to md0 and md1
MDADM – Failed disk recovery (too many disk errors)
This only happens once every couple of years, but occasionally a SCSI disk on one of our servers has too many errors, and is kicked out of the md array
And… we have to rebuild it. Perhaps we should replace it since it appears to be having problems, but really, the I in RAID is inexpensive (or something) so I would rather lean to being frugal with the disks and replacing them only if required.
I can never remember of the top of my head the commands to recover, so this time I am going to blog it so I can easily find it.
First step, take a look at the status of the arrays on the disk
#cat /proc/mdstat (I don't have a copy of what the failed drive looks like since I didn't start blogging until after) Sometimes an infrequent disk error can cause md to fail a hard drive and remove it from an array, even though the disk is fine.
That is what happened in this case, and I knew the disk was at least partially good. The disk / partition that failed was /dev/sdb1 and was part of a RAID V, on that same device another partition is part of a RAID I, that RAID I is still healthy so I knew the disk is just fine. So I am only re-adding the disk to the array so it can rebuild. If the disk has a second problem in the next few months, I will go ahead and replace it, since the issue that happened tonight is probably indicating a disk that is beginning to fail but probably still has lots of life in it.
The simple process is
#mdadm /dev/md0 --remove /dev/sdb1
This removed the faulty disk, that is when you would physically replace the disk in the machine, since I am only going to rebuild the disk I just skip that and move to the next step.
#mdadm /dev/md0 --re-add
The disk started to reload and VOILA! we are rebuilding and will be back online in a few minutes.
Now you take a look at the status of the arrays
#cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md0 : active raid5 sdb1 sdc1 sda1 140632704 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U] [=======>.............] recovery = 35.2% (24758528/70316352) finish=26.1min speed=29020K/sec md1 : active raid1 sda2 sdb2 1365440 blocks [2/2] [UU]
In case you want to do any trouble shooting on what happened, this command is useful in looking into the logs.
#grep mdadm /var/log/syslog -A10 -B10
But this command is the one that I use to see the important events related to the failure and rebuild. As I am typing this I am just over 60% complete rebuilt which you see in the log
#grep mdadm /var/log/syslog Jun 15 21:02:02 xxxxxx mdadm: Fail event detected on md device /dev/md0, component device /dev/sdb1 Jun 15 22:03:16 xxxxxx mdadm: RebuildStarted event detected on md device /dev/md0 Jun 15 22:11:16 xxxxxx mdadm: Rebuild20 event detected on md device /dev/md0 Jun 15 22:19:16 xxxxxx mdadm: Rebuild40 event detected on md device /dev/md0 Jun 15 22:27:16 xxxxxx mdadm: Rebuild60 event detected on md device /dev/md0
You can see from the times, it took me just over an hour to respond and start the rebuild (I know, that seems too long if I were to just do this remotely, but when I got the notice, I went on site since I thought I would have to do a physical swap and I had to wait a bit while the Colo security verified my ID, and I was probably moving a little slow after some Nachos at Jalepeno’s) Once the rebuild started it took about 10 minutes per 20% of the disk to rebuild.
Update: 9 months later the disk finally gave out and I had to manually replace the disk. I blogged again: