Mdadm – Failed disk recovery (unreadable disk)


After 9 more months I ran into a nother disk failure. (First disk failure found here

But this time,   The system was unable to read the disk at all

#fdisk /dev/sdb

This process just hung for a few minutes.  It seems I couldn’t simply run a few commands like before to remove and add the disk back to the software RAID.  So I had to replace the disk.  Before I went to the datacenter I ran

#mdadm /dev/md0 --remove /dev/sdb1

I physically went to our data center,   found the disk that showed the failure (it was disk sdb so I ‘assumed’ it was the center disk out of three,  but I was able to verify since it was not blinking from normal disk activity.  I removed the disk,  swapped it out for one that I had sitting their waiting for this to happen,  and replaced it.  Then I ran a command to make sure the disk was correctly partitioned to be able to fit into the array

#fdisk /dev/sdb

This command did not hang,  but responded with cannot read disk.  Darn,  looks like some error happened within the OS or on the backplane that made it so a newly added disk wasn’t readable.  I scheduled a restart on the server later when the server came back up, fdisk could read the disk.  It looks like I had used the disk for something before,  but since I had put it in my spare disk pile,  I knew I could delete it  and I partitioned it with one partion to match what the md was expecting (same as the old disk)

#fdisk /dev/sdb
>d 2         -deletes the old partition 2
>d 1         -deletes the old partition 1
>c            -creates a new partion
>p           – sets the new partion as primary
>1           – sets the new partion as number 1
>> <ENTER>     – just press enter to accept the defaults starting cylinder
>> <ENTER>     – just press enter to accept the defaults ending cylinder
>> w            – write the partion changes to disk
>> Ctrl +c     – break out of fdisk


Now the partition is ready to add back to the raid array

#mdadm /dev/md0 –add /dev/sdb1

And we can immediately see the progress

#mdadm /dev/md0 --detail
 Version : 00.90.03
 Creation Time : Wed Jul 18 00:57:18 2007
 Raid Level : raid5
 Array Size : 140632704 (134.12 GiB 144.01 GB)
 Device Size : 70316352 (67.06 GiB 72.00 GB)
 Raid Devices : 3
 Total Devices : 3
Preferred Minor : 0
 Persistence : Superblock is persistent
Update Time : Sat Feb 22 10:32:01 2014
 State : active, degraded, recovering
 Active Devices : 2
Working Devices : 3
 Failed Devices : 0
 Spare Devices : 1
Layout : left-symmetric
 Chunk Size : 64K
Rebuild Status : 0% complete
UUID : fe510f45:66fd464d:3035a68b:f79f8e5b
 Events : 0.537869
Number Major Minor RaidDevice State
 0 8 1 0 active sync /dev/sda1
 3 8 17 1 spare rebuilding /dev/sdb1
 2 8 33 2 active sync /dev/sdc1

And then to see the progress of rebuilding

#cat /proc/mdadm
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[3] sda1[0] sdc1[2]
 140632704 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]
 [==============>......] recovery = 71.1% (50047872/70316352) finish=11.0min speed=30549K/sec
md1 : active raid1 sda2[0]
 1365440 blocks [2/1] [U_]

Wow in the time I have been blogging this,  already 71 percent rebuilt!,   but wait!  what is this,   md1 is failed?  I check my monitor and what do I find but another message that shows that md1 failed with the reboot.  I was so used to getting the notice saying md0 was down I did not notice that md1 did not come backup with the reboot!  How can this be?

It turnd out that sdb was in use on both md1 and md0,   but even through sdb could not be read at all on /dev/sdb  and /dev/sdb1 failed out of the md0 array,   somehow the raid subsystem had not noticed and degraded the md1 array even though the entire sdb disk was not respoding (perhaps sdb2 WAS responding back then  just not sdb),  who knows at this point.  Maybe the errors on the old disk could have been corrected by the reboot if I had tried that before replacing the disk,   but that doesn’t matter any more,  All I know is that I have to repartion the sdb device in order to support both the md0 and md1 arrays.

I had to wait until sdb finished rebuilding,  then remove it from md0,   use fdisk to destroy the partitions,   build new  partitions matching sda and add the disk back to md0 and md1