To God Goes All The Glory!
How to recover data from a crashed MBWE-II
I want to acknowledge the help given me by Gabriel (who sure earned his name this time!) along with everyone else on these fora who posted their own experiences with the MBWE. Without your help I would have been SO SCREWED it would not be funny.
As we all know, there's really no excuse for inadequate backups. And of all people, I know better.
However, there I was with 30+ years of accumulated experience, tools, tricks, tips, software, etc. on a single drive - just waiting for Good 'Ole Mr. Murphy to come in and ball it up. This data was both critical and irreplaceable, so "failure is NOT an option!"
There was no choice, I had to recover that data "regardless of cost or loss!" - even if it meant I had to go through those disks byte-by-byte with a disk editor.
I was damned lucky. I was able to recover about 99% of my data, with the lost data being (relatively) easily replaced.
It did cost me though. I went through about $700.00, four tanks of gasoline, and a number of trips to my local (!!) Micro-Center to get parts and materials. Not to mention two weeks of acid-reflux.
I am taking the trouble to document what eventually succeeded for me - in the hope that it will help others avoid some of the mistakes *I* made.
Lastly, please excuse the length of this article. Even though I will make it as brief as possible, it was a long time in the telling, and it won't be told here in three lines.
- Your hard drive must still be spinning, with the potential for recovering data
Obviously if your drive's platters have frozen solid and don't spin, or the drive is suffering from a gross mechanical defect - such as pieces rattling around inside - your chances of success plummet like a rock.
- You will need a computer that you can exclusively dedicate to this task for awhile
"Awhile" might be measured in days, or even weeks. It took me two weeks of trial-and-error to get my data fully recovered.
- You will need at least twice as many drives as there were drives in your MBWE
My device had two 500 gig drives, so I purchased four drives to rebuild data on.
- Each drive will need to be at least twice the size of the drive you're trying to recover
Since I had two 500 gig drives, I purchased four 1T drives.
- You will need a controller card - or available SATA space on your MoBo - for the extra drives in addition to the drive(s) already in the system
- You may need a replacement drive for the one that failed
Try to get as exact a replacement as possible. Western Digital, same size, same model series if possible, etc.
- You will need a flavor of Linux compatible with your system and controller
Some people recommend the use of a "Live CD" for the recovery. I don't.
You will need to download, install, save test artifacts and files, etc. etc. etc. I found it much easier to just do a flat "install from scratch" on the recovery system.
- You will need ddrescue / dd_rescue
You will need to find, or download, a copy of the program "ddrescue". (It's called "dd_rescue" on some distributions.) If your distribution does not come with that already, download and install it via your distribution's package manager.
- You will need mdadm
This is commonly included in most recent distributions. If it's not included, you can download it via your distribution's package manager.
- You will need a recent copy of the Western Digital Data Lifeguard Tools CD to make a boot floppy of the Western Digital Data Lifeguard "Diagnostics".
- You will need to be on excellent terms with Lady Luck!
Or, as Scripture says: "The fervent effectual prayer of a righteous man availeth much."
And I'm not kidding.
If you're reading this, you are probably already in Deep Sneakers, and sinking fast. Luck, prayer, whatever, will be a primary constituent of your success.
Rule #1: Don't Touch That Drive!
You are already in trouble. Dinking around with the drive - potentially changing it's contents - will only make it worse.
Prepare the new drives to receive the recovery data
- Attach all the new drives, create one single partition on each, and format as ext3.
- You can do this one-at-a-time, or you can attach all four of the new recovery drives to the controller, and format them all up there.
- Shutdown and remove all formatted drives and set them aside carefully.
Copying the data off the damaged drive.
- Install the drive that is NOT damaged, and view the partition table with Gparted or QTParted and verify that the partition table is intact.
- Your partition table should look like this:
- Unallocated space. (This space is used to store individual system specific data, such as MAC address, serial number, etc.)
- Partition #1, formatted as ext3. (This is the boot partition, with /boot, /root, etc. on it.)
- Partition #2, formatted as swap (This is the system paging file.)
- Partition #3, formatted as ext3 (This is the rest of the O/S, /var, etc.)
- Partition #4, unknown format. (This is the data-store, don't modify or change this!)
These partitions will be essentially identical between the two drives on a two drive system - Linear array or mirrored.
- Using dd_rescue, copy the "un-damaged" drive to a file on one of the new drives.
- This will take a fairly long while - measured in hours.
- Take note of any failed blocks. (cut-and-paste to a text file.)
- Shutdown the system, turn it off, remove the new drive with the file, label it, and put it somewhere safe.
- Attach another new drive.
- Using dd_rescue, copy the last partition from the "undamaged" drive to a file on the new drive.
- This will also take a long while. Almost exactly as long as the first copy, since this is where most of the data lives.
- Again, take note of any failed blocks. Hopefully you won't find any on the "2nd" drive during either copy.
- Shutdown the system, turn it off, remove both the new drive (mark it and put it somewhere safe), and the "B" drive, label and put somewhere else safe.
- Add the failed drive to the system and attempt to verify partitions
- Attach the failed drive ("A"), to the controller where the "B" drive was, and re-run the Gparted, QTParted partition verification step as noted above.
- Shut down and turn off the system.
IF the "failed" drive's partition table is OK, continue with the next section.
IF the "failed" drive's partition table is NOT OK, continue with the steps below.
- Use dd to copy the first 512 bytes from the disk with the good partition table.
- Copy that file to the first 512 bytes of the "bad" disk to see if we can recover valid partition data.
Attempt to recover data from the failed drive
- Attach the failed drive ("A"), to the controller where the "B"drive was, and attach another new drive.
- Reboot the system.
- Using dd_rescue, copy the last partition of the "A" drive to a file on the new disk.
- Again, this will take a long while.
- Also, take careful note of any bad blocks.
- Shutdown the system, turn it off, remove and label the new drive, and put it away safely.
- Attach the last new drive and reboot.
- Attempt to copy data from the entire disk to a file on the last new hard disk
- Allow dd_rescue to copy about half the disk contents to a file, then abort it with CTL-C.
- Hopefully, one of the two disks had the system partitions without errors.
- Shutdown the system, turn it off, remove and label the last new drive, and put it away safely, leaving the potentially defective drive attached.
At this point, you should have all the images you need.
Verify if the "failed" drive is really bad
- At this point, the system should be shut down, with all the new drives removed, and the one failing drive still attached.
- Boot the system using the "Diagnostics" floppy you created from the Western Digital Data Lifeguard CD.
- Select the correct drive in your system.
- Run the "Quick Test".
- It is not necessary to run the "full" test.
- If the drive passes the "Quick" test, repeat it a few times to verify that it always passes.
- Ideally, each pass will return an error code of "0000"
- If the drive passes, mark it so, and put it away.
- If the drive fails, mark it so, and set it aside where you won't pick it up to use it.
- The magnets out of a failed H/D make GREAT 'fridge magnets!
- Replace it with the replacement drive you purchased, or go purchase one. Remember to get as exact a replacement as humanly possible.
- Repeat this same exact procedure, substituting the other MBWE drive to verify it is OK.
Attempt to rebuild the damaged data array
- Re-attach the data image drives and prepare to recover
- Shutdown and turn off the system if not already shutdown.
- Attach the two drives that have the two data-partition images on them in positions 1 & 2 on the controller.
- Attach a blank drive - if available - as position #3.
- Restart the system.
- Mount the three drives in a convenient location
- I will assume /recover/a, /recover/b, and /recover/c are the mount points.
- I am also assuming that the drive with the drive "A" data image is first, the drive "B" data image is second.
- Loop-mount the recovered data image files created before.
- I will assume that they're named "a-recover-data" and "b-recover-data"
- Execute the following commands to loop-mount the two image files:
losetup /dev/loop0 /recover/a/a-recover-data losetup /dev/loop1 /recover/b/b-recover-data
This creates two "fake" (virtual) drives mounted on loop0 and loop1 that contain the contents of these two files.
Trick: You can loop-mount ANY valid file-system image - including things like cd/dvd ISO images, etc.
- Merge the images into a copy of their original array
- Execute the following command to re-create the original MBWE array structure:
mdadm --assemble /dev/md1 --force /dev/loop0 /dev/loop1
This command takes the two loop-mounted array parts and (hopefully!) merges them into an array image similar to the one on the MBWE that the two drives came out of.
Hopefully the array built - and started! - correctly. If it didn't, I don't know how to help you here.
Assuming the array built correctly - mount /dev/md1 wherever convenient. (Let's assume /recover/md1)
Navigate to the mount point, and view the contents of the root of that "drive". If all has gone well, at this point you should see a filesystem containing folders and data - as you had it on the original MBWE.
If you successfully see a filesystem - congratulate yourself, take a deep breath, and perhaps take a short break.
If you don't have a filesystem here - I am not sure how to fix this. Not without messing with it myself.
Make a "backup" of the filesystem's apparent content.
- Very Important!
- Using "cp -R", copy the entire contents of the /dev/md1 mount point to the empty drive you have mounted at your third hard drive mount point.
- This will take a while. Take careful note of any files that generate errors.
- We do this because when we try to repair the two partition images, things might get destroyed.
Attempt to repair / recover the partition images
- Check array partitions for consistency
- Execute the following command to verify the structure of the array partition's filesystem.
fsec -t ext3 /dev/md1 -- -n -f -v -n = Don't actually fix anything -f = Force scan, even if screwy. -v = Tell us a lot about what you see.
- Again, remember to take careful note of any errors or issues seen.
- In my case, there were a lot of "inode hash" errors.
- Try a "real" fsck to clean up issues"
- This will discover if any of the issues disclosed were "serious" issues. (They probably are, but we can see if we get lucky… .)
- Execute the following command:
fsck -t ext3 /dev/md1 -- -D -p -f -v D = consolidate and re-index directories. p = "Preen" (auto-repair) non-critical problems. f = Force checking v = Tell us what's happening.
- You may get a "/dev/md1: Adding dirhash hint to filesystem" message when you start the "real" fsck. This is indicating that fsck is updating the partition to handle indexing properly. This is a non-problem.
- When I did this, it still bailed out on me because "inode hash" issues are considered "critical" problems. What will happen is that - if you force fix, and you will need to, trust me - the directories and/or files with the inode hash errors will be deleted and the space consumed returned to the free pool.
- Retry fsck forcing it to fix all errors found
- We will need to absolutely clean up the issues found, so we must (at this point) force fsck to fix things.
- Execute the following code to do this:
fsck -t ext3 /dev/md1 -- -y -f -v (note, we're omitting the "-D" here deliberately.) y = force auto fix (answer any question "yes!")
- Re-execute the same command again to verify all issues have been resolved.
- Repeat until there are no more errors found.
- Once everything is OK, re-run fsck again to optimize and re-index directories.
fsck -t ext3 /dev/md1 -- -D -y -f -v
This does just like before, except the "-D" forces directory re-indexing and optimization.
- Un-mount /dev/md1, and stop the array
umount /dev/md1 mdadm --stop /dev/md1
Stop and take stock of things
Where we should be now
- We should have two partition image files loop-mounted.
- We should have them successfully assembled into an array.
- We should have successfully run fsck on the array partition and cleaned up any errors.
- We should have at least ONE good disk out of the two that came from the MBWE.
- We should have at least ONE good system image from the two drives.
- If you don't, you will need to download one and follow instructions to install it at a later step.
Begin rebuilding the two drives for the MBWE.
- I am assuming that the "B" drive contained no bad blocks - and if there were, they are in the data partition, not the system partitions.
- I am also assuming that we have a good drive "A", or a replacement, that may not have a good system image on it.
- If this is not true - you do not have ANY good system images, skip the single step below, download a system image, and follow the instructions to install it on the two drives, creating the last (fourth) partition.
- Using dd_rescue, copy the entirety of drive "B" to drive "A". This will replace the bad/missing system partitions, and re-create the 4th partition for the data.
- After this is about 1/2 done, stop the copy with CTRL-C.
- Using dd_rescue, copy the drive "A" data partition image that we fixed-up before, back to partition 4 of drive "A".
- Using dd_rescue, copy the drive "B" data partition image that we fixed-up before, back to partition 4 of drive "B".
- Once that is done, completely shut-down and turn off power.
Rebuild the MBWE
- Re-install the hard drives
- Replace the two side-rails on each hard drive (if you removed them)
- Re-insert the two drives into the MBWE, remembering that drive "A" goes in the slot closest to the controller electronics.
- Re-connect all connectors removed during MBWE tear-down.
- Reconnect network and power
- Re-attach the network cable to the MBWE.
- Re-attach the power connector to the MBWE.
- FIRE THAT PUPPY UP!! (and pray…)
- Re-connect power.
- Carefully monitor the front-panel lights.
At this point, the MBWE should boot, do a final internal fsck - which is indicated by the internal lit ring spinning - and then come fully back on-line.
- Note If you replaced the system partitions with downloaded partition data, you may have to re-configure the MBWE to your needs.
Verify correct operation
- Attempt to access the web setup page
- Verify that the web-setup page works, and that the drive status is "OK"
- Re-configure any settings that you need to change.
- Attempt to access the pre-existing shares on the MBWE
- Verify that the original shares on the MBWE exist, you can access them, and you can read-and-write data to them.
- Note that any files or directories that were "corrected" during the fsck of the partition array above may not be there - you may have to replace this data. THAT is why I asked you to take notes!
Verify everything's correct, replace any lost data, return to service
- Satisfy yourself that everything is back to normal, by shutting down the MBWE, re-booting it, etc.
- You will probably notice that the MBWE is booting up - and serving files - much faster now than ever it did before.
- This is a result of both cleaning up all the cruft and problems, as well as the consolidate, optimize, and re-index steps that we performed during the FSCK operations above.
- Replace any necessary lost data
- Replace any necessary lost data as noted during the FSCK passes above.
- Return to Service
- Return the MBWE to normal operational status.
Congratulate Yourself on a Job Well Done!
What say ye?