Troubleshooting RAID-1 devices – Endian

Versions: 2.4, 2.5, 3.0

Applies to platform: UTM Mercury and Macro 2.4, 2.5, 3.0, UTM Software 2.4, 2.5, UTM 3.0 with RAID
Last updated on: 9th November 2012

All Endian UTM Appliance models "Mercury Pro" and "Macro" have built in RAID-1 support since 2007.

Note

If you have a 5.0 Appliance, please refer to this link.

If you want to install an Endian software appliance on a different hardware that has two hard disks and use them as a RAID array, the installer automatically detects them and gives you the choice to enable RAID-1.

This lesson guides you in tackling and resolving common issues that may occur with RAID 1 devices.

Warning

The commands presented in this lesson operate on the hard disks of your system at a low level. Using them in the wrong way may cause your hard disks to be wiped out and your system to become unusable! So, be careful!

1. Verifying Endian UTM Appliance's RAID status

In order to see if the RAID array of your Endian UTM Appliance is perfectly working or if there is the need for some fix, you can look at the /proc/mdstat special file:

root@endian:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb2[1] sda2[0]
205760448 blocks [2/2] [UU]

md1 : active raid1 sdb1[1] sda1[0]
38435392 blocks [2/2] [UU]
unused devices: <none>

The above output shows the structure of the RAID in a 2.5 Appliance: md1 is mounted on / as is, while md2 contains a PV (Physical Volume) for LVM containing 4 logical volumes called swap, log, var, and config, mounted as swap space, on /var/log, /var, and /var/efw respectively. Whenever you see the [UU] string, the array is correctly working, while the [_U] strings suggests that the array device is in a degraded state. The output for a 2.4 Appliance may slightly differ from the one above, since the RAID had a different set up.

2. Readding a missing partition from a RAID array

You can detect that a partition (or even a whole disk) in a RAID-1 array is missing by issuing the following command from the console:

root@endian:~# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[1]
38893248 blocks [2/1] [UU]

md4 : active raid1 sda4[1]
77762560 blocks [2/1] [_U]

md1 : active raid1 sda1[1]
32064 blocks [2/1] [UU]

In this case, partition sdb4 is damaged. In this case, you should also see also a message like the following one in the /var/log/messages file, showing that the sdb4 partition is no more part of the array :

md: kicking non-fresh sdb4 from array!

While this may usually indicate that the sda4 partition is not working anymore, a first troubleshooting option is to try to rebuild the array, by issuing the following command to readd the partition sdb4:

root@endian:~# mdadm --add /dev/md4 /dev/sdb4

You can then follow the synchronization process by looking again at the /proc/mdstat file, which during the synchronization looks like:

root@endian:~# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[2] sdb3[1]
38893248 blocks [2/1] [_U]
[=================>...] recovery = 85.1% (33107136/38893248) finish=2.4min speed=38939K/sec

md4 : active raid1 sda4[2] sdb4[1]
77762560 blocks [2/1] [_U]
resync=DELAYED

md1 : active raid1 sda1[0] sdb1[1]
32064 blocks [2/2] [UU]

You can even follow the process until the end using

root@endian:~# watch cat /proc/mdstat

This command shows every two seconds the actual content of the /proc/mdstat file. To end the show, press CTRL+C.

If this procedure does not help in recreating the array, it becomes necessary to replace the hard disk entirely (see point 2. below).

3. Replacing a faulty hard disk

Replacing an hard disk in a RAID array is a process that takes some time, but is quite easy to achieve in four steps:

Remove partitions from the array.
Remove the failed hard disk from the system and plug in the new hard disk.
Recreate the partition table on the new hard disk and then the RAID array.
Add the newly created partitions to the RAID array.

Note

When you replace a faulty hard disk, make sure that the new hard disk is at least of the size of the old one, or it will not be possible to create the partition table on the new disk.

We assume that the /dev/sda disk is the good one, while /dev/sdb is the failed one, i.e., the one that must be replaced. If you use LVM partitions, you could have devices named lime /dev/mapper/<somename> instead of /dev/sd*.

Remove partitions.: The failed hard disk and all its partitions must be removed from the array. Each partition must be declared as failed and removed from the array, an operation achieved by issuing the following commands:
root@endian:~# mdadm --manage /dev/md0 --fail /dev/sdb1
root@endian:~# mdadm --manage /dev/md0 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1

When all the partitions are removed from the array -you can check this by looking at file /proc/mdstat and verify that all raid devices (/dev/mdX) are in [_U] state- you can proceed to replace the failed hard disk.
Replace hard disk.: The new disk must be physically inserted in its slot, an operation that may require that the system be turned off. Plug also the power cord off, replace the old hard disk with the new one, then plug the power cord in to boot the system.
Clone the partition table and boot information.: As soon as the hard disk is recognised by the system, you need to copy the partition table and MBR (Master Boot Record) exactly like it is in the "good" hard disk (i.e., /dev/sda). To achieve this goal, you can use the followingt wo commands:; Copy MBR (Master Boot Record); root@endian:~# dd if=/dev/sda of=/dev/sdb bs=446 count=1
Copy the partition table:

root@endian:~# sfdisk -d /dev/sda | sfdisk /dev/sdb

Add new disk partition to array

When you are done with the copy, you can add the partitions on the new disk to the array using mdadm, proceeding like in step 2. above, "Readding a missing partition from a RAID array".

Note

Warning

1. Verifying Endian UTM Appliance's RAID status

2. Readding a missing partition from a RAID array

3. Replacing a faulty hard disk

Note

Add new disk partition to array

Commenti