Troubleshooting RAID-1 Devices (newer appliances) – Endian

Version: 5.0

Applies to platform: Hotspot 5.0 with RAID, UTM 5.0
Last updated on: 10 August 2017

Note

This lessons can be only applied to the most recent appliance, equipped with grub-install. For older appliances, please follow this lesson.

If you want to install an Endian software appliance on a different hardware that has two hard disks and use them as a RAID array, the installer automatically detects them and gives you the choice to enable RAID-1.

This lesson guides you in tackling and resolving common issues that may occur with RAID 1 devices.

Warning

The commands presented in this lesson operate on the hard disks of your system at a low level. Using them in the wrong way may cause your hard disks to be wiped out and your system to become unusable! So, be careful!

1. Verifying Endian UTM Appliance's RAID status

In order to see if the RAID array of your Endian UTM Appliance is perfectly working or if there is the need for some fix, you can look at the /proc/mdstat special file:

root@endian:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb2[1] sda2[0]
205760448 blocks [2/2] [UU]

md1 : active raid1 sdb1[1] sda1[0]
38435392 blocks [2/2] [UU]
unused devices: <none>

The above output shows the structure of the RAID in a 5.0 appliance: md1 is mounted on / as is, while md2 contains a PV (Physical Volume) for LVM containing 4 logical volumes called swap, log, var, and config, mounted as swap space, on /var/log, /var, and /var/efw respectively. Whenever you see the [UU] string, the array is correctly working, while the [_U] strings suggests that the array device is in a degraded state.

2. Readding a missing partition from a RAID array

You can detect that a partition (or even a whole disk) in a RAID-1 array is missing by issuing the following command from the console:

root@endian:~# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[1]
38893248 blocks [2/1] [UU]

md4 : active raid1 sda4[1]
77762560 blocks [2/1] [_U]

md1 : active raid1 sda1[1]
32064 blocks [2/1] [UU]

In this case, partition sdb4 is damaged and you should also see a message like the following one in the /var/log/messages file, showing that the sdb4 partition is no more part of the array:

md: kicking non-fresh sdb4 from array!

While this may usually indicate that the sdb4 partition is not working anymore, a first troubleshooting option is to try to rebuild the array, by issuing the following command to readd the partition sdb4:

root@endian:~# mdadm --add /dev/md4 /dev/sdb4

You can then follow the synchronization process by looking again at the /proc/mdstat file, which during the synchronization looks like:

root@endian:~# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[2] sdb3[1]
38893248 blocks [2/1] [_U]
[=================>...] recovery = 85.1% (33107136/38893248) finish=2.4min speed=38939K/sec

md4 : active raid1 sda4[2] sdb4[1]
77762560 blocks [2/1] [_U]
resync=DELAYED

md1 : active raid1 sda1[0] sdb1[1]
32064 blocks [2/2] [UU]

You can even follow the process until the end using

root@endian:~# watch cat /proc/mdstat

This command shows every two seconds the actual content of the /proc/mdstat file. To end the show, press CTRL+C.

If this procedure does not help in recreating the array, it becomes necessary to replace the hard disk entirely (see point 2. below).

2. Replacing a faulty hard disk

Replacing an hard disk in a RAID array is a process that takes some time, but is quite easy to achieve in four steps:

Remove partitions from the array.
Remove the failed hard disk from the system and plug in the new hard disk.
Recreate the partition table on the new hard disk and then the RAID array.
Add the newly created partitions to the RAID array.

Note

When you replace a faulty hard disk, make sure that the new hard disk is at least of the size of the old one, or it will not be possible to create the partition table on the new disk.

We assume that the /dev/sda disk is the good one, while /dev/sdb is the failed one, i.e., the one that must be replaced. If you use LVM partitions, you could have devices named like /dev/mapper/<somename> instead of /dev/sd*.

Remove partitions.: The failed hard disk and all its partitions must be removed from the array. Each partition must be declared as failed and removed from the array, an operation achieved by issuing the following commands:
root@endian:~# mdadm --manage /dev/md0 --fail /dev/sdb1
root@endian:~# mdadm --manage /dev/md0 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1

When all the partitions are removed from the array -you can check this by looking at file /proc/mdstat and verify that all raid devices (/dev/mdX) are in [_U] state- you can proceed to replace the failed hard disk.
Replace hard disk.: The new disk must be physically inserted in its slot, an operation that may require that the system be turned off. Plug also the power cord off, replace the old hard disk with the new one, then plug the power cord in to boot the system.
Clone the partition table and boot information.: As soon as the hard disk is recognised by the system, you need to copy the partition table and MBR (Master Boot Record) exactly like it is in the "good" hard disk (i.e., /dev/sda). To achieve this goal, you can use the following two commands:

Copy MBR (Master Boot Record)

root@endian:~# grub-install /dev/sdb
Copy the partition table:

root@endian:~# sfdisk -d /dev/sda | sfdisk /dev/sdb
Add new disk partition to array: When you are done with the copy, you can add the partitions on the new disk to the array using mdadm, proceeding like in step 2. above, "Readding a missing partition from a RAID array".

Note

Warning

1. Verifying Endian UTM Appliance's RAID status

2. Readding a missing partition from a RAID array

2. Replacing a faulty hard disk

Note

Comments