RAID 5 Maintenance Rebuild Recovery

Keep a comprehensive on-site maintenance log of all activities,  who did what and on what date.

Make sure you record the physical order, location, time date of install and connection of all the system units especially the hard drives. Visibly label the hard drive units with their place order in the system along with date of install whilst the system is functioning correctly.

RAID 5 System Maintenance Knowledge

Get familiar with your manufacture specific documentation such that you are proficient in the replacement and substitution of any failed components specifically a single failed hard disk drive.

Ensure you have a fault reporting routine that will facilitate a response to a system fault immediately it is flagged, particularly a hard disk drive drop-out.

RAID 5 Maintenance.

Back-up your system and test the back-up capability by removing one of the drives from the subsystem while it’s running. Replace this drive with an exact match model and type blank hard drive, then allow the system to rebuild to un-degraded operational state (This may take a relatively long time)

Note that if two live hard drives fail simultaneously there is no possibility of a successful auto rebuild and a specialised data recovery service will be needed to restore your system to an operational state.

Be aware that a system rebuild using a corrupt volume will not repair the data integrity and will almost certainly cause further damage to the system operational data.

Things to know about RAID 5 Systems.

  • RAID 5 is not a back-up solution, it is a performance option.
  • RAID 5 with multiple hard drive fail need specialist attention, never attempt a rebuild that will potentially overwrite your data files.
  • RAID 5 rebuilds require all drive members installed, never attempt a rebuild from a system that is not fully provisioned.
  • In a RAID 5 system, stored data is written to a hard drives in sequence. Never swap out hard drives from other live locations within the array or shuffle them about.
  • Never remove more than one hard disk drive at a time from a RAID 5 system.
  • We advise never to run check disk type utilities on failed or failing RAID drives. They can cause mayhem when reallocating dud data areas.

 

The RAID 5 hot spare dilemma.

RAID 5  recovery and maintenance actions ought to be:

  • Run a data backup.
  • Check the back-up.
  • Find the bad drive. i.e. check serial number is that reported by RAID controller.
  • Replace the bad hard disk for a new, unused one.
  • Start the rebuild of the RAID.

During a RAID 5 rebuild there is an increased probability of an additional drive failure, Using a Hot-Spare your RAID will skip the first two steps and a rebuild will be attempted before a contingency back-up is available Running a Hot-Spare is therefore not recommended for certain critical applications.

The generic tools initially deployed during data recovery procedures are by nature read-only. Does this mean that a data recovery practitioner using these tools cannot in the process of attempting to recover data from a faulty RAID system,  damage the very data he or she is trying to preserve?

In the overwhelming majority of cases no damage to data stored on RAID configured hard disk drives is incurred during the recovery process. When undertaking relatively large volumes of read instructions from large RAID configured storage systems however, mechanical and electronic components can become stressed leading to malfunction that can affect the read process.

RAID system data recovery.

We took four hard disk drives installed in a failed RAID 5 configured NAS system and connected them individually to an imaging tool. Two of the hard disk drives produced read errors requiring intervention to allow the process to complete. The disk images produced could therefore not be considered as precise duplicates but only as components that could possibly be re-engineered into a working a system.

Re-building the RAID system.

The image hard disk drives were then re-assembled into their RAID 5 system disk order and the specific RAID 5 configuration table/s re-built and commissioned into a working system that could access the stored data volume, stripped across the individual disks.

RAID data recovery and file checking.

The data files available from the system were transferred to an alternative storage medium and the integrity of the files checked by a data recovery technician.

Corrupted  Data Files.

A small number of the available files were corrupt and unreadable however the majority of data files and folders were acceptable to the customer. The data recovery work  in terms of value, availability and cost was also a viable proposition and the customer elected to manually rebuild and reconfigure the small number of corrupted files.

The Data Recovery technicians dilemma

The potential for component replacement on the original hard disk drives targeted at producing an error free duplicate image involves a  dilemma. Further work introducing new components may produce increased read error counts, may improve the quality and integrity of the data recovered or may involve no improvements whatsoever. i.e. the stored data available to be recovered was corrupt in the first place. Ultimately the decision to undertake further costly and speculative component replacement and RAID system re-build work depends on the quantity and value to the customer of the data read exhibiting as corrupt.

Selecting a RAID Set-up level

There are two distinct types of drive array referred to, Physical and Logical i.e. A physical drive array is a group of physical hard disk drives. The physical hard disk drives are managed in partitions known as logical drives. A logical drive is a partition in a physical array of disks that is made up of contiguous data segments on the individual hard disk drives. A logical drive can consist of an entire physical hard disk drive array or a part of a hard disk drive array.

In selecting the physical hard disk drives for use in your RAID set-up you need to be aware that the final RAID system performance will be constrained by the slowest speed and smallest size of hard disk used.

Choose enterprise class high performance hard disk drives of matched type and capacity that the manufacturer recommends for use in a RAID and server application.

To ensure best performance for your specific application select your required RAID level when the system drive is created. The RAID level best suited for a hard disk drive array will be dependent on a number of factors:

  • Number of physical drives in the array
  • Capacity of the individual drives.
  • Data redundancy needed.
  • Hard Drive performance.

=============================================================================================================================================

RAID 0 – Data Striping

RAID 0 provides disk striping across all drives in the RAID array. RAID 0 does not provide any data redundancy, but does offer the best performance of any RAID level. RAID 0 breaks up data into smaller segments, and then stripes the data segments across each drive in the array. The size of each data segment is determined by the stripe size. RAID 0 offers high bandwidth.

RAID 0 Uses.

Ideal for applications that require high bandwidth but do not require fault tolerance.

Strong Points.

By breaking up a large file into smaller segments, the RAID controller can use both SAS drive and SATA drives to read or write the file faster. RAID 0 involves no parity calculations to complicate the write operation.

Weak Points.

RAID 0 is not fault tolerant. If a drive in a RAID 0 configured array fails then the virtual disk (all hard disk drives associated with the virtual disk) will fail.

===================================================================================================================================================

RAID 1 – Disk Mirroring/Disk Duplexing

In RAID 1, the RAID controller duplicates all data from one drive to a second drive. RAID 1 provides complete data redundancy, but at the cost of doubling the required data storage capacity. Table 2 provides an overview of RAID 1.

RAID 1 Uses.

Small data bases or any other application that requires fault tolerance but demands relatively small capacity.

Strong Points

Provides redundancy and fault tolerance with minimal capacity.

Weak Points

Requires twice as many disk drives and performance is badly impaired during drive rebuilds.

=====================================================================================================

RAID 5 – Data Striping with Striped Parity

RAID 5 includes disk striping at the block level and parity. Parity is the data’s property of being odd or even, and parity checking detects errors in the data. In RAID 5, the parity information is written to all drives. RAID 5 uses a form of strip with parity to maintain data redundancy. A minimum of three drives are required to build a RAID 5 array and they should be identical drives for the best performance

RAID 5 Uses.

RAID 5 is best suited for networks that perform a lot of small I/O transactions simultaneously.

Strong Points.

RAID 5 addresses the bottleneck issue for random I/O operations. Because each drive contains both data and parity, numerous writes can take place concurrently. RAID 5 implementations set-up includes a function called hot swap. This allows for drives to be replaced while the array is still functioning to either increase the drives capacity or to replace a damaged drive. The drive controller then takes time while the array is running to rebuild the data array across the drives. This is a valuable feature for systems that require 24×7 operations. A rebuild will generally be handled faster with a dedicated hardware RAID controller.

Weak Points

RAID 5 paired with large capacity drives are more susceptible to array failure due to longer rebuild times. During a rebuild losing another physical drive is catastrophic. RAID 6 provides protection against this issue.
==============================================================================================================

RAID 1 vs RAID 5

RAID 5

Needs 2 block reads and 2 block writes to write a single block , has lower storage costs better for low update rates and large amounts of data.

RAID 1.

better write performance, only requires 2 block writes, better for high update rates, needs typically 60% more disks

=============================================================================================================

RAID 6 – Distributed Parity and Disk Striping

RAID 6 is similar to RAID 5 (disk striping and parity), but instead of one parity block per stripe, there are two. With two independent parity blocks, RAID 6 can survive the loss of two disks in a virtual disk without losing data.

RAID 6 Provides high data throughput, especially for large files. Use RAID 5 for transaction processing applications because each drive can read and write independently. If a drive fails, the RAID controller uses the parity drive to recreate all missing information. Use also for office automation and online customer service that requires fault tolerance. Use for any application that has high read request rates but low write request rates.

RAID 6 Uses.

Use for data that requires a high level of protection from loss. Use for office automation and online customer service that requires fault tolerance. Use for any application that has high read request rates but low write request rates.

Strong Points

Provides data redundancy, high read rates, and good performance in most environments. Provides redundancy with lowest loss of capacity. If two drives in a RAID 6 virtual disk fail, two drive rebuilds are required, one for each drive. These rebuilds do not occur at the same time. The controller rebuilds one failed drive at a time. In the case of a failure of one drive or two drives in a virtual disk, the RAID controller uses the parity blocks to recreate the missing information. Provides the highest level of protection against drive failures of all of the RAID levels.

Weak Points

Not well suited to tasks requiring lot of writes. Suffers more impact if no cache is used (clustering). If a drive is being rebuilt, disk drive performance is reduced. Environments with few processes do not perform as well because the RAID overhead is not offset by the performance gains in handling simultaneous processes.

A RAID 6 virtual disk has to generate two sets of parity data for each write operation, which results in a significant decrease in performance during writes. Disk drive performance is reduced during a drive rebuild. Environments with few processes do not perform as well because the RAID overhead is not offset by the performance gains in handling simultaneous processes. RAID 6 costs more because of the extra capacity required by using two parity blocks per stripe.

==============================================================================================================

RAID 10 – Combination of RAID 1 and RAID 0

RAID 10 is a combination of RAID 0 and RAID 1. RAID 10 consists of stripes across mirrored drives. RAID 10 breaks up data into smaller blocks and then mirrors the blocks of data to each RAID 1 RAID set. Each RAID 1 RAID set then duplicates its data to its other drive. The size of each block is determined by the stripe size parameter, which is set during the creation of the RAID set. RAID 10 supports up to eight spans.

RAID 10 Uses.

Appropriate when used with data storage that requires 100 percent redundancy of mirrored arrays and that needs the enhanced I/O performance of RAID 0 (striped arrays). RAID 10 works well for medium-sized databases or any environment that requires a higher degree of fault tolerance and moderate to medium capacity.

Strong Points.

Provides both high data transfer rates and complete data redundancy.

Weak Points.

Requires twice as many drives as all other RAID levels except RAID 1.

==================================================================================================================================

RAID 50 – Combination of RAID 5 and RAID 0

RAID 50 provides the features of both RAID 0 and RAID 5. RAID 50 includes both parity and disk striping across multiple arrays. RAID 50 is best implemented on two RAID 5 disk arrays with data striped across both disk groups.

RAID 50 breaks up data into smaller blocks and then stripes the blocks of data to each RAID 5 disk set. RAID 5 breaks up data into smaller blocks, calculates parity by performing an exclusive-or on the blocks and then writes the blocks of data and parity to each drive in the array. The size of each block is determined by the stripe size parameter, which is set during the creation of the RAID set.

RAID level 50 supports up to eight spans and tolerates up to eight drive failures, though less than total disk drive capacity is available. Though multiple drive failures can be tolerated, only one drive failure can be tolerated in each RAID 1 level array.

RAID 50 Uses

Appropriate when used with data storage that requires 100 percent redundancy of mirrored arrays and that needs the enhanced I/O performance of RAID 0 (striped arrays). RAID 10 works well for medium-sized databases or any environment that requires a higher degree of fault tolerance and moderate to medium capacity.

Strong Points

Provides both high data transfer rates and complete data redundancy.

Weak Points

Requires twice as many drives as all other RAID levels except RAID 1.

Help and Support

Need help and Support ? call Datlabs RAID technical support

Degraded RAID 5

A degraded RAID 5 presents as a system that is unresponsive or whose transaction processing times have become unacceptably protracted and noticeable to its clients. The most common example of a degraded RAID 5 is one with a single failed hard disk drive. When a hard drive fails, the system undertakes an automatic rebuild process. This RAID rebuild process then involves reconstructing data on a hot spare hard drive from algorithms and parity references stored across all the remaining working hard drives. These algorithms or processes take preference over client services. During this time the RAID will present to the client user as unresponsive or in technical parlance “degraded.” Once this rebuild process has completed successfully the RAID System will return to a reasonable and responsive working state.

RAID 5 Failed Hard Disk Drive Replacement

. The condition and health of the hard disk drives in a RAID 5 array need to be monitored on a regular basis. There are a number of maintenance applications available to technicians that facilitate comprehensive monitoring and management, the use of which is outside the scope of this blog. Once a failed hard drive is flagged by a RAID system it needs to be replaced with a new healthy hard drive of compatible capacity and performance. The replacement operation needs to be undertaken as soon as is practicable in order to militate against a possible consecutive drive failure and a catastrophic system failure.

RAID 5 Failure Technical Support.

A RAID 5 operating normally is unlikely to fail as a result of simultaneous hardware failure. If all drives were healthy prior to the RAID 5 fail then it is more than likely to be as a consequence of corrupt RAID 5 configuration tables in the control software. With this type of problem you are more than likely going to need the assistance of a technician familiar with RAID 5 set-up and operation and who is able to restore access to your data files. RAID recovery technicians are specialists in their field and their services are unlikely to be at the lower end of your budget, particularly if you need their immediate and undivided attention. Faced with time consuming and unexpected expense, the first thing to do is to adequately prepare the system for diagnosis and data recovery such that you will be making the most efficient use of the time the technician spends addressing your problem.  

RAID System Power Down.

This maybe obvious but it is worth amplifying i.e. the system configuration that you are running your RAID hard disk drives with, is now corrupt or has the potential to become corrupt. Whilst the RAID is in this state it will be abortively trying to rebuild itself and potentially allocating data to areas of the system containing files you would not want to lose. So the first thing to do is to power your system down.

Identify RAID hard disk order.

Once powered down the next thing to do is to record the physical order of the hard disks in the array. You can read a hard drives numeral value from the disk itself however this requires a specialist procedure best left to the technician.

RAID 5 Rebuild.

To rebuild a RAID 5 following multiple hard drive failure or corrupt configuration table is better attended to in a laboratory environment. Here technicians have access to specialist tools, workshops facilities and resources essential to undertaking the rebuild and recovery process. Need help then call our RAID 5 data recovery service.

High Capacity RAID Systems.

RAID configurations are used to protect against the failure of hard disks, storing data redundantly across a small group of disks. When one disk fails a spare disk is brought into play to rebuild the system either from a mirror copy in RAID 1 or from the remaining data and parity in RAID 5 or RAID 6. A RAID1 or RAID5 configured system cannot cope with more than one hard disk failing during the rebuild.  A RAID 6 array, thanks to its dual parity can cope with two concurrent disk failures.

RAID Rebuild Times

The problem with RAID systems is that they take an inordinate amount of time to rebuild in a failure situation, which involves operational disruption for an increasing number of businesses. This disruption is a consequence of the large capacity of the disks being used and the fact that the read/write speeds have not increased in relation to the growth in capacity. With the array in use the rebuild operation must also compete with normal I/O operations and with a large capacity disk fail the rebuild will be extremely slow to recover.

Typically an array using 2TB Hard Disk Drives will have a rebuild time of 6-8 hours and upward.

We must also take into account that under rebuild a whole hard disk drives’ worth of data needs to be read off the remaining hard disk drives in the RAID array. This is a heavy workload that will place stress on the remaining hard disk drives and increase the probability of a second hard disk drive failure.

Larger numbers of smaller capacity enterprise class hard disk drive units in a nested RAID configuration can militate against disruption during a failure situation. In considering an implementation involving a larger number of disks this has to be offset against the growing complexity of parity and access times.

Find yourself in a situation where you have lost access to the data on your RAID system?
Not sure how to react or what to do?
Here’s a few tips to help you through this troubling time.

  1. Don’t Panic
  2. Don’t allow any disk checking software to run on the array, (e.g. Scandisk / Chkdsk etc)
  3. Don’t change the disk order
  4. Don’t continue to run the system
  5. Don’t reconfigure the array
  6. Do not attempt to rebuild the RAID

If you would like more information on what to do in this type of situation then see our How to react in a data loss situation page

Raid Failures in Server Rooms.

We were recently asked by a major media service provider to comment as to RAID failures in their server room.   This subsequent to IGFSS deployment and the release of fire suppressant gas.

Server Room Environment.

The general circumstances were that their servers were located in a dedicated server room with all environmental systems correctly installed and functional. There was no evidence of ambient temperature events or gradients outside of acceptable limits. Inspection of each rack, server, communication, power and control equipment showed no sign of local overheating, burning or fire. We noted however that server racks were operated without the front panel covers.

Failed RAID configured Hard Disk Drives.

We were presented with and examined a sample of hard disk drives from the failed RAID 6 configured servers. On inspection each of the hard disk drives was fully functional and fit for purpose. We concluded that the RAID hard disk drives had not suffered permanent  electronic, mechanical component or platter damage subsequent to IGFSS deployment. This was as expected,  IGFSS comprise inert gas that presents no danger to electronics , hardware or storage media.

Rapid Suppressant Gas Release into Server Rooms.

IGFSS  are designed to extinguish a fire by quickly flooding equipment rooms with inert gas in order to rapidly reduce oxygen levels.  The gas is stored under high-pressure in cylinders for rapid deployment to fire affected areas. Cylinder pressures are circa 2,900 PSI, and discharge of circa 1,000 PSI. We noted that the system  exhaust nozzles were reported to be pointed towards the racks and that covers were not fitted. We concluded that the rapid release of the gas directly on to or near the hard disk drives had resulted in high levels of acoustic noise or transient air displacement pressure variations that may have disrupted the read/write functions of the RAID configured server hard disk drives.

Prior Research and Testing.

General available research into the effects of suppressant Gas release events is scarce and what there is has been under controlled conditions. What we do know is that real world reported events are relatively rare but have resulted in simultaneous multiple server failures. Trials conducted in the US have shown that the worst disruption to RAID Server hard disk drive operation was evident during the first 60 seconds of the IGFSS being deployed. The pressure variations or noise reaching the HDDs causing individual hard disk drive read/write heads to go “off track” (generally HDDs can tolerate less than 1/1,000,000 of an inch offset from the centre of the data track, anything in excess will disrupt read/ write events).

Hard Drives and Server Room Environments

As far as the wider implications as to which models of HDD are more or less affected by IGFSS being deployed is almost imponderable. Each enterprise class hard disk drive make and model having a unique set of sensitivities and design specifics.

 

How to avoid RAID Server Failure.

Power Down.

Most IGFSS events will be preceded by an alarm. If possible and without risk to the person, safely and gracefully shut down the RAID Server systems to militate against potential data loss and damage that may result from a rapid gas release event.

Fit Smaller Gas Exhaust Nozzles

Smaller nozzles generate less noise and have less impact on HDDs., where possible fit the small variant of these nozzles. Although there is little evidence to support this:- pneumatic sirens may affect HDD operation  so install these outside the actual server rooms.

Ensure RACK Front Covers are fitted

Fit covers to the front of the racks as these will act as noise baffles and shield the impact of direct gas release.

 

Oh and don’t smoke in the server room !

pip pip