Getting maximum benefit from redundant, tiered, high-availability and cloud storage systems means choosing the right parts, configuring elements to fully empower their innate resiliency features and planning storage allocations appropriately to insure optimum performance and best protection for critical data. Here are some tips to help you get the most from your resilient storage investment.
Many paths to all devices: High-availability storage can only work when communications and data paths from sources to devices and among controllers and cooperating devices are open and usable. In cluster-based high-availability network storage, it’s important to make fully meshed connections, then to configure "multipath" access in your solution to fully enable use of physical connections to route around failed devices, as well as failed cables.
RAID/mirroring is strength: RAID takes your data, cuts it into chunks and stores the chunks across different devices along with calculated parity bits that enable reconstruction of lost chunks in the event of device failures. Monolithic RAID systems do this across an array of disks in a single enclosure. More advanced software can do it across physically and geographically separated RAID mirrors at metro distances, adding to the protection afforded by geographically separated mirroring the ability to reconstruct data even in the event of disk unit failures too numerous for the underlying RAID systems alone to contend with. Enterprise HA storage vendors often make this sort of higher-order resiliency available as an option on top of mirroring and local RAID – NetApp’s SyncMirror is one example.
Shredding protocols for resilience and security: Taking the idea of lost-data reconstruction by parity a few steps further, some cloud storage vendors have, for the past several years, been experimenting with widespread geographic dispersion of encrypted customer data, along with error-correction codes enabling powerful data reconstruction in the event of individual storage unit failures or unexpected offline conditions. In theory, this means that primary storage providers could aggregate service from secondary providers of low trust and reliability (read: cheap) to store secure enterprise data. If certain providers are offline, the system reconstructs missing chunks from what’s available. The storage protocol also insures hackers would have to compromise multiple devices across numerous service providers in order to recover readable data, making hacking prohibitively difficult and time-consuming.
Active/Active lowers TCO: Given the option, it makes sense to configure multi-controller resilient storage systems in active/active mode, where controllers mirror one another’s states to insure non-disruptive fallback on demand. The benefit to TCO comes from the ability of such systems to permit non-disruptive upgrades – theoretically involving the controllers themselves, or any controlled devices. By contrast, other controller resiliency configurations may require more complex prep, procedures or scheduling of upgrades outside peak use times, all of which tend to add cost and inconvenience. However, requirements for active/active configuration are stringent: controllers need to be configured identically, with all the same protocols, for mutual fallback to work reliably.
Evaluate your resiliency strategy, equipment, and upgrade status with tools: The complexity of today’s resilient storage solutions makes them hard to document, manage, and maintain without help. Trying to do so, in fact, can increase risks. For example, if you don’t carefully inventory and match configurations on two controllers before placing them in active/active relation to one another (see above) the solution will either refuse to initialize or may initialize but fail when called upon. In other situations, out-of-date firmware, software, inappropriate jumper, or other subtle hardware configurations can undermine the effectiveness of resilience strategies. Makers of storage solutions and storage management software often offer tools that make remote evaluation of fine-grained device status possible, and in some cases, even direct mitigation.

