The Journey to a Hybrid Software Defined Storage Infrastructure S01E06

This is the sixth episode of “The Journey to a Hybrid Software Defined Storage Infrastructure”. It is a IBM TEC Study made by Angelo Bernasconi, PierLuigi Buratti, Luca Polichetti, Matteo Mascolo, Francesco Perillo.

This episode has been already published on Linkein on November 8th 2016 as pilot Episode here:

https://www.linkedin.com/pulse/dr-hybrid-sds-infrastructure-pierluigi-buratti?trk=prof-post

To read previous episode check here:

E01

https://ilovemystorage.wordpress.com/2016/11/23/the-journey-to-a-hybrid-software-defined-storage-infrastructure-s01e01/

E02

https://ilovemystorage.wordpress.com/2016/11/29/the-journey-to-a-hybrid-software-defined-storage-infrastructure-s01e02/

E03

https://ilovemystorage.wordpress.com/2016/12/06/the-journey-to-a-hybrid-software-defined-storage-infrastructure-s01e03/

E04

https://ilovemystorage.wordpress.com/2016/12/13/the-journey-to-a-hybrid-software-defined-storage-infrastructure-s01e04/

E05

https://ilovemystorage.wordpress.com/2016/12/19/the-journey-to-a-hybrid-software-defined-storage-infrastructure-s01e05/

Enjoy your reading!

Software Defined Storage (SDS) provides the possibility to spin-up, on standard components, a look-alike storage technology to enable data mirror and to manage the entire data replication process using the same tools and the same look-and-feel as the traditional disk mirroring functions among two Storage controllers.

Using standard components, means reduce the TCO by leveraging on common compute and storage blocks, while traditional storage is (almost) always a specialised-built infrastructure. Common Compute and Storage blocks also brings a “standardisation” in the support and skills required to manage the infrastructure and again this goes into a simplification of the IT operation, hence in the direction of the TCO reduction.

On the other hands, standardised components mean standardised reliability and performance, thus a SDS solution normally doesn’t perform at the same level as the correspondent storage subsystem.

To summarise, the use of an SDS solution into a Disaster Recovery brings costs reduction, at the expense of reliability and performance that the Storage will have. Does this matter? In day by day, when the solution only replicates data, and maybe performs DR Tests, this reduction in reliability and performance is normally not an issue, but in an emergency condition, when your infrastructure is requested to perform (or out-perform) the production infrastructure, it might.

To further contain costs at DR Site, a solution known as Hyper-Converged Infrastructure (HCI) topology can be adopted. In the HCI topology, the SDS function delivering Storage capability is hosted on the same Hypervisors as the DR VMs.

The DR solution can also have Recovery Tiers, basically your customer can split the IT infrastructure in waves (Tiers or Recovery Zones) with a homogeneous Recovery Time Objective (RTO) requirement.

Typically, a three-tier approach is used:

  • Tier-1 applications, with an immediate or near-immediate restart (0-12h), which requires dedicated resources
  • Tier-2 applications, with a restart within 1-4 days, could leverage on shared or re-usable assets at the alternate site. Since the time is short, usually this spare capacity must be already there, and cannot be acquired at time of disaster (ATOD).
  • Tier-3 applications, with a restart greater than 5 days, might be covered by additional Compute power that can be freed-up in the alternate site, by closing Dev/Test environments and re-purposing those Compute resources to the DR solution. These Dev/Test environments might be restarted later, as soon as the additional Compute resources are re-provisioned in the alternate site, or when the emergency is over and the production moved back to the original site.

All the above is a general definition; your customer might have more tiers or different combinations or requirements.

Other considerations on a DR Strategy are related to the distance among the primary and secondary sites; distance among the two sites, brings network latency in consideration, thus an Asynchronous replication might be the best option to contain performance impacts on production applications.

Also, the “specialisation” of the sites plays a great role in the decision and on the complexity of the solution. Dealing (managing and controlling) a uni-directional Disk Mirroring is far more simple than doing the same on a bi-directional Disk Mirroring, when it comes to “data consistency”.

In this article, we will have a view of possible DR Use Cases related to SDS, in a one-to-one topology.

In a one-to-one topology, four possible combinations exist:

  1. Prod on Premise; DR on a proprietary Alternate Site
  2. Prod on Premise; DR on a Cloud Provider
  3. Prod on a Cloud Provider; DR on another Cloud Provider site (or an alternate Cloud Provider)
  4. Prod on a Cloud Provider; DR on a proprietary Alternate Site

What-ever is the combination, there are common pit-falls that you need to be aware of.

 Common DR pit-falls

Plan and Design for the “best conditions” – It is important that your solution and your testing methodology are planned and designed to mimic what the real condition might be. Design for “best case”, of course contains costs, but leaves you in the hope that it would work.

Having Single-Point-Of-Failure (SPOF) – SPOF can be anywhere in a solution, not only on the technology side. Your solution might be dependent on people, vendors, providers, and other external dependencies. It is essential that you know and identify clearly in advance most of your SPOFs at least, and have a plan to mitigate your dependencies. Be also prepared to discover SPOFs during the initial sessions of your DR Test.

Have a Plan-A only – Any solution has a residual risk. Risks cannot be eliminated; they can only be mitigated. The more risks you mitigate with a single solution, the more the solution would cost. Having a Plan-B with a different RTO, might help to mitigate additional risks, and not adding too much costs too. Your RTO might not be the one expected, meaning you will need more time to restart your operations, and thus you will have more impact by the emergency, but restarting later might be far better than not restarting at all. A possible Plan-B, in a two sites topology, might be to have a periodic backup secured off-site from the two sites, and at a distance to be reasonable considered safe.

DR Testing – an un-tested DR solution is like running blind in a wood; chances are high that you will encounter an obstacle that stop your running. Moreover, you discover that only during the most important run of your life. If Testing is essential to guarantee you have a valid solution, also the condition you are going to test are important. If you plan to perform a DR Test, by doing a pre-ordered close of operation on one site and a well-organised restart in the other, you are sure your DR Tests works, but is this what will happen during an emergency? You should design your Tests to mimic as much as possible the possible emergency conditions, by simulating a so called “rolling disaster condition”, where your IT service is impacted progressively by the emergency. This is the best way to Test your solution and have a reasonable understanding whether it is resilient (ability to resist to stress conditions).

Case 1: Production on-premise, DR on another “traditional” site

In this topology, the Storage used by IT on-premise is usually provided by traditional storage subsystems.

At the DR Site, to contain costs, instead of adopting the same storage subsystem, your customer can spin up the storage requirement adopting the corresponding SDS solution, which can also be implemented as a Hyper-Converged Infrastructure (HCI) topology.

SDS in this topology will help in containing TCO as you can reuse your existing storage that might be resulting from a technology change and deploy that as the storage used by the SDS in the alternate site. Also, the Compute you need for the SDS solution might be resulting from a technology upgrade of the Compute in your primary site.

How a Three-tier DR solution can be implemented

Tier-1 applications, could be hosted on the same Hyper-Converged Infrastructure with the SDS solution.

Tier-2 applications, needs to have additional spare Compute capacity at the alternate site.

Tier-3 applications, might be covered by additional Compute power that can be freed-up in the alternate site, by closing Dev/Test environments and re-purposing those Compute resources to the DR solution.

Attention Points

Virtualisation & Standardisation – virtualisation plays a great role, as it masks the difference in the physical layer, making more easy to restart on a different hardware set, while physical servers requires more customisation during the recovery/restart process to adapt the software layer to the different physical layer the Compute has in the alternate site. In the same way, standardise your IT infrastructure on a small set of components helps in reducing the permutations of combination you must deal with.

Automation – DR operations are complex and requires a high level of technical interactions, so Automation can help in reducing the dependency on critical resources.

DR Provider resource syndication – most DR Providers actively use resource syndication to contain costs. Resource syndication means the same resource is used to provide service to different customer that might not be exposed to the same concurrent event. In other words, the Compute that you use for your DR solution might also be used by other customers of the DR provider, but not at the same time as you. It’s is important to understand this methodology and evaluate how the DR provider applies resource syndication. Resource syndicated among customers at 1000’s KM/miles offers a different (better) “usability ratio” compared to Resources syndicated among customers in the same city or city-block.

Case2: Production On-premise, DR to a Cloud Provider

In this topology, the SDS provides abstraction from the underline technologies used by cloud provider and creates the look-alike storage to enable a bi-directional data replication process (on-premise to the cloud and cloud to the on-premise).

Like the previous topology, you might adopt a Hyper-Converged Infrastructure (HCI) topology to contain costs.

How a Three-tier DR solution can be implemented

Tier-1 applications, could be hosted on the same Hyper-Converged Infrastructure with the SDS solution, eventually.

Tier-2 applications, can leverage on the Cloud Provider on-Demand resources to integrate additional Compute resources as and when required.

Tier-3 applications too, can be requested on-Demand to the Cloud Provider.

Attention Points

Provisioning Time– Cloud Providers are not created equally, so you need to evaluate the different offerings, and the SLA associated to the provisioning time of On-Demand resources, as they impact the RTO of your DR solution. Another important point is the commitment offered by the Cloud Provider on the SLA, and what limitations might be there (does it commit to provide on-Demand resources, per the SLA and independent on the number of concurrent requests?).

Resource Pricing– on-demand resources are cheaper, as they are usually billed by usage. It is important to evaluate all the caveats associated with the usage billing, and how that might impact your TCO with hidden and unplanned costs.

Compatibility with the Cloud Provider– doing DR on a Cloud Provider from the on-premise, has the same requirements and challenges as a migration project to the Cloud. You must verify that what is going to be deployed on the Cloud Provider can run on that infrastructure.

Constrains on the Cloud Provider – despite all the abstractions the cloud provider can do, at the end you might be still facing some constrains in your design. An example is the limitation you might have to boot a Virtual Machine from a replicated disk (LUN), when that VM runs in the in Cloud Provider managed hypervisor infrastructure. Other examples are in the area of LAN and WAN, where the Software Defined Network (SDN) provides some flexibilities, but not the full flexibility you have in your own infrastructure.

Cloud Resource Scaling – having a DR on Cloud, might provide the false expectation that cloud scalability is infinite. It is not, and it is not particularly in the Cloud Data Center where you have decided to replicate your data.

Cloud Networking (LAN) – having a DR on Cloud, might also imply that you need to re-build your network structures on the Cloud Provider. In some cases, you will be forced to use Software Defined Network (SDN) and Network Functions Virtualisation (NFV), to reproduce your complex enterprise network layout. A careful planning and evaluation of performance and scalability limitation is essential to assure your re-provisioned compute can work properly.

Cloud Networking (WAN) – all the external networks connected to your primary (failed) site must be re-routed to the alternate Cloud-Site. Different options are available, and you need to evaluate can plan what options best fit your requirements. Consider the charges associated to the use of the network (download/upload charges) when using the Cloud Provider resources, as they might add a substantial cost to your DR TCO.

Case 3: Production on Cloud, DR on Cloud

In a Cloud to Cloud topology, the SDS provides abstraction from the underline technologies used by Cloud Providers and creates the look-alike storage to enable a bi-directional data replication process. On the same provider (different cloud sites), SDS might provide a disk replication technology that offers better options compared to the native feature available by the provider.

Like the previous topology, also here to contain costs, you might adopt a Hyper-Converged Infrastructure (HCI) topology.

How a Three-tier DR solution can be implemented

Tier-1 applications, could be hosted on the same Hyper-Converged Infrastructure with the SDS solution, eventually.

Tier-2 applications, can leverage on the Cloud Provider on-Demand resources to integrate additional Compute resources as and when required.

Tier-3 applications too, can be requested on-Demand to the Cloud Provider.

Attention Points

Cloud Resource Scaling – having a DR on Cloud, might provide the false expectation that cloud scalability is infinite. It is not, and it is not particularly in the Cloud Data Center where you have decided to replicate your data.

Provisioning Time – Cloud Providers are not created equally, so you need to evaluate the different offerings, and the SLA associated to the provisioning time of on-Demand resources, as they impact the RTO of your DR solution. Another important point is the commitment offered by the Cloud Provider on the SLA, and what limitations might be there (does it commit to provide on-Demand resources, per the SLA and independent on the number of concurrent requests?).

Cloud Provider Compatibility (when different providers) – on Different providers, you would expect to have different technologies. Like in the On-Premise to cloud topology, moving your workload from a provider to another, is equivalent and has the same complexity as a cloud migration project from the on-premise. Attention must be placed in investigating the different methodologies, interfaces and services offered by the two Cloud Providers (Monitoring, API, Networks, Provisioning Time)

Resource Pricing (when different providers) – it is important to evaluate and understand all the caveats associated with the usage billing of the two providers, and how that might impact your TCO with hidden and unplanned costs.

Case 4: Production on Cloud, DR to On-premise

Although in nowadays “get on cloud” mood this option seems odd, if you give Data the right importance they have, this option might not sound odd anymore.

A company that want to maintain flexibility in its cloud strategy, should think to maintain a copy of the company data in their own site.

SDS enables this flexibility and option, and it can be easily implemented, by reversing the flow of the data replication, once everything has on-boarded the Cloud.

To leverage on this topology as a DR solution, you must maintain all your Compute locally and not getting too much dependent on services or features available at the Cloud Provider only, as they will not be available at your on-premise.

As in the other topology, even here you can opt for a Hyper-Converged Infrastructure, in which case you probably transform your traditional storage into a SDS solution you host on-premises on the Compute servers.

In terms of DR Tiering, you can implement the same DR Tiering as in the On-Premise to On-Premise, as the target is your site.

Attention Points

Compatibility with the Cloud Provider – doing DR from a Cloud Provider to the on-premise has challenges on the compatibility side. You might need to mimic the hypervisor infrastructure you have used on the cloud to avoid hypervisor migrations on-the-fly.

Dependencies from the Cloud Provider – you should also mind the possible dependencies on services or features that your applications have started using on the cloud provider (DNS, monitoring, logging, …), as they might not be available in your on-premise.

 

Advertisements

One thought on “The Journey to a Hybrid Software Defined Storage Infrastructure S01E06

  1. Pingback: The Journey to a Hybrid Software Defined Storage Infrastructure S01E07 | I Love My Storage

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s