Becoming multi-cloud & on-prem
Why be multi-cloud or go on-premise?
Customer data may be sensitive about their data being captured by the product software. If the data is too much to travel on the internet, needs fast enough processing, or a simple pre-installed box is easier to sell than enabling a bunch of services on the customer cloud, then hardware running the software that can be deployed in the customer premises (on-premise or on-prem) may be a requirement.
So, the following 2 questions must be answered about the product:
- Is the product offering a hosted SaaS, or is it a product the customer can host within their cloud? Providing support in a customer environment needs significant re-engineering & support.
- If customer-hosted, what clouds are important to support? Is on-prem hosting needed? Porting to every additional cloud provider is effort and taking it on-prem — much more so.
For the rest of this document, we assume that the answer to question (1) is to make it customer host cloud hostable. The last section assumes an on-prem requirement for question (2).
Engineering steps
Simplify stack footprint
A simple stack is always better. And when one has to move across public clouds, complexity can be doubly expensive. Going multi-cloud might be a time to relook at the stack, and evaluate the cost benefits of any stack simplification.
Here are some suggestions:
- Unless there’s a good reason, use a single network, single Kubernetes cluster, and fewest infrastructure components & databases.
- Currently, Docker & Kubernetes are accepted as de facto standards. Being dockerized & kubernetized is practically the first ask by a cloud or platform vendor.
Automate deployment
The deployment should be scripted, for ease of customer setup. Scripting early ensures there are no information gaps within the Engineering team — code is the most accurate document! It also enables fast iterations.
Terraform has modules for most common publics clouds, allows for modulization, and due to its internal graph state being maintained, handles upgrades/partial installations optimally. Thus, it works well for this automation.
Become cloud agnostic
Public clouds differ in their stack, APIs & SDKs. Knowing which clouds one may have to run the software on in the foreseeable future helps plan. Different cloud-specific software versions are unmaintainable. Good engineering & modularization can ensure a largely single largely cloud-agnostic code base, with cloud-specific variations minimized.
While one may never be fully cloud-agnostic, the below framework helps:
- Use cloud-agnostic abstraction for all the services being used from the cloud — ie Kubernetes, MySQL, object storage etc. Do not use any cloud-specific SDK in the application code.
- Wherever wrapper libraries exist that work across cloud-specific implementations (eg minio client for GCS, S3, etc), use them.
- If an open-source self-hosted version of the service works as well as the cloud-provided version (eg MySQL instead of GCP CloudSQL), use that instead.
- If not, use a wrapper class around the cloud-specific SDK. This wrapper should be cloud agnostic. Push the cloud-specific configuration to settings or secrets out of the application code.
2. Even infrastructure-as-code (IaC, likely terraform) should be modularized. Perhaps have a common terraform folder, and a cloud-specific folder linking to the implementation in the common folder or overriding it as needed:
- List down the cloud-provided services from the current cloud, and for every cloud option, see if there’s an equivalent in the target cloud.
- If most application services are self-hosted in Kubernetes, the infrastructure code is small — and most deployment can be in helm charts.
- Is the service usage clean & are the cloud abstractions consistent? Can the usage be single terraform module, with cloud-specific implementations? Eg network to cover GCP VPC & AWS network, or Kubernetes to cover GCP GKE & AWS EKS?
Prioritization
The above steps feed off each other, forcing more clarity, structure & cleanup among them. The sequence of the steps may depend on the organization's business priorities & the current state of the code. For example:
- If the company needs to release a customer deployable on the currently hosted cloud should prioritize scripting/automating the deployment, then simplification or cloud agnosticity.
- A company in the early stages of development with a clear multi-cloud requirement should prioritize simple stack & cloud agnosticity.
Irrespective of the prioritization & phases, do minimize any throwaway work.
Port to a cloud
Porting a specific cloud then includes:
- Mapping & then porting the application services to the cloud-specific equivalents, wherever needed. Eg AWS EKS instead of GCP GKE. In the absence of an equivalent, is there a compelling existing open-source alternative? If so, will the move to the open-source alternative be a step towards cloud-agnosticity?
- Port the IaC code. Terraform providers exist for most clouds, but the terraform module interfaces defined above must be implemented for the cloud using the provider. Or the terraform code itself overwritten.
This porting effort progressively decreases as the system becomes more kubernetized.
The exception is when going on-premise, detailed below.
Going on-prem
Public clouds have a range of services that on-premise stacks do not give it out of the box. The first step is to identify the stack & the hardware one’ll port the application software on — this stack is then installed on the identified hardware. The service gaps vs public cloud are installed separately, filled, the application ported (any code & IaC code changes) & tested.
Identify the hardware & the stack
Stack
Public clouds offer compute & disk virtualizations (Virtual machines & Persistent volumes respectively) & the ability to launch Kubernetes clusters, amongst others. At the very least, the one needs a stack offering these to run on the on-premise hardware.
While multiple Kubernetes variants exist, the following 2 seem the most common & well-integrated:
- Redhat OpenShift
- VMware ESXi, vCenter, VMware Tanzu (multiple flavours)
On-prem port is a large engineering undertaking. The stack should be well considered using the below factors:
- Stack maturity:
- Stability & Support: A more stable, established or well-used stack means fewer surprises, and the ability to resolve the issues using the community knowledge. Is there a support partner or an expert partner to help with the unknowns?
- Gap vs services needed: The alternatives for filling the infrastructure service gaps are documented to work on only some virtualization OS or Kubernetes software. Does the chosen stack support them?
- GPU support. Most stacks may not support GPUs well — ie multiple GPUs on a single physical node or a VM, GPU multiplexing or time-slicing. This should be an important factor if the application uses GPUs.
2. Costs & overhead: In addition to the engineering effort, there are other costs:
- Licensing cost: Solutions like VMware & some Loadbalancer options (eg AVI) are licensed.
- Compute Overhead: The virtualization & Kubernetes stack brings its own overhead. Though hard to measure, a quick review of existing online literature will help with a quick evaluation.
- Team skills: Familiarity with virtualization stack is usually a system admin skill than DevOps, and may not be well available in cloud-native organizations. Bringing in an experienced consultant can help save time on experiments & trials.
Hardware
- Cloud is elastic, on-prem hardware isn’t. Kubernetes should hopefully prioritize this fixed compute by deprioritising & delaying the batch jobs.
- Basis the cloud setup & expected load, estimate the compute, memory & storage needed — closer to the max load that may be needed. A production-grade setup should have (n+1) nodes put together in a single rack — where n can be usually 2–5. This calculation can estimate the size of each node.
- The virtualization stack & networking is more involved in a multi-node setup. For quick POC deployments — a single node meeting the consolidated compute requirement may be good enough.
IP / networking
Below are the IP addresses in an Application service running on Kubernetes cluster on hardware:
- Physical nodes, routers
- VMs. This includes some service gap-filling VMs — eg for DHCP, DNS, artefact-repository, etc
- Kubernetes pods & services
- External IPs for load balancers
The hardware on-prem deployable box should largely be portable, without needing any of the internal network changes. Hence a few best practices:
- Create a separate private LAN in a preferably less common (and hence unlikely to conflict) IP range — eg 192.168.23.0/24 or so.
- Allocate the IP ranges explicitly in the block amongst the 3 categories above. Some IP ranges like the Kubernetes pods need not be on the same network but are better managed by Kubernetes.
Mapping the services
Below are some of the common services to be missing in the virtualization stack often used by the applications. We also list the recommended alternatives:
Port, test & test on-site
Expect significant code & IaC code rewrites. Different public clouds & on-prem stacks are very different — there will be innumerable surprises. Don’t expect things to work the first time — Murphy’s law will kick in. Budget at least time for testing & fixing them as it took to port, and test it in the customer environment.