Multi-availability zone Kubernetes cluster administration

Organizations that demand the highest uptime requirements must be able to withstand more significant outages, such as the loss of an entire availability zone (AZ). While provisioning a Kubernetes cluster that spans multiple availability zones is a straightforward task in managed cloud providers, there are additional administrative requirements that must be considered during the deployment, configuration, and operational phases of the ArcGIS Enterprise on Kubernetes software.

The following sections will describe the considerations and requirements prior to configuring the organization, the anticipated effects of the loss of functionality within an availability zone, and how an administrator can verify the health of the underlying system following the recovery of functionality in that availability zone.

At minimum, the Kubernetes cluster where the application is deployed must meet the following requirements:

  • Worker node groups span at least three availability zones
  • Adequate worker node capacity to rebalance all workloads into two availability zones during an outage

Deployment and configuration

When deploying ArcGIS Enterprise on Kubernetes to a multi-AZ cluster, the presence of stateful workloads means that when using block storage devices there is an availability zone dependency of those associated disks.

To ensure that the associated replicas of each statefulset are spread across the appropriate topology, there is a new property introduced in the deployment properties file, K8S_AVAILABILITY_TOPOLOGY_KEY, that must be updated prior to running the deployment script. Updating this property to any value other than kubernetes.io/hostname will introduce a topologySpreadConstraint spec to the statefulsets, which guarantees that we don't have an unequal balance or replicas in any single availability zone. The availability zone node label key in most cloud providers is topology.kubernetes.io/zone.

During organization configuration, to guarantee the highest availability, configure the organization using the enhanced availability architecture profile. This is the only profile that guarantees adequate coverage for all stateful workloads in the case of an availability zone failure.

Even when using the enhanced availability architecture profile, there are a number of deployments that remain at a single replica since they vary widely based on specific organization requirements. The following deployments should be considered for scaling above a single replica to additionally reduce downtime:

  • arcgis-enterprise-apps
  • arcgis-enterprise-manager
  • arcgis-enterprise-portal
  • arcgis-enterprise-web-style-app
  • arcgis-help
  • arcgis-javascript-api
  • arcgis-service-api
  • arcgis-service-lifecycle-manager
  • arcgis-featureserver-webhook-processor
  • arcgis-gpserver-webhook-processor

Each of the deployments listed above, besides the arcgis-enterprise-web-style-app, take a shorter amount of time to restart following an AZ loss compared to the relational data store failover process. Scaling to an additional replica of all listed deployments will add approximately 1 additional CPU and 0.5 GiB additional RAM to the total namespace requests and 5 additional CPU and 2.5 GiB additional RAM to the total namespace limits.

When publishing services, all dedicated services should be set to run a minimum if two pods to ensure there is no interruption during rescheduling. The shared instance deployments should also be scaled to at least two replicas for the same purpose.

Effects of an AZ loss

The meaning and impact of availability zone loss varies depending on which cloud services are affected. Loss of a particular service, such as compute or storage, can have significant impact on running workloads while other network-based outages, such as DNS resolution or sharp increases in latency, can affect how microservices are able to communicate with one another.

During relational store failover, certain functions such as sign in, hosted feature service editing, and loading of hosted layers may be degraded for a short time. When the standby instance is finished being promoted to primary, all organization functionalities should be in a stable state.

Organization health verification

Administrators can use the integrated health check reports to assess the health of their site. They can also review the workloads of the namespace for issues in pod startup and other challenges that may arise.

During an AZ outage

If an availability zone is lost when using block storage for PVs, the statefulset pods will be unable to be rescheduled into other availability zones due to the volume affinity requirements.

For the data stores that require it, a quorum is maintained so the associated services can function in a degraded state. Since the relational store is a core component that several other services are dependent upon, there is an option to reset the standby through the Admin API. This will remove the statefulset and associated PVC, and create a new stateful set. This allows the standby instance to re-synchronize with the primary after spawning into one of the remaining availability zones.

Following an AZ outage

If services are restored in the availability zone experiencing an outage, the associated workloads that were stuck in a pending state should be started, which can be confirmed through the Kubernetes API.

If services are unable to be restored in the existing availability zone, ensure your cluster has a minimum of three remaining availability zones, and if not, extend your cluster into an additional availability zone. Once that is complete, create a backup, undeploy and redeploy, and perform a restore operation from the backup.