Multi-availability zone Kubernetes cluster administration

Organizations that require the highest uptime requirements must be able to withstand more significant outages, such as the loss of an entire availability zone (AZ). While provisioning a Kubernetes cluster that spans multiple AZs is a straightforward task in managed cloud providers, there are additional administrative requirements that must be considered during the deployment, configuration, and operational phases of the ArcGIS Enterprise on Kubernetes software.

The sections below describe the considerations that must be made and the requirements that must be met before configuring the organization, the anticipated effects of the loss of functionality within an AZ, and how an administrator can verify the health of the underlying system following the recovery of functionality in that AZ.

At a minimum, the Kubernetes cluster where the application is deployed must meet the following requirements:

  • Worker node groups span at least three AZs.
  • There is adequate worker node capacity to rebalance all workloads into two AZs during an outage.

Deployment and configuration

When deploying ArcGIS Enterprise on Kubernetes to a multi-AZ cluster, the presence of stateful workloads means that there is an AZ dependency of associated disks when using block storage devices.

To ensure that the associated replicas of each statefulset are spread across the appropriate topology, a new property, K8S_AVAILABILITY_TOPOLOGY_KEY, has been introduced in the deployment properties file that must be updated before running the deployment script. Updating this property to a value other than kubernetes.io/hostname introduces a topologySpreadConstraint specification to the statefulsets, which guarantees that you don't have an unequal balance or replicas in a single AZ. The AZ node label key in most cloud providers is topology.kubernetes.io/zone.

To guarantee the highest availability during organization configuration, configure the organization using the enhanced availability architecture profile. This is the only profile that guarantees adequate coverage for all stateful workloads in the case of an AZ failure.

Even when using the enhanced availability architecture profile, there are a number of deployments that remain at a single replica since they vary widely based on specific organization requirements. Consider the following deployments for scaling above a single replica to additionally reduce downtime:

  • arcgis-enterprise-apps
  • arcgis-enterprise-manager
  • arcgis-enterprise-portal
  • arcgis-enterprise-web-style-app
  • arcgis-help
  • arcgis-javascript-api
  • arcgis-service-api
  • arcgis-service-lifecycle-manager
  • arcgis-featureserver-webhook-processor
  • arcgis-gpserver-webhook-processor

Each of the deployments listed above, except arcgis-enterprise-web-style-app, take less time to restart following an AZ loss compared to the relational data store failover process. Scaling to an additional replica of all listed deployments adds approximately 1 additional CPU and 0.5 GiB additional RAM to the total namespace requests and 5 additional CPU and 2.5 GiB additional RAM to the total namespace limits.

When publishing services, set all dedicated services to run a minimum of two pods to ensure that there is no interruption during rescheduling. Scale the shared instance deployments to at least two replicas for the same purpose.

Effects of an AZ loss

The meaning and impact of AZ loss varies depending on the cloud services that are affected. Loss of a particular service, such as compute or storage, can have a significant impact on running workloads, while other network-based outages, such as DNS resolution or sharp increases in latency, can affect how microservices can communicate with one another.

During relational store failover, certain functions—such as sign in, hosted feature service editing, and loading of hosted layers—may be degraded for a short time. When the standby instance is finished being promoted to primary, all organization functionality should return to a stable state.

Organization health verification

Administrators can use the integrated health check reports to assess the health of their site. They can also review the workloads of the namespace for issues in pod startup and other challenges that may arise.

During an AZ outage

If an AZ is lost when using block storage for PVs, the statefulset pods cannot be rescheduled into other AZs due to the volume affinity requirements.

For the data stores that require it, a quorum is maintained so the associated services can function in a degraded state. Since the relational store is a core component that several other services are dependent on, the standby can be reset through the Admin API. This removes the statefulset and associated PVC and creates a new stateful set. This allows the standby instance to synchronize with the primary after moving into one of the remaining AZs.

Following an AZ outage

When services are restored in the AZ experiencing an outage, restart the associated workloads that were in a pending state, which can be confirmed through the Kubernetes API.

If services cannot be restored in the existing AZ, ensure that the cluster has a minimum of three remaining AZs. If it doesn't, extend the cluster into an additional AZ. Once that is complete, create a backup, undeploy and redeploy, and perform a restore operation from the backup.