GPU enabled nodes

Kubernetes includes support for managing graphical processing units (GPUs) across different nodes in your cluster, using device plugins.

In ArcGIS Enterprise on Kubernetes, you can implement a device plugin to enable GPU nodes in your cluster, to optimize GIS workflows, such as those pertaining to raster analytics and deep learning. By default, capabilities such as raster analytics are configured to run in CPU mode but also provide the flexibility to run in GPU mode when these resources are available.

Consideration for the availability and utilization of GPU in your cluster is optional, as it will incur additional cost.

To enable GPU, a NVIDIA device plugin for Kubernetes is required. The NVIDIA device plugin for Kubernetes is a daemonset that allows you to expose the number of GPUs on each node of your cluster, run GPU enabled containers, and track the health of the GPUs.

Note:

At this release, ArcGIS Enterprise on Kubernetes is only supported with NVIDIA GPUs.

Enable GPU

Steps to enable GPU for your organization include the following, which are specific to your environment and preferences.

  1. Complete steps to configure raster analytics or another capability for which you want to use GPU-enabled nodes.
  2. Verify whether your instance has the device plugin installed.

    Many cloud environments are preconfigured with GPU nodes. If the device plugin is not installed, see the NVIDIA device plugin for Kubernetes documentation for details and installation steps. If you've deployed on-premises, your administrator must enable GPU on each node in your cluster.

  3. To leverage GPU enabled nodes for your organization's GIS workflows, set requests and limits for GPU.
  4. Optionally, if you want to run GPU workloads exclusively on GPU nodes, configure node affinity and tolerations.

Set requests and limits for GPU

Use the ArcGIS Enterprise Administrator API Directory to set requests and limits for GPU for each of the following deployments:

  • system-rasteranalysistools-gpsyncserver (used for training models)
  • system-rasterprocessinggpu-dpserver (used for processing)

  1. Sign in to the ArcGIS Enterprise Administrator API Directory as an administrator.
  2. Click System > Deployments.
  3. Locate the system-rasteranalysistools-gpsyncserver deployment and click its corresponding ID.
  4. Click Edit Deployment.
  5. In the deployment JSON, locate the resources section for the deployment and the customResources parameter.
              
    "containers": [
          {
            "name": "main-container",
            "resources": {
              "memoryMin": "4Gi",
              "memoryMax": "8Gi",
              "cpuMin": "0.125",
              "customResources": {
                "limits":{"nvidia.com/gpu": "1"},
                "requests":{"nvidia.com/gpu": "1"}
              },
              "cpuMax": "2"
            },
    
  6. Update the customResources parameter for each container listed to include requests and limits for GPU.
  7. Click Submit to save edits to the deployment.
  8. Repeat steps for the system-rasterprocessinggpu-dpserver deployment.

Learn how to edit system deployments in the Administrator Directory API documentation.

Configure node affinity and tolerations

GPU nodes can have both CPU and GPU workloads running on them. If your CPU workloads are allowed to run on a GPU node, no further steps are needed. However, if you want to ensure GPU workloads are run exclusively on GPU nodes, your administrator must take additional steps to configure node affinity and tolerations. Doing so involves the following steps to taint the nodes and apply tolerations to applicable services so that they may be scheduled on a tainted node.

  1. To ensure GPU workloads are scheduled exclusively on GPU nodes, taint the GPU nodes.

    kubectl taint nodes <your-node-name> nvidia.com/gpu=Exists:NoExecute
    

  2. Label the GPU nodes. Alternatively, use an existing label that's already specified on the node.

    kubectl label nodes <your-node-name> raster=GPU
    

  3. Edit the service placement policy for the RasterProcessingGPU (DPServer) service under System, so that it uses node affinity and tolerations.

          
    
    "podPlacementPolicy": {
              "tolerations": [{
                 "effect": "NoExecute",
                 "key": "nvidia.com/gpu",
                 "operator": "Exists"
              }],
              "nodeAffinity": {
                 "requiredDuringSchedulingIgnoredDuringExecution": {
                      "nodeSelectorTerms": [{
                           "matchExpressions": [{
                              "key": "raster",
                              "operator": "In",
                              "values": ["GPU"]
                            }]
                       }]
                  }
              }
          }
    

  4. Verify that GPU pods are running on the GPU nodes.

You can begin to use raster analysis tools and host imagery in your organization. Additionally, see recommendations for Tuning raster analytics.