Serve a model with a single GPU in GKE

This tutorial shows you how to deploy and serve a large language model (LLM) using GPUs on Google Kubernetes Engine (GKE) with NVIDIA Triton Inference Server and TensorFlow Serving This provides a foundation for understanding and exploring practical LLM deployment for inference in a managed Kubernetes environment. You deploy a pre-built container to a GKE cluster with a single L4 Tensor Core GPU and you prepare the GKE infrastructure to do online inference.

This tutorial is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who want to host a pre-trained machine learning (ML) model on a GKE cluster. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks.

Before reading this page, ensure that you're familiar with the following:

Create a Cloud Storage bucket

Create a Cloud Storage bucket to store the pre-trained model that will be served.

In Cloud Shell, run the following:

gcloud storage buckets create gs://$GSBUCKET

Configure your cluster to access the bucket using Workload Identity Federation for GKE

To let your cluster access the Cloud Storage bucket, you do the following:

Create a Google Cloud service account.
Create a Kubernetes ServiceAccount in your cluster.
Bind the Kubernetes ServiceAccount to the Google Cloud service account.

Create a Google Cloud service account

In the Google Cloud console, go to the Create service account page:

Go to Create service account
In the Service account ID field, enter gke-ai-sa.
Click Create and continue.
In the Role list, select the Cloud Storage > Storage Insights Collector Service role.
Click Add another role.
In the Select a role list, select the Cloud Storage > Storage Object Admin role.
Click Continue, and then click Done.

Create a Kubernetes ServiceAccount in your cluster

In Cloud Shell, do the following:

Create a Kubernetes namespace:

kubectl create namespace gke-ai-namespace

Create a Kubernetes ServiceAccount in the namespace:

kubectl create serviceaccount gpu-k8s-sa --namespace=gke-ai-namespace

Bind the Kubernetes ServiceAccount to the Google Cloud service account

In Cloud Shell, run the following commands:

Add an IAM binding to the Google Cloud service account:

gcloud iam service-accounts add-iam-policy-binding gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[gke-ai-namespace/gpu-k8s-sa]"

The --member flag provides the full identity of the Kubernetes ServiceAccount in Google Cloud.

Annotate the Kubernetes ServiceAccount:

kubectl annotate serviceaccount gpu-k8s-sa \
    --namespace gke-ai-namespace \
    iam.gke.io/gcp-service-account=gke-ai-sa@PROJECT_ID.iam.gserviceaccount.com

Deploy the online inference server

Each online inference framework expects to find the pre-trained ML model in a specific format. The following section shows how to deploy the inference server depending on the framework you want to use:

Triton

In Cloud Shell, copy the pre-trained ML model into the Cloud Storage bucket:

gcloud storage cp src/triton-model-repository gs://$GSBUCKET --recursive

Deploy the framework by using a Deployment. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster.:
```
envsubst < src/gke-config/deployment-triton.yaml | kubectl --namespace=gke-ai-namespace apply -f -
```

Validate that GKE deployed the framework:

kubectl get deployments --namespace=gke-ai-namespace

When the framework is ready, the output is similar to the following:

NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
triton-deployment    1/1     1            1           5m29s

Deploy the Services to access the Deployment:

kubectl apply --namespace=gke-ai-namespace -f src/gke-config/service-triton.yaml

Check the external IP is assigned:

kubectl get services --namespace=gke-ai-namespace

The output is similar to the following:

NAME            TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                                        AGE
kubernetes      ClusterIP      34.118.224.1     <none>          443/TCP                                        60m
triton-server   LoadBalancer   34.118.227.176   35.239.54.228   8000:30866/TCP,8001:31035/TCP,8002:30516/TCP   5m14s

Take note of the IP address for the triton-server in the EXTERNAL-IP column.

Check that the service and the deployment are working correctly:

curl -v EXTERNAL_IP:8000/v2/health/ready

The output is similar to the following:

...
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
...

TF Serving

In Cloud Shell, copy the pre-trained ML model into the Cloud Storage bucket:

gcloud storage cp src/tfserve-model-repository gs://$GSBUCKET --recursive

Deploy the framework by using a Deployment. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster.:
```
envsubst < src/gke-config/deployment-tfserve.yaml | kubectl --namespace=gke-ai-namespace apply -f -
```

Validate that GKE deployed the framework:

kubectl get deployments --namespace=gke-ai-namespace

When the framework is ready, the output is similar to the following:

NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
tfserve-deployment   1/1     1            1           5m29s

Deploy the Services to access the Deployment:

kubectl apply --namespace=gke-ai-namespace -f src/gke-config/service-tfserve.yaml

Check the external IP is assigned:

kubectl get services --namespace=gke-ai-namespace

The output is similar to the following:

NAME            TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                                        AGE
kubernetes      ClusterIP      34.118.224.1     <none>          443/TCP                                        60m
tfserve-server  LoadBalancer   34.118.227.176   35.239.54.228   8500:30003/TCP,8000:32194/TCP                  5m14s

Take note of the IP address for the tfserve-server in the EXTERNAL-IP column.

Check that the Service and the Deployment are working correctly:

curl -v EXTERNAL_IP:8000/v1/models/mnist

Replace the EXTERNAL_IP with your external IP address.

The output is similar to the following:

...
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Thu, 12 Oct 2023 19:01:19 GMT
< Content-Length: 154
<
{
  "model_version_status": [
        {
        "version": "1",
        "state": "AVAILABLE",
        "status": {
          "error_code": "OK",
          "error_message": ""
        }
      }
    ]
}

Serve the model

Triton

Create a Python virtual environment in Cloud Shell.

python -m venv ./mnist_client
source ./mnist_client/bin/activate

Install the required Python packages.

pip install -r src/client/triton-requirements.txt

Test Triton inference server by loading an image:
```
cd src/client
python triton_mnist_client.py -i EXTERNAL_IP -m mnist -p ./images/TEST_IMAGE.png
```
Replace the following:
- EXTERNAL_IP: Your external IP address.
- TEST_IMAGE: The name of the file that corresponds to the image you want to test. You can use the images stored in src/client/images.
Depending on which image you use, the output is similar to the following:
```
Calling Triton HTTP Service      ->      Prediction result: 7
```

TF Serving

Create a Python virtual environment in Cloud Shell.

python -m venv ./mnist_client
source ./mnist_client/bin/activate

Install the required Python packages.

pip install -r src/client/tfserve-requirements.txt

Test TensorFlow Serving with a few images.

cd src/client
python tfserve_mnist_client.py -i EXTERNAL_IP -m mnist -p ./images/TEST_IMAGE.png

Replace the following:

EXTERNAL_IP: Your external IP address.
TEST_IMAGE: A value from 0 to 9. You can use the images stored in src/client/images.

Depending on which image you use, you will get an output similar to this:

  Calling TensorFlow Serve HTTP Service    ->      Prediction result: 5

Observe model performance

Triton

To observe the model performance, you can use the Triton dashboard integration in Cloud Monitoring. With this dashboard, you can view critical performance metrics like token throughput, request latency, and error rates.

To use the Triton dashboard, you must enable Google Cloud Managed Service for Prometheus, which collects the metrics from Triton, in your GKE cluster. Triton exposes metrics in Prometheus format by default; you do not need to install an additional exporter.

You can then view the metrics by using the Triton dashboard. For information about using Google Cloud Managed Service for Prometheus to collect metrics from your model, see the Triton observability guidance in the Cloud Monitoring documentation.

TF Serving

To observe the model performance, you can use the TF Serving dashboard integration in Cloud Monitoring. With this dashboard, you can view critical performance metrics like token throughput, request latency, and error rates.

To use the TF Serving dashboard, you must enable Google Cloud Managed Service for Prometheus, which collects the metrics from TF Serving, in your GKE cluster.

You can then view the metrics by using the TF Serving dashboard. For information about using Google Cloud Managed Service for Prometheus to collect metrics from your model, see the TF Serving observability guidance in the Cloud Monitoring documentation.

Serve a model with a single GPU in GKE Stay organized with collections Save and categorize content based on your preferences.

Create a Cloud Storage bucket

Configure your cluster to access the bucket using Workload Identity Federation for GKE

Create a Google Cloud service account

Create a Kubernetes ServiceAccount in your cluster

Bind the Kubernetes ServiceAccount to the Google Cloud service account

Deploy the online inference server

Triton

TF Serving

Serve the model

Triton

TF Serving

Observe model performance

Triton

TF Serving

Serve a model with a single GPU in GKE