Deploy a Ray Serve application with a Stable Diffusion model on Google Kubernetes Engine (GKE)

This guide provides an example of how to deploy and serve a Stable Diffusion model on Google Kubernetes Engine (GKE) using Ray Serve and the Ray Operator add-on as an example implementation.

About Ray and Ray Serve

Ray is an open-source scalable compute framework for AI/ML applications. Ray Serve is a model serving library for Ray used for scaling and serving models in a distributed environment. For more information, see Ray Serve in the Ray documentation.

You can use a RayCluster or RayService resource to deploy your Ray Serve applications. You should use a RayService resource in production for the following reasons:

In-place updates for RayService applications
Zero downtime upgrading for RayCluster resources
Highly available Ray Serve applications

Prepare your environment

To prepare up your environment, follow these steps:

Launch a Cloud Shell session from the Google Cloud console, by clicking Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of the Google Cloud console.

Set environment variables:

export PROJECT_ID=PROJECT_ID
export CLUSTER_NAME=rayserve-cluster
export COMPUTE_REGION=us-central1
export COMPUTE_ZONE=us-central1-c
export CLUSTER_VERSION=CLUSTER_VERSION
export TUTORIAL_HOME=`pwd`

Replace the following:

PROJECT_ID: your Google Cloud project ID.
CLUSTER_VERSION: the GKE version to use. Must be 1.30.1 or later.

Clone the GitHub repository:

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples

Change to the working directory:

cd kubernetes-engine-samples/ai-ml/gke-ray/rayserve/stable-diffusion

Create a Python virtual environment:
venv
```
python -m venv myenv && \
source myenv/bin/activate
```
Conda
1. Install Conda.
2. Run the following commands:
  conda create -c conda-forge python=3.9.19 -n myenv && \ conda activate myenv
When you deploy a Serve application with serve run, Ray expects the Python version of the local client to match the version used in the Ray cluster. The rayproject/ray:2.37.0 image uses Python 3.9. If you're running a different client version, select the appropriate Ray image.

Install the required dependencies to run the Serve application:

pip install ray[serve]==2.37.0
pip install torch
pip install requests

Create a cluster with a GPU node pool

Create an Autopilot or Standard GKE cluster with a GPU node pool:

Autopilot

Create an Autopilot cluster:

gcloud container clusters create-auto ${CLUSTER_NAME}  \
    --enable-ray-operator \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

Standard

Create a Standard cluster:

gcloud container clusters create ${CLUSTER_NAME} \
    --addons=RayOperator \
    --cluster-version=${CLUSTER_VERSION}  \
    --machine-type=c3d-standard-8 \
    --location=${COMPUTE_ZONE} \
    --num-nodes=1

Create a GPU node pool:

gcloud container node-pools create gpu-pool \
    --cluster=${CLUSTER_NAME} \
    --machine-type=g2-standard-8 \
    --location=${COMPUTE_ZONE} \
    --num-nodes=1 \
    --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest

Deploy a RayCluster resource

To deploy a RayCluster resource:

Review the following manifest:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: stable-diffusion-cluster
spec:
  rayVersion: '2.37.0'
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      metadata:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.37.0
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          - containerPort: 8000
            name: serve
          resources:
            limits:
              cpu: "2"
              ephemeral-storage: "15Gi"
              memory: "8Gi"
            requests:
              cpu: "2"
              ephemeral-storage: "15Gi"
              memory: "8Gi"
        nodeSelector:
          cloud.google.com/machine-family: c3d
  workerGroupSpecs:
  - replicas: 1
    minReplicas: 1
    maxReplicas: 4
    groupName: gpu-group
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.37.0-gpu
          resources:
            limits:
              cpu: 4
              memory: "16Gi"
              nvidia.com/gpu: 1
            requests:
              cpu: 3
              memory: "16Gi"
              nvidia.com/gpu: 1
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-l4

This manifest describes a RayCluster resource.

Apply the manifest to your cluster:
```
kubectl apply -f ray-cluster.yaml
```

Verify the RayCluster resource is ready:

kubectl get raycluster

The output is similar to the following:

NAME                       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
stable-diffusion-cluster   2                 2                   6      20Gi     0      ready    33s

In this output, ready in the STATUS column indicates the RayCluster resource is ready.

Connect to the RayCluster resource

To connect to the RayCluster resource:

Verify that GKE created the RayCluster service:

kubectl get svc stable-diffusion-cluster-head-svc

The output is similar to the following:

NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                AGE
pytorch-mnist-cluster-head-svc   ClusterIP   34.118.238.247   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP   109s

Establish port-forwarding sessions to the Ray head:

kubectl port-forward svc/stable-diffusion-cluster-head-svc 8265:8265 2>&1 >/dev/null &
kubectl port-forward svc/stable-diffusion-cluster-head-svc 10001:10001 2>&1 >/dev/null &

Verify that the Ray client can connect to the Ray cluster using localhost:

ray list nodes --address http://localhost:8265

The output is similar to the following:

======== List: 2024-06-19 15:15:15.707336 ========
Stats:
------------------------------
Total: 3

Table:
------------------------------
    NODE_ID                                                   NODE_IP     IS_HEAD_NODE    STATE    NODE_NAME    RESOURCES_TOTAL                 LABELS
0  1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2  10.28.1.21  False           ALIVE    10.28.1.21   CPU: 2.0                        ray.io/node_id: 1d07447d7d124db641052a3443ed882f913510dbe866719ac36667d2
# Several lines of output omitted

Run a Ray Serve application

To run a Ray Serve application:

Run the Stable Diffusion Ray Serve application:

serve run stable_diffusion:entrypoint --working-dir=. --runtime-env-json='{"pip": ["torch", "torchvision", "diffusers==0.12.1", "huggingface_hub==0.25.2", "transformers", "fastapi==0.113.0"], "excludes": ["myenv"]}' --address ray://localhost:10001

The output is similar to the following:

2024-06-19 18:20:58,444 INFO scripts.py:499 -- Running import path: 'stable_diffusion:entrypoint'.
2024-06-19 18:20:59,730 INFO packaging.py:530 -- Creating a file package for local directory '.'.
2024-06-19 18:21:04,833 INFO handle.py:126 -- Created DeploymentHandle 'hyil6u9f' for Deployment(name='StableDiffusionV2', app='default').
2024-06-19 18:21:04,834 INFO handle.py:126 -- Created DeploymentHandle 'xo25rl4k' for Deployment(name='StableDiffusionV2', app='default').
2024-06-19 18:21:04,836 INFO handle.py:126 -- Created DeploymentHandle '57x9u4fp' for Deployment(name='APIIngress', app='default').
2024-06-19 18:21:04,836 INFO handle.py:126 -- Created DeploymentHandle 'xr6kt85t' for Deployment(name='StableDiffusionV2', app='default').
2024-06-19 18:21:04,836 INFO handle.py:126 -- Created DeploymentHandle 'g54qagbz' for Deployment(name='APIIngress', app='default').
2024-06-19 18:21:19,139 INFO handle.py:126 -- Created DeploymentHandle 'iwuz00mv' for Deployment(name='APIIngress', app='default').
2024-06-19 18:21:19,139 INFO api.py:583 -- Deployed app 'default' successfully.

Establish a port-forwarding session to the Ray Serve port (8000):

kubectl port-forward svc/stable-diffusion-cluster-head-svc 8000:8000 2>&1 >/dev/null &

Run the Python script:
```
python generate_image.py
```
The script generates an image to a file named output.png. The image is similar to the following:

Deploy a RayService

The RayService custom resource manages the lifecycle of a RayCluster resource and Ray Serve application.

For more information about RayService, see Deploy Ray Serve Applications and Production Guide in the Ray documentation.

To deploy a RayService resource, follow these steps:

Review the following manifest:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stable-diffusion
spec:
  serveConfigV2: |
    applications:
      - name: stable_diffusion
        import_path: ai-ml.gke-ray.rayserve.stable-diffusion.stable_diffusion:entrypoint
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/main.zip"
          pip: ["diffusers==0.12.1", "torch", "torchvision", "huggingface_hub==0.25.2", "transformers"]
  rayClusterConfig:
    rayVersion: '2.37.0'
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        spec:
          containers:
          - name: ray-head
            image:  rayproject/ray:2.37.0
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            resources:
              limits:
                cpu: "2"
                ephemeral-storage: "15Gi"
                memory: "8Gi"
              requests:
                cpu: "2"
                ephemeral-storage: "15Gi"
                memory: "8Gi"
          nodeSelector:
            cloud.google.com/machine-family: c3d
    workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 4
      groupName: gpu-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.37.0-gpu
            resources:
              limits:
                cpu: 4
                memory: "16Gi"
                nvidia.com/gpu: 1
              requests:
                cpu: 3
                memory: "16Gi"
                nvidia.com/gpu: 1
          nodeSelector:
            cloud.google.com/gke-accelerator: nvidia-l4

This manifest describes a RayService custom resource.

Apply the manifest to your cluster:
```
kubectl apply -f ray-service.yaml
```

Verify that the Service is ready:

kubectl get svc stable-diffusion-serve-svc

The output is similar to the following:

NAME                         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE

stable-diffusion-serve-svc   ClusterIP   34.118.236.0   <none>        8000/TCP   31m

Configure port-forwarding to the Ray Serve Service:

kubectl port-forward svc/stable-diffusion-serve-svc 8000:8000 2>&1 >/dev/null &

Run the Python script from the previous section:
```
python generate_image.py
```
The script generates an image similar to the image generated in the previous section.

Deploy a Ray Serve application with a Stable Diffusion model on Google Kubernetes Engine (GKE) Stay organized with collections Save and categorize content based on your preferences.

About Ray and Ray Serve

Prepare your environment

venv

Conda

Create a cluster with a GPU node pool

Autopilot

Standard

Deploy a RayCluster resource

Connect to the RayCluster resource

Run a Ray Serve application

Deploy a RayService

Deploy a Ray Serve application with a Stable Diffusion model on Google Kubernetes Engine (GKE)