Troubleshooting scalability in GKE

High usage of the etcd database can cause cluster instability and resource shortages that prevent your Google Kubernetes Engine (GKE) clusters from scaling effectively.

Use this document to learn how to identify clusters where etcd usage is approaching its limit and find recommendations to free up space, helping to ensure that your cluster remains stable.

This information is important for Platform admins and operators responsible for maintaining the health and scalability of GKE clusters. For more information about the common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.

This document covers troubleshooting cluster stability related to high etcd usage. If you experience a different scalability problem, one of the following documents might help:

Identify clusters where etcd usage is approaching the limit

GKE provides insights and recommendations for the scenario where etcd usage is approaching the limit. You can find these insights and recommendations in the following ways:

  • Use the Google Cloud console. Go to the Kubernetes clusters page. In the Notifications column for specific clusters, check for the Free up space to reduce risk of cluster instability recommendation.
  • Use the gcloud CLI or Recommender API by specifying the ETCD_DB_USAGE_APPROACHING_LIMIT recommender subtype.

    To query for this recommendation, run the following command:

    gcloud recommender recommendations list \
        --recommender=google.container.DiagnosisRecommender \
        --location=LOCATION \
        --project=PROJECT_ID \
        --format=yaml \
        --filter="recommenderSubtype:ETCD_DB_USAGE_APPROACHING_LIMIT"
    

To implement this recommendation, remove any unnecessary data from etcd to free up space. This might involve deleting old resources or moving large objects out of etcd. For more information, see Plan for large GKE clusters.

Identify clusters where storage usage per object type is approaching the limit

GKE provides insights and recommendations for the scenario where total size of etcd objects per type is approaching the limit. You can find these insights and recommendations in the following ways:

  • Use the Google Cloud console. Go to the Kubernetes clusters page. In the Notifications column for specific clusters, check for the Reduce the size of resource type(s) recommendation.
  • Use the gcloud CLI or Recommender API by specifying the APISERVER_RESOURCE_TYPE_SIZE_EXCEEDS_LIMIT recommender subtype.

    To query for this recommendation, run the following command:

    gcloud recommender recommendations list \
        --recommender=google.container.DiagnosisRecommender \
        --location=LOCATION \
        --project=PROJECT_ID \
        --format=yaml \
        --filter="recommenderSubtype:APISERVER_RESOURCE_TYPE_SIZE_EXCEEDS_LIMIT"
    

    To decide which objects to remove, you can use kubectl to list them. For example, if ConfigMaps are nearing the storage limit, the following command will output all ConfigMaps across all namespaces, helping you identify candidates for deletion:

    kubectl get configmaps --all-namespaces > new_file.txt
    

To implement this recommendation and free up space, remove any unnecessary objects of the specified types from storage. This process might involve deleting old resources or moving large objects out of storage. For more information, see Plan for large GKE clusters.

What's next