Skip to content

Early return in PodTemplateSpecSanitizer causes constant resource mismatch due to enforced resource requests/limits in GKE Autopilot clusters #3006

@Donnerbart

Description

@Donnerbart

Bug Report

What did you do?

Install a HiveMQ Platform and HiveMQ Platform Operator Helm chart in a GKE Autopilot cluster.

What did you expect to see?

I expect a smooth reconciliation.

What did you see instead? Under which circumstances?

We see a constant mismatch of our StatefulSet resource, so it's updated on every reconciliation:

15:04:28.712 [INFO] c.h.p.o.d.StatefulSetResourceMatcher - Detected changes in StatefulSet specification:
  Path: /spec/template/spec/containers/0/resources/limits/cpu
    Actual value: "1"
    Desired value: "1000m"

  Path: /spec/template/spec/containers/0/resources/requests/cpu
    Actual value: "1"
    Desired value: "1000m"

(StatefulSetResourceMatcher extends SSABasedGenericKubernetesResourceMatcher and uses the internal, pruned actual and desired maps for the diff logging)

This mismatch should be prevented by the PodTemplateSpecSanitizer. The actual root cause for the mismatch is hidden, due to an unlucky configuration of resource requests/limits and the interference of GKE Autopilot:

  • The HiveMQ Platform Helm chart configures cpu requests/limits of 1000m that will be serialized as 1 by K8s. So we require the PodTemplateSpecSanitizer in JOSDK to sanitize the actualMap, to prevent false positive mismatches on our StatefulSet resource.

  • The HiveMQ Platform Helm chart doesn't configure ephemeral-storage requests/limits by default, but GKE Autopilot enforces this and updates our StatefulSet accordingly on-the-fly.

Under the hood we end up with these values in the matcher:

desired:
  resources:
    limits:
      cpu: 1000m
      memory: 2048M
    requests:
      cpu: 1000m
      memory: 2048M
actual:
  resources:
    limits:
      cpu: 1                 # changed by K8s
      ephemeral-storage: 1Gi # added by GKE Autopilot
      memory: 2048M
    requests:
      cpu: 1                 # changed by K8s
      ephemeral-storage: 1Gi # added by GKE Autopilot
      memory: 2048M

The size mismatch of the actual and desired maps trigger this early return in PodTemplateSpecSanitizer. So the cpu values are not sanitized and we end up with a false positive mismatch of the StatefulSet.

Since the desired state doesn't contain ephemeral-storage, there are no managed fields for this key in the requests/limits resources of our container. The SSABasedGenericKubernetesResourceMatcher then correctly prunes ephemeral-storage from the actual map, but also hides it as the actual root cause for the wrong cpu mismatch. For example, even with debug logging the ephemeral-storage won't show up in the diff, because that uses the pruned actual map: var diff = getDiff(prunedActual, desiredMap, objectMapper);. The same applies to our custom logging, that also uses the pruned actual map.

Environment

Kubernetes cluster type: K8s 1.33.5 on GKE with Autopilot

$ Mention java-operator-sdk version from pom.xml file

5.1.4

$ java -version

openjdk version "21.0.8" 2025-07-15
OpenJDK Runtime Environment (build 21.0.8+9-Ubuntu-0ubuntu124.04.1)
OpenJDK 64-Bit Server VM (build 21.0.8+9-Ubuntu-0ubuntu124.04.1, mixed mode, sharing)

$ kubectl version

Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.33.5-gke.1080000

Possible Solution

The easiest solution would be to remove the early return: .filter(m -> m.size() == desiredResource.size()).

This shouldn't cost much performance, since we still have two more early returns before we call equals() check that invokes the expensive getNumericalAmount().

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions