Intermittent health check failures against Apollo server in Kubernetes cluster

We are running Apollo server 2.19.1 in a Kubernetes cluster, and we use the following readiness and liveness checks in the cluster:

      livenessProbe:
        httpGet:
          path: /.well-known/apollo/server-health
          port: 4000
          scheme: HTTP
          httpHeaders:
            - name: X-Forwarded-Host
              value: atlas
        initialDelaySeconds: 10
        timeoutSeconds: 1
        periodSeconds: 5
        successThreshold: 1
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /.well-known/apollo/server-health
          port: 4000
          scheme: HTTP
          httpHeaders:
            - name: X-Forwarded-Host
              value: atlas
        initialDelaySeconds: 10
        timeoutSeconds: 1
        periodSeconds: 5
        successThreshold: 1
        failureThreshold: 3

Pods restart anything from 0 to 10 times over a 12-hour cycle, and we cannot figure out why. We have 15 pods in production. In addition to the intermittent restarts, we consistently see a health check failure every 5 minutes on each pod:

Here are a few more observations to give you a sense of what we have looked at:

  • CPU and network traffic is very low
  • Memory usage is generally low across all pods, but there are one or two spikes per hour to a peak within the range of 200 - 600 MB. Memory usually recovers to the level before the spike.
  • Restarts cannot be correlated to memory spikes. Some pods have zero restarts, even though they have memory spikes.
  • Liveness checks fail exactly every 5 minutes on each pod. This strikes me as too regular and too evenly spread.
  • If one correlates the liveness check failures with other 502 status codes in the Nginx logs, it suggests that they failed because Apollo refuses the connection or the network connection cannot be established.

Next, we will add tracing to the health check to understand the failure’s root cause.

If anybody has advice on what else we can do to figure this out, it would be greatly appreciated!

Seeing similar issues on our installations. @rochecompaan any head way in your investigation? We are looking into similar lines by adding probes to check if node loop back thread is getting stuck using eventLoopMonitor.

Increasing the timeout from 1 to 10s seems to have fixed the issue for us.