Intermittent health check failures against Apollo server in Kubernetes cluster

rochecompaan · December 21, 2022, 10:43am

We are running Apollo server 2.19.1 in a Kubernetes cluster, and we use the following readiness and liveness checks in the cluster:

      livenessProbe:
        httpGet:
          path: /.well-known/apollo/server-health
          port: 4000
          scheme: HTTP
          httpHeaders:
            - name: X-Forwarded-Host
              value: atlas
        initialDelaySeconds: 10
        timeoutSeconds: 1
        periodSeconds: 5
        successThreshold: 1
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /.well-known/apollo/server-health
          port: 4000
          scheme: HTTP
          httpHeaders:
            - name: X-Forwarded-Host
              value: atlas
        initialDelaySeconds: 10
        timeoutSeconds: 1
        periodSeconds: 5
        successThreshold: 1
        failureThreshold: 3

Pods restart anything from 0 to 10 times over a 12-hour cycle, and we cannot figure out why. We have 15 pods in production. In addition to the intermittent restarts, we consistently see a health check failure every 5 minutes on each pod:

Here are a few more observations to give you a sense of what we have looked at:

CPU and network traffic is very low
Memory usage is generally low across all pods, but there are one or two spikes per hour to a peak within the range of 200 - 600 MB. Memory usually recovers to the level before the spike.
Restarts cannot be correlated to memory spikes. Some pods have zero restarts, even though they have memory spikes.
Liveness checks fail exactly every 5 minutes on each pod. This strikes me as too regular and too evenly spread.
If one correlates the liveness check failures with other 502 status codes in the Nginx logs, it suggests that they failed because Apollo refuses the connection or the network connection cannot be established.

Next, we will add tracing to the health check to understand the failure’s root cause.

If anybody has advice on what else we can do to figure this out, it would be greatly appreciated!

Mehul_Desai · January 14, 2023, 8:21am

Seeing similar issues on our installations. @rochecompaan any head way in your investigation? We are looking into similar lines by adding probes to check if node loop back thread is getting stuck using eventLoopMonitor.

rochecompaan · January 16, 2023, 12:45pm

Increasing the timeout from 1 to 10s seems to have fixed the issue for us.

Topic		Replies	Views
Running Apollo Router in K8s Help router	3	1574	September 27, 2022
Router keeps shutting down Help router	0	5	September 6, 2024
Recommended health check strategy is impossible using AWS load balancer Help server , tooling	4	1157	October 26, 2023
Apollo Server Operation Timeout General server	1	731	October 31, 2023
Apollo Server Timeout Help server , web	0	590	April 22, 2023

Intermittent health check failures against Apollo server in Kubernetes cluster

Related topics