In, Apollo router v1.35.0 we are observing sudden spike in Query planning latency when load is increase above a certain limit

Hello folks,
We are in the process of upgrading our apollo router version from v1.11.0 to v1.35.0. We also have added tracing capability through OTLP in our router.yaml which uses gRPC protocol to send out traces to the Datadog-agent configured with trace_buffer value 3000.
We have observed that while the application metrics work smoothly for <130k req/min, on increasing the load to 140k req/min on a single AWS EC2 instance, the p99 latency for Query planning takes a hit and the number of errors also shoot up.

Query planning numbers
130k req/min - 12.6ms
140k req/min - 7.8s

This happens when there is no considerable increase in CPU usage, while we have observed that the memory usage increased considerably in case of higher load.

Steps to reproduce the behavior:

  1. Apollo router v1.35.0
  2. router.yaml config
telemetry:
  instrumentation:
    spans:
      default_attribute_requirement_level: required
  exporters:
    tracing:
      common:
        resource:
          "service.name": "gql-router"
      otlp:
        enabled: true
        endpoint: 0.0.0.0:4317
        protocol: grpc
        batch_processor:
          scheduled_delay: 10s
          max_concurrent_exports: 300
          max_export_batch_size: 3000
          max_export_timeout: 30s
          max_queue_size: 2000000
      propagation:
        datadog: true
    logging:
      common:
        service_name: "gql-router"
      stdout:
        enabled: true
        format:
          # text:
          #   ansi_escape_codes: false
          #   display_filename: false
          #   display_level: true
          #   display_line_number: false
          #   display_target: false
          #   display_thread_id: false
          #   display_thread_name: false
          #   display_timestamp: true
          #   display_resource: true
          #   display_span_list: false
          #   display_current_span: false
          #   display_service_name: true
          #   display_service_namespace: true
          json:
            display_filename: false
            display_level: true
            display_line_number: false
            display_target: false
            display_thread_id: false
            display_thread_name: false
            display_timestamp: true
            display_current_span: true
            display_span_list: false
            display_resource: true
      experimental_when_header:
        - name: debug
          value: 'true'
          headers: true # default: false
          body: true
  1. Hit load of 130k req/min and then 140k req/min (Numbers might vary for your application)
  2. Look at query planning latency, memory and CPU usage

Expected behavior
The query planning process must not show spikes in latency numbers over a period of time.

Desktop (please complete the following information):

  • AWS EC2 instance
  • 16 Gb RAM
  • 8 vCPUs

For more info → you can follow this linkLink to Github issue