In, Apollo router v1.35.0 we are observing sudden spike in Query planning latency when load is increase above a certain limit

Saurabh_Shukla · February 22, 2024, 8:17am

Hello folks,
We are in the process of upgrading our apollo router version from v1.11.0 to v1.35.0. We also have added tracing capability through OTLP in our router.yaml which uses gRPC protocol to send out traces to the Datadog-agent configured with trace_buffer value 3000.
We have observed that while the application metrics work smoothly for <130k req/min, on increasing the load to 140k req/min on a single AWS EC2 instance, the p99 latency for Query planning takes a hit and the number of errors also shoot up.

Query planning numbers
130k req/min - 12.6ms
140k req/min - 7.8s

This happens when there is no considerable increase in CPU usage, while we have observed that the memory usage increased considerably in case of higher load.

Steps to reproduce the behavior:

Apollo router v1.35.0
router.yaml config

telemetry:
  instrumentation:
    spans:
      default_attribute_requirement_level: required
  exporters:
    tracing:
      common:
        resource:
          "service.name": "gql-router"
      otlp:
        enabled: true
        endpoint: 0.0.0.0:4317
        protocol: grpc
        batch_processor:
          scheduled_delay: 10s
          max_concurrent_exports: 300
          max_export_batch_size: 3000
          max_export_timeout: 30s
          max_queue_size: 2000000
      propagation:
        datadog: true
    logging:
      common:
        service_name: "gql-router"
      stdout:
        enabled: true
        format:
          # text:
          #   ansi_escape_codes: false
          #   display_filename: false
          #   display_level: true
          #   display_line_number: false
          #   display_target: false
          #   display_thread_id: false
          #   display_thread_name: false
          #   display_timestamp: true
          #   display_resource: true
          #   display_span_list: false
          #   display_current_span: false
          #   display_service_name: true
          #   display_service_namespace: true
          json:
            display_filename: false
            display_level: true
            display_line_number: false
            display_target: false
            display_thread_id: false
            display_thread_name: false
            display_timestamp: true
            display_current_span: true
            display_span_list: false
            display_resource: true
      experimental_when_header:
        - name: debug
          value: 'true'
          headers: true # default: false
          body: true

Hit load of 130k req/min and then 140k req/min (Numbers might vary for your application)
Look at query planning latency, memory and CPU usage

Expected behavior
The query planning process must not show spikes in latency numbers over a period of time.

Desktop (please complete the following information):

AWS EC2 instance
16 Gb RAM
8 vCPUs

For more info → you can follow this linkLink to Github issue

Topic		Replies	Views
Help with odd spike in Apollo Router query_planning time Router router , graphos	1	15	January 17, 2025
Query Planner taking huge time> 500 ms Router	1	64	May 28, 2025
Apollo Server V4 Memory Increase (or leak)? Server server	6	2355	February 6, 2025
Operation metrics is not being shown after upgrading Apollo Router,Rover and Federation Version GraphOS federation , studio , router	3	312	February 6, 2025
Apollo Router: Self Hosted Setup Router federation , router	3	801	February 6, 2025

In, Apollo router v1.35.0 we are observing sudden spike in Query planning latency when load is increase above a certain limit

Related topics