Hello folks,
We are in the process of upgrading our apollo router version from v1.11.0 to v1.35.0. We also have added tracing capability through OTLP in our router.yaml which uses gRPC protocol to send out traces to the Datadog-agent configured with trace_buffer value 3000.
We have observed that while the application metrics work smoothly for <130k req/min, on increasing the load to 140k req/min on a single AWS EC2 instance, the p99 latency for Query planning takes a hit and the number of errors also shoot up.
Query planning numbers
130k req/min - 12.6ms
140k req/min - 7.8s
This happens when there is no considerable increase in CPU usage, while we have observed that the memory usage increased considerably in case of higher load.
Steps to reproduce the behavior:
- Apollo router v1.35.0
- router.yaml config
telemetry:
instrumentation:
spans:
default_attribute_requirement_level: required
exporters:
tracing:
common:
resource:
"service.name": "gql-router"
otlp:
enabled: true
endpoint: 0.0.0.0:4317
protocol: grpc
batch_processor:
scheduled_delay: 10s
max_concurrent_exports: 300
max_export_batch_size: 3000
max_export_timeout: 30s
max_queue_size: 2000000
propagation:
datadog: true
logging:
common:
service_name: "gql-router"
stdout:
enabled: true
format:
# text:
# ansi_escape_codes: false
# display_filename: false
# display_level: true
# display_line_number: false
# display_target: false
# display_thread_id: false
# display_thread_name: false
# display_timestamp: true
# display_resource: true
# display_span_list: false
# display_current_span: false
# display_service_name: true
# display_service_namespace: true
json:
display_filename: false
display_level: true
display_line_number: false
display_target: false
display_thread_id: false
display_thread_name: false
display_timestamp: true
display_current_span: true
display_span_list: false
display_resource: true
experimental_when_header:
- name: debug
value: 'true'
headers: true # default: false
body: true
- Hit load of 130k req/min and then 140k req/min (Numbers might vary for your application)
- Look at query planning latency, memory and CPU usage
Expected behavior
The query planning process must not show spikes in latency numbers over a period of time.
Desktop (please complete the following information):
- AWS EC2 instance
- 16 Gb RAM
- 8 vCPUs
For more info → you can follow this linkLink to Github issue