Intermittent slow queries - how to debug?

We’re facing some challenging performance issues with some queries taking an extremely long time to respond, often in excess of 20s and frequently timing out at 60s. These extreme timings cannot be reproduced in local dev environments, even when using load testing tools like graphql-bench and autocannon. These same queries can also complete in ~500ms.

We have a federated graph setup with an instance of Apollo Gateway (running ~0.24) and 2 instances of Apollo Server (running 3.x) providing subgraphs. All 3 use NestJS 7 on top of Apollo Server + express. Our resolvers use Typeorm to interface with a reasonably large and complex MSSQL database. The database is hugely OP and resource monitors never get over 10% RAM & CPU.

The API servers are hosted on AWS ECS and are generously resources. Like the DB server, the CPU and RAM usage average out at 10-20%, although the CPU does spike to >100% sometimes.

Tracing in Apollo Studio appears to show some high level resolvers “hanging” or “waiting” for long periods of time. ie. 27s out of a 30s response.

We do have some N+1 issues and we’ve made some limited use of Dataloader in places to resolve those issues, although we can definitely implement it more. I’m not convined that the N+1 issues are responsible for the extreme timings though due to the wild range of those timing and the inconsistencies of them.

Does this scenario sound famililar to anyone? Does anyone have any ideas or pointers how we might go about tracking down the root issue and debugging it? We’re a bit stumped tbh, especially as the issues seem to be impossible to repro locally.

If your app hangs on resolvers, the most likely query might be the problem.

This might be a stupid and obvious question from me but:

  1. Did you try to use EXPLAIN() to see if the query is hitting indexes and stuff?
  2. Did you use a copy of the production database on local/dev to do your tests?

Like first and obvious culprit in those cases is often a lack of indexes or badly written queries. And you will often not notice it on the local database because you will have a very small data set.

Now if you test it with a copy of production and it works fine then switch that app (if you can) to the production database. And see if the local app with the production database works fine. If no then that means the database and queries are fine and the problem is somewhere in the network.

Are you using some kind of ORM in your resolvers? If yes, try to see if it has any logging capabilities. Some ORM’s can log your queries with execution time.