Intermittent slow queries - how to debug?

We’re facing some challenging performance issues with some queries taking an extremely long time to respond, often in excess of 20s and frequently timing out at 60s. These extreme timings cannot be reproduced in local dev environments, even when using load testing tools like graphql-bench and autocannon. These same queries can also complete in ~500ms.

We have a federated graph setup with an instance of Apollo Gateway (running ~0.24) and 2 instances of Apollo Server (running 3.x) providing subgraphs. All 3 use NestJS 7 on top of Apollo Server + express. Our resolvers use Typeorm to interface with a reasonably large and complex MSSQL database. The database is hugely OP and resource monitors never get over 10% RAM & CPU.

The API servers are hosted on AWS ECS and are generously resources. Like the DB server, the CPU and RAM usage average out at 10-20%, although the CPU does spike to >100% sometimes.

Tracing in Apollo Studio appears to show some high level resolvers “hanging” or “waiting” for long periods of time. ie. 27s out of a 30s response.

We do have some N+1 issues and we’ve made some limited use of Dataloader in places to resolve those issues, although we can definitely implement it more. I’m not convined that the N+1 issues are responsible for the extreme timings though due to the wild range of those timing and the inconsistencies of them.

Does this scenario sound famililar to anyone? Does anyone have any ideas or pointers how we might go about tracking down the root issue and debugging it? We’re a bit stumped tbh, especially as the issues seem to be impossible to repro locally.

1 Like

If your app hangs on resolvers, the most likely query might be the problem.

This might be a stupid and obvious question from me but:

  1. Did you try to use EXPLAIN() to see if the query is hitting indexes and stuff?
  2. Did you use a copy of the production database on local/dev to do your tests?

Like first and obvious culprit in those cases is often a lack of indexes or badly written queries. And you will often not notice it on the local database because you will have a very small data set.

Now if you test it with a copy of production and it works fine then switch that app (if you can) to the production database. And see if the local app with the production database works fine. If no then that means the database and queries are fine and the problem is somewhere in the network.

Are you using some kind of ORM in your resolvers? If yes, try to see if it has any logging capabilities. Some ORM’s can log your queries with execution time.

This sounds very familiar to me. We have a very similar setup and are seeing in Apollo Studio server wait times that look like there’s a disconnect between Gateway and the subgraph. Our MSSQL DB does have occasional issues when we hit ~65k rpm but that’s not the issue at hand. Our issues seem to occur all the time and the long tail of slow, hung, timed-out responses grow proportionally with the load placed on our GQL servers. We are in the middle of building some load tests so that we can attempt to replicate this in our non-production environments.

Our setup is Gateway 0.51, Apollo Server 3.10, Sequelize (tedious), and express. We leverage node clusters on 4-core EC2 boxes (Elastic Beanstalk) and our DB server is seriously OP and rarely sees 50% utilization.

Anyway, we have not had much luck finding the root cause of these occasional, unusually long delays in what are otherwise <500ms requests. It seems systemic and constrained to Apollo Server or Gateway (we cannot identify which).

We eventually solved our primary issue which turned out to be a misconfiguration of TypeORM. We had some typos that weren’t being picked up by Typescript due to the “extra” portion of the TypeORM config being passed directly to the underlying db driver and it being typed as “any”.

Specifically, it was the “pool.max” setting that was killing us as it was defaulting to 10 and choking the number of available db connections. We believe we have a secondary issue with db connections not being closed in a timely manner but once we upped the “pool.max” to a much larger number, performance improved dramatically.

We do still have a number of a slow running queries but they’re now consistently slow and have all proved to be design problems and n+1 issues which we are slowly working through.

@dlredden - We are also seeing similar behavior that you are describing. We’re you ever able to track down the root cause?