Creating new DataLoader per request spikes database CPU and slows down query

When we instantiate DataLoader outside of the context, the query loads quickly. However, if we move the instantiation inside of the context, our database CPU goes up significantly and the queries slow down. The queries we are doing are standard and we are following the examples. What is the best way to debug something like this?

1 Like

If you’re using TypeORM, you can simply set a debug/log flag.

With GraphQL, to solve the n+1 problem fairly well, I recommend using DataLoader on plain (non-mutating) queries globally and using it like a simple batcher, and using an in-memory cache for frequent access to the same object, if you’re using _resolveObject or the like.

Normally, a dataloader is something you new in a resolver, used generally as a simple cache per-resolver. I, however, found that newing it ahead of time and letting all my resolvers simply .load() with it, it can batch very well.

From there, you can batch 1000s of queries from your DB all at once, and if a different resolver might re-query an entity you just fetched, you can use a short-TTL in-memory cache (or equivalent) to persist the queried results for a few seconds.

This allows a simple pattern such as:

const thingyLoader = new DataLoader(
  ids => {
    // attempt to get every entity
    // you may need to use a utility function to make sure the results are in the correct order
  },
  {
    ...,
    maxBatch: 1000
  }
);

const resolvers = {
  Query: {
    thingy: async (parent, args, context, info) => {
      return await thingyLoader.load(args.id); // don't just return the ID, because the entity might not exist
    },
    Thingy: {
      _resolveObject: async (fields, ...) => {
        return await thingyLoader.load(args.id);
      },
      specialTimeConsumingField: async (parent, args, context, info) => {
        // time-consuming work
        return 'whatever';
      }
    }
  }

The _resolveObject is the main loader, but most of the time, you need a root query field such as Query.thingy. This field has to run to kick off resolution, but for general lookups (anywhere but Query.thingy), you can just do a simple object resolver.

Using a dataloader in this way, with your choice of “cache”, such as expiry-map, will allow you to return the resolved object with no real additional logic, and only have the object in memory for a second or two longer than necessary, which is preferable to re-batching the lookup to the DB.

I use this general pattern in production, and have used it to fetch >100,000 entities from a database in a single query, and it took ~10s. Pretty performant, if you ask me.

Thank you for a detailed response. Just to clarify are you suggesting we instantiate DataLoader once for the App instead of creating one per request? In the https://github.com/graphql/dataloader package it states that typically instances are created per request:

Typically instances are created per request when used within a web-server like express if different users can see different things.

Avoid multiple requests from different users using the DataLoader instance, which could result in cached data incorrectly appearing in each request.

I just tried your implementation and am having a hard time getting it to work. Is there a sample project you can point me to that utilizes the strategy you mentioned? The reason why we wanted to move to creating new DataLoader per request was because when one of the columns got updated, the change was not properly updated in the query.

I use the strategy I described for many database entities of the exact same type, referenced simply by their ID; more complex dataloaders would need to have complex IDs with things like query parameters mixed in, which would make lookups very difficult. You could also have many different types of dataloaders for specific queries, but that sounds more complex than spending the time to create a complex dataloader.

For a simple lookup by ID with no variable where logic, a dataloader will work fine.

import { Thingy } from '../db/entities';
import expiryMap from 'expiry-map';

export const thingyLoader = new DataLoader(
  async (ids: string[]) => {
    const entities = await entityManager.findByIds(Thingy, ids, {
      loadRelationIds: {
        relations: ['relationName'],
      },
    });

    return ensureOrder({ // you should be able to find a reference to this util in the dataloader issues
      docs: entities,
      keys: ids,
      prop: 'id',
      nullable: true,
    });
  },
  {
    cache: new expiryMap(2000), // i recommend having this as a tunable env var per entity type
    batchScheduleFn: callback => setTimeout(callback, 100), // make it wait this many ms before sending out a batch
    maxBatchSize: 1000, // or until the batch is big enough to not spam the db with http requests
  }
);

That’s about the simplest batching dataloader you can get. I use other layers in there for performance gains, but that’s for that extra oomph in speed.

I’ve omitted certain things like connecting to the db and initializing an entity manager.

Avoid multiple requests from different users using the DataLoader instance, which could result in cached data incorrectly appearing in each request.

This is a valid thing for them to advise, but my object resolvers are just that, read-only object resolvers, for fetching a base entity from the database. The results will never be different between .load()s, because it always runs the exact same SQL query.