[General/Help] Managed federation overview

Curiosity question regarding managed federation.

In the docs:

The gateway also knows that your subgraphs are prepared to handle operations against the updated set of schemas. This is because your subgraphs should register their updated schemas as part of their deployment, meaning they’re definitely running by the time the gateway is aware of the configuration change

This makes sense. From the diagram here, here’s how I understand the scenario to act serially:

  1. Subgraph X re-deploys.
  2. Subgraph X registers and updates schema.
  3. Apollo Schema Registry picks up change.
  4. Apollo Schema Registry updates configuration.
  5. Polling Gateway picks up change in configuration.

Here’s my question:

What happens in the time between 2-3 vis-a-vis the Gateway for breaking, backwards-incompatible changes on the actual Subgraph X service layer business logic?

An example (potential?) high-latency race condition situation:

I remove a column from the database. Migrations run with deployment. That field gets removed from the Subgraph X schema. But if a query for removed field gets through the Gateway (before new schema is picked up on #5), the Subgraph X service should be up and running - would this fail at the service/resolver layer?

The way I understand it - and please correct me otherwise! - the way around this would be to have Pub/Sub on the registry back to the Gateway once new Subgraph X schema is registered rather than the Gateway polling the registry. Is this wrong?

…and in a moment of self-awareness - this example is contrived and I understand the point of Managed Federation is that it solves this type of situation that happens every time when deploying manually without a manager - just want to understand if all the risk is gone (or just most)

Hello, good question! You are correct that in the managed federation scenario you describe, there is a period of time when the gateway thinks Subgraph X supports a field that it no longer does. And during that time, Subgraph X will respond with an error to any query that includes that field that the gateway sends it. So yes, there is still some temporary risk to deploying Subgraph X like this.

There are other mechanisms available to help reduce these scenarios, however!

  • Firstly, if you’re removing a field in a subgraph, that removal will affect any clients that query that field, regardless of whether the gateway picks up the change immediately. Apollo Studio’s schema checks feature helps you identify when breaking changes like field removals are safe by checking the removal against your operation history to see if any clients would break if the field went away. Ideally, few to zero clients are still using a field you’re actively removing.
  • To prevent the exact scenario you describe, you could register your updated subgraph schema before deploying a version of Subgraph X that can’t resolve the field. While the gateway is waiting for its new configuration, it might still send queries with the removed field to Subgraph X, but the subgraph still knows how to resolve that field. Then when the gateway does get its new configuration, you can deploy an update to Subgraph X that actually removes support for the field.
1 Like

Thanks for the great response, Stephen. Appreciate the comprehensive tertiary details as well. :fist_right:

1 Like