feat(v17): Implement tracing channels#4670
Conversation
|
@logaretm is attempting to deploy a commit to the The GraphQL Foundation Team on Vercel. A member of the Team first needs to authorize it. |
|
|
There was a problem hiding this comment.
Pull request overview
This PR introduces opt-in diagnostics tracing channels for GraphQL core operations while keeping the codebase runtime-agnostic (no direct node:diagnostics_channel dependency) by requiring consumers to register a compatible module via enableDiagnosticsChannel().
Changes:
- Added
src/diagnostics.tswith minimal duck-typed interfaces andmaybeTrace*helpers, plus public exports fromsrc/index.ts. - Wrapped
parse,validate,execute/subscribe, and field resolution to emitgraphql:*tracing channel lifecycles with lazy context fields where appropriate. - Added unit tests (via a fake diagnostics channel) and a Node integration test project to validate real
node:diagnostics_channelbehavior.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/diagnostics.ts | New core diagnostics/tracing channel registration + trace helper implementations. |
| src/index.ts | Exposes enableDiagnosticsChannel and related public types from the package entrypoint. |
| src/language/parser.ts | Wraps parse() with tracing emission. |
| src/validation/validate.ts | Wraps validate() with tracing emission. |
| src/execution/execute.ts | Wraps execute()/subscribe() and related execution entrypoints with tracing + lazy operation metadata. |
| src/execution/Executor.ts | Wraps field resolver invocation to emit graphql:resolve lifecycle events with lazy fieldPath. |
| src/testUtils/fakeDiagnosticsChannel.ts | Adds fake tracing channel implementation to support deterministic unit tests. |
| src/tests/diagnostics-test.ts | Verifies registration behavior and channel identity semantics. |
| src/language/tests/parser-diagnostics-test.ts | Validates graphql:parse lifecycle emission. |
| src/validation/tests/validate-diagnostics-test.ts | Validates graphql:validate lifecycle emission including error paths. |
| src/execution/tests/execute-diagnostics-test.ts | Validates graphql:execute lifecycle emission for sync/async/error paths. |
| src/execution/tests/subscribe-diagnostics-test.ts | Validates graphql:subscribe emission for sync vs async subscription setup. |
| src/execution/tests/resolve-diagnostics-test.ts | Validates graphql:resolve emission for sync/async/error and lazy fieldPath. |
| resources/integration-test.ts | Adds the new diagnostics integration test project to the integration suite. |
| integrationTests/diagnostics/test.js | Real Node integration tests for channel lifecycles and ALS propagation. |
| integrationTests/diagnostics/package.json | Adds diagnostics integration test project metadata and node engine constraint. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
5333b30 to
faf327b
Compare
Please excuse my limited understanding of the current tracing space, but is there a concept of nesting of channels? In theory, I would expect we would want to: (1) nest the "execute" channel for subscription events under "subscribe" and Naively, I would assume this would be very useful information to understanding the hierarchy. And I would think this would provide a way to pass the overarching context of the subscribe option to execute (i.e. execute would always have the document, as in your description of the channels/context within the linked issue #4629)? And similarly, we could also have the overall execute (or subscribe/execute) passed to resolve? |
|
@yaacovCR Happy to elaborate more!
From a purely diag/tracing channels API perspective, there's no parent/child relationship in the API. Each channel is independent. That said, the hierarchy you're describing does exist at runtime, it's just reconstructed on the subscriber side rather than on the emitter side. The mechanism is done through You can check the image I had in the PR description that represents how it looks like, or check this trace view as you can see the spans are nested under one another. This works because tracing channels propagate the async execution context (unlike plain event emitters). Subscribers bind an It's why I kept the runStores wrapper in the types and impl even after dropping Now, for graphql as the emitter, there are a couple of reasons not to encode the hierarchy ourselves:
The good news is subscribers will have access to Here is a quick example snippet: import { AsyncLocalStorage } from 'node:async_hooks';
import dc from 'node:diagnostics_channel';
import { enableDiagnosticsChannel } from 'graphql';
enableDiagnosticsChannel(dc);
const executeCtx = new AsyncLocalStorage();
const executeChannel = dc.tracingChannel('graphql:execute');
const resolveChannel = dc.tracingChannel('graphql:resolve');
// Bind on `start` only, the store activates when execute's runStores frame
// opens and stays active through the entire lifecycle. `end`, `asyncStart`,
// `asyncEnd`, and `error` all fire inside that frame (sync handlers directly,
// async ones via async_hooks propagation), so they inherit the same store
// without any extra wiring.
executeChannel.start.bindStore(executeCtx, ctx => ({
document: ctx.document,
operationName: ctx.operationName,
startedAt: Date.now(),
}));
// Subscribe to the full execute lifecycle. Every handler sees the same store.
executeChannel.subscribe({
start: () => {
const { operationName } = executeCtx.getStore();
console.log(`[execute:start] ${operationName}`);
},
end: () => {
const { operationName, startedAt } = executeCtx.getStore();
console.log(`[execute:end] ${operationName} (+${Date.now() - startedAt}ms sync)`);
},
asyncStart: () => {
const { operationName } = executeCtx.getStore();
console.log(`[execute:asyncStart] ${operationName}`);
},
asyncEnd: () => {
const { operationName, startedAt } = executeCtx.getStore();
console.log(`[execute:asyncEnd] ${operationName} (+${Date.now() - startedAt}ms total)`);
},
error: ({ error }) => {
const { operationName } = executeCtx.getStore();
console.log(`[execute:error] ${operationName} -> ${error.message}`);
},
});
// Nested channels inherit the parent's store too, because resolve fires
// beneath execute's runStores frame.
resolveChannel.subscribe({
start: resolveCtx => {
const parent = executeCtx.getStore();
console.log(` [resolve:start] ${resolveCtx.fieldPath} (under ${parent.operationName})`);
},
asyncEnd: resolveCtx => {
const parent = executeCtx.getStore();
console.log(` [resolve:asyncEnd] ${resolveCtx.fieldPath} (under ${parent.operationName})`);
},
error: ({ error }, name) => {
const parent = executeCtx.getStore();
console.log(` [resolve:error] under ${parent.operationName}: ${error.message}`);
},
}); |
This was hugely helpful. On this background, I am wondering whether instead of:
Maybe something like the below would be better for our official tracing channels:
The idea would be that The downside is you can no longer top-level distinguish between subscriptions and queries/mutations. Do we want to preserve that? Do we want to top-level distinguish between each of our operations? What about between queries and mutations? Again, I am asking these questions with very little knowledge of how the tracing system is used in practice, hopefully you can guide me here? Looping in @graphql/tsc if they have any thoughts. |
We could do
The distinction can be done with the args already sent to the execute channel via the I initially added a What do u think? |
Naming = so hard. Definitely like shorter. But I do prefer consistency with the other verbs we have for our channels. We could consider:
I am leaning toward the last option, again looping in better namers @graphql/tsc @graphql/tsc-emeriti |
|
@yaacovCR I will re-work the PR to use the autoloading snippet and I will dig into the benchmarks and see what I can do, ideally it should offer close to zero slowdown so I must be doing something wrong. |
Although maybe we should do instead: is there a convention in the diagnostic channels world? |
|
@yaacovCR Not a strict convention, but this is what is currently recommended. For the names, I think we can keep the inner one called
Naming is hard 😅 |
|
Not necessarily related, but possibly, what would a trace look like for incremental delivery? Options abound because of the disconnect between execution groups and delivery groups. |
a001dd3 to
03c9fea
Compare
|
@yaacovCR So I pushed a couple of optimizations, I think it is down significantly after running the benchmark u gave me. It was at 6% overhead at the time of your posting on my machine and now its down to 2.3%-3%. Can u have a go and see if that's acceptable? I have more ideas but will require juggling alternate Executor class implementations.
Right now they aren't handled and any deferred executions will be orphaned (up to the APM how to deal with that) since they will still fire the So it would look like this: I think we can keep the current PR scoped to this at the moment, and I can try a follow up if this one goes through. I'm thinking we can add another The group events look like extra spans, but their real job is to keep an ALS frame alive while the deferred resolvers run, so anything the APM bound stays readable. I'll verify end-to-end in the follow-up |
183bb06 to
bbddc0c
Compare
|
Squeezed a bit of performance with the last couple of commits 🙌 for the channel names happy to pick whatever works for us here to make this ready to go. |
db814a5 to
3d1de2a
Compare
Qard
left a comment
There was a problem hiding this comment.
I'm the creator of diagnostics_channel, for those that don't know me. 👋🏻
Generally, LGTM. I'd like to know more about the exact timing on resolve tracing. Can those be effectively linked together via channel.bindStore(...) triggering nested resolvers within the scope of their parent?
|
@Qard ATM nested resolvers are not linked together as children, they instead execute and appear as siblings. However they all appear as a child of the
I realize this isn't the ideal outcome, I will experiment with expanding the runStores callback to include resolve child recursion and see how it fares in benchmarks. |
|
Ideally we'd want that causality. Might be sufficient to pass the parent resolver info into the child events as that is past into the resolvers themselves. 🤔 |
|
@Qard I widened the runStores frame to include the nested resolvers, and got something looking like this
Perf wise, it has the same perf with no subs as the current state, all near 0% diff so it didn't introduce cold paths. Now comparing it against the current impl (no nesting) with subs attached, the numbers are very similar.
So only one case has seen a slow down which I think would come to a wash once an APM does somethings anyways. So we basically get it for free here, WDYT? worth including or maybe let the subscriber stitch it together? (they can with the path information in each call). |
…aceSync
The traced branch of `executeIgnoringIncremental` was not exercised by
any subscriber-attached test, tripping the 100% coverage threshold on
CI. Add a case in execute-diagnostics-test that subscribes to
`graphql:execute` and asserts start/end around a call.
Also drop the `traceSync` helper from diagnostics.ts; every call site
invokes `channel.traceSync(...)` directly, so the export was dead code.
Mark the `require('node:diagnostics_channel')` CJS fallback and the
enclosing `catch` block with c8 ignore comments — they are only
reachable on Node 20.0-20.15 (pre-`getBuiltinModule`) and on runtimes
without `diagnostics_channel` at all, neither of which the unit tests
run on.
Hoist the resolve tracing decision once per `executeFields` / `executeFieldsSerially` call instead of re-checking `shouldTrace(resolveChannel)` per field, and move the traced branch's `traceMixed` call + ctx closure into a module-scope helper `invokeResolverWithTracing`. The latter is the load-bearing change: once `executeField`'s body referenced `traceMixed` and the `() => resolveFn(...)` closure, V8 stopped inlining `executeField` into `executeFields`, paying a real call frame per field. Extracting the two-function tail keeps `executeField`'s bytecode small enough to inline again. Effect on the introspection benchmark (the worst-case fan-out shape, ~1000 resolver calls per iteration, no subscribers attached): - with the inline ternary: -5.94% vs upstream - with this refactor: -2.05% / -2.30% across two paired runs Other benchmarks (list-sync, list-async, list-asyncIterable, object-async) sit at the noise floor with 95% CIs that cross zero. Adds a serial-mutation case to resolve-diagnostics-test to cover the `executeFieldsSerially` branch the hoisted snapshot now lives in.
to cover custom perEventExecutors
also combine execute/subscribe
…e the original passed values, not the internal type
now that the files are longer... :)
Bun's node:diagnostics_channel ships TracingChannel but does not expose the aggregate hasSubscribers getter, so a subscriber added via tracingChannel.subscribe(handlers) is invisible to the previous ?.hasSubscribers gate and graphql-js silently drops every event. Introduce a shouldTrace helper that trusts the aggregate when present and falls back to checking each of the five underlying lifecycle channels when it is undefined. Route every emission gate (parse/validate/execute/ subscribe/resolve) through it.
V8 inlines executeField slightly better when the never-taken-on-Node branch is the trailing fallback rather than an early-returning if body. The flipped form cost ~5% on introspection; the original ordering plus c8 ignore start/stop over the Bun-only branch keeps both 100% coverage and the perf characteristics.
50ea29c to
78f3c31
Compare



This PR implements tracing channel support as discussed in #4629
graphql-js auto-resolves
node:diagnostics_channelat module load viaprocess.getBuiltinModule, with arequirefallback for Node 20.0–20.15. On runtimes wherenode:diagnostics_channelis unavailable (e.g. browsers) the load silently no-ops and emission sites short-circuit. APMs subscribe purely on the subscriber side — no interaction with the graphql API required.Most of the changes are tests; the actual source changes are mostly tame and introduce wrappers around pre-existing code paths.
API
graphql-js vendors a minimal structural subset of the tracing channel types for internal use. Ideally these wouldn't include deep specifics like
runStores, but that surface is required to maintain the sync-or-async behavior we have today. The alternative is to normalize to async for all calls which isn't ideal and probably would have a perf implication.The implemented channels are:
graphql:parsesourcegraphql:validateschema,documentgraphql:executeschema,document,variableValues,operationName(lazy),operationType(lazy)graphql:executeschema,operation,variableValues,operationName,operationTypegraphql:subscribeschema,document,variableValues,operationName(lazy),operationType(lazy)graphql:resolvefieldName,parentType,fieldType,args,isTrivialResolver,fieldPath(lazy)Usage
APMs and users subscribe by channel name on their own
node:diagnostics_channelimport — no graphql API call needed:I tested this locally in a dashboard and it produces waterfall structure like so with basic Sentry integration:
You can view that trace here.
closes #4629