fix(eval): wire App.plugins / context-cache / resumability through adk eval by saifer82 · Pull Request #5534 · google/adk-python

saifer82 · 2026-04-28T20:55:18Z

Link to Issue or Description of Change

Closes: # cli_eval bypasses App.plugins, breaking observability plugins (e.g., BigQueryAgentAnalyticsPlugin) during eval runs #5503

The maintainer confirmed reproduction and asked for a PR:

We have reproduced the issue and observed the same behavior. Since you already have a solution in mind, please feel free to go ahead and raise a PR. Our team will be happy to review it.

Problem:
cli_eval resolves agents via agent_module.agent.root_agent, which drops the wrapping App and therefore its plugins, context_cache_config, and resumability_config. As a result, when a project wraps its root agent in App(root_agent=..., plugins=[...]), plugin lifecycle hooks (on_event_callback, etc.) fire during adk web / adk run but are silently skipped during adk eval. Observability plugins like BigQueryAgentAnalyticsPlugin produce no telemetry rows for eval runs — exactly the workload where per-case latency / token / trajectory data is most useful.

Solution:
Resolve the App (when present) at the eval CLI entrypoint and plumb it through LocalEvalService to EvaluationGenerator._generate_inferences_from_root_agent, where the eval Runner is built. When an App is in play, the Runner is constructed from a copy of the App with the two internal eval plugins (_RequestIntercepterPlugin, EnsureRetryOptionsPlugin) merged into app.plugins. The user's App instance is never mutated. When no App is present the legacy bare-agent path is preserved.

This also incidentally fixes the parallel gaps with App.context_cache_config and App.resumability_config, which were dropped by the same bypass.

The four commits are sequenced for review readability:

fix(cli_eval): add get_app_or_root_agent resolver — new helper + back-compat shim for get_root_agent.
fix(evaluation): forward App through to the eval Runner — _generate_inferences_from_root_agent accepts app= and merges plugins; _process_query resolves the App for the public generate_responses entry point.
fix(eval): plumb App through LocalEvalService to fix App.plugins bypass — LocalEvalService.__init__ accepts app=; cli_tools_click.cli_eval uses the new resolver and passes app through.
test(cli_tools_click): mock get_app_or_root_agent in eval CLI tests — fixture update for the renamed resolver.

Testing Plan

Unit Tests:

I have added or updated unit tests for my change.
All unit tests pass locally.

$ uv run pytest tests/unittests/cli/ tests/unittests/evaluation/ tests/unittests/test_runners.py tests/unittests/apps/ -q
867 passed

New tests (10 cases across 3 files):

tests/unittests/cli/utils/test_cli_eval.py — 4 tests covering get_app_or_root_agent: App present, App absent, app attribute exists but is not an App instance (falls back), and get_root_agent back-compat.
tests/unittests/evaluation/test_evaluation_generator.py — 4 tests covering _generate_inferences_from_root_agent with app=: Runner built with app= (merged plugins), legacy fallback when app=None, user's App not mutated across repeated runs, and root_agent override propagates to merged App copy (sub-agent eval scenario).
tests/unittests/evaluation/test_local_eval_service.py — 2 tests asserting LocalEvalService forwards app (or None) through to _generate_inferences_from_root_agent.

Manual End-to-End (E2E) Tests:

Reproduction setup matches the issue: an agent wrapped in App(...) with BigQueryAgentAnalyticsPlugin registered, evaluating a single case via adk eval.

	Pre-fix (1.31.1)	Post-fix (this PR)
`INVOCATION_STARTING` rows	0	1
`LLM_REQUEST/RESPONSE` rows	0	10
`TOOL_STARTING/COMPLETED` rows	0	8
Total rows from one case	0	30

Concretely, after running adk eval ./app routing_and_tools.evalset.json:route_sales_total_en against this PR:

+-----------------------+-------------------+---+
|      event_type       |       agent       | n |
+-----------------------+-------------------+---+
| STATE_DELTA           | root_agent        | 5 |
| LLM_REQUEST           | sales_performance | 4 |
| LLM_RESPONSE          | sales_performance | 4 |
| TOOL_STARTING         | sales_performance | 3 |
| TOOL_COMPLETED        | sales_performance | 3 |
| LLM_RESPONSE          | root_agent        | 1 |
| AGENT_COMPLETED       | sales_performance | 1 |
| USER_MESSAGE_RECEIVED | root_agent        | 1 |
| TOOL_STARTING         | root_agent        | 1 |
| INVOCATION_STARTING   | root_agent        | 1 |
| INVOCATION_COMPLETED  | root_agent        | 1 |
| AGENT_STARTING        | root_agent        | 1 |
| TOOL_COMPLETED        | root_agent        | 1 |
| LLM_REQUEST           | root_agent        | 1 |
| AGENT_STARTING        | sales_performance | 1 |
| AGENT_COMPLETED       | root_agent        | 1 |
+-----------------------+-------------------+---+

The plugin captures the full lifecycle (root + sub-agent) and the Batch writer task cancelled log line confirms its teardown ran inside the eval Runner.

Checklist

I have read the CONTRIBUTING.md document.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
I have manually tested my changes end-to-end.
Any dependent changes have been merged and published in downstream modules. (N/A — no dependent changes.)

Additional context

Scope deliberately excluded:

cli_optimize (GEPA prompt optimization) — also routes through LocalEvalService but constructs it inside LocalEvalSampler with no app argument. Bringing the optimize path under App-plugin coverage is a small follow-up: thread app into LocalEvalSampler.__init__ and pass it on to LocalEvalService(...). Happy to do it in a separate PR.
adk eval generate (generate_eval_cases) — switched to the new resolver for consistency only. It uses ScenarioGenerator, not a Runner, so plugins don't apply there.
YAML / Visual Builder agents via AgentLoader — out of scope. cli_eval doesn't use AgentLoader today; aligning the two loaders would be a larger refactor and not what this issue asks for.

Open question for reviewers:

The issue raised the possibility of an opt-in flag (adk eval --use-app-plugins) in case the bypass was intentional. This PR makes App-plugins-on-eval the default behavior, on the assumption that a plugin contract of "fires on every event" is what users expect. Happy to gate it behind a flag if you'd prefer the conservative default.

🤖 Generated with Claude Code

google-cla · 2026-04-28T20:55:38Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Eval flows currently access `agent_module.agent.root_agent` directly, which drops the wrapping `App` (and therefore its plugins, context-cache config, and resumability config). Add `get_app_or_root_agent` that returns the `(app, root_agent)` pair, mirroring the resolution order `AgentLoader._load_from_module_or_package` already uses on the web / run paths. Keep `get_root_agent` as a back-compat wrapper. This commit is the resolver and unit tests only; subsequent commits plumb the App through `EvaluationGenerator` and `LocalEvalService` so plugins fire during eval runs.

`_generate_inferences_from_root_agent` now accepts an optional `app` parameter. When provided, the eval Runner is built from a copy of the App with internal eval plugins (`_RequestIntercepterPlugin`, `EnsureRetryOptionsPlugin`) merged into `app.plugins`. The user's App is never mutated, and the App's `context_cache_config` / `resumability_config` ride along automatically. When `app` is None, the legacy bare-agent path is preserved. `_process_query` (used by the public `generate_responses` entry point) now resolves `agent.app` first and forwards it to the helper, so projects that wrap their root agent in an `App` get plugin coverage during eval without further changes. The CLI plumbing that hands the App down from `cli_eval` / `LocalEvalService` is in the next commit.

Closes the loop on https://github.com/google/adk-python/issues/<TBD>: when a project wraps its root agent in `App(root_agent=..., plugins=[...])` and runs `adk eval`, the registered plugins (e.g., `BigQueryAgentAnalyticsPlugin`) now fire on every invocation just like they do for `adk web` / `adk run`. Same applies to `App.context_cache_config` and `App.resumability_config`, which now ride along automatically. Changes: * `LocalEvalService.__init__` accepts an optional `app` keyword argument and forwards it to `_generate_inferences_from_root_agent` for each eval case. * `cli_tools_click.cli_eval` resolves the `App` via `get_app_or_root_agent` and passes it to `LocalEvalService`. * `cli_optimize` (GEPA prompt optimization) also routes through `LocalEvalService` but currently constructs it inside `LocalEvalSampler` with no `app` argument; bringing the optimize path under App-plugin coverage is a separate, narrower follow-up and is intentionally not included here.

The eval CLI now resolves agents via `get_app_or_root_agent`. Update the shared `mock_get_root_agent` fixture in test_cli_tools_click.py to patch the new resolver and yield `(None, root_agent)`, matching the non-App path the eval-set-id tests exercise.

adk-bot added the eval [Component] This issue is related to evaluation label Apr 28, 2026

saifer82 added 4 commits April 28, 2026 22:23

saifer82 force-pushed the fix/eval-app-plugins branch from 4ca7875 to 456cd98 Compare April 28, 2026 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): wire App.plugins / context-cache / resumability through adk eval#5534

fix(eval): wire App.plugins / context-cache / resumability through adk eval#5534
saifer82 wants to merge 4 commits intogoogle:mainfrom
saifer82:fix/eval-app-plugins

saifer82 commented Apr 28, 2026

Uh oh!

google-cla Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saifer82 commented Apr 28, 2026

Link to Issue or Description of Change

Testing Plan

Checklist

Additional context

Uh oh!

google-cla Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants