Skip to content

fix(eval): wire App.plugins / context-cache / resumability through adk eval#5534

Open
saifer82 wants to merge 4 commits intogoogle:mainfrom
saifer82:fix/eval-app-plugins
Open

fix(eval): wire App.plugins / context-cache / resumability through adk eval#5534
saifer82 wants to merge 4 commits intogoogle:mainfrom
saifer82:fix/eval-app-plugins

Conversation

@saifer82
Copy link
Copy Markdown

Link to Issue or Description of Change

The maintainer confirmed reproduction and asked for a PR:

We have reproduced the issue and observed the same behavior. Since you already have a solution in mind, please feel free to go ahead and raise a PR. Our team will be happy to review it.

Problem:
cli_eval resolves agents via agent_module.agent.root_agent, which drops the wrapping App and therefore its plugins, context_cache_config, and resumability_config. As a result, when a project wraps its root agent in App(root_agent=..., plugins=[...]), plugin lifecycle hooks (on_event_callback, etc.) fire during adk web / adk run but are silently skipped during adk eval. Observability plugins like BigQueryAgentAnalyticsPlugin produce no telemetry rows for eval runs — exactly the workload where per-case latency / token / trajectory data is most useful.

Solution:
Resolve the App (when present) at the eval CLI entrypoint and plumb it through LocalEvalService to EvaluationGenerator._generate_inferences_from_root_agent, where the eval Runner is built. When an App is in play, the Runner is constructed from a copy of the App with the two internal eval plugins (_RequestIntercepterPlugin, EnsureRetryOptionsPlugin) merged into app.plugins. The user's App instance is never mutated. When no App is present the legacy bare-agent path is preserved.

This also incidentally fixes the parallel gaps with App.context_cache_config and App.resumability_config, which were dropped by the same bypass.

The four commits are sequenced for review readability:

  1. fix(cli_eval): add get_app_or_root_agent resolver — new helper + back-compat shim for get_root_agent.
  2. fix(evaluation): forward App through to the eval Runner_generate_inferences_from_root_agent accepts app= and merges plugins; _process_query resolves the App for the public generate_responses entry point.
  3. fix(eval): plumb App through LocalEvalService to fix App.plugins bypassLocalEvalService.__init__ accepts app=; cli_tools_click.cli_eval uses the new resolver and passes app through.
  4. test(cli_tools_click): mock get_app_or_root_agent in eval CLI tests — fixture update for the renamed resolver.

Testing Plan

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.
$ uv run pytest tests/unittests/cli/ tests/unittests/evaluation/ tests/unittests/test_runners.py tests/unittests/apps/ -q
867 passed

New tests (10 cases across 3 files):

  • tests/unittests/cli/utils/test_cli_eval.py — 4 tests covering get_app_or_root_agent: App present, App absent, app attribute exists but is not an App instance (falls back), and get_root_agent back-compat.
  • tests/unittests/evaluation/test_evaluation_generator.py — 4 tests covering _generate_inferences_from_root_agent with app=: Runner built with app= (merged plugins), legacy fallback when app=None, user's App not mutated across repeated runs, and root_agent override propagates to merged App copy (sub-agent eval scenario).
  • tests/unittests/evaluation/test_local_eval_service.py — 2 tests asserting LocalEvalService forwards app (or None) through to _generate_inferences_from_root_agent.

Manual End-to-End (E2E) Tests:

Reproduction setup matches the issue: an agent wrapped in App(...) with BigQueryAgentAnalyticsPlugin registered, evaluating a single case via adk eval.

Pre-fix (1.31.1) Post-fix (this PR)
INVOCATION_STARTING rows 0 1
LLM_REQUEST/RESPONSE rows 0 10
TOOL_STARTING/COMPLETED rows 0 8
Total rows from one case 0 30

Concretely, after running adk eval ./app routing_and_tools.evalset.json:route_sales_total_en against this PR:

+-----------------------+-------------------+---+
|      event_type       |       agent       | n |
+-----------------------+-------------------+---+
| STATE_DELTA           | root_agent        | 5 |
| LLM_REQUEST           | sales_performance | 4 |
| LLM_RESPONSE          | sales_performance | 4 |
| TOOL_STARTING         | sales_performance | 3 |
| TOOL_COMPLETED        | sales_performance | 3 |
| LLM_RESPONSE          | root_agent        | 1 |
| AGENT_COMPLETED       | sales_performance | 1 |
| USER_MESSAGE_RECEIVED | root_agent        | 1 |
| TOOL_STARTING         | root_agent        | 1 |
| INVOCATION_STARTING   | root_agent        | 1 |
| INVOCATION_COMPLETED  | root_agent        | 1 |
| AGENT_STARTING        | root_agent        | 1 |
| TOOL_COMPLETED        | root_agent        | 1 |
| LLM_REQUEST           | root_agent        | 1 |
| AGENT_STARTING        | sales_performance | 1 |
| AGENT_COMPLETED       | root_agent        | 1 |
+-----------------------+-------------------+---+

The plugin captures the full lifecycle (root + sub-agent) and the Batch writer task cancelled log line confirms its teardown ran inside the eval Runner.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules. (N/A — no dependent changes.)

Additional context

Scope deliberately excluded:

  • cli_optimize (GEPA prompt optimization) — also routes through LocalEvalService but constructs it inside LocalEvalSampler with no app argument. Bringing the optimize path under App-plugin coverage is a small follow-up: thread app into LocalEvalSampler.__init__ and pass it on to LocalEvalService(...). Happy to do it in a separate PR.
  • adk eval generate (generate_eval_cases) — switched to the new resolver for consistency only. It uses ScenarioGenerator, not a Runner, so plugins don't apply there.
  • YAML / Visual Builder agents via AgentLoader — out of scope. cli_eval doesn't use AgentLoader today; aligning the two loaders would be a larger refactor and not what this issue asks for.

Open question for reviewers:

The issue raised the possibility of an opt-in flag (adk eval --use-app-plugins) in case the bypass was intentional. This PR makes App-plugins-on-eval the default behavior, on the assumption that a plugin contract of "fires on every event" is what users expect. Happy to gate it behind a flag if you'd prefer the conservative default.

🤖 Generated with Claude Code

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Apr 28, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@adk-bot adk-bot added the eval [Component] This issue is related to evaluation label Apr 28, 2026
Eval flows currently access `agent_module.agent.root_agent` directly,
which drops the wrapping `App` (and therefore its plugins, context-cache
config, and resumability config). Add `get_app_or_root_agent` that
returns the `(app, root_agent)` pair, mirroring the resolution order
`AgentLoader._load_from_module_or_package` already uses on the web /
run paths. Keep `get_root_agent` as a back-compat wrapper.

This commit is the resolver and unit tests only; subsequent commits plumb
the App through `EvaluationGenerator` and `LocalEvalService` so plugins
fire during eval runs.
`_generate_inferences_from_root_agent` now accepts an optional `app`
parameter. When provided, the eval Runner is built from a copy of the
App with internal eval plugins (`_RequestIntercepterPlugin`,
`EnsureRetryOptionsPlugin`) merged into `app.plugins`. The user's App
is never mutated, and the App's `context_cache_config` /
`resumability_config` ride along automatically. When `app` is None,
the legacy bare-agent path is preserved.

`_process_query` (used by the public `generate_responses` entry point)
now resolves `agent.app` first and forwards it to the helper, so
projects that wrap their root agent in an `App` get plugin coverage
during eval without further changes.

The CLI plumbing that hands the App down from `cli_eval` /
`LocalEvalService` is in the next commit.
Closes the loop on https://github.com/google/adk-python/issues/<TBD>:
when a project wraps its root agent in `App(root_agent=..., plugins=[...])`
and runs `adk eval`, the registered plugins (e.g.,
`BigQueryAgentAnalyticsPlugin`) now fire on every invocation just like
they do for `adk web` / `adk run`. Same applies to `App.context_cache_config`
and `App.resumability_config`, which now ride along automatically.

Changes:
* `LocalEvalService.__init__` accepts an optional `app` keyword argument
  and forwards it to `_generate_inferences_from_root_agent` for each
  eval case.
* `cli_tools_click.cli_eval` resolves the `App` via `get_app_or_root_agent`
  and passes it to `LocalEvalService`.
* `cli_optimize` (GEPA prompt optimization) also routes through
  `LocalEvalService` but currently constructs it inside `LocalEvalSampler`
  with no `app` argument; bringing the optimize path under App-plugin
  coverage is a separate, narrower follow-up and is intentionally not
  included here.
The eval CLI now resolves agents via `get_app_or_root_agent`. Update
the shared `mock_get_root_agent` fixture in test_cli_tools_click.py to
patch the new resolver and yield `(None, root_agent)`, matching the
non-App path the eval-set-id tests exercise.
@saifer82 saifer82 force-pushed the fix/eval-app-plugins branch from 4ca7875 to 456cd98 Compare April 28, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

# cli_eval bypasses App.plugins, breaking observability plugins (e.g., BigQueryAgentAnalyticsPlugin) during eval runs

2 participants