Skip to content

Intermittent missing metrics after pod restart: failed PID→container correlation is not retried #1951

@wsszh

Description

@wsszh

Summary

In Kubernetes deployments, OBI can intermittently stop reporting metrics for a restarted or newly created pod until OBI itself is restarted. The root cause is that a failed initial PID→container correlation is treated as permanent.

Failure sequence

  1. Process watcher discovers an application PID.
  2. getContainerInfo(pid) reads /proc/<pid>/cgroup but fails — container metadata is not yet ready (kubelet still initializing the container).
  3. onNewProcess() returns false, permanently dropping the PID from Kubernetes enrichment.
  4. Pod metadata becomes available shortly after, but the already-seen PID is never retried.
  5. Metrics for that process stop flowing.
  6. Restarting OBI forces a fresh process scan and recovers metrics.

Affected code

pkg/appolly/discover/watcher_kube.goonNewProcess():

func (wk *watcherKubeEnricher) onNewProcess(procInfo ProcessAttrs) (ProcessAttrs, bool) {
	wk.mt.Lock()
	defer wk.mt.Unlock()
	containerInfo, err := wk.getContainerInfo(procInfo.pid)
	if err != nil {
		// PID is permanently dropped here — never retried
		wk.log.Debug("can't get container info for PID", "pid", procInfo.pid, "error", err)
		return ProcessAttrs{}, false
	}
	// ...
}

Environment

  • Beyla v2.6.3 / OBI (latest main has the same code path)
  • Kubernetes, containerd runtime
  • Prometheus scrape mode
  • DaemonSet deployment

User-visible symptom

  • A pod restarts or is created for the first time.
  • OBI stops reporting HTTP/gRPC metrics for that pod.
  • Other pods on the same node continue working normally.
  • Restarting OBI fixes the problem.
  • The issue is intermittent — depends on startup timing and node load.

Evidence

Log pattern during the failure:

msg="Pod added" component=discover.watcherKubeEnricher containers="[name:\"app-name\"]"
msg="new process" component=discover.watcherKubeEnricher pid=...
msg="can't get container info for PID" component=discover.watcherKubeEnricher pid=... error="/proc/.../cgroup: couldn't find any docker entry for process with PID ..."

Note the pod event shows containers="[name:\"app-name\"]" with no containerID — the kubelet has created the pod object but hasn't yet populated the container status.

Healthy control on the same node: Other pods on the same node have valid containerStatuses[].containerID and their cgroup paths parse correctly with the existing regex patterns. The failure is a timing race, not a parser incompatibility.

Why this is a race

  • Process discovery can happen before the container cgroup entry is written to /proc/<pid>/cgroup.
  • The window is narrow (milliseconds) and depends on node load, kubelet scheduling, and process scan interval.
  • Some pods self-heal because additional pod events re-trigger correlation for already-registered processes — but never for processes that failed initial lookup.
  • Restarting OBI always recovers because a fresh process scan re-discovers the PID after container metadata is available.

Expected behavior

If the first getContainerInfo(pid) fails transiently, the process should be kept in a pending state and retried:

  • On subsequent pod create/update events (metadata becoming available).
  • On a short periodic interval (fallback for cases where no new pod event arrives).
  • Pending entries should be cleaned up when the process terminates.

Suggested fix direction

In onNewProcess(), when getContainerInfo(pid) fails:

  1. Store the process in a pendingProcesses map[app.PID]ProcessAttrs instead of dropping it.
  2. Add a retryPendingProcesses() helper that iterates pending PIDs and retries getContainerInfo().
  3. Call retryPendingProcesses() from two triggers:
    • After pod create/update events in enrichPodEvent() (event-driven recovery).
    • On a short periodic ticker in the enrich() select loop (fallback recovery).
  4. On successful retry: register the process normally and emit EventCreated.
  5. On process termination: delete(pendingProcesses, pid) to prevent leaks.

This approach:

  • Does not change behavior for processes whose first lookup succeeds (the common case).
  • Uses the existing injectable containerInfoForPID var for testability.
  • Handles both timing scenarios: slow cgroup availability (ticker) and delayed pod metadata (pod events).

I have a working prototype tested against v2.6.3 and am happy to port it to the current pkg/appolly/discover/ layout and submit a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions