Intermittent missing metrics after pod restart: failed PID→container correlation is not retried

### Summary

In Kubernetes deployments, OBI can intermittently stop reporting metrics for a restarted or newly created pod until OBI itself is restarted. The root cause is that a failed initial PID→container correlation is treated as permanent.

### Failure sequence

1. Process watcher discovers an application PID.
2. `getContainerInfo(pid)` reads `/proc/<pid>/cgroup` but fails — container metadata is not yet ready (kubelet still initializing the container).
3. `onNewProcess()` returns `false`, permanently dropping the PID from Kubernetes enrichment.
4. Pod metadata becomes available shortly after, but the already-seen PID is never retried.
5. Metrics for that process stop flowing.
6. Restarting OBI forces a fresh process scan and recovers metrics.

### Affected code

[`pkg/appolly/discover/watcher_kube.go` — `onNewProcess()`](https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/main/pkg/appolly/discover/watcher_kube.go#L174-L195):

```go
func (wk *watcherKubeEnricher) onNewProcess(procInfo ProcessAttrs) (ProcessAttrs, bool) {
	wk.mt.Lock()
	defer wk.mt.Unlock()
	containerInfo, err := wk.getContainerInfo(procInfo.pid)
	if err != nil {
		// PID is permanently dropped here — never retried
		wk.log.Debug("can't get container info for PID", "pid", procInfo.pid, "error", err)
		return ProcessAttrs{}, false
	}
	// ...
}
```

### Environment

- Beyla v2.6.3 / OBI (latest `main` has the same code path)
- Kubernetes, containerd runtime
- Prometheus scrape mode
- DaemonSet deployment

### User-visible symptom

- A pod restarts or is created for the first time.
- OBI stops reporting HTTP/gRPC metrics for that pod.
- Other pods on the same node continue working normally.
- Restarting OBI fixes the problem.
- The issue is intermittent — depends on startup timing and node load.

### Evidence

**Log pattern during the failure:**

```
msg="Pod added" component=discover.watcherKubeEnricher containers="[name:\"app-name\"]"
msg="new process" component=discover.watcherKubeEnricher pid=...
msg="can't get container info for PID" component=discover.watcherKubeEnricher pid=... error="/proc/.../cgroup: couldn't find any docker entry for process with PID ..."
```

Note the pod event shows `containers="[name:\"app-name\"]"` with no `containerID` — the kubelet has created the pod object but hasn't yet populated the container status.

**Healthy control on the same node:** Other pods on the same node have valid `containerStatuses[].containerID` and their cgroup paths parse correctly with the existing regex patterns. The failure is a timing race, not a parser incompatibility.

### Why this is a race

- Process discovery can happen before the container cgroup entry is written to `/proc/<pid>/cgroup`.
- The window is narrow (milliseconds) and depends on node load, kubelet scheduling, and process scan interval.
- Some pods self-heal because additional pod events re-trigger correlation for already-registered processes — but never for processes that failed initial lookup.
- Restarting OBI always recovers because a fresh process scan re-discovers the PID after container metadata is available.

### Expected behavior

If the first `getContainerInfo(pid)` fails transiently, the process should be kept in a pending state and retried:
- On subsequent pod create/update events (metadata becoming available).
- On a short periodic interval (fallback for cases where no new pod event arrives).
- Pending entries should be cleaned up when the process terminates.

### Suggested fix direction

In `onNewProcess()`, when `getContainerInfo(pid)` fails:

1. Store the process in a `pendingProcesses map[app.PID]ProcessAttrs` instead of dropping it.
2. Add a `retryPendingProcesses()` helper that iterates pending PIDs and retries `getContainerInfo()`.
3. Call `retryPendingProcesses()` from two triggers:
   - After pod create/update events in `enrichPodEvent()` (event-driven recovery).
   - On a short periodic ticker in the `enrich()` select loop (fallback recovery).
4. On successful retry: register the process normally and emit `EventCreated`.
5. On process termination: `delete(pendingProcesses, pid)` to prevent leaks.

This approach:
- Does not change behavior for processes whose first lookup succeeds (the common case).
- Uses the existing injectable `containerInfoForPID` var for testability.
- Handles both timing scenarios: slow cgroup availability (ticker) and delayed pod metadata (pod events).

I have a working prototype tested against v2.6.3 and am happy to port it to the current `pkg/appolly/discover/` layout and submit a PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent missing metrics after pod restart: failed PID→container correlation is not retried #1951

Summary

Failure sequence

Affected code

Environment

User-visible symptom

Evidence

Why this is a race

Expected behavior

Suggested fix direction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Intermittent missing metrics after pod restart: failed PID→container correlation is not retried #1951

Description

Summary

Failure sequence

Affected code

Environment

User-visible symptom

Evidence

Why this is a race

Expected behavior

Suggested fix direction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions