Summary
In Kubernetes deployments, OBI can intermittently stop reporting metrics for a restarted or newly created pod until OBI itself is restarted. The root cause is that a failed initial PID→container correlation is treated as permanent.
Failure sequence
- Process watcher discovers an application PID.
getContainerInfo(pid) reads /proc/<pid>/cgroup but fails — container metadata is not yet ready (kubelet still initializing the container).
onNewProcess() returns false, permanently dropping the PID from Kubernetes enrichment.
- Pod metadata becomes available shortly after, but the already-seen PID is never retried.
- Metrics for that process stop flowing.
- Restarting OBI forces a fresh process scan and recovers metrics.
Affected code
pkg/appolly/discover/watcher_kube.go — onNewProcess():
func (wk *watcherKubeEnricher) onNewProcess(procInfo ProcessAttrs) (ProcessAttrs, bool) {
wk.mt.Lock()
defer wk.mt.Unlock()
containerInfo, err := wk.getContainerInfo(procInfo.pid)
if err != nil {
// PID is permanently dropped here — never retried
wk.log.Debug("can't get container info for PID", "pid", procInfo.pid, "error", err)
return ProcessAttrs{}, false
}
// ...
}
Environment
- Beyla v2.6.3 / OBI (latest
main has the same code path)
- Kubernetes, containerd runtime
- Prometheus scrape mode
- DaemonSet deployment
User-visible symptom
- A pod restarts or is created for the first time.
- OBI stops reporting HTTP/gRPC metrics for that pod.
- Other pods on the same node continue working normally.
- Restarting OBI fixes the problem.
- The issue is intermittent — depends on startup timing and node load.
Evidence
Log pattern during the failure:
msg="Pod added" component=discover.watcherKubeEnricher containers="[name:\"app-name\"]"
msg="new process" component=discover.watcherKubeEnricher pid=...
msg="can't get container info for PID" component=discover.watcherKubeEnricher pid=... error="/proc/.../cgroup: couldn't find any docker entry for process with PID ..."
Note the pod event shows containers="[name:\"app-name\"]" with no containerID — the kubelet has created the pod object but hasn't yet populated the container status.
Healthy control on the same node: Other pods on the same node have valid containerStatuses[].containerID and their cgroup paths parse correctly with the existing regex patterns. The failure is a timing race, not a parser incompatibility.
Why this is a race
- Process discovery can happen before the container cgroup entry is written to
/proc/<pid>/cgroup.
- The window is narrow (milliseconds) and depends on node load, kubelet scheduling, and process scan interval.
- Some pods self-heal because additional pod events re-trigger correlation for already-registered processes — but never for processes that failed initial lookup.
- Restarting OBI always recovers because a fresh process scan re-discovers the PID after container metadata is available.
Expected behavior
If the first getContainerInfo(pid) fails transiently, the process should be kept in a pending state and retried:
- On subsequent pod create/update events (metadata becoming available).
- On a short periodic interval (fallback for cases where no new pod event arrives).
- Pending entries should be cleaned up when the process terminates.
Suggested fix direction
In onNewProcess(), when getContainerInfo(pid) fails:
- Store the process in a
pendingProcesses map[app.PID]ProcessAttrs instead of dropping it.
- Add a
retryPendingProcesses() helper that iterates pending PIDs and retries getContainerInfo().
- Call
retryPendingProcesses() from two triggers:
- After pod create/update events in
enrichPodEvent() (event-driven recovery).
- On a short periodic ticker in the
enrich() select loop (fallback recovery).
- On successful retry: register the process normally and emit
EventCreated.
- On process termination:
delete(pendingProcesses, pid) to prevent leaks.
This approach:
- Does not change behavior for processes whose first lookup succeeds (the common case).
- Uses the existing injectable
containerInfoForPID var for testability.
- Handles both timing scenarios: slow cgroup availability (ticker) and delayed pod metadata (pod events).
I have a working prototype tested against v2.6.3 and am happy to port it to the current pkg/appolly/discover/ layout and submit a PR.
Summary
In Kubernetes deployments, OBI can intermittently stop reporting metrics for a restarted or newly created pod until OBI itself is restarted. The root cause is that a failed initial PID→container correlation is treated as permanent.
Failure sequence
getContainerInfo(pid)reads/proc/<pid>/cgroupbut fails — container metadata is not yet ready (kubelet still initializing the container).onNewProcess()returnsfalse, permanently dropping the PID from Kubernetes enrichment.Affected code
pkg/appolly/discover/watcher_kube.go—onNewProcess():Environment
mainhas the same code path)User-visible symptom
Evidence
Log pattern during the failure:
Note the pod event shows
containers="[name:\"app-name\"]"with nocontainerID— the kubelet has created the pod object but hasn't yet populated the container status.Healthy control on the same node: Other pods on the same node have valid
containerStatuses[].containerIDand their cgroup paths parse correctly with the existing regex patterns. The failure is a timing race, not a parser incompatibility.Why this is a race
/proc/<pid>/cgroup.Expected behavior
If the first
getContainerInfo(pid)fails transiently, the process should be kept in a pending state and retried:Suggested fix direction
In
onNewProcess(), whengetContainerInfo(pid)fails:pendingProcesses map[app.PID]ProcessAttrsinstead of dropping it.retryPendingProcesses()helper that iterates pending PIDs and retriesgetContainerInfo().retryPendingProcesses()from two triggers:enrichPodEvent()(event-driven recovery).enrich()select loop (fallback recovery).EventCreated.delete(pendingProcesses, pid)to prevent leaks.This approach:
containerInfoForPIDvar for testability.I have a working prototype tested against v2.6.3 and am happy to port it to the current
pkg/appolly/discover/layout and submit a PR.