Skip to content

[receiver/vcenter] receiver is creating performance issues with large deployments #47941

@atoulme

Description

@atoulme

Component(s)

receiver/vcenter

What happened?

Description

Customer is experiencing periodic drops in performance metrics from vCenter Server 8.0.3.00700. This occurs during critical system performance troubleshooting, where the vCenter receiver fails to ingest data for windows of approximately 15 minutes, impacting environmental visibility.

The root cause is a resource exhaustion on the vCenter management services (vpxd/Tomcat) driven by oversized QueryPerf API requests from the OpenTelemetry (OTel) collector. These massive batch queries trigger excessive CPU and memory spikes, leading to service timeouts and HTTP 500 responses

When an external monitoring tool initiates QueryPerf calls without optimized batching or specific time boundaries, vCenter is forced to perform extensive database scans. If these queries exceed the vpxd.stats.maxQueryMetrics safety limit (default 256), vCenter drops the request to protect database integrity, resulting in the reported "Request processing is restricted by administrator" condition https://knowledge.broadcom.com/external/article?articleNumber=301449

Recommendation:

Stabilize the metric collection by performing the following configuration adjustments in the OTel collector:

Reduce Batch Sizes: Lower the max_metrics or perf_request_batch_size parameter to prevent single API calls from exceeding the vCenter limit VMware probe Error Failed to execute - max query
Increase Polling Interval: Adjust the collection interval from 60 seconds to 120 seconds to reduce the frequency of high-impact queries.

Optimize Query Scopes: Ensure all QueryPerf calls include a specific BEGIN_TIME to avoid full-table historical scans https://knowledge.broadcom.com/external/article?articleNumber=301449

Steps to Reproduce

Run the vcenter receiver against a large vCenter deployment.

Analysis of the vCenter envoy-access.log and vpxd.log confirms a breakdown in communication between the OTel collector and vCenter service endpoints:
envoy.log: POST /vsanHealth 500 for VsanPerfQueryPerf.
vpxd.log: Frequent QueryPerf failures where the query size exceeds the defined vpxd.stats.maxQueryMetrics limit.
Trace Identifier: HTTP 500 errors on /sdk for SessionIsActive calls, indicating backend service congestion.

Collector version

v0.126.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

Log output

Additional context

No response

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions