Component(s)
receiver/vcenter
What happened?
Description
Customer is experiencing periodic drops in performance metrics from vCenter Server 8.0.3.00700. This occurs during critical system performance troubleshooting, where the vCenter receiver fails to ingest data for windows of approximately 15 minutes, impacting environmental visibility.
The root cause is a resource exhaustion on the vCenter management services (vpxd/Tomcat) driven by oversized QueryPerf API requests from the OpenTelemetry (OTel) collector. These massive batch queries trigger excessive CPU and memory spikes, leading to service timeouts and HTTP 500 responses
When an external monitoring tool initiates QueryPerf calls without optimized batching or specific time boundaries, vCenter is forced to perform extensive database scans. If these queries exceed the vpxd.stats.maxQueryMetrics safety limit (default 256), vCenter drops the request to protect database integrity, resulting in the reported "Request processing is restricted by administrator" condition https://knowledge.broadcom.com/external/article?articleNumber=301449
Recommendation:
Stabilize the metric collection by performing the following configuration adjustments in the OTel collector:
Reduce Batch Sizes: Lower the max_metrics or perf_request_batch_size parameter to prevent single API calls from exceeding the vCenter limit VMware probe Error Failed to execute - max query
Increase Polling Interval: Adjust the collection interval from 60 seconds to 120 seconds to reduce the frequency of high-impact queries.
Optimize Query Scopes: Ensure all QueryPerf calls include a specific BEGIN_TIME to avoid full-table historical scans https://knowledge.broadcom.com/external/article?articleNumber=301449
Steps to Reproduce
Run the vcenter receiver against a large vCenter deployment.
Analysis of the vCenter envoy-access.log and vpxd.log confirms a breakdown in communication between the OTel collector and vCenter service endpoints:
envoy.log: POST /vsanHealth 500 for VsanPerfQueryPerf.
vpxd.log: Frequent QueryPerf failures where the query size exceeds the defined vpxd.stats.maxQueryMetrics limit.
Trace Identifier: HTTP 500 errors on /sdk for SessionIsActive calls, indicating backend service congestion.
Collector version
v0.126.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
Additional context
No response
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.
Component(s)
receiver/vcenter
What happened?
Description
Customer is experiencing periodic drops in performance metrics from vCenter Server 8.0.3.00700. This occurs during critical system performance troubleshooting, where the vCenter receiver fails to ingest data for windows of approximately 15 minutes, impacting environmental visibility.
The root cause is a resource exhaustion on the vCenter management services (vpxd/Tomcat) driven by oversized QueryPerf API requests from the OpenTelemetry (OTel) collector. These massive batch queries trigger excessive CPU and memory spikes, leading to service timeouts and HTTP 500 responses
When an external monitoring tool initiates QueryPerf calls without optimized batching or specific time boundaries, vCenter is forced to perform extensive database scans. If these queries exceed the vpxd.stats.maxQueryMetrics safety limit (default 256), vCenter drops the request to protect database integrity, resulting in the reported "Request processing is restricted by administrator" condition https://knowledge.broadcom.com/external/article?articleNumber=301449
Recommendation:
Stabilize the metric collection by performing the following configuration adjustments in the OTel collector:
Reduce Batch Sizes: Lower the max_metrics or perf_request_batch_size parameter to prevent single API calls from exceeding the vCenter limit VMware probe Error Failed to execute - max query
Increase Polling Interval: Adjust the collection interval from 60 seconds to 120 seconds to reduce the frequency of high-impact queries.
Optimize Query Scopes: Ensure all QueryPerf calls include a specific BEGIN_TIME to avoid full-table historical scans https://knowledge.broadcom.com/external/article?articleNumber=301449
Steps to Reproduce
Run the vcenter receiver against a large vCenter deployment.
Analysis of the vCenter envoy-access.log and vpxd.log confirms a breakdown in communication between the OTel collector and vCenter service endpoints:
envoy.log: POST /vsanHealth 500 for VsanPerfQueryPerf.
vpxd.log: Frequent QueryPerf failures where the query size exceeds the defined vpxd.stats.maxQueryMetrics limit.
Trace Identifier: HTTP 500 errors on /sdk for SessionIsActive calls, indicating backend service congestion.
Collector version
v0.126.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
Additional context
No response
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding
+1orme too, to help us triage it. Learn more here.