Skip to content

Commit aa96510

Browse files
authored
Merge pull request #828 from pratik-mahalle/docs
docs: add monitoring considerations to planning and link from index
2 parents 6c766d7 + 6ee5cd5 commit aa96510

2 files changed

Lines changed: 101 additions & 0 deletions

File tree

content/en/cloud/self-hosted/planning/_index.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,3 +140,9 @@ weight: 1
140140
### Considerations of Air-Gapped Deployments
141141

142142
Layer5 acknowledges the importance of air-gapped deployments and ensures content support for such environments. Content registered should be available even in the absence of internet connectivity, thus aligning with Layer5's commitment to versatile deployment scenarios.
143+
144+
### Monitoring Considerations
145+
146+
Plan for comprehensive observability across your Layer5 Cloud deployment, including metrics, logs, tracing, dashboards, and alerting. Establish SLOs for latency, availability, and saturation; size telemetry storage appropriately; and ensure privacy and access controls for operational data.
147+
148+
See: [Monitoring](/cloud/self-hosted/planning/monitoring/)
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
title: Monitoring
3+
description: "Plan monitoring for Layer5 Cloud self-hosted deployments: metrics, logs, tracing, dashboards, and alerts."
4+
categories: [Self-Hosted]
5+
tags: [monitoring]
6+
weight: 4
7+
---
8+
9+
Monitoring is essential to operate a reliable Layer5 Cloud deployment. Plan for metrics, logs, traces, dashboards, alerting, and retention so that you can detect and resolve issues quickly, understand capacity, and meet compliance needs.
10+
11+
## Objectives
12+
13+
- Establish observability for core services (API, UI, real-time collaboration, identity, database, cache, ingress) and infrastructure (Kubernetes, nodes, storage, networking)
14+
- Provide dashboards for SLOs and golden signals (latency, traffic, errors, saturation)
15+
- Configure actionable alerts with clear ownership and runbooks
16+
- Size and retain telemetry data according to compliance and cost constraints
17+
18+
## Metrics
19+
20+
Collect system and application metrics. Common choices include Prometheus or any OpenMetrics-compatible backend.
21+
22+
- Kubernetes: kube-state-metrics, cAdvisor/node-exporter, API server, etcd, ingress controller
23+
- Layer5 Cloud services: HTTP latency and error rates, request throughput, worker queue depth, WebSocket/WebRTC health
24+
- Datastores: database query latency, connections, cache hit ratio
25+
26+
Recommended metrics and SLOs:
27+
28+
- Request success rate (5xx, 4xx) per route and service; target ≥ 99.9% over 30 days
29+
- p50/p90/p99 latency per route and service; budget aligned to user experience goals
30+
- Resource saturation: CPU, memory, pod restarts, HPA activity; queue length where applicable
31+
- Collaboration health: signaling availability, peer connection success, message delivery error rate
32+
33+
## Logs
34+
35+
Use a centralized, searchable logging stack (e.g., Loki, Elasticsearch, or a managed service). Ensure structured logs (JSON) for Layer5 Cloud services and infrastructure components.
36+
37+
- Retention tiers: short-term hot (7–14 days), longer-term warm/cold per compliance
38+
- Privacy: scrub/omit secrets and PII; apply data minimization and access control
39+
- Context: include request IDs, user/session IDs (where appropriate), and correlation IDs
40+
41+
## Tracing
42+
43+
Enable distributed tracing with OpenTelemetry to diagnose cross-service latency and failures.
44+
45+
- Propagate W3C Trace Context across ingress → services → dependencies
46+
- Sample rates: start with 1–5% head sampling; use tail-based sampling for errors/latency outliers
47+
- Storage/backends: Tempo/Jaeger/managed APM
48+
49+
## Dashboards
50+
51+
Provide Grafana (or equivalent) dashboards for:
52+
53+
- Service health overview: error rate, latency, throughput, saturation
54+
- Ingress and API gateway performance by route
55+
- Real-time collaboration: signaling uptime, peer connection success, message RTT
56+
- Identity/OIDC: login success, token issuance errors, external IdP health
57+
- Database/cache: latency, throughput, errors, saturation
58+
- Kubernetes: cluster/node/pod health, HPA activity, pending pods, eviction events
59+
60+
## Alerts
61+
62+
Create multi-level alerts (warning/critical) with clear runbooks and ownership.
63+
64+
- Availability: elevated 5xx rate or failure rate by route/service
65+
- Latency: p99 above budget for sustained periods
66+
- Saturation: CPU/memory pressure, pod crashloops, queue backlogs
67+
- Dependencies: database unreachable, cache error spikes, external IdP failures
68+
- Collaboration: signaling down, degraded connection success, message delivery failures
69+
70+
Alert destinations may include Slack, email, PagerDuty, or your incident tool. Include links to dashboards and logs in notifications.
71+
72+
## Sizing and Retention
73+
74+
Estimate telemetry volume early to avoid unexpected costs.
75+
76+
- Metrics: number of time series × scrape interval; downsample older data
77+
- Logs: average line size × events/sec; apply sampling/filters and retention tiers
78+
- Traces: sample strategically; store only spans needed for SLOs and investigations
79+
80+
## Security and Compliance
81+
82+
- Restrict telemetry access by role; audit access to sensitive logs
83+
- Encrypt in transit and at rest; segregate prod/staging data
84+
- Redact secrets and PII at the source where possible
85+
86+
## Reference Architecture (example)
87+
88+
- Metrics: Prometheus + Alertmanager; long-term storage via remote-write (e.g., Thanos, Mimir)
89+
- Logs: Loki (or Elasticsearch) with LogQL saved views and retention tiers
90+
- Traces: Tempo/Jaeger with OpenTelemetry SDKs/collectors
91+
- Dashboards: Grafana with folders for platform, services, and business metrics
92+
93+
This setup is vendor-neutral and can be substituted with managed offerings from your cloud provider or APM vendor.
94+
95+

0 commit comments

Comments
 (0)