Observability & Monitoring Conventions

observabilitymonitoringdevops

Conventions for implementing structured logging, metrics, and distributed tracing across services to achieve full-stack observability and enable rapid incident response.

Целевые файлы

Файл	Формат
.cursorrules	plaintext
CLAUDE.md	markdown

Содержимое

# Observability & Monitoring Rules ## Three Pillars - Every service must emit all three observability signals: **logs** (what happened), **metrics** (how often/how long), **traces** (where time was spent). - Use OpenTelemetry (OTEL) as the unified instrumentation standard — avoid vendor-specific SDKs in application code. - Export signals to a single backend (Grafana stack, Datadog, New Relic) configured in infrastructure, not in application code. ## Structured Logging - All logs must be structured JSON. Never use plain `console.log` or `print()` in production code. - Required fields on every log line: `timestamp` (ISO-8601 UTC), `level`, `service`, `version`, `trace_id`, `span_id`, `message`. - Use log levels correctly: `DEBUG` (development detail), `INFO` (notable events), `WARN` (recoverable issues), `ERROR` (failures requiring attention), `FATAL` (service cannot continue). - Include context fields relevant to the operation: `user_id`, `request_id`, `order_id`, `duration_ms`. - Never log PII, passwords, tokens, credit card numbers, or full request/response bodies containing sensitive data. - Log at entry and exit of significant operations — not every function call. ## Metrics - Instrument the four golden signals: **Latency** (p50/p95/p99), **Traffic** (req/s), **Errors** (error rate %), **Saturation** (CPU/memory/queue depth). - Use standard metric types: Counter (monotonically increasing), Gauge (current value), Histogram (distribution). - Name metrics with the format `{service}_{noun}_{unit}`: `api_requests_total`, `db_query_duration_seconds`, `cache_hit_ratio`. - Add high-cardinality labels sparingly — `status_code`, `method`, `route`. Never use `user_id` or `order_id` as a metric label. - Define SLOs before instrumenting — instrument what you need to prove you're meeting them. ## Distributed Tracing - Propagate trace context across all service boundaries: HTTP headers (`traceparent`), message queues (message attributes), gRPC metadata. - Create spans for every external call: DB query, HTTP request, cache operation, message publish. - Name spans clearly: `db.users.findById`, `http.POST /api/orders`, `cache.get user:{id}`. - Add span attributes for important context: `db.system`, `db.statement` (without values for parameterized queries), `http.url`, `http.status_code`. - Sample intelligently: always sample errors and slow requests; sample normal traffic at 1-10%. ## Alerting - Define alerts based on SLOs, not raw metrics. Alert on error budget burn rate, not arbitrary thresholds. - Every alert must have a runbook link in its description. - Alerts must be actionable — if you can't describe what to do when the alert fires, the alert is wrong. - Use alert severity levels: `critical` (page on-call immediately), `warning` (notify, no page), `info` (dashboard only). - Avoid alert fatigue — review and tune alerts monthly. Silence noisy alerts only with a documented fix date. ## Health Endpoints - Every service must expose `/health/live` (is the process alive?) and `/health/ready` (is it ready to serve traffic?). - Liveness: check process is running — no external dependencies. - Readiness: check DB connections, critical downstream services, cache connections. - Return HTTP 200 on healthy, 503 on unhealthy. Include a JSON body with component statuses. ## Anti-Patterns - Do not use correlation IDs that are not propagated across service boundaries. - Do not log at DEBUG level in production by default — it will flood your log storage. - Do not create metrics inside hot loops — use counters with periodic flushes. - Do not ignore error metrics for background jobs — they are as critical as request errors. - Do not store traces for more than 15-30 days — they are large; archive aggregated data instead.

CLI

npx mindaxis apply observability-rules --target cursor --scope project

Используется в паках

Incident & SRE Toolkit

On-call readiness: monitoring, alerting, chaos engineering, and incident response

Открыть пак →

SRE Toolkit

Site reliability and platform engineering

Открыть пак →

← Правила

Observability & Monitoring Conventions

observabilitymonitoringdevops

Conventions for implementing structured logging, metrics, and distributed tracing across services to achieve full-stack observability and enable rapid incident response.

Целевые файлы

Файл	Формат
.cursorrules	plaintext
CLAUDE.md	markdown

Содержимое

CLI

npx mindaxis apply observability-rules --target cursor --scope project

Используется в паках

Incident & SRE Toolkit

On-call readiness: monitoring, alerting, chaos engineering, and incident response

Открыть пак →

SRE Toolkit

Site reliability and platform engineering

Открыть пак →