Conventions for implementing structured logging, metrics, and distributed tracing across services to achieve full-stack observability and enable rapid incident response.
# Observability & Monitoring Rules
## Three Pillars
- Every service must emit all three observability signals: **logs** (what happened), **metrics** (how often/how long), **traces** (where time was spent).
- Use OpenTelemetry (OTEL) as the unified instrumentation standard — avoid vendor-specific SDKs in application code.
- Export signals to a single backend (Grafana stack, Datadog, New Relic) configured in infrastructure, not in application code.
## Structured Logging
- All logs must be structured JSON. Never use plain `console.log` or `print()` in production code.
- Required fields on every log line: `timestamp` (ISO-8601 UTC), `level`, `service`, `version`, `trace_id`, `span_id`, `message`.
- Use log levels correctly: `DEBUG` (development detail), `INFO` (notable events), `WARN` (recoverable issues), `ERROR` (failures requiring attention), `FATAL` (service cannot continue).
- Include context fields relevant to the operation: `user_id`, `request_id`, `order_id`, `duration_ms`.
- Never log PII, passwords, tokens, credit card numbers, or full request/response bodies containing sensitive data.
- Log at entry and exit of significant operations — not every function call.
## Metrics
- Instrument the four golden signals: **Latency** (p50/p95/p99), **Traffic** (req/s), **Errors** (error rate %), **Saturation** (CPU/memory/queue depth).
- Use standard metric types: Counter (monotonically increasing), Gauge (current value), Histogram (distribution).
- Name metrics with the format `{service}_{noun}_{unit}`: `api_requests_total`, `db_query_duration_seconds`, `cache_hit_ratio`.
- Add high-cardinality labels sparingly — `status_code`, `method`, `route`. Never use `user_id` or `order_id` as a metric label.
- Define SLOs before instrumenting — instrument what you need to prove you're meeting them.
## Distributed Tracing
- Propagate trace context across all service boundaries: HTTP headers (`traceparent`), message queues (message attributes), gRPC metadata.
- Create spans for every external call: DB query, HTTP request, cache operation, message publish.
- Name spans clearly: `db.users.findById`, `http.POST /api/orders`, `cache.get user:{id}`.
- Add span attributes for important context: `db.system`, `db.statement` (without values for parameterized queries), `http.url`, `http.status_code`.
- Sample intelligently: always sample errors and slow requests; sample normal traffic at 1-10%.
## Alerting
- Define alerts based on SLOs, not raw metrics. Alert on error budget burn rate, not arbitrary thresholds.
- Every alert must have a runbook link in its description.
- Alerts must be actionable — if you can't describe what to do when the alert fires, the alert is wrong.
- Use alert severity levels: `critical` (page on-call immediately), `warning` (notify, no page), `info` (dashboard only).
- Avoid alert fatigue — review and tune alerts monthly. Silence noisy alerts only with a documented fix date.
## Health Endpoints
- Every service must expose `/health/live` (is the process alive?) and `/health/ready` (is it ready to serve traffic?).
- Liveness: check process is running — no external dependencies.
- Readiness: check DB connections, critical downstream services, cache connections.
- Return HTTP 200 on healthy, 503 on unhealthy. Include a JSON body with component statuses.
## Anti-Patterns
- Do not use correlation IDs that are not propagated across service boundaries.
- Do not log at DEBUG level in production by default — it will flood your log storage.
- Do not create metrics inside hot loops — use counters with periodic flushes.
- Do not ignore error metrics for background jobs — they are as critical as request errors.
- Do not store traces for more than 15-30 days — they are large; archive aggregated data instead.