Observability#

Logging, request tracing, and usage reporting.

Where logs go#

The app writes structured logs to stdout. On Azure App Service these are picked up by the App Service log stream and any attached log sink. Locally, they show up in your terminal.

The log level is controlled by LOG_LOGLEVEL for application packages (default DEBUG) and LOG_LOGLEVEL_3RDPARTY for third-party libraries (default WARNING). Both accept either a numeric level or a string ("debug", "info", "warning", "error", "critical").

Hard prohibitions#

Some things must never appear in logs at any level:

  • Feedback record text / content

  • User-supplied prompt

  • The assembled system or user message sent to the LLM

  • LLM response text

  • API key values (protected by SecretStr)

The constants and helpers in qfa.utils make this easy to honour. When in doubt, log the character count or a hash, not the value.

Safe to log#

Everything that’s not in the prohibition list above is fine, especially:

  • request_id (every response carries X-Request-ID)

  • tenant_id from the authenticated key

  • operation (analyze, summarize, …)

  • Record count, estimated tokens

  • Attempt numbers and retry reasons

  • Model name, latency, prompt_tokens, completion_tokens, cost

  • HTTP status codes

Request tracing#

Every response includes an X-Request-ID header generated by the request-id middleware. The same ID appears in every log line for that request and in the error envelope’s request_id field, so a caller reporting a 502 can hand you the ID and you can grep the logs end-to-end.

Usage queries#

When DB_TRACK_USAGE=true, two endpoints expose aggregates over the llm_calls table:

  • GET /v1/usage — stats scoped to the caller’s tenant. Accepts from and to query parameters (ISO 8601, timezone-aware).

  • GET /v1/usage/all — stats across all tenants. Requires is_superuser=true.

Returned shape: counts and token totals per operation, plus simple latency distribution stats. The endpoints are reporting-only — no mutation.

If usage tracking is off entirely, both endpoints return 503 with code=usage_tracking_disabled. If it’s on but the database is down, 503 with code=usage_backend_unavailable. The distinction matters: the first is “feature flag is off, redeploy needed”; the second is “transient, retry with backoff.”

What’s not wired up yet#

  • No APM, no metrics export (Prometheus / OpenTelemetry).

  • No log shipping to a central index — App Service log stream is it.

  • No alerting rules.

These are gaps to fill when operational maturity calls for them.