Cross-cutting concerns#

Things that don’t belong to a single component — they show up at multiple layers.

Anonymisation round-trip#

For every operation that reaches the LLM, the orchestrator wraps the user-facing text in an anonymise → call → de-anonymise sandwich:

        sequenceDiagram
    participant route as Route handler
    participant orch as Orchestrator
    participant anon as PresidioAnonymizer
    participant llm as LLMPort

    route->>orch: analyze(request, deadline)
    orch->>anon: anonymize(user_message)
    anon-->>orch: (anonymised_text, mapping)
    orch->>llm: complete(system_message, anonymised_text, …)
    llm-->>orch: structured response
    orch->>anon: deanonymize(response_json, mapping)
    anon-->>orch: response_with_pii_restored
    orch-->>route: result

Notes:

The mapping lives in memory for the request and is discarded when the orchestrator method returns.
The de-anonymise step runs over the serialised response — substitutions are textual, so the round-trip is a string replacement, not a structured walk.

Call context and usage tracking#

qfa.services.call_context defines a ContextVar[CallContext | None] and a call_scope(tenant_id, operation) async context manager. Every public orchestrator method enters a scope; the TrackingLLMAdapter reads the context to attribute each LLM call to a tenant and operation.

Consequence: any new code path that calls LLMPort.complete outside an orchestrator method will raise MissingCallScopeError at runtime. The tracking adapter refuses to record an untyped call.

Deadlines, timeouts, retries#

Layer	Concern	Mechanism
Route handler	Per-request deadline	`deadline = now(UTC) + 120s`, passed as an absolute `datetime` into the orchestrator
Orchestrator	Deadline check	Before each LLM call: if remaining time is negative, raise `AnalysisTimeoutError`
Adapter (`LiteLLMClient`)	Retry on transient errors	`tenacity.retry` with exponential backoff (1s→10s, 60s budget) for `LLMTimeoutError` and `LLMRateLimitError`
Adapter (`LiteLLMClient`)	Per-call timeout	Passed through to `litellm.acompletion(timeout=…)`
Adapter (`LiteLLMClient`)	Token budget guard	Estimates `len(text) / chars_per_token`; raises `FeedbackTooLargeError` if over `LLM_MAX_TOTAL_TOKENS`

Retry policy and token budget belong to the adapter because both are model-specific (different LiteLLM-routed models have different context windows and rate-limit behaviour).

Error → HTTP mapping#

The exception handlers in qfa.api.app translate domain errors into HTTP responses:

Exception	HTTP	`error.code`
Missing / invalid bearer token	401	`authentication_required`
Pydantic validation failure	422	`validation_error`
`FeedbackTooLargeError`	413	`payload_too_large`
`AnalysisTimeoutError`	504	`analysis_timeout`
`AnalysisError` (with “injection” in message)	422	`prompt_injection_detected`
`AnalysisError` (other)	502	`analysis_unavailable`
`LLMError`	502	`llm_unavailable`
`UsageRepositoryUnavailableError`	503	`usage_backend_unavailable`
Usage tracking disabled	503	`usage_tracking_disabled`
Unhandled `Exception`	500	`internal_error`

All responses share the same envelope shape with a server-generated request_id.

Logging policy#

Hard prohibitions — never log at any level:

Feedback record content (record.text / record.content)
User prompt (request.prompt) — log the character count instead
Assembled system or user messages sent to the LLM
LLM response text
API key values (protected by SecretStr)

Safe to log: request_id, tenant_id, operation, record counts, estimated tokens, attempt numbers, model name, durations, HTTP status codes, prompt_tokens, completion_tokens, cost.

See Observability for what each log statement looks like at runtime.