Cross-cutting concerns#

Things that don’t belong to a single component — they show up at multiple layers.

Anonymisation round-trip#

For every operation that reaches the LLM, the orchestrator wraps the user-facing text in an anonymise → call → de-anonymise sandwich:

        sequenceDiagram
    participant route as Route handler
    participant orch as Orchestrator
    participant anon as PresidioAnonymizer
    participant llm as LLMPort

    route->>orch: analyze(request, deadline)
    orch->>anon: anonymize(user_message)
    anon-->>orch: (anonymised_text, mapping)
    orch->>llm: complete(system_message, anonymised_text, …)
    llm-->>orch: structured response
    orch->>anon: deanonymize(response_json, mapping)
    anon-->>orch: response_with_pii_restored
    orch-->>route: result
    

Notes:

  • The mapping lives in memory for the request and is discarded when the orchestrator method returns.

  • The de-anonymise step runs over the serialised response — substitutions are textual, so the round-trip is a string replacement, not a structured walk.

Call context and usage tracking#

qfa.services.call_context defines a ContextVar[CallContext | None] and a call_scope(tenant_id, operation) async context manager. Every public orchestrator method enters a scope; the TrackingLLMAdapter reads the context to attribute each LLM call to a tenant and operation.

Consequence: any new code path that calls LLMPort.complete outside an orchestrator method will raise MissingCallScopeError at runtime. The tracking adapter refuses to record an untyped call.

Deadlines, timeouts, retries#

Layer

Concern

Mechanism

Route handler

Per-request deadline

deadline = now(UTC) + 120s, passed as an absolute datetime into the orchestrator

Orchestrator

Deadline check

Before each LLM call: if remaining time is negative, raise AnalysisTimeoutError

Adapter (LiteLLMClient)

Retry on transient errors

tenacity.retry with exponential backoff (1s→10s, 60s budget) for LLMTimeoutError and LLMRateLimitError

Adapter (LiteLLMClient)

Per-call timeout

Passed through to litellm.acompletion(timeout=…)

Adapter (LiteLLMClient)

Token budget guard

Estimates len(text) / chars_per_token; raises FeedbackTooLargeError if over LLM_MAX_TOTAL_TOKENS

Retry policy and token budget belong to the adapter because both are model-specific (different LiteLLM-routed models have different context windows and rate-limit behaviour).

Error → HTTP mapping#

The exception handlers in qfa.api.app translate domain errors into HTTP responses:

Exception

HTTP

error.code

Missing / invalid bearer token

401

authentication_required

Pydantic validation failure

422

validation_error

FeedbackTooLargeError

413

payload_too_large

AnalysisTimeoutError

504

analysis_timeout

AnalysisError (with “injection” in message)

422

prompt_injection_detected

AnalysisError (other)

502

analysis_unavailable

LLMError

502

llm_unavailable

UsageRepositoryUnavailableError

503

usage_backend_unavailable

Usage tracking disabled

503

usage_tracking_disabled

Unhandled Exception

500

internal_error

All responses share the same envelope shape with a server-generated request_id.

Logging policy#

Hard prohibitions — never log at any level:

  • Feedback record content (record.text / record.content)

  • User prompt (request.prompt) — log the character count instead

  • Assembled system or user messages sent to the LLM

  • LLM response text

  • API key values (protected by SecretStr)

Safe to log: request_id, tenant_id, operation, record counts, estimated tokens, attempt numbers, model name, durations, HTTP status codes, prompt_tokens, completion_tokens, cost.

See Observability for what each log statement looks like at runtime.