Runbooks¶
Short symptom → check → mitigate notes for common production incidents. Pair with Troubleshooting for local/dev issues and Deployment for probes and Kubernetes.
503 on GET /api/readyz¶
Symptom: Load balancer or user sees 503 from /api/readyz; Kubernetes may mark the pod NotReady.
Meaning: When FLUXLIT_STREAMLIT_UPSTREAM is set, readiness requires HTTP 2xx from GET on the Streamlit upstream root. Anything else (connection refused, timeout, 4xx/5xx/3xx) returns 503.
Checks
curl -v http://<pod-ip>:8000/api/readyz— read JSONdetail.curl -v http://<upstream-from-env>/— same URL the probe uses (root path on Streamlit).Gateway / Streamlit logs: slow start, crash loop, wrong upstream URL after reload.
fluxlit doctor <target>on a shell with the same env (if you can exec into the container).
Mitigations
Increase startup time: readiness
initialDelaySeconds, or fix Streamlit startup cost.Fix wrong upstream (stale
FLUXLIT_STREAMLIT_UPSTREAM/ state file after restart).If Streamlit is intentionally down for maintenance, accept NotReady traffic drain or use a separate maintenance mode.
Blank Streamlit UI (white page, spinner forever)¶
Symptom: /api/healthz is 200 but / or /_stcore/... shows blank or endless loading.
Checks
Browser devtools Network: failed
/_stcore/streamWebSocket or 4xx/5xx on static assets.FLUXLIT_ROOT_PATH: must match browser-visible prefix; missing or wrong path breaks asset URLs (Configuration).FLUXLIT_TRUST_PROXY: if behind TLS or subpath termination, scheme/host headers must match public URL.Gateway logs with
FLUXLIT_ENABLE_GATEWAY_ACCESS_LOG=1andrequest_idcorrelation (Observability).
Mitigations
Align
root_path, proxyX-Forwarded-*, and nginx location blocks — seedocker/proxy-deployment/smoke tests and Production TLS and edge headers.Verify WebSocket upgrade through the edge proxy (timeouts, buffering).
WebSocket failures behind nginx / Traefik¶
Symptom: Streamlit disconnects, “Connection error”, or WS closes immediately.
Checks
Proxy
UpgradeandConnectionheaders passed through to the FluxLit port.Idle / read timeouts on the proxy greater than Streamlit/WebSocket heartbeats; align with
FLUXLIT_GATEWAY_WS_*settings (Configuration).Subpath: WebSocket URL must include the same prefix as HTTP (
/myapp/_stcore/stream).
Mitigations
Use the repo’s proxy-deployment nginx configs as a reference; increase proxy
proxy_read_timeout(nginx) or equivalent.For TLS termination at the edge, confirm
wss://and certificates match what the browser uses.
Auth misconfig (401 / 403, login loops, fluxlit doctor FAIL)¶
Symptom: API returns 401/403; OIDC redirect errors; doctor reports JWT/auth FAIL.
Checks
pip install "fluxlit[auth]"in the image if using JWT/OIDC helpers.Env:
FLUXLIT_JWT_*,FLUXLIT_PUBLIC_BASE_URL,FLUXLIT_OIDC_*, clock skew (NTP).FLUXLIT_INTERNAL_API_BASEstill loopback-safe and consistent withapi_mount_path.BFF / OIDC: remember in-memory
statestore requires single replica or externalized store (Security architecture).
Mitigations
Fix issuer/audience/JWKS URL; rotate secrets per Secrets lifecycle.
For subpath deployments, set
FLUXLIT_ROOT_PATHandFLUXLIT_PUBLIC_BASE_URLto the public origin.
Multi-replica: new Streamlit session after refresh¶
Symptom: Users report losing UI state or “starting over” after F5 or intermittent 503 / reconnects, only when more than one FluxLit replica is behind the load balancer.
Meaning: Each replica has its own Streamlit process and in-memory st.session_state. Without sticky routing or a shared store (URL session + external SessionStore, or app-level persistence), the next request may hit a different replica.
Checks
Confirm replica count > 1 and whether the LB uses affinity (cookie / IP / connection).
If using FluxLit URL-session helpers, verify the store is shared across replicas (not
InMemorySessionStoreper pod).For OIDC BFF, confirm you are not relying on single-replica in-memory
statewithout affinity — see Security architecture.
Mitigations
Add session affinity on the Service or ingress (see
examples/kubernetes/service-session-affinity.example.yamlin the repo), orMove continuity data to an external
SessionStoreor database — see URL session continuity (no cookies) and Deployment (scaling checklist).
Scripted load and chaos (repository)¶
Repeatable scripts live under scripts/ in the repository:
Script |
Role |
|---|---|
|
Many GETs with |
|
Many GETs on |
|
SIGTERM → gateway exits within a bounded window. |
|
Kill Streamlit child → parent exits. |
|
Timeout, 413, and WebSocket drop behaviors. |
|
Many GETs on |
Run ./scripts/run_smoke_app.sh (or your app) in one terminal, then point BASE_URL at it. For CI-style signals, pair with Observability (metrics, gateway logs) and Deployment (readiness).
Soak methodology and baselines¶
Soak scripts are relative measurements: they report request counts, HTTP status mix (for soak_readyz.sh), and p50/p95/p99 latency in milliseconds over COUNT iterations. They do not assert absolute SLOs — operators should compare runs on the same machine class (for example the same Kubernetes node pool or CI runs-on label) and record commit SHA + date when publishing reference numbers.
What to record: COUNT, BASE_URL, PATH_SUFFIX, relevant FLUXLIT_* flags (metrics, proxy trust), CPU/memory snapshot if available, and the script’s final summary line or OUTPUT_FORMAT=json payload.
Under load, FluxLit emits: gateway access log extras (when enabled) with stable keys listed in Support matrix; gateway RED metrics (when FLUXLIT_ENABLE_GATEWAY_PROMETHEUS_METRICS=1); DEBUG lines on fluxlit.gateway for histogram observe failures (request still completes). Not emitted: USE-style host saturation metrics from FluxLit core — scrape the node or cAdvisor (Observability).
Correlation limits: request_id ties gateway access logs to the internal hop; Streamlit runs in a separate process, so its logs use the same id only where the runtime forwards it — see Observability correlation section.