graph TB
subgraph internet["🌐 Public Internet — *.dloizides.com (real LE TLS)"]
users["End users / testers"]
end
subgraph prod["🟢 PROD — Hetzner prod (4 vCPU / 8 GB · k3s) — CPU req 94% / mem 83%"]
ingress["Traefik ingress + LE certs
ALL public URLs terminate here"]
subgraph prodapis["Product APIs (+ BFF) → shared saas-db"]
kefiapi["kefi-api / bff-kefi"]
tenant["tenant-api"]
quest["questioner-api"]
onlinemenu["onlinemenu-api"]
content["content-api"]
payment["payment-api"]
notif["notification-api (MT consumer)"]
end
subgraph prodweb["Web / SPA / static"]
kefiweb["kefi-web SPA · kefi-marketing"]
kefiland["kefi-landings (vestigial/crashloop)"]
erevna["erevna-web · katalogos-web"]
statics["~35 static apps/games · HQ"]
end
kc["keycloak + keycloak-db"]
subgraph prodstate["Stateful infra (hibernate-able → 0/0)"]
saasdb[("saas-db (shared; *-db aliases)")]
rmq["rabbitmq"]
sw["seaweedfs (S3)"]
maddy["maddy (mail)"]
pouenidb[("poueni-postgres")]
dynalux[("dynalux-db")]
end
amlmkt["aml-marketing (static landing — stays)"]
promagent["prometheus AGENT (remote-write)"]
pubjob["kefi publish Job: build-site→kaniko→rollout
CPU req 500m→200m (now schedules)"]
gproxy["selector-less Service + EndpointSlice
grafana.* · analytics.* · aml-* → staging NodePort"]
end
subgraph staging["🟡 STAGING — staging (WireGuard-only · 8 vCPU / 16 GB)"]
registry["Docker registry (staging)"]
graf["Grafana + Prometheus(store) + Loki + Alertmanager"]
umami["Umami + umami-db"]
amlstg["AML (migrated 2026-06-28): aml-screening+aml-postgres ·
aml-identity+identity-postgres · aml-ner — NodePort"]
swstg["seaweedfs (E2E report store)"]
subgraph e2e["E2E runners (CronJobs) — run HERE, TARGET prod"]
e2ekefi["kefi-lifecycle 03:00 (deadline 1800)"]
e2epoueni["poueni-reset 03:40"]
e2efull["full suite 04:00"]
end
end
users -->|HTTPS| ingress
ingress --> prodapis
ingress --> prodweb
ingress --> kc
ingress -->|analytics.* · grafana.* · aml-*| gproxy
gproxy -.->|WireGuard ~100ms| umami
gproxy -.->|WireGuard ~100ms| graf
gproxy -.->|WireGuard ~100ms| amlstg
promagent -.->|remote-write WG| graf
kefiapi -->|create Job| pubjob
pubjob -->|push image| registry
e2e -.->|test over public DNS| ingress
e2e -->|reports| swstg
prod -.->|pull images| registry
classDef off fill:#3d1414,stroke:#f85149,color:#ffd7d5;
classDef fix fill:#13331b,stroke:#3fb950,color:#aff5b4;
class kefiland off;
class pubjob fix;
class amlstg fix;
saas-dbgrafana.dloizides.com proxied from prodanalytics.dloizides.com proxied from prodaml-screening/identity.dloizides.com proxied from prod (via NodePort)Publish build pods needed 500m CPU but only ~250m free → stuck Pending → sites never built. Lowered build-site+kaniko requests to 200m + restarted kefi-api. Verified: build Completes in 39s, lifecycle E2E passes.
Re-spaced schedules (canary-lock races), kefi deadline 900→1800s, image pins, imagePullPolicy: IfNotPresent. Full prod suite running now for a complete green/red.
Like memory before it. The 200m publish fix is a margin fix — if CPU tightens, publish breaks again. Strategic lever: offload AML→staging (same proxy pattern as Umami), gated on the test window.
Products "Running but not Ready" = a stateful workload at 0/0. First check: kubectl get statefulset -n dloizides → scale the DB/broker to 1 (DB-first).