🗺️ Platform Deployment Map

last verified 2026-06-28

PROD CPU req 91% · mem 77% STAGING 16 GB box

📡 Open Live Status → 🧭 Product Roadmap → Real-time service health & recent deploys at status.dloizides.com; the product roadmap (flowchart · gantt · kanban) lives here too. This page is the static topology companion.

What runs where live topology

graph TB
    subgraph internet["🌐 Public Internet — *.dloizides.com (real LE TLS)"]
        users["End users / testers"]
    end

    subgraph prod["🟢 PROD — Hetzner prod (4 vCPU / 8 GB · k3s) — CPU req 94% / mem 83%"]
        ingress["Traefik ingress + LE certs
ALL public URLs terminate here"]
        subgraph prodapis["Product APIs (+ BFF) → shared saas-db"]
            kefiapi["kefi-api / bff-kefi"]
            tenant["tenant-api"]
            quest["questioner-api"]
            onlinemenu["onlinemenu-api"]
            content["content-api"]
            payment["payment-api"]
            notif["notification-api (MT consumer)"]
        end
        subgraph prodweb["Web / SPA / static"]
            kefiweb["kefi-web SPA · kefi-marketing"]
            kefiland["kefi-landings (vestigial/crashloop)"]
            erevna["erevna-web · katalogos-web"]
            statics["~35 static apps/games · HQ"]
        end
        kc["keycloak + keycloak-db"]
        subgraph prodstate["Stateful infra (hibernate-able → 0/0)"]
            saasdb[("saas-db (shared; *-db aliases)")]
            rmq["rabbitmq"]
            sw["seaweedfs (S3)"]
            maddy["maddy (mail)"]
            pouenidb[("poueni-postgres")]
            dynalux[("dynalux-db")]
        end
        amlmkt["aml-marketing (static landing — stays)"]
        promagent["prometheus AGENT (remote-write)"]
        pubjob["kefi publish Job: build-site→kaniko→rollout
CPU req 500m→200m (now schedules)"]
        gproxy["selector-less Service + EndpointSlice
grafana.* · analytics.* · aml-* → staging NodePort"]
    end

    subgraph staging["🟡 STAGING — staging (WireGuard-only · 8 vCPU / 16 GB)"]
        registry["Docker registry (staging)"]
        graf["Grafana + Prometheus(store) + Loki + Alertmanager"]
        umami["Umami + umami-db"]
        amlstg["AML (migrated 2026-06-28): aml-screening+aml-postgres ·
aml-identity+identity-postgres · aml-ner — NodePort"]
        swstg["seaweedfs (E2E report store)"]
        subgraph e2e["E2E runners (CronJobs) — run HERE, TARGET prod"]
            e2ekefi["kefi-lifecycle 03:00 (deadline 1800)"]
            e2epoueni["poueni-reset 03:40"]
            e2efull["full suite 04:00"]
        end
    end

    users -->|HTTPS| ingress
    ingress --> prodapis
    ingress --> prodweb
    ingress --> kc
    ingress -->|analytics.* · grafana.* · aml-*| gproxy
    gproxy -.->|WireGuard ~100ms| umami
    gproxy -.->|WireGuard ~100ms| graf
    gproxy -.->|WireGuard ~100ms| amlstg
    promagent -.->|remote-write WG| graf
    kefiapi -->|create Job| pubjob
    pubjob -->|push image| registry
    e2e -.->|test over public DNS| ingress
    e2e -->|reports| swstg
    prod -.->|pull images| registry

    classDef off fill:#3d1414,stroke:#f85149,color:#ffd7d5;
    classDef fix fill:#13331b,stroke:#3fb950,color:#aff5b4;
    class kefiland off;
    class pubjob fix;
    class amlstg fix;

live (prod) runs on staging, public URL proxied from prod hibernate-able / scheduled sleep off by design

Deployment inventory status by host

🟢 PROD — Hetzner prod (public)

Hetzner prod · 4 vCPU / 8 GB · real LE TLS

tenant · kefi · questioner · onlinemenu · content · payment · notification — product APIs + BFFs → shared saas-db
keycloak (+ keycloak-db) — auth; all realms
kefi-web · erevna-web · katalogos-web — product SPAs
~35 static apps / games · hq.dloizides.com
saas-db · rabbitmq · seaweedfs · maddy · poueni-postgres · dynalux-db — stateful; scaled to 0/0 when hibernated → APIs go "Running but not Ready"
aml-marketing — static AML landing (stays on prod)
aml-screening.* · aml-identity.* — AML stack MIGRATED to staging 2026-06-28; public URLs proxied from prod
kefi-landings — vestigial; SPA replaced it (empty nginx root → crashloop)
kefi publish build Jobs — kaniko on Publish click; CPU req 200m (was 500m → Pending)
prometheus AGENT — scrapes prod, remote-writes to staging
grafana.* · analytics.* — public URL here, app on staging (WG)

🟡 STAGING — staging (WireGuard-only)

WireGuard-only · 8 vCPU / 16 GB · no public TLS

Docker registry (staging) — prod pulls images from here
Grafana · Prometheus (store/query) · Loki · Alertmanager — prod agent remote-writes here; grafana.dloizides.com proxied from prod
Umami + umami-db — migrated 2026-06-27; analytics.dloizides.com proxied from prod
AML stack — migrated 2026-06-28: aml-screening+aml-postgres, aml-identity+identity-postgres, aml-ner; aml-screening/identity.dloizides.com proxied from prod (via NodePort)
seaweedfs — E2E canary report store
E2E CronJobs (target prod over public DNS) — kefi-lifecycle 03:00 · poueni-reset 03:40 · full suite 04:00 · all share one canary lock

Status & recent changes 2026-06-28

✅ kefi self-serve publish — FIXED

Publish build pods needed 500m CPU but only ~250m free → stuck Pending → sites never built. Lowered build-site+kaniko requests to 200m + restarted kefi-api. Verified: build Completes in 39s, lifecycle E2E passes.

✅ prod-from-staging E2E runners — REPAIRED

Re-spaced schedules (canary-lock races), kefi deadline 900→1800s, image pins, imagePullPolicy: IfNotPresent. Full prod suite running now for a complete green/red.

⚠️ prod is CPU-request saturated (94%)

Like memory before it. The 200m publish fix is a margin fix — if CPU tightens, publish breaks again. Strategic lever: offload AML→staging (same proxy pattern as Umami), gated on the test window.

⚠️ Hibernation gotcha

Products "Running but not Ready" = a stateful workload at 0/0. First check: kubectl get statefulset -n dloizides → scale the DB/broker to 1 (DB-first).