Platform engineering · playbook

Kubernetes from scratch on bare metal. Reproducible in an afternoon.

10 sequenced scripts. Cilium eBPF + ArgoCD + External Secrets Operator + Hashicorp Vault + Cloudflared Zero Trust. No managed control plane, no tribal knowledge — just a cluster a teammate can re-create before their coffee goes cold.

3

Control-plane nodes

10

Bootstrap scripts

HA

From day one

~3 h

End-to-end bootstrap

Why not just use managed?

Managed Kubernetes (GKE / EKS) is great until the economics of your workload flip upside-down — egress heavier than compute, a fleet of bare-metal hosts already paid for, or a private-VLAN networking story you want to keep inside the data center.

The real question was never "managed vs self-hosted." It was: can we self-host in a way that's boring in production and reproducible from a Git repo? No manual `kubectl apply`. No secrets pasted into YAML. No one engineer holding the knowledge of how to stand the cluster up.

The bar: a teammate opens a fresh set of bare-metal hosts, runs 10 scripts in order, and by the end has a cluster identical to production — same ingress, same secrets flow, same GitOps pipeline. This playbook is my distillation of that bootstrap.

Stack, with reasoning

kubeadm

Control plane

3 control-plane nodes behind a load-balanced API endpoint. Classic, debuggable, no vendor lock-in. Worth the extra ceremony vs k3s when HA from day one is non-negotiable.

Cilium + Hubble

CNI — eBPF

`kubeProxyReplacement: Strict` removes kube-proxy overhead; Hubble gives network-layer observability for free. The catch on Hetzner vSwitch: MTU must be 1350 to fit the provider's 1400 envelope — that one took a day to debug the first time.

CoreDNS

DNS

Custom ConfigMap forwards internal domains to the private-VLAN resolvers and adds static hosts for services running outside the cluster (Postgres, Redis, etc.). One source of truth for name resolution.

NGINX Ingress Controller

Ingress

Labeled to a dedicated ingress node. Nothing fancy at Layer 7 — Cloudflared Zero Trust sits in front, so the cluster edge stays simple.

Cloudflared

Zero Trust ingress

Deployed in the `infra` namespace, HTTP/2 to origin. No public load balancer, no exposed IPs. Internal services reachable only through authenticated Cloudflare Access policies. Kills a whole class of attack surface.

ArgoCD

GitOps

Helm-installed, pulling from a deployments repo via SSH deploy key. Webhook-triggered reconciliation on every push. AppSets per environment. After bootstrap, no manual `kubectl apply` — ever.

External Secrets Operator + Hashicorp Vault

Secrets

ESO reads from Vault KV v2 via scoped AppRoles per environment. Vault runs on a dedicated bare-metal host, not in-cluster — one less chicken-and-egg at recovery time. Secrets flow: Vault → ESO → Kubernetes Secret → Pod. No secret ever lands in Git.

The 10-script bootstrap

Each script is idempotent, commented, and assumes a fresh Ubuntu 22.04 host. Run in order. If one fails, fix it, re-run.

  1. 1-prerequisites.sh

    Host setup

    Network, SSH, firewall, apt sources, container runtime (containerd).

  2. 2-initialize-kubeadm.sh

    Control-plane init

    `kubeadm init` with pod + service subnets, external API endpoint on a load-balanced hostname.

  3. 3-deploy-cilium.sh

    CNI

    Helm install Cilium with eBPF masquerading, Hubble UI, `kubeProxyReplacement: Strict`, MTU 1350.

  4. 3.1-join-worker-nodes.sh

    Worker join

    Generates join commands for workers. Labels ingress nodes so NGINX schedules there.

  5. 4-test-connectivity.sh

    Smoke test

    Deploys a test pod per node and verifies cross-node + private-VLAN routing before anything else runs.

  6. 5-kubernetes-dashboard.sh

    Dashboard

    Optional — read-only dashboard behind Cloudflared Access.

  7. 6-dns.sh

    CoreDNS config

    Applies the ConfigMap with internal VLAN forwarders and static hosts.

  8. 7-ingress.sh

    Ingress

    NGINX Ingress Controller Helm chart, labeled node placement, TLS via cert-manager + Cloudflare.

  9. 8-cloudflared-tunnel.sh

    Zero Trust ingress

    Cloudflared Deployment in `infra` namespace, tunnel token injected from Vault via ESO.

  10. 9-app-bootstrap.sh

    Bootstrap services

    Namespaces + ServiceAccounts + image-pull secrets + any bridge workloads your apps need before ArgoCD takes over.

  11. 10-argocd.sh

    GitOps

    Helm-installs ArgoCD, creates projects + AppSets per environment, configures the GitHub webhook.

The gotchas (write these down)

Cilium MTU 1350 inside Hetzner vSwitch 1400

Hetzner's vSwitch wraps packets in an extra encapsulation. Cilium defaults to the interface MTU (1400), and packets drop silently after node join. Set `MTU: 1350` in the Cilium Helm values. It belongs in a comment in the CNI script so the next person doesn't re-discover it at 1am.

enableMasqueradeToRouteSource: true

Without this, pod traffic exits with the pod IP (which the provider's vSwitch has no route back for). Setting it to `true` makes Cilium masquerade to the node's VLAN IP. Once you know, obvious. Before you know, baffling.

ArgoCD webhook requires GitHub auth

The `argocd` namespace needs a webhook secret plus a GitHub App deploy key. If the secret is wrong, ArgoCD falls back to 3-minute polling — fine for dev, painful for prod.

ESO ClusterSecretStore requires Vault AppRole bootstrap

Before ESO can read secrets, Vault needs the per-environment AppRoles configured with appropriate policies. Automate this with an Ansible playbook or a bootstrap script — don't do it by hand the first time, let alone the second.

Need this for your team?

I do Kubernetes bootstrap consulting: managed-to-bare-metal migrations, GitOps rollouts with ArgoCD + ESO + Vault, Cilium eBPF adoption, and Cloudflared Zero Trust ingress. Remote, paid engagements — typically 2–6 weeks per project. Based in Spain (CET) until October 2026, then Mexico (CST).

Email [email protected]

Stateful-migration playbook

The other hard-infra story: an Elasticsearch major-version migration on a live platform, zero regression, with a shape-compare safety net and a per-service SPEC → Build → QA workflow.