Last month, a guy in the OpenClaw Discord posted a screenshot of his monitoring dashboard. His AI agent had been down for eleven hours. Nobody noticed because the container was technically still "running" — the process hadn't crashed, it had just stopped responding.
This is the classic container zombie problem. The process is alive. The container is green. But the application inside is completely frozen. Docker and Kubernetes have no way to know unless you tell them how to check.
OpenClaw 2026.3.1 fixes this with built-in health check endpoints.
The zombie container problem
When you run an AI agent in Docker, the container runtime only monitors one thing: is the main process still running? If the PID exists, Docker considers the container healthy.
But AI agents fail in ways that don't kill the process. A stuck API call to your LLM provider. A memory leak that makes everything crawl. A network partition that disconnects your agent from Slack but leaves the gateway loop spinning on nothing.
In all these cases, your agent is effectively dead but Docker happily reports everything is fine. Your team sends messages to Slack. Nobody answers. For eleven hours.
Built-in endpoints that just work
The latest OpenClaw release adds four HTTP endpoints to the gateway:
/healthand/healthz— liveness probes. Returns 200 if the gateway process is functioning./readyand/readyz— readiness probes. Returns 200 if the gateway is ready to handle requests.
These follow the Kubernetes probe conventions, but they work with any container orchestrator — Docker Compose, Docker Swarm, Nomad, plain Docker with HEALTHCHECK.
The best part: zero configuration. Update OpenClaw and the endpoints are there. They use fallback routing, meaning if you already have a handler on /health for something else, your handler takes priority. No conflicts.
Docker Compose setup
Here's what a production-ready docker-compose.yml looks like with health checks:
services:
openclaw:
image: openclaw/openclaw:latest
restart: unless-stopped
volumes:
- ./config:/home/user/.openclaw
ports:
- "3000:3000"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
What this does: every 30 seconds, Docker curls the health endpoint. If it fails three times in a row, Docker restarts the container. The start_period gives OpenClaw 40 seconds to boot up before the checks begin.
For most self-hosted setups, this is all you need. Your agent goes zombie? Docker kills it and brings it back within two minutes.
Kubernetes deployment
If you're running OpenClaw in Kubernetes — maybe alongside other services in a cluster — the probe configuration goes in your pod spec:
apiVersion: apps/v1
kind: Deployment
metadata:
name: openclaw-agent
spec:
replicas: 1
template:
spec:
containers:
- name: openclaw
image: openclaw/openclaw:latest
ports:
- containerPort: 3000
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 40
periodSeconds: 30
readinessProbe:
httpGet:
path: /readyz
port: 3000
initialDelaySeconds: 10
periodSeconds: 10
The liveness probe restarts the pod if the gateway is frozen. The readiness probe pulls the pod from the service load balancer if it's still starting up or temporarily overloaded. Standard Kubernetes patterns, but now they work out of the box with OpenClaw.
Why self-hosted agents need this more than cloud ones
Cloud AI platforms handle reliability for you. They have teams of SREs, auto-scaling groups, and monitoring stacks. When something breaks, they fix it.
When you self-host your AI agent, you are the SRE. And the first rule of being your own SRE is: automate the recovery before you need it.
Health checks are the foundation. They're not exciting. Nobody tweets about adding a HEALTHCHECK line to their Docker Compose file. But they're the difference between "my agent was down for eleven hours" and "my agent was down for ninety seconds and I didn't even notice."
Monitoring beyond health checks
Health checks handle the binary question: is it alive? For deeper observability, pair them with a few other tools:
Log aggregation. OpenClaw logs to stdout by default, which Docker captures. Pipe those logs to something searchable — Loki, the ELK stack, even a simple docker logs --follow in a tmux pane.
Uptime monitoring. Point an external service at your health endpoint. UptimeRobot, Healthchecks.io, or a simple cron job that curls the endpoint and alerts you on failure. This catches the case where the entire host machine goes down, not just the container.
Resource limits. Set memory and CPU limits on your container. An AI agent processing large documents can spike in memory usage. Without limits, it can OOM-kill other services on the same host.
services:
openclaw:
# ... other config
deploy:
resources:
limits:
memory: 2G
cpus: '2.0'
The setup that doesn't wake you up at 3 AM
The whole point of running a personal AI agent is making your life easier. If that agent needs babysitting, it's creating work instead of eliminating it.
With health checks, container restart policies, and basic monitoring, your agent becomes self-healing. It crashes? Docker restarts it. It freezes? The health check catches it. The host reboots? The restart: unless-stopped policy brings everything back.
A guy I helped set up OpenClaw last week through OpenClaw Setup had been running his agent on bare metal — no containers, no health checks, no restart logic. He'd SSH into his server every morning to check if it was still running. We moved him to Docker Compose with health checks in about an hour. He hasn't SSH'd into that server since.
That's the goal. Set it up once, then forget about it. If you want that kind of setup running on your own machine by tonight, book a free 15-minute call and we'll get it done together.