← Back to BlogGuide

OpenClaw Health Checks: Keep Your AI Agent Alive in Docker and Kubernetes

H.··5 min read

Last month, a guy in the OpenClaw Discord posted a screenshot of his monitoring dashboard. His AI agent had been down for eleven hours. Nobody noticed because the container was technically still "running" — the process hadn't crashed, it had just stopped responding.

This is the classic container zombie problem. The process is alive. The container is green. But the application inside is completely frozen. Docker and Kubernetes have no way to know unless you tell them how to check.

OpenClaw 2026.3.1 fixes this with built-in health check endpoints.

The zombie container problem

When you run an AI agent in Docker, the container runtime only monitors one thing: is the main process still running? If the PID exists, Docker considers the container healthy.

But AI agents fail in ways that don't kill the process. A stuck API call to your LLM provider. A memory leak that makes everything crawl. A network partition that disconnects your agent from Slack but leaves the gateway loop spinning on nothing.

In all these cases, your agent is effectively dead but Docker happily reports everything is fine. Your team sends messages to Slack. Nobody answers. For eleven hours.

Built-in endpoints that just work

The latest OpenClaw release adds four HTTP endpoints to the gateway:

These follow the Kubernetes probe conventions, but they work with any container orchestrator — Docker Compose, Docker Swarm, Nomad, plain Docker with HEALTHCHECK.

The best part: zero configuration. Update OpenClaw and the endpoints are there. They use fallback routing, meaning if you already have a handler on /health for something else, your handler takes priority. No conflicts.

Docker Compose setup

Here's what a production-ready docker-compose.yml looks like with health checks:

services:
  openclaw:
    image: openclaw/openclaw:latest
    restart: unless-stopped
    volumes:
      - ./config:/home/user/.openclaw
    ports:
      - "3000:3000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

What this does: every 30 seconds, Docker curls the health endpoint. If it fails three times in a row, Docker restarts the container. The start_period gives OpenClaw 40 seconds to boot up before the checks begin.

For most self-hosted setups, this is all you need. Your agent goes zombie? Docker kills it and brings it back within two minutes.

Kubernetes deployment

If you're running OpenClaw in Kubernetes — maybe alongside other services in a cluster — the probe configuration goes in your pod spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: openclaw-agent
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: openclaw
        image: openclaw/openclaw:latest
        ports:
        - containerPort: 3000
        livenessProbe:
          httpGet:
            path: /healthz
            port: 3000
          initialDelaySeconds: 40
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /readyz
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 10

The liveness probe restarts the pod if the gateway is frozen. The readiness probe pulls the pod from the service load balancer if it's still starting up or temporarily overloaded. Standard Kubernetes patterns, but now they work out of the box with OpenClaw.

Why self-hosted agents need this more than cloud ones

Cloud AI platforms handle reliability for you. They have teams of SREs, auto-scaling groups, and monitoring stacks. When something breaks, they fix it.

When you self-host your AI agent, you are the SRE. And the first rule of being your own SRE is: automate the recovery before you need it.

Health checks are the foundation. They're not exciting. Nobody tweets about adding a HEALTHCHECK line to their Docker Compose file. But they're the difference between "my agent was down for eleven hours" and "my agent was down for ninety seconds and I didn't even notice."

Monitoring beyond health checks

Health checks handle the binary question: is it alive? For deeper observability, pair them with a few other tools:

Log aggregation. OpenClaw logs to stdout by default, which Docker captures. Pipe those logs to something searchable — Loki, the ELK stack, even a simple docker logs --follow in a tmux pane.

Uptime monitoring. Point an external service at your health endpoint. UptimeRobot, Healthchecks.io, or a simple cron job that curls the endpoint and alerts you on failure. This catches the case where the entire host machine goes down, not just the container.

Resource limits. Set memory and CPU limits on your container. An AI agent processing large documents can spike in memory usage. Without limits, it can OOM-kill other services on the same host.

services:
  openclaw:
    # ... other config
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '2.0'

The setup that doesn't wake you up at 3 AM

The whole point of running a personal AI agent is making your life easier. If that agent needs babysitting, it's creating work instead of eliminating it.

With health checks, container restart policies, and basic monitoring, your agent becomes self-healing. It crashes? Docker restarts it. It freezes? The health check catches it. The host reboots? The restart: unless-stopped policy brings everything back.

A guy I helped set up OpenClaw last week through OpenClaw Setup had been running his agent on bare metal — no containers, no health checks, no restart logic. He'd SSH into his server every morning to check if it was still running. We moved him to Docker Compose with health checks in about an hour. He hasn't SSH'd into that server since.

That's the goal. Set it up once, then forget about it. If you want that kind of setup running on your own machine by tonight, book a free 15-minute call and we'll get it done together.

Related Reading

Get Your AI Agent Running

We handle the entire setup — deploy, configure, and secure OpenClaw so you don't have to.

  • Fully deployed in 48 hours
  • All channels — Slack, Telegram, WhatsApp
  • Security hardened from day one
  • 14-day hypercare included

One-time setup

$999

Complete setup, no recurring fees