DEV Community

Cover image for Docker Monitoring Without a Platform: docker stats + cgroups (DevOps)
Sajja Sudhakararao
Sajja Sudhakararao

Posted on

Docker Monitoring Without a Platform: docker stats + cgroups (DevOps)

When an incident hits a containerized service, you often don’t need a full observability stack to get traction. You need fast answers: Which container is hot? What resource is saturating? Is it an app problem or a limit problem?

This guide shows a practical monitoring stack you can run from any Docker host:

  1. Docker-level commands (docker stats, docker inspect, docker logs)
  2. Host Linux tools (ps/top/free/df/iostat/ss/journalctl)
  3. Kernel primitives: cgroups (resource limits/accounting) and namespaces (isolation)

1) Start with docker stats (the fastest signal)

docker stats streams runtime metrics for containers, including CPU%, memory usage/limit, network I/O, and block I/O.

docker stats
Enter fullscreen mode Exit fullscreen mode

Common workflows:

docker stats --no-stream          # Snapshot (good for scripts)
docker stats <container_name>     # Focus on one container
Enter fullscreen mode Exit fullscreen mode

How to interpret it (in plain language)

  • CPU%: who’s burning compute right now.
  • MEM USAGE / LIMIT: how close you are to the memory ceiling.
  • NET I/O: traffic spikes, retries, or unusual egress. ​- BLOCK I/O: slow disks, chatty logging, or heavy read/write workloads.

2) Jump from “container name” → “what is it?”

Once you identify a hot container, immediately gather identity + configuration.

docker ps
docker inspect <container> | less
Enter fullscreen mode Exit fullscreen mode

Useful inspect questions:

  • What image/tag is running?
  • What env vars/config are set?
  • What ports and volumes are attached?
  • Are there memory/CPU limits configured?

3) Logs: confirm symptoms fast

docker logs --tail 200 <container>
docker logs -f <container>
Enter fullscreen mode Exit fullscreen mode

This is often enough to spot:

  • crash loops
  • OOM errors / memory pressure
  • upstream timeouts
  • DB connection exhaustion

4) Understand why it’s happening: cgroups + namespaces (the mental model)

Docker relies on Linux kernel features:

  • Namespaces isolate views of processes, networking, mounts, etc.
  • cgroups control and account for resources like CPU, memory, and I/O.

Why this matters during incidents:

  • A container can be “slow” because it’s CPU-throttled, not because the app code suddenly got worse.
  • A container can restart because it hit its memory limit and the kernel’s OOM behavior targeted its processes.

5) Host-level confirmation (tie back to your Linux monitoring toolkit)

When docker stats shows a spike, verify on the host to avoid false conclusions.

CPU hogs

ps aux --sort=-%cpu | head -15
Enter fullscreen mode Exit fullscreen mode

Memory pressure

free -h
Enter fullscreen mode Exit fullscreen mode

Disk full / log explosions

df -h
du -sh /var/lib/docker/* 2>/dev/null | sort -h | tail -10
Enter fullscreen mode Exit fullscreen mode

Disk I/O saturation

iostat -x 1 3
Enter fullscreen mode Exit fullscreen mode

Unexpected listeners / traffic patterns

ss -tuln
Enter fullscreen mode Exit fullscreen mode

These host checks help you decide whether you’re dealing with a single container or a node-wide saturation problem.

6) What to do with the data (action mapping)

Use the shortest safe path to stability:

1. CPU high + latency rising

  • If CPU is legitimately needed: scale out / add capacity.
  • If CPU is throttled: revisit limits/requests (or container CPU shares).

2. Memory near limit

  • If memory leak suspected: restart as mitigation + open an issue with heap profiling.
  • If limit too low for normal peaks: adjust limit carefully and monitor.

3. Block I/O high

  • Check log volume and disk saturation; reduce noisy logs or move logs off disk.
  • Consider storage performance constraints and workload patterns.

4. Network I/O abnormal

  • Look for retries, timeouts, DDoS/abuse patterns, or upstream issues.

7) Copy/paste triage sequence (5 minutes)

# 1) Find the hot container
docker stats --no-stream

# 2) Identify it
docker ps
docker inspect <container> | less

# 3) Check symptoms
docker logs --tail 200 <container>

# 4) Confirm on host (avoid guessing)
ps aux --sort=-%cpu | head -10
free -h
df -h
iostat -x 1 3
ss -tuln
Enter fullscreen mode Exit fullscreen mode

What’s your most common container failure mode: OOM kills, CPU throttling, disk I/O, or network timeouts?

Top comments (0)