When an incident hits a containerized service, you often don’t need a full observability stack to get traction. You need fast answers: Which container is hot? What resource is saturating? Is it an app problem or a limit problem?
This guide shows a practical monitoring stack you can run from any Docker host:
- Docker-level commands (docker stats, docker inspect, docker logs)
- Host Linux tools (ps/top/free/df/iostat/ss/journalctl)
- Kernel primitives: cgroups (resource limits/accounting) and namespaces (isolation)
1) Start with docker stats (the fastest signal)
docker stats streams runtime metrics for containers, including CPU%, memory usage/limit, network I/O, and block I/O.
docker stats
Common workflows:
docker stats --no-stream # Snapshot (good for scripts)
docker stats <container_name> # Focus on one container
How to interpret it (in plain language)
- CPU%: who’s burning compute right now.
- MEM USAGE / LIMIT: how close you are to the memory ceiling.
- NET I/O: traffic spikes, retries, or unusual egress. - BLOCK I/O: slow disks, chatty logging, or heavy read/write workloads.
2) Jump from “container name” → “what is it?”
Once you identify a hot container, immediately gather identity + configuration.
docker ps
docker inspect <container> | less
Useful inspect questions:
- What image/tag is running?
- What env vars/config are set?
- What ports and volumes are attached?
- Are there memory/CPU limits configured?
3) Logs: confirm symptoms fast
docker logs --tail 200 <container>
docker logs -f <container>
This is often enough to spot:
- crash loops
- OOM errors / memory pressure
- upstream timeouts
- DB connection exhaustion
4) Understand why it’s happening: cgroups + namespaces (the mental model)
Docker relies on Linux kernel features:
- Namespaces isolate views of processes, networking, mounts, etc.
- cgroups control and account for resources like CPU, memory, and I/O.
Why this matters during incidents:
- A container can be “slow” because it’s CPU-throttled, not because the app code suddenly got worse.
- A container can restart because it hit its memory limit and the kernel’s OOM behavior targeted its processes.
5) Host-level confirmation (tie back to your Linux monitoring toolkit)
When docker stats shows a spike, verify on the host to avoid false conclusions.
CPU hogs
ps aux --sort=-%cpu | head -15
Memory pressure
free -h
Disk full / log explosions
df -h
du -sh /var/lib/docker/* 2>/dev/null | sort -h | tail -10
Disk I/O saturation
iostat -x 1 3
Unexpected listeners / traffic patterns
ss -tuln
These host checks help you decide whether you’re dealing with a single container or a node-wide saturation problem.
6) What to do with the data (action mapping)
Use the shortest safe path to stability:
1. CPU high + latency rising
- If CPU is legitimately needed: scale out / add capacity.
- If CPU is throttled: revisit limits/requests (or container CPU shares).
2. Memory near limit
- If memory leak suspected: restart as mitigation + open an issue with heap profiling.
- If limit too low for normal peaks: adjust limit carefully and monitor.
3. Block I/O high
- Check log volume and disk saturation; reduce noisy logs or move logs off disk.
- Consider storage performance constraints and workload patterns.
4. Network I/O abnormal
- Look for retries, timeouts, DDoS/abuse patterns, or upstream issues.
7) Copy/paste triage sequence (5 minutes)
# 1) Find the hot container
docker stats --no-stream
# 2) Identify it
docker ps
docker inspect <container> | less
# 3) Check symptoms
docker logs --tail 200 <container>
# 4) Confirm on host (avoid guessing)
ps aux --sort=-%cpu | head -10
free -h
df -h
iostat -x 1 3
ss -tuln
What’s your most common container failure mode: OOM kills, CPU throttling, disk I/O, or network timeouts?
Top comments (0)