Sajja Sudhakararao

Posted on Jan 31

Docker Monitoring Without a Platform: docker stats + cgroups (DevOps)

#linux #observability #docker #sre

When an incident hits a containerized service, you often don’t need a full observability stack to get traction. You need fast answers: Which container is hot? What resource is saturating? Is it an app problem or a limit problem?

This guide shows a practical monitoring stack you can run from any Docker host:

Docker-level commands (docker stats, docker inspect, docker logs)
Host Linux tools (ps/top/free/df/iostat/ss/journalctl)
Kernel primitives: cgroups (resource limits/accounting) and namespaces (isolation)

1) Start with docker stats (the fastest signal)

docker stats streams runtime metrics for containers, including CPU%, memory usage/limit, network I/O, and block I/O.

docker stats

Common workflows:

docker stats --no-stream          # Snapshot (good for scripts)
docker stats <container_name>     # Focus on one container

How to interpret it (in plain language)

CPU%: who’s burning compute right now.
MEM USAGE / LIMIT: how close you are to the memory ceiling.
NET I/O: traffic spikes, retries, or unusual egress. - BLOCK I/O: slow disks, chatty logging, or heavy read/write workloads.

2) Jump from “container name” → “what is it?”

Once you identify a hot container, immediately gather identity + configuration.

docker ps
docker inspect <container> | less

Useful inspect questions:

What image/tag is running?
What env vars/config are set?
What ports and volumes are attached?
Are there memory/CPU limits configured?

3) Logs: confirm symptoms fast

docker logs --tail 200 <container>
docker logs -f <container>

This is often enough to spot:

crash loops
OOM errors / memory pressure
upstream timeouts
DB connection exhaustion

4) Understand why it’s happening: cgroups + namespaces (the mental model)

Docker relies on Linux kernel features:

Namespaces isolate views of processes, networking, mounts, etc.
cgroups control and account for resources like CPU, memory, and I/O.

Why this matters during incidents:

A container can be “slow” because it’s CPU-throttled, not because the app code suddenly got worse.
A container can restart because it hit its memory limit and the kernel’s OOM behavior targeted its processes.

5) Host-level confirmation (tie back to your Linux monitoring toolkit)

When docker stats shows a spike, verify on the host to avoid false conclusions.

CPU hogs

ps aux --sort=-%cpu | head -15

Memory pressure

free -h

Disk full / log explosions

df -h
du -sh /var/lib/docker/* 2>/dev/null | sort -h | tail -10

Disk I/O saturation

iostat -x 1 3

Unexpected listeners / traffic patterns

ss -tuln

These host checks help you decide whether you’re dealing with a single container or a node-wide saturation problem.

6) What to do with the data (action mapping)

Use the shortest safe path to stability:

1. CPU high + latency rising

If CPU is legitimately needed: scale out / add capacity.
If CPU is throttled: revisit limits/requests (or container CPU shares).

2. Memory near limit

If memory leak suspected: restart as mitigation + open an issue with heap profiling.
If limit too low for normal peaks: adjust limit carefully and monitor.

3. Block I/O high

Check log volume and disk saturation; reduce noisy logs or move logs off disk.
Consider storage performance constraints and workload patterns.

4. Network I/O abnormal

Look for retries, timeouts, DDoS/abuse patterns, or upstream issues.

7) Copy/paste triage sequence (5 minutes)

# 1) Find the hot container
docker stats --no-stream

# 2) Identify it
docker ps
docker inspect <container> | less

# 3) Check symptoms
docker logs --tail 200 <container>

# 4) Confirm on host (avoid guessing)
ps aux --sort=-%cpu | head -10
free -h
df -h
iostat -x 1 3
ss -tuln

What’s your most common container failure mode: OOM kills, CPU throttling, disk I/O, or network timeouts?

DEV Community