The Dashboard Problem
What we built: 47 tiles showing metrics, logs, alerts, compliance, costs
What NOC uses: 3 tiles
Why: Dashboard answered "what's our resource count?" not "what's broken?"
What NOC Teams Actually Need
Question #1: "What's Down Right Now?"
Not: 15 charts showing healthy services
Yes: List of failures, ranked by business impact
Question #2: "What Needs My Attention?"
Not: 200 active alerts
Yes: 5 critical alerts requiring human action
Question #3: "Is This Normal?"
Not: Current CPU usage
Yes: Current vs 7-day baseline with "normal range" shading
Dashboard Design That Works
Tile 1: Critical Incidents (Top Priority)
Query:
AzureActivity
| where Level == "Critical" or Level == "Error"
| where TimeGenerated > ago(1h)
| summarize Count=count() by ResourceGroup, OperationNameValue
| order by Count desc
| take 10
Display:
- Red alert icon
- Resource name
- Error count
- Time since first occurrence
- Business impact (if known)
Tile 2: Service Health Issues
Query:
ServiceHealthResources
| where type == "microsoft.resourcehealth/events"
| where properties.status == "Active"
| project ServiceName = properties.service,
Issue = properties.title,
Impact = properties.impact
Display:
- Azure service name
- Issue description
- Affected regions
- Link to status page
Tile 3: Failed Deployments
Query:
AzureActivity
| where OperationNameValue contains "Microsoft.Resources/deployments/write"
| where ActivityStatusValue == "Failed"
| where TimeGenerated > ago(24h)
| project TimeGenerated, Caller, ResourceGroup, ErrorMessage = Properties
Display:
- Who tried to deploy
- What failed
- Error message
- Time
Tile 4: Abnormal Resource Consumption
Query:
Perf
| where TimeGenerated > ago(1h)
| where CounterName == "% Processor Time"
| summarize AvgCPU = avg(CounterValue) by Computer
| where AvgCPU > 85
Display:
- VM name
- Current CPU %
- Comparison to 7-day average
- Threshold breach time
Tile 5: Budget Alerts
Query:
AzureActivity
| where OperationNameValue contains "Microsoft.Consumption"
| where Level == "Warning" or Level == "Error"
| where TimeGenerated > ago(24h)
Display:
- Subscription name
- Current spend
- Budget amount
- Forecast end-of-month
What NOT to Include
❌ Resource Counts
Why NOC doesn't care: "We have 847 VMs" doesn't help incident response
Who cares: Capacity planning team
Where it belongs: Monthly capacity review, not NOC dashboard
❌ Compliance Metrics
Why NOC doesn't care: "72% compliant with tag policy" isn't urgent at 2 AM
Who cares: Governance team
Where it belongs: Weekly governance report
❌ Cost Breakdown Charts
Why NOC doesn't care: "Compute is 45% of spend" doesn't help fix outages
Who cares: FinOps team
Where it belongs: Monthly cost review
❌ "Healthy" Status
Why NOC doesn't care: If it's working, they don't need to see it
Better: Only show failures. If dashboard is empty, everything's fine.
Real NOC Dashboard Example
Our 5-tile dashboard:
-
Critical Alerts (red box, top-left)
- Currently: 0
- If >0: Shows alert details
-
Service Health (orange box, top-right)
- Currently: 1 (Azure DevOps degraded, East US)
- Impact: Low
-
Failed Deployments (yellow box, middle-left)
- Last 24h: 3 failures
- Links to logs
-
High CPU VMs (yellow box, middle-right)
- Currently: 2 VMs over 85%
- Shows VM names, current %
-
Budget Status (green box, bottom)
- 67% of monthly budget used
- 45% of month elapsed
- Forecast: On track
Total tiles: 5
Time to understand status: 10 seconds
Dashboard Refresh Strategy
Real-Time Data (1-minute refresh)
- Critical alerts
- Service health
- High CPU/memory
Near Real-Time (5-minute refresh)
- Failed deployments
- Error logs
- Network issues
Hourly Refresh
- Budget status
- Backup failures
- Compliance alerts
Common Mistakes
❌ Mistake #1: Too Many Tiles
Problem: 47 tiles, can't see critical issues
Fix: Maximum 10 tiles, prioritize by urgency
❌ Mistake #2: Showing "Green"
Problem: "99% of services healthy" takes space
Fix: Only show failures. Empty dashboard = everything's fine.
❌ Mistake #3: No Business Context
Problem: "VM-SQL-12 is down" (which app is that?)
Fix: Map VMs to apps in dashboard query
❌ Mistake #4: Metrics Without Baselines
Problem: "CPU is 45%" (is that normal?)
Fix: Show current vs 7-day average
The "Empty Dashboard Is Good" Philosophy
Traditional thinking: Dashboard must always show data
Better thinking: Dashboard shows PROBLEMS
Result:
- Dashboard empty most of the time
- When something appears, it's urgent
- NOC knows exactly what to fix
Multi-Team Dashboard Strategy
Don't: One dashboard for everyone
Do: Separate dashboards per team:
NOC Dashboard
- Incidents requiring immediate action
- 5 tiles, 10-second understanding
FinOps Dashboard
- Cost trends
- Budget tracking
- Reservation coverage
Security Dashboard
- Security alerts
- Compliance violations
- Vulnerability scans
Capacity Dashboard
- Resource utilization
- Growth trends
- Forecast capacity needs
Full Dashboard Templates
Complete KQL queries, Azure Monitor Workbook templates, and multi-team dashboard architecture:
👉 Azure NOC Dashboard Complete Guide
Building dashboards for NOC teams? Show problems, not status. Empty dashboard = everything's working. That's success.
Top comments (0)