Designing metrics that make sense
When an alert goes off at 3AM, with blurry eyes and only partial consciousness, the most important question is not why the alert fired, it is...
where should my attention go first to understand what’s happening?
- Is the system down?
- Are user requests failing?
- Is latency spiking?
- Is a critical 3rd party integration unreachable?
- Is this a traffic surge or a cascading failure?
Nothing is worse than staring at a dashboard during an incident and wondering if the alert is real or just a side-effect of bad instrumentation.
Prometheus has become the de-facto standard for cloud-native monitoring, it is powerful but also forces you to make architectural choices, Counter or Gauge? Histogram or Summary? Push or Pull? and this dictates whether your dashboards provide clarity or noise during chaos.
Prometheus is designed for:
- Time-series monitoring
- Event counting & system behavior tracking
- Real-time alerting
- Observability in distributed systems
It uses a pull model, stores time-series data, and provides a powerful query language called PromQL.
Why Metric Design Matters
Averages can be misleading because they hide outliers and spiky data. An average (mean) cannot distinguish between a system where all users experience 100ms latency and one where half experience 1ms and the other half experience 200ms.
Understanding rates, distributions, and system state is essential for accurate observability.
Prometheus Metric Types: When to Use Each
Prometheus provides four core metric types and it's important to match each type to the question being asked.
| Type | Use Case | Example |
|---|---|---|
| Counter | Values that only increase | HTTP requests |
| Gauge | Values that go up & down | Memory usage |
| Histogram | Request duration & distributions | API latency |
| Summary | Client-side latency quantiles | Response time percentiles |
Counter
Counters are ideal for:
- HTTP request counts
- Jobs processed
- error counts
- retry counts
- Cache hits/misses
Code
httpRequestsTotal.Inc()
Query
rate(http_requests_total[5m])
Tip: Always use rate() or increase() with counters, raw values rarely provide insight.
Guage
Guages are ideal for:
- cpu usage
- memory consumption
- active connections
- queue depth
- number of running pods
Code
queueDepth.Set(float64(noOfItems))
Query
myapp_queue_depth
Tip: If a metric never decreases, it should be a counter, but if it goes up and down then it should be a gauge.
Histogram
Histograms capture distributions across buckets, for accurate percentile calculations.
Histograms are ideal for:
- request latency
- database query duration
- payload sizes
- job processing time
Code
requestDuration.Observe(duration)
Query
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
Summaries
Summaries are used to calculate percentiles on the client-side for single-instance applications where pre-defining histogram buckets is impractical.
Summaries are ideal when:
- running a single instance
- aggregation is not required
Summaries cannot be aggregated across instances, histograms are typically preferred in microservices environments.
Labels
Labels allow metrics to be sliced and analyzed.
http_requests_total{method="POST", status="500"}
Use labels for:
- status codes
- endpoints
- regions
- service names
Tip: Avoid using lables for high cardinality data like user IDs or request IDs that could explode into an unbounded number of time-series.
It can severely impact performance by overloading the memory and slowing down queries.
Finally, observability is useful when metrics answer operational questions quickly especially under pressure. At 2 AM, the goal is not to admire dashboards. The goal is to understand reality:
- What are users experiencing?
- What changed?
- Where is the bottleneck?
- Is this a spike or a trend?
If dashboards feel noisy or alerts lack value, the issue is often design rather than tooling. Choosing the right metric type transforms prometheus from a data collector into a system that provides clarity when it matters the most.