Skip to main content

Designing metrics that make sense

· 4 min read

When an alert goes off at 3AM, with blurry eyes and only partial consciousness, the most important question is not why the alert fired, it is...

where should my attention go first to understand what’s happening?

  • Is the system down?
  • Are user requests failing?
  • Is latency spiking?
  • Is a critical 3rd party integration unreachable?
  • Is this a traffic surge or a cascading failure?

Nothing is worse than staring at a dashboard during an incident and wondering if the alert is real or just a side-effect of bad instrumentation.

Prometheus has become the de-facto standard for cloud-native monitoring, it is powerful but also forces you to make architectural choices, Counter or Gauge? Histogram or Summary? Push or Pull? and this dictates whether your dashboards provide clarity or noise during chaos.

Prometheus is designed for:

  • Time-series monitoring
  • Event counting & system behavior tracking
  • Real-time alerting
  • Observability in distributed systems

It uses a pull model, stores time-series data, and provides a powerful query language called PromQL.

Why Metric Design Matters

Averages can be misleading because they hide outliers and spiky data. An average (mean) cannot distinguish between a system where all users experience 100ms latency and one where half experience 1ms and the other half experience 200ms.

Understanding rates, distributions, and system state is essential for accurate observability.

Prometheus Metric Types: When to Use Each

Prometheus provides four core metric types and it's important to match each type to the question being asked.

TypeUse CaseExample
CounterValues that only increaseHTTP requests
GaugeValues that go up & downMemory usage
HistogramRequest duration & distributionsAPI latency
SummaryClient-side latency quantilesResponse time percentiles

Counter

Counters are ideal for:

  • HTTP request counts
  • Jobs processed
  • error counts
  • retry counts
  • Cache hits/misses

Code

Go
httpRequestsTotal.Inc()

Query

promql
rate(http_requests_total[5m])

Tip: Always use rate() or increase() with counters, raw values rarely provide insight.

Guage

Guages are ideal for:

  • cpu usage
  • memory consumption
  • active connections
  • queue depth
  • number of running pods

Code

Go
queueDepth.Set(float64(noOfItems))

Query

promql
myapp_queue_depth

Tip: If a metric never decreases, it should be a counter, but if it goes up and down then it should be a gauge.

Histogram

Histograms capture distributions across buckets, for accurate percentile calculations.

Histograms are ideal for:

  • request latency
  • database query duration
  • payload sizes
  • job processing time

Code

Go
requestDuration.Observe(duration)

Query

promql
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)

Summaries

Summaries are used to calculate percentiles on the client-side for single-instance applications where pre-defining histogram buckets is impractical.

Summaries are ideal when:

  • running a single instance
  • aggregation is not required

Summaries cannot be aggregated across instances, histograms are typically preferred in microservices environments.

Labels

Labels allow metrics to be sliced and analyzed.

promql
http_requests_total{method="POST", status="500"}

Use labels for:

  • status codes
  • endpoints
  • regions
  • service names

Tip: Avoid using lables for high cardinality data like user IDs or request IDs that could explode into an unbounded number of time-series. It can severely impact performance by overloading the memory and slowing down queries.

Finally, observability is useful when metrics answer operational questions quickly especially under pressure. At 2 AM, the goal is not to admire dashboards. The goal is to understand reality:

  • What are users experiencing?
  • What changed?
  • Where is the bottleneck?
  • Is this a spike or a trend?

If dashboards feel noisy or alerts lack value, the issue is often design rather than tooling. Choosing the right metric type transforms prometheus from a data collector into a system that provides clarity when it matters the most.