← Node.js Course13 / 14

Building for Scale: Observability & Resilience

Keep a busy service healthy. Measure it with metrics (prom-client), watch it with Prometheus, trace requests with OpenTelemetry, and survive failure with graceful degradation, rate limiting, and a circuit breaker.

Ad 728×90

Instrumentation

Why: instrumentation means adding small bits of code that measure what your app is doing — how many requests it handles, how long they take, how many fail. You cannot improve or fix what you cannot see. These measurements are called metrics, and prom-client is the standard Node library for recording them.

$ pnpm add prom-client

// metrics.js — define what you want to measure
import client from 'prom-client'

// Collect default Node metrics for free (memory, CPU, event-loop lag)
client.collectDefaultMetrics()

// A counter only ever goes up — perfect for "how many requests"
export const requestCount = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
})

// A histogram records a distribution — perfect for "how long did it take"
export const requestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Request duration in seconds',
  labelNames: ['method', 'route'],
})

Monitoring

Why: monitoring is watching those metrics over time and alerting a human when something looks wrong. The usual setup: your app exposes its numbers at a /metrics URL, a tool called Prometheus visits that URL every few seconds to record them, and Grafana draws the graphs and fires alerts (for example, "error rate above 5% for 5 minutes"). Your only job inside the app is to expose /metrics and update the numbers.

// app.js — measure every request and expose the numbers
import express from 'express'
import client from 'prom-client'
import { requestCount, requestDuration } from './metrics.js'

const app = express()

// Time each request and count it once it finishes
app.use((req, res, next) => {
  const stop = requestDuration.startTimer({ method: req.method, route: req.path })
  res.on('finish', () => {
    requestCount.inc({ method: req.method, route: req.path, status: res.statusCode })
    stop()
  })
  next()
})

// Prometheus scrapes (visits) this endpoint on a schedule
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType)
  res.end(await client.register.metrics())
})

app.listen(3000)

Telemetry

Why: telemetry is data your app emits about itself so you can understand it from the outside. The most useful kind at scale is a trace — a record that follows one request as it hops between services, showing where the time went. OpenTelemetry (often shortened to OTel) is the vendor-neutral standard; it auto-instruments common libraries, so you get traces without rewriting your code.

$ pnpm add @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

// tracing.js — load this BEFORE the rest of your app
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'

const sdk = new NodeSDK({
  // Auto-instruments Express, HTTP, database clients, and more
  instrumentations: [getNodeAutoInstrumentations()],
})

sdk.start()
// Start your app with:  node --import ./tracing.js app.js

Graceful Degradation

Why: graceful degradation means that when one piece breaks, the whole app does not. If a non-essential dependency (say, a recommendations service) is down, you return a sensible fallback instead of failing the page. The pattern: wrap the risky call in try/catch and always have a plan B.

// A product page that still works when recommendations are down
async function getProductPage(id) {
  const product = await db.getProduct(id) // essential — let it throw

  let recommendations = []
  try {
    recommendations = await recommender.fetch(id) // nice-to-have
  } catch (err) {
    logger.warn('recommendations unavailable, showing none', { err })
    // fall back to an empty list instead of failing the whole page
  }

  return { product, recommendations }
}

Throttling

Why: throttling (also called rate limiting) caps how many requests one client can make in a window of time. It protects your app from being overwhelmed — whether by a buggy client stuck in a retry loop or an abusive one. express-rate-limit is a drop-in middleware that returns HTTP 429 (Too Many Requests) once a caller goes over the limit.

$ pnpm add express-rate-limit

import express from 'express'
import rateLimit from 'express-rate-limit'

const app = express()

const limiter = rateLimit({
  windowMs: 60 * 1000, // the window: 1 minute
  limit: 100,          // max 100 requests per IP per window
  standardHeaders: true,
  message: { error: 'Too many requests, please slow down.' },
})

app.use('/api', limiter) // protect the API routes

Circuit Breaker

Why: when a downstream service is failing, hammering it with retries makes things worse and ties up your own resources waiting for timeouts. A circuit breaker watches the failure rate and, once it crosses a threshold, "trips" — it stops calling the broken service for a while and fails fast (or returns a fallback) instead. After a cool-down it lets one test request through to check whether the service recovered. opossum is the common Node library.

$ pnpm add opossum

import CircuitBreaker from 'opossum'

// Wrap the unreliable call
const breaker = new CircuitBreaker(callPaymentApi, {
  timeout: 3000,                // a call taking >3s counts as a failure
  errorThresholdPercentage: 50, // trip once 50% of recent calls fail
  resetTimeout: 10000,          // after 10s, allow one test request through
})

// Plan B used while the circuit is open
breaker.fallback(() => ({ status: 'queued', note: 'payment delayed' }))

async function charge(order) {
  return breaker.fire(order) // fails fast instead of hanging when tripped
}