Keep a busy service healthy. Measure it with metrics (prom-client), watch it with Prometheus, trace requests with OpenTelemetry, and survive failure with graceful degradation, rate limiting, and a circuit breaker.
Why: instrumentation means adding small bits of code that measure what your app is doing — how many requests it handles, how long they take, how many fail. You cannot improve or fix what you cannot see. These measurements are called metrics, and prom-client is the standard Node library for recording them.
$ pnpm add prom-client// metrics.js — define what you want to measure
import client from 'prom-client'
// Collect default Node metrics for free (memory, CPU, event-loop lag)
client.collectDefaultMetrics()
// A counter only ever goes up — perfect for "how many requests"
export const requestCount = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status'],
})
// A histogram records a distribution — perfect for "how long did it take"
export const requestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Request duration in seconds',
labelNames: ['method', 'route'],
})Why: monitoring is watching those metrics over time and alerting a human when something looks wrong. The usual setup: your app exposes its numbers at a /metrics URL, a tool called Prometheus visits that URL every few seconds to record them, and Grafana draws the graphs and fires alerts (for example, "error rate above 5% for 5 minutes"). Your only job inside the app is to expose /metrics and update the numbers.
// app.js — measure every request and expose the numbers
import express from 'express'
import client from 'prom-client'
import { requestCount, requestDuration } from './metrics.js'
const app = express()
// Time each request and count it once it finishes
app.use((req, res, next) => {
const stop = requestDuration.startTimer({ method: req.method, route: req.path })
res.on('finish', () => {
requestCount.inc({ method: req.method, route: req.path, status: res.statusCode })
stop()
})
next()
})
// Prometheus scrapes (visits) this endpoint on a schedule
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType)
res.end(await client.register.metrics())
})
app.listen(3000)Why: telemetry is data your app emits about itself so you can understand it from the outside. The most useful kind at scale is a trace — a record that follows one request as it hops between services, showing where the time went. OpenTelemetry (often shortened to OTel) is the vendor-neutral standard; it auto-instruments common libraries, so you get traces without rewriting your code.
$ pnpm add @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node// tracing.js — load this BEFORE the rest of your app
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
const sdk = new NodeSDK({
// Auto-instruments Express, HTTP, database clients, and more
instrumentations: [getNodeAutoInstrumentations()],
})
sdk.start()
// Start your app with: node --import ./tracing.js app.jsWhy: graceful degradation means that when one piece breaks, the whole app does not. If a non-essential dependency (say, a recommendations service) is down, you return a sensible fallback instead of failing the page. The pattern: wrap the risky call in try/catch and always have a plan B.
// A product page that still works when recommendations are down
async function getProductPage(id) {
const product = await db.getProduct(id) // essential — let it throw
let recommendations = []
try {
recommendations = await recommender.fetch(id) // nice-to-have
} catch (err) {
logger.warn('recommendations unavailable, showing none', { err })
// fall back to an empty list instead of failing the whole page
}
return { product, recommendations }
}Why: throttling (also called rate limiting) caps how many requests one client can make in a window of time. It protects your app from being overwhelmed — whether by a buggy client stuck in a retry loop or an abusive one. express-rate-limit is a drop-in middleware that returns HTTP 429 (Too Many Requests) once a caller goes over the limit.
$ pnpm add express-rate-limitimport express from 'express'
import rateLimit from 'express-rate-limit'
const app = express()
const limiter = rateLimit({
windowMs: 60 * 1000, // the window: 1 minute
limit: 100, // max 100 requests per IP per window
standardHeaders: true,
message: { error: 'Too many requests, please slow down.' },
})
app.use('/api', limiter) // protect the API routesWhy: when a downstream service is failing, hammering it with retries makes things worse and ties up your own resources waiting for timeouts. A circuit breaker watches the failure rate and, once it crosses a threshold, "trips" — it stops calling the broken service for a while and fails fast (or returns a fallback) instead. After a cool-down it lets one test request through to check whether the service recovered. opossum is the common Node library.
$ pnpm add opossumimport CircuitBreaker from 'opossum'
// Wrap the unreliable call
const breaker = new CircuitBreaker(callPaymentApi, {
timeout: 3000, // a call taking >3s counts as a failure
errorThresholdPercentage: 50, // trip once 50% of recent calls fail
resetTimeout: 10000, // after 10s, allow one test request through
})
// Plan B used while the circuit is open
breaker.fallback(() => ({ status: 'queued', note: 'payment delayed' }))
async function charge(order) {
return breaker.fire(order) // fails fast instead of hanging when tripped
}