← Python Course19 / 20

Building for Scale: Observability & Resilience

Keep a busy service healthy. Measure it with metrics (prometheus-client), watch it with Prometheus, trace requests with OpenTelemetry, and survive failure with graceful degradation, rate limiting, and a circuit breaker.

Ad 728×90

Instrumentation

Why: instrumentation means adding small bits of code that measure what your app is doing — how many requests it handles, how long they take, how many fail. You cannot improve or fix what you cannot see. These measurements are called metrics, and prometheus-client is the standard Python library for recording them.

# Install:  pip install prometheus-client
# metrics.py — define what you want to measure
from prometheus_client import Counter, Histogram

# A counter only ever goes up — perfect for "how many requests"
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'path', 'status'],
)

# A histogram records a distribution — perfect for "how long did it take"
REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'Request duration in seconds',
    ['method', 'path'],
)

Monitoring

Why: monitoring is watching those metrics over time and alerting a human when something looks wrong. The usual setup: your app exposes its numbers at a /metrics URL, a tool called Prometheus visits that URL every few seconds to record them, and Grafana draws the graphs and fires alerts (for example, "error rate above 5% for 5 minutes"). Your only job inside the app is to expose /metrics and update the numbers.

# main.py — measure every request and expose the numbers
import time
from fastapi import FastAPI, Request
from fastapi.responses import Response
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from metrics import REQUEST_COUNT, REQUEST_DURATION

app = FastAPI()

@app.middleware('http')
async def measure(request: Request, call_next):
    start = time.perf_counter()
    response = await call_next(request)
    REQUEST_DURATION.labels(request.method, request.url.path).observe(
        time.perf_counter() - start
    )
    REQUEST_COUNT.labels(
        request.method, request.url.path, response.status_code
    ).inc()
    return response

# Prometheus scrapes (visits) this endpoint on a schedule
@app.get('/metrics')
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

Telemetry

Why: telemetry is data your app emits about itself so you can understand it from the outside. The most useful kind at scale is a trace — a record that follows one request as it hops between services, showing where the time went. OpenTelemetry (often shortened to OTel) is the vendor-neutral standard; it auto-instruments common libraries, so you get traces without rewriting your code.

# Install:  pip install opentelemetry-sdk opentelemetry-instrumentation-fastapi
# tracing.py — wire tracing into your FastAPI app
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

trace.set_tracer_provider(TracerProvider())
# ConsoleSpanExporter just prints traces; swap for an OTLP exporter in production
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(ConsoleSpanExporter())
)

def setup_tracing(app):
    # Auto-creates a trace for every request — no per-route code needed
    FastAPIInstrumentor.instrument_app(app)

Graceful Degradation

Why: graceful degradation means that when one piece breaks, the whole app does not. If a non-essential dependency (say, a recommendations service) is down, you return a sensible fallback instead of failing the page. The pattern: wrap the risky call in try/except and always have a plan B.

# A product page that still works when recommendations are down
import logging

logger = logging.getLogger(__name__)

async def get_product_page(product_id: int):
    product = await db.get_product(product_id)  # essential — let it raise

    try:
        recommendations = await recommender.fetch(product_id)  # nice-to-have
    except Exception as err:
        logger.warning('recommendations unavailable: %s', err)
        recommendations = []  # fall back instead of failing the whole page

    return {'product': product, 'recommendations': recommendations}

Throttling

Why: throttling (also called rate limiting) caps how many requests one client can make in a window of time. It protects your app from being overwhelmed — whether by a buggy client stuck in a retry loop or an abusive one. slowapi adds rate limiting to FastAPI and returns HTTP 429 (Too Many Requests) once a caller goes over the limit.

# Install:  pip install slowapi
# main.py — limit how often each client can call an endpoint
from fastapi import FastAPI, Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.errors import RateLimitExceeded
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)  # track callers by IP
app = FastAPI()
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.get('/api/items')
@limiter.limit('100/minute')  # max 100 requests per IP per minute
def list_items(request: Request):
    return {'items': []}

Circuit Breaker

Why: when a downstream service is failing, hammering it with retries makes things worse and ties up your own resources waiting for timeouts. A circuit breaker watches the failure rate and, once it crosses a threshold, "trips" — it stops calling the broken service for a while and fails fast (or returns a fallback) instead. After a cool-down it lets one test request through to check whether the service recovered. The circuitbreaker library makes this a one-line decorator.

# Install:  pip install circuitbreaker
from circuitbreaker import circuit

# After 5 failures the circuit "opens" and calls fail fast for 30 seconds,
# instead of hanging while the broken service times out.
@circuit(failure_threshold=5, recovery_timeout=30)
def call_payment_api(order):
    return payment_client.charge(order)  # raises on failure

def charge(order):
    try:
        return call_payment_api(order)
    except Exception:
        # Plan B while the circuit is open
        return {'status': 'queued', 'note': 'payment delayed'}