← Google Cloud Course9 / 13

Cloud Monitoring & Logging

See what your infrastructure is doing: read metrics, search and filter logs across services with Cloud Logging, route logs with sinks, and raise alerts when something crosses a threshold.

Ad 728×90

What the operations suite covers

Google Cloud's operations suite has two halves: Cloud Monitoring (metrics — numbers over time, plus alerts) and Cloud Logging (the searchable record of what happened). Why: without them you are flying blind — you cannot fix what you cannot see.

Logging and Monitoring are on by default. List recent log entries:

gcloud logging read "severity>=WARNING" --limit 10 --freshness 1h

Cloud Logging — search across everything

Cloud Logging collects logs from every service into one place you query with a filter language. Why: instead of SSHing into each VM, you search all logs at once — "errors from this VM in the last hour" — with a single query.

Find errors from one VM in the last hour

gcloud logging read \
  'resource.type="gce_instance" AND severity>=ERROR' \
  --limit 20 --freshness 1h \
  --format "table(timestamp, resource.labels.instance_id, textPayload)"

Stream logs live as they arrive

gcloud logging tail 'resource.type="cloud_run_revision"'

Log sinks — route logs somewhere

A log sink exports matching logs to a destination — a Cloud Storage bucket (cheap archive), BigQuery (analysis), or Pub/Sub (real-time processing). Why: you keep logs long-term for compliance, or feed them into dashboards and alerts beyond the default retention.

Archive all WARNING+ logs to a Cloud Storage bucket

gcloud logging sinks create warn-archive \
  storage.googleapis.com/learn-uploads-7f3k \
  --log-filter "severity>=WARNING"

Grant the sink's writer identity permission to write to the bucket (the create command prints the service account to authorize)

Metrics — numbers over time

A metric is a time series — CPU %, request count, log-based counts. Google publishes many automatically for every resource. Why read them: they tell you whether a VM is overloaded or errors are spiking. You can even define a "log-based metric" that counts matching log lines.

List available metric types for Compute Engine

gcloud monitoring metrics-descriptors list \
  --filter 'metric.type=starts_with("compute.googleapis.com")' \
  --format "value(type)" 2>/dev/null | head

Create a log-based metric counting ERROR log lines

gcloud logging metrics create error_count \
  --description "Count of ERROR logs" \
  --log-filter "severity>=ERROR"

Alerting policies — get told when something is wrong

An alerting policy watches a metric and notifies a channel (email, SMS, Slack, PagerDuty) when it crosses a threshold. Why: you find out about high CPU or a flood of errors before users complain. You first create a notification channel, then the policy that uses it.

Create an email notification channel

gcloud beta monitoring channels create \
  --display-name "Ops email" \
  --type email \
  --channel-labels email_address=ops@example.com

Then create an alerting policy referencing that channel (via a policy JSON file with --policy-from-file), e.g. "VM CPU > 80% for 5m".