Observability

Observability before scale: how to design systems you can actually debug

PublishedAugust 18, 2026
FocusTracing
AreaReliability

Teams usually discover the cost of weak observability too late. The system works, traffic grows, incidents appear, and suddenly nobody can answer the simplest question: what actually happened? That is the moment when logging turns into archaeology, metrics become decorative, and traces are either missing or too noisy to trust.

The observability work I trust is the kind that makes incidents shorter. Dashboards are useful, but only if the names, IDs, and event boundaries match how the system actually fails.

I have very little patience for observability that exists to look mature. A beautiful dashboard that does not change the next action is decoration. A plain log line with the right identifiers can save an hour. The point is not to collect signals. The point is to reduce uncertainty when people are tired, production is noisy, and the answer is not obvious.

The best systems I have worked on had a shared language for failure. People could say “this is a queue saturation issue,” or “this is a manifest freshness issue,” or “this is a downstream dependency timeout,” and everyone knew what evidence should exist. That shared language is architecture.

1 traceShould connect request, queue, worker, storage, and playback state.
3 classesUser error, dependency failure, and system saturation should never look identical.
minutesThe goal is reducing diagnosis time, not collecting more charts.

Observability is not a dashboard exercise. It is an architectural property. A system is observable when engineers can infer internal behavior from the signals it emits, quickly, repeatedly, and under pressure. The important part is not the tooling itself. It is the discipline behind what gets measured, how it is named, and whether the emitted signals map cleanly to the system's actual behavior.

Start with questions, not tools

Before choosing a stack, define the operational questions the system should answer. Which dependency is responsible for the latency spike? Which queue is building pressure? Which release changed error behavior? Which device or region is producing playback failures? A good observability design begins as a question model, not as a shopping list of vendors.

When teams skip this step, they end up with broad but shallow visibility. They collect enormous volumes of logs and metrics, but none of it is shaped for diagnosis. A useful signal is one that resolves uncertainty. If a graph cannot narrow the search space during an incident, it is decoration.

The question behind every signal

Before adding a metric, I ask: what decision does this make easier? If the answer is vague, the metric usually becomes noise. If the answer is concrete, the signal deserves a stable name, clear dimensions, and an owner.

logger.error('playback.segment_fetch_failed', {
  traceId,
  assetId,
  variant,
  cdn,
  region,
  statusCode,
  retryCount,
  manifestVersion,
});

Logs are contracts

Logs should never be treated as incidental output. They are operational contracts. Every high-value log event needs stable structure, domain-aware names, and enough context to connect it to the surrounding system. Random strings written from scattered code paths are not observability. They are future confusion stored in plain text.

The most useful logging systems distinguish event classes clearly: input validation failures, dependency failures, retries, state transitions, idempotency collisions, queue replays, timeout boundaries, playback errors, and policy denials should not collapse into the same generic error line. Error taxonomies reduce panic because they turn noise into categories.

Incident-grade log requirements

Stable event name

The event name should survive refactors. If dashboards depend on it, do not rename it casually.

Correlation identity

Every event crossing service boundaries needs trace, request, job, asset, or user-session identity.

Failure class

Do not make humans infer whether the event is validation, dependency, saturation, policy, or state violation.

Action hint

A good event points toward the next diagnostic question instead of merely announcing that something happened.

Metrics should map to system limits

Teams often measure what is easy instead of what is decisive. CPU, memory, and request count matter, but they are not enough. The best metrics are the ones that map directly to system limits: queue depth, retry rate, cache hit ratio, p95 latency, segment fetch failure rates, backlog age, saturation of worker pools, replication lag, or manifest generation time.

Every system has ceilings. Observability is the discipline of naming those ceilings before they are hit. Once those boundaries are defined, dashboards become operational instruments instead of status theater.

Tracing is about causality

Tracing matters because distributed systems fail through causality chains. A request times out because a downstream dependency stalls, which happens because a queue is saturated, which happened because a retry loop amplified a partial failure. Good traces preserve that chain. Bad traces only prove that many things happened at once.

The right tracing strategy is selective and intentional. Trace the paths that cross boundaries, allocate scarce resources, or mutate critical state. Do not aim for maximal span volume. Aim for decision-grade causality.

Span boundaryWhy it mattersCommon mistake
HTTP entrypointDefines the user-facing request and response budget.Tracing only the controller and losing downstream causality.
Queue publishPreserves intent when work leaves the synchronous path.Generating a new trace and breaking the incident story.
Worker executionShows backlog wait, processing time, retries, and saturation.Measuring execution but ignoring time spent waiting in queue.
External dependencySeparates internal latency from dependency behavior.Logging timeout errors without dependency identity or retry state.

Observability should follow the shape of the architecture

Clean observability depends on clean boundaries. If your services have weak contracts, inconsistent event names, and mixed responsibilities, no amount of instrumentation will save you. This is why observability improves when architecture improves. Clear service boundaries produce clearer traces, cleaner logs, and more legible metrics.

The reverse is also true: weak observability is often a sign that the architecture itself is muddled. If nobody can explain what should be measured, the system probably has unclear ownership or hidden coupling.

Instrumentation exposes organizational truth

Observability work has a habit of revealing awkward facts. A service that is impossible to instrument cleanly is often a service with too many jobs. A log taxonomy that nobody can agree on is often a domain model that nobody owns. A dashboard that needs twenty filters to answer one question is usually a sign that the system does not have clean boundaries.

This is why I do not treat observability as a layer added at the end. It is a test of architecture. When the traces are clean, the system is often clean. When the traces are chaotic, the code usually has the same shape. Instrumentation does not create the mess; it makes the mess visible.

Playback, streaming, and event systems need domain signals

Generic infrastructure metrics are not enough for media or event-driven platforms. A streaming product needs player startup time, rebuffer ratio, bitrate switch behavior, CDN miss patterns, manifest freshness, and segment delivery health. An event pipeline needs replay counts, poison-queue rates, deduplication collisions, and late-event compensation behavior.

Domain-specific signals turn observability into leverage. They let teams reason in product terms instead of only infra terms. That is how engineers move from “the service looks healthy” to “the user experience is healthy.”

const domainSignals = {
  playback: ['startup_ms', 'rebuffer_ratio', 'variant_switches', 'segment_p95'],
  graphql: ['resolver_p95', 'field_error_rate', 'query_complexity', 'n_plus_one_hits'],
  postgres: ['lock_wait_ms', 'replication_lag', 'slow_query_class', 'index_miss_rate'],
  queues: ['backlog_age', 'retry_rate', 'poison_messages', 'dedupe_collisions'],
};

GraphQL and PostgreSQL need observability at the boundary

In GraphQL systems, the expensive path is often hidden behind a friendly API shape. One request can trigger many resolvers, nested data access, cache misses, permission checks, and fan-out to services. If telemetry stops at “GraphQL request took 900ms,” the signal is too coarse. You need resolver timing, query shape, field-level errors, cache behavior, and database access grouped by operation.

PostgreSQL has the same problem from the other direction. A slow query is not only a database issue; it is often a product path, a GraphQL resolver, a data model decision, or a missing ownership boundary. The useful question is not “which query was slow?” It is “which user-facing behavior caused this query, under which cardinality, with which lock and cache conditions?”

GraphQL/PostgreSQL diagnostic model

Operation shape

Record operation name, query complexity, selected fields, depth, and high-cardinality arguments without leaking sensitive values.

Resolver timing

Measure resolver p95 and error rate independently so one nested field cannot hide inside aggregate request latency.

Query class

Group SQL by normalized query shape, not raw text, so regressions remain visible across dynamic parameters.

Lock context

Capture lock wait, transaction age, row count estimates, and index usage when latency crosses the budget.

Alerting should protect attention

Most alerting systems fail because they punish attention instead of protecting it. They trigger on symptom spikes without understanding whether the event is user-visible, sustained, or operationally relevant. Good alerting systems are conservative. They escalate meaningful risk, not every transient irregularity.

The purpose of alerting is not awareness. It is intervention. If the person receiving the alert does not know what action to take, the alert is not operationally complete.

alert: playback_rebuffer_rate_high
when: p95(rebuffer_ratio, 5m) > baseline * 1.35
group_by: [region, device_class, cdn]
runbook: isolate CDN path, compare manifest freshness, sample traces

A good alert is almost a short conversation with the person receiving it. It says what changed, why the change matters, how broad the blast radius might be, and where to look first. Anything less pushes cognitive work into the worst possible moment. During an incident, the system should do as much explanatory work as it can.

This is why I like alerts that carry dimensions rather than drama. Region, device class, CDN, query class, queue name, and release version are not details; they are the difference between panic and diagnosis. The alert should narrow the room, not simply turn on the lights.

SLOs are product promises translated into engineering math

Service-level objectives are often introduced as reliability bureaucracy, but the useful version is much simpler: what promise does this product make, and how much failure can it tolerate before users, revenue, or trust are meaningfully damaged? A playback platform, a GraphQL API, an attribution pipeline, and a billing workflow do not need the same SLO shape. Reliability has to follow the product surface.

I like SLOs that are close to user experience and far from vanity metrics. CPU usage is not an SLO. Request count is not an SLO. For a playback product, startup success, rebuffer ratio, license acquisition success, and manifest freshness can be SLO candidates. For a data pipeline, freshness, completeness, deduplication correctness, and replay latency matter more. For a GraphQL API, operation-level latency and error budgets are more useful than one global API number.

SurfaceSLO candidateWhy it maps to user value
PlaybackSuccessful startup under 2s by region/device class.Users feel startup failure immediately.
GraphQLp95 latency by named operation and complexity bucket.Different operations have different product expectations.
PostgreSQLLock wait and transaction age for critical write paths.Data integrity and user actions degrade before CPU looks scary.
Event pipelineFreshness lag and replay completion time.Analytics and attribution lose trust when data arrives late.

Cardinality is where good observability becomes expensive

High-cardinality dimensions are both necessary and dangerous. Without dimensions like region, device class, operation name, CDN, queue, or query class, you cannot isolate failures. With uncontrolled dimensions like raw user IDs, full URLs, arbitrary GraphQL variables, or unbounded error strings, observability costs explode and dashboards slow down. The skill is not avoiding cardinality. The skill is budgeting it.

I treat cardinality like schema design. Dimensions should have owners, expected ranges, retention rules, and a reason to exist. If a dimension does not change an operational decision, it probably does not belong in the hot metrics path. It may still belong in logs or traces with different retention. Not every signal needs to live in every system.

const allowedMetricDimensions = {
  playback_startup_ms: ['region', 'device_class', 'cdn', 'app_version'],
  graphql_operation_ms: ['operation_name', 'complexity_bucket', 'cache_state'],
  postgres_lock_wait_ms: ['query_class', 'table_group', 'transaction_type'],
};

// user_id, raw_url, query_text, and error_message belong in logs/traces,
// not high-volume metrics labels.

Sampling without destroying the incident

Sampling is unavoidable at scale, but careless sampling destroys exactly the evidence you need during rare failures. A one-percent trace sample may be fine for healthy traffic and useless for a low-volume error class. The right strategy is adaptive: keep representative samples for normal behavior, but increase retention for errors, slow paths, retries, and events crossing critical boundaries.

I prefer policies that encode operational value. Always keep traces for failed payments, playback startup failures, poison-queue messages, admin actions, schema migration paths, and severe latency outliers. Sample boring success aggressively. Keep enough healthy traffic to compare, but do not spend the same budget on every request. Observability is not democracy; some events are more important than others.

Sampling policy

Always keep

Errors, retries, timeouts, security denials, payment failures, playback startup failures, and long-tail latency outliers.

Sample normally

Healthy high-volume reads, cache hits, static delivery, and low-risk background work.

Escalate dynamically

Raise sample rates during deploys, incidents, regional anomalies, or sudden shifts in error taxonomy.

Expire deliberately

Keep expensive high-cardinality traces long enough for incident review, then reduce retention.

Runbooks are part of the observability surface

A runbook is not a document you write after the real work. It is part of the operational interface. If an alert points to a runbook and the runbook starts with vague advice, the system has not really explained itself. A good runbook should name the likely failure classes, the first three queries to run, the dashboards that matter, the dimensions to group by, the rollback criteria, and the escalation boundary.

The best runbooks are short because the system has good names. When services, events, metrics, and traces are coherent, the runbook does not need to teach the whole architecture during an incident. It only needs to guide the next decision. That is the standard I aim for: fewer words during the incident, more design before it.

Incident review should improve the system's vocabulary

Postmortems often focus on timelines and action items, but the most valuable incident reviews also improve language. What did we not have a name for? What failure class was hidden behind a generic error? Which dashboard forced us to infer instead of observe? Which log line told a human-readable story, and which one merely emitted a stack trace? Every serious incident should leave the system with better nouns.

I like incident reviews that separate technical failure, detection failure, explanation failure, and coordination failure. A system can fail technically but detect the problem quickly. It can detect the problem but explain it badly. It can explain the problem but route ownership to the wrong team. Each layer needs different remediation. Collapsing all of it into “add alert” is how teams repeat the same incident with better dashboards and the same confusion.

Review layerQuestionUseful output
Technical failureWhich invariant broke?Code, config, capacity, or contract fix.
Detection failureWhy did we not know sooner?SLO, alert, sampling, or telemetry change.
Explanation failureWhy was diagnosis slow?Better event names, dimensions, traces, or runbook.
Coordination failureWhy was ownership unclear?Escalation path, service boundary, or team contract.

Data lineage is observability for facts

Metrics and traces explain system behavior. Data lineage explains how facts became facts. In analytics, attribution, reporting, recommendation, and media metadata systems, lineage is not optional. If a number changes and nobody can trace it back through ingestion, normalization, deduplication, enrichment, aggregation, and presentation, the organization loses trust in its own data.

The lineage model does not need to be fancy at first. It needs stable event IDs, source identity, transformation version, schema version, processing time, and output identity. When a dashboard moves unexpectedly, engineers should be able to ask whether the source changed, the parser changed, the enrichment changed, the aggregation changed, or the dashboard query changed. Without lineage, every data incident becomes archaeology.

type LineageEvent = {
  source: 'player' | 'cdn' | 'billing' | 'graphql' | 'postgres';
  sourceEventId: string;
  schemaVersion: string;
  transformVersion: string;
  producedAt: string;
  outputDataset: string;
  quality: { deduped: boolean; late: boolean; corrected: boolean };
};

Multi-team observability needs governance without theater

Once several teams emit telemetry into the same platform, observability needs governance. Not heavy process, not committees for every metric name, but enough shared structure that signals remain comparable. If one team calls it `asset_id`, another calls it `videoId`, and another stores it only inside a JSON blob, cross-system diagnosis becomes expensive. Naming is operational infrastructure.

The governance I like is lightweight and practical: event naming rules, approved high-cardinality dimensions, metric ownership, retention tiers, dashboard review for critical paths, and a small set of golden signals per domain. This is not about control. It is about making sure the organization can still reason collectively when systems grow beyond one team's memory.

Telemetry governance primitives

Naming registry

Shared event names, metric names, dimensions, and ownership for critical product paths.

Retention tiers

Different retention for metrics, logs, traces, lineage events, and incident snapshots.

Domain contracts

Playback, GraphQL, PostgreSQL, queues, and billing each define required signals.

Review loop

Incident reviews update the vocabulary instead of only adding more alerts.

Ownership maps turn signals into action

Observability without ownership creates spectators. A dashboard turns red, several teams gather, and everyone has partial context. The missing artifact is often an ownership map: which team owns the product behavior, which team owns the service, which team owns the data model, which team owns the CDN or database dependency, and who can make a decision during an incident.

Ownership maps should be connected to telemetry. If an alert fires for `playback.segment_fetch_failed`, the system should know whether the likely owner is packaging, CDN configuration, origin service, player release, or entitlement. That mapping will never be perfect, but even an imperfect first route is better than a generic page to everyone.

SignalFirst ownerSecondary owner
manifest_freshness_lagPackaging pipelineCDN/origin platform
graphql_operation_p95API/platform teamDomain service or PostgreSQL owner
queue_backlog_ageWorker/service ownerDatabase or downstream dependency owner
rebuffer_ratio_region_spikePlayback infrastructureCDN routing or player team

Diagnosis workflows should be rehearsed

Teams rehearse deployments more often than diagnosis. That is backwards for systems where reliability matters. A diagnosis workflow is the path from symptom to likely cause: alert fires, dimensions narrow scope, traces confirm causality, logs explain failure class, metrics quantify blast radius, and the runbook points to a decision. If that path has never been rehearsed, it will break under pressure.

I like running small failure drills that do not require chaos theater. Pick a real historical incident and ask a new engineer to diagnose it using only the current observability surface. Where do they get stuck? Which names confuse them? Which dashboard is missing a dimension? Which runbook assumes context they do not have? That exercise is brutal and useful because it tests the system's memory.

Diagnosis drill

Symptom

Start from one user-visible symptom, not an internal hypothesis.

Scope

Group by region, release, device, operation, CDN, queue, or query class.

Causality

Use traces and lineage to prove the path, not merely correlate graphs.

Decision

End with rollback, mitigation, escalation, or documented no-action.

Error budgets should change engineering behavior

Error budgets are only useful if they affect decisions. If a team burns the budget and nothing changes, the SLO is decorative. A real budget changes release pace, review depth, operational focus, or investment priority. It gives teams a language for tradeoffs: we can ship faster because the system is healthy, or we need to slow feature work because reliability debt is now user-visible.

I prefer budgets tied to specific product surfaces rather than one global reliability number. A playback startup budget, a GraphQL checkout operation budget, and an attribution freshness budget should not be blended. Each has different users, consequences, and owners. Blended budgets hide pain in the average, which is the same mistake weak dashboards make.

const errorBudgetPolicy = {
  surface: 'playback_startup',
  window: '28d',
  burnRate: currentFailures / allowedFailures,
  actions: [
    { when: 'burnRate > 1.0', do: 'pause risky player releases' },
    { when: 'burnRate > 1.5', do: 'prioritize reliability fixes' },
    { when: 'regional spike', do: 'route incident to CDN/playback owner' },
  ],
};

Signal architecture: metrics, logs, traces, and lineage each have a job

Observability gets expensive and confusing when every signal type is asked to do every job. Metrics are for trends, budgets, aggregation, and alerting. Logs are for discrete events and domain context. Traces are for causality across boundaries. Lineage is for explaining how facts were produced. If a team tries to use logs as metrics, metrics as traces, or traces as data lineage, the platform becomes noisy and expensive.

A mature observability architecture assigns responsibility. A playback rebuffer spike should alert from metrics, narrow through dimensions, inspect causality through traces, explain domain detail through logs, and validate reporting impact through lineage. Each layer contributes something specific. None of them needs to carry the entire story alone.

SignalBest atWeak atRetention shape
MetricsBudgets, alerts, rates, aggregates.Explaining one user's detailed path.Longer retention, controlled dimensions.
LogsDomain events, error context, discrete decisions.High-volume numerical aggregation.Tiered retention by event class.
TracesCross-service causality and timing.Complete storage of every healthy request.Sampled, with error/outlier preservation.
LineageData correctness and transformation history.Real-time alerting on hot request paths.Long enough for audit and correction windows.

Cost observability for observability itself

Observability platforms need observability. If signal volume grows without ownership, teams eventually face a painful choice between losing visibility and accepting uncontrolled cost. The answer is not blindly reducing data. The answer is understanding which signals are valuable, which are duplicated, which dimensions are explosive, and which retention policies no longer match operational value.

I like tracking cost by team, service, signal type, event class, and cardinality source. The goal is not to shame teams for emitting telemetry. The goal is to make tradeoffs visible. A high-cost trace stream for a payment flow may be justified. A high-cost debug log for a healthy polling endpoint probably is not. Cost is a design signal.

const telemetryCost = {
  service: 'playback-api',
  team: 'media-platform',
  signal: 'logs',
  topDimensions: ['asset_id', 'region', 'cdn'],
  gbPerDay: 184,
  retainedDays: 14,
  action: 'move debug events to sampled tier',
};

Retention is an engineering decision

Retention is not a billing setting. It is an engineering decision about how long the organization needs memory. Security investigations, billing audits, attribution corrections, playback incident reviews, and product debugging all have different memory windows. Keeping everything forever is expensive. Dropping everything quickly is irresponsible. Retention should match the decision window.

A useful retention model starts with questions. How long after a playback incident do we need session detail? How long can attribution data be corrected? How long do billing disputes remain open? How long do we need raw traces after a deploy? The answers define tiers. Hot signals, incident snapshots, audit logs, lineage events, and aggregate metrics should not all share one retention policy.

Retention tiers

Hot diagnostic

High-detail traces/logs for recent deploys, active incidents, and short investigation windows.

Operational aggregate

Metrics and rollups kept long enough for trend, budget, and capacity analysis.

Audit memory

Security, billing, entitlement, and lineage records retained according to business and compliance windows.

Incident snapshot

Selected high-fidelity data preserved around major incidents for review and training.

Observability is how systems stay teachable

One of the most overlooked benefits of strong observability is onboarding. Well-instrumented systems teach engineers how they behave. New team members can read traces, inspect logs, correlate metrics, and understand the living architecture without relying entirely on oral tradition.

This matters because long-lived systems outlast the people who first built them. Observability is part of the system's memory. It is the operational narrative that keeps infrastructure understandable years later.

Observability is one of the few engineering investments that compounds socially. It helps the system, but it also helps the next engineer understand what the previous engineer was thinking.

Final thought

Teams often think observability becomes important after scale. The opposite is closer to the truth. Observability is what allows scale to remain survivable. If the system cannot explain itself under pressure, every future incident becomes slower, more expensive, and more political.

Build the habit early. Give your systems names for what matters. Define boundaries, taxonomies, and budgets before traffic makes every unknown more painful. The strongest platforms are not only fast or scalable. They are legible when they fail.