altitudes® Cloud · Platform · AI Amsterdam · Rotterdam --:--
[SERVICE]OPS4–8 weeks initial, continuous improvement
[07] / SERVICE — OBSERVABILITY

SLOs the team agrees with. Alerts on-call trusts.

Observability is not dashboards. It's the runbook the dashboard triggers. We build observability that changes behaviour: SLOs the team agrees with, alerts the on-call trusts, AI calls included as first-class telemetry.

[01] / CAPABILITIES _

What an observability engagement covers.

Five signal types, one correlated context — this is how metrics, logs, traces, profiles, and events converge into the picture you actually need.

[FIG.10 / OBSERVABILITY · PILLARS] FIVE PILLARS — ONE CONTEXT
  • [01] / SLO

    Service Level Objectives

    Per service, agreed with the team. Written in language a CFO can read. Reviewed monthly.

  • [02] / TELEMETRY

    OpenTelemetry-first instrumentation

    Traces, logs, metrics. Vendor-portable. No lock-in to a single APM.

  • [03] / ALERTS

    Alerts on-call trusts

    We delete more alerts than we add. Every alert links to a runbook. Pager fatigue is a metric.

  • [04] / AI

    AI telemetry, first-class

    Token, latency, quality, cost per use case. Hallucination class as a signal.

  • [05] / RUNBOOKS

    Runbooks the dashboard triggers

    Every alert ships with a runbook. Updated after every incident. Reviewed quarterly.

  • [06] / EVIDENCE

    Evidence pack for audit

    Same telemetry that runs the platform feeds DORA, ISO 27001 and SOC 2 evidence. One source of truth.

[INFO FLOW · 5 STAGES _]

The full telemetry pipeline, end to end: from instrumentation in your workload to the alert that wakes your engineer at 2 AM.

[FIG.11 / OBSERVABILITY · INFO FLOW]
01 / INSTRUMENTATION

Data starts inside the application

OpenTelemetry SDK emits metrics and distributed traces from your code. Structured JSON logs carry a correlation ID that follows a single request across every service boundary — no agent, no sidecar, no bolt-on.

02 / COLLECTION

One tier cleans the signal

Every telemetry stream passes through one collection tier before storage. Here it is sampled to cut volume, enriched with deployment metadata, and fanned out to the right backend — so stores receive clean, labelled data instead of a raw firehose.

03 / STORAGE

Purpose-built stores, not one database

Metrics, logs, and traces have different query shapes. PromQL runs range aggregations on Mimir. Loki answers full-text searches. Tempo traverses trace graphs. Forcing all three into one store means slow queries or wasted money.

04 / CORRELATION

One pane, three stores, no pivoting

Grafana becomes the single entry point. One dashboard joins a latency spike from Tempo with the log lines from Loki and the metric alert from Mimir — correlated in context, without switching tools or copy-pasting IDs between tabs.

05 / ACTION

The pipeline ends in a decision

SRE dashboards surface SLO burn rates in real time. On-call alerting fires precisely when an error budget is breaching — not a minute earlier, not a minute later. Every alert traces back to a specific query and a specific owner.

[02] HOW WE RUN IT _

How an observability engagement runs.

From audit to handoff4–8 weeks initial, continuous improvement
[01] / AUDIT

Pager-fatigue audit, 1 week.

We measure the alert noise floor and the on-call load. Baseline written down.

⏱ 1 wk
[02] / DESIGN

SLOs and dashboards.

Per service, agreed with the team. Reviewed by your SRE lead.

⏱ 1 wk
[03] / BUILD

Instrumentation, 4–6 weeks.

OpenTelemetry, runbooks per alert, AI calls included. Pair-coded.

⏱ 4–6 wks
[04] / HANDOFF

Your SRE team owns it.

Documented, instrumented, with a quarterly review template you can run yourself.

⏱ 1 wk
[RELATED]PACKAGED SOLUTIONEXCELLENCE & BENCHMARKING

Excellence & Benchmarking.

Observability baseline plus benchmark against EU mid-market peers. Pager-fatigue index, SLO coverage, runbook quality scored. Outcome: a maturity number you can defend, and a focused improvement plan.

See the solution

Questions about observability.

[01] Which vendor? +

We are OpenTelemetry-first. Backend is your choice: Grafana stack, VictoriaMetrics, Tempo, OpenSearch. We pick on signal quality and exit cost, not on logo.

[02] Can you keep our existing stack? +

Often yes. The engagement is usually about what you instrument and how you alert, not about replacing the backend. We replace backends only when the data shape is wrong.

[03] How do you measure AI in the stack? +

Token count per call, latency, output quality (auto-evaluated where possible, human-scored where not), and cost. Joined to the same trace as the rest of the service.

[04] What about cost of observability itself? +

Tracked monthly. We routinely reduce observability spend by 30 to 40 percent during the engagement by deleting noise and adjusting sampling. Net of our fee, the engagement usually pays for itself.

[NEXT STEP]

Ready to talk observability?

Book the Platform Read Or email hello@altitudes.cloud

One call, one written summary, either way.