Migrating from Datadog/New Relic to OpenTelemetry: A Platform Engineer's Migration Guide

I have seen the Datadog bill shock moment enough times that I can describe it precisely. A team starts with the agent, falls in love with the dashboards, enables APM, turns on log management, adds infrastructure metrics for every node — and then finance sends an email. The bill has tripled in six months and nobody can explain exactly why, because Datadog's pricing model is genuinely complex.

This is usually when teams start investigating OpenTelemetry. But the conversation often goes in circles: OTel is vendor-neutral, but is it operationally mature? Can we really replace Datadog? What do we lose?

This post is my honest, opinionated take on OpenTelemetry migration — not as a vendor-bashing exercise, but as a practical guide for platform engineers who need to make a real decision.

What OpenTelemetry Actually Is

OpenTelemetry is a CNCF project that standardizes how applications produce observability data — traces, metrics, and logs. It has two main parts:

SDKs and instrumentation libraries — the code that runs inside your application and produces telemetry
The OpenTelemetry Collector — a standalone binary that receives, processes, and exports telemetry data to one or more backends

The key insight: OpenTelemetry is a pipeline, not a destination. You still need a backend — Prometheus for metrics, Jaeger or Tempo for traces, Loki for logs, or a commercial platform like Honeycomb, Grafana Cloud, or even Datadog. What OTel gives you is a vendor-neutral data layer so you can change backends without re-instrumenting your applications.

This is the actual value proposition. Not "free observability." Not "replace Datadog." Decouple your instrumentation from your storage backend. That decoupling is what removes the lock-in.

The Collector Architecture

The Collector is the most important component to understand before you migrate anything. It's a pipeline with three stages:

Receivers — accept data in various formats (OTLP, Prometheus, Jaeger, Zipkin, Datadog agent protocol, etc.)
Processors — transform, filter, sample, or batch data
Exporters — send data to a backend

yaml

1# A minimal collector config
2receivers:
3  otlp:
4    protocols:
5      grpc:
6        endpoint: 0.0.0.0:4317
7      http:
8        endpoint: 0.0.0.0:4318
9
10processors:
11  batch:
12    timeout: 10s
13    send_batch_size: 1000
14  memory_limiter:
15    limit_mib: 512
16    spike_limit_mib: 128
17    check_interval: 5s
18
19exporters:
20  otlp/tempo:
21    endpoint: http://tempo:4317
22    tls:
23      insecure: true
24  prometheusremotewrite:
25    endpoint: http://prometheus:9090/api/v1/write
26
27service:
28  pipelines:
29    traces:
30      receivers: [otlp]
31      processors: [memory_limiter, batch]
32      exporters: [otlp/tempo]
33    metrics:
34      receivers: [otlp]
35      processors: [memory_limiter, batch]
36      exporters: [prometheusremotewrite]

Deployment Topology

In Kubernetes, I use a two-layer topology:

DaemonSet collectors — run on every node, collect host metrics, node logs, and act as a local OTLP endpoint for applications
Deployment collectors (the "gateway") — receive from DaemonSet collectors, do expensive processing (tail sampling, aggregation), and export to backends

This separation matters because tail-based sampling (making sampling decisions after seeing a complete trace) requires a collector that sees all spans for a trace. A DaemonSet doesn't see the full picture; a centralized deployment does.

yaml

1# DaemonSet collector — lightweight, local
2apiVersion: apps/v1
3kind: DaemonSet
4metadata:
5  name: otel-collector-agent
6  namespace: monitoring
7spec:
8  selector:
9    matchLabels:
10      app: otel-collector-agent
11  template:
12    metadata:
13      labels:
14        app: otel-collector-agent
15    spec:
16      tolerations:
17      - operator: Exists  # run on all nodes including control plane
18      containers:
19      - name: otel-collector
20        image: otel/opentelemetry-collector-contrib:0.96.0
21        args: ["--config=/conf/config.yaml"]
22        resources:
23          limits:
24            memory: 256Mi
25            cpu: 200m
26        volumeMounts:
27        - name: config
28          mountPath: /conf
29        - name: varlog
30          mountPath: /var/log
31          readOnly: true
32      volumes:
33      - name: config
34        configMap:
35          name: otel-collector-agent-config
36      - name: varlog
37        hostPath:
38          path: /var/log

Auto-Instrumentation vs Manual

This is where I see teams make the wrong call most often. They reach for auto-instrumentation because it's fast and they want to avoid touching application code. Auto-instrumentation is genuinely useful, but it has real limitations.

Auto-instrumentation uses a Kubernetes operator (the OpenTelemetry Operator) to inject instrumentation at Pod startup via a mutating webhook. For Java, Python, and Node.js, this is mature and covers most HTTP/gRPC/database spans without any application changes.

yaml

1# Annotate a namespace for auto-instrumentation
2apiVersion: opentelemetry.io/v1alpha1
3kind: Instrumentation
4metadata:
5  name: auto-instrumentation
6  namespace: payments
7spec:
8  exporter:
9    endpoint: http://otel-collector-agent:4317
10  propagators:
11    - tracecontext
12    - baggage
13    - b3
14  sampler:
15    type: parentbased_traceidratio
16    argument: "0.1"
17  java:
18    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
19  nodejs:
20    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest

yaml

# Opt a deployment into auto-instrumentation
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-java: "true"

The limitations of auto-instrumentation:

No business context. You get spans for HTTP calls, database queries, and queue messages — but not for "what did this specific order processing logic actually do."
Framework support varies. If you're using a niche framework or a custom protocol, auto-instrumentation may miss it entirely.
Sampling decisions are made at the head (trace start), not the tail. This means you can't selectively keep traces for errors while dropping healthy traces, unless you route through a gateway collector with tail sampling.

Manual instrumentation is more work but gives you the spans and attributes that actually matter for debugging. A well-instrumented service emits spans for every logical operation with business-meaningful attributes:

python

1from opentelemetry import trace
2from opentelemetry.trace import Status, StatusCode
3
4tracer = trace.get_tracer(__name__)
5
6def process_payment(order_id: str, amount_cents: int) -> PaymentResult:
7    with tracer.start_as_current_span("process_payment") as span:
8        span.set_attribute("order.id", order_id)
9        span.set_attribute("payment.amount_cents", amount_cents)
10        span.set_attribute("payment.currency", "USD")
11
12        try:
13            result = payment_gateway.charge(order_id, amount_cents)
14            span.set_attribute("payment.result", result.status)
15            return result
16        except PaymentDeclinedException as e:
17            span.set_status(Status(StatusCode.ERROR, str(e)))
18            span.record_exception(e)
19            raise

My recommendation: start with auto-instrumentation to get visibility fast, then add manual instrumentation for the critical paths that matter for debugging. Don't try to manually instrument everything upfront — you'll never finish.

Migration Strategy: Traces First, Metrics Second, Logs Last

I've seen migrations fail when teams try to replace everything simultaneously. The right order is traces, then metrics, then logs. Here's why.

Phase 1: Traces (Weeks 1–4)

Traces have the clearest value proposition and the least overlap with existing systems. Start by deploying the OTel Collector as a sidecar in your gateway layer, accepting OTLP and forwarding to your chosen trace backend (Tempo, Jaeger, or Honeycomb).

Enable auto-instrumentation on non-critical services first. Validate that trace context is propagating correctly across service boundaries — this is the most common early failure. A broken trace that shows only one service instead of the full call chain is usually a propagator misconfiguration.

During this phase, run your vendor agent and OTel in parallel. Do not decommission Datadog APM until you've validated that you can reproduce the trace queries your team actually uses in the new system.

Phase 2: Metrics (Weeks 5–10)

Metrics migration is harder than traces because you likely have existing dashboards and alerts built on vendor-specific metric names (which are the foundation for your SLO and error budget strategy). Datadog metrics often have prefixes like aws.ec2.cpu or kubernetes.pods.running that don't map directly to Prometheus metric names.

Build a mapping table before you start. For each metric in your critical dashboards, identify:

The source metric name in Datadog/New Relic
The equivalent Prometheus/OTel metric name
The label/tag differences

yaml

1# OTel Collector: transform Datadog metric names to Prometheus conventions
2processors:
3  metricstransform:
4    transforms:
5    - include: "system.cpu.usage"
6      action: update
7      new_name: "node_cpu_seconds_total"
8    - include: "process.runtime.jvm.memory.usage"
9      action: update
10      new_name: "jvm_memory_used_bytes"

Don't migrate dashboards during this phase — rebuild them. Copy-pasting Datadog dashboard queries into Prometheus doesn't work because the data model is fundamentally different. Use this as an opportunity to remove the dashboards nobody looks at.

Phase 3: Logs (Weeks 11–16)

Log migration is the most disruptive phase because developers are most attached to log search. Move logs last, after traces and metrics have stabilized and the team trusts the new system.

The OpenTelemetry log data model is still maturing compared to traces and metrics. Using the OTel Collector's filelog receiver to collect and forward logs works well:

yaml

1receivers:
2  filelog:
3    include:
4    - /var/log/pods/*/*/*.log
5    start_at: beginning
6    include_file_path: true
7    operators:
8    - type: json_parser
9      timestamp:
10        parse_from: attributes.time
11        layout: '%Y-%m-%dT%H:%M:%S.%LZ'
12    - type: move
13      from: attributes.log
14      to: body
15
16exporters:
17  loki:
18    endpoint: http://loki:3100/loki/api/v1/push
19    default_labels_enabled:
20      exporter: false
21      job: true

Cardinality Gotchas

High cardinality kills metrics systems. This is the operational risk that bites teams most often during OTel migrations.

The problem: every unique combination of metric name + label values creates a new time series. If you add a user_id label to a request counter and you have 100k users, you just created 100k time series from one metric. Prometheus and most TSDBs have hard limits on cardinality; exceed them and ingestion starts dropping.

Vendor agents often hide this problem because they do cardinality management server-side. When you move to self-managed Prometheus, the problem becomes yours.

Rules I enforce:

Never use unique IDs as label values. No user_id, order_id, request_id. These belong in trace attributes (high cardinality is fine there), not metric labels.
Bucket dimensions, don't enumerate them. Instead of region=us-east-1a, use region=us-east-1. Instead of specific service versions, use version=stable / version=canary.
Cap label cardinality in the Collector. Use the filter processor to drop metrics with high-cardinality labels before they hit your TSDB.

yaml

processors:
  filter:
    metrics:
      datapoint:
      # Drop metrics with high-cardinality user_id attribute
      - 'attributes["user_id"] != nil'

Sampling Strategies

Sampling is how you make observability affordable at scale. The naive approach — sample every trace at 10% — works but loses exactly the traces you care most about (errors, slow requests).

Head-based sampling (sampling decision made at trace start): simple but dumb. You sample by trace ID modulo, so errors are undersampled at the same rate as successes.

Tail-based sampling (decision made after the trace is complete): smart but expensive. You need a stateful collector that holds spans in memory until the trace is complete, then makes a decision. This is why the deployment-tier gateway collector matters.

yaml

1# Tail sampling policy: keep 100% of errors, 1% of successful traces
2processors:
3  tail_sampling:
4    decision_wait: 10s
5    num_traces: 100000
6    expected_new_traces_per_sec: 1000
7    policies:
8    - name: errors-policy
9      type: status_code
10      status_code:
11        status_codes: [ERROR]
12    - name: slow-traces-policy
13      type: latency
14      latency:
15        threshold_ms: 1000
16    - name: probabilistic-policy
17      type: probabilistic
18      probabilistic:
19        sampling_percentage: 1

The decision_wait of 10 seconds means the collector holds spans in memory for up to 10 seconds before making a sampling decision. Size your gateway collector's memory accordingly — 100k concurrent traces at ~1KB average span size means ~100MB just for the buffer.

Cost Implications

The honest answer on cost: OpenTelemetry doesn't automatically save money. It removes the vendor tax, but you replace it with operational costs.

What you actually pay for:

Storage: Prometheus TSDB, Tempo for traces, Loki for logs. Self-managed means you pay storage + compute. Managed means you pay the vendor (Grafana Cloud, AWS Managed Prometheus, etc.).
Ops time: Someone on your team now owns the observability infrastructure. In a small team, this is a real cost.
Collector compute: The Collector cluster isn't free to run, especially with tail sampling.

The break-even point for most teams is around $30–50k/year in observability spend. Below that, the engineering cost of running OTel infrastructure often exceeds what you save by leaving Datadog. Above that, the economics usually favor migration — but run the numbers for your situation.

The strategic value — vendor independence and the ability to route data to multiple backends simultaneously — exists regardless of cost. But don't migrate just to save money. Migrate because you want control over your telemetry pipeline.

What You Actually Lose

I won't pretend the migration is cost-free. You lose:

Datadog's ML-based anomaly detection and watchdog features. These are genuinely good and have no direct OTel equivalent.
Integrated infrastructure metrics + APM correlation. Grafana Stack does this now, but it's not as polished.
Datadog's browser RUM and synthetic monitoring. These are harder to replace.
New Relic's entity-aware alerting. Prometheus alerting is powerful but more manual.

If any of these are core to how your team works, factor them into the migration plan. The goal isn't to throw away everything — it's to instrument your applications with OTel so that backends become swappable choices rather than architectural commitments.

Thinking about migrating from a vendor observability agent to OpenTelemetry? Talk to us at Coding Protocols. We help platform teams design observability architectures that give them full control without sacrificing the visibility they depend on.

Migrating from Datadog and New Relic to OpenTelemetry: A Practical Guide

What OpenTelemetry Actually Is

The Collector Architecture

Deployment Topology

Auto-Instrumentation vs Manual

Migration Strategy: Traces First, Metrics Second, Logs Last

Phase 1: Traces (Weeks 1–4)

Phase 2: Metrics (Weeks 5–10)

Phase 3: Logs (Weeks 11–16)

Cardinality Gotchas

Sampling Strategies

Cost Implications

What You Actually Lose

Related Topics

Read Next

Docker Swarm vs Kubernetes vs Nomad: Choosing Your Container Orchestrator in 2026

Monitoring Strategy: Prometheus vs Datadog (2026 Edition)

SLOs That Actually Work: Error Budgets, Burn Rate Alerts, and Avoiding Alert Fatigue