Migrating from Datadog and New Relic to OpenTelemetry: A Practical Guide
Vendor observability agents work well right up until they become a lock-in problem, a cost problem, or both. Here's how to migrate to OpenTelemetry without breaking production observability.

I have seen the Datadog bill shock moment enough times that I can describe it precisely. A team starts with the agent, falls in love with the dashboards, enables APM, turns on log management, adds infrastructure metrics for every node — and then finance sends an email. The bill has tripled in six months and nobody can explain exactly why, because Datadog's pricing model is genuinely complex.
This is usually when teams start investigating OpenTelemetry. But the conversation often goes in circles: OTel is vendor-neutral, but is it operationally mature? Can we really replace Datadog? What do we lose?
This post is my honest, opinionated take on OpenTelemetry migration — not as a vendor-bashing exercise, but as a practical guide for platform engineers who need to make a real decision.
What OpenTelemetry Actually Is
OpenTelemetry is a CNCF project that standardizes how applications produce observability data — traces, metrics, and logs. It has two main parts:
- SDKs and instrumentation libraries — the code that runs inside your application and produces telemetry
- The OpenTelemetry Collector — a standalone binary that receives, processes, and exports telemetry data to one or more backends
The key insight: OpenTelemetry is a pipeline, not a destination. You still need a backend — Prometheus for metrics, Jaeger or Tempo for traces, Loki for logs, or a commercial platform like Honeycomb, Grafana Cloud, or even Datadog. What OTel gives you is a vendor-neutral data layer so you can change backends without re-instrumenting your applications.
This is the actual value proposition. Not "free observability." Not "replace Datadog." Decouple your instrumentation from your storage backend. That decoupling is what removes the lock-in.
The Collector Architecture
The Collector is the most important component to understand before you migrate anything. It's a pipeline with three stages:
- Receivers — accept data in various formats (OTLP, Prometheus, Jaeger, Zipkin, Datadog agent protocol, etc.)
- Processors — transform, filter, sample, or batch data
- Exporters — send data to a backend
1# A minimal collector config
2receivers:
3 otlp:
4 protocols:
5 grpc:
6 endpoint: 0.0.0.0:4317
7 http:
8 endpoint: 0.0.0.0:4318
9
10processors:
11 batch:
12 timeout: 10s
13 send_batch_size: 1000
14 memory_limiter:
15 limit_mib: 512
16 spike_limit_mib: 128
17 check_interval: 5s
18
19exporters:
20 otlp/tempo:
21 endpoint: http://tempo:4317
22 tls:
23 insecure: true
24 prometheusremotewrite:
25 endpoint: http://prometheus:9090/api/v1/write
26
27service:
28 pipelines:
29 traces:
30 receivers: [otlp]
31 processors: [memory_limiter, batch]
32 exporters: [otlp/tempo]
33 metrics:
34 receivers: [otlp]
35 processors: [memory_limiter, batch]
36 exporters: [prometheusremotewrite]Deployment Topology
In Kubernetes, I use a two-layer topology:
- DaemonSet collectors — run on every node, collect host metrics, node logs, and act as a local OTLP endpoint for applications
- Deployment collectors (the "gateway") — receive from DaemonSet collectors, do expensive processing (tail sampling, aggregation), and export to backends
This separation matters because tail-based sampling (making sampling decisions after seeing a complete trace) requires a collector that sees all spans for a trace. A DaemonSet doesn't see the full picture; a centralized deployment does.
1# DaemonSet collector — lightweight, local
2apiVersion: apps/v1
3kind: DaemonSet
4metadata:
5 name: otel-collector-agent
6 namespace: monitoring
7spec:
8 selector:
9 matchLabels:
10 app: otel-collector-agent
11 template:
12 metadata:
13 labels:
14 app: otel-collector-agent
15 spec:
16 tolerations:
17 - operator: Exists # run on all nodes including control plane
18 containers:
19 - name: otel-collector
20 image: otel/opentelemetry-collector-contrib:0.96.0
21 args: ["--config=/conf/config.yaml"]
22 resources:
23 limits:
24 memory: 256Mi
25 cpu: 200m
26 volumeMounts:
27 - name: config
28 mountPath: /conf
29 - name: varlog
30 mountPath: /var/log
31 readOnly: true
32 volumes:
33 - name: config
34 configMap:
35 name: otel-collector-agent-config
36 - name: varlog
37 hostPath:
38 path: /var/logAuto-Instrumentation vs Manual
This is where I see teams make the wrong call most often. They reach for auto-instrumentation because it's fast and they want to avoid touching application code. Auto-instrumentation is genuinely useful, but it has real limitations.
Auto-instrumentation uses a Kubernetes operator (the OpenTelemetry Operator) to inject instrumentation at Pod startup via a mutating webhook. For Java, Python, and Node.js, this is mature and covers most HTTP/gRPC/database spans without any application changes.
1# Annotate a namespace for auto-instrumentation
2apiVersion: opentelemetry.io/v1alpha1
3kind: Instrumentation
4metadata:
5 name: auto-instrumentation
6 namespace: payments
7spec:
8 exporter:
9 endpoint: http://otel-collector-agent:4317
10 propagators:
11 - tracecontext
12 - baggage
13 - b3
14 sampler:
15 type: parentbased_traceidratio
16 argument: "0.1"
17 java:
18 image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
19 nodejs:
20 image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest# Opt a deployment into auto-instrumentation
spec:
template:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-java: "true"The limitations of auto-instrumentation:
- No business context. You get spans for HTTP calls, database queries, and queue messages — but not for "what did this specific order processing logic actually do."
- Framework support varies. If you're using a niche framework or a custom protocol, auto-instrumentation may miss it entirely.
- Sampling decisions are made at the head (trace start), not the tail. This means you can't selectively keep traces for errors while dropping healthy traces, unless you route through a gateway collector with tail sampling.
Manual instrumentation is more work but gives you the spans and attributes that actually matter for debugging. A well-instrumented service emits spans for every logical operation with business-meaningful attributes:
1from opentelemetry import trace
2from opentelemetry.trace import Status, StatusCode
3
4tracer = trace.get_tracer(__name__)
5
6def process_payment(order_id: str, amount_cents: int) -> PaymentResult:
7 with tracer.start_as_current_span("process_payment") as span:
8 span.set_attribute("order.id", order_id)
9 span.set_attribute("payment.amount_cents", amount_cents)
10 span.set_attribute("payment.currency", "USD")
11
12 try:
13 result = payment_gateway.charge(order_id, amount_cents)
14 span.set_attribute("payment.result", result.status)
15 return result
16 except PaymentDeclinedException as e:
17 span.set_status(Status(StatusCode.ERROR, str(e)))
18 span.record_exception(e)
19 raiseMy recommendation: start with auto-instrumentation to get visibility fast, then add manual instrumentation for the critical paths that matter for debugging. Don't try to manually instrument everything upfront — you'll never finish.
Migration Strategy: Traces First, Metrics Second, Logs Last
I've seen migrations fail when teams try to replace everything simultaneously. The right order is traces, then metrics, then logs. Here's why.
Phase 1: Traces (Weeks 1–4)
Traces have the clearest value proposition and the least overlap with existing systems. Start by deploying the OTel Collector as a sidecar in your gateway layer, accepting OTLP and forwarding to your chosen trace backend (Tempo, Jaeger, or Honeycomb).
Enable auto-instrumentation on non-critical services first. Validate that trace context is propagating correctly across service boundaries — this is the most common early failure. A broken trace that shows only one service instead of the full call chain is usually a propagator misconfiguration.
During this phase, run your vendor agent and OTel in parallel. Do not decommission Datadog APM until you've validated that you can reproduce the trace queries your team actually uses in the new system.
Phase 2: Metrics (Weeks 5–10)
Metrics migration is harder than traces because you likely have existing dashboards and alerts built on vendor-specific metric names (which are the foundation for your SLO and error budget strategy). Datadog metrics often have prefixes like aws.ec2.cpu or kubernetes.pods.running that don't map directly to Prometheus metric names.
Build a mapping table before you start. For each metric in your critical dashboards, identify:
- The source metric name in Datadog/New Relic
- The equivalent Prometheus/OTel metric name
- The label/tag differences
1# OTel Collector: transform Datadog metric names to Prometheus conventions
2processors:
3 metricstransform:
4 transforms:
5 - include: "system.cpu.usage"
6 action: update
7 new_name: "node_cpu_seconds_total"
8 - include: "process.runtime.jvm.memory.usage"
9 action: update
10 new_name: "jvm_memory_used_bytes"Don't migrate dashboards during this phase — rebuild them. Copy-pasting Datadog dashboard queries into Prometheus doesn't work because the data model is fundamentally different. Use this as an opportunity to remove the dashboards nobody looks at.
Phase 3: Logs (Weeks 11–16)
Log migration is the most disruptive phase because developers are most attached to log search. Move logs last, after traces and metrics have stabilized and the team trusts the new system.
The OpenTelemetry log data model is still maturing compared to traces and metrics. Using the OTel Collector's filelog receiver to collect and forward logs works well:
1receivers:
2 filelog:
3 include:
4 - /var/log/pods/*/*/*.log
5 start_at: beginning
6 include_file_path: true
7 operators:
8 - type: json_parser
9 timestamp:
10 parse_from: attributes.time
11 layout: '%Y-%m-%dT%H:%M:%S.%LZ'
12 - type: move
13 from: attributes.log
14 to: body
15
16exporters:
17 loki:
18 endpoint: http://loki:3100/loki/api/v1/push
19 default_labels_enabled:
20 exporter: false
21 job: trueCardinality Gotchas
High cardinality kills metrics systems. This is the operational risk that bites teams most often during OTel migrations.
The problem: every unique combination of metric name + label values creates a new time series. If you add a user_id label to a request counter and you have 100k users, you just created 100k time series from one metric. Prometheus and most TSDBs have hard limits on cardinality; exceed them and ingestion starts dropping.
Vendor agents often hide this problem because they do cardinality management server-side. When you move to self-managed Prometheus, the problem becomes yours.
Rules I enforce:
- Never use unique IDs as label values. No
user_id,order_id,request_id. These belong in trace attributes (high cardinality is fine there), not metric labels. - Bucket dimensions, don't enumerate them. Instead of
region=us-east-1a, useregion=us-east-1. Instead of specific service versions, useversion=stable/version=canary. - Cap label cardinality in the Collector. Use the
filterprocessor to drop metrics with high-cardinality labels before they hit your TSDB.
processors:
filter:
metrics:
datapoint:
# Drop metrics with high-cardinality user_id attribute
- 'attributes["user_id"] != nil'Sampling Strategies
Sampling is how you make observability affordable at scale. The naive approach — sample every trace at 10% — works but loses exactly the traces you care most about (errors, slow requests).
Head-based sampling (sampling decision made at trace start): simple but dumb. You sample by trace ID modulo, so errors are undersampled at the same rate as successes.
Tail-based sampling (decision made after the trace is complete): smart but expensive. You need a stateful collector that holds spans in memory until the trace is complete, then makes a decision. This is why the deployment-tier gateway collector matters.
1# Tail sampling policy: keep 100% of errors, 1% of successful traces
2processors:
3 tail_sampling:
4 decision_wait: 10s
5 num_traces: 100000
6 expected_new_traces_per_sec: 1000
7 policies:
8 - name: errors-policy
9 type: status_code
10 status_code:
11 status_codes: [ERROR]
12 - name: slow-traces-policy
13 type: latency
14 latency:
15 threshold_ms: 1000
16 - name: probabilistic-policy
17 type: probabilistic
18 probabilistic:
19 sampling_percentage: 1The decision_wait of 10 seconds means the collector holds spans in memory for up to 10 seconds before making a sampling decision. Size your gateway collector's memory accordingly — 100k concurrent traces at ~1KB average span size means ~100MB just for the buffer.
Cost Implications
The honest answer on cost: OpenTelemetry doesn't automatically save money. It removes the vendor tax, but you replace it with operational costs.
What you actually pay for:
- Storage: Prometheus TSDB, Tempo for traces, Loki for logs. Self-managed means you pay storage + compute. Managed means you pay the vendor (Grafana Cloud, AWS Managed Prometheus, etc.).
- Ops time: Someone on your team now owns the observability infrastructure. In a small team, this is a real cost.
- Collector compute: The Collector cluster isn't free to run, especially with tail sampling.
The break-even point for most teams is around $30–50k/year in observability spend. Below that, the engineering cost of running OTel infrastructure often exceeds what you save by leaving Datadog. Above that, the economics usually favor migration — but run the numbers for your situation.
The strategic value — vendor independence and the ability to route data to multiple backends simultaneously — exists regardless of cost. But don't migrate just to save money. Migrate because you want control over your telemetry pipeline.
What You Actually Lose
I won't pretend the migration is cost-free. You lose:
- Datadog's ML-based anomaly detection and watchdog features. These are genuinely good and have no direct OTel equivalent.
- Integrated infrastructure metrics + APM correlation. Grafana Stack does this now, but it's not as polished.
- Datadog's browser RUM and synthetic monitoring. These are harder to replace.
- New Relic's entity-aware alerting. Prometheus alerting is powerful but more manual.
If any of these are core to how your team works, factor them into the migration plan. The goal isn't to throw away everything — it's to instrument your applications with OTel so that backends become swappable choices rather than architectural commitments.
Thinking about migrating from a vendor observability agent to OpenTelemetry? Talk to us at Coding Protocols. We help platform teams design observability architectures that give them full control without sacrificing the visibility they depend on.


