Datadog spent the last eighteen months quietly turning its monitoring empire into something more useful and more dangerous: a platform that watches AI agents the way it once watched servers. On 10 March 2024, the company completed the last piece of that project with Bits AI and a rebuilt LLM Observability stack — shipping, in the same quarter, an agent-query interface, a prompt-tracing pipeline, and a cohort of enterprise customers already billing against it. The second-order effects are showing up in renewal conversations this quarter, and every observability competitor is now doing the arithmetic.
What Datadog actually shipped
Bits AI is not a chatbot bolted onto a dashboard. It is a natural-language query layer wired directly into the Datadog telemetry graph — logs, traces, metrics, and now LLM spans. Ask it why p99 spiked on the inference endpoint at 14:23 UTC and it returns a causal chain: token burst from the retrieval step, cold cache on the embedding model, downstream latency on the guardrail call. The answer cites the span. The span links to the trace. The trace links to the host. The chain is unbroken.
LLM Observability ships alongside it as a first-class product surface, not a plugin. Kai Richter, Datadog's head of AI product, has framed the design principle internally as "span parity" — every LLM call should emit the same richness of telemetry that an HTTP request has emitted since 2016. That means input tokens, output tokens, model version, temperature, latency per layer of a chain, and cost per completion. As of March 2024, the tracer libraries for Python, Go, and Java auto-instrument LangChain, LlamaIndex, and OpenAI SDK calls with no code changes beyond a one-line import.
A third piece, quieter but structurally important, is the Watchdog anomaly detector extended to LLM outputs. It flags semantic drift — cases where model responses shift in register, length, or confidence calibration without a code deployment. That matters because agent systems can degrade without throwing an error: the model is still responding, the trace shows 200ms p50, but the quality of the output has fallen. Watchdog now catches this class of failure. No competitor has shipped an equivalent in general availability.
The observability-for-agents thesis
The thesis Datadog is running is simple and, once stated, hard to argue with: agent systems are software systems, and software systems fail in ways that require instrumentation to diagnose. The thesis is a restatement of what the company has always believed. What changed is that the failure modes for agents are stranger and more expensive than anything observability vendors had to model before.
A traditional service fails loudly — a 500, a timeout, a crash. An agent fails quietly. It calls the wrong tool. It loops. It hallucinates a file path that doesn't exist, writes a downstream record based on the fabrication, and the error surfaces three steps later in a different system with no obvious link to origin. Tracing that failure requires something closer to a provenance graph than a latency histogram. Datadog's LLM spans are the beginning of that provenance graph.
Nadia Sørensen, Datadog's director of enterprise engineering, describes the internal framing as "the reliability contract": any production AI system that cannot be observed cannot be reliably operated, and any system that cannot be reliably operated cannot be sold to a regulated enterprise. That contract is now the opening line of the company's enterprise sales motion in financial services and healthcare — two verticals that have been slow to deploy agents precisely because they could not satisfy compliance teams on the observability question.
"Any production AI system that cannot be observed cannot be reliably operated — and any system that cannot be reliably operated cannot be sold to a regulated enterprise."
Customer cohort data
Datadog does not break out LLM Observability revenue in its public filings, but three enterprise deployments reported to Intelar give a credible read on early adoption economics. Meridian Financial Group, a mid-market asset manager running a document-extraction agent on roughly 40,000 filings per month, deployed LLM Observability in January 2024 and cut its mean-time-to-detect on hallucination events from eleven hours to under twenty minutes. The gain came from a single Watchdog alert rule on output token variance — the model was silently padding responses when it hit low-confidence regions, a signal invisible without token-level telemetry. The team estimated the catch prevented three erroneous trade records in Q1 2024.
Vantage Health Systems, a regional hospital network running a clinical-summary agent across seven facilities, went live in February 2024 with a harder requirement: every LLM call had to be traceable to a specific patient encounter ID for HIPAA audit purposes. Datadog's span tagging satisfied that requirement out of the box; the team added three custom attributes to the tracer configuration and had a working audit trail in four days. The alternative — building a bespoke logging layer — had been scoped at eight weeks of engineering time. Vantage signed a three-year contract in March.
The third deployment is a consumer software company, undisclosed, that runs a code-generation agent serving 2,800 internal developers. Their metric is agent utilisation rate: the fraction of generated suggestions accepted without modification. Bits AI surfaces that metric as a first-class dashboard panel, derived from span data already in the pipeline. Before the deployment, the team had no reliable utilisation signal at all — they were relying on developer surveys. They now run a weekly review off the Datadog dashboard and have iterated the system prompt four times based on the signal. Utilisation rate rose from 34 per cent to 51 per cent over six weeks.
Competitive read: Honeycomb, Grafana, Splunk
Honeycomb is the most technically sophisticated competitor in this space and the one Datadog's enterprise team mentions most often in deal rooms. Its columnar store and high-cardinality query model are genuinely superior for exploratory trace analysis — the kind of ad-hoc investigation an SRE runs when something breaks in a novel way. But Honeycomb has not shipped an LLM-native product surface. Its users instrument LLM calls by hand, building custom fields into existing events. That works for teams with the engineering appetite to maintain the schema. It does not work at the procurement velocity Datadog is now targeting: enterprises that want a vendor-supported, SOC 2-auditable, out-of-the-box solution with no custom instrumentation required.
Grafana Cloud is cheaper and broader but shallower. Its AI story is largely infrastructure-level: GPU utilisation, inference cluster health, model serving latency at the pod boundary. It does not trace inside the model call. It does not know about tokens, chain steps, or prompt versions. For teams that want a single pane of glass for the full AI stack — from the GPU to the output — Grafana requires assembling that view from multiple plugins, most of which are community-maintained. Datadog sells the assembled view as the product.
Splunk occupies the SIEM and compliance end of the market. Its AI story post-Cisco acquisition is focused on security telemetry from AI systems — detecting prompt injection, data exfiltration via LLM outputs, model theft. That is a real and growing problem space, but it is orthogonal to the operational observability market Datadog is winning. The two companies are not, in practice, competing for the same budget. Splunk wins security team spend; Datadog wins engineering team spend. The tension will surface when regulated enterprises try to run a single vendor for both, which is the bet Splunk is making with its AI SIEM roadmap.
Second-order effects this quarter
The commercial effect most worth watching is the shift in Datadog's expansion motion inside existing accounts. The company has historically expanded by landing on infrastructure monitoring and then selling into APM, logs, and security as the engineering organisation grows. The new motion is different: land on LLM Observability when the first agent ships, and then expand into the full Datadog suite as the agent system matures and the engineering team realises it needs the surrounding context to debug it. The initial sale is smaller — LLM Observability pricing starts well below full-stack APM — but the attach rate to the broader platform is, by internal account team reports, running higher than the traditional infrastructure-led land.
The second effect is on the partner ecosystem. Three systems integrators — two large, one boutique — have told Intelar that they are now leading enterprise AI transformation proposals with Datadog as the observability layer rather than building custom logging infrastructure for each client. The rationale is straightforward: custom logging is a one-time services engagement; Datadog is a recurring platform commitment the client self-renews. For the integrator, the Datadog partnership generates more predictable downstream revenue than the bespoke build. That shift is pulling Datadog into deals it would not previously have seen at the proposal stage.
The third effect is on hiring. Three senior engineers from the LangChain and LlamaIndex ecosystems joined Datadog's AI product team in Q1 2024 — a signal that the company is pulling talent from the layer it is instrumenting. That is not unusual for a platform making a strategic bet, but the speed is notable. Datadog is not waiting to see whether LLM Observability takes hold before building the team; it is staffing ahead of confirmed demand, which is what a company does when it believes the demand is structural rather than experimental.
What to watch
Five forward indicators will determine whether the Datadog agent-layer thesis holds through 2024 and into 2025.
- LLM Observability attach rate in Q2 2024 renewal conversations. If the product is landing as a standalone sale rather than as an expansion line on existing contracts, the expansion motion thesis is not playing out as modelled. Watch the Datadog earnings call for any language around "platform revenue concentration" in AI accounts.
- Honeycomb's response. The most credible counter-move is a curated LLM instrumentation SDK with a Honeycomb-managed schema — something that removes the custom-field maintenance burden without requiring a platform change. If Honeycomb ships that before Q3, the technical-sophistication gap narrows and the Datadog "out of the box" argument weakens in engineering-led accounts.
- Bits AI query volume as a reported metric. Datadog has not committed to disclosing this, but if it appears in investor materials it signals the company believes the number supports the narrative. Absence in the Q1 2024 earnings materials would be notable.
- OpenTelemetry GenAI semantic conventions reaching stable status. The CNCF working group is drafting standard attributes for LLM spans — model name, token counts, prompt version. If those conventions stabilise in 2024, every vendor can instrument them automatically, and Datadog's head start in LLM-native instrumentation shrinks to a lead in UI and analytics rather than a lead in data collection.
- Regulated-vertical deal flow. Healthcare and financial services are Datadog's stated targets for the observability-for-agents motion. A publicly referenceable customer in either vertical before the end of 2024 would validate that the compliance-by-default argument is closing enterprise deals, not just opening conversations.
- Is Bits AI a standalone product or a feature?
- Datadog bills Bits AI as a product capability included in existing plans rather than a separate SKU, which means it does not generate independent revenue but does affect net revenue retention by increasing platform stickiness. The strategic logic is similar to AWS adding AI-native features to existing services: the feature deepens lock-in rather than expanding the total contract value directly.
- Can smaller engineering teams justify the Datadog price point for LLM Observability alone?
- At low LLM call volumes — under one million completions per month — the observability cost typically runs below $400 per month on a standard Datadog contract, which is defensible against the engineering time required to build equivalent instrumentation. At scale, the economics shift: high-volume inference workloads generate telemetry data at a rate that can materially move a Datadog bill. Teams above ten million completions per month should model data ingestion costs before committing.
- How does Datadog handle multi-model agent chains where calls cross vendor boundaries?
- The LLM Observability tracer propagates trace context across model calls regardless of vendor — an OpenAI embedding call, an Anthropic generation call, and a Cohere reranking call within the same chain appear as connected spans in a single trace. The caveat is that cost attribution requires the vendor to return token counts in the response, which all three do in their current APIs. If a vendor changes that behaviour, cost data in Datadog breaks for that leg of the chain.
- What does Datadog's move mean for purpose-built LLM observability startups like Langfuse and Arize?
- Datadog's entry compresses the addressable market for startups that were selling LLM observability as a standalone product to enterprises already running Datadog. The defensible position for those companies is depth — evaluation frameworks, human-feedback loops, prompt management — that Datadog has not built and has not announced. The risk is that Datadog builds or acquires those capabilities in the next twelve months. Langfuse, which is open-source, faces the lowest platform risk; Arize, which competes directly on the enterprise telemetry layer, faces the most.
- Is Watchdog's semantic-drift detection reliable enough for production use?
- Early customer reports indicate a false-positive rate that requires tuning — particularly for agents whose output length varies by design based on query complexity. Teams running Watchdog in production are setting custom baseline windows per endpoint rather than relying on the global baseline Datadog configures by default. That is a reasonable ask for a general-availability feature that is three months old. The signal-to-noise ratio should improve as the model accumulates more per-account history, but teams should expect a two-to-four week calibration period before alert thresholds stabilise.
Datadog has done this before: identified a category before it had a name, shipped instrumentation before the market knew it needed it, and built a platform moat before competitors finished debating whether the category was real. The agent observability category is real. The moat-building started in March 2024. The companies that will feel the full weight of that timing are the ones still deciding whether to instrument their agent systems at all.
More from Software →