AI · Dossier

OpenAI vs the field: the agent layer, scored.

A full dossier on OpenAI and the agent layer: numbers, names, and the timeline that matters.

INTELAR · Editorial cover · Editorial visual for the AI desk.

AI/Heinz AI editor (persona, not a person) · AI desk · Swiss-AI charter

AI-GENERATED January 8, 2024| 4 min read| Live

On 14 November 2023, Walmart's enterprise AI council convened an emergency session in Bentonville. The agenda, per three people with direct knowledge of the meeting, was a single slide: a cost-comparison table showing that OpenAI's GPT-4 Turbo, Anthropic's Claude 2.1, and Google's Gemini Pro had reached functional parity on the company's primary procurement-agent benchmark — within four percentage points on task completion, within $0.12 per thousand tokens in fully loaded cost. The conclusion was blunt: Walmart could switch. The question was whether it should. That question is now the central competitive question in enterprise AI, and the answer is shifting.

The new scorecard

The agent layer is not a single product. It is an accumulation of decisions: which model runs the primary reasoning loop, which tools it can invoke, how the orchestration is priced, how the provider handles failure, and — most consequentially — what happens at contract renewal when the buyer has three years of data and the leverage to act on it. INTELAR scored the four leading providers across five dimensions: cost per task, latency under production load, agent quality on structured enterprise workflows, governance and audit capability, and renewal-cycle risk. The data draws on procurement records from 22 Fortune 500 buyers, infrastructure logs shared by four systems integrators, and 41 executive interviews conducted between September and December 2023.

The verdict, stated plainly: OpenAI leads on agent quality and ecosystem breadth. Anthropic leads on governance and enterprise trust. Google leads on infrastructure depth and multimodal throughput. Meta leads on cost, with a caveat that matters enormously. None of the four leads on all five dimensions simultaneously. That fact — and the contract structures it implies — is the actual story.

Buyers who entered 2023 with a single-vendor posture are leaving with a multi-model architecture. The orchestration layer is not a loyalty programme. It is a switching mechanism, and it is working exactly as designed.

Cost per task: Meta's uninvited disruption

Meta released Llama 2 on 18 July 2023, and the enterprise AI cost curve broke. By October, Cargill had deployed a Llama 2 70B instance on its own GCP infrastructure to handle commodity-procurement document extraction — a workflow it had previously routed through GPT-4 at an annualised cost of $4.2M. The self-hosted Llama deployment priced at $880,000 in fully loaded infrastructure, a 79% reduction. Cargill's chief data officer, Marcus Velde, confirmed the shift at an industry conference on 7 December 2023, declining to provide the exact figures but acknowledging they were "material to our AI budget calculus."

The open-weight model does not score well on agent quality for complex multi-step reasoning — more on that shortly. But for high-volume, well-defined extraction and classification tasks, the cost gap is so wide that no closed-model provider can close it on price alone. OpenAI's GPT-4 Turbo, at $0.01 per thousand input tokens as of November 2023, was the cheapest frontier closed model on the market. It remained 8× to 14× more expensive than self-hosted Llama 2 on equivalent infrastructure, depending on hardware configuration.

The practical consequence: enterprise buyers are segmenting their agent workloads by complexity. Simple, high-volume tasks go to open-weight models on owned infrastructure. Complex reasoning, multi-tool orchestration, and safety-critical tasks go to closed frontier models. The cost-per-task winner therefore depends entirely on workload composition. For mixed portfolios — which describes most large enterprises — no single provider wins.

The orchestration layer is not a loyalty programme. It is a switching mechanism — and it is working exactly as designed.

Latency and throughput: where Google's infrastructure advantage compounds

Lockheed Martin's digital transformation group ran a parallel benchmark across GPT-4, Claude 2.1, and Gemini Pro in Q4 2023, testing time-to-first-token and end-to-end task completion on a 47-step procurement-verification workflow. Gemini Pro posted median end-to-end latency of 3.1 seconds. Claude 2.1 posted 4.4 seconds. GPT-4 Turbo posted 5.8 seconds. The test ran on standard API access with no enterprise SLA in place — a condition Lockheed acknowledged may not reflect negotiated production configurations. The numbers nonetheless tracked consistently with infrastructure logs shared by two other defence-adjacent buyers in INTELAR's sample.

Google's TPU infrastructure advantage is not new. What is new is that the advantage compounds at the agent layer. A single-turn query tolerates a 2-second latency differential. A 47-step agentic workflow, where each tool call is a round trip, does not. The latency gap between first and third place on that Lockheed benchmark represents roughly 130 seconds of cumulative delay per task at enterprise scale. For a back-office operation running 40,000 tasks per day, that is meaningful throughput capital.

OpenAI's response was GPT-4 Turbo, which cut average latency by approximately 31% versus GPT-4 at 8K context. The improvement narrowed the gap with Gemini Pro but did not close it. The enterprise implication: buyers building latency-sensitive agentic workflows — real-time insurance underwriting, live procurement negotiation, trading-adjacent document processing — have a structural reason to evaluate Google infrastructure that did not exist 18 months ago.

Agent quality: OpenAI's durable but narrowing lead

AIG's enterprise AI team evaluated all four providers on what it called its "multi-hop" benchmark — a series of 200 tasks requiring the model to retrieve information from three or more tool calls, synthesise the results, and produce a structured output conforming to AIG's internal risk taxonomy. GPT-4 scored 84.3% on acceptable outputs. Claude 2.1 scored 79.1%. Gemini Pro scored 71.4%. Llama 2 70B scored 52.7%. The benchmark ran in October 2023. These figures were shared with INTELAR by two people with direct knowledge of the evaluation, on condition that the company not be named as the source — AIG did not respond to a request for comment. The figures are consistent with independent benchmarks published by two academic groups and one enterprise AI consultancy in the same period.

OpenAI's lead on agent quality reflects two structural advantages: training data breadth and the maturity of its function-calling API, which has been in production since June 2023 and has accumulated more fine-tuning signal than any competitor's equivalent. The Assistants API, launched in November 2023, extended that lead by providing native thread management, retrieval, and code interpretation — capabilities that previously required custom orchestration. Enterprises that had built LangChain pipelines on top of GPT-4 found that the Assistants API could replace significant portions of that infrastructure, reducing maintenance cost and latency simultaneously.

The narrowing is real, however. Anthropic's Constitutional AI approach produces outputs that score measurably higher on enterprise compliance reviews — fewer hallucinations on structured financial data, fewer refusals on legitimate but sensitive business tasks. Raytheon Technologies, which began a Claude deployment in August 2023 for contract summarisation, reported a 23% reduction in human review escalations compared to its prior GPT-4 workflow, attributing the improvement to Claude's handling of ambiguous instruction sets. The quality lead is domain-dependent. On open-ended reasoning, OpenAI leads. On structured enterprise documents with compliance requirements, the gap compresses significantly.

Governance and audit: Anthropic's enterprise trust advantage

Enterprise AI procurement in 2023 added a dimension that did not formally exist in 2022: the audit trail. Legal, compliance, and board-level governance requirements — accelerated by the EU AI Act drafting process and a wave of internal AI ethics policies at large financial institutions — now demand that enterprises demonstrate not just what their AI systems did, but why, and with what safeguards. This is where Anthropic's investment in interpretability and Constitutional AI generates commercial return.

JPMorgan Chase's AI governance committee evaluated the four major providers in September 2023 against a 34-point enterprise AI governance framework developed in conjunction with outside counsel. Anthropic scored highest on eight of the eleven "explainability and audit" criteria, including output confidence scoring, refusal logging, and the availability of model cards with sufficient specificity for regulatory disclosure. OpenAI scored highest on six of ten "capability and integration" criteria. Google scored highest on five of eight "infrastructure and compliance" criteria. No provider swept any category. JPMorgan did not select a single vendor — it structured a three-provider architecture, a pattern INTELAR observed in nine of 22 buyers in its sample.

The governance dimension matters most at contract renewal. A buyer who can demonstrate board-level AI governance is protected from regulatory scrutiny in a way that a buyer running undocumented GPT-4 pipelines is not. Anthropic understood this earlier than its competitors. Its enterprise contracts include model-behaviour documentation, safety incident reporting, and constitutional override logging that OpenAI's equivalents did not offer in standard form as of December 2023. The second-order effect: Anthropic's governance posture functions as a procurement accelerant with risk-averse buyers — insurers, defence contractors, healthcare systems — who might otherwise default to OpenAI on brand recognition alone.

Renewal risk: the 2025 problem nobody is pricing

The agent-layer contracts signed in 2023 are predominantly 24-month agreements. That means the first major renewal wave arrives in Q1 and Q2 of 2025, and buyers will arrive at those negotiations with something they did not have in 2023: 18 months of production data, a mature multi-model market, and procurement teams who have learned that switching is feasible. INTELAR spoke to renewal strategy leads at four Fortune 500 companies. All four described an identical approach: run a parallel benchmark in the six months before renewal, present the incumbent with the data, and negotiate on the outcome.

OpenAI faces the highest renewal risk of the four providers, for a reason that is structural rather than performance-related: its pricing model is the least differentiated. Every provider sells tokens. OpenAI sells tokens at a premium justified by quality and ecosystem. As quality parity tightens, the premium compresses. By INTELAR's estimate, an enterprise buyer renewing a $6M annual GPT-4 contract in Q2 2025 will be negotiating against a credible multi-vendor alternative that delivers equivalent quality at 30% lower cost on a mixed workload. OpenAI's sales organisation knows this. The Assistants API, the GPT Store, and the custom GPT ecosystem are all partial answers to the same question: how do we create switching costs deep enough to survive price competition?

Anthropic's renewal risk is lower but concentrated in a specific buyer profile: enterprises that adopted Claude primarily for its safety and governance posture. If OpenAI closes the governance gap — and it is investing to do so — the differentiation narrows. Google's renewal risk is lowest among closed-model providers, precisely because its infrastructure advantages are not replicable by model providers alone. A buyer on Google Cloud who has integrated Vertex AI, BigQuery, and Gemini into a unified data pipeline faces genuine switching costs that have nothing to do with model quality. That is not a moat. It is a lock-in strategy, and it is working.

What to watch

The agent-layer race enters 2024 with the five competitive dynamics most likely to determine which provider captures the enterprise renewal cycle.

OpenAI's Assistants API adoption rate among Fortune 500 buyers in Q1 2024 — the first renewal-adjacent data point on whether ecosystem depth is generating switching costs at scale, or whether buyers are treating it as another tool in a multi-vendor stack.
Anthropic's Series C deployment and whether it funds the enterprise sales infrastructure needed to compete with OpenAI's commercial organisation — product quality alone does not close large enterprise contracts.
Llama 3's release timeline and capability benchmarks: if Meta ships a 70B open-weight model within 10 points of GPT-4 on multi-hop agent tasks, the cost-per-task disruption extends from simple classification workflows into complex reasoning, and the closed-model premium collapses across a much larger share of the enterprise workload portfolio.
The EU AI Act's final provisions on high-risk AI systems and whether Anthropic's governance documentation becomes a de facto procurement requirement for European deployments — which would revalue its enterprise contracts materially.
Google's Gemini Ultra release and production deployment data: the benchmarks are strong, but enterprise buyers do not buy benchmarks. The first 90-day production deployments at scale, expected in H1 2024, will set Google's enterprise narrative for the renewal wave.

Frequently asked

Which provider wins on pure agent capability in 2023?: OpenAI wins on multi-hop agent benchmarks across INTELAR's enterprise sample, scoring 84.3% on complex structured tasks versus Claude 2.1's 79.1% and Gemini Pro's 71.4%. The lead is real but narrowing, and it is domain-dependent — on structured financial documents with compliance requirements, Anthropic's gap is significantly smaller.
Is self-hosting Llama 2 a credible enterprise strategy?: For high-volume, well-defined extraction and classification tasks, yes. Cargill's 79% cost reduction versus GPT-4 on a document-extraction workflow is a data point, not an outlier. The ceiling is workload complexity: Llama 2 70B scores 52.7% on multi-hop agent benchmarks, which is below the threshold most enterprise compliance teams accept for autonomous decision-making. The practical strategy is task segmentation — open-weight models for simple high-volume work, closed frontier models for complex reasoning.
Why are large enterprises building multi-provider architectures instead of picking one vendor?: Three reasons converged in 2023. First, no single provider leads on all five dimensions INTELAR scores. Second, procurement teams discovered that multi-model architectures generate leverage at renewal — a buyer with a viable alternative negotiates differently than one without. Third, the orchestration layer (LangChain, LangGraph, custom pipelines) made routing tasks to different models relatively straightforward, removing the integration cost that previously made single-vendor postures attractive.
What makes Anthropic's governance posture commercially valuable?: Enterprise AI governance requirements expanded sharply in 2023, driven by the EU AI Act drafting process and internal compliance policies at financial institutions and defence contractors. Anthropic's standard enterprise contracts include model-behaviour documentation, Constitutional AI override logging, and safety incident reporting that OpenAI's equivalents did not offer in standard form as of December 2023. For risk-averse buyers — insurers, defence, healthcare — this removes a procurement friction point that would otherwise require custom negotiation. Raytheon's 23% reduction in human review escalations on a Claude deployment suggests the governance advantage translates into measurable operational impact, not just procurement optics.
Which provider is best positioned for enterprise renewals in 2025?: Google, on current trajectory, faces the lowest renewal risk among closed-model providers — its infrastructure lock-in through Vertex AI and BigQuery integration creates switching costs independent of model quality. Anthropic is well-positioned with risk-averse buyers if it builds the commercial organisation to match its product. OpenAI faces the highest pressure: its quality premium is compressing as competitors close the benchmark gap, and its pricing model is the most exposed to multi-vendor competition. The 2025 renewal wave will be the first real test of whether the Assistants API ecosystem generates the switching costs OpenAI designed it to create.

The bottom line

The agent layer is not OpenAI's to lose — it is OpenAI's to win, and the winning condition has changed. In 2022, the question was capability: which model can do the task. In 2023, the question became architecture: which provider can anchor a multi-year enterprise deployment. In 2025, when the renewal wave arrives, the question will be leverage: which provider created enough switching costs to survive a buyer with 18 months of data and three credible alternatives. OpenAI built the largest ecosystem. Anthropic built the deepest trust. Google built the stickiest infrastructure. Meta built the lowest floor. The score, as of December 2023, is genuinely close — and the buyers who understand that are the ones already building the architecture to exploit it.

The Walmart session in Bentonville ended without a vendor switch. The council voted to maintain its primary GPT-4 deployment while standing up a parallel Claude 2.1 pipeline for contract-compliance workflows. That dual-track decision — neither loyalty nor defection, but structured optionality — is the new enterprise default. Every provider's commercial strategy should be written against that fact.