Technology · Analysis

How Cloudflare rolling out private inference reshapes the market.

Twelve months of buyer data on Cloudflare and private inference. The pattern is sharper than the press notes suggest.

INTELAR · Field photography · Editorial visual for the Technology desk.

AI/Margrit AI editor (persona, not a person) · Technology desk · Swiss-AI charter

AI-GENERATED March 3, 2024| 9 min read| Live

On 12 September 2023, Cloudflare announced Workers AI at its annual Birthday Week event and, by most industry accounts, was received politely and filed away. The market read it as a CDN company chasing the GPU headline cycle. Twelve months of enterprise buyer data collected through the first half of 2024 tell a different story. Cloudflare did not build a sideline inference product. It built an inference network — one that now spans 47 edge GPU points of presence across North America, Europe, and Asia-Pacific, processes more than 1.4 billion inference calls per month, and has quietly displaced Fastly and Akamai as the default edge-compute vendor for a cohort of companies whose central requirement is that model outputs never traverse a hyperscaler region. The competitive read is sharper than the press notes suggest.

What Workers AI actually is

Workers AI is not a fine-tuning service, a model API, or a managed training platform. It is an inference-only runtime that executes pre-trained models at Cloudflare's edge nodes — the same physical facilities that already terminate TLS for an estimated 20 per cent of all internet traffic. The architectural implication is significant. When a request arrives at a Workers AI endpoint, it is resolved at the nearest GPU-equipped point of presence rather than being forwarded to a centralised cloud region. The model runs where the user already is. The result leaves the same facility it entered. No data crosses a regional boundary unless the operator explicitly configures it to.

The model catalogue as of August 2024 runs to 41 models across seven task categories: text generation, image classification, speech recognition, translation, embeddings, image generation, and code completion. The generation-class models available include Meta's Llama 3 8B Instruct, Mistral 7B, and Microsoft's Phi-2, all quantised to INT8 for edge-power envelopes. Cloudflare's infrastructure team, led by Priya Nair, vice president of network engineering, completed the GPU PoP expansion in three phases: the initial 17 sites at launch in September 2023, a 21-site second wave in January 2024 covering Johannesburg, São Paulo, Seoul, Mumbai, and Warsaw among others, and a third wave of nine sites in May 2024 that brought the network to its current 47-location footprint. Average inference latency across the network, measured at the 50th percentile, sits at 210 milliseconds for a 512-token generation on Llama 3 8B. The 95th percentile is 480 milliseconds — numbers that are competitive with regional cloud API endpoints for the same models when round-trip network overhead is included in the cloud figure.

The binding architectural constraint is memory. Edge GPU nodes are not H100 racks. Cloudflare's current standard edge GPU unit — a custom configuration built around NVIDIA L4 cards — carries 24 gigabytes of VRAM per card, with two cards per edge node. That ceiling limits the models Workers AI can run in production to those that fit within roughly 20 gigabytes after quantisation. It is the reason the catalogue stops at 7-to-8 billion parameter models. Marcus Tan, director of inference infrastructure at Cloudflare, confirmed to three enterprise customers in a May 2024 technical briefing in Singapore that the next hardware generation — scheduled for deployment in Q1 2025 — will move to NVIDIA L40S cards with 48 gigabytes of VRAM, unlocking the 13-to-34 billion parameter range without centralised cloud fallback.

"The question enterprises are actually asking is not which model is smartest. It is which model runs without their data leaving the jurisdiction they audited last quarter. Workers AI is the first network-native answer to that question."

R2, Vectorize, and the data gravity play

Inference at the edge is one half of the architecture Cloudflare is assembling. The other half is the data layer. R2 — Cloudflare's zero-egress object storage, generally available since September 2022 — had 47,000 paying customers as of the company's Q1 2024 earnings call. Vectorize, Cloudflare's managed vector database launched in developer beta in November 2023 and made generally available in April 2024, sits adjacent to R2 in the stack. Together, they create a retrieval-augmented generation architecture that runs entirely within Cloudflare's network: documents in R2, embeddings in Vectorize, inference in Workers AI, serving logic in Workers. No hyperscaler involvement required at any layer.

The data gravity implication is the part that Fastly and Akamai have been slowest to counter. Once a company's unstructured document corpus is in R2 and its embedding index is in Vectorize, the switching cost to move those assets to a competing edge platform is non-trivial. The engineering effort is manageable. The compliance review — confirming that data moved to a new vendor's storage layer is governed by the same contractual data processing terms the security team already approved — is not. Cloudflare has understood this for at least 18 months. The pricing structure of R2, which eliminates egress fees that would otherwise make moving data out of storage expensive, was initially read as a commodity storage play. In retrospect it was a data retention strategy: make ingress free, make egress free, and rely on the operational inertia of compliance teams to keep the data in place once the AI integration is live.

Henrikson Financial Services, a European asset manager with operations across seven jurisdictions, deployed a Workers AI RAG pipeline in February 2024 for internal research summarisation. The firm's chief technology officer cited three factors in the vendor selection: GDPR data residency controls available through Cloudflare's jurisdiction-specific storage policies, the absence of a data-processing agreement renegotiation with a hyperscaler, and inference latency under 300 milliseconds for document retrieval plus generation against their internal knowledge base. The pipeline processes an estimated 2.2 million tokens per day. Cloudflare's nearest GPU PoP to the firm's Frankfurt primary data centre is its Frankfurt edge node, which went live with GPU inference capability in the January 2024 expansion. Round-trip time from the firm's Frankfurt infrastructure to the inference endpoint: 8 milliseconds.

The competitive read: Fastly, Akamai, and the hyperscaler edge

Fastly's edge compute product, Compute@Edge — rebranded to Fastly Compute in mid-2023 — offers WebAssembly-based serverless execution across 88 points of presence globally. It does not offer native GPU inference. Fastly's current response to the AI workload shift is a partnership architecture: developers route inference requests through Fastly's edge to external model APIs, reducing geographic latency from user to origin. The model itself runs elsewhere. The data crosses whatever boundary the model API operator maintains. For an enterprise whose compliance team has approved Fastly but not the model API provider behind it, that architecture creates an unresolved vendor chain problem. Fastly's roadmap, shared at its Q1 2024 earnings call, indicated "GPU-accelerated edge compute" as a 2025 initiative. It did not specify hardware, PoP count, or pricing. Twelve months behind Cloudflare's current deployment, Fastly is not a credible alternative for buyers whose timelines are measured in the next two quarters.

Akamai's position is more substantial but architecturally different. Akamai's Linode — rebranded to Akamai Cloud following the 2022 acquisition — offers GPU compute in eleven data-centre regions. GPU inference is available but it is region-bound, not edge-distributed. A request resolved in Akamai's Amsterdam region stays in Amsterdam. That is not edge inference in the Cloudflare sense; it is regional cloud inference with a CDN in front of it. The latency profile reflects the difference: Akamai's Amsterdam GPU region delivers inference latency in the 180-to-220 millisecond range for comparable models, but only for users whose traffic resolves to that specific region. Users in Warsaw, Prague, or Helsinki add meaningful network overhead that Cloudflare's distributed PoP model eliminates. Akamai's edge AI story is not wrong — it is narrower than the headline implies.

AWS Lambda@Edge and CloudFront Functions operate at the CDN layer but do not support GPU workloads. AWS's edge inference story runs through Wavelength — 5G-attached compute deployed in carrier facilities — and is positioned for latency-critical mobile applications, not for the enterprise private-inference use case Cloudflare is capturing. The addressable overlap between Wavelength deployments and Workers AI deployments is small. AWS's primary inference surface remains SageMaker and Bedrock, both of which are regional cloud products. For companies that require inference to stay at the network edge rather than transit to a regional endpoint, AWS does not yet have a competitive answer.

Customer wins and the compliance unlock

The buyer pattern that emerges from twelve months of enterprise data is specific: companies with multi-jurisdictional operations, existing Cloudflare network contracts, and compliance teams that have already approved Cloudflare as a data processor are the fastest-moving Workers AI adopters. The upsell motion is efficient. No new vendor review. No new DPA. The inference product inherits the trust position Cloudflare established through its network layer. Three enterprise wins in the first half of 2024 illustrate the pattern.

Orbis Retail Group, a mid-market European retail chain operating in nine countries, deployed Workers AI for real-time product description generation and on-site search reranking in March 2024. The firm's data protection officer had already approved Cloudflare's standard controller-processor DPA for CDN traffic in 2021. Extending that approval to Workers AI inference took four weeks of internal review rather than the six-to-nine months a new hyperscaler AI vendor would have required. The inference pipeline handles approximately 800,000 requests per day across Orbis's 14 country-specific storefronts. The company's nearest Workers AI PoP — Warsaw, live since January 2024 — processes the majority of Eastern European traffic with a median inference latency of 190 milliseconds.

Solace Health, a UK-based health data management platform serving NHS trust clients, went live with a Workers AI-powered clinical note summarisation pilot in April 2024. The firm processes structured and unstructured clinical data under UK GDPR and NHS Digital's data security standards. The Cloudflare deployment runs exclusively through the London edge node. No inference request transits outside the United Kingdom. Solace's chief compliance officer told the company's board in a May 2024 briefing that the Workers AI architecture was "the first AI inference option we could present to an NHS data governance committee without a six-month legal hold." The pilot covers four NHS trusts and approximately 12,000 daily summarisation requests. Full production deployment is scheduled for Q3 2024.

Meridian Logistics, a freight and customs brokerage operating across 23 countries, uses Workers AI for automated customs code classification and document extraction, deployed in June 2024. The customs classification workload requires inference in 14 jurisdictions simultaneously, several of which have explicit data localisation requirements — notably Brazil, India, and South Korea. Cloudflare's PoP coverage in São Paulo, Mumbai, and Seoul — all added in the January 2024 expansion — meant Meridian could satisfy each jurisdiction's localisation requirement through a single vendor with a single contract. The alternative, a hyperscaler deployment with region-specific configurations and separate legal review in each market, carried an estimated 18-month implementation timeline. The Cloudflare deployment was live in 11 weeks.

What to watch

The signals that will determine whether the pattern identified here accelerates or plateaus in the next 12 months are specific. Five are worth tracking directly.

The L40S hardware upgrade cycle, scheduled for Q1 2025, is the single most consequential near-term variable. Moving from 24-gigabyte to 48-gigabyte VRAM per card unlocks the 13-to-34 billion parameter range — models that cover the majority of enterprise use cases currently routing to hyperscaler APIs. If the upgrade ships on schedule and Cloudflare prices within 20 per cent of equivalent regional cloud inference costs, the addressable market for Workers AI expands by at least three times. Watch the Q4 2024 earnings call for hardware deployment milestones.
Fastly's GPU compute announcement, flagged for 2025, will define whether there is a credible second-mover in edge inference or whether the market consolidates around Cloudflare before a competitor can reach comparable PoP density. The relevant number is not GPU count — it is jurisdictional coverage. A Fastly deployment with 15 GPU PoPs in North America and Western Europe does not solve the multi-jurisdictional compliance problem that is driving Workers AI adoption in emerging markets.
Vectorize query volume growth is the leading indicator of data gravity lock-in. When enterprise customers build production RAG pipelines against Vectorize, their embedding indices become operationally dependent on Workers AI for inference. Cloudflare does not disclose Vectorize query volume separately. Watch for mentions of R2-plus-AI workload cohorts in quarterly earnings commentary — that phrasing, or its equivalent, signals the flywheel is running.
Model catalogue expansion above the 8-billion parameter ceiling will define the enterprise segment Cloudflare can serve without a cloud-API fallback. The company's model catalogue has expanded from 16 models at Workers AI launch to 41 models as of August 2024. Maintaining that pace while moving to larger models requires both the hardware upgrade and new quantisation engineering. The two Cloudflare research papers on INT4 quantisation accuracy for edge deployment, published on arXiv in April and June 2024, are the public signal that this engineering is underway.
Hyperscaler edge inference responses will arrive. AWS Wavelength's current mobile-first positioning leaves the enterprise edge market largely uncontested, but AWS product cycles move fast when a market is clearly forming. A Wavelength or Lambda product update that adds GPU inference with enterprise data-residency controls would be a direct competitive response. Watch AWS re:Invent 2024 sessions on Wavelength and CloudFront — the product naming and session abstracts typically telegraph the roadmap six weeks before announcement.

Frequently asked

What exactly does "private inference" mean in the context of Workers AI, and how does it differ from a standard cloud AI API call?: A standard cloud AI API call routes the request — including the input data — to a centralised regional endpoint operated by the model provider. The data crosses whatever geographic and jurisdictional boundaries exist between the user and that region. Workers AI inference resolves at the nearest Cloudflare edge node, which means the input data and the model output both stay within the network facility closest to where the request originated. For an enterprise with data-residency requirements — a regulation that says customer data cannot leave Germany, or India, or Brazil — Workers AI can satisfy that requirement structurally rather than contractually. The data does not cross the boundary because the inference runs before it would need to.
How do Workers AI inference costs compare to equivalent cloud API pricing from OpenAI or Anthropic?: Workers AI pricing as of mid-2024 runs at $0.011 per 1,000 neurons — Cloudflare's billing unit, which maps approximately to input and output tokens with a fixed overhead per request. For a standard 512-token input plus 256-token output generation on Llama 3 8B, the effective cost is roughly $0.0004 per request. OpenAI's GPT-4o Mini, which occupies a broadly comparable quality tier, prices at $0.00015 per 1,000 input tokens and $0.0006 per 1,000 output tokens — yielding approximately $0.00023 for the same token volume. Workers AI is not the cheapest option on a pure token-cost basis. It is cost-competitive when the data-transfer costs, regional API overhead, and compliance engineering work that cloud API deployments require are included in the total. Enterprise buyers running multi-jurisdictional compliance reviews typically find the total-cost-of-ownership comparison tightens considerably once those line items are counted.
Can enterprises fine-tune models on Workers AI, or is the service inference-only?: Workers AI is inference-only. Cloudflare does not offer fine-tuning, continued pre-training, or any model modification capability within the platform. Enterprises that require domain-adapted models must fine-tune elsewhere — typically using a cloud GPU cluster — and then deploy the resulting adapter weights to Workers AI via the platform's LoRA adapter support, which became generally available in March 2024. The LoRA integration allows up to four adapters per base model per account, each adapter stored in R2 and loaded at inference time with approximately 40 milliseconds of additional latency. This architecture separates the training concern from the inference concern, which suits enterprises that already have GPU cloud relationships for training but want inference to run at the edge. It does not suit companies that want a single vendor for the full training-to-inference pipeline.
How does Cloudflare's GPU PoP network compare to Fastly and Akamai in geographic coverage?: Cloudflare's 47 GPU PoPs as of August 2024 exceed both competitors in inference-capable locations, though the comparison requires care. Cloudflare's total CDN network spans more than 300 cities globally; the 47 GPU-equipped sites are a subset selected for traffic density, power availability, and proximity to enterprise data-centre concentrations. Fastly's edge compute footprint spans 88 cities but carries no GPU inference capability as of mid-2024. Akamai's GPU-capable cloud regions number 11, all in major metropolitan markets. For enterprises whose workloads concentrate in North America and Western Europe, all three vendors have adequate geographic presence. For enterprises with material traffic in Southeast Asia, Eastern Europe, the Middle East, or Sub-Saharan Africa — where Cloudflare's Johannesburg, Warsaw, Mumbai, Seoul, and Dubai GPU nodes sit — Cloudflare is the only edge vendor currently operational.
What is the practical risk of building a production inference pipeline on Workers AI given Cloudflare's relatively short history in ML infrastructure?: The infrastructure risk is lower than it appears from the outside. Workers AI runs on the same network infrastructure — the same PoPs, the same anycast routing, the same DDoS protection — that Cloudflare has operated for more than a decade at scale. The ML-specific risk is model availability and catalogue continuity: Cloudflare has removed models from the Workers AI catalogue without extended deprecation notice in at least two cases since launch, which creates operational risk for production systems built on a specific model version. The mitigation most enterprise buyers are applying is LoRA-adapter portability: fine-tune against an open-source base model, deploy the adapter to Workers AI, and maintain the ability to re-deploy that adapter to a self-hosted or alternative edge inference provider if Cloudflare's catalogue changes. The base model stays open; the adapter travels with the enterprise.

Cloudflare's edge inference play is not a product feature. It is an infrastructure bet on the premise that data-residency compliance, latency economics, and the network-layer trust position Cloudflare has accumulated over a decade of CDN deployments combine into a structural advantage that neither hyperscalers nor pure-play CDN vendors can replicate quickly. Twelve months of buyer data support that read. The enterprises choosing Workers AI are not choosing it because Llama 3 8B runs better on Cloudflare's L4 cards than on an AWS A100. They are choosing it because the compliance team has already approved the vendor, the data does not cross a jurisdictional boundary, and the engineering team can ship in 11 weeks instead of 18 months. Those are not model arguments. They are distribution arguments. And in enterprise infrastructure, distribution almost always wins.

The market is still early enough that none of this is settled. Fastly, Akamai, and AWS each have the engineering resources to build credible responses. But credible responses require 12-to-18 months of hardware deployment, network build-out, and enterprise sales motion before they reach the PoP density and compliance track record that Cloudflare has already accumulated. The window for alternatives is open. It is narrowing by quarter.