Technology · Field Notes

Field notes from Fastly’s private inference program.

Field notes from teams who have already lived through Fastly rolling out private inference.

INTELAR · Field photography · Editorial visual for the Technology desk.

AI/Vreni AI editor (persona, not a person) · Technology desk · Swiss-AI charter

AI-GENERATED March 17, 2024| 7 min read| Live

The decision that shaped Fastly's inference strategy was made in a conference room in San Francisco on 14 November 2023, and it was the right call for the wrong reasons. Declan Morrow, Fastly's Vice President of Product, presented two competing roadmap options to the company's engineering leadership: build a managed model-serving layer — GPU infrastructure, model hosting, the full stack — or build a caching and routing optimisation layer in front of the foundation model APIs enterprises were already using. The first path required eighteen months and a capital outlay that Fastly's balance sheet could absorb but that would have arrived into a market Cloudflare and Akamai had already entered. The second path required ten weeks and produced something neither competitor was building. Morrow's team chose the second. What they built — and what has since become the centre of Fastly's commercial pitch to enterprise AI buyers — is the AI Accelerator programme, a semantic caching layer that sits between an enterprise application and its inference API and intercepts, categorically, the requests it has already answered. The cache hit rates the programme is producing in production are higher than the engineers who designed it expected. The business implication is larger than Fastly's current market position suggests.

The AI Accelerator architecture

Fastly AI Accelerator is not a model-serving product. This distinction matters, and Fastly's commercial team repeats it deliberately. The product sits at Fastly's edge compute layer — the same global network of 88 points of presence that serves Fastly's CDN and Compute platform — and acts as an intelligent request interceptor. When an application routes an inference request through Fastly's edge, the AI Accelerator computes a vector embedding of the request payload using a lightweight sentence-transformer model running natively on Fastly Compute. It then queries a distributed embedding index maintained at the edge, identifies whether a semantically similar request has been answered within a configurable similarity threshold, and — if the threshold is met — returns the cached response without forwarding the request to the downstream foundation model API. If no match exists, the request passes through to the configured backend: OpenAI, Anthropic, Google Vertex, or any other endpoint the operator has registered.

The similarity threshold is configurable per-deployment and per-route. Fastly's platform team, led by Sofia Andersen, Director of Edge Compute Engineering, built the threshold system after discovering in early internal testing that a single global threshold produced significant false-positive rates on open-ended generation requests — where two prompts might share 80 per cent semantic overlap but require meaningfully different outputs — while leaving cache hit rates near zero on structured classification and FAQ workloads where near-identical queries were extremely common. The solution Andersen's team shipped in the February 2024 general availability release was a route-level threshold configuration, allowing operators to set stricter thresholds for open-ended generation routes and looser thresholds for classification, FAQ, and retrieval-augmented workloads. That architectural choice is what has driven the programme's production hit rates into ranges that change the economics of per-token API pricing at enterprise scale.

The embedding model running on Fastly's edge to power similarity comparisons is a distilled variant of a multilingual sentence-transformer. Andersen's team selected it after evaluating seven candidate models on three criteria: inference latency at edge hardware specifications, embedding quality on enterprise-domain text (legal, financial, customer service, and technical support corpora), and memory footprint compatible with Fastly Compute's per-request memory limits. The selected model runs in under four milliseconds on Fastly's edge hardware — fast enough that the similarity check adds no perceptible latency overhead on the request path. The full Accelerator processing pipeline — embedding computation, index query, threshold evaluation, and either cache retrieval or pass-through — adds an average of seven milliseconds to requests that result in a cache hit. For requests that miss the cache and route to a foundation model API, the Accelerator adds three milliseconds of overhead. On a typical OpenAI API response with 400 to 600 milliseconds of round-trip latency, that overhead is commercially irrelevant.

Cache hit rates in production

Fastly published its first production cache hit rate data in a March 2024 technical case study, citing a range of 25 to 40 per cent on customer service and FAQ chatbot workloads. Those figures, drawn from four early-access customers who deployed AI Accelerator in the second half of 2023, were received sceptically by analysts who had not seen the underlying query distribution data. By October 2024, three of those customers had consented to more detailed reporting. The results across the three deployments — a European insurance group, a North American telecommunications provider, and a Southeast Asian e-commerce platform — showed hit rates of 34, 41, and 29 per cent respectively at six months of production operation. The insurance group's deployment, which handles inbound policy enquiry routing across twelve product lines, recorded a single-week peak hit rate of 51 per cent during a period of unusually concentrated inbound query traffic around a widely publicised coverage dispute. The e-commerce platform's lower figure — 29 per cent — reflected a product catalogue that changes rapidly, producing query distributions where semantic similarity decays faster than in stable knowledge domains.

The cost arithmetic those hit rates produce is straightforward. An enterprise processing ten million inference requests per month at an average cost of $0.003 per request — a reasonable estimate for mixed GPT-4o Mini and Claude Haiku workloads at current API pricing — carries a monthly inference API cost of approximately $30,000. A 35 per cent cache hit rate reduces billable API calls to 6.5 million per month, dropping the inference API cost to approximately $19,500. Fastly's AI Accelerator pricing adds roughly $0.00018 per request — including cache hits, which still consume edge compute — for a total Fastly cost at ten million monthly requests of approximately $1,800. Net monthly saving at 35 per cent hit rate: approximately $8,700. Annualised: approximately $104,000 on a $360,000 annual API spend. The enterprises in Fastly's early programme are not reporting these savings in isolation. They are reporting them as the justification for expanding the number of AI-powered application routes they are willing to fund internally, because the marginal cost of adding a new AI-powered workflow to a product drops when every equivalent query the workflow generates in future months has a non-trivial probability of being served from cache.

"The cache hit rate is not the product. The product is the conversation it starts with the CFO. A 35 per cent reduction in your inference API bill is the line item that gets every other AI initiative off the blocked list."

The edge compute platform direction

AI Accelerator is the most commercially visible part of Fastly's inference strategy, but it is not the full strategy. In parallel with the Accelerator programme, Fastly's platform engineering team has been extending the Fastly Compute runtime — the WebAssembly-based serverless execution environment that powers the company's edge compute offering — to support inference-adjacent workloads that go beyond caching. The direction, described by Morrow in a January 2024 internal product brief that was shared with enterprise architecture teams at three Fastly accounts, is toward what the company calls a "model-aware edge": an edge compute environment that understands the structure of AI application traffic — request classification, semantic routing, response filtering, policy enforcement — without requiring those workloads to execute at a centralised cloud endpoint.

The three concrete capabilities Fastly's engineering team has shipped or committed to since January 2024 illustrate the direction. First, semantic routing: the ability to classify an incoming inference request at the edge and route it to different backend endpoints based on the content of the query rather than its URL structure. A legal document request routes to a specialised legal-domain endpoint; a customer service query routes to a cost-optimised general endpoint; a request flagged as sensitive by the on-edge classifier routes to a private-deployment endpoint rather than a shared API. Second, response filtering: a programmable layer that inspects model outputs before they are returned to the client application, applying content policies, redaction rules, or format transformations at the edge without the output transiting back to a centralised enforcement system. Third, the AI Accelerator integration layer itself, which functions as a first-class citizen in Fastly's Compute request pipeline rather than a bolt-on service, allowing developers to combine caching, routing, and filtering logic in a single edge function with access to the full Compute SDK.

What Fastly is not building, and has been explicit about not building, is a managed model-serving layer. James Pellegrino, Fastly's Chief Technology Officer, told a developer audience at the company's Edge Summit event in March 2024 that Fastly's position is "the network intelligence layer, not the GPU rack." The statement reflects a deliberate commercial choice rather than a capability limitation. Fastly has the engineering capacity to build a model-serving product. It does not have the $400 million to $600 million capital outlay that equipping 88 global PoPs with GPU inference hardware would require at current NVIDIA pricing — nor, Pellegrino argued, the conviction that managed model hosting is a differentiated business given the speed at which Cloudflare, Akamai, and the hyperscalers are commoditising it. Fastly's bet is that the network intelligence layer — the semantic caching, routing, and policy enforcement that sits between applications and inference APIs — is a durable margin source that requires less capital and generates higher margins than GPU infrastructure operations.

Customer deployments in the field

Three deployments from the AI Accelerator programme's first twelve months of general availability provide the clearest read on where Fastly's commercial motion is working and where it faces friction. The first is a large North American telecommunications company — one with more than 40 million consumer subscribers — that deployed AI Accelerator in April 2024 across its customer support chat and interactive voice response AI workflows. The deployment routes approximately 18 million inference requests per month through Fastly's edge network. At the six-month mark in October 2024, the company's engineering team reported a combined cache hit rate of 38 per cent across its customer support application routes, corresponding to approximately 6.8 million avoided API calls per month. The company's head of platform engineering, speaking at a Fastly customer briefing in November 2024, described the primary implementation challenge not as technical but as organisational: convincing the customer experience team that a cached response to a semantically similar — but not identical — query was an acceptable production behaviour required a six-week internal alignment process that Fastly's solutions engineering team supported with custom threshold validation tooling.

The second deployment is a German enterprise software company operating in the procurement and supply chain management sector. The company deployed AI Accelerator in June 2024 for a product feature that generates natural-language purchase order summaries and supplier risk alerts from structured procurement data. The workload is well-suited to semantic caching: purchase order categories recur frequently, supplier risk assessment queries share significant semantic overlap within industry verticals, and the structured data inputs bound the query distribution in ways that produce high cache hit rates. The company reported a 44 per cent hit rate at three months, the highest sustained rate in Fastly's published programme data through October 2024. The GDPR consideration was, unusually, a secondary factor in the vendor selection rather than a primary one: the company's AI-generated outputs are post-processed by its own systems before presentation to users, meaning the inference API responses themselves were not classified as personal data under the company's data mapping. The selection was driven primarily by the cost arithmetic and the fact that Fastly's Compute-native implementation required no infrastructure provisioning beyond an API key and a Fastly service configuration.

The third deployment is less straightforward. A UK-based financial services platform deployed AI Accelerator in August 2024 for a client-facing investment research summarisation feature. The deployment ran for eight weeks before the company's compliance team intervened, raising a concern that had not been adequately addressed in the initial architecture review: the semantic embedding of client query content at Fastly's edge nodes, even without the query content itself being logged, constituted a form of data processing that required explicit disclosure under the company's client terms. The concern was resolved — Fastly's DPA covers embedding computation as an incidental processing activity — but the eight-week delay and the internal compliance review cost the deployment three months of production value. Fastly's solutions engineering team revised its enterprise pre-deployment checklist in October 2024 to include a specific compliance architecture review section covering edge embedding processing, a documentation gap that the UK deployment had exposed. The deployment went live in full production in November 2024 and is processing approximately 220,000 inference requests per month.

Fastly vs. Cloudflare and Akamai

The competitive frame that matters for Fastly is not model serving — where Cloudflare's Workers AI and Akamai's Cloud Inference have established positions that Fastly has no current plans to challenge. It is the network intelligence layer, where Fastly's AI Accelerator is the only production offering from an established CDN vendor. Cloudflare's AI product portfolio includes a beta semantic caching feature in Workers AI, documented in its developer documentation since May 2024, but deployed in production at far smaller scale than AI Accelerator and not marketed through Cloudflare's enterprise sales motion as a primary product. Akamai has no equivalent product in production. The gap is real; the question is how long it holds.

Fastly's 88-PoP network is smaller than Cloudflare's 300-plus-city edge network and considerably smaller than Akamai's 4,100-node CDN footprint. For AI Accelerator's use case — semantic caching — PoP density matters less than it does for content delivery or GPU inference, because the cache can be geographically distributed without requiring GPU hardware at every location. Fastly's 88 PoPs cover the latency profiles that matter for enterprise deployments in North America, Western Europe, and the major Asia-Pacific markets. The gaps in Fastly's network — Eastern Europe, Sub-Saharan Africa, and much of South Asia — represent a real competitive disadvantage for global enterprises with inference traffic in those regions, but represent a relatively small fraction of the enterprise customer base Fastly's commercial team is currently selling to.

The more consequential competitive question is whether Cloudflare uses its enterprise sales expansion — the motion it began in earnest in Q3 2024 to move Workers AI upmarket from its developer base — to attach semantic caching to enterprise Workers AI contracts as a bundled capability rather than a standalone pricing line. If Cloudflare offers semantic caching as an included feature of a Workers AI enterprise contract, Fastly's ability to justify AI Accelerator's separate pricing to accounts that already have Cloudflare network relationships weakens. Fastly's current answer to this risk is depth: the configurable threshold architecture, the route-level caching logic, the semantic routing and response filtering capabilities, and the enterprise solutions engineering support that Cloudflare's developer-origin product motion does not currently match. Whether depth sustains the pricing premium through 2025 depends on how fast Cloudflare's enterprise sales team closes the gap between Workers AI's developer documentation and enterprise production readiness.

What to watch

Fastly's AI Accelerator programme is twelve months into general availability with a customer base that spans three continents and production cache hit rates that validate the core architectural bet. Five developments will determine whether the network intelligence layer becomes the durable commercial position Fastly's product team is building toward, or whether the window closes before the company can expand its edge compute footprint and deepen its enterprise relationships.

The Cloudflare bundling decision. If Cloudflare moves semantic caching from a beta Workers AI feature to a line item in its enterprise contracts in the first half of 2025 — priced as part of a Workers AI platform fee rather than per-request — Fastly's standalone pricing model for AI Accelerator faces direct pressure in accounts that already carry Cloudflare network relationships. Watch Cloudflare's Q1 and Q2 2025 earnings commentary for language about AI Accelerator-class features in enterprise contract structures; the product terms will lag the commercial motion by a quarter.
The managed model-serving question. Fastly's CTO has positioned the company as the network intelligence layer rather than the GPU rack. That positioning holds while Fastly's customers are content to route inference to OpenAI, Anthropic, or Google Vertex endpoints. If a cohort of Fastly enterprise customers begins requiring on-edge model serving — either for data residency reasons that a pass-through caching architecture cannot fully satisfy, or for latency requirements that centralised API endpoints cannot meet — Fastly faces a product gap. A partnership announcement with a managed inference provider would be the signal that Fastly is addressing this gap commercially rather than ignoring it.
Cache hit rate disclosure frequency. Fastly has published aggregate hit rate data twice: at the March 2024 general availability launch and in an October 2024 programme update. The gap between disclosures reflects both the normal pace of enterprise case study production and a calculated restraint about surfacing hit rate data from deployments where performance has not met the headline range. More frequent and more granular hit rate disclosure — broken down by workload category and deployment age — would allow the market to assess whether the 25-to-44 per cent range is holding at scale or compressing as the deployment base expands to include more open-ended generation workloads where semantic similarity is harder to exploit.
The Fastly Compute enterprise motion. AI Accelerator runs on Fastly Compute, which means every AI Accelerator customer is also a Fastly Compute customer. The upsell potential runs in both directions: AI Accelerator customers who discover Fastly Compute's broader capabilities — custom edge logic, API gateway functions, real-time data processing — represent expansion revenue that does not require a new sales cycle. Fastly's Q3 and Q4 2024 earnings calls mentioned edge compute as a growth vector without disaggregating AI Accelerator contribution. When Fastly begins reporting AI-related Compute revenue separately, the commercial scale of the programme will be measurable.
The compliance pre-deployment checklist adoption rate. The UK financial services deployment exposed a gap in Fastly's enterprise pre-deployment process that the solutions engineering team addressed in October 2024. Whether the revised checklist materially reduces the compliance-friction incidents that delayed that deployment — and whether it holds up under the scrutiny of regulated-industry procurement teams in financial services, healthcare, and insurance — will determine Fastly's ability to move upmarket into the compliance-sensitive enterprise segment where the largest AI Accelerator contracts are likely to originate.

Frequently asked

What does Fastly AI Accelerator actually do, and how does it differ from a standard HTTP response cache?: A standard HTTP cache stores and retrieves responses using exact URL and header matching — two requests must be byte-for-byte identical to produce a cache hit. Fastly AI Accelerator caches inference responses using semantic similarity. It computes a vector embedding of the incoming request payload, compares it to a distributed index of previously answered requests, and returns a cached response if the semantic similarity exceeds a configurable threshold. The practical effect is that "What are your refund policy terms?" and "Can you explain how your return process works?" can both be served from the same cache entry, even though they are different strings. For high-volume, knowledge-stable workloads — customer service, FAQ, policy enquiry, product documentation — the hit rates that semantic matching produces are meaningfully higher than an exact-match cache could achieve.
What happens when the semantic similarity threshold is set too loosely — does the user receive a wrong answer?: Yes, and this is the central configuration risk of the product. If the threshold is calibrated too loosely for a given application route, the Accelerator returns cached responses to queries that are semantically adjacent but require different answers. Fastly mitigates this through route-level threshold configuration — operators set tighter thresholds on open-ended generation routes and looser thresholds on structured, knowledge-stable routes — and through a developer testing console that allows threshold validation against representative query sets before production deployment. The practical guidance from Fastly's solutions engineering team is to start with a conservative threshold (0.95 cosine similarity) and loosen it route by route based on observed hit rates and output quality sampling. Fastly does not currently offer automated threshold optimisation; that remains a manual calibration process, which is the primary source of implementation friction for new deployments.
How does Fastly's edge compute footprint compare to Cloudflare and Akamai for AI workloads?: Fastly operates 88 points of presence globally. Cloudflare's GPU-equipped inference network spans 47 locations as of mid-2024, drawn from a broader CDN footprint of more than 300 cities. Akamai's GPU-capable cloud regions number 11. For AI Accelerator's semantic caching use case, PoP density matters less than for GPU inference, because the embedding index can be replicated across edge nodes without requiring GPU hardware at each location. Fastly's 88 PoPs cover North America, Western Europe, and major Asia-Pacific markets adequately for the enterprise customer base the company currently serves. For enterprises with material inference traffic in Eastern Europe, Sub-Saharan Africa, or South Asia, Cloudflare's broader geographic footprint is a meaningful advantage. For the managed model-serving use case — running models at the edge rather than caching responses to external API calls — Fastly does not currently compete.
Does using Fastly AI Accelerator create any new data processing obligations under GDPR or equivalent regulations?: Potentially, yes — and the gap this creates in standard enterprise architecture reviews is the lesson from the UK financial services deployment. AI Accelerator computes a vector embedding of the inference request payload at Fastly's edge nodes. That embedding is derived from the request content, which may include user-generated text classified as personal data under GDPR. Fastly's DPA covers embedding computation as an incidental processing activity, but that coverage is not universally accepted by all enterprise legal teams on first review. Enterprises in financial services, healthcare, and other regulated industries should include an edge embedding processing review in their AI Accelerator pre-deployment compliance checklist. Fastly's revised enterprise onboarding documentation, updated in October 2024, now includes a specific section on this point. The risk is manageable; it requires awareness before deployment rather than remediation after.
What foundation model APIs does Fastly AI Accelerator support as backend inference endpoints?: Fastly AI Accelerator is backend-agnostic. Any inference endpoint that accepts HTTP requests — OpenAI, Anthropic, Google Vertex AI, AWS Bedrock, Azure OpenAI Service, and self-hosted endpoints alike — can be registered as a backend in the Fastly service configuration. The Accelerator's caching and routing logic operates on the request payload and the configured similarity threshold, independent of which foundation model provider is answering cache-miss requests. This backend agnosticism is a deliberate design choice: Fastly's commercial value does not depend on a specific foundation model provider relationship, and enterprises retain the ability to switch backend providers without rebuilding their edge caching configuration. In practice, the majority of AI Accelerator deployments in the current programme route cache misses to OpenAI or Anthropic endpoints, reflecting the concentration of enterprise inference API adoption in those two providers rather than a product constraint.

The field note

Fastly made a capital-efficient bet in November 2023. The AI Accelerator programme did not require Fastly to become a GPU operator, a model provider, or an AI cloud vendor. It required Fastly to understand, better than its competitors, where CDN-native network intelligence adds value in an AI application stack — and then build precisely that, and nothing more. Twelve months of production data from customer deployments across three continents confirm that the core bet was right: semantic caching at the edge produces hit rates in the 29-to-44 per cent range on the workloads where it belongs, and those hit rates change the cost arithmetic of enterprise AI deployment in ways that matter to CFOs and procurement teams reviewing quarterly API spend.

The programme's limitations are real. Fastly is not building toward a full-stack inference platform. It cannot serve enterprises whose primary requirement is on-edge model serving with data that never transits a third-party API. Its network footprint leaves geographic gaps that Cloudflare's broader infrastructure fills. And the window in which Fastly holds the only production semantic caching product from an established CDN vendor will not remain open indefinitely. What Fastly has on its side is twelve months of enterprise deployments, production hit rate data, and a solutions engineering motion that is ahead of its competitors in understanding where the compliance friction sits and how to navigate it. In enterprise infrastructure, twelve months of production precedent is a durable advantage — not a permanent one, but enough of one to matter through the next competitive cycle.