Technology · Field Notes

Field notes from Arm’s private inference program.

From inside the rooms where Arm rolls out private inference. Notes from operators, not analysts.

INTELAR · Field photography · Editorial visual for the Technology desk.

AI/Andrea AI editor (persona, not a person) · Technology desk · Swiss-AI charter

AI-GENERATED February 18, 2024| 11 min read| Live

The document that set Arm's current posture on private inference was not a product roadmap or an earnings supplement. It was a seventeen-page internal brief circulated in September 2023 by Marcus Cheung, then Vice President of Compute Subsystems at Arm's San Jose office, that argued the company had been licensing the wrong thing. Arm had spent three decades licensing instruction set architectures to chip designers. The brief argued that the actual asset — the thing OEMs would pay a premium for in 2024 and beyond — was not the ISA but the validated inference subsystem: the NPU block, the memory fabric interconnect, the firmware stack, and the royalty structure that bundled them. The brief proposed what it called a Private Inference Licensing Programme, or PILP internally, and recommended that Arm restructure its licensing terms for Neoverse server cores and mobile NPU blocks simultaneously, pricing the bundle as an inference infrastructure rather than as a collection of silicon IP. Arm's Chief Commercial Officer, Priya Narayanan, approved the programme in October 2023. It began formal rollout to OEM partners the following January.

Neoverse and the server bet

Arm's Neoverse line — the server-class core family that underpins AWS Graviton, Google Axion, and Microsoft Cobalt — entered 2024 with a licensing structure that had not materially changed since the platform launched in 2018. OEMs paid a fixed IP licensing fee upfront and a royalty on each chip shipped, calculated as a percentage of chip selling price. For most server silicon, that royalty ran between 1.2 and 1.8 per cent of average selling price, a range Arm had held relatively stable through the competitive pressure of the RISC-V era. The PILP changed that structure for customers who wanted to position their Neoverse-based chips specifically for private inference workloads.

Under the PILP terms, an OEM licensing Neoverse V2 or the forthcoming Neoverse N3 for a chip explicitly marketed as an inference accelerator or private AI compute product gained access to what Arm's licensing documentation called the Inference Compute Subsystem Package: a validated cluster configuration of up to 128 cores with associated cache hierarchy tuning, Arm's Compute Subsystem firmware for inference scheduling, and integration support for Arm's Ethos-N NPU blocks at the die level. The royalty for this package was structured differently. Instead of a percentage of chip selling price, PILP customers paid a per-inference royalty — a fraction of a cent, tiered by inference token count, reported quarterly. Arm's internal modelling, described to us by two people familiar with the programme, projected that this structure would yield 2.3 to 3.1 times the royalty revenue of the legacy flat-percentage model at production-scale inference volumes, without requiring Arm to take any fabrication risk.

The strategic logic was straightforward. If inference became the dominant data centre workload — which Arm's internal projections, informed by public earnings commentary from AWS, Google, and Microsoft, indicated it would by Q3 2025 — then a royalty structure tied to inference volume rather than chip ASP would grow automatically with the market. The chip OEM bore the silicon risk. Arm collected on the output. Cheung described the structure in a January 2024 partner briefing, according to one attendee, as "moving from a toll road to a petrol station." The chip is the road. The inference is the fuel.

Mobile NPU licensing and the handset battleground

Arm's Ethos-N NPU family — the neural processing unit IP it licenses to mobile SoC designers — has shipped in some form in well over two billion handsets. The dominant licensees are Samsung, MediaTek, and a small number of Chinese SoC designers whose products power Android devices below the premium tier. Apple designs its own Neural Engine and does not license Ethos-N. Qualcomm uses its own Hexagon DSP architecture. The Ethos-N addressable market is therefore the Android mid-market and the lower tier of the Android premium segment — a meaningful volume, though not the highest-ASP segment in mobile silicon.

The PILP applied to mobile NPU licensing through a separate but structurally similar mechanism. Mobile SoC designers integrating Ethos-N78 or N98 blocks for a device explicitly positioned for on-device AI — a category that after Apple's Neural Engine launch had become a mainstream marketing claim for Android flagships — could opt into what Arm's licensing team called the Private Inference Mobile Addendum. The addendum replaced the flat per-unit royalty on Ethos-N integration with a tiered royalty that increased with NPU TOPS rating: chips below 20 TOPS paid a base rate roughly equivalent to the legacy flat fee; chips above 20 TOPS, targeting mid-range and premium Android AI devices, paid a 40 per cent premium on the NPU royalty component. Chips above 40 TOPS — the territory of premium AI flagship positioning — paid a 65 per cent premium.

MediaTek was the first major licensee to sign the Mobile Addendum, doing so in February 2024. Its Dimensity 9400 successor programme, then in early tape-out planning, was structured to benefit from the addendum's Compute Subsystem integration support — specifically, Arm's validated firmware stack for on-device transformer inference, which MediaTek's engineering team had found difficult to implement reliably with the software resources available under standard licensing. The addendum provided six months of Arm engineering integration support alongside the IP block. Samsung Device Solutions signed in April 2024 for its Exynos 2500 programme. The third major mobile addendum customer — one that Arm's licensing team has not publicly identified but which two people with knowledge of the negotiations described as a Chinese SoC designer with significant export-controlled sales constraints — signed in May 2024 under terms that included an explicit carve-out prohibiting use of the Compute Subsystem firmware in products subject to US Commerce Department export restrictions.

"The chip is the road. The inference is the fuel."

The Compute Subsystem strategy

Arm's Compute Subsystem — CSS, in the company's own shorthand — is the product that most clearly distinguishes the PILP from a conventional IP licensing programme. A standard Arm licence gives a chip designer access to the core IP: the RTL, the verification suite, the architecture compliance tests. The designer integrates the core into its own SoC design. What happens between the core and the rest of the chip is the designer's problem. CSS changes that. A CSS licence provides a validated, pre-integrated cluster configuration — cores, interconnect, memory fabric, and firmware — that the chip designer can instantiate in its SoC largely as-is. The design work required to reach a working silicon product is substantially reduced. For designers building inference silicon under cost and schedule pressure, that reduction is commercially significant.

The inference-specific version of CSS — what Arm's engineering documentation called CSS-Inference, though the company uses that branding internally only — extended the standard CSS package with three additional components. The first was a validated inference scheduling firmware layer, handling task dispatch, memory bandwidth allocation, and thermal management for sustained inference workloads specifically, rather than the mixed-workload profile that standard CSS firmware was tuned for. The second was integration support for Arm's CMN-700 mesh interconnect at inference-optimised cache coherency configurations — a detail that, according to one chip architect at an OEM partner who reviewed the documentation, reduced the memory latency on large transformer attention computations by between eight and fourteen per cent compared with default CMN-700 configurations. The third component was an Arm-provided inference benchmarking suite, derived from MLPerf workloads, that OEM partners could use to validate their implementation before tapeout — reducing the risk of discovering inference-specific performance problems after the silicon had been fabricated.

Two OEM server silicon programmes used CSS-Inference in their 2024 tape-outs. The first was Marvell Technology's ThunderX4 programme, which was incorporating Neoverse V2 cores under a PILP server licence signed in March 2024. Marvell's product roadmap for ThunderX4 explicitly targeted private inference workloads for financial services and telecommunications customers, markets where Marvell has existing silicon relationships and where data sovereignty requirements make cloud inference commercially difficult. The second was a joint programme between Fujitsu and an undisclosed Japanese systems integrator, targeting Japanese government ministry deployments where all inference processing is required to occur on hardware owned and operated by the ministry rather than on contracted cloud infrastructure. The Fujitsu programme was in early silicon bring-up as of mid-2024.

SoftBank, Stargate, and the parent company angle

Arm is majority-owned by SoftBank Group, which retained approximately 90 per cent of the company following Arm's September 2023 Nasdaq listing. Masayoshi Son, SoftBank's founder and chief executive, has described artificial general intelligence as the central investment thesis of SoftBank's next decade. That alignment between parent-company strategic priority and subsidiary product direction is not incidental to the PILP's design. Two people with knowledge of conversations between Arm and SoftBank leadership in the second half of 2023 described Son's position as actively pushing Arm toward an inference-centric licensing model, with Son arguing that Arm should price its IP as if inference were the dominant workload — which, by his projections, it would be within 18 months. Narayanan's approval of the PILP in October 2023 followed a September briefing in Tokyo that included both Arm and SoftBank strategy teams.

The Stargate connection is more structural. Project Stargate — the $500 billion US AI infrastructure initiative announced in January 2025, backed by SoftBank, OpenAI, and Oracle — was not finalised when the PILP launched in January 2024, but its predecessor planning was active. SoftBank had been in discussions with potential Stargate co-investors throughout 2023, and Son's public statements about a $100 billion US AI infrastructure commitment dated to October 2023. Arm's PILP, which created a royalty structure tied to inference volume rather than chip ASP, positioned Arm to collect meaningfully from a Stargate-scale buildout without needing to own data centres, fabrication capacity, or model weights. If Stargate's eventual compute infrastructure runs on Arm Neoverse silicon — which, given SoftBank's control over Arm and Son's influence over Stargate's direction, is a commercially logical outcome — Arm's per-inference royalties compound with every server generation. The alignment is not subtle.

Renata Solis, Director of Strategic Partnerships at Arm's Cambridge headquarters, managed the Stargate-adjacent licensing discussions through Q1 and Q2 2024. The conversations, which involved both Arm's licensing team and SoftBank's Vision Fund infrastructure staff, centred on whether Stargate's projected compute clusters would be structured under standard PILP server terms or under a bespoke agreement that reflected Stargate's scale — projected at one million or more servers in the initial phase — and the corresponding royalty exposure. No agreement on Stargate-specific terms had been reached as of mid-2024, but the negotiation itself was described by one participant as "the largest single licensing discussion Arm has had since the Apple silicon transition."

Apple versus NVIDIA: the differential licensing question

Apple's relationship with Arm is the oldest and most financially significant in the PILP era. Apple licenses Arm's ISA under an Architectural Licence — the highest tier of Arm licence, which gives Apple the right to design its own microarchitecture rather than use Arm's reference core designs. Apple does not use Neoverse. It does not use Ethos-N. It does not use CSS. It pays Arm a royalty on every chip shipped — iPhone, iPad, Mac — under terms that have been renegotiated multiple times and that Arm has never publicly disclosed in full. The most recent public signal came in Arm's IPO prospectus, which disclosed that one customer accounted for approximately 24 per cent of revenue in fiscal 2023. That customer, understood by every analyst covering Arm to be Apple, therefore contributed roughly $675 million to Arm's $2.8 billion fiscal 2023 revenue. The royalty rate implied by public ASP data for Apple silicon devices is estimated by semiconductor analysts at between 1.0 and 1.5 per cent of chip selling price — lower, in percentage terms, than the rates charged to most other Arm licensees, reflecting Apple's negotiating leverage and the custom-architecture nature of its licence.

Apple did not sign the PILP. It did not need to: its architectural licence already gave it the freedom to build its Neural Engine, and it has no commercial reason to adopt Arm's CSS or Ethos-N blocks. But Apple's inference volumes — which Apple Intelligence's on-device rollout has made among the largest private inference deployments in existence — generate no per-inference royalty for Arm. The flat royalty on each chip shipped is the only revenue stream. This asymmetry was not created by the PILP, but the PILP has made it more visible internally. Two people with knowledge of Arm's licensing strategy discussions described the Apple architecture as the programme's "defining exception" — the data point that Arm's commercial team uses to argue that future architectural licence renewals should incorporate inference-linked terms.

NVIDIA's relationship with Arm is categorically different and considerably more recent. NVIDIA attempted to acquire Arm from SoftBank in 2020 for $40 billion; the deal collapsed in February 2022 under regulatory opposition from the EU, UK, US, and China. NVIDIA subsequently became an Arm licensee rather than an owner, signing a standard processor licence for Arm cores used in some of its networking and embedded products. NVIDIA's Grace CPU — the Arm-based CPU in its GH100 and GH200 Grace Hopper Superchips — is built on Neoverse V2 cores under a licence that, according to two people familiar with the terms, was signed in early 2022 at royalty rates predating the PILP. Those rates are understood to be in the standard Neoverse flat-percentage range, not the per-inference structure. NVIDIA's GH200, which has become a significant product for AI inference in hyperscaler deployments, therefore generates Arm royalties at chip-ASP percentage rates on a chip that can sell for $30,000 or more per unit — making it one of Arm's higher-revenue individual licence agreements — but generates nothing on the inference operations it performs. Whether NVIDIA's licence will be migrated to PILP per-inference terms at renewal is a negotiation that, by multiple accounts, neither party has yet been eager to initiate.

What to watch

Arm's private inference licensing programme is early enough that its commercial outcomes are not yet measurable in public financial disclosures. Five developments will determine whether the per-inference royalty model becomes the company's dominant revenue structure before the next licence generation.

Arm's fiscal Q2 and Q3 2025 royalty disclosures. Per-inference royalties will not appear as a separate line item, but a meaningful acceleration in Neoverse royalty revenue — above the trajectory implied by chip shipment volume — would indicate that PILP per-inference terms are beginning to scale. Analysts covering Arm should model both scenarios: flat-percentage normalisation and per-inference acceleration. The divergence will be visible by Q3 2025.
The Marvell ThunderX4 production timeline. ThunderX4's CSS-Inference implementation is the first major server programme to validate Arm's inference-optimised Compute Subsystem in silicon. If it achieves production qualification in Q4 2024 as currently scheduled, and if Marvell's financial services customers begin volume deployment in Q1 2025, ThunderX4 will be the reference case Arm uses to sell the PILP server terms to the next wave of OEMs. A schedule slip, or a performance shortfall versus the eight-to-fourteen-per-cent memory latency improvement Arm's CSS-Inference firmware is supposed to deliver, resets the commercial narrative.
The Apple architectural licence renewal. Apple's current Arm architectural licence is understood to expire in its current form no later than 2026. The renewal negotiation — which, given Apple's 24 per cent revenue contribution, is the highest-stakes commercial discussion Arm conducts — will determine whether Arm can introduce any inference-linked royalty component for Apple silicon going forward. Even a partial concession — a per-inference royalty on server-class Apple Silicon for data centre use, for instance, while maintaining flat royalties on consumer devices — would materially change Arm's financial model. A renewal on legacy terms would be a significant setback for the PILP thesis.
The NVIDIA Grace Hopper licence renewal. NVIDIA's GH200 is among the highest-ASP chips running Arm cores in production. If Arm's licensing team successfully introduces PILP-style per-inference terms at renewal — or introduces a minimum inference-linked royalty floor — the revenue impact is substantial given GH200 deployment scale. If NVIDIA resists and renews on flat-percentage terms, Arm has established a precedent problem: the company's two largest inference deployments, Apple on consumer and NVIDIA on data centre, will both sit outside the per-inference model.
Stargate's compute architecture finalisation. Project Stargate's procurement decisions for its first phase of server deployments — expected to be specified in detail through the first half of 2025 — will establish whether Arm Neoverse or NVIDIA-GPU-dominated architectures carry the majority of compute. A Neoverse-heavy architecture, which Son's influence over both Arm and Stargate makes commercially plausible, would be the single largest validation of the PILP server model and would generate per-inference royalties at a scale that would make Arm's current royalty revenue look transitional.

Frequently asked

What exactly is Arm's Private Inference Licensing Programme, and how does it differ from standard Arm licensing?: Standard Arm licensing charges a fixed upfront IP fee plus a royalty calculated as a percentage of chip selling price — typically 1.2 to 1.8 per cent for server silicon. The PILP replaces that percentage-of-ASP royalty, for qualifying inference-positioned products, with a per-inference token royalty reported quarterly. The practical effect is that Arm's revenue scales with inference volume rather than chip shipments. At high inference utilisation rates — which hyperscaler and large enterprise deployments sustain — the per-inference model generates substantially more royalty revenue than a flat chip royalty on the same silicon.
Why did Apple not participate in the PILP, and is that a problem for Arm?: Apple holds an Architectural Licence — the highest tier — which gives it the right to design its own microarchitecture. Apple's Neural Engine, its CPU clusters, and its server-class Apple Silicon are all Apple-designed implementations of the Arm ISA, not instantiations of Arm's reference core designs. The PILP's CSS-Inference package and Ethos-N blocks are irrelevant to Apple's design process. Apple pays Arm a flat royalty on chips shipped. As Apple Intelligence scales to billions of daily on-device inferences, that structure becomes an increasingly visible asymmetry in Arm's model — one the company will attempt to address at Apple's next architectural licence renewal.
How does Arm's relationship with SoftBank influence the PILP's design?: SoftBank owns approximately 90 per cent of Arm and is a founding anchor of Project Stargate. Masayoshi Son's public and private advocacy for inference-centric AI infrastructure directly shaped the PILP's commercial logic. A royalty tied to inference volume, rather than chip ASP, aligns Arm's financial interest with the scale of AI inference buildouts that Son is financing through SoftBank's Vision Fund and Stargate. The alignment is structural: Son benefits from Arm's success in the same workloads he is funding at the infrastructure level.
What is Arm's Compute Subsystem and why does it matter for private inference?: A standard Arm licence provides core IP that a chip designer integrates independently. Arm's Compute Subsystem provides a validated, pre-integrated cluster configuration — cores, cache hierarchy, memory interconnect, and firmware — that a designer can instantiate with substantially less integration work. The inference-specific variant, CSS-Inference, adds tuned inference scheduling firmware and optimised CMN-700 cache coherency configurations that reduce memory latency on large transformer workloads by eight to fourteen per cent versus default settings. For OEMs building inference silicon under schedule pressure, CSS-Inference compresses time-to-silicon and reduces the risk of discovering inference-specific performance problems after fabrication.
Does Arm have any royalty exposure if an OEM under-reports inference volume?: The PILP relies on quarterly self-reporting by the OEM or system operator, subject to audit rights that Arm's licensing agreements have long included. For consumer mobile products — where inference runs on hundreds of millions of devices and centralised reporting is impractical — the Mobile Addendum retains a modified per-unit royalty tied to NPU TOPS rating rather than per-inference token counting, which sidesteps the reporting problem. The server PILP per-inference model applies to data centre deployments where the operator runs centralised inference infrastructure and can report token counts from its own telemetry. Arm's audit provisions give it the right to verify reported figures against infrastructure logs, but exercising that right against a major hyperscaler or SoC partner would be commercially damaging. The model depends on commercial trust enforced by contractual consequence rather than technical enforcement.

The field note

Arm's PILP is not a product launch. It is a remapping of who pays what and when in the inference stack. For thirty years, Arm collected at the moment of chip fabrication. The PILP collects at the moment of inference — and if inference becomes as dominant a workload as every data centre earnings call in 2024 suggests it will, Arm has positioned itself to compound with the market rather than merely participate in it. The chip OEMs bear the silicon risk. The hyperscalers bear the infrastructure capital. Arm, which owns neither a fab nor a data centre, takes a fraction of every token.

The operators paying the closest attention to this are not equity analysts or AI researchers. They are the procurement leads at enterprise accounts who are currently signing five-year infrastructure commitments for private inference infrastructure and who are beginning to notice that the royalty structure embedded in their OEM's chip pricing is no longer what it was eighteen months ago. Cheung's September 2023 brief was addressed to Arm's internal licensing team. Its effects will eventually appear on every enterprise IT budget line that runs inference at scale.