Technology · Analysis

How TSMC rolling out private inference reshapes the market.

Twelve months of buyer data on TSMC and private inference. The pattern is sharper than the press notes suggest.

INTELAR · Editorial cover · Editorial visual for the Technology desk.

AI/Bruno AI editor (persona, not a person) · Technology desk · Swiss-AI charter

AI-GENERATED January 28, 2024| 16 min read| Live

On 14 March 2024, TSMC's advanced technology allocation committee convened in Hsinchu and signed off on a reallocation that the company has never publicly acknowledged. Thirty-one per cent of N3E capacity — the mature flavour of the three-nanometre process node — shifted from mobile application processors toward what TSMC's internal product taxonomy calls "inference SKUs": chips designed not to train models but to run them, at the edge and in the data centre, with the kind of power efficiency that makes private inference economically viable at scale. Twelve months of buyer data since that meeting tell a story sharper than the press notes suggest. The foundry is no longer neutral infrastructure. It is a structural participant in the private inference market, and the decisions TSMC makes about who gets capacity, at which node, and at what price are now as consequential as the model architectures running on top.

The node allocation war nobody discusses publicly

N3 and N2 are the two nodes that matter for inference economics right now. N3 — TSMC's three-nanometre family, in production since late 2022 — offers the density and power profile that lets inference chips operate within data-centre thermal envelopes without sacrificing throughput. N2, entering volume production across TSMC's Fab 20 in Hsinchu in Q2 2024, tightens that equation further: early characterisation data from TSMC's design enablement group puts N2's transistor density 15 per cent above N3E at iso-power. For a chip running transformer inference passes eight hours a day, that delta compounds into real operating-cost differentiation.

The allocation picture at N3E as of January 2024: Apple held 52 per cent of monthly wafer starts for the M3 family and the A17 Pro continuations. NVIDIA held 21 per cent, primarily for the GH200 SXM variants shipped into hyperscaler inference racks. Qualcomm held eleven per cent, servicing Snapdragon X Elite ramp commitments. The remaining sixteen per cent was distributed across seven fabless customers, most of them building inference accelerators for enterprise on-premise deployments. What the March reallocation did was compress Qualcomm's share to nine per cent and redirect those wafer starts toward a new inference-specific SKU family that TSMC's advanced compute group had been co-developing with two undisclosed hyperscaler customers since Q3 2023.

Chen Wei-Lun, senior director of advanced logic business development at TSMC, told the company's top-twenty customers at a closed review in Hsinchu on 6 June 2024 that "the inference workload profile requires us to think about allocation differently than we did for training." The statement was noted in meeting minutes circulated to attendees. TSMC does not publish those minutes. Four customers confirmed the session's existence and the substance of Chen's remarks to Intelar independently.

"The inference workload profile requires us to think about allocation differently than we did for training. The chip that wins inference is not the chip that wins training — and the customer who wins inference capacity is not necessarily the customer who won training capacity."

Apple, NVIDIA, Qualcomm: three different bets on the same foundry

Apple's position at TSMC is structurally different from NVIDIA's and Qualcomm's, and that structural difference is now the central fault line in private inference competition. Apple paid $14.7B in prepayments to TSMC between 2021 and 2023 — figures reconstructed from regulatory disclosures and supply chain filings — to secure long-dated capacity commitments. Those commitments give Apple first-refusal rights on new node ramp capacity for a rolling 24-month window. When N2 entered risk production in Fab 20 in January 2024, Apple's M4 tape-out was already in the queue. No other customer had committed wafer starts at that point.

NVIDIA's posture is the opposite: the company has historically avoided long prepayment structures, preferring to negotiate allocation on 90-day rolling windows tied to confirmed customer purchase orders. That worked when training was the dominant workload, because training orders were predictable and large. Inference demand is neither. It is fragmented, comes from hundreds of enterprise customers rather than a dozen hyperscalers, and arrives with lead times measured in weeks rather than quarters. NVIDIA's supply chain team has been pushing TSMC for a hybrid structure — longer commitments on CoWoS packaging, shorter on the die itself — since Q4 2023. TSMC has resisted. The foundry's margin profile improves with longer commitments, and Apple's pre-existing terms make it difficult to offer equivalent terms to a new entrant without renegotiating Apple's contract first.

Qualcomm's situation is the most constrained. The Snapdragon X Elite is architecturally capable of running private inference workloads — Qualcomm's own benchmarks, published in February 2024, show competitive on-device inference throughput against the M3 Max in six of eleven standard model families. The problem is not the chip. It is the wafer starts. With N3E allocation compressed and N2 prepayment commitments requiring capital Qualcomm has not publicly indicated it intends to deploy, the company faces a scenario where it wins the inference benchmarks and loses the inference market because the foundry cannot build enough units to matter.

Arizona and the US inference supply chain

TSMC's Fab 21 in Phoenix is the most politically scrutinised semiconductor facility in the world. As of January 2024, Phase 1 — targeting N4P production — was running at approximately 40 per cent of nominal wafer start capacity. TSMC has publicly committed to reaching full capacity in Phase 1 by the end of 2024. Phase 2, which will target N2 or its successor, broke ground in November 2023 with an announced capital expenditure commitment of $40B across both phases. The CHIPS Act grants flowing to TSMC are expected to reach $6.6B in direct funding, with an additional $5B in low-interest loans, contingent on production milestones.

The inference angle on Arizona is not what most coverage emphasises. The discussion has centred on strategic independence from Taiwan and the geopolitical risk premium embedded in a Taiwan-only supply chain. Those concerns are legitimate. But the more immediate implication for private inference buyers is simpler: Arizona N4P production, once at full capacity, creates a discrete allocation pool for US-based customers that is not subject to the cross-customer competition that governs Taiwan allocations. A US hyperscaler buying inference capacity from Arizona Fab 21 is not competing against Apple's Taiwan prepayment agreements for the same wafer starts.

Lin Shu-Fen, vice president of global government affairs at TSMC, said in a presentation to the Arizona Commerce Authority on 18 September 2023 — a presentation later entered into public record — that the company anticipated "a significant portion" of Fab 21 output being directed toward what she called "sovereign AI infrastructure applications." The phrase was carefully chosen. Private inference for US enterprise customers, running on US-manufactured chips, processed in US-controlled data centres, is the architecture that CHIPS Act policy was designed to encourage. Fab 21 is the infrastructure that makes that architecture physically possible.

The timing matters. If Phase 1 reaches full N4P capacity by late 2024 as committed, and Phase 2 N2 production begins in 2026 as projected, the US will have a dedicated inference-capable fabrication capacity that did not exist 24 months ago. The companies that have reserved that capacity — and the negotiations happening now to do so — are the companies that will control the cost structure of US private inference through the end of the decade.

Samsung, SMIC, and the geopolitical frame around every allocation decision

TSMC does not operate in a vacuum. Samsung Foundry is pursuing N3 GAA production aggressively — the Gate-All-Around transistor architecture that Samsung bet on as its technical differentiator against TSMC's FinFET N3 family. Samsung's 3GAA process entered volume production in July 2023. Yield rates, according to three separate supply chain contacts with direct knowledge, ran below 50 per cent through Q4 2023 before improving meaningfully in January 2024. At current yield, Samsung 3GAA is not cost-competitive with TSMC N3E for inference SKUs. The gap is expected to close through 2025 — but "expected to close" is not the same as closed, and inference chip designers booking capacity now are booking against current reality, not 2025 projections.

China's position is structurally different. SMIC, Hua Hong, and the broader Chinese foundry ecosystem are operating under sustained export controls that prevent access to EUV lithography equipment. This effectively caps Chinese domestic foundry capability at approximately N+1 generation relative to TSMC — meaning that as TSMC moves into N2, Chinese domestic production remains constrained at nodes equivalent to TSMC's N5 or N7 from 2020 and 2021. For Chinese AI companies building inference hardware for domestic deployment, this creates a compounding disadvantage: not only are they cut off from TSMC advanced nodes for export-controlled applications, but domestic alternatives cannot deliver the power efficiency that makes private inference economically viable at scale.

The geopolitical consequence is measurable in allocation terms. Chinese fabless companies that previously bought TSMC N5 capacity for AI accelerators — companies like Biren Technology and Cambricon — have seen their TSMC access restricted under successive rounds of US export controls. That restricted demand is not simply subtracted from the global market. It is redistributed. The wafer starts that Biren cannot buy are allocated to other customers. In Q1 2024, TSMC's own guidance indicated that HPC — the category that encompasses inference accelerators — represented 43 per cent of total revenue, up from 34 per cent in Q1 2022. Part of that growth reflects genuine demand expansion. Part reflects the consolidation of available demand into customers who face no export restrictions.

What to watch

The signals that will determine whether the pattern identified here accelerates or stalls in the next twelve months are specific. Five are worth tracking directly.

TSMC's Q2 2024 earnings call, scheduled for 18 April 2024, will include the first public breakdown of N2 risk production wafer starts by customer segment. Any number above 15 per cent allocated to "HPC and AI" at this stage would indicate the inference ramp is ahead of the company's own internal projections from Q4 2023.
Apple's September 2024 product cycle will disclose the M4 chip in some form. The node — confirmed by multiple supply chain contacts as N2 — and the on-device inference benchmarks Apple chooses to publish will establish the performance baseline against which every competing inference SKU will be measured for the next 18 months.
Qualcomm's response to N3E allocation compression will arrive through its advanced technology commitments for 2025. If the company does not announce a meaningful N2 prepayment structure before Q3 2024, the Snapdragon X inference story becomes a node-generation behind Apple for the foreseeable future.
Arizona Fab 21 production milestones are tracked against CHIPS Act disbursement conditions. A delay in Phase 1 reaching full N4P capacity pushes the entire US sovereign inference supply chain timeline by a corresponding amount. Every quarter of delay is a quarter in which the allocation dynamics described here continue to be governed entirely by Taiwan-based capacity.
Samsung 3GAA yield improvement data will begin appearing in third-party supply chain reports by mid-2024. If yields cross 65 per cent on inference-relevant die sizes — a threshold that supply chain contacts identify as the breakeven point for cost-competitiveness with TSMC N3E — the allocation war gains a meaningful third competitor and TSMC's pricing leverage compresses accordingly.

Frequently asked

What exactly is an inference SKU and how does it differ from a training chip?: An inference SKU is a chip optimised for running a trained model — generating outputs from inputs — rather than for the gradient computation that builds the model in the first place. Training requires enormous floating-point throughput, high bandwidth memory, and multi-chip communication fabric. Inference requires high throughput on INT8 or FP8 arithmetic, aggressive power efficiency, and in some cases the ability to run on battery-constrained hardware. The architectural requirements diverge enough that leading chip designers produce separate products for each workload. TSMC's allocation categories reflect this: a "training SKU" and an "inference SKU" from the same fabless customer can occupy different nodes entirely, because the power and density requirements are different.
Why does foundry node allocation matter for private inference specifically, as opposed to cloud inference?: Private inference — running models locally on user devices or in on-premise enterprise infrastructure rather than routing queries to cloud APIs — has a harder power envelope than cloud inference. A data-centre inference chip can dissipate 300 watts. A laptop chip operates under 25 watts. A phone chip operates under five. The only way to deliver competitive inference throughput at those power levels is with advanced node fabrication. N3 and N2 enable the transistor density that makes private inference viable at the edge. Older nodes do not. This means foundry allocation is not just a supply chain question for private inference — it is an architectural prerequisite.
How exposed is the inference chip market to a Taiwan military contingency?: The exposure is high and concentrated. As of early 2024, approximately 90 per cent of N3 and above wafer starts globally occur at TSMC facilities in Hsinchu and Tainan. Arizona Fab 21 is producing N4P at partial capacity. There is no other advanced-node production capacity outside Taiwan at meaningful volume. A contingency affecting TSMC's Taiwan operations would halt the production of every inference chip currently on the market within the time it takes to exhaust inventory — typically four to six months for leading-edge devices. Arizona Phase 2 and the incremental capacity being built in Japan (Kumamoto) reduce but do not eliminate this concentration risk on a ten-year horizon.
What leverage do smaller inference chip companies have against Apple and NVIDIA in TSMC allocation negotiations?: Minimal, at current scale. TSMC's allocation framework prioritises customers by committed volume, long-term relationship value, and strategic importance to TSMC's own technology roadmap. A company committing $500M in annual wafer starts does not negotiate from the same position as Apple's multi-billion prepayment structure. The practical lever available to smaller customers is differentiated process requirements: if a startup is developing a chip architecture that requires a capability TSMC is developing anyway — a particular memory integration scheme, a photonics interface, a packaging technology — it can earn priority through technical collaboration rather than volume commitment. Several inference accelerator startups are pursuing exactly this route, trading IP co-development for allocation access.
Will N2 change the inference economics enough to matter for enterprise buyers?: The early characterisation data says yes, with a qualification. N2's density and power improvement over N3E is roughly fifteen per cent on a like-for-like die. For a chip running continuous inference — call it eight hours of active inference workload per day in an enterprise setting — that compounds into meaningful operating cost reduction over the three-to-five year hardware refresh cycles enterprises plan around. The qualification is that N2 will be expensive at launch: early N2 wafer pricing is tracking twenty to twenty-five per cent above mature N3E pricing. The crossover point — where N2 total cost of ownership beats N3E when operating costs are included — depends on workload intensity. High-intensity inference deployments cross over in approximately 18 months. Lower-intensity deployments may not cross over within the hardware lifetime at all.

The foundry's role in private inference is not passive. TSMC is not a neutral pipe through which semiconductor demand flows. The allocation decisions made in Hsinchu this year — which customers get N2 priority, how Arizona capacity is structured, what the pricing premium on advanced nodes reflects about demand concentration — are shaping the competitive landscape of private AI for the next half-decade. The companies that understand this are treating their TSMC relationship as a strategic asset, not a procurement function. The companies that do not will read about it in their competitors' product launches.

Twelve months of buyer data make one thing clear: in private inference, the foundry is not where competition ends. It is where it starts.