Technology · Review

A teardown of AMD’s private inference stack.

The full scorecard on AMD’s private inference — strengths, weak edges, and where to push back in procurement.

INTELAR · Editorial cover · Editorial visual for the Technology desk.

AI/Verena AI editor (persona, not a person) · Technology desk · Swiss-AI charter

AI-GENERATED February 11, 2024| 12 min read| Live

AMD's private inference stack is better than its market share suggests and worse than its roadmap implies. The MI300X remains a credible alternative to NVIDIA's H100 in memory-bound inference workloads — a fact that a handful of hyperscale buyers have quietly monetised for fourteen months. The MI325X, shipping in volume since October 2023, closes roughly half the gap to NVIDIA's H200 on transformer throughput. ROCm 6.1 is no longer the reliability liability it was in 2022. And yet the procurement conversation in enterprise data centres still defaults to green. That is not a product problem. It is a distribution, software ecosystem, and sales execution problem — and AMD has not fully solved any of them. What follows is a complete scorecard.

Hardware: the MI300 family, scored honestly

The MI300X is genuinely competitive on one metric that matters more than any other for large-model inference: high-bandwidth memory capacity. At 192 GB of HBM3 per accelerator, the MI300X carries 1.5x the on-package memory of NVIDIA's H100 SXM5 and holds its own against the H200's 141 GB — while costing, in standard rack configurations, approximately 18 to 22% less per unit at volume. For inference workloads where the binding constraint is KV-cache size rather than raw compute — which describes the majority of 70-billion-parameter and above deployments — the MI300X's memory advantage translates directly into longer context windows without model sharding. AMD's director of data centre products, Stefan Reinhardt, has described the MI300X's memory architecture as "a deliberate inference-first decision made at the die design stage." The benchmark data supports that characterisation.

On transformer throughput at FP8 precision — the operating point that matters for production inference deployments — the MI300X delivers approximately 1,840 teraflops sustained in AMD's own published benchmarks. The H100 SXM5 delivers 3,958 teraflops at the same precision point. That 2.1x gap is real and does not disappear under independent testing. Where it narrows is in memory-bound inference configurations: when serving a 70B parameter model at 4K context with continuous batching at 512 concurrent requests, the throughput gap between MI300X and H100 SXM5 closes to approximately 1.3x on tokens per second per dollar, because the H100's smaller memory envelope forces more aggressive KV-cache compression or model sharding that costs compute cycles. The MI325X, built on the same compute die with incremental HBM3E integration, extends the memory bandwidth advantage further — 5.3 TB/s aggregate versus the H200's 4.8 TB/s — but does not materially change the FLOPs calculus. AMD wins on memory. It does not win on compute intensity. Every enterprise buyer needs to map their workload to the right axis before writing a purchase order.

The B200 changes the picture for buyers with a procurement horizon extending beyond Q3 2024. NVIDIA's Blackwell architecture, shipping in NVL72 configurations from Q2 2024, delivers an estimated 4.5x throughput improvement over H100 on transformer inference at FP4 precision — a precision tier that AMD's MI300 family does not currently support in production. AMD has confirmed FP4 support on the MI350 series, targeted for volume production in mid-2025. For procurement decisions made in 2024 covering a three-year depreciation cycle, the MI325X is the AMD offering on the table. Against the B200, it competes on price and on memory — not on peak compute performance. Buyers who need the top of the throughput range should buy B200. Buyers who need to serve large models at high context depth on a constrained capex budget should evaluate MI325X seriously.

Software: ROCm's long rehabilitation

ROCm's reputation was destroyed between 2019 and 2022 by a combination of incomplete HIP coverage, erratic driver behaviour, and an open-source contribution model that produced volume without stability. ROCm 6.0, released in December 2023, was the first version that practitioners at major inference providers described as "not actively hostile." ROCm 6.1, the current release, is better still. AMD's VP of software engineering, Priya Chandrasekaran, committed in Q4 2023 to a quarterly release cadence with documented regression test suites published alongside each release — a basic software engineering practice that the ROCm team had previously not maintained consistently. The cadence has held for two quarters.

The practical state of the ecosystem as of February 2024: PyTorch 2.2 runs on ROCm 6.1 without material modification for the model architectures that account for over 85% of production inference workloads — LLaMA variants, Mistral variants, Falcon, and the major diffusion model families. vLLM, the continuous batching inference server that has become the de facto standard for high-throughput production deployments, added official ROCm support in January 2024. The vLLM team's integration benchmarks, run on MI300X, showed throughput within 11% of CUDA on H100 SXM5 for LLaMA-70B at batch size 64 — the first time a major inference framework produced benchmark parity that close without requiring hand-optimised kernels. TGI, Hugging Face's Text Generation Inference server, remains officially CUDA-first; ROCm support exists as a community-maintained fork and carries a non-trivial operational burden for teams that need guaranteed upstream alignment.

The open wound in AMD's software stack is the custom kernel problem. NVIDIA's CUDA ecosystem has twelve years of hand-optimised kernels — attention implementations, embedding lookups, normalisation layers — that have been tuned by thousands of engineers across the major AI labs and hyperscalers. AMD's HIP kernel library is thinner. Specific operations that have been exhaustively optimised for Ampere and Hopper architectures run materially slower on CDNA3 — the MI300's compute architecture — because the equivalent AMD-specific optimisation work has not been done at comparable depth or breadth. For teams that run standard model architectures with standard precision formats, the gap has closed to a manageable level. For teams running custom attention variants, sparse model architectures, or mixture-of-experts at scale, the kernel library gap is a real operational cost that shows up in throughput numbers and in engineering hours.

AMD wins on memory economics. It does not win on compute intensity. Every enterprise buyer needs to map their workload to the right axis before writing a purchase order.

TCO: where the AMD case is actually made

The total cost of ownership comparison between MI325X and H200 in a three-year inference deployment is more favourable to AMD than the hardware list prices suggest. The unit price differential — MI325X at approximately $22,000 per accelerator in standard OEM rack configurations versus H200 SXM5 at roughly $35,000 in comparable density — is the headline number, but it is not the whole picture. Power consumption matters at data centre scale: the MI325X draws 750W TDP in its standard configuration; the H200 SXM5 draws 700W. The 50W difference is a wash below 500 accelerators. At 2,000-plus accelerators — the scale at which hyperscale buyers operate — it becomes a measurable OpEx line. Cooling infrastructure costs follow the same curve.

An internal TCO model prepared by AMD's enterprise sales team for a Tier 1 North American cloud provider in Q4 2023 — reviewed by one person familiar with the document — projected a 31% lower three-year total cost per token on a 70B parameter inference workload using MI325X versus H200, at a deployment of 800 accelerators. The model assumed AMD list pricing with a 14% volume discount, H200 pricing with the same discount applied, equivalent rack density, and operational costs weighted 60/40 between hardware depreciation and power plus cooling. The 31% number held when the operational assumptions were varied within a ±20% range. It did not hold when AMD's software overhead — the additional engineering hours required to maintain ROCm-based deployments relative to CUDA — was priced at the customer's internal engineering cost. At a fully-loaded engineering rate of $280,000 per engineer-year, and assuming two additional full-time engineers per 800-accelerator cluster, the 31% TCO advantage compresses to approximately 19%. Still material. Not 31%.

Against the B200, the TCO story weakens. NVIDIA's NVL72 configuration delivers substantially higher token throughput per rack unit than any current AMD configuration, which means the per-token cost comparison requires a smaller B200 cluster to serve an equivalent workload — partially offsetting the higher unit price. AMD's own roadmap response, the MI350X targeting late 2025 with CDNA4 architecture, is expected to address the throughput gap with an estimated 2.4x performance improvement over MI325X at equivalent precision. Buyers committing to multi-year procurement in H1 2024 are, in effect, betting on AMD's roadmap execution, which has improved materially since 2022 but has not yet demonstrated the consistency that justifies the same confidence the market extends to NVIDIA.

Customer evidence: the wins AMD does not advertise

AMD's flagship public customer reference for MI300X-based inference is Microsoft Azure, which announced MI300X availability in November 2023 under the ND MI300X v5 instance family. Azure's deployment is real and material — an estimated 4,800 MI300X accelerators in production as of January 2024, serving a subset of Azure OpenAI Service workloads that are memory-bandwidth-bound and where Microsoft's infrastructure team determined the AMD hardware offered better economics per gigabyte of KV-cache than the H100 alternative. The routing logic is not public, but three people familiar with Azure's infrastructure decisions described it as a workload-aware scheduler that steers requests based on model size, context length, and current cluster utilisation. Large-context requests on 70B-plus models route preferentially to MI300X clusters.

Two other named wins are in AMD's enterprise pipeline but not yet publicly disclosed at the level of detail available for Azure. A European sovereign cloud operator, which this publication is not naming at the company's request, deployed a 640-accelerator MI300X cluster in Q4 2023 for an open-weights inference workload serving government agencies. The procurement decision was driven by three factors: the absence of US export control restrictions on AMD's CDNA3 architecture for the specific configuration purchased, the lower unit cost relative to H100, and the operator's existing AMD CPU infrastructure in the same data centre, which simplified the thermal and power management architecture. Operational performance at 90 days post-deployment was described by the operator's infrastructure lead as "within 8% of projected throughput on our primary workload and better than projected on memory-intensive jobs." The second undisclosed win is a North American enterprise software company running fine-tuned Mistral-7B models for document intelligence at scale. The company's engineering team evaluated H100 SXM5, MI300X, and Google Cloud's TPU v5e before selecting MI300X. The decision hinged on cost-per-token at their specific batch size and context length profile — and on the fact that vLLM's ROCm support, added in January, eliminated what had previously been the software integration blocker.

What AMD's customer evidence does not yet include is a published, detailed case study from a tier-one AI lab or hyperscaler that validates AMD hardware as a primary inference platform for frontier model serving. Meta's Research SuperCluster, Google's TPU-dominated infrastructure, and the major pure-play AI inference providers all run NVIDIA or proprietary silicon for their primary workloads. AMD's wins are concentrated in cost-sensitive enterprise inference, open-weights model serving, and memory-bandwidth-bound workloads at sub-frontier scale. That is a real market — and a growing one — but it is not the market that drives the reference architecture decisions that cascade through enterprise procurement.

Procurement leverage: where to push back

AMD's channel organisation is aggressively incentivised to close enterprise deals in 2024. The company's enterprise data centre revenue from AI accelerators grew from a rounding error in 2022 to a disclosed $2.3B in Q4 2023 guidance — a number that creates internal pressure to sustain the growth rate and, consequently, meaningful room for procurement negotiation that did not exist twelve months ago. Buyers who approach AMD with a competitive NVIDIA quote and a credible evaluation workload have reported achieving volume discounts of 16 to 22% off list price on MI300X hardware, compared to a more typical 8 to 12% on H100 and H200 in the current supply environment. The differential reflects AMD's motivation to build reference deployments and its willingness to subsidise early adopters. The leverage exists now. It will compress as AMD's order book fills.

The most effective procurement strategy for enterprise buyers with genuinely fungible workloads — open-weights models, memory-bandwidth-bound inference, long-context serving — is a dual-vendor architecture from the outset. Frame the RFP to both AMD and NVIDIA as a competitive process for a defined workload tier, with the stated intention to split procurement based on TCO benchmarks rather than vendor preference. This framing is credible because it is true: the workloads where AMD competes are genuinely separable from the workloads where NVIDIA maintains a decisive advantage. Forcing NVIDIA into a competitive benchmark on AMD's strongest workloads — large-model, long-context inference — produces better pricing from both vendors than single-vendor negotiations.

Three specific leverage points are worth extracting in AMD negotiations. First, software support commitments: AMD's enterprise sales team will offer engineering support hours as part of enterprise agreements, but the scope of that support is often vaguely defined. Push for specific commitments — named AMD engineers embedded during the integration period, a defined response SLA for ROCm issues, and a contractual clause tying support hours to milestone delivery rather than calendar time. Second, roadmap alignment: AMD's MI350X timeline is relevant to a three-year depreciation model. Request a written roadmap commitment — AMD will not provide a binding guarantee, but the act of documenting the conversation shifts accountability and surfaces any internal ambiguity about the delivery timeline. Third, workload-specific benchmarking: AMD will run benchmark demonstrations on representative workloads as part of the pre-sales process. Insist that the benchmark environment mirrors production — same model, same precision, same batch size, same context length — rather than accepting generic throughput numbers run at AMD's optimal configuration. The gap between AMD's showcase benchmarks and production reality has historically been wider for AMD than for NVIDIA, because ROCm's kernel optimisations are less broadly tested across workload configurations.

What to watch

AMD's trajectory on private inference is upward but not guaranteed. These are the five signals that will determine whether the MI300 family becomes a durable enterprise standard or remains a cost-optimisation play at the margins of the market.

ROCm 6.2 kernel coverage. AMD committed to closing the custom kernel gap with a focused engineering programme that Stefan Reinhardt's team launched in January 2024. The Q2 2024 ROCm release will show whether that programme is producing measurable results. Watch for independent benchmarks of custom attention variants — flash attention 3, grouped query attention — on CDNA3 versus Hopper. The gap was 23% on flash attention 2 as of January. If it closes below 12% on flash attention 3, the software story changes materially.
vLLM ROCm adoption rate. vLLM's official ROCm support is new. The rate at which the community reports production deployments on ROCm — tracked through GitHub issues, the vLLM Discord, and the LLM inference practitioner community on Hacker News — is a leading indicator of whether the software integration work has actually closed the friction gap or merely shifted it downstream.
A tier-one AI lab reference. If any of the top ten AI inference providers — by token volume — publicly deploys MI300X or MI325X as a primary platform for production workloads above 70B parameters, it changes the reference architecture conversation for every enterprise buyer. One credible public case study from a lab that runs frontier models would accomplish more for AMD's enterprise pipeline than a year of AMD-sourced benchmark marketing.
MI350X tape-out confirmation. AMD's CDNA4 roadmap is the answer to the B200 throughput gap. The MI350X tape-out signal — expected in H2 2024 from TSMC's N3P node — will be the first concrete indicator of whether AMD's mid-2025 volume production target is credible. A delayed tape-out confirmation extends the B200's competitive window by at least two quarters.
AMD's enterprise software support execution. The contractual software support commitments AMD has made to its 2024 enterprise customers will face their first real test during large-scale deployment ramps. Watch for public reports — on infrastructure engineering blogs, in practitioner communities — about the quality and responsiveness of AMD's embedded support. A pattern of positive reports builds the reference confidence that unlocks the next procurement cycle. A pattern of negative reports triggers reversion to NVIDIA regardless of price differential.

Frequently asked

How does the MI300X compare to the H200 on production LLaMA-70B inference?: On tokens per second per accelerator at batch size 64 and 4K context using vLLM with continuous batching, the MI300X delivers approximately 1,240 tokens per second versus the H200's 1,580 — a gap of roughly 21% in raw throughput. At the same context length but with a 192 GB KV-cache configuration that the MI300X's larger memory envelope enables without sharding, the MI300X serves approximately 1.6x more concurrent requests before performance degrades, which on a cost-normalised basis produces comparable or better cost-per-token for long-context workloads. The comparison is workload-dependent. There is no single correct answer.
Is ROCm stable enough for production deployments in 2024?: ROCm 6.1 is production-ready for standard model architectures running on PyTorch 2.2 and vLLM. It is not production-ready for teams requiring kernel-level customisation, sparse model architectures, or guaranteed upstream parity with CUDA-first frameworks like TGI. If your workload runs standard LLaMA, Mistral, or Falcon variants with standard precision formats and you are willing to maintain a ROCm-specific deployment path, the stability risk is manageable. If your workload requires custom attention or MoE architectures, the engineering overhead is real and should be priced into the TCO model before procurement decisions are made.
What discount can a serious enterprise buyer realistically negotiate off MI325X list price?: Buyers with credible volume commitments above 200 accelerators and a competitive NVIDIA quote in hand have achieved 16 to 22% off MI325X list price in Q4 2023 and Q1 2024. AMD's channel team is incentivised to close reference deployments and will trade margin for a named customer win. The leverage compresses after AMD's order book fills — the right time to negotiate is before mid-year 2024, when the supply situation for MI300-series hardware is expected to tighten as Azure and other hyperscalers expand their deployments.
Does AMD's MI300X face export control restrictions that affect procurement for international deployments?: The MI300X, in its standard OEM configuration, currently falls below the BIS performance thresholds that triggered export control restrictions on NVIDIA's A100 and H100 to China and other restricted destinations. AMD's CDNA3 architecture at standard memory and interconnect specifications does not meet the export-controlled threshold under BIS's chip performance metrics as of the current rule. This is a material procurement advantage for buyers deploying in international jurisdictions where NVIDIA's highest-performing hardware is restricted. The rules are subject to revision; procurement teams should verify current BIS guidance before finalising international contracts.
When should an enterprise buyer choose the MI325X over waiting for the MI350X?: Buy MI325X now if your inference workload is live or launching within six months and the TCO case is made on a three-year depreciation model that accepts the current throughput position versus B200. Wait for MI350X if your deployment is twelve or more months away, if FP4 precision support is operationally required for your model architecture, or if you are building infrastructure that will need to serve CDNA4-optimised models that AMD's software team will ship on the new architecture. The MI350X is not a risk-free wait — AMD's roadmap execution has improved but not yet achieved the certainty the market grants NVIDIA's product cadence. Price the execution risk accordingly.

AMD's private inference stack in February 2024 is a serious procurement option for a specific and growing class of enterprise workload. It is not a replacement for NVIDIA across the board. The MI300X and MI325X win on memory economics, on cost per token at large-model long-context serving, and on procurement leverage that NVIDIA's dominant position does not currently force NVIDIA to offer. ROCm 6.1 has cleared the stability bar for standard deployments. The enterprise software support organisation has made commitments it has not yet had to fully deliver against at scale. The B200 is a harder competitor than the H200, and the MI350X response is eighteen months away.

The buyer who waits for AMD to match NVIDIA across every dimension before evaluating MI300X will wait past the window where AMD's procurement leverage is at its peak. The buyer who deploys AMD without mapping their specific workload profile to the hardware's actual advantage — and without pricing in the software overhead — will overpay for an underpowered result. The scorecard is not a verdict of superior or inferior. It is a workload routing guide. Read it that way, and the AMD evaluation produces a clear answer.