
René Herzer

A guide for CIOs and CTOs who have decided against the cloud and in favor of operating in their own data center. You will learn how to derive your demand from user numbers and use cases, benchmark it against a reference capacity, and define a server and GPU configuration for the next 24 months — including a tested model-hardware matrix and a procurement template.
Introduction
What hardware do hospitals, municipalities, and government agencies need to run AI in their own data centers? This question reaches us multiple times per week. It comes from CIOs and CTOs who have made a decision: patient data, citizen data, social data does not belong in a US cloud. It belongs in their own data center.
This article is for you if you are at that point. It does not rehash the cloud-versus-on-prem debate — you have already settled that. Instead, you get a method that, after reading, lets you do five things:
Express your AI demand in tokens per day.
Benchmark that demand against a reference capacity.
Choose a realistic entry path that matches your budget.
Derive a server and GPU configuration for the next 24 months.
Write a briefing for your hardware vendor.
What you will not find here: a detailed comparison between hardware ownership and managed cloud options. That decision depends on more than hardware — it depends on your operational maturity, data center readiness, and organizational size. We treat it in a separate article on operating models.
Before we begin, it is worth highlighting three mistakes that repeatedly appear in AI infrastructure projects.
First, organizations size hardware around model names instead of actual usage. The relevant question is rarely which model you want to run. The relevant question is how many people will use it, how often, and for which tasks.
Second, many sizing exercises account only for the language model. In production environments, OCR, speech recognition, embedding models, and increasingly agentic workloads consume substantial resources as well.
Third, infrastructure is often sized for current demand rather than future adoption. In practice, successful AI deployments tend to grow much faster than initially expected.
The methodology in this article is designed to avoid exactly these mistakes.
At a Glance
Hardware sizing for AI does not mean counting CPU cores or RAM. It means knowing how many tokens per day your organization processes.
An employee actively using AI generates 30,000 to 200,000 tokens per day, depending on the use case.
A reference machine with 2× NVIDIA H200, running a 70B–120B class model on vLLM, produces around one billion tokens within 24 hours at full load. On 2× B200 (Blackwell), the same workload reaches roughly 1.8–2.0 billion tokens per day.
A GPU rarely runs only the language model. Embedding, speech recognition, OCR, and increasingly image generation and Large Table Models also consume VRAM. These models belong in your sizing calculation.
Context length determines KV cache requirements. With long contexts, the KV cache becomes larger than the model itself.
Plan today with a factor of 3 to 5 over your calculated daily demand for classic use. With agentic workloads — which in June 2026 are no longer a forecast but production reality — plan with a factor of 10 to 20.
How cards are connected to each other (NVLink, PCIe) determines usable performance for large models — not VRAM totals alone.
Flagship models with 256k context (such as Kimi K2.5 or DeepSeek R2) form a separate hardware class. They deliver depth, not throughput — and require 8× H200 or 4× B200.
Small hardware delivers small use cases. That is not a weakness of the technology — it is a natural constant. Plan accordingly.
This guide follows three steps:
Estimate your demand in tokens.
Translate that demand into hardware.
Plan scaling headroom for the next 24 months.
Why Hardware Sizing for AI Works Differently Than for Traditional IT
When you size a web server, you think in CPU cores, RAM, and IOPS. The logic has been established for twenty years. AI servers follow different rules, and this is the most common stumbling block in procurement projects.
The size that matters is not how many requests the CPU handles per second. It is how many tokens per second the GPU processes. A token is the smallest processing unit of a language model — roughly a word fragment. "Hospital" is about two tokens, an average English sentence 15 to 25. Every question, every answer, every analyzed document section gets converted into tokens and processed by the GPU.
The conclusion: if you know how many tokens your organization needs per day, you know your hardware. Without that number, you are buying blind.
CPU, system RAM, and storage are not unimportant on an AI server, but they are supporting cast. The GPU with its fast specialized memory (VRAM) determines whether your AI platform succeeds or fails.
Pillar 1: Your Demand in Tokens
The first task is expressing your demand in this unit. With a few benchmarks from practice, it works well.
How Much Is a Token in Your Daily Operations?
These values come from real deployments and serve as a starting point for your own estimates:
Use Case | Tokens per Operation |
|---|---|
Chat prompt (question + answer) | 500 – 2,000 |
Summary of a medical report or case note | 3,000 – 8,000 |
RAG retrieval with context and answer | 5,000 – 15,000 |
Heavy user per day (≈ 2h active use) | 50,000 – 200,000 |
Occasional user per day | 10,000 – 50,000 |
A Worked Example
An organization with 500 employees. How many tokens per day?
60% of employees use AI regularly → 300 active users.
Of those, 20% are heavy users: 60 people × 150,000 tokens = 9 million tokens/day.
The other 80% are occasional users: 240 people × 30,000 tokens = 7.2 million tokens/day.
Total: around 16.2 million tokens per day.
These numbers are assumptions, not predictions. Run the math with your own values — headcount, adoption rate, intensity of use.
💡 Rule of thumb: For every 100 active employees, expect 5 to 10 million tokens per day with classic use (chat plus document analysis). Pure chat use sits at the lower end, intensive document work at the upper end.
If You Have No Empirical Values Yet
That is the rule, not the exception. Take the middle values from the table and work with them. It does not get more precise than that. You are planning an investment for 24 to 36 months.
☝️ Key Takeaway
If you can estimate your daily token demand, you can estimate your infrastructure demand.
For most organizations, hardware sizing becomes significantly easier once user numbers, adoption rates, and expected usage patterns are translated into tokens per day.
Pillar 2: From Tokens to Hardware
Once your demand estimate is in place, the hardware question becomes answerable. You need a reference value — a machine with known capacity that you can benchmark against.
The Reference Machines
📊 Reference A (Hopper): A machine with 2× NVIDIA H200, running a 70B–120B class model in FP8 or MXFP4 on the vLLM inference stack, produces around one billion tokens within 24 hours at full load.
📊 Reference B (Blackwell): The same workload on 2× NVIDIA B200 reaches roughly 1.8–2.0 billion tokens per day, thanks to NVFP4 support and higher memory bandwidth.
The example above (16.2 million tokens per day) uses around 1.6% of Reference A's capacity. An organization with 500 employees has massive headroom on a 2× H200 machine for growth and new applications.
Context Length — What Does It Mean in Pages?
Context length describes how many tokens of input and output combined a model processes. The number sounds abstract — a comparison with US Letter pages of English prose makes it tangible.
A US Letter page of English prose (Times New Roman 11pt, 1.15 line spacing, around 400 words) corresponds to roughly 550 tokens.
Context Length | US Letter Pages | Typical Application |
|---|---|---|
4,000 tokens (4k) | ~7 pages | Short chats, simple questions |
8,000 tokens (8k) | ~14 pages | Medical report + follow-up question |
16,000 tokens (16k) | ~29 pages | Multi-page case note, short RAG |
32,000 tokens (32k) | ~58 pages | Longer contracts, mid-range RAG retrieval |
65,000 tokens (65k) | ~118 pages | Patient file excerpt, complex research |
128,000 tokens (128k) | ~232 pages | Long conversations, agent sessions, case files |
200,000 tokens (200k) | ~363 pages | Very long agent sessions, large document bundles |
256,000 tokens (256k) | ~465 pages | Complete patient histories, large procurement files, multi-document legal analysis |
Why this matters: context length determines KV cache requirements — and the KV cache becomes larger than the model itself at long contexts. When 20 employees work simultaneously with 32k context, the KV cache claims 40 to 80 GB of VRAM in addition to the model. At 256k context across 30 parallel users, the KV cache alone can exceed 100 GB — on top of the model weights.
A practical note: a model's maximum context (say 128k or 256k) is not the value you actually work with day to day. Most requests fall well below it. Plan for the context that covers 95% of your use cases — and keep the longer ones in mind as exceptions.
A GPU Rarely Carries Just the LLM
Here is a point missing from many sizing discussions: a production AI platform does not consist of a single model. Multiple models run in parallel on the same hardware:
Component | Function | VRAM Need (typical) |
|---|---|---|
LLM (language model) | Chat, summarization, RAG answer | 16 – 240 GB depending on model and quantization |
Embedding model | Vectorization for RAG | 1 – 8 GB (often runs on CPU) |
Speech recognition (Whisper) | Audio → text transcription | 3 – 10 GB |
OCR / vision model | Text recognition in scans and images | 8 – 90 GB |
TTS | Text → speech | 4 – 6 GB |
Image generation | Image models | 8 – 24 GB |
Large Table Models (emerging) | Table understanding and generation | Not yet established |
When your departments ask "Can we also do dictation from medical reports?" or "Can we make scanned files searchable?", the answer is: yes — if the corresponding models have space on the GPU.
A 2× H200 machine with 282 GB VRAM carries a 120B-class model (60 GB), Whisper (6 GB), a 30B vision model for OCR (~32 GB), and enough KV cache reserve for many parallel users. A configuration with less VRAM forces compromises — on the model, on context window, on user count, or on available functions.
How GPU Memory Works Together — and When It Does Not
A question that surfaces in nearly every procurement conversation: "Can we just take three cards with 48 GB each and have 144 GB for a large model?" The honest answer: technically often not, and when possible, with significant performance losses.
The reason lies in how the cards are connected. With large models split across multiple GPUs (Tensor Parallelism), the cards constantly communicate with each other. This communication needs a fast connection — otherwise the data bus becomes a bottleneck and throughput collapses.
There are two connection types:
NVLink / NVSwitch: A direct high-speed connection between GPUs, at 600 to 1,800 GB/s depending on generation. This is the prerequisite for efficient Tensor Parallelism on large models.
PCIe: The standard bus in the server, at 64 GB/s per card (PCIe 5.0 x16). Sufficient for small models, a bottleneck for large ones.
Not every GPU supports NVLink. An overview of common cards:
GPU | NVLink | Suitability for Large Models (70B+) |
|---|---|---|
H100 / H200 (SXM form factor) | Yes, NVSwitch, 900 GB/s | Ideal |
H100 / H200 (PCIe form factor) | Only 2-card bridge, 600 GB/s | Limited |
B200 | Yes, NVLink 5, 1,800 GB/s | Ideal |
B300 (Blackwell Ultra) | Yes, NVLink 5, 1,800 GB/s | Ideal, 288 GB per GPU |
A100 (SXM) | Yes, 600 GB/s | Good, but aging |
L40S | No, PCIe only | Not recommended for 70B+ |
RTX 6000 Ada | No, PCIe only | Lab / evaluation only |
RTX 4090 / 5090 | No, removed by design | Not for production use |
Implication for procurement: when running a 70B or 120B class model, GPU choice is not just a VRAM question — it is a bus question. A configuration with 4× L40S sounds like 192 GB of total memory but is only of limited use for large models. A configuration with 2× H200 in SXM form factor with 282 GB and NVSwitch is the clearly better choice — even though the individual card costs more.
Another rule: Tensor Parallelism must divide the model's attention heads evenly. A model with 64 heads runs with 1, 2, 4, or 8 GPUs — not with 3 or 6. This influences the sensible number of cards per server.
Replicas Instead of Tensor Parallelism: When the Model Fits on One GPU
A point that is missed in most sizing discussions: when a model fits on a single GPU, running one replica per GPU is often better than splitting one model across multiple GPUs.
The advantages:
No communication overhead between GPUs — each replica runs independently
Throughput scales linearly with GPU count: two GPUs deliver twice the throughput, four GPUs four times
NVLink becomes optional — PCIe is sufficient when GPUs don't talk to each other
Higher availability — if one GPU fails, the other replicas keep serving
A real example from production: a 5× H200 deployment running GPT-OSS-120B as five independent replicas (one per GPU) delivers more than 4 billion tokens per day in sustained operation, with peaks above 50,000 tokens per second. No tensor parallelism. The two-GPU replicas in this setup processed 2.10 and 2.11 billion tokens respectively — essentially perfectly balanced load distribution through simple round-robin routing.
When does tensor parallelism become necessary? When the model no longer fits on a single GPU. That is typically the case from the flagship class upward (1T MoE models, see below). For everything that fits on one H200 — which includes most 70B and 120B class models in FP8 or MXFP4 — replicas are the cleaner and more efficient architecture.
The Flagship Class: When Depth Matters More Than Volume
Most organizations are well served by the 70B–120B class. But there are use cases where a flagship model with 256k context and trillion-parameter MoE architecture (such as Kimi K2.5 or DeepSeek R2) makes a real difference:
Complete patient histories spanning years, analyzed in one pass.
Complex zoning or procurement procedures with hundreds of pages of supporting documents.
Multi-document legal analysis where context must be preserved across files.
Multimodal case review combining text, scans, and images.
This is a separate hardware class. Plan with:
8× H200 on an HGX board with NVSwitch — or alternatively 4× B200 (Blackwell with NVFP4).
Realistic throughput: 100 to 300 million tokens per day on such a node. Less than a 2× H200 machine with a 120B model.
You buy depth and context length, not volume.
The key insight: flagship deployment is not an upgrade path from the standard configuration — it is a parallel track for specific use cases. Organizations that need both should plan two distinct node types.
⚠️ Watch for the "INT4" label trap.
When a vendor or model card states that a flagship model is "INT4 quantized," that rarely means the whole model. With Kimi K2.5, for example, only the routed expert weights are INT4. Attention layers, shared experts, embeddings, and the LM head remain in BF16 (2 bytes per parameter). The expected ~250 GB turns into ~549 GB of actual weight memory. Always ask the vendor for the real VRAM footprint of all layers combined — in writing.A practical operational tip in the same vein: with long contexts,
--kv-cache-dtype fp8in vLLM halves KV cache memory at negligible quality loss. For 256k deployments, this should be the default.
A Note on Blackwell, B200 and B300
In June 2026, Blackwell is no longer a forecast but a real procurement option. NVIDIA's NVFP4 data format — a 4-bit floating-point representation with per-block scaling — allows a flagship model like Kimi K2.5 to run on 4× B200 instead of 8× H200 at comparable quality. The hardware count halves, and total cost of ownership for flagship-class deployments improves significantly.
The B300 (Blackwell Ultra) takes this further: 288 GB of HBM3e per GPU and 8 TB/s memory bandwidth. A single B300 has roughly twice the VRAM of an H200 — which fundamentally changes what a one- or two-GPU configuration can do.
NVFP4 requires Blackwell-generation tensor cores; Hopper-generation H200s do not support it. For new procurements in 2026, B200 or B300 deserve a serious comparison against H200 — not just for flagship use, but increasingly for the standard tier as well.
The Three Realistic Entry Paths
Before we talk about specific configurations, an honest acknowledgment: not every organization can or should buy flagship hardware. And not every use case requires it. But the inverse is equally true — undersized hardware will only ever support undersized use cases.
Small hardware delivers small use cases. That is not a weakness of the technology — it is a natural constant. Plan accordingly.
Here are three entry strategies for organizations with constrained budgets.
Path 1: Start Small, Grow Cleanly
Who this is for: Organizations with 50–500 employees, a clear pilot character, and willingness to expand in 12–18 months.
Hardware: 1× H200 (141 GB) or 1× B200 (192 GB) in a server with free slots for 1–3 additional GPUs.
What works:
Models up to ~30B parameters in full precision
70B models in FP8 quantization
120B-class MoE models like GPT-OSS-120B in MXFP4
Chat, summarization, RAG over mid-sized knowledge bases
Whisper for dictation in parallel
30–60 concurrent users realistic
What does not work:
Models above 120B parameters
Long contexts (>32k) with many parallel users
Agentic workloads at any meaningful scale
Investment logic: You buy a chassis that grows with you. The second GPU arrives when user count justifies it — not on speculation.
Path 2: Shared Infrastructure with Clear Task Separation
Who this is for: Organizations that want to run multiple smaller use cases in parallel, without needing a single "large" model.
Hardware: 2–4× L40S or equivalent (48 GB per card, PCIe).
What works:
Multiple smaller models in parallel on separate cards (one for chat, one for OCR, one for embedding)
Stable operation for 100–300 users in straightforward use cases
Very good ratio of total VRAM to investment
What does not work:
Splitting one large model across multiple cards (PCIe limit)
Running 70B+ models efficiently
Upgrade path toward flagship class
Investment logic: You buy breadth, not depth. Works well when use cases are known and stable. Works poorly when you later need a large model — then you have to buy new, not extend.
Path 3: Operate Jointly
Who this is for: Smaller municipalities, hospitals in association, districts with multiple facilities.
Model: One facility procures flagship hardware (e.g., 4× H200 or 4× B200), several smaller houses share capacity through a common platform — naturally with clean multi-tenancy and in a shared regional data center, not in someone else's cloud.
What works:
Access to models a single small house could never finance
Investment costs distribute across participants
One professional operations team instead of five half-responsible IT teams
What needs to be clarified:
Governance: Who decides on model selection and updates?
Data separation: Technically clean, contractually clear
Cost distribution: By usage or by headcount?
Investment logic: Some Capex becomes Opex — but under your sovereignty, not that of a US hyperscaler. For many municipal carriers, the only realistic path to flagship capability.
The Honest Lower Bound
Below 1× H200 (or equivalently: 1× L40S with significant limitations, 2× RTX 6000 Ada for lab use) there is no sensible production recommendation for on-premise hardware. Anything below that is lab, test, or PoC — not production. This boundary should be stated clearly because otherwise expectations form that the system cannot meet.
For organizations below this threshold — small municipalities, smaller offices, single departments — owning hardware is rarely the right answer. A managed private cloud option is usually the better path. We cover that decision in a separate article on operating models.
The Most Expensive Variant Is Rarely the Most Expensive Hardware
A machine bought too small gets replaced after 12 months — completely, not extended, because the chassis does not grow. Paying twice is more expensive than buying right once. "Starting small" is only honest when growth paths are documented: which GPUs fit the chassis, does the power supply carry it, does cooling handle full load, does the PCIe lane plan support it? If these questions are unanswered, "start small" is a synonym for "buy completely new later."
Pillar 3: Scaling Headroom for 24 Months
Anyone sizing exactly to current demand today buys again in 12 months — often with architectural breaks. Four reasons to plan bigger than your demand estimate suggests.
1. Adoption grows. Empirical observation from ongoing deployments: active use doubles within the first 12 months after going into production. What starts as a pilot with 50 users reaches 200 after a year. The driver is not hype but visibility — employees see colleagues working faster with AI and follow suit.
2. Use cases become more demanding. What starts as chat becomes RAG over the entire knowledge base. What starts as summarizing one document becomes comparative analysis over 20. Longer contexts, larger models, more parallel requests — tokens per user grow faster than user counts.
3. New model types arrive. Today LLM, embedding, Whisper, and OCR are the components. Within 12 to 24 months, TTS, image generation, and Large Table Models (LTM) join them in production. Every additional model consumes VRAM you plan for today.
4. Agentic workloads are no longer optional. In June 2026, agentic workloads are production reality — not a forecast. Applications where the LLM does not generate a single answer but plans multiple steps, calls tools, and processes results are increasingly the default.
What happens then: each agent step is its own LLM call, and each step carries the entire accumulated context. Token consumption grows not linearly with the number of steps but disproportionately — because context grows with every step.
Documented orders of magnitude from current agentic system deployments:
Agent Type | Tokens per Task | Multiple vs. Chat |
|---|---|---|
Simple agent (3–5 steps, one tool) | 20,000 – 80,000 | 10 – 40× |
Mid-complexity (research, 10–20 steps) | 100,000 – 500,000 | 50 – 250× |
Long-running (coding, research over hours) | 1M – several M | 1,000×+ |
The same employee using 30,000 tokens per day in classic chat reaches 300,000 to 3,000,000 with agentic assistance. Load peaks shift because agents often run in the background, parallel to interactive chat. Multiple model sizes in parallel become the norm: a strong model for planning, a fast one for routine steps. Both need VRAM simultaneously.
💡 Rule of thumb: Plan today with a factor of 3 to 5 over your calculated daily demand for classic use. With agentic workloads in production — which in 2026 is the realistic default — plan with a factor of 10 to 20, or pick an architecture that allows fast expansion.
Modular Architecture as a Scaling Path
Architecture decides whether expansion becomes painful. A proven separation:
LLM server with GPU resources for language and multimodal models.
RAG server with vector database, embedding pipeline, and document index (CPU-heavy, low GPU need).
Management server with user administration, audit log, monitoring, API gateway.
This separation allows the GPU component to scale independently from the rest. When token demand grows, GPU capacity is added. When the knowledge base grows, RAG capacity is added. A monolithic build forces you to touch the entire system at every expansion step.
Redundancy: What Happens When a GPU Fails?
GPUs in servers fail rarely, but they fail. With classic web servers, the answer to failure scenarios is well established: redundant power supplies, RAID, clusters. With AI servers, the answer is more complex because GPUs are expensive and not trivially mirrored.
Three redundancy levels seen in practice:
Component redundancy in the server. Redundant power supplies, hot-swap fans, ECC memory. This protects against common failure causes, not against a GPU defect.
Model redundancy across multiple GPUs. When two or more GPUs run replicas of the same model (see "Replicas Instead of Tensor Parallelism" above), the inference stack continues at reduced capacity if a GPU fails. This is one of the underrated advantages of the replica architecture.
Server redundancy with a second node. A second, identically configured server takes over when the first fails. This is the solution with real high availability — and the most expensive one.
For initial production operation, component redundancy plus a documented recovery plan (spare parts SLA, documented restart procedure) is enough. For business-critical applications — such as 24/7 medical dictation systems — planning starts at level 2, ideally with a second node.
Clarify during procurement: what downtime is tolerable? What response time does the vendor guarantee? Is a cold spare or hot spare available?
Expansion Stages
Three expansion stages belong on your checklist during hardware selection:
Stage 1: More GPUs in the same server. Server platforms differ significantly in how many GPUs they accommodate. Typical configurations offer 2, 4, or 8 GPU slots. What to watch:
PCIe lanes and generation: Modern GPUs use PCIe 5.0 x16. The server must provide enough lanes, or GPUs are throttled.
Free slots: When the server ships today with 2 of 4 possible GPUs, expansion is a plug-in step. When all slots are full, a new server is needed.
Power supply headroom: Every additional GPU draws 350 to 700 watts. The power supply must carry the expansion.
Cooling headroom: More GPUs produce more heat. Server cooling must handle full load of all planned GPUs — not only those installed today.
Stage 2: Second server, same platform. When user count exceeds a single server's capacity, a second node joins. The AI platform must support this (load balancing between nodes). Prerequisites in the data center: rack space, power feed, network connection. With good initial layout, the second server fits the same rack without rework.
Stage 3: Multi-node cluster. With many thousands of users, multiple servers form a cluster — usually with InfiniBand for GPU-to-GPU communication across node boundaries. This is an architectural decision that does not come for free retroactively: InfiniBand switches and cables belong in the plan, the network in the layout.
Operational Realities That Surprise IT Teams
Beyond pure hardware sizing, two operational topics regularly trip up first-time deployments. Worth knowing during procurement:
CUDA driver and inference stack compatibility. Container images of inference engines (vLLM, TGI, etc.) carry their own CUDA runtime expectations. A mismatch between host driver version and container expectation produces hard-to-diagnose startup failures. Specify in the procurement that the vendor delivers a tested, working combination of driver, container, and inference stack — and documents which versions are supported.
Model loading times and orchestration timeouts. A 549 GB flagship model does not load in 30 seconds. With Kubernetes or similar orchestration, default health-check timeouts will kill the pod before it ever serves a request. Plan for startup windows of 20 to 30 minutes when flagship models are involved, and ensure your platform layer is configured accordingly.
These details belong on the operations team's checklist, not in the boardroom decision — but they should not surface for the first time on go-live day.
Questions to Ask Today
How many GPU slots does the proposed server have in total — and how many are populated?
Which PCIe generation and how many lanes per GPU?
What power supply headroom remains for additional GPUs?
What cooling headroom remains at full GPU population?
Is the server InfiniBand-ready (NICs, slots, cabling)?
What is the expansion path to a second node?
For flagship deployments: is the server an HGX-class board with NVSwitch, or are GPUs connected via PCIe only?
These questions cost nothing in the procurement document. They prevent expensive architectural breaks later.
📊 Concrete Model Recommendations
📅 The following recommendations are current as of June 2026.
We update this section quarterly. The methodology in the rest of this article remains unaffected by model updates.
The recommendations are organized by hardware tier, not by model. Pick the tier that matches your budget and use case, then choose from the models tested for that tier.
How We Select the Models in This List
Before the tables, a word on selection criteria. A production-ready local model must satisfy four conditions simultaneously:
Fast enough — sustained throughput for many parallel jobs, not just single-user benchmarks
Smart enough — quality at the level the use case requires; smaller models are faster but more error-prone
Reliable — clean structured output (JSON), dependable tool calls, predictable behavior across thousands of requests
Trustworthy — open weights, and the shipped weights match the evaluated weights (for models trained natively in low-precision formats like MXFP4, this is given; for post-hoc quantized models, often not)
The last point is subtle but important. Many published benchmarks are run on full-precision weights, while the weights you actually deploy are quantized afterward — sometimes with measurable quality loss not reflected in the benchmark. Models trained natively in their deployment format (such as GPT-OSS-120B in MXFP4) avoid this gap.
Hardware Tiers at a Glance
Tier | Hardware | Total VRAM | Realistic Use |
|---|---|---|---|
Lab / PoC | 2× RTX 6000 Ada or 1× L40S | 48–96 GB | Testing, evaluation, single-user demos |
Entry Production | 1× H200 or 1× B200 | 141–192 GB | 50–300 users, chat + RAG |
Standard Production | 2× H200 (NVLink) or 2× B200 | 282–384 GB | 300–2,000 users, full feature set |
Heavy Production | 4× H200 or 2–4× B200 | 564–768 GB | 2,000–10,000 users, long contexts |
Flagship | 8× H200 HGX or 4× B200 / B300 | 1,128 / 768–1,152 GB | 1T-parameter MoE, 256k context, agentic workloads |
Note on consumer GPUs (RTX 4090, RTX 5090): We do not list consumer cards as production tiers. They lack NVLink, ECC memory, and enterprise driver support. For lab and evaluation purposes they are usable; for production with patient or citizen data, they are not.
Tier 1 — Lab / PoC
Hardware: 2× RTX 6000 Ada (96 GB total) or 1× L40S (48 GB)
Purpose: Evaluation, single-user testing, model selection before procurement.
Model | Quantization | Context | Notes |
|---|---|---|---|
Llama 4 8B Instruct | FP8 | 128k | Solid all-rounder for testing |
Qwen 3 14B | Int8 | 128k | Strong multilingual baseline |
Mistral NeMo 12B | FP8 | 128k | Efficient, good for German |
GPT-OSS 20B | MXFP4 | 128k | Lightweight reasoning |
What this tier cannot do: Production-grade concurrency, long-context for many users, flagship models, agentic workloads at scale.
Tier 2 — Entry Production
Hardware: 1× H200 (141 GB) or 1× B200 (192 GB), in a chassis with free slots for expansion
Purpose: First production deployment for smaller organizations (50–300 active users).
Model | Quantization | Context | Concurrent Users | Notes |
|---|---|---|---|---|
Llama 4 70B Instruct | FP8 | 128k | 30–60 | Reference model for this tier |
Qwen 3 70B | FP8 | 128k | 30–60 | Best for multilingual / German workloads |
GPT-OSS 120B | MXFP4 | 128k | 20–40 | Higher quality, tighter VRAM; ~220–250 tokens/s decode |
Mistral Large 3 | Int8 | 128k | 25–50 | Strong for European languages |
DeepSeek V3 Lite | MXFP4 | 64k | 30–50 | MoE efficiency, good throughput |
Typical parallel model stack on 1× H200: LLM 70B FP8 (70 GB) + Whisper Large v3 Turbo (6 GB) + Qwen 3 VL 7B for OCR (14 GB) + embedding (2 GB) + KV cache reserve for 40 concurrent users at 16k context (30 GB).
What this tier cannot do: Models above 120B parameters, sustained long-context (>32k) for many users, flagship 1T-MoE models.
Tier 3 — Standard Production
Hardware: 2× H200 with NVLink (282 GB) or 2× B200 (384 GB)
Purpose: The workhorse tier for most hospitals and municipalities (300–2,000 active users).
Model | Quantization | Context | Concurrent Users | Tokens/Day (full load) |
|---|---|---|---|---|
Llama 4 70B Instruct | FP8 | 128k | 100–200 | ~1.0 B |
GPT-OSS 120B | MXFP4 | 128k | 80–150 | ~1.0 B |
Qwen 3 110B | FP8 | 128k | 80–150 | ~900 M |
DeepSeek V3 | MXFP4 | 128k | 100–180 | ~1.2 B |
Llama 4 405B | Q4 | 64k | 40–80 | ~500 M |
Reference benchmark: 2× H200 (NVLink) + GPT-OSS 120B (MXFP4) on vLLM = ~1 billion tokens per 24 hours at full load. Equivalent on 2× B200: ~1.8–2.0 billion tokens per 24 hours.
Typical parallel stack on 2× H200: LLM 120B class (60 GB) + Whisper Large v3 Turbo (6 GB) + Qwen 3 VL 30B for document understanding (32 GB) + embedding model on GPU (2 GB) + KV cache for 150 concurrent users at 16k context (~120 GB). Total around 220 GB of 282 GB — healthy reserve.
📍 Case study: A real high-throughput deployment
A university-affiliated facility currently operates 5× H200 in two servers running GPT-OSS-120B (MXFP4) via vLLM, using the replica architecture (one model instance per GPU, no tensor parallelism). Result: more than 4 billion tokens per day in sustained operation, peaks above 50,000 tokens per second, 2.76 million requests in one week. Stack: vLLM + LiteLLM proxy + PostgreSQL for usage tracking + Prometheus/Grafana for monitoring, fully containerized, behind the institution's firewall.This is not the typical case — it is a research-intensive deployment under sustained heavy load with agentic workloads. For perspective: 4 billion tokens per day equals roughly 1.5 trillion tokens per year. A clinic with 500 active users at moderate usage reaches about 5 billion tokens per year — 300× less. Even a university hospital with 5,000 active users and intensive agent use would land at around 150 billion tokens per year, a tenth of the example.
What the case shows is the upper bound of what compact hardware can deliver. For most organizations, this is reassuring: even with strong growth, you have years of headroom on a 2× H200 configuration.
What this tier cannot do: True flagship deployments (Kimi K2.5, DeepSeek R2 full), 256k context for many parallel users, heavy agentic workloads.
Tier 4 — Heavy Production
Hardware: 4× H200 (564 GB) or 2–4× B200 (384–768 GB)
Purpose: Larger organizations (2,000–10,000 users) or document-heavy workflows.
Model | Quantization | Context | Concurrent Users |
|---|---|---|---|
Llama 4 405B | FP8 | 128k | 80–150 |
DeepSeek V3 | FP8 | 128k | 150–300 |
Qwen 3 235B (MoE) | FP8 | 128k | 120–250 |
GPT-OSS 120B | FP8 (higher precision) | 128k | 200–350 |
Two drivers usually push organizations from Tier 3 to Tier 4: more users (when peak concurrency exceeds what 2× H200 can serve at acceptable latency), and higher precision (running the same model at FP8 instead of MXFP4 measurably improves output quality for specialized domains like medical coding or legal analysis).
What this tier cannot do: Trillion-parameter MoE models with 256k context — that requires Tier 5.
Tier 5 — Flagship
Hardware: 8× H200 on HGX board with NVSwitch (1,128 GB) or 4× B200/B300 with NVLink 5 (768–1,152 GB)
Purpose: Use cases where context depth and model capability matter more than throughput.
Model | Quantization | Context | Notes |
|---|---|---|---|
Kimi K2.5 | INT4 (experts only) + BF16 | 256k | 1T-parameter MoE, native tool calling, reasoning |
DeepSeek R2 | FP8 | 128k | Strongest open reasoning model as of June 2026 |
Qwen 3 Max | FP8 | 256k | Best for multilingual long-context |
Llama 4 405B | BF16 | 128k | Maximum quality, no MoE |
Flagship models deliver depth, not volume. Expect 100–300 million tokens per day on 8× H200, ~20 tokens/second per request for generation, and 5–30 concurrent users at long context.
A 2× H200 + 120B configuration produces 3–10× more tokens per day than a flagship deployment. If your bottleneck is volume, stay in Tier 3. If your bottleneck is what the model can actually do with a complete patient history or a 400-page procurement file, you need Tier 5.
Multi-Modal Components
Models that complement the main LLM in a production stack:
Speech-to-Text
Model | VRAM | Notes |
|---|---|---|
Whisper Large v3 Turbo | ~6 GB | Current default, multilingual |
Whisper Large v4 | ~8 GB | Better German medical terminology |
Vision / OCR
Model | VRAM | Notes |
|---|---|---|
Qwen 3 VL 7B | ~14 GB | Entry-tier vision |
Qwen 3 VL 30B | ~32 GB | Standard for document understanding |
Llama 4 Vision 90B | ~90 GB | Highest quality, Tier 3+ only |
Embeddings
Model | VRAM | Notes |
|---|---|---|
BGE-M3 | ~2 GB | Multilingual default, often runs on CPU |
Qwen 3 Embedding 8B | ~8 GB | Higher quality for German / European languages |
Nomic Embed v2 | ~1 GB | Lightweight, good for high-throughput RAG |
Text-to-Speech (emerging)
Model | VRAM | Notes |
|---|---|---|
Kokoro TTS | ~4 GB | Lightweight, multilingual |
XTTS v3 | ~6 GB | Higher quality, voice cloning |
Quantization Reference
Format | Bits per Weight | VRAM vs FP16 | Quality Impact |
|---|---|---|---|
FP16 / BF16 | 16 | 100% | None (baseline) |
FP8 | 8 | 50% | Negligible for most tasks |
Int8 | 8 | 50% | Minimal |
MXFP4 (Hopper, Blackwell) | 4 | 25% | Small, well-tested for production |
NVFP4 (Blackwell only) | 4 | 25% | Smaller than MXFP4 due to per-block scaling |
Q4 (GGUF, generic) | 4 | 25% | Noticeable for specialized domains |
Q3 / Q2 | 2–3 | 12–19% | Significant — lab use only |
Recommendation: Default to FP8 when VRAM allows. Use MXFP4/NVFP4 for production when memory is tight. Avoid Q3 and below for production.
Performance Levers Worth Knowing
Three settings consistently improve throughput and latency in production deployments — they cost nothing and should be standard:
Prefix caching. When system prompts are reused across requests (typical in clinical and administrative workloads), prefix caching delivers high hit rates and significantly lowers per-request latency. In vLLM, this is the
--enable-prefix-cachingflag.FP8 KV cache. For long-context deployments,
--kv-cache-dtype fp8halves KV cache memory consumption with negligible quality loss. Enables more concurrent users or longer effective context on the same hardware.Replica architecture instead of tensor parallelism (see earlier section). When the model fits on a single GPU, run one instance per GPU instead of splitting one model across multiple. Linear throughput scaling, no NVLink dependency, higher availability.
What to Communicate to Your Vendor
With the information so far, you can write a briefing to your supplier that fits on a few lines:
We are planning an on-premise AI platform for around [X] active users over the next 24 months. Expected daily demand: around [Y] million tokens. We intend to run a [Z]B class model, in parallel with Whisper for dictation and a vision model for OCR.
Please propose a configuration with:
GPU class H200 (SXM), B200 or B300, with NVLink/NVSwitch
Total VRAM at least [N] GB
Expansion headroom for at least two additional GPUs without architectural changes
5 years of support, 4-hour on-site response
GDPR-compliant operation, no external telemetry
Documented and supported combination of GPU driver, container runtime, and inference stack
Please also state power and cooling requirements for our data center planning.
Nothing more is needed for the first inquiry. The supplier handles the rest in the proposal.
Does Hardware Ownership Pay Off?
The short answer: yes, almost always, if your organization is large enough to fill the hardware and operates sensitive data. Three rules of thumb from practice:
Hardware investment for a Standard Production tier (2× H200, ~60,000 € for GPUs plus ~25,000 € for the server) typically amortizes within 12–18 months compared to renting equivalent capacity from a managed AI platform — provided utilization exceeds roughly 30%.
Unlimited token usage is the decisive advantage, especially for agentic workloads where token consumption becomes unpredictable. Fixed hardware cost beats variable per-token billing once your usage grows.
Precondition: a data center capable of hosting GPUs (power, cooling, space) and an IT team willing to take on operations. Without both, managed options are usually the better path regardless of size.
The full economic comparison — own hardware versus managed private cloud versus other operating models — depends on more than hardware cost. It depends on organization size, data center maturity, compliance requirements, and operational appetite. We treat that decision in a separate article on operating models and organization size [Link to follow-up article].
If you have already decided that owning hardware is the right path for you, the rest of this article gives you what you need.
Help for Your Procurement
So you do not start from zero, here is a structured template to adapt to your situation. It does not replace legal advice and is not a complete specification, but it provides a starting point for comparable proposals.
1. Mandatory Requirements
GPU class and minimum VRAM per GPU (referencing the chosen tier above).
NVLink/NVSwitch for models from 70B parameters upward; HGX-class boards mandatory for flagship deployments.
Number of GPUs per server, compatible with planned model topologies (Tensor Parallelism factors 1, 2, 4, or 8 — or replica architecture, see above).
ECC memory for both GPU and system RAM.
Network connectivity: at least 2× 25 GbE, with InfiniBand HDR or NDR optional for multi-GPU setups.
Redundant power supplies, hot-swap fans.
Warranty and support period (suggested: 5 years next business day).
Power and cooling requirements compatible with the existing data center infrastructure.
GDPR-compliant operation, no telemetry sent to external vendor clouds.
Documented and supported combination of GPU driver, container runtime, and inference stack.
2. Expansion Headroom
Number of free GPU slots in the proposed server.
PCIe generation and lane allocation per slot.
Power supply headroom for additional GPUs.
Cooling headroom at full GPU population.
InfiniBand preparation for later multi-node expansion.
3. Evaluation Criteria with Suggested Weighting
Criterion | Weighting |
|---|---|
Token throughput under defined load | 40% |
Scalability (path to expansion) | 25% |
Support, SLA, response times | 20% |
Price | 15% |
Adjust the weighting to your own priorities — do not copy them.
4. Scope of Delivery
Hardware fully assembled and pre-configured.
On-site commissioning by the vendor.
Documentation in English (or your local language).
Training for internal administrators (suggested: 2 days).
Handover workshop with performance verification.
5. Service Levels
Response time on failure (e.g., 4h on-site during business days).
Spare parts availability throughout the warranty period.
Defined escalation path (name, phone number, deputy).
Availability guarantee for the overall solution (e.g., 99.5% during business hours).
6. Acceptance Criteria
Measurable performance values under defined load (e.g., "≥ X tokens/second at Y concurrent users and Z tokens context length with model M in quantization Q").
Stability test over 72 hours of continuous load.
VRAM utilization documented for all parallel models in operation.
Proof that expansion options work technically.
This template is intentionally neutral. It requests performance, not brands. That protects you from lock-in and produces comparable proposals.
The goal of AI infrastructure planning is not to buy the largest possible server.
The goal is to provide enough capacity for real-world use cases while preserving room for growth.
Organizations that understand their token demand, model requirements, and expected adoption can make hardware decisions based on measurable requirements rather than assumptions.
Summary
Three pillars carry the hardware decision:
Express demand in tokens. From headcount, adoption rate, and use cases, daily token demand emerges. For every 100 active users, around 5 to 10 million tokens per day.
Benchmark demand against reference. A machine with 2× H200 and a 120B-class model on vLLM produces around one billion tokens per 24 hours; on 2× B200 roughly twice that. From this you derive which GPU class, what VRAM, and how many GPUs belong in your configuration — so that alongside the LLM, embedding, Whisper, OCR, and future models have space. For large models, the connection between cards (NVLink instead of PCIe, or replica architecture) co-determines usable performance. For flagship models with 256k context, plan a separate node class with 8× H200 or 4× B200/B300 — and verify the real VRAM footprint of "quantized" models in writing.
Plan scaling headroom. Factor 3 to 5 over today's demand for classic use, factor 10 to 20 with agentic workloads in production. Modular architecture, free GPU slots, power headroom, clear expansion paths.
Choose your entry path honestly: start small with a chassis that grows, build breadth with multiple smaller GPUs for known use cases, or operate jointly with peers when flagship capability matters but single-organization investment is out of reach. Below the threshold of one full H200, owning hardware rarely makes sense — consider managed options instead.
With this method, you answer your own sizing question. The VRAM calculator at apxml.com/tools/vram-calculator helps with detail checks.
basebox is an on-premise AI platform for organizations handling critical data — patient data, citizen data, social data, classified data. For a conversation about specific configurations for your situation, reach us through the usual channels.
Copy link
Stay Up to Date
