AI in Your Own Data Center: What Hardware Do Hospitals and Public Agencies Actually Need?

AI in Your Own Data Center: What Hardware Do Hospitals and Public Agencies Actually Need?

AI in Your Own Data Center: What Hardware Do Hospitals and Public Agencies Actually Need?

AI in Your Own Data Center: What Hardware Do Hospitals and Public Agencies Actually Need?

Jun 1, 2026

René Herzer

machine generating tokens for the cto-cio, basebox


A guide for CIOs and CTOs who have decided against the cloud and in favor of operating in their own data center. You will learn how to derive your demand from user numbers and use cases, benchmark it against a reference capacity (≈ 1 billion tokens per day on 2× H200), and define a server and GPU configuration for the next 24 months — including a template for your procurement.

Introduction

What hardware do hospitals, municipalities, and government agencies need to run AI in their own data centers? This question reaches us multiple times per week. It comes from CIOs and CTOs who have made a decision: patient data, citizen data, social data does not belong in a US cloud. It belongs in their own data center.

This article is for you if you are at that point. The article does not rehash the cloud-versus-on-prem debate — you have already settled that. Instead, you get a method that, after reading, lets you do four things:

  1. Express your AI demand in tokens per day.

  2. Benchmark that demand against a reference capacity.

  3. Derive a server and GPU configuration for the next 24 months.

  4. Write a briefing for your hardware vendor.

What you will not find here: dollar amounts. Hardware prices for AI components change monthly, GPU availability fluctuates, and new model generations shift reference points. Any price quoted today will be outdated in three months. What does not change is the underlying logic. That is what you get here.

At a Glance

  • Hardware sizing for AI does not mean counting CPU cores or RAM. It means knowing how many tokens per day your organization processes.

  • An employee actively using AI generates 30,000 to 200,000 tokens per day, depending on the use case.

  • A reference machine with 2× NVIDIA H200, running GPT-OSS-120B in Q4 quantization on vLLM, produces around one billion tokens within 24 hours at full load.

  • A GPU rarely runs only the language model. Embedding, speech recognition, OCR, and eventually image generation and Large Table Models also consume VRAM. These models belong in your sizing calculation.

  • Context length determines KV cache requirements. With long contexts, the KV cache becomes larger than the model itself.

  • Plan today with a factor of 3 to 5 over your calculated daily demand. With agentic workloads in mind, more like 10 to 20.

  • How cards are connected to each other (NVLink, PCIe) determines usable performance for large models — not VRAM totals alone.

Why Hardware Sizing for AI Works Differently Than for Traditional IT

When you size a web server, you think in CPU cores, RAM, and IOPS. The logic has been established for twenty years. AI servers follow different rules, and this is the most common stumbling block in procurement projects.

The size that matters is not how many requests the CPU handles per second. It is how many tokens per second the GPU processes. A token is the smallest processing unit of a language model — roughly a word fragment. "Hospital" is about two tokens, an average English sentence 15 to 25. Every question, every answer, every analyzed document section gets converted into tokens and processed by the GPU.

The conclusion: if you know how many tokens your organization needs per day, you know your hardware. Without that number, you are buying blind.

CPU, system RAM, and storage are not unimportant on an AI server, but they are supporting cast. The GPU with its fast specialized memory (VRAM) determines whether your AI platform succeeds or fails.

Pillar 1: Your Demand in Tokens

The first task is expressing your demand in this unit. With a few benchmarks from practice, it works well.

How Much Is a Token in Your Daily Operations?

These values come from real deployments and serve as a starting point for your own estimates:

Use Case

Tokens per Operation

Chat prompt (question + answer)

500 – 2,000

Summary of a medical report or case note

3,000 – 8,000

RAG retrieval with context and answer

5,000 – 15,000

Heavy user per day (≈ 2h active use)

50,000 – 200,000

Occasional user per day

10,000 – 50,000

A Worked Example

An organization with 500 employees. How many tokens per day?

  1. 60% of employees use AI regularly → 300 active users.

  2. Of those, 20% are heavy users: 60 people × 150,000 tokens = 9 million tokens/day.

  3. Of those, 80% are occasional users: 240 people × 30,000 tokens = 7.2 million tokens/day.

  4. Total: around 16.2 million tokens per day.

These numbers are assumptions, not predictions. Run the math with your own values — headcount, adoption rate, intensity of use.

💡 Rule of thumb: For every 100 active employees, expect 5 to 10 million tokens per day with classic use (chat plus document analysis). Pure chat use sits at the lower end, intensive document work at the upper end.


If You Have No Empirical Values Yet

That is the rule, not the exception. Take the middle values from the table and work with them. It does not get more precise than that. You are planning an investment for 24 to 36 months.

Pillar 2: From Tokens to Hardware

Once your demand estimate is in place, the hardware question becomes answerable. You need a reference value — a machine with known capacity that you can benchmark against.

The Reference Machine

📊 Reference: A machine with 2× NVIDIA H200, running the open-weight model GPT-OSS-120B in Q4 quantization on the vLLM inference stack, produces around one billion tokens within 24 hours at full load.


Three reasons for this choice: the configuration is an enterprise variant currently available. GPT-OSS-120B is an open-weight model in production use. vLLM is a production-grade inference stack with documented throughput values.

The example above (16.2 million tokens per day) uses around 1.6% of this machine's capacity. An organization with 500 employees has massive headroom on a 2× H200 machine for growth and new applications.

Context Length — What Does It Mean in Pages?

Context length describes how many tokens of input and output combined a model processes. The number sounds abstract — a comparison with US Letter pages of English prose makes it tangible.

A US Letter page of English prose (Times New Roman 11pt, 1.15 line spacing, around 400 words) corresponds to roughly 550 tokens.

Context Length

US Letter Pages

Typical Application

4,000 tokens (4k)

~7 pages

Short chats, simple questions

8,000 tokens (8k)

~14 pages

Medical report + follow-up question

16,000 tokens (16k)

~29 pages

Multi-page case note, short RAG

32,000 tokens (32k)

~58 pages

Longer contracts, mid-range RAG retrieval

65,000 tokens (65k)

~118 pages

Patient file excerpt, complex research

128,000 tokens (128k)

~232 pages

Long conversations, agent sessions, case files

200,000 tokens (200k)

~363 pages

Very long agent sessions, large document bundles

Why this matters: context length determines KV cache requirements — and the KV cache becomes larger than the model itself at long contexts. When 20 employees work simultaneously with 32k context, the KV cache claims 40 to 80 GB of VRAM in addition to the model.

A practical note: a model's maximum context (say 128k) is not the value you actually work with day to day. Most requests fall well below it. Plan for the context that covers 95% of your use cases — and keep the longer ones in mind as exceptions.

A GPU Rarely Carries Just the LLM

Here is a point missing from many sizing discussions: a production AI platform does not consist of a single model. Multiple models run in parallel on the same hardware:

Component

Function

VRAM Need (typical)

LLM (language model)

Chat, summarization, RAG answer

16 – 240 GB depending on model and quantization

Embedding model

Vectorization for RAG

1 – 2 GB (often runs on CPU)

Speech recognition (Whisper)

Audio → text transcription

3 – 10 GB

OCR / vision model

Text recognition in scans and images

8 – 16 GB

TTS (future)

Text → speech

2 – 6 GB

Image generation (future)

Image models

8 – 24 GB

Large Table Models (future)

Table understanding and generation

Not yet established

When your departments ask "Can we also do dictation from medical reports?" or "Can we make scanned files searchable?", the answer is: yes — if the corresponding models have space on the GPU.

A 2× H200 machine with 282 GB VRAM carries the 120B model in Q4 (60 GB), Whisper (8 GB), a vision model for OCR (~16 GB), and enough KV cache reserve for many parallel users. A configuration with less VRAM forces compromises — on the model, on context window, on user count, or on available functions.

How GPU Memory Works Together — and When It Does Not

A question that surfaces in nearly every procurement conversation: "Can we just take three cards with 48 GB each and have 144 GB for a large model?" The honest answer: technically often not, and when possible, with significant performance losses.

The reason lies in how the cards are connected. With large models split across multiple GPUs (Tensor Parallelism), the cards constantly communicate with each other. This communication needs a fast connection — otherwise the data bus becomes a bottleneck and throughput collapses.

There are two connection types:

  1. NVLink / NVSwitch: A direct high-speed connection between GPUs, at 600 to 1,800 GB/s depending on generation. This is the prerequisite for efficient Tensor Parallelism on large models.

  2. PCIe: The standard bus in the server, at 64 GB/s per card (PCIe 5.0 x16). Sufficient for small models, a bottleneck for large ones.

Not every GPU supports NVLink. An overview of common cards:

GPU

NVLink

Suitability for Large Models (70B+)

H100 / H200 (SXM form factor)

Yes, NVSwitch, 900 GB/s

Ideal

H100 / H200 (PCIe form factor)

Only 2-card bridge, 600 GB/s

Limited

B200

Yes, NVLink 5, 1,800 GB/s

Ideal

A100 (SXM)

Yes, 600 GB/s

Good

L40S

No, PCIe only

Not recommended

RTX 6000 Ada

No, PCIe only

Not recommended

RTX 4090

No, removed by design

Not for production use

Implication for procurement: when running a 70B or 120B class model, GPU choice is not just a VRAM question — it is a bus question. A configuration with 4× L40S sounds like 192 GB of total memory but is only of limited use for large models. A configuration with 2× H200 in SXM form factor with 282 GB and NVSwitch is the clearly better choice — even though the individual card costs more.

Another rule: Tensor Parallelism must divide the model's attention heads evenly. GPT-OSS-120B with 64 heads runs with 1, 2, 4, or 8 GPUs — not with 3 or 6. This influences the sensible number of cards per server.

What to Communicate to Your Vendor

With the information so far, you can write a briefing to your supplier that fits on a few lines:

We are planning an on-premise AI platform for around [X] active users over the next 24 months. Expected daily demand: around [Y] million tokens. We intend to run a [Z]B class model, in parallel with Whisper for dictation and an OCR model.

Please propose a configuration with:

  • GPU class H200 (SXM) or equivalent with NVLink/NVSwitch

  • Total VRAM at least [N] GB

  • Expansion headroom for at least two additional GPUs without architectural changes

  • 5 years of support, 4-hour on-site response

  • GDPR-compliant operation, no external telemetry

Please also state power and cooling requirements for our data center planning.

Nothing more is needed for the first inquiry. The supplier handles the rest in the proposal.

For an idea of which range you are operating in:

Organization Size

GPU Configuration (Example)

Total VRAM

Model Class

50 – 300 employees

1× H100 or H200

80 – 141 GB

8B – 14B

300 – 1,000

2× H100 or H200

160 – 282 GB

70B in FP8

1,000 – 3,000

2× H200 (SXM)

282 GB

120B in Q4

3,000 – 10,000+

4–8× H200 / B200, multiple nodes

500 GB+

120B + reserve

A tested model-hardware matrix with concrete quantizations, context lengths, and user counts is maintained by basebox at docs.basebox.ai/on-premise/llm-recommendations. The list is updated continuously.

The Calculator for Detail Questions

For the question "Does my chosen model fit my GPU configuration with my context length and user count?", a free tool exists: the VRAM calculator at apxml.com.

💡 Mini-glossary: What the terms in the VRAM calculator mean

Before using the calculator, a quick look at the input fields helps. You do not need to be an ML engineer, but an understanding of the settings is useful.

  • Model: The language model you plan to operate. Parameter count (e.g., "70B" = 70 billion parameters) drives VRAM need the most.

  • Inference Quantization: How precisely the model stores its internal values. FP16 = full precision, highest VRAM consumption, best quality. INT8 = half VRAM, slight quality loss. Q4 / INT4 = one quarter, noticeable losses depending on the use case. For many production workloads, Q4 is a compromise between memory and quality.

  • KV Cache Quantization: While the model answers, it stores intermediate results (the Key-Value cache). This cache grows in proportion to context length. With long documents, the KV cache exceeds the model itself. FP16 = precise, Q8/Q4 = leaner.

  • Sequence Length: How many tokens the model processes at once — input plus output. A medical report or case note runs 4,000 to 8,000 tokens, a longer RAG request 16,000+. More context means more KV cache, which means more VRAM.

  • Batch Size: How many requests the GPU processes at the same time. Higher batch size raises throughput and VRAM need.

  • Concurrent Users: How many people send requests at the same time. Not your total headcount, but the peak load. Rule of thumb: 15–25% of your active users are concurrent at lunchtime peak.

  • Attention Structure / Positional Embeddings (MLA, RoPE, …): Technical properties of the model that the calculator pulls in automatically from the selected model. Nothing to set here.

On interpreting the results: The calculator computes with the full context length — as if every user maxed out the window on every request. In practice, this is rarely the case. Chat requests are usually well below 4,000 tokens, many document analyses below 10,000. Actual VRAM need is typically below the calculated worst case. Use the calculator as an upper bound, not as an average.

Three values to change in practice: model, quantization, context length × concurrent users. Everything else stays on defaults.

The calculator is at apxml.com/tools/vram-calculator.

The Trade-Off in Quantization

Quantization from FP16 to Q4 reduces VRAM need by about 75% and costs answer quality. How much exactly depends on the model and the use case. For classic tasks (summarization, questions on documents, lookup functions), Q4 works in practice. For specialized terminology, medical coding, or complex legal analysis, testing with higher precision (Q8 or FP16) pays off. The difference shows up not in theory, but in operation with your own data.

Pillar 3: Scaling Headroom for 24 Months

Anyone sizing exactly to current demand today buys again in 12 months — often with architectural breaks. Four reasons to plan bigger than your demand estimate suggests.

1. Adoption grows. Empirical observation from ongoing deployments: active use doubles within the first 12 months after going into production. What starts as a pilot with 50 users reaches 200 after a year. The driver is not hype but visibility — employees see colleagues working faster with AI and follow suit.

2. Use cases become more demanding. What starts as chat becomes RAG over the entire knowledge base. What starts as summarizing one document becomes comparative analysis over 20. Longer contexts, larger models, more parallel requests — tokens per user grow faster than user counts.

3. New model types arrive. Today LLM, embedding, Whisper, and OCR are the components. Within 12 to 24 months, TTS, image generation, and Large Table Models (LTM) join them. Every additional model consumes VRAM you plan for today.

4. Agentic workloads. Today you run chat and document analysis. Within your hardware depreciation period, agentic workloads arrive — applications where the LLM does not generate a single answer but plans multiple steps, calls tools, and processes results.

What happens then: each agent step is its own LLM call, and each step carries the entire accumulated context. Token consumption grows not linearly with the number of steps but disproportionately — because context grows with every step.

Documented orders of magnitude from agentic system deployments:

Agent Type

Tokens per Task

Multiple vs. Chat

Simple agent (3–5 steps, one tool)

20,000 – 80,000

10 – 40×

Mid-complexity (research, 10–20 steps)

100,000 – 500,000

50 – 250×

Long-running (coding, research over hours)

1M – several M

1,000×+

The same employee using 30,000 tokens per day today reaches 300,000 to 3,000,000 with agentic assistance. Load peaks shift because agents often run in the background, parallel to interactive chat. Multiple model sizes in parallel become the norm: a strong model for planning, a fast one for routine steps. Both need VRAM simultaneously.

💡 Rule of thumb: Plan today with a factor of 3 to 5 over your calculated daily demand for classic use. If you expect agentic workloads within the hardware depreciation period, plan with a factor of 10 to 20 — or pick an architecture that allows fast expansion.

Modular Architecture as a Scaling Path

Architecture decides whether expansion becomes painful. A proven separation:

  1. LLM server with GPU resources for language and multimodal models.

  2. RAG server with vector database, embedding pipeline, and document index (CPU-heavy, low GPU need).

  3. Management server with user administration, audit log, monitoring, API gateway.

This separation allows the GPU component to scale independently from the rest. When token demand grows, GPU capacity is added. When the knowledge base grows, RAG capacity is added. A monolithic build forces you to touch the entire system at every expansion step.

Redundancy: What Happens When a GPU Fails?

GPUs in servers fail rarely, but they fail. With classic web servers, the answer to failure scenarios is well established: redundant power supplies, RAID, clusters. With AI servers, the answer is more complex because GPUs are expensive and not trivially mirrored.

Three redundancy levels seen in practice:

  1. Component redundancy in the server. Redundant power supplies, hot-swap fans, ECC memory. This protects against common failure causes, not against a GPU defect.

  2. Model redundancy across multiple GPUs. When two or more GPUs run in the server and the model fits on each one individually, the inference stack continues at reduced capacity if a GPU fails. Prerequisite: the model fits on a single GPU, or the platform supports automatic failover.

  3. Server redundancy with a second node. A second, identically configured server takes over when the first fails. This is the solution with real high availability — and the most expensive one.

For initial production operation, component redundancy plus a documented recovery plan (spare parts SLA, documented restart procedure) is enough. For business-critical applications — such as 24/7 medical dictation systems — planning starts at level 2, ideally with a second node.

Clarify during procurement: what downtime is tolerable? What response time does the vendor guarantee? Is a cold spare or hot spare available?

Expansion and Future-Readiness

Even if you start today with 200 active users, the question "what happens when we reach 2,000 or 10,000 in two years?" belongs in the procurement decision. AI adoption rarely runs linearly: a steep rise often follows the pilot when more departments join.

Three expansion stages to keep in mind during hardware selection:

Stage 1: More GPUs in the same server. Server platforms differ significantly in how many GPUs they accommodate. Typical configurations offer 2, 4, or 8 GPU slots. What to watch:

  • PCIe lanes and generation: Modern GPUs use PCIe 5.0 x16. The server must provide enough lanes, or GPUs are throttled.

  • Free slots: When the server ships today with 2 of 4 possible GPUs, expansion is a plug-in step. When all slots are full, a new server is needed.

  • Power supply headroom: Every additional GPU draws 350 to 700 watts. The power supply must carry the expansion.

  • Cooling headroom: More GPUs produce more heat. Server cooling must handle full load of all planned GPUs — not only those installed today.

Stage 2: Second server, same platform. When user count exceeds a single server's capacity, a second node joins. The AI platform must support this (load balancing between nodes). Prerequisites in the data center: rack space, power feed, network connection. With good initial layout, the second server fits the same rack without rework.

Stage 3: Multi-node cluster. With many thousands of users, multiple servers form a cluster — usually with InfiniBand for GPU-to-GPU communication across node boundaries. This is an architectural decision that does not come for free retroactively: InfiniBand switches and cables belong in the plan, the network in the layout.

Questions to Ask Today

  1. How many GPU slots does the proposed server have in total — and how many are populated?

  2. Which PCIe generation and how many lanes per GPU?

  3. What power supply headroom remains for additional GPUs?

  4. What cooling headroom remains at full GPU population?

  5. Is the server InfiniBand-ready (NICs, slots, cabling)?

  6. What is the expansion path to a second node?

These questions cost nothing in the procurement document. They prevent expensive architectural breaks later.

Help for Your Procurement

So you do not start from zero, here is a structured template to adapt to your situation. It does not replace legal advice and is not a complete specification, but it provides a starting point for comparable proposals.

1. Mandatory Requirements

  • GPU class and minimum VRAM per GPU (referencing the chosen configuration).

  • NVLink/NVSwitch for models from 70B parameters upward.

  • Number of GPUs per server, compatible with planned model topologies (Tensor Parallelism factors 1, 2, 4, or 8).

  • ECC memory for both GPU and system RAM.

  • Network connectivity: at least 2× 25 GbE, with InfiniBand HDR or NDR optional for multi-GPU setups.

  • Redundant power supplies, hot-swap fans.

  • Warranty and support period (suggested: 5 years next business day).

  • Power and cooling requirements compatible with the existing data center infrastructure (kW per rack, intake air, exhaust air, water cooling if applicable).

  • GDPR-compliant operation, no telemetry sent to external vendor clouds.

2. Expansion Headroom

  • Number of free GPU slots in the proposed server.

  • PCIe generation and lane allocation per slot.

  • Power supply headroom for additional GPUs.

  • Cooling headroom at full GPU population.

  • InfiniBand preparation for later multi-node expansion.

3. Evaluation Criteria with Suggested Weighting

Criterion

Weighting

Token throughput under defined load

40%

Scalability (path to expansion)

25%

Support, SLA, response times

20%

Price

15%

Adjust the weighting to your own priorities — do not copy them.

4. Scope of Delivery

  • Hardware fully assembled and pre-configured.

  • On-site commissioning by the vendor.

  • Documentation in English (or your local language).

  • Training for internal administrators (suggested: 2 days).

  • Handover workshop with performance verification.

5. Service Levels

  • Response time on failure (e.g., 4h on-site during business days).

  • Spare parts availability throughout the warranty period.

  • Defined escalation path (name, phone number, deputy).

  • Availability guarantee for the overall solution (e.g., 99.5% during business hours).

6. Acceptance Criteria

  • Measurable performance values under defined load (e.g., "≥ X tokens/second at Y concurrent users and Z tokens context length with model M in quantization Q").

  • Stability test over 72 hours of continuous load.

  • VRAM utilization documented for all parallel models in operation.

  • Proof that expansion options work technically.

This template is intentionally neutral. It requests performance, not brands. That protects you from lock-in and produces comparable proposals.

Summary

Three pillars carry the hardware decision:

  1. Express demand in tokens. From headcount, adoption rate, and use cases, daily token demand emerges. For every 100 active users, around 5 to 10 million tokens per day.

  2. Benchmark demand against reference. A machine with 2× H200 and GPT-OSS-120B in Q4 on vLLM produces around one billion tokens per 24 hours. From this you derive which GPU class, what VRAM, and how many GPUs belong in your configuration — so that alongside the LLM, embedding, Whisper, OCR, and future models have space. For large models, the connection between cards (NVLink instead of PCIe) co-determines usable performance.

  3. Plan scaling headroom. Factor 3 to 5 over today's demand for classic use, factor 10 to 20 with agentic workloads in view. Modular architecture, free GPU slots, power headroom, clear expansion paths.

With this method, you answer your own sizing question. The VRAM calculator at apxml.com helps with detail checks. The tested model-hardware matrix at docs.basebox.ai/on-premise/llm-recommendations provides concrete configurations with measured values.

basebox is an on-premise AI platform for organizations handling critical data — patient data, citizen data, social data, classified data. For a conversation about specific configurations for your situation, reach us through the usual channels.

Sources and further tools

Copy link

Stay Up to Date

© 2026 basebox GmbH, Utting am Ammersee, Germany. All rights reserved.

Made in Bavaria | EU-compliant

© 2026 basebox GmbH, Utting am Ammersee, Germany. All rights reserved.

Made in Bavaria | EU-compliant

© 2026 basebox GmbH, Utting am Ammersee, Germany. All rights reserved.

Made in Bavaria | EU-compliant

© 2026 basebox GmbH, Utting am Ammersee, Germany. All rights reserved.

Made in Bavaria | EU-compliant