Hardware Sizing for On-Premise AI: The VRAM Calculation Guide - basebox

Product

Industries

Pricing

Support

Try for Free

Book Demo

Select Language

English

Try for Free

Hardware Sizing for On-Premise AI: The VRAM Calculation Guide

Oct 14, 2025

René Herzer

The central question in AI hardware procurement is: "What hardware should I buy?" The answer depends significantly on an often overlooked factor: the GPU's VRAM requirements. This guide explains why VRAM is the crucial sizing factor and how to calculate it correctly.

Why VRAM determines the hardware decision

With traditional server hardware, CPU, RAM, and storage are the focus. With AI systems, the GPU with its VRAM is the limiting factor. Unlike other applications, the complete AI model must be fully loaded into GPU memory.

The Hardware Hierarchy in AI:

VRAM (critical): Determines which models can run at all
GPU Performance: Affects response speed
CPU/RAM: Supporting components
Storage: For documents and operating system

Practical consequence: A €30,000 H100 with 80GB VRAM can run a 70B model, while a €3,000 RTX 4090 with 24GB VRAM fails with the same model - regardless of other hardware specifications.

Concurrent User: Realistic usage planning

A common planning error is equating total users with concurrent users. The actual concurrency is significantly lower.

Concurrent User Rules of Thumb

Office Environment (Standard Working Hours):
• 10 Total users → 2-3 concurrent users
• 50 Total users → 8-12 concurrent users
• 100 Total users → 15-25 concurrent users
• 500 Total users → 50-100 concurrent users

Intensive Usage (Research, Analysis Teams):
• 10 Total users → 4-6 concurrent users
• 50 Total users → 15-25 concurrent users
• 100 Total users → 30-50 concurrent users

Call Center / Support (Continuous Usage):
• 10 Total users → 6-8 concurrent users
• 50 Total users → 30-40 concurrent users
• 100 Total users → 60-80 concurrent users

Analyze Usage Patterns

Typical usage distribution throughout the day:
• 09:00-11:00: Peak time (80% of daily usage)
• 11:00-14:00: Moderate usage (40% of daily usage)
• 14:00-17:00: Second peak (60% of daily usage)
• Outside hours: Minimal usage (5-10%)

Sizing for different scenarios:
• Conservative: Size for 80% peak usage
• Balanced: Size for 60% peak usage with queues
• Cost-optimized: Size for 40% peak usage with longer wait times

VRAM Calculation: The Decisive Factors

The theoretical foundations are important, but for precise calculations, using a VRAM calculator is recommended. This takes into account all relevant parameters and their interactions.

Use VRAM Calculator →

With this tool, different configurations can be tested and the exact VRAM requirements can be determined. The following sections explain the individual parameters of the calculator:

1. Model size and quantization

Model parameters determine the basic requirements:
• 3B Parameter: ~6 GB (FP16) or ~3 GB (INT8)
• 7B Parameter: ~14 GB (FP16) or ~7 GB (INT8)
• 13B Parameter: ~26 GB (FP16) or ~13 GB (INT8)
• 70B Parameter: ~140 GB (FP16) or ~70 GB (INT8)

Quantization reduces VRAM consumption:
• FP16: Full precision, highest quality, maximum VRAM consumption
• INT8: Halved VRAM consumption, 2-5% quality loss
• INT4: Quarter VRAM consumption, 5-10% quality loss

Production systems typically use INT8 quantization for optimal balance of quality and resource consumption.

2. Context Window (Sequence Length)

The Context Window defines the maximum number of tokens (words/characters) that the model can process simultaneously.

Typical context sizes and use cases:
• 2K Tokens: ~1 page of text, simple questions
• 8K Tokens: ~4 pages of text, medium documents
• 32K Tokens: ~15 pages of text, detailed analyses
• 128K Tokens: ~60 pages of text, very extensive documents

VRAM Impact: VRAM consumption increases exponentially with context size due to the KV-Cache.

Example DeepSeek-R1 3B:
• 2K Context: 8.02 GB VRAM
• 8K Context: ~12 GB VRAM
• 32K Context: ~20+ GB VRAM

3. Batch Size (Simultaneous Processing)

The batch size corresponds to the number of concurrent users and determines how many requests can be processed in parallel.

Batch size effects:
• Batch size = 1: One user, minimal latency
• Batch size = 8: Eight parallel requests, moderate latency
• Batch size = 32: Maximum throughput, higher latency per request

VRAM consumption increases linearly: Each additional parallel request increases the VRAM requirement for activations.

Practical example (13B model, 8K context):
• 1 concurrent user: 15 GB VRAM
• 4 concurrent users: 18 GB VRAM
• 8 concurrent users: 22 GB VRAM
• 16 concurrent users: 30 GB VRAM

4. KV-Cache Quantization

The KV-Cache stores already processed tokens for performance optimization.

KV-Cache options:
• FP16: Standard precision, highest VRAM consumption
• INT8: 50% VRAM reduction, minimal quality loss
• INT4: 75% VRAM reduction, noticeable quality loss

Practical Calculation Examples

Example 1: Small Organization (25 Total Users)

Usage Analysis:
• 25 total users (office environment)
• 5-8 concurrent users (peak time)
• Documents up to 5 pages (8K context)
• Standard quality requirements

VRAM Configuration:
• Model: 13B parameters (INT8)
• Context: 8K tokens
• Batch size: 8 (for peak usage)
• VRAM requirement: ~22 GB
• GPU recommendation: NVIDIA L4 (24 GB)

Example 2: Medium Organization (100 Total Users)

Usage Analysis:
• 100 total users (mixed usage)
• 20-25 concurrent users (peak time)
• Documents up to 15 pages (32K Context)
• High quality requirements

VRAM Configuration:
• Model: 70B Parameters (INT8)
• Context: 32K Tokens
• Batch Size: 24 (for peak usage)
• VRAM requirement: ~45 GB
• GPU recommendation: NVIDIA L40S (48 GB)

Example 3: Large Organization (500 Total Users)

Usage Analysis:
• 500 total users (intensive usage)
• 80-100 concurrent users (peak time)
• Very extensive documents (128K Context)
• Highest quality requirements

VRAM Configuration:
• Model: 70B Parameters (FP16)
• Context: 128K Tokens
• Batch Size: 32 per GPU
• VRAM requirement: ~75 GB per GPU
• GPU recommendation: 3× NVIDIA H100 (80 GB) in cluster

Architecture flexibility in dimensioning

Modern AI platforms enable flexible distribution of components:

Monolithic Architecture:
• LLM, RAG and management on one server
• Lowest hardware requirements
• Limited scalability

Distributed Architecture:
• LLM server: Dedicated for text generation
• RAG server: Specialized for document processing
• Management server: Web interface and orchestration

Hybrid Approaches:
• RAG and management on one server
• LLM on dedicated server
• Optimal cost-benefit ratio

The architecture decision significantly influences VRAM calculation, as resources can be dimensioned more specifically in distributed systems.

Common Calculation Errors

Underestimating Concurrent Users:
• Problem: Only considering average usage
• Solution: Analyze peak times and scale accordingly

Overestimating Context Window:
• Problem: 128K context planned, only 8K actually used
• Solution: Analyze realistic document lengths

Not considering Quantization:
• Problem: FP16 calculation without checking INT8 alternatives
• Solution: Realistically assess quality requirements

Missing Safety Buffer:
• Problem: Exact calculation without reserves
• Solution: 15-25% buffer for unforeseen requirements

VRAM Optimization Strategies

Quantization Optimization:
• Model Weights: INT8 for production environments
• KV-Cache: INT8 for VRAM efficiency
• Activations: FP16 for numerical stability

Context Management:
• Sliding Window: Automatic removal of old tokens
• Document Chunking: Splitting of long documents
• Intelligent Caching: Pre-computation of frequent requests

Batch Optimization:
• Dynamic Batching: Adjustment of batch size based on load
• Sequence Packing: Consolidation of short requests

Validation of the calculation

Cloud Testing Before Hardware Procurement:
• Hourly rental of GPU instances (€2-4/hour)
• Testing with realistic data and usage patterns
• Measurement of VRAM consumption and performance metrics
• Validation of different configurations

Monitoring Metrics:
• VRAM utilization under various loads
• Response times with different batch sizes
• Throughput with various context sizes
• Quality assessment at different quantization levels

Checklist for Hardware Sizing

Usage Analysis:
• Total number of users determined
• Concurrent users calculated based on usage type
• Peak times and usage patterns analyzed
• Growth projections for 12-24 months created

Requirements Analysis:
• Typical and maximum document lengths determined
• Quality requirements specified
• Performance requirements (latency vs. throughput) clarified
• Availability requirements defined

VRAM Calculation:
• Model size selected based on quality requirements
• Quantization for production environment considered (INT8)
• Context window dimensioned based on real documents
• Batch size calculated for peak concurrent users
• Safety buffer of 15-25% planned

Validation:
• Cloud test performed with calculated configuration
• Performance measured under realistic load
• VRAM consumption documented under various scenarios
• Alternative configurations evaluated

Documentation:
• VRAM requirements documented for hardware specification
• Architecture decisions made (monolithic vs. distributed)
• Scaling options planned for future expansions
• Dimensioning rationale prepared for stakeholders

Conclusion

Hardware sizing for on-premise AI begins with correct VRAM calculation. Through systematic analysis of usage patterns, realistic modeling of concurrent users, and thorough validation through cloud testing, costly wrong decisions can be avoided.

Investing a few hundred euros in cloud tests can prevent purchasing mistakes in the five-figure range and ensures that the procured hardware meets actual requirements.

Copy link

Stay Up to Date

basebox GmbH

Bahnhofplatz 3

86919 Utting am Ammersee

Germany

+49 (0)8806 9590600

support@basebox.ai

Product

Features

Apps

Security & Compliance

Deployment

Industries

Healthcare

Finance

Manufacturing & Industry

Public Sector

Company

About

Jobs

Contact

Support

Create Support Ticket

Blog

FAQ

Legal

Imprint

Terms of Services

Data Processing Agreement

Made in Bavaria | EU-compliant

basebox GmbH

Bahnhofplatz 3

86919 Utting am Ammersee

Germany

+49 (0)8806 9590600

support@basebox.ai

Product

Features

Apps

Security & Compliance

Deployment

Industries

Healthcare

Finance

Manufacturing & Industry

Public Sector

Company

About

Jobs

Contact

Support

Create Support Ticket

Blog

FAQ

Legal

Imprint

Terms of Services

Data Processing Agreement

Made in Bavaria | EU-compliant

basebox GmbH

Bahnhofplatz 3

86919 Utting am Ammersee

Germany

+49 (0)8806 9590600

support@basebox.ai

Product

Features

Apps

Security & Compliance

Deployment

Industries

Healthcare

Finance

Manufacturing & Industry

Public Sector

Company

About

Jobs

Contact

Support

Create Support Ticket

Blog

FAQ

Legal

Imprint

Terms of Services

Data Processing Agreement

Made in Bavaria | EU-compliant

basebox GmbH

Bahnhofplatz 3

86919 Utting am Ammersee

Germany

+49 (0)8806 9590600

support@basebox.ai

Product

Features

Apps

Security & Compliance

Deployment

Industries

Healthcare

Finance

Manufacturing & Industry

Public Sector

Company

About

Jobs

Contact

Support

Create Support Ticket

Blog

FAQ

Legal

Imprint

Terms of Services

Data Processing Agreement

Made in Bavaria | EU-compliant