Oct 14, 2025

René Herzer
The central question in AI hardware procurement is: "What hardware should I buy?" The answer depends significantly on an often overlooked factor: the GPU's VRAM requirements. This guide explains why VRAM is the crucial sizing factor and how to calculate it correctly.
Why VRAM determines the hardware decision
With traditional server hardware, CPU, RAM, and storage are the focus. With AI systems, the GPU with its VRAM is the limiting factor. Unlike other applications, the complete AI model must be fully loaded into GPU memory.
The Hardware Hierarchy in AI:
VRAM (critical): Determines which models can run at all
GPU Performance: Affects response speed
CPU/RAM: Supporting components
Storage: For documents and operating system
Practical consequence: A €30,000 H100 with 80GB VRAM can run a 70B model, while a €3,000 RTX 4090 with 24GB VRAM fails with the same model - regardless of other hardware specifications.
Concurrent User: Realistic usage planning
A common planning error is equating total users with concurrent users. The actual concurrency is significantly lower.
Concurrent User Rules of Thumb
Office Environment (Standard Working Hours):
• 10 Total users → 2-3 concurrent users
• 50 Total users → 8-12 concurrent users
• 100 Total users → 15-25 concurrent users
• 500 Total users → 50-100 concurrent users
Intensive Usage (Research, Analysis Teams):
• 10 Total users → 4-6 concurrent users
• 50 Total users → 15-25 concurrent users
• 100 Total users → 30-50 concurrent users
Call Center / Support (Continuous Usage):
• 10 Total users → 6-8 concurrent users
• 50 Total users → 30-40 concurrent users
• 100 Total users → 60-80 concurrent users
Analyze Usage Patterns
Typical usage distribution throughout the day:
• 09:00-11:00: Peak time (80% of daily usage)
• 11:00-14:00: Moderate usage (40% of daily usage)
• 14:00-17:00: Second peak (60% of daily usage)
• Outside hours: Minimal usage (5-10%)
Sizing for different scenarios:
• Conservative: Size for 80% peak usage
• Balanced: Size for 60% peak usage with queues
• Cost-optimized: Size for 40% peak usage with longer wait times
VRAM Calculation: The Decisive Factors
The theoretical foundations are important, but for precise calculations, using a VRAM calculator is recommended. This takes into account all relevant parameters and their interactions.
Use VRAM Calculator →
With this tool, different configurations can be tested and the exact VRAM requirements can be determined. The following sections explain the individual parameters of the calculator:
1. Model size and quantization
Model parameters determine the basic requirements:
• 3B Parameter: ~6 GB (FP16) or ~3 GB (INT8)
• 7B Parameter: ~14 GB (FP16) or ~7 GB (INT8)
• 13B Parameter: ~26 GB (FP16) or ~13 GB (INT8)
• 70B Parameter: ~140 GB (FP16) or ~70 GB (INT8)
Quantization reduces VRAM consumption:
• FP16: Full precision, highest quality, maximum VRAM consumption
• INT8: Halved VRAM consumption, 2-5% quality loss
• INT4: Quarter VRAM consumption, 5-10% quality loss
Production systems typically use INT8 quantization for optimal balance of quality and resource consumption.
2. Context Window (Sequence Length)
The Context Window defines the maximum number of tokens (words/characters) that the model can process simultaneously.
Typical context sizes and use cases:
• 2K Tokens: ~1 page of text, simple questions
• 8K Tokens: ~4 pages of text, medium documents
• 32K Tokens: ~15 pages of text, detailed analyses
• 128K Tokens: ~60 pages of text, very extensive documents
VRAM Impact: VRAM consumption increases exponentially with context size due to the KV-Cache.
Example DeepSeek-R1 3B:
• 2K Context: 8.02 GB VRAM
• 8K Context: ~12 GB VRAM
• 32K Context: ~20+ GB VRAM
3. Batch Size (Simultaneous Processing)
The batch size corresponds to the number of concurrent users and determines how many requests can be processed in parallel.
Batch size effects:
• Batch size = 1: One user, minimal latency
• Batch size = 8: Eight parallel requests, moderate latency
• Batch size = 32: Maximum throughput, higher latency per request
VRAM consumption increases linearly: Each additional parallel request increases the VRAM requirement for activations.
Practical example (13B model, 8K context):
• 1 concurrent user: 15 GB VRAM
• 4 concurrent users: 18 GB VRAM
• 8 concurrent users: 22 GB VRAM
• 16 concurrent users: 30 GB VRAM
4. KV-Cache Quantization
The KV-Cache stores already processed tokens for performance optimization.
KV-Cache options:
• FP16: Standard precision, highest VRAM consumption
• INT8: 50% VRAM reduction, minimal quality loss
• INT4: 75% VRAM reduction, noticeable quality loss
Practical Calculation Examples
Example 1: Small Organization (25 Total Users)
Usage Analysis:
• 25 total users (office environment)
• 5-8 concurrent users (peak time)
• Documents up to 5 pages (8K context)
• Standard quality requirements
VRAM Configuration:
• Model: 13B parameters (INT8)
• Context: 8K tokens
• Batch size: 8 (for peak usage)
• VRAM requirement: ~22 GB
• GPU recommendation: NVIDIA L4 (24 GB)
Example 2: Medium Organization (100 Total Users)
Usage Analysis:
• 100 total users (mixed usage)
• 20-25 concurrent users (peak time)
• Documents up to 15 pages (32K Context)
• High quality requirements
VRAM Configuration:
• Model: 70B Parameters (INT8)
• Context: 32K Tokens
• Batch Size: 24 (for peak usage)
• VRAM requirement: ~45 GB
• GPU recommendation: NVIDIA L40S (48 GB)
Example 3: Large Organization (500 Total Users)
Usage Analysis:
• 500 total users (intensive usage)
• 80-100 concurrent users (peak time)
• Very extensive documents (128K Context)
• Highest quality requirements
VRAM Configuration:
• Model: 70B Parameters (FP16)
• Context: 128K Tokens
• Batch Size: 32 per GPU
• VRAM requirement: ~75 GB per GPU
• GPU recommendation: 3× NVIDIA H100 (80 GB) in cluster
Architecture flexibility in dimensioning
Modern AI platforms enable flexible distribution of components:
Monolithic Architecture:
• LLM, RAG and management on one server
• Lowest hardware requirements
• Limited scalability
Distributed Architecture:
• LLM server: Dedicated for text generation
• RAG server: Specialized for document processing
• Management server: Web interface and orchestration
Hybrid Approaches:
• RAG and management on one server
• LLM on dedicated server
• Optimal cost-benefit ratio
The architecture decision significantly influences VRAM calculation, as resources can be dimensioned more specifically in distributed systems.
Common Calculation Errors
Underestimating Concurrent Users:
• Problem: Only considering average usage
• Solution: Analyze peak times and scale accordingly
Overestimating Context Window:
• Problem: 128K context planned, only 8K actually used
• Solution: Analyze realistic document lengths
Not considering Quantization:
• Problem: FP16 calculation without checking INT8 alternatives
• Solution: Realistically assess quality requirements
Missing Safety Buffer:
• Problem: Exact calculation without reserves
• Solution: 15-25% buffer for unforeseen requirements
VRAM Optimization Strategies
Quantization Optimization:
• Model Weights: INT8 for production environments
• KV-Cache: INT8 for VRAM efficiency
• Activations: FP16 for numerical stability
Context Management:
• Sliding Window: Automatic removal of old tokens
• Document Chunking: Splitting of long documents
• Intelligent Caching: Pre-computation of frequent requests
Batch Optimization:
• Dynamic Batching: Adjustment of batch size based on load
• Sequence Packing: Consolidation of short requests
Validation of the calculation
Cloud Testing Before Hardware Procurement:
• Hourly rental of GPU instances (€2-4/hour)
• Testing with realistic data and usage patterns
• Measurement of VRAM consumption and performance metrics
• Validation of different configurations
Monitoring Metrics:
• VRAM utilization under various loads
• Response times with different batch sizes
• Throughput with various context sizes
• Quality assessment at different quantization levels
Checklist for Hardware Sizing
Usage Analysis:
• Total number of users determined
• Concurrent users calculated based on usage type
• Peak times and usage patterns analyzed
• Growth projections for 12-24 months created
Requirements Analysis:
• Typical and maximum document lengths determined
• Quality requirements specified
• Performance requirements (latency vs. throughput) clarified
• Availability requirements defined
VRAM Calculation:
• Model size selected based on quality requirements
• Quantization for production environment considered (INT8)
• Context window dimensioned based on real documents
• Batch size calculated for peak concurrent users
• Safety buffer of 15-25% planned
Validation:
• Cloud test performed with calculated configuration
• Performance measured under realistic load
• VRAM consumption documented under various scenarios
• Alternative configurations evaluated
Documentation:
• VRAM requirements documented for hardware specification
• Architecture decisions made (monolithic vs. distributed)
• Scaling options planned for future expansions
• Dimensioning rationale prepared for stakeholders
Conclusion
Hardware sizing for on-premise AI begins with correct VRAM calculation. Through systematic analysis of usage patterns, realistic modeling of concurrent users, and thorough validation through cloud testing, costly wrong decisions can be avoided.
Investing a few hundred euros in cloud tests can prevent purchasing mistakes in the five-figure range and ensures that the procured hardware meets actual requirements.
Stay Up to Date