Local AI, Enterprise Scale: Practical Insights and Tooling for On-Premise LLMs

Local AI, Enterprise Scale: Practical Insights and Tooling for On-Premise LLMs

Local AI, Enterprise Scale: Practical Insights and Tooling for On-Premise LLMs

Local AI, Enterprise Scale: Practical Insights and Tooling for On-Premise LLMs

Jun 13, 2025

Felix Raab

basebox-local-ai-enterprise-insights
basebox-local-ai-enterprise-insights

Table of Contents

  1. Introduction

  2. Hardware

  3. Inference

  4. Testing

  5. Benchmarking

  6. Software and Tooling

  7. Conclusion

Introduction

basebox AI provides enterprise AI systems for both cloud and on-premises deployments. Many regulated industries require AI services to run entirely on-premises, making demand-based scaling more challenging than in the cloud. This creates a unique challenge: LLMs require powerful GPUs and significant compute resources, but on-premises deployments must often work within fixed hardware constraints and often completely offline. This article highlights some of the challenges of running on-premise AI services, presents concrete benchmarks, discusses some of our learnings, and shows extra tooling we've been developing.

Hardware

LLMs require capable GPUs with considerable video memory (VRAM) for acceptable performance, except for very small models. One of the main GPU suppliers, NVIDIA, offers two distinct categories: consumer-grade GeForce RTX cards and enterprise-grade options like the RTX A6000 (workstation) or H100/L40S (data center). Planning the hardware requirements depends on several factors. However, as a general guideline, it is recommended to design the system to be horizontally scalable. This allows for the addition of more GPUs at a later stage if needed. Key factors that affect the overall performance of the system are:

  • Model size: Roughly speaking, there are small models, medium-sized models, and large models – expressed in terms of number of parameters (e.g. 8B, 70B, 405B); models can run with different quantizations (in simplified terms, this can be thought of as the "precision"). With multiple creators, types, sizes, and quantization formats, the number of available models grows exponentially. The HuggingFace platform has become the main hub for models for publishing, browsing, searching, downloading, and even running models. Most open-source inference engines support different model architectures and formats.

  • Context size: Context is often discussed in terms of input tokens and output tokens. If users plan to send large amounts of text or big files to models and expect extensive model responses, memory and compute demands increase. Inference engines cap the maximum supported input tokens and quickly yield errors or crash if they exceed the hardware capabilities.

  • Concurrent users: Whether 1 user or 8 users simultaneously interact with LLM services matters. More concurrent users can become a significant bottleneck since memory and compute requirements quickly increase. Inference engines have internal mechanisms to batch and queue requests. Depending on workload, users of the system may experience longer waiting times for responses (or in the worst case, the system crashes).


Those factors are all interdependent. For example, less powerful GPUs could run smaller models with bigger context sizes; more powerful GPUs could run medium-sized models with bigger context sizes but limited numbers of concurrent users. The vast and unfortunately somewhat "messy" model landscape and the aforementioned factors make it challenging to come up with a magic formula and concrete numbers. As we'll see later, nothing beats actual testing and benchmarking of concrete models on concrete hardware with a concrete inference engine.

The good news is that: 1) High-end GPUs get cheaper as newer high-end GPUs are released, and model architectures become more efficient; 2) AI workloads can still be run on CPUs and not all components necessarily need expensive GPUs. One example of such a component is a system for retrieval-augmented generation (RAG). A RAG system at its core is an information retrieval system ("search engine"), now typically a hybrid system in which classic full-text search meets semantic search. RAG systems are often used for document processing to work around LLM context size limitations, to reduce the hallucination problem, and to feed AI services with data that were not part of their training data (e.g. company internals). The vectors for semantic search are calculated using smaller and cheaper "embedding models" and they run fine on multi-core CPUs. (RAG systems can still take advantage of GPUs if higher performance is needed or more expensive operations such as vision models should be run as part of an embedding pipeline.)

Inference

The arguably most important component of a local AI system is the component that actually runs LLMs, the inference engine. There are a number of open-source projects, and not all of them seem suitable for on-premise and enterprise usage. For example, in early testing, we found that the popular Ollama project is suitable for personal and home usage but less suitable for production deployments as performance and scalability do not match that of other projects and it lacks certain enterprise features. More recently, there have been community discussions regarding attribution practices to the Llama.cpp project – organizations evaluating Ollama should consider these community dynamics as part of their assessment.

So what are the main criteria we have been looking for?

  • Production-readiness: Deployable onto custom hardware with at least NVIDIA support (AMD support often exists but is generally less mature), clear licensing, and built for handling workloads beyond single-user home usage; proven in real-world production deployments

  • Support and Community: Important since the landscape is rapidly evolving with new models and architectures getting constantly released; this would also exclude smaller projects with less adoption, smaller communities, less real-world usage, and overall higher risk of being abandoned

  • Model Support: The well-known state-of-the-art open-source models must be supported, with flexibility to run different flavors in different formats and quantizations (pre-quantized and on-the-fly quantized formats); ideally, multimodal and vision models are supported if that use case is important

  • Performance: The engine should support state-of-the-art performance with the latest algorithmic developments and optimizations around model architectures and how they make use of the hardware (this includes approaches such as flash attention/paged attention, tensor parallelism, continuous batching, etc.). A key factor for on-premise LLMs is how inference engines batch user requests to GPUs. By processing multiple requests together, GPUs achieve higher throughput, but this can increase response latency as requests wait for the batch to fill. The right balance between batching and latency is especially important for larger or more complex models, and directly impacts real-world performance and user experience.

  • Interfaces: APIs must be OpenAI compatible, at least at the basic level where typical chat interaction between users and "assistant" roles is supported; the LLM ecosystem revolves around this de facto standard, so lack of support would make integration more complicated and time-consuming

  • Offline Mode: Essential for on-premise deployments in regulated industries; startup and runtime execution must work without active internet connection where pre-downloaded models are loaded from disk

  • Docker Support: Since the inference engine might need to run as part of a bigger containerized system, an official Docker deployment option is not strictly necessary but definitely an advantage (also worth noting that NVIDIA GPU passthrough works with Docker on Linux but not on macOS due to platform limitations)

  • Other "enterprise features", for instance: Monitoring and integration with popular systems such as OpenTelemetry/Prometheus; structured outputs to guide model responses; running adapters for custom fine-tuned models


Two engines that mostly check the criteria above are vLLM and TGI (Text Generation Inference). The former is often used by cloud providers, the latter has been spearheaded by Hugging Face and is used in their own infrastructure. TGI strikes a good balance between production-readiness, performance (often tested to be faster than vLLM) and enterprise features. vLLM comes with more configuration options, at the expense of more overall complexity to manage. The next sections focus on our use of TGI and show some real-world benchmarks.

Testing

Testing inference engines presents a practical challenge: most teams lack access to multiple GPU configurations for evaluation. Without a rack of H100s at your disposal, you need a cost-effective approach that maximizes learning from limited hardware access.

  • Smoke Testing: This is a basic test where you boot the inference engine and model and see if it starts up inference correctly and is ready to be called via its API. One of our learnings here is that this is unpredictable, i.e. different versions of models and different quantizations may or may not run even if all configuration appears to be correct. We've been maintaining an internal catalog of models that we've tested and recorded the significant parameters (available VRAM, quantization, maximum input length) for each model. If models do not run, error output and open issue trackers often contain some clues.

  • Performance Testing: Is performance acceptable when streaming responses via the API? As mentioned before, performance heavily depends on available VRAM, context sizes, and concurrent users. This is best measured with benchmark tooling that simulates real workloads and measures relevant metrics end-to-end (see next section). In particular, both latency (time to first token) and throughput (tokens per second) should be tracked, as batching and concurrency settings can significantly affect these results.

  • Quality Testing: Measuring quality in LLMs is notoriously challenging due to their non-deterministic behavior and issues like hallucinations. While highly quantized models may offer faster performance, they often result in higher "perplexity"—meaning the model appears more "confused"—and user-facing parameters such as "temperature" can significantly affect the output. Within the AI community, there is growing recognition that the usefulness of traditional "evals" (LLM evaluations) is limited. Methods like "LLM as Judge" (using LLMs to evaluate other LLM outputs) can introduce their own biases. Even when models perform well in automated tests, they may still struggle with certain prompts in real-world use, depending on the domain. As a result, human review remains essential. The testing process can be semi-automated, with prompts and context delivered to the API via specialized tools, and responses gathered for both automated and human evaluation. Tools such as Promptfoo help streamline this process and support security-related testing as well.


How do you test all of the above if you don't happen to own a number of expensive H100, L40S machines? A growing number of hosting services provide access to on-demand GPUs that are billed by the hour – only when the machine is actually in use. Provider terms are sometimes unclear about the billing of machines when they are not in use, so it is worth double-checking if hourly billing still applies when machines are turned off. This can still get expensive, but if you plan before what to do and limit the time the machine is on, it seems acceptable. For us, the procedure has often been: Remotely booting the machine, logging in and performing your tests, putting the machine back to sleep again. Repeat for significant changes such as inference engine updates, new models, or different GPU and aim to automate parts of the process as you go.

Benchmarking

Benchmark numbers should always be taken with a grain of salt since the test environment, parameters, and inputs often vary to such an extent that they make direct comparisons less meaningful. The idea here is to give you a rough impression of what you can expect when running popular models on popular professional GPUs. Metrics typically measured include latency and throughput, or, as they are perceived by end users in terms of "time to first token" and "tokens per second". To measure those numbers, we've both used TGI's own benchmarking tool and an in-house tool.

We ran TGI with its default backend and the Llama and Qwen models in different sizes and in pre-quantized and on-the-fly quantized flavors. This excludes the GGUF model format as used by Llama.cpp/ollama. (The Llama.cpp backend for TGI introduces additional build complexities as it should be compiled with native CPU instructions on the target machines – on-premise deployment would get even more challenging, so for the time being we have not looked into supporting that.)

Our testing was conducted using TGI version 3.3.0 across three different (single!) GPU configurations: NVIDIA H100 80GB HBM3, NVIDIA L40S 46GB, and NVIDIA RTX A6000 48GB. We developed a custom benchmark tool that attempts to simulate actual usage patterns under varying load conditions by:

  • Performing requests against the OpenAI API-compatible chat completion endpoint

  • Ramping up concurrent users over a configurable time period

  • Rotating between different prompt scenarios: "best case" scenarios use regular small prompts with low token output values and low concurrent users, while "worst case" scenarios use large prompts from a prompts file (book chapters that should be summarized) with high output token values and more concurrent users

  • Measuring a number of metrics; here we only report end-to-end metrics as perceived by users, including time to first token (TTFT) and tokens per second; (the reported token metrics are only approximations measured by a default tokenizer)


For example, TGI's benchmark tool looks like this:


TGI-Benchmark-Tool-basebox



Our benchmark tool outputs a report like this:


  1. # Benchmark Results:

  2. ## Request Statistics:

  3. Total Requests: 1

  4. Successful Requests: 1

  5. Error Rate: 0.00%

  6. Requests/second: 0.23

  7. ## Timing Statistics:

  8. Average Time To First Token (TTFT): 135.703209ms

  9. Average Time Per Output Token (TPOT): 13.109962ms

  10. Average Request Duration: 4.383331042s

  11. Total Benchmark Duration: 4.383331042s

  12. ## Token Statistics:

  13. Total Input Tokens: 6

  14. Total Output Tokens: 324

  15. Average Input Tokens/Request: 6.00

  16. Average Output Tokens/Request: 324.00

  17. Average Tokens/Second: 73.92

  18. Note: Token counts estimated (API usage not always provided)

  19. ## Configuration:

  20. Model: meta-llama/Llama-3.1-8B-Instruct

  21. Concurrent Users: 1

  22. API Endpoint: http://XXX.XXX.XXX.XXX:8080/v1

  23. Iterations: 1

  24. Max Tokens: 1024



Results Summary

NVIDIA H100 80GB HBM3

Hardware Configuration:

  • Ubuntu 22.04.5 LTS, Intel CPU (16 cores), 196GB RAM

  • NVIDIA H100 80GB HBM3

Model

Quantization

Max Input Tokens

Scenario

Concurrent Users

TTFT

Tokens/sec

Max Output Tokens

Llama 3.3 70B

AWQ

32,000

Best Case

1

131ms

40.11

1,024

Llama 3.3 70B

AWQ

64,000

Worst Case

4

38.97s

6.15

8,000

The H100 performs well for single users with the 70B model, achieving 131ms time to first token. With 4 concurrent users and large contexts, performance drops substantially to 38.97s TTFT, reflecting the memory constraints of heavy concurrent workloads.

NVIDIA L40S 46GB

Hardware Configuration:

  • Ubuntu 22.04.5 LTS, AMD CPU (48 cores), 283GB RAM

  • NVIDIA L40S 46GB

Model

Quantization

Max Input Tokens

Scenario

Concurrent Users

TTFT

Tokens/sec

Max Output Tokens

Llama 3.1 8B

EETQ (8-bit)

32,000

Best Case

1

136ms

73.92

1,024

Llama 3.1 8B

EETQ (8-bit)

64,000

Worst Case

4

7.11s

19.88

8,000

Qwen 2.5 32B

AWQ

32,000

Best Case

1

177ms

34.06

1,024

Qwen 2.5 32B

AWQ

32,000

Worst Case

4

38.73s

6.28

8,000

Qwen 2.5 14B

GPTQ Int-8

32,000

Best Case

1

136ms

40.67

1,024

Qwen 2.5 14B

GPTQ Int-8

32,000

Worst Case

4

11.77s

10.87

8,000

The L40S handles different model sizes effectively. The 8B models deliver 73.92 tokens/second, while the 32B model achieves 34.06 tokens/second for single users. Under concurrent load, throughput decreases significantly – the 32B model drops to 6.28 tokens/second with 4 users.

NVIDIA RTX A6000 48GB

Hardware Configuration:

  • Ubuntu 22.04.5 LTS, Intel CPU (48 cores), 125GB RAM

  • NVIDIA RTX A6000 48GB

Model

Quantization

Max Input Tokens

Scenario

Concurrent Users

TTFT

Tokens/sec

Max Output Tokens

Llama 3.1 8B

EETQ (8-bit)

32,000

Best Case

1

262ms

81.01

1,024

Llama 3.1 8B

EETQ (8-bit)

64,000

Worst Case

4

6.06s

20.21

8,000

Llama 3.1 8B

EETQ (8-bit)

64,000

Worst Case

8

2.11s

30.40

8,000

Llama 3.3 70B

AWQ

8,000

Best Case

4

117ms

19.41

1,024

Qwen 2.5 14B

GPTQ Int-8

32,000

Best Case

1

85ms

46.30

1,024

Qwen 2.5 14B

GPTQ Int-8

32,000

Worst Case

4

17.54s

10.04

8,000

The RTX A6000 performs well with 8B models, reaching 81.01 tokens/second. The 8-user test with Llama 3.1 8B achieved better throughput than the 4-user test (30.40 vs 20.21 tokens/second), which might suggest more efficient batching at higher concurrency. The 70B model runs but requires a reduced context window of 8,000 tokens.

Key Insights

Model Size vs. Performance: Smaller models (8B parameters) consistently deliver 2-4x higher throughput than their larger counterparts across all hardware configurations — a critical factor for concurrent user scenarios. The 14B models offer a good balance between capability and performance, while 70B models require careful consideration of concurrent user limits.

Quantization Impact: AWQ and EETQ quantization techniques enable larger models to run on less capable hardware while maintaining acceptable performance. GPTQ Int-8 shows good results for mid-sized models. The quantization impact on response quality and "perplexity" must eventually be assessed through human review.

Concurrent User Scaling: All configurations show significant performance degradation as concurrent users increase, particularly with large context windows. This highlights the importance of proper capacity planning for production deployments. Note that mixture-of-experts models (such as DeepSeek) tend to require larger batch sizes to achieve efficient GPU utilization, which can further increase latency in low-concurrency or on-premise scenarios.

Hardware Considerations: While the H100 offers the best raw performance, the L40S and RTX A6000 provide more cost-effective alternatives for many use cases, especially when running appropriately sized models with suitable quantization. (As a side note: with the H200 available now, H100 prices have dropped.)

Software and Tooling

The user interface for interacting with LLM services is a key part of an AI system. Although some open source ready-to-use chat user interfaces exist, chances are that they might not serve you well and you would end up heavily modifying off-the-shelf solutions to address the various needs. Enterprise environments often have specific requirements around security and access control, compliance, and integration with existing systems. You also might want to enable organizations to create and customize their own domain-specific LLM apps and expose them to other units.

For basebox, running the various components we need with their dependencies through Docker has been essential. The challenge is that you have to manage a locally distributed system with all of its complexities when components interface with each other and processing is mostly asynchronous. In addition, on-premise deployments pose some extra challenges around installation, hardware and software compatibility and diagnostics, or offline capabilities. This is where custom tooling comes in. For basebox, the need for such tools has organically emerged as part of the development process, for instance:

  • bbsetup: a terminal UI installer that walks admins through the on-premise installation process and comes with pre-packaged default models

  • basecheck: a diagnostics tool that checks hardware capabilities, software requirements and connectivity such as GPU drivers, Docker, ports, internet and SMTP server connectivity, proxy settings

  • mget: a command-line tool and integration library that efficiently downloads and manages model files based on a declarative description

  • lgen: a license generation command-line tool and integration library for controlling the allowed number of users, token budgets, expiration times

  • ragsrv: an offline and CPU-compatible RAG system with document processing pipelines, hybrid full-text and semantic search, APIs and webhooks

Conclusion

On-premises AI deployments require careful balance: fixed GPU investments must support rapidly evolving models without cloud scaling flexibility. Our experience with basebox shows there's no one-size-fits-all solution — hardware and model choices depend entirely on specific workloads and must be validated through systematic testing. With maturing inference engines and better tooling, on-premise AI is increasingly viable for organizations with regulatory or data sovereignty needs.

Copy link

Copy link

Copy link

Copy link

Stay Up to Date

© 2025 basebox GmbH, Utting am Ammersee, Germany. All rights reserved.

Made in Bavaria | EU-compliant

© 2025 basebox GmbH, Utting am Ammersee, Germany. All rights reserved.

Made in Bavaria | EU-compliant

© 2025 basebox GmbH, Utting am Ammersee, Germany. All rights reserved.

Made in Bavaria | EU-compliant

© 2025 basebox GmbH, Utting am Ammersee, Germany. All rights reserved.

Made in Bavaria | EU-compliant