# OnPremize

> On-prem AI for enterprise codebases
> Summary version: [llms.txt](https://onpremize.com/llms.txt)
> Last-Updated: 2026-06-05T21:50:33.302Z

OnPremize is an enterprise AI code intelligence platform that deploys entirely inside your network. It combines retrieval-augmented generation (RAG) over your codebase with LoRA fine-tuning, so models learn your architecture, patterns, and conventions. No code or data leaves your infrastructure.

## Core Modules

### Code-Aware RAG
Hybrid dense + sparse search over indexed repositories using BGE-M3 embeddings and Qdrant vector storage. Retrieves contextually relevant code snippets for AI-assisted Q&A. Uses Reciprocal Rank Fusion (RRF) to merge dense and sparse search results for higher relevance.

### LoRA Fine-Tuning
Train lightweight LoRA adapters on your codebase. Supports multiple model families (Qwen2.5-Coder, StarCoder2, and others) via a pluggable port-and-adapter architecture. Uses Unsloth for optimized training where supported, and standard PEFT otherwise. Generates SFT datasets from your repositories including fill-in-the-middle, architecture Q&A, and git patch formats.

### OpenAI-Compatible API
Drop-in replacement for existing AI tooling. The server exposes both OpenAI and Anthropic API formats, making integration with editors, CI/CD pipelines, and custom workflows straightforward.

### Agent Workflows
Multi-step tool-calling traces and architecture Q&A dataset generation for continuous model improvement. Supports synthetic data generation pipelines to keep models current with evolving codebases.

## Deployment Modes

- **On-premise bare metal**: Direct installation on your hardware
- **Kubernetes**: Containerized deployment with orchestration
- **VPC**: Cloud-hosted but within your virtual private cloud
- **Air-gapped**: Fully offline operation with no outbound internet access. Offline artifact bundles (containers, model weights, dependencies) can be transferred via your approved process.

## Architecture

The system uses a unified server design:

- **Server** (port 8000): Handles requests, RAG retrieval, and routes LLM calls
- **LLM Worker** (optional, port 8001): Separate process for GPU inference, enabled via configuration
- **Qdrant**: Vector database for hybrid search (dense + sparse vectors)
- **BGE-M3**: Embedding model producing 1024-dim dense vectors + sparse lexical weights

LLM inference can run in-process or via a separate worker process for resource isolation. Models are loaded lazily on first request (direct mode) or eagerly at worker startup.

## Technical Stack

- Python 3.12, FastAPI
- PyTorch, Transformers, PEFT, TRL
- Qdrant vector database
- BGE-M3 embeddings (BAAI/bge-m3)
- Qwen2.5-Coder, StarCoder2 (open-weight LLMs)
- Unsloth (optimized training)
- 4-bit quantization support for reduced GPU memory

## Security & Compliance

- All processing happens within your network boundary
- API key authentication supported
- No telemetry or external data transmission
- Configurable logging detail for privacy requirements
- SSO (SAML/OIDC), RBAC, and audit logging on the roadmap
- Supports air-gapped operation with no internet dependency

## Hardware Requirements

- **Pilot**: Single GPU node (e.g., NVIDIA A10 or A100)
- **Production**: Multiple GPU nodes for inference and fine-tuning workloads
- Exact requirements scoped during evaluation based on repo size, user count, and latency targets

## Packages

- **Team**: Core RAG + search for small teams
- **Business**: Adds fine-tuning, multi-repo support, and priority support
- **Enterprise**: Full platform with air-gapped deployment, SLA, and dedicated support

## Frequently Asked Questions

**Does any code or data leave our network?**
No. OnPremize runs inside your infrastructure (VPC, on-prem, or air-gapped). Prompts, code context, and embeddings are processed within your boundary.

**Can you run fully air-gapped?**
Yes. Offline artifact bundles are provided for transfer via your approved process. Day-to-day operation does not require outbound internet access.

**What models are supported?**
Open-weight models such as Qwen2.5-Coder and StarCoder2. The adapter system is pluggable for adding new model families.

**How do citations work?**
Responses include citations back to source files (paths and line ranges) so engineers can verify. The system can express uncertainty rather than guessing when retrieval support is insufficient.

## Primary Sources

- [Air-Gapped AI Code Assistant](https://onpremize.com/solutions/air-gapped-ai-code-assistant)
- [On-Prem RAG for Source Code](https://onpremize.com/solutions/on-prem-rag-for-source-code)
- [LoRA Fine-Tuning for Private Code](https://onpremize.com/platform/lora-fine-tuning-for-private-code)
- [On-Prem AI Governance](https://onpremize.com/security/on-prem-ai-governance)
- [Machine-Readable Entity Facts](https://onpremize.com/ai-entity.json)

## Pages

- [Home](https://onpremize.com)
- [Privacy Policy](https://onpremize.com/privacy)
- [Terms of Service](https://onpremize.com/terms)
- [Brand Kit](https://onpremize.com/brand)

## Contact

- General: ping@onpremize.com
- Sales: ping@onpremize.com
- Security: ping@onpremize.com