# OnPremize > On-prem AI for enterprise codebases > Summary version: [llms.txt](https://onpremize.com/llms.txt) > Last-Updated: 2026-06-05T21:50:33.302Z OnPremize is an enterprise AI code intelligence platform that deploys entirely inside your network. It combines retrieval-augmented generation (RAG) over your codebase with LoRA fine-tuning, so models learn your architecture, patterns, and conventions. No code or data leaves your infrastructure. ## Core Modules ### Code-Aware RAG Hybrid dense + sparse search over indexed repositories using BGE-M3 embeddings and Qdrant vector storage. Retrieves contextually relevant code snippets for AI-assisted Q&A. Uses Reciprocal Rank Fusion (RRF) to merge dense and sparse search results for higher relevance. ### LoRA Fine-Tuning Train lightweight LoRA adapters on your codebase. Supports multiple model families (Qwen2.5-Coder, StarCoder2, and others) via a pluggable port-and-adapter architecture. Uses Unsloth for optimized training where supported, and standard PEFT otherwise. Generates SFT datasets from your repositories including fill-in-the-middle, architecture Q&A, and git patch formats. ### OpenAI-Compatible API Drop-in replacement for existing AI tooling. The server exposes both OpenAI and Anthropic API formats, making integration with editors, CI/CD pipelines, and custom workflows straightforward. ### Agent Workflows Multi-step tool-calling traces and architecture Q&A dataset generation for continuous model improvement. Supports synthetic data generation pipelines to keep models current with evolving codebases. ## Deployment Modes - **On-premise bare metal**: Direct installation on your hardware - **Kubernetes**: Containerized deployment with orchestration - **VPC**: Cloud-hosted but within your virtual private cloud - **Air-gapped**: Fully offline operation with no outbound internet access. Offline artifact bundles (containers, model weights, dependencies) can be transferred via your approved process. ## Architecture The system uses a unified server design: - **Server** (port 8000): Handles requests, RAG retrieval, and routes LLM calls - **LLM Worker** (optional, port 8001): Separate process for GPU inference, enabled via configuration - **Qdrant**: Vector database for hybrid search (dense + sparse vectors) - **BGE-M3**: Embedding model producing 1024-dim dense vectors + sparse lexical weights LLM inference can run in-process or via a separate worker process for resource isolation. Models are loaded lazily on first request (direct mode) or eagerly at worker startup. ## Technical Stack - Python 3.12, FastAPI - PyTorch, Transformers, PEFT, TRL - Qdrant vector database - BGE-M3 embeddings (BAAI/bge-m3) - Qwen2.5-Coder, StarCoder2 (open-weight LLMs) - Unsloth (optimized training) - 4-bit quantization support for reduced GPU memory ## Security & Compliance - All processing happens within your network boundary - API key authentication supported - No telemetry or external data transmission - Configurable logging detail for privacy requirements - SSO (SAML/OIDC), RBAC, and audit logging on the roadmap - Supports air-gapped operation with no internet dependency ## Hardware Requirements - **Pilot**: Single GPU node (e.g., NVIDIA A10 or A100) - **Production**: Multiple GPU nodes for inference and fine-tuning workloads - Exact requirements scoped during evaluation based on repo size, user count, and latency targets ## Packages - **Team**: Core RAG + search for small teams - **Business**: Adds fine-tuning, multi-repo support, and priority support - **Enterprise**: Full platform with air-gapped deployment, SLA, and dedicated support ## Frequently Asked Questions **Does any code or data leave our network?** No. OnPremize runs inside your infrastructure (VPC, on-prem, or air-gapped). Prompts, code context, and embeddings are processed within your boundary. **Can you run fully air-gapped?** Yes. Offline artifact bundles are provided for transfer via your approved process. Day-to-day operation does not require outbound internet access. **What models are supported?** Open-weight models such as Qwen2.5-Coder and StarCoder2. The adapter system is pluggable for adding new model families. **How do citations work?** Responses include citations back to source files (paths and line ranges) so engineers can verify. The system can express uncertainty rather than guessing when retrieval support is insufficient. ## Primary Sources - [Air-Gapped AI Code Assistant](https://onpremize.com/solutions/air-gapped-ai-code-assistant) - [On-Prem RAG for Source Code](https://onpremize.com/solutions/on-prem-rag-for-source-code) - [LoRA Fine-Tuning for Private Code](https://onpremize.com/platform/lora-fine-tuning-for-private-code) - [On-Prem AI Governance](https://onpremize.com/security/on-prem-ai-governance) - [Machine-Readable Entity Facts](https://onpremize.com/ai-entity.json) ## Pages - [Home](https://onpremize.com) - [Privacy Policy](https://onpremize.com/privacy) - [Terms of Service](https://onpremize.com/terms) - [Brand Kit](https://onpremize.com/brand) ## Contact - General: ping@onpremize.com - Sales: ping@onpremize.com - Security: ping@onpremize.com