Deploying Open-Source Models: A Practical Production Guide
Most teams underestimate the operational gap between a working prototype and a production deployment. Choose managed inference (Together AI, Replicate, Modal) when you want to avoid owning GPU infrastructure. Self-host with vLLM or TGI when you have strict data residency requirements or need custom server logic. Either way, instrument your deployment before users hit it and treat model versioning like software versioning.
Getting an open-source model to produce useful output on your laptop is not the same problem as running that model in production for real users. The gap between the two is where most teams lose significant time. Inference servers crash under bursty load. Model versions drift between environments. Cold starts on serverless infrastructure add 30-second delays that kill user sessions. Observability is an afterthought until something breaks and you cannot figure out what.
This guide documents the decisions and trade-offs I have worked through deploying Llama, Mistral, and Qwen variants in production. I will cover the actual choices in the order you encounter them, with specific numbers where I have them.
The Real Problem With Production Deployments
A prototype runs one request at a time. Production runs concurrent requests with varied input lengths, users who will retry aggressively when something is slow, and a requirement to stay running across model updates and infrastructure incidents.
Three categories of problems show up once you move past a demo:
Throughput and latency under load. A single Llama 3.1 70B request on an A100 might complete in 3 seconds. Ten concurrent requests on the same GPU without proper batching can take 25-35 seconds each. Inference servers handle this differently, and picking the wrong one can mean 10x latency degradation.
Operational overhead. Self-hosting a model means owning the GPU, the server process, the container image, the autoscaler, the monitoring, and the update pipeline. This is weeks of engineering work before your first user request.
Model lifecycle management. Open-source models release new versions constantly. Llama 3.1 replaced 3.0 within months. Your production deployment needs a way to test, promote, and roll back model versions without downtime.
Step 1: Make the Self-Host vs. Managed Decision First
This decision constrains everything downstream, so make it explicitly rather than drifting into self-hosting by default.
Choose managed inference (Together AI, Replicate, Modal) if:
- You do not have a dedicated ML infrastructure engineer.
- Your data can leave your environment (check your compliance requirements explicitly, not just generally).
- Your primary workload uses models from the standard open-source catalog: Llama, Mistral, Qwen, DeepSeek, Gemma.
- Your team’s time is more expensive than the inference cost markup.
Choose self-hosted inference (vLLM, Text Generation Inference) if:
- Strict data residency requirements prohibit sending data to third-party APIs.
- You need to run custom or fine-tuned weights that no managed provider supports.
- Your request volume is high enough that the cost of GPU time on your own infrastructure beats the managed API markup. At roughly 50 million tokens per day, the math often shifts toward self-hosting.
- You need to instrument the model serving layer in ways managed APIs do not expose.
Most early-stage teams choose managed inference and move to self-hosting only when they hit a specific constraint. This is the right default. Running your own GPU fleet is an infrastructure problem that competes with your product roadmap for engineering time.
Step 2: Choose an Inference Server
If you go the self-hosted route, your two serious options are vLLM and Hugging Face Text Generation Inference (TGI). Both support the major open-source model architectures and expose OpenAI-compatible APIs.
vLLM is the more widely deployed option in production settings. Its core advantage is PagedAttention, an attention mechanism that manages KV cache memory more efficiently than naive implementations. In practice, this means you can serve more concurrent requests on the same GPU without out-of-memory errors. vLLM supports continuous batching, which means requests are processed as a stream rather than in fixed-size batches, reducing the queuing latency that hurts short requests when they share a batch with long ones.
A minimum production-viable vLLM deployment for Llama 3.1 70B requires two A100 80GB GPUs with tensor parallelism enabled. Single-GPU deployment of a 70B model forces 8-bit or 4-bit quantization, which measurably degrades output quality on reasoning tasks.
TGI (Text Generation Inference) is Hugging Face’s production server. It integrates tightly with the Hugging Face model hub, making it easier to pull weights and stay current with model releases. TGI’s flash attention implementation is solid, and it has better native support for some model families (Falcon, BLOOM) that vLLM handles less cleanly. For most teams running Llama or Mistral variants, vLLM is the stronger choice; TGI’s main advantage is the Hugging Face ecosystem integration.
Step 3: Set Up Model Versioning From Day One
Model versioning is the part teams skip and regret. When a new Llama release drops and you want to test it, you need a process that does not involve manually swapping weights on your production server.
The approach that works:
-
Tag your model deployments explicitly. Do not run “latest.” Pin your deployment to a specific model version (e.g.,
meta-llama/Llama-3.1-70B-Instructat a specific commit hash) and store that reference in your deployment config, not just in your head. -
Maintain a staging environment that mirrors production. Run model version upgrades in staging first. Test your complete prompt suite against the new model. Output format regressions are common across model versions, even within the same model family.
-
Build a rollback path before you need it. On managed platforms like Together AI, switching model versions is a config change. On self-hosted vLLM, you need the previous weights available and a deployment process that can swap them without a cold restart. Test the rollback path before you ship the first upgrade.
-
Log the model version with every request. When you investigate a quality issue three months later, you need to know which model version produced it. This takes one field in your request log and saves hours of debugging.
Step 4: Instrument Before Users Arrive
The minimum observability setup for a production model deployment:
- Request latency: Time to first token (TTFT) and total generation time, logged at p50, p95, and p99.
- Token throughput: Input and output tokens per request, so you can identify unexpectedly long inputs that cause latency spikes.
- Error rate: Model errors, rate limit errors, and infrastructure errors tracked separately. A 2% error rate that is entirely rate limits tells you something different than a 2% error rate from inference failures.
- Model version: Which model version served each request.
- Cost per request: On managed APIs, this is straightforward from token counts. On self-hosted, track GPU utilization and map it to cost using your actual hardware or cloud pricing.
Grafana with a Prometheus exporter from your inference server is a reasonable open-source stack for this. vLLM exposes a /metrics endpoint compatible with Prometheus out of the box.
Step 5: Load Test Before Launch
Run a load test that reflects your actual traffic patterns before opening the endpoint to users. A simple benchmark that sends requests sequentially tells you little; you need concurrent load that matches your expected p95 traffic.
For a typical production deployment, test at 1x, 3x, and 5x your estimated peak concurrent users. Watch for:
- Latency degradation as concurrency increases (healthy systems degrade gradually; broken systems fall off a cliff).
- Out-of-memory errors on the GPU side (vLLM will log these; TGI will return 500 errors).
- Queue depth growth under sustained load (indicates your server cannot keep up with arrival rate).
locust is a reasonable load testing tool for HTTP inference endpoints. Artillery is another option if your team already uses it for other services.
Common Mistakes
Skipping streaming on latency-sensitive endpoints. If a response takes 8 seconds to generate, users experience a blank UI for 8 seconds unless you stream tokens. vLLM, TGI, Together AI, and Modal all support server-sent events (SSE) streaming. Wire it up from the start; adding it later means refactoring your client handling.
Using a single GPU for a 70B model. Quantization gets you there on paper, but the quality hit is real and it shows up exactly in the tasks you care about most. Two A100s in tensor-parallel mode running FP16 is the correct setup for a 70B model in production.
Not setting a max token limit per request. An unconstrained request where a user sends an unusually long prompt can consume a GPU for minutes and queue all other requests behind it. Set a hard max_new_tokens limit appropriate to your use case and return an error for inputs that exceed your context budget.
Treating the managed API cost as fixed. Together AI, Modal, and Replicate all offer reserved capacity at lower rates than on-demand pricing. If your traffic is consistent rather than bursty, reserved pricing can reduce costs by 30-60%. Run three months of on-demand usage before evaluating reserved capacity, since you need real utilization data to size the reservation.
Copying prompts directly from development to production without a system prompt audit. System prompts that work in a chat interface often fail in a production API context where input format, output expectations, and failure modes differ. Review your complete prompt suite in staging before each model version change.
Action Items for Today
If you are setting up a new deployment:
- Decide self-hosted vs. managed using the criteria above. Write the decision down with the rationale so you can revisit it when your requirements change.
- If managed: set up a Together AI account, run 100 test requests against your primary model, and verify your output format is stable. Budget 2-3 hours.
- If self-hosted: provision a two-GPU A100 instance on Lambda Labs or RunPod, install vLLM, and run the built-in benchmark script (
python -m vllm.entrypoints.openai.api_server --model <your-model>) before writing any application code. - Add model version logging to your first request. Do not wait until you need it.
- Set a load test goal: define the concurrent user count you need to support at p95 < 5 seconds, and do not declare the deployment production-ready until you have hit that number in a test.
The operational work here is not glamorous, but it is the difference between a model deployment that runs reliably for 18 months and one that your team spends nights debugging three weeks after launch. Most of the work is in the setup; once the infrastructure is stable, the day-to-day overhead is low.