LoRA Fine-Tuning Tutorial: Step-by-Step With Together AI

Full fine-tuning a 7B parameter model from scratch requires roughly 112 GB of GPU VRAM just to hold the weights in float32 — before you account for gradients, optimizer states, and activations. That is six A100s or eight H100s that most teams do not have sitting around, and it is unnecessary for most production fine-tuning tasks.

LoRA (Low-Rank Adaptation) sidesteps this by freezing the pre-trained weights and inserting small trainable matrices into the attention layers. The result is a model that behaves as though it were fine-tuned on your data but was trained at a fraction of the cost. A LoRA job on Llama 3.1 8B typically runs on a single A100 and completes in under an hour for 1,000 examples. Together AI exposes this as a managed API call, which means you can run a fine-tuning job without owning or renting a single GPU directly.

This tutorial walks through the full process: preparing your dataset, running the job via the Together AI API, evaluating the result, and the specific mistakes I have seen eat training budgets.

The Problem LoRA Solves

Base language models are trained to complete text in a general way. They are not trained to consistently produce your specific output format, follow your classification taxonomy, or match your brand’s response style. You can push them toward desired behavior with prompts, but prompts have limits: they consume tokens on every request, they can be overridden by unusual inputs, and they do not fundamentally change how the model weights represent your problem.

Full fine-tuning would fix this, but updating all parameters in a 7B model requires too much memory and compute for most teams to run regularly. LoRA solves the cost problem by decomposing the weight updates into low-rank matrices. Instead of updating a weight matrix W directly, LoRA learns two smaller matrices A and B where the effective update is A×B. For a typical attention layer with rank=8, this reduces trainable parameters by 99.9% compared to full fine-tuning. The quality difference on narrow tasks is small — often within 2-3 percentage points of full fine-tuning, according to the original Hu et al. paper.

Together AI’s fine-tuning API supports LoRA-based training as the default method for most supported models, including Llama 3.1 8B, Mistral 7B, and Qwen 2.5 7B. You pay per token processed during training, not per GPU-hour, which makes cost predictable before you start.

Step 1: Prepare Your Dataset

Dataset quality determines outcome quality. This is not a cliche — it is the specific finding from every fine-tuning job I have run. A model trained on 300 high-quality, consistent examples outperforms one trained on 2,000 sloppy examples with label noise.

Together AI expects training data in JSONL format with a specific conversation structure. Each line represents one training example:

{"messages": [{"role": "system", "content": "You extract structured data from unstructured text."}, {"role": "user", "content": "Patient presented with moderate lower back pain for 3 weeks."}, {"role": "assistant", "content": "{\"symptom\": \"lower back pain\", \"severity\": \"moderate\", \"duration\": \"3 weeks\"}"}]}

The messages array must include at least one user turn and one assistant turn. The assistant turn is what the model learns to produce. A system prompt is optional but strongly recommended — it gives the model consistent context for what it is supposed to do.

Minimum dataset size: Together AI requires at least 1 example, but I have not seen useful results below 100. For classification tasks, 200-500 examples per class tends to be the practical minimum for stable results. For generation tasks (summarization, reformatting), 300-800 examples typically covers enough variation to generalize.

How to build the dataset: If you have existing human-generated examples, use them. If you are bootstrapping from scratch, use GPT-4o or Claude to generate candidate examples and then review them for accuracy before including them in training. Do not train on unreviewed model outputs — you will amplify whatever errors the generator made.

Split your data: Hold out 10-20% for validation. Together AI automatically evaluates validation loss during training, and you need this to know when training has converged versus when it has overfit.

Step 2: Upload Your Dataset

Upload the training file using the Together AI Files API:

import together

client = together.Together(api_key="your-api-key")

with open("training_data.jsonl", "rb") as f:
    response = client.files.upload(
        file=("training_data.jsonl", f),
        purpose="fine-tune"
    )

file_id = response.id
print(f"Uploaded file ID: {file_id}")

Together AI validates the file on upload and will return an error if the format is wrong. Common issues: missing messages key, role values other than system/user/assistant, or non-UTF8 characters in the content.

Step 3: Launch the Fine-Tuning Job

job = client.fine_tuning.create(
    training_file=file_id,
    model="meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
    n_epochs=3,
    learning_rate=1e-5,
    lora=True,
    lora_r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    suffix="my-extraction-model"
)

print(f"Job ID: {job.id}")
print(f"Status: {job.status}")

Key parameters and what they actually do:

n_epochs is how many times the model sees your full dataset. For small datasets (under 500 examples), 3-5 epochs is standard. For larger datasets, 1-2 epochs is often enough. More epochs do not mean better results — after a point, the model memorizes your training set and validation loss starts climbing.

learning_rate at 1e-5 is conservative. The Together AI docs suggest 1e-4 as the default. I have found 1e-5 to 5e-5 works better for instruction-following tasks where you want to preserve the base model’s general capabilities. At 1e-4, you can overwrite capabilities the base model already had.

lora_r is the rank of the LoRA matrices. Rank 8 is a common starting point. Higher rank (16, 32) increases model capacity but also memory and training cost. For most task-specific fine-tuning, rank 8 performs comparably to rank 32 at about a quarter of the training cost.

lora_alpha is typically set to 2× the rank. This controls the scaling of the LoRA weights during inference. The standard guidance from the original paper is alpha = 2 * r, which is where the 8/16 pairing above comes from.

Step 4: Monitor and Evaluate

Poll for job status:

import time

while True:
    status = client.fine_tuning.retrieve(job.id)
    print(f"Status: {status.status} | Train loss: {status.train_loss}")
    if status.status in ["completed", "failed", "cancelled"]:
        break
    time.sleep(30)

When the job completes, Together AI provides the fine-tuned model ID in the format your-account/model-name. You can immediately run inference against it using the standard chat completions endpoint.

Evaluating the result: Do not rely solely on training loss. Run your fine-tuned model against your held-out validation set and measure what you actually care about — F1 score for classification, exact match for structured extraction, or human ratings for generation quality. Training loss tells you the model learned the examples; it does not tell you whether it generalized.

Table comparing LoRA rank settings, their parameter counts, training cost multipliers, and typical quality impact on narrow tasks — LoRA training parameters and their effect on training cost vs. model quality

Common Mistakes That Waste Your Budget

Training on inconsistent labels. If your assistant responses vary in format for equivalent inputs — sometimes JSON, sometimes plain text, sometimes JSON with extra fields — the model will learn to be inconsistent. Audit a random 10% of your training examples before uploading. Fix label inconsistencies before training, not after.

Not setting a system prompt. Models trained without a system prompt learn to respond to bare user messages. At inference time, if you add a system prompt, the model behaves differently than during training. Use the same system prompt in training that you will use in production.

Using the wrong base model. Instruct-tuned models (ending in -Instruct) are fine-tuned on conversation data and follow instructions by default. Base models are raw pre-trained weights and respond to completion prompts, not chat-style prompts. If your training data uses messages format, you almost always want an instruct model. Training a base model with chat-format data produces inconsistent results.

Stopping too early because loss looks good. Training loss decreasing does not mean validation loss is decreasing. If you do not hold out a validation set, you cannot tell whether the model is generalizing or memorizing. Set aside 15-20% of your examples as validation before uploading.

Expecting fine-tuning to inject new factual knowledge. LoRA fine-tuning changes how the model responds, not what it knows. If your task requires the model to know facts that are not in its pre-training data (recent events, proprietary internal documents), fine-tuning will not fix this. That is a retrieval problem, not a fine-tuning problem.

Step 5: Deploy and Compare

Once the job completes, your fine-tuned model is available via the standard Together AI inference API. The model ID follows the pattern your-account/model-name-suffix. Call it the same way you would call any hosted model:

response = client.chat.completions.create(
    model="your-account/meta-llama-3.1-8b-instruct-reference-my-extraction-model",
    messages=[
        {"role": "system", "content": "You extract structured data from unstructured text."},
        {"role": "user", "content": "Patient reported severe headache starting yesterday afternoon."}
    ]
)

Run A/B comparisons against your baseline (base model with your best prompt). Measure on 50-100 fresh examples not in your training or validation set. If the fine-tuned model does not outperform the prompted base model on this held-out set, you have a data quality problem or the task is not a good fit for fine-tuning.

Action Items You Can Take Today

Audit your use case first. Does your task require consistent output format, speed, or cost reduction at volume? Those are LoRA fine-tuning’s strengths. Does your task require broad knowledge, reasoning across long context, or frequent updates? Those favor RAG or better prompting.
Start with 100 examples. Build a small, clean dataset of 100 examples with consistent formatting. Run a quick job on Together AI (it will cost under $0.50) and evaluate the output before investing in a larger dataset.
Establish your baseline. Before fine-tuning, measure your base model with your best system prompt on your validation set. Write down the number. This is what fine-tuning needs to beat to justify the effort.
Pick rank 8. Unless you have a specific reason to go higher, start with lora_r=8, lora_alpha=16, n_epochs=3, learning_rate=1e-5. This is a reasonable starting configuration for most narrow tasks. Adjust based on validation loss curves, not intuition.
Keep your fine-tuned model names meaningful. Together AI uses a suffix parameter to name your models. Use a naming scheme that includes the task and date — extraction-v1-2026-02 — so you can track which version of the model is in production and what data it was trained on.