Prompting vs Fine-Tuning vs RAG vs RL
Your VP of Product is keen to get AI-powered features shipped by next quarter. Your engineering team is properly split: some are advocating for fine-tuning Llama 3, others insist Retrieval-Augmented Generation (RAG) is the only scalable way forward, your prompt engineering hire says “just use better context,” and someone just came back from a conference banging on about Reinforcement Learning (RL).
They all have compelling arguments, vendor benchmarks, and strong opinions. Here’s the rub: they’re all spot on. And they’re all talking nonsense.
The answer depends entirely on your specific bottleneck. To make the right choice, you need to look beyond the API bill and consider the Total Cost of Ownership (TCO).
- Hidden Latency Costs: RAG retrieval steps can add 300ms–500ms to every request, directly impacting user conversion rates.
- The “Context Bloat” Tax: RAG often bloats prompts with 2,000+ tokens of context, making per-query costs 5x higher than a fine-tuned model that needs zero examples.
- Maintenance Reality: Fine-tuning a 7B parameter model might cost <$500 in compute, but maintaining the training dataset can cost $50k/year in engineering hours.
This guide offers a framework to figure out the TCO yourself, without the sales pitch.
The Big Four: A CTO’s Definitions
Before we dive into the costs (in pounds and pence, or dollars and sense), let’s establish what we’re actually comparing.
1. Prompting (Context Engineering)
This is where you craft detailed instructions and shove data into the request window. Think of the model as a brilliant but literal intern; it stays general-purpose, you just get better at asking nicely. It’s best for instructions, formatting, logic puzzles, and getting something out the door by Friday.
2. RAG (Retrieval-Augmented Generation)
Here, you give the model access to a dynamic library. You hunt down relevant text from your database (often using a vector database) and paste it into the prompt at runtime. It’s essentially an open-book exam. The model remains general, but the facts it uses are bespoke. This is your go-to for proprietary data and real-time facts.
3. Fine-tuning (The Specialist)
Fine-tuning involves updating the model’s internal weights on a specific dataset. You aren’t just giving it instructions; you’re changing its brain. It’s like sending that intern to medical school. They come back knowing exactly how to write a clinical note without being told. It is the king of specific formats, styles, and reducing latency (by removing the need for long instructions).
4. Reinforcement Learning (RL/RLHF)
You create a reward system to punish or praise the model’s outputs, training it to optimise for a score. This is research-grade engineering. Unless you are OpenAI, Anthropic, or deep in robotics, you probably don’t need this.
Real-World Case Studies: Where the Costs Lie
Theory is great, but where does the rubber meet the road? Let’s analyze a few common business problems and see how our framework applies.
Case Study A: The Developer Documentation Helper (Stripe’s Approach)
Problem: You’re a payments company with thousands of API endpoints. Developers need fast, accurate answers from your ever-changing documentation to resolve integration issues. A slow or wrong answer means a lost customer.
- The Winner: RAG.
- Why: The knowledge base (API docs, tutorials, guides) is constantly updated. Fine-tuning on this content would be slow and expensive to keep current. Basic prompting fails because the full documentation set is millions of tokens. RAG allows the model to search the latest documentation at query time and provide source links for verification.
- The Cost Reality: Stripe’s engineering cost wasn’t in training a model, but in building a world-class retrieval pipeline. This includes:
- Embedding Costs: Running all your documentation through an embedding API (like
text-embedding-3-large) one-time. - Vector Database: Paying for a managed vector database (like Pinecone, or rolling their own on a service like AWS OpenSearch) to store and query these embeddings.
- Latency Engineering: Optimizing the retrieval step to be under 200ms, as search speed is critical for user experience. The cost is measured in developer salaries, not GPU hours.
- Embedding Costs: Running all your documentation through an embedding API (like
Case Study B: The Code Completion Assistant (GitHub Copilot’s Approach)
Problem: You want to build a tool that suggests relevant code snippets to developers as they type, deeply understanding the nuances of dozens of programming languages and frameworks.
- The Winner: Fine-tuning.
- Why: This is a behavioral problem, not a knowledge one. The model doesn’t need to look up facts; it needs to behave like a senior developer. It needs to have the patterns, syntax, and idioms of code baked into its weights. RAG would be far too slow to provide real-time suggestions, and the context of a single file might not be enough.
- The Cost Reality: The initial model (like GPT-3) was pre-trained on a massive corpus of public code. GitHub then fine-tuned it on a curated, high-quality dataset.
- Upfront Cost: Millions of dollars in initial training and fine-tuning compute.
- Marginal Cost: Because the specialized model is so efficient, the cost per suggestion is extremely low, allowing them to offer it at a flat monthly fee. A prompt-based approach would be orders of magnitude more expensive at their scale (billions of suggestions per month).
Case Study C: The E-commerce Support Chatbot (DoorDash’s Problem)
Problem: A restaurant owner is trying to update their menu via a support chat. They might ask “how do I add soup?” or “my seasonal pumpkin spice latte isn’t showing up”. The answers are in the merchant handbook, but the chatbot needs to be empathetic and guide them through specific UI steps.
- The Winner: A Hybrid Approach (RAG + Fine-tuning).
- Why: They started with prompting, but it was unreliable.
- RAG First: They first implemented RAG to pull in the most relevant sections from their merchant help center. This solved the knowledge problem, ensuring answers were factually correct.
- Fine-tune for Behavior: However, the tone was generic. The RAG output was often too verbose and didn’t sound like a helpful agent. They then fine-tuned a smaller model (like Llama 3) on thousands of example conversations. This taught the model the specific style and behavior of a good support agent—how to be concise, empathetic, and action-oriented.
- The Cost Reality: The final solution uses RAG to fetch knowledge and a fine-tuned model to present it. This gives the best of both worlds:
- RAG keeps the knowledge fresh.
- Fine-tuning keeps the inference cost low and the user experience high-quality. The fine-tuned model requires very little prompt instruction (“You are a helpful assistant. Here is the context: [CONTEXT_FROM_RAG]”), making it cheaper and faster than stuffing examples into a giant prompt.
TCO Comparison Cheat Sheet
When calculating the TCO, use this formula: $$TCO = (Eng_Hours \times Rate) + (Compute_Train) + (Inference_Vol \times Cost_{query}) + Maintenance$$
| Feature | Time to MVP | Initial Cost | Marginal Cost | Maintenance Burden |
|---|---|---|---|---|
| Prompting | Hours | S | XL | Low (Prompts drift) |
| RAG | Weeks | M | L | High (Data pipelines break) |
| Fine-tuning | Weeks | L | S | Medium (Data curation) |
| RL | Months | XL | M | Extreme |
The Bottom Line
You should almost always start with Prompting. It requires the least engineering graft, even if it’s highest in marginal cost. As discussed in The Economics of AI, the landscape is shifting towards efficiency, but premature optimisation is the root of all evil.
- Start with Prompting. If it works but is too expensive, move to step 2.
- Analyse the Failure:
- Knowledge Problem? (It doesn’t know the facts) -> Add RAG.
- Behaviour/Cost Problem? (It won’t follow format or is too verbose) -> Fine-tune.
- Avoid RL until you have a dedicated research team and deep pockets.
Remember, the TCO isn’t just the bill you get from OpenAI or AWS. The TCO includes the weeks your team spends debugging a vector search algorithm instead of building product features. Choose the path that optimises for engineering velocity first, and compute efficiency second.