Prompt Library for Model Evaluation and Benchmarking
Seven ready-to-use prompts for building LLM evaluation pipelines: rubric generation, test case creation, LLM-as-judge scoring, faithfulness checks, and stakeholder result summaries. Tested with Claude and GPT-4o.
Evaluating models in production is harder than the benchmarks make it look. MMLU and HumanEval tell you something about a model’s general capability, but they tell you almost nothing about whether it will handle your specific use case well after the 47th edge case shows up. The only reliable way to know is to build your own eval suite — test cases that reflect the distribution of inputs your system actually sees, with scoring criteria that match what “good” means for your task.
The prompts in this library target a gap I kept running into: the meta-work of evaluation. Writing rubrics, generating test cases, designing scoring systems, and summarizing results all consume engineering time before the actual model testing even starts. AI handles this scaffolding work surprisingly well, which frees up time for the parts that require judgment: deciding what to measure and interpreting the results.
I tested all prompts using Claude (claude-sonnet-4-6) and ChatGPT (GPT-4o). Outputs shown are paraphrased, not verbatim, because the goal is to show what kind of result each prompt produces, not to index AI output in an article about AI evaluation. These are starting points — your rubric needs to reflect your task definition, and your test cases need to reflect your actual input distribution.
Prompt 1: Writing an Eval Rubric for a Specific Task
The scenario: you’re evaluating a customer support model that answers questions about a SaaS product’s pricing and feature tiers. You need a rubric that human raters can apply consistently, and you’re tired of writing rubrics from scratch for every new task.
The prompt:
Create a 5-dimension evaluation rubric for rating AI responses to customer support questions about SaaS product pricing and feature comparisons.
Task context: The AI reads from a knowledge base containing plan tier descriptions (Starter, Pro, Business, Enterprise), pricing pages, and FAQ documents. Users ask questions like "Does the Pro plan include SSO?" or "What's the difference between Pro and Business for a 10-person team?"
For each dimension, provide:
- A 1-sentence description of what it measures
- Scoring scale (1-5) with anchors at 1, 3, and 5
- 1 example of a 1-score response and 1 example of a 5-score response for that dimension
Dimensions: factual accuracy, completeness, conciseness, tone, and appropriate scope (not going beyond what the knowledge base supports).
What it produced (paraphrased):
A rubric with clean 1-5 scales for each dimension. The factual accuracy dimension anchored 1-scores on statements that directly contradict the knowledge base, 3-scores on responses with one technically correct but misleading statement, and 5-scores on fully accurate responses. The scope dimension was well-constructed: it explicitly penalized responses that speculate about features not in the knowledge base, which is the failure mode that actually matters in this context.
Grade: Good Starting Point, Needs Calibration
The rubric is structurally sound and the example anchors are useful. The problem with generic rubrics is that raters disagree on edge cases, and the AI’s examples are generic enough that ambiguous real-world outputs fall between anchors. Before running it with human raters, pilot it on 20-30 real outputs and find where raters disagree. That’s where you add sharper examples. The AI gives you the frame; your data provides the examples that make it reliable.
One specific issue: factual accuracy and scope tend to collapse in practice because inaccuracies in this task type usually come from going out-of-scope. Consider combining them or treating one as a binary pre-check before scoring the others.
Prompt 2: Generating Test Cases from Examples
The scenario: you have 5 labeled examples of good responses for an information extraction task. You need 30 more test cases with varied complexity, but writing them manually takes 3-4 hours.
The prompt:
I'm building a test suite for an information extraction model. The model reads B2B sales call transcripts and extracts: (1) expressed pain points, (2) mentioned competitors, (3) budget signals (any mention of cost, budget, or pricing sensitivity), (4) timeline indicators.
Here are 3 example inputs with expected extractions:
[example 1]
[example 2]
[example 3]
Generate 12 additional test cases with varied difficulty:
- 3 cases where one of the 4 fields is empty/absent from the transcript
- 3 cases where information is ambiguous (e.g., a mentioned tool might be a competitor or a partner)
- 3 cases with high noise (transcription errors, filler words, topic switching)
- 3 cases where the same field appears multiple times with potentially contradictory information
For each case, provide the synthetic transcript excerpt and the expected extraction with brief reasoning for ambiguous cases.
What it produced (paraphrased):
12 test cases with good variation. The ambiguous competitor cases were the strongest — it created scenarios where a prospect mentioned a company that could plausibly be either a competitor or an integration partner, with appropriate notes about why both classifications were defensible. The contradictory information cases were realistic: a prospect mentions a Q2 timeline early in the call, then says “realistically we’re looking at end of year” near the end.
The noise cases were weaker. The AI simulated transcription errors by inserting obvious typos, but real ASR errors produce phonetically plausible substitutions — “Salesforce” becoming “sales force” or budget signals getting half-swallowed mid-sentence. If noise handling is what you’re actually testing, write those cases manually using real ASR outputs from your pipeline.
Grade: Useful, Verify Against Real Data
For tasks with a clear input structure, AI-generated test cases work well for expanding a small seed set. The risk is distribution mismatch: the AI imagines a plausible input space based on your examples, not based on what your system actually sees in production. Run 100 real inputs through your model and compare the error distribution to your synthetic test cases. You’ll find failure modes the synthetic cases missed. Use AI-generated cases for coverage breadth; use real examples for the failure modes that matter.
Prompt 3: LLM-as-Judge Scoring Prompt
The scenario: you’re running automated evals using Claude as a judge to score outputs from a smaller model. You need a scoring prompt that produces consistent, calibrated scores that are parseable by your eval pipeline.
The prompt:
You are an objective evaluator for AI model outputs. I will give you a task instruction, a user query, and a model response.
Score the response on a 1-10 scale for [DIMENSION: e.g., factual accuracy].
Scoring rules:
- Be consistent: a response that scores 7 should score 7 on a comparable response in a different evaluation
- Response length should not affect your score; longer is not better
- Do not give partial credit for correct structure if the content is wrong
Output format — JSON only, no other text:
{
"score": integer 1-10,
"reasoning": "1-3 sentences max",
"critical_flaw": "single sentence describing the most significant failure, or null if none"
}
Task instruction: [TASK_INSTRUCTION]
User query: [USER_QUERY]
Model response: [MODEL_RESPONSE]
What it produced (paraphrased):
Consistent JSON outputs with scores, reasoning, and the critical_flaw field. The reasoning was usually one sentence identifying the key quality signal. The critical_flaw field surfaced specific failure modes that aggregated scores bury — on a batch of 200 test cases, sorting by critical_flaw content identified a pattern where the judge was consistently flagging the same class of error, which pointed to a prompt issue rather than a model capability issue.
Grade: Production-Ready with One Adjustment
LLM-as-judge works well for relative comparisons but drifts on absolute scales without calibration examples. Without anchors, the same judge model tends to compress scores to the 5-8 range because it avoids extreme scores by default. Add 3-5 calibration examples to the prompt showing what a 3, 6, and 9 actually look like for your specific task. This takes 20 minutes and cuts inter-run variance significantly.
The JSON format with critical_flaw is the right output structure for automated pipelines. It parses cleanly and the flaw field is directly actionable when debugging. If you’re running this through Braintrust or LangSmith, the critical_flaw values aggregate into natural failure categories without manual tagging.
Prompt 4: Capability Gap Analysis After A/B Testing
The scenario: you ran GPT-4o and Llama 3.1 70B on 500 inputs. Human raters scored 100 samples per model. You have the scores and a rough failure taxonomy, and you need to figure out where to invest prompt engineering effort.
The prompt:
I ran an A/B test comparing GPT-4o and Llama 3.1 70B on [TASK_TYPE]. Human raters scored 100 samples from each model on accuracy, completeness, and format adherence (1-5 scale).
Results:
- GPT-4o: accuracy 4.2, completeness 3.8, format 4.6
- Llama 3.1 70B: accuracy 3.4, completeness 3.1, format 4.5
I've categorized the 30 lowest-scoring Llama responses by failure mode:
[FAILURE_MODE_LIST_WITH_COUNTS]
Analyze which failure modes account for most of the accuracy gap. For the top 2 failure modes, suggest 3 targeted prompt engineering interventions each. For each intervention, explain what it targets and how I would verify it worked without re-running the full human eval.
What it produced (paraphrased):
A breakdown identifying that the accuracy gap concentrated in 2 of the 5 failure modes I’d listed, with interventions that targeted each specifically. The suggestions were concrete — not “improve your system prompt” but specific techniques: adding explicit negative instructions (“Do not cite information not present in the provided context”), adding a chain-of-thought step before the final answer for reasoning-heavy subtasks, and including 2-3 few-shot examples specifically for the failure mode pattern.
The verification suggestions were also actionable: run the modified prompt on the 30 lowest-scoring examples from the original eval, score them with the LLM judge from Prompt 3, and compare scores rather than re-running the full human eval.
Grade: High Value for Model Selection Decisions
This prompt pattern is most useful when you’ve already done the manual categorization work. Someone has to read the low-scoring outputs and name the failure patterns — that’s 30-60 minutes of work per hundred samples that can’t be shortcut. Once you have the taxonomy, the AI is good at connecting failure patterns to interventions and generating hypotheses worth testing. Not all interventions will work; budget for 2-3 rounds of testing before you give up on one approach.
Prompt 5: Regression Test Design After a Model Switch
The scenario: your LLM provider is deprecating the model version your pipeline runs on. You’re evaluating a switch from GPT-4o to GPT-4o-mini to reduce inference costs. You need a regression test plan before touching production traffic.
The prompt:
We're migrating a production pipeline from GPT-4o to GPT-4o-mini to reduce inference costs by approximately 75%. The pipeline handles [TASK_DESCRIPTION].
Current performance on our eval set:
- Task accuracy: 91%
- Hallucination rate: 4%
- Format compliance: 99%
- P95 latency: 2.1 seconds
Design a regression test plan for this migration. Include:
1. Which metrics to prioritize and why (given the specific task)
2. Minimum sample sizes for each test phase, with reasoning
3. Specific test case types that should stress-test the weaker model
4. Pass/fail criteria for proceeding to production rollout
5. A rollback decision rule with a specific numeric threshold
What it produced (paraphrased):
A structured test plan with phased rollout (10% → 25% → 100% of traffic), specific sample sizes by metric, and concrete pass/fail thresholds. The stress test cases it suggested — multi-step reasoning chains, instructions with conflicting requirements, long-context retrieval, low-resource language inputs — are well-known weak points for smaller models and appropriate for most task types.
Grade: Good Structure, Check the Statistics
The plan structure is correct and the phased rollout approach is right. The sample size calculations need independent verification. For detecting a change in hallucination rate from 4% to 6% with 80% statistical power, you need roughly 1,000 samples per phase — the AI suggested 500, which is underpowered for a 2-point shift. Use a sample size calculator and treat the AI’s suggestions as ballpark figures.
The rollback threshold the AI set (“any 1% increase in hallucination rate”) will trigger false rollbacks from noise. Set your threshold based on what change in performance actually has a business impact, not on the smallest detectable effect. A 1% increase in hallucination rate on a pipeline that generates internal summaries matters less than the same change on a pipeline that generates customer-facing responses.
Prompt 6: Faithfulness Evaluation for RAG Pipelines
The scenario: your RAG pipeline summarizes customer contracts. You need automated detection of whether summaries stay faithful to the source document, at a claim level rather than a document level.
The prompt:
You are evaluating whether an AI-generated summary is faithful to its source document.
Definition: a claim is faithful if it is explicitly stated in the source or is a direct logical inference with no additional assumptions. A claim is unfaithful if it adds details not in the source, even if those details would typically be true.
Source document: [SOURCE_TEXT]
Generated summary: [SUMMARY_TEXT]
For each factual claim in the summary:
1. State the claim
2. Find the specific sentence(s) in the source that support it
3. If no supporting sentence exists, flag as "unsupported"
4. If the claim contradicts the source, flag as "contradicted"
Output a JSON array. Each item: {"claim": string, "status": "supported"|"unsupported"|"contradicted", "source_text": string or null, "severity": "low"|"medium"|"high"}
Severity for unsupported/contradicted claims: high if a reader would likely rely on this claim for a decision, medium if it adds context without affecting decisions, low if it is purely decorative.
What it produced (paraphrased):
Claim-level faithfulness assessments in structured JSON. The severity ratings tracked with the intuitive importance of each claim. On a contract summary test case, it correctly flagged a payment terms claim that was a plausible inference but not explicitly stated — exactly the type of error that creates problems when someone acts on the summary without reading the source.
Grade: Production-Ready
This is one of the highest-value evaluation prompts for RAG pipelines. Claim-level output makes the results actionable — you know which claims to investigate, not just a binary “hallucinated: yes/no.” The severity field is useful for triage when a batch run surfaces 50 flagged claims across 20 documents.
Two operational caveats: the model’s definition of “direct logical inference” is broader than most production teams want, so add 2-3 examples of inferences that should be flagged as unsupported to tighten the definition. And JSON reliability degrades on very long source documents — chunk documents over roughly 4,000 tokens before passing them through this prompt.
Prompt 7: Writing a Model Evaluation Summary for Stakeholders
The scenario: you’ve finished a two-week evaluation comparing two models for a production deployment. Engineering leadership and the product team need a decision summary, and they don’t want to read a spreadsheet.
The prompt:
I'm writing an internal evaluation summary for a model deployment decision. The audience is engineering leadership and the product team — not ML specialists.
Model A: GPT-4o, $0.0025 per 1K input tokens
Model B: Claude claude-sonnet-4-6, $0.003 per 1K input tokens
Task: [TASK_DESCRIPTION]
Test set: 1,200 examples, 15% human-rated
Results:
- Quality (human preference): GPT-4o 78%, Claude 84%
- Hallucination rate: GPT-4o 6.2%, Claude 3.1%
- P95 latency: GPT-4o 1.8 seconds, Claude 2.4 seconds
- Cost at 10M tokens per month: GPT-4o $2,500, Claude $3,000
The team is leaning toward GPT-4o for cost reasons. Write a 400-word summary presenting the full trade-off, with a clear recommendation. Don't minimize the cost difference, but make the recommendation based on what the data actually supports.
What it produced (paraphrased):
A 380-word summary that framed the trade-off as a cost-per-error calculation rather than a headline cost comparison. At 6.2% vs 3.1% hallucination rate on 10M tokens per month, the difference is roughly 31,000 additional incorrect responses per month with GPT-4o. The summary quantified the downstream cost of those errors against the $500 monthly cost difference, and recommended Claude on the grounds that the reduction in error rate justified the delta if errors had any measurable downstream impact.
Grade: Useful Template, Your Numbers Required
The cost-per-error framing is the right way to present this trade-off to non-technical stakeholders, and the AI arrived at it without being told to. The structure — headline numbers, trade-off framing, recommendation with explicit conditions — is worth adapting for any model comparison.
Two things to verify: the downstream cost estimate for incorrect responses needs to come from your actual operational data (the AI will invent a plausible-sounding number if you don’t provide one), and the final recommendation belongs to you, not the model. You know context about vendor relationships, technical debt, and roadmap constraints that the AI doesn’t.
What Works
The prompts that produce the most usable outputs share a pattern: a well-defined output format, a specific task scope, and enough context that the AI isn’t filling in blanks. The LLM-as-judge prompt and the faithfulness evaluation prompt are closest to production-ready because the task is bounded — score this response, classify this claim. The rubric and test case generation prompts require more iteration because they’re starting points, not endpoints.
What Doesn’t Work
Asking an AI to evaluate model outputs without a defined rubric produces low-value feedback. Vague prompts like “score this response on quality” generate plausible-sounding assessments that aren’t reproducible across runs or raters. Every prompt in this library required defining what “good” means before the AI could assess it. That definition has to come from someone who understands the task.
AI also can’t replace the manual work of categorizing real failures. Before you can use Prompt 4 to analyze a performance gap, someone has to read the low-scoring outputs and name the failure patterns. What the AI can do is help analyze patterns you’ve already identified and suggest interventions worth testing.
Tips for Customizing These Prompts
Add calibration examples to any scoring prompt. Three examples showing what a low, mid, and high score look like for your specific task dramatically improve consistency. Without anchors, most models compress scores toward the center of the scale.
Specify output format explicitly for automated pipelines. JSON with named keys is more reliable than prose. Include a null case for optional fields (like critical_flaw in Prompt 3) to prevent parsing errors when the field doesn’t apply.
Version your eval prompts the same way you version your production prompts. When your judge model is updated by the provider, scores can shift even if the thing being evaluated hasn’t changed. Pin your judge model version if your platform supports it — both Braintrust and LangSmith support this — and run regression checks on your scoring prompts after provider updates.
Common Mistakes in Managed-Models Evaluation
Using benchmark scores instead of task-specific evals. A model scoring 90% on MMLU tells you nothing about whether it handles your specific edge cases. Build at least 100 task-specific test cases before relying on any model for a production workload.
Evaluating on a static test set for too long. After 3-4 weeks, production inputs drift away from your test distribution. Collect new examples weekly and rotate them into your eval set. If you’re using PromptFoo, schedule automated test runs against fresh samples.
Skipping LLM-as-judge calibration. An uncalibrated judge gives you relative comparisons (Model A better than Model B) but not reliable absolute scores. Calibration takes two hours and makes scores meaningful across eval runs and across time.
Treating a single model’s scores as ground truth on subjective tasks. For tasks like “summarize this in the user’s voice,” a single judge has high variance. Use 3 independent judgments and take the median for tasks with inherent subjectivity.