From Theory to Practice: Automated Evals for AI Agents with Langfuse

November 13, 2025 · 33 min read

Founder & CEO

📚 AI Agent Evaluation Series - Part 3 of 5

Observability & Evals: Why They Matter ←
Human-in-the-Loop Evaluation ←
Implementing Automated Evals ← You are here
Debugging AI Agents →
Human Review Training Guide →

From Theory to Practice: Automated Evals for AI Agents with Langfuse

We covered why observability and evals matter for AI agents in Part 1 of this series. Now let's get practical: how do you actually implement automated evaluations that run continuously, catch regressions before users do, and give you the confidence to ship faster?

This guide walks through setting up automated evals with Langfuse—from basic quality checks to sophisticated LLM-as-a-judge evaluations. And if you're using Answer Agent, you're already halfway there: Langfuse tracing is built-in, so you can skip the instrumentation headache and jump straight to measuring quality.

The Evaluation Mindset: Continuous Improvement Over Perfection

Before diving into implementation, let's establish the right mental model. Automated evals aren't about achieving perfection on day one—they're about building a system that gets better every week.

Here's the mindset shift:

Old way (waterfall):

Build agent
Test manually
Deploy
Hope for the best
React to user complaints

New way (continuous improvement):

Build agent with basic observability
Collect real traces from production
Define eval criteria based on actual failures
Automate evaluation on new traces
Iterate on prompts, tools, and models with data-driven insights
Repeat weekly

The key difference? You're not trying to predict every failure mode upfront. You're building a feedback loop that learns from real usage and systematically eliminates issues over time.

Need help setting up your continuous improvement loop? Let's design your evaluation strategy →

Why Langfuse for Automated Evals?

Langfuse has become the de facto standard for AI agent observability and evals. Here's why it stands out:

1. Unified Platform

One tool handles both observability (traces, spans, generations) and evaluations (scores, experiments, datasets). You're not stitching together multiple systems.

2. Flexible Evaluation Methods

LLM-as-a-judge: Use Claude, GPT-4, or any model to evaluate quality
Manual annotation: Human reviewers score production traces
Custom Python evaluators: Write your own scoring logic
Deterministic checks: Rule-based validation (format, length, required fields)

3. Production-Ready Infrastructure

Self-hosted or cloud
Handles millions of traces
Real-time dashboards
API access for custom integrations

4. Built for Iteration

Datasets for regression testing
Experiments for A/B testing prompts
Version tracking for prompts and models
Time-series metrics to track improvement

5. Framework Agnostic

Works with LangChain, LlamaIndex, raw OpenAI calls, Anthropic SDK—anything that generates traces.

Evaluating which observability platform fits your stack? Get a comparison analysis →

Answer Agent Users: You're Already Set Up

If you're using Answer Agent, here's the good news: Langfuse tracing is already built into every agent you run. No instrumentation code needed.

Answer Agent automatically captures:

All LLM calls with full prompts and completions
Tool invocations with inputs and outputs
Agent reasoning steps and decision chains
Token usage and latency metrics
Session grouping for multi-turn conversations

How It Works Behind the Scenes

Answer Agent's Langfuse integration happens at the framework level. When you configure a Langfuse credential in the UI, the system:

Injects callback handlers into every LangChain execution
Traces all components (agents, tools, retrievers, chains)
Groups by session to track full conversations
Adds metadata (user ID, organization ID, agent version)
Handles errors gracefully (tracing failures don't break your agent)

Here's what the code looks like under the hood:

// Answer Agent automatically creates Langfuse handlers for all agents
import CallbackHandler from 'langfuse-langchain'
import { Langfuse } from 'langfuse'

// When you add Langfuse credentials in the UI, Answer Agent does this:
const langfuse = new Langfuse({
  secretKey: process.env.LANGFUSE_SECRET_KEY,
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  baseUrl: 'https://cloud.langfuse.com',
  sdkIntegration: 'Flowise'
})

const handler = new CallbackHandler({
  secretKey: process.env.LANGFUSE_SECRET_KEY,
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  baseUrl: 'https://cloud.langfuse.com',
  sessionId: chatflowId,
  userId: userId,
  metadata: {
    organizationId: organizationId,
    agentVersion: '1.0',
    environment: 'production'
  }
})

// This handler is automatically passed to all LangChain components
const agent = await createAgent({
  callbacks: [handler, ...otherCallbacks]
})

Translation: You just configure credentials in the UI. Answer Agent handles the rest. Every agent run is automatically traced, with zero code from you.

Setting Up Langfuse in Answer Agent

Get Langfuse credentials:
- Sign up at cloud.langfuse.com or self-host
- Create a new project
- Copy your Public Key and Secret Key
Add credentials in Answer Agent:
- Go to Credentials → Add Credential
- Select "Langfuse API"
- Enter Public Key, Secret Key, and Endpoint
- Save
Attach to your agent:
- Open your agent/chatflow
- Add the "LangFuse" analytics node
- Connect your saved credential
- Deploy

That's it. Every conversation from now on appears in your Langfuse dashboard with full trace detail.

Using Answer Agent and need help with Langfuse setup? Get a configuration walkthrough →

Setting Up Your First Automated Eval

Now that traces are flowing into Langfuse, let's set up automated quality checks. We'll start simple and build up to sophisticated evaluations.

Step 1: Define What "Good" Looks Like

Before writing any code, answer these questions:

What does success look like for this agent?
- Example: "Agent provides accurate product recommendations based on user preferences"
What are common failure modes?
- Example: "Agent hallucinates product features" or "Agent recommends out-of-stock items"
How would a human evaluate quality?
- Example: "Check if recommended products match the user's stated preferences"

Write these down. This is your evaluation criteria, and it will evolve as you learn more about production behavior.

Step 2: Create a Dataset from Production Traces

Langfuse lets you curate datasets from real production data. Here's how:

In the Langfuse UI:

Go to Traces and filter for interesting examples:
- Traces with low user satisfaction scores
- Traces where the agent used tools incorrectly
- Edge cases that surprised you
Click on a trace → Add to Dataset
Name your dataset (e.g., "Product Recommendation Test Cases")
Add expected outputs or evaluation criteria

Via the API:

from langfuse import Langfuse

langfuse = Langfuse(
  secret_key="sk-lf-...",
  public_key="pk-lf-...",
  host="https://cloud.langfuse.com"
)

# Create a dataset
langfuse.create_dataset(name="product-recommendations")

# Add production traces to dataset
langfuse.create_dataset_item(
  dataset_name="product-recommendations",
  input={
    "user_query": "I need a laptop for video editing under $2000",
    "user_preferences": {
      "budget": 2000,
      "use_case": "video_editing"
    }
  },
  expected_output={
    "should_include": ["GPU specs", "RAM >= 16GB", "price < $2000"],
    "should_not_include": ["out_of_stock_items", "gaming_laptops"]
  },
  metadata={
    "original_trace_id": "trace-abc-123",
    "user_feedback": "helpful"
  }
)

Pro tip: Start with 20-30 representative examples. You can always add more later as you discover new edge cases.

Step 3: Write Your First Evaluator

Let's start with a simple LLM-as-a-judge evaluator that checks if the agent's response matches the user's preferences.

from langfuse import Langfuse
from langfuse.decorators import observe
from openai import OpenAI

langfuse = Langfuse()
openai = OpenAI()

@observe(as_type="generation")
def evaluate_product_recommendation(input_query: str, agent_output: str, expected_criteria: dict) -> dict:
    """
    Evaluates if the agent's product recommendation matches user preferences.
    Returns a score from 0-10 and reasoning.
    """

    eval_prompt = f"""You are an expert evaluator assessing an AI agent's product recommendation.

User Query: {input_query}

Agent's Recommendation: {agent_output}

Evaluation Criteria:
- Must include: {expected_criteria.get('should_include', [])}
- Must not include: {expected_criteria.get('should_not_include', [])}

Rate the recommendation on a scale of 0-10 where:
- 10 = Perfect match, all criteria met
- 7-9 = Good match, minor issues
- 4-6 = Partial match, some criteria missed
- 0-3 = Poor match, major issues

Respond in JSON format:
{{
  "score": <number 0-10>,
  "reasoning": "<brief explanation>",
  "criteria_met": [<list of met criteria>],
  "criteria_missed": [<list of missed criteria>]
}}"""

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an expert evaluator. Always respond with valid JSON."},
            {"role": "user", "content": eval_prompt}
        ],
        response_format={"type": "json_object"},
        temperature=0.2
    )

    result = response.choices[0].message.content
    return eval(result)  # Parse JSON string to dict

# Run evaluation on a dataset item
dataset_item = langfuse.get_dataset_item("product-recommendations", "item-1")

eval_result = evaluate_product_recommendation(
    input_query=dataset_item.input["user_query"],
    agent_output="[Agent's actual response]",
    expected_criteria=dataset_item.expected_output
)

# Send score back to Langfuse
langfuse.score(
    trace_id=dataset_item.metadata["original_trace_id"],
    name="product_recommendation_quality",
    value=eval_result["score"],
    comment=eval_result["reasoning"]
)

Step 4: Automate Evaluation with Langfuse SDK

Now let's run this evaluator automatically on all new production traces:

import os
from langfuse import Langfuse
from openai import OpenAI

# Initialize clients
langfuse = Langfuse(
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    host="https://cloud.langfuse.com"
)
openai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def evaluate_trace(trace_id: str):
    """Evaluate a specific trace and add score to Langfuse."""

    # Fetch trace details
    trace = langfuse.fetch_trace(trace_id)

    # Extract input and output
    user_input = trace.input
    agent_output = trace.output

    # Run LLM-as-judge evaluation
    eval_prompt = f"""Evaluate this AI agent interaction for quality.

User Input: {user_input}
Agent Output: {agent_output}

Rate from 0-10 based on:
- Accuracy: Is the information correct?
- Relevance: Does it address the user's question?
- Helpfulness: Does it provide actionable guidance?

Respond with JSON: {{"score": <0-10>, "reasoning": "<explanation>"}}"""

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": eval_prompt}],
        response_format={"type": "json_object"},
        temperature=0.2
    )

    result = eval(response.choices[0].message.content)

    # Send score to Langfuse
    langfuse.score(
        trace_id=trace_id,
        name="quality_score",
        value=result["score"],
        comment=result["reasoning"]
    )

    return result

# Evaluate recent traces
traces = langfuse.fetch_traces(limit=10, order_by="desc")
for trace in traces:
    eval_result = evaluate_trace(trace.id)
    print(f"Trace {trace.id}: Score {eval_result['score']}/10")

Step 5: Run Evaluations on a Schedule

To catch regressions early, run evaluations automatically on a schedule (e.g., every hour or after every 100 new traces).

Option A: Simple cron job

# crontab -e
0 * * * * /usr/bin/python3 /path/to/run_evals.py

Option B: GitHub Actions (for dataset regression testing)

name: Run Langfuse Evals

on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours
  workflow_dispatch:  # Manual trigger

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install langfuse openai

      - name: Run evaluations
        env:
          LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
          LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python scripts/run_evals.py

      - name: Post results
        if: failure()
        uses: actions/github-script@v6
        with:
          script: |
            github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: 'Eval Regression Detected',
              body: 'Automated evals failed. Check Langfuse dashboard for details.'
            })

Option C: AWS Lambda (serverless)

import json
import os
from langfuse import Langfuse

def lambda_handler(event, context):
    """AWS Lambda function to evaluate recent traces."""

    langfuse = Langfuse(
        secret_key=os.environ["LANGFUSE_SECRET_KEY"],
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"]
    )

    # Fetch traces from the last hour
    traces = langfuse.fetch_traces(
        from_timestamp=datetime.now() - timedelta(hours=1)
    )

    results = []
    for trace in traces:
        score = evaluate_trace(trace.id)
        results.append({"trace_id": trace.id, "score": score})

    # Alert on low scores
    low_scores = [r for r in results if r["score"] < 5]
    if low_scores:
        # Send alert via SNS, Slack, etc.
        send_alert(f"Found {len(low_scores)} traces with quality issues")

    return {
        'statusCode': 200,
        'body': json.dumps({
            'evaluated': len(results),
            'low_quality': len(low_scores)
        })
    }

Advanced Eval Patterns

Once you have basic evals running, here are some advanced patterns to level up your evaluation strategy.

Pattern 1: Multi-Criteria Evaluation

Instead of a single quality score, evaluate multiple dimensions:

def evaluate_multi_criteria(trace_id: str):
    """Evaluate trace across multiple quality dimensions."""

    trace = langfuse.fetch_trace(trace_id)

    criteria = {
        "accuracy": "Is the information factually correct?",
        "relevance": "Does it answer the user's question?",
        "clarity": "Is the response clear and well-structured?",
        "safety": "Does it avoid harmful or biased content?",
        "completeness": "Does it provide enough detail?"
    }

    for criterion_name, criterion_prompt in criteria.items():
        score = evaluate_single_criterion(
            trace,
            criterion_name,
            criterion_prompt
        )

        langfuse.score(
            trace_id=trace_id,
            name=f"eval_{criterion_name}",
            value=score["value"],
            comment=score["reasoning"]
        )

This gives you granular insights: "Accuracy is great, but clarity needs work."

Pattern 2: Comparative Evaluation (A/B Testing)

Test if your new prompt actually improves quality:

from langfuse import Langfuse

langfuse = Langfuse()

# Run experiment with two prompt versions
experiment = langfuse.create_experiment(
    name="prompt-optimization-v2",
    dataset_name="product-recommendations"
)

# Version A: Current prompt
results_a = []
for item in dataset:
    output_a = run_agent_with_prompt(item.input, prompt_version="v1")
    score_a = evaluate_output(item.input, output_a, item.expected)
    results_a.append(score_a)

    experiment.log_run(
        variant="prompt_v1",
        input=item.input,
        output=output_a,
        scores={"quality": score_a}
    )

# Version B: New prompt
results_b = []
for item in dataset:
    output_b = run_agent_with_prompt(item.input, prompt_version="v2")
    score_b = evaluate_output(item.input, output_b, item.expected)
    results_b.append(score_b)

    experiment.log_run(
        variant="prompt_v2",
        input=item.input,
        output=output_b,
        scores={"quality": score_b}
    )

# Compare results
avg_a = sum(results_a) / len(results_a)
avg_b = sum(results_b) / len(results_b)

print(f"Prompt v1 average score: {avg_a}")
print(f"Prompt v2 average score: {avg_b}")
print(f"Improvement: {((avg_b - avg_a) / avg_a * 100):.2f}%")

View side-by-side comparisons in Langfuse's experiment UI.

Pattern 3: Retrieval Quality Evaluation (for RAG agents)

If your agent uses RAG (retrieval-augmented generation), evaluate retrieval quality separately:

def evaluate_retrieval_quality(trace_id: str):
    """Evaluate if retrieved documents are relevant to the query."""

    trace = langfuse.fetch_trace(trace_id)

    # Extract retrieval span
    retrieval_span = [s for s in trace.spans if s.name == "retrieval"][0]

    query = retrieval_span.input["query"]
    retrieved_docs = retrieval_span.output["documents"]

    # LLM-as-judge: Rate each document's relevance
    relevance_scores = []
    for doc in retrieved_docs:
        eval_prompt = f"""Rate how relevant this document is to the query.

Query: {query}

Document: {doc['content'][:500]}...

Score 0-10 where 10 = perfectly relevant, 0 = completely irrelevant.

JSON response: {{"score": <0-10>, "reasoning": "<explanation>"}}"""

        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": eval_prompt}],
            response_format={"type": "json_object"}
        )

        result = eval(response.choices[0].message.content)
        relevance_scores.append(result["score"])

    # Average relevance score
    avg_relevance = sum(relevance_scores) / len(relevance_scores)

    langfuse.score(
        trace_id=trace_id,
        name="retrieval_relevance",
        value=avg_relevance,
        comment=f"Avg relevance of {len(retrieved_docs)} documents"
    )

    return avg_relevance

Pattern 4: User Feedback as Ground Truth

The gold standard: let real users tell you what's good.

# When user provides feedback (thumbs up/down)
@app.post("/feedback")
def record_feedback(trace_id: str, feedback: str, user_comment: str = ""):
    """Record user feedback as a score in Langfuse."""

    score_value = 1.0 if feedback == "thumbs_up" else 0.0

    langfuse.score(
        trace_id=trace_id,
        name="user_satisfaction",
        value=score_value,
        comment=user_comment,
        data_type="BOOLEAN"
    )

    # If negative feedback, flag for review
    if score_value == 0.0:
        langfuse.add_to_dataset(
            dataset_name="negative-feedback",
            trace_id=trace_id
        )

    return {"status": "recorded"}

Then correlate user feedback with your automated eval scores to validate that your evals predict real-world satisfaction.

Monitoring Eval Results: Dashboards and Alerts

Langfuse provides built-in dashboards, but you can also build custom monitoring:

Dashboard Metrics to Track

Average eval score over time (are you improving?)
Score distribution (how consistent is quality?)
Percentage of traces below threshold (e.g., score < 7)
Eval score by agent version (did the latest change help or hurt?)
Eval score by user segment (do power users get better results?)
Correlation between eval scores and user feedback (do your evals match reality?)

Setting Up Alerts

import os
from slack_sdk import WebClient

slack = WebClient(token=os.environ["SLACK_TOKEN"])

def check_quality_regression():
    """Check if recent quality scores dropped below threshold."""

    # Fetch eval scores from last 24 hours
    recent_scores = langfuse.get_scores(
        name="quality_score",
        from_timestamp=datetime.now() - timedelta(days=1)
    )

    avg_score = sum(s.value for s in recent_scores) / len(recent_scores)

    # Alert if average drops below 7.0
    if avg_score < 7.0:
        slack.chat_postMessage(
            channel="#ai-agent-alerts",
            text=f"🚨 Quality regression detected! Avg score: {avg_score:.2f}/10 (threshold: 7.0)\n\nView traces: https://cloud.langfuse.com/project/[id]/traces"
        )

Run this check hourly via cron or AWS EventBridge.

Real-World Eval Workflow: Week-by-Week Iteration

Here's what continuous improvement looks like in practice:

Week 1: Baseline

Deploy agent with Langfuse tracing
Collect 1000+ production traces
Review traces manually, note failure patterns
Create first dataset (20-30 examples)
Write first evaluator (simple quality check)
Baseline score: 6.8/10

Week 2: Fix Low-Hanging Fruit

Identify top 3 failure modes from evals
Fix obvious bugs (tool config, prompt issues)
Re-run evals on dataset
New score: 7.4/10 ⬆️

Week 3: Prompt Optimization

Run A/B test: current prompt vs. improved version
Evaluator shows v2 is 12% better
Deploy v2 to production
New score: 7.9/10 ⬆️

Week 4: Add Multi-Criteria Evals

Break quality into dimensions (accuracy, relevance, clarity)
Discover: accuracy is great [8.5], but clarity needs work [6.2]
Focus optimization on response structure
New clarity score: 7.8/10 ⬆️
Overall score: 8.2/10 ⬆️

Week 5+: Continuous Monitoring

Evals run automatically on all production traces
Alert fires if score drops below 7.5
Dataset grows to 100+ examples from real edge cases
Regression testing prevents new changes from breaking old fixes
Steady improvement continues...

Common Pitfalls and How to Avoid Them

Pitfall 1: Over-Optimizing for Your Evals

Problem: You improve eval scores but user satisfaction doesn't change.

Solution: Always validate eval scores against real user feedback. If your eval says 9/10 but users complain, your eval criteria are wrong.

Pitfall 2: Dataset Drift

Problem: Your dataset becomes stale as your agent evolves.

Solution: Continuously add new edge cases from production. Aim to add 5-10 new examples per week.

Pitfall 3: Ignoring Statistical Significance

Problem: You declare victory after testing on 5 examples.

Solution: Use at least 30+ examples for meaningful comparisons. Calculate confidence intervals.

from scipy import stats

def compare_with_confidence(scores_a, scores_b):
    """Compare two sets of eval scores with statistical test."""
    t_stat, p_value = stats.ttest_ind(scores_a, scores_b)

    if p_value < 0.05:
        print(f"Statistically significant improvement (p={p_value:.4f})")
    else:
        print(f"No significant difference (p={p_value:.4f})")

Pitfall 4: Eval Prompt Ambiguity

Problem: Your LLM-as-judge gives inconsistent scores because the eval prompt is vague.

Solution: Be extremely specific in eval prompts. Include examples of 10/10, 5/10, and 0/10 responses.

Pitfall 5: Ignoring Cost

Problem: Running evals on every production trace with GPT-4 costs $500/day.

Solution:

Use cheaper models for evals (GPT-4o-mini, Claude Haiku)
Sample traces (e.g., evaluate 10% randomly)
Use deterministic checks where possible (cheaper than LLM calls)

Next Steps: Building Your Eval Strategy

This week:

Set up Langfuse tracing (or verify it's working in Answer Agent)
Collect 100+ production traces
Manually review traces and document 3-5 failure patterns
Create your first dataset with 20 examples

Next week:

Write your first evaluator (start simple: single quality score)
Run it on your dataset
Calculate baseline average score
Identify 2-3 specific improvements to make

Month 1:

Automate evaluation on new traces (hourly cron job)
Set up alerts for quality regressions
Grow dataset to 50+ examples
Run your first A/B test (prompt optimization)

Month 2+:

Add multi-criteria evals (accuracy, relevance, safety, etc.)
Implement user feedback collection
Correlate eval scores with real user satisfaction
Build custom dashboards for your team

Want a personalized roadmap for your evaluation strategy? Book a planning session →

What's Next in the Series?

Now that you have automated evals running, you'll need to debug when things go wrong and train human reviewers to validate quality. Here's where to go next:

🐛 Debugging When Evals Fail

Debugging AI Agents with Langfuse →

Learn to trace failures step-by-step, analyze root causes using spans and observations, and validate that your fixes actually work. This post covers:

Understanding traces and sessions for multi-turn debugging
Using evaluation data to identify systematic failures
Real-world debugging workflows from problem to fix

Need help debugging production issues? Let's talk →

👥 Training Human Reviewers

Human Review Training Guide →

Your automated evals need human validation. This complete training manual shows domain experts how to:

Score AI outputs using the 1-5 rubric
Provide feedback that improves the system
Balance efficiency with quality (1-2 minutes per review)

🔄 Understanding Human-in-the-Loop

Human at the Center →

See how human expertise and automated evals work together to create reliable AI systems. Learn the feedback loop that turns sporadic wins into consistent performance.

Conclusion: Evals Are Your Competitive Advantage

Automated evaluations transform AI agents from unpredictable black boxes into systems you can confidently improve and deploy. While your competitors are manually testing and hoping for the best, you're systematically measuring, iterating, and compounding improvements every week.

Langfuse provides the infrastructure. Answer Agent gives you the tracing for free. The only thing left is defining what "good" means for your use case—and that's the valuable, differentiated work that only you can do.

Start small. Add one evaluator this week. Measure one thing. Then build on it. In three months, you'll have a robust eval system that catches regressions before users do and guides every architectural decision with data.

Ready to implement automated evals for your AI agents? Schedule a demo to see how Answer Agent + Langfuse can accelerate your agent development.

Frequently Asked Questions

What's the difference between observability and evals?

Observability tells you what happened in your agent's execution: which tools were called, what prompts were sent, what responses came back, and how long each step took. It's about understanding the mechanics of your system.

Evals (evaluations) tell you how well it happened: was the response accurate? Did it follow instructions? Was it helpful? Evals measure quality and outcomes.

Think of observability as your car's dashboard (speed, RPM, fuel level) and evals as a driving instructor grading your performance (did you follow traffic rules, drive smoothly, arrive safely?).

Together, they create a feedback loop: observability shows you what to improve, and evals measure if your improvements actually worked.

How many evaluation examples do I need in my dataset?

Start with 20-30 representative examples to cover your main use cases and common edge cases. This is enough to:

Catch major regressions when you change prompts or models
Identify obvious failure patterns
Get a directional signal on quality improvements

As you grow, aim for 50-100+ examples to:

Achieve statistical significance in A/B tests
Cover long-tail edge cases
Build confidence in automated decisions

Key principle: Continuously add new examples from production failures. Your dataset should evolve as your agent evolves. Budget 30 minutes per week to review production traces and add 5-10 new test cases.

Should I use GPT-4 or GPT-4o-mini for LLM-as-a-judge evaluations?

GPT-4o-mini is usually the right choice for these reasons:

Cost: 15x cheaper than GPT-4 ($0.15/1M input tokens vs. $2.50/1M)
Speed: 2-3x faster responses
Quality: Sufficient for most evaluation tasks (rating 0-10, checking criteria)

Use GPT-4 or Claude Sonnet when:

Evaluating complex reasoning chains
Judging nuanced quality dimensions (e.g., tone, empathy)
Your eval accuracy must match expert human reviewers

Pro tip: Start with GPT-4o-mini. If you find eval scores don't correlate with user feedback, upgrade to GPT-4 for that specific evaluator. Most teams run 80%+ of evals on cheaper models.

How often should automated evals run?

It depends on your production volume and iteration speed:

High-volume production (1000+ traces/day):

Run evals every 1-6 hours
Evaluate a random sample (10-20%) to control costs
Alert immediately on quality drops below threshold

Medium volume (100-1000 traces/day):

Run evals once or twice daily
Evaluate most or all traces
Review dashboard daily for trends

Low volume / development phase:

Run evals on-demand after each deployment
Evaluate all traces
Manually review results before next iteration

Regression testing (dataset-based):

Run on every code/prompt change (CI/CD)
Run daily as a safety check
Run before production deploys (gate)

Key principle: Evals should run frequently enough to catch regressions before users do, but not so frequently that cost becomes prohibitive.

What's the difference between Langfuse datasets and production traces?

Production traces are real user interactions captured automatically as your agent runs. They show actual usage patterns, edge cases, and failures that happen in the wild. Every conversation is a trace.

Datasets are curated collections of examples used for regression testing and experiments. You create datasets by:

Selecting interesting production traces
Manually creating test cases
Importing examples from other sources

Think of datasets as your "unit tests" for agent quality—stable, versioned examples you run repeatedly to ensure changes don't break existing behavior.

Use production traces for:

Discovering new failure modes
Monitoring real-world quality trends
Understanding actual user needs

Use datasets for:

A/B testing prompt changes
Preventing regressions
Validating improvements against known cases

Can I use custom Python code for evaluators, or do I have to use LLM-as-a-judge?

You can absolutely use custom Python code for evaluations. In fact, you should use a mix of evaluation methods:

Deterministic/Rule-based evaluators (Python code):

def evaluate_response_format(output: str) -> dict:
    """Check if output matches required format."""
    score = 0.0
    issues = []

    # Check required fields
    if "recommendation:" in output.lower():
        score += 0.5
    else:
        issues.append("Missing 'recommendation' section")

    # Check length
    if 100 <= len(output) <= 500:
        score += 0.5
    else:
        issues.append(f"Length {len(output)} outside range 100-500")

    return {
        "score": score,
        "issues": issues
    }

Best for: Format validation, length checks, required field detection, deterministic criteria.

LLM-as-judge evaluators: Best for: Semantic quality, relevance, accuracy, tone, nuanced judgments.

Recommendation: Use deterministic checks for what you can check deterministically (cheaper, faster, more reliable), and use LLM-as-judge for subjective quality assessments.

How do I handle evaluation disagreements when using LLM-as-a-judge?

LLM evaluators can be inconsistent, especially with vague criteria. Here's how to improve reliability:

1. Use temperature 0.0-0.2 for evaluation calls (more deterministic).

2. Provide explicit scoring rubrics in your eval prompt:

Rate from 0-10 where:
- 10: Perfect response, all criteria met, no issues
- 8-9: High quality, minor issues only
- 6-7: Acceptable, some issues but usable
- 4-5: Below standard, multiple issues
- 0-3: Unacceptable, major problems

3. Use structured output (JSON mode) to force consistent format:

response_format={"type": "json_object"}

4. Run multiple evaluators and average scores (ensemble approach):

scores = [
    gpt4_mini_evaluator(trace),
    claude_haiku_evaluator(trace),
    gpt4_evaluator(trace)
]
final_score = sum(scores) / len(scores)

5. Validate against human judgments: Periodically have humans score the same traces and check correlation. If correlation < 0.7, your eval prompt needs work.

What if my eval scores look great but users are still unhappy?

This is a critical signal that your evaluation criteria don't match real user needs. Here's how to diagnose and fix it:

Step 1: Collect user feedback Add thumbs up/down buttons to your agent interface and record explicit user satisfaction:

langfuse.score(
    trace_id=trace_id,
    name="user_satisfaction",
    value=1.0 if thumbs_up else 0.0
)

Step 2: Correlate eval scores with user feedback

# Fetch traces with both eval scores and user feedback
traces_with_both = fetch_traces_with_scores(
    eval_score_name="quality_score",
    user_score_name="user_satisfaction"
)

# Calculate correlation
eval_scores = [t.eval_score for t in traces_with_both]
user_scores = [t.user_score for t in traces_with_both]

correlation = pearson_correlation(eval_scores, user_scores)
print(f"Correlation: {correlation}")  # Should be > 0.7

Step 3: Investigate mismatches Find traces where eval score is high but user feedback is negative:

mismatches = [
    t for t in traces_with_both
    if t.eval_score > 8.0 and t.user_score == 0.0
]

# Manually review these traces to understand the gap

Step 4: Update evaluation criteria Based on what you learn, adjust your eval prompts to match what users actually care about. Often, the mismatch reveals that you're optimizing for the wrong things (e.g., verbosity when users want conciseness).

Should I evaluate every production trace or just a sample?

Sample if:

You have high volume (1000+ traces/day)
Eval costs are significant (using GPT-4 or complex multi-step evals)
You need real-time dashboards but don't need per-trace scores

Evaluate everything if:

Volume is manageable (< 500 traces/day)
Using cheap evaluators (GPT-4o-mini, Claude Haiku, or deterministic checks)
You want per-trace scores for debugging or user support

Sampling strategies:

Random sampling: Evaluate 10-20% of traces uniformly
Stratified sampling: Ensure you sample across different user segments, times of day, agent versions
Intelligent sampling: Evaluate 100% of traces with negative user feedback, low confidence scores, or error flags + 10% of others

Code example:

import random

def should_evaluate(trace: dict) -> bool:
    """Decide if this trace should be evaluated."""

    # Always evaluate if user gave negative feedback
    if trace.get("user_feedback") == "negative":
        return True

    # Always evaluate if agent had low confidence
    if trace.get("confidence_score", 1.0) < 0.7:
        return True

    # Otherwise, sample 10% randomly
    return random.random() < 0.10

How do I evaluate multi-turn conversations vs. single interactions?

For single-turn interactions (one user input → one agent response), evaluate the final output directly.

For multi-turn conversations, you have three approaches:

Approach 1: Evaluate each turn independently

for turn in conversation.turns:
    score = evaluate_single_turn(turn.user_input, turn.agent_output)
    langfuse.score(trace_id=turn.trace_id, name="turn_quality", value=score)

# Average across turns
avg_score = sum(scores) / len(scores)

Approach 2: Evaluate conversation outcomes

# Did the full conversation achieve the user's goal?
eval_prompt = f"""
User's initial goal: {conversation.first_message}

Full conversation: {conversation.all_turns}

Did the agent successfully help the user achieve their goal? Rate 0-10.
"""

Approach 3: Hybrid (turn quality + overall outcome)

# Evaluate each turn (did the agent respond appropriately?)
turn_scores = [evaluate_turn(t) for t in conversation.turns]

# Evaluate overall outcome (did conversation succeed?)
outcome_score = evaluate_conversation_outcome(conversation)

# Combined score
final_score = 0.7 * outcome_score + 0.3 * avg(turn_scores)

Recommendation: For task-oriented agents (e.g., customer support, booking systems), focus on overall outcome. For open-ended agents (e.g., brainstorming, coaching), evaluate per-turn quality.

Can I use Langfuse if I'm not using LangChain or LlamaIndex?

Yes! Langfuse is framework-agnostic. While it has first-class integrations with LangChain and LlamaIndex, you can use it with:

Raw OpenAI SDK calls
Anthropic SDK (Claude)
Custom agent frameworks
Any Python/TypeScript code

How to trace custom code:

from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse()

@observe()
def my_custom_agent(user_input: str) -> str:
    """Custom agent logic without LangChain."""

    # Your custom code here
    prompt = build_custom_prompt(user_input)

    # Trace the LLM call
    with langfuse.generation(name="llm_call") as gen:
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        gen.end(output=response.choices[0].message.content)

    # Trace tool calls
    with langfuse.span(name="tool_execution") as span:
        result = execute_custom_tool(response)
        span.end(output=result)

    return result

Answer Agent's advantage: If you're using Answer Agent with its built-in Langfuse integration, all tracing happens automatically regardless of framework. But if you're building custom agents, you can still use Langfuse—it just requires manual instrumentation.

How do I prevent eval scores from being gamed or manipulated?

If your evals directly influence decisions (e.g., bonuses, promotions, product launches), there's risk of gaming the system. Here's how to maintain integrity:

1. Use multiple evaluation methods (can't game them all):

LLM-as-judge scores
Deterministic checks
User feedback
Expert human review (gold standard)

2. Don't rely solely on automated evals for high-stakes decisions. Use them to guide decisions, not make them.

3. Randomly audit eval results with human reviewers:

# Randomly select 5% of traces for human review
audit_sample = random.sample(evaluated_traces, k=int(len(evaluated_traces) * 0.05))

# Compare human scores to automated scores
for trace in audit_sample:
    human_score = get_human_review(trace.id)
    auto_score = trace.eval_score

    if abs(human_score - auto_score) > 2.0:
        flag_for_investigation(trace.id, "Large human-auto score discrepancy")

4. Monitor for suspicious patterns:

Sudden spikes in eval scores without corresponding user feedback improvement
Eval scores that are always exactly 10/10 (suspiciously perfect)
Scores that don't correlate with downstream metrics (retention, NPS, etc.)

5. Use hidden test sets that developers don't have access to (like Kaggle's public vs. private leaderboards).

What's the relationship between Langfuse scores and Answer Agent analytics?

Langfuse is a dedicated observability and evaluation platform. It provides:

Detailed trace capture (every LLM call, tool use, reasoning step)
Evaluation infrastructure (LLM-as-judge, datasets, experiments)
Long-term trend analysis
Custom dashboards

Answer Agent provides built-in analytics for operational monitoring:

Chatflow execution counts
Token usage and costs
Error rates
Basic performance metrics

How they work together:

Answer Agent automatically sends all traces to Langfuse (when configured). Langfuse provides the deep evaluation and quality analysis layer on top of Answer Agent's operational metrics.

Think of it as:

Answer Agent analytics: "Is my agent running? How many requests? Any errors?"
Langfuse: "Is my agent providing high-quality responses? Where are the quality issues? Are improvements working?"

You use both. Answer Agent for real-time ops monitoring, Langfuse for quality deep dives and continuous improvement.

How do I get started with evals if I'm completely new to this?

Week 1: Foundation

Day 1-2: Set up tracing

If using Answer Agent: Add Langfuse credential, attach to agent
If custom code: Instrument with Langfuse SDK
Verify traces appear in Langfuse dashboard

Day 3-4: Collect data

Run your agent with real or test queries
Aim for 50-100 traces
Manually review traces in Langfuse UI

Day 5: Identify patterns

Note 3-5 examples of good responses
Note 3-5 examples of bad responses
Write down what makes them good or bad

Week 2: First Evaluator

Day 1-2: Create dataset

Add 10 good examples to a Langfuse dataset
Add 10 bad examples
Write expected criteria for each

Day 3-4: Write evaluator

Start with a simple LLM-as-judge evaluator
Use the code examples from this article
Test on your 20 examples manually

Day 5: Calculate baseline

Run evaluator on all 20 examples
Calculate average score
This is your baseline (e.g., 6.5/10)

Week 3: First Improvement

Day 1-2: Identify one improvement

Based on eval results, pick one specific thing to fix
Example: "Agent is too verbose" or "Agent misses key details"

Day 3-4: Make the change

Modify prompt, adjust settings, or fix bug
Re-run evaluator on dataset

Day 5: Measure impact

Compare new scores to baseline
If improved, deploy to production
If not, try something else

Week 4+: Automate and Expand

Set up automated eval runs (cron job)
Add more examples to dataset (aim for 50+)
Add more evaluators (multi-criteria)
Monitor trends over time

Key principle: Start small and simple. One evaluator, 20 examples, one metric. Build from there.

How much does running automated evals cost?

Cost depends on your evaluation strategy, but here are realistic estimates:

Low-cost setup (recommended for most teams):

Evaluator: GPT-4o-mini ($0.15 per 1M input tokens)
Eval prompt length: ~500 tokens per evaluation
Cost per eval: ~$0.0001 (0.01 cents)
1000 evals/day: $0.10/day or $3/month

Medium-cost setup (high quality):

Evaluator: GPT-4 ($2.50 per 1M input tokens)
Eval prompt length: ~500 tokens per evaluation
Cost per eval: ~$0.0015 (0.15 cents)
1000 evals/day: $1.50/day or $45/month

High-volume production:

10,000 traces/day
Sample 10% for eval (1000 evals/day)
Using GPT-4o-mini
$3/month

Optimization strategies:

Use cheaper models for simple checks (GPT-4o-mini, Claude Haiku)
Sample traces instead of evaluating 100%
Cache eval results for dataset regression tests
Use deterministic checks where possible (free!)
Batch evaluations to reduce API overhead

Bottom line: For most teams, eval costs are $10-50/month—negligible compared to the value of catching quality issues before users do.

Can I run Langfuse on-premises or do I have to use the cloud?

You can self-host Langfuse on your own infrastructure. This is common for:

Enterprise teams with strict data residency requirements
Highly regulated industries (healthcare, finance)
Organizations that can't send data to third-party clouds

Self-hosting options:

Docker Compose (simple local/small deployments)
Kubernetes (production-grade, scalable)
Cloud VMs (AWS, GCP, Azure)

See Langfuse self-hosting docs for detailed setup guides.

Cloud vs. self-hosted trade-offs:

Feature	Langfuse Cloud	Self-Hosted
Setup	5 minutes	Hours to days
Maintenance	Zero (managed)	Ongoing (you manage)
Scaling	Automatic	Manual
Cost	Free tier + usage-based	Infrastructure + eng time
Data control	On Langfuse servers	On your infrastructure
SLA	Langfuse's uptime	Your uptime

Recommendation: Start with Langfuse Cloud (free tier) to validate the eval workflow. If data residency becomes a requirement later, migrate to self-hosted. Don't prematurely optimize infrastructure.

What happens if my evaluator gives false positives (rates bad outputs as good)?

False positives are a serious problem—they give you false confidence and let quality issues slip through. Here's how to detect and fix them:

Detection:

Compare eval scores to user feedback:
- Find traces where eval score > 8 but user feedback is negative
- Manually review those traces to understand what the evaluator missed
Spot check high scores:
- Randomly review traces that scored 9-10/10
- If you disagree with the score, your eval criteria are too loose
Look for suspiciously perfect scores:
- If everything scores 10/10, your evaluator isn't discriminating
- Good evals should show a distribution (some 6s, some 7s, some 9s)

Fixes:

Tighten eval criteria:

# Before (too loose)
"Rate the response quality from 0-10"

# After (more specific)
"Rate from 0-10 based on:
- Accuracy: All facts must be verifiable (0 = hallucinations present)
- Completeness: Must address every part of user's question
- Format: Must use requested structure (bullet points, numbered list, etc.)

Deduct 2 points for any hallucination.
Deduct 1 point for missing any part of the question."

Use multiple evaluators:

# Require consensus from multiple models
scores = [
    gpt4_mini_eval(trace),
    claude_haiku_eval(trace),
    gpt4_eval(trace)
]

# Flag if there's disagreement
if max(scores) - min(scores) > 3:
    flag_for_human_review(trace)

Add ground truth checks:

# If you have reference materials, check grounding
def evaluate_with_ground_truth(output: str, reference_docs: list):
    """Verify claims against source documents."""
    claims = extract_claims(output)

    for claim in claims:
        if not verify_claim_in_docs(claim, reference_docs):
            return {"score": 0, "reason": f"Unverified claim: {claim}"}

    return {"score": 10, "reason": "All claims verified"}

Key principle: Validate your evaluators regularly against human judgment. Evals are only valuable if they accurately reflect real quality.

How do I evaluate agent behavior that requires domain expertise to judge?

For specialized domains (medical, legal, technical, etc.), generic LLM-as-judge evaluators may not have the expertise to accurately evaluate quality. Here are strategies:

Strategy 1: Build domain-specific eval prompts

eval_prompt = f"""You are a medical professional evaluating an AI health assistant's response.

Patient question: {user_input}
AI response: {agent_output}

Evaluate based on these medical criteria:
- Medical accuracy: Are diagnoses/treatments evidence-based?
- Safety: Does it include appropriate disclaimers (not medical advice, see doctor)?
- Completeness: Does it address symptoms, potential causes, and next steps?
- Clarity: Is medical terminology explained for laypersons?

Rate 0-10 and cite specific medical issues if score < 8.

Note: You are evaluating medical correctness, NOT providing medical advice to the patient.
"""

Strategy 2: Use expert human reviewers

Select a subset of traces for expert review
Have domain experts score them
Use these expert scores to:
1. Validate automated eval accuracy
2. Train more accurate evaluators
3. Build a "gold standard" dataset

Strategy 3: Use retrieval-augmented evaluation

def evaluate_with_domain_knowledge(output: str, knowledge_base: list):
    """Evaluate agent output against authoritative domain sources."""

    # Retrieve relevant authoritative sources
    relevant_sources = retrieve_sources(output, knowledge_base)

    # LLM-as-judge with domain context
    eval_prompt = f"""You are evaluating an AI assistant's response for technical accuracy.

AI Response: {output}

Authoritative Sources:
{format_sources(relevant_sources)}

Rate technical accuracy 0-10. Deduct points for any claims that contradict authoritative sources.
"""

    score = llm_evaluate(eval_prompt)
    return score

Strategy 4: Multi-stage evaluation

Stage 1: Automated check for obvious issues (format, safety disclaimers, length)
Stage 2: LLM-as-judge for general quality
Stage 3: Human expert review for traces that pass stages 1-2 but are borderline (score 6-8)

Recommendation: Start with automated evals to catch obvious issues (90% of problems). Use expert review for the remaining 10% where nuance matters.

From Theory to Practice: Automated Evals for AI Agents with Langfuse

The Evaluation Mindset: Continuous Improvement Over Perfection​

Why Langfuse for Automated Evals?​

1. Unified Platform​

2. Flexible Evaluation Methods​

3. Production-Ready Infrastructure​

4. Built for Iteration​

5. Framework Agnostic​

Answer Agent Users: You're Already Set Up​

How It Works Behind the Scenes​

Setting Up Langfuse in Answer Agent​

Setting Up Your First Automated Eval​

Step 1: Define What "Good" Looks Like​

Step 2: Create a Dataset from Production Traces​

Step 3: Write Your First Evaluator​

Step 4: Automate Evaluation with Langfuse SDK​

Step 5: Run Evaluations on a Schedule​

Advanced Eval Patterns​

Pattern 1: Multi-Criteria Evaluation​

Pattern 2: Comparative Evaluation (A/B Testing)​

Pattern 3: Retrieval Quality Evaluation (for RAG agents)​

Pattern 4: User Feedback as Ground Truth​

Monitoring Eval Results: Dashboards and Alerts​

Dashboard Metrics to Track​

Setting Up Alerts​

Real-World Eval Workflow: Week-by-Week Iteration​

Week 1: Baseline​

Week 2: Fix Low-Hanging Fruit​

Week 3: Prompt Optimization​

Week 4: Add Multi-Criteria Evals​

Week 5+: Continuous Monitoring​

Common Pitfalls and How to Avoid Them​

Pitfall 1: Over-Optimizing for Your Evals​

Pitfall 2: Dataset Drift​

Pitfall 3: Ignoring Statistical Significance​

Pitfall 4: Eval Prompt Ambiguity​

Pitfall 5: Ignoring Cost​

Next Steps: Building Your Eval Strategy​

What's Next in the Series?​

🐛 Debugging When Evals Fail​

👥 Training Human Reviewers​

🔄 Understanding Human-in-the-Loop​

Conclusion: Evals Are Your Competitive Advantage​

Frequently Asked Questions​

What's the difference between observability and evals?​

How many evaluation examples do I need in my dataset?​

Should I use GPT-4 or GPT-4o-mini for LLM-as-a-judge evaluations?​

How often should automated evals run?​

What's the difference between Langfuse datasets and production traces?​

Can I use custom Python code for evaluators, or do I have to use LLM-as-a-judge?​

How do I handle evaluation disagreements when using LLM-as-a-judge?​

What if my eval scores look great but users are still unhappy?​

Should I evaluate every production trace or just a sample?​

How do I evaluate multi-turn conversations vs. single interactions?​

Can I use Langfuse if I'm not using LangChain or LlamaIndex?​

How do I prevent eval scores from being gamed or manipulated?​

What's the relationship between Langfuse scores and Answer Agent analytics?​

How do I get started with evals if I'm completely new to this?​

How much does running automated evals cost?​

Can I run Langfuse on-premises or do I have to use the cloud?​

What happens if my evaluator gives false positives (rates bad outputs as good)?​

How do I evaluate agent behavior that requires domain expertise to judge?​

Ask Alpha

The Evaluation Mindset: Continuous Improvement Over Perfection

Why Langfuse for Automated Evals?

1. Unified Platform

2. Flexible Evaluation Methods

3. Production-Ready Infrastructure

4. Built for Iteration

5. Framework Agnostic

Answer Agent Users: You're Already Set Up

How It Works Behind the Scenes

Setting Up Langfuse in Answer Agent

Setting Up Your First Automated Eval

Step 1: Define What "Good" Looks Like

Step 2: Create a Dataset from Production Traces

Step 3: Write Your First Evaluator

Step 4: Automate Evaluation with Langfuse SDK

Step 5: Run Evaluations on a Schedule

Advanced Eval Patterns

Pattern 1: Multi-Criteria Evaluation

Pattern 2: Comparative Evaluation (A/B Testing)

Pattern 3: Retrieval Quality Evaluation (for RAG agents)

Pattern 4: User Feedback as Ground Truth

Monitoring Eval Results: Dashboards and Alerts

Dashboard Metrics to Track

Setting Up Alerts

Real-World Eval Workflow: Week-by-Week Iteration

Week 1: Baseline

Week 2: Fix Low-Hanging Fruit

Week 3: Prompt Optimization

Week 4: Add Multi-Criteria Evals

Week 5+: Continuous Monitoring

Common Pitfalls and How to Avoid Them

Pitfall 1: Over-Optimizing for Your Evals

Pitfall 2: Dataset Drift

Pitfall 3: Ignoring Statistical Significance

Pitfall 4: Eval Prompt Ambiguity

Pitfall 5: Ignoring Cost

Next Steps: Building Your Eval Strategy

What's Next in the Series?

🐛 Debugging When Evals Fail

👥 Training Human Reviewers

🔄 Understanding Human-in-the-Loop

Conclusion: Evals Are Your Competitive Advantage

Frequently Asked Questions

What's the difference between observability and evals?

How many evaluation examples do I need in my dataset?

Should I use GPT-4 or GPT-4o-mini for LLM-as-a-judge evaluations?

How often should automated evals run?

What's the difference between Langfuse datasets and production traces?

Can I use custom Python code for evaluators, or do I have to use LLM-as-a-judge?

How do I handle evaluation disagreements when using LLM-as-a-judge?

What if my eval scores look great but users are still unhappy?

Should I evaluate every production trace or just a sample?

How do I evaluate multi-turn conversations vs. single interactions?

Can I use Langfuse if I'm not using LangChain or LlamaIndex?

How do I prevent eval scores from being gamed or manipulated?

What's the relationship between Langfuse scores and Answer Agent analytics?

How do I get started with evals if I'm completely new to this?

How much does running automated evals cost?

Can I run Langfuse on-premises or do I have to use the cloud?

What happens if my evaluator gives false positives (rates bad outputs as good)?

How do I evaluate agent behavior that requires domain expertise to judge?