Skip to main content

2 posts tagged with "Evaluation"

Methods for evaluating AI

View All Tags

Human Review Training: Scoring AI Agents in Langfuse for Better Evaluation

· 64 min read
Bradley Taylor
Founder & CEO

📚 AI Agent Evaluation Series - Part 5 of 5

  1. Observability & Evals: Why They Matter ←
  2. Human-in-the-Loop Evaluation ←
  3. Implementing Automated Evals ←
  4. Debugging AI Agents ←
  5. Human Review Training Guide ← You are here

Human Review Training: Scoring AI Agents in Langfuse for Better Evaluation

If you're a domain expert who's been asked to review AI agent responses in Langfuse, this guide is for you. You don't need to be a technical expert or an AI specialist—you just need to bring your domain knowledge and judgment to help improve the AI system.

This is training content designed to help you understand exactly what to look for, how to score responses, and how to provide feedback that makes a real difference. Think of this as your handbook for becoming an effective AI reviewer.

What you'll learn:

  • The exact 1-5 scoring rubric with real examples
  • How to use Langfuse's annotation queue efficiently
  • Copy-paste comment templates for common scenarios
  • Best practices for consistent, high-quality reviews
  • Why your 1-2 minutes of judgment matters more than spending hours

Let's get you ready to make a meaningful impact.

Debugging AI Agents with Langfuse: Observability & Evals That Actually Work

· 34 min read
Bradley Taylor
Founder & CEO

📚 AI Agent Evaluation Series - Part 4 of 5

  1. Observability & Evals: Why They Matter ←
  2. Human-in-the-Loop Evaluation ←
  3. Implementing Automated Evals ←
  4. Debugging AI Agents ← You are here
  5. Human Review Training Guide →

Debugging AI Agents with Langfuse: Observability & Evals That Actually Work

Building AI agents is exciting. Debugging them when they fail in production? Not so much.

Here's the problem: AI agents don't fail like traditional software. There's no stack trace pointing to line 47. Instead, you get vague responses, hallucinations, or worse—confidently incorrect answers. Your users see the failure, but you have no idea why the agent decided to call the wrong tool, ignore context, or make up facts.

The solution? Observability and evaluation systems built specifically for AI.

In this guide, we'll show you how to use Langfuse to debug AI agents effectively. You'll learn how to trace agent execution, analyze LLM calls, build evaluation datasets, and implement automated checks that catch issues before your users do. Whether you're running simple RAG pipelines or complex multi-agent systems, these techniques will help you ship reliable AI applications.

Ask Alpha