Skip to content
API Blog

Evaluations

Evals are automated test cases that verify your agent behaves correctly. They live in your agent directory and run via the CLI.

Terminal window
# Run all evals for the default agent
npx @lobu/cli@latest eval
# Run a specific eval
npx @lobu/cli@latest eval ping
# Run with a different model
npx @lobu/cli@latest eval --model anthropic/claude-sonnet-4
# CI mode (JSON output, exit 1 on failure)
npx @lobu/cli@latest eval --ci --output results.json

The gateway must be running (npx @lobu/cli@latest run) before running evals.

Eval files are YAML, stored in agents/{name}/evals/. Each file defines a test case with one or more conversational turns and assertions.

agents/my-agent/evals/ping.yaml
name: ping
description: Agent responds to a greeting
turns:
- content: "Hello, are you there?"
assert:
- type: contains
value: "hello"
options: { case_insensitive: true }
name: follows-instructions
description: Agent follows formatting instructions without adding unrequested content
trials: 3
timeout: 60
tags: [behavioral]
rubric: follows-instructions.rubric.md
scoring:
pass_threshold: 0.8
turns:
- content: "List exactly 3 benefits of remote work. Use bullet points."
assert:
- type: regex
value: "^[\\s\\S]*[-•].*[-•].*[-•]"
weight: 0.5
- type: llm-rubric
value: "Lists exactly 3 benefits (not 2, not 4+), uses bullet points"
weight: 0.5

Test context retention across multiple messages:

name: context-retention
description: Agent remembers context across turns
trials: 3
timeout: 60
tags: [behavioral, multi-turn]
turns:
- content: "My name is Alice and I work at Acme Corp."
- content: "What company do I work at?"
assert:
- type: contains
value: "Acme"
weight: 0.5
- type: llm-rubric
value: "Correctly recalls Acme Corp from the previous message"
weight: 0.5
- content: "And what's my name?"
assert:
- type: contains
value: "Alice"

Turns without assert are sent but not graded — useful for setup messages.

FieldTypeDefaultDescription
namestringrequiredEval name (used in reports)
descriptionstringWhat this eval tests
trialsnumber3Number of times to run (for statistical confidence)
timeoutnumber120Per-turn timeout in seconds
tagsstring[]Tags for filtering (e.g., smoke, behavioral)
rubricstringPath to a rubric markdown file (relative to eval file)
scoring.pass_thresholdnumber0.8Minimum score (0–1) for a trial to pass
turnsarrayrequiredConversational turns (min 1)
FieldTypeDescription
contentstringThe user message to send
assertarrayAssertions to check against the agent’s response
FieldTypeDefaultDescription
typestringrequiredcontains, regex, or llm-rubric
valuestringrequiredThe value to check (substring, regex pattern, or grading criteria)
weightnumber1Relative weight in scoring
options.case_insensitivebooleanfalseCase-insensitive match (for contains)

contains — checks if the agent’s response includes a substring.

- type: contains
value: "Acme Corp"
options: { case_insensitive: true }

regex — tests the response against a regular expression (case-insensitive by default).

- type: regex
value: "\\d{3}-\\d{4}" # matches a phone number pattern

llm-rubric — sends the response to an LLM for qualitative grading. Use this for subjective criteria that can’t be captured with string matching.

- type: llm-rubric
value: "Response is friendly, acknowledges the user's question, and provides a helpful answer"

For more detailed grading, create a rubric file. It’s a markdown document with criteria the LLM evaluates against.

agents/my-agent/evals/follows-instructions.rubric.md
# Instruction Following
## Direct Compliance
- Agent addresses the specific request, not a tangential topic
- Response format matches the formatting instructions given
- Exact count requested is respected (no more, no fewer)
## Boundary Respect
- Agent does not add unrequested features or disclaimers
- No unsolicited follow-up questions
## Tone
- Professional and helpful
- No unnecessary apologies or hedging

Reference it from your eval:

rubric: follows-instructions.rubric.md

When a rubric is present, its score is weighted 50% alongside assertion scores (50%).

  • Each assertion produces a score of 0 or 1, weighted by weight
  • Trial score = weighted average of all assertion scores (+ rubric if present)
  • A trial passes if score >= pass_threshold (default 0.8)
  • The eval pass rate = fraction of trials that passed
  • Multiple trials (default 3) provide statistical confidence against non-deterministic responses
FlagDescription
-a, --agent <id>Agent ID (defaults to first in lobu.toml)
-g, --gateway <url>Gateway URL (default: from .env or http://localhost:8080)
-m, --model <model>Model to evaluate (e.g., anthropic/claude-sonnet-4)
--trials <n>Override trial count for all evals
--ciCI mode: JSON output, exit code 1 on any failure
--output <file>Write results to a JSON file
--listList available evals without running them

Results are automatically saved to agents/{name}/evals/.results/ as JSON after each run. A comparison report is generated at agents/{name}/evals/evals-report.md showing:

  • Model comparison table (pass rate, avg score, latency, tokens)
  • Rubric details per model
  • Failed trial transcripts with trace IDs (for debugging via observability)

Run evals with different --model values to build a comparison across providers.

agents/my-agent/
evals/
ping.yaml
context-retention.yaml
follows-instructions.yaml
follows-instructions.rubric.md
.results/ # auto-generated
openrouter-claude-sonnet_1234.json
gemini-gemini-pro_5678.json
evals-report.md # auto-generated comparison