Evaluations
Evals are automated test cases that verify your agent behaves correctly. They live in your agent directory and run via the CLI.
Quick start
Section titled “Quick start”# Run all evals for the default agentnpx @lobu/cli@latest eval
# Run a specific evalnpx @lobu/cli@latest eval ping
# Run with a different modelnpx @lobu/cli@latest eval --model anthropic/claude-sonnet-4
# CI mode (JSON output, exit 1 on failure)npx @lobu/cli@latest eval --ci --output results.jsonThe gateway must be running (npx @lobu/cli@latest run) before running evals.
Eval file format
Section titled “Eval file format”Eval files are YAML, stored in agents/{name}/evals/. Each file defines a test case with one or more conversational turns and assertions.
Minimal example
Section titled “Minimal example”name: pingdescription: Agent responds to a greeting
turns: - content: "Hello, are you there?" assert: - type: contains value: "hello" options: { case_insensitive: true }Full example
Section titled “Full example”name: follows-instructionsdescription: Agent follows formatting instructions without adding unrequested contenttrials: 3timeout: 60tags: [behavioral]rubric: follows-instructions.rubric.md
scoring: pass_threshold: 0.8
turns: - content: "List exactly 3 benefits of remote work. Use bullet points." assert: - type: regex value: "^[\\s\\S]*[-•].*[-•].*[-•]" weight: 0.5 - type: llm-rubric value: "Lists exactly 3 benefits (not 2, not 4+), uses bullet points" weight: 0.5Multi-turn example
Section titled “Multi-turn example”Test context retention across multiple messages:
name: context-retentiondescription: Agent remembers context across turnstrials: 3timeout: 60tags: [behavioral, multi-turn]
turns: - content: "My name is Alice and I work at Acme Corp."
- content: "What company do I work at?" assert: - type: contains value: "Acme" weight: 0.5 - type: llm-rubric value: "Correctly recalls Acme Corp from the previous message" weight: 0.5
- content: "And what's my name?" assert: - type: contains value: "Alice"Turns without assert are sent but not graded — useful for setup messages.
Schema reference
Section titled “Schema reference”| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | Eval name (used in reports) |
description | string | — | What this eval tests |
trials | number | 3 | Number of times to run (for statistical confidence) |
timeout | number | 120 | Per-turn timeout in seconds |
tags | string[] | — | Tags for filtering (e.g., smoke, behavioral) |
rubric | string | — | Path to a rubric markdown file (relative to eval file) |
scoring.pass_threshold | number | 0.8 | Minimum score (0–1) for a trial to pass |
turns | array | required | Conversational turns (min 1) |
| Field | Type | Description |
|---|---|---|
content | string | The user message to send |
assert | array | Assertions to check against the agent’s response |
Assertion
Section titled “Assertion”| Field | Type | Default | Description |
|---|---|---|---|
type | string | required | contains, regex, or llm-rubric |
value | string | required | The value to check (substring, regex pattern, or grading criteria) |
weight | number | 1 | Relative weight in scoring |
options.case_insensitive | boolean | false | Case-insensitive match (for contains) |
Assertion types
Section titled “Assertion types”contains — checks if the agent’s response includes a substring.
- type: contains value: "Acme Corp" options: { case_insensitive: true }regex — tests the response against a regular expression (case-insensitive by default).
- type: regex value: "\\d{3}-\\d{4}" # matches a phone number patternllm-rubric — sends the response to an LLM for qualitative grading. Use this for subjective criteria that can’t be captured with string matching.
- type: llm-rubric value: "Response is friendly, acknowledges the user's question, and provides a helpful answer"Rubrics
Section titled “Rubrics”For more detailed grading, create a rubric file. It’s a markdown document with criteria the LLM evaluates against.
# Instruction Following
## Direct Compliance- Agent addresses the specific request, not a tangential topic- Response format matches the formatting instructions given- Exact count requested is respected (no more, no fewer)
## Boundary Respect- Agent does not add unrequested features or disclaimers- No unsolicited follow-up questions
## Tone- Professional and helpful- No unnecessary apologies or hedgingReference it from your eval:
rubric: follows-instructions.rubric.mdWhen a rubric is present, its score is weighted 50% alongside assertion scores (50%).
Scoring
Section titled “Scoring”- Each assertion produces a score of 0 or 1, weighted by
weight - Trial score = weighted average of all assertion scores (+ rubric if present)
- A trial passes if score >=
pass_threshold(default 0.8) - The eval pass rate = fraction of trials that passed
- Multiple trials (default 3) provide statistical confidence against non-deterministic responses
CLI options
Section titled “CLI options”| Flag | Description |
|---|---|
-a, --agent <id> | Agent ID (defaults to first in lobu.toml) |
-g, --gateway <url> | Gateway URL (default: from .env or http://localhost:8080) |
-m, --model <model> | Model to evaluate (e.g., anthropic/claude-sonnet-4) |
--trials <n> | Override trial count for all evals |
--ci | CI mode: JSON output, exit code 1 on any failure |
--output <file> | Write results to a JSON file |
--list | List available evals without running them |
Results and reports
Section titled “Results and reports”Results are automatically saved to agents/{name}/evals/.results/ as JSON after each run. A comparison report is generated at agents/{name}/evals/evals-report.md showing:
- Model comparison table (pass rate, avg score, latency, tokens)
- Rubric details per model
- Failed trial transcripts with trace IDs (for debugging via observability)
Run evals with different --model values to build a comparison across providers.
Directory structure
Section titled “Directory structure”agents/my-agent/ evals/ ping.yaml context-retention.yaml follows-instructions.yaml follows-instructions.rubric.md .results/ # auto-generated openrouter-claude-sonnet_1234.json gemini-gemini-pro_5678.json evals-report.md # auto-generated comparison