Skip to content
API Blog

Evaluations

Evals for Lobu agents run through promptfoo — a mature, vendor-neutral LLM eval framework — via the published @lobu/promptfoo-provider package. promptfoo handles the runner, assertion library (regex / contains / llm-rubric / factuality / context-recall / etc.), reporter, web viewer, and CI integration. Our provider connects it to your Lobu agent.

Terminal window
# 1. Install promptfoo + the Lobu provider in your project.
bun add -D promptfoo @lobu/promptfoo-provider
# 2. Boot your gateway (in another terminal).
npx @lobu/cli@latest run
# 3. Mint a token + run evals.
export LOBU_TOKEN=$(npx @lobu/cli@latest token)
bunx promptfoo eval -c agents/<agent-id>/evals/promptfooconfig.yaml
bunx promptfoo view

promptfoo view opens a comparison grid in your browser — useful for both debugging individual cases and for screen-shared demos.

# agents/<agent-id>/evals/promptfooconfig.yaml
description: Smoke evals
providers:
- id: 'package:@lobu/promptfoo-provider:LobuProvider'
config:
agent: <agent-id>
# gateway: http://localhost:8787 # defaults to LOBU_GATEWAY env
# token: ... # defaults to LOBU_TOKEN env
defaultTest:
options:
provider: anthropic:messages:claude-haiku-4-5-20251001
prompts:
- '{{query}}'
tests:
- description: ping
vars:
query: 'Hello, are you there?'
assert:
- type: regex
value: 'hello|hi\b|hey|yes|here|ready'
weight: 0.3
- type: llm-rubric
value: 'Response is friendly, acknowledges the greeting, and matches the agent persona.'
weight: 0.7

providers[].id uses promptfoo’s package: protocol — package:<npm-name>:<exported-class>. With @lobu/promptfoo-provider resolved on the module path, this loads the LobuProvider class.

keyenv fallbackrequirednotes
agentLOBU_AGENTyesagent id registered with the gateway
gatewayLOBU_GATEWAYnodefaults to http://localhost:8787
tokenLOBU_TOKENyesbearer token from lobu token
providernooverride the LLM provider for this session
modelnooverride the LLM model
timeoutMsnoper-call timeout (default 120000)

promptfoo ships a large assertion library; the ones most useful for Lobu agent evals:

AssertionWhen to use
contains / icontains / regexDeterministic checks for required tokens, IDs, dates, names
equals / is-jsonStrict output shape
llm-rubricBehavioural grading: tone, format compliance, instruction following
factualityOutput factually consistent with a reference answer
similar / levenshteinFuzzy match against expected output
cost / latencyBudget enforcement

See promptfoo’s assertions docs for the full set.

promptfoo expands tests: into one test case per entry. Load test data from a JSONL file for many cases:

tests: file://./cases/specific.jsonl

Each row’s fields become vars available as {{var_name}} substitutions in prompts and in the assertion value.

The canonical reference is examples/personal-finance/evals/promptfooconfig.yaml. It exercises a real agent with two single-turn evals: ping (persona check) and tax-year-anchoring (UK fiscal-year boundary, two independent cases).

  • Multi-turn evals are not yet first-class. @lobu/promptfoo-provider invokes the agent with a single user message per test case. For sequential conversations, either flatten the transcript into one prompt (“user said earlier: X; now they say: Y”) or wait for a planned vars.transcript extension to the provider.
  • RAG-specific assertions (context-recall, context-faithfulness, custom tool-call checks) are not wired up. The gateway’s SSE protocol doesn’t surface tool calls yet, so the provider can’t populate metadata.toolCalls / metadata.retrievedContext. Tracked as a follow-up gateway change.

promptfoo writes JSON / JUnit / HTML reports — see promptfoo eval --output. The GitHub Action reporter annotates failing assertions on pull requests.

For CI:

Terminal window
bunx promptfoo eval -c agents/<agent-id>/evals/promptfooconfig.yaml \
--output results.json --no-share
# exits non-zero on any failed assertion