Agent Evaluation

Testing and evaluation solve different problems:

Tests protect behavior/contracts in deterministic conditions.
Evaluation compares quality over datasets (accuracy, latency, token usage) across models/prompts.

Use both.

What to evaluate and why

Metric	What it answers	Typical decision supported
accuracy / success rate	does output match expected intent?	prompt/model regression checks
duration (`min/max/avg`)	how fast is the agent?	sync vs async route suitability
token usage	how expensive is context+output size?	memory strategy and model sizing
failure rate	how often runs fail?	retry policy and guardrail tuning

Evaluation output contract

Dataset shape is intentionally up to the application.
Evaluation output should be JSON and stable over time so CI can diff it.

Typical fields:

suite name and timestamp
case-level pass/fail
duration statistics (min/max/avg/total)
token usage statistics (prompt/completion/total)
aggregate accuracy/failure rate

Example flow

import { createEvaluationResult, diffEvaluationResults, validateDataset } from '@purista/ai/evaluation'
import { z } from 'zod/v4'
import { extendApi } from '@purista/core'

const datasetSchema = extendApi(
  z.object({
    id: z.string(),
    prompt: z.string(),
    expectedSubstring: z.string(),
  }),
  { title: 'Support Eval Row' },
)

const dataset = await validateDataset(datasetSchema.array(), await loadJson('support-regression.json'))

const samples = []
for (const row of dataset) {
  const started = performance.now()
  const envelopes = await invokeAgent({
    eventBridge,
    agentName: 'supportAgent',
    agentVersion: '1',
    payload: { prompt: row.prompt, message: row.prompt, history: [], attachments: [] },
  })

  const finalMessage = envelopes
    .map(e => e.frame)
    .filter((f): f is { kind: 'message'; content: string; final?: boolean } => f.kind === 'message')
    .findLast(f => f.final === true)?.content

  samples.push({
    id: row.id,
    expected: row.expectedSubstring,
    actual: finalMessage ?? '',
    success: (finalMessage ?? '').includes(row.expectedSubstring),
    durationMs: performance.now() - started,
  })
}

const current = createEvaluationResult({
  workload: 'supportAgent',
  manifestVersion: '1',
  dataset: 'support-regression',
  samples,
})

const diff = diffEvaluationResults(previousRun, current)

Storage recommendation

Store results under a predictable structure:

text

evaluations/
  support-regression/
    2026-03-04T10-00-00Z.json
    2026-03-05T10-00-00Z.json

This makes comparisons and regressions visible in pull requests and release checks.

CI usage

Run evaluation as part of a dedicated pipeline stage (not necessarily every unit-test run).
Keep threshold checks explicit (for example: accuracy must not drop by more than 2%).
Publish JSON artifacts for later inspection and trend dashboards.

Service

Command

Stream

Subscription

Queues

AI Agents

Stores

Exposing Commands

Connect To PURISTA

Advanced

Event Bridges

Queue Bridges

Microservice Style

Temporal

Agent Evaluation

What to evaluate and why

Evaluation output contract

Example flow

Storage recommendation

CI usage

Agent Evaluation ​

What to evaluate and why ​

Evaluation output contract ​

Example flow ​

Storage recommendation ​

CI usage ​

Agent Evaluation

What to evaluate and why

Evaluation output contract

Example flow

Storage recommendation

CI usage