Skip to content

Agent Evaluation

Testing and evaluation solve different problems:

  • Tests protect behavior/contracts in deterministic conditions.
  • Evaluation compares quality over datasets (accuracy, latency, token usage) across models/prompts.

Use both.

What to evaluate and why

MetricWhat it answersTypical decision supported
accuracy / success ratedoes output match expected intent?prompt/model regression checks
duration (min/max/avg)how fast is the agent?sync vs async route suitability
token usagehow expensive is context+output size?memory strategy and model sizing
failure ratehow often runs fail?retry policy and guardrail tuning

Evaluation output contract

Dataset shape is intentionally up to the application.
Evaluation output should be JSON and stable over time so CI can diff it.

Typical fields:

  • suite name and timestamp
  • case-level pass/fail
  • duration statistics (min/max/avg/total)
  • token usage statistics (prompt/completion/total)
  • aggregate accuracy/failure rate

Example flow

ts
import { createEvaluationResult, diffEvaluationResults, validateDataset } from '@purista/ai/evaluation'
import { z } from 'zod/v4'
import { extendApi } from '@purista/core'

const datasetSchema = extendApi(
  z.object({
    id: z.string(),
    prompt: z.string(),
    expectedSubstring: z.string(),
  }),
  { title: 'Support Eval Row' },
)

const dataset = await validateDataset(datasetSchema.array(), await loadJson('support-regression.json'))

const samples = []
for (const row of dataset) {
  const started = performance.now()
  const envelopes = await invokeAgent({
    eventBridge,
    agentName: 'supportAgent',
    agentVersion: '1',
    payload: { prompt: row.prompt, message: row.prompt, history: [], attachments: [] },
  })

  const finalMessage = envelopes
    .map(e => e.frame)
    .filter((f): f is { kind: 'message'; content: string; final?: boolean } => f.kind === 'message')
    .findLast(f => f.final === true)?.content

  samples.push({
    id: row.id,
    expected: row.expectedSubstring,
    actual: finalMessage ?? '',
    success: (finalMessage ?? '').includes(row.expectedSubstring),
    durationMs: performance.now() - started,
  })
}

const current = createEvaluationResult({
  workload: 'supportAgent',
  manifestVersion: '1',
  dataset: 'support-regression',
  samples,
})

const diff = diffEvaluationResults(previousRun, current)

Storage recommendation

Store results under a predictable structure:

text
evaluations/
  support-regression/
    2026-03-04T10-00-00Z.json
    2026-03-05T10-00-00Z.json

This makes comparisons and regressions visible in pull requests and release checks.

CI usage

  • Run evaluation as part of a dedicated pipeline stage (not necessarily every unit-test run).
  • Keep threshold checks explicit (for example: accuracy must not drop by more than 2%).
  • Publish JSON artifacts for later inspection and trend dashboards.