Know your prompts work before production.
The harness ships built-in eval helpers for comparing prompt candidates against a fixed input set. Deterministic, local, cancellable, and CI-friendly — without a hosted eval platform, dataset service, or annotation queue.
A local comparator, not a full eval platform
The eval helpers answer one question: given these prompt variants and these example inputs, which candidate performs better? That's it.
- Comparing two or more prompt strings against a fixed example set
- Running deterministic checks in unit or CI tests
- Validating scorer logic before wiring a product eval layer
- Smoke-testing output structure after a model upgrade
- Dataset storage and management
- Prompt version history and experiment runs
- Human annotation queues
- Optimisation loops and regression dashboards
- LLM-as-judge scoring (bring your own scorer callback)
For a full eval platform: use CloudGrid AI Evaluation to manage datasets, experiment results, comparisons, optimization runs, and per-item score records. The harness helpers are the inner loop — call them from your CloudGrid or product eval workflow and persist whatever they return.
When the eval loop becomes a product workflow.
CloudGrid is the companion observability and AI evaluation platform for teams that want the harness inner loop connected to production evidence. It keeps datasets, expected outputs, observed results, trace links, aggregate metrics, comparisons, and optimization candidates in one reviewable project.
Explore CloudGrid AI Evaluation →Candidates × items → scores
The helper evaluates candidates in order. For each candidate it runs every item through your runCandidate callback, then scores the output with your scorer callback. Results are aggregated and returned sorted by mean score.
import {
evaluatePromptCandidates,
evaluateDeterministicScorer,
} from '@purista/harness'
const abort = new AbortController()
const scores = await evaluatePromptCandidates({
candidates: [
{ id: 'brief', prompt: 'Answer in one short paragraph.' },
{ id: 'detailed', prompt: 'Answer with details and citations.' },
],
items: [
{
id: 'item-1',
input: { question: 'Can I deploy on Friday?' },
expected: 'change freeze',
},
{
id: 'item-2',
input: { question: 'What is the rollback procedure?' },
expected: 'revert',
},
],
signal: abort.signal,
runCandidate: async (candidate, item, signal) => {
signal.throwIfAborted()
return session.agents.answerer.prompt({
question: item.input.question,
}, { signal })
},
scorer: async (target) =>
evaluateDeterministicScorer({
type: 'contains',
path: '/answer',
value: String(target.expected),
caseInsensitive: true,
}, target),
}) Return shape
// scores: CandidateScore[]
// sorted: mean score desc, pass rate desc, id asc
[
{
candidateId: 'detailed',
meanScore: 1.0,
passRate: 1.0,
itemCount: 2,
scorerCount: 2,
},
{
candidateId: 'brief',
meanScore: 0.5,
passRate: 0.5,
itemCount: 2,
scorerCount: 2,
},
] Sorting is deterministic. Tie-breaking uses candidateId lexicographic order. Use stable ids so CI output is reproducible across runs.
Four deterministic scorer types
evaluateDeterministicScorer is a pure function. No async, no model calls, no side effects. Pass it a scorer definition and a target, get a score back. Use JSON Pointer paths to select into nested output.
JSON Pointer selected value contains a substring.
evaluateDeterministicScorer({
type: 'contains',
path: '/answer',
value: 'change freeze',
caseInsensitive: true, // optional
}, target) JSON Pointer selected value matches a regular expression.
evaluateDeterministicScorer({
type: 'regex',
path: '/status',
pattern: '^(approved|rejected)___PH3___#39;,
flags: 'i', // optional: 'i' | 'm' | 'im'
}, target) Two JSON Pointer selected values in the output are deeply equal.
evaluateDeterministicScorer({
type: 'attribute-equality',
leftPath: '/category',
rightPath: '/expectedCategory',
}, target) Output conforms to a JSON Schema subset. Validates structure when you care about shape, not content.
evaluateDeterministicScorer({
type: 'json-schema',
schema: {
type: 'object',
required: ['answer', 'citations'],
properties: {
answer: { type: 'string' },
citations: { type: 'array' },
},
additionalProperties: false,
},
}, target) JSON Pointer paths. Use RFC 6901 JSON Pointer syntax: /answer selects output.answer, /items/0/title selects the first item's title. When the pointer is missing, the scorer returns passed: false with evidence.reason: "missing_pointer" — it does not throw.
Bring your own scoring logic
The scorer callback is fully application-defined. Compose multiple deterministic checks, call a secondary model for LLM-as-judge, or run a business rule. The harness doesn't care — it just aggregates the scores you return.
Compose multiple checks
scorer: async (target) => {
// Check 1: answer must be present
const hasAnswer = evaluateDeterministicScorer(
{ type: 'contains', path: '/answer', value: 'policy' },
target
)
// Check 2: citations must be an array
const hasCitations = evaluateDeterministicScorer(
{
type: 'json-schema',
schema: {
type: 'object',
required: ['citations'],
properties: {
citations: { type: 'array' },
},
},
},
target
)
// Combine: both must pass
const passed = hasAnswer.passed && hasCitations.passed
return {
score: passed ? 1 : 0,
passed,
evidence: { hasAnswer: hasAnswer.passed, hasCitations: hasCitations.passed },
}
} LLM-as-judge scorer
scorer: async (target, signal) => {
signal.throwIfAborted()
// Call a judge agent for semantic quality
const judgment = await session.agents.judge.prompt({
response: target.output,
question: target.input,
}, { signal })
return {
score: judgment.qualityScore,
passed: judgment.qualityScore >= 0.7,
evidence: { reasoning: judgment.reasoning },
}
} Cost note. LLM-as-judge scorers call the model once per candidate × item. For large eval sets, prefer deterministic scorers in CI and reserve model-based scoring for final validation or ambiguous cases.
Cancellable. Nothing persisted.
AbortSignal cancellation
evaluatePromptCandidates requires an AbortSignal. It checks the signal before scheduling each candidate/item pair and passes the same signal to both callbacks. Your callbacks should forward it into model or tool calls.
const abort = new AbortController()
// cancel after 30s in CI
setTimeout(() => abort.abort(), 30_000)
const scores = await evaluatePromptCandidates({
signal: abort.signal,
runCandidate: async (candidate, item, signal) => {
signal.throwIfAborted() // check early
return session.agents.answerer.prompt(
item.input,
{ signal } // forward to model call
)
},
// ...
}) Privacy — nothing stored
The eval helpers return aggregate scores only. They do not emit prompt text, model output, expected values, context, or per-item score records to telemetry in any form.
Persisting results. If you need per-item scores, experiment history, or annotation records — store the return value in your application layer. The harness helpers are stateless by design.
Gate your build on prompt quality
Use a fake model provider in CI to keep eval tests fast and free. Switch to a live provider for nightly or pre-release runs where you want real model output.
import { describe, it, expect } from 'vitest'
import {
evaluatePromptCandidates,
evaluateDeterministicScorer,
} from '@purista/harness'
// Deterministic fake — no real model calls, no cost
const fakeProvider = {
id: 'fake',
genAiSystem: 'fake',
async object(req) {
const q = req.messages.at(-1)?.content ?? ''
return {
object: {
answer: q.includes('freeze') ? 'No deployments during change freeze.' : 'Deployment allowed.',
citations: [],
},
usage: { inputTokens: 1, outputTokens: 1, totalTokens: 2 },
finishReason: 'stop',
}
},
}
describe('prompt eval — change policy', () => {
it('detailed prompt outperforms brief on policy questions', async () => {
const harness = createAppHarness(fakeProvider)
const session = await harness.getSession('eval-test')
const scores = await evaluatePromptCandidates({
candidates: [
{ id: 'brief', prompt: 'Answer briefly.' },
{ id: 'detailed', prompt: 'Answer with policy context.' },
],
items: [
{ id: 'item-1', input: { question: 'freeze date?' }, expected: 'freeze' },
],
signal: new AbortController().signal,
runCandidate: async (candidate, item) =>
session.agents.answerer.prompt(item.input),
scorer: async (target) =>
evaluateDeterministicScorer({
type: 'contains', path: '/answer',
value: String(target.expected), caseInsensitive: true,
}, target),
})
expect(scores[0].candidateId).toBe('detailed')
expect(scores[0].passRate).toBe(1)
})
}) Test the whole system, not just prompts.
The testing guide covers fake providers, contract tests, streaming assertions, MCP fakes, and review gate tests.