Know your promptsworkbefore production.

The harness ships built-in eval helpers for comparing prompt candidates against a fixed input set. Deterministic, local, cancellable, and CI-friendly — without a hosted eval platform, dataset service, or annotation queue.

Scope

A local comparator, not a full eval platform

The eval helpers answer one question: given these prompt variants and these example inputs, which candidate performs better? That's it.

Use it for

Comparing two or more prompt strings against a fixed example set
Running deterministic checks in unit or CI tests
Validating scorer logic before wiring a product eval layer
Smoke-testing output structure after a model upgrade

Not meant for

Dataset storage and management
Prompt version history and experiment runs
Human annotation queues
Optimisation loops and regression dashboards
LLM-as-judge scoring (bring your own scorer callback)

For a full eval platform: use CloudGrid AI Evaluation to manage datasets, experiment results, comparisons, optimization runs, and per-item score records. The harness helpers are the inner loop — call them from your CloudGrid or product eval workflow and persist whatever they return.

CloudGrid integration

When the eval loop becomes a product workflow.

CloudGrid is the companion observability and AI evaluation platform for teams that want the harness inner loop connected to production evidence. It keeps datasets, expected outputs, observed results, trace links, aggregate metrics, comparisons, and optimization candidates in one reviewable project.

Explore CloudGrid AI Evaluation →

Execution model

Candidates × items → scores

The helper evaluates candidates in order. For each candidate it runs every item through your runCandidate callback, then scores the output with your scorer callback. Results are aggregated and returned sorted by mean score.

evaluatePromptCandidatestypescript

import {
  evaluatePromptCandidates,
  evaluateDeterministicScorer,
} from '@purista/harness'

const abort = new AbortController()

const scores = await evaluatePromptCandidates({
  candidates: [
    { id: 'brief',    prompt: 'Answer in one short paragraph.' },
    { id: 'detailed', prompt: 'Answer with details and citations.' },
  ],

  items: [
    {
      id: 'item-1',
      input: { question: 'Can I deploy on Friday?' },
      expected: 'change freeze',
    },
    {
      id: 'item-2',
      input: { question: 'What is the rollback procedure?' },
      expected: 'revert',
    },
  ],

  signal: abort.signal,

  runCandidate: async (candidate, item, signal) => {
    signal.throwIfAborted()
    return session.agents.answerer.prompt({
      question: item.input.question,
    }, { signal })
  },

  scorer: async (target) =>
    evaluateDeterministicScorer({
      type: 'contains',
      path: '/answer',
      value: String(target.expected),
      caseInsensitive: true,
    }, target),
})

Return shape

snippet.tstypescript

// scores: CandidateScore[]
// sorted: mean score desc, pass rate desc, id asc

[
  {
    candidateId: 'detailed',
    meanScore: 1.0,
    passRate: 1.0,
    itemCount: 2,
    scorerCount: 2,
  },
  {
    candidateId: 'brief',
    meanScore: 0.5,
    passRate: 0.5,
    itemCount: 2,
    scorerCount: 2,
  },
]

Sorting is deterministic. Tie-breaking uses candidateId lexicographic order. Use stable ids so CI output is reproducible across runs.

Built-in scorers

Four deterministic scorer types

evaluateDeterministicScorer is a pure function. No async, no model calls, no side effects. Pass it a scorer definition and a target, get a score back. Use JSON Pointer paths to select into nested output.

contains

JSON Pointer selected value contains a substring.

snippet.tstypescript

evaluateDeterministicScorer({
  type: 'contains',
  path: '/answer',
  value: 'change freeze',
  caseInsensitive: true,  // optional
}, target)

regex

JSON Pointer selected value matches a regular expression.

snippet.tstypescript

evaluateDeterministicScorer({
  type: 'regex',
  path: '/status',
  pattern: '^(approved|rejected)___PH3___#39;,
  flags: 'i',  // optional: 'i' | 'm' | 'im'
}, target)

attribute-equality

Two JSON Pointer selected values in the output are deeply equal.

snippet.tstypescript

evaluateDeterministicScorer({
  type: 'attribute-equality',
  leftPath: '/category',
  rightPath: '/expectedCategory',
}, target)

json-schema

Output conforms to a JSON Schema subset. Validates structure when you care about shape, not content.

snippet.tstypescript

evaluateDeterministicScorer({
  type: 'json-schema',
  schema: {
    type: 'object',
    required: ['answer', 'citations'],
    properties: {
      answer: { type: 'string' },
      citations: { type: 'array' },
    },
    additionalProperties: false,
  },
}, target)

JSON Pointer paths. Use RFC 6901 JSON Pointer syntax: /answer selects output.answer, /items/0/title selects the first item's title. When the pointer is missing, the scorer returns passed: false with evidence.reason: "missing_pointer" — it does not throw.

Custom scorers

Bring your own scoring logic

The scorer callback is fully application-defined. Compose multiple deterministic checks, call a secondary model for LLM-as-judge, or run a business rule. The harness doesn't care — it just aggregates the scores you return.

Compose multiple checks

Composite scorertypescript

scorer: async (target) => {
  // Check 1: answer must be present
  const hasAnswer = evaluateDeterministicScorer(
    { type: 'contains', path: '/answer', value: 'policy' },
    target
  )

  // Check 2: citations must be an array
  const hasCitations = evaluateDeterministicScorer(
    {
      type: 'json-schema',
      schema: {
        type: 'object',
        required: ['citations'],
        properties: {
          citations: { type: 'array' },
        },
      },
    },
    target
  )

  // Combine: both must pass
  const passed = hasAnswer.passed && hasCitations.passed
  return {
    score: passed ? 1 : 0,
    passed,
    evidence: { hasAnswer: hasAnswer.passed, hasCitations: hasCitations.passed },
  }
}

LLM-as-judge scorer

LLM-as-judge scorertypescript

scorer: async (target, signal) => {
  signal.throwIfAborted()

  // Call a judge agent for semantic quality
  const judgment = await session.agents.judge.prompt({
    response: target.output,
    question: target.input,
  }, { signal })

  return {
    score: judgment.qualityScore,
    passed: judgment.qualityScore >= 0.7,
    evidence: { reasoning: judgment.reasoning },
  }
}

Cost note. LLM-as-judge scorers call the model once per candidate × item. For large eval sets, prefer deterministic scorers in CI and reserve model-based scoring for final validation or ambiguous cases.

Cancellation & privacy

Cancellable. Nothing persisted.

AbortSignal cancellation

evaluatePromptCandidates requires an AbortSignal. It checks the signal before scheduling each candidate/item pair and passes the same signal to both callbacks. Your callbacks should forward it into model or tool calls.

Cancellationtypescript

const abort = new AbortController()

// cancel after 30s in CI
setTimeout(() => abort.abort(), 30_000)

const scores = await evaluatePromptCandidates({
  signal: abort.signal,
  runCandidate: async (candidate, item, signal) => {
    signal.throwIfAborted()   // check early
    return session.agents.answerer.prompt(
      item.input,
      { signal }              // forward to model call
    )
  },
  // ...
})

Privacy — nothing stored

The eval helpers return aggregate scores only. They do not emit prompt text, model output, expected values, context, or per-item score records to telemetry in any form.

Not storedPrompt candidate text

Not storedItem inputs and expected values

Not storedModel output content

Not storedPer-item score evidence

Returned onlycandidateId, meanScore, passRate, itemCount

Persisting results. If you need per-item scores, experiment history, or annotation records — store the return value in your application layer. The harness helpers are stateless by design.

CI integration

Gate your build on prompt quality

Use a fake model provider in CI to keep eval tests fast and free. Switch to a live provider for nightly or pre-release runs where you want real model output.

CI eval testtypescript

import { describe, it, expect } from 'vitest'
import {
  evaluatePromptCandidates,
  evaluateDeterministicScorer,
} from '@purista/harness'

// Deterministic fake — no real model calls, no cost
const fakeProvider = {
  id: 'fake',
  genAiSystem: 'fake',
  async object(req) {
    const q = req.messages.at(-1)?.content ?? ''
    return {
      object: {
        answer: q.includes('freeze') ? 'No deployments during change freeze.' : 'Deployment allowed.',
        citations: [],
      },
      usage: { inputTokens: 1, outputTokens: 1, totalTokens: 2 },
      finishReason: 'stop',
    }
  },
}

describe('prompt eval — change policy', () => {
  it('detailed prompt outperforms brief on policy questions', async () => {
    const harness = createAppHarness(fakeProvider)
    const session = await harness.getSession('eval-test')

    const scores = await evaluatePromptCandidates({
      candidates: [
        { id: 'brief',    prompt: 'Answer briefly.' },
        { id: 'detailed', prompt: 'Answer with policy context.' },
      ],
      items: [
        { id: 'item-1', input: { question: 'freeze date?' }, expected: 'freeze' },
      ],
      signal: new AbortController().signal,
      runCandidate: async (candidate, item) =>
        session.agents.answerer.prompt(item.input),
      scorer: async (target) =>
        evaluateDeterministicScorer({
          type: 'contains', path: '/answer',
          value: String(target.expected), caseInsensitive: true,
        }, target),
    })

    expect(scores[0].candidateId).toBe('detailed')
    expect(scores[0].passRate).toBe(1)
  })
})

Test the whole system, not just prompts.

The testing guide covers fake providers, contract tests, streaming assertions, MCP fakes, and review gate tests.

Testing Guide→Memory Adapters Security & Production