Know your prompts work before production.

The harness ships built-in eval helpers for comparing prompt candidates against a fixed input set. Deterministic, local, cancellable, and CI-friendly — without a hosted eval platform, dataset service, or annotation queue.

Scope

A local comparator, not a full eval platform

The eval helpers answer one question: given these prompt variants and these example inputs, which candidate performs better? That's it.

Use it for
  • Comparing two or more prompt strings against a fixed example set
  • Running deterministic checks in unit or CI tests
  • Validating scorer logic before wiring a product eval layer
  • Smoke-testing output structure after a model upgrade
Not meant for
  • Dataset storage and management
  • Prompt version history and experiment runs
  • Human annotation queues
  • Optimisation loops and regression dashboards
  • LLM-as-judge scoring (bring your own scorer callback)

For a full eval platform: use CloudGrid AI Evaluation to manage datasets, experiment results, comparisons, optimization runs, and per-item score records. The harness helpers are the inner loop — call them from your CloudGrid or product eval workflow and persist whatever they return.

CloudGrid integration

When the eval loop becomes a product workflow.

CloudGrid is the companion observability and AI evaluation platform for teams that want the harness inner loop connected to production evidence. It keeps datasets, expected outputs, observed results, trace links, aggregate metrics, comparisons, and optimization candidates in one reviewable project.

Explore CloudGrid AI Evaluation →
1
Execution model

Candidates × items → scores

The helper evaluates candidates in order. For each candidate it runs every item through your runCandidate callback, then scores the output with your scorer callback. Results are aggregated and returned sorted by mean score.

evaluatePromptCandidates typescript
import {
  evaluatePromptCandidates,
  evaluateDeterministicScorer,
} from '@purista/harness'

const abort = new AbortController()

const scores = await evaluatePromptCandidates({
  candidates: [
    { id: 'brief',    prompt: 'Answer in one short paragraph.' },
    { id: 'detailed', prompt: 'Answer with details and citations.' },
  ],

  items: [
    {
      id: 'item-1',
      input: { question: 'Can I deploy on Friday?' },
      expected: 'change freeze',
    },
    {
      id: 'item-2',
      input: { question: 'What is the rollback procedure?' },
      expected: 'revert',
    },
  ],

  signal: abort.signal,

  runCandidate: async (candidate, item, signal) => {
    signal.throwIfAborted()
    return session.agents.answerer.prompt({
      question: item.input.question,
    }, { signal })
  },

  scorer: async (target) =>
    evaluateDeterministicScorer({
      type: 'contains',
      path: '/answer',
      value: String(target.expected),
      caseInsensitive: true,
    }, target),
})

Return shape

snippet.ts typescript
// scores: CandidateScore[]
// sorted: mean score desc, pass rate desc, id asc

[
  {
    candidateId: 'detailed',
    meanScore: 1.0,
    passRate: 1.0,
    itemCount: 2,
    scorerCount: 2,
  },
  {
    candidateId: 'brief',
    meanScore: 0.5,
    passRate: 0.5,
    itemCount: 2,
    scorerCount: 2,
  },
]

Sorting is deterministic. Tie-breaking uses candidateId lexicographic order. Use stable ids so CI output is reproducible across runs.

2
Built-in scorers

Four deterministic scorer types

evaluateDeterministicScorer is a pure function. No async, no model calls, no side effects. Pass it a scorer definition and a target, get a score back. Use JSON Pointer paths to select into nested output.

contains

JSON Pointer selected value contains a substring.

snippet.ts typescript
evaluateDeterministicScorer({
  type: 'contains',
  path: '/answer',
  value: 'change freeze',
  caseInsensitive: true,  // optional
}, target)
regex

JSON Pointer selected value matches a regular expression.

snippet.ts typescript
evaluateDeterministicScorer({
  type: 'regex',
  path: '/status',
  pattern: '^(approved|rejected)___PH3___#39;,
  flags: 'i',  // optional: 'i' | 'm' | 'im'
}, target)
attribute-equality

Two JSON Pointer selected values in the output are deeply equal.

snippet.ts typescript
evaluateDeterministicScorer({
  type: 'attribute-equality',
  leftPath: '/category',
  rightPath: '/expectedCategory',
}, target)
json-schema

Output conforms to a JSON Schema subset. Validates structure when you care about shape, not content.

snippet.ts typescript
evaluateDeterministicScorer({
  type: 'json-schema',
  schema: {
    type: 'object',
    required: ['answer', 'citations'],
    properties: {
      answer: { type: 'string' },
      citations: { type: 'array' },
    },
    additionalProperties: false,
  },
}, target)

JSON Pointer paths. Use RFC 6901 JSON Pointer syntax: /answer selects output.answer, /items/0/title selects the first item's title. When the pointer is missing, the scorer returns passed: false with evidence.reason: "missing_pointer" — it does not throw.

3
Custom scorers

Bring your own scoring logic

The scorer callback is fully application-defined. Compose multiple deterministic checks, call a secondary model for LLM-as-judge, or run a business rule. The harness doesn't care — it just aggregates the scores you return.

Compose multiple checks

Composite scorer typescript
scorer: async (target) => {
  // Check 1: answer must be present
  const hasAnswer = evaluateDeterministicScorer(
    { type: 'contains', path: '/answer', value: 'policy' },
    target
  )

  // Check 2: citations must be an array
  const hasCitations = evaluateDeterministicScorer(
    {
      type: 'json-schema',
      schema: {
        type: 'object',
        required: ['citations'],
        properties: {
          citations: { type: 'array' },
        },
      },
    },
    target
  )

  // Combine: both must pass
  const passed = hasAnswer.passed && hasCitations.passed
  return {
    score: passed ? 1 : 0,
    passed,
    evidence: { hasAnswer: hasAnswer.passed, hasCitations: hasCitations.passed },
  }
}

LLM-as-judge scorer

LLM-as-judge scorer typescript
scorer: async (target, signal) => {
  signal.throwIfAborted()

  // Call a judge agent for semantic quality
  const judgment = await session.agents.judge.prompt({
    response: target.output,
    question: target.input,
  }, { signal })

  return {
    score: judgment.qualityScore,
    passed: judgment.qualityScore >= 0.7,
    evidence: { reasoning: judgment.reasoning },
  }
}

Cost note. LLM-as-judge scorers call the model once per candidate × item. For large eval sets, prefer deterministic scorers in CI and reserve model-based scoring for final validation or ambiguous cases.

4
Cancellation & privacy

Cancellable. Nothing persisted.

AbortSignal cancellation

evaluatePromptCandidates requires an AbortSignal. It checks the signal before scheduling each candidate/item pair and passes the same signal to both callbacks. Your callbacks should forward it into model or tool calls.

Cancellation typescript
const abort = new AbortController()

// cancel after 30s in CI
setTimeout(() => abort.abort(), 30_000)

const scores = await evaluatePromptCandidates({
  signal: abort.signal,
  runCandidate: async (candidate, item, signal) => {
    signal.throwIfAborted()   // check early
    return session.agents.answerer.prompt(
      item.input,
      { signal }              // forward to model call
    )
  },
  // ...
})

Privacy — nothing stored

The eval helpers return aggregate scores only. They do not emit prompt text, model output, expected values, context, or per-item score records to telemetry in any form.

Not stored Prompt candidate text
Not stored Item inputs and expected values
Not stored Model output content
Not stored Per-item score evidence
Returned only candidateId, meanScore, passRate, itemCount

Persisting results. If you need per-item scores, experiment history, or annotation records — store the return value in your application layer. The harness helpers are stateless by design.

5
CI integration

Gate your build on prompt quality

Use a fake model provider in CI to keep eval tests fast and free. Switch to a live provider for nightly or pre-release runs where you want real model output.

CI eval test typescript
import { describe, it, expect } from 'vitest'
import {
  evaluatePromptCandidates,
  evaluateDeterministicScorer,
} from '@purista/harness'

// Deterministic fake — no real model calls, no cost
const fakeProvider = {
  id: 'fake',
  genAiSystem: 'fake',
  async object(req) {
    const q = req.messages.at(-1)?.content ?? ''
    return {
      object: {
        answer: q.includes('freeze') ? 'No deployments during change freeze.' : 'Deployment allowed.',
        citations: [],
      },
      usage: { inputTokens: 1, outputTokens: 1, totalTokens: 2 },
      finishReason: 'stop',
    }
  },
}

describe('prompt eval — change policy', () => {
  it('detailed prompt outperforms brief on policy questions', async () => {
    const harness = createAppHarness(fakeProvider)
    const session = await harness.getSession('eval-test')

    const scores = await evaluatePromptCandidates({
      candidates: [
        { id: 'brief',    prompt: 'Answer briefly.' },
        { id: 'detailed', prompt: 'Answer with policy context.' },
      ],
      items: [
        { id: 'item-1', input: { question: 'freeze date?' }, expected: 'freeze' },
      ],
      signal: new AbortController().signal,
      runCandidate: async (candidate, item) =>
        session.agents.answerer.prompt(item.input),
      scorer: async (target) =>
        evaluateDeterministicScorer({
          type: 'contains', path: '/answer',
          value: String(target.expected), caseInsensitive: true,
        }, target),
    })

    expect(scores[0].candidateId).toBe('detailed')
    expect(scores[0].passRate).toBe(1)
  })
})

Test the whole system, not just prompts.

The testing guide covers fake providers, contract tests, streaming assertions, MCP fakes, and review gate tests.