Test agentswithout callingexternal APIs.

Every layer of a harness application is testable in isolation. Inject a fake model provider. Contract-test your state store. Assert streaming event sequences. Run fake MCP servers in CI. No real API keys needed.

Test pyramid

Four levels, clear boundaries

Start at the bottom. Each level is independently testable. Climb the pyramid only when the previous level passes.

Unit tests

Zod schemasTypeScript tool handlersSkill frontmatterEval scorer definitionsCustom adapter logic

Pure functions. No harness, no session, no model calls.

Contract tests

StateStore adapterMemoryAdapterSandboxSessionModel provider adapter

Use the shared contract test helpers from @purista/harness/testing.

Integration tests

Full session agent runsWorkflow sequencesStreaming event assertionsMCP runnerReview gate flows

Use a fake model provider. Real sessions, real sandboxes, fake model responses.

Live smoke tests

Real provider end-to-endProduction config validation

Gate-kept by API key presence. Run nightly or pre-release only.

Level 3 — integration tests

Replace the model with a fake provider

A fake model provider is a plain object that implements the ModelProvider interface. It returns deterministic responses. Real sessions, real sandboxes, real tool calls — only the model output is controlled.

Minimal fake provider

Fake provider testtypescript

const fakeProvider = {
  id: 'fake',
  genAiSystem: 'fake',
  async object(req) {
    return {
      object: {
        answer: 'Fake answer from test fixture.',
        citations: [],
      },
      usage: { inputTokens: 1, outputTokens: 1, totalTokens: 2 },
      finishReason: 'stop',
    }
  },
}

const harness = defineHarness({ name: 'test' })
  .models({
    fast: {
      provider: fakeProvider,
      model: 'fake-model',
      capabilities: ['object'],
    },
  })
  .agents(({ agent }) => ({
    answerer: agent({
      model: 'fast',
      input: z.object({ question: z.string() }),
      output: z.object({ answer: z.string(), citations: z.array(z.string()) }),
      instructions: 'Answer concisely.',
    }),
  }))
  .build()

const session = await harness.getSession('test-user')

await expect(
  session.agents.answerer.prompt({ question: 'hello?' })
).resolves.toMatchObject({ answer: 'Fake answer from test fixture.' })

Input-aware fake

Inspect the request messages to return different responses for different inputs — useful for testing branching workflows.

Input-aware faketypescript

const inputAwareFake = {
  id: 'fake',
  genAiSystem: 'fake',
  async object(req) {
    const last = req.messages.at(-1)?.content ?? ''
    const isCritical = last.toString().toLowerCase()
      .includes('critical')

    return {
      object: {
        priority: isCritical ? 'P1' : 'P3',
        owner: isCritical ? 'on-call' : 'team-backlog',
        nextAction: isCritical
          ? 'page on-call immediately'
          : 'add to backlog',
      },
      usage: { inputTokens: 1, outputTokens: 1, totalTokens: 2 },
      finishReason: 'stop',
    }
  },
}

The harness also ships fakeModelProvider() and fakeMemoryAdapter() helpers from @purista/harness/testing for common fixture patterns.

Streaming assertions

Assert event sequences from .stream()

The streaming API emits typed RunEvent values. Collect them in an array and assert the sequence, tool calls, and final output.

Collect and assert events

Streaming event assertionstypescript

const events: RunEvent[] = []

for await (const event of session.agents.answerer.stream({
  question: 'What is the policy?',
})) {
  events.push(event)
}

// lifecycle
expect(events.at(0)?.type).toBe('run.started')
expect(events.at(-1)?.type).toBe('run.finished')

// tool calls
const toolEvents = events.filter(e => e.type === 'tool.started')
expect(toolEvents).toHaveLength(1)
expect(toolEvents[0].toolId).toBe('search_docs')

// final output
const finished = events.find(e => e.type === 'run.finished')
expect(finished?.output).toMatchObject({
  answer: expect.any(String),
  citations: expect.any(Array),
})

Event sequence for tool use

run.startedFirst event. Always emitted.

agent.startedAgent loop begins.

tool.startedOne per tool call. Has toolId, toolName, input.

tool.finishedTool output returned. Has output and durationMs.

agent.finishedModel returned final validated output.

run.finishedLast event. Has output, usage, durationMs.

Workflows emit the same outer run.started / run.finished envelope plus agent.started / agent.finished for each agent invocation inside.

Unit tests — tools

Test tool handlers in isolation

TypeScript tool handlers are plain async functions. Pass a minimal context object and assert both successful output and validation failure behaviour — no harness or session needed.

Happy path

Tool happy pathtypescript

import { policyLookupHandler } from './tools/policyLookup'

it('returns policy text for known topic', async () => {
  const ctx = {
    logger: console,
    signal: new AbortController().signal,
    toolId: 'policy_lookup',
    sandbox: undefined,
  }

  const result = await policyLookupHandler(
    ctx,
    { topic: 'deployment' }
  )

  expect(result).toMatchObject({
    text: expect.stringContaining('change freeze'),
  })
})

Validation and error paths

Tool error pathstypescript

it('rejects unknown topic', async () => {
  const ctx = {
    logger: console,
    signal: new AbortController().signal,
    toolId: 'policy_lookup',
    sandbox: undefined,
  }

  await expect(
    policyLookupHandler(ctx, { topic: 'does-not-exist' })
  ).rejects.toThrow('Policy not found')
})

it('respects cancellation', async () => {
  const abort = new AbortController()
  abort.abort()

  await expect(
    policyLookupHandler(
      { ...ctx, signal: abort.signal },
      { topic: 'deployment' }
    )
  ).rejects.toThrow()
})

Unit tests — eval scorers

Validate scorer definitions before running evals

Import evaluateDeterministicScorer directly to unit-test scorer definitions without a harness, session, or candidate loop.

Scorer unit teststypescript

import { evaluateDeterministicScorer } from '@purista/harness'

it('contains scorer passes when value is present', async () => {
  const result = evaluateDeterministicScorer(
    {
      type: 'contains',
      path: '/answer',
      value: 'change freeze',
      caseInsensitive: true,
    },
    {
      input: { question: 'Can I deploy on Friday?' },
      output: { answer: 'No — there is a change freeze.' },
    }
  )

  expect(result).toMatchObject({ score: 1, passed: true })
})

it('contains scorer fails gracefully on missing pointer', async () => {
  const result = evaluateDeterministicScorer(
    { type: 'contains', path: '/missing', value: 'foo' },
    { input: {}, output: { answer: 'something' } }
  )

  expect(result.passed).toBe(false)
  expect(result.evidence?.reason).toBe('missing_pointer')
})

Test all four types

containsvalue present / absent / case handling

regexpattern match / no match / flags

attribute-equalityequal values / unequal / missing pointers

json-schemavalid shape / missing required / additional properties

Test scorers before candidate runs. A wrong pointer path or invalid schema returns passed: false silently — catching it in a unit test is faster than debugging it mid-eval.

Contract tests — MCP

Test MCP tools with local fake servers

Use fixture servers for contract tests. Prove the harness correctly executes commands through the sandbox, validates schemas, and handles errors — without a real MCP server.

Stdio MCP contract tests

Stdio MCP contract teststypescript

// things to assert for mcp_stdio tools:

// 1. command runs through sandbox executor
it('runs command through sandbox exec', async () => {
  const result = await session.agents.diagrammer.prompt(input)
  expect(result.diagramXml).toBeDefined()
})

// 2. fails cleanly when sandbox has no executor
it('throws SandboxNoExecutorError without executor', async () => {
  const noExecHarness = defineHarness({ name: 'test' })
    .sandbox(inMemorySandbox()) // file-only, no exec
    .tools({ drawio: { kind: 'mcp_stdio', ... } })
    .build()

  const s = await noExecHarness.getSession('t')
  await expect(s.agents.diagrammer.prompt(input))
    .rejects.toThrow('SandboxNoExecutorError')
})

// 3. validates input schema before calling server
it('rejects invalid tool input', async () => {
  await expect(
    session.agents.diagrammer.prompt({ badInput: true })
  ).rejects.toThrow('ValidationError')
})

HTTP MCP contract tests

HTTP MCP contract teststypescript

// things to assert for mcp_http tools:

// 1. auth failure surfaces as McpAuthError
it('throws McpAuthError on 401', async () => {
  const harness = buildHarnessWithHttpMcp({
    url: 'http:___PH2___
  })
  await expect(session.agents.search.prompt(input))
    .rejects.toThrow('McpAuthError')
})

___PH3___
it('throws McpProtocolError on bad response', async () => {
  await expect(session.agents.search.prompt(input))
    .rejects.toThrow('McpProtocolError')
})

___PH4___
it('rejects invalid MCP server output', async () => {
  ___PH5___
  await expect(session.agents.search.prompt(input))
    .rejects.toThrow('ValidationError')
})

___PH6___
it('throws ToolError when MCP signals isError', async () => {
  await expect(session.agents.search.prompt(input))
    .rejects.toThrow('ToolError&#39;)
})

Integration tests — review gates

Prove human-in-the-loop flows work correctly

Review gates are stateful flows. Four invariants must hold. Test each one explicitly.

INVARIANT

No mutation before approval

Run the workflow, assert write tools were not called before the review decision resolves.

INVARIANT

Review questions visible and submittable

Assert answer choices are returned and that each can be submitted independently.

INVARIANT

Decisions are idempotent

Submit the same decision twice. Assert the second submission succeeds or returns a clear idempotency signal.

INVARIANT

Stale ids fail cleanly

Submit a decision with a stale review id or run id. Assert a meaningful error, not a silent no-op.

No mutation before approvaltypescript

it('does not write before review approval', async () => {
  const writesSpy = vi.fn()

  const harness = defineHarness({ name: 'test' })
    .tools({
      write_wiki: {
        description: 'Write to wiki.',
        input: z.object({ content: z.string() }),
        output: z.object({ ok: z.boolean() }),
        handler: async (_ctx, input) => {
          writesSpy(input)
          return { ok: true }
        },
      },
    })
    .workflows(({ workflow }) => ({
      propose_and_review: workflow({
        input: z.object({ topic: z.string() }),
        output: z.object({ approved: z.boolean() }),
        handler: async (ctx) => {
          const proposal = await ctx.agents.writer(ctx.input)
          const decision = await ctx.reviewGate(proposal)
          if (decision.approved) {
            await ctx.tools.write_wiki({ content: proposal.content })
          }
          return { approved: decision.approved }
        },
      }),
    }))
    .build()

  const session = await harness.getSession('test')

  // pause at review gate — do not approve
  const run = session.workflows.propose_and_review.stream({ topic: 'AI policy' })
  for await (const event of run) {
    if (event.type === 'review.requested') break
  }

  expect(writesSpy).not.toHaveBeenCalled()
})

Contract tests — adapters

Validate custom adapters with shared contracts

The harness ships shared contract test suites from @purista/harness/testing. Run them against your custom adapters before publishing to verify they behave exactly as the core expects.

State store

stateStoreContracttypescript

import { stateStoreContract } from '@purista/harness/testing'
import { myStateStore } from './myStateStore'

describe('myStateStore', () => {
  stateStoreContract(() => myStateStore({
    url: process.env.TEST_DB_URL,
  }))
})

Validates session CRUD, run write/read, message append, event persistence, and concurrent-write behaviour.

Memory adapter

memoryAdapterContracttypescript

import { memoryAdapterContract } from '@purista/harness/testing'
import { myMemoryAdapter } from './myMemoryAdapter'

describe('myMemoryAdapter', () => {
  memoryAdapterContract(() => myMemoryAdapter({
    url: process.env.TEST_REDIS_URL,
  }))
})

Validates scope isolation, read/write round-trips, list with prefix, delete, TTL expiry, and search when declared.

Sandbox

sandboxContracttypescript

import { sandboxContract } from '@purista/harness/testing'
import { myCustomSandbox } from './myCustomSandbox'

describe('myCustomSandbox', () => {
  sandboxContract(() => myCustomSandbox())
})

Validates file read/write, executor availability declaration, cancellation, and snapshot/resume when declared.

Coverage gate

The harness core enforces a coverage gate on its own test suite: statements 80%, branches 75%, functions 80%, lines 80%. Adapter packages should enforce their own gates before publishing.

Production-ready. Check the list.

Review the security model and production checklist before deploying to make sure nothing slips through.

Security & Production→Observability Prompt Evaluations